I have a dataframe with millions of registers, like this:
CLI_ID OCCUPA_ID DIG_LABEL
125 2705 1
328 2708 7
400 2712 1
401 2705 2
525 2708 1
I want to take an aleatory sample of 100k rows that contains 70% of 2705, 20% of 2708, 10% of 2712 from OCCUPA_ID and 50% of 1, 20% of 2 and 30% 7 from DIG_LABEL.
How can I get this in Spark, using pyspark?
CodePudding user response:
use sampleBy instead using sample function in pyspark,becasue sample only use for sampling without any column.so we will take sampleBy.
here,sampleBy we have in column,fraction and seed(Optional).
consider,
df_sample = df.sampleBy(column,fraction,seed)
where,
columnis defined for selectingcolumnyou want tosamplingfractionis just defined forsamplingration like 10% so it will take as 0.1 vice versa.seedfor which of data show will saved asseedthrough becasue everytime it will show different data if not use thisseed.
so your question required answer is,
dfsample = df.sampleBy("OCCUPA_ID",{"2705":0.7,"2708":0.2,"2712":0.1},42).sampleBy("DIG_LABEL",{"1":0.5,"2":0.2,"7":0.3},42)
just take two times of sampling OCCUPA_ID and after DIG_LABEL.
42isseedhere both time
CodePudding user response:
You can use the sampleBy method for pyspark DataFrames to perform stratified sample and pass the column name and a dictionary for the fractions within each column. For example:
spark_df.sampleBy("OCCUPA_ID", fractions={"2705": 0.7, "2708": 0.2, "2712": 0.1}, seed=42).show()
------ --------- ---------
|CLI_ID|OCCUPA_ID|DIG_LABEL|
------ --------- ---------
| 1| 2705| 7|
| 4| 2705| 1|
| 5| 2705| 7|
| 7| 2708| 2|
| 12| 2705| 1|
| 16| 2708| 2|
| 18| 2708| 2|
| 20| 2705| 7|
| 25| 2705| 2|
| 26| 2705| 2|
| 38| 2705| 7|
| 40| 2705| 1|
| 44| 2705| 2|
| 48| 2708| 7|
| 50| 2708| 2|
| 53| 2705| 1|
| 57| 2705| 1|
| 58| 2712| 1|
| 61| 2705| 2|
| 63| 2708| 7|
------ --------- ---------
only showing top 20 rows
Since you want one pyspark DataFrame with two samplings performed from two different columns, you can chain the sampleBy methods together:
spark_stratified_sample_df = spark_df.sampleBy("OCCUPA_ID", fractions={"2705": 0.7, "2708": 0.2, "2712": 0.1}, seed=42).sampleBy("DIG_LABEL", fractions={"1": 0.5, "2": 0.2, "7": 0.3}, seed=42)
