I have a dataframe like the following:
rdd = spark.sparkContext.parallelize([
(22,'fl1.variant,fl2.variant,fl3.control','xxx','yyy','zzz'),
(23,'fl1.variant,fl2.neither,fl3.control','xxx','yyy','zzz'),
(24,'fl4.variant,fl2.variant,fl4.variant','xxx1','yyy1','zzz1'),
(25,'fl3.control,fl3.control,fl3.variant','xxx1','yyy1','zzz1')
])
df = rdd.toDF(['Date','Type','Data1','Data2','Data3'])
sample data :
| Date | Type | Data1 | Data2 | Data3 |
|---|---|---|---|---|
| 22 | fl1.variant,fl2.variant,fl3.control | xxx | yyy | zzz |
| 23 | fl1.variant,fl2.neither,fl3.control | xxx | yyy | zzz |
| 24 | fl4.variant,fl2.variant,fl4.variant | xxx1 | yyy1 | zzz1 |
| 25 | fl3.control,fl3.control,fl3.variant | xxx1 | yyy1 | zzz1 |
I need to identify the distinct values of Type column data based on Data1, Data2 and Data3 columns. Type column's datatype is string separated by ,.
Based on Data1, Data2, Data3, I need to combine all the data and identify the unique values of Type column.
expected output :
| Data1 | Data2 | Data3 | Type_list |
|---|---|---|---|
| xxx | yyy | zzz | [fl1.variant,fl2.variant,fl3.control, fl2.neither] |
| xxx1 | yyy1 | zzz1 | [fl4.variant,fl2.variant,fl3.control,fl3.variant] |
I tried like below but could not get the expected distinct values
df1 = df.sort("Data1","Data2","Data3","Type"). \
groupBy("Data1","Data2","Data3"). \
agg(func.collect_set("Type").cast(func.StringType())). \
withColumnRenamed("CAST(collect_set(Type) AS STRING)", "Type_list")
| Data1 | Data2 | Data3 | Type_list |
|---|---|---|---|
| xxx | yyy | zzz | [fl1.variant,fl2.variant,fl3.control, fl1.variant, fl2.neither,fl3.control] |
| xxx1 | yyy1 | zzz1 | [fl4.variant,fl2.variant,fl4.variant, fl3.control,fl3.control,fl3.variant] |
df2 = df1.select("Data1","Data2","Data3",func.array_distinct(func.split("Type_list" , ",")))
| Data1 | Data2 | Data3 | array_distinct(split(Type_list, ,, -1)) |
|---|---|---|---|
| xxx | yyy | zzz | [[fl1.variant, fl2.variant, fl3.control, fl1.variant, fl2.neither, fl3.control]] |
| xxx1 | yyy1 | zzz1 | [[fl4.variant, fl2.variant, fl4.variant, fl3.control, fl3.control, fl3.variant]] |
CodePudding user response:
first split then array_distinct(flatten(collect_list()))
data_sdf. \
withColumn('type_split', func.split('type', ',')). \
groupBy('data1','data2','data3'). \
agg(func.array_distinct(func.flatten(func.collect_list('type_split'))).alias('type_list')). \
show(truncate=False)
# ----- ----- ----- ----------------------------------------------------
# |data1|data2|data3|type_list |
# ----- ----- ----- ----------------------------------------------------
# |xxx |yyy |zzz |[fl1.variant, fl2.variant, fl3.control, fl2.neither]|
# |xxx1 |yyy1 |zzz1 |[fl4.variant, fl2.variant, fl3.control, fl3.variant]|
# ----- ----- ----- ----------------------------------------------------
