Home > Back-end >  Pyspark Group a Dataframe by Multiple Keys in Parallel
Pyspark Group a Dataframe by Multiple Keys in Parallel

Time:01-13

I need to group a dataframe by all possible column combinations and output the count.

For example:

df:

col1 col2
val1 val2

combination_list:

[[col1], [col2], [col1, col2]]

the number of combination = 2^n - 1, meaning I need to do 2^n -1 times groupby operation.

I have tried:

for c in combination_list:
  df_result = df.groupby(c).count()
  df_result.repartition(1).write.csv('file_path')

This is very slow as I have hundreds of combinations and I know for loop should be avoided in pyspark. How can I conduct groupby for all the combinations in the list in parallel?

CodePudding user response:

The operation you are looking for is cube. Cube generates aggregations for all combination of columns passed as arguements.


data = [("c1v1", "c2v1",),
        ("c1v2", "c2v2",),
        ("c1v1", "c2v2",), ]

df = spark.createDataFrame(data, ("col1", "col2"))

df.show()

"""
 ---- ---- 
|col1|col2|
 ---- ---- 
|c1v1|c2v1|
|c1v2|c2v2|
|c1v1|c2v2|
 ---- ---- 
"""

df.cube(df.columns).count().show()

"""
 ---- ---- ----- 
|col1|col2|count|
 ---- ---- ----- 
|c1v2|c2v2|    1|
|c1v1|c2v1|    1|
|c1v1|c2v2|    1|
|null|null|    3|
|null|c2v2|    2|
|null|c2v1|    1|
|c1v2|null|    1|
|c1v1|null|    2|
 ---- ---- ----- 
"""
  •  Tags:  
  • Related