Pyspark Group a Dataframe by Multiple Keys in Parallel-CodePudding

I need to group a dataframe by all possible column combinations and output the count.

For example:

df:

col1	col2
val1	val2

combination_list:

[[col1], [col2], [col1, col2]]

the number of combination = 2^n - 1, meaning I need to do 2^n -1 times groupby operation.

I have tried:

for c in combination_list:
  df_result = df.groupby(c).count()
  df_result.repartition(1).write.csv('file_path')

This is very slow as I have hundreds of combinations and I know for loop should be avoided in pyspark. How can I conduct groupby for all the combinations in the list in parallel?

CodePudding user response：

The operation you are looking for is cube. Cube generates aggregations for all combination of columns passed as arguements.


data = [("c1v1", "c2v1",),
        ("c1v2", "c2v2",),
        ("c1v1", "c2v2",), ]

df = spark.createDataFrame(data, ("col1", "col2"))

df.show()

"""
 ---- ---- 
|col1|col2|
 ---- ---- 
|c1v1|c2v1|
|c1v2|c2v2|
|c1v1|c2v2|
 ---- ---- 
"""

df.cube(df.columns).count().show()

"""
 ---- ---- ----- 
|col1|col2|count|
 ---- ---- ----- 
|c1v2|c2v2|    1|
|c1v1|c2v1|    1|
|c1v1|c2v2|    1|
|null|null|    3|
|null|c2v2|    2|
|null|c2v1|    1|
|c1v2|null|    1|
|c1v1|null|    2|
 ---- ---- ----- 
"""