I need to group a dataframe by all possible column combinations and output the count.
For example:
df:
| col1 | col2 |
|---|---|
| val1 | val2 |
combination_list:
[[col1], [col2], [col1, col2]]
the number of combination = 2^n - 1, meaning I need to do 2^n -1 times groupby operation.
I have tried:
for c in combination_list:
df_result = df.groupby(c).count()
df_result.repartition(1).write.csv('file_path')
This is very slow as I have hundreds of combinations and I know for loop should be avoided in pyspark. How can I conduct groupby for all the combinations in the list in parallel?
CodePudding user response:
The operation you are looking for is cube.
Cube generates aggregations for all combination of columns passed as arguements.
data = [("c1v1", "c2v1",),
("c1v2", "c2v2",),
("c1v1", "c2v2",), ]
df = spark.createDataFrame(data, ("col1", "col2"))
df.show()
"""
---- ----
|col1|col2|
---- ----
|c1v1|c2v1|
|c1v2|c2v2|
|c1v1|c2v2|
---- ----
"""
df.cube(df.columns).count().show()
"""
---- ---- -----
|col1|col2|count|
---- ---- -----
|c1v2|c2v2| 1|
|c1v1|c2v1| 1|
|c1v1|c2v2| 1|
|null|null| 3|
|null|c2v2| 2|
|null|c2v1| 1|
|c1v2|null| 1|
|c1v1|null| 2|
---- ---- -----
"""
