I have a spark daframe like
item_id attribute_id attribute_value
1001 color blue
1001 shape rectangular
1001 material copper
1002 color black
1002 material copper
1003 color grey
I Want Resultant dataframe to match below
item_id color shape meterial
1001 blue rectangular copper
1002 black null copper
1003 grey null null
I am trying achieve this in PySpark but not sure about the syntax, Any Hints? Note: Any pointers in PySpark or spark sql is appreciated
CodePudding user response:
The operation you are looking for is pivot. You should group the dataframe by item_id then pivot based on attribute_id and finally choose the first value of attribute_value.
data = [(1001, "color", "blue",),
(1001, "shape", "rectangular",),
(1001, "material", "copper",),
(1002, "color", "black",),
(1002, "material", "copper",),
(1003, "color", "grey",), ]
df = spark.createDataFrame(data, ("item_id", "attribute_id", "attribute_value",))
df.groupBy("item_id").pivot("attribute_id").agg(F.first("attribute_value")).show()
"""
------- ----- -------- -----------
|item_id|color|material| shape|
------- ----- -------- -----------
| 1001| blue| copper|rectangular|
| 1002|black| copper| null|
| 1003| grey| null| null|
------- ----- -------- -----------
"""
