Forming multiple columns from data in a perticular column based on condition in PySpark-CodePudding

I have a spark daframe like

item_id   attribute_id  attribute_value
1001      color          blue
1001      shape          rectangular
1001      material       copper
1002      color          black
1002      material       copper
1003      color          grey

I Want Resultant dataframe to match below

item_id   color    shape        meterial 
1001      blue     rectangular  copper
1002      black    null         copper
1003      grey     null         null

I am trying achieve this in PySpark but not sure about the syntax, Any Hints? Note: Any pointers in PySpark or spark sql is appreciated

CodePudding user response：

The operation you are looking for is pivot. You should group the dataframe by item_id then pivot based on attribute_id and finally choose the first value of attribute_value.

data = [(1001, "color", "blue",),
        (1001, "shape", "rectangular",),
        (1001, "material", "copper",),
        (1002, "color", "black",),
        (1002, "material", "copper",),
        (1003, "color", "grey",), ]

df = spark.createDataFrame(data, ("item_id", "attribute_id", "attribute_value",))

df.groupBy("item_id").pivot("attribute_id").agg(F.first("attribute_value")).show()

"""
 ------- ----- -------- ----------- 
|item_id|color|material|      shape|
 ------- ----- -------- ----------- 
|   1001| blue|  copper|rectangular|
|   1002|black|  copper|       null|
|   1003| grey|    null|       null|
 ------- ----- -------- ----------- 
"""