Home > Enterprise >  Pyspark Pivot based on column values combinations
Pyspark Pivot based on column values combinations

Time:01-15

I have a dataframe with 3 columns as shown below

enter image description here

I would like to Pivot and fill the columns on the id so that each row contains a column for each id column combination where the value is for that id, as shown below

enter image description here

Note: Zero or Null is shown if the ID does not match. For instance ID2_colA and Id2_ColB get 0 in the first two rows, and ID1_calA abd ID1_ColB get 0 in row 3 There are more distinct values in ID column. Shortened it for the ease of illustration

How can I achieve this in pyspark?

Here is the code for the first dataframe:

data = [(("ID1", 3, 5)), (("ID1", 4, 12)), (("ID2", 8, 3))]
df = spark.createDataFrame(data, ["ID", "colA", "colB"])

CodePudding user response:

You can create a map column where the values are columns colA and colB and the keys concatenation of literal colA and colB names with IDcolumn. Then, explode the map and pivot the resulting value column like this:

from itertools import chain
import pyspark.sql.functions as F

df.select(
    "ID", "colA", "colB",
    F.explode(
        F.create_map(
            *list(chain(*[[F.concat_ws("_", F.lit(c), F.col("ID")), F.col(c)] for c in ["colA", "colB"]]))
        )
    )
).groupBy("ID", "colA", "colB") \
    .pivot("key").agg(F.first("value")) \
    .fillna(0) \
    .show()

# --- ---- ---- -------- -------- -------- -------- 
#|ID |colA|colB|colA_ID1|colA_ID2|colB_ID1|colB_ID2|
# --- ---- ---- -------- -------- -------- -------- 
#|ID2|8   |3   |0       |8       |0       |3       |
#|ID1|3   |5   |3       |0       |5       |0       |
#|ID1|4   |12  |4       |0       |12      |0       |
# --- ---- ---- -------- -------- -------- -------- 
  •  Tags:  
  • Related