I have a dataframe with 3 columns as shown below
I would like to Pivot and fill the columns on the id so that each row contains a column for each id column combination where the value is for that id, as shown below
Note: Zero or Null is shown if the ID does not match. For instance ID2_colA and Id2_ColB get 0 in the first two rows, and ID1_calA abd ID1_ColB get 0 in row 3
There are more distinct values in ID column. Shortened it for the ease of illustration
How can I achieve this in pyspark?
Here is the code for the first dataframe:
data = [(("ID1", 3, 5)), (("ID1", 4, 12)), (("ID2", 8, 3))]
df = spark.createDataFrame(data, ["ID", "colA", "colB"])
CodePudding user response:
You can create a map column where the values are columns colA and colB and the keys concatenation of literal colA and colB names with IDcolumn. Then, explode the map and pivot the resulting value column like this:
from itertools import chain
import pyspark.sql.functions as F
df.select(
"ID", "colA", "colB",
F.explode(
F.create_map(
*list(chain(*[[F.concat_ws("_", F.lit(c), F.col("ID")), F.col(c)] for c in ["colA", "colB"]]))
)
)
).groupBy("ID", "colA", "colB") \
.pivot("key").agg(F.first("value")) \
.fillna(0) \
.show()
# --- ---- ---- -------- -------- -------- --------
#|ID |colA|colB|colA_ID1|colA_ID2|colB_ID1|colB_ID2|
# --- ---- ---- -------- -------- -------- --------
#|ID2|8 |3 |0 |8 |0 |3 |
#|ID1|3 |5 |3 |0 |5 |0 |
#|ID1|4 |12 |4 |0 |12 |0 |
# --- ---- ---- -------- -------- -------- --------


