I have two dataframes, one simply with some unique ids with associated names like so:
Id name
0 name_a
1 name_b
2 name_c
Second dataframe contains the ids from the first dataframe stored in an array, in each row:
Row_1 row_2
0 [0,2]
1 [1,0]
My question is it possible to replace the arrays from the second dataframe so it checks the names from the first df based on the ids, so:
Row_1 row_2
0 [name_a, name_c]
1 [name_b, name_a]
It seems too time consuming to create a map of the first df and just add it to the second df with an udf. Any help is much appreciated on how to approach this.
CodePudding user response:
You can try using explode function to convert array into rows, then join the data with the initial data frame, in the last step do a group by & .agg(collect_list())
from pyspark.sql.functions import explode
df3 = df2.select(df2.row_1,explode(df2.row_2))
df4 = df3.join(df1,df3.row_1==df1.Id).select(df3.row_1,df1.name)
df5 = df4.groupBy('row_1').agg(collect_list('name').alias('name'))
Reference links:
- https://sparkbyexamples.com/pyspark/pyspark-explode-array-and-map-columns-to-rows/#:~:text=explode – PySpark explode array or,it contains all array elements.
- https://www.owenrumney.co.uk/pyspark-opposite-of-explode/
CodePudding user response:
Join using array_contains function groupby and collect_list:
from pyspark.sql import functions as F
df1 = spark.createDataFrame([(0, "name_a"), (1, "name_b"), (2, "name_c")], ["Id", "name"])
df2 = spark.createDataFrame([(0, [0, 2]), (1, [1, 0])], ["Row_1", "Row_2"])
result = df2.join(
df1, on=F.array_contains("Row_2", F.col("Id")), how="left"
).groupBy("Row_1").agg(
F.collect_list("name").alias("Row_2")
)
result.show()
# ----- ----------------
#|Row_1| Row_2|
# ----- ----------------
#| 0|[name_a, name_c]|
#| 1|[name_a, name_b]|
# ----- ----------------
