Pyspark replace values on array column based on another dataframe-CodePudding

I have two dataframes, one simply with some unique ids with associated names like so:

Id name
0 name_a
1 name_b
2 name_c

Second dataframe contains the ids from the first dataframe stored in an array, in each row:

Row_1 row_2
0 [0,2]
1 [1,0]

My question is it possible to replace the arrays from the second dataframe so it checks the names from the first df based on the ids, so:

Row_1 row_2
0 [name_a, name_c]
1 [name_b, name_a]

It seems too time consuming to create a map of the first df and just add it to the second df with an udf. Any help is much appreciated on how to approach this.

CodePudding user response：

You can try using explode function to convert array into rows, then join the data with the initial data frame, in the last step do a group by & .agg(collect_list())

from pyspark.sql.functions import explode

df3 = df2.select(df2.row_1,explode(df2.row_2))
df4 = df3.join(df1,df3.row_1==df1.Id).select(df3.row_1,df1.name)
df5 = df4.groupBy('row_1').agg(collect_list('name').alias('name'))

Reference links:

CodePudding user response：

Join using array_contains function groupby and collect_list:

from pyspark.sql import functions as F

df1 = spark.createDataFrame([(0, "name_a"), (1, "name_b"), (2, "name_c")], ["Id", "name"])
df2 = spark.createDataFrame([(0, [0, 2]), (1, [1, 0])], ["Row_1", "Row_2"])

result = df2.join(
    df1, on=F.array_contains("Row_2", F.col("Id")), how="left"
).groupBy("Row_1").agg(
    F.collect_list("name").alias("Row_2")
)

result.show()
# ----- ---------------- 
#|Row_1|           Row_2|
# ----- ---------------- 
#|    0|[name_a, name_c]|
#|    1|[name_a, name_b]|
# ----- ----------------