I'm new to pyspark from pandas.
Joining on one condition and dropping duplicate seemed to work perfectly when I do:
df1.join(df2, df1.col1 == df2.col1, how="left").drop(df2.col1)
However what if I want to join on two columns condition and drop two columns of joined df b.c. it is a duplicate.
I've tried:
df1.join(df2, [df1.col1 == df2.col1, df1.col2 == df2.col2, how="left").drop(df2.col1, df2.col2)
CodePudding user response:
The method drop can only take a single Column expression OR one/more string column names to drop. That's why it work for drop(df2.col1) but raises en exception when using drop(df2.col1, df2.col2).
Having these two dataframes as an example:
df1 = spark.createDataFrame([(1, 1), (2, 2)], ["col1", "col2"])
df2 = spark.createDataFrame([(5, 3, "ok"), (2, 2, "ko")], ["col1", "col2", "status"])
You can drop the duplicates columns like this:
- Using list of column names as join condition
df1.join(df2, ["col1", "col2"], "left").show()
# ---- ---- ------
#|col1|col2|status|
# ---- ---- ------
#| 1| 1| null|
#| 2| 2| ko|
# ---- ---- ------
- Using
selectexpression
df1.join(df2, (df1["col1"] == df2["col1"]) & (df1["col2"] == df2["col2"]), "left")\
.select(
df1["*"],
*[df2[c] for c in df2.columns if c not in ["col1", "col2"]]
).show()
# ---- ---- ------
#|col1|col2|status|
# ---- ---- ------
#| 1| 1| null|
#| 2| 2| ko|
# ---- ---- ------
