pyspark join multiple conditon and drop both duplicate column-CodePudding

I'm new to pyspark from pandas.

Joining on one condition and dropping duplicate seemed to work perfectly when I do:

df1.join(df2, df1.col1 == df2.col1, how="left").drop(df2.col1)

However what if I want to join on two columns condition and drop two columns of joined df b.c. it is a duplicate.

I've tried:

df1.join(df2, [df1.col1 == df2.col1, df1.col2 == df2.col2, how="left").drop(df2.col1, df2.col2)

CodePudding user response：

The method drop can only take a single Column expression OR one/more string column names to drop. That's why it work for drop(df2.col1) but raises en exception when using drop(df2.col1, df2.col2).

Having these two dataframes as an example:

df1 = spark.createDataFrame([(1, 1), (2, 2)], ["col1", "col2"])
df2 = spark.createDataFrame([(5, 3, "ok"), (2, 2, "ko")], ["col1", "col2", "status"])

You can drop the duplicates columns like this:

Using list of column names as join condition

df1.join(df2, ["col1", "col2"], "left").show()

# ---- ---- ------ 
#|col1|col2|status|
# ---- ---- ------ 
#|   1|   1|  null|
#|   2|   2|    ko|
# ---- ---- ------

Using select expression

df1.join(df2, (df1["col1"] == df2["col1"]) & (df1["col2"] == df2["col2"]), "left")\
    .select(
        df1["*"],
        *[df2[c] for c in df2.columns if c not in ["col1", "col2"]]
    ).show()

# ---- ---- ------ 
#|col1|col2|status|
# ---- ---- ------ 
#|   1|   1|  null|
#|   2|   2|    ko|
# ---- ---- ------