I am trying to join two apache spark sql DataFrame and replace column value of first dataframe with another. Eg:
Df1:
col1 | col2 | other columns .... say (col-x, col-y, col-z)
------------ |--------------------------------
x | a |random values
y | b |random values
z | c |random values
Df2:
col1 | col3 | other columns .. say (col-a, col-b, col-c)
-------------|--------------------------------
x | a1 |different random values
y | b1 |different random values
w | w1 |different random values
resultant dataframe should be
DF:
col1 | col2 | other columns of DF1 (col-x. col-y, col-z)
-------------|-------------------------------
a1 | a |random values
b1 | b |random values
z | c |random values
I need to perform left join and replace values of DF1.col1 with DF2.col3 wherever DF1.col1 = DF2.col1. I am not sure how to do that. Furthermore, as it can be seen in above example, DF1 has a lot more columns other than "col1" and "col2" and I cannot apply select on all of them. I was trying something like,
val df = df1.join(df2, Seq("col1"), "left").select(
coalesce(df2("col2"), df1("col1")).as("col1")
)
but this doesn't seem to work. Also, I think it will filter out other columns of DF1. I want to keep all columns of DF1.
How can I do this in Scala?
CodePudding user response:
You can construct the required 3 columns as follows.
val df = df1.join(df2, Seq("col1"), "left").select(coalesce(df2("col3"), df1("col1")).as("col1"),col("col2"), col("colx"))
CodePudding user response:
For get all columns from "df1" after join, alias can be used for Dataframe:
val updatedCol1 = coalesce(df2("col3"), df1("col1")).alias("col1")
val columns = updatedCol1 :: df1.columns
.filterNot(_ == "col1")
.map(cname => col("df1." cname))
.toList
df1.alias("df1")
.join(df2, Seq("col1"), "left")
.select(columns: _*)
