In the below example I would like to override the values in Spark Dataframe A with corresponding value in Dataframe B (if it exists). Is there a way to do it using Spark (Scala)?
Dataframe A
ID Name Age
1 Paul 30
2 Sean 35
3 Rob 25
Dataframe B
ID Name Age
1 Paul 40
Result
ID Name Age
1 Paul 40
2 Sean 35
3 Rob 25
CodePudding user response:
The combined use of a left join and coalesce should do the trick, something like:
dfA
.join(dfB, "ID", "left")
.select(
dfA.col("ID"),
dfA.col("Name"),
coalesce(dfB.col("Age"), dfA.col("Age")).as("Age")
)
Explanation: For a specific ID some_id, there is 2 cases:
- if
dfBdoes not containsome_id: then the left join will produce null fordfB.col("Age")and thecoalescewill return the first non-null value from expressions we passed to it, i.e. the value ofdfA.col("Age") - if
dfBcontainssome_idthen the value fromdfB.col("Age")will be used.
