There are 2 data frames having a date column each with a different name (start_date, finish_date). I need to take the month value from the second data frame (from the column "finish_date") and compare it with the first data frame (from the column "start_date"). The date is in the format MM/DD/YYYY If the month matches then we need to return another column's corresponding value, if it doesn't match then fill it by null
CodePudding user response:
Extract month using org.apache.spark.sql.functions.month function and join both dataframes on extracted column.
Example:
val d1 = Seq("01/23/2019","06/24/2019","09/20/2019")
val d2 = Seq("04/23/2019","06/01/2008","09/05/2020")
import spark.implicits._
val df1 = d1.toDF("dateColStr").withColumn("dateCol1", to_date(col("dateColStr"),"MM/dd/yyyy")).drop("DateColStr")
val df2 = d2.toDF("dateColStr").withColumn("dateCol2", to_date(col("dateColStr"),"MM/dd/yyyy")).drop("DateColStr")
df1.printSchema()
// root
// |-- dateCol1: date (nullable = true)
df2.printSchema()
// root
// |-- dateCol2: date (nullable = true)
val jDF = df1.join(df2, month($"dateCol1") === month($"dateCol2"))
jDF.show()
// ---------- ----------
// | dateCol1| dateCol2|
// ---------- ----------
// |2019-06-24|2008-06-01|
// |2019-09-20|2020-09-05|
// ---------- ----------
