Home > Blockchain >  How can I compare 2 data frames by using the month from the date format MM/DD/YYYY.. in spark Scala?
How can I compare 2 data frames by using the month from the date format MM/DD/YYYY.. in spark Scala?

Time:01-20

There are 2 data frames having a date column each with a different name (start_date, finish_date). I need to take the month value from the second data frame (from the column "finish_date") and compare it with the first data frame (from the column "start_date"). The date is in the format MM/DD/YYYY If the month matches then we need to return another column's corresponding value, if it doesn't match then fill it by null

CodePudding user response:

Extract month using org.apache.spark.sql.functions.month function and join both dataframes on extracted column.

Example:

  val d1 = Seq("01/23/2019","06/24/2019","09/20/2019")
  val d2 = Seq("04/23/2019","06/01/2008","09/05/2020")

  import spark.implicits._
  val df1 = d1.toDF("dateColStr").withColumn("dateCol1", to_date(col("dateColStr"),"MM/dd/yyyy")).drop("DateColStr")
  val df2 = d2.toDF("dateColStr").withColumn("dateCol2", to_date(col("dateColStr"),"MM/dd/yyyy")).drop("DateColStr")

  df1.printSchema()
//  root
//  |-- dateCol1: date (nullable = true)

  df2.printSchema()
//  root
//  |-- dateCol2: date (nullable = true)

  val jDF = df1.join(df2, month($"dateCol1") === month($"dateCol2"))
  jDF.show()

//   ---------- ---------- 
//  |  dateCol1|  dateCol2|
//   ---------- ---------- 
//  |2019-06-24|2008-06-01|
//  |2019-09-20|2020-09-05|
//   ---------- ---------- 
  •  Tags:  
  • Related