Home > Blockchain >  How to get count of Dates that fall in between to Date columns in Spark?
How to get count of Dates that fall in between to Date columns in Spark?

Time:01-21

I have the following Data,

logType log_create_date error_date
A   2019-01-01  2019-01-02
A   2019-02-04  2019-02-03
A   2019-03-10  2019-03-07
A   2019-04-13  2019-04-08
B   2019-05-17  2019-05-10
B   2019-06-20  2019-06-11
B   2019-07-24  2019-07-13
B   2019-08-27  2019-08-14

Which I have managed to turn in to below using,

val window = Window.orderBy("logType","log_crate_date")
val lagCol = lag(col("log_crate_date"), 1).over(window)
spark.sql("SELECT * FROM test")
.withColumn("log_create_date_previous", lagCol)
.select("logType","log_create_date_previous","log_create_date","error_date")

    logType log_create_date_previous    log_create_date error_date
    A   null    2019-01-01  2019-01-02
    A   2019-01-01  2019-02-04  2019-02-03
    A   2019-02-04  2019-03-10  2019-03-07
    A   2019-03-10  2019-04-13  2019-04-08
    B   2019-04-13  2019-05-17  2019-05-10
    B   2019-05-17  2019-06-20  2019-06-11
    B   2019-06-20  2019-07-24  2019-07-13
    B   2019-07-24  2019-08-27  2019-08-14
    B   2019-08-27  2019-08-27  null

Now I want to count for each $logType, and in b/w a distinct log_create_date_previous & log_create_date how many error_date Counts has happened by counting how many error_date's falls under log_create_date_previous & log_create_date.

CodePudding user response:

For calculate how many records has "error_date" between "log_create_date_previous" and "log_create_date" for each type, "between" function can be used:

  .withColumn("error_in_between", $"error_date".between($"log_create_date_previous", $"log_create_date"))
  .groupBy("logType").agg(sum(when($"error_in_between", 1).otherwise(0)).alias("error_count"))

For provided dataset, result is:

 ------- ----------- 
|logType|error_count|
 ------- ----------- 
|A      |3          |
|B      |4          |
 ------- ----------- 
  •  Tags:  
  • Related