I have the following Data,
logType log_create_date error_date
A 2019-01-01 2019-01-02
A 2019-02-04 2019-02-03
A 2019-03-10 2019-03-07
A 2019-04-13 2019-04-08
B 2019-05-17 2019-05-10
B 2019-06-20 2019-06-11
B 2019-07-24 2019-07-13
B 2019-08-27 2019-08-14
Which I have managed to turn in to below using,
val window = Window.orderBy("logType","log_crate_date")
val lagCol = lag(col("log_crate_date"), 1).over(window)
spark.sql("SELECT * FROM test")
.withColumn("log_create_date_previous", lagCol)
.select("logType","log_create_date_previous","log_create_date","error_date")
logType log_create_date_previous log_create_date error_date
A null 2019-01-01 2019-01-02
A 2019-01-01 2019-02-04 2019-02-03
A 2019-02-04 2019-03-10 2019-03-07
A 2019-03-10 2019-04-13 2019-04-08
B 2019-04-13 2019-05-17 2019-05-10
B 2019-05-17 2019-06-20 2019-06-11
B 2019-06-20 2019-07-24 2019-07-13
B 2019-07-24 2019-08-27 2019-08-14
B 2019-08-27 2019-08-27 null
Now I want to count for each $logType, and in b/w a distinct log_create_date_previous & log_create_date how many error_date Counts has happened by counting how many error_date's falls under log_create_date_previous & log_create_date.
CodePudding user response:
For calculate how many records has "error_date" between "log_create_date_previous" and "log_create_date" for each type, "between" function can be used:
.withColumn("error_in_between", $"error_date".between($"log_create_date_previous", $"log_create_date"))
.groupBy("logType").agg(sum(when($"error_in_between", 1).otherwise(0)).alias("error_count"))
For provided dataset, result is:
------- -----------
|logType|error_count|
------- -----------
|A |3 |
|B |4 |
------- -----------
