spark.sql("select case when length(regexp_replace(date,'[^0-9]', ''))==8 then regexp_replace(date,'[^0-9]', '') else regexp_replace(date,'[^0-9]','') end as date from input").show(false)
In the above I need to add the requirements such as,
1.the output should be validated with the format 'yyyymmdd' using unix_timestamp.
- if it is not valid then should transform the extracted digits string by moving the first four (4) characters to the end of the extracted digits string (
MMDDYYYY to YYYYMMDD) and then should be validated with the 'yyyymmdd' format, if this condition is satisfied then print that date.
I'm not sure how to include the Unix timestamp in my query.
Sample input and output 1:
input: 2021dgsth02hdg02
output: 20210202
Sample input and output 2:
input: 0101def20dr21 (note: MMDDYYYY TO YYYYMMDD)
output: 20210101
Using unix_timestamp in place of to_date
spark.sql("select (case when length(regexp_replace(date,'[^0-9]', ''))==8 then CASE WHEN from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z] ',''),'yyyyMMdd') ,'yyyyMMdd') IS NULL THEN from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z] ',''),'MMddyyyy') ,'MMddyyyy') ELSE from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z] ',''),'yyyyMMdd') ,'yyyyMMdd') END else regexp_replace(date,'[^0-9]','') end ) AS dt from input").show(false)
CodePudding user response:
Try Below code.
scala> val df = Seq("2021dgsth02hdg02","0101def20dr21").toDF("dt")
df: org.apache.spark.sql.DataFrame = [dt: string]
scala> df.show(false)
----------------
|dt |
----------------
|2021dgsth02hdg02|
|0101def20dr21 |
----------------
scala> df
.withColumn("dt",regexp_replace($"dt","[a-zA-Z] ",""))
.withColumn("dt",
when(
to_date($"dt","yyyyMMdd").isNull,
to_date($"dt","MMddyyyy")
)
.otherwise(to_date($"dt","yyyyMMdd"))
).show(false)
----------
|dt |
----------
|2021-02-02|
|2021-01-01|
----------
// Entering paste mode (ctrl-D to finish)
spark.sql("""
select (
CASE WHEN to_date(regexp_replace(date,'[a-zA-Z] ',''),'yyyyMMdd') IS NULL
THEN to_date(regexp_replace(date,'[a-zA-Z] ',''),'MMddyyyy')
ELSE to_date(regexp_replace(date,'[a-zA-Z] ',''),'yyyyMMdd')
END
) AS dt from input
""")
.show(false)
// Exiting paste mode, now interpreting.
----------
|dt |
----------
|2021-02-02|
|2021-01-01|
----------
