I'm currently transforming some date data in a dataframe which looks like:
----------- ------------
|first_col|sec_col-------|
--------- --------------
|a--------|28-04-2021 |
|a--------|01-03-2017 |
|a--------|"Feb 23, 2012"|
|a--------|"May 01, 2019"|
--------- --------------
I now want to transform the last two lines into better date formats like so: 23-Feb-2012 I thought doing this by regular expressions but the following code doesn't work:
from pyspark.sql import functions as f
from pyspark.sql.functions import regexp_replace, regexp_extract
#(a lot of stuff happens here which is not important for the question so I let it out)
input_df = input_df.withColumn("sec_col", input_df.sec_col.cast("String"))
.withColumn("sec_col2",
f.when(input_df.sec_col.rlike("\"\w{3} \d{2}, \d{4}\""),
f.concat(regexp_extract("sec_col","\"(\w{3}) (\d{2}), (\d{4})\"",2),f.lit("-"), regexp_extract("sec_col","\"(\w{3}) (\d{2}), (\d{4})\"",1),f.lit("-"),regexp_extract("sec_col","\"(\w{3}) (\d{2}), (\d{4})\"",3))))
.otherwise(f.col("sec_col"))
Can anyone help?
CodePudding user response:
Instead of trying to parse the data format with regex you can directly convert it to a date using to_date as you already know the date_format to parse and fetch the first non-null value
Data Preparation
sparkDF = sql.createDataFrame([("28-04-2021",),
("01-03-2017",),
("Feb 23, 2012",),
("May 01, 2019",)
]
,['timestamp'])
sparkDF.show()
------------
| timestamp|
------------
| 28-04-2021|
| 01-03-2017|
|Feb 23, 2012|
|May 01, 2019|
------------
Parsing Dates
sparkDF = sparkDF.withColumn('p1',F.to_date(F.col('timestamp'),"MMM dd, yyyy"))\
.withColumn('p2',F.to_date(F.col('timestamp'),"dd-MM-yyyy"))
------------ ---------- ----------
| timestamp| p1| p2|
------------ ---------- ----------
| 28-04-2021| null|2021-04-28|
| 01-03-2017| null|2017-03-01|
|Feb 23, 2012|2012-02-23| null|
|May 01, 2019|2019-05-01| null|
------------ ---------- ----------
Coalesce
sparkDF = sparkDF.withColumn('timestamp_parsed',F.coalesce(F.col('p1'),F.col('p2')))\
.drop(*['p1','p2'])
sparkDF.show()
------------ ----------------
| timestamp|timestamp_parsed|
------------ ----------------
| 28-04-2021| 2021-04-28|
| 01-03-2017| 2017-03-01|
|Feb 23, 2012| 2012-02-23|
|May 01, 2019| 2019-05-01|
------------ ----------------
Furthermore you can check various parsing formats available here and a similar question can be found here
