Pyspark dataframe change format if regex applies-CodePudding

I'm currently transforming some date data in a dataframe which looks like:

 ----------- ------------ 
|first_col|sec_col-------|
 --------- -------------- 
|a--------|28-04-2021    |
|a--------|01-03-2017    |
|a--------|"Feb 23, 2012"|
|a--------|"May 01, 2019"|
 --------- --------------

I now want to transform the last two lines into better date formats like so: 23-Feb-2012 I thought doing this by regular expressions but the following code doesn't work:

from pyspark.sql import functions as f
from pyspark.sql.functions import regexp_replace, regexp_extract

#(a lot of stuff happens here which is not important for the question so I let it out)

input_df = input_df.withColumn("sec_col", input_df.sec_col.cast("String"))
    .withColumn("sec_col2", 
        f.when(input_df.sec_col.rlike("\"\w{3} \d{2}, \d{4}\""),
            f.concat(regexp_extract("sec_col","\"(\w{3}) (\d{2}), (\d{4})\"",2),f.lit("-"), regexp_extract("sec_col","\"(\w{3}) (\d{2}), (\d{4})\"",1),f.lit("-"),regexp_extract("sec_col","\"(\w{3}) (\d{2}), (\d{4})\"",3))))
         .otherwise(f.col("sec_col"))

Can anyone help?

CodePudding user response：

Instead of trying to parse the data format with regex you can directly convert it to a date using to_date as you already know the date_format to parse and fetch the first non-null value

Data Preparation

sparkDF = sql.createDataFrame([("28-04-2021",),
                               ("01-03-2017",),
                               ("Feb 23, 2012",),
                               ("May 01, 2019",)
                              ]
                              ,['timestamp'])

sparkDF.show()

 ------------ 
|   timestamp|
 ------------ 
|  28-04-2021|
|  01-03-2017|
|Feb 23, 2012|
|May 01, 2019|
 ------------

Parsing Dates


sparkDF = sparkDF.withColumn('p1',F.to_date(F.col('timestamp'),"MMM dd, yyyy"))\
                 .withColumn('p2',F.to_date(F.col('timestamp'),"dd-MM-yyyy"))


 ------------ ---------- ---------- 
|   timestamp|        p1|        p2|
 ------------ ---------- ---------- 
|  28-04-2021|      null|2021-04-28|
|  01-03-2017|      null|2017-03-01|
|Feb 23, 2012|2012-02-23|      null|
|May 01, 2019|2019-05-01|      null|
 ------------ ---------- ----------

Coalesce

sparkDF = sparkDF.withColumn('timestamp_parsed',F.coalesce(F.col('p1'),F.col('p2')))\
                 .drop(*['p1','p2'])

sparkDF.show()

 ------------ ---------------- 
|   timestamp|timestamp_parsed|
 ------------ ---------------- 
|  28-04-2021|      2021-04-28|
|  01-03-2017|      2017-03-01|
|Feb 23, 2012|      2012-02-23|
|May 01, 2019|      2019-05-01|
 ------------ ----------------

Furthermore you can check various parsing formats available here and a similar question can be found here