I'm currently working on pyspark and I've a csv file(having a few columns among which I'll display only the date datatype columns) which when opened in Excel looks like this:
Date received Date sent to company
11/13/2014 11/13/2014
11/13/2014 11/13/2014
11/13/2014 11/13/2014
11/13/2014 11/13/2014
12-11-2014 11/13/2014
12-11-2014 11/13/2014
12-11-2014 11/13/2014
12-11-2014 11-12-2014
12-11-2014 11-12-2014
12-11-2014 11-12-2014
12-11-2014 11-12-2014
12-11-2014 11-12-2014
12-11-2014 11-12-2014
12-11-2014 11-12-2014
12-11-2014 11-12-2014
12-11-2014 11-12-2014
12-11-2014 11-12-2014
12-11-2014 11-12-2014
Here is the screenshot for more clear understanding
As you can see I've used this csv file for my pyspark but I really want to have the date columns in one particular format say: "dd-mm-yyyy".
Can somebody help me with it?!
Although I've tried:
df.select(col("Date_received"),to_date(col("Date_received"),"dd-MM-yyyy").alias("date")) \
.show()
Which gives the following ouput:
------------- ----------
|Date_received| date|
------------- ----------
| 11/13/2014| null|
| 11/13/2014| null|
| 11/13/2014| null|
| 11/13/2014| null|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
------------- ----------
only showing top 20 rows
Observe how the output for first 4 rows is "null". And also I'm providing "dd-mm-yyyy" then how come the output has a "yyyy-mm-dd" format?
How to tackle this problem? Coz I want to change the date_format here(to "dd-mm-yyyy").
CodePudding user response:
To tackle multiple date_formats available within your data , you can parse each one of them into a new column(s) using to_date and then finally coalesce the first non-null value
You can find more on this - Parse Date Format
Available Date Parsing Formats within Spark - https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
A typical example is as below -
Data Preparation
df = pd.read_csv(StringIO("""
Date received,Date sent to company
11/13/2014,11/13/2014
11/13/2014,11/13/2014
11/13/2014,11/13/2014
11/13/2014,11/13/2014
12-11-2014,11/13/2014
12-11-2014,11/13/2014
12-11-2014,11/13/2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
"""),delimiter=",")
sparkDF = sql.createDataFrame(df)
sparkDF.show()
------------- --------------------
|Date received|Date sent to company|
------------- --------------------
| 11/13/2014| 11/13/2014|
| 11/13/2014| 11/13/2014|
| 11/13/2014| 11/13/2014|
| 11/13/2014| 11/13/2014|
| 12-11-2014| 11/13/2014|
| 12-11-2014| 11/13/2014|
| 12-11-2014| 11/13/2014|
| 12-11-2014| 11-12-2014|
| 12-11-2014| 11-12-2014|
| 12-11-2014| 11-12-2014|
| 12-11-2014| 11-12-2014|
| 12-11-2014| 11-12-2014|
| 12-11-2014| 11-12-2014|
| 12-11-2014| 11-12-2014|
| 12-11-2014| 11-12-2014|
| 12-11-2014| 11-12-2014|
| 12-11-2014| 11-12-2014|
| 12-11-2014| 11-12-2014|
------------- --------------------
To Date
sparkDF = sparkDF.withColumn('p1',F.to_date(F.col('Date received'),'MM/dd/yyyy'))\
.withColumn('p2',F.to_date(F.col('Date received'),'MM-dd-yyyy'))
sparkDF.show()
------------- -------------------- ---------- ----------
|Date received|Date sent to company| p1| p2|
------------- -------------------- ---------- ----------
| 11/13/2014| 11/13/2014|2014-11-13| null|
| 11/13/2014| 11/13/2014|2014-11-13| null|
| 11/13/2014| 11/13/2014|2014-11-13| null|
| 11/13/2014| 11/13/2014|2014-11-13| null|
| 12-11-2014| 11/13/2014| null|2014-12-11|
| 12-11-2014| 11/13/2014| null|2014-12-11|
| 12-11-2014| 11/13/2014| null|2014-12-11|
| 12-11-2014| 11-12-2014| null|2014-12-11|
| 12-11-2014| 11-12-2014| null|2014-12-11|
| 12-11-2014| 11-12-2014| null|2014-12-11|
| 12-11-2014| 11-12-2014| null|2014-12-11|
| 12-11-2014| 11-12-2014| null|2014-12-11|
| 12-11-2014| 11-12-2014| null|2014-12-11|
| 12-11-2014| 11-12-2014| null|2014-12-11|
| 12-11-2014| 11-12-2014| null|2014-12-11|
| 12-11-2014| 11-12-2014| null|2014-12-11|
| 12-11-2014| 11-12-2014| null|2014-12-11|
| 12-11-2014| 11-12-2014| null|2014-12-11|
------------- -------------------- ---------- ----------
Coalesce
sparkDF = sparkDF.withColumn('date_received_parsed',F.coalesce(F.col('p1'),F.col('p2')))\
.drop(*['p1','p2'])
sparkDF.show()
------------- -------------------- --------------------
|Date received|Date sent to company|date_received_parsed|
------------- -------------------- --------------------
| 11/13/2014| 11/13/2014| 2014-11-13|
| 11/13/2014| 11/13/2014| 2014-11-13|
| 11/13/2014| 11/13/2014| 2014-11-13|
| 11/13/2014| 11/13/2014| 2014-11-13|
| 12-11-2014| 11/13/2014| 2014-12-11|
| 12-11-2014| 11/13/2014| 2014-12-11|
| 12-11-2014| 11/13/2014| 2014-12-11|
| 12-11-2014| 11-12-2014| 2014-12-11|
| 12-11-2014| 11-12-2014| 2014-12-11|
| 12-11-2014| 11-12-2014| 2014-12-11|
| 12-11-2014| 11-12-2014| 2014-12-11|
| 12-11-2014| 11-12-2014| 2014-12-11|
| 12-11-2014| 11-12-2014| 2014-12-11|
| 12-11-2014| 11-12-2014| 2014-12-11|
| 12-11-2014| 11-12-2014| 2014-12-11|
| 12-11-2014| 11-12-2014| 2014-12-11|
| 12-11-2014| 11-12-2014| 2014-12-11|
| 12-11-2014| 11-12-2014| 2014-12-11|
------------- -------------------- --------------------
