How can extract date from struct type column in PySpark dataframe?-CodePudding

I'm dealing with PySpark dataframe which has struct type column as shown below:

df.printSchema()

#root
#|-- timeframe: struct (nullable = false)
#|    |-- start: timestamp (nullable = true)
#|    |-- end: timestamp (nullable = true)

So I tried to collect() and pass end timestamps/window of related column for plotting issue:

from pyspark.sql.functions import *

# method 1 
ts1 = [val('timeframe.end') for val in df.select(date_format(col('timeframe.end'),"yyyy-MM-dd")).collect()]

# method 2
ts2 = [val('timeframe.end') for val in df.select('timeframe.end').collect()]

So normally when the column is not struct I follow this answer but in this case I couldn't find better ways except this post and this answer which they tries to convert it to arrays. I'm not sure this the best practice.

What I have tried 2 methods as shown above unsuccessfully which outputs belows:

print(ts1)     #[Row(2021-12-28='timeframe.end')]
print(ts2)     #[Row(2021-12-28 00:00:00='timeframe.end')]

Expected outputs are below:

print(ts1)     #[2021-12-28]          just date format
print(ts2)     #[2021-12-28 00:00:00] just timestamp format

How can I handle this matter?

CodePudding user response：

You can access Row fields using brackets (row["field"]) or with dot (row.field) not with parentheses. Try this instead:

from pyspark.sql import Row
import pyspark.sql.functions as F

df = spark.createDataFrame([Row(timeframe=Row(start="2021-12-28 00:00:00", end="2022-01-06 00:00:00"))])

ts1 = [r["end"] for r in df.select(F.date_format(F.col("timeframe.end"), "yyyy-MM-dd").alias("end")).collect()]
# or
# ts1 = [r.end for r in df.select(F.date_format(F.col("timeframe.end"), "yyyy-MM-dd").alias("end")).collect()]

print(ts1)
#['2022-01-06']

When you do row("timeframe.end") you actually calling the class Row that's why you get those values.