I have a dataframe of the following structure:
timeStatistics.show(10, False)
------ --------------------------------------- --------------------------------------- -------------------------------------- -----
|idByte|min(time_delta) |max(time_delta) |avg(time_delta) |count|
------ --------------------------------------- --------------------------------------- -------------------------------------- -----
|1002b0|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b1|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b2|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b3|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b4|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b5|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b6|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b7|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1004b0|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1004b1|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
------ --------------------------------------- --------------------------------------- -------------------------------------- -----
only showing top 10 rows
I want to add a column that gives me the factor by which min(time_delta) and max(time_delta) differ.
My first atttempt was to just add:
.withColumn("min_max_split", (F.col("max(time_delta)")/F.col("min(time_delta)")))
However, it seems that division of two intervals is not supported:
AnalysisException: cannot resolve '(
max(time_delta)/min(time_delta))' due to data type mismatch: argument 2 requires numeric type, however, 'min(time_delta)' is of interval day to second type.
I thought of converting the intervals using the unix_timestamp() function. However, my intervals are sometimes smaller than a second, so unix_timestamp() would return zero.
CodePudding user response:
You can add the intervals to current_timestamp, convert the result into double type then divide:
from pyspark.sql import functions as F
df1 = df.withColumn(
"min_max_split",
(F.current_timestamp() F.col("max(time_delta)")).cast('double') / (
F.current_timestamp() F.col("min(time_delta)")).cast('double')
)
df1.show(1)
# ------ -------------------- -------------------- -------------------- ----- ------------------
#|idByte| min(time_delta)| max(time_delta)| avg(time_delta)|count| min_max_split|
# ------ -------------------- -------------------- -------------------- ----- ------------------
#|1002b0|INTERVAL '0 00:00...|INTERVAL '0 00:00...|INTERVAL '0 00:00...| 4198|1.0000000000048699|
# ------ -------------------- -------------------- -------------------- ----- ------------------
CodePudding user response:
I found a solution, which is a slight modification from @blackbishop 's answer:
.withColumn("min_max_split",\
(\
(F.to_date(F.from_unixtime(F.lit(0))) F.col("min(time_delta)")).cast('double')\
/ (F.to_date(F.from_unixtime(F.lit(0))) F.col("max(time_delta)")).cast('double')\
)\
)
