Home > Blockchain >  Spark: How to divide interval by interval
Spark: How to divide interval by interval

Time:01-22

I have a dataframe of the following structure:

timeStatistics.show(10, False)

 ------ --------------------------------------- --------------------------------------- -------------------------------------- ----- 
|idByte|min(time_delta)                        |max(time_delta)                        |avg(time_delta)                       |count|
 ------ --------------------------------------- --------------------------------------- -------------------------------------- ----- 
|1002b0|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b1|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b2|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b3|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b4|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b5|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b6|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b7|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1004b0|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1004b1|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
 ------ --------------------------------------- --------------------------------------- -------------------------------------- ----- 
only showing top 10 rows

I want to add a column that gives me the factor by which min(time_delta) and max(time_delta) differ.

My first atttempt was to just add:

.withColumn("min_max_split", (F.col("max(time_delta)")/F.col("min(time_delta)")))

However, it seems that division of two intervals is not supported:

AnalysisException: cannot resolve '(max(time_delta) / min(time_delta))' due to data type mismatch: argument 2 requires numeric type, however, 'min(time_delta)' is of interval day to second type.

I thought of converting the intervals using the unix_timestamp() function. However, my intervals are sometimes smaller than a second, so unix_timestamp() would return zero.

CodePudding user response:

You can add the intervals to current_timestamp, convert the result into double type then divide:

from pyspark.sql import functions as F

df1 = df.withColumn(
    "min_max_split",
    (F.current_timestamp()   F.col("max(time_delta)")).cast('double') / (
                F.current_timestamp()   F.col("min(time_delta)")).cast('double')
)

df1.show(1)
# ------ -------------------- -------------------- -------------------- ----- ------------------ 
#|idByte|     min(time_delta)|     max(time_delta)|     avg(time_delta)|count|     min_max_split|
# ------ -------------------- -------------------- -------------------- ----- ------------------ 
#|1002b0|INTERVAL '0 00:00...|INTERVAL '0 00:00...|INTERVAL '0 00:00...| 4198|1.0000000000048699|
# ------ -------------------- -------------------- -------------------- ----- ------------------ 

CodePudding user response:

I found a solution, which is a slight modification from @blackbishop 's answer:

.withColumn("min_max_split",\
  (\
    (F.to_date(F.from_unixtime(F.lit(0))) F.col("min(time_delta)")).cast('double')\
    / (F.to_date(F.from_unixtime(F.lit(0))) F.col("max(time_delta)")).cast('double')\
  )\
)
  •  Tags:  
  • Related