Home > Enterprise >  Is there a Scala Spark equivalent to pandas Grouper freq feature?
Is there a Scala Spark equivalent to pandas Grouper freq feature?

Time:01-12

In pandas, if we have a time series and need to group it by a certain frequency (say, every two weeks), it's possible to use the Grouper class, like this:

import pandas as pd
df.groupby(pd.Grouper(key='timestamp', freq='2W'))

Is there any equivalent in Spark (more specifically, using Scala) for this feature?

CodePudding user response:

You can use the sql function window. First you create if you don´t have the timestamp column from a string type datetime:

val data =
  Seq(("2022-01-01 00:00:00", 1),
      ("2022-01-01 00:15:00", 1),
      ("2022-01-08 23:30:00", 1),
      ("2022-01-22 23:30:00", 4))

The apply the window function to the timestamp column, and do the aggregation to the column you need to obtain a result per slot:

val df0 = 
  df.groupBy(window(col("date"), "1 week", "1 week", "0 minutes"))
    .agg(sum("a") as "sum_a")

The result includes the calculate windows. Take a look to the doc for a better understanding of the input parameters: https://spark.apache.org/docs/latest/api/sql/index.html#window.

val df1 = df0.select("window.start", "window.end", "sum_a")

df1.show()

it gives:

 ------------------- ------------------- ----- 
|              start|                end|sum_a|
 ------------------- ------------------- ----- 
|2022-01-20 01:00:00|2022-01-27 01:00:00|    4|
|2021-12-30 01:00:00|2022-01-06 01:00:00|    2|
|2022-01-06 01:00:00|2022-01-13 01:00:00|    1|
 ------------------- ------------------- ----- 
  •  Tags:  
  • Related