I have a (very) large dataset partitioned by year, month and day. The partition columns were derived from a updated_at column during ingestion. Here is how it looks like:
| id | user | updated_at | year | month | day |
|---|---|---|---|---|---|
| 1 | a | 1992-01-19 | 1992 | 1 | 19 |
| 2 | c | 1992-01-20 | 1992 | 1 | 20 |
| 3 | a | 1992-01-21 | 1992 | 1 | 21 |
| ... | ... | ... | ... | ... | ... |
| 720987 | c | 2012-07-20 | 2012 | 7 | 20 |
| 720988 | a | 2012-07-21 | 2012 | 7 | 21 |
| ... | ... | ... | ... | ... | ... |
I need to use Apache Spark to find the 5th earliest event per user.
A simple window function like the one below is impossible since I use a shared cluster and I won't have enough resources to do in-memory processing at any given time due to the size of the dataset.
window = Window.partitionBy("user").orderBy(F.asc("updated_at"))
.withColumn("rank", F.dense_rank().over(window))
.filter(F.col("rank") == 5)
I am considering looping through partitions, processing and persisting data to disk, and then merging them back. How would you solve it? Thanks!
CodePudding user response:
I think the code below will be faster because data is partitioned by these cols and spark can benefit from data locality.
Window.partitionBy("user").orderBy(F.asc("year"), F.asc("month"), F.asc("day"))
