Find nth row per group in a large dataset with Spark-CodePudding

I have a (very) large dataset partitioned by year, month and day. The partition columns were derived from a updated_at column during ingestion. Here is how it looks like:

id	user	updated_at	year	month	day
1	a	1992-01-19	1992	1	19
2	c	1992-01-20	1992	1	20
3	a	1992-01-21	1992	1	21
...	...	...	...	...	...
720987	c	2012-07-20	2012	7	20
720988	a	2012-07-21	2012	7	21
...	...	...	...	...	...

I need to use Apache Spark to find the 5th earliest event per user.

A simple window function like the one below is impossible since I use a shared cluster and I won't have enough resources to do in-memory processing at any given time due to the size of the dataset.

window = Window.partitionBy("user").orderBy(F.asc("updated_at"))

.withColumn("rank", F.dense_rank().over(window))
.filter(F.col("rank") == 5)

I am considering looping through partitions, processing and persisting data to disk, and then merging them back. How would you solve it? Thanks!

CodePudding user response：

I think the code below will be faster because data is partitioned by these cols and spark can benefit from data locality.

Window.partitionBy("user").orderBy(F.asc("year"), F.asc("month"), F.asc("day"))