I am using AWS Athena (Presto based) and I have this table named base:
| id | category | year | month |
|---|---|---|---|
| 1 | a | 2021 | 6 |
| 1 | b | 2022 | 8 |
| 1 | a | 2022 | 11 |
| 2 | a | 2022 | 1 |
| 2 | a | 2022 | 4 |
| 2 | b | 2022 | 6 |
I would like to craft a query that counts the distinct values of the categories per id, cumulatively per month and year, but retaining the original columns:
| id | category | year | month | sumC |
|---|---|---|---|---|
| 1 | a | 2021 | 6 | 1 |
| 1 | b | 2022 | 8 | 2 |
| 1 | a | 2022 | 11 | 2 |
| 2 | a | 2022 | 1 | 1 |
| 2 | a | 2022 | 4 | 1 |
| 2 | b | 2022 | 6 | 2 |
I've tried doing the following query with no success:
SELECT id,
category,
year,
month,
COUNT(category) OVER (PARTITION BY id, ORDER BY year, month) AS sumC FROM base;
This results in 1, 2, 3, 1, 2, 3 which is not what I'm looking for. I'd rather need something like a COUNT(DISTINCT) inside a window function, though it's not supported as a construct.
I also tried the DENSE_RANK trick:
DENSE_RANK() OVER (PARTITION BY id ORDER BY category)
DENSE_RANK() OVER (PARTITION BY id ORDER BY category)
- 1 as sumC
Though, because there is no ordering between year and month, it just results in 2, 2, 2, 2, 2, 2.
Any help is appreciated!
CodePudding user response:
One option is
- creating a new column that will contain when each "category" is seen for the first time (partitioning on "id", "category" and ordering on "year", "month")
- computing a running sum over this column, with the same partition
WITH cte AS (
SELECT *,
CASE WHEN ROW_NUMBER() OVER(
PARTITION BY id, category
ORDER BY year, month) = 1
THEN 1
ELSE 0
END AS rn1
FROM base
ORDER BY id,
year_,
month_
)
SELECT id,
category,
year_,
month_,
SUM(rn1) OVER(
PARTITION BY id
ORDER BY year, month
) AS sumC
FROM cte
Does it work for you?
