SQL count distinct over partition by cumulatively-CodePudding

I am using AWS Athena (Presto based) and I have this table named base:

id	category	year	month
1	a	2021	6
1	b	2022	8
1	a	2022	11
2	a	2022	1
2	a	2022	4
2	b	2022	6

I would like to craft a query that counts the distinct values of the categories per id, cumulatively per month and year, but retaining the original columns:

id	category	year	month	sumC
1	a	2021	6	1
1	b	2022	8	2
1	a	2022	11	2
2	a	2022	1	1
2	a	2022	4	1
2	b	2022	6	2

I've tried doing the following query with no success:

SELECT id, 
       category, 
       year, 
       month, 
       COUNT(category) OVER (PARTITION BY id, ORDER BY year, month) AS sumC FROM base;

This results in 1, 2, 3, 1, 2, 3 which is not what I'm looking for. I'd rather need something like a COUNT(DISTINCT) inside a window function, though it's not supported as a construct.

I also tried the DENSE_RANK trick:

  DENSE_RANK() OVER (PARTITION BY id ORDER BY category) 
  DENSE_RANK() OVER (PARTITION BY id ORDER BY category) 
- 1 as sumC

Though, because there is no ordering between year and month, it just results in 2, 2, 2, 2, 2, 2.

Any help is appreciated!

CodePudding user response：

One option is

creating a new column that will contain when each "category" is seen for the first time (partitioning on "id", "category" and ordering on "year", "month")
computing a running sum over this column, with the same partition

WITH cte AS (
    SELECT *, 
           CASE WHEN ROW_NUMBER() OVER(
                         PARTITION BY id, category
                         ORDER     BY year, month) = 1
                THEN 1 
                ELSE 0 
           END AS rn1
    FROM base
    ORDER BY id, 
             year_, 
             month_
)
SELECT id,
       category,
       year_,
       month_,
       SUM(rn1) OVER(
            PARTITION BY id
            ORDER     BY year, month 
       ) AS sumC
FROM cte

Does it work for you?