Moving Distinct Count in Big Query (SQL syntax)-CodePudding

I have sort of a weird one for all the SQL masters out there. I need to get the distinct count of items in a moving window of 14 days. I tried dense_rank but it didn't specify (or I did not know how to) specify the 14 day moving window.

For simplicity my dataset has 3 columns.

store (string)
item code (string)
date (date)

A quick example of my endo goal would be the following:

Day 1 I scan items 1,2,3,4
Day 2 I scan items 2,3,4,5
Day 3 I scan items 1,6

So then for day 1 my uniques would be 4, Day 2 my uniques would be 5 and day 3 my uniques would be 6 (1,2,3,4,5,6)

Once I get to day 15 I would ignore values found in day 1 and only take days 2-15

Any and all help would be greatly appreciated.

CodePudding user response：

Consider below approach

select store, date, 
  ( select count(distinct item) 
    from t.items item
  ) distinct_items_count
from (
  select store, date, any_value(items) items
  from (
    select store, date, 
      array_agg(item_code) over(partition by store order by unix_date(date) range between 13 preceding and current row) items
    from your_table
  )
  group by store, date
) t

CodePudding user response：

Another option to consider - with use of HyperLogLog functions - so it consumes less resources and faster

select store, date, 
  ( select hll_count.merge(sketch)
    from t.sketches_14days sketch 
  ) distinct_items_count
from (
  select store, date, 
    array_agg(daily_sketch) over(partition by store order by unix_date(date) range between 13 preceding and current row) sketches_14days
  from (
    select store, date, hll_count.init(item_code) daily_sketch
    from your_table
    group by store, date
  )
) t

Note:

HLL functions are approximate aggregate functions. Approximate aggregation typically requires less memory than exact aggregation functions, like COUNT(DISTINCT), but also introduces statistical uncertainty. This makes HLL functions appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.