Choosing variable with MAX timestamp in Snowflake-CodePudding

So I have some data as follows:

ID  status date
001  happy 01-01-2021
001    sad 01-02-2021
002  angry 01-03-2021
003    sad 01-04-2021
004  happy 01-05-2021
003  happy 01-05-2021
004  happy 01-06-2021

And all I want to do is have a table with unique ID and the status on their most recent date.

Final Output:

ID  status date
001    sad 01-02-2021
002  angry 01-03-2021
003  happy 01-05-2021
004  happy 01-06-2021

I know how to do this with a row_number PARTITION but this is very computationally taxing. Any other method of which I can accomplish the above?

CodePudding user response：

The cheap way, is to tell the DB want you want and let it do the work. A partition and then sort is not the expensive.

SELECT a,b,c,e
FROM table
QUALIFY ROW_NUMBER() OVER(PARTITION BY a,b ORDER BY c) = 1;

so to compare "speed" of the current two solutions, let make a table with 10M row:

CREATE TABLE small_data AS
SELECT 
    SEQ8() as seq,
    mod(seq, 20) as part,
    random() as rnd
FROM TABLE(generator(ROWCOUNT => 10000000));

running my code

SELECT seq, part, rnd
FROM small_data
QUALIFY ROW_NUMBER() OVER(PARTITION BY part ORDER BY rnd) = 1;

605 ms to return.

running the self join of Tim's code

SELECT t1.*
FROM small_data t1
WHERE NOT EXISTS (SELECT 1 FROM small_data t2
                  WHERE t2.part = t1.part AND t2.rnd > t1.rnd);

I aborted after 3m 30s

CodePudding user response：

You could try using exists logic as an alternative to ROW_NUMBER:

SELECT t1.*
FROM yourTable t1
WHERE NOT EXISTS (SELECT 1 FROM yourTable t2
                  WHERE t2.ID = t1.ID AND t2.date > t1.date);