How might I write in SQL a query that accumulates a table of rows representing selective updates into a single "latest and greatest" row per BY group?
For instance from the following table:
| index | date | ssn | first | last | title | car | shoe |
|---|---|---|---|---|---|---|---|
| 1 | Apr 1 | 100-00-0001 | Joseph | Schmoe | Analyst | Honda | Adidas |
| 2 | May 1 | 100-00-0001 | Joe | ||||
| 3 | May 16 | 100-00-0001 | Nike | ||||
| 4 | June 20 | 100-00-0001 | Sr Analyst | ||||
| 5 | Jul 13 | 200-22-2222 | Jane | Doe | Architect | ||
| 6 | Aug 4 | 100-00-0001 | Tesla | ||||
| 7 | Sep 9 | 100-00-0001 | VP | All birds | |||
| 8 | Sep 9 | 200-22-2222 | Hoka | ||||
| 9 | Oct 4 | 100-00-0001 | '' | Timberlands | |||
| 10 | Nov 9 | 100-00-0001 | Jeep | ||||
| 11 | Dec 4 | 200-22-2222 | Principal |
I'd like a query that returns something like this:
| index | date | ssn | first | last | title | car | shoe |
|---|---|---|---|---|---|---|---|
| 10 | Nov 9 | 100-00-0001 | Joe | Schmoe | '' | Jeep | Timberlands |
| 11 | Dec 4 | 200-22-2222 | Jane | Doe | Principal | Hoka |
My first instinct is there might be some aggregate function analogous to MAX() that works on an ordered set of rows, e.g. LAST(), e.g. something like:
SELECT
LATEST(index) OVER (PARTITION BY ssn ORDER BY date) AS index,
LATEST(date) OVER (PARTITION BY ssn ORDER BY date) AS date,
LATEST(ssn) OVER (PARTITION BY ssn ORDER BY date) AS ssn,
LATEST(first) OVER (PARTITION BY ssn ORDER BY date) AS first,
LATEST(last) OVER (PARTITION BY ssn ORDER BY date) AS last,
LATEST(title) OVER (PARTITION BY ssn ORDER BY date) AS title,
LATEST(car) OVER (PARTITION BY ssn ORDER BY date) AS car,
LATEST(shoe) OVER (PARTITION BY ssn ORDER BY date) AS shoe
FROM myschema.updates
GROUP BY ssn
Apologies if this is trivial, I just don't yet know what term to search for.
CodePudding user response:
You can try using sub-query with window function and filter the top ones in each ssn.
Below query for your reference(not tested)
SELECT
index,date,snn,first,last,title,car,shoe
FROM
(
SELECT
index,
date,
ssn,
first,
last,
title,
car,
shoe,
ROW_NUMBER() OVER (PARTITION BY ssn ORDER BY date DESC) rn
FROM myschema.updates
) A
WHERE
rn = 1;
CodePudding user response:
If you want the most recent not-null, not-blank value of each column, for each ssn, reported against the last date, then you can use something like this:
select
s1.ssn
, max(s1.date) as date
, max(case when s1.first_rn=1 then first else null end) as first
, max(case when s1.last_rn=1 then last else null end) as last
from (
select t.ssn,
t.date,
t.first,
t.last,
row_number() over (partition by ssn
order by case when coalesce(first,'')=''
then 2 else 1
end asc
, date desc
) as first_rn,
row_number() over (partition by ssn
order by case when coalesce(last,'')=''
then 2 else 1
end asc
, date desc
) as last_rn
from Tbl t
) s1
group by s1.ssn
We first find the most recent row (marking it with xxx_rn=1) with non-empty column (by making sure that nulls/blanks come last in numbering), then in the outer level we summarise, picking up values for each ssn by inspecting the rows with xxx_rn=1 (you could use max or min, it wouldn't matter, because there will only be one xxx_rn=1 record for each ssn in the subquery; min/max will ignore all the others). I have only demonstrated on first and last, you need to repeat for the other columns.
CodePudding user response:
You can aggregate the values into an array then pick the first array element. Unlike standard aggregation functions, array_agg() will not ignore null values, so we need an additional filter clause to get rid of them:
select ssn,
(array_agg("index" order by date desc) filter (where "index" is not null))[1] as "index",
(array_agg("date" order by date desc))[1] as date,
(array_agg("first" order by date desc) filter (where "first" is not null))[1] as "first",
(array_agg("last" order by date desc) filter (where "last" is not null))[1] as "last",
(array_agg(title order by date desc) filter (where title is not null))[1] as title,
(array_agg(car order by date desc) filter (where car is not null))[1] as car,
(array_agg(shoe order by date desc) filter (where shoe is not null))[1] as shoe
from updates
group by ssn;
This assumes that date is a real date column not a text column that stores 'Apr 1'
