Postgresql: Return most recent record based on where conditions-CodePudding

I'm trying to return a set of rows where if a car_id in the table below contains a duplicate, just use the most recent entry by created_on and apply the WHERE clause to the most recent entry. In which case if the most recent entry doesn't satisfy, do not return anything.

SQLFiddle: For the question - http://sqlfiddle.com/#!17/cc0a5/2

Note: Please see edit - statement at the end of question for further clarification

Data

Table: car_wash

Here is a sample table with all the data.

id	car_id	next_clean	end_at	created_on
79d	d48	2022-06-25 18:29:32.210955 00:00	NULL	2022-06-25 18:34:32.235425 00:00
1f3	645	2022-06-25 18:29:32.210955 00:00	NULL	2022-06-25 18:34:32.235434 00:00
397	d41	2022-06-25 18:29:32.210955 00:00	NULL	2022-06-25 18:34:32.235435 00:00
435	372	2022-06-25 18:29:32.210955 00:00	NULL	2022-06-25 18:34:32.235436 00:00
3d5	d41	2022-06-25 18:32:52.068891 00:00	2022-06-25 18:34:32.269289 00:00	2022-06-25 18:34:32.269252 00:00
2df	372	2022-07-25 18:35:32.068891 00:00	2022-06-25 18:35:32.269293 00:00	2022-06-25 18:35:32.269291 00:00

Based on the data, I want to retrieve rows of cars that need to be cleaned i.e their next_clean date has elapsed, ~and the log of their previous recorded clean (end_at) is NULL, i.e the clean never took place or finished.~

In the event a car appears multiple times e.g. car_id: 372 for id: 435 and id: 2df, then only take into consideration the most recent (created_on) record i.e id: 2df and if id: 2df meets the criteria mentioned above return it, if not do not return it or even id: 372. I only care about the most recent entry.

For the attempted postgres query below, assume that now() takes place between the next_clean dates for id: 435 and id: 2df (a month in the future), meaning 2df is in the future and will fail the WHERE check. For those wondering why id: 2df has a populated end_at field even though the next_clean hasn't taken place. I'm using the end_at to signify a clean happened, and the next_clean stores when a new clean should take place)

What I Have Tried

Based on numerous solutions, the use of window functions was suggested, but I am not sure I'm applying it correctly

SELECT *
FROM (SELECT
             id,
             car_id,
             next_clean,
             end_at,
             created_on,
             row_number() over (partition by car_id order by created_on desc) as rn
      FROM car_wash
      WHERE car_wash.user_id = 'some-id'
        AND (car_wash.next_clean <= now())
        AND car_wash.end_at IS NULL
      LIMIT 100 OFFSET 0) t
WHERE rn = 1

Result

id	car_id	next_clean	end_at	created_on	rn
435	372	2022-06-25 18:29:32.210955 00:00	NULL	2022-06-25 18:34:32.235436 00:00	1
1f3	645	2022-06-25 18:29:32.210955 00:00	NULL	2022-06-25 18:34:32.235434 00:00	1
397	d41	2022-06-25 18:29:32.210955 00:00	NULL	2022-06-25 18:34:32.235435 00:00	1
79d	d48	2022-06-25 18:29:32.210955 00:00	NULL	2022-06-25 18:34:32.235425 00:00	1

The problem with this result is that since id: 2df fails the check as it's next_clean is in the future, id: 435 for car_id: 372 is returned which is not what I want (as it unfortunately does pass the check, but I only want to focus on the most recently created car_id).

Intended/Expected

id	car_id	next_clean	end_at	created_on	rn
1f3	645	2022-06-25 18:29:32.210955 00:00	NULL	2022-06-25 18:34:32.235434 00:00	1
397	d41	2022-06-25 18:29:32.210955 00:00	NULL	2022-06-25 18:34:32.235435 00:00	1
79d	d48	2022-06-25 18:29:32.210955 00:00	NULL	2022-06-25 18:34:32.235425 00:00	1

Edit: @jholkin raised a good point regarding why we return id: 3d5 in the first place since end_at is not null. It introduces conflicting priorities. So ideally the WHERE clause should only care that the created_on is most recent if there are duplicates, and that the next_clean is <= now(). The expected result is still the same as now end_at isn't a factor.

CodePudding user response：

You can use DISTINCT ON to produce only the first row per group (per car_id) according to an ordering (created_on DESC). Then, filtering is trivial.

For example:

select *
from (
  select distinct on (car_id) *
  from car_wash
  order by car_id, created_on desc
) x
where next_clean <= now() and end_at is null

CodePudding user response：

if the condition says, that when there are more than two records of the same car_id, it should take the most recent one based on the create_on field and this record must have null value in its end_at field, why should it return the record with id 397? If the most current create_on value is id 3d5. this query should help you:

select t.* 
from ( select distinct on (cw.car_id) cw.id, cw.car_id ,
last_value (cw.next_clean) over (partition by cw.next_clean order by cw.next_clean desc) as next_clean ,
cw.end_at ,
(cw.created_on) as created_on  from car_wash cw ) t 
where t.next_clean <= now()     and t.end_at is null;

CodePudding user response：

you can do it with FIRST_VALUE() window functions:

https://www.postgresqltutorial.com/postgresql-window-function/postgresql-first_value-function/

https://www.postgresql.org/docs/current/tutorial-window.html#:~:text=A window function performs a,done with an aggregate function.
https://www.postgresql.org/docs/current/functions-window.html