I'm trying to return a set of rows where if a car_id in the table below contains a duplicate, just use the most recent entry by created_on and apply the WHERE clause to the most recent entry. In which case if the most recent entry doesn't satisfy, do not return anything.
SQLFiddle: For the question - http://sqlfiddle.com/#!17/cc0a5/2
Note: Please see edit - statement at the end of question for further clarification
Data
Table: car_wash
Here is a sample table with all the data.
| id | car_id | next_clean | end_at | created_on |
|---|---|---|---|---|
| 79d | d48 | 2022-06-25 18:29:32.210955 00:00 | NULL | 2022-06-25 18:34:32.235425 00:00 |
| 1f3 | 645 | 2022-06-25 18:29:32.210955 00:00 | NULL | 2022-06-25 18:34:32.235434 00:00 |
| 397 | d41 | 2022-06-25 18:29:32.210955 00:00 | NULL | 2022-06-25 18:34:32.235435 00:00 |
| 435 | 372 | 2022-06-25 18:29:32.210955 00:00 | NULL | 2022-06-25 18:34:32.235436 00:00 |
| 3d5 | d41 | 2022-06-25 18:32:52.068891 00:00 | 2022-06-25 18:34:32.269289 00:00 | 2022-06-25 18:34:32.269252 00:00 |
| 2df | 372 | 2022-07-25 18:35:32.068891 00:00 | 2022-06-25 18:35:32.269293 00:00 | 2022-06-25 18:35:32.269291 00:00 |
Based on the data, I want to retrieve rows of cars that need to be cleaned i.e their next_clean date has elapsed, ~and the log of their previous recorded clean (end_at) is NULL, i.e the clean never took place or finished.~
In the event a car appears multiple times e.g. car_id: 372 for id: 435 and id: 2df, then only take into consideration the most recent (created_on) record i.e id: 2df and if id: 2df meets the criteria mentioned above return it, if not do not return it or even id: 372. I only care about the most recent entry.
For the attempted postgres query below, assume that now() takes place between the next_clean dates for id: 435 and id: 2df (a month in the future), meaning 2df is in the future and will fail the WHERE check. For those wondering why id: 2df has a populated end_at field even though the next_clean hasn't taken place. I'm using the end_at to signify a clean happened, and the next_clean stores when a new clean should take place)
What I Have Tried
Based on numerous solutions, the use of window functions was suggested, but I am not sure I'm applying it correctly
SELECT *
FROM (SELECT
id,
car_id,
next_clean,
end_at,
created_on,
row_number() over (partition by car_id order by created_on desc) as rn
FROM car_wash
WHERE car_wash.user_id = 'some-id'
AND (car_wash.next_clean <= now())
AND car_wash.end_at IS NULL
LIMIT 100 OFFSET 0) t
WHERE rn = 1
Result
| id | car_id | next_clean | end_at | created_on | rn |
|---|---|---|---|---|---|
| 435 | 372 | 2022-06-25 18:29:32.210955 00:00 | NULL | 2022-06-25 18:34:32.235436 00:00 | 1 |
| 1f3 | 645 | 2022-06-25 18:29:32.210955 00:00 | NULL | 2022-06-25 18:34:32.235434 00:00 | 1 |
| 397 | d41 | 2022-06-25 18:29:32.210955 00:00 | NULL | 2022-06-25 18:34:32.235435 00:00 | 1 |
| 79d | d48 | 2022-06-25 18:29:32.210955 00:00 | NULL | 2022-06-25 18:34:32.235425 00:00 | 1 |
The problem with this result is that since id: 2df fails the check as it's next_clean is in the future, id: 435 for car_id: 372 is returned which is not what I want (as it unfortunately does pass the check, but I only want to focus on the most recently created car_id).
Intended/Expected
| id | car_id | next_clean | end_at | created_on | rn |
|---|---|---|---|---|---|
| 1f3 | 645 | 2022-06-25 18:29:32.210955 00:00 | NULL | 2022-06-25 18:34:32.235434 00:00 | 1 |
| 397 | d41 | 2022-06-25 18:29:32.210955 00:00 | NULL | 2022-06-25 18:34:32.235435 00:00 | 1 |
| 79d | d48 | 2022-06-25 18:29:32.210955 00:00 | NULL | 2022-06-25 18:34:32.235425 00:00 | 1 |
Edit: @jholkin raised a good point regarding why we return id: 3d5 in the first place since end_at is not null. It introduces conflicting priorities. So ideally the WHERE clause should only care that the created_on is most recent if there are duplicates, and that the next_clean is <= now(). The expected result is still the same as now end_at isn't a factor.
CodePudding user response:
You can use DISTINCT ON to produce only the first row per group (per car_id) according to an ordering (created_on DESC). Then, filtering is trivial.
For example:
select *
from (
select distinct on (car_id) *
from car_wash
order by car_id, created_on desc
) x
where next_clean <= now() and end_at is null
CodePudding user response:
if the condition says, that when there are more than two records of the same car_id, it should take the most recent one based on the create_on field and this record must have null value in its end_at field, why should it return the record with id 397? If the most current create_on value is id 3d5. this query should help you:
select t.*
from ( select distinct on (cw.car_id) cw.id, cw.car_id ,
last_value (cw.next_clean) over (partition by cw.next_clean order by cw.next_clean desc) as next_clean ,
cw.end_at ,
(cw.created_on) as created_on from car_wash cw ) t
where t.next_clean <= now() and t.end_at is null;
CodePudding user response:
you can do it with FIRST_VALUE() window functions:
https://www.postgresqltutorial.com/postgresql-window-function/postgresql-first_value-function/
https://www.postgresql.org/docs/current/tutorial-window.html#:~:text=A window function performs a,done with an aggregate function.
https://www.postgresql.org/docs/current/functions-window.html
