This is an example of table data that I am working on (the table contained a lot of columns, I am showing here only the relevant ones):
| Id | job_number | status | parent_id |
|---|---|---|---|
| 1 | 42FWD-42 | 0 | 0 |
| 2 | 42FWD-42 | 1 | 1 |
| 3 | 42FWD-42 | 5 | 1 |
Id is auto generated. parent_id links the job using the id.
When a new job is created via the app, a new row is created (with status "0"). The auto-generated Id is then used for subsequent rows of same job, and set as parent id.
Another record with status "1" (which is code for started) is also created just after parent record.
Explanation of the problem: due to a bug in the app, there are duplicate set of rows for the same job.
Example of problem
| Id | job_number | status | parent_id |
|---|---|---|---|
| 1 | 42FWD-42 | 0 | 0 |
| 2 | 42FWD-42 | 0 | 0 |
| 3 | 42FWD-42 | 1 | 1 |
| 4 | 42FWD-42 | 1 | 2 |
| 5 | 42FWD-42 | 5 | 1 |
As you can see from this example, due to the bug, there are 2 rows with "0" status for the same job, and 2 rows with "1" status.
This creates a lot of problems in operation in app where the job is updated using the job number.
The status number should not repeat for a specific job.
What I want to do is to find all duplicates like those in example. For example, I want a query where I can find all duplicates which have same job number, but different parent_id and NO "5" status.
Example result using the example table above, I need the query to return:
| Id | job_number | status | parent_id |
|---|---|---|---|
| 2 | 42FWD-42 | 0 | 0 |
| 4 | 42FWD-42 | 1 | 2 |
Explanation of this result:
Row with
Id=1is considered the correct record because it has an associated record with status "5"Row with
Id=2is considered duplicate and its associated records are also considered duplicate
Another possible case: there are duplicate rows, but none have status=5. These rows can be discarded, ie need not be shown in results.
A brief explanation of how the query works will be appreciated.
CodePudding user response:
you can do this easy with a group by
declare @t table (id int, job_number varchar(10), status int, parent_id int)
insert into @t
values (1, '42FWD-42', 0, 0), (2, '42FWD-42', 0, 0), (3, '42FWD-42', 1, 1), (4, '42FWD-42', 1, 2), (5, '42FWD-42', 5, 1)
select max(t.id) as id, t.job_number, t.status
from @t t
group by t.job_number, t.status
having count(*) > 1
the result is
id job_number status
2 42FWD-42 0
4 42FWD-42 1
and to get also the parent_id you can add a self join
select max(t.id) as id,
t.job_number,
t.status,
(select t2.parent_id from @t t2 where t2.id = max(t.id)) as parent_id
from @t t
group by t.job_number, t.status
having count(*) > 1
this returns
id job_number status parent_id
2 42FWD-42 0 0
4 42FWD-42 1 2
CodePudding user response:
you can try to use ROW_NUMBER window function to get duplicate row data and its id by job_number, then using cte recursive to find all error records by this id
Query 1:
;WITH CTE AS (
SELECT *,ROW_NUMBER() OVER (PARTITION BY job_number ORDER BY Id) rn
FROM T
WHERE status = 0
), CTE1 AS (
SELECT id,job_number,status,parent_id
FROM CTE
WHERE rn > 1
UNION ALL
SELECT t.id,t.job_number,t.status,t.parent_id
FROM CTE1 c INNER JOIN T t
ON c.id = t.parent_id
)
SELECT *
FROM CTE1
| id | job_number | status | parent_id |
|----|------------|--------|-----------|
| 2 | 42FWD-42 | 0 | 0 |
| 4 | 42FWD-42 | 1 | 2 |
