I have a table that contains the status of units we are monitoring reported throughout the day. I've made a much simplified version of this table below.
| ID | Unit | Time | Status |
|---|---|---|---|
| 1 | A | 10:01 | 1 |
| 2 | A | 10:03 | 1 |
| 3 | B | 10:04 | 5 |
| 4 | A | 10:06 | 1 |
| 5 | B | 10:09 | 4 |
| 6 | B | 10:10 | 4 |
| 7 | A | 10:11 | 2 |
| 8 | A | 10:18 | 2 |
| 9 | B | 10:21 | 4 |
| 10 | A | 10:25 | 2 |
I am provided with a list of IDs for which I need to retrieve the corresponding data plus the status of the same unit in it's previous 2 status updates.
For example, if I'm given ID 8, I need to return:
| Unit | Time | Status | Status_lag1 | Status_lag2 |
|---|---|---|---|---|
| A | 10:18 | 2 | 2 | 1 |
My problem isn't with using the lag function, that's fine. The issue is that if I apply the filter by ID in the same statement, I'm not applying the lag function over the whole table.
SELECT Unit,
Time,
Status,
LAG(Status, 1) OVER (PARTITION BY Unit ORDER BY Time) AS Status_lag1,
LAG(STATUS, 2) OVER (PARTITION BY Unit ORDER BY Time) AS Status_lag2
FROM mydata
WHERE ID = 8;
But when I do this correctly with a subquery – first doing the lag on the whole table, then applying my ID filter – it's quite slow (it's a large table).
SELECT sub.*
FROM (
SELECT Unit,
Time,
Status,
LAG(Status, 1) OVER (PARTITION BY Unit ORDER BY Time) AS Status_lag1,
LAG(STATUS, 2) OVER (PARTITION BY Unit ORDER BY Time) AS Status_lag2
FROM mydata
) sub
WHERE sub.ID = 8;
Any alternative approaches I'm not thinking of that would improve performance? I'm using SQL Server 2017.
CodePudding user response:
I am provided with a list of IDs for which I need to retrieve the corresponding data plus the status of the same unit in it's previous 2 status updates. ... doing the lag on the whole table, then applying my ID filter – it's quite slow (it's a large table)
One way (assuming list of 2 and 8 for example purposes) would be
SELECT md1.Unit,
md1.Time,
md1.Status,
md1.ID,
oa.Status_lag1,
oa.Status_lag2
FROM mydata md1
OUTER APPLY (SELECT MAX(CASE WHEN RN = 1 THEN Status END) Status_lag1,
MAX(CASE WHEN RN = 2 THEN Status END) Status_lag2
FROM (SELECT TOP 2 ROW_NUMBER() OVER (ORDER BY md2.Time DESC) AS RN,
Status
FROM mydata md2
WHERE md1.Unit = md2.Unit
AND md2.Time < md1.Time
ORDER BY md2.Time DESC) t) oa
WHERE md1.ID IN ( 2, 8 )
For the example above this only reads the 5 rows neccessary to calculate the result (assumes an index on id and one on mydata(Unit, TIME) INCLUDE (Status))
i.e. it reads the row for id=2 and then the one preceding row with the same Unit (as only one exists) and it reads the row for id=8 and then the two preceding rows with the same Unit as that.
This should be much less resource intensive if the table is large and you are only interested in a few rows. You'd need to do some testing to find the number of list elements at which it becomes quicker to just scan the whole table rather than do the correlated seek.
CodePudding user response:
I figured out an answer to my own questions. Moving the subquery into a temp table took the execution time from over 1 hour to a few seconds:
SELECT Unit,
Time,
Status,
LAG(Status, 1) OVER (PARTITION BY Unit ORDER BY Time) AS Status_lag1,
LAG(STATUS, 2) OVER (PARTITION BY Unit ORDER BY Time) AS Status_lag2
INTO #LaggedTable
FROM mydata;
SELECT *
FROM #LaggedTable
WHERE sub.ID = 8;
I'll have to dig into the execution plan to figure out what's the actual difference here.

