Home > Software engineering >  Window functions (LAG) with a very specific filter
Window functions (LAG) with a very specific filter

Time:01-06

I have a table that contains the status of units we are monitoring reported throughout the day. I've made a much simplified version of this table below.

ID Unit Time Status
1 A 10:01 1
2 A 10:03 1
3 B 10:04 5
4 A 10:06 1
5 B 10:09 4
6 B 10:10 4
7 A 10:11 2
8 A 10:18 2
9 B 10:21 4
10 A 10:25 2

I am provided with a list of IDs for which I need to retrieve the corresponding data plus the status of the same unit in it's previous 2 status updates.

For example, if I'm given ID 8, I need to return:

Unit Time Status Status_lag1 Status_lag2
A 10:18 2 2 1

My problem isn't with using the lag function, that's fine. The issue is that if I apply the filter by ID in the same statement, I'm not applying the lag function over the whole table.

SELECT  Unit,
        Time,
        Status,
        LAG(Status, 1) OVER (PARTITION BY Unit ORDER BY Time) AS Status_lag1,
        LAG(STATUS, 2) OVER (PARTITION BY Unit ORDER BY Time) AS Status_lag2
FROM    mydata
WHERE ID = 8;

But when I do this correctly with a subquery – first doing the lag on the whole table, then applying my ID filter – it's quite slow (it's a large table).

SELECT sub.*
FROM (
    SELECT  Unit,
            Time,
            Status,
            LAG(Status, 1) OVER (PARTITION BY Unit ORDER BY Time) AS Status_lag1,
            LAG(STATUS, 2) OVER (PARTITION BY Unit ORDER BY Time) AS Status_lag2
    FROM    mydata
) sub
WHERE sub.ID = 8;

Any alternative approaches I'm not thinking of that would improve performance? I'm using SQL Server 2017.

CodePudding user response:

I am provided with a list of IDs for which I need to retrieve the corresponding data plus the status of the same unit in it's previous 2 status updates. ... doing the lag on the whole table, then applying my ID filter – it's quite slow (it's a large table)

One way (assuming list of 2 and 8 for example purposes) would be

SELECT md1.Unit,
       md1.Time,
       md1.Status,
       md1.ID,
       oa.Status_lag1,
       oa.Status_lag2
FROM   mydata md1
       OUTER APPLY (SELECT MAX(CASE WHEN RN = 1 THEN Status END) Status_lag1,
                           MAX(CASE WHEN RN = 2 THEN Status END) Status_lag2
                    FROM   (SELECT TOP 2 ROW_NUMBER() OVER (ORDER BY md2.Time DESC) AS RN,
                                         Status
                            FROM   mydata md2
                            WHERE  md1.Unit = md2.Unit
                                   AND md2.Time < md1.Time
                            ORDER  BY md2.Time DESC) t) oa
WHERE  md1.ID IN ( 2, 8 ) 

For the example above this only reads the 5 rows neccessary to calculate the result (assumes an index on id and one on mydata(Unit, TIME) INCLUDE (Status))

enter image description here

i.e. it reads the row for id=2 and then the one preceding row with the same Unit (as only one exists) and it reads the row for id=8 and then the two preceding rows with the same Unit as that.

This should be much less resource intensive if the table is large and you are only interested in a few rows. You'd need to do some testing to find the number of list elements at which it becomes quicker to just scan the whole table rather than do the correlated seek.

CodePudding user response:

I figured out an answer to my own questions. Moving the subquery into a temp table took the execution time from over 1 hour to a few seconds:

SELECT  Unit,
        Time,
        Status,
        LAG(Status, 1) OVER (PARTITION BY Unit ORDER BY Time) AS Status_lag1,
        LAG(STATUS, 2) OVER (PARTITION BY Unit ORDER BY Time) AS Status_lag2
INTO    #LaggedTable
FROM    mydata;

SELECT  *
FROM    #LaggedTable
WHERE   sub.ID = 8;

I'll have to dig into the execution plan to figure out what's the actual difference here.

  •  Tags:  
  • Related