Home > Software design >  SQL Server group by overlapping 10 day intervals
SQL Server group by overlapping 10 day intervals

Time:01-14

I have a table which logs each individual piece produced across several production machines, going back a number of years. Periodically (e.g. once per week) I want to check this table to establish "best performance" records for each machine and product combination, storing them in a new table according to the following rules;

  • The machine must have produced a minimum of 10,000 parts over a 10 day period - if only 9000 parts were produced over 10 days, this is an invalid record
  • The machine must have been running the same product without changing over for the entire period i.e. if on day 5 the product changed, this is an invalid record

The Performance data table looks like below [VisionMachineResults]

ID MCSAP DateTime ProductName InspectionResult
1 123456 2020-01-01 08:29:34:456 Product A 0
2 123456 2020-01-01 08:45:50:456 Product B 1
3 844214 2020-01-01 08:34:48:456 Product A 2
4 978415 2020-01-02 09:29:26:456 Product C 0
5 985633 2020-01-04 23:29:11:456 Product A 2

I am able to produce a result which gives a list of individual days performance per SAP / Product Combination, but I then need to process the data in a complex loop outside of SQL to establish the 10 day groups.

My current query is:

SELECT CAST(DateTime AS date) AS InputDate,
       MCSAP,
       ZAssetRegister.LocalName,
       ProductName,
       SUM(CASE WHEN InspectionResult = 0 THEN 1 END) AS OKParts,
       COUNT(CASE WHEN InspectionResult > 0 THEN 1 END) AS NGParts
FROM [VisionMachineResults]
     INNER JOIN ZAssetRegister ON VisionMachineResults.MCSAP = ZAssetRegister.SAP_Number
GROUP BY CAST(DateTime AS date),
         MCSAP,
         ProductName,
         ZAssetRegister.LocalName
ORDER BY InputDate,
         ZAssetRegister.LocalName;

Would it be possible to have the SQL query give the result in 10 day groups, instead of per individual day i.e.

01-01-2021 to 11-01-2021 | Machine 1 | Product 1 | 20,000 | 5,000
02-01-2021 to 12-01-2021 | Machine 1 | Product 1 | 22,000 | 1,000
03-01-2021 to 13-01-2021 | Machine 1 | Product 1 | 18,000 | 4,000
etc...

I would then iterate through the rows to find the one with the best percentage of OK parts. Any ideas appreciated!

CodePudding user response:

One option that comes to my mind is the use of a numbers table (google Jeff Moden on SQL Server Central for more background).

The number table then uses a start date (from the range of dates to investigate) that in addition to generate a date to link to also generates a "bucket" by which to group afterwards.

Similar to:

-- generate date frame from and to
DECLARE
  @date_start  date = Convert( date, '20211110', 112 ),
  @date_end    date = Convert( date, '20220110', 112 )
;
WITH
  cteN
(
  Number
)
AS
( -- build a list of 10 single digit numbers
  SELECT Cast( 0 AS int ) AS Number UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
)
,
  cteNumbers
(
  Number
)
AS
( -- splice single digit numbers to list from 0 to 99999
  SELECT
    cN10000.Number * 10000   cN1000.Number * 1000   cN100.Number * 100   cN10.Number * 10   cN1.Number
  FROM
    cteN AS cN10000
    CROSS JOIN cteN AS cN1000
    CROSS JOIN cteN AS cN100
    CROSS JOIN cteN AS cN10
    CROSS JOIN cteN AS cN1
)
,
  cteBucketOffset
( 
  DatediffNum,
  Offset
)
AS
( -- determine the offset in datediffs to number buckets later correctly
  SELECT
    Cast( Datediff( dd, @date_start, @date_end ) AS int ) - 1 AS DatediffNum,
    Cast( Datediff( dd, @date_start, @date_end ) % 10 AS tinyint ) - 1 AS Offset
)
,
  cteDates
(
  Dated,
  Bucket,
  BucketNumber,
  BucketOffset,
  DatediffNum
)
AS
( -- generate list of dates with bucket batches and numbers
  SELECT 
    Dateadd( dd, cN.Number * -1, @date_end ) AS Dated,
    Cast( ( cBO.Offset   cN.Number ) / 10 AS int ) AS Bucket,
    Cast( ( cBO.Offset   cN.Number ) % 10 AS tinyint ) AS BucketNumber,
    cBO.Offset,
    cBO.DatediffNum
  FROM
    cteNumbers AS cN
    CROSS JOIN cteBucketOffset AS cBO
  WHERE
      cN.Number <= Datediff( dd, @date_start, @date_end )
)
SELECT
  *
FROM
  cteDates AS cD
ORDER BY
  cD.Dated ASC
;

Long winded due to showing each step. The result is a table-on-the-fly usable to join back to the raw data. "Bucket" can then be used instead of the date itself to group raw data.

Once this data is built then decisions can be made on the grouped conditions like having a minimum number of rows.

CodePudding user response:

This process needs to be considered in many levels. First, you mention 10 consecutive days. We dont know if those days include weekends, if the machines are running 24/7. If the dates running can skip over holidays as well? So, 10 days could be Jan 1 to Jan 10. But if you skip weekends, you only have 6 actual WEEKDAYS.

Next, consideration of a machine working on more than one product, such as a switching between dates, or even within a single day.

As a commenter indicated, having column names by same as a reserved word (such as DateTime), bad practice and try to see if any new columns are common key words that may cause confusion and avoid them.

You also mention that you had to do complex looping checks, and how to handle joining out to 10 days, the splits, etc. I think I have a somewhat elegant approach to doing this and should prove to be rather simple in the scheme of things.'

You are using SQL-Server, so I will do this using TEMP tables via "#" table names. This way, when you are done with a connection, or a call to making this a stored procedure, you dont have to keep deleting and recreating them. That said, let me take you one-step-at-a-time.

First, I'm creating a simple table matching your structure, even with the DateTime context.

CREATE TABLE VisionMachineResults
(
    ID int IDENTITY(1,1) NOT NULL,
    MCSAP nvarchar(6) NOT NULL,
    DateTime datetime NOT NULL,
    ProductName nvarchar(10) NOT NULL,
    InspectionResult int NOT NULL,
    CONSTRAINT ID PRIMARY KEY CLUSTERED 
    (
        [ID] ASC
    ) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

Now, I'm inserting the data, similar to what you have, but not millions of rows. You mention you are looking for 10 days out, so I just padded the end with several extras to simulate that. I also explicitly forced a gap change of product by the one machine on Jan 5th. Additionally, I added a product change on Jan 7th to trigger this a "break" within your 10-day consideration. You'll see the results later.

insert into VisionMachineResults
    (MCSAP, [DateTime], ProductName, InspectionResult )
values
( '123456', '2020-01-01 08:29:34.456', 'Product A', 0 ),
( '123456', '2020-01-01 08:29:34.456', 'Product B', 1 ),
( '844214', '2020-01-01 08:29:34.456', 'Product A', 2 ),
( '978415', '2020-01-02 08:29:34.456', 'Product C', 0 ),
( '985633', '2020-01-04 08:29:34.456', 'Product A', 0 ),
( '985633', '2020-01-05 08:29:34.456', 'Product A', 0 ),
( '985633', '2020-01-05 08:29:34.456', 'Product B', 0 ),
( '985633', '2020-01-06 08:29:34.456', 'Product A', 2 ),
( '985633', '2020-01-07 08:29:34.456', 'Product B', 0 ),
( '985633', '2020-01-08 08:29:34.456', 'Product A', 0 ),
( '985633', '2020-01-09 08:29:34.456', 'Product A', 0 ),
( '985633', '2020-01-10 08:29:34.456', 'Product A', 0 ),
( '985633', '2020-01-11 08:29:34.456', 'Product A', 0 ),
( '985633', '2020-01-12 08:29:34.456', 'Product A', 0 ),
( '985633', '2020-01-13 08:29:34.456', 'Product A', 0 ),
( '985633', '2020-01-14 08:29:34.456', 'Product A', 1 ),
( '985633', '2020-01-15 08:29:34.456', 'Product A', 0 ),
( '985633', '2020-01-16 08:29:34.456', 'Product A', 0 ),
( '985633', '2020-01-17 08:29:34.456', 'Product A', 0 ),
( '985633', '2020-01-18 08:29:34.456', 'Product A', 0 ),
( '985633', '2020-01-19 08:29:34.456', 'Product A', 0 ),
( '985633', '2020-01-20 08:29:34.456', 'Product A', 0 )
go

So now, consider this the baseline of YOUR production data. My first query will be doing a bunch of things, but storing the pre-aggregations INTO #tmpPartDailyCounts result table. This way you can look at them at the different stages to apply sanity check to my approach.

Here, on a per machine (MCSAP), and Date (without time portion), I am grabbing certain aggregates, and keeping them grouped by machine and date.

select
        VMR.MCSAP,
        cast(VMR.DateTime as Date) as InputDate,
        min( VMR.ProductName ) ProductName,
        max( VMR.ProductName ) LastProductName,
        count( distinct VMR.ProductName ) as MultipleProductsSameDay,
        sum( case when VMR.InspectionResult = 0 then 1 else 0 end ) OKParts,
        sum( case when NOT VMR.InspectionResult = 0 then 1 else 0 end ) BadParts,
        count(*) TotalParts
    into
        #tmpPartDailyCounts
    from
        VisionMachineResults VMR
    group by
        VMR.MCSAP,
        cast(VMR.DateTime as Date)

You were joining to an asset table and dont think you really need that. If the machine made the product, does it matter if a final assembly is complete? Dont know, you would know better.

Now, the aggregates and why. The min( VMR.ProductName ) ProductName and max( VMR.ProductName ) LastProductName, this is just to carry-forward the product name created on the date in question for any final output. If on a given day, only one product was made, it would be the same anyhow, just pick one. However, if on any day there are multiple products, the MIN() and MAX() will be of different values. If the same product across all that are built, then both values would be the same -- ON ANY SINGLE GIVEN DATE.

The rest are simple aggregates of OK parts, BAD parts (something was wrong), but also the TOTAL Parts created, regardless of any inspection failure. This is the primary qualifier for you to hit you 10,000, but if you wanted to change to 10,000 GOOD parts, change accordingly.

select
        VMR.MCSAP,
        cast(VMR.DateTime as Date) as InputDate,
        min( VMR.ProductName ) ProductName,
        max( VMR.ProductName ) LastProductName,
        count( distinct VMR.ProductName ) as MultipleProductsSameDay,
        sum( case when VMR.InspectionResult = 0 then 1 else 0 end ) OKParts,
        sum( case when NOT VMR.InspectionResult = 0 then 1 else 0 end ) BadParts,
        count(*) TotalParts
    into
        #tmpPartDailyCounts
    from
        VisionMachineResults VMR
    group by
        VMR.MCSAP,
        cast(VMR.DateTime as Date)

Now, at this point, I have a pre-aggregation done on a per machine and date basis. Now, I want to get some counter that is sequentially applied on a per date that a product was done. I will pull this result into a temp table #tmpPartDays. By using the over/partition, this will create a result that first puts the records in order of MCSAP, then by the date and dumps an output with whatever the ROW_NUMBER() is to that. So, if there is no activity for a given machine such as over a weekend or holiday that the machine is not running, the SEQUENTIAL counter via OVER/PARTITION will keep them sequentially 1 through however many days... Again, query the result of this table and you'll see it.

By querying against the pre-aggregated table, that may account for 500k records and results down to say 450 via per machine/day, This query is now only querying against the 450 and will be very quick.

   SELECT 
            PDC.MCSAP,
            PDC.InputDate,
            MultipleProductsSameDay,
            ROW_NUMBER() OVER(PARTITION BY MCSAP 
                            ORDER BY [InputDate] ) 
                AS CapDay
        into
            #tmpPartDays
        FROM 
            #tmpPartDailyCounts PDC
        ORDER BY 
            PDC.MCSAP;  

Now, is the kicker, tying this all together. I'm starting with just the #tmpPartDays JOINED to itself on the same MCSAP AND a MUST-HAVE matching record 10 days out... So this resolves issues of weekend / holidays since serial consecutive.

This now give me the begin/end date range such as 1-10, 2-11, 3-12, 4-13, etc.

I then join to the tmpPartDailyCounts result on the same part AND the date is at the respective begin (PD.InputDate) and END (PD2.InputDate). I re-apply the same aggregates to get the total counts WITHIN EACH Part 10 day period. Run this query WITHOUT the "HAVING" clause to see what is coming out.

select
        PD.MCSAP,
        PD.InputDate BeginDate,
        PD2.InputDate EndDate,
        SUM( PDC.MultipleProductsSameDay ) as TotalProductsMade,
        sum( PDC.OKParts ) OKParts,
        sum( PDC.BadParts ) BadParts,
        sum( PDC.TotalParts ) TotalParts,
        min( PDC.ProductName ) ProductName,
        max( PDC.LastProductName ) LastProductName
    from
        #tmpPartDays PD
            -- join again to get 10 days out for the END cycle
            JOIN #tmpPartDays PD2
                on PD.MCSAP = PD2.MCSAP
                AND PD.CapDay  9 = PD2.CapDay
                -- Now join to daily counts for same machine and within the 10 day period
                JOIN #tmpPartDailyCounts PDC
                    on PD.MCSAP = PDC.MCSAP
                    AND PDC.InputDate >= PD.InputDate
                    AND PDC.InputDate <= PD2.InputDate
    group by
        PD.MCSAP,
        PD.InputDate,
        PD2.InputDate
    having
            SUM( PDC.MultipleProductsSameDay ) = 10
        AND min( PDC.ProductName ) = max( PDC.LastProductName ) 
        AND SUM( PDC.TotalParts ) >= 10
    

Finally, the elimination of the records you DONT want. Since I dont have millions of records to simulate, just follow along. I am doing a HAVING on

  1. SUM( PDC.TotalParts ) >= 10

    1. SUM( PDC.MultipleProductsSameDay ) = 10

If on ANY day there are MORE than 1 product created, the count would be 11 or more, thus indicating not the same product, so that would cause an exclusion. But also, if at the tail-end of data such as only 7 days of production, it would never HIT 10 which was your 10-day qualifier also.

2.   AND min( PDC.ProductName ) = max( PDC.LastProductName ) 

Here, since we are spanning back to the DAILY context, if ANY product changes on any date, the Product Name (via min) and LastProductName (via max) will change, regardless of the day, and regardless of the name context. So, by making sure both the min() and max() are the same, you know it is the same product across the entire span.

3.  AND SUM( PDC.TotalParts ) >= 10

Finally, the count of things made. In this case, I did >= 10 because I was only testing with 1 item per day, thus 10 days = 10 items. In your scenario, you may have 987 in one day, but 1100 in another, thus balancing low and high production days to get to that 10,000, but for sample of data, just change YOUR context to the 10,000 limit minimum.

This SQLFiddle shows the results as it gets down to the per machine/day and showing the sequential activity. The last MCSAP machine starts on Jan 4th, but has a sequential day row assignment starting at 1 to give proper context to the 1-10, 2-11, etc.

First SQL Fiddle showing machine/day

Second fiddle shows final query WITHOUT the HAVING clause and you can see the first couple rows of TotalProductsMade is 11 which means SOMETHING on any of the day-span in question created different products and would be excluded from final. For the begin and end dates of Jan 6-15 and Jan 7-16, you will see the MIN/MAX products showing Product A and Product B, thus indicating that SOMEWHERE within its 10-day span a product switched... These too will be excluded.

The FINAL query This query shows the results with the HAVING clause applied.

CodePudding user response:

Seems just a matter of grouping on the year and the day of the year divided by 10.

SELECT 
  CONCAT(CONVERT(VARCHAR(10),MIN([DateTime]),105), ' to ', CONVERT(VARCHAR(10), MAX([DateTime]), 105)) AS InputDateRange
, MCSAP
, MAX(ZAssetRegister.LocalName) AS LocalName
, ProductName
, SUM(CASE WHEN InspectionResult = 0 THEN 1 END) AS OKParts
, COUNT(CASE WHEN InspectionResult > 0 THEN 1 END) AS NGParts
, COUNT(DISTINCT CAST([Datetime] AS DATE)) AS total_days
FROM VisionMachineResults
JOIN ZAssetRegister
  ON VisionMachineResults.MCSAP = ZAssetRegister.SAP_Number
GROUP BY 
  DATEPART(YEAR, [DateTime]), 
  CEILING(DATEPART(DAYOFYEAR, [DateTime])/10.0), 
  MCSAP,
  ProductName
ORDER BY 
  MIN([DateTime]),
  MAX(ZAssetRegister.LocalName);

Simplified test on db<>fiddle here

  •  Tags:  
  • Related