Postgres SQL - How do I aggregated groups of multiple consecutive rows?-CodePudding

I have a data representing tagged continuous-spans in a single table with a <tag, start & stop>.

Example below.

I'm trying to combine multiple rows into a single row where the condition is that they create a "continuous span". In the query below - I would like the functionality that LEFT_MOST_CONTINUOUS returns the minimum v_start of a continuous span (same for RIGHT_MOST_CONTINUOUS for maximum v_stop). Note that there might be more than a single continuous span (that should have different v_start and v_stop values).

Input:

WITH data AS (
    SELECT *
    FROM (VALUES 
        ('a', 2, 3),
        ('a', 3, 5),
        ('a', 5, 7),
        ('a', 8, 10),
        ('a', 10, 12),
        ('a', 12, 14),
        ('b', 7, 8),
        ('b', 8, 10),
        ('b', 12, 15),
        ('c', 10, 11)
    ) AS T(tag, v_start, v_stop)
    ORDER BY tag, v_start, v_stop
)
SELECT tag,
       LEFT_MOST_CONTINUOUS(v_start) OVER (PARTITION BY tag),
       RIGHT_MOST_CONTINUOUS(v_stop) OVER (PARTITION BY tag)
FROM data
ORDER BY 1, 2, 3

Where I expect to get the following output:

"a" 2   7
"a" 8   14
"b" 7   10
"b" 12  15
"c" 10  11

Since I want to merge the first 3 tuples (for tag "a") which are consecutive into a single value representing the entire span; same for the next 3 tuples (again for "a"). Then for "b" we can merge the next 2, but leave out the 3rd (which has it's v_start != the other's v_stop). And "c" there is nothing to merge with.

Help appreciated,

Tal

CodePudding user response：

You can use a gaps-and-islands approach by marking the first record of each group when either there is no previous record for the tag or the v_start is greater than v_stop of the previous record:

select tag, v_start, v_stop, 
         coalesce(lag(v_stop) over w < v_start, true) as is_end_grp
    from data
  window w as (partition by tag order by v_start)

Use a windowed sum() of the boolean is_end_grp cast to int (1 if true, 0 if false) to number the groups:

  select tag, sum(is_end_grp::int) over (partition by tag 
                                             order by v_start) as grp_num,
         v_start, v_stop
    from mark_gaps

Aggregation over (tag, grp_num) will produce your desired result:

select tag, min(v_start) as v_start, max(v_stop) as v_stop
  from numbered_groups
 group by tag, grp_num
 order by tag, v_start

Working DB<>Fiddle

CodePudding user response：

Using the numbered_groups logic from @Mike Organek answer. I just started from a different place

WITH data AS (
    SELECT *
    , case when lead(v_start) over(partition by tag order by v_start) = v_stop then 0 else 1 end stopcheck
    , case when lag(v_stop) over(partition by tag order by v_stop) = v_start then 0 else 1 end startcheck
    FROM (VALUES 
        ('a' , 2 , 3),
        ('a', 3, 5),
        ('a', 5, 7),
        ('a', 8, 10),
        ('a', 10, 12),
        ('a', 12, 14),
        ('b', 7, 8),
        ('b', 8, 10),
        ('b', 12, 15),
        ('c', 10, 11)
    ) AS T(tag, v_start, v_stop)
    ORDER BY tag, v_start, v_stop
)
,cnt as (
  select *
  , sum(startcheck) over (partition by tag order by v_start) grpn 
  from data)
select c1.tag, c1.v_start, c2.v_stop
from cnt c1 
inner join cnt c2 
  on c1.tag = c2.tag and c1.grpn = c2.grpn
  where c1.startcheck = 1 and c2.stopcheck = 1

This logic is all based on the assumption that your data always starts where the last row left off, there is no overlap etc.

Create a startcheck and stopcheck by comparing the prior row and next row relatively. From here use another window function sum() over to order the start records (so we don't match start of second batch to stop of first batch)

Join the table to itself matching like tag and groups. Filtering start and stop records

CodePudding user response：

You can use following query

WITH data AS (
    SELECT *
    FROM (VALUES 
        ('a', 2, 3),
        ('a', 3, 5),
        ('a', 5, 7),
        ('a', 8, 10),
        ('a', 10, 12),
        ('a', 12, 14),
        ('b', 7, 8),
        ('b', 8, 10),
        ('b', 12, 15),
        ('c', 10, 11)
    ) AS T(tag, v_start, v_stop)
    ORDER BY tag, v_start, v_stop
),
cte1 as(
   select *,
      case
        when lag(v_stop)over(partition by tag order by(select null)) = v_start
          then 0
          else 1
      end as grp
  from data
),
cte2 as(
  select *, 
         sum(grp) over (partition by tag  order by v_start) as rnk
  from cte1
)

select tag,min(v_start)v_start,max(v_stop)v_stop
from cte2
group by tag,rnk
order by tag

Demo in db<>fiddle