I have a data representing tagged continuous-spans in a single table with a <tag, start & stop>.
Example below.
I'm trying to combine multiple rows into a single row where the condition is that they create a "continuous span".
In the query below - I would like the functionality that LEFT_MOST_CONTINUOUS returns the minimum v_start of a continuous span (same for RIGHT_MOST_CONTINUOUS for maximum v_stop). Note that there might be more than a single continuous span (that should have different v_start and v_stop values).
Input:
WITH data AS (
SELECT *
FROM (VALUES
('a', 2, 3),
('a', 3, 5),
('a', 5, 7),
('a', 8, 10),
('a', 10, 12),
('a', 12, 14),
('b', 7, 8),
('b', 8, 10),
('b', 12, 15),
('c', 10, 11)
) AS T(tag, v_start, v_stop)
ORDER BY tag, v_start, v_stop
)
SELECT tag,
LEFT_MOST_CONTINUOUS(v_start) OVER (PARTITION BY tag),
RIGHT_MOST_CONTINUOUS(v_stop) OVER (PARTITION BY tag)
FROM data
ORDER BY 1, 2, 3
Where I expect to get the following output:
"a" 2 7
"a" 8 14
"b" 7 10
"b" 12 15
"c" 10 11
Since I want to merge the first 3 tuples (for tag "a") which are consecutive into a single value representing the entire span; same for the next 3 tuples (again for "a"). Then for "b" we can merge the next 2, but leave out the 3rd (which has it's v_start != the other's v_stop). And "c" there is nothing to merge with.
Help appreciated,
Tal
CodePudding user response:
You can use a gaps-and-islands approach by marking the first record of each group when either there is no previous record for the tag or the v_start is greater than v_stop of the previous record:
select tag, v_start, v_stop,
coalesce(lag(v_stop) over w < v_start, true) as is_end_grp
from data
window w as (partition by tag order by v_start)
Use a windowed sum() of the boolean is_end_grp cast to int (1 if true, 0 if false) to number the groups:
select tag, sum(is_end_grp::int) over (partition by tag
order by v_start) as grp_num,
v_start, v_stop
from mark_gaps
Aggregation over (tag, grp_num) will produce your desired result:
select tag, min(v_start) as v_start, max(v_stop) as v_stop
from numbered_groups
group by tag, grp_num
order by tag, v_start
CodePudding user response:
Using the numbered_groups logic from @Mike Organek answer. I just started from a different place
WITH data AS (
SELECT *
, case when lead(v_start) over(partition by tag order by v_start) = v_stop then 0 else 1 end stopcheck
, case when lag(v_stop) over(partition by tag order by v_stop) = v_start then 0 else 1 end startcheck
FROM (VALUES
('a' , 2 , 3),
('a', 3, 5),
('a', 5, 7),
('a', 8, 10),
('a', 10, 12),
('a', 12, 14),
('b', 7, 8),
('b', 8, 10),
('b', 12, 15),
('c', 10, 11)
) AS T(tag, v_start, v_stop)
ORDER BY tag, v_start, v_stop
)
,cnt as (
select *
, sum(startcheck) over (partition by tag order by v_start) grpn
from data)
select c1.tag, c1.v_start, c2.v_stop
from cnt c1
inner join cnt c2
on c1.tag = c2.tag and c1.grpn = c2.grpn
where c1.startcheck = 1 and c2.stopcheck = 1
This logic is all based on the assumption that your data always starts where the last row left off, there is no overlap etc.
Create a startcheck and stopcheck by comparing the prior row and next row relatively. From here use another window function sum() over to order the start records (so we don't match start of second batch to stop of first batch)
Join the table to itself matching like tag and groups. Filtering start and stop records
CodePudding user response:
You can use following query
WITH data AS (
SELECT *
FROM (VALUES
('a', 2, 3),
('a', 3, 5),
('a', 5, 7),
('a', 8, 10),
('a', 10, 12),
('a', 12, 14),
('b', 7, 8),
('b', 8, 10),
('b', 12, 15),
('c', 10, 11)
) AS T(tag, v_start, v_stop)
ORDER BY tag, v_start, v_stop
),
cte1 as(
select *,
case
when lag(v_stop)over(partition by tag order by(select null)) = v_start
then 0
else 1
end as grp
from data
),
cte2 as(
select *,
sum(grp) over (partition by tag order by v_start) as rnk
from cte1
)
select tag,min(v_start)v_start,max(v_stop)v_stop
from cte2
group by tag,rnk
order by tag
Demo in db<>fiddle
