I'm very new to BigQuery and not terribly familiar with SQL. I have a table of data that looks like this, where MyDate is a Timestamp object:
| Row | MyDate | StateTransition | MyIdentifier |
|---|---|---|---|
| 1 | 2022-09-23 00:08:00 UTC | Start | 6371 |
| 2 | 2022-10-10 01:17:14 UTC | Finished | 6371 |
| 3 | 2022-09-26 04:51:40 UTC | Start | 7768 |
| 4 | 2022-10-05 03:44:32 UTC | Finished | 7768 |
etc.
My query looks something like
SELECT *
FROM <my-data-source>
WHERE (StateTransition="Start" OR StateTransition="Finished")
ORDER BY MyIdentifier, MyDate
What I'm trying to do is calculate the elapsed time (in days) between the Start and Finished timestamps associated with each MyIdentifier, and to have that displayed in another column. It could look like:
| Row | MyDate | StateTransition | MyIdentifier | ElapsedTime |
|---|---|---|---|---|
| 1 | 2022-09-23 00:08:00 UTC | Start | 6371 | |
| 2 | 2022-10-10 01:17:14 UTC | Finished | 6371 | 0.33 |
| 3 | 2022-09-26 04:51:40 UTC | Start | 7768 | |
| 4 | 2022-10-05 03:44:32 UTC | Finished | 7768 | 0.04 |
Alternatively, it could even be flattened a little to something like:
| Row | StartTransition | FinishedTransition | MyIdentifier | ElapsedTime |
|---|---|---|---|---|
| 1 | 2022-09-23 00:08:00 UTC | 2022-10-10 01:17:14 UTC | 6371 | 0.33 |
| 2 | 2022-09-26 04:51:40 UTC | 2022-10-05 03:44:32 UTC | 7768 | 0.04 |
I've tried looking through the BigQuery docs and Stack Overflow but haven't found anything that addresses this use case of selecting items from multiple rows with a common identifier and then performing an operation on them. It seems like subtracting the two timestamps would be done with the 
But for the intermediate result, we need a window function.
SELECT *,
IF(
StateTransition = 'Finished',
TIMESTAMP_DIFF(MyDate, FIRST_VALUE(IF(StateTransition = 'Start', MyDate, NULL) IGNORE NULLS) OVER w, DAY),
NULL
) AS ElapsedTime
FROM sample_table
WINDOW w AS (PARTITION BY MyIdentifier ORDER BY MyDate);
and if you want flattend result from the above result (using a window function), the query will looks like below which shows same result as the first query using an aggregation.
SELECT MyIdentifier,
FIRST_VALUE(IF(StateTransition = 'Start', MyDate, NULL) IGNORE NULLS) OVER w AS StartTransition,
MyDate AS FinishedTransition,
IF(
StateTransition = 'Finished',
TIMESTAMP_DIFF(MyDate, FIRST_VALUE(IF(StateTransition = 'Start', MyDate, NULL) IGNORE NULLS) OVER w, DAY),
NULL
) AS ElapsedTime
FROM sample_table
QUALIFY StateTransition = 'Finished'
WINDOW w AS (PARTITION BY MyIdentifier ORDER BY MyDate);
CodePudding user response:
I think that for each MyIdentifier you should have only one start and one finish, so you can simply split and join:
;WITH
ts AS ( SELECT * FROM <my-data-source> WHERE StateTransition = 'Start'),
tf AS ( SELECT * FROM <my-data-source> WHERE StateTransition = 'Finished')
SELECT
ts.MyIdentifier,
ts.MyDate StartTransition,
tf.MyDate FinishedTransition,
TIMESTAMP_DIFF(ts.MyDate, tf.MyDate, DAY) ElapsedTime
FROM ts
LEFT JOIN tf on ts.MyIdentifier = tf.MyIdentifier

