I have a large dataset (~300,000 rows) of fish detections. Each detection has a date, a station (location), and a tagID, among many other variables like temperature, depth, etc. I want to pull out the first and last detection for each station, every time the fish visits that station. The end goal is to compute residency time at each station before it moves, and again when it comes back.
Here is a small example of the data
tagID <- c("8272", "8272", "8272", "8272", "8272", "8272", "8272", "8272", "8272", "8272")
date <- c("2020-07-12", "2020-07-12", "2020-07-13", "2020-07-13", "2020-07-16", "2020-07-17", "2020-07-20", "2020-07-29", "2020-07-30", "2020-08-04")
station <- c("4", "4", "4", "5", "5", "6", "6", "6", "4", "4")
temp <- c("10", "9", "11", "12", "10", "12", "11", "12", "12", "9")
depth <- c("6.14", "34.2", "21", "23.5", "15.4", "54", "32.4", "23", "33.3", "32.7")
df <- data.frame(tagID, date, station, temp, depth)
with the dataframe looking like:
tagID date station temp depth
1 8272 2020-07-12 4 10 6.14
2 8272 2020-07-12 4 9 34.2
3 8272 2020-07-13 4 11 21
4 8272 2020-07-13 5 12 23.5
5 8272 2020-07-16 5 10 15.4
6 8272 2020-07-17 6 12 54
7 8272 2020-07-20 6 11 32.4
8 8272 2020-07-29 6 12 23
9 8272 2020-07-30 4 12 33.3
10 8272 2020-08-04 4 9 32.7
I would like to find an efficient way to go through all 300K rows and extract something like:
tagID date station temp depth
1 8272 2020-07-12 4 10 6.14
3 8272 2020-07-13 4 11 21
4 8272 2020-07-13 5 12 23.5
5 8272 2020-07-16 5 10 15.4
6 8272 2020-07-17 6 12 54
8 8272 2020-07-29 6 12 23
9 8272 2020-07-30 4 12 33.3
10 8272 2020-08-04 4 9 32.7
This shows the first and last detection while the fish was at station 4, and then the first and last detection again when the fish comes back to station 4 later in the season.
I've looked at questions like Select first and last row from grouped data and Select the first and last row by group in a data frame, and other similar questions, but none of those account for a 2nd (3rd, 4th, n... time) the group (in my case: station) appears in the data.
Please let me know if you can help. Thank you. (This is my first question on stack overflow, any tips for future questions are helpful)
CodePudding user response:
My approach here is to tag each occasion that a fish changes stations, and then count the cumulative number of those changes. Then we can group by fish and # of station changes, and filter for the first and last of each.
library(dplyr)
df %>%
group_by(tagID) %>%
mutate(station_chg = station != lag(station, default = ""),
station_cuml = cumsum(station_chg)) %>%
group_by(tagID, station_cuml) %>%
slice(1, n()) %>%
ungroup()
Result
# A tibble: 8 x 7
tagID date station temp depth station_chg station_cuml
<chr> <chr> <chr> <chr> <chr> <lgl> <int>
1 8272 2020-07-12 4 10 6.14 TRUE 1
2 8272 2020-07-13 4 11 21 FALSE 1
3 8272 2020-07-13 5 12 23.5 TRUE 2
4 8272 2020-07-16 5 10 15.4 FALSE 2
5 8272 2020-07-17 6 12 54 TRUE 3
6 8272 2020-07-29 6 12 23 FALSE 3
7 8272 2020-07-30 4 12 33.3 TRUE 4
8 8272 2020-08-04 4 9 32.7 FALSE 4
CodePudding user response:
Here is a data.table approach. As mentioned by @Henrik in the comments, you can use rleid to create a new column to group by, instead of using station as values for station get repeated. rleid. Then, for each group, it will include the first and last .N values. Note that unique is added to consider situations when only one row of data may be present for a given group. I hope this may be a fast solution for you.
library(data.table)
setDT(df)
df[ , id := rleid(station)][ , .SD[unique(c(1, .N))], by = id]
Output
id tagID date station temp depth
1: 1 8272 2020-07-12 4 10 6.14
2: 1 8272 2020-07-13 4 11 21
3: 2 8272 2020-07-13 5 12 23.5
4: 2 8272 2020-07-16 5 10 15.4
5: 3 8272 2020-07-17 6 12 54
6: 3 8272 2020-07-29 6 12 23
7: 4 8272 2020-07-30 4 12 33.3
8: 4 8272 2020-08-04 4 9 32.7
