I have people's trip records as a data frame in R. Which looks as below:
t_participant_id t_destination_PostCode tripReasonString tripSequence
1 304 1082 work 1
2 304 1000 work 2
3 304 1000 work 3
4 304 1000 other 4
5 304 1000 other 5
6 304 1082 other 6
7 304 1082 home 7
8 362 1070 shopping 1
9 362 1070 work 2
10 362 1070 work 3
11 362 1070 shopping 4
12 362 1180 home 5
13 362 1070 leisure 6
14 362 1180 home 7
15 482 2800 work 1
16 482 2800 work 2
17 482 1020 home 3
Basically, t_participant_id is a person's unique ID, tripReason is the reason for people's trip, and the t_destination_postcode is the area that they are going to (for instance, in row 1, person 304 go to area 1082 for work).
Different people have different numbers of trips during one day (person 304 and 362has 7 trips in total, whereas person 482 has 3). I would like to filter out the person that the t_destination_PostCode for the trip purpose work is the same as other work trips under this person's activity chain.
In short, for person 304, his first work trip is at line 1, and going to the area 1082. The next work trip is at line 2, and going to the area 1000. So I know he is not the person I would like to filter out. However, for person 362, he has the first work activity at tripSequence == 2, heading to 1070. The next work trip is at tripSequence 3, and also to 1070. No other work trips in the activity chain anymore. So in this case, we would like to filter out person 362, and record the first work tripSequence, which is 2.
So the final result should be a data frame that says
t_participant_id firstWorkTrip
362 2
482 1
I really don't want to do this manually but not sure how can I achieve this in R. Thanks very much for your help in advance!
Here is the reprex for the dataframe
test <- data.frame(list(t_participant_id = c(304L, 304L, 304L, 304L, 304L,
304L, 304L, 362L, 362L, 362L, 362L, 362L, 362L, 362L, 482L, 482L,
482L), t_destination_PostCode = c(1082L, 1000L, 1000L, 1000L,
1000L, 1082L, 1082L, 1070L, 1070L, 1070L, 1070L, 1180L, 1070L,
1180L, 2800L, 2800L, 1020L), tripReasonString = c("work", "work",
"work", "other", "other", "other", "home", "shopping", "work",
"work", "shopping", "home", "leisure", "home", "work", "work",
"home"), tripSequence = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L,
3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L)))
CodePudding user response:
df %>%
filter(tripReasonString == "work") %>%
group_by(t_participant_id) %>%
filter(n_distinct(t_destination_PostCode) == 1) %>%
summarize(first_work_trip = min(tripSequence))
# # A tibble: 2 × 2
# t_participant_id first_work_trip
# <int> <int>
# 1 362 2
# 2 482 1
