I am working with a dataset that is in a wide format in R. It is a user generated dataset, with each row per UserID showing consecutive events per date. Here is a mock example of the type of data I am working with:-
df<-structure(list(UserID = c("hdyyu-192", "yeui-1893", "dnnd-1882", "nopr-738", "ieka-1728"),
`05/06/2021` = c("Activity B", "Activity A", "Activity B", "Activity C", "Activity D"),
`06/06/2021` = c("Activity B", "Activity A", "Activity A", "Activity C", "Activity D"),
`07/06/2021` = c("Activity D","Activity A", NA, "Activity C", "Activity D"),
`08/06/2021` = c("Activity A","Activity B", NA, "Activity B", "Activity A"),
`09/06/2021` = c("Activity A","Activity B", NA, "Activity B", "Activity C"),
`10/06/2021` = c("Activity C",NA, NA, NA, NA),
`11/06/2021` = c("Activity B", NA, NA, NA, NA)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,-5L))
head(df)
# A tibble: 5 x 8
UserID `05/06/2021` `06/06/2021` `07/06/2021` `08/06/2021` `09/06/2021` `10/06/2021` `11/06/2021`
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 hdyyu-192 Activity B Activity B Activity D Activity A Activity A Activity C Activity B
2 yeui-1893 Activity A Activity A Activity A Activity B Activity B NA NA
3 dnnd-1882 Activity B Activity A NA NA NA NA NA
4 nopr-738 Activity C Activity C Activity C Activity B Activity B NA NA
5 ieka-1728 Activity D Activity D Activity D Activity A Activity C NA NA
What I am trying to do is count the frequency of consecutive events; in other words, the frequency the antecedent/consequent event.
Here is the desired output:-
#desired outcome
Antecedent | Consequent | Count
Activity A | Activity A | 3
Activity A | Activity B | 1
Activity A | Activity C | 1
........................................
Activity D | Activity A | 2
Activity D | Activity B | 0
Activity D | Activity C | 0
Activity D | Activity D | 2
Can someone kindly suggest an approach in R that would let me achieve this output? Many thanks :)
CodePudding user response:
I'd do using string searches. We can create all possible combinations with expand.grid and then use do.call to convert them to strings, doing the same on df (without the UserID column) and just use stringr::str_count() to get the number of instances in each row, the sum being the total number of times it's occurred in df. This helps us maintain the ordering you want.
Note, we need to use a lookahead in the regex to count overlapping instances.
library(stringr)
activities <- paste0("Activity ", LETTERS[1:5])
activities_df <- expand.grid(activities, activities)
activities_df$count <- sapply(
do.call(paste, activities_df),
\(x) sum(str_count(do.call(paste, df[,-1]), paste0("(?=", x, ")")))
)
head(activities_df)
#> Var1 Var2 count
#> 1 Activity A Activity A 3
#> 2 Activity B Activity A 1
#> 3 Activity C Activity A 0
#> 4 Activity D Activity A 2
#> 5 Activity E Activity A 0
#> 6 Activity A Activity B 1
Note that \(x) is the shorthand for function(x) in R >= 4.1.0, just change to function(x) if using a previous version.
CodePudding user response:
With data.table:
library(data.table)
setDT(df)
DT <- melt(df,id.vars = "UserID",variable.name = 'Timestamp',value.name = 'Activity')
DT[,TimeStamp:=as.Date(Timestamp,format='%d/%m/%Y')]
DT <- DT[order(UserID,Timestamp)][,Activity_prec:=shift(Activity),by=.(UserID)]
DT[!is.na(Activity_prec)&!is.na(Activity),.(Count=.N),by=.(Activity_prec,Activity)][order(Activity_prec,Activity)]
Activity_prec Activity Count
<char> <char> <int>
1: Activity A Activity A 3
2: Activity A Activity B 1
3: Activity A Activity C 2
4: Activity B Activity A 1
5: Activity B Activity B 3
6: Activity B Activity D 1
7: Activity C Activity B 2
8: Activity C Activity C 2
9: Activity D Activity A 2
10: Activity D Activity D 2
CodePudding user response:
Another possible solution, using tidyverse:
library(tidyverse)
df %>%
pivot_longer(-UserID) %>%
group_by(UserID) %>%
mutate(aux = str_c(lag(value), value, sep = "-")) %>%
ungroup %>% select(aux) %>%
mutate(aux2 = "count") %>%
pivot_wider(id_cols = aux, names_from = aux2,
values_from = aux2, values_fn = length) %>% drop_na(aux) %>%
separate(aux, into = c("Antecedent", "Consequent"), sep = "-") %>%
arrange(Antecedent, Consequent)
#> # A tibble: 10 × 3
#> Antecedent Consequent count
#> <chr> <chr> <int>
#> 1 Activity A Activity A 3
#> 2 Activity A Activity B 1
#> 3 Activity A Activity C 2
#> 4 Activity B Activity A 1
#> 5 Activity B Activity B 3
#> 6 Activity B Activity D 1
#> 7 Activity C Activity B 2
#> 8 Activity C Activity C 2
#> 9 Activity D Activity A 2
#> 10 Activity D Activity D 2
In case one is interested in showing also the zero counts:
df %>%
pivot_longer(-UserID) %>%
group_by(UserID) %>%
mutate(aux = str_c(lag(value), value, sep = "-")) %>%
ungroup %>% select(aux) %>%
mutate(aux2 = "count") %>%
pivot_wider(id_cols = aux, names_from = aux2,
values_from = aux2, values_fn = length) %>% drop_na(aux) %>%
separate(aux, into = c("Antecedent", "Consequent"), sep = "-") %>%
bind_rows(df[-1] %>% unlist %>% as.vector() %>% unique %>%
expand_grid(Antecedent = ., Consequent = .) ) %>%
drop_na(c(Antecedent, Consequent)) %>%
group_by(Antecedent, Consequent) %>%
summarise(count = sum(count, na.rm = T), .groups = "drop") %>%
arrange(Antecedent, Consequent)
#> # A tibble: 16 × 3
#> Antecedent Consequent count
#> <chr> <chr> <int>
#> 1 Activity A Activity A 3
#> 2 Activity A Activity B 1
#> 3 Activity A Activity C 2
#> 4 Activity A Activity D 0
#> 5 Activity B Activity A 1
#> 6 Activity B Activity B 3
#> 7 Activity B Activity C 0
#> 8 Activity B Activity D 1
#> 9 Activity C Activity A 0
#> 10 Activity C Activity B 2
#> 11 Activity C Activity C 2
#> 12 Activity C Activity D 0
#> 13 Activity D Activity A 2
#> 14 Activity D Activity B 0
#> 15 Activity D Activity C 0
#> 16 Activity D Activity D 2
