I have large dataframe of e-mails that looks like this:
structure(list(messagesubject = c("Re: Fuentes/referencias para tutorial YoPeatón",
"Re: Fuentes/referencias para tutorial YoPeatón", "Protejamos nuestras playas",
"Protejamos nuestras playas", "Protejamos nuestras playas", "Protejamos nuestras playas",
"Protejamos nuestras playas", "Protejamos nuestras playas", "Protejamos nuestras playas",
"Protejamos nuestras playas", "documento", "documento", "documento"
), senderaddress = c("6607", "7998", "4799", "9731", "4799",
"79", "9731", "4799", "9731", "4799", "1846", "1846", "1846"
), recipient = c("94529", "163910", "289", "289", "270177", "2310",
"1118", "1118", "2551", "2551", "753", "12291", "106610"), datetimesent = structure(c(1515632629,
1515632629, 1515632636, 1515632636, 1515632639, 1515632640, 1515632641,
1515632641, 1515632647, 1515632647, 1515632680, 1515632680, 1515632680
), tzone = "UTC", class = c("POSIXct", "POSIXt"))), row.names = c(NA,
-13L), class = c("tbl_df", "tbl", "data.frame"))
I need to extract e-mails with the same "messagesubject", same "senderaddress" and within a 5 seconds timeframe ("datetimesent") from each other. I´m cleaning a dataset and most probably these are just the same e-mail but delivered with lag.
I am able to extract exact repetitions with:
d[(duplicated(d[c(4,2,1)]) | duplicated(d[c(4,2,1)], fromLast = TRUE)), ]
but I have no idea of how to do it considering a timeframe in "datetimesent"
The result should have lines 3,5,8 (have the same senderaddress, same messagesubject and have a maximun of 5 seconds between them, as well as lines 4 and 7. It should look like:
messagesubject senderaddress recipient datetimesent
1 Protejamos nuestras playas 4799 289 2018-01-11 01:03:56
2 Protejamos nuestras playas 9731 289 2018-01-11 01:03:56
3 Protejamos nuestras playas 4799 270177 2018-01-11 01:03:59
4 Protejamos nuestras playas 9731 1118 2018-01-11 01:04:01
5 Protejamos nuestras playas 4799 1118 2018-01-11 01:04:01
CodePudding user response:
I think this does what you want:
# first ensure file is in order
df<-df %>% arrange(senderaddress, datetimesent)
# if the sender address is the same, test if subject is the same
df$diffsub<-ifelse(df$senderaddress==data.table::shift(df$senderaddress,1), ifelse(df$messagesubject!=data.table::shift(df$messagesubject,1),"Different","Same Subj"),NA)
# if diffSub is "Same Subj" compute time difference
df$difftime<-ifelse(df$diffsub=="Same Subj",df$datetimesent-data.table::shift(df$datetimesent,1),NA)
# if diffsub=Same Subj, email is also the same, so
df # examine
CodePudding user response:
We can do this using grouping in tidyverse. The key is to group by subject and address, order by time, then find the time difference with the previous and next message.
We can then filter for groups with multiple rows with a message <= 5 seconds before or after it:
library(dplyr)
library(lubridate)
d %>%
group_by(messagesubject, senderaddress) %>%
arrange(datetimesent) %>%
mutate(lag = (datetimesent-lag(datetimesent)), # time from previous msg
lead = lead(datetimesent)-datetimesent) %>% # time to next msg
filter(lag <= seconds(5) | lead <= seconds(5), # require msg within 5sec
n() > 1) # require >1 msg with sub/add
# A tibble: 8 × 6
# Groups: messagesubject, senderaddress [3]
messagesubject senderaddress recipient datetimesent lag lead
<chr> <chr> <chr> <dttm> <drtn> <drtn>
1 Protejamos nuestras playas 4799 289 2018-01-11 01:03:56 NA secs 3 secs
2 Protejamos nuestras playas 9731 289 2018-01-11 01:03:56 NA secs 5 secs
3 Protejamos nuestras playas 4799 270177 2018-01-11 01:03:59 3 secs 2 secs
4 Protejamos nuestras playas 9731 1118 2018-01-11 01:04:01 5 secs 6 secs
5 Protejamos nuestras playas 4799 1118 2018-01-11 01:04:01 2 secs 6 secs
6 documento 1846 753 2018-01-11 01:04:40 NA secs 0 secs
7 documento 1846 12291 2018-01-11 01:04:40 0 secs 0 secs
8 documento 1846 106610 2018-01-11 01:04:40 0 secs NA secs
