Home > database >  Identify duplicates with two fixed values and within a range in r
Identify duplicates with two fixed values and within a range in r

Time:01-20

I have large dataframe of e-mails that looks like this:

structure(list(messagesubject = c("Re: Fuentes/referencias para tutorial YoPeatón", 
"Re: Fuentes/referencias para tutorial YoPeatón", "Protejamos nuestras playas", 
"Protejamos nuestras playas", "Protejamos nuestras playas", "Protejamos nuestras playas", 
"Protejamos nuestras playas", "Protejamos nuestras playas", "Protejamos nuestras playas", 
"Protejamos nuestras playas", "documento", "documento", "documento"
), senderaddress = c("6607", "7998", "4799", "9731", "4799", 
"79", "9731", "4799", "9731", "4799", "1846", "1846", "1846"
), recipient = c("94529", "163910", "289", "289", "270177", "2310", 
"1118", "1118", "2551", "2551", "753", "12291", "106610"), datetimesent = structure(c(1515632629, 
1515632629, 1515632636, 1515632636, 1515632639, 1515632640, 1515632641, 
1515632641, 1515632647, 1515632647, 1515632680, 1515632680, 1515632680
), tzone = "UTC", class = c("POSIXct", "POSIXt"))), row.names = c(NA, 
-13L), class = c("tbl_df", "tbl", "data.frame"))

I need to extract e-mails with the same "messagesubject", same "senderaddress" and within a 5 seconds timeframe ("datetimesent") from each other. I´m cleaning a dataset and most probably these are just the same e-mail but delivered with lag.

I am able to extract exact repetitions with:

d[(duplicated(d[c(4,2,1)]) | duplicated(d[c(4,2,1)], fromLast = TRUE)), ]

but I have no idea of how to do it considering a timeframe in "datetimesent"

The result should have lines 3,5,8 (have the same senderaddress, same messagesubject and have a maximun of 5 seconds between them, as well as lines 4 and 7. It should look like:

messagesubject             senderaddress recipient datetimesent       
              
1 Protejamos nuestras playas 4799          289       2018-01-11 01:03:56
2 Protejamos nuestras playas 9731          289       2018-01-11 01:03:56
3 Protejamos nuestras playas 4799          270177    2018-01-11 01:03:59
4 Protejamos nuestras playas 9731          1118      2018-01-11 01:04:01
5 Protejamos nuestras playas 4799          1118      2018-01-11 01:04:01

CodePudding user response:

I think this does what you want:

# first ensure file is in order
df<-df %>% arrange(senderaddress, datetimesent)

# if the sender address is the same, test if subject is the same
df$diffsub<-ifelse(df$senderaddress==data.table::shift(df$senderaddress,1), ifelse(df$messagesubject!=data.table::shift(df$messagesubject,1),"Different","Same Subj"),NA) 


# if diffSub is "Same Subj" compute time difference   
df$difftime<-ifelse(df$diffsub=="Same Subj",df$datetimesent-data.table::shift(df$datetimesent,1),NA)  

# if diffsub=Same Subj, email is also the same, so
df # examine

CodePudding user response:

We can do this using grouping in tidyverse. The key is to group by subject and address, order by time, then find the time difference with the previous and next message.

We can then filter for groups with multiple rows with a message <= 5 seconds before or after it:

library(dplyr)
library(lubridate)

d %>%
    group_by(messagesubject, senderaddress) %>%
    arrange(datetimesent) %>%
    mutate(lag = (datetimesent-lag(datetimesent)),     # time from previous msg
           lead = lead(datetimesent)-datetimesent) %>% # time to next msg
    filter(lag <= seconds(5) | lead <= seconds(5), # require msg within 5sec
           n() > 1)                                # require >1 msg with sub/add

# A tibble: 8 × 6
# Groups:   messagesubject, senderaddress [3]
  messagesubject             senderaddress recipient datetimesent        lag     lead   
  <chr>                      <chr>         <chr>     <dttm>              <drtn>  <drtn> 
1 Protejamos nuestras playas 4799          289       2018-01-11 01:03:56 NA secs  3 secs
2 Protejamos nuestras playas 9731          289       2018-01-11 01:03:56 NA secs  5 secs
3 Protejamos nuestras playas 4799          270177    2018-01-11 01:03:59  3 secs  2 secs
4 Protejamos nuestras playas 9731          1118      2018-01-11 01:04:01  5 secs  6 secs
5 Protejamos nuestras playas 4799          1118      2018-01-11 01:04:01  2 secs  6 secs
6 documento                  1846          753       2018-01-11 01:04:40 NA secs  0 secs
7 documento                  1846          12291     2018-01-11 01:04:40  0 secs  0 secs
8 documento                  1846          106610    2018-01-11 01:04:40  0 secs NA secs

  •  Tags:  
  • Related