I am somewhat new to data.table & trying to replicate my dplyr code to data.table but unable to get the same results.
libs
library(data.table)
library(lubridate)
library(tidyverse)
df
(Have not used any NA in this dummy data but need to filter out NA's)
test_df <- data.frame(id = c(1234, 1234, 5678, 5678),
date = c("2021-10-10","2021-10-10", "2021-8-10", "2021-8-15")) %>%
mutate(date = ymd(date))
dplyr code:
Find out ids that have more than one distinct date.
test_df %>%
group_by(id) %>%
filter(!is.na(date)) %>%
distinct(date) %>%
count(id) %>%
filter(n > 1)
id n
5678 2
data.table attempt:
test_dt <- setDT(test_df)
test_dt[!is.na(date), by = id][
,keyby = .(date)][
,.N, by = id][
N > 1
]
CodePudding user response:
The distinct in dplyr can be unique in data.table with by option
unique(setDT(test_df)[!is.na(date)], by = c("id", "date"))[, .N, by = id][N > 1]
id N
1: 5678 2
Steps are as follows
- Convert to data.table (
setDT) - Remove the rows with
NAfrom 'date' (!is.na(date)) - Get the
uniquerowsbythe 'id' and 'date' column - Do a group by 'id' to get the count (
.N) - Finally, filter the rows where count is greater than 1
