I have a data.table like below
d_in_small <- data.table( score = c(94.272159,94.751695,106.588274,109.361865,111.017053,113.703024,118.638858,124.007419,125.695266,125.696385,94.272191,94.751649,95.496644,108.902001,113.703027),
group = c('a','a','a','a','a','a','a','a','a','a','a','a','a','a','a'))
| id | score | group |
|---|---|---|
| 1 | 94.2721 | a |
| 2 | 94.7516 | a |
| 3 | 106.5882 | a |
| 4 | 109.3618 | a |
| 5 | 111.0170 | a |
| 6 | 113.7030 | a |
| 7 | 118.6388 | a |
| 8 | 124.0074 | a |
| 9 | 125.6952 | a |
| 10 | 125.6963 | a |
| 11 | 94.2721 | a |
| 12 | 94.7516 | a |
| 13 | 95.4966 | a |
| 14 | 108.9020 | a |
| 15 | 113.7030 | a |
I would like to identify which scores occur only once in the dataset. e.g. score in row 1 is in row 11 as well, but score in row 3 is unique.
So, Is there any way to easily identify non-duplicated items in a data.table and add them in a new column to get an output like below?
| id | score | group | tag |
|---|---|---|---|
| 1 | 94.2721 | a | |
| 2 | 94.7516 | a | |
| 3 | 106.5882 | a | outlier |
| 4 | 109.3618 | a | outlier |
| 5 | 111.0170 | a | outlier |
| 6 | 113.7030 | a | |
| 7 | 118.6388 | a | outlier |
| 8 | 124.0074 | a | outlier |
| 9 | 125.6952 | a | outlier |
| 10 | 125.6963 | a | outlier |
| 11 | 94.2721 | a | |
| 12 | 94.7516 | a | |
| 13 | 95.4966 | a | outlier |
| 14 | 108.9020 | a | outlier |
| 15 | 113.7030 | a |
The DT snap here is a sample, so they will be grouped by the column group when applied in the entire dataset.
Thanks
CodePudding user response:
Try:
d_in_small[,outlier:=fifelse(.N==1,'outlier',NA),by=score]
Note that the example dataset you provide has only outliers because of the last decimals.
You could group by round(score,decimals) to get the precision you want to detect outliers.
For example:
d_in_small[,outlier:=fifelse(.N==1,'outlier',NA),by=round(score,4)][]
score group outlier
<num> <char> <char>
1: 94.27216 a <NA>
2: 94.75169 a outlier
3: 106.58827 a outlier
4: 109.36186 a outlier
5: 111.01705 a outlier
6: 113.70302 a <NA>
7: 118.63886 a outlier
8: 124.00742 a outlier
9: 125.69527 a outlier
10: 125.69639 a outlier
11: 94.27216 a <NA>
12: 94.75165 a outlier
13: 95.49664 a outlier
14: 108.90200 a outlier
15: 113.70303 a <NA>
CodePudding user response:
Here is an option. I define a rounding function to make the code more readable.
f <- function(x) round(x, 4)
d_in_small[, tag := duplicated(f(score)) | duplicated(f(score), fromLast = TRUE)]
d_in_small[, tag := c("outlier", "")[tag 1L]]
d_in_small
CodePudding user response:
Mark the duplicates first and then change the FALSE value to outlier could be one way to handle it
d_in_small[, tag := .N > 1, by = score]
d_in_small$tag <- ifelse(d_in_small$tag=='FALSE','outlier','TRUE')
