Home > database >  data.table tag rows that have no duplicate values within a column
data.table tag rows that have no duplicate values within a column

Time:01-13

I have a data.table like below

d_in_small <- data.table( score = c(94.272159,94.751695,106.588274,109.361865,111.017053,113.703024,118.638858,124.007419,125.695266,125.696385,94.272191,94.751649,95.496644,108.902001,113.703027),
                        group = c('a','a','a','a','a','a','a','a','a','a','a','a','a','a','a'))
id score group
1 94.2721 a
2 94.7516 a
3 106.5882 a
4 109.3618 a
5 111.0170 a
6 113.7030 a
7 118.6388 a
8 124.0074 a
9 125.6952 a
10 125.6963 a
11 94.2721 a
12 94.7516 a
13 95.4966 a
14 108.9020 a
15 113.7030 a

I would like to identify which scores occur only once in the dataset. e.g. score in row 1 is in row 11 as well, but score in row 3 is unique.

So, Is there any way to easily identify non-duplicated items in a data.table and add them in a new column to get an output like below?

id score group tag
1 94.2721 a
2 94.7516 a
3 106.5882 a outlier
4 109.3618 a outlier
5 111.0170 a outlier
6 113.7030 a
7 118.6388 a outlier
8 124.0074 a outlier
9 125.6952 a outlier
10 125.6963 a outlier
11 94.2721 a
12 94.7516 a
13 95.4966 a outlier
14 108.9020 a outlier
15 113.7030 a

The DT snap here is a sample, so they will be grouped by the column group when applied in the entire dataset.

Thanks

CodePudding user response:

Try:

d_in_small[,outlier:=fifelse(.N==1,'outlier',NA),by=score]

Note that the example dataset you provide has only outliers because of the last decimals. You could group by round(score,decimals) to get the precision you want to detect outliers.

For example:

d_in_small[,outlier:=fifelse(.N==1,'outlier',NA),by=round(score,4)][]
        score  group outlier
        <num> <char>  <char>
 1:  94.27216      a    <NA>
 2:  94.75169      a outlier
 3: 106.58827      a outlier
 4: 109.36186      a outlier
 5: 111.01705      a outlier
 6: 113.70302      a    <NA>
 7: 118.63886      a outlier
 8: 124.00742      a outlier
 9: 125.69527      a outlier
10: 125.69639      a outlier
11:  94.27216      a    <NA>
12:  94.75165      a outlier
13:  95.49664      a outlier
14: 108.90200      a outlier
15: 113.70303      a    <NA>

CodePudding user response:

Here is an option. I define a rounding function to make the code more readable.

f <- function(x) round(x, 4)

d_in_small[, tag := duplicated(f(score)) | duplicated(f(score), fromLast = TRUE)]
d_in_small[, tag := c("outlier", "")[tag   1L]]
d_in_small

CodePudding user response:

Mark the duplicates first and then change the FALSE value to outlier could be one way to handle it

d_in_small[, tag := .N > 1, by = score]
d_in_small$tag <- ifelse(d_in_small$tag=='FALSE','outlier','TRUE')
  •  Tags:  
  • Related