Home > Enterprise >  R remove all records that have duplicates based one more than one variables
R remove all records that have duplicates based one more than one variables

Time:01-16

I know there are many questions about duplicate removal but I could find anything that matches my needs.

i have

df<-data.frame(var1=c("A", "A", "B", "B", "C", "D", "E"), var2=c(1, 2, 3, 4,5, 5, 6 ))

A is mapped to 1, 2

B is mapped to 2, 3

5 is mapped to C, D

and only E is uniquely mapped to 6 and 6 is uniquely mapped to E

I would like filter the dataset so that only

   var1 var2
7    E    6

is returned. Base oder tidyverse solution are welcomed.

I have tried

unique(df$var1, df$var2)
df[!duplicated(df),]
df %>% distinct(var1, var2)

but without the wanted result.

CodePudding user response:

Using a custom function to determine if the mapping is unique you could achieve your desired result like so:

df <- data.frame(
  var1 = c("A", "A", "B", "B", "C", "D", "E"),
  var2 = c(1, 2, 3, 4, 5, 5, 6)
)

is_unique <- function(x, y) ave(as.numeric(factor(x)), y, FUN = function(x) length(unique(x)) == 1)

df[is_unique(df$var2, df$var1) & is_unique(df$var1, df$var2), ]
#>   var1 var2
#> 7    E    6

CodePudding user response:

Using igraph tools:

library(igraph)
g = graph_from_data_frame(df)
cmp = components(g)
cmp$membership[cmp$membership %in% which(cmp$csize == 2)]
# E 6 
# 4 4

plot(g)

enter image description here

  •  Tags:  
  • Related