How to identify pairs of rows within a group in R?-CodePudding

I am wondering how to identify pairs in a table. The pair is defined by a set of columns with the same information in the row, but in one of the columns, the information of both rows differ. It may also differs for other columns as well. See the example below:

a1 <- c(100,110,100,200,105,110) 
a2 <- c(10,5,10,50,10,5)
a3 <- c(1,1,5,3,1,4)
a4 <- c('A','B','A','C','B','B')
a5 <- c('car blue','bus red','car blue','bicycle','bus red','bus red')
a6 <- c(0,0,1,1,0,1)
A <- data.frame(a1,a2,a3,a4,a5,a6)

In this example, there are two pairs: rows (1,3) and (2,6). You can see that all the information in the columns a1, a2, a4 and a5 are the same for the pairs, whereas the last column contains a different information defined by the category 0 and 1. The column a3 is also different but it does not define the pair. The pair is determined when one of them has 0 in the last column and the other one has 1. It is possible that we find a fake pair, with the same information for the 5 columns but being both with the category 0 (or both 1) in the last column. These ones I want to remove. So, can you help me with this? According to the example above, I need to find the corresponding table

b1 <- c(100,110,100,110)
b2 <- c(10,5,10,5)
b3 <- c(1,1,5,4)
b4 <- c('A','B','A','B')
b5 <- c('car blue','bus red','car blue','bus red')
b6 <- c(0,0,1,1)
B <- data.frame(b1,b2,b3,b4,b5,b6)

Can you help me with that?

Thanks a lot!

CodePudding user response：

We can do it these ways.

base R

A[ave(A$a6, A[,c("a1","a2","a4","a5")], FUN = function(z) all(0:1 %in% z)) > 0,]
#    a1 a2 a3 a4       a5 a6
# 1 100 10  1  A car blue  0
# 2 110  5  1  B  bus red  0
# 3 100 10  5  A car blue  1
# 6 110  5  4  B  bus red  1

dplyr

library(dplyr)
A %>%
  group_by(a1, a2, a4, a5) %>%
  filter(all(0:1 %in% a6)) %>%
  ungroup()
# # A tibble: 4 x 6
#      a1    a2    a3 a4    a5          a6
#   <dbl> <dbl> <dbl> <chr> <chr>    <dbl>
# 1   100    10     1 A     car blue     0
# 2   110     5     1 B     bus red      0
# 3   100    10     5 A     car blue     1
# 4   110     5     4 B     bus red      1

data.table

library(data.table)
> as.data.table(A)[, .SD[all(0:1 %in% a6),], by = .(a1, a2, a4, a5)]
    a1 a2 a4       a5 a3 a6
1: 100 10  A car blue  1  0
2: 100 10  A car blue  5  1
3: 110  5  B  bus red  1  0
4: 110  5  B  bus red  4  1

Data

A <- structure(list(a1 = c(100, 110, 100, 200, 105, 110), a2 = c(10, 5, 10, 50, 10, 5), a3 = c(1, 1, 5, 3, 1, 4), a4 = c("A", "B", "A", "C", "B", "B"), a5 = c("car blue", "bus red", "car blue", "bicycle", "bus red", "bus red"), a6 = c(0, 0, 1, 1, 0, 1)), class = "data.frame", row.names = c(NA, -6L))

CodePudding user response：

Another solution is using duplicated(), but remember that it does not consider the first (or the last) as a duplicate. So using it once with and once without fromLast = T you get all duplicates.

cols <- c("a1", "a2", "a4", "a5")
A[duplicated(A[cols]) | duplicated(A[cols], fromLast = T), ]

   a1 a2 a3 a4       a5 a6
1 100 10  1  A car blue  0
2 110  5  1  B  bus red  0
3 100 10  5  A car blue  1
6 110  5  4  B  bus red  1

CodePudding user response：

an optional dplyr approach can be this:

A %>%
    # build the groupings
    dplyr::group_by(a1, a2, a4, a5) %>%
    # select those  groups with exactly two rows, given that theire A6 sum is equal to 1 
    dplyr::filter(n() == 2 & sum(a6) == 1 & a6 %in% 0:1) %>%
    # good practise is to ungroup to avoid unwanted behaviour down stream
    dplyr::ungroup()

# A tibble: 4 x 6
     a1    a2    a3 a4    a5          a6
  <dbl> <dbl> <dbl> <chr> <chr>    <dbl>
1   100    10     1 A     car blue     0
2   110     5     1 B     bus red      0
3   100    10     5 A     car blue     1
4   110     5     4 B     bus red      1