Remove duplicate only if same content within a column for each groups-CodePudding

I have a dataframe such as :

Groups Subgroups Species Anothercolum
G1     1         SP1     jkl
G1     1         SP2     ds
G1     2         SP1     9DS
G1     2         SP2     JD
G1     3         SP1     Js2_0
G1     3         SP2     KUIO
G2     1         SP4     JUI
G2     1         SP5     NA
G2     1         SP6     NA
G2     2         SP4     900
G2     2         SP5     LO
G2     2         SP7     dez
G2     3         SP4     dez
G2     3         SP5     ge
G2     3         SP7     gez

And I would like to remove duplicate elements within Groups and Subgroups but only when there is the same content of Species.

For instance, for G1:

Groups Subgroups Species
G1     1         SP1
G1     1         SP2
G1     2         SP1
G1     2         SP2
G1     3         SP1
G1     3         SP2

I keep the first Subgroups1 since SP1 and SP2 are both in Subgroup1 and Subgroup2;

Groups Subgroups Species
G1     1         SP1
G1     1         SP2

But for G2 :

Groups Subgroups Species
    G2     1         SP4
    G2     1         SP5
    G2     1         SP6
    G2     2         SP4
    G2     2         SP5
    G2     2         SP7
    G2     3         SP4
    G2     3         SP5
    G2     3         SP7

The Subgroup1 does not have the same 3 species as subgroup2 and 3, so I keep that one, but remove the subgroup3 and keep subgroup2 since they have the same species.

I should then get the following expected output :

Groups Subgroups Species Anothercolum
G1     1         SP1     jkl
G1     1         SP2     ds
G2     1         SP4     JUI
G2     1         SP5     NA
G2     1         SP6     NA
G2     2         SP4     900
G2     2         SP5     LO
G2     2         SP7     dez

Here is the dataframe if it can help:

structure(list(Groups = c("G1", "G1", "G1", "G1", "G1", "G1", 
"G2", "G2", "G2", "G2", "G2", "G2", "G2", "G2", "G2"), Subgroups = c(1L, 
1L, 2L, 2L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), Species = c("SP1", 
"SP2", "SP1", "SP2", "SP1", "SP2", "SP4", "SP5", "SP6", "SP4", 
"SP5", "SP7", "SP4", "SP5", "SP7")), class = "data.frame", row.names = c(NA, 
-15L))

CodePudding user response：

A base R option

merge(
  df,
  subset(
    aggregate(
      Species ~ .,
      df,
      c
    ),
    as.logical(ave(Species, Groups, FUN = Negate(duplicated))),
    select = c(Groups, Subgroups)
  )
)

gives

  Groups Subgroups Species
1     G1         1     SP1
2     G1         1     SP2
3     G2         1     SP5
4     G2         1     SP4
5     G2         1     SP6
6     G2         2     SP7
7     G2         2     SP4
8     G2         2     SP5

CodePudding user response：

You can use duplicated to find duplicates even for a list. Something like this

library(dplyr)
library(tidyr)

df %>% 
  group_by(Groups, Subgroups) %>% 
  summarize(across(c(Species, Anothercolum), list), .groups = "drop") %>% 
  filter(!duplicated(Species)) %>% 
  unnest(c(Species, Anothercolum))

Output

# A tibble: 8 x 4
  Groups Subgroups Species Anothercolum
  <chr>      <int> <chr>   <chr>       
1 G1             1 SP1     jkl         
2 G1             1 SP2     ds          
3 G2             1 SP4     JUI         
4 G2             1 SP5     NA          
5 G2             1 SP6     NA          
6 G2             2 SP4     900         
7 G2             2 SP5     LO          
8 G2             2 SP7     dez

Another (more efficient) version using paste0 instead of list since you can find this warning from ?duplicated

Warning
Using this for lists is potentially slow, especially if the elements are not atomic vectors (see vector) or differ only in their attributes. In the worst case it is O(n^2).

library(dplyr)
library(tidyr)

df %>% 
  group_by(Groups, Subgroups) %>% 
  summarize(across(c(Species, Anothercolum), paste0, collapse = ","), .groups = "drop") %>% 
  filter(!duplicated(Species)) %>%
  separate_rows(Species, Anothercolum, sep = ",", convert = TRUE)

For this one, you will have to be careful about the separator used. However, both should be fine if you do not have a large dataset.