I have a dataframe such as :
Groups Subgroups Species Anothercolum
G1 1 SP1 jkl
G1 1 SP2 ds
G1 2 SP1 9DS
G1 2 SP2 JD
G1 3 SP1 Js2_0
G1 3 SP2 KUIO
G2 1 SP4 JUI
G2 1 SP5 NA
G2 1 SP6 NA
G2 2 SP4 900
G2 2 SP5 LO
G2 2 SP7 dez
G2 3 SP4 dez
G2 3 SP5 ge
G2 3 SP7 gez
And I would like to remove duplicate elements within Groups and Subgroups but only when there is the same content of Species.
For instance, for G1:
Groups Subgroups Species
G1 1 SP1
G1 1 SP2
G1 2 SP1
G1 2 SP2
G1 3 SP1
G1 3 SP2
I keep the first Subgroups1 since SP1 and SP2 are both in Subgroup1 and Subgroup2;
Groups Subgroups Species
G1 1 SP1
G1 1 SP2
But for G2 :
Groups Subgroups Species
G2 1 SP4
G2 1 SP5
G2 1 SP6
G2 2 SP4
G2 2 SP5
G2 2 SP7
G2 3 SP4
G2 3 SP5
G2 3 SP7
The Subgroup1 does not have the same 3 species as subgroup2 and 3, so I keep that one, but remove the subgroup3 and keep subgroup2 since they have the same species.
I should then get the following expected output :
Groups Subgroups Species Anothercolum
G1 1 SP1 jkl
G1 1 SP2 ds
G2 1 SP4 JUI
G2 1 SP5 NA
G2 1 SP6 NA
G2 2 SP4 900
G2 2 SP5 LO
G2 2 SP7 dez
Here is the dataframe if it can help:
structure(list(Groups = c("G1", "G1", "G1", "G1", "G1", "G1",
"G2", "G2", "G2", "G2", "G2", "G2", "G2", "G2", "G2"), Subgroups = c(1L,
1L, 2L, 2L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), Species = c("SP1",
"SP2", "SP1", "SP2", "SP1", "SP2", "SP4", "SP5", "SP6", "SP4",
"SP5", "SP7", "SP4", "SP5", "SP7")), class = "data.frame", row.names = c(NA,
-15L))
CodePudding user response:
A base R option
merge(
df,
subset(
aggregate(
Species ~ .,
df,
c
),
as.logical(ave(Species, Groups, FUN = Negate(duplicated))),
select = c(Groups, Subgroups)
)
)
gives
Groups Subgroups Species
1 G1 1 SP1
2 G1 1 SP2
3 G2 1 SP5
4 G2 1 SP4
5 G2 1 SP6
6 G2 2 SP7
7 G2 2 SP4
8 G2 2 SP5
CodePudding user response:
You can use duplicated to find duplicates even for a list. Something like this
library(dplyr)
library(tidyr)
df %>%
group_by(Groups, Subgroups) %>%
summarize(across(c(Species, Anothercolum), list), .groups = "drop") %>%
filter(!duplicated(Species)) %>%
unnest(c(Species, Anothercolum))
Output
# A tibble: 8 x 4
Groups Subgroups Species Anothercolum
<chr> <int> <chr> <chr>
1 G1 1 SP1 jkl
2 G1 1 SP2 ds
3 G2 1 SP4 JUI
4 G2 1 SP5 NA
5 G2 1 SP6 NA
6 G2 2 SP4 900
7 G2 2 SP5 LO
8 G2 2 SP7 dez
Another (more efficient) version using paste0 instead of list since you can find this warning from ?duplicated
Warning
Using this for lists is potentially slow, especially if the elements are not atomic vectors (see vector) or differ only in their attributes. In the worst case it is O(n^2).
library(dplyr)
library(tidyr)
df %>%
group_by(Groups, Subgroups) %>%
summarize(across(c(Species, Anothercolum), paste0, collapse = ","), .groups = "drop") %>%
filter(!duplicated(Species)) %>%
separate_rows(Species, Anothercolum, sep = ",", convert = TRUE)
For this one, you will have to be careful about the separator used. However, both should be fine if you do not have a large dataset.
