In R how to save only duplicated values in two columns?-CodePudding

I have the following dataset with 4 columns:

head(L12_17)
       species.2017 cooperative.2017     species.2012 cooperative.2012
1  Abrocoma cinerea               no Abrocoma cinerea               no
2 Acomys cineraceus               no Acinonyx jubatus               no
3      Acomys kempi               no Acomys cahirinus               no
4    Acomys louisae               no Acomys cilicicus               no
5     Acomys minous               no   Acomys ignitus               no
6  Acomys percivali               no     Acomys kempi               no

How can I save in column "species.2017" and column "species.2012" only those species that are present in both columns?

The end result will be to have a new dataset with 3 columns for "species name" "cooperative 2012" and "cooperative 2017", but I would like to keep in "species name" only those species (and their corresponding cooperative 2012 and cooperative 2017 data) that are present in "species.2017" AND "species.2012" columns. Thanks!

This is the end result I wish for:
    > end.result
              species cooperative.2012 cooperative.2017
1        Acomys kempi               no              yes
2           Acomys 22               no               no
3          Acomys 444               no               no
4 Addax nasomaculatus              yes               no

This is my current data:

> dput(head(data, 20))
structure(list(species.2017 = c("Abrocoma cinerea", "Acomys cineraceus", 
"Acomys kempi", "Acomys louisae", "Acomys minous", "Acomys percivali", 
"Acomys russatus", "Acomys spinosissimus", "Acomys subspinosus", 
"Acomys wilsoni", "Aconaemys fuscus", "Acrobates pygmaeus", "Addax nasomaculatus", 
"Aepyceros melampus", "Aethomys chrysophilus", "Aethomys hindei", 
"Aethomys kaiseri", "Ailuropoda melanoleuca", "Ailurus fulgens", 
"Akodon azarae"), cooperative.2017 = c("no", "no", "no", "no", 
"no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", 
"no", "no", "no", "no", "no"), species.2012 = c("Abrocoma cinerea", 
"Acinonyx jubatus", "Acomys cahirinus", "Acomys cilicicus", "Acomys ignitus", 
"Acomys kempi", "Acomys louisae", "Acomys minous", "Acomys mullah", 
"Acomys nesiotes", "Acomys percivali", "Acomys russatus", "Acomys spinosissimus", 
"Acomys subspinosus", "Acomys wilsoni", "Aconaemys fuscus", "Acrobates pygmaeus", 
"Addax nasomaculatus", "Aepyceros melampus", "Aethomys chrysophilus"
), cooperative.2012 = c("no", "no", "no", "no", "no", "no", "no", 
"no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", 
"no", "no")), row.names = c(NA, 20L), class = "data.frame")

CodePudding user response：

so you want to keep rows where either of the species columns is a species that exists in both groups? the following code probably gets you what you need, although you have eight rows in which both species.2012 and species.2017 are in common. I'm not sure you which one you want to keep.

species.2017 <- df$species.2017
species.2012 <- df$species.2012
common <- intersect(species.2017, species.2012)

df <- df %>% 
  filter(species.2012 %in% common | species.2017 %in% common) %>% 
  mutate(species = ifelse(species.2012 %in% common, species.2012, species.2017)) %>% 
  select(-c(species.2012, species.2017))

CodePudding user response：

Here is a base R way.
First create a logical index of the values of species.2017 matching species.2012. Then get the final column names vector. And subset based on those two vectors.


i <- data$species.2017 %in% data$species.2012
icol <- c(j <- grep("species", names(data))[1], grep("cooperative", names(data)))
names(data)[j] <- sub("\\..*$", "", names(data)[j])
result1 <- data[i, icol]
row.names(result1) <- NULL
result1
#>                  species cooperative.2017 cooperative.2012
#> 1       Abrocoma cinerea               no               no
#> 3           Acomys kempi               no               no
#> 4         Acomys louisae               no               no
#> 5          Acomys minous               no               no
#> 6       Acomys percivali               no               no
#> 7        Acomys russatus               no               no
#> 8   Acomys spinosissimus               no               no
#> 9     Acomys subspinosus               no               no
#> 10        Acomys wilsoni               no               no
#> 11      Aconaemys fuscus               no               no
#> 12    Acrobates pygmaeus               no               no
#> 13   Addax nasomaculatus               no               no
#> 14    Aepyceros melampus               no               no
#> 15 Aethomys chrysophilus               no               no

^{Created on 2022-01-30 by the reprex package (v2.0.1)}

Another way, with merge. Since all matches between the species columns must be in the final result, split the data by columns and merge the two df's. The result is identical to the result above.

tmp1 <- data[1:2]
tmp2 <- data[3:4]
result2 <- merge(tmp1, tmp2, by.x = "species.2017", by.y = "species.2012")
names(result2)[1] <- "species"
rm(tmp1, tmp2)

identical(result1, result2)
#> [1] TRUE

^{Created on 2022-01-30 by the reprex package (v2.0.1)}

CodePudding user response：

Difficult to know what you want. Maybe turn the values in one of the columns in question into a regex alternation pattern and filter on where the two columns in question have matching values, deselect one of the two now-identical columns and rename the remaining one:

library(dplyr)
df %>%
  filter(grepl(paste0("\\b(", paste0(spec1, collapse = "|"), ")\\b"), spec2)) %>%
  select(-spec2) %>%
  rename(spec = spec1)
  spec1 spec2 smelse
1   XYZ   XYZ      1
2     A     A      4

Toy data:

df <- data.frame(
  spec1 = c("XYZ", "QWE", "P", "A"),
  spec2 = c("XYZ", "abc", "Pothead", "A"),
  smelse = 1:4
)