Home > Software engineering >  How to subset R data frame based on duplicates in one column and unique values in another
How to subset R data frame based on duplicates in one column and unique values in another

Time:02-08

This seems pretty straightforward but I am stumped. I have a data frame that looks like this:

df1 %>% head()
  values  paired
  <ch>     <int>
1 apples      1         
2 x           1          
3 oranges     2          
4 z           2          
5 bananas     3          
6 y           3 
7 apples      4
8 p           4

I would like to create a new data frame that extracts all paired values based on a search criteria. So if I want all pairs that correspond to apples I would like to end up with something like this:

df1 %>% head()
  values  paired
  <ch>     <int>
1 apples      1         
2 x           1                   
3 apples      4
4 p           4

I have tried using:

new_pairs <- df1 %>%
  arrange(values, paired) %>%
  filter(duplicated(paired) == TRUE,
         values=="apples") 

But I am getting only the apple rows back

CodePudding user response:

You'll need to group on the paired variable before filtering.

How about:

df1 %>% 
  group_by(paired) %>% 
  filter("apples" %in% values) %>%
  ungroup()

Result:

# A tibble: 4 x 2
  values paired
  <chr>   <int>
1 apples      1
2 x           1
3 apples      4
4 p           4

Your data:

df1 <- structure(list(values = c("apples", "x", "oranges", "z", "bananas", "y", "apples", "p"), 
                      paired = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L)), 
                 class = "data.frame", 
                 row.names = c("1", "2", "3", "4", "5", "6", "7", "8"))

CodePudding user response:

Here is another tidyverse possibility. I filter for the rows that have apples and also keep the rows that immediately follow apples.

library(tidyverse)

df %>%
  filter((values == "apples" |
            row_number() %in% c(which(
              values == "apples", arr.ind = TRUE
            )   1)))

Output

  values paired
1 apples      1
2      x      1
3 apples      4
4      p      4

Here is a data.table option (subset is only used to change the order of the columns):

library(data.table)

dt <- as.data.table(df)
subset(dt[, .SD[any(values == "apples")], by = paired], select = c("values", "paired"))

   values paired
1: apples      1
2:      x      1
3: apples      4
4:      p      4

Data

df <-
    structure(list(
      values = c("apples", "x", "oranges", "z", "bananas",
                 "y", "apples", "p"),
      paired = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L)
    ),
    class = "data.frame",
    row.names = c(NA,-8L))
  •  Tags:  
  • Related