Home > Mobile >  How to remove duplicated rows from a dataframe `df` but only when a specific column of the `df` is N
How to remove duplicated rows from a dataframe `df` but only when a specific column of the `df` is N

Time:01-20

I have a data frame df with 5 columns. Region.Label indicates the Region where the study was performed, Sample.Label is a specific area within that Region where I counted for birds, Sp is the bird species I found in that specific area, Distance is the distance between the bird and me, and Effort is the time I was in the area looking for birds. When Distance is NA means that for this area the species was not observed. As an example of the data frame I have:

df <- data.frame(Region.Label=c("A","A","A","A","A","A","A","A"),
                 Sample.Label=c(1,1,1,2,2,2,3,3),
                 Sp=c("ZZ","ZZ","BB","ZZ","BB","CC","ZZ","BB"),
                 Distance=c(2,7,NA,NA,NA,6,NA,NA),
                 Effort=c(99,99,99,87,87,87,72,72))

df$Region.Label <- as.factor(df$Region.Label)
df$Sample.Label <- as.numeric(df$Sample.Label)
df

  Region.Label Sample.Label Sp Distance Effort
1            A            1 ZZ        2     99
2            A            1 ZZ        7     99
3            A            1 BB       NA     99
4            A            2 ZZ       NA     87
5            A            2 BB       NA     87
6            A            2 CC        6     87
7            A            3 ZZ       NA     72
8            A            3 BB       NA     72

Here, I would like to remove all the rows which have NA for the column df$Distance, as it indicates that the species was not observed in the area, but I want to delete rows with NA for df$Distance when that row with the NA is a duplicate of another row excluding the column df$Sp.

I would like to obtain this:

 Region.Label Sample.Label Sp Distance Effort
1            A            1 ZZ        2     99
2            A            1 ZZ        7     99
3            A            2 CC        6     87
4            A            3 ZZ       NA     72

In this example, I do not remove df[7,] because the Sample.Label is different from the previous rows. I remove df[8,] because df[7,] and df[8,] are equal except for df$Sp.

Does anyone know how to get what I want?

CodePudding user response:

Perhaps, a group by operation would help - grouped by 'Region.Label', 'Sample.Label', 'Effort', filter the non-NA elements of 'Distance' if there are any non-NA elements or else get the first row (row_number() == 1)

library(dplyr)
df %>% 
 group_by(Region.Label, Sample.Label, Effort) %>% 
 filter(if(all(is.na(Distance))) row_number() == 1 else !is.na(Distance)) %>%
 ungroup

-output

# A tibble: 4 × 5
  Region.Label Sample.Label Sp    Distance Effort
  <fct>               <dbl> <chr>    <dbl>  <dbl>
1 A                       1 ZZ           2     99
2 A                       1 ZZ           7     99
3 A                       2 CC           6     87
4 A                       3 ZZ          NA     72
  •  Tags:  
  • Related