removing duplicate rows based on two conditionals on columns r-CodePudding

I'm trying to drop duplicates and keep the row with the maximum values. I can do this separately per strategy.

However, when trying to do this based on two separate conditionals based on the strategy, the dataframe tends to overwrite one another when trying to apply these.

This is needed given that one strategy contains values that one strategy has and another does not; note these do share one common column though.

Current Data

  ID strategy Common DNA_Col RNA_Col
1 ABA      DNA   0.65    0.66      NA
2 ABB      RNA   0.65      NA    0.15
3 ABB      RNA   0.65      NA    0.12
4 ABC      DNA   0.55    0.88      NA
5 ABC      DNA   0.14    0.14      NA
6 ABC      DNA   0.15    0.50      NA
7 ABD      RNA   0.25      NA    0.12

Desired DataFrame

   ID strategy Common DNA_Col RNA_Col
1 ABA      DNA   0.65    0.66      NA
2 ABB      RNA   0.65      NA    0.15
3 ABC      DNA   0.55    0.88      NA
4 ABD      RNA   0.25      NA    0.12

Code

Producing the Dataframe:

> df <- data.frame(
    stringsAsFactors = FALSE,
    ID = c("ABA", "ABB", "ABB", "ABC", "ABC", "ABC", "ABD"),
    strategy =c("DNA", "RNA", "RNA", "DNA", "DNA", "DNA", "RNA"),
    Common = c(0.65, 0.65, 0.65, 0.55, 0.14, 0.15, 0.25),
    DNA_Col= c(0.66, NA, NA, 0.88, 0.14, 0.5, NA),
    RNA_Col = c(NA, 0.15, 0.12, NA, NA, NA, 0.12)
  )

Applying Conditionals

if (df$strategy == "RNA") {
  df = df %>% group_by(id) %>% slice_max(RNA_Col, n=1) %>% ungroup
} else if (df$strategy == "DNA") {
  df = df %>% group_by(df) %>% slice_max(DNA_Col, n=1) %>% ungroup
}

CodePudding user response：

This can be done by using the pivot_longer() function to bring the values from the RNA_Col and DNA_Col variables into one single column to be handled simultaneously, this column can then be used to repopulate the columns using ifelse().


library(tidyverse)

df <- data.frame(
     stringsAsFactors = FALSE,
     ID = c("ABA", "ABB", "ABB", "ABC", "ABC", "ABC", "ABD"),
     strategy =c("DNA", "RNA", "RNA", "DNA", "DNA", "DNA", "RNA"),
     Common = c(0.65, 0.65, 0.65, 0.55, 0.14, 0.15, 0.25),
     DNA_Col= c(0.66, NA, NA, 0.88, 0.14, 0.5, NA),
     RNA_Col = c(NA, 0.15, 0.12, NA, NA, NA, 0.12)
   )

df %>% 
  pivot_longer(cols = c(DNA_Col, RNA_Col),
               names_to = "Original_Col",
               values_to = "Value") %>% 
  group_by(ID) %>% 
  slice_max(Value, n = 1) %>% 
  ungroup() %>% 
  mutate(DNA_Col = ifelse(
    Original_Col == "DNA_Col", Value, NA
  ),
  RNA_Col = ifelse(
    Original_Col == "RNA_Col", Value, NA
  )) %>% 
  select(ID, strategy, Common, DNA_Col, RNA_Col)

# A tibble: 4 × 5
  ID    strategy Common DNA_Col RNA_Col
  <chr> <chr>     <dbl>   <dbl>   <dbl>
1 ABA   DNA        0.65    0.66   NA   
2 ABB   RNA        0.65   NA       0.15
3 ABC   DNA        0.55    0.88   NA   
4 ABD   RNA        0.25   NA       0.12

CodePudding user response：

A possible solution:

library(tidyverse)

df <- data.frame(
  stringsAsFactors = FALSE,
  ID = c("ABA", "ABB", "ABB", "ABC", "ABC", "ABC", "ABD"),
  strategy = c("DNA", "RNA", "RNA", "DNA", "DNA", "DNA", "RNA"),
  Common = c(0.65, 0.65, 0.65, 0.55, 0.14, 0.15, 0.25),
  DNA_Col = c(0.66, NA, NA, 0.88, 0.14, 0.5, NA),
  RNA_Col = c(NA, 0.15, 0.12, NA, NA, NA, 0.12)
)

df %>% 
  group_by(ID) %>% 
  slice_max(DNA_Col, n = 1) %>% 
  ungroup %>% 
  bind_rows(
    df %>% 
      group_by(ID) %>% 
      slice_max(RNA_Col, n = 1) %>% 
      ungroup)

#> # A tibble: 4 × 5
#>   ID    strategy Common DNA_Col RNA_Col
#>   <chr> <chr>     <dbl>   <dbl>   <dbl>
#> 1 ABA   DNA        0.65    0.66   NA   
#> 2 ABC   DNA        0.55    0.88   NA   
#> 3 ABB   RNA        0.65   NA       0.15
#> 4 ABD   RNA        0.25   NA       0.12