I'm trying to drop duplicates and keep the row with the maximum values. I can do this separately per strategy.
However, when trying to do this based on two separate conditionals based on the strategy, the dataframe tends to overwrite one another when trying to apply these.
This is needed given that one strategy contains values that one strategy has and another does not; note these do share one common column though.
Current Data
ID strategy Common DNA_Col RNA_Col
1 ABA DNA 0.65 0.66 NA
2 ABB RNA 0.65 NA 0.15
3 ABB RNA 0.65 NA 0.12
4 ABC DNA 0.55 0.88 NA
5 ABC DNA 0.14 0.14 NA
6 ABC DNA 0.15 0.50 NA
7 ABD RNA 0.25 NA 0.12
Desired DataFrame
ID strategy Common DNA_Col RNA_Col
1 ABA DNA 0.65 0.66 NA
2 ABB RNA 0.65 NA 0.15
3 ABC DNA 0.55 0.88 NA
4 ABD RNA 0.25 NA 0.12
Code
Producing the Dataframe:
> df <- data.frame(
stringsAsFactors = FALSE,
ID = c("ABA", "ABB", "ABB", "ABC", "ABC", "ABC", "ABD"),
strategy =c("DNA", "RNA", "RNA", "DNA", "DNA", "DNA", "RNA"),
Common = c(0.65, 0.65, 0.65, 0.55, 0.14, 0.15, 0.25),
DNA_Col= c(0.66, NA, NA, 0.88, 0.14, 0.5, NA),
RNA_Col = c(NA, 0.15, 0.12, NA, NA, NA, 0.12)
)
Applying Conditionals
if (df$strategy == "RNA") {
df = df %>% group_by(id) %>% slice_max(RNA_Col, n=1) %>% ungroup
} else if (df$strategy == "DNA") {
df = df %>% group_by(df) %>% slice_max(DNA_Col, n=1) %>% ungroup
}
CodePudding user response:
This can be done by using the pivot_longer() function to bring the values from the RNA_Col and DNA_Col variables into one single column to be handled simultaneously, this column can then be used to repopulate the columns using ifelse().
library(tidyverse)
df <- data.frame(
stringsAsFactors = FALSE,
ID = c("ABA", "ABB", "ABB", "ABC", "ABC", "ABC", "ABD"),
strategy =c("DNA", "RNA", "RNA", "DNA", "DNA", "DNA", "RNA"),
Common = c(0.65, 0.65, 0.65, 0.55, 0.14, 0.15, 0.25),
DNA_Col= c(0.66, NA, NA, 0.88, 0.14, 0.5, NA),
RNA_Col = c(NA, 0.15, 0.12, NA, NA, NA, 0.12)
)
df %>%
pivot_longer(cols = c(DNA_Col, RNA_Col),
names_to = "Original_Col",
values_to = "Value") %>%
group_by(ID) %>%
slice_max(Value, n = 1) %>%
ungroup() %>%
mutate(DNA_Col = ifelse(
Original_Col == "DNA_Col", Value, NA
),
RNA_Col = ifelse(
Original_Col == "RNA_Col", Value, NA
)) %>%
select(ID, strategy, Common, DNA_Col, RNA_Col)
# A tibble: 4 × 5
ID strategy Common DNA_Col RNA_Col
<chr> <chr> <dbl> <dbl> <dbl>
1 ABA DNA 0.65 0.66 NA
2 ABB RNA 0.65 NA 0.15
3 ABC DNA 0.55 0.88 NA
4 ABD RNA 0.25 NA 0.12
CodePudding user response:
A possible solution:
library(tidyverse)
df <- data.frame(
stringsAsFactors = FALSE,
ID = c("ABA", "ABB", "ABB", "ABC", "ABC", "ABC", "ABD"),
strategy = c("DNA", "RNA", "RNA", "DNA", "DNA", "DNA", "RNA"),
Common = c(0.65, 0.65, 0.65, 0.55, 0.14, 0.15, 0.25),
DNA_Col = c(0.66, NA, NA, 0.88, 0.14, 0.5, NA),
RNA_Col = c(NA, 0.15, 0.12, NA, NA, NA, 0.12)
)
df %>%
group_by(ID) %>%
slice_max(DNA_Col, n = 1) %>%
ungroup %>%
bind_rows(
df %>%
group_by(ID) %>%
slice_max(RNA_Col, n = 1) %>%
ungroup)
#> # A tibble: 4 × 5
#> ID strategy Common DNA_Col RNA_Col
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 ABA DNA 0.65 0.66 NA
#> 2 ABC DNA 0.55 0.88 NA
#> 3 ABB RNA 0.65 NA 0.15
#> 4 ABD RNA 0.25 NA 0.12
