I have some duplicate IDs in my df, but I want only 1 row per ID. I cannot use unique() or distinct() because then some data would be erased, as the rows are not identical.
Please see this example:
# The style of df I have
df <- data.frame(IDs=c(1,1,2,3,4,4,4,5),
Intervention=c("Progesterone", "Stitch", NA, "Stitch", "Progesterone", "Stitch", "Pessary", "Progesterone"),
Other_data1= c(22,22,32,44,24,24,24,NA),
Other_data2=c("a","a","b","c","d","d","d","e"))
df
# IDs Intervention Other_data1 Other_data2
# 1 1 Progesterone 22 a
# 2 1 Stitch 22 a
# 3 2 <NA> 32 b
# 4 3 Stitch 44 c
# 5 4 Progesterone 24 d
# 6 4 Stitch 24 d
# 7 4 Pessary 24 d
# 8 5 Progesterone NA e
So if I used unique() I would lose the full information in the df$Intervention column.
Please could someone let me know how I can get the df into this format:
# The style of df I want
df_I_want <- data.frame(IDs=c(1,2,3,4,5),
Progesterone=c("Yes", NA, "No", "Yes", "Yes"),
Stitch=c("Yes", NA, "Yes", "Yes", "No"),
Pessary=c("No", NA, "No", "Yes", "No"),
Other_data1= c(22,32,44,24,NA),
Other_data2=c("a","b","c","d","e"))
df_I_want
# IDs Progesterone Stitch Pessary Other_data1 Other_data2
# 1 1 Yes Yes No 22 a
# 2 2 <NA> <NA> <NA> 32 b
# 3 3 No Yes No 44 c
# 4 4 Yes Yes Yes 24 d
# 5 5 Yes No No NA e
My real df contains thousands of rows x hundreds of columns, so I have many cases of df$Other_data so I cannot really manually type out excluding these rows when reshaping the df. But there is only 1 column where the data differs by the row, as in the above example with df$Intervention.
CodePudding user response:
Here is another pivot_wider solution, but here I use mutate and case_when to identify their corresponding values under the newly expanded columns.
If all of the three newly expanded columns are NA, they should remain NA. Otherwise, treat NA as "No" and non-NA as "Yes".
Note that within across(), you should input the column position (or column names) of the newly expanded columns (e.g. Progesterone, Stitch and Pessary are newly created, and they are in position 4 to 6, therefore 4:6).
Edit: Added length(unique(na.omit(df$Intervention))) when calculating and comparing the number of NAs across the newly expanded columns so that it's more dynamic
library(tidyverse)
df %>%
pivot_wider(names_from = Intervention, values_from = Intervention) %>%
select(-"NA") %>%
mutate(across(4:6, ~case_when(rowSums(is.na(across(4:6))) == length(unique(na.omit(df$Intervention))) ~ NA_character_,
is.na(.x) ~ "No",
!is.na(.x) ~ "Yes")))
# A tibble: 5 × 6
IDs Other_data1 Other_data2 Progesterone Stitch Pessary
<dbl> <dbl> <chr> <chr> <chr> <chr>
1 1 22 a Yes Yes No
2 2 32 b NA NA NA
3 3 44 c No Yes No
4 4 24 d Yes Yes Yes
5 5 NA e Yes No No
CodePudding user response:
Updated: I have now updated the code so that you have "Yes" and "No" for each intervention.
You can use the function pivot_wider() to achieve this:
library(tidyverse)
df <- data.frame(IDs=c(1,1,2,3,4,4,4,5),
Intervention=c("Progesterone", "Stitch", NA, "Stitch", "Progesterone", "Stitch", "Pessary", "Progesterone"),
Other_data1= c(22,22,32,44,24,24,24,NA),
Other_data2=c("a","a","b","c","d","d","d","e"))
df %>%
pivot_wider(names_from = Intervention,
values_from = Intervention) %>%
select(-c(`NA`)) %>%
mutate(across(.cols = Progesterone:Pessary,
.fns = ~case_when(is.na(.) ~ "No",
TRUE ~ "Yes")))
#> # A tibble: 5 × 6
#> IDs Other_data1 Other_data2 Progesterone Stitch Pessary
#> <dbl> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 22 a Yes Yes No
#> 2 2 32 b No No No
#> 3 3 44 c No Yes No
#> 4 4 24 d Yes Yes Yes
#> 5 5 NA e Yes No No
Created on 2022-08-19 with reprex v2.0.2
