I'm working with a dataset of author collaborations. Here's a subset of five articles with article ID, corresponding author (RP), corresponding author's country (Country), author full names (AF_1:AF_3), and author short names (AU_1:AU_3):
articles <- structure(list(
ArtID = 1:5,
RP = c("DE GARRIDO, L","CURSEU, PL","HENIKE, T","DI VINCENZO, F","OMIGIE, D"),
Country = c("spain", "romania", "germany", "italy", NA),
AF_1 = c("DE GARRIDO, LUIS","CURSEU, PETRU L.","STIELER, MAXIMILIAN","DI VINCENZO, FAUSTO","OMIGIE, DIANA"),
AF_2 = c(NA,"SCHRUIJER, SANDRA G. L.","HENIKE, TASSILO","IACOPINO, VALENTINA","RICCI, JESSICA"),
AF_3 = c(NA, "FODOR, OANA C.", NA, NA, NA),
AU_1 = c("DE GARRIDO L", "CURSEU PL", "STIELER M","DI VINCENZO F", "OMIGIE D"),
AU_2 = c(NA, "SCHRUIJER SGL", "HENIKE T","IACOPINO V", "RICCI J"),
AU_3 = c(NA, "FODOR OC", NA, NA, NA)),
row.names = c(NA,-5L),class = c("data.frame"))
> articles
ArtID RP Country AF_1 AF_2 AF_3 AU_1 AU_2 AU_3
1 1 DE GARRIDO, L spain DE GARRIDO, LUIS <NA> <NA> DE GARRIDO L <NA> <NA>
2 2 CURSEU, PL romania CURSEU, PETRU L. SCHRUIJER, SANDRA G. L. FODOR, OANA C. CURSEU PL SCHRUIJER SGL FODOR OC
3 3 HENIKE, T germany STIELER, MAXIMILIAN HENIKE, TASSILO <NA> STIELER M HENIKE T <NA>
4 4 DI VINCENZO, F italy DI VINCENZO, FAUSTO IACOPINO, VALENTINA <NA> DI VINCENZO F IACOPINO V <NA>
5 5 OMIGIE, D <NA> OMIGIE, DIANA RICCI, JESSICA <NA> OMIGIE D RICCI J <NA>
When pivoting the data from wide (articles) to long (authors), it automatically copies other column values for all authors (i.e., country). pivot_longer code and data:
authors <- articles %>% pivot_longer(cols=starts_with(c("AF","AU")),names_to=c(".value","ArtAthID"),names_sep="_",values_drop_na=T)
> authors
# A tibble: 10 × 6
ArtID RP Country ArtAthID AF AU
<int> <chr> <chr> <chr> <chr> <chr>
1 1 DE GARRIDO, L spain 1 DE GARRIDO, LUIS DE GARRIDO L
2 2 CURSEU, PL romania 1 CURSEU, PETRU L. CURSEU PL
3 2 CURSEU, PL romania 2 SCHRUIJER, SANDRA G. L. SCHRUIJER SGL
4 2 CURSEU, PL romania 3 FODOR, OANA C. FODOR OC
5 3 HENIKE, T germany 1 STIELER, MAXIMILIAN STIELER M
6 3 HENIKE, T germany 2 HENIKE, TASSILO HENIKE T
7 4 DI VINCENZO, F italy 1 DI VINCENZO, FAUSTO DI VINCENZO F
8 4 DI VINCENZO, F italy 2 IACOPINO, VALENTINA IACOPINO V
9 5 OMIGIE, D NA 1 OMIGIE, DIANA OMIGIE D
10 5 OMIGIE, D NA 2 RICCI, JESSICA RICCI J
However, when pivoting, I want some columns (Country, Uni, Email) to stay with the corresponding author and be NA for the other authors. For example, Country should look like this:
> authors
# A tibble: 10 × 6
ArtID RP Country ArtAthID AF AU
<int> <chr> <chr> <chr> <chr> <chr>
1 1 DE GARRIDO, L spain 1 DE GARRIDO, LUIS DE GARRIDO L
2 2 CURSEU, PL romania 1 CURSEU, PETRU L. CURSEU PL
3 2 CURSEU, PL NA 2 SCHRUIJER, SANDRA G. L. SCHRUIJER SGL
4 2 CURSEU, PL NA 3 FODOR, OANA C. FODOR OC
5 3 HENIKE, T NA 1 STIELER, MAXIMILIAN STIELER M
6 3 HENIKE, T germany 2 HENIKE, TASSILO HENIKE T
7 4 DI VINCENZO, F italy 1 DI VINCENZO, FAUSTO DI VINCENZO F
8 4 DI VINCENZO, F NA 2 IACOPINO, VALENTINA IACOPINO V
9 5 OMIGIE, D NA 1 OMIGIE, DIANA OMIGIE D
10 5 OMIGIE, D NA 2 RICCI, JESSICA RICCI J
I've tried a few things (trying to use matching between RP and AU), but am hitting a wall. Any suggestions for how to do this?
Thanks for any help!
CodePudding user response:
I think the pivot is working as designed, as it is keeping the data associated with all authors. What you're asking to do is de-duplicate Country (and perhaps others) within a particular ArtID.
Try this:
authors %>%
group_by(ArtID) %>%
mutate(across(c(Country), ~ replace(., duplicated(.), .[NA][1]))) %>%
ungroup()
# # A tibble: 10 x 6
# ArtID RP Country ArtAthID AF AU
# <int> <chr> <chr> <chr> <chr> <chr>
# 1 1 DE GARRIDO, L spain 1 DE GARRIDO, LUIS DE GARRIDO L
# 2 2 CURSEU, PL romania 1 CURSEU, PETRU L. CURSEU PL
# 3 2 CURSEU, PL NA 2 SCHRUIJER, SANDRA G. L. SCHRUIJER SGL
# 4 2 CURSEU, PL NA 3 FODOR, OANA C. FODOR OC
# 5 3 HENIKE, T germany 1 STIELER, MAXIMILIAN STIELER M
# 6 3 HENIKE, T NA 2 HENIKE, TASSILO HENIKE T
# 7 4 DI VINCENZO, F italy 1 DI VINCENZO, FAUSTO DI VINCENZO F
# 8 4 DI VINCENZO, F NA 2 IACOPINO, VALENTINA IACOPINO V
# 9 5 OMIGIE, D NA 1 OMIGIE, DIANA OMIGIE D
# 10 5 OMIGIE, D NA 2 RICCI, JESSICA RICCI J
Notes:
- You mentioned doing the same for
UniandEmail, but those columns were not in the original data (which is fine). For this reason, I chose to usemutate(across(c(Country), ..))instead of the more conventionalmutate(Country = replace(..)): include your other columns in thatc(.)vector, perhaps justmutate(across(c(Country, Uni, Email), ~ ...)). - The
replaceshould be clear enough, but the.[NA][1]is to make sure that the replacement (ofNA) is the same class ofNAas the original column. There are at least six different classes ofNA, and some R tools -- notably many withindplyrandtidyr-- complain when trying to combine an integer NA (NA_integer_) with a logical (NA) or string (NA_character_) or real/floating-point (NA_real_), to name a few. The use of.[NA]will always give the correct class. The addition of[1]is to get around the fact thatreplace(.)requires the third argument to be the same length as the number of values to replace, not necessarily the same length as the input argumentx; this same-length is relaxed when recycling, so I truncate this to be always length-1.
Your desired output for ArtID=3 showed the NA before "germany", which seemed like just a typo in the sample data. If you want it matched (grepl) between RP and the other fields, then perhaps this works better (though some articles with this example clear out all countries):
authors %>%
mutate(tmp = mapply(function(x, ...) any(grepl(x, unlist(list(...)))), RP, AF, AU), across(c(Country), ~ if_else(tmp, ., .[NA]))) %>%
select(-tmp)
# # A tibble: 10 x 6
# ArtID RP Country ArtAthID AF AU
# <int> <chr> <chr> <chr> <chr> <chr>
# 1 1 DE GARRIDO, L spain 1 DE GARRIDO, LUIS DE GARRIDO L
# 2 2 CURSEU, PL NA 1 CURSEU, PETRU L. CURSEU PL
# 3 2 CURSEU, PL NA 2 SCHRUIJER, SANDRA G. L. SCHRUIJER SGL
# 4 2 CURSEU, PL NA 3 FODOR, OANA C. FODOR OC
# 5 3 HENIKE, T NA 1 STIELER, MAXIMILIAN STIELER M
# 6 3 HENIKE, T germany 2 HENIKE, TASSILO HENIKE T
# 7 4 DI VINCENZO, F italy 1 DI VINCENZO, FAUSTO DI VINCENZO F
# 8 4 DI VINCENZO, F NA 2 IACOPINO, VALENTINA IACOPINO V
# 9 5 OMIGIE, D NA 1 OMIGIE, DIANA OMIGIE D
# 10 5 OMIGIE, D NA 2 RICCI, JESSICA RICCI J
This second method is flawed because of the mismatch in author name formats ("HENIKE, T" != "HENIKE T", for instance). If you can come up with a better way to match RP with the other columns, then this may be more to your liking.
CodePudding user response:
I don't think you can do what you want within pivot_longer, but I think this gives you what you need for one "additional" variable (Which is all you've given us in your test data...)
articles %>%
group_by(Country) %>%
pivot_longer(
cols=starts_with(c("AF","AU")),
names_to=c(".value","ArtAthID"),
names_sep="_",
values_drop_na=T
) %>%
mutate(Country=ifelse(row_number() == 1, Country, NA)) %>%
ungroup()
# A tibble: 10 × 6
# Groups: Country [5]
ArtID RP Country ArtAthID AF AU
<int> <chr> <chr> <chr> <chr> <chr>
1 1 DE GARRIDO, L spain 1 DE GARRIDO, LUIS DE GARRIDO L
2 2 CURSEU, PL romania 1 CURSEU, PETRU L. CURSEU PL
3 2 CURSEU, PL NA 2 SCHRUIJER, SANDRA G. L. SCHRUIJER SGL
4 2 CURSEU, PL NA 3 FODOR, OANA C. FODOR OC
5 3 HENIKE, T germany 1 STIELER, MAXIMILIAN STIELER M
6 3 HENIKE, T NA 2 HENIKE, TASSILO HENIKE T
7 4 DI VINCENZO, F italy 1 DI VINCENZO, FAUSTO DI VINCENZO F
8 4 DI VINCENZO, F NA 2 IACOPINO, VALENTINA IACOPINO V
9 5 OMIGIE, D NA 1 OMIGIE, DIANA OMIGIE D
10 5 OMIGIE, D NA 2 RICCI, JESSICA RICCI J
if you need more than one additional variable, add each to the group_by() call and then use across() in the mutate(). Something like (untested code):
...
%>% mutate(
across(
c(Country, Uni, Email),
function(x) ifelse(row_number() == 1, x, NA)
)
)
