I have a dataframe below and I want to remove duplicates based on columns country and year, and keep the non NA values for the columns 3 to the last column. If all rows within (country, year) are NA, the value for the row should be an NA as well.
iso2c country year DT.ODA.ODAT.GN.ZS NY.GDP.MKTP.CD DT.ODA.ODAT.GD.ZS DT.ODA.ALLD.GD.ZS DT.ODA.ODAT.XP.ZS DT.ODA.ALLD.XP.ZS NY.GNP.MKTP.CD
1 AGO Angola 1985 NA NA 1.329899 1.329899 NA NA NA
2 AO Angola 1985 1.352825 7558613008 NA NA NA NA 6688963211
3 AGO Angola 1986 NA NA 2.049293 2.049293 NA NA NA
4 AO Angola 1986 1.947237 7076793823 NA NA NA NA 6688963211
5 AGO Angola 1987 NA NA 1.820775 1.820775 NA NA NA
6 AO Angola 1987 2.009728 8089279285 NA NA NA NA 6688963211
7 AGO Angola 1988 NA NA 1.968970 1.968970 NA NA NA
8 AO Angola 1988 2.347598 8775116269 NA NA NA NA 6688963211
9 AGO Angola 1989 NA NA 1.799623 1.799623 NA NA NA
10 AO Angola 1989 1.665031 10207922517 NA NA NA NA 10033444816
I have tried to use summarise accross,
df %>%
select(-iso2c) %>%
group_by(country, year) %>%
summarise(across(, ~ sum(.x, na.rm = TRUE)))
country year DT.ODA.ODAT.GN.ZS NY.GDP.MKTP.CD DT.ODA.ODAT.GD.ZS DT.ODA.ALLD.GD.ZS DT.ODA.ODAT.XP.ZS DT.ODA.ALLD.XP.ZS NY.GNP.MKTP.CD
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Angola 1985 1.35 7558613008. 1.33 1.33 0 0 6688963211.
2 Angola 1986 1.95 7076793823. 2.05 2.05 0 0 6688963211.
3 Angola 1987 2.01 8089279285. 1.82 1.82 0 0 6688963211.
4 Angola 1988 2.35 8775116269. 1.97 1.97 0 0 6688963211.
5 Angola 1989 1.67 10207922517. 1.80 1.80 0 0 10033444816.
6 Angola 1990 2.65 11236275843. 2.59 2.59 0 0 10033444816.
7 Angola 1991 0 0 2.27 2.27 0 0 0
8 Angola 1992 0 0 5.95 5.95 0 0 0
9 Angola 1993 0 0 5.48 5.48 0 0 0
10 Angola 1994 30.1 3390500000 11.0 11.0 0 0 1484500000
but then it returns 0 for the groups that all rows were NA, this is problematic as I might have observations that are actually 0 then I might not be able to differentiate the real 0 and the 0 created by NA.
Reproducible dataset is below
df <- structure(list(iso2c = c("AGO", "AO", "AGO", "AO", "AGO", "AO",
"AGO", "AO", "AGO", "AO", "AGO", "AO", "AGO", "AO", "AGO", "AO",
"AGO", "AO", "AGO", "AO"), country = c("Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola"), year = c(1985L, 1985L, 1986L,
1986L, 1987L, 1987L, 1988L, 1988L, 1989L, 1989L, 1990L, 1990L,
1991L, 1991L, 1992L, 1992L, 1993L, 1993L, 1994L, 1994L), DT.ODA.ODAT.GN.ZS = c(NA,
1.35282546806335, NA, 1.9472375, NA, 2.00972839050293, NA, 2.34759848175049,
NA, 1.66503130900065, NA, 2.64884089050293, NA, NA, NA, NA, NA,
NA, NA, 30.1158644206278), NY.GDP.MKTP.CD = c(NA, 7558613007.90635,
NA, 7076793822.60201, NA, 8089279284.72241, NA, 8775116269.16722,
NA, 10207922517.1839, NA, 11236275842.7358, NA, NA, NA, NA, NA,
NA, NA, 3390500000), DT.ODA.ODAT.GD.ZS = c(1.32989861920417,
NA, 2.04929343541605, NA, 1.82077454909391, NA, 1.9689704235723,
NA, 1.79962315882805, NA, 2.59030206019162, NA, 2.27231226407093,
NA, 5.94508667614588, NA, 5.47506427358001, NA, 11.0127233451064,
NA), DT.ODA.ALLD.GD.ZS = c(1.32989861920417, NA, 2.04929343541605,
NA, 1.82077454909391, NA, 1.9689704235723, NA, 1.79962315882805,
NA, 2.59030206019162, NA, 2.27231226407093, NA, 5.94508667614588,
NA, 5.47506427358001, NA, 11.0127233451064, NA), DT.ODA.ODAT.XP.ZS = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), DT.ODA.ALLD.XP.ZS = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), NY.GNP.MKTP.CD = c(NA,
6688963210.70234, NA, 6688963210.70234, NA, 6688963210.70234,
NA, 6688963210.70234, NA, 10033444816.0535, NA, 10033444816.0535,
NA, NA, NA, NA, NA, NA, NA, 1484500000)), row.names = c(NA, 20L
), class = "data.frame")
CodePudding user response:
A possible solution, using all(is.na(.x)) to detect when all elements inside the grouped column are NA:
df %>%
select(-iso2c) %>%
group_by(country, year) %>%
summarise(across(everything(),
~ if_else(all(is.na(.x)), NA_real_, sum(.x, na.rm = TRUE))))
