Home > Enterprise >  Remove duplicates based on few columns and keep numeric numeric value if any and keep NA if there is
Remove duplicates based on few columns and keep numeric numeric value if any and keep NA if there is

Time:01-10

I have a dataframe below and I want to remove duplicates based on columns country and year, and keep the non NA values for the columns 3 to the last column. If all rows within (country, year) are NA, the value for the row should be an NA as well.

   iso2c country year DT.ODA.ODAT.GN.ZS NY.GDP.MKTP.CD DT.ODA.ODAT.GD.ZS DT.ODA.ALLD.GD.ZS DT.ODA.ODAT.XP.ZS DT.ODA.ALLD.XP.ZS NY.GNP.MKTP.CD
1    AGO  Angola 1985                NA             NA          1.329899          1.329899                NA                NA             NA
2     AO  Angola 1985          1.352825     7558613008                NA                NA                NA                NA     6688963211
3    AGO  Angola 1986                NA             NA          2.049293          2.049293                NA                NA             NA
4     AO  Angola 1986          1.947237     7076793823                NA                NA                NA                NA     6688963211
5    AGO  Angola 1987                NA             NA          1.820775          1.820775                NA                NA             NA
6     AO  Angola 1987          2.009728     8089279285                NA                NA                NA                NA     6688963211
7    AGO  Angola 1988                NA             NA          1.968970          1.968970                NA                NA             NA
8     AO  Angola 1988          2.347598     8775116269                NA                NA                NA                NA     6688963211
9    AGO  Angola 1989                NA             NA          1.799623          1.799623                NA                NA             NA
10    AO  Angola 1989          1.665031    10207922517                NA                NA                NA                NA    10033444816

I have tried to use summarise accross,

df %>%
  select(-iso2c) %>%
  group_by(country, year) %>%
  summarise(across(, ~ sum(.x, na.rm = TRUE)))

   country  year DT.ODA.ODAT.GN.ZS NY.GDP.MKTP.CD DT.ODA.ODAT.GD.ZS DT.ODA.ALLD.GD.ZS DT.ODA.ODAT.XP.ZS DT.ODA.ALLD.XP.ZS NY.GNP.MKTP.CD
   <chr>   <int>             <dbl>          <dbl>             <dbl>             <dbl>             <dbl>             <dbl>          <dbl>
 1 Angola   1985              1.35    7558613008.              1.33              1.33                 0                 0    6688963211.
 2 Angola   1986              1.95    7076793823.              2.05              2.05                 0                 0    6688963211.
 3 Angola   1987              2.01    8089279285.              1.82              1.82                 0                 0    6688963211.
 4 Angola   1988              2.35    8775116269.              1.97              1.97                 0                 0    6688963211.
 5 Angola   1989              1.67   10207922517.              1.80              1.80                 0                 0   10033444816.
 6 Angola   1990              2.65   11236275843.              2.59              2.59                 0                 0   10033444816.
 7 Angola   1991              0                0               2.27              2.27                 0                 0             0 
 8 Angola   1992              0                0               5.95              5.95                 0                 0             0 
 9 Angola   1993              0                0               5.48              5.48                 0                 0             0 
10 Angola   1994             30.1     3390500000              11.0              11.0                  0                 0    1484500000 

but then it returns 0 for the groups that all rows were NA, this is problematic as I might have observations that are actually 0 then I might not be able to differentiate the real 0 and the 0 created by NA.

Reproducible dataset is below

df <- structure(list(iso2c = c("AGO", "AO", "AGO", "AO", "AGO", "AO", 
"AGO", "AO", "AGO", "AO", "AGO", "AO", "AGO", "AO", "AGO", "AO", 
"AGO", "AO", "AGO", "AO"), country = c("Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola"), year = c(1985L, 1985L, 1986L, 
1986L, 1987L, 1987L, 1988L, 1988L, 1989L, 1989L, 1990L, 1990L, 
1991L, 1991L, 1992L, 1992L, 1993L, 1993L, 1994L, 1994L), DT.ODA.ODAT.GN.ZS = c(NA, 
1.35282546806335, NA, 1.9472375, NA, 2.00972839050293, NA, 2.34759848175049, 
NA, 1.66503130900065, NA, 2.64884089050293, NA, NA, NA, NA, NA, 
NA, NA, 30.1158644206278), NY.GDP.MKTP.CD = c(NA, 7558613007.90635, 
NA, 7076793822.60201, NA, 8089279284.72241, NA, 8775116269.16722, 
NA, 10207922517.1839, NA, 11236275842.7358, NA, NA, NA, NA, NA, 
NA, NA, 3390500000), DT.ODA.ODAT.GD.ZS = c(1.32989861920417, 
NA, 2.04929343541605, NA, 1.82077454909391, NA, 1.9689704235723, 
NA, 1.79962315882805, NA, 2.59030206019162, NA, 2.27231226407093, 
NA, 5.94508667614588, NA, 5.47506427358001, NA, 11.0127233451064, 
NA), DT.ODA.ALLD.GD.ZS = c(1.32989861920417, NA, 2.04929343541605, 
NA, 1.82077454909391, NA, 1.9689704235723, NA, 1.79962315882805, 
NA, 2.59030206019162, NA, 2.27231226407093, NA, 5.94508667614588, 
NA, 5.47506427358001, NA, 11.0127233451064, NA), DT.ODA.ODAT.XP.ZS = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), DT.ODA.ALLD.XP.ZS = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), NY.GNP.MKTP.CD = c(NA, 
6688963210.70234, NA, 6688963210.70234, NA, 6688963210.70234, 
NA, 6688963210.70234, NA, 10033444816.0535, NA, 10033444816.0535, 
NA, NA, NA, NA, NA, NA, NA, 1484500000)), row.names = c(NA, 20L
), class = "data.frame")

CodePudding user response:

A possible solution, using all(is.na(.x)) to detect when all elements inside the grouped column are NA:

df %>%
  select(-iso2c) %>%
  group_by(country, year) %>%
  summarise(across(everything(),
    ~ if_else(all(is.na(.x)), NA_real_, sum(.x, na.rm = TRUE))))
  •  Tags:  
  • Related