Home > Software design >  How do I duplicate and add rows between the values of two different columns?
How do I duplicate and add rows between the values of two different columns?

Time:01-14

I'm trying to duplicate each observation for all of the years that fall between "styear" and "endyear." So, for example, there should end up being 118 USA rows with years 1898-2016.

Here's the data:

# A tibble: 14 x 9
   stateabb ccode styear stmonth stday endyear endmonth endday version
   <chr>    <dbl>  <dbl>   <dbl> <dbl>   <dbl>    <dbl>  <dbl>   <dbl>
 1 USA          2   1898       8    13    2016       12     31    2016
 2 UKG        200   1816       1     1    2016       12     31    2016
 3 FRN        220   1816       1     1    1940        6     22    2016
 4 FRN        220   1945       8    15    2016       12     31    2016
 5 GMY        255   1816       1     1    1918       11     11    2016
 6 GMY        255   1925       1     1    1945        5      7    2016
 7 GMY        255   1991      12    11    2016       12     31    2016
 8 AUH        300   1816       1     1    1918       11      3    2016
 9 ITA        325   1860       1     1    1943        9      2    2016
10 RUS        365   1816       1     1    1917       12      5    2016
11 RUS        365   1922       1     1    2016       12     31    2016
12 CHN        710   1950       1     1    2016       12     31    2016
13 JPN        740   1895       4     1    1945        8     14    2016
14 JPN        740   1991      12    11    2016       12     31    2016

I have tried various combinations of slice() and seq() and complete(), but I think I'm just too new at coding to a) know what to do and b) how to really understand other answers to similar questions.

Ultimately, I am merging this data with other data and creating 0/1 dummy variable to indicate if a country was a "great power" in a given year. The easiest way I thought of was to do this by creating individual rows for each year a country was a great power (the data in this question) because the data I am merging it with is also in the country-year format. I am open to other options, though, if something else is easier!

Thank you!

CodePudding user response:

My guess is that there is a better way to do this, but here is one way with a small reduced data set. The steps are

  1. Make a minimal dataset.
  2. Make function that creates the seq of dates that you want
  3. Split the dataframe into a list of small dataframes by country with dplyr group_split.
  4. Apply a function using purrr map that maps your list into new expanded date dataframes.
  5. Bind list of dataframes back into one dataframe.
library(dplyr)
library(purrr )

df<-data.frame(
  stringsAsFactors = FALSE,
          stateabb = c("USA", "UKG"),
            styear = c(1898L, 1816L),
           endyear = c(2016L, 2016L)
    )

expand_dates<-function(df) {
  stateabb<-df %>% pluck("stateabb")
  styear<-df %>% pluck("styear")
  endyear<-df%>% pluck("endyear")
  years=seq(styear,endyear )
  
  data.frame(years) %>%
    mutate(stateabb=stateabb,styear=styear,endyear=endyear)
}

df_new<-df %>%
  group_split(stateabb)%>%
  map(expand_dates) %>%
  bind_rows()

head(df_new)
#>   years stateabb styear endyear
#> 1  1816      UKG   1816    2016
#> 2  1817      UKG   1816    2016
#> 3  1818      UKG   1816    2016
#> 4  1819      UKG   1816    2016
#> 5  1820      UKG   1816    2016
#> 6  1821      UKG   1816    2016

Created on 2022-01-13 by the reprex package (v2.0.1)

CodePudding user response:

I think tidyr::expand() and full_seq() can achieve what you want, with grouping on stateabb and styear since you have multiple start years for some states.

Assuming your data frame is named mydata, something like this. I have retained the column of expanded years and named it filled_year, but you may want to remove it.

library(dplyr)
library(tidyr)

new_data <- mydata %>% 
  group_by(stateabb, styear) %>% 
  tidyr::expand(stateabb, full_seq(c(styear, endyear), 1)) %>% 
  inner_join(mydata) %>% 
  rename(filled_year = `full_seq(c(styear, endyear), 1)`) %>%
  ungroup()

The top and bottom of the USA rows:

new_data %>% 
  filter(stateabb == "USA") %>% 
  head()

# A tibble: 6 x 10
  styear stateabb filled_year ccode stmonth stday endyear endmonth endday version
   <int> <chr>          <dbl> <int>   <int> <int>   <int>    <int>  <int>   <int>
1   1898 USA             1898     2       8    13    2016       12     31    2016
2   1898 USA             1899     2       8    13    2016       12     31    2016
3   1898 USA             1900     2       8    13    2016       12     31    2016
4   1898 USA             1901     2       8    13    2016       12     31    2016
5   1898 USA             1902     2       8    13    2016       12     31    2016
6   1898 USA             1903     2       8    13    2016       12     31    2016

new_data %>% 
  filter(stateabb == "USA") %>% 
  tail()

# A tibble: 6 x 10
  styear stateabb filled_year ccode stmonth stday endyear endmonth endday version
   <int> <chr>          <dbl> <int>   <int> <int>   <int>    <int>  <int>   <int>
1   1898 USA             2011     2       8    13    2016       12     31    2016
2   1898 USA             2012     2       8    13    2016       12     31    2016
3   1898 USA             2013     2       8    13    2016       12     31    2016
4   1898 USA             2014     2       8    13    2016       12     31    2016
5   1898 USA             2015     2       8    13    2016       12     31    2016
6   1898 USA             2016     2       8    13    2016       12     31    2016

Your example data:

mydata <- structure(list(stateabb = c("USA", "UKG", "FRN", "FRN", "GMY", 
"GMY", "GMY", "AUH", "ITA", "RUS", "RUS", "CHN", "JPN", "JPN"
), ccode = c(2L, 200L, 220L, 220L, 255L, 255L, 255L, 300L, 325L, 
365L, 365L, 710L, 740L, 740L), styear = c(1898L, 1816L, 1816L, 
1945L, 1816L, 1925L, 1991L, 1816L, 1860L, 1816L, 1922L, 1950L, 
1895L, 1991L), stmonth = c(8L, 1L, 1L, 8L, 1L, 1L, 12L, 1L, 1L, 
1L, 1L, 1L, 4L, 12L), stday = c(13L, 1L, 1L, 15L, 1L, 1L, 11L, 
1L, 1L, 1L, 1L, 1L, 1L, 11L), endyear = c(2016L, 2016L, 1940L, 
2016L, 1918L, 1945L, 2016L, 1918L, 1943L, 1917L, 2016L, 2016L, 
1945L, 2016L), endmonth = c(12L, 12L, 6L, 12L, 11L, 5L, 12L, 
11L, 9L, 12L, 12L, 12L, 8L, 12L), endday = c(31L, 31L, 22L, 31L, 
11L, 7L, 31L, 3L, 2L, 5L, 31L, 31L, 14L, 31L), version = c(2016L, 
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 
2016L, 2016L, 2016L, 2016L)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14"))
  •  Tags:  
  • Related