How do I duplicate and add rows between the values of two different columns?-CodePudding

I'm trying to duplicate each observation for all of the years that fall between "styear" and "endyear." So, for example, there should end up being 118 USA rows with years 1898-2016.

Here's the data:

# A tibble: 14 x 9
   stateabb ccode styear stmonth stday endyear endmonth endday version
   <chr>    <dbl>  <dbl>   <dbl> <dbl>   <dbl>    <dbl>  <dbl>   <dbl>
 1 USA          2   1898       8    13    2016       12     31    2016
 2 UKG        200   1816       1     1    2016       12     31    2016
 3 FRN        220   1816       1     1    1940        6     22    2016
 4 FRN        220   1945       8    15    2016       12     31    2016
 5 GMY        255   1816       1     1    1918       11     11    2016
 6 GMY        255   1925       1     1    1945        5      7    2016
 7 GMY        255   1991      12    11    2016       12     31    2016
 8 AUH        300   1816       1     1    1918       11      3    2016
 9 ITA        325   1860       1     1    1943        9      2    2016
10 RUS        365   1816       1     1    1917       12      5    2016
11 RUS        365   1922       1     1    2016       12     31    2016
12 CHN        710   1950       1     1    2016       12     31    2016
13 JPN        740   1895       4     1    1945        8     14    2016
14 JPN        740   1991      12    11    2016       12     31    2016

I have tried various combinations of slice() and seq() and complete(), but I think I'm just too new at coding to a) know what to do and b) how to really understand other answers to similar questions.

Ultimately, I am merging this data with other data and creating 0/1 dummy variable to indicate if a country was a "great power" in a given year. The easiest way I thought of was to do this by creating individual rows for each year a country was a great power (the data in this question) because the data I am merging it with is also in the country-year format. I am open to other options, though, if something else is easier!

Thank you!

CodePudding user response：

My guess is that there is a better way to do this, but here is one way with a small reduced data set. The steps are

Make a minimal dataset.
Make function that creates the seq of dates that you want
Split the dataframe into a list of small dataframes by country with dplyr group_split.
Apply a function using purrr map that maps your list into new expanded date dataframes.
Bind list of dataframes back into one dataframe.

library(dplyr)
library(purrr )

df<-data.frame(
  stringsAsFactors = FALSE,
          stateabb = c("USA", "UKG"),
            styear = c(1898L, 1816L),
           endyear = c(2016L, 2016L)
    )

expand_dates<-function(df) {
  stateabb<-df %>% pluck("stateabb")
  styear<-df %>% pluck("styear")
  endyear<-df%>% pluck("endyear")
  years=seq(styear,endyear )
  
  data.frame(years) %>%
    mutate(stateabb=stateabb,styear=styear,endyear=endyear)
}

df_new<-df %>%
  group_split(stateabb)%>%
  map(expand_dates) %>%
  bind_rows()

head(df_new)
#>   years stateabb styear endyear
#> 1  1816      UKG   1816    2016
#> 2  1817      UKG   1816    2016
#> 3  1818      UKG   1816    2016
#> 4  1819      UKG   1816    2016
#> 5  1820      UKG   1816    2016
#> 6  1821      UKG   1816    2016

^{Created on 2022-01-13 by the reprex package (v2.0.1)}

CodePudding user response：

I think tidyr::expand() and full_seq() can achieve what you want, with grouping on stateabb and styear since you have multiple start years for some states.

Assuming your data frame is named mydata, something like this. I have retained the column of expanded years and named it filled_year, but you may want to remove it.

library(dplyr)
library(tidyr)

new_data <- mydata %>% 
  group_by(stateabb, styear) %>% 
  tidyr::expand(stateabb, full_seq(c(styear, endyear), 1)) %>% 
  inner_join(mydata) %>% 
  rename(filled_year = `full_seq(c(styear, endyear), 1)`) %>%
  ungroup()

The top and bottom of the USA rows:

new_data %>% 
  filter(stateabb == "USA") %>% 
  head()

# A tibble: 6 x 10
  styear stateabb filled_year ccode stmonth stday endyear endmonth endday version
   <int> <chr>          <dbl> <int>   <int> <int>   <int>    <int>  <int>   <int>
1   1898 USA             1898     2       8    13    2016       12     31    2016
2   1898 USA             1899     2       8    13    2016       12     31    2016
3   1898 USA             1900     2       8    13    2016       12     31    2016
4   1898 USA             1901     2       8    13    2016       12     31    2016
5   1898 USA             1902     2       8    13    2016       12     31    2016
6   1898 USA             1903     2       8    13    2016       12     31    2016

new_data %>% 
  filter(stateabb == "USA") %>% 
  tail()

# A tibble: 6 x 10
  styear stateabb filled_year ccode stmonth stday endyear endmonth endday version
   <int> <chr>          <dbl> <int>   <int> <int>   <int>    <int>  <int>   <int>
1   1898 USA             2011     2       8    13    2016       12     31    2016
2   1898 USA             2012     2       8    13    2016       12     31    2016
3   1898 USA             2013     2       8    13    2016       12     31    2016
4   1898 USA             2014     2       8    13    2016       12     31    2016
5   1898 USA             2015     2       8    13    2016       12     31    2016
6   1898 USA             2016     2       8    13    2016       12     31    2016

Your example data:

mydata <- structure(list(stateabb = c("USA", "UKG", "FRN", "FRN", "GMY", 
"GMY", "GMY", "AUH", "ITA", "RUS", "RUS", "CHN", "JPN", "JPN"
), ccode = c(2L, 200L, 220L, 220L, 255L, 255L, 255L, 300L, 325L, 
365L, 365L, 710L, 740L, 740L), styear = c(1898L, 1816L, 1816L, 
1945L, 1816L, 1925L, 1991L, 1816L, 1860L, 1816L, 1922L, 1950L, 
1895L, 1991L), stmonth = c(8L, 1L, 1L, 8L, 1L, 1L, 12L, 1L, 1L, 
1L, 1L, 1L, 4L, 12L), stday = c(13L, 1L, 1L, 15L, 1L, 1L, 11L, 
1L, 1L, 1L, 1L, 1L, 1L, 11L), endyear = c(2016L, 2016L, 1940L, 
2016L, 1918L, 1945L, 2016L, 1918L, 1943L, 1917L, 2016L, 2016L, 
1945L, 2016L), endmonth = c(12L, 12L, 6L, 12L, 11L, 5L, 12L, 
11L, 9L, 12L, 12L, 12L, 8L, 12L), endday = c(31L, 31L, 22L, 31L, 
11L, 7L, 31L, 3L, 2L, 5L, 31L, 31L, 14L, 31L), version = c(2016L, 
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 
2016L, 2016L, 2016L, 2016L)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14"))