I'm trying to duplicate each observation for all of the years that fall between "styear" and "endyear." So, for example, there should end up being 118 USA rows with years 1898-2016.
Here's the data:
# A tibble: 14 x 9
stateabb ccode styear stmonth stday endyear endmonth endday version
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 USA 2 1898 8 13 2016 12 31 2016
2 UKG 200 1816 1 1 2016 12 31 2016
3 FRN 220 1816 1 1 1940 6 22 2016
4 FRN 220 1945 8 15 2016 12 31 2016
5 GMY 255 1816 1 1 1918 11 11 2016
6 GMY 255 1925 1 1 1945 5 7 2016
7 GMY 255 1991 12 11 2016 12 31 2016
8 AUH 300 1816 1 1 1918 11 3 2016
9 ITA 325 1860 1 1 1943 9 2 2016
10 RUS 365 1816 1 1 1917 12 5 2016
11 RUS 365 1922 1 1 2016 12 31 2016
12 CHN 710 1950 1 1 2016 12 31 2016
13 JPN 740 1895 4 1 1945 8 14 2016
14 JPN 740 1991 12 11 2016 12 31 2016
I have tried various combinations of slice() and seq() and complete(), but I think I'm just too new at coding to a) know what to do and b) how to really understand other answers to similar questions.
Ultimately, I am merging this data with other data and creating 0/1 dummy variable to indicate if a country was a "great power" in a given year. The easiest way I thought of was to do this by creating individual rows for each year a country was a great power (the data in this question) because the data I am merging it with is also in the country-year format. I am open to other options, though, if something else is easier!
Thank you!
CodePudding user response:
My guess is that there is a better way to do this, but here is one way with a small reduced data set. The steps are
- Make a minimal dataset.
- Make function that creates the seq of dates that you want
- Split the dataframe into a list of small dataframes by country with
dplyrgroup_split. - Apply a function using
purrrmap that maps your list into new expanded date dataframes. - Bind list of dataframes back into one dataframe.
library(dplyr)
library(purrr )
df<-data.frame(
stringsAsFactors = FALSE,
stateabb = c("USA", "UKG"),
styear = c(1898L, 1816L),
endyear = c(2016L, 2016L)
)
expand_dates<-function(df) {
stateabb<-df %>% pluck("stateabb")
styear<-df %>% pluck("styear")
endyear<-df%>% pluck("endyear")
years=seq(styear,endyear )
data.frame(years) %>%
mutate(stateabb=stateabb,styear=styear,endyear=endyear)
}
df_new<-df %>%
group_split(stateabb)%>%
map(expand_dates) %>%
bind_rows()
head(df_new)
#> years stateabb styear endyear
#> 1 1816 UKG 1816 2016
#> 2 1817 UKG 1816 2016
#> 3 1818 UKG 1816 2016
#> 4 1819 UKG 1816 2016
#> 5 1820 UKG 1816 2016
#> 6 1821 UKG 1816 2016
Created on 2022-01-13 by the reprex package (v2.0.1)
CodePudding user response:
I think tidyr::expand() and full_seq() can achieve what you want, with grouping on stateabb and styear since you have multiple start years for some states.
Assuming your data frame is named mydata, something like this. I have retained the column of expanded years and named it filled_year, but you may want to remove it.
library(dplyr)
library(tidyr)
new_data <- mydata %>%
group_by(stateabb, styear) %>%
tidyr::expand(stateabb, full_seq(c(styear, endyear), 1)) %>%
inner_join(mydata) %>%
rename(filled_year = `full_seq(c(styear, endyear), 1)`) %>%
ungroup()
The top and bottom of the USA rows:
new_data %>%
filter(stateabb == "USA") %>%
head()
# A tibble: 6 x 10
styear stateabb filled_year ccode stmonth stday endyear endmonth endday version
<int> <chr> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 1898 USA 1898 2 8 13 2016 12 31 2016
2 1898 USA 1899 2 8 13 2016 12 31 2016
3 1898 USA 1900 2 8 13 2016 12 31 2016
4 1898 USA 1901 2 8 13 2016 12 31 2016
5 1898 USA 1902 2 8 13 2016 12 31 2016
6 1898 USA 1903 2 8 13 2016 12 31 2016
new_data %>%
filter(stateabb == "USA") %>%
tail()
# A tibble: 6 x 10
styear stateabb filled_year ccode stmonth stday endyear endmonth endday version
<int> <chr> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 1898 USA 2011 2 8 13 2016 12 31 2016
2 1898 USA 2012 2 8 13 2016 12 31 2016
3 1898 USA 2013 2 8 13 2016 12 31 2016
4 1898 USA 2014 2 8 13 2016 12 31 2016
5 1898 USA 2015 2 8 13 2016 12 31 2016
6 1898 USA 2016 2 8 13 2016 12 31 2016
Your example data:
mydata <- structure(list(stateabb = c("USA", "UKG", "FRN", "FRN", "GMY",
"GMY", "GMY", "AUH", "ITA", "RUS", "RUS", "CHN", "JPN", "JPN"
), ccode = c(2L, 200L, 220L, 220L, 255L, 255L, 255L, 300L, 325L,
365L, 365L, 710L, 740L, 740L), styear = c(1898L, 1816L, 1816L,
1945L, 1816L, 1925L, 1991L, 1816L, 1860L, 1816L, 1922L, 1950L,
1895L, 1991L), stmonth = c(8L, 1L, 1L, 8L, 1L, 1L, 12L, 1L, 1L,
1L, 1L, 1L, 4L, 12L), stday = c(13L, 1L, 1L, 15L, 1L, 1L, 11L,
1L, 1L, 1L, 1L, 1L, 1L, 11L), endyear = c(2016L, 2016L, 1940L,
2016L, 1918L, 1945L, 2016L, 1918L, 1943L, 1917L, 2016L, 2016L,
1945L, 2016L), endmonth = c(12L, 12L, 6L, 12L, 11L, 5L, 12L,
11L, 9L, 12L, 12L, 12L, 8L, 12L), endday = c(31L, 31L, 22L, 31L,
11L, 7L, 31L, 3L, 2L, 5L, 31L, 31L, 14L, 31L), version = c(2016L,
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2016L, 2016L, 2016L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14"))
