Compute cummulative occurences of value (string) until a new string appears and years are continuous-CodePudding

I have data like this:

dataset <- data.frame(year = c(2001, 2002, 2003, 2005, 2006))
dataset$firm <- c("A", "A", "B","B","B" )

I would like to compute the number of consecutive years that firm appears in the dataset. The expected results is like this:

dataset <- data.frame(year = c(2001, 2002, 2003, 2005, 2006))
dataset$firm <- c("A", "A", "B","B","B" )
dataset$tenure <- c(1,2,1,1,2)

How can I get the tenure variable here? Thank you so much.

CodePudding user response：

With tidyverse you can do this. It checks if the years are one year apart and takes the cummulative sum of the logical result.

library(dplyr)
library(tidyr)

dataset %>% 
  group_by(firm) %>% 
  mutate(tenure=(year-1==lag(year))*1, 
    tenure=replace_na(tenure,1), 
    tenure=cumsum(tenure)) %>% 
  ungroup()
# A tibble: 9 × 3
   year firm  tenure
  <dbl> <chr>  <dbl>
1  2001 A          1
2  2002 A          2
3  2003 B          1
4  2005 B          1
5  2006 B          2
6  2007 B          3
7  2008 B          4
8  2010 B          4
9  2011 B          5

extended data

dataset <- structure(list(year = c(2001, 2002, 2003, 2005, 2006, 2007, 2008,
2010, 2011), firm = c("A", "A", "B", "B", "B", "B", "B", "B",
"B")), row.names = c(NA, -9L), class = "data.frame")

CodePudding user response：

You can use -

library(dplyr)

dataset %>%
  arrange(firm, year) %>%
  group_by(firm, temp = cumsum(c(TRUE, diff(year) > 1))) %>%
  mutate(tenure = row_number()) %>%
  ungroup %>%
  select(-temp)

#   year firm  tenure
#  <dbl> <chr>  <int>
#1  2001 A          1
#2  2002 A          2
#3  2003 B          1
#4  2005 B          1
#5  2006 B          2