Say I have the following data:
| date | name | rolename |
|---|---|---|
| 2009-12-01 | John | helper |
| 2010-12-01 | John | helper |
| 2011-12-01 | John | senior helper |
| 2012-12-01 | John | manager |
| 2009-12-01 | Will | helper |
| 2010-12-01 | Will | senior helper |
| 2011-12-01 | Will | manager |
| 2012-12-01 | Will | senior manager |
I am trying to count the number of roles, based on the column rolename for the person in the name column, the person has worked to date. For example, for the above data, I want a fourth column which measures the number of positions a person has worked so far:
| date | name | rolename | nopositions |
|---|---|---|---|
| 2009-12-01 | John | helper | 1 |
| 2010-12-01 | John | helper | 1 |
| 2011-12-01 | John | senior helper | 2 |
| 2012-12-01 | John | manager | 3 |
| 2009-12-01 | Will | helper | 1 |
| 2010-12-01 | Will | senior helper | 2 |
| 2011-12-01 | Will | manager | 3 |
| 2012-12-01 | Will | senior manager | 4 |
My failed attempts:
#attempt 1
library(dplyr)
data %>%
group_by(name) %>%
mutate(nopositions = count(rolename))
#attempt2
library(runner)
data %>%
group_by(name) %>%
mutate(nopositions = runner(x = rolename,
k = inf,
idx = date,
f = function(x) length(x))
CodePudding user response:
Assuming that the order by date is assured,
library(dplyr)
quux %>%
group_by(name) %>%
mutate(noposition = cummax(match(rolename, unique(rolename)))) %>%
ungroup()
# # A tibble: 8 × 4
# date name rolename noposition
# <chr> <chr> <chr> <int>
# 1 2009-12-01 John helper 1
# 2 2010-12-01 John helper 1
# 3 2011-12-01 John senior helper 2
# 4 2012-12-01 John manager 3
# 5 2009-12-01 Will helper 1
# 6 2010-12-01 Will senior helper 2
# 7 2011-12-01 Will manager 3
# 8 2012-12-01 Will senior manager 4
We might get away without cummax, except that if a name returns to a previous rolename, its noposition will decreases (revert to a previous value). However, we want to keep the most-recent max value.
This does assume that unique preserves the natural order of the first-occurrences. If something goes amiss with this (I cannot think of something off-hand), we could do a window of words:
quux %>%
group_by(name) %>%
mutate(noposition = sapply(seq_along(rolename), \(i) length(unique(rolename[1:i])))) %>%
ungroup()
# # A tibble: 8 × 4
# date name rolename noposition
# <chr> <chr> <chr> <int>
# 1 2009-12-01 John helper 1
# 2 2010-12-01 John helper 1
# 3 2011-12-01 John senior helper 2
# 4 2012-12-01 John manager 3
# 5 2009-12-01 Will helper 1
# 6 2010-12-01 Will senior helper 2
# 7 2011-12-01 Will manager 3
# 8 2012-12-01 Will senior manager 4
This produces the same results here, and it will tend to perform more poorly with larger groups (as it is iterating a lot more). I offer it as an extension in case assumptions preclude the use of cummax(match(..)).
