I am attempting to create a dataframe column that assigns a sequential number for each change in a combination of values. Each combination of student and term represents a group. My data looks like this.
| student | year |
|---|---|
| A | 20211 |
| A | 20222 |
| A | 20222 |
| A | 20225 |
| B | 20211 |
| B | 20211 |
| B | 20227 |
| C | 20211 |
| C | 20222 |
| C | 20229 |
And I want to assign values in a new column to indicate each unique student and year combination. I've tried sort_values with groupby and cumcount but I'm getting a sequence of rows not just when the year value changes. This is what I want
| student | year | enrollment |
|---|---|---|
| A | 20211 | 1 |
| A | 20222 | 2 |
| A | 20222 | 2 |
| A | 20225 | 3 |
| B | 20211 | 1 |
| B | 20211 | 1 |
| B | 20227 | 2 |
| C | 20211 | 1 |
| C | 20222 | 2 |
| C | 20229 | 3 |
CodePudding user response:
You can use pd.factorize per student group:
df['enrollment'] = df.groupby('student')['year'] \
.transform(lambda x: pd.factorize(x)[0] 1)
print(df)
# Output:
student year enrollment
0 A 20211 1
1 A 20222 2
2 A 20222 2
3 A 20225 3
4 B 20211 1
5 B 20211 1
6 B 20227 2
7 C 20211 1
8 C 20222 2
9 C 20229 3
