I have a DataFrame:
dct = {'A':['abc','abc','abc', 'xyz', 'xyz','abc','abc','abc', 'xyz', 'xyz', 'xyz', 'xyz'],
'B':['a','a','a','a','a','z','z','z','p','p','p','q'],
'C':[1,1,1,1,2,5,5,5,9,9,9,9],
'GROUP':[123,123,123,123,123,456,456,456,767,767,767,767]
}
df = pd.DataFrame(dct)
A B C GROUP
0 abc a 1 123
1 abc a 1 123
2 abc a 1 123
3 xyz a 1 123
4 xyz a 2 123
5 abc z 5 456
6 abc z 5 456
7 abc z 5 456
8 xyz p 9 767
9 xyz p 9 767
10 xyz p 9 767
11 xyz q 9 767
I am trying to create a new column called 'change'.
Assume that I am grouping by Group, and a change occurs when anything in columns A, B or C changes from the previous row. 'change' is incremented by 1 for that group. If nothing changes the same value remains. When a new group starts the change value begins again at 1. I am able to accomplish this using lists and loops but feel like there should be a more pythonic resolution using pandas?
Example output would look like this:
A B C GROUP change
0 abc a 1 123 1
1 abc a 1 123 1
2 abc a 1 123 1
3 xyz a 1 123 2
4 xyz a 2 123 3
5 abc z 5 456 1
6 abc z 5 456 1
7 abc z 5 456 1
8 xyz p 9 767 1
9 xyz p 9 767 1
10 xyz p 9 767 1
11 xyz q 9 767 2
CodePudding user response:
You can calculate the cumulative sum of non-duplicated values per group:
df["change"] = (df.groupby("GROUP", group_keys=False)
.apply(lambda gr: (~gr.duplicated()).cumsum()))
to get
A B C GROUP change
0 abc a 1 123 1
1 abc a 1 123 1
2 abc a 1 123 1
3 xyz a 1 123 2
4 xyz a 2 123 3
5 abc z 5 456 1
6 abc z 5 456 1
7 abc z 5 456 1
8 xyz p 9 767 1
9 xyz p 9 767 1
10 xyz p 9 767 1
11 xyz q 9 767 2
The group_keys=False won't result in MultiIndex after apply and eases the assignment with index alignment.
An example group and what apply does to it:
>>> gr = df.groupby("GROUP", group_keys=False).get_group(123)
>>> gr
A B C GROUP change
0 abc a 1 123 1
1 abc a 1 123 1
2 abc a 1 123 1
3 xyz a 1 123 2
4 xyz a 2 123 3
# note the `~` in front; negates the result
# so the "first" seen values are True
>>> ~gr.duplicated()
0 True
1 False
2 False
3 True
4 True
dtype: bool
>>> (~gr.duplicated()).cumsum()
0 1
1 1
2 1
3 2
4 3
dtype: int32
So with duplicated's flag, False ones (i.e., duplicates) won't contribute to the cumulative sum and it'll stay the same through them.
CodePudding user response:
You could use a double groupby and the groupby number (ngroup):
df['change'] = (df.groupby('GROUP', group_keys=False)
.apply(lambda d: d.groupby(list(df.columns), sort=False)
.ngroup().add(1))
)
output:
A B C GROUP change
0 abc a 1 123 1
1 abc a 1 123 1
2 abc a 1 123 1
3 xyz a 1 123 2
4 xyz a 2 123 3
5 abc z 5 456 1
6 abc z 5 456 1
7 abc z 5 456 1
8 xyz p 9 767 1
9 xyz p 9 767 1
10 xyz p 9 767 1
11 xyz q 9 767 2
