Identify a change in a group of parameters within Pandas Dataframe and create an incrementing value-CodePudding

I have a DataFrame:

dct = {'A':['abc','abc','abc', 'xyz', 'xyz','abc','abc','abc', 'xyz', 'xyz', 'xyz', 'xyz'],
       'B':['a','a','a','a','a','z','z','z','p','p','p','q'],
       'C':[1,1,1,1,2,5,5,5,9,9,9,9],
   'GROUP':[123,123,123,123,123,456,456,456,767,767,767,767]
    }

df = pd.DataFrame(dct)

    A   B   C   GROUP
0   abc a   1   123
1   abc a   1   123
2   abc a   1   123
3   xyz a   1   123
4   xyz a   2   123
5   abc z   5   456
6   abc z   5   456
7   abc z   5   456
8   xyz p   9   767
9   xyz p   9   767
10  xyz p   9   767
11  xyz q   9   767

I am trying to create a new column called 'change'.

Assume that I am grouping by Group, and a change occurs when anything in columns A, B or C changes from the previous row. 'change' is incremented by 1 for that group. If nothing changes the same value remains. When a new group starts the change value begins again at 1. I am able to accomplish this using lists and loops but feel like there should be a more pythonic resolution using pandas?

Example output would look like this:

    A   B   C GROUP change
0   abc a   1   123 1
1   abc a   1   123 1
2   abc a   1   123 1
3   xyz a   1   123 2
4   xyz a   2   123 3
5   abc z   5   456 1
6   abc z   5   456 1
7   abc z   5   456 1
8   xyz p   9   767 1
9   xyz p   9   767 1
10  xyz p   9   767 1
11  xyz q   9   767 2

CodePudding user response：

You can calculate the cumulative sum of non-duplicated values per group:

df["change"] = (df.groupby("GROUP", group_keys=False)
                  .apply(lambda gr: (~gr.duplicated()).cumsum()))

to get

      A  B  C  GROUP  change
0   abc  a  1    123       1
1   abc  a  1    123       1
2   abc  a  1    123       1
3   xyz  a  1    123       2
4   xyz  a  2    123       3
5   abc  z  5    456       1
6   abc  z  5    456       1
7   abc  z  5    456       1
8   xyz  p  9    767       1
9   xyz  p  9    767       1
10  xyz  p  9    767       1
11  xyz  q  9    767       2

The group_keys=False won't result in MultiIndex after apply and eases the assignment with index alignment.

An example group and what apply does to it:

>>> gr = df.groupby("GROUP", group_keys=False).get_group(123)
>>> gr
     A  B  C  GROUP  change
0  abc  a  1    123       1
1  abc  a  1    123       1
2  abc  a  1    123       1
3  xyz  a  1    123       2
4  xyz  a  2    123       3

# note the `~` in front; negates the result
# so the "first" seen values are True
>>> ~gr.duplicated()
0     True
1    False
2    False
3     True
4     True
dtype: bool

>>> (~gr.duplicated()).cumsum()
0    1
1    1
2    1
3    2
4    3
dtype: int32

So with duplicated's flag, False ones (i.e., duplicates) won't contribute to the cumulative sum and it'll stay the same through them.

CodePudding user response：

You could use a double groupby and the groupby number (ngroup):

df['change'] = (df.groupby('GROUP', group_keys=False)
                  .apply(lambda d: d.groupby(list(df.columns), sort=False)
                                    .ngroup().add(1))
                )

output:

      A  B  C  GROUP  change
0   abc  a  1    123       1
1   abc  a  1    123       1
2   abc  a  1    123       1
3   xyz  a  1    123       2
4   xyz  a  2    123       3
5   abc  z  5    456       1
6   abc  z  5    456       1
7   abc  z  5    456       1
8   xyz  p  9    767       1
9   xyz  p  9    767       1
10  xyz  p  9    767       1
11  xyz  q  9    767       2