Comparaison row by row with groupby-CodePudding

I have a data frame like this:

ordre/id /date /origine /destination /horaire A /horaire B

1 1112 2021-03-11 Paris / Marseille/10:00/14:00
2 1114 2021-05-11 Paris / Bordeaux/09:00/13:00
3 1112 2021-03-11 Paris / Marseille/10:00/14:00
4 1114 2021-05-11 Paris / Bordeaux/10:20/14:00 
5 1112 2021-03-11 Paris / Marseille/10:00/14:00
6 1112 2021-03-11 Paris / Marseille/10:00/14:00
7 1114 2021-05-11 Paris / Bordeaux/09:00/13:00
8 1114 2021-05-11 Paris / Bordeaux/10:00/14:00
9 1112 2021-03-11 Paris / Lyon/10:00/12:00

I want to add a new column note that will store comparaison value for each object groupby same id and date, any change 'date /origine /destination /horaire A /horaire B' so note True

example:

for row 9, destination is lyon where is difference previous line 'marseille' so note is True
for row 4, we have a diffrences data 'Horaire A' and 'Horaire B' so note is True

Output:

1 1112 2021-03-11 Paris / Marseille/10:00/14:00
2 1112 2021-03-11 Paris / Marseille/10:00/14:00
3 1112 2021-03-11 Paris / Marseille/10:00/14:00
4 1112 2021-03-11 Paris / Lyon/10:00/12:00/True
5 1112 2021-03-11 Paris / Marseille/10:00/14:00/True
6 1114 2021-05-11 Paris / Bordeaux/09:00/13:00
7 1114 2021-05-11 Paris / Bordeaux/09:00/13:00
8 1114 2021-05-11 Paris / Bordeaux/10:00/14:00/True
9 1114 2021-05-11 Paris / Bordeaux/10:20/14:00/True

I wrote this code:

df['Note'] = df.groupby(['Date','id']).apply(lambda x: (x['Origine'] != x['Origine'].shift(-1)) | (x['Destination'] != x['Destination'].shift(-1)) | (x['Horaire A'] != x['Horaire A'].shift(-1)) | (x['Horaire B'] != x['Horaire B'].shift(-1)))
df['Note'] = df['Note'].shift(1)

But this program gives an error: incompatible index of inserted column with frame index

How do i fix it?

CodePudding user response：

IIUC, you can compare the rows with the shifted rows, per group. If any field mismatches, then we set the output as True.

I am relying on "ordre" as unique key here to merge back to the original data, but you could use the index if this is not the case. In such case, "order" should be dropped of the groupby.

df.merge(df.set_index('ordre')
           .groupby(['id', 'date'], group_keys=False)
           .apply(lambda d: d.ne(d.shift().bfill()).any(1))
           .rename('diff_previous'),
         left_on='ordre', right_index=True
        )

output:

   ordre    id        date origine destination horaire A horaire B  diff_previous
0      1  1112  2021-03-11   Paris   Marseille     10:00     14:00          False
1      2  1114  2021-05-11   Paris    Bordeaux     09:00     13:00          False
2      3  1112  2021-03-11   Paris   Marseille     10:00     14:00          False
3      4  1114  2021-05-11   Paris    Bordeaux     10:20     14:00           True
4      5  1112  2021-03-11   Paris   Marseille     10:00     14:00          False
5      6  1112  2021-03-11   Paris   Marseille     10:00     14:00          False
6      7  1114  2021-05-11   Paris    Bordeaux     09:00     13:00           True
7      8  1114  2021-05-11   Paris    Bordeaux     10:00     14:00           True
8      9  1112  2021-03-11   Paris        Lyon     10:00     12:00           True

CodePudding user response：

I've used the following code to generate the DataFrame:

data = {
    "ordre": range(1, 10),
    "id": [1112, 1114, 1112, 1114, 1112, 1112, 1114, 1114, 1112],
    "date": [
        "2021-03-11",
        "2021-05-11",
        "2021-03-11",
        "2021-05-11",
        "2021-03-11",
        "2021-03-11",
        "2021-05-11",
        "2021-05-11",
        "2021-03-11",
    ],
    "origine": ["Paris", "Paris", "Paris", "Paris", "Paris", "Paris", "Paris", "Paris", "Paris"],
    "destination": [
        "Marseille",
        "Bordeaux",
        "Marseille",
        "Bordeaux",
        "Marseille",
        "Marseille",
        "Bordeaux",
        "Bordeaux",
        "Lyon",
    ],
    "horaire A": ["10:00", "09:00", "10:00", "10:20", "10:00", "10:00", "09:00", "10:00", "10:00"],
    "horaire B": ["14:00", "13:00", "14:00", "14:00", "14:00", "14:00", "13:00", "14:00", "12:00"],
}

df = pd.DataFrame(data)

Then the idea is to:

Sort the data by ("date", "id", "ordre")
Set "note" to True if :

a. ("date", "id") are the same as previous row

b. One of ("origine", "destination", "horaire A", "horaire B") is different from the previous row

Which translates into:

index_cols = ["date", "id"]
compare_cols = ["origine", "destination", "horaire A", "horaire B"]

df = df.sort_values(by=index_cols ["ordre"])
shifted_compare = df[index_cols   compare_cols].shift(1).eq(df[index_cols   compare_cols])

df["note"] = shifted_compare[index_cols].all(axis=1) & ~shifted_compare[compare_cols].all(axis=1)

Which outputs to:

>>> df.sort_values(by="ordre")
   ordre    id        date origine destination horaire A horaire B   note
0      1  1112  2021-03-11   Paris   Marseille     10:00     14:00  False
1      2  1114  2021-05-11   Paris    Bordeaux     09:00     13:00  False
2      3  1112  2021-03-11   Paris   Marseille     10:00     14:00  False
3      4  1114  2021-05-11   Paris    Bordeaux     10:20     14:00   True
4      5  1112  2021-03-11   Paris   Marseille     10:00     14:00  False
5      6  1112  2021-03-11   Paris   Marseille     10:00     14:00  False
6      7  1114  2021-05-11   Paris    Bordeaux     09:00     13:00   True
7      8  1114  2021-05-11   Paris    Bordeaux     10:00     14:00   True
8      9  1112  2021-03-11   Paris        Lyon     10:00     12:00   True