I have a data frame like this:
ordre/id /date /origine /destination /horaire A /horaire B
1 1112 2021-03-11 Paris / Marseille/10:00/14:00
2 1114 2021-05-11 Paris / Bordeaux/09:00/13:00
3 1112 2021-03-11 Paris / Marseille/10:00/14:00
4 1114 2021-05-11 Paris / Bordeaux/10:20/14:00
5 1112 2021-03-11 Paris / Marseille/10:00/14:00
6 1112 2021-03-11 Paris / Marseille/10:00/14:00
7 1114 2021-05-11 Paris / Bordeaux/09:00/13:00
8 1114 2021-05-11 Paris / Bordeaux/10:00/14:00
9 1112 2021-03-11 Paris / Lyon/10:00/12:00
I want to add a new column note that will store comparaison value for each object groupby same id and date, any change 'date /origine /destination /horaire A /horaire B' so note True
example:
for row 9,
destinationis lyon where is difference previous line 'marseille' sonoteis Truefor row 4, we have a diffrences data
'Horaire A'and'Horaire B'sonoteis True
Output:
1 1112 2021-03-11 Paris / Marseille/10:00/14:00
2 1112 2021-03-11 Paris / Marseille/10:00/14:00
3 1112 2021-03-11 Paris / Marseille/10:00/14:00
4 1112 2021-03-11 Paris / Lyon/10:00/12:00/True
5 1112 2021-03-11 Paris / Marseille/10:00/14:00/True
6 1114 2021-05-11 Paris / Bordeaux/09:00/13:00
7 1114 2021-05-11 Paris / Bordeaux/09:00/13:00
8 1114 2021-05-11 Paris / Bordeaux/10:00/14:00/True
9 1114 2021-05-11 Paris / Bordeaux/10:20/14:00/True
I wrote this code:
df['Note'] = df.groupby(['Date','id']).apply(lambda x: (x['Origine'] != x['Origine'].shift(-1)) | (x['Destination'] != x['Destination'].shift(-1)) | (x['Horaire A'] != x['Horaire A'].shift(-1)) | (x['Horaire B'] != x['Horaire B'].shift(-1)))
df['Note'] = df['Note'].shift(1)
But this program gives an error: incompatible index of inserted column with frame index
How do i fix it?
CodePudding user response:
IIUC, you can compare the rows with the shifted rows, per group. If any field mismatches, then we set the output as True.
I am relying on "ordre" as unique key here to merge back to the original data, but you could use the index if this is not the case. In such case, "order" should be dropped of the groupby.
df.merge(df.set_index('ordre')
.groupby(['id', 'date'], group_keys=False)
.apply(lambda d: d.ne(d.shift().bfill()).any(1))
.rename('diff_previous'),
left_on='ordre', right_index=True
)
output:
ordre id date origine destination horaire A horaire B diff_previous
0 1 1112 2021-03-11 Paris Marseille 10:00 14:00 False
1 2 1114 2021-05-11 Paris Bordeaux 09:00 13:00 False
2 3 1112 2021-03-11 Paris Marseille 10:00 14:00 False
3 4 1114 2021-05-11 Paris Bordeaux 10:20 14:00 True
4 5 1112 2021-03-11 Paris Marseille 10:00 14:00 False
5 6 1112 2021-03-11 Paris Marseille 10:00 14:00 False
6 7 1114 2021-05-11 Paris Bordeaux 09:00 13:00 True
7 8 1114 2021-05-11 Paris Bordeaux 10:00 14:00 True
8 9 1112 2021-03-11 Paris Lyon 10:00 12:00 True
CodePudding user response:
I've used the following code to generate the DataFrame:
data = {
"ordre": range(1, 10),
"id": [1112, 1114, 1112, 1114, 1112, 1112, 1114, 1114, 1112],
"date": [
"2021-03-11",
"2021-05-11",
"2021-03-11",
"2021-05-11",
"2021-03-11",
"2021-03-11",
"2021-05-11",
"2021-05-11",
"2021-03-11",
],
"origine": ["Paris", "Paris", "Paris", "Paris", "Paris", "Paris", "Paris", "Paris", "Paris"],
"destination": [
"Marseille",
"Bordeaux",
"Marseille",
"Bordeaux",
"Marseille",
"Marseille",
"Bordeaux",
"Bordeaux",
"Lyon",
],
"horaire A": ["10:00", "09:00", "10:00", "10:20", "10:00", "10:00", "09:00", "10:00", "10:00"],
"horaire B": ["14:00", "13:00", "14:00", "14:00", "14:00", "14:00", "13:00", "14:00", "12:00"],
}
df = pd.DataFrame(data)
Then the idea is to:
Sort the data by
("date", "id", "ordre")Set
"note"toTrueif :a.
("date", "id")are the same as previous rowb. One of
("origine", "destination", "horaire A", "horaire B")is different from the previous row
Which translates into:
index_cols = ["date", "id"]
compare_cols = ["origine", "destination", "horaire A", "horaire B"]
df = df.sort_values(by=index_cols ["ordre"])
shifted_compare = df[index_cols compare_cols].shift(1).eq(df[index_cols compare_cols])
df["note"] = shifted_compare[index_cols].all(axis=1) & ~shifted_compare[compare_cols].all(axis=1)
Which outputs to:
>>> df.sort_values(by="ordre")
ordre id date origine destination horaire A horaire B note
0 1 1112 2021-03-11 Paris Marseille 10:00 14:00 False
1 2 1114 2021-05-11 Paris Bordeaux 09:00 13:00 False
2 3 1112 2021-03-11 Paris Marseille 10:00 14:00 False
3 4 1114 2021-05-11 Paris Bordeaux 10:20 14:00 True
4 5 1112 2021-03-11 Paris Marseille 10:00 14:00 False
5 6 1112 2021-03-11 Paris Marseille 10:00 14:00 False
6 7 1114 2021-05-11 Paris Bordeaux 09:00 13:00 True
7 8 1114 2021-05-11 Paris Bordeaux 10:00 14:00 True
8 9 1112 2021-03-11 Paris Lyon 10:00 12:00 True
