I have a dataframe:
ColA ColB ColC
a 0 1
b 3 3
c 1 1
a 0 1
a 1 2
b 3 3
I need to identify every row which has different values while filtering based on a value in a column. Example : when I filter the dataframe with value 'a' in ColA, the 5th row has different values in the ColB and ColC.
I tried with
df['result']=df['ColA'].ne(df['ColA'].shift().bfill()).astype(int)
which resulted in:
| ColA | ColB | ColC | result |
|---|---|---|---|
| a | 0 | 1 | 0 |
| b | 3 | 3 | 1 |
| c | 1 | 1 | 1 |
| a | 0 | 1 | 1 |
| a | 1 | 2 | 1 |
| b | 3 | 3 | 1 |
What I need is(Filtering for the value 'a' should identify the row with different values in other columns):
| ColA | ColB | ColC | result |
|---|---|---|---|
| a | 0 | 1 | 0 |
| b | 3 | 3 | 1 |
| c | 1 | 1 | 1 |
| a | 0 | 1 | 0 |
| a | 1 | 2 | 1 |
| b | 3 | 3 | 1 |
If I use groupby method:
df.groupby(df.columns.tolist())['ColA'].nunique()
it works only with small dataframes with a few data types.
CodePudding user response:
If I understand correctly, you can drop_duplicates and then create the result column with groupby and cumcount to get an identifier per unique row per group.
print(df.drop_duplicates(subset=['ColA','ColB','ColC'])
.assign(result=lambda x: x.groupby('ColA').cumcount()))
# ColA ColB ColC result
# 0 a 0 1 0
# 1 b 3 3 0
# 2 c 1 1 0
# 4 a 1 2 1
As you can see, you are "missing rows" from the original df, so merge it back to df.
df = (
df.merge(df.drop_duplicates(subset=['ColA','ColB','ColC'])
.assign(result=lambda x: x.groupby('ColA').cumcount()),
how='left')
)
print(df)
# ColA ColB ColC result
# 0 a 0 1 0
# 1 b 3 3 0
# 2 c 1 1 0
# 3 a 0 1 0
# 4 a 1 2 1
# 5 b 3 3 0
CodePudding user response:
The fastest way is removing duplicated rows by drop_duplicates(keep=False) and creating result columns based on indexes.
data = {
'ColA': ['a', 'b', 'c', 'a', 'a', 'b'],
'ColB': [0, 3, 1, 0, 1, 3],
'ColC': [1, 3, 1, 1, 2, 3]
}
df = pd.DataFrame(data)
unique_rows = df.drop_duplicates(keep=False)
df['result'] = df.index.isin(unique_rows.index).astype(int)
print(df)
CodePudding user response:
If you just want to have an unique id, use pd.factorize on your entire dataframe without groupby:
df['result'] = pd.factorize(df.agg(tuple, axis=1))[0]
print(df)
# Output
ColA ColB ColC result
0 a 0 1 0
1 b 3 3 1
2 c 1 1 2
3 a 0 1 0
4 a 1 2 3
5 b 3 3 1
You can use duplicated:
df['result'] = (~df.duplicated(keep=False)).astype(int)
print(df)
# Output
ColA ColB ColC result
0 a 0 1 0
1 b 3 3 0
2 c 1 1 1
3 a 0 1 0
4 a 1 2 1
5 b 3 3 0
