Identify varying rows in pandas dataframe-CodePudding

I have a dataframe:

ColA    ColB    ColC    
a       0        1     
b       3        3     
c       1        1
a       0        1
a       1        2
b       3        3

I need to identify every row which has different values while filtering based on a value in a column. Example : when I filter the dataframe with value 'a' in ColA, the 5th row has different values in the ColB and ColC.

I tried with

df['result']=df['ColA'].ne(df['ColA'].shift().bfill()).astype(int)

which resulted in:

ColA	ColB	ColC	result
a	0	1	0
b	3	3	1
c	1	1	1
a	0	1	1
a	1	2	1
b	3	3	1

What I need is(Filtering for the value 'a' should identify the row with different values in other columns):

ColA	ColB	ColC	result
a	0	1	0
b	3	3	1
c	1	1	1
a	0	1	0
a	1	2	1
b	3	3	1

If I use groupby method:

df.groupby(df.columns.tolist())['ColA'].nunique()

it works only with small dataframes with a few data types.

CodePudding user response：

If I understand correctly, you can drop_duplicates and then create the result column with groupby and cumcount to get an identifier per unique row per group.

print(df.drop_duplicates(subset=['ColA','ColB','ColC'])
        .assign(result=lambda x: x.groupby('ColA').cumcount()))
#   ColA  ColB  ColC  result
# 0    a     0     1       0
# 1    b     3     3       0
# 2    c     1     1       0
# 4    a     1     2       1

As you can see, you are "missing rows" from the original df, so merge it back to df.

df = (
    df.merge(df.drop_duplicates(subset=['ColA','ColB','ColC'])
               .assign(result=lambda x: x.groupby('ColA').cumcount()), 
             how='left')
)
print(df)
#   ColA  ColB  ColC  result
# 0    a     0     1       0
# 1    b     3     3       0
# 2    c     1     1       0
# 3    a     0     1       0
# 4    a     1     2       1
# 5    b     3     3       0

CodePudding user response：

The fastest way is removing duplicated rows by drop_duplicates(keep=False) and creating result columns based on indexes.

data = {
    'ColA': ['a', 'b', 'c', 'a', 'a', 'b'], 
    'ColB': [0, 3, 1, 0, 1, 3],
    'ColC': [1, 3, 1, 1, 2, 3]
}
df = pd.DataFrame(data)

unique_rows = df.drop_duplicates(keep=False)
df['result'] = df.index.isin(unique_rows.index).astype(int)

print(df)

CodePudding user response：

If you just want to have an unique id, use pd.factorize on your entire dataframe without groupby:

df['result'] = pd.factorize(df.agg(tuple, axis=1))[0]
print(df)

# Output
  ColA  ColB  ColC  result
0    a     0     1       0
1    b     3     3       1
2    c     1     1       2
3    a     0     1       0
4    a     1     2       3
5    b     3     3       1

You can use duplicated:

df['result'] = (~df.duplicated(keep=False)).astype(int)
print(df)

# Output
  ColA  ColB  ColC  result
0    a     0     1       0
1    b     3     3       0
2    c     1     1       1
3    a     0     1       0
4    a     1     2       1
5    b     3     3       0