Home > Software engineering >  Python dataframe drop duplicates based on two pairs of columns
Python dataframe drop duplicates based on two pairs of columns

Time:02-02

I have the following dataframe:

Column1 | Column2 | Column3 | Column4
   A         B         C         D
   A         B         A         B
   A         B         B         A
   E         F         A         B
   E         F         G         H
   E         F         B         A
   X         Y         E         F
   X         Y         A         E

How do I remove the duplicates based on the values on both Column1 and Column2 so that I get the following result:

Column1 | Column2 | Column3 | Column4
   A         B         C         D
   E         F         G         H
   X         Y         A         E

My approach is to record the indices that met the conditions and then drop the rows with these indices:

df1 = pd.DataFrame({'Column1':['A','A','A','E','E','E','X','X'],
                    'Column2':['B','B','B','F','F','F','Y','Y'],
                    'Column3':['C','A','B','A','G','B','E','A'],
                    'Column4':['D','B','A','B','H','A','F','E']
                    
                    })
excs =[]
for i, (a,b) in enumerate(zip(df1.Column1,df1.Column2)):
    for c,d in zip(df1.Column3,df1.Column4):
        if a == c and b == d:
            excs.append(i)

for i in set(excs):
    df1.drop(i,inplace=True)

And I got:

Column1 | Column2 | Column3 | Column4
   X         Y         E         F
   X         Y         A         E

CodePudding user response:

you can use the drop_duplicates method, that has the following signture:

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)

so in your case:

df1 = df1.drop_duplicates(subset=['Column1', 'Column2'])

CodePudding user response:

I just got the result. It's just I recorded the wrong indices. My apologies.

excs =[]
for a,b in zip(df1.Column1,df1.Column2):
    for i,(c,d) in enumerate(zip(df1.Column3,df1.Column4)):
        if (a == c and b == d) or (a == d and b ==c):
            excs.append(i)



for i in set(excs):
    df1.drop(i,inplace=True)
  •  Tags:  
  • Related