I have the following dataframe:
Column1 | Column2 | Column3 | Column4
A B C D
A B A B
A B B A
E F A B
E F G H
E F B A
X Y E F
X Y A E
How do I remove the duplicates based on the values on both Column1 and Column2 so that I get the following result:
Column1 | Column2 | Column3 | Column4
A B C D
E F G H
X Y A E
My approach is to record the indices that met the conditions and then drop the rows with these indices:
df1 = pd.DataFrame({'Column1':['A','A','A','E','E','E','X','X'],
'Column2':['B','B','B','F','F','F','Y','Y'],
'Column3':['C','A','B','A','G','B','E','A'],
'Column4':['D','B','A','B','H','A','F','E']
})
excs =[]
for i, (a,b) in enumerate(zip(df1.Column1,df1.Column2)):
for c,d in zip(df1.Column3,df1.Column4):
if a == c and b == d:
excs.append(i)
for i in set(excs):
df1.drop(i,inplace=True)
And I got:
Column1 | Column2 | Column3 | Column4
X Y E F
X Y A E
CodePudding user response:
you can use the drop_duplicates method, that has the following signture:
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
so in your case:
df1 = df1.drop_duplicates(subset=['Column1', 'Column2'])
CodePudding user response:
I just got the result. It's just I recorded the wrong indices. My apologies.
excs =[]
for a,b in zip(df1.Column1,df1.Column2):
for i,(c,d) in enumerate(zip(df1.Column3,df1.Column4)):
if (a == c and b == d) or (a == d and b ==c):
excs.append(i)
for i in set(excs):
df1.drop(i,inplace=True)
