I want to drop duplicates from DF where column's values are equal for one unique key. Example:
In:
KEY SYSTEM
TD-438426 AAA
TD-438426 BBB
TD-438426 AAA
TD-438709 BBB
TD-438709 BBB
TD-438750 CCC
TD-438750 CCC
TD-438750 CCC
TD-438874 AAA
TD-438874 BBB
Out:
KEY SYSTEM
TD-438426 AAA
TD-438426 BBB
TD-438709 BBB
TD-438750 CCC
TD-438874 AAA
TD-438874 BBB
P.S. Of course there are some exceptions that I want to catch.
In:
KEY TEST SYSTEM
TD-438426 ABC AAA
TD-438426 ABC BBB
Out:
KEY TEST SYSTEM
TD-438426 ABC AAA
TD-438426 ABC BBB
And
In:
KEY TEST SYSTEM
TD-438426 ABC AAA
TD-438426 CBA AAA
Out:
KEY TEST SYSTEM
TD-438426 ABC AAA
CodePudding user response:
Like @mcsioni mentioned in the comments, what you are looking for is df.drop_duplicates()
Also, it is useful to understand two arguments of this method, namely, subset and keep.
E.g., You want to retain only unique values in the KEY column and keep the first SYSTEM value for each unique KEY, you'd do:
df.drop_duplicates(subset=['KEY'], keep='first')
If you just used df.drop_duplicates() without any arguments, the subset will be all the columns, which is what your desired output is asking for.
EDIT
To keep up with your new requirement, do this:
df.drop_duplicates(subset=['KEY', 'SYSTEM'], keep='first')
Note: The default behavior for the keep argument is 'first' but doesn't hurt to be explicit when working with high-level libraries like pandas.
