Is there a way to customize drop_duplicates so that it drops the "kind of" duplicates?
Example: pandas df
| Year | Name | ID | City |
|---|---|---|---|
| 2011 | Superman | 101 | Metropolis |
| 2011 | Batman | 102 | Gotham |
| 2012 | The Batman | 102 | Gotham |
| 2011 | Noobmaster69 | 103 | Online |
| 2011 | Noobmaster69 | 103 | Online |
I tried using drop_duplicates so I got this
| Year | Name | ID | City |
|---|---|---|---|
| 2011 | Superman | 101 | Metropolis |
| 2011 | Batman | 102 | Gotham |
| 2012 | The Batman | 102 | Gotham |
| 2011 | Noobmaster69 | 103 | Online |
I actually want to squeeze it even more, as I want only "102" row with "The Batman" which is newer info (2012>2011) to be on the data frame. Expecting something like this
| Year | Name | ID | City |
|---|---|---|---|
| 2011 | Superman | 101 | Metropolis |
| 2012 | The Batman | 102 | Gotham |
| 2011 | Noobmaster69 | 103 | Online |
CodePudding user response:
#Try This Here Duplicates can be easily delete with ID column.
import pandas as pd
#reads your table data
read_file = pd.read_csv("your_filename.csv")
df = pd.DataFrame(read_file)
df = df.drop_duplicates(subset='ID', keep='last')
subset = "specific_col" used to drop the items from the specific column and keep = "last" used to keep the last duplicate(removes first duplicate)
