I need to delete duplicates in a large database, but the rows to be deleted must be based on a hierarchy using either SQLite or Python Pandas. Is there a efficient way to relize this? preferably using python pandas dataframe but SQLite is also fine.
| ID | Text | Category |
|---|---|---|
| 1 | text | Priority 3 |
| 2 | text | Priority 1 |
| 3 | text | Priority 2 |
| 4 | text 2 | Priority 3 |
| 5 | text 2 | Priority 2 |
should turn to this:
| ID | Text | Category |
|---|---|---|
| 2 | text | Priority 1 |
| 5 | text 2 | Priority 2 |
CodePudding user response:
Try this:
df = df.sort_values(by=['Text','Category'], ascending=[True,True])
df.groupby('Text')['Category'].first().reset_index()
Output:
| index | Text | Category |
|---|---|---|
| 0 | text | Priority 1 |
| 1 | text 2 | Priority 2 |
CodePudding user response:
Very similar approach to @Drakax but using drop_duplicates instead of groupby and first
import pandas as pd
df = pd.DataFrame({
'ID': [1, 2, 3, 4, 5],
'Text': ['text', 'text', 'text', 'text 2', 'text 2'],
'Category': ['Priority 3', 'Priority 1', 'Priority 2', 'Priority 3', 'Priority 2'],
})
df.sort_values(['Text','Category']).drop_duplicates('Text')
