Deleting duplicates with hierarchy condition SQL Python Pandas-CodePudding

I need to delete duplicates in a large database, but the rows to be deleted must be based on a hierarchy using either SQLite or Python Pandas. Is there a efficient way to relize this? preferably using python pandas dataframe but SQLite is also fine.

ID	Text	Category
1	text	Priority 3
2	text	Priority 1
3	text	Priority 2
4	text 2	Priority 3
5	text 2	Priority 2

should turn to this:

ID	Text	Category
2	text	Priority 1
5	text 2	Priority 2

CodePudding user response：

Try this:

df = df.sort_values(by=['Text','Category'], ascending=[True,True])
df.groupby('Text')['Category'].first().reset_index()

Output:

index	Text	Category
0	text	Priority 1
1	text 2	Priority 2

CodePudding user response：

Very similar approach to @Drakax but using drop_duplicates instead of groupby and first

import pandas as pd

df = pd.DataFrame({
    'ID': [1, 2, 3, 4, 5],
    'Text': ['text', 'text', 'text', 'text 2', 'text 2'],
    'Category': ['Priority 3', 'Priority 1', 'Priority 2', 'Priority 3', 'Priority 2'],
})

df.sort_values(['Text','Category']).drop_duplicates('Text')