Home > Software engineering >  drop duplicates based on two criteria
drop duplicates based on two criteria

Time:01-14

My dataframe reads

** | Object ID | Top depth | Base Depth|**
 1 | obj1 | 1500 | 2000 |
 2 | obj1 | 1500 | 1800 |
 3 | obj1 | 1900 | 2000 |
 4 | obj2 | 900  | 1200 |
 5 | obj2 | 1000 | 1200 |
 6 | obj3 | 700  | 800  |

I want to drop duplicates but only if it is the same ID with a duplicate top and base depth. for instance, obj1(1) in the first row would be removed as obj1(2) has the same top depth and obj1(3) has the same base depth, obj2 would be left alone because only the base depth is a match.

I was trying to alter the drop duplicates command but I am not sure how to alter it for two true statements i.e. ID & Top Match AND ID & bottom Match

One of my failed attempts:

new_df = old_df.drop_duplicates(subset=[('Object ID', 'Top depth') and ('Object ID', 'Base Depth')]).reset_index(drop=True)

Any help is greatly appreciated.

Edit for further clarification brought to my attention below: The ones I want to drop are combinations of the two others, it encapsulates the depths. Shallowest is the top depth and deepest is the base depth.

Thanks!

CodePudding user response:

Try it like this.

import pandas as pd
df = pd.DataFrame({'Object ID': ['obj1', 'obj1', 'obj1', 'obj2', 'obj2', 'obj3'], 
                    'Top Depth': [1500, 1500, 1900, 900, 1000, 800],
                    'Base Depth':[2000, 1800, 2000, 1200, 1200, 800]})

df

enter image description here

df = df.sort_values('Object ID').drop_duplicates(subset=['Object ID','Top Depth'], keep='last')
df

enter image description here

CodePudding user response:

I think this approach should work.

df = pd.DataFrame({'Object ID': ['obj1', 'obj1', 'obj1', 'obj2', 'obj2', 'obj2','obj2'], 
                    'Top Depth': [1500, 1500, 1900, 900, 1000, 1000, 900],
                    'Base Depth':[2000, 1800, 2000, 1200, 1200, 800, 1200]})

  Object ID  Top Depth  Base Depth
0      obj1       1500        2000
1      obj1       1500        1800
2      obj1       1900        2000
3      obj2        900        1200
4      obj2       1000        1200
5      obj2       1000         800
6      obj2        900        1200
# drop complete duplicates first
df = df.drop_duplicates(subset=['Object ID','Top Depth','Base Depth'])

# get list of values for Top Depth and Base Depth within the same Object ID
df['Top Depth List'] = df['Object ID'].map(df.groupby('Object ID')['Top Depth'].agg(list).to_dict())
df['Base Depth List'] = df['Object ID'].map(df.groupby('Object ID')['Base Depth'].agg(list).to_dict())

# determine duplicates
df['Duplicate'] = df.apply(lambda x: (x['Base Depth List'].count(x['Base Depth']) > 1) & (x['Top Depth List'].count(x['Top Depth']) > 1), axis=1)

# keep only non-duplicates and the original columns
df = df.loc[~df['Duplicate'], ['Object ID','Top Depth','Base Depth']]

  Object ID  Top Depth  Base Depth
1      obj1       1500        1800
2      obj1       1900        2000
3      obj2        900        1200
5      obj2       1000         800

CodePudding user response:

find duplicates in the first two columns, find duplicates in the first and the last columns. Any thing is duplicate in both of them should be dropped: (Note you should use keep = 'last')

df[~(df.duplicated(subset = ['Object ID','Top Depth'], keep = "last") & df.duplicated(subset = ['Object ID','Base Depth'], keep = "last"))]
  •  Tags:  
  • Related