Home > Enterprise >  Find duplicates in dataframe and group them by assigning a key
Find duplicates in dataframe and group them by assigning a key

Time:04-30

I´ve looked around and found similar questions but none of them really helped me to find a solution. I want my script to read a csv which looks like this:

hot_dict = {'Links': links, 'Titles': titles, 'Datestamps': datestamp_extended,'GroupID': ""  }

I want to find all duplicate links in column links and assign all links that are identical the same key in column "GroupID"

Links GroupID
A Key1
B Key2
A Key1
B Key2

This gives me just true and false values obviously:

df['GroupID'] =df.duplicated(subset=['Links'], keep=False)

Is there an elegant way to continue from here?

Thanks a lot!

CodePudding user response:

For a simple key with an integer ID, you can first convert the Links column to categorical data, then just obtain the category code from that:

df['GroupID'] = df['Links'].astype('category').cat.codes
  • Related