Python Pandas drop duplicate values from grouped data frame-CodePudding

I am tracking how long people have had the same job. The real dataset is large, with lots of duplicate employee ids and jobs. I have the following code:

import pandas as pd

id= [2000,2001,2001,3000,2000,3000,3300,3300,3300,3300]
jtitle = ['job1','job2','job1','job3', 'job3', 'job2', 'job5', 'job5', 'job5', 'job6']
date = ['01/01/2021', '17/02/2018','17/02/2021', '01/01/2021', '25/03/2011', '11/11/2000', '22/01/2022', '15/12/2021', '11/11/2021', '10/09/2021']

data= pd.DataFrame(data=zip(id, jtitle, date), columns= ["id", "jtitle", "date"])
# convert to datetime object
data.date = pd.to_datetime(data.date, dayfirst=True)
#print(data)

# group employees by ID
latest = data.sort_values('date', ascending=False).groupby('id').nth(0)
#print(latest)

# find the latest point in time where there is a change in job title - but how?
prev_date = data.sort_values('date', ascending=False).groupby('id').nth(1).date
print(prev_date)
    
# calculate the difference in days
latest['days'] = latest.date - prev_date
print(latest)

Problem is, it only works if the same person is in the dataset only twice, which never happens in reality. I would need to somehow drop the jtitle duplicates in

prev_date = data.sort_values('date', ascending=False).groupby('id').nth(1).date

but if I put in .unique(), I get the error message "AttributeError: 'DataFrameGroupBy' object has no attribute 'unique'". The current output of the code shows:

jtitle       date      days
id                              
2000   job1 2021-01-01 3570 days
2001   job1 2021-02-17 1096 days
3000   job3 2021-01-01 7356 days
3300   job5 2022-01-22   38 days

The desired output would be correcting the error of "38 days", which is around 134 days. Does anyone know how to fix it or solve the problem in another way, please?

CodePudding user response：

You should use the method drop_duplicates from pandas.

The following should solve your problem.

Yours code:


import pandas as pd

id= [2000,2001,2001,3000,2000,3000,3300,3300,3300,3300]
jtitle = ['job1','job2','job1','job3', 'job3', 'job2', 'job5', 'job5', 'job5', 'job6']
date = ['01/01/2021', '17/02/2018','17/02/2021', '01/01/2021', '25/03/2011', '11/11/2000', '22/01/2022', '15/12/2021', '11/11/2021', '10/09/2021']

data= pd.DataFrame(data=zip(id, jtitle, date), columns= ["id", "jtitle", "date"])
# convert to datetime object
data.date = pd.to_datetime(data.date, dayfirst=True)

Solution:

# subset employees by ID, sort by date and drop duplicates
latest = data.sort_values('date', ascending=False).drop_duplicates(subset=['id'], keep='first').copy()

prev_date = data.sort_values('date', ascending=False).drop_duplicates(subset=['id'], keep='last').copy()
    
# calculate the difference in days
latest['days'] = latest['date'].values -  prev_date['date'].values
print(latest)

Output:

  id jtitle       date      days
3300   job5 2022-01-22  134 days
2001   job1 2021-02-17 1096 days
2000   job1 2021-01-01 3570 days
3000   job3 2021-01-01 7356 days

CodePudding user response：

Alternative solution with diff and sum.

data['days'] = data.sort_values('date').groupby('id').date.diff()
data = data.groupby(['id', 'jtitle']).agg({'days': 'sum', 'date': 'first'}).reset_index()

# to filter to only more than 0 days
data[data.days.dt.days > 0]

Result

     id jtitle      days       date
0  2000   job1 3570 days 2021-01-01
1  2001   job1 1096 days 2021-02-17
2  3000   job3 7356 days 2021-01-01
3  3300   job5  134 days 2022-01-22

CodePudding user response：

It is not entirely clear what you are trying to achieve (eg the date column -- is this the date a person starts that job? finishes it?) but let's give it a go

first we aggregate byid and jtitle and find the earliest and the latest date that corresponds to that combo:

df = (data.groupby(['id','jtitle'], sort=False)
        .agg(start_date = ('date','min'), end_date = ('date','max'))
        .reset_index().sort_values(['id','start_date'], ascending = True)
)

df looks like this:

      id  jtitle    start_date           end_date
--  ----  --------  -------------------  -------------------
 4  2000  job3      2011-03-25 00:00:00  2011-03-25 00:00:00
 0  2000  job1      2021-01-01 00:00:00  2021-01-01 00:00:00
 1  2001  job2      2018-02-17 00:00:00  2018-02-17 00:00:00
 2  2001  job1      2021-02-17 00:00:00  2021-02-17 00:00:00
 5  3000  job2      2000-11-11 00:00:00  2000-11-11 00:00:00
 3  3000  job3      2021-01-01 00:00:00  2021-01-01 00:00:00
 7  3300  job6      2021-09-10 00:00:00  2021-09-10 00:00:00
 6  3300  job5      2021-11-11 00:00:00  2022-01-22 00:00:00

Now we get the days column by grouping by id and within each group subtracting end-date from the previous row (ie previous job) from the end_date of the current row (job). We also drop NaNs

df2 = (df.groupby('id').apply(lambda g: g.assign(days = g['end_date']-g['end_date'].shift()))
        .dropna()
        .reset_index(drop = True)
    )

df2:

      id  jtitle    start_date           end_date             days
--  ----  --------  -------------------  -------------------  ------------------
 0  2000  job1      2021-01-01 00:00:00  2021-01-01 00:00:00  3570 days 00:00:00
 1  2001  job1      2021-02-17 00:00:00  2021-02-17 00:00:00  1096 days 00:00:00
 2  3000  job3      2021-01-01 00:00:00  2021-01-01 00:00:00  7356 days 00:00:00
 3  3300  job5      2021-11-11 00:00:00  2022-01-22 00:00:00  134 days 00:00:00

Note that this works for more than two different jtitles per id

for example, with another job, job7, added for id=3300 (last row)

id= [2000,2001,2001,3000,2000,3000,3300,3300,3300,3300,3300]
jtitle = ['job1','job2','job1','job3', 'job3', 'job2', 'job5', 'job5', 'job5', 'job6', 'job7']
date = ['01/01/2021', '17/02/2018','17/02/2021', '01/01/2021', '25/03/2011', '11/11/2000', '22/01/2022', '15/12/2021', '11/11/2021', '10/09/2021', '07/09/2023']

data= pd.DataFrame(data=zip(id, jtitle, date), columns= ["id", "jtitle", "date"])

the output would be

    id  jtitle  start_date  end_date    days
0   2000    job1    2021-01-01  2021-01-01  3570 days
1   2001    job1    2021-02-17  2021-02-17  1096 days
2   3000    job3    2021-01-01  2021-01-01  7356 days
3   3300    job5    2021-11-11  2022-01-22  134 days
4   3300    job7    2023-09-07  2023-09-07  593 days