I am tracking how long people have had the same job. The real dataset is large, with lots of duplicate employee ids and jobs. I have the following code:
import pandas as pd
id= [2000,2001,2001,3000,2000,3000,3300,3300,3300,3300]
jtitle = ['job1','job2','job1','job3', 'job3', 'job2', 'job5', 'job5', 'job5', 'job6']
date = ['01/01/2021', '17/02/2018','17/02/2021', '01/01/2021', '25/03/2011', '11/11/2000', '22/01/2022', '15/12/2021', '11/11/2021', '10/09/2021']
data= pd.DataFrame(data=zip(id, jtitle, date), columns= ["id", "jtitle", "date"])
# convert to datetime object
data.date = pd.to_datetime(data.date, dayfirst=True)
#print(data)
# group employees by ID
latest = data.sort_values('date', ascending=False).groupby('id').nth(0)
#print(latest)
# find the latest point in time where there is a change in job title - but how?
prev_date = data.sort_values('date', ascending=False).groupby('id').nth(1).date
print(prev_date)
# calculate the difference in days
latest['days'] = latest.date - prev_date
print(latest)
Problem is, it only works if the same person is in the dataset only twice, which never happens in reality. I would need to somehow drop the jtitle duplicates in
prev_date = data.sort_values('date', ascending=False).groupby('id').nth(1).date
but if I put in .unique(), I get the error message "AttributeError: 'DataFrameGroupBy' object has no attribute 'unique'". The current output of the code shows:
jtitle date days
id
2000 job1 2021-01-01 3570 days
2001 job1 2021-02-17 1096 days
3000 job3 2021-01-01 7356 days
3300 job5 2022-01-22 38 days
The desired output would be correcting the error of "38 days", which is around 134 days. Does anyone know how to fix it or solve the problem in another way, please?
CodePudding user response:
You should use the method drop_duplicates from pandas.
The following should solve your problem.
Yours code:
import pandas as pd
id= [2000,2001,2001,3000,2000,3000,3300,3300,3300,3300]
jtitle = ['job1','job2','job1','job3', 'job3', 'job2', 'job5', 'job5', 'job5', 'job6']
date = ['01/01/2021', '17/02/2018','17/02/2021', '01/01/2021', '25/03/2011', '11/11/2000', '22/01/2022', '15/12/2021', '11/11/2021', '10/09/2021']
data= pd.DataFrame(data=zip(id, jtitle, date), columns= ["id", "jtitle", "date"])
# convert to datetime object
data.date = pd.to_datetime(data.date, dayfirst=True)
Solution:
# subset employees by ID, sort by date and drop duplicates
latest = data.sort_values('date', ascending=False).drop_duplicates(subset=['id'], keep='first').copy()
prev_date = data.sort_values('date', ascending=False).drop_duplicates(subset=['id'], keep='last').copy()
# calculate the difference in days
latest['days'] = latest['date'].values - prev_date['date'].values
print(latest)
Output:
id jtitle date days
3300 job5 2022-01-22 134 days
2001 job1 2021-02-17 1096 days
2000 job1 2021-01-01 3570 days
3000 job3 2021-01-01 7356 days
CodePudding user response:
Alternative solution with diff and sum.
data['days'] = data.sort_values('date').groupby('id').date.diff()
data = data.groupby(['id', 'jtitle']).agg({'days': 'sum', 'date': 'first'}).reset_index()
# to filter to only more than 0 days
data[data.days.dt.days > 0]
Result
id jtitle days date
0 2000 job1 3570 days 2021-01-01
1 2001 job1 1096 days 2021-02-17
2 3000 job3 7356 days 2021-01-01
3 3300 job5 134 days 2022-01-22
CodePudding user response:
It is not entirely clear what you are trying to achieve (eg the date column -- is this the date a person starts that job? finishes it?) but let's give it a go
first we aggregate byid and jtitle and find the earliest and the latest date that corresponds to that combo:
df = (data.groupby(['id','jtitle'], sort=False)
.agg(start_date = ('date','min'), end_date = ('date','max'))
.reset_index().sort_values(['id','start_date'], ascending = True)
)
df looks like this:
id jtitle start_date end_date
-- ---- -------- ------------------- -------------------
4 2000 job3 2011-03-25 00:00:00 2011-03-25 00:00:00
0 2000 job1 2021-01-01 00:00:00 2021-01-01 00:00:00
1 2001 job2 2018-02-17 00:00:00 2018-02-17 00:00:00
2 2001 job1 2021-02-17 00:00:00 2021-02-17 00:00:00
5 3000 job2 2000-11-11 00:00:00 2000-11-11 00:00:00
3 3000 job3 2021-01-01 00:00:00 2021-01-01 00:00:00
7 3300 job6 2021-09-10 00:00:00 2021-09-10 00:00:00
6 3300 job5 2021-11-11 00:00:00 2022-01-22 00:00:00
Now we get the days column by grouping by id and within each group subtracting end-date from the previous row (ie previous job) from the end_date of the current row (job). We also drop NaNs
df2 = (df.groupby('id').apply(lambda g: g.assign(days = g['end_date']-g['end_date'].shift()))
.dropna()
.reset_index(drop = True)
)
df2:
id jtitle start_date end_date days
-- ---- -------- ------------------- ------------------- ------------------
0 2000 job1 2021-01-01 00:00:00 2021-01-01 00:00:00 3570 days 00:00:00
1 2001 job1 2021-02-17 00:00:00 2021-02-17 00:00:00 1096 days 00:00:00
2 3000 job3 2021-01-01 00:00:00 2021-01-01 00:00:00 7356 days 00:00:00
3 3300 job5 2021-11-11 00:00:00 2022-01-22 00:00:00 134 days 00:00:00
Note that this works for more than two different jtitles per id
for example, with another job, job7, added for id=3300 (last row)
id= [2000,2001,2001,3000,2000,3000,3300,3300,3300,3300,3300]
jtitle = ['job1','job2','job1','job3', 'job3', 'job2', 'job5', 'job5', 'job5', 'job6', 'job7']
date = ['01/01/2021', '17/02/2018','17/02/2021', '01/01/2021', '25/03/2011', '11/11/2000', '22/01/2022', '15/12/2021', '11/11/2021', '10/09/2021', '07/09/2023']
data= pd.DataFrame(data=zip(id, jtitle, date), columns= ["id", "jtitle", "date"])
the output would be
id jtitle start_date end_date days
0 2000 job1 2021-01-01 2021-01-01 3570 days
1 2001 job1 2021-02-17 2021-02-17 1096 days
2 3000 job3 2021-01-01 2021-01-01 7356 days
3 3300 job5 2021-11-11 2022-01-22 134 days
4 3300 job7 2023-09-07 2023-09-07 593 days
