I have a inital dataframe:
df = pd.DataFrame({
'job': ['job1', None, None, 'job3', None, None, 'job4', None, None, None, 'job5', None, None, None, 'job6', None, None, None, None],
'name': ['n_j1', None, None, 'n_j3', None, None, 'n_j4', None, None, None, 'nj5', None, None, None, 'nj6', None, None, None, None],
'schedule': ['01', None, None, '06', None, None, '09', None, None, None, None, None, None, None, None, None, None, None, None],
'task_type': ['START', 'TA', 'END', 'START', 'TB', 'END', 'START', 'TB', 'TB', 'END', 'START', 'TA', 'TA', 'END', 'START', 'TA', 'TA', 'TA', 'END'],
'task': [None, 'task12', None, None, 'task31', None, None, 'task18', 'nt6', None, None, 'task3', 'task7', None, None, 'task19', 'task6', 'task88', None],
'task_name': [None, 'name_t12', None, None, 'name_t31', None, None, 'name_t18', 'n_t5', None, None, 'name_t33', 'n7', None, None, 'name_t19', 'n6', 'n88', None]
})
job name schedule task_type task task_name
0 job1 n_j1 01 START None None
1 None None None TA task12 name_t12
2 None None None END None None
3 job3 n_j3 06 START None None
4 None None None TB task31 name_t31
5 None None None END None None
6 job4 n_j4 09 START None None
7 None None None TB task18 name_t18
8 None None None TB nt6 n_t5
9 None None None END None None
10 job5 nj5 None START None None
11 None None None TA task3 name_t33
12 None None None TA task7 n7
13 None None None END None None
14 job6 nj6 None START None None
15 None None None TA task19 name_t19
16 None None None TA task6 n6
17 None None None TA task88 n88
18 None None None END None None
My expected result is:
job name schedule types tasks n_names
0 job1 n_j1 01 [TA] [task12] [name_t12]
1 job3 n_j3 06 [TB] [task31] [name_t31]
2 job4 n_j4 09 [TB, TB] [task18, nt6] [name_t18, n_t5]
3 job5 nj5 None [TA, TA] [task3, task7] [name_t33, n7]
4 job6 nj6 None [TA, TA, TA] [task19, task6, task88] [name_t19, n6, n88]
grouped by a job with listed its tasks, where tasks are defined by task_type between START and END
What I came up with so far:
handling None:
df[['job', 'name', 'schedule']] = df[['job', 'name', 'schedule']].fillna(method='ffill')
filtering rows for desired task types:
df[df['task_type'].isin(['TA', 'TB'])]
grouping the data:
df.groupby(['job', 'name', 'schedule']).apply(lambda x: [list(x['task_type']), list(x['task']), list(x['task_name'])]).apply(pd.Series).reset_index(names=['job', 'name', 'schedule'])
But the results is not what expect:
job name schedule 0 1 2
0 job1 n_j1 01 [TA] [task12] [name_t12]
1 job3 n_j3 06 [TB] [task31] [name_t31]
2 job4 n_j4 09 [TB, TB] [task18, nt6] [name_t18, n_t5]
3 job5 nj5 09 [TA, TA] [task3, task7] [name_t33, n7]
4 job6 nj6 09 [TA, TA, TA] [task19, task6, task88] [name_t19, n6, n88]
My question is how to get rid of schedule values for job5, job6 and name the lists more friendly.
There should by more elegant way of doing this, any help would be appreciated.
CodePudding user response:
Try:
x = df.groupby(df["job"].ffill()).agg(
lambda x: [v for v in x if v and v not in {"START", "END"}]
)
x["job"] = x["job"].str[0]
x["name"] = x["name"].str[0]
x["schedule"] = x["schedule"].str[0]
print(x.reset_index(drop=True))
Prints:
job name schedule task_type task task_name
0 job1 n_j1 01 [TA] [task12] [name_t12]
1 job3 n_j3 06 [TB] [task31] [name_t31]
2 job4 n_j4 09 [TB, TB] [task18, nt6] [name_t18, n_t5]
3 job5 nj5 NaN [TA, TA] [task3, task7] [name_t33, n7]
4 job6 nj6 NaN [TA, TA, TA] [task19, task6, task88] [name_t19, n6, n88]
CodePudding user response:
Here you go:
Do the following steps on your initial df:
df[['job', 'name']] = df[['job', 'name']].fillna(method='ffill')
df = df.groupby(['job', 'name']).agg(lambda x: x.tolist()).applymap(lambda x: [i for i in x if i not in [None, 'START', 'END']])
# replace empty lists with None on schedule column
df['schedule'] = df['schedule'].apply(lambda x: None if x == [] else x)
