Groupby dataframe on some columns and list other columns-CodePudding

I have a inital dataframe:

df = pd.DataFrame({
    'job': ['job1', None, None, 'job3', None, None, 'job4', None, None, None, 'job5', None, None, None, 'job6', None, None, None, None],
    'name': ['n_j1', None, None, 'n_j3', None, None, 'n_j4', None, None, None, 'nj5', None, None, None, 'nj6', None, None, None, None],
    'schedule': ['01', None, None, '06', None, None, '09', None, None, None, None, None, None, None, None, None, None, None, None],
    'task_type': ['START', 'TA', 'END', 'START', 'TB', 'END', 'START', 'TB', 'TB', 'END', 'START', 'TA', 'TA', 'END', 'START', 'TA', 'TA', 'TA', 'END'],
    'task': [None, 'task12', None, None, 'task31', None, None, 'task18', 'nt6', None, None, 'task3', 'task7', None, None, 'task19', 'task6', 'task88', None],
    'task_name': [None, 'name_t12', None, None, 'name_t31', None, None, 'name_t18', 'n_t5', None, None, 'name_t33', 'n7', None, None, 'name_t19', 'n6', 'n88', None]
})

     job  name schedule task_type    task task_name
0   job1  n_j1       01     START    None      None
1   None  None     None        TA  task12  name_t12
2   None  None     None       END    None      None
3   job3  n_j3       06     START    None      None
4   None  None     None        TB  task31  name_t31
5   None  None     None       END    None      None
6   job4  n_j4       09     START    None      None
7   None  None     None        TB  task18  name_t18
8   None  None     None        TB     nt6      n_t5
9   None  None     None       END    None      None
10  job5   nj5     None     START    None      None
11  None  None     None        TA   task3  name_t33
12  None  None     None        TA   task7        n7
13  None  None     None       END    None      None
14  job6   nj6     None     START    None      None
15  None  None     None        TA  task19  name_t19
16  None  None     None        TA   task6        n6
17  None  None     None        TA  task88       n88
18  None  None     None       END    None      None

My expected result is:

    job  name schedule          types                   tasks              n_names
0  job1  n_j1       01          [TA]                 [task12]           [name_t12]
1  job3  n_j3       06          [TB]                 [task31]           [name_t31]
2  job4  n_j4       09      [TB, TB]            [task18, nt6]     [name_t18, n_t5]
3  job5   nj5     None      [TA, TA]           [task3, task7]       [name_t33, n7]
4  job6   nj6     None  [TA, TA, TA]  [task19, task6, task88]  [name_t19, n6, n88]

grouped by a job with listed its tasks, where tasks are defined by task_type between START and END

What I came up with so far:

handling None: df[['job', 'name', 'schedule']] = df[['job', 'name', 'schedule']].fillna(method='ffill')

filtering rows for desired task types: df[df['task_type'].isin(['TA', 'TB'])]

grouping the data: df.groupby(['job', 'name', 'schedule']).apply(lambda x: [list(x['task_type']), list(x['task']), list(x['task_name'])]).apply(pd.Series).reset_index(names=['job', 'name', 'schedule'])

But the results is not what expect:

    job  name schedule             0                        1                    2
0  job1  n_j1       01          [TA]                 [task12]           [name_t12]
1  job3  n_j3       06          [TB]                 [task31]           [name_t31]
2  job4  n_j4       09      [TB, TB]            [task18, nt6]     [name_t18, n_t5]
3  job5   nj5       09      [TA, TA]           [task3, task7]       [name_t33, n7]
4  job6   nj6       09  [TA, TA, TA]  [task19, task6, task88]  [name_t19, n6, n88]

My question is how to get rid of schedule values for job5, job6 and name the lists more friendly. There should by more elegant way of doing this, any help would be appreciated.

CodePudding user response：

Try:

x = df.groupby(df["job"].ffill()).agg(
    lambda x: [v for v in x if v and v not in {"START", "END"}]
)
x["job"] = x["job"].str[0]
x["name"] = x["name"].str[0]
x["schedule"] = x["schedule"].str[0]

print(x.reset_index(drop=True))

Prints:

    job  name schedule     task_type                     task            task_name
0  job1  n_j1       01          [TA]                 [task12]           [name_t12]
1  job3  n_j3       06          [TB]                 [task31]           [name_t31]
2  job4  n_j4       09      [TB, TB]            [task18, nt6]     [name_t18, n_t5]
3  job5   nj5      NaN      [TA, TA]           [task3, task7]       [name_t33, n7]
4  job6   nj6      NaN  [TA, TA, TA]  [task19, task6, task88]  [name_t19, n6, n88]

CodePudding user response：

Here you go:

Do the following steps on your initial df:


df[['job', 'name']] = df[['job', 'name']].fillna(method='ffill')

df = df.groupby(['job', 'name']).agg(lambda x: x.tolist()).applymap(lambda x: [i for i in x if i not in [None, 'START', 'END']])

# replace empty lists with None on schedule column
df['schedule'] = df['schedule'].apply(lambda x: None if x == [] else x)