I have a dataset
id category description status
11 A Text_1 Finished
11 A Text_2 Pause
11 A Text_3 Started
22 A Text_1 Pause
33 B Text_1 Finished
33 B Text_2 Finished
And I want to group data with id and concat description only for raws with status = 'Finished'
So desired output is
id category description
11 A Text_1
22 A
33 B Text_1 Text_2
I can concat it using
data.groupby(['id', 'category'])['description'].apply(' '.join).reset_index()
But how can I use condition inside this expression?
CodePudding user response:
you can filter before groupby then reindex with the missing groups
out = data.loc[data.status == 'Finished'].groupby(['id', 'category'])['description'].apply(' '.join).reindex(pd.MultiIndex.from_frame(data[['id','category']].drop_duplicates()),fill_value= ' ').reset_index()
Out[70]:
id category description
0 11 A Text_1
1 22 A
2 33 B Text_1 Text_2
CodePudding user response:
You can use groupby.apply with a conditional and a default value if the group is empty after filtering:
out = (df
.groupby(['id', 'category'])
.apply(lambda g: ' '.join(d['description'])
if len(d:=g[g['status'].eq('Finished')])
else '' )
.reset_index(name='description')
)
Output:
id category description
0 11 A Text_1
1 22 A
2 33 B Text_1 Text_2
CodePudding user response:
Here's a way:
key = ['id', 'category']
df2 = data[key].drop_duplicates().join(
data.query("status == 'Finished'").groupby(key).description.apply(' '.join),
on=key).fillna('').reset_index(drop=True)
Explanation:
- use
query()to filter on status for "Finished", usegroupby()to group by thekey[id, category], then use yourstr.join()logic on thedescriptioncolumn values within each group - use a de-duped version of the
keycolumns andDataFrame.join()to expand the filtered results to contains allkeyvalues, and usefillna()to replaceNaNwith an empty string in thedescriptioncolumn for keys that were filtered out.
Output:
id category description
0 11 A Text_1
1 22 A
2 33 B Text_1 Text_2
