I have a dataframe
text label title version
0 Alice is in Seattle SA 1
1 Alice is in wonderland. Portlang SA 2
2 Mallory has done the task. Gotland sometitle 4
3 Mallory has done the task. california sometitle 4
4 Mallory has california sometitle 2
5 Bob is final. Portland some different title 3
6 Mallory has done Portland sometitle 3
The final result I want is to find the hightest version text for given title and corresponding label, however the label should be divided as columns. Here is the final result:
text Seattle Portlang Gotland california Portland title
0 Alice is in wonderland. 0 1 0 0 0 SA
1 Mallory has done the task. 0 0 1 1 0 sometitle
2 Bob is final. 0 0 0 0 1 some different title
Thanks in advance,
CodePudding user response:
Use pivot_table. First rename text values with the title of the highest version for each title the pivot your dataframe:
out = (
df.assign(dummy=1)
.mask(df.groupby('title')['version'].rank(method='dense', ascending=False) > 1)
.pivot_table('dummy', ['title', 'text'], 'label', fill_value=0)
.reset_index()
.rename_axis(columns=None)
)
Output:
>>> out
title text Gotland Portland Portlang california
0 SA Alice is in wonderland. 0 0 1 0
1 some different title Bob is final. 0 1 0 0
2 sometitle Mallory has done the task. 1 0 0 1
