Given a dataframe like this:
row1 = ['AAA', 'BBB', 'BBB', 'CCC', 'AAA', 'AAA']
row2 = ['CCC', 'CCC', 'BBB', 'AAA', 'AAA', 'AAA']
col = {'List': [row1, row2]}
df = pd.DataFrame(col)
which leads to:
| List | |
|---|---|
| 0 | ['AAA', 'BBB', 'BBB', 'CCC', 'AAA', 'AAA'] |
| 1 | ['CCC', 'CCC', 'BBB', 'AAA', 'AAA', 'AAA'] |
I would like to generate the following dataframe:
| List | |
|---|---|
| 0 | ['AAA', 'BBB x 2', 'CCC', 'AAA x 2'] |
| 1 | ['CCC x 2', 'BBB', 'AAA x 3'] |
where the final column List contains a multiplier index that indicates how many times the term appears consecutively in the list.
Could you suggest a pandas instruction that solves this task?
CodePudding user response:
In your case you may need to check with explode ,then we create the subgroup with cumsum ad shift
s = df.explode('List')
s = s.groupby([s.index,s['List'].shift().ne(s['List']).cumsum()])['List'].agg(['first','count'])
out = s['first'] 'x' s['count'].astype(str)
out = out.mask(s['count']==1,s['first']).groupby(level=0).agg(list)
out
Out[202]:
0 [AAA, BBBx2, CCC, AAAx2]
1 [CCCx2, BBB, AAAx3]
dtype: object
CodePudding user response:
itertools groupby can do what you need. The custom function joins the count to the value if there is more than one record in that group.
from itertools import groupby
import pandas as pd
row1 = ['AAA', 'BBB', 'BBB', 'CCC', 'AAA', 'AAA']
row2 = ['CCC', 'CCC', 'BBB', 'AAA', 'AAA', 'AAA']
col = {'Lists': [row1, row2]}
df = pd.DataFrame(col)
def count_items(row):
output = []
for k, d in groupby(row):
x = list(d)
if len(x)>1:
output.append(' x '.join([k, str(len(x))]))
else:
output.append(k)
return output
df['Lists'] = df.Lists.apply(lambda x: count_items(x))
print(df)
Output
Lists
0 [AAA, BBB x 2, CCC, AAA x 2]
1 [CCC x 2, BBB, AAA x 3]
