I have this DataFrame to groupby key:
df = pd.DataFrame({
'key': ['1', '1', '1', '2', '2', '3', '3', '4', '4', '5'],
'data1': [['A', 'B', 'C'], 'D', 'P', 'E', ['F', 'G', 'H'], ['I', 'J'], ['K', 'L'], 'M', 'N', 'O']
'data2': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
})
df
I want to make the groupby key and sum data2, it's ok for this part. But concerning data1, I want to :
- If a list doesn't exist yet:
- Single values don't change when key was not duplicated
- Single values assigned to a key are combined into a new list
- If a list already exist:
- Other single values are append to it
- Other lists values are append to it
The resulting DataFrame should then be :
dfgood = pd.DataFrame({
'key': ['1', '2', '3', '4', '5'],
'data1': [['A', 'B', 'C', 'D', 'P'], ['F', 'G', 'H', 'E'], ['I', 'J', 'K', 'L'], ['M', 'N'], 'O']
'data2': [6, 9, 13, 17, 10]
})
dfgood
In fact, I don't really care about the order of data1 values into the lists, it could also be any structure that keep them together, even a string with separators or a set, if it's easier to make it go the way you think best to do this.
I thought about two solutions :
- Going that way :
dfgood = df.groupby('key', as_index=False).agg({
'data1' : lambda x: x.iloc[0].append(x.iloc[1]) if type(x.iloc[0])==list else list(x),
'data2' : sum,
})
dfgood
It doesn't work because of index out of range in x.iloc[1].
I also tried, because data1 was organized like this in another groupby from the question on this link:
dfgood = df.groupby('key', as_index=False).agg({
'data1' : lambda g: g.iloc[0] if len(g) == 1 else list(g)),
'data2' : sum,
})
dfgood
But it's creating new lists from preexisting lists or values and not appending data to already existing lists.
- Another way to do it, but I think it's more complicated and there should be a better or faster solution :
- Turning data1 lists and single values into individual series with
apply, - use
wide_to_longto keep single values for each key, - Then groupby applying :
- Turning data1 lists and single values into individual series with
dfgood = df.groupby('key', as_index=False).agg({
'data1' : lambda g: g.iloc[0] if len(g) == 1 else list(g)),
'data2' : sum,
})
dfgood
I think my problem is that I don't know how to use lambdas correctly and I try stupid things like x.iloc[1] in the previous example. I've looked at a lot of tutorial about lambdas, but it's still fuzzy in my mind.
CodePudding user response:
There is problem combinations lists with scalars, possible solution is create first lists form scalars and then flatten them in groupby.agg:
dfgood = (df.assign(data1 = df['data1'].apply(lambda y: y if isinstance(y, list) else [y]))
.groupby('key', as_index=False).agg({
'data1' : lambda x: [z for y in x for z in y],
'data2' : sum,
})
)
print (dfgood)
key data1 data2
0 1 [A, B, C, D, P] 6
1 2 [E, F, G, H] 9
2 3 [I, J, K, L] 13
3 4 [M, N] 17
4 5 [O] 10
Another idea is use flatten function for flatten only lists, not strings:
#https://stackoverflow.com/a/5286571/2901002
def flatten(foo):
for x in foo:
if hasattr(x, '__iter__') and not isinstance(x, str):
for y in flatten(x):
yield y
else:
yield x
dfgood = (df.groupby('key', as_index=False).agg({
'data1' : lambda x: list(flatten(x)),
'data2' : sum}))
CodePudding user response:
You could explode to get individual rows, then aggregate again with groupby agg after taking care of masking the duplicated values in data2 (to avoid summing duplicates):
(df.explode('data1')
.assign(data2=lambda d: d['data2'].mask(d.duplicated(['key', 'data2']), 0))
.groupby('key')
.agg({'data1': list, 'data2': 'sum'})
)
output:
data1 data2
key
1 [A, B, C, D, P] 6
2 [E, F, G, H] 9
3 [I, J, K, L] 13
4 [M, N] 17
5 [O] 10
