Pandas Dataframe GroupBy Agg - LAMBDA - single values go to preexisting or new lists and preexisting-CodePudding

I have this DataFrame to groupby key:

df = pd.DataFrame({
                   'key': ['1', '1', '1', '2', '2', '3', '3', '4', '4', '5'],
                   'data1': [['A', 'B', 'C'], 'D', 'P', 'E', ['F', 'G', 'H'], ['I', 'J'], ['K', 'L'], 'M', 'N', 'O']
                   'data2': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
                 })
df

I want to make the groupby key and sum data2, it's ok for this part. But concerning data1, I want to :

If a list doesn't exist yet:
- Single values don't change when key was not duplicated
- Single values assigned to a key are combined into a new list
If a list already exist:
- Other single values are append to it
- Other lists values are append to it

The resulting DataFrame should then be :

dfgood = pd.DataFrame({
                   'key': ['1', '2', '3', '4', '5'],
                   'data1': [['A', 'B', 'C', 'D', 'P'], ['F', 'G', 'H', 'E'], ['I', 'J', 'K', 'L'], ['M', 'N'], 'O']
                   'data2': [6, 9, 13, 17, 10]
                 })
dfgood

In fact, I don't really care about the order of data1 values into the lists, it could also be any structure that keep them together, even a string with separators or a set, if it's easier to make it go the way you think best to do this.

I thought about two solutions :

Going that way :

dfgood = df.groupby('key', as_index=False).agg({
            'data1' : lambda x: x.iloc[0].append(x.iloc[1]) if type(x.iloc[0])==list else list(x),
            'data2' : sum,
            })
dfgood

It doesn't work because of index out of range in x.iloc[1]. I also tried, because data1 was organized like this in another groupby from the question on this link:

dfgood = df.groupby('key', as_index=False).agg({
            'data1' : lambda g: g.iloc[0] if len(g) == 1 else list(g)),
            'data2' : sum,
            })
dfgood

But it's creating new lists from preexisting lists or values and not appending data to already existing lists.

Another way to do it, but I think it's more complicated and there should be a better or faster solution :
- Turning data1 lists and single values into individual series with apply,
- use wide_to_long to keep single values for each key,
- Then groupby applying :

dfgood = df.groupby('key', as_index=False).agg({
            'data1' : lambda g: g.iloc[0] if len(g) == 1 else list(g)),
            'data2' : sum,
            })
dfgood

I think my problem is that I don't know how to use lambdas correctly and I try stupid things like x.iloc[1] in the previous example. I've looked at a lot of tutorial about lambdas, but it's still fuzzy in my mind.

CodePudding user response：

There is problem combinations lists with scalars, possible solution is create first lists form scalars and then flatten them in groupby.agg:

dfgood = (df.assign(data1 = df['data1'].apply(lambda y: y if isinstance(y, list) else [y]))
            .groupby('key', as_index=False).agg({
            'data1' : lambda x: [z for y in x for z in y],
            'data2' : sum,
            })
            )
print (dfgood)
  key            data1  data2
0   1  [A, B, C, D, P]      6
1   2     [E, F, G, H]      9
2   3     [I, J, K, L]     13
3   4           [M, N]     17
4   5              [O]     10

Another idea is use flatten function for flatten only lists, not strings:

#https://stackoverflow.com/a/5286571/2901002
def flatten(foo):
    for x in foo:
        if hasattr(x, '__iter__') and not isinstance(x, str):
            for y in flatten(x):
                yield y
        else:
            yield x

dfgood = (df.groupby('key', as_index=False).agg({
            'data1' : lambda x: list(flatten(x)),
            'data2' : sum}))

CodePudding user response：

You could explode to get individual rows, then aggregate again with groupby agg after taking care of masking the duplicated values in data2 (to avoid summing duplicates):

(df.explode('data1')
   .assign(data2=lambda d: d['data2'].mask(d.duplicated(['key', 'data2']), 0))
   .groupby('key')
   .agg({'data1': list, 'data2': 'sum'})
)

output:

               data1  data2
key                        
1    [A, B, C, D, P]      6
2       [E, F, G, H]      9
3       [I, J, K, L]     13
4             [M, N]     17
5                [O]     10