Pandas grouping with filtering on other columns-CodePudding

I have the following dataframe in Pandas:

name	value	in	out
A	50	1	0
A	-20	0	1
B	150	1	0
C	10	1	0
D	500	1	0
D	-250	0	1
E	800	1	0

There are maximally only 2 observations for each name: one for in and one for out. If there is only in for a name there is only one observation for it.

You can create this dataset with this code:

data = {
        'name': ['A','A','B','C','D','D','E'],
        'values': [50,-20,150,10,500,-250,800],
        'in': [1,0,1,1,1,0,1],
        'out': [0,1,0,0,0,1,0]
       }

df = pd.DataFrame.from_dict(data)

I want to sum the value column for each name but only if name has both in and out record. In other words, only when one unique name has exactly 2 rows.

The result should look like this:

name	value
A	30
D	250

If I run the following code I got all the results without filtering based on in and out.

df.groupby('name').sum()

name	value
A	30
B	150
C	10
D	250
E	800

How to add the beforementioned filtering based on columns?

CodePudding user response：

IIUC, you could filter before aggregation:

# check that we have exactly 1 in and 1 out per group
mask = df.groupby('name')[['in', 'out']].transform('sum').eq([1,1]).all(1)
# slice the correct groups and aggregate
out = df[mask].groupby('name', as_index=False)['values'].sum()

Or, you could filter afterwards (maybe less efficient if you have a lot of groups that would be filtered out):

(df.groupby('name', as_index=False).sum()
   .loc[lambda d: d['in'].eq(1) & d['out'].eq(1), ['name', 'values']]
)

output:

  name  values
0    A      30
1    D     250

CodePudding user response：

Maybe you can try something with groupby, agg, and query (like below):

df.groupby('name').agg({'name':'count', 'values': 'sum'}).query('name>1')[['values']]

Output:

        values
name        
A         30
D         250

You could also make .query('name==2') in above if you like but assuming it can occur max at 2 .query('name>1') would also return same.