I have the following dataframe in Pandas:
| name | value | in | out |
|---|---|---|---|
| A | 50 | 1 | 0 |
| A | -20 | 0 | 1 |
| B | 150 | 1 | 0 |
| C | 10 | 1 | 0 |
| D | 500 | 1 | 0 |
| D | -250 | 0 | 1 |
| E | 800 | 1 | 0 |
There are maximally only 2 observations for each name: one for in and one for out.
If there is only in for a name there is only one observation for it.
You can create this dataset with this code:
data = {
'name': ['A','A','B','C','D','D','E'],
'values': [50,-20,150,10,500,-250,800],
'in': [1,0,1,1,1,0,1],
'out': [0,1,0,0,0,1,0]
}
df = pd.DataFrame.from_dict(data)
I want to sum the value column for each name but only if name has both in and out record. In other words, only when one unique name has exactly 2 rows.
The result should look like this:
| name | value |
|---|---|
| A | 30 |
| D | 250 |
If I run the following code I got all the results without filtering based on in and out.
df.groupby('name').sum()
| name | value |
|---|---|
| A | 30 |
| B | 150 |
| C | 10 |
| D | 250 |
| E | 800 |
How to add the beforementioned filtering based on columns?
CodePudding user response:
IIUC, you could filter before aggregation:
# check that we have exactly 1 in and 1 out per group
mask = df.groupby('name')[['in', 'out']].transform('sum').eq([1,1]).all(1)
# slice the correct groups and aggregate
out = df[mask].groupby('name', as_index=False)['values'].sum()
Or, you could filter afterwards (maybe less efficient if you have a lot of groups that would be filtered out):
(df.groupby('name', as_index=False).sum()
.loc[lambda d: d['in'].eq(1) & d['out'].eq(1), ['name', 'values']]
)
output:
name values
0 A 30
1 D 250
CodePudding user response:
Maybe you can try something with groupby, agg, and query (like below):
df.groupby('name').agg({'name':'count', 'values': 'sum'}).query('name>1')[['values']]
Output:
values
name
A 30
D 250
You could also make .query('name==2') in above if you like but assuming it can occur max at 2 .query('name>1') would also return same.
