I got two dataframes looking like this:
df1:
| omg | odds | outcome | |
|---|---|---|---|
| 0 | 2625 | 27 | 1 |
| 1 | 2625 | 22 | 1 |
| 2 | 2626 | 10 | 0 |
| 3 | 2628 | 15 | 1 |
| 4 | 2628 | 20 | 1 |
df2:
| omgangar | |
|---|---|
| 0 | 2625 |
| 1 | 2626 |
| 2 | 2627 |
| 3 | 2628 |
| 4 | 2629 |
In df2 I'd like to add a new column that counts the occurrences of df2[omgangar] in df1[omg] together with some other conditions (odds between two values and outcome equal to 1).
I've tried this code but it throws an error:
df2['outcome'] =
sum((df1['omg'] == df2['omgangar']) & (df1['odds'].between(20,30)) & (df1['outcome'] == 1))
Error:
ValueError: Can only compare identically-labeled Series objects
The desired output of df2 is:
| omgangar | outcome | |
|---|---|---|
| 0 | 2625 | 2 |
| 1 | 2626 | 0 |
| 2 | 2627 | 0 |
| 3 | 2628 | 1 |
| 4 | 2629 | 0 |
CodePudding user response:
There is no need to use countifs, instead apply the filters first and then merge the dataframes.
First filter df1 with the two conditions, then groupby on omg and aggregate using count as follows:
s = df1.loc[df['odds'].between(20,30) & (df['outcome'] == 1)].groupby('omg')['odds'].count()
This will return a series s which looks like this:
omg
2625 2
2628 1
Name: odds, dtype: int64
Now, merge s with df2 and fill any resulting NaN values with 0:
df2.merge(df, left_on='omgangar', right_index=True, how='left').fillna(0.0)
Result:
omgangar odds
0 2625 2.0
1 2626 0.0
2 2627 0.0
3 2628 1.0
4 2629 0.0
CodePudding user response:
My suggestion would be to use merge and groupby for this task:
import pandas as pd
df1 = pd.DataFrame({
'omg': [2625, 2625, 2626, 2628, 2628],
'odds': [27, 22, 10, 15, 20],
'outcome': [1, 1, 0, 1, 1]})
df2 = pd.DataFrame({'omgangar': list(range(2625, 2630))})
df = df2.merge(df1, left_on = 'omgangar', right_on = 'omg', how='left')
df['outcome'] = (df.outcome == 1) & (df.odds.between(20, 30))
df.groupby('omgangar')[['outcome']].agg('sum')
outcome
omgangar
2625 2
2626 0
2627 0
2628 1
2629 0
CodePudding user response:
Use DataFrame.loc for filter by mask with column omg and get counts by Series.value_counts, for new column use Series.map with replace missing not matched values to 0:
s = df1.loc[df1['odds'].between(20,30) & (df1['outcome'] == 1), 'omg'].value_counts()
df2['outcome'] = df2['omgangar'].map(s).fillna(0).astype(int)
print (df2)
omgangar outcome
0 2625 2
1 2626 0
2 2627 0
3 2628 1
4 2629 0
