I'm trying to create a new df from race_dbs that's grouped by 'horse_id' showing the number of times 'place' = 1 as well as the total number of times that 'horse_id' occurs.
Some background on the dataset if it's helpful;
race_dbs contains horse race data. There are 12 horses in a race, for each is shown their odds, fire, place, time, and gate number.
What I'm trying to achieve from this code is the calculation of win rates for each horse.
- A win is denoted by
'place' = 1 - Total race count will be calculated by how many times a particular 'horse_id' occurs in the db.
race_dbs
| race_id | horse_id | odds | fire | place | horse_time | gate |
|---|---|---|---|---|---|---|
| V14qANzi | 398807 | NaN | 0 | 1 | 72.0191 | 7 |
| xeieZak | 191424 | NaN | 0 | 8 | 131.3010 | 10 |
| xeieZak | 139335 | NaN | 0 | 1 | 131.3713 | 9 |
| xeieZak | 137195 | NaN | 0 | 11 | 131.6310 | 11 |
| xeieZak | 398807 | NaN | 0 | 12 | 131.7886 | 2 |
| ... | ... | .. | .. | ... | ... | .. |
From this simple table the output would look like, but please bear in mind my dataset is very large, containing 12882353 rows in total.
desired output
| horse_id | wins | races | win rate |
|---|---|---|---|
| 398807 | 1 | 2 | 50% |
| 191424 | 0 | 1 | 0% |
| 139335 | 1 | 1 | 100% |
| 137195 | 0 | 1 | 0% |
| ... | .. | .. | ... |
It should be noted that I'm a complete coding beginner so forgive me if this is an easy solve.
I have tried to use the groupby and lambda pandas functions but I am struggling to combine both functions, and believe there will be a much simpler way.
import pandas as pd
race_db = pd.read_csv('horse_race_data_db.csv')
race_db_2 = pd.read_csv('2_horse_race_data.csv')
frames = [race_db, race_db_2]
race_dbs = pd.concat(frames, ignore_index=True, sort=False)
race_dbs_horse_wins = race_dbs.groupby('horse_id')['place'].apply(lambda x: x[x == 1].count())
race_dbs_horse_sums = race_dbs.groupby('horse_id').aggregate({"horse_id":['sum']})
Thanks for the help!
CodePudding user response:
For count Trues values create helper boolean column and aggregate sum, for win rate aggregate mean and for count use 
