I have a Pandas data frame (called "ud_flex" below) that looks like the one below:
The data frame has over 27 million observations in it that I'm trying to iterate through to do a calculation for each row. Below is the calculation that I'm using:
def set_fpts(pos, rank, curr_fpts):
if pos == "RB" and rank >= 3.0:
return 0
elif pos == "WR" and rank >= 4.0:
return 0
elif (pos == "TE" or pos == "QB") and rank >= 2.0:
return 0
else:
return curr_fpts
Here is the for loop that I've created:
players = ud_flex.shape[0]
for i in range(0,players):
new_fpts = set_fpts(ud_flex.iloc[i]['position_name'], ud_flex.iloc[i]['wk_rank_orig'], ud_flex.iloc[i]['fpts'])
ud_flex.at[i, 'fpts_orig'] = new_fpts
Does anyone have any suggestions for how to speed up this loop? It's currently taking nearly an hour! Thanks!
CodePudding user response:
You could start making an algorithm that exits faster:
def set_fpts(pos, rank, curr_fpts):
if rank > 4:
return 0
if rank < 2:
return curr_fpts
if pos in ["TE", "QB"]:
return 0
if rank >= 3:
if pos == "RB":
return 0
return curr_fpts
CodePudding user response:
In general, iterating through pandas data frames is slow, so it's not surprising that your for loop based approach is taking a while.
I suspect that the following alternative should work more quickly for a data frame of your size.
mask = (((ud_flex['position_name']=="RB") & (ud_flex['wk_rank_orig']>=3))
|((ud_flex['position_name']=="WR") & (ud_flex['wk_rang_orig']>=4))
|((ud_flex['position_name'].isin["TE","QB"]) & (ud_flex['wk_rang_orig']>=2)))
ud_flex['fpts_orig'][mask] = 0
ud_flex['fpts_orig'][~mask] = ud_flex['fpts']
