I have a pandas data frame similar like mentioned below and for all the (Domain) unique value I want to calculate Count(EV) Count(PV) count(DV) count(GV) where values are = green / total count of values in that unique domain
| Domain | EV | PV | DV | GV | Numerator(part) | denominator(part) | ideal Output |
|---|---|---|---|---|---|---|---|
| KA-BLR | Green | Blue | Green | 1 | 6 | 0.166 | |
| KA-BLR | Green | Green | Blue | 1 | 6 | 0.166 | |
| KL-TRV | Green | Blue | Yellow | Red | 0.5 | 7 | 0.071 |
| KL-TRV | Green | Blue | Blue | 0.5 | 7 | 0.071 | |
| KL-COK | Blue | Blue | Yellow | Green | 0.25 | 4 | 0.0625 |
| TN-CHN | Green | Blue | 0.5 | 5 | 0.1 | ||
| TN-CHN | Green | Blue | Yellow | 0.5 | 5 | 0.1 |
Sample Code
OVER_ALL_SCORE = {}
for Domain in df_RR["Domain"].unique():
#count of greens
EV_G = (df_RR['EV'] == 'Green').sum()
PV_G = (df_RR['PV'] == 'Green').sum()
DV_G = (df_RR['DV'] == 'Green').sum()
GV_G= (df_RR['GV'] == 'Green').sum()
#count of all values excluding null
EV = df_RR['EV'].sum()
PV = df_RR['PV'].sum()
DV = df_RR['DV'].sum()
GV = df_RR['GV'] .sum()
# so (0.25*(SUM for "DV" of greens (totally correct)) 0.25*(SUM for "PV" of greens (totally correct)) 0.25*(SUM for "EV" of greens (totally correct)) 0.25*(SUM for "GV" of greens (totally correct)) / total count of values
Numerator = (0.25*EV_G) (0.25*PV_G) (0.25* DV_G) (0.25* GV_G)
denominator = EV PV DV GV
try:
OVER_ALL_SCORE [domain]=(Numerator /denominator )
except:
OVER_ALL_SCORE [domain]=0
df_RR['Overall_score']=df_RR['Domain'].map(OVER_ALL_SCORE)
Currently this logic is returning same value across all the domain. please help in resolving it
Thanks in Advance
CodePudding user response:
Here's a solution that gives the ideal output:
OVER_ALL_SCORE = {}
for Domain in df_RR["Domain"].unique():
sub_df = df_RR.loc[df_RR['Domain']==Domain]
#count of greens
EV_G = (sub_df['EV'] == 'Green').sum()
PV_G = (sub_df['PV'] == 'Green').sum()
DV_G = (sub_df['DV'] == 'Green').sum()
GV_G = (sub_df['GV'] == 'Green').sum()
#count of all values
EV = sub_df['EV'].count()
PV = sub_df['PV'].count()
DV = sub_df['DV'].count()
GV = sub_df['GV'].count()
numerator = (0.25*EV_G) (0.25*PV_G) (0.25* DV_G) (0.25* GV_G)
denominator = EV PV DV GV
try:
OVER_ALL_SCORE[Domain] = (numerator /denominator )
except:
OVER_ALL_SCORE[Domain] = 0
df_RR['Overall_score']=df_RR['Domain'].map(OVER_ALL_SCORE)
There are a few changes are were key:
count() vs. sum()
In your count of ALL VALUES you'll want to use the count method rather than the sum method (otherwise, this code will just concatenate the string values in the table):
df_RR['EV'].sum()
This returns: 'GreenGreenGreenBlueGreen' (since the sum method simply adds all of the values in the series).
Use this instead:
df_RR['EV'].count()
The reason it works in your count of the greens is that this code df_RR['EV'] == 'Green' is returning a series of booleans which can be summed correctly to give you the number of greens (since it will add the trues as 1's and the falses as zeros):
True True True False True is the same as 1 1 1 0 1
The Main Problem
Currently, your counts are the same in each loop that you run since you're not filtering according to the domain. I would create sub-dataframe. Based on the domain you're looking at as the first step in your loop:
domain_df = df_RR.loc[df_RR['Domain'] == Domain]
