how to find count of a specific value in column on pandas data frame and use it for calculations-CodePudding

I have a pandas data frame similar like mentioned below and for all the (Domain) unique value I want to calculate Count(EV) Count(PV) count(DV) count(GV) where values are = green / total count of values in that unique domain

Domain	EV	PV	DV	GV	Numerator(part)	denominator(part)	ideal Output
KA-BLR	Green	Blue		Green	1	6	0.166
KA-BLR	Green	Green	Blue		1	6	0.166
KL-TRV	Green	Blue	Yellow	Red	0.5	7	0.071
KL-TRV	Green	Blue	Blue		0.5	7	0.071
KL-COK	Blue	Blue	Yellow	Green	0.25	4	0.0625
TN-CHN	Green		Blue		0.5	5	0.1
TN-CHN	Green	Blue		Yellow	0.5	5	0.1

Sample Code

OVER_ALL_SCORE = {}
for Domain in df_RR["Domain"].unique():

   #count of greens 

    EV_G = (df_RR['EV'] == 'Green').sum()
    
    PV_G = (df_RR['PV'] == 'Green').sum()
    
    DV_G = (df_RR['DV'] == 'Green').sum()
    
    GV_G= (df_RR['GV'] == 'Green').sum()

    #count of all values excluding null

    EV = df_RR['EV'].sum()
    
    PV = df_RR['PV'].sum()
    
    DV = df_RR['DV'].sum()
    
    GV = df_RR['GV'] .sum()
    
    
       
    # so (0.25*(SUM for "DV" of greens (totally correct)) 0.25*(SUM for "PV" of greens (totally correct)) 0.25*(SUM for "EV" of greens (totally correct)) 0.25*(SUM for "GV" of greens (totally correct)) / total count of values
    
   Numerator = (0.25*EV_G)   (0.25*PV_G)   (0.25* DV_G)   (0.25* GV_G)
   

   denominator = EV PV DV GV

   try:
      OVER_ALL_SCORE [domain]=(Numerator /denominator )
   
  except:
        OVER_ALL_SCORE [domain]=0

 df_RR['Overall_score']=df_RR['Domain'].map(OVER_ALL_SCORE)

Currently this logic is returning same value across all the domain. please help in resolving it

Thanks in Advance

CodePudding user response：

Here's a solution that gives the ideal output:

OVER_ALL_SCORE = {}

for Domain in df_RR["Domain"].unique():

    sub_df = df_RR.loc[df_RR['Domain']==Domain]

    #count of greens 

    EV_G = (sub_df['EV'] == 'Green').sum()
    PV_G = (sub_df['PV'] == 'Green').sum()
    DV_G = (sub_df['DV'] == 'Green').sum()
    GV_G = (sub_df['GV'] == 'Green').sum()

    #count of all values

    EV = sub_df['EV'].count()
    PV = sub_df['PV'].count()
    DV = sub_df['DV'].count()
    GV = sub_df['GV'].count()

    numerator = (0.25*EV_G)   (0.25*PV_G)   (0.25* DV_G)   (0.25* GV_G)
    denominator = EV PV DV GV

    try:
        OVER_ALL_SCORE[Domain] = (numerator /denominator )
    except:
        OVER_ALL_SCORE[Domain] = 0

df_RR['Overall_score']=df_RR['Domain'].map(OVER_ALL_SCORE)

There are a few changes are were key:

count() vs. sum()

In your count of ALL VALUES you'll want to use the count method rather than the sum method (otherwise, this code will just concatenate the string values in the table):

df_RR['EV'].sum()

This returns: 'GreenGreenGreenBlueGreen' (since the sum method simply adds all of the values in the series).

Use this instead:

df_RR['EV'].count()

The reason it works in your count of the greens is that this code df_RR['EV'] == 'Green' is returning a series of booleans which can be summed correctly to give you the number of greens (since it will add the trues as 1's and the falses as zeros):

True True True False True is the same as 1 1 1 0 1

The Main Problem

Currently, your counts are the same in each loop that you run since you're not filtering according to the domain. I would create sub-dataframe. Based on the domain you're looking at as the first step in your loop:

domain_df = df_RR.loc[df_RR['Domain'] == Domain]