how to replace a specifc column value based on max count of values in a groupby scenario with pandas-CodePudding

I have a dataframe with two columns district and region. Below is the sample of how the input dataframe looks like:

    district     region

1   Aveiro       -
2   Aveiro       Entre Douro e Minho
3   Aveiro       Beira Litoral
4   Aveiro       Beira Litoral
5   Aveiro       Entre Douro e Minho
6   Aveiro       Beira Litoral
7   Braga        Trás-os-Montes
8   Braga        -
9   Braga        Trás-os-Montes

As you can see, There are no null values in the dataframes. But in the region column, there are some records that have this value "-" . Now i want to replcae all the "-" records in that column with the most frequent value based on a groupby scenario with column district: We can get that count with this...

df1['region'].groupby(df1['district']).value_counts()


district          region               
Aveiro            Beira Litoral             3
                  Entre Douro e Minho       2
                  -                         1
Braga             Trás-os-Montes            2
                  -                         1

As you can see, "Beira Litoral" is the most frequent value for Averio, then it should replace the "-" in region column. Similarly, "Trás-os-Montes" is the most frequent value for Braga.

The output dataframe should look like this:

    district     region

1   Aveiro       Beira Litoral
2   Aveiro       Entre Douro e Minho
3   Aveiro       Beira Litoral
4   Aveiro       Beira Litoral
5   Aveiro       Entre Douro e Minho
6   Aveiro       Beira Litoral
7   Braga        Trás-os-Montes
8   Braga        Trás-os-Montes
9   Braga        Trás-os-Montes

If i had Nan instead of "-" then I could have solved that with something like this

CodePudding user response：

You can filter out the -, then find the mode with groupby():

modes = (df1['region'].replace('-', np.nan)
   .groupby(df1['district'])
   .transform(lambda x: x.mode()[0])
)

df1['region'] = np.where(df1['region']=='-', modes, df1['region'])

Output:

  district               region
1   Aveiro        Beira Litoral
2   Aveiro  Entre Douro e Minho
3   Aveiro        Beira Litoral
4   Aveiro        Beira Litoral
5   Aveiro  Entre Douro e Minho
6   Aveiro        Beira Litoral
7    Braga       Trás-os-Montes
8    Braga       Trás-os-Montes
9    Braga       Trás-os-Montes

CodePudding user response：

Get the values you want to use for each group and then replace with them.

def get_most_freq_region(group):
    return group.region.value_counts().index[0]
fill_in_values = tmp.groupby('district').apply(get_most_freq_region)
tmp = tmp.set_index('district')
tmp['fill_in_values'] = fill_in_values
tmp.loc[:, 'region'] = tmp.apply(axis=1, func=lambda x: x.region if x.region != '-' else x.fill_in_values)

returned value:

    region  fill_in_values
district        
Aveiro  Beira Litoral   Beira Litoral
Aveiro  Entre Douro e Minho     Beira Litoral
Aveiro  Beira Litoral   Beira Litoral
Aveiro  Beira Litoral   Beira Litoral
Aveiro  Entre Douro e Minho     Beira Litoral
Aveiro  Beira Litoral   Beira Litoral
Braga   Trás-os-Montes  Trás-os-Montes
Braga   Trás-os-Montes  Trás-os-Montes
Braga   Trás-os-Montes  Trás-os-Montes