I have a dataframe with two columns district and region. Below is the sample of how the input dataframe looks like:
district region
1 Aveiro -
2 Aveiro Entre Douro e Minho
3 Aveiro Beira Litoral
4 Aveiro Beira Litoral
5 Aveiro Entre Douro e Minho
6 Aveiro Beira Litoral
7 Braga Trás-os-Montes
8 Braga -
9 Braga Trás-os-Montes
As you can see, There are no null values in the dataframes. But in the region column, there are some records that have this value "-" . Now i want to replcae all the "-" records in that column with the most frequent value based on a groupby scenario with column district: We can get that count with this...
df1['region'].groupby(df1['district']).value_counts()
district region
Aveiro Beira Litoral 3
Entre Douro e Minho 2
- 1
Braga Trás-os-Montes 2
- 1
As you can see, "Beira Litoral" is the most frequent value for Averio, then it should replace the "-" in region column. Similarly, "Trás-os-Montes" is the most frequent value for Braga.
The output dataframe should look like this:
district region
1 Aveiro Beira Litoral
2 Aveiro Entre Douro e Minho
3 Aveiro Beira Litoral
4 Aveiro Beira Litoral
5 Aveiro Entre Douro e Minho
6 Aveiro Beira Litoral
7 Braga Trás-os-Montes
8 Braga Trás-os-Montes
9 Braga Trás-os-Montes
If i had Nan instead of "-" then I could have solved that with something like this
CodePudding user response:
You can filter out the -, then find the mode with groupby():
modes = (df1['region'].replace('-', np.nan)
.groupby(df1['district'])
.transform(lambda x: x.mode()[0])
)
df1['region'] = np.where(df1['region']=='-', modes, df1['region'])
Output:
district region
1 Aveiro Beira Litoral
2 Aveiro Entre Douro e Minho
3 Aveiro Beira Litoral
4 Aveiro Beira Litoral
5 Aveiro Entre Douro e Minho
6 Aveiro Beira Litoral
7 Braga Trás-os-Montes
8 Braga Trás-os-Montes
9 Braga Trás-os-Montes
CodePudding user response:
Get the values you want to use for each group and then replace with them.
def get_most_freq_region(group):
return group.region.value_counts().index[0]
fill_in_values = tmp.groupby('district').apply(get_most_freq_region)
tmp = tmp.set_index('district')
tmp['fill_in_values'] = fill_in_values
tmp.loc[:, 'region'] = tmp.apply(axis=1, func=lambda x: x.region if x.region != '-' else x.fill_in_values)
returned value:
region fill_in_values
district
Aveiro Beira Litoral Beira Litoral
Aveiro Entre Douro e Minho Beira Litoral
Aveiro Beira Litoral Beira Litoral
Aveiro Beira Litoral Beira Litoral
Aveiro Entre Douro e Minho Beira Litoral
Aveiro Beira Litoral Beira Litoral
Braga Trás-os-Montes Trás-os-Montes
Braga Trás-os-Montes Trás-os-Montes
Braga Trás-os-Montes Trás-os-Montes
