I have a Pandas Dataframe like this
| Age | Gender | City |
|---|---|---|
| 10000 | Male | Tokyo |
| 15000 | Male | Tokyo |
| 20000 | Male | Tokyo |
| 12000 | Female | Madrid |
| 14000 | Female | Madrid |
| 16000 | Female | Madrid |
| 15000 | Female | Rome |
| NaN | Female | Rome |
| NaN | Male | Tokyo |
| NaN | Female | Rome |
Those 3 last rows I'd like to input the median based on the gender and city. For example, for the Female in Rome that has NaN value, it would be 15000 because of the only one female of Rome that has 15000.
For the male with Nan values and from Tokyo, it would be 15000 because it is the median of the male of Tokyo.
I know I can fill with the median of the column df['Age'] = df['Age'].fillna(median), but I want to calculate it using the other categorial columns too.
Maybe something like this?
df['Age'] = df['Age].finnla(df[['Age','Gender','City']].groupby(by=['Gender','City']).median())
How can I do this?
Appreciate ur help
CodePudding user response:
Let us try groupby with transform, also please make sure it is NaN not 'Nan'
df.Age.fillna(df.groupby(['Gender','City'])['Age'].transform('median'),inplace=True)
df
Out[628]:
Age Gender City
0 10000.0 Male Tokyo
1 15000.0 Male Tokyo
2 20000.0 Male Tokyo
3 12000.0 Female Madrid
4 14000.0 Female Madrid
5 16000.0 Female Madrid
6 15000.0 Female Rome
7 15000.0 Female Rome
8 15000.0 Male Tokyo
9 15000.0 Female Rome
CodePudding user response:
Another way:
df['Age'] = df.groupby(['Gender', 'City'])['Age'].apply(lambda x: x.fillna(x.median()))
print(df)
# Output
Age Gender City
0 10000.0 Male Tokyo
1 15000.0 Male Tokyo
2 20000.0 Male Tokyo
3 12000.0 Female Madrid
4 14000.0 Female Madrid
5 16000.0 Female Madrid
6 15000.0 Female Rome
7 15000.0 Female Rome
8 15000.0 Male Tokyo
9 15000.0 Female Rome
