Impute Average value of one column grouped by other columns in Pandas Dataframe-CodePudding

I have a dataframe df :

City	Territory	Region	Area	Sales
Chicopee	Springfield MA	Northeast	National	58761
Chicopee	Springfield MA	Northeast	National	65204
Chicopee	Springfield MA	Northeast	National	79862
Feeding Hills	Springfield MA	Northeast	National	67247
Holyoke	Springfield MA	Northeast	National	64347
Holyoke	Springfield MA	Northeast	National	73473
Northampton	Springfield MA	Northeast	National	43349
South Hadley	Springfield MA	Northeast	National	43551
South Hadley	Springfield MA	Northeast	National	73633
Springfield	Springfield MA	Northeast	National	64025
Springfield	Springfield MA	Northeast	National	56670
Springfield	Springfield MA	Northeast	National	79424
Springfield	Springfield MA	Northeast	National	39118
Springfield	Springfield MA	Northeast	National	74262
Leominster	Worcester MA	Northeast	National	39348
Lunenburg	Worcester MA	Northeast	National	33119
Westborough	Worcester MA	Northeast	National	54086
Worcester	Worcester MA	Northeast	National	79859
Worcester	Worcester MA	Northeast	National	40721
Framingham	Worcester MA	Northeast	National	79197
Framingham	Worcester MA	Northeast	National	36837
Framingham	Worcester MA	Northeast	National	66552
Framingham	Worcester MA	Northeast	National	66355
Concord	Worcester MA	Northeast	National	78933
Holliston	Worcester MA	Northeast	National	37515
Hudson	Worcester MA	Northeast	National	49778

I want to impute the columns with average of Sales for the respective columns at the respective levels:

City	Territory_Name	Region	Area	Sales
72628	52548.57143	58083.19231	53497.5	67241
72628	52548.57143	58083.19231	53497.5	68107
72628	52548.57143	58083.19231	53497.5	53908
52495	52548.57143	58083.19231	53497.5	42890
64779	52548.57143	58083.19231	53497.5	37501
64779	52548.57143	58083.19231	53497.5	64936
74919	52548.57143	58083.19231	53497.5	33580
35424.5	52548.57143	58083.19231	53497.5	68582
35424.5	52548.57143	58083.19231	53497.5	34911
57106.6	52548.57143	58083.19231	53497.5	56194
57106.6	52548.57143	58083.19231	53497.5	57250
57106.6	52548.57143	58083.19231	53497.5	36804
57106.6	52548.57143	58083.19231	53497.5	48133
57106.6	52548.57143	58083.19231	53497.5	57877
53656	57681.83333	58083.19231	53497.5	54927
49928	57681.83333	58083.19231	53497.5	45930
35606	57681.83333	58083.19231	53497.5	78042
43763.5	57681.83333	58083.19231	53497.5	66828
43763.5	57681.83333	58083.19231	53497.5	47742
47273.75	57681.83333	58083.19231	53497.5	57319
47273.75	57681.83333	58083.19231	53497.5	74662
47273.75	57681.83333	58083.19231	53497.5	45176
47273.75	57681.83333	58083.19231	53497.5	48625
77168	57681.83333	58083.19231	53497.5	49464
34433	57681.83333	58083.19231	53497.5	37793
51612	57681.83333	58083.19231	53497.5	36054

Is there an inbuilt package I can leverage to achieve this for all the categorical columns that I have?

CodePudding user response：

try this:

avg_sale = df.groupby('Region').agg(avg_sale_region=('Sales', np.mean))
df.merge(avg_sale, on='Region')

This is for region, you can do the same for other columns such as city, ...

for col in columns_you_want: #columns_you_want=['City', 'Region', ...]
    avg_sale = df.groupby(col).agg(**{'avg_sale_by_' col: ('Sales', np.mean)})
    df = df.merge(avg_sale, on=col)

after that you can rename the columns:

df = df.drop(columns=columns_you_want)
df = df.rename(columns={'avg_sale_by_' col: col for col in columns_you_want})