I have a dataframe df :
| City | Territory | Region | Area | Sales |
|---|---|---|---|---|
| Chicopee | Springfield MA | Northeast | National | 58761 |
| Chicopee | Springfield MA | Northeast | National | 65204 |
| Chicopee | Springfield MA | Northeast | National | 79862 |
| Feeding Hills | Springfield MA | Northeast | National | 67247 |
| Holyoke | Springfield MA | Northeast | National | 64347 |
| Holyoke | Springfield MA | Northeast | National | 73473 |
| Northampton | Springfield MA | Northeast | National | 43349 |
| South Hadley | Springfield MA | Northeast | National | 43551 |
| South Hadley | Springfield MA | Northeast | National | 73633 |
| Springfield | Springfield MA | Northeast | National | 64025 |
| Springfield | Springfield MA | Northeast | National | 56670 |
| Springfield | Springfield MA | Northeast | National | 79424 |
| Springfield | Springfield MA | Northeast | National | 39118 |
| Springfield | Springfield MA | Northeast | National | 74262 |
| Leominster | Worcester MA | Northeast | National | 39348 |
| Lunenburg | Worcester MA | Northeast | National | 33119 |
| Westborough | Worcester MA | Northeast | National | 54086 |
| Worcester | Worcester MA | Northeast | National | 79859 |
| Worcester | Worcester MA | Northeast | National | 40721 |
| Framingham | Worcester MA | Northeast | National | 79197 |
| Framingham | Worcester MA | Northeast | National | 36837 |
| Framingham | Worcester MA | Northeast | National | 66552 |
| Framingham | Worcester MA | Northeast | National | 66355 |
| Concord | Worcester MA | Northeast | National | 78933 |
| Holliston | Worcester MA | Northeast | National | 37515 |
| Hudson | Worcester MA | Northeast | National | 49778 |
I want to impute the columns with average of Sales for the respective columns at the respective levels:
| City | Territory_Name | Region | Area | Sales |
|---|---|---|---|---|
| 72628 | 52548.57143 | 58083.19231 | 53497.5 | 67241 |
| 72628 | 52548.57143 | 58083.19231 | 53497.5 | 68107 |
| 72628 | 52548.57143 | 58083.19231 | 53497.5 | 53908 |
| 52495 | 52548.57143 | 58083.19231 | 53497.5 | 42890 |
| 64779 | 52548.57143 | 58083.19231 | 53497.5 | 37501 |
| 64779 | 52548.57143 | 58083.19231 | 53497.5 | 64936 |
| 74919 | 52548.57143 | 58083.19231 | 53497.5 | 33580 |
| 35424.5 | 52548.57143 | 58083.19231 | 53497.5 | 68582 |
| 35424.5 | 52548.57143 | 58083.19231 | 53497.5 | 34911 |
| 57106.6 | 52548.57143 | 58083.19231 | 53497.5 | 56194 |
| 57106.6 | 52548.57143 | 58083.19231 | 53497.5 | 57250 |
| 57106.6 | 52548.57143 | 58083.19231 | 53497.5 | 36804 |
| 57106.6 | 52548.57143 | 58083.19231 | 53497.5 | 48133 |
| 57106.6 | 52548.57143 | 58083.19231 | 53497.5 | 57877 |
| 53656 | 57681.83333 | 58083.19231 | 53497.5 | 54927 |
| 49928 | 57681.83333 | 58083.19231 | 53497.5 | 45930 |
| 35606 | 57681.83333 | 58083.19231 | 53497.5 | 78042 |
| 43763.5 | 57681.83333 | 58083.19231 | 53497.5 | 66828 |
| 43763.5 | 57681.83333 | 58083.19231 | 53497.5 | 47742 |
| 47273.75 | 57681.83333 | 58083.19231 | 53497.5 | 57319 |
| 47273.75 | 57681.83333 | 58083.19231 | 53497.5 | 74662 |
| 47273.75 | 57681.83333 | 58083.19231 | 53497.5 | 45176 |
| 47273.75 | 57681.83333 | 58083.19231 | 53497.5 | 48625 |
| 77168 | 57681.83333 | 58083.19231 | 53497.5 | 49464 |
| 34433 | 57681.83333 | 58083.19231 | 53497.5 | 37793 |
| 51612 | 57681.83333 | 58083.19231 | 53497.5 | 36054 |
Is there an inbuilt package I can leverage to achieve this for all the categorical columns that I have?
CodePudding user response:
try this:
avg_sale = df.groupby('Region').agg(avg_sale_region=('Sales', np.mean))
df.merge(avg_sale, on='Region')
This is for region, you can do the same for other columns such as city, ...
for col in columns_you_want: #columns_you_want=['City', 'Region', ...]
avg_sale = df.groupby(col).agg(**{'avg_sale_by_' col: ('Sales', np.mean)})
df = df.merge(avg_sale, on=col)
after that you can rename the columns:
df = df.drop(columns=columns_you_want)
df = df.rename(columns={'avg_sale_by_' col: col for col in columns_you_want})
