I have two Pandas dataframes as below and I am trying to replace the 0 values from column Price in the first one.
The first dataframe looks like this:
| Region | Country | Product | Year | Price |
|---|---|---|---|---|
| Africa | South Africa | ABC | 2016 | 500 |
| Africa | South Africa | ABC | 2017 | 400 |
| Africa | South Africa | ABC | 2018 | 0 |
| Africa | South Africa | ABC | 2019 | 450 |
| Africa | South Africa | XYZ | 2016 | 750 |
| Africa | South Africa | XYZ | 2017 | 0 |
| Africa | South Africa | XYZ | 2018 | 0 |
| Africa | South Africa | XYZ | 2019 | 890 |
| Asia | Japan | DEF | 2016 | 0 |
| Asia | Japan | DEF | 2017 | 0 |
| Asia | Japan | DEF | 2018 | 415 |
| Asia | Japan | DEF | 2019 | 0 |
It contains price for each type of product for each country for a given year. Some countries have values of 0 for certain years as that data is missing.
The second dataframe contains the average yearly price growth for each region and product and looks like this.
| Region | Product | Average Yearly Price Growth |
|---|---|---|
| Africa | ABC | 0.043 |
| Africa | DEF | 0.071 |
| Africa | XYZ | 0.0128 |
| Asia | XYZ | 0.05 |
| Asia | ABC | -0.009 |
| Asia | DEF | 0.021 |
The task is to replace all 0 values in the first dataframe with the previous year's value for that particular product in that particular country, plus the average yearly rate increase (assuming the earliest data point is available).
In the case of product DEF in Japan, I would want to detect the earliest year where data is available and calculate all the previous years where Price is 0 from that point as follows:
Product DEF in Japan in 2018 has a price of 415 and is the earliest available data point for Price. Therefore the price for 2017 should be 415/(1 0.021)=406.464. After that, the value for 2016 should be 406.464/(1 0.021). Once these values have been calculated, the value for DEF in Japan in 2019 should be 415 415*0.021=423.715.
For other countries such as South Africa where the price for the earliest year (2016) is available it should be easier. For example, for product XYZ in South Africa in 2017, instead of 0 the value should be 750 750*0.0128=759.6. Where 750 is the previous year value for that product in that country and 0.0128 is the average yearly price growth for that product in Africa (from dataframe 2).
Then for the same product but in 2018, the value should be the previously calculated one, 759.6 759.6*0.0128=769.32.
Here is the sample data:
data1 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF','DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
'Price': [500, 400, 0,450,750,0,0,890,0,0,415,0]}
data2 = {'Region': ['Africa','Africa','Africa','Asia','Asia','Asia'],
'Product': ['ABC','DEF','XYZ','XYZ','ABC','DEF'],
'Average Yearly Price Growth': [0.043, 0.071, 0.0128,0.05,-0.009,0.021]}
df = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
My code so far is below, but it only works for situations where the earlies available point is available, like South Africa. Any ideas on how to deal with situations like Japan?
for index,rows in df.iterrows():
if rows['Price']==0:
try:
past_year_value = df[(df['Country']==rows['Country']) & (df['Product']==rows['Product']) & (df['Region']==rows['Region']) & (df['Year']==rows['Year']-1)]['Price'].values[0]
df.at[index,'Price'] = past_year_value past_year_value*df2[(df2['Product']==rows['Product']) & (df2['Region']==rows['Region'])]['Average Yearly Price Growth'].values[0]
except:
pass
CodePudding user response:
Simple fix add another for loop after that, this is more like forward fill and backward fill for the value of 0 , since we need to step , just like bfill and ffill chain
for index,rows in df.iterrows():
if rows['Price']==0:
try:
past_year_value = df[(df['Country']==rows['Country']) & (df['Product']==rows['Product']) & (df['Region']==rows['Region']) & (df['Year']==rows['Year']-1)]['Price'].values[0]
df.at[index,'Price'] = past_year_value past_year_value*df2[(df2['Product']==rows['Product']) & (df2['Region']==rows['Region'])]['Average Yearly Price Growth'].values[0]
except:
pass
df = df.iloc[::-1]
for index,rows in df.iterrows():
if rows['Price']==0:
try:
past_year_value = df[(df['Country']==rows['Country']) & (df['Product']==rows['Product']) & (df['Region']==rows['Region']) & (df['Year']==rows['Year'] 1)]['Price'].values[0]
df.at[index,'Price'] = past_year_value /(1 df2[(df2['Product']==rows['Product']) & (df2['Region']==rows['Region'])]['Average Yearly Price Growth'].values[0])
except:
pass
df = df.iloc[::-1]
Out[88]:
Region Country Product Year Price
0 Africa South Africa ABC 2016 500
1 Africa South Africa ABC 2017 400
2 Africa South Africa ABC 2018 417
3 Africa South Africa ABC 2019 450
4 Africa South Africa XYZ 2016 750
5 Africa South Africa XYZ 2017 759
6 Africa South Africa XYZ 2018 768
7 Africa South Africa XYZ 2019 890
8 Asia Japan DEF 2016 397
9 Asia Japan DEF 2017 406
10 Asia Japan DEF 2018 415
11 Asia Japan DEF 2019 423
