I am trying to loop over my dataframe and looking for additional 3 rows for each element in df.con which is only looping over 2nd elementUS and missing UK.
Please find the attached code.
import pandas as pd
d = { 'year': [2019,2019,2019,2020,2020,2020],
'age group': ['(0-14)','(14-50)','(50 )','(0-14)','(14-50)','(50 )'],
'con': ['UK','UK','UK','US','US','US'],
'population': [10,20,300,400,1000,2000]}
df = pd.DataFrame(data=d)
df2 = df.copy()
df
year age group con population
0 2019 (0-14) UK 10
1 2019 (14-50) UK 20
2 2019 (50 ) UK 300
3 2020 (0-14) US 400
4 2020 (14-50) US 1000
5 2020 (50 ) US 2000
n_df_2 = df.copy()
con_list = [x for x in df.con]
year_list = [x for x in df.year]
age_list = [x for x in df['age group']]
new_list = ['young vs child','old vs young', 'unemployed vs working']
for country in df.con:
bev_child = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[0]))]
bev_work = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[1]))]
bev_old = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[2]))]
bev_child.loc[:,'population'] = bev_work.loc[:,'population'].max() / bev_child.loc[:,'population'].max()
bev_child.loc[:,'con'] = country '-' new_list[0]
bev_child.loc[:,'age group'] = new_list[0]
s = n_df_2.append(bev_child, ignore_index=True)
bev_child.loc[:,'population'] = bev_child.loc[:,'population'].max() bev_old.loc[:,'population'].max()/ bev_work.loc[:,'population'].max()
bev_child.loc[:,'con'] = country '-' new_list[2]
bev_child.loc[:,'age group'] = new_list[2]
s = s.append(bev_child, ignore_index=True)
bev_child.loc[:,'population'] = bev_old.loc[:,'population'].max() / bev_work.loc[:,'population'].max()
bev_child.loc[:,'con'] = country '-' new_list[1]
bev_child.loc[:,'age group'] = new_list[1]
s = s.append(bev_child, ignore_index=True)
s
output missing UK rows...
year age group con population
0 2019 (0-14) UK 10.0
1 2019 (14-50) UK 20.0
2 2019 (50 ) UK 300.0
3 2020 (0-14) US 400.0
4 2020 (14-50) US 1000.0
5 2020 (50 ) US 2000.0
6 2020 young vs child US-young vs child 2.5
7 2020 unemployed vs working US-unemployed vs working 4.5
8 2020 old vs young US-old vs young 2.0
CodePudding user response:
Each time through the loop, s is re-initialized to a new dataframe on this line:
s = n_df_2.append(bev_child, ignore_index=True)
This makes s end up as the original value of n_df_2, plus only the three values that are appended to it the last time the loop body is executed.
I think this is closer to what you want (nothing before the loop changes):
for country in df.con.unique():
bev_child = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[0]))]
bev_work = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[1]))]
bev_old = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[2]))]
bev_child.loc[:, 'population'] = bev_work.loc[:, 'population'].max() / bev_child.loc[:, 'population'].max()
bev_child.loc[:, 'con'] = country '-' new_list[0]
bev_child.loc[:, 'age group'] = new_list[0]
n_df_2 = n_df_2.append(bev_child, ignore_index=True)
bev_child.loc[:, 'population'] = bev_child.loc[:, 'population'].max() bev_old.loc[:,
'population'].max() / bev_work.loc[:,
'population'].max()
bev_child.loc[:, 'con'] = country '-' new_list[2]
bev_child.loc[:, 'age group'] = new_list[2]
n_df_2 = n_df_2.append(bev_child, ignore_index=True)
bev_child.loc[:, 'population'] = bev_old.loc[:, 'population'].max() / bev_work.loc[:, 'population'].max()
bev_child.loc[:, 'con'] = country '-' new_list[1]
bev_child.loc[:, 'age group'] = new_list[1]
n_df_2 = n_df_2.append(bev_child, ignore_index=True)
print(n_df_2)
Output:
year age group con population
0 2019 (0-14) UK 10.0
1 2019 (14-50) UK 20.0
2 2019 (50 ) UK 300.0
3 2020 (0-14) US 400.0
4 2020 (14-50) US 1000.0
5 2020 (50 ) US 2000.0
6 2019 young vs child UK-young vs child 2.0
7 2019 unemployed vs working UK-unemployed vs working 17.0
8 2019 old vs young UK-old vs young 15.0
9 2020 young vs child US-young vs child 2.5
10 2020 unemployed vs working US-unemployed vs working 4.5
11 2020 old vs young US-old vs young 2.0
Note that this only loops through the unique values in df.con, so the loop body only runs twice. Three records are added to the output each time the loop runs. Note also that the output is appended to n_df_2, so there's not need for the variable s.
