For loop is not working while appending rows-CodePudding

I am trying to loop over my dataframe and looking for additional 3 rows for each element in df.con which is only looping over 2nd elementUS and missing UK.

Please find the attached code.

import pandas as pd
d = { 'year': [2019,2019,2019,2020,2020,2020], 
      'age group': ['(0-14)','(14-50)','(50 )','(0-14)','(14-50)','(50 )'], 
      'con': ['UK','UK','UK','US','US','US'],
      'population': [10,20,300,400,1000,2000]}
df = pd.DataFrame(data=d)
df2 = df.copy()
df

year    age group   con population
0   2019    (0-14)  UK  10
1   2019    (14-50) UK  20
2   2019    (50 )   UK  300
3   2020    (0-14)  US  400
4   2020    (14-50) US  1000
5   2020    (50 )   US  2000

n_df_2 = df.copy()
con_list = [x for x in df.con]
year_list = [x for x in df.year]
age_list = [x for x in df['age group']]
new_list = ['young vs child','old vs young', 'unemployed vs working']

for country in df.con:

      bev_child =  n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[0]))]
      bev_work =  n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[1]))]
      bev_old =  n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[2]))]


      bev_child.loc[:,'population'] = bev_work.loc[:,'population'].max() / bev_child.loc[:,'population'].max() 
      bev_child.loc[:,'con'] = country  '-' new_list[0]
      bev_child.loc[:,'age group'] = new_list[0]
      s = n_df_2.append(bev_child, ignore_index=True)


      bev_child.loc[:,'population'] = bev_child.loc[:,'population'].max()   bev_old.loc[:,'population'].max()/ bev_work.loc[:,'population'].max() 
      bev_child.loc[:,'con'] = country  '-'  new_list[2]
      bev_child.loc[:,'age group'] = new_list[2]

      s = s.append(bev_child, ignore_index=True)

      bev_child.loc[:,'population'] = bev_old.loc[:,'population'].max() / bev_work.loc[:,'population'].max() 
      bev_child.loc[:,'con'] = country  '-'  new_list[1]
      bev_child.loc[:,'age group'] = new_list[1]

      s = s.append(bev_child, ignore_index=True)
s

output missing UK rows...


year    age group                   con                     population
0   2019    (0-14)                  UK                      10.0
1   2019    (14-50)                 UK                      20.0
2   2019    (50 )                   UK                      300.0
3   2020    (0-14)                  US                      400.0
4   2020    (14-50)                 US                      1000.0
5   2020    (50 )                   US                      2000.0
6   2020    young vs child          US-young vs child          2.5
7   2020    unemployed vs working   US-unemployed vs working   4.5
8   2020    old vs young             US-old vs young           2.0

CodePudding user response：

Each time through the loop, s is re-initialized to a new dataframe on this line:

s = n_df_2.append(bev_child, ignore_index=True)

This makes s end up as the original value of n_df_2, plus only the three values that are appended to it the last time the loop body is executed.

I think this is closer to what you want (nothing before the loop changes):

for country in df.con.unique():

    bev_child = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[0]))]
    bev_work = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[1]))]
    bev_old = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[2]))]

    bev_child.loc[:, 'population'] = bev_work.loc[:, 'population'].max() / bev_child.loc[:, 'population'].max()
    bev_child.loc[:, 'con'] = country   '-'   new_list[0]
    bev_child.loc[:, 'age group'] = new_list[0]
    n_df_2 = n_df_2.append(bev_child, ignore_index=True)

    bev_child.loc[:, 'population'] = bev_child.loc[:, 'population'].max()   bev_old.loc[:,
                                                                            'population'].max() / bev_work.loc[:,
                                                                                                  'population'].max()
    bev_child.loc[:, 'con'] = country   '-'   new_list[2]
    bev_child.loc[:, 'age group'] = new_list[2]
    n_df_2 = n_df_2.append(bev_child, ignore_index=True)

    bev_child.loc[:, 'population'] = bev_old.loc[:, 'population'].max() / bev_work.loc[:, 'population'].max()
    bev_child.loc[:, 'con'] = country   '-'   new_list[1]
    bev_child.loc[:, 'age group'] = new_list[1]
    n_df_2 = n_df_2.append(bev_child, ignore_index=True)

print(n_df_2)

Output:

    year              age group                       con  population
0   2019                 (0-14)                        UK        10.0
1   2019                (14-50)                        UK        20.0
2   2019                  (50 )                        UK       300.0
3   2020                 (0-14)                        US       400.0
4   2020                (14-50)                        US      1000.0
5   2020                  (50 )                        US      2000.0
6   2019         young vs child         UK-young vs child         2.0
7   2019  unemployed vs working  UK-unemployed vs working        17.0
8   2019           old vs young           UK-old vs young        15.0
9   2020         young vs child         US-young vs child         2.5
10  2020  unemployed vs working  US-unemployed vs working         4.5
11  2020           old vs young           US-old vs young         2.0

Note that this only loops through the unique values in df.con, so the loop body only runs twice. Three records are added to the output each time the loop runs. Note also that the output is appended to n_df_2, so there's not need for the variable s.