Two seperate similar DataFrames with different lengths
df2=
| Index | First Name | Age | Gender | Weight |
|---|---|---|---|---|
| 0 | James | 25 | Male | 155 |
| 1 | John | 27 | Male | 175 |
| 2 | Patricia | 23 | Female | 135 |
| 3 | Mary | 22 | Female | 125 |
| 4 | Martin | 30 | Male | 185 |
| 5 | Margaret | 29 | Female | 141 |
| 6 | Kevin | 22 | Male | 198 |
df1=
| Index | First Name | Age | Gender | Weight | Height |
|---|---|---|---|---|---|
| 0 | James | 25 | Male | 165 | 5'10 |
| 1 | John | 27 | Male | 175 | 5'9 |
| 2 | Matthew | 29 | Male | 183 | 6'0 |
| 3 | Patricia | 23 | Female | 135 | 5'3 |
| 4 | Mary | 22 | Female | 125 | 5'4 |
| 5 | Rachel | 29 | Male | 123 | 5'3 |
| 6 | Jose | 20 | Male | 175 | 5'11 |
| 7 | Kevin | 22 | Male | 192 | 6'2 |
df2 has some rows which are not in df1 and df1 has some values which are not in df2.
I am comparing df1 against df2. I have calculated the newentries with the following code
newentries = df2.loc[~df2['First Name'].isin(df1['First Name'])]
deletedentries = df1.loc[~df1['First Name'].isin(df2['First Name'])]
where newentries denote the rows/entries that are there in df2 but not in df1; deletedentries denote the rows/entries that are there in df1 but not in df2. The above code works perfectly fine.
I need to copy the height from df1 to df2 when the first names are equal.
df2.loc[df2['First Name'].isin(df1['First Name']),"Height"] = df1.loc[dfxh2['First Name'].isin(df2['First Name']),"Height"]
The above code copies the values however indexing is causing an issue and the values are not copied to the corresponding rows, I tried to promote First Name as the Index but that doesn't solve the issue. Please help me with a solution
Also, I need to calculate the modified values, if the First Name is same, I need to check for modified values; for example in df1, the weight of James is 155 however in df2 the weight is 165, so I need to store the modified weight of James(165) and index(0) in a new dataframe without iteration; the iteration takes a long time because this is a sample of a big dataframe with a lot more rows and columns.
Desired output:
df2=
| Index | First Name | Age | Gender | Weight | Height |
|---|---|---|---|---|---|
| 0 | James | 25 | Male | 155 | 5'10 |
| 1 | John | 27 | Male | 175 | 5'9 |
| 2 | Patricia | 23 | Female | 135 | 5'3 |
| 3 | Mary | 22 | Female | 125 | 5'4 |
| 4 | Martin | 30 | Male | 185 | |
| 5 | Margaret | 29 | Female | 141 | |
| 6 | Kevin | 22 | Male | 198 | 6'2 |
Martin's and Margaret's heights are not there in df1, so their heights are not updated in df2
newentries=
| Index | First Name | Age | Gender | Weight | Height |
|---|---|---|---|---|---|
| 4 | Martin | 30 | Male | 185 | |
| 5 | Margaret | 29 | Female | 141 |
deletedentries=
| Index | First Name | Age | Gender | Weight | Height |
|---|---|---|---|---|---|
| 2 | Matthew | 29 | Male | 183 | 6'0 |
| 5 | Rachel | 29 | Male | 123 | 5'3 |
| 6 | Jose | 20 | Male | 175 | 5'11 |
modval=
| Index | First Name | Age | Gender | Weight | Height |
|---|---|---|---|---|---|
| 0 | James | 165 | |||
| 7 | Kevin | 192 |
CodePudding user response:
for your desired output for df2 you can try this:
desired_df2 = df2.merge(df1[['First Name','Height']], on='First Name', how='left')
#if you want to change the "NaN" values just add ".fillna(fill_value=0)" for e.g 0 after the merge
print(desired_df2)
First Name Age Gender Weight Height
0 James 25 Male 155 5'10
1 John 27 Male 175 5'9
2 Patricia 23 Female 135 5'3
3 Mary 22 Female 125 5'4
4 Martin 30 Male 185 NaN
5 Margaret 29 Female 141 NaN
6 Kevin 22 Male 198 6'2
new and deleted entries is already right. for the moment I'm a bit stuck how to get the modval dataframe. I'll update my answer if I get a solution.
CodePudding user response:
Building off of Rabinzel's answer:
output = df2.merge(df1, how='left', on='First Name', suffixes=[None, '_old'])
df3 = output[['First Name', 'Age', 'Gender', 'Weight', 'Height']]
cols = df1.columns[1:-1]
modval = pd.DataFrame()
for col in cols:
modval = pd.concat([modval, output[['First Name', col '_old']][output[col] != output[col '_old']].dropna()])
modval.rename(columns={col '_old':col}, inplace=True)
newentries = df2[~df2['First Name'].isin(df1['First Name'])]
deletedentries = df1[~df1['First Name'].isin(df2['First Name'])]
print(df3, newentries, deletedentries, modval, sep='\n\n')
Output:
First Name Age Gender Weight Height
0 James 25 Male 155 5'10
1 John 27 Male 175 5'9
2 Patricia 23 Female 135 5'3
3 Mary 22 Female 125 5'4
4 Martin 30 Male 185 NaN
5 Margaret 29 Female 141 NaN
6 Kevin 22 Male 198 6'2
First Name Age Gender Weight
4 Martin 30 Male 185
5 Margaret 29 Female 141
First Name Age Gender Weight Height
2 Matthew 29 Male 183 6'0
5 Rachel 29 Male 123 5'3
6 Jose 20 Male 175 5'11
First Name Age Gender Weight
0 James NaN NaN 165.0
6 Kevin NaN NaN 192.0
