Pandas Merge issue-CodePudding

I'm trying to merge one column values from df2 to df1. df1.merge(df2, how='outer') seems to be what I needed but result is not what I wanted because of duplicate. Using 'on' introduces _x and _y which I don't want either.

In below Example: sub=site1 in both df1 and df2 is same, then 'fred' from df2 replaces 'own' of df1.

# Pandas Merge test:

import pandas as pd

df1 = pd.DataFrame({'sub': ['site1', 'site2', 'site3'], 'iss': ['enc1', 'enc2', 'enc3'], 'rem': [1, 3, 5], 'own': ['andy', 'brian', 'cody']})
df2 = pd.DataFrame({'sub': ['data1', 'data2', 'site1'], 'rem': [2, 4, 6], 'own': ['david', 'edger', 'fred']})

>>> df1
     sub   iss  rem    own
0  site1  enc1    1   andy
1  site2  enc2    3  brian
2  site3  enc3    5   cody

>>> df2
     sub  rem    own
0  data1    2  david
1  data2    4  edger
2  site1    6   fred

>>> df1.merge(df2, how='outer')
     sub   iss  rem    own
0  site1  enc1    1   andy
1  site2  enc2    3  brian
2  site3  enc3    5   cody
3  data1   NaN    2  david
4  data2   NaN    4  edger
5  site1   NaN    6   fred

>>> df1.merge(df2, on='sub', how='outer')
     sub   iss  rem_x  own_x  rem_y  own_y
0  site1  enc1    1.0   andy    6.0   fred
1  site2  enc2    3.0  brian    NaN    NaN
2  site3  enc3    5.0   cody    NaN    NaN
3  data1   NaN    NaN    NaN    2.0  david
4  data2   NaN    NaN    NaN    4.0  edger

Expected Output:

     sub   iss  rem    own
0  site1  enc1    1   fred
1  site2  enc2    3  brian
2  site3  enc3    5   cody
3  data1   NaN    2  david
4  data2   NaN    4  edger

CodePudding user response：

A potential somewhat simple solution using pd.concat and loc to filter df1 to just contain records not present in df2 and then concat them together.

# used to make use loc on index as it is a bit simpler.
df1 = df1.set_index('sub')
df2 = df2.set_index('sub')

Then pd.concat them together.

df3 = pd.concat([df1[~df1.index.isin(df2.index)],df2])

Output:

print(df3)
        iss  rem    own
sub                    
site2  enc2    3  brian
site3  enc3    5   cody
data1   NaN    2  david
data2   NaN    4  edger
site1   NaN    6   fred

This does not change the value of rem and iss for site1 to equal the value of df1 though. If that is also needed you would you could just add an additional loc statement as a possible solution. Like this:

df3.loc[(df3.index.isin(df1.index.to_list())) & ~(df3['rem'].isin(df1['rem'].to_list())), ['iss','rem']] = df1[['iss','rem']]

Final Output

        iss  rem    own
sub                    
site2  enc2    3  brian
site3  enc3    5   cody
data1   NaN    2  david
data2   NaN    4  edger
site1  enc1    1   fred

CodePudding user response：

Edit: changed to using update instead of fillna as per @bkeesey's comment

you need to merge on sub then update the new columns and drop the old ones

try

import pandas as pd

df1 = pd.DataFrame({'sub': ['site1', 'site2', 'site3'], 'iss': ['enc1', 'enc2', 'enc3'], 'rem': [1, 3, 5], 'own': ['andy', 'brian', 'cody']})
df2 = pd.DataFrame({'sub': ['data1', 'data2', 'site1'], 'rem': [2, 4, 6], 'own': ['david', 'edger', 'fred']})

dfm = df1.merge(df2, on='sub', how='outer', suffixes=["_x",""])

dfm.own.update(dfm.own_x)
dfm.rem.update(dfm.rem_x)

del dfm["own_x"]
del dfm["rem_x"]

result

     sub   iss  rem    own
0  site1  enc1  6.0   fred
1  site2  enc2  3.0  brian
2  site3  enc3  5.0   cody
3  data1   NaN  2.0  david
4  data2   NaN  4.0  edger

CodePudding user response：

here is one way to do it


# update the df1.own with the values for it in the df2
# using map
df1['own'] = df1['sub'].map(df2.set_index('sub')['own']).fillna(df1['own'])


out=(pd.concat([df1, df2])            # concat the two DF
.drop_duplicates(subset=['sub'])      # drop duplicates
.reset_index()                        # reset index
.drop(columns='index'))               # remove the unwanted column

out

    sub     iss     rem     own
0   site1   enc1    1   fred
1   site2   enc2    3   brian
2   site3   enc3    5   cody
3   data1   NaN     2   david
4   data2   NaN     4   edger

alternately,

# merge the two DF, and drop the duplicates
out=(pd.concat([df1, df2])
.drop_duplicates(subset=['sub'])
.reset_index()
.drop(columns='index'))

# map the own in the resulting DF from concat
out['own'] = out['sub'].map(df2.set_index('sub')['own']).fillna(out['own'])
out

sub     iss     rem     own
0   site1   enc1    1   fred
1   site2   enc2    3   brian
2   site3   enc3    5   cody
3   data1   NaN     2   david
4   data2   NaN     4   edger