Home > Software design >  Pandas - How to keep rows in a Dataframe that are different when comparing with another Dataframe
Pandas - How to keep rows in a Dataframe that are different when comparing with another Dataframe

Time:01-26

my question is: I have two Dataframes:

Dataframe 1:

dataframe 1

Dataframe 2:

Dataframe 2

If you notice, Dataframe 2 has some updated values and I want to create a new Dataframe that has only theses updated values, no matter which column had its updated value.

Desired Dataframe:

Desired Dataframe

CodePudding user response:

Let's make sure we have the right dataframes:

In [270]: df1
Out[270]:
     a   b
0  aaa   5
1  bbb   4
2  ccc   7
3  ddd   9
4  eee  11

In [271]: df2
Out[271]:
       a    b
0  aaaaa    5
1    bbb   38
2    ccc    7
3  ddddd  104
4    eee   11

We want df2's values, so left join:

In [272]: df = df2.set_index('a').join(df1.set_index('a'), how='left', rsuffix="_r")

In [273]: df
Out[273]:
         b   b_r
a
aaaaa    5   NaN
bbb     38   4.0
ccc      7   7.0
ddddd  104   NaN
eee     11  11.0

We only care about when the values differ:

In [274]: df = df[df.b != df.b_r]

In [275]: df
Out[275]:
         b  b_r
a
aaaaa    5  NaN
bbb     38  4.0
ddddd  104  NaN

We no longer need df1's values:

In [276]: df = df.drop(columns=['b_r'])
In [277]: df
Out[277]:
         b
a
aaaaa    5
bbb     38
ddddd  104

CodePudding user response:

After some work, I came with a solution. I don't know if it is the best way to do it:

This is Dataframe 1:

In [4]: df1
Out[4]: 
    id ColA  ColB    ColC
0   4  AAA     5  Test 1
1   7  BBB     4  Test 2
2  49  CCC     7  Test 3
3  71  DDD     9  Test 4
4  84  EEE    11  Test 5

And this is Dataframe 2:

In [5]: df2
Out[5]: 
Out[5]: 
    id   ColA  ColB    ColC
0    4  AAAAA     5  Test 1
1    7    BBB    38  Test 2
2   49    CCC     7  Test 3
3   71  DDDDD   104  Test 4
4   84    EEE    11  Test_5
5  102    FFF    23  Test 6

So, I want to create a new Dataframe that only has items that exists only in Dataframe 2:

In [6]: df_unique = df2[~df2.id.isin(df1.id)]
In [7]: df_unique
Out[7]: 
    id ColA  ColB    ColC
5  102  FFF    23  Test 6

And then I remove these items from Dataframe 2 so that it has only the items that were modified:

In [8]: df2 = df2[~df2.id.isin(df_unique.id)]
In [9]: df2
Out[9]: 
   id   ColA  ColB    ColC
0   4  AAAAA     5  Test 1
1   7    BBB    38  Test 2
2  49    CCC     7  Test 3
3  71  DDDDD   104  Test 4
4  84    EEE    11  Test_5

Finally, I create a new Dataframe that only has the items modified compared to Dataframe 1:

In [10]: df_merged = pd.merge(df2, df1, how='outer', 
                     indicator=True).query('_merge=="left_only"').drop('_merge', axis=1)
In [11]: df_merged
Out[11]: 
   id   ColA  ColB    ColC
0   4  AAAAA     5  Test 1
1   7    BBB    38  Test 2
3  71  DDDDD   104  Test 4
4  84    EEE    11  Test_5
  •  Tags:  
  • Related