Home > database >  Pandas: filter one dataframe by multiple, simultaneous column values of another dataframe
Pandas: filter one dataframe by multiple, simultaneous column values of another dataframe

Time:02-07

I have a filtered dataframe called correct_df and a raw dataframe example_df.

example_df = pd.DataFrame({'Test': ['Test_1', 'Test_1', 'Test_1', 'Test_2', 'Test_2', 'Test_2', 'Test_3', 'Test_3', 'Test_3'], 'A': [1, 2, 3, 1, 2, 3, 1, 2, 3]})
other_df = pd.DataFrame({'Test': ['Test_1', 'Test_1', 'Test_3', 'Test_3'], 'A': [1, 2, 1, 3]})

Desired result:

I want the indexes of example_df where 'Test' and 'A' column values both match the combined row value of correct_df.

I have tried:

result = example_df.loc[ (example_df['Test'].isin(other_df['Test'])) & (example_df['A'].isin(other_df['A'])) ]
result

    Test    A
0   Test_1  1
1   Test_1  2
2   Test_1  3
6   Test_3  1
7   Test_3  2
8   Test_3  3

But as these two conditions are separated, the resulting value only applies the conditions on single columns without chaining them i.e. A then also B, not A and B. How do I get the .loc of both column conditions?

CodePudding user response:

Use DataFrame.reset_index for avoid lost indices and then DataFrame.merge:

result = example_df.reset_index().merge(other_df, on=['Test','A'])
print (result)
   index    Test  A
0      0  Test_1  1
1      1  Test_1  2
2      6  Test_3  1
3      8  Test_3  3

result = (example_df.reset_index()
                    .merge(other_df, on=['Test','A'])
                    .set_index('index')
                    .rename_axis(None))
print (result)
     Test  A
0  Test_1  1
1  Test_1  2
6  Test_3  1
8  Test_3  3

Another idea with MultiIndex with Index.isin and filtering in boolean indexing:

result = example_df[example_df.set_index(['Test','A']).index
                              .isin(other_df.set_index(['Test','A']).index)]
print (result)
     Test  A
0  Test_1  1
1  Test_1  2
6  Test_3  1
8  Test_3  3

CodePudding user response:

@jezrael's solution is absolutely perfect for your question. This is just another way (a little convoluted) to get the same outcome using numpy.

We can filter example_df directly using a boolean array which we can construct by checking if rows in example_df exist other_df. To do that we need to make example_df a 3D array and use numpy broadcasting to compare with other_df. Then use all and any to shrink it back to the 1D array msk:

msk = (example_df.to_numpy()[:, None]==other_df.to_numpy()).all(axis=2).any(axis=1)
out = example_df[msk]

Output:

     Test  A
0  Test_1  1
1  Test_1  2
6  Test_3  1
8  Test_3  3
  •  Tags:  
  • Related