Home > Enterprise >  Match values of two columns with same IDs
Match values of two columns with same IDs

Time:01-30

I'm trying to match the data from reference and result based on the ID but been stuck for an hour now. Problem:

  • Check if result is the same value as reference and record the boolean value in a forth column Match. If Reference is NaN but Result isn't, return False.
  • If there are multiple ID's, and Result has NaN values but at least one row that matches with Result, return True for the matching row. For the rest of the rows return nothing.

Sample Dataframe:

ID     Reference     Result            
1      5             5                    
2      3             NaN                      
2      3             3                             
3      NaN           NaN                         
4      2             NaN                             
4      2             NaN                            
4      2             3   
5      7             NaN
6      NaN           4          

Expected Dataframe:

ID     Reference     Result        Match    
1      5             5             True        
2      3             NaN                      
2      3             3             True                 
3      NaN           NaN           True                 
4      2             NaN                             
4      2             NaN                            
4      2             3             False 
5      7             NaN           False
6      NaN           4             False  
df['Match']=np.where(
    (df["Reference"] == df["Result"]) | 
    (
        (df["Reference"].isnull() &
     (df["Result"].isnull())
    )
), True, False)

CodePudding user response:

Interesting challenge! Here's my take:

df['Match'] = df.Reference.eq(df.Result) | df.Reference.isna() & df.Result.isna()
df.loc[df.drop('ID', axis=1).isna().sum(axis=1).groupby(df['ID']).transform(lambda g: g if g.shape[0] > 1 else 0).astype(bool), 'Match'] = ''

Output:

>>> df
   ID  Reference  Result  Match
0   1        5.0     5.0   True
1   2        3.0     NaN       
2   2        3.0     3.0   True
3   3        NaN     NaN   True
4   4        2.0     NaN       
5   4        2.0     NaN       
6   4        2.0     3.0  False
7   5        7.0     NaN  False
8   6        NaN     4.0  False

>>> df['Match'].values
array([True, '', True, True, '', '', False, False, False], dtype=object)
  •  Tags:  
  • Related