I'm trying to match the data from reference and result based on the ID but been stuck for an hour now. Problem:
- Check if result is the same value as reference and record the boolean value in a forth column Match. If Reference is NaN but Result isn't, return False.
- If there are multiple ID's, and Result has NaN values but at least one row that matches with Result, return True for the matching row. For the rest of the rows return nothing.
Sample Dataframe:
ID Reference Result
1 5 5
2 3 NaN
2 3 3
3 NaN NaN
4 2 NaN
4 2 NaN
4 2 3
5 7 NaN
6 NaN 4
Expected Dataframe:
ID Reference Result Match
1 5 5 True
2 3 NaN
2 3 3 True
3 NaN NaN True
4 2 NaN
4 2 NaN
4 2 3 False
5 7 NaN False
6 NaN 4 False
df['Match']=np.where(
(df["Reference"] == df["Result"]) |
(
(df["Reference"].isnull() &
(df["Result"].isnull())
)
), True, False)
CodePudding user response:
Interesting challenge! Here's my take:
df['Match'] = df.Reference.eq(df.Result) | df.Reference.isna() & df.Result.isna()
df.loc[df.drop('ID', axis=1).isna().sum(axis=1).groupby(df['ID']).transform(lambda g: g if g.shape[0] > 1 else 0).astype(bool), 'Match'] = ''
Output:
>>> df
ID Reference Result Match
0 1 5.0 5.0 True
1 2 3.0 NaN
2 2 3.0 3.0 True
3 3 NaN NaN True
4 4 2.0 NaN
5 4 2.0 NaN
6 4 2.0 3.0 False
7 5 7.0 NaN False
8 6 NaN 4.0 False
>>> df['Match'].values
array([True, '', True, True, '', '', False, False, False], dtype=object)
