| Source | Target |
|---|---|
![]() |
![]() |
Two tables will be joined via composite keys ID1 and ID2 via pandas merge. We have a data testing tool that does the data analysis. It will perform the merge and filter whatever rows are not present on the left or right side of the merge into an external Missing In Source or Missing in Target table.
| Missing in Target | Missing in Source |
|---|---|
![]() |
![]() |
In this example in Source the second row with composite key ABC,345 is missing in Target. So that row will be filtered into Missing in Target. Similarly in Target the second row with composite key ABC,222 is missing in Source so it will be filtered into Missing in Source.
For the Missing ... tables the business wants to know "why exactly are they missing"--which composite key by row made the row missing? For example, for the row with ABC,345 in the Missing in Target above, ABC was present in both table rows but 345 wasn't. Therefore,ID2 with value 345 is the guilty key for this row.
CodePudding user response:
Try this using merge with how=outer and indicator:
import pandas as pd
df1 = pd.DataFrame({'ID1':'XYZ ABC EFG UK'.split(' '),
'ID2': [123,345, 789, 111],
'Data1':'car truck cat lion'.split(' '),
'Data2':'house sky mouse zion'.split(' ')})
df2 = pd.DataFrame({'ID1':'XYZ ABC ZAB UK'.split(' '),
'ID2':[123,222,789,111],
'Data1':'car truck cat lion'.split(' '),
'Data2':'house sky mouse zion'.split(' ')})
df_r = df1.merge(df2, how='outer', indicator='Ind')
df_source_missing = df_r.query('Ind == "left_only"')
df_target_missing = df_r.query('Ind == "right_only"')
Output:
print(df_source_missing)
ID1 ID2 Data1 Data2 Ind
1 ABC 345 truck sky left_only
2 EFG 789 cat mouse left_only
print(df_target_missing)
ID1 ID2 Data1 Data2 Ind
4 ABC 222 truck sky right_only
5 ZAB 789 cat mouse right_only
CodePudding user response:
I believe you want .compare:
import pandas as pd
data1 = {
"ID1": ["XYZ", "ABC", "EFG", "IJK"],
"ID2": [123, 345, 789, 111],
"DATA1": ["car", "truck", "cat", "lion"],
"DATA2": ["house", "sky", "mouse", "zion"]
}
data2 = {
"ID1": ["XYZ", "ABC", "ZAB", "IJK"],
"ID2": [123, 222, 789, 111],
"DATA1": ["car", "truck", "cat", "lion"],
"DATA2": ["house", "sky", "mouse", "zion"]
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df = (
df1
.compare(df2, align_axis=0)
.rename(index={"self": "df1", "other": "df2"}, level=-1)
.fillna("")
.reset_index()
.drop("level_0", axis=1)
.rename(columns={"level_1": "source"})
)
print(df)
source ID1 ID2
0 df1 345.0
1 df2 222.0
2 df1 EFG
3 df2 ZAB




