I want to find:
- data not found in provider, but found in accounting,
- data found in accounting, but not in provider
- data found in both, accounting and provider, but the amounts do not match.
To accomplish this I have to use three provided tables Accounting, Helper, Provider First I join Accounting with Helper on invoice. Then I join it with Provider on toID. After merge inner join I have a table which in this case creates two unnecessary records.
Accounting table looks like:
| doc_nr | amount | invoice |
|---|---|---|
| a1 | -9.17 | i1 |
| a2 | 9.17 | i1 |
Helper table:
| toID | invoice |
|---|---|
| toid1 | i1 |
Provider table:
| toID | tID | DATE | tAmount |
|---|---|---|---|
| toid1 | t1 | 2021-01-18 | 9.17 |
| toid1 | t2 | 2021-01-19 | -9.17 |
After joining three tables using merge inner join I got a table:
| doc_nr | amount | invoice | toID | tID | tAmount | DATE |
|---|---|---|---|---|---|---|
| a1 | -9.17 | i1 | toid1 | t1 | 9.17 | 2021-01-18 |
| a1 | -9.17 | i1 | toid1 | t2 | -9.17 | 2021-01-19 |
| a2 | 9.17 | i1 | toid1 | t1 | 9.17 | 2021-01-18 |
| a2 | 9.17 | i1 | toid1 | t2 | -9.17 | 2021-01-19 |
Expected result should be only 2 records instead of 4:
| doc_nr | amount | invoice | toID | tID | tAmount | DATE |
|---|---|---|---|---|---|---|
| a1 | -9.17 | i1 | toid1 | t2 | -9.17 | 2021-01-19 |
| a2 | 9.17 | i1 | toid1 | t1 | 9.17 | 2021-01-18 |
Can't think of a way how to filter the table to get expected results.
Tried dropping duplicates based on doc_nr, but results gives:
| doc_nr | amount | invoice | toID | tID | tAmount | DATE |
|---|---|---|---|---|---|---|
| a1 | -9.17 | i1 | toid1 | t2 | -9.17 | 2021-01-19 |
| a2 | 9.17 | i1 | toid1 | t2 | -9.17 | 2021-01-18 |
It should have different tID; doc_nr and amount match tAmount. After join it seems it duplicates doc_nr creating it with other`s doc_nr and tID.
Any ideas on how to clean these duplicating rows and reach the expected result?
CodePudding user response:
Try this:
>>> df.assign(x=df['amount'].eq(df['tAmount']).cumsum()).replace(0, np.nan).dropna().drop_duplicates(subset=['doc_nr', 'x']).drop('x', axis=1)
doc_nr amount invoice toID tID tAmount DATE
1 a1 -9.17 i1 toid1 t2 -9.17 2021-01-19
2 a2 9.17 i1 toid1 t1 9.17 2021-01-18
>>>
CodePudding user response:
As you mentioned, you want amount to match tAmount:
It should have different tID; doc_nr and amount match tAmount.
then, you should include also these 2 fields in the second merge statement, as follows:
(Accounting.merge(Helper)
.merge(Provider, left_on=['toID', 'amount'], right_on=['toID', 'tAmount'])
)
use the left_on= and right_on= parameters to include the related 2 merge fields.
Result:
doc_nr amount invoice toID tID DATE tAmount
0 a1 -9.17 i1 toid1 t2 2021-01-19 -9.17
1 a2 9.17 i1 toid1 t1 2021-01-18 9.17
