How to drop rows based on condition?-CodePudding

I want to find:

data not found in provider, but found in accounting,
data found in accounting, but not in provider
data found in both, accounting and provider, but the amounts do not match.

To accomplish this I have to use three provided tables Accounting, Helper, Provider First I join Accounting with Helper on invoice. Then I join it with Provider on toID. After merge inner join I have a table which in this case creates two unnecessary records.

Accounting table looks like:

doc_nr	amount	invoice
a1	-9.17	i1
a2	9.17	i1

Helper table:

toID	invoice
toid1	i1

Provider table:

toID	tID	DATE	tAmount
toid1	t1	2021-01-18	9.17
toid1	t2	2021-01-19	-9.17

After joining three tables using merge inner join I got a table:

doc_nr	amount	invoice	toID	tID	tAmount	DATE
a1	-9.17	i1	toid1	t1	9.17	2021-01-18
a1	-9.17	i1	toid1	t2	-9.17	2021-01-19
a2	9.17	i1	toid1	t1	9.17	2021-01-18
a2	9.17	i1	toid1	t2	-9.17	2021-01-19

Expected result should be only 2 records instead of 4:

doc_nr	amount	invoice	toID	tID	tAmount	DATE
a1	-9.17	i1	toid1	t2	-9.17	2021-01-19
a2	9.17	i1	toid1	t1	9.17	2021-01-18

Can't think of a way how to filter the table to get expected results.

Tried dropping duplicates based on doc_nr, but results gives:

doc_nr	amount	invoice	toID	tID	tAmount	DATE
a1	-9.17	i1	toid1	t2	-9.17	2021-01-19
a2	9.17	i1	toid1	t2	-9.17	2021-01-18

It should have different tID; doc_nr and amount match tAmount. After join it seems it duplicates doc_nr creating it with other`s doc_nr and tID.

Any ideas on how to clean these duplicating rows and reach the expected result?

CodePudding user response：

Try this:

>>> df.assign(x=df['amount'].eq(df['tAmount']).cumsum()).replace(0, np.nan).dropna().drop_duplicates(subset=['doc_nr', 'x']).drop('x', axis=1)
  doc_nr amount invoice   toID tID tAmount        DATE
1     a1  -9.17      i1  toid1  t2   -9.17  2021-01-19
2     a2   9.17      i1  toid1  t1    9.17  2021-01-18
>>>

CodePudding user response：

As you mentioned, you want amount to match tAmount:

It should have different tID; doc_nr and amount match tAmount.

then, you should include also these 2 fields in the second merge statement, as follows:

(Accounting.merge(Helper)
           .merge(Provider, left_on=['toID', 'amount'], right_on=['toID', 'tAmount'])
)

use the left_on= and right_on= parameters to include the related 2 merge fields.

Result:

  doc_nr  amount invoice   toID tID        DATE  tAmount
0     a1   -9.17      i1  toid1  t2  2021-01-19    -9.17
1     a2    9.17      i1  toid1  t1  2021-01-18     9.17