Background:
I have a pandas df which I am trying to remove rows from if the following conditions are all met.
% Ownership(float64) == 100Ownership Audit Note(object) contains either 'Ignore' or 'ignore'Entity ID %(float64) ==Account # %(object)
Issue - it looks like my code is dropping rows if any conditions is met, rather than all of them. Is there something wrong with my code?
My current code: this is the line of code I wrote -
df = df[(~(df['Ownership Audit Note'].str.contains('ignore|Ignore')))
& (~(df['% Ownership'] == 100))
& (~(df['Entity ID %'] == 'Account # %'))]
Extract from df
df = pd.DataFrame(
{
"% Ownership": {0: 100.0, 1: 50.0, 2: 100.0},
"Ownership Audit Note": {
0: "[ignore] 100% Ownership ",
1: "[Ignore] 50% Ownership ",
2: "[ignored] Recurring 67% as of 2022-01-07 ",
},
"Entity ID %": {
0: 100.0,
1: 50.0,
2: 67.0,
},
"Account # %": {
0: 100.0,
1: 50.0,
2: 65.0,
},
}
)
How my code is currently behaving: using my current code at the db extract as an example.
- Index position 0 - the first row should drop (and is) as all conditions are met.
- Index position 1 - this row should not drop as
% Ownershipis not == 100. This row is still dropping. - Index position 2 - this row should not drop as
Entity ID %does not ==Account # %. This row is still dropping.
CodePudding user response:
You should invert the final result of the and, not each invert each test (or then you'd need to use or). Also there was a mistake, you were comparing df['Entity ID %'] to a fixed string:
df[~(df['Ownership Audit Note'].str.contains('ignore|Ignore')
& df['% Ownership'].eq(100)
& df['Entity ID %'].eq(df['Account # %'])
)]
output:
% Ownership Ownership Audit Note Entity ID % Account # %
1 50.0 [Ignore] 50% Ownership 50.0 50.0
2 100.0 [ignored] Recurring 67% as of 2022-01-07 67.0 65.0
CodePudding user response:
df = df[(df['Ownership Audit Note'].str.contains('ignore', case=False))
&(df['% Ownership'] == 100.0)
&(df['Entity ID %'] == df['Account # %'])]
CodePudding user response:
As you said correctly, the problem is that you are keeping only rows that match every condition. This is because you use the NOT operator (~) on each condition and not on the sum of them all. This will first find the rows in which all conditions are met, and then take the opposite rows:
df = df[(~(df['Ownership Audit Note'].str.contains('ignore|Ignore'))
& (df['% Ownership'] == 100)
& (df['Entity ID %'] == 'Account # %'))]
