I am trying to remove rows from a dataframe where the first sequence of letters in the Ref column are equal to the Product column.
For example, for the input:
--------- ---------------
| Product | Provision Ref |
--------- ---------------
| DVX | DVX9251 |
--------- ---------------
| CDV | 22CDV95 |
--------- ---------------
| TV | TV12369 |
--------- ---------------
| TV | 992TV15 |
--------- ---------------
Desired output:
--------- ---------------
| Product | Provision Ref |
--------- ---------------
| CDV | 22CDV95 |
--------- ---------------
| TV | 992TV15 |
--------- ---------------
I have tried both of the following pieces of code but they are not working
df = df.loc[df['Provision Ref'].str[0:df['Product'].map(len)] != df['Product']]
df = df.loc[df['Provision Ref'].str[0:int(df['Product'].map(len))] != df['Product']]
CodePudding user response:
Try this:
filtered = df[df.groupby('Product', sort=False).apply(lambda g: g['Provision Ref'].str.startswith(g['Product'].iloc[0])).tolist()]
Output:
>>> filtered
Product Provision Ref
0 DVX DVX9251
2 TV TV12369
More readable but less efficient:
filtered = df[df.apply(lambda x: x['Provision Ref'].startswith(x['Product']), axis=1)]
Another method, probably more efficient if the items of Product have few unique lengths (e.g. most are either 2, 3 or 4 chars long, etc.):
filtered = df[df.groupby(df['Product'].str.len(), sort=False).apply(lambda x: x['Provision Ref'].str[:len(x['Product'].iloc[0])] == x['Product']).tolist()]
CodePudding user response:
We can use a row-wise .apply(), because
df['Provision Ref'].str.startswith(df['Product']) isn't vectorized like that (as @anarchy wrote).
df[~df.apply(lambda row: row['Provision Ref'].startswith(row['Product']), axis=1)]
Product Provision Ref
1 CDV 22CDV95
3 TV 992TV15
