I am trying to tag TRUE or FALSE to an email message dataframe that has columns SenderEmail, Counterparties, and MessageBody
df['Spam'] = df['SenderEmail'].apply(lambda x: True if "no" and "reply" in x.lower() else "")
df['Spam'] = df['MessageBody'].apply(lambda x: True if "please do not reply" in x.lower() else "")
The code works, but I realise that after I ran one after the other, the results from the second line code will overrun the results from the first line code, leaving me with the results from the second line code only. I can’t remove the else “” while using this, so I was thinking to run a for loop instead. But I’m not sure how to do so.
CodePudding user response:
You can use
df['Spam'] = (df['SenderEmail'].str.contains('^(?=.*no)(?=.*reply)', case=False) |
df['MessageBody'].str.contains('please do not reply', case=False))
Here,
df['SenderEmail'].str.contains('^(?=.*no)(?=.*reply)', case=False)checks if theSenderEmailcolumn value contains both substringsnoandreplydf['MessageBody'].str.contains('please do not reply', case=False)checks ifMessageBodycolumn containsplease do not replysubstring.
The case=False enables case insensitive checking.
Pandas test:
import pandas as pd
df = pd.DataFrame(
{'SenderEmail': ['no reply', 'reply', 'no', 'and more no some reply'],
'MessageBody':['ok', 'please do not reply', 'ok', 'ok']})
df['Spam'] = (df['SenderEmail'].str.contains('^(?=.*no)(?=.*reply)', case=False) |
df['MessageBody'].str.contains('please do not reply', case=False))
# => df
# SenderEmail MessageBody Spam
# 0 no reply ok True
# 1 reply please do not reply True
# 2 no ok False
# 3 and more no some reply ok True
