I have a pandas data frame, which looks like the following:
email col2 col3
[email protected] John Doe
[email protected] John Doe
[email protected] John Doe
[email protected] John Doe
[email protected] Jane Doe
I want to go through each email address starting with at least two 'x's and check whether the same email address exists without those 'x's.
Required result:
email col2 col3 exists_in_valid_form
[email protected] John Doe False
[email protected] John Doe True
[email protected] John Doe True
[email protected] John Doe True
[email protected] Jane Doe False
I was able to get a sub-data frame containing all of those rows with the emails starting with 'xx' using df[df['email'].str.contains("xx")], and was also able to get the email addresses without the 'x's using str.lstrip('x'), but neither does not seem to help me get whether this email appears somewhere else without those x's or not.
CodePudding user response:
You can use duplicated() to get whether a value is existing in other row.
df['exists_in_valid_form'] = df.email.str.lstrip('x').duplicated(keep=False) & df.email.str.startswith('xx')
I added df.email.str.startswith('xx') to make sure it should start with at least 2 "x" and return False for "[email protected]".
