I have a string column and would like to create a function to extract parts of the string based on some conditions of the original string column
EMAIL NUM_ID
[email protected] a9927345A
[email protected] B2722144X
[email protected] A2822876H
[email protected] b6722111A
[email protected] X8923314X
Would like to create NEW_NUM_ID based on the first letter of NUM_ID. If NUM_ID has "A" or "a" for the first alphabet, then the NEW_NUM_ID would pick the last 5 characters If NUM_ID has "B" or "b" for the first alphabet, then the NEW_NUM_ID would be the original string If NUM_ID has "X or "x" for the first alphabet, then the NEW_NUM_ID would be the original string
EMAIL NUM_ID NEW_NUM_ID
[email protected] a9927345A 7345A
[email protected] B2722144X B2722144X
[email protected] A2822876H 2876H
[email protected] b6722111A B6722111A
[email protected] X8923314X X8923314X
I have created the following code but can't seem to get it.
#Function to create the NEW_NUM_ID
def create_new_id(number_id):
match = re.findall(r'^[a-zA-Z].*', number_id)
if match[0] == 'A':
return number_id[-5:]
elif match[0] == 'B':
return number_id
elif match[0] == 'X':
return number_id
else:
return 'NA'
df['NEW_NUM_ID'] = df['NUM_ID'].apply(create_new_id)
Appreciate any form of help I can get, thank you.
CodePudding user response:
Using np.where we can try:
df["NEW_NUM_ID"] = np.where(df["NUM_ID"].str.contains(r'^[bBxX]'),
df["NUM_ID"], # original
df["NUM_ID"].str.extract(r'(.{5})$')) # last 5
CodePudding user response:
You can use a str.replace like
df["NEW_NUM_ID"] = df["NUM_ID"].str.replace(r'(?i)^a.*(\S{5})\s*$', r'\1', regex=True)
See the regex demo.
If there can be leading whitespace, use a (?i)^\s*a.*(\S{5})\s*$ as regex. Details:
(?i)- case insensitive search and replace is enabled^- start of stringa- anaorA.*- any zero or more chars other than line break chars as many as possible(\S{5})- Group 1: five non-whitespace chars\s*- zero or more whitespaces$- end of string.
The replacement is the backreference to Group 1 value.
