Function to create new column with substring based on regex condition of another string column

I have a string column and would like to create a function to extract parts of the string based on some conditions of the original string column

EMAIL               NUM_ID    
[email protected]       a9927345A
[email protected]     B2722144X 
[email protected]   A2822876H 
[email protected]    b6722111A
[email protected]      X8923314X

Would like to create NEW_NUM_ID based on the first letter of NUM_ID. If NUM_ID has "A" or "a" for the first alphabet, then the NEW_NUM_ID would pick the last 5 characters If NUM_ID has "B" or "b" for the first alphabet, then the NEW_NUM_ID would be the original string If NUM_ID has "X or "x" for the first alphabet, then the NEW_NUM_ID would be the original string

EMAIL               NUM_ID        NEW_NUM_ID
[email protected]       a9927345A     7345A 
[email protected]     B2722144X     B2722144X
[email protected]   A2822876H     2876H
[email protected]    b6722111A     B6722111A
[email protected]      X8923314X     X8923314X

I have created the following code but can't seem to get it.

#Function to create the NEW_NUM_ID

def create_new_id(number_id):
    match = re.findall(r'^[a-zA-Z].*', number_id)
    if match[0] == 'A':
        return number_id[-5:]
    elif match[0] == 'B':
        return number_id
    elif match[0] == 'X':
        return number_id
    else:
        return 'NA'

df['NEW_NUM_ID'] = df['NUM_ID'].apply(create_new_id)

Appreciate any form of help I can get, thank you.

CodePudding user response：

Using np.where we can try:

df["NEW_NUM_ID"] = np.where(df["NUM_ID"].str.contains(r'^[bBxX]'),
                            df["NUM_ID"],                           # original
                            df["NUM_ID"].str.extract(r'(.{5})$'))   # last 5

CodePudding user response：

You can use a str.replace like

df["NEW_NUM_ID"] = df["NUM_ID"].str.replace(r'(?i)^a.*(\S{5})\s*$', r'\1', regex=True)

See the regex demo.

If there can be leading whitespace, use a (?i)^\s*a.*(\S{5})\s*$ as regex. Details:

(?i) - case insensitive search and replace is enabled
^ - start of string
a - an a or A
.* - any zero or more chars other than line break chars as many as possible
(\S{5}) - Group 1: five non-whitespace chars
\s* - zero or more whitespaces
$ - end of string.

The replacement is the backreference to Group 1 value.