Home > Enterprise >  Create new column based on substring
Create new column based on substring

Time:01-21

I'm trying to create a new column based on whether the strings in the original column contain a certain substring. What I tried was this:

def get_group(row):
    stores = pd.Series(row['store'])
    if (stores.str.contains('Blue')): 'Blue'
    elif (stores.str.contains('Yellow')): 'Yellow'
    elif (stores.str.contains('Green')): 'Green'
    elif (stores.str.contains('Red')): 'Red'
    elif (stores.str.contains('Purple')): 'Purple'
    elif (stores.str.contains('Pink')): 'Pink'
    elif (stores.str.contains('Orange')): 'Orange'
    else: 'Outhers'

db['group'] = db.apply(lambda row: get_group(row), axis=1)

However it is not working

CodePudding user response:

You are missing a return in your function. Besides, to check if a string contains a substring, you have to use in. Finally, your line pd.Series(row['store']) is wrong.

Your function should look like this:

def get_group(row):
    stores = row['store']
    to_return='Others'
    if ('Blue' in stores): to_return='Blue'
    elif ('Yellow' in stores): to_return='Yellow'
    elif ('Green' in stores): to_return='Green'
    elif ('Red' in stores): to_return='Red'
    elif ('Purple' in stores): to_return='Purple'
    elif ('Pink' in stores): to_return='Pink'
    elif ('Orange' in stores): to_return='Orange'
    return(to_return)

Be aware that this function is sensitive to the case, so it will not detect 'blue' with a lowercase for instance, but only 'Blue'.

If you want to make your function case-insensitive, you have to transform all your strings into lowercase for instance: if ('blue' in stores.lower())

CodePudding user response:

There's two things you need to fix:

  1. Your if-statements are returning a boolean series that have an ambiguous truth value. In other words, a combination of True and False is being returned for the values of the boolean series and Python doesn't know which of these values to use. One way to obtain a single truth value is by using .any() to return True if any of the values are True.
  2. You need to add a return statement for the strings

With that being said, the following should work:

def get_group(row):
    stores = pd.Series(row['store'])
    if stores.str.contains('Blue').any(): return 'Blue'
    elif stores.str.contains('Yellow').any(): return 'Yellow'
    elif stores.str.contains('Green').any(): return 'Green'
    elif stores.str.contains('Red').any(): return 'Red'
    elif stores.str.contains('Purple').any(): return 'Purple'
    elif stores.str.contains('Pink').any(): return 'Pink'
    elif stores.str.contains('Orange').any(): return 'Orange'
    else: return 'Others'

db['group'] = db.apply(lambda row: get_group(row), axis=1)
  •  Tags:  
  • Related