Home > Enterprise >  Check columns for groups of strings, replace with 1 if they exist 0 if they do not - python, pandas,
Check columns for groups of strings, replace with 1 if they exist 0 if they do not - python, pandas,

Time:02-01

I'm trying to search for a set of strings in a column in a pandas dataframe and replace with 1 if the strings exist and 0 if they do not.

Per the example below, this works fine on the first pass:

df = pd.DataFrame({'ID':[1,2,3,4], 'Event':['1 Day', '2 Days','3 Days','4 Days']})
df['Event'] = np.where(df['Event'].str.contains('3 Days|4 Days'),1,df['Event'])

df

but when I try and apply the opposite logic and replace the instances where the strings do not exist:

df = pd.DataFrame({'ID':[1,2,3,4], 'Event':['1 Day', '2 Days','3 Days','4 Days']})
df['Event'] = np.where(df['Event'].str.contains('3 Days|4 Days'),1,df['Event'])
df['Event'] = np.where(~df['Event'].str.contains('3 Days|4 Days'),0,df['Event'])  

df

I get this error - TypeError: bad operand type for unary ~: 'float'

I tried using logical operators so the actions would occur simultaneously:

df = pd.DataFrame({'ID':[1,2,3,4], 'Event':['1 Day', '2 Days','3 Days','4 Days']})
df['Event'] = np.where(df['Event'].str.contains('3 Days|4 Days'),1,df['Event']) & np.where(~df['Event'].str.contains('3 Days|4 Days'),0,df['Event'])  

df

but received this error... TypeError: unsupported operand type(s) for &: 'str' and 'int'

What I'm ultimately trying to achieve is a df that replaces all the cells where the strings exist with 1 and the instances where those strings do not exist with 0s so I can analyze. Like so:

ID  Event
1   0  
2   0
3   1
4   1

CodePudding user response:

After this line:

df['Event'] = np.where(df['Event'].str.contains('3 Days|4 Days'),1, df['Event'])

df['Event'] contains 1 which is not a string, so the second time you check (inside np.where):

df['Event'].str.contains('3 Days|4 Days')

it returns:

0    False
1    False
2      NaN
3      NaN
Name: Event, dtype: object

Since NaN doesn't evaluate ~NaN, it returns an error.

To get the desired outcome, simply use np.where once where you select 1 if True, 0 otherwise:

df['Event'] = np.where(df['Event'].str.contains('3 Days|4 Days'), 1, 0)

Output:

   ID  Event
0   1      0
1   2      0
2   3      1
3   4      1
  •  Tags:  
  • Related