I am trying to create a binary (yes/no) variable based on what is in a particular text string (in Python).
The data looks something like:
| Person ID | Test Result |
|---|---|
| 87 | No exercise induced ischaemia |
| 88 | Treadmill test induced increased BP |
| 89 | NORMAL test on treadmill |
and so on.
I need to pick out all the people who have "No exercise induced ischaemia". Can anybody shed some light on how to do this, given I have about 20 columns in the real data set and about 14000 rows that need to be searched.
Here's an example dataframe for convenience
d = {'ID': [87, 88, 89, 90, 91, 92], 'TestResult': ["No exercise induced ischaemia", "NORMAL test on treadmill", "No exercise induced ischaemia", "treadmill induced ischaemia", "NORMAL test on treadmill", "No exercise induced ischaemia"]}
df = pd.DataFrame(data=d)
I've tried things like
df['NegTest'] = df[df.TestResult.str.contains('No exercise induced ischaemia',case=True)]
with no luck.
Thanks for any help!
CodePudding user response:
You're very close. Just use np.where to actually generate the yes/no:
df['NegTest'] = np.where(df.TestResult.str.contains('No exercise induced ischaemia', case=True), 'yes', 'no')
Output:
>>> df
ID TestResult NegTest
0 87 No exercise induced ischaemia yes
1 88 NORMAL test on treadmill no
2 89 No exercise induced ischaemia yes
3 90 treadmill induced ischaemia no
4 91 NORMAL test on treadmill no
5 92 No exercise induced ischaemia yes
If you want it to just be True/False, you can even skip np.where:
df['NegTest'] = df.TestResult.str.contains('No exercise induced ischaemia', case=True)
Output:
>>> df
ID TestResult NegTest
0 87 No exercise induced ischaemia True
1 88 NORMAL test on treadmill False
2 89 No exercise induced ischaemia True
3 90 treadmill induced ischaemia False
4 91 NORMAL test on treadmill False
5 92 No exercise induced ischaemia True
