I am trying to apply regex on one of the columns in pandas dataframe, this column has text data in it, I am trying to extract a specific block. This is a sample of how my data will look like,
Patient Name :
NHI: ABC2134
DOB: 10/03/1737
Patient Referred from: WTH ABC
Exam performed at: XYZ Hospital Radiology
Reference: ABCADAFAD
Date of exam: 12/11/2019
Examination(s) included in this report:
CT Head
INDICATION:
Fall some time ago with ataxia since. Recent admission with
tachybrady syndrome.
I am trying to extract everything until the empty line after "CT Head", the entities may or may not appear in same order except the "Patient Referred from" which will always be the first entity, but there wont be any empty line until that entire block is finished. I have tried this below one,
def extract_patient_details(s):
match = re.search('(?s)Patient Referred(.*?)(?:(?:\r*\n){2})', s)
return match.group(0)
extract_patient_details(s)
The above snippet works fine and gives me this output
Exam performed at: XYZ Hospital Radiology
Reference: ABCADAFAD
Date of exam: 12/11/2019
Examination(s) included in this report:
CT Head
But what i want is
Patient Name :
NHI: ABC2134
DOB: 10/03/1737
Patient Referred from: WTH ABC
Exam performed at: XYZ Hospital Radiology
Reference: ABCADAFAD
Date of exam: 12/11/2019
Examination(s) included in this report:
CT Head
CodePudding user response:
It's because your regex is:
(?s)Patient Referred(.*?)(?:(?:\r*\n){2})
It's matching everything you told it to - your string start looks for "Patient Referred" and not Patient name, ie everything between patient referred ct head. Try this instead:
match = re.search('(?s)(Patient Name).*?( CT Head)', s)
output:
Patient Name :
NHI: ABC2134
DOB: 10/03/1737
Patient Referred from: WTH ABC
Exam performed at: XYZ Hospital Radiology
Reference: ABCADAFAD
Date of exam: 12/11/2019
Examination(s) included in this report:
CT Head
Entire code:
import re
def extract_patient_details(s):
match = re.search('(?s)(Patient Name).*?( CT Head)', s)
return match.group(0)
s = """
Patient Name :
NHI: ABC2134
DOB: 10/03/1737
Patient Referred from: WTH ABC
Exam performed at: XYZ Hospital Radiology
Reference: ABCADAFAD
Date of exam: 12/11/2019
Examination(s) included in this report:
CT Head
INDICATION:
Fall some time ago with ataxia since. Recent admission with
tachybrady syndrome.
"""
s = extract_patient_details(s)
print(s)
CodePudding user response:
Can you try re.match(r'(?sm). CT Head', st).group(0)?
(?sm) turns ON re.DOTALL and re.MULTILINE
We're using re.match() as we're matching from the beginning of the string
CodePudding user response:
re.DOTALL plays an important role in your case.
def extract_patient_details(s):
match = re.search(r'^(.*Patient Referred.*?)(?:\r?\n){2}', s, re.DOTALL)
return match.group(1)
In pandas, you can use extract method as well.
import pandas as pd
import re
# Create a sample dataframe
df = pd.DataFrame([
{'diagnosis': '''Patient Name :
NHI: ABC2134
DOB: 10/03/1737
Patient Referred from: WTH ABC
Exam performed at: XYZ Hospital Radiology
Reference: ABCADAFAD
Date of exam: 12/11/2019
Examination(s) included in this report:
CT Head
INDICATION:
Fall some time ago with ataxia since. Recent admission with
tachybrady syndrome.'''}
])
pat = re.compile(r'^(.*Patient Referred.*?)(?:\r?\n){2}', re.DOTALL)
df_extracted = df.diagnosis.str.extract(pat, expand=True)
