python regex to select everything before and after a particular string-CodePudding

I am trying to apply regex on one of the columns in pandas dataframe, this column has text data in it, I am trying to extract a specific block. This is a sample of how my data will look like,

Patient Name :
NHI:  ABC2134
DOB:  10/03/1737

Patient Referred from: WTH ABC
Exam performed at:  XYZ Hospital Radiology
Reference:   ABCADAFAD
Date of exam:   12/11/2019
Examination(s) included in this report:
 CT Head

INDICATION:
Fall some time ago with ataxia since. Recent admission with 
tachybrady syndrome.

I am trying to extract everything until the empty line after "CT Head", the entities may or may not appear in same order except the "Patient Referred from" which will always be the first entity, but there wont be any empty line until that entire block is finished. I have tried this below one,

def extract_patient_details(s):

    match = re.search('(?s)Patient Referred(.*?)(?:(?:\r*\n){2})', s)
    return match.group(0)

extract_patient_details(s)

The above snippet works fine and gives me this output

Exam performed at:  XYZ Hospital Radiology
Reference:   ABCADAFAD
Date of exam:   12/11/2019
Examination(s) included in this report:
 CT Head

But what i want is

Patient Name :
NHI:  ABC2134
DOB:  10/03/1737

Patient Referred from: WTH ABC
Exam performed at:  XYZ Hospital Radiology
Reference:   ABCADAFAD
Date of exam:   12/11/2019
Examination(s) included in this report:
 CT Head

CodePudding user response：

It's because your regex is:

(?s)Patient Referred(.*?)(?:(?:\r*\n){2})

It's matching everything you told it to - your string start looks for "Patient Referred" and not Patient name, ie everything between patient referred ct head. Try this instead:

match = re.search('(?s)(Patient Name).*?( CT Head)', s)

output:

Patient Name :
NHI:  ABC2134
DOB:  10/03/1737

Patient Referred from: WTH ABC
Exam performed at:  XYZ Hospital Radiology
Reference:   ABCADAFAD
Date of exam:   12/11/2019
Examination(s) included in this report:
 CT Head

Entire code:

import re


def extract_patient_details(s):

    match = re.search('(?s)(Patient Name).*?( CT Head)', s)
    return match.group(0)


s = """
Patient Name :
NHI:  ABC2134
DOB:  10/03/1737

Patient Referred from: WTH ABC
Exam performed at:  XYZ Hospital Radiology
Reference:   ABCADAFAD
Date of exam:   12/11/2019
Examination(s) included in this report:
 CT Head

INDICATION:
Fall some time ago with ataxia since. Recent admission with 
tachybrady syndrome. 
"""

s = extract_patient_details(s)
print(s)

CodePudding user response：

Can you try re.match(r'(?sm). CT Head', st).group(0)?

(?sm) turns ON re.DOTALL and re.MULTILINE

We're using re.match() as we're matching from the beginning of the string

CodePudding user response：

re.DOTALL plays an important role in your case.

def extract_patient_details(s):
    match = re.search(r'^(.*Patient Referred.*?)(?:\r?\n){2}', s, re.DOTALL)
    return match.group(1)

In pandas, you can use extract method as well.

import pandas as pd
import re

# Create a sample dataframe
df = pd.DataFrame([
    {'diagnosis': '''Patient Name :
NHI:  ABC2134
DOB:  10/03/1737

Patient Referred from: WTH ABC
Exam performed at:  XYZ Hospital Radiology
Reference:   ABCADAFAD
Date of exam:   12/11/2019
Examination(s) included in this report:
 CT Head

INDICATION:
Fall some time ago with ataxia since. Recent admission with 
tachybrady syndrome.'''}
])

pat = re.compile(r'^(.*Patient Referred.*?)(?:\r?\n){2}', re.DOTALL)
df_extracted = df.diagnosis.str.extract(pat, expand=True)