Home > Software design >  How to extract a human name in a string python
How to extract a human name in a string python

Time:01-08

I have a string out of an OCR'ed image, and I need to find a way to extract human names from it. here is the image required to OCR, which comes out as:

From: Al Amri, Salim <[email protected]>

Sent: 25 August 2021 17:20

To: Al Harthi, Mohammed <[email protected]>

Ce: Al hajri, Malik <[email protected]>; Omar, Naif <[email protected]>

Subject: Conference Rooms Booking Details

Dear Mohammed,

As per our last discussion these are the available conference rooms available for booking along
with their rates for full day:

Room: Luban, available on 26/09/2021. Rate: $4540

Room: Mazoon, available on 04/12/2021 and 13/02/2022. Rate: $3000
Room: Dhofar. Available on 11/11/2021. Rate: $2500

Room: Nizwa. Available on 13/12/2022. Rate: $1200

   

Please let me know which ones you are interested so we go through more details.
Best regards,

Salim Al Amri

There are 4 names in total in the heading, and i am required to get the output:

names = 'Al Hajri, Malik', 'Omar, Naif', 'Al Amri, Salim', 'Al Harthy, Mohammed' #desired output

but I have no idea how to extract the names. I have tried RegEx and came up with:

names = re.findall(r'(?i)([A-Z][a-z] [A-Z][a-z][, ] [A-Z][a-z] )', string) #regex to find names

which searches for a Capital letter, then a comma, then another word starting with a capital letter. it is close to the desired result but it comes out as:

names = ['Amri, Salim', 'Harthi, Mohammed', 'hajri, Malik', 'Omar, Naif', 'Luban, available', 'Mazoon, available'] #acutal result

I have thought of maybe using another string that extracts the room names and excludes them from the list, but i have no idea how to implement that idea. i am new to RegEx, so any help will be appreciated. thanks in advance

CodePudding user response:

Depending on the contents of your email, a reasonable approach might be to use:

[:;]\s*(. ?)\s*<

See an online demo.

  • [:;] - A (semi-)colon;
  • \s* - 0 (Greedy) whitespaces;
  • (. ?) - A 1st capture group of 1 (Lazy) characters;
  • \s* - 0 (Greedy) whitespaces;
  • < - A literal '<'.

Note that I specifically use (. ?) to capture names since names are notoriously hard to match.


import re
s = """From: Al Amri, Salim <[email protected]>

Sent: 25 August 2021 17:20

To: Al Harthi, Mohammed <[email protected]>

Ce: Al hajri, Malik <[email protected]>; Omar, Naif <[email protected]>

Subject: Conference Rooms Booking Details

Dear Mohammed,

As per our last discussion these are the available conference rooms available for booking along
with their rates for full day:

Room: Luban, available on 26/09/2021. Rate: $4540

Room: Mazoon, available on 04/12/2021 and 13/02/2022. Rate: $3000
Room: Dhofar. Available on 11/11/2021. Rate: $2500

Room: Nizwa. Available on 13/12/2022. Rate: $1200

   

Please let me know which ones you are interested so we go through more details.
Best regards,

Salim Al Amri"""
print(re.findall(r'[:;]\s*(. ?)\s*<', s))

Prints:

['Al Amri, Salim', 'Al Harthi, Mohammed', 'Al hajri, Malik', 'Omar, Naif']

CodePudding user response:

Notwithstanding the excellent RE approach suggested by @JvdV, here's a step-by-step way in which you could achieve this:

OCR = """From: Al Amri, Salim <[email protected]>

Sent: 25 August 2021 17:20

To: Al Harthi, Mohammed <[email protected]>

Ce: Al hajri, Malik <[email protected]>; Omar, Naif <[email protected]>

Subject: Conference Rooms Booking Details

Dear Mohammed,

As per our last discussion these are the available conference rooms available for booking along
with their rates for full day:

Room: Luban, available on 26/09/2021. Rate: $4540

Room: Mazoon, available on 04/12/2021 and 13/02/2022. Rate: $3000
Room: Dhofar. Available on 11/11/2021. Rate: $2500

Room: Nizwa. Available on 13/12/2022. Rate: $1200

   

Please let me know which ones you are interested so we go through more details.
Best regards,

Salim Al Amri"""

names = []
for line in OCR.split('\n'):
    tokens = line.split()
    if tokens and tokens[0] in ['From:', 'To:', 'Ce:']: # Ce or Cc ???
        parts = line.split(';')
        for i, p in enumerate(parts):
            names.append(' '.join(p.split()[i==0:-1]))
print(names)
  •  Tags:  
  • Related