Use regex in Python to find text after A (which appears after B) but before C-CodePudding

Fellows,

I have a bunch of PDFs (IPO prospectuses) from which I need to extract specific information (about the investment banks sponsoring each IPO or the "sponsors"). I have already written code that extracts the text from each PDF using PyPDF2. Now to get the "sponsors" information, I need to write a regular expression with the following logic:

Extract text which appears:

after the "Joint sponsors" OR "Sole sponsor" that appears after "Parties involved in..."

(The reason for this double after condition is that you will find the words "joint sponsors" or "sole sponsors" used all over the 500 page document, but only in the section "Parties involved in..." is "joint sponsors" or "sole sponsor" followed by the actual information on the banks. The OR condition is there since any IPO can have either joint or sole sponsors.)

before "Legal Advisers"

Please see the sample text extracted from a 500 page prospectus below, you may test your code on it to see if it extracts the bank information only - would much appreciate your help!

THIS DOCUMENT IS IN DRAFT FORM, INCOMPLETE AND SUBJECT TO CHANGE AND THAT THE INFORMATION MUST BE READ IN CONJUNCTION WITH THE SECTION HEADED “WARNING” ON THE COVER OF THIS DOCUMENT. DIRECTORS AND PARTIES INVOLVED IN THE [REDACTED] PARTIES INVOLVED IN THE [REDACTED] Joint Sponsors China International Capital Corporation Hong Kong Securities Limited 29/F, One International Finance Centre 1 Harbour View Street Central Hong Kong China Securities (International) Corporate Finance Company Limited 18/F, Two Exchange Square 8 Connaught Place Central Hong Kong [REDACTED] – 166 – Legal Advisers to the Company As to Hong Kong and United States laws: O’Melveny & Myers 31/F, AIA Central 1 Connaught Road Central Hong Kong

CodePudding user response：

Below should be what you're looking for. However, if these patterns do occur multiple times in the document. You might want to consider condensing your search down to reduce false positives.

Maybe searching in reverse? Since it sounds like this text is expected to be towards the end of the document.

Be sure to enable the IGNORECASE case flag in the python regex, or fix the casing in the regex to match what you expect all the documents to have.

(?:PARTIES INVOLVED IN). (?:Joint sponsors|Sole sponsor)(. )(?:Legal Advisers to the Company)

Regex Demo

(?:PARTIES INVOLVED IN)           - Search for "Parties Involved In"
.                                 - Match on any number of characters until you find the next part
(?:Joint sponsors|Sole sponsor)   - Match on 'Joint Sponsors' OR 'Sole Sponsor'
(. )                              - Capturing group on the text I'm thinking you're wanting
(?:Legal Advisers to the Company) - Stop the capturing group when you see `Legal Advisers to the Company`

The matching group will capture the following from the sample text:

China International Capital Corporation Hong Kong Securities Limited 29/F, One International Finance Centre 1 Harbour View Street Central Hong Kong China Securities (International) Corporate Finance Company Limited 18/F, Two Exchange Square 8 Connaught Place Central Hong Kong [REDACTED] – 166 –

CodePudding user response：

this regex pattern seems to work

pattern = (?<=PARTIES INVOLVED IN. Joint Sponsors|Sole Sponsor). (?=Legal Advisers)

list_of_matches = re.findall(pattern, string)

to get a list of the matches.