I am interested in extracting some information from some PDF files that look like this. I only need the information at pages 2 and after which looks like this:
- (U) country: On [date] [text]. (text in brackets)
This means it always starts with a number a dot a country and finishes with brackets which brackets may also go to the next line too.
My implementation in python is the following:
- use pdfminer extract_text function to get the whole text.
- Then use re.findall function in the whole text using this regex
^\d{1,2}\. \(u\) \w .\w*.\w*:.* on \d{1,2} \w .*$with the re.MULTILINE option too.
I have noticed that this extracts the first line of all the paragraphs that I am interested in, but I cannot find a way to grab everything until the end of the paragraph which is the brackets (.*).
I was wondering if anyone can provide some help into this. I was hoping I can match this by only one regex. Otherwise I might try split it by line and iterate through each one.
Thanks in advance.
CodePudding user response:
You could update the pattern using a negated character class matching until the first occurrence of : and then match at least on after it.
To match all following line, you can match a newline and assert that the nextline does not contain only spaces followed by a newline using a negative lookahead.
Using a case insensitive match:
^\d{1,2}\.\s\(u\)\s[^:\n]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*
The pattern matches:
^Start of string\d{1,2}\.\s\(u\)\sMatch 2 digits,.a whitespace char and(u)[^:\n]*:Match any char except:or a newline, then match:.*?\son\sMatch the first occurrence ofonbetween whitespace chars\d{1,2}\sMatch 1-2 digits and a whitespace char.*Match the rest of the line(?:Non capture group\n(?![^\S\r\n]*\n).*Match a newline, and assert not only spaces followed by a newline
)*Close non capture group and optionally repeat
For example
pattern = r"^\d{1,2}\.\s\(u\)\s[^:]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*"
print(re.findall(pattern, extracted_text, re.M | re.I))
