Suppose I have the following text:
test = '\n\nDisclaimer ...........................\t10\n\nITOM - IT Object Model ...............\t11\n\nDB – Datenbank Model..................\t11\n\nDB - Datenbank Model - Views .........\t12'
which looks like:
Disclaimer ........................... 10
ITOM - IT Object Model ............... 11
DB – Datenbank Model.................. 11
DB - Datenbank Model - Views ......... 12
I want to make a list of the contents such that I get:
['Disclaimer', 'ITOM - IT Object Model', 'DB – Datenbank Model', 'DB - Datenbank Model - Views' ]
so I do the following:
re.findall(r'^[a-zA-Z\%\$\#\@\!\-\_]\S*', test1, re.MULTILINE)
which returns:
['Disclaimer', 'ITOM', 'DB', 'DB']
I wonder why my RegEx doesn't pick the words after -?
CodePudding user response:
You can use a regex and a non-regex approach here:
[line.split('...')[0].strip() for line in test1.splitlines() if line.strip()]
[re.sub(r'\s*\. \s*\d \s*$', '', line) for line in test1.splitlines() if line.strip()]
re.findall(r'^(.*?)[^\S\n]*\. [^\S\n]*\d [^\S\n]*$', test1, re.M)
See the Python demo.
Notes:
- The text is split into separate lines
- Drop the line if it is blank
- Either split the line with triple dots and get the first chunk
- Or, if you prefer regex, remove the dots followed with optional whitespace, then digits and possibly trailing whitespaces.
Or, if you prefer the fully-regex approach (see the third line of code in the above snippet), you can use re.findall with a ^(.*?)[^\S\n]*\. [^\S\n]*\d [^\S\n]*$ pattern:
^- start of a line(.*?)- Group 1: any zero or more chars other than line break chars, as few as possible[^\S\n]*- zero or more horizontal whitespaces\.- one or more dots[^\S\n]*- zero or more horizontal whitespaces\d- one or more digits[^\S\n]*- zero or more horizontal whitespaces$- end of line.
See the regex demo.
CodePudding user response:
I'm proposing an alternate approach, with a different regex. Replace the unwanted characters, instead of finding the needed ones, as it seems easy for your case.
See below:
contents = re.sub(r"\s?(\.) \s (\d) \b", "", text, re.MULTILINE).splitlines(keepends=False)
This will produce a list of contents you want.
