Regex match strings divided by 'and'-CodePudding

I need to parse a string to get desired number and position form a string, for example:

2 Better Developers and 3 Testers
5 Mechanics and chef
medic and 3 nurses

Currently I am using code like this which returns list of tuples, like [('2', 'Better Developers'), ('3', 'Testers')]:

def parse_workers_list_from_str(string_value: str) -> [(str, str)]:
    result: [(str, str)] = []
    if string_value:
        for part in string_value.split('and'):
            result.append(re.findall(r'(?: *)(\d |)(?: |)([\w ] )', part.strip())[0])
    return result

Can I do it without .split() using only regex?

CodePudding user response：

Together with re.MULTILINE you can do everything in one regex, that will also split everything correctly:

>>> s = """2 Better Developers and 3 Testers
5 Mechanics and chef
medic and 3 nurses"""
>>> re.findall(r"\s*(\d*)\s*(. ?)(?:\s and\s |$)", s, re.MULTILINE)
[('2', 'Better Developers'), ('3', 'Testers'), ('5', 'Mechanics'), ('', 'chef'), ('', 'medic'), ('3', 'nurses')]

With explanation and conversion of empty '' to 1:

import re

s = """2 Better Developers and 3 Testers
5 Mechanics and chef
medic and 3 nurses"""

results = re.findall(r"""
    # Capture the number if one exists
    (\d*)
    # Remove spacing between number and text
    \s*
    # Caputre the text
    (. ?)
    # Attempt to match the word 'and' or the end of the line
    (?:\s and\s |$\n?)
    """, s, re.MULTILINE|re.VERBOSE)

results = [(int(n or 1), t.title()) for n, t in results]
results == [(2, 'Better Developers'), (3, 'Testers'), (5, 'Mechanics'), (1, 'Chef'), (1, 'Medic'), (3, 'Nurses')]

CodePudding user response：

You may use this regex:

(\d*) *(\S (?: \S )*?) and (\d*) *(\S (?: \S )*)

Here we match and surrounded with a single space on either side. Before and after and we match using this sub-pattern:

(\d*) *(\S (?: \S )*?)

Which match optional 0 digits to start with followed by 0 or more spaces followed by 1 or more non-whitespace strings separated by a space.

RegEx Demo

Code:

import re
arr = ['2 Better Developers and 3 Testers', '5 Mechanics and chef', 'medic and 3 nurses', '5 foo']

rx = re.compile(r'(\d*) *(\S (?: \S )*?) and (\d*) *(\S (?: \S )*)')

for s in arr: print (rx.findall(s))

Output:

[('2', 'Better Developers', '3', 'Testers')]
[('5', 'Mechanics', '', 'chef')]
[('', 'medic', '3', 'nurses')]
[]