regex match returning none-CodePudding

I know similar questions like this have already been asked on the platform but I checked them and did not find the help I needed.

I have some String such as :

path = "most popular data structure in OOP lists/5438_133195_9917949_1218833? povid=racking these benchmarks"

path = "activewear/2356_15890_9397775? povid=ApparelNavpopular data structure you to be informed when a regression"

I have a function :

def extract_id(path):
    pattern = re.compile(r"([0-9] (_[0-9] ) )", re.IGNORECASE)
    return pattern.match(path)

The expected results are 5438_133195_9917949_1218833 and 2356_15890_9397775. I tested the function online, and it seems to produce the expected result but my it's returning None in my app. What am I doing wrong? Thanks.

CodePudding user response：

You don't need any capture groups, you can get a match only and return .group() using re.seach:

\b\d (?:_\d ) \b

\b A word boundary
\d Match 1 digits
(?:_\d ) Repeat 1 times _ and 1 digits
\b A word boundary

Regex demo

import re

path = "most popular data structure in OOP lists/5438_133195_9917949_1218833? povid=racking these benchmarks"
pattern = re.compile(r"\b\d (?:_\d ) \b")
def extract_id(path):
    return pattern.search(path).group()

print(extract_id(path))

Output

5438_133195_9917949_1218833

CodePudding user response：

match is used to match an entire statement. What you want is search. You have to use group to retrieve matches from a search. You don't need re.IGNORECASE if you are looking for characters that don't have a case. You should compile your regex only once. Compiling a pattern that never changes, every time a function is called, is not optimal.

You could simplify your expression to ((\d _?) )\?, which will find a repeating sequence of one or more \digits that may be followed by an underscore, and is ultimately ended with a question mark

example:

import re

#do this once
pathid = re.compile(r'((\d _?) )\?') 

def extract_id(path:str) -> str:
    if m := pathid.search(path): #make sure there is a match
        return m.group(1)        #return match from group 1 `((\d _?) )`
    return None                  #no match

#use
path   = "thingsbefore/5438_133195_9917949_1218833?thingsafter"
result = extract_id(path)

#proof
print(result) #5438_133195_9917949_1218833

python regex docs

Your id comes after the last / and before the ?. The below solution will likely be much faster. This doesn't search by pattern, it prunes by position.

def extract_id(path:str) -> str:
    #right of the last / to left of the ?
    return path.split('/')[-1].split('?')[0]

#use
path   = "thingsbefore/5438_133195_9917949_1218833?thingsafter"
result = extract_id(path)

#proof
print(result) #5438_133195_9917949_1218833