I know similar questions like this have already been asked on the platform but I checked them and did not find the help I needed.
I have some String such as :
path = "most popular data structure in OOP lists/5438_133195_9917949_1218833? povid=racking these benchmarks"
path = "activewear/2356_15890_9397775? povid=ApparelNavpopular data structure you to be informed when a regression"
I have a function :
def extract_id(path):
pattern = re.compile(r"([0-9] (_[0-9] ) )", re.IGNORECASE)
return pattern.match(path)
The expected results are 5438_133195_9917949_1218833 and 2356_15890_9397775. I tested the function online, and it seems to produce the expected result but my it's returning None in my app. What am I doing wrong? Thanks.
CodePudding user response:
You don't need any capture groups, you can get a match only and return .group() using re.seach:
\b\d (?:_\d ) \b
\bA word boundary\dMatch 1 digits(?:_\d )Repeat 1 times_and 1 digits\bA word boundary
import re
path = "most popular data structure in OOP lists/5438_133195_9917949_1218833? povid=racking these benchmarks"
pattern = re.compile(r"\b\d (?:_\d ) \b")
def extract_id(path):
return pattern.search(path).group()
print(extract_id(path))
Output
5438_133195_9917949_1218833
CodePudding user response:
match is used to match an entire statement. What you want is search. You have to use group to retrieve matches from a search. You don't need re.IGNORECASE if you are looking for characters that don't have a case. You should compile your regex only once. Compiling a pattern that never changes, every time a function is called, is not optimal.
You could simplify your expression to ((\d _?) )\?, which will find a repeating sequence of one or more \digits that may be followed by an underscore, and is ultimately ended with a question mark
example:
import re
#do this once
pathid = re.compile(r'((\d _?) )\?')
def extract_id(path:str) -> str:
if m := pathid.search(path): #make sure there is a match
return m.group(1) #return match from group 1 `((\d _?) )`
return None #no match
#use
path = "thingsbefore/5438_133195_9917949_1218833?thingsafter"
result = extract_id(path)
#proof
print(result) #5438_133195_9917949_1218833
Your id comes after the last / and before the ?. The below solution will likely be much faster. This doesn't search by pattern, it prunes by position.
def extract_id(path:str) -> str:
#right of the last / to left of the ?
return path.split('/')[-1].split('?')[0]
#use
path = "thingsbefore/5438_133195_9917949_1218833?thingsafter"
result = extract_id(path)
#proof
print(result) #5438_133195_9917949_1218833
