Regular expression bug?-CodePudding

I'm trying to add a list from text file with regular expression, I need to grab this part of every line somenumber.mp4.

This is what I have inside the txt file:

 http://example.com:80/path/to/file/151542.mp4
 http://example.com:80/path/to/file/151543.mp4
 http://example.com:80/path/to/file/151544.mp4
 http://example.com:80/path/to/file/151545.mp4
 http://example.com:80/path/to/file/151546.mp4
 http://example.com:80/path/to/file/151547.mp4
 http://example.com:80/path/to/file/151548.mp4

Code:

import os # To move a file in Python
import re # regex

path_to_txt = "C:/Users/Administrator/Documents/Edit/episodes.txt"

regex = re.compile("[0-9]*\.(mp4|mkv|avi)")
with open(path_to_txt, 'r', encoding='utf8') as n:
    n = n.read()
    file_name = re.findall(regex, n)

print(file_name)

The regular expression seem to be right [0-9]*\.(mp4|mkv|avi), but I wonder why it doesn't grab the part that I want.

Output:

['mp4', 'mp4', 'mp4', 'mp4', 'mp4', 'mp4', 'mp4']

CodePudding user response：

The documentation explains this "feature" of findall. The parentheses establish a "match group". If you have exactly one match group, then findall only returns that group. Add a second set of parens around the whole thing and you'll get what you want.

regex = re.compile("([0-9]*\.(?:mp4|mkv|avi))")

Output:

['151542.mp4', '151543.mp4', '151544.mp4',  '151545.mp4', '151546.mp4', '151547.mp4', '151548.mp4']

CodePudding user response：

As said @tim-roberts you just need to add ?: in the parenthesis in order to make it a "non selection group".

"[0-9]*\.(?:mp4|mkv|avi)"
# => ['151547.mp4']

Nevertheless, I would like to add something :

If all you need to get is the filename from the url. A more general solution is os.path.basename :

from os import path
txt= "http://example.com:80/path/to/file/151547.mp4"
path.basename(txt)
# => '151547.mp4'