I'm trying to add a list from text file with regular expression, I need to grab this part of every line somenumber.mp4.
This is what I have inside the txt file:
http://example.com:80/path/to/file/151542.mp4
http://example.com:80/path/to/file/151543.mp4
http://example.com:80/path/to/file/151544.mp4
http://example.com:80/path/to/file/151545.mp4
http://example.com:80/path/to/file/151546.mp4
http://example.com:80/path/to/file/151547.mp4
http://example.com:80/path/to/file/151548.mp4
Code:
import os # To move a file in Python
import re # regex
path_to_txt = "C:/Users/Administrator/Documents/Edit/episodes.txt"
regex = re.compile("[0-9]*\.(mp4|mkv|avi)")
with open(path_to_txt, 'r', encoding='utf8') as n:
n = n.read()
file_name = re.findall(regex, n)
print(file_name)
The regular expression seem to be right [0-9]*\.(mp4|mkv|avi), but I wonder why it doesn't grab the part that I want.
Output:
['mp4', 'mp4', 'mp4', 'mp4', 'mp4', 'mp4', 'mp4']
CodePudding user response:
The documentation explains this "feature" of findall. The parentheses establish a "match group". If you have exactly one match group, then findall only returns that group. Add a second set of parens around the whole thing and you'll get what you want.
regex = re.compile("([0-9]*\.(?:mp4|mkv|avi))")
Output:
['151542.mp4', '151543.mp4', '151544.mp4', '151545.mp4', '151546.mp4', '151547.mp4', '151548.mp4']
CodePudding user response:
As said @tim-roberts you just need to add ?: in the parenthesis in order to make it a "non selection group".
"[0-9]*\.(?:mp4|mkv|avi)"
# => ['151547.mp4']
Nevertheless, I would like to add something :
If all you need to get is the filename from the url. A more general solution is os.path.basename :
from os import path
txt= "http://example.com:80/path/to/file/151547.mp4"
path.basename(txt)
# => '151547.mp4'
