Python(3) - store one specific link of output from soup.find

I am working on a web scrapping function. I want to find the latest download link ending with ".img.html" from:

"https://dl.twrp.me/gauguin/"

and store this link in a variable, not just print it.

My code so far:

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "https://dl.twrp.me/gauguin/"

html_doc = urlopen(url)
# defining html link (twrp...)

soup = BeautifulSoup(html_doc, "html.parser")

for link in soup.find_all('a'):
    links = (link.get('href'))
print(links)

My Output:

https://twrp.me
#
https://twrp.me/about/
https://twrp.me/contactus/
https://twrp.me/Devices/
https://twrp.me/FAQ/
/public.asc
/gauguin/twrp-3.5.2_10-0-gauguin.img.html
/gauguin/twrp-3.5.1_10-0-gauguin.img.html
/gauguin/twrp-3.5.0_10-0-gauguin.img.html
https://twrp.me/terms/termsofservice.html
https://twrp.me/terms/cookiepolicy.html
https://github.com/TeamWin

So my goal is to filter this output so I just have those one link (the latest):

/gauguin/twrp-3.5.2_10-0-gauguin.img.html

stored in a variable, so I can call this variable later on or even directly download it with wget for example.

CodePudding user response：

As you iterate over all the <a> elements, you can use the str.endswith() function to check if the URL ends with .img.html. Do this to extract all your URLs to a single list:

urls = []
for link in soup.find_all('a'):
    url = link.get('href')
    if url.endswith('.img.html')
        urls.append(url)

Which gives a list like so:

urls = ['/gauguin/twrp-3.5.2_10-0-gauguin.img.html',
'/gauguin/twrp-3.5.1_10-0-gauguin.img.html',
'/gauguin/twrp-3.5.0_10-0-gauguin.img.html']

Next, it depends on whether the version-specifiers in your URLs are guaranteed to follow lexicographic order, i.e. will doing a simple string compare get you the latest one? This is usually true if they're all of the same format where the number of digits in each part of the version number remains the same for different strings. The ones you have shown in your example meet this requirement.

If this is the case, simple do

max(urls)

which gives

'/gauguin/twrp-3.5.2_10-0-gauguin.img.html'

If this is not the case, (for example if you had '/gauguin/twrp-3.15.2_10-0-gauguin.img.html', which is numerically > 3.5.2 but not lexicographically) you're going to have to parse out the version number from your string, possibly using a regex, and compare those version numbers. You can do this using the key argument to the max() function (stay tuned, I'm editing my answer for this).

Let's say your version numbers have the format <numbers>.<numbers>.<numbers>_<numbers>-<numbers>. You'd use the following regex (try it online):

\d \.\d \.\d _\d -\d 

Explanation
\d     : One or more digits
\.     : The . character

To use it with max(), you could write a function that extracts the version number from the file name:

import re

def extract_version(filename):
    # e.g. filename = '/gauguin/twrp-3.5.2_10-0-gauguin.img.html'
    
    match = re.search(r"(\d )\.(\d )\.(\d )_(\d )-(\d )", filename)
    # e.g. match = <re.Match object; span=(14, 24), match='3.5.2_10-0'>
    # e.g. match.groups() = ('3', '5', '2', '10', '0')

    if match is not None:
        return tuple(int(m) for m in match.groups()) # Convert match to tuple of integer for correct comparison

    return tuple() # if match is none, return an empty tuple

The one modification I made to the regex was to surround the \d in parentheses. The parentheses make it a capturing group, so all the numeric parts are captured as separate groups. Then, re.search() returns a match object, and its .groups() method gives a tuple containing the numbers, but as strings. So we need to convert these strings to integers before returning it.

Then, use the function as the key argument to max():

max(urls, key=extract_version)

CodePudding user response：

Upon reading your query I understand that you are looking for a method to get the links that end with ".img.html" so that you can use them in the future. The below code will extract all the target links and store in a python list which can be easily used in future.

You can try this:

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "https://dl.twrp.me/gauguin/"

html_doc = urlopen(url)
# defining html link (twrp...)
soup = BeautifulSoup(html_doc, "html.parser")

links = []

for link in soup.find_all('a'):
    links.append((link.get('href')))

# target strings variable will contain all the links that end with .img.html 

target_strings =[]
for i in links:
    if '.img.html' in i:
        target_strings.append(i)

# and if needed in the future you can extract a single element from the list

CodePudding user response：

By latest I assume you want the link with the most recent date. As such you need to capture both the URL and the date given for each link. This can then by converted into a datetime object and added to a list.

After all URLs are found, the list can be easily sorted into date order, with the newest first. The latest URL can then be used to download the img file.

For example:

from bs4 import BeautifulSoup
import requests
from datetime import datetime

base_url = "https://dl.twrp.me"
req = requests.get(f"{base_url}/gauguin")
soup = BeautifulSoup(req.content, "html.parser")

urls = []

for a in soup.find_all('a', href=True):
    link = a['href']
    
    if link.endswith('.img.html'):
        date_text = a.find_next('em').get_text(strip=True)
        date_dt = datetime.strptime(date_text, "%Y-%m-%d %H:%M:%S %Z")
        urls.append([date_dt, link])

latest = sorted(urls, reverse=True)[0][1]       # choose the latest url

# Download the latest img file
url_img = base_url   latest.split('.html')[0]
filename = url_img.split('/')[-1]

with requests.get(url_img, stream=True, headers={'Referer' : base_url   latest}) as req_img:
    with open(filename, 'wb') as f_img:
        for chunk in req_img.iter_content(chunk_size=2**15): 
            f_img.write(chunk)

This approach might have the advantage of still working if the naming or numbering scheme is changed. A referer header is added to avoid the website returning the HTML for the download page.

This results in a download of 131,072 KB

Note, if you prefer to ignore the date and just sort on the version number, use the following approach:

from bs4 import BeautifulSoup
import requests
from datetime import datetime
import re

base_url = "https://dl.twrp.me"
req = requests.get(f"{base_url}/gauguin")
soup = BeautifulSoup(req.content, "html.parser")

urls = []

for a in soup.find_all('a', href=True):
    link = a['href']
    
    if link.endswith('.img.html'):
        version = re.findall(r'(\d )', link)
        urls.append([version, link])

latest = sorted(urls, reverse=True)[0][1]       # choose the latest url