Extract link thanks to pattern - python request-CodePudding

thanks to a request, I gathered all the text from a website. I am now facing a new issue; At some times this appears in the text.

<I>Season 2021/2022</I><BR>
<IMG SRC="Excel.gif" BORDER="0" ALIGN="Absmiddle">&nbsp;&nbsp;<A HREF="mmz4281/2122/I1.csv">Serie A</A> <FONT SIZE="1">(FT & HT results; match stats; match, total goals & AH odds)</FONT><BR>
<IMG SRC="Excel.gif" BORDER="0" ALIGN="Absmiddle">&nbsp;&nbsp;<A HREF="mmz4281/2122/I2.csv">Serie B</A> <FONT SIZE="1">(FT & HT results; match stats;  match odds and total goals odds)</FONT><BR><BR>

What I would like to do, is to get just the HREF ("mmz4281/2122/I1.csv") from the first time "<>(stackoverflow is not rendering the I)Season" appears (i do not need the HREF for older seasons - talking about football matches). Notice that the request returned quite a huge file.

Is there an easy way to handle this?

CodePudding user response：

This code will output the first url in the string:

string = """
<I>Season 2021/2022</I><BR>
<IMG SRC="Excel.gif" BORDER="0" ALIGN="Absmiddle">&nbsp;&nbsp;<A HREF="mmz4281/2122/I1.csv">Serie A</A> <FONT SIZE="1">(FT & HT results; match stats; match, total goals & AH odds)</FONT><BR>
<IMG SRC="Excel.gif" BORDER="0" ALIGN="Absmiddle">&nbsp;&nbsp;<A HREF="mmz4281/2122/I2.csv">Serie B</A> <FONT SIZE="1">(FT & HT results; match stats;  match odds and total goals odds)</FONT><BR><BR>
"""
num = string.find('HREF="')     #find index of HREF
url = string[num 6:]            #Extract string after HREF
num = url.find('">')            #Find index of ">
url = url[:num]                 #Extract string before ">
print(url)

CodePudding user response：

You might want to use regular expressions for this. They can work wonders.

import re

# The raw data
data = """
<I>Season 2021/2022</I><BR>
<IMG SRC="Excel.gif" BORDER="0" ALIGN="Absmiddle">&nbsp;&nbsp;<A HREF="mmz4281/2122/I1.csv">Serie A</A> <FONT SIZE="1">(FT & HT results; match stats; match, total goals & AH odds)</FONT><BR>
<IMG SRC="Excel.gif" BORDER="0" ALIGN="Absmiddle">&nbsp;&nbsp;<A HREF="mmz4281/2122/I2.csv">Serie B</A> <FONT SIZE="1">(FT & HT results; match stats;  match odds and total goals odds)</FONT><BR><BR>
"""

# Compile a Regular expression using the Single Line flag.
pattern = re.compile(r'<I>Season.*?HREF="(.*?)"', re.S)

# Search for a match of this pattern in the data
match = pattern.search(data)

# Print the matched group
print(match.group(1))

This outputs

mmz4281/2122/I1.csv

The regex broken down:

<I>Season   # Indication of a new Season
.*?         # Match any character lazily, including newline
HREF="      # Match HREF=" literally. Start of the HREF tag that we want.
(.*?)       # Group together a lazy match of any character.
"           # Match " literally. End of the HREF tag that we want.

Furthermore, if your file contains multiple separate seasons that you want to get href's for, then you can use pattern.findall(data) (documentation) to get a list of relevant href's.