How to webscrape a text inside a link in Python?-CodePudding

I would like to webscrape the following page: https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html

In particular, I would like to get the text inside every link you see displayed clicking on the link above. I am able to do it only by clickling on the link. For example, clicking on the first one:

import pandas as pd
from bs4 import BeautifulSoup
import requests

x = "https://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in211222~5f9a709924.en.html"

x1=[requests.get(x)]
x2 = [BeautifulSoup(x1[0].text)]
x3 = [x2[0].select("p  p") for i in range(len(x2)-1)]

The problem is that I am not able to automate the process that leads me from the url with the list of links containing text (https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html) to the actual link where the text I need is stored (e.g. https://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in211222~5f9a709924.en.html)

Can anyone help me?

Thanks!

CodePudding user response：

To get a list of all links on https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html:

from bs4 import BeautifulSoup
import requests

r = requests.get('https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html')
soup = BeautifulSoup(r.text, 'html.parser')
links = [link.get('href') for link in soup.find_all('a')]

CodePudding user response：

Wouter's answer is correct for getting all links, but if you need just the the title links, you could try a more specific selector query like select('div.title > a'). Here's an example:

from bs4 import BeautifulSoup
import requests

url = "https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html"

html = BeautifulSoup(requests.get(url).text, 'html.parser')
links = html.select('div.title > a')
for link in links:
    print(link.attrs['href'])

CodePudding user response：

In particular, I would like to get the text inside every link you see displayed clicking on the link above.

To get the text of every linked article you have to iterate over your list of links and request each of them:

for link in soup.select('div.title > a'):
    soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{link['href']}").content)
    data.append({
        'title':link.text,
        'url': url,
        'subtitle':soup.main.h2.text,
        'text':' '.join([p.text for p in soup.select('main .section p:not([class])')])
    })

Example

Contents are stored in a list of dicts, so you can easily access and process the data later.

from bs4 import BeautifulSoup
import requests

url = "https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html"
soup = BeautifulSoup(requests.get(url).content)

data = []

for link in soup.select('div.title > a'):
    soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{link['href']}").content)
    data.append({
        'title':link.text,
        'url': url,
        'subtitle':soup.main.h2.text,
        'text':' '.join([p.text for p in soup.select('main .section p:not([class])')])
    })

print(data)