I'm trying to extract data from a website with BeautifulSoup.
I'm actually stuck with this :
"Trad. de l'anglais par < a href="/searchinternet/advanced?all_authors_id=35534&SearchAction=1">Camille Fabien < /a>"
I want to get the names of translaters but the tag uses their id.
my code is
translater = soup.find_all("a", href="/searchinternet/advanced?all_authors_id=")
I tried with a str.startswith but it doesn't work. Can someone help me plz?
CodePudding user response:
Providing your HTML is correct, static (doesn't get loaded with javascript after initial page load), this is one way to select that/those links:
from bs4 import BeautifulSoup as bs
html = '''<p>Trad. de l'anglais par <a href="/searchinternet/advanced?all_authors_id=35534&SearchAction=1">Camille Fabien </a></p>'''
soup = bs(html, 'html.parser')
a = soup.select('a[href^="/searchinternet/advanced?all_authors_id="]')
print(a[0])
print(a[0].get_text(strip=True))
print(a[0].get('href'))
Result in terminal:
<a href="/searchinternet/advanced?all_authors_id=35534&SearchAction=1">Camille Fabien </a>
Camille Fabien
/searchinternet/advanced?all_authors_id=35534&SearchAction=1
EDIT: Who doesn't like a challenge?... Based on further comments made by OP, here is a way of obtaining titles, authors, translators and illustrator from that page - considering there can be one, or more translators/one or more illustrators:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
url = 'https://www.gallimard.fr/searchinternet/advanced/(editor_brand_id)/1/(fserie)/FOLIO-JUNIOR LIVRE HEROS::Folio Junior - Un Livre dont Vous êtes le Héros @ DEFIS FANTASTIQ::Série Défis Fantastiques/(limit)/3?date[from]=1980-01-01&date[to]=1995-01-01&SearchAction=OK'
big_list = []
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
items = soup.select('div[] > table div[]')
print()
for i in items:
title = i.select_one('div[] h3')
author = i.select_one('div[] a')
history = i.select_one('p[]')
translators = [[y.get_text() for y in x.find_previous_siblings('a')] for x in history.contents if "Illustrations" in x]
illustrators = [[y.get_text() for y in x.find_next_siblings('a')] for x in history.contents if "Illustrations" in x]
big_list.append((title.text.strip(), author.text.strip(), ', '.join([x for y in translators for x in y]), ', '.join([x for y in illustrators for x in y])))
df = pd.DataFrame(big_list, columns = ['Title', 'Author', 'Translator(s)', 'Illustrator(s)'])
print(df)
Result in terminal:
| Title | Author | Translator(s) | Illustrator(s) | |
|---|---|---|---|---|
| 0 | Le Sépulcre des Ombres | Jonathan Green | Noël Chassériau | Alan Langford |
| 1 | La Légende de Zagor | Ian Livingstone | Pascale Houssin | Martin McKenna |
| 2 | Les Mages de Solani | Keith Martin | Noël Chassériau | Russ Nicholson |
| 3 | Le Siège de Sardath | Keith P. Phillips | Yannick Surcouf | Pete Knifton |
| 4 | Retour à la Montagne de Feu | Ian Livingstone | Yannick Surcouf | Martin McKenna |
| 5 | Les Mondes de l'Aleph | Peter Darvill-Evans | Yannick Surcouf | Tony Hough |
| 6 | Les Mercenaires du Levant | Paul Mason | Mona de Pracontal | Terry Oakes |
| 7 | L'Arpenteur de la Lune | Stephen Hand | Pierre de Laubier | Martin McKenna, Terry Oakes |
| 8 | La Tour de la Destruction | Keith Martin | Mona de Pracontal | Pete Knifton |
| 9 | La Légende des Guerriers Fantômes | Stephen Hand | Alexis Galmot | Martin McKenna |
| 10 | Le Repaire des Morts-Vivants | Dave Morris | Nicolas Grenier | David Gallagher |
| 11 | L'Ancienne Prophétie | Paul Mason | Mona de Pracontal | Terry Oakes |
| 12 | La Vengeance des Démons | Jim Bambra | Mona de Pracontal | Martin McKenna |
| 13 | Le Sceptre Noir | Keith Martin | Camille Fabien | David Gallagher |
| 14 | La Nuit des Mutants | Peter Darvill-Evans | Anne Collas | Alan Langford |
| 15 | L'Élu des Six Clans | Luke Sharp | Noël Chassériau | Martin Mac Kenna, Martin McKenna |
| 16 | Le Volcan de Zamarra | Luke Sharp | Olivier Meyer | David Gallagher |
| 17 | Les Sombres Cohortes | Ian Livingstone | Noël Chassériau | Nik William |
| 18 | Le Vampire du Château Noir | Keith Martin | Mona de Pracontal | Martin McKenna |
| 19 | Le Voleur d'Âmes | Keith Martin | Mona de Pracontal | Russ Nicholson |
| 20 | Le Justicier de l'Univers | Martin Allen | Mona de Pracontal | Tim Sell |
| 21 | Les Esclaves de l'Eternité | Paul Mason | Sylvie Bonnet | Bob Harvey |
| 22 | La Créature venue du Chaos | Steve Jackson | Noël Chassériau | Alan Langford |
| 23 | Les Rôdeurs de la Nuit | Graeme Davis | Nicolas Grenier | John Sibbick |
| 24 | L'Empire des Hommes-Lézards | Marc Gascoigne | Jean Lacroix | David Gallagher |
| 25 | Les Gouffres de la Cruauté | Luke Sharp | Sylvie Bonnet | Russ Nicholson |
| 26 | Les Spectres de l'Angoisse | Robin Waterfield | Mona de Pracontal | Ian Miller |
| 27 | Le Chasseur des Étoiles | Luke Sharp | Arnaud Dupin de Beyssat | Cary Mayes, Gary Mayes |
| 28 | Les Sceaux de la Destruction | Robin Waterfield | Sylvie Bonnet | Russ Nicholson |
| 29 | La Crypte du Sorcier | Ian Livingstone | Noël Chassériau | John Sibbick |
| 30 | La Forteresse du Cauchemar | Peter Darvill-Evans | Mona de Pracontal | Dave Carson |
| 31 | La Grande Menace des Robots | Steve Jackson | Danielle Plociennik | Gary Mayes |
| 32 | L'Épée du Samouraï | Mark Smith | Pascale Jusforgues | Alan Langford |
| 33 | L'Épreuve des Champions | Ian Livingstone | Alain Vaulont, Pascale Jusforgues | Brian Williams |
| 34 | Défis Sanglants sur l'Océan | Andrew Chapman | Jean Walter | Bob Harvey |
| 35 | Les Démons des Profondeurs | Steve Jackson | Noël Chassériau | Bob Harvey |
| 36 | Rendez-vous avec la M.O.R.T. | Steve Jackson | Arnaud Dupin de Beyssat | Declan Considine |
| 37 | La Planète Rebelle | Robin Waterfield | C. Degolf | Gary Mayes |
| 38 | Les Trafiquants de Kelter | Andrew Chapman | Anne Blanchet | Nik Spender |
| 39 | Le Combattant de l'Autoroute | Ian Livingstone | Alain Vaulont, Pascale Jusforgues | Kevin Bulmer |
| 40 | Le Mercenaire de l'Espace | Andrew Chapman | Jean Walthers | Geoffroy Senior |
| 41 | Le Temple de la Terreur | Ian Livingstone | Denise May | Bill Houston |
| 42 | Le Manoir de l'Enfer | Steve Jackson | ||
| 43 | Le Marais aux Scorpions | Steve Jackson | Camille Fabien | Duncan Smith |
| 44 | Le Talisman de la Mort | Steve Jackson | Camille Fabien | Bob Harvey |
| 45 | La Sorcière des Neiges | Ian Livingstone | Michel Zénon | Edward Crosby, Gary Ward |
| 46 | La Citadelle du Chaos | Steve Jackson | Marie-Raymond Farré | Russ Nicholson |
| 47 | La Galaxie Tragique | Steve Jackson | Camille Fabien | Peter Jones |
| 48 | La Forêt de la Malédiction | Ian Livingstone | Camille Fabien | Malcolm Barter |
| 49 | La Cité des Voleurs | Ian Livingstone | Henri Robillot | Iain McCaig |
| 50 | Le Labyrinthe de la Mort | Ian Livingstone | Patricia Marais | Iain McCaig |
| 51 | L'Île du Roi Lézard | Ian Livingstone | Fabienne Vimereu | Alan Langford |
| 52 | Le Sorcier de la Montagne de Feu | Steve Jackson | Camille Fabien | Russ Nicholson |
Bear in mind this method fails for Le Manoir de l'Enfer, because word 'Illustrations' is not found in text. It's down to the OP to find a solution for that one.
BeautifulSoup documentation can be found at https://beautiful-soup-4.readthedocs.io/en/latest/index.html
Also, Pandas docs can be found here: https://pandas.pydata.org/pandas-docs/stable/index.html
CodePudding user response:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("./test.html", "r"),'html.parser') #returns a list
names = []
for elem in soup:
names.append(elem.text)
