I was trying to build a web-scraper for data collection for a research project at uni. However, I am not able to scrape the whole website, as there seems to be a problem with soup.find_all...
This is what I've come up with so far:
from bs4 import BeautifulSoup
import requests
from csv import writer
url= "https://pubmed.ncbi.nlm.nih.gov/?term=("spontaneous intracranial hypotension"[All Fields] OR "spontaneous cerebrospinal fluid leak"[All Fields] OR "cerebrospinal fluid hypovolemia"[All Fields] OR "cerebrospinal fluid hypovolemia syndrome"[All Fields] OR "Hypoliquorrhea"[All Fields] OR "Spontaneous spinal cerebrospinal fluid leak"[All Fields]) NOT "letter to the editor"[All Fields]&filter=dates.1000/1/1-2022/3/31&filter=lang.english&ac=no&format=abstract&sort=date&size=200"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
lists = soup.find_all('article', class_="article-overview")
with open('disstest.csv', 'w', encoding= 'utf8', newline='') as f:
thewriter = writer(f)
header = ['Herkunftsland', 'Journal', 'Anzahl Zitationen']
thewriter.writerow(header)
for list in lists:
herkunftsland = lists.find('ul', class_="item-list").text.replace('\n','')
journal = lists.find('div', class_="article-source").text.replace('\n', '')
zitationen = lists.find('li', class_="references-count").text.replace('\n', '')
info = [herkunftsland, journal, zitationen]
thewriter.writerow(info)
I am getting the following messages:
Traceback (most recent call last):
File "/Users/***/Documents/Test/scrape.py", line 17, in <module>
herkunftsland = lists.find('ul', class_="item-list").text.replace('\n','')
File"/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/bs4/element.py", line 2289, in __getattr__
raise AttributeError(
AttributeError: ResultSet object has no attribute 'find'.
You're probably treating a list of elements like a single element.
Did you call find_all() when you meant to call find()?
CodePudding user response:
It looks like you made a mistake and use the lists list to search, but you should use _list
for _list in lists:
herkunftsland = _list.find('ul', class_="item-list").text.replace('\n', '')
journal = _list.find('div', class_="article-source").text.replace('\n', '')
zitationen = _list.find('li', class_="references-count").text.replace('\n', '')
info = [herkunftsland, journal, zitationen]
thewriter.writerow(info)
CodePudding user response:
As mentioned by @Charls Ken you used the wrong variable lists to extract your data and you should also avoid using reserved keywords like list.
Would also recommend to check if elements are available before calling methods on them, to avoid AttributeErrors.
for _list in lists:
herkunftsland = e.text.replace('\n','') if (e:= _list.find('ul', class_="item-list")) else None
journal = e.text.replace('\n','').strip() if (e:= _list.find('div', class_="article-source")) else None
zitationen = e.text.replace('\n','').strip() if (e:= _list.find('li', class_="references-count")) else None
info = [herkunftsland, journal, zitationen]
Note: This uses walrus operator that requires Python 3.8 or later to work.
To go without walrus operator:
journal = _list.find('div', class_="article-source").text.replace('\n','').strip() if _list.find('div', class_="article-source") else None
