In this webpage I want to scrape the title and the URL for each row.
The tag for each row seems to be <div> but since there are a lot of divs, how do I ensure I am pulling the right one?
Screenshot of the multiple pages/articles attached [![enter image description here][1]][1] [1]: https://i.stack.imgur.com/0mPXS.png
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
productlinks=[]
url='https://journals.lww.com/ccmjournal/toc/2022/01001'
r=requests.get(url)
soup= BeautifulSoup(r.content,'html.parser')
content=soup.find_all('div')
for item in content:
title=item.find('Title')
link=item.find_element_by_css_selector('a').get_attribute('href')
CodePudding user response:
Assuming you also want top two rows and not just the numbered articles you can use an attribute = value css selector with starts with operator to target the parent div element (for all listings), with id starting with itemListContainer, then specifically target the direct anchor tag children of h4 elements
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://journals.lww.com/ccmjournal/toc/2022/01001')
soup = bs(r.content, 'lxml')
d = {i.text.strip():i['href'] for i in soup.select('[id^=itemListContainer] h4 > a')}
You can restrict to only numbered items with
d = {i.text.strip():i['href'] for i in soup.select('.ej-toc-subheader div h4 > a')}
The class selector gets you to the subheaders then the adjacent sibling combinator moves you, along with the type selector, to the immediately adjacent div. The h4 with child combinator (>) then a type selector takes you to the target anchor tags.
If you want all results then you can set results to 100 per page then click next whilst that element is present. Use a pause to avoid being rate-limited:
from bs4 import BeautifulSoup as bs
from selenium import webdriver
import time
d = webdriver.Chrome()
d.get('https://journals.lww.com/ccmjournal/toc/2022/01001')
results = {}
try:
d.find_element_by_css_selector('.cookie-jar-content .button').click()
except:
pass
d.find_element_by_css_selector('.js-items-on-page-selectize [value="100"]').click()
while True:
soup = bs(d.page_source, 'lxml')
for i in soup.select('.ej-toc-subheader div h4 > a'):
results[i.text.strip()] = i['href']
next_page = soup.select_one('.element__nav--next')
if next_page is None:
break
d.find_element_by_css_selector('.element__nav--next').click()
time.sleep(1) # rate-limit
Just with selenium
from selenium import webdriver
import time
from selenium.common.exceptions import NoSuchElementException
d = webdriver.Chrome()
d.get('https://journals.lww.com/ccmjournal/toc/2022/01001')
results = {}
try:
d.find_element_by_css_selector('.cookie-jar-content .button').click()
except NoSuchElementException:
pass
d.find_element_by_css_selector('.js-items-on-page-selectize [value="100"]').click()
while True:
for i in d.find_elements_by_css_selector('.ej-toc-subheader div h4 > a'):
results[i.text.strip()] = i.get_attribute('href')
try:
next_page = d.find_element_by_css_selector('.element__nav--next')
except NoSuchElementException:
break
next_page.click()
time.sleep(1) # rate-limit
d.close()
