Home > Software engineering >  how to scrape a website with no clear class/id names
how to scrape a website with no clear class/id names

Time:01-15

when it comes to scraping, I love websites which have clear schema's to get data easily.

Several websites unfortunatly don't.

How is your approach when scraping them?

Let's take this sourcecode for our example: enter image description here

as we can see the <div> have all the same class.

I was thinking to approach it like this

divs = driver.find_elements_by_xpath('//div[@]//div')
for div in divs:
   tel = div.find_element_by_xpath('.//div[3]')

Unfortunately, not all pages include a tel number so I can't use the elements of the parent div.

I am using Selenium atm but also working with scrapy.

Any approach or help for this kind of cases would be amazing!

CodePudding user response:

Hello if u are using python you might want to use beautiful soap 4

pip install beautifulsoup4

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<p>Some<b>bad<i>HTML")
>>> print(soup.prettify())
<html>
 <body>
  <p>
   Some
   <b>
    bad
    <i>
     HTML
    </i>
   </b>
  </p>
 </body>
</html>
>>> soup.find(text="bad")
'bad'
>>> soup.i
<i>HTML</i>
#
>>> soup = BeautifulSoup("<tag1>Some<tag2/>bad<tag3>XML", "xml")
#
>>> print(soup.prettify())
<?xml version="1.0" encoding="utf-8"?>
<tag1>
 Some
 <tag2/>
 bad
 <tag3>
  XML
 </tag3>
</tag1>

bs4 documentation

CodePudding user response:

I am not sure if this is what you are asking for, but if the Tel is not found, you need to escape and continue instead of terminating. If that is the case, then you may try something like this. Feel free to improve upon, if this is what you are looking for:

divs = driver.find_elements_by_xpath('//div[@]//div')
for div in divs:
    try:
        tel = div.find_element_by_xpath('.//div[3]')
    except:
        print("Tel not found")
        continue

With BeautifulSoup:

from bs4 import BeautifulSoup as bs
import requests

resp = requests.get("https://www.kitanetz.de/homepage/error.php")
soup = bs(resp.text, 'lxml')
print(soup.prettify())
  •  Tags:  
  • Related