Today I wanted to try out Nokogiri (Ruby) to list the addresses which are listed on this site: https://www.funda.nl/koop/rotterdam/straat-oostzeedijk/
I tried to show the addresses with the debugger using this video https://www.youtube.com/watch?v=b3CLEUBdWwQ
The results are
- Oostzeedijk 6 B01
- Oostzeedijk 166 C
It's class is called "search-result__header-title".
I tried different things such as div-elements but I can't show the results.
require 'nokogiri'
require 'httparty'
require 'byebug'
def scraper
url = "https://www.funda.nl/koop/rotterdam/straat-oostzeedijk/"
unparsed_page = HTTParty.get(url)
parsed_page = Nokogiri::HTML(unparsed_page)
byebug
end
scraper
In the debugger I have tried this:
(byebug) parsed_page
This give me a result, but when a specify this then the result is:
(byebug) parsed_page.css('div.search-content-output')
[]
Can somebody give me a hint? I am stuck.
CodePudding user response:
The problem is that on the URL you are using (https://www.funda.nl/koop/rotterdam/straat-oostzeedijk/), content is loaded asynchronously.
The tutorial you're following assumes a "simple" web-page, where all of the page's content is loaded immediately. But for your scenario, unparsed_page is initially missing lots of page content that only loads later.
So what you need to do here is run code that actually mimics the behaviour of a user interacting with the website. There are many libraries designed to do this, so my solution below is certainly not the only option available, but hopefully you will find this concrete example useful.
I will be using Google Chrome, Chromedriver and the ruby library watir. Prerequisites:
- Install
chromedriver. This step will vary depending on your operating system. For example, on MacOS, you can probably just runbrew install chromedriver. gem install watir
The code:
require 'watir'
b = Watir::Browser.new :chrome
b.goto("https://www.funda.nl/koop/rotterdam/straat-oostzeedijk/")
puts b.div(class: 'search-content-output').text
Result:
Hartschelp 111 Monster, € 1.395.000 k.k.
Uitgelicht door Kolpa van der Hoek Makelaars Rotterdam
Buitenbassinweg 506 Rotterdam, € 495.000 k.k.
Uitgelicht door Oranje Bouwgroep B.V.
Van der Duijn van Maasdamweg 614 O.3.6. Rotterdam, € 890.000 v.o.n.
...
Note that this website also seems to have a CAPTCHA to prevent web scrapers, however, the developers have screwed this up because the EU cookie consent popup appears before the CAPTCHA at the moment, thus rendering it somewhat useless
