Web scraping urls using a for loop-CodePudding

I am scraping tables from a website and have been scraping each web page one at a time but since the urls follow a pattern I am thinking of running the urls through a for loop.

I am trying to use the following script:

for(i in 1:38) {
  webpage <- read_html(paste0("www.website.com/", i))
  data <- webpage %>%
    html_nodes("table") %>%
    .[[1]] %>% 
    html_table()
}

My main issue is that the sites I am scraping do not follow a pattern I am able to put in the above for loop, but rather read as the following (if the /W wasn't included it would make it a lot easier): www.website.com/sample/test-01/W, www.website.com/sample/test-02/W, www.website.com/sample/test-03/W etc.

I feel as though there is an extremely simple way to place these into the above for loop but I am not sure of the syntax.

EDIT: one more issue is the 0 in the url www.website.com/sample/test-01/W. I can't paste the i after the 0 since the pattern goes 06-07-08-09-10-11 with the 0 not being valid after 09. And the website www.website.com/sample/test-012/W does not exist.

CodePudding user response：

In order to append the \W at the end, you just need to use the pate0 function once again on the webpage.

for(i in 1:38) {
  webpage <- paste0("www.website.com/", i)
  temp <- paste0(webpage, "/W")

It will make the URL look like this:

www.website.com/1/W
www.website.com/2/W
...

To get the digits part, you can use the sprintf from base R. To get two-digit numbers you'll have to use sprintf("d", i) in a loop.

The code will look like this:

for(i in 1:38) {
  webpage <- paste0("www.website.com/", sprintf("d", i))
  temp <- paste0(webpage, "/W")
  print(temp)
}

Note: I've modified the code to prove my point.

The output will look like this:

[1] "www.website.com/01/W"
[1] "www.website.com/02/W"
[1] "www.website.com/03/W"
[1] "www.website.com/04/W"
[1] "www.website.com/05/W"
[1] "www.website.com/06/W"
[1] "www.website.com/07/W"
[1] "www.website.com/08/W"
[1] "www.website.com/09/W"
[1] "www.website.com/10/W"
[1] "www.website.com/11/W"
[1] "www.website.com/12/W"
[1] "www.website.com/13/W"
[1] "www.website.com/14/W"
[1] "www.website.com/15/W"
[1] "www.website.com/16/W"
[1] "www.website.com/17/W"
[1] "www.website.com/18/W"
[1] "www.website.com/19/W"
[1] "www.website.com/20/W"
[1] "www.website.com/21/W"
[1] "www.website.com/22/W"
[1] "www.website.com/23/W"
[1] "www.website.com/24/W"
[1] "www.website.com/25/W"
[1] "www.website.com/26/W"
[1] "www.website.com/27/W"
[1] "www.website.com/28/W"
[1] "www.website.com/29/W"
[1] "www.website.com/30/W"
[1] "www.website.com/31/W"
[1] "www.website.com/32/W"
[1] "www.website.com/33/W"
[1] "www.website.com/34/W"
[1] "www.website.com/35/W"
[1] "www.website.com/36/W"
[1] "www.website.com/37/W"
[1] "www.website.com/38/W"

CodePudding user response：

You may create a list of urls using sprintf -

web_urls <- sprintf('www.website.com/test-d/W', 1:38)

Then use lapply for rvest code on each url.

extract_table <- function(url) {
  webpage <- read_html(url)
  data <- webpage %>%
    html_nodes("table") %>%
    .[[1]] %>% 
    html_table()
}

result <- lapply(web_urls, extract_table)