I am scraping tables from a website and have been scraping each web page one at a time but since the urls follow a pattern I am thinking of running the urls through a for loop.
I am trying to use the following script:
for(i in 1:38) {
webpage <- read_html(paste0("www.website.com/", i))
data <- webpage %>%
html_nodes("table") %>%
.[[1]] %>%
html_table()
}
My main issue is that the sites I am scraping do not follow a pattern I am able to put in the above for loop, but rather read as the following (if the /W wasn't included it would make it a lot easier): www.website.com/sample/test-01/W, www.website.com/sample/test-02/W, www.website.com/sample/test-03/W etc.
I feel as though there is an extremely simple way to place these into the above for loop but I am not sure of the syntax.
EDIT: one more issue is the 0 in the url www.website.com/sample/test-01/W. I can't paste the i after the 0 since the pattern goes 06-07-08-09-10-11 with the 0 not being valid after 09. And the website www.website.com/sample/test-012/W does not exist.
CodePudding user response:
In order to append the \W at the end, you just need to use the pate0 function once again on the webpage.
for(i in 1:38) {
webpage <- paste0("www.website.com/", i)
temp <- paste0(webpage, "/W")
It will make the URL look like this:
www.website.com/1/W
www.website.com/2/W
...
To get the digits part, you can use the sprintf from base R. To get two-digit numbers you'll have to use sprintf("d", i) in a loop.
The code will look like this:
for(i in 1:38) {
webpage <- paste0("www.website.com/", sprintf("d", i))
temp <- paste0(webpage, "/W")
print(temp)
}
Note: I've modified the code to prove my point.
The output will look like this:
[1] "www.website.com/01/W"
[1] "www.website.com/02/W"
[1] "www.website.com/03/W"
[1] "www.website.com/04/W"
[1] "www.website.com/05/W"
[1] "www.website.com/06/W"
[1] "www.website.com/07/W"
[1] "www.website.com/08/W"
[1] "www.website.com/09/W"
[1] "www.website.com/10/W"
[1] "www.website.com/11/W"
[1] "www.website.com/12/W"
[1] "www.website.com/13/W"
[1] "www.website.com/14/W"
[1] "www.website.com/15/W"
[1] "www.website.com/16/W"
[1] "www.website.com/17/W"
[1] "www.website.com/18/W"
[1] "www.website.com/19/W"
[1] "www.website.com/20/W"
[1] "www.website.com/21/W"
[1] "www.website.com/22/W"
[1] "www.website.com/23/W"
[1] "www.website.com/24/W"
[1] "www.website.com/25/W"
[1] "www.website.com/26/W"
[1] "www.website.com/27/W"
[1] "www.website.com/28/W"
[1] "www.website.com/29/W"
[1] "www.website.com/30/W"
[1] "www.website.com/31/W"
[1] "www.website.com/32/W"
[1] "www.website.com/33/W"
[1] "www.website.com/34/W"
[1] "www.website.com/35/W"
[1] "www.website.com/36/W"
[1] "www.website.com/37/W"
[1] "www.website.com/38/W"
CodePudding user response:
You may create a list of urls using sprintf -
web_urls <- sprintf('www.website.com/test-d/W', 1:38)
Then use lapply for rvest code on each url.
extract_table <- function(url) {
webpage <- read_html(url)
data <- webpage %>%
html_nodes("table") %>%
.[[1]] %>%
html_table()
}
result <- lapply(web_urls, extract_table)
