I am trying to scrape multiple RSS links in R (those are 800 news articles)
I was able to scrape individual URLs by:
cnn_url <- "http://rss.cnn.com/~r/rss/cnn_travel/~3/-GFuCIsYZgQ/index.html"
cnn_html <- read_html(cnn_url)
cnn_html
cnn_nodes <- cnn_html %>% html_elements(".Article__body")
#look for texts
cnn_texts <- cnn_html %>%
html_elements(".Article__body") %>%
html_text()
cnn_texts[1]
But I am trying to scrape 800 articles (the main text of news stories) in R, and I can't run the codes above for each URL because I have more than 800 links. So I used the codes:
cnn_data <- news_data %>%
filter(media_name == "CNN")
head(cnn_data$url)
head(cnn_data)
urls <- cnn_data[,4] #column with url
url_xml <- try(apply(urls, 1, read_html))
textScraper <- function(x) {
html_text(html_nodes (x, ".Article__body") %>%
html_nodes("p")) %>%
paste(collapse = '')}
cnn_text <- lapply(url_xml, textScraper)
cnn_text[1]
cnn_data$full_article <- cnn_text
head(cnn_data$full_article)
But when I ran the line:
url_xml <- try(apply(urls, 1, read_html))
I got an error message that says: Error in open.connection(x, "rb") : HTTP error 404.
I assume this may be because the URLs are linked to RSS; is there any way I can scrape those news stories by using the URLs that I have?
FYI: data file consists of rows that have links like this--
http://rss.cnn.com/~r/rss/cnn_travel/~3/-GFuCIsYZgQ/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/WvpC9ZKjJXo/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/przZf_johNY/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/TieFj4roU_M/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/iqRZ7f8MhzQ/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/Uq46bJROhiI/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/6u-D9sna6uY/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/JNTXgcM1yY0/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/WG8UTHcZvwQ/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/6YHMwdj6W7s/index.html
CodePudding user response:
Your try() statement is testing the return from the call to apply(), thus if there is one bad link the apply will error and then the try statement takes over. You need to wrap the try around read_html and not the apply.
Something like this should work, returning a list of web pages. Note all of your above links work.
library(rvest)
mylist<-lapply(urls, function(url) {
#be kind and not attack the server
Sys.sleep(1)
print(url) #debug
url_xml<-try(read_html(url))
})
Yes, it possible to code to handle the different pages, but that is potentially a much bigger question to answer.
