Scrape multiple RSS links in R-CodePudding

I am trying to scrape multiple RSS links in R (those are 800 news articles)

I was able to scrape individual URLs by:

    cnn_url <- "http://rss.cnn.com/~r/rss/cnn_travel/~3/-GFuCIsYZgQ/index.html"
    cnn_html <- read_html(cnn_url)
    cnn_html

    cnn_nodes <- cnn_html %>% html_elements(".Article__body")
    #look for texts
    cnn_texts <- cnn_html %>% 
    html_elements(".Article__body") %>% 
    html_text()

    cnn_texts[1]

But I am trying to scrape 800 articles (the main text of news stories) in R, and I can't run the codes above for each URL because I have more than 800 links. So I used the codes:

    cnn_data <- news_data %>% 
    filter(media_name  == "CNN")
    head(cnn_data$url)
    head(cnn_data)

    urls <- cnn_data[,4] #column with url
    url_xml <- try(apply(urls, 1, read_html)) 

    textScraper <- function(x) {
    html_text(html_nodes (x, ".Article__body") %>% 
              html_nodes("p")) %>%
    paste(collapse = '')}

    cnn_text <- lapply(url_xml, textScraper)
    cnn_text[1]

    cnn_data$full_article <- cnn_text
    head(cnn_data$full_article)

But when I ran the line:

    url_xml <- try(apply(urls, 1, read_html))

I got an error message that says: Error in open.connection(x, "rb") : HTTP error 404.

I assume this may be because the URLs are linked to RSS; is there any way I can scrape those news stories by using the URLs that I have?

FYI: data file consists of rows that have links like this--

http://rss.cnn.com/~r/rss/cnn_travel/~3/-GFuCIsYZgQ/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/WvpC9ZKjJXo/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/przZf_johNY/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/TieFj4roU_M/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/iqRZ7f8MhzQ/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/Uq46bJROhiI/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/6u-D9sna6uY/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/JNTXgcM1yY0/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/WG8UTHcZvwQ/index.html
http://rss.cnn.com/~r/rss/cnn_latest/~3/6YHMwdj6W7s/index.html

CodePudding user response：

Your try() statement is testing the return from the call to apply(), thus if there is one bad link the apply will error and then the try statement takes over. You need to wrap the try around read_html and not the apply.
Something like this should work, returning a list of web pages. Note all of your above links work.

library(rvest)

mylist<-lapply(urls, function(url) {
   #be kind and not attack the server
   Sys.sleep(1)
   print(url)  #debug
   url_xml<-try(read_html(url))
})

Yes, it possible to code to handle the different pages, but that is potentially a much bigger question to answer.