Is it possible for R to read URLS in a column and then paste the page title in another column. Asking for a lot I know but is it possible ?
CodePudding user response:
Yes, it is.
library(xml2)
url <- "http://www.sapijaszko.net"
html <- read_html(url)
title <- xml_text(xml_find_all(html, ".//title"))
df <- data.frame("url" = url, "title" = title)
df
#> url title
#> 1 http://www.sapijaszko.net sapijaszko.net
Now, you can create your list of URLs, get the titles, and compose a data frame.
Created on 2022-01-24 by the reprex package (v2.0.1)
CodePudding user response:
Just wrape it within function, then execute for your list of URLs.
Let load 2 libraries, which we will use in our example.
library(xml2) # library allows to manipulate XML/HTML documents
library(dplyr) # this library allows wrangling data.frames
Lets create a list of URL's, and put it in data frame
urls <- data.frame(url = c("https://cran.r-project.org/",
"https://stackoverflow.com/"))
Right now urls is an data.frame object with one column called url. For details you can run str(urls).
Lets create a function, which returns the title of web page. Our unction will take an URL as an argument, fetch the web page content, then will search for title and return it. Our function usestwo other functions: read_html() and xml_text() from xml2 package. read_html() in fact reads the content of web page, xml_text() allows to search the page for specific text, in our case title.
getTitle <- function(url) {
html <- read_html(url)
title <- xml_text(xml_find_all(html, ".//title"))
return(title)
}
Now, we will apply our function to our list of URLs, and create another column with the titles. We will use mutate() function from dplyr package for column creation, we will use lapply() for applying our own function to the URLs.
urls |>
mutate(title = lapply(url, getTitle), .before = "url")
As result, we will get an output simillar to this one:
title url
1 The Comprehensive R Archive Network https://cran.r-project.org/
2 Stack Overflow - Where Developers Learn, Share, & Build Careers https://stackoverflow.com/
So, the whole code:
library(xml2)
library(dplyr)
urls <- data.frame(url = c("https://cran.r-project.org/", "https://stackoverflow.com/"))
getTitle <- function(url) {
html <- read_html(url)
title <- xml_text(xml_find_all(html, ".//title"))
return(title)
}
urls |>
mutate(title = lapply(url, getTitle), .before = "url")
For details of any function you can run ?function_name, like ?lapply.
