If I query my target link as follows:
library(jsonlite)
link <- "https://www.forest-trends.org/wp-content/themes/foresttrends/map_tools/project_fetch_single.php?pid=1"
df <- fromJSON(link)
I get a JSON list with one element: df$html. I would like to parse this HTML using rvest in order to access tags like psize and pstatus. But the double backslashes \\ seem to stop me. Any idea how to formulate my rvest query correctly? I'm thinking of something like:
df$html %>% html_node(xpath = '//div[contains(@class, \"psize\")]') %>% html_text()
CodePudding user response:
Combining a few different functions, you can arrive to that. This is not suppose to be a 100% correct answer, but it can give some ideas about how to format the string.
library(rvest)
library(tidyr)
split <- read_html(link) %>%
html_node(xpath='/html/body/div') %>%
html_text() %>%
strsplit(., split = "\\\\n|\\\\t")
split <- split[[1]][!is.na(split[[1]]) & split[[1]] != ""]
data.frame(col1 = split[1:5]) %>%
separate(col = col1, into = c("col1", "col2"), sep = ": ", extra = "drop")
col1 col2
1 Size 85000 ha
2 Status In development
3 Description REDD project in Madre de Dios, Peru
4 Objective Carbon sequestration or avoided, Carbon sequestration or avoided
5 Interventions Afforestation or reforestation
