Parsing a JSON document that contains HTML in r-CodePudding

If I query my target link as follows:

library(jsonlite)
link <- "https://www.forest-trends.org/wp-content/themes/foresttrends/map_tools/project_fetch_single.php?pid=1"
df <- fromJSON(link)

I get a JSON list with one element: df$html. I would like to parse this HTML using rvest in order to access tags like psize and pstatus. But the double backslashes \\ seem to stop me. Any idea how to formulate my rvest query correctly? I'm thinking of something like:

df$html %>% html_node(xpath = '//div[contains(@class, \"psize\")]') %>% html_text()

CodePudding user response：

Combining a few different functions, you can arrive to that. This is not suppose to be a 100% correct answer, but it can give some ideas about how to format the string.

library(rvest)
library(tidyr)

split <- read_html(link) %>% 
  html_node(xpath='/html/body/div') %>% 
  html_text() %>% 
  strsplit(., split = "\\\\n|\\\\t")

split <- split[[1]][!is.na(split[[1]]) & split[[1]] != ""]
data.frame(col1 = split[1:5]) %>% 
  separate(col = col1, into = c("col1", "col2"), sep = ": ", extra = "drop")

          col1                                                             col2
1          Size                                                         85000 ha
2        Status                                                   In development
3   Description                              REDD project in Madre de Dios, Peru
4     Objective Carbon sequestration or avoided, Carbon sequestration or avoided
5 Interventions                                   Afforestation or reforestation