Home > OS >  Dynamically scrape content with Rvest where 3 parts of URL are dynamically changing
Dynamically scrape content with Rvest where 3 parts of URL are dynamically changing

Time:01-22

I am trying to scrape content for players from transfermarkt where the urls for individual teams are almost identical but 3 parts of url are dynamically changing.

I am scraping 5 years of data which I already did: but it is just for one team and I want to do it for all of them.

     # make a target url with the relevant year
  url_base <- 'https://www.transfermarkt.com/as-trencin/kader/verein/7918/plus/1/galerie/0?saison_id=%d'
  
  map_df(2017:2021, function(i) {

  # simple but effective progress indicator
  cat(".")

  pg <- read_html(sprintf(url_base, i))

  data.frame(name=html_text(html_nodes(pg, ".hauptlink a , #yw1_c1")),
             date_of_birth=html_text(html_nodes(pg, ".posrela  .zentriert , .sort-link")),
             market_value=html_text(html_nodes(pg, ".rechts")),
             season=i,
             stringsAsFactors=FALSE)

}) -> asSquad

Example of URLs per team:

https://www.transfermarkt.com/**as-trencin**/kader/verein/**7918**/plus/1/galerie/0?saison_id=**2017**

https://www.transfermarkt.com/**slovan-bratislava**/kader/verein/**540**/plus/1/galerie/0?saison_id=**2019**

For now, I have been able to scrape one team for 5 years, but how can I scrape it when 3 parts of URL are changing and do it all at once per all teams, please?

Please, any advice is welcomed! Thank you!

CodePudding user response:

Something like:

library(rvest)
teams <- c("as-trencin", "slovan-bratislava")
var2 <- c("7918", "540")
years <- 2017:2018

all <- data.frame()

for (i in 1:length(teams)){
  for (year in years) {
  url <- paste0("https://www.transfermarkt.com/", teams[i], "/kader/verein/", var2[i],"/plus/1/galerie/0?saison_id=", year)
  print(url)
  # do.The.Scraping, saveToDataFrame, rBindToMainDataFrame
  pg <- read_html(sprintf(url))
  asSquad <- data.frame(
    name=stringi::stri_trim(html_text(html_nodes(pg, ".hauptlink a , #yw1_c1"))),
    date_of_birth=html_text(html_nodes(pg, ".posrela  .zentriert , .sort-link")),
    market_value=html_text(html_nodes(pg, ".rechts")),
    season=year,
    stringsAsFactors=FALSE)
  asSquad <-asSquad[-1,]
  
  all <- rbind(all, asSquad)
  }
}
#> [1] "https://www.transfermarkt.com/as-trencin/kader/verein/7918/plus/1/galerie/0?saison_id=2017"
#> [1] "https://www.transfermarkt.com/as-trencin/kader/verein/7918/plus/1/galerie/0?saison_id=2018"
#> [1] "https://www.transfermarkt.com/slovan-bratislava/kader/verein/540/plus/1/galerie/0?saison_id=2017"
#> [1] "https://www.transfermarkt.com/slovan-bratislava/kader/verein/540/plus/1/galerie/0?saison_id=2018"

If var2 differs for the same team, then add another loop.

Grzegorz

  •  Tags:  
  • Related