I am trying to scrape content for players from transfermarkt where the urls for individual teams are almost identical but 3 parts of url are dynamically changing.
I am scraping 5 years of data which I already did: but it is just for one team and I want to do it for all of them.
# make a target url with the relevant year
url_base <- 'https://www.transfermarkt.com/as-trencin/kader/verein/7918/plus/1/galerie/0?saison_id=%d'
map_df(2017:2021, function(i) {
# simple but effective progress indicator
cat(".")
pg <- read_html(sprintf(url_base, i))
data.frame(name=html_text(html_nodes(pg, ".hauptlink a , #yw1_c1")),
date_of_birth=html_text(html_nodes(pg, ".posrela .zentriert , .sort-link")),
market_value=html_text(html_nodes(pg, ".rechts")),
season=i,
stringsAsFactors=FALSE)
}) -> asSquad
Example of URLs per team:
For now, I have been able to scrape one team for 5 years, but how can I scrape it when 3 parts of URL are changing and do it all at once per all teams, please?
Please, any advice is welcomed! Thank you!
CodePudding user response:
Something like:
library(rvest)
teams <- c("as-trencin", "slovan-bratislava")
var2 <- c("7918", "540")
years <- 2017:2018
all <- data.frame()
for (i in 1:length(teams)){
for (year in years) {
url <- paste0("https://www.transfermarkt.com/", teams[i], "/kader/verein/", var2[i],"/plus/1/galerie/0?saison_id=", year)
print(url)
# do.The.Scraping, saveToDataFrame, rBindToMainDataFrame
pg <- read_html(sprintf(url))
asSquad <- data.frame(
name=stringi::stri_trim(html_text(html_nodes(pg, ".hauptlink a , #yw1_c1"))),
date_of_birth=html_text(html_nodes(pg, ".posrela .zentriert , .sort-link")),
market_value=html_text(html_nodes(pg, ".rechts")),
season=year,
stringsAsFactors=FALSE)
asSquad <-asSquad[-1,]
all <- rbind(all, asSquad)
}
}
#> [1] "https://www.transfermarkt.com/as-trencin/kader/verein/7918/plus/1/galerie/0?saison_id=2017"
#> [1] "https://www.transfermarkt.com/as-trencin/kader/verein/7918/plus/1/galerie/0?saison_id=2018"
#> [1] "https://www.transfermarkt.com/slovan-bratislava/kader/verein/540/plus/1/galerie/0?saison_id=2017"
#> [1] "https://www.transfermarkt.com/slovan-bratislava/kader/verein/540/plus/1/galerie/0?saison_id=2018"
If var2 differs for the same team, then add another loop.
Grzegorz
