Home > database >  Trying to use importxml in google sheets to extract a value from a website, but it returns a complet
Trying to use importxml in google sheets to extract a value from a website, but it returns a complet

Time:01-26

image of my google sheet

I am trying to create a sheet where I want to scrape values from www.tapology.com of individual fighters and fill columns created like fighter's name, his record, where he is from, etc.

I would like to scrape some values of this fighter so I started with name:

https://www.tapology.com/fightcenter/fighters/119825-dominik-mazur

I put together something like this:

=IMPORTXML("https://www.tapology.com/fightcenter/fighters/119825-dominik-mazur", "//div[@id='stats']/ul[1]/li[1]/span[1]")

The XPath should lead to the name of the fighter which is Dominik Mazur, but for some reason it returns a different fighters name at random.

I guess it gets all the names that have the same structure, but I have no idea how to single out the value that I'm looking for.

Generally my goal is to create a google sheet where I paste a link to any fighter's tapology and it fills my sheet with the information I want instead of me manually typing everything.

Unfortunately I am not a programmer or very technical person so I would really appreciate your help. If anything is unclear please let me know.

CodePudding user response:

Solution

Try this

=arrayformula(split(transpose(split(scraping("yourURLhere"),"♥")),"♦"))

with custom function

function scraping(url){
  var source = UrlFetchApp.fetch(url).getContentText().replace(/(\r\n|\n|\r|\t|  )/gm,"")
  var data = source.match(/<div class='details details_two_columns' id='stats'>[\s\S\w] ?<\/div>/g)
  return (data[0].replace(/<li>/g,'♥').replace(/<span>/g,'♦').replace(/<\/li>|<strong>|<\/strong>|<\/span>/g,'').replace(/(<([^>] )>)/g,''))
}

Scrapping multiple urls

function tapology(url){
  var source = UrlFetchApp.fetch(url).getContentText().replace(/(\r\n|\n|\r|\t|  )/gm,"")
  var data = source.match(/<div class='details details_two_columns' id='stats'>[\s\S\w] ?<\/div>/g)
  var infos = data[0].replace(/<li>/g,'♥').replace(/<span>/g,'♦').replace(/<\/li>|<strong>|<\/strong>|<\/span>/g,'').replace(/(<([^>] )>)/g,'').replace(/&#39;/,'’').replace(/&quot;/,'”').split('♥')
  var result = []
  infos.forEach(function(info){result.push(info.split('♦')[1])})
  return([result])
}

A comparison between importxml and custom function:

https://docs.google.com/spreadsheets/d/1se1lWH4g4OSvAW2NldjGiTDS5XB19_5QwSBpMqVRy84/copy

  •  Tags:  
  • Related