Home > Enterprise >  How to scrape and merge visible and hidden data from table with BeautifulSoup?
How to scrape and merge visible and hidden data from table with BeautifulSoup?

Time:01-06

I want to obtain on a single DataFrame all the different tables from each company scraping this web page:

https://rk.americaeconomia.com/display/embed/500-latam/2021

or

https://www.americaeconomia.com/negocios-industrias/estas-son-las-500-mayores-empresas-de-america-latina-2021

It has been very difficult to find a solution.

CodePudding user response:

Situation

There are two different types of <tr> one with data and another with extra data corresponding to the first one we have to bring them together. Another point is that extra data is not stored in common table structure

How to achieve?

Each combination of data and extra data has to be joined to a single row - There are many options to do so, this approach deals with dataframes. Joined dataframes are stored in a list and would be concatenated after iterating has finished.

data = []
for row in soup.select('#awesomeTable tbody tr.extraDataRow'):

    df = pd.DataFrame([list(row.find_previous('tr').stripped_strings)], columns=list(soup.select_one('#awesomeTable tr').stripped_strings))
    df = df.join(pd.DataFrame(dict([list(x.stripped_strings) for x in row.select('p')]), index=df.index))
    data.append(df)


df = pd.concat(data)

Example

import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
r = requests.get('https://rk.americaeconomia.com/display/embed/500-latam/2021',headers=headers)
soup = BeautifulSoup(r.text,'lxml')
data = []
for row in soup.select('#awesomeTable tbody tr.extraDataRow'):
   
    df = pd.DataFrame([list(row.find_previous('tr').stripped_strings)], columns=list(soup.select_one('#awesomeTable tr').stripped_strings))
    df = df.join(pd.DataFrame(dict([list(x.stripped_strings) for x in row.select('p')]), index=df.index))
    data.append(df)


df = pd.concat(data)

Output

RK 2021 EMPRESA PAÍS RK 2020* RK 2019 SECTOR / RUBRO VENTAS 2020 US$ Millones VENTAS 2019 US$ Millones VARIACIÓN VENTAS 20/19 (%) UTILIDAD NETA 2020 US$ Millones UTILIDAD NETA 2019 US$ Millones VARIACIÓN UTILIDAD 20/19 (%) EBITDA 2020 US$ Millones EBITDA 2019 US$ Millones VARIACIÓN EBITDA 20/19 (%) ACTIVO TOTAL 2020 US$ Millones PATRIMONIO NETO 2020 US$ Millones EMPLEADOS 2020 EMPLEADOS 2019 ROA (%) 2020 ROE (%) 2020 MARGEN NETO (%) 2020 Presencia EN BOLSA SITIO WEB (www.)
1 PETROBRAS BRA 1 1 Petróleo/Gas 53,282.0 76,746.8 -30.6 1,392.0 10,191.7 -86.3 21,777.0 35,461.8 -38.6 190,107.6 59,905.7 N.D. 58,513 0.7 2.3 2.6 petrobras.com
2 JBS BRA 4 4 Alimentos 52,916.8 51,933.1 1.9 900.5 1,540.9 -41.6 5,539.7 5,018.0 10.4 31,536.7 8,383.6 N.D. 234,192 2.9 10.7 1.7 jbs.com.br
3 AMÉRICA MÓVIL MX 3 3 Telecomunicaciones 51,352.7 53,288.7 -3.6 2,366.0 3,582.9 -34 16,644.7 16,597.6 0.3 82,064.9 15,913.4 186,851 191,523 2.9 14.9 4.6 americamovil.com
4 PEMEX MX 2 2 Petróleo/Gas 44,676.3 72,837.4 -38.7 -22,520.7 -18,042.9 24.8 4,779.8 9,465.4 -49.5 95,527.7 -122,327.0 N.D. 156,614 -23.6 18.4 -50.4 No pemex.com
5 VALE BRA 5 5 Minería 40,838.3 37,743.0 8.2 5,231.4 -1,694.0 -408.8 14,527.9 4,995.8 190.8 92,054.2 34,845.2 N.D. 71,149 5.7 15 12.8 vale.com
  •  Tags:  
  • Related