I want to obtain on a single DataFrame all the different tables from each company scraping this web page:
https://rk.americaeconomia.com/display/embed/500-latam/2021
or
It has been very difficult to find a solution.
CodePudding user response:
Situation
There are two different types of <tr> one with data and another with extra data corresponding to the first one we have to bring them together. Another point is that extra data is not stored in common table structure
How to achieve?
Each combination of data and extra data has to be joined to a single row - There are many options to do so, this approach deals with dataframes. Joined dataframes are stored in a list and would be concatenated after iterating has finished.
data = []
for row in soup.select('#awesomeTable tbody tr.extraDataRow'):
df = pd.DataFrame([list(row.find_previous('tr').stripped_strings)], columns=list(soup.select_one('#awesomeTable tr').stripped_strings))
df = df.join(pd.DataFrame(dict([list(x.stripped_strings) for x in row.select('p')]), index=df.index))
data.append(df)
df = pd.concat(data)
Example
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
r = requests.get('https://rk.americaeconomia.com/display/embed/500-latam/2021',headers=headers)
soup = BeautifulSoup(r.text,'lxml')
data = []
for row in soup.select('#awesomeTable tbody tr.extraDataRow'):
df = pd.DataFrame([list(row.find_previous('tr').stripped_strings)], columns=list(soup.select_one('#awesomeTable tr').stripped_strings))
df = df.join(pd.DataFrame(dict([list(x.stripped_strings) for x in row.select('p')]), index=df.index))
data.append(df)
df = pd.concat(data)
Output
| RK 2021 | EMPRESA | PAÍS | RK 2020* | RK 2019 | SECTOR / RUBRO | VENTAS 2020 US$ Millones | VENTAS 2019 US$ Millones | VARIACIÓN VENTAS 20/19 (%) | UTILIDAD NETA 2020 US$ Millones | UTILIDAD NETA 2019 US$ Millones | VARIACIÓN UTILIDAD 20/19 (%) | EBITDA 2020 US$ Millones | EBITDA 2019 US$ Millones | VARIACIÓN EBITDA 20/19 (%) | ACTIVO TOTAL 2020 US$ Millones | PATRIMONIO NETO 2020 US$ Millones | EMPLEADOS 2020 | EMPLEADOS 2019 | ROA (%) 2020 | ROE (%) 2020 | MARGEN NETO (%) 2020 | Presencia EN BOLSA | SITIO WEB (www.) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | PETROBRAS | BRA | 1 | 1 | Petróleo/Gas | 53,282.0 | 76,746.8 | -30.6 | 1,392.0 | 10,191.7 | -86.3 | 21,777.0 | 35,461.8 | -38.6 | 190,107.6 | 59,905.7 | N.D. | 58,513 | 0.7 | 2.3 | 2.6 | Sí | petrobras.com |
| 2 | JBS | BRA | 4 | 4 | Alimentos | 52,916.8 | 51,933.1 | 1.9 | 900.5 | 1,540.9 | -41.6 | 5,539.7 | 5,018.0 | 10.4 | 31,536.7 | 8,383.6 | N.D. | 234,192 | 2.9 | 10.7 | 1.7 | Sí | jbs.com.br |
| 3 | AMÉRICA MÓVIL | MX | 3 | 3 | Telecomunicaciones | 51,352.7 | 53,288.7 | -3.6 | 2,366.0 | 3,582.9 | -34 | 16,644.7 | 16,597.6 | 0.3 | 82,064.9 | 15,913.4 | 186,851 | 191,523 | 2.9 | 14.9 | 4.6 | Sí | americamovil.com |
| 4 | PEMEX | MX | 2 | 2 | Petróleo/Gas | 44,676.3 | 72,837.4 | -38.7 | -22,520.7 | -18,042.9 | 24.8 | 4,779.8 | 9,465.4 | -49.5 | 95,527.7 | -122,327.0 | N.D. | 156,614 | -23.6 | 18.4 | -50.4 | No | pemex.com |
| 5 | VALE | BRA | 5 | 5 | Minería | 40,838.3 | 37,743.0 | 8.2 | 5,231.4 | -1,694.0 | -408.8 | 14,527.9 | 4,995.8 | 190.8 | 92,054.2 | 34,845.2 | N.D. | 71,149 | 5.7 | 15 | 12.8 | Sí | vale.com |
