Home > Enterprise >  Beautiful Soup cannot find table on iShares
Beautiful Soup cannot find table on iShares

Time:01-07

I have been trying to scrape ETF data from iShares.com for an ongoing project for a while now. I am trying to create web scrapers for multiple websites but they are all identical. Essentially I run into two issues:

  1. I keep getting the error :"AttributeError: 'NoneType' object has no attribute 'tr'" although I am quite sure that I have chosen the correct table.

  2. When I look into the "Inspect elements" on some of the websites, I have to click the "Show more" in order to see the code for all of the rows.

I am not a computer scientist, but I have tried many different approaches which have sadly all been unsuccessful so I hope you can help.

The URL: https://www.ishares.com/uk/individual/en/products/251382/ishares-msci-world-minimum-volatility-ucits-etf

The table can be found on the URL under "Holdings". Alternatively, it can be found under the following paths: JS Path: <document.querySelector("#allHoldingsTable > tbody")> xPath: //*[@id="allHoldingsTable"]/tbody

Code:

import requests
import pandas as pd
from bs4 import BeautifulSoup


urls = [
'https://www.ishares.com/uk/individual/en/products/251382/ishares-msci-world-minimum-volatility-ucits-etf'
]

all_data = []
for url in urls:
    print("Loading URL {}".format(url))

    # load the page into soup:
    soup = BeautifulSoup(requests.get(url).content, "html.parser")

    # find correct table:
    tbl = soup.select_one(".allHoldingsTable")

    # remove the first row (it's not header):
    tbl.tr.extract()

    # convert the html to pandas DF:
    df = pd.read_html(str(tbl),thousands='.', decimal=',')[0]

    # move the first row to header:
    df.columns = map(lambda x: str(x).replace("*", "").strip(), df.loc[0])
    df = df.loc[1:].reset_index(drop=True).rename(columns={"nan": "Name"})

    df["Company"] = soup.h1.text.split("\n")[0].strip()
    df["URL"] = url
    all_data.append(df.loc[:, ~df.isna().all()])

df = pd.concat(all_data, ignore_index=True)
print(df)


from openpyxl import load_workbook
path= '/Users/karlemilthulstrup/Downloads/ishares.xlsx'
book = load_workbook(path ,read_only = False, keep_vba=True)
writer = pd.ExcelWriter(path, engine='openpyxl')
writer.book = book
df.to_excel(writer, index=False)
writer.save()
writer.close()

CodePudding user response:

As stated in the comments, the data is dynamically rendered. If you don't want to go the route of accessing the data directly, you could use something like Selenium, that will allow the page to render, THEN you can go in there the way you have it above.

Also, there's a button that will download this into a csv for you. Why not just do that?

But if you must scrape the page, you get the data in json format. Just parse it:

import requests
import json
import pandas as pd

url = 'https://www.ishares.com/uk/individual/en/products/251382/ishares-msci-world-minimum-volatility-ucits-etf/1506575576011.ajax?tab=all&fileType=json'
r = requests.get(url)
r.encoding='utf-8-sig'
jsonData = json.loads(r.text)


rows = []
for each in jsonData['aaData']:
    row = {'Issuer Ticker':each[0],
     'Name':each[1],
     'Sector':each[2],
     'Asset Class':each[3],
     'Market Value':each[4]['display'],
     'Market Value Raw':each[4]['raw'],
     'Weight (%)':each[5]['display'],
     'Weight (%) Raw':each[5]['raw'],
     'Notaional Value':each[6]['display'],
     'Notaional Value Raw':each[6]['raw'],
     'Nominal':each[7]['display'],
     'Nominal Raw':each[7]['raw'],
     'ISIN':each[8],
     'Price':each[9]['display'],
     'Price Raw':each[9]['raw'],
     'Location':each[10],
     'Exchange':each[11],
     'Market Currency':each[12]}
     
    rows.append(row)
     
df = pd.DataFrame(rows)

Output:

print(df)
    Issuer Ticker  ... Market Currency
0              VZ  ...             USD
1             ROG  ...             CHF
2            NESN  ...             CHF
3              WM  ...             USD
4             PEP  ...             USD
..            ...  ...             ...
309          ESH2  ...             USD
310          TUH2  ...             USD
311           JPY  ...             USD
312    MARGIN_JPY  ...             JPY
313    MARGIN_SGD  ...             SGD

[314 rows x 18 columns]
  •  Tags:  
  • Related