Using python and selenium to get elemnts in <div> to a list or dataframe-CodePudding

I have a table created using 'div' elements, which has dynamic content based on the choice and also the data to be displayed that are generated with javascript. Html structure is like this:

<div >
<div >
<div >
<div >
</div></div></div>
<div  style="box-shadow:none">
<div  style="width:0"></div>
<span >Total common shares outstanding</span></div>
<div ></div>
<div >
<div >
<div >
<div>‪22.32B‬</div>
</div></div>
<div >
<div >
<div>‪21.34B‬</div>
</div></div>
<div >
<div ><div>‪20.50B‬</div>
</div></div>

Using below python code, result is like this: Total common shares outstanding‪22.32B‬‪21.34B‬‪20.50B‬‪19.02B‬‪17.77B‬‪16.98B‬‪16.43B‬‪16.33B‬ Instead I would it in a list or in a dtaframe like this:

['Total common shares outstanding‪',22.32,21.34,‬‪20.50B‬,19.02,17.77,‬‪16.98B‬,16.43,‬‪16.33,]

Python code I'm using to scrape data is this one:

from selenium import webdriver
import pandas as pd
import requests, bs4
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

url ='https://www.tradingview.com/symbols/NASDAQ-AAPL/financials-statistics-and-ratios/'
driver = webdriver.Chrome('chromedriver',options=options)
driver.get(url)
html = driver.page_source
#print(html)
soup = bs4.BeautifulSoup(html, 'html.parser')
for title in soup.find_all("div", {"class": "container-jKD0Exn-"}):
     print(title.text '\n')

Is there any way in selenium or beautifulsoap to get a list like that?

CodePudding user response：

As one approach, if there is no api, what you should prefer to use, you can go with BeautifulSoup and stripped_strings:

data = []
for title in soup.find_all("div", {"class": "container-jKD0Exn-"}):
     data.append(list(title.stripped_strings))

pd.DataFrame(data)

Output DataFrame:

0	1	2	3	4	5	6	7	8
Key stats
Total common shares outstanding	‪22.32B‬	‪21.34B‬	‪20.50B‬	‪19.02B‬	‪17.77B‬	‪16.98B‬	‪16.43B‬	‪16.33B‬
Float shares outstanding	‪22.29B‬	‪21.32B‬	‪20.48B‬	‪18.99B‬	‪17.75B‬	‪16.96B‬	‪16.41B‬	‪16.32B‬
Number of employees	‪110.00K‬	‪116.00K‬	‪123.00K‬	‪132.00K‬	‪137.00K‬	‪147.00K‬	‪154.00K‬	—
Number of shareholders	‪23.50K‬	‪23.50K‬	‪23.50K‬	‪23.50K‬	‪23.50K‬	‪23.50K‬	‪23.50K‬	—
...	...	...	...	...	...	...	...	...	...

CodePudding user response：

Using Selenium to print the desired texts you have to induce WebDriverWait for the visibility_of_all_elements_located() and you can use the following Locator Strategy:

Using xpath:

driver.get("https://www.tradingview.com/symbols/NASDAQ-AAPL/financials-statistics-and-ratios/")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//span[text()='Accept']"))).click()
df = pd.DataFrame([my_elem.text.replace('\u202a', ' ').replace('\u202c', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[text()='Total common shares outstanding']//following::div[2]//div[starts-with(@class, 'wrap')]/div")))], columns = ['Total common shares outstanding'])
print(df)
driver.quit()

Console Output:

      Total common shares outstanding
0                         22.32B
1                         21.34B
2                         20.50B
3                         19.02B
4                         17.77B
5                         16.98B
6                         16.43B
7                         16.33B

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC