I try to scrape the NASDAQ values from the www.n-tv.de website. I'm crawling with SELENIUM through the Sites. The Stock Values are on the Site in Tables.
The Source COde of Table for Example is like:
<div >
<table >
<thead>
<tr>
<th>Name</th><th >Kurs</th><th >%</th><th >Absolut</th><th >Relation</th><th >Zeit</th><th >Handelsvolumen</th><th >ISIN</th>
</tr>
</thead>
<tbody>
<tr onclick="document.location='https://www.n-tv.de/boersenkurse/aktien/activision-blizzard-295693';">
<td>Activision Blizzard</td>
<td ><span >66,53$</span></td>
<td ><span >-1,42%</span></td>
<td ><span >-0,96</span></td>
<td ><span > <span><span></span></span><span style="border-width: 24px;"></span></span></td>
<td >31.12.</td>
<td >8 Tsd.</td>
<td >US00507V1098</td>
</tr>
...
</tbody>
</table>
</div>
SO I do not understand the following Problem:
Seachrching the Web Elements of NASDAQ table i will do per Xpath :
nasdaq = driver.find_element_by_xpath('//table[@]')
rows_nasdaq = nasdaq.find_elements_by_class_name('linked')
I have made another solution, that works correctly by searching the tableholder elements (3 on this site) and after listing them then taking only the third object, but i really want to understand, why that xpath selctor above is not working to this the element , although i have the class name unique on this site as an attribute of the table tag element.
I do not use css or something, has someone an idea, why in this case the xpath is not matching ??
CodePudding user response:
Assumed yo like to scrape this url https://www.n-tv.de/boersenkurse/suche/?suchbegriff=to le.
You have to wait for element you try to find is present in the DOM and can use selenium waits for this:
nasdaq = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//table[@]')))
Need to be imported
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Example:
....
driver.get('https://www.n-tv.de/boersenkurse/suche/?suchbegriff=to le')
nasdaq = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//table[@]')))
for i in nasdaq.find_elements_by_class_name('linked'):
print(i.get_attribute('onclick'))
Output
document.location='https://www.n-tv.de/boersenkurse/indizes/swx-sp-tra-leis-tr-303397';
document.location='https://www.n-tv.de/boersenkurse/aktien/apollo-tourism- -leisure-1562996';
document.location='https://www.n-tv.de/boersenkurse/aktien/toqublanmonde--eo-047-11904326';
document.location='https://www.n-tv.de/boersenkurse/indizes/cb-p2p-onl-lend---digbanking-12533785';
document.location='https://www.n-tv.de/boersenkurse/indizes/concinngenddivwomin-leader-3254557';
document.location='https://www.n-tv.de/boersenkurse/indizes/concinnity-msos-leaders-39076931';
...
EDIT
Based on your comment I got the "link" - Issue, there was no table under url https://www.n-tv.de/ but the nasdaq is linked by https://www.n-tv.de/boersenkurse/indizes/nasdaq-849974 and there I found your table.
So it is not necessary to wait, but it can't hurt either. I have imported the table directly with pandas into a dataframe:
import pandas as pd
...
driver.get('https://www.n-tv.de/boersenkurse/indizes/nasdaq-849974')
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//table[@]')))
pd.read_html(driver.page_source)[3]
Output
Note: Relation column is empty, cause there is no text stored in it and you can simply drop it, if you like
| Name | Kurs | % | Absolut | Relation | Zeit | Handelsvolumen | ISIN |
|---|---|---|---|---|---|---|---|
| Activision Blizzard | 67,12$ | -0,44% | -30 | nan | 18:05 | 4 Mio. | US00507V1098 |
| Adobe | 545,25$ | -3,39% | -1912 | nan | 18:05 | 2 Mio. | US00724F1012 |
| Advanced Micro Devices | 141,89$ | -5,55% | -834 | nan | 18:05 | 44 Mio. | US0079031078 |
| Airbnb | 167,86$ | -2,79% | -481 | nan | 18:05 | 2 Mio. | US0090661010 |
| Align Technology | 629,44$ | -2,87% | -1861 | nan | 18:02 | 178 Tsd. | US0162551016 |
| ... | ... | ... | ... | ... | ... | ... | ... |
