I have a basic web scrapping application that creates the files I need, but in my for loop, when it loops through different searches the panda series being created only writes to the first row. This version is cut down to meet the minimal necessary code.
Below is my code:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
URL = "https://mor.nlm.nih.gov/RxClass/search?query=ALIMENTARY TRACT AND METABOLISM|ATC1-4&searchBy=class&sourceIds=a&drugSources=atc1-4|atc,epc|dailymed,meshpa|mesh,disease|medrt,chem|dailymed,moa|dailymed,pe|dailymed,pk|medrt,tc|fmtsme,va|va,dispos|snomedct,struct|snomedct,schedule|rxnorm"
driver = webdriver.Chrome('C:\Program Files\Chrome Driver\chromedriver.exe')
driver.get(URL)
filename = 'tabletest.csv'
classid = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.table-responsive div.propText strong:nth-child(2)")))]
classname = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.table-responsive div.propText strong:nth-child(1)")))]
classtype = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.table-responsive div.propText strong:nth-child(3)")))]
df = pd.read_html(driver.page_source)[1].iloc[:,:-1]
df["ClassID"] = pd.Series(classid)
df["ClassName"] = pd.Series(classname)
df["ClassType"] = pd.Series(classtype)
df.to_csv(filename,index=False)
I've tried using iloc with several variations as I'm doing from the table in the line of code above the series, but I get the following errors on the following attempts. I know why I'm getting the errors, these are just attempts. The current code that writes a single line produces no error.
The reason for this is because it's not in a series...
df["ClassID"] = classid
df["ClassName"] = classname
df["ClassType"] = classtype
Traceback (most recent call last):
File "c:\Python\selenium example.py", line 19, in <module>
df["ClassID"] = classid
File "C:\Anaconda3\envs\datawrangler\lib\site-packages\pandas\core\frame.py", line 3612, in __setitem__
self._set_item(key, value)
File "C:\Anaconda3\envs\datawrangler\lib\site-packages\pandas\core\frame.py", line 3784, in _set_item
value = self._sanitize_column(value)
File "C:\Anaconda3\envs\datawrangler\lib\site-packages\pandas\core\frame.py", line 4509, in _sanitize_column
com.require_length_match(value, self.index)
File "C:\Anaconda3\envs\datawrangler\lib\site-packages\pandas\core\common.py", line 531, in require_length_match
raise ValueError(
ValueError: Length of values (1) does not match length of index (397)
The errors using iloc
df["ClassID"] = pd.Series(classid).iloc[:,:-1]
df["ClassName"] = pd.Series(classname).iloc[:,:-1]
df["ClassType"] = pd.Series(classtype).iloc[:,:-1]
Traceback (most recent call last):
File "c:\Python\selenium example.py", line 19, in <module>
df["ClassID"] = pd.Series(classid).iloc[:,:-1]
File "C:\Anaconda3\envs\datawrangler\lib\site-packages\pandas\core\indexing.py", line 925, in __getitem__
return self._getitem_tuple(key)
File "C:\Anaconda3\envs\datawrangler\lib\site-packages\pandas\core\indexing.py", line 1506, in _getitem_tuple
self._has_valid_tuple(tup)
File "C:\Anaconda3\envs\datawrangler\lib\site-packages\pandas\core\indexing.py", line 751, in _has_valid_tuple
self._validate_key_length(key)
File "C:\Anaconda3\envs\datawrangler\lib\site-packages\pandas\core\indexing.py", line 792, in _validate_key_length
raise IndexingError("Too many indexers")
pandas.core.indexing.IndexingError: Too many indexers
df["ClassID"] = pd.Series(classid)[1].iloc[:,:-1]
df["ClassName"] = pd.Series(classname)[1].iloc[:,:-1]
df["ClassType"] = pd.Series(classtype)[1].iloc[:,:-1]
Traceback (most recent call last):
File "c:\Python\selenium example.py", line 19, in <module>
df["ClassID"] = pd.Series(classid)[1].iloc[:,:-1]
File "C:\Anaconda3\envs\datawrangler\lib\site-packages\pandas\core\series.py", line 942, in __getitem__
return self._get_value(key)
File "C:\Anaconda3\envs\datawrangler\lib\site-packages\pandas\core\series.py", line 1051, in _get_value
loc = self.index.get_loc(label)
File "C:\Anaconda3\envs\datawrangler\lib\site-packages\pandas\core\indexes\range.py", line 387, in get_loc
raise KeyError(key) from err
KeyError: 1
What I'm not understanding is why does my series only write to a single row in the output files?
CodePudding user response:
What happens?
Trying to assign each single value as list (it just contains one string) wont work, cause the length of your list and these of the dataframe does not match.
How to fix?
Collect your values as string instead of list and assign it as value with your new columns:
classid = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.table-responsive div.propText strong:nth-child(2)"))).text
classname = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.table-responsive div.propText strong:nth-child(1)"))).text
classtype = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.table-responsive div.propText strong:nth-child(3)"))).text
pd.read_html(driver.page_source)[1].iloc[:,:-1]\
.assign(ClassID=classid, ClassName=classname, ClassType=classtype)\
.to_csv('a.csv',index=False)
Note Alternative you can collected the three class props as list of strings and assign it by index
Example (alternative approach)
Wait for the presence of the table, collect the props and put everything together:
...
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '#drugMemberPage table')))
classProps = [c.text for c in driver.find_elements(By.CSS_SELECTOR, '.propText strong')]
pd.read_html(driver.page_source)[1].iloc[:,:-1]\
.assign(ClassID=classProps[1], ClassName=classProps[0], ClassType=classProps[2])\
.to_csv('a.csv',index=False)
Output
| Type | RXCUI | RxNorm Name | Relation | ClassID | ClassName | ClassType |
|---|---|---|---|---|---|---|
| IN | 16681 | acarbose | INDIRECT | A | ALIMENTARY TRACT AND METABOLISM | ATC1-4 |
| IN | 173 | acetohexamide | INDIRECT | A | ALIMENTARY TRACT AND METABOLISM | ATC1-4 |
| IN | 16784 | acetoxolone | INDIRECT | A | ALIMENTARY TRACT AND METABOLISM | ATC1-4 |
| IN | 272 | activated charcoal | INDIRECT | A | ALIMENTARY TRACT AND METABOLISM | ATC1-4 |
| ... | ... | ... | ... | ... | ... | ... |
