How can I get the pd.Series to write more than one row?-CodePudding

I have a basic web scrapping application that creates the files I need, but in my for loop, when it loops through different searches the panda series being created only writes to the first row. This version is cut down to meet the minimal necessary code.

Below is my code:

import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


URL = "https://mor.nlm.nih.gov/RxClass/search?query=ALIMENTARY TRACT AND METABOLISM|ATC1-4&searchBy=class&sourceIds=a&drugSources=atc1-4|atc,epc|dailymed,meshpa|mesh,disease|medrt,chem|dailymed,moa|dailymed,pe|dailymed,pk|medrt,tc|fmtsme,va|va,dispos|snomedct,struct|snomedct,schedule|rxnorm"

driver = webdriver.Chrome('C:\Program Files\Chrome Driver\chromedriver.exe')
driver.get(URL)
filename = 'tabletest.csv'
classid = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.table-responsive div.propText strong:nth-child(2)")))]
classname = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.table-responsive div.propText strong:nth-child(1)")))]
classtype = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.table-responsive div.propText strong:nth-child(3)")))]
df = pd.read_html(driver.page_source)[1].iloc[:,:-1]
df["ClassID"] = pd.Series(classid)
df["ClassName"] = pd.Series(classname)
df["ClassType"] = pd.Series(classtype)
df.to_csv(filename,index=False)

I've tried using iloc with several variations as I'm doing from the table in the line of code above the series, but I get the following errors on the following attempts. I know why I'm getting the errors, these are just attempts. The current code that writes a single line produces no error.

The reason for this is because it's not in a series...

df["ClassID"] = classid
df["ClassName"] = classname
df["ClassType"] = classtype
Traceback (most recent call last):
  File "c:\Python\selenium example.py", line 19, in <module>
    df["ClassID"] = classid
  File "C:\Anaconda3\envs\datawrangler\lib\site-packages\pandas\core\frame.py", line 3612, in __setitem__
    self._set_item(key, value)
  File "C:\Anaconda3\envs\datawrangler\lib\site-packages\pandas\core\frame.py", line 3784, in _set_item
    value = self._sanitize_column(value)
  File "C:\Anaconda3\envs\datawrangler\lib\site-packages\pandas\core\frame.py", line 4509, in _sanitize_column
    com.require_length_match(value, self.index)
  File "C:\Anaconda3\envs\datawrangler\lib\site-packages\pandas\core\common.py", line 531, in require_length_match
    raise ValueError(
ValueError: Length of values (1) does not match length of index (397)

The errors using iloc

df["ClassID"] = pd.Series(classid).iloc[:,:-1]
df["ClassName"] = pd.Series(classname).iloc[:,:-1]
df["ClassType"] = pd.Series(classtype).iloc[:,:-1]

    Traceback (most recent call last):
      File "c:\Python\selenium example.py", line 19, in <module>
        df["ClassID"] = pd.Series(classid).iloc[:,:-1]
      File "C:\Anaconda3\envs\datawrangler\lib\site-packages\pandas\core\indexing.py", line 925, in __getitem__
        return self._getitem_tuple(key)
      File "C:\Anaconda3\envs\datawrangler\lib\site-packages\pandas\core\indexing.py", line 1506, in _getitem_tuple
        self._has_valid_tuple(tup)
      File "C:\Anaconda3\envs\datawrangler\lib\site-packages\pandas\core\indexing.py", line 751, in _has_valid_tuple
        self._validate_key_length(key)
      File "C:\Anaconda3\envs\datawrangler\lib\site-packages\pandas\core\indexing.py", line 792, in _validate_key_length
        raise IndexingError("Too many indexers")
    pandas.core.indexing.IndexingError: Too many indexers


df["ClassID"] = pd.Series(classid)[1].iloc[:,:-1]
df["ClassName"] = pd.Series(classname)[1].iloc[:,:-1]
df["ClassType"] = pd.Series(classtype)[1].iloc[:,:-1]
Traceback (most recent call last):
  File "c:\Python\selenium example.py", line 19, in <module>
    df["ClassID"] = pd.Series(classid)[1].iloc[:,:-1]
  File "C:\Anaconda3\envs\datawrangler\lib\site-packages\pandas\core\series.py", line 942, in __getitem__
    return self._get_value(key)
  File "C:\Anaconda3\envs\datawrangler\lib\site-packages\pandas\core\series.py", line 1051, in _get_value
    loc = self.index.get_loc(label)
  File "C:\Anaconda3\envs\datawrangler\lib\site-packages\pandas\core\indexes\range.py", line 387, in get_loc
    raise KeyError(key) from err
KeyError: 1

What I'm not understanding is why does my series only write to a single row in the output files?

CodePudding user response：

What happens?

Trying to assign each single value as list (it just contains one string) wont work, cause the length of your list and these of the dataframe does not match.

How to fix?

Collect your values as string instead of list and assign it as value with your new columns:

classid = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.table-responsive div.propText strong:nth-child(2)"))).text
classname = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.table-responsive div.propText strong:nth-child(1)"))).text
classtype = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.table-responsive div.propText strong:nth-child(3)"))).text
pd.read_html(driver.page_source)[1].iloc[:,:-1]\
    .assign(ClassID=classid, ClassName=classname, ClassType=classtype)\
    .to_csv('a.csv',index=False)

Note Alternative you can collected the three class props as list of strings and assign it by index

Example (alternative approach)

Wait for the presence of the table, collect the props and put everything together:

...
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '#drugMemberPage table')))
classProps = [c.text for c in driver.find_elements(By.CSS_SELECTOR, '.propText strong')]

pd.read_html(driver.page_source)[1].iloc[:,:-1]\
    .assign(ClassID=classProps[1], ClassName=classProps[0], ClassType=classProps[2])\
    .to_csv('a.csv',index=False)

Output

Type	RXCUI	RxNorm Name	Relation	ClassID	ClassName	ClassType
IN	16681	acarbose	INDIRECT	A	ALIMENTARY TRACT AND METABOLISM	ATC1-4
IN	173	acetohexamide	INDIRECT	A	ALIMENTARY TRACT AND METABOLISM	ATC1-4
IN	16784	acetoxolone	INDIRECT	A	ALIMENTARY TRACT AND METABOLISM	ATC1-4
IN	272	activated charcoal	INDIRECT	A	ALIMENTARY TRACT AND METABOLISM	ATC1-4
...	...	...	...	...	...	...