i am scrapping a series of URL's with this code :
df1 = pd.DataFrame()
url = 'https://www.welcometothejungle.com/fr/jobs?
page=1&refinementList[profession_name.fr.Tech][]=Data Science'
path = '/Users/jdkj/desktop/chromedriver 3'
options = webdriver.ChromeOptions()
driver = webdriver.Chrome((path), chrome_options=options)
html = driver.get(url)
time.sleep(3)
elems = driver.find_elements_by_xpath("//article/div[2]/header/a['href']")
for elem in elems:
urls = elem.get_attribute("href")
print(urls)
This returns the correct results that i want to see, the problem is that when i try to put this "urls" in my empty dataframe "df1" with the following code :
df_test = df1.append({'URLS' : urls}, ignore_index = True)
df_test.head()
It does not show me the urls that i want (it doesn't return an error but the result doesn't really make sense)
I am beginning at python so there is probably some simple answer to my question i guess, i hope i was clear
CodePudding user response:
The problem with your code is that you are overwriting the urls variable and then appending to the DataFrame only the last scraped URL. Move the df1.append statement to inside the for block:
df1 = pd.DataFrame()
url = 'https://www.welcometothejungle.com/fr/jobs?
page=1&refinementList[profession_name.fr.Tech][]=Data Science'
path = '/Users/jdkj/desktop/chromedriver 3'
options = webdriver.ChromeOptions()
driver = webdriver.Chrome((path), chrome_options=options)
html = driver.get(url)
time.sleep(3)
elems = driver.find_elements_by_xpath("//article/div[2]/header/a['href']")
for elem in elems:
url = elem.get_attribute("href") # <--- get the url from the <a> tag
df1 = df1.append({'URLS': url}, ignore_index=True) # <--- add the url to the dataframe in the URLS column
