I am trying to make web Scraper with Python and there is a problem in extracting title of company.
def extract_indeed_job():
jobs = []
result = requests.get(f"{url}&start={0*LIMIT}")
result_soup = BeautifulSoup(result.text, "html.parser")
results = result_soup.find_all("a", {"class": "tapItem"})
for result in results:
title = result.find("h2", {"class": "jobTitle"}).find("span")["title"]
company = result.find("span", {"class": "companyName"}).get_text()
location = result.find("div", {"class": "companyLocation"}).get_text()
print(title, company, location)
Some of posts, there are two span tags in the h2 class="jobTitle" tag 
And I need to get just span title. So I wrote in with this tag. But, Python notices the key error and it doesn't work.
What can I do to solve? Is there any problem in my code??
CodePudding user response:
Note that there are multiple <span>s inside <h2> element. You want <span> which is immediate child of <h2> rather than <span> inside <div>, to get it you might replace
result.find("h2", {"class": "jobTitle"}).find("span")
using
result.find("h2", {"class": "jobTitle"}).find("span", recursive=False)
This will prevent recursive search (i.e. looking for children of children and further)
CodePudding user response:
the True ensure that you are filtering those span with that attribute so when you try to access to its value you don't get an error. The find just returns a span careless of the attributes that you need.
result.find("span", title=True)['title']
The code and html you provided are ambiguos. Your statement title = result.find("h2", {"class": "jobTitle"}) will never match the h2 tag because its class attribute is more complex, ``jobTitle jobTitle-color-purple jobTitle-newJob`. To match that you need
import re
...
result.find("h2", class_=re.compile(r'jobTitle'))
Use regular expression to improve the search in the soup.
