I'm trying to grab the email for a school project from this webpage, which I am successfully able to along with the name of the organization but now I'm having a new problem. Looks like its grabbing it 3 times, which is causing an issue to my lists. Removing Dupes in Post is not ideal in this situation. Anybody have any idea how I can just grab the email & organization name just 1 time?
Is it an issue with my for loop?
Code:
U4Etest1 = ['http://www.usavolleyballclubs.com/VolleyballClubDirectory.asp?Customer_ID=26045','http://www.usavolleyballclubs.com/VolleyballClubDirectory.asp?Customer_ID=36914']
email_list = []
org_name_list = []
for u in U4Etest1:
url2 = u
driver.get(url2)
time.sleep(3)
html = urlopen(url2)
soup = BeautifulSoup(html, 'lxml')
emailsoup = soup.find('table', class_="table table-striped")
for es in emailsoup:
org_name2 = emailsoup.find('h3').text
org_name_list.append(org_name2)
try:
malito = emailsoup.find('a', {'target':'_top'})['href']
email_list.append(malito)
except:
email_list.append('N/A')
print(f'''
Org Name: {org_name2}
Email: {malito}
''')
Output:
Org Name: Eastern Elite
Email: mailto:[email protected]&[email protected]&subject=Club Volleyball Inquiry from
Org Name: Eastern Elite
Email: mailto:[email protected]&[email protected]&subject=Club Volleyball Inquiry from
Org Name: Eastern Elite
Email: mailto:[email protected]&[email protected]&subject=Club Volleyball Inquiry from
Org Name: Corpus Christi Legacy Volleyball Club
Email: mailto:[email protected]&[email protected]&subject=Club Volleyball Inquiry from
Org Name: Corpus Christi Legacy Volleyball Club
Email: mailto:[email protected]&[email protected]&subject=Club Volleyball Inquiry from
Org Name: Corpus Christi Legacy Volleyball Club
Email: mailto:[email protected]&[email protected]&subject=Club Volleyball Inquiry from
CodePudding user response:
Firstly, why are you even using a for loop for html? Or Selenium, if you are not using it? Secondly, please always add imports and variables along with code.
Following code works for me:
import time
from bs4 import BeautifulSoup
from urllib.request import urlopen
urls = ['http://www.usavolleyballclubs.com/VolleyballClubDirectory.asp?Customer_ID=26045',
'http://www.usavolleyballclubs.com/VolleyballClubDirectory.asp?Customer_ID=36914']
email_list = []
org_name_list = []
for url in urls:
time.sleep(3)
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
email_soup = soup.find('table', class_="table table-striped")
org_name = email_soup.find('h3').text
org_name_list.append(org_name)
try:
malito = email_soup.find('a', {'target': '_top'})['href']
email_list.append(malito)
print(f'Org Name: {org_name} \nEmail: {malito}')
except:
email_list.append('N/A')
