I'm testing this code, to try to download around 120 Excel files from one URL.
import requests
from bs4 import BeautifulSoup
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"}
resp = requests.get("https://healthcare.ascension.org/price-transparency/price-transparency-files",headers=headers)
soup = BeautifulSoup(resp.text,"html.parser")
for link in soup.find_all('a', href=True):
if 'xls' in link['href']:
print(link['href'])
url="https://healthcare.ascension.org" link['href']
data=requests.get(url)
print(data)
output = open(f'C:/Users/ryans/Downloads/{url.split("/")[-1].split(".")[0]}.xls', 'wb')
output.write(data.content)
output.close()
This line: data=requests.get(url)
Always geive me Response [406] results. Apparently, HTTP 406 is a status of "Not Acceptable" per HTTP.CAT and Mozilla. Not sure what is wrong here, but I think I should have 120 Excel files, with data, downloaded. Now, I am getting 120 Excel files on my laptop, but none of the files have any data in them.
CodePudding user response:
The HTTP 406 error arises when no User-Agent is specified.
Once that's been resolved and the HREFs appropriately parsed (for format and relevance) then the OP's code should work. However, it's going to be very slow because the XLSX files being acquired have size measured in many MB.
Therefore, matters can be improved greatly with a multithreaded approach as follows:
import requests
from bs4 import BeautifulSoup as BS
from concurrent.futures import ThreadPoolExecutor
import os
HEADERS = {'User-Agent': 'PostmanRuntime/7.29.0'}
TARGET = 'C:/Users/ryans/Downloads'
HOST = 'https://healthcare.ascension.org'
def download(url):
base = os.path.basename(url)
print(f'Processing {base}')
with requests.Session() as session:
(r := session.get(url, headers=HEADERS, stream=True)).raise_for_status()
with open(os.path.join(TARGET, base), 'wb') as xl:
for chunk in r.iter_content(chunk_size=4096):
xl.write(chunk)
with requests.Session() as session:
(r := session.get(f'{HOST}/price-transparency/price-transparency-files', headers=HEADERS)).raise_for_status()
soup = BS(r.text, 'lxml')
urls = []
for link in soup.find_all('a', href=True):
href = link['href']
if not href.startswith('java'):
if not href.startswith('http'):
href = HOST href
if href.endswith('xlsx'):
urls.append(href)
with ThreadPoolExecutor() as executor:
executor.map(download, urls)
print('Done')
Note:
Python 3.8 required
CodePudding user response:
This website seems to filter user-agent, so as you did set the header in your dictionnary, you just need to pass it to request when invoking the get method :
requests.get(url, headers=headers)
It seems that only the user-agent is checked.
