I have a csv with a list of http URL's. I need to check for each of the URL's listed if URL is reachable over http. How can I do that?
CodePudding user response:
You can check the URL's with a python script.
As input you need this csv structure
name,link
google,https://google.com
bla,https://doesnot.exist.com
Copy the following python code into a file: check_url.py
Then execute it with: python3 check_url.py
import csv
import urllib.parse
import urllib.request
import socket
# try to resolve the hostname
def hostname_resolves(hostname):
try:
socket.gethostbyname(hostname)
return 1
except socket.error:
return 0
# open file
file = open("links.csv")
csvreader = csv.reader(file)
# extract headers
header = []
header = next(csvreader)
# extract data
rows = []
for row in csvreader:
rows.append(row)
rows
file.close()
# iterate over the links and check if they can be reached and respond with a valid http response code
for row in rows:
# extract url
url = row[1]
print("check url: " url)
# extract host
parsed_url = urllib.parse.urlparse(url)
host = parsed_url.netloc
# try to resolve host over dns
resolvable = hostname_resolves(host)
# if the host could be resolve, try to do a http request
url_reacheable_over_http = 0
if resolvable == 1:
http_status_code = urllib.request.urlopen(url).getcode()
if http_status_code < 500:
url_reacheable_over_http = 1
row.append(url_reacheable_over_http)
# write the result to a new csv file
with open('links_checked_result.csv', 'w', encoding='UTF8') as f:
writer = csv.writer(f)
# write the header
writer.writerow(header)
for row in rows:
# write the data
writer.writerow(row)
The output should be a file links_checked_result.csv with this content:
name,link
google,https://google.com,1
bla,https://https://doesnot.exist.com,0
