The idea of the program is to check for domains/subdomains living (via http/https protocol) in the subdomains.txt file.
I did this by using HEAD requests to domains/subdomains and receiving the response status code. If the status code is available, the domain or subdomain is live. (load_url_http function)
To speed up the program, I used concurrent.futures.ThreadPoolExecutor with a number of threads of 200 However, even after increasing the number of threads to 300, the program's speed isn't much improved.
I want an improvement in my program to be able to send thousands of requests at once. Below is part of my source code:
python-request-multil.py
import time
import requests
import concurrent.futures
def load_url_http(protocol: str, domain: str, timeout: int = 10):
try:
conn = requests.head(protocol "://" domain, timeout=timeout)
return conn.status_code
except Exception:
return None
#--- main ---#
start_time = time.time()
worker = 400
protocol = "http"
timeout = 10
print("Number of worker:", worker)
with concurrent.futures.ThreadPoolExecutor(max_workers=worker) as executor:
# The file object that the subdomain lives on will be written to
file_live_subdomain = open("live_subdomains.txt", "a")
# load domain/subdomain list from file
URLS = open("subdomains.txt", "r").read().split("\n")
URLS_length = len(URLS)
# Count the number of live subdomains
live_count = 0
# Start the load operations and mark each future with its URL
future_to_url = {
executor.submit(load_url_http, protocol, url, timeout): url for url in URLS
}
for i, future in zip(range(URLS_length), concurrent.futures.as_completed(future_to_url)):
url = future_to_url[future]
print(f"\r--> Checking live subdomain.........{i 1}/{URLS_length}", end="")
try:
data = future.result()
# If `load_url_http` returns any status code
if data != None:
# print(f'{protocol}://{url}:{data}')
live_count = live_count 1
file_live_subdomain.write(f"\n{protocol}://" url)
except Exception as exc:
print(exc)
print(f"\n[ ] Live domain: {live_count}/{URLS_length}", end="")
file_live_subdomain.close()
print("\n--- %s seconds ---" % (time.time() - start_time))
Run:
┌──(quangtb㉿QuangTB)-[/mnt/e/DATA/Downloads]
└─$ python3 python-request-multil.py
Number of worker: 100
--> Checking live subdomain.........1117/1117
[ ] Live domain: 344/1117
--- 67.41670227050781 seconds ---
┌──(quangtb㉿QuangTB)-[/mnt/e/DATA/Downloads]
└─$ python3 python-request-multil.py
Number of worker: 200
--> Checking live subdomain.........1117/1117
[ ] Live domain: 344/1117
--- 54.6825795173645 seconds ---
┌──(quangtb㉿QuangTB)-[/mnt/e/DATA/Downloads]
└─$ python3 python-request-multil.py
Number of worker: 300
--> Checking live subdomain.........1117/1117
[ ] Live domain: 339/1117
--- 54.186068058013916 seconds ---
┌──(quangtb㉿QuangTB)-[/mnt/e/DATA/Downloads]
└─$ python3 python-request-multil.py
Number of worker: 400
--> Checking live subdomain.........1117/1117
[ ] Live domain: 344/1117
--- 54.19181728363037 seconds ---
CodePudding user response:
In python multithreading doesn't actually run in parallel, all the threads run under 1 process and that process runs only on 1 core (of the cpu).
You can create as many threads as you want and is won't solve the problem and can actually make it worse, because those 300 threads are running on 1 cpu core but that 1 cpu core can only run 1 command at a time, so what ends up happening is that the cpu core needs to run a few commands on 1 thread and then switch to another thread and run a few commands on that other thread and then switch to another one and etc... the switching action between threads takes resources and in that time your program code doesn't run. So in the end if you open to many threads your cpu core will spend more time on switching between threads than on executing your program.
What you can do to actually run your code simultaneously is to open a few processes instead of threads using the multiprocessing library, and then your code will run on a number of cores, and the same thing here, don't open hundreds of processes, open only a few, I recommend on openings 1 process for each core your cpu has, the multiprocessing library has a built-in function the returns the number of cores your cpu has:
import multiprocessing
print(multiprocessing.cpu_count())
note that because of the fact that the processes actually run your code simultaneously your prints can interfere with each other so you will need to use multiprocessing.Lock() and do something like this:
import multiprocessing
lock = multiprocessing.Lock()
lock.acquire()
print("something")
lock.release()
do lock.acquire() before each print and lock.release() after (if you won't release the lock your program will be stuck) and this will make sure that your prints won't get in each other.
EDIT:
in your case because of the timeout it will actually be better to open a few processes and in each process something like 20 threads.
because if the process is waiting 10 seconds for each failed address it will end up being slower than opening a lot of threads
so for your case the fastest way I can think of is opening a few processes and in each process open 20 to 30 threads
You can try something like this:
import multiprocessing
import random
import threading
import time
import requests
MAX_NUMBER_OF_PROCESSES = multiprocessing.cpu_count()
MAX_NUMBER_OF_THREADS_IN_EACH_PROCESS = 30
PROTOCOL = "http"
TIMEOUT = 4
START_TIME = time.time()
def load_url_http(protocol: str, domains: list[str], timeout: int = 10):
with open("live_subdomains.txt", "a") as live_domains_file:
for domain in domains:
try:
conn = requests.head(protocol "://" domain, timeout=timeout)
if conn.status_code is not None:
live_domains_file.write(f"{protocol}://{domain}\n")
except Exception:
pass
return
def create_threads_for_process(protocol: str, domains: list[str], timeout: int = 10):
# create threads
threads_list = []
threads_urls = {}
start = 0
number_of_threads_to_open = len(domains) if len(domains) < MAX_NUMBER_OF_THREADS_IN_EACH_PROCESS \
else MAX_NUMBER_OF_THREADS_IN_EACH_PROCESS
for i in range(1, number_of_threads_to_open 1):
# distribute the work of this process evenly between all the threads
if i != number_of_threads_to_open:
threads_urls[i] = domains[start:(len(domains) // number_of_threads_to_open) * i]
start = (len(domains) // number_of_threads_to_open) * i
else:
threads_urls[i] = domains[start:]
# create and start the thread
thread = threading.Thread(target=load_url_http,
args=(protocol, threads_urls[i], timeout,),
daemon=True)
thread.start()
threads_list.append(thread)
# wait for all threads to finish
while threads_list:
for thread in threads_list:
if not thread.is_alive():
threads_list.remove(thread)
time.sleep(0.8)
def main():
with open("live_subdomains.txt", "w") as file:
file.write("")
with open("subdomains.txt", "r") as file:
urls = file.read().split("\n")
random.shuffle(urls) # shuffle the urls list
# create the processes
processes_list = []
processes_urls = {}
start = 0
number_of_processes_to_open = len(urls) if len(urls) < MAX_NUMBER_OF_PROCESSES else MAX_NUMBER_OF_PROCESSES
for i in range(1, number_of_processes_to_open 1):
if i != number_of_processes_to_open:
# give each process an even amount of work
processes_urls[i] = urls[start:(len(urls) // number_of_processes_to_open) * i]
start = (len(urls) // number_of_processes_to_open) * i
else:
# the last process will get a bit more / a bit less in
# case len(urls) isn't dividable by number_of_processes_to_open
processes_urls[i] = urls[start:]
# create the process, start it and add to processes list
process = multiprocessing.Process(target=create_threads_for_process,
args=(PROTOCOL, processes_urls[i], TIMEOUT,),
daemon=True)
process.start()
processes_list.append(process)
# wait for all processes to finish
while multiprocessing.active_children():
time.sleep(0.8)
# print result
with open("live_subdomains.txt", "r") as live_urls_file:
live_count = len(live_urls_file.read().split("\n")) - 1 # -1 empty line at the end
print(f"\n[ ] Live domain: {live_count}/{len(urls)}", end="")
print("\n--- %s seconds ---" % (time.time() - START_TIME))
if __name__ == '__main__':
main()
in this code I opened 1 process for each core the cpu has, and in each process I opened 30 threads, also you don't need to wait 10 seconds for a replay something like 5 seconds is enough especially when you use HEAD and not GET.
I run your code and compared to mine, your code took 35 seconds to finish, mine took 25 seconds and that is on 1117 urls, the bigger the url list the more significant it will be.
