Python Multithreading: Check active domains with multithreading in python optimally-CodePudding

The idea of the program is to check for domains/subdomains living (via http/https protocol) in the subdomains.txt file.

I did this by using HEAD requests to domains/subdomains and receiving the response status code. If the status code is available, the domain or subdomain is live. (load_url_http function)

To speed up the program, I used concurrent.futures.ThreadPoolExecutor with a number of threads of 200 However, even after increasing the number of threads to 300, the program's speed isn't much improved.

I want an improvement in my program to be able to send thousands of requests at once. Below is part of my source code:

python-request-multil.py

import time

import requests
import concurrent.futures


def load_url_http(protocol: str, domain: str, timeout: int = 10):
        try:
            conn = requests.head(protocol   "://"   domain, timeout=timeout)
            return conn.status_code
        except Exception:
            return None


#--- main ---#
start_time = time.time()

worker = 400
protocol = "http"
timeout = 10

print("Number of worker:", worker)

with concurrent.futures.ThreadPoolExecutor(max_workers=worker) as executor:
    # The file object that the subdomain lives on will be written to
    file_live_subdomain = open("live_subdomains.txt", "a")
    
    # load domain/subdomain list from file
    URLS = open("subdomains.txt", "r").read().split("\n")
    URLS_length = len(URLS)
    
    # Count the number of live subdomains
    live_count = 0
    
    # Start the load operations and mark each future with its URL
    future_to_url = {
        executor.submit(load_url_http, protocol, url, timeout): url for url in URLS
    }
    
    for i, future in zip(range(URLS_length), concurrent.futures.as_completed(future_to_url)):
        url = future_to_url[future]
        print(f"\r-->  Checking live subdomain.........{i 1}/{URLS_length}", end="")
        try:
            data = future.result()
            
            # If `load_url_http` returns any status code
            if data != None:
                # print(f'{protocol}://{url}:{data}')
                live_count = live_count   1
                file_live_subdomain.write(f"\n{protocol}://"   url)
        except Exception as exc:
            print(exc)
    print(f"\n[ ] Live domain: {live_count}/{URLS_length}", end="")
    file_live_subdomain.close()

print("\n--- %s seconds ---" % (time.time() - start_time))

Run:

┌──(quangtb㉿QuangTB)-[/mnt/e/DATA/Downloads]
└─$ python3 python-request-multil.py
Number of worker: 100
-->  Checking live subdomain.........1117/1117
[ ] Live domain: 344/1117
--- 67.41670227050781 seconds ---

┌──(quangtb㉿QuangTB)-[/mnt/e/DATA/Downloads]
└─$ python3 python-request-multil.py
Number of worker: 200
-->  Checking live subdomain.........1117/1117
[ ] Live domain: 344/1117
--- 54.6825795173645 seconds ---

┌──(quangtb㉿QuangTB)-[/mnt/e/DATA/Downloads]
└─$ python3 python-request-multil.py
Number of worker: 300
-->  Checking live subdomain.........1117/1117
[ ] Live domain: 339/1117
--- 54.186068058013916 seconds ---

┌──(quangtb㉿QuangTB)-[/mnt/e/DATA/Downloads]
└─$ python3 python-request-multil.py
Number of worker: 400
-->  Checking live subdomain.........1117/1117
[ ] Live domain: 344/1117
--- 54.19181728363037 seconds ---

CodePudding user response：

In python multithreading doesn't actually run in parallel, all the threads run under 1 process and that process runs only on 1 core (of the cpu).

You can create as many threads as you want and is won't solve the problem and can actually make it worse, because those 300 threads are running on 1 cpu core but that 1 cpu core can only run 1 command at a time, so what ends up happening is that the cpu core needs to run a few commands on 1 thread and then switch to another thread and run a few commands on that other thread and then switch to another one and etc... the switching action between threads takes resources and in that time your program code doesn't run. So in the end if you open to many threads your cpu core will spend more time on switching between threads than on executing your program.

What you can do to actually run your code simultaneously is to open a few processes instead of threads using the multiprocessing library, and then your code will run on a number of cores, and the same thing here, don't open hundreds of processes, open only a few, I recommend on openings 1 process for each core your cpu has, the multiprocessing library has a built-in function the returns the number of cores your cpu has:

import multiprocessing
print(multiprocessing.cpu_count())

note that because of the fact that the processes actually run your code simultaneously your prints can interfere with each other so you will need to use multiprocessing.Lock() and do something like this:

import multiprocessing

lock = multiprocessing.Lock()

lock.acquire()
print("something")
lock.release()

do lock.acquire() before each print and lock.release() after (if you won't release the lock your program will be stuck) and this will make sure that your prints won't get in each other.

EDIT:

in your case because of the timeout it will actually be better to open a few processes and in each process something like 20 threads.

because if the process is waiting 10 seconds for each failed address it will end up being slower than opening a lot of threads

so for your case the fastest way I can think of is opening a few processes and in each process open 20 to 30 threads

You can try something like this:

import multiprocessing
import random
import threading
import time
import requests


MAX_NUMBER_OF_PROCESSES = multiprocessing.cpu_count()
MAX_NUMBER_OF_THREADS_IN_EACH_PROCESS = 30
PROTOCOL = "http"
TIMEOUT = 4
START_TIME = time.time()


def load_url_http(protocol: str, domains: list[str], timeout: int = 10):
    with open("live_subdomains.txt", "a") as live_domains_file:
        for domain in domains:
            try:
                conn = requests.head(protocol   "://"   domain, timeout=timeout)
                if conn.status_code is not None:
                    live_domains_file.write(f"{protocol}://{domain}\n")
            except Exception:
                pass
    return


def create_threads_for_process(protocol: str, domains: list[str], timeout: int = 10):
    # create threads
    threads_list = []
    threads_urls = {}
    start = 0
    number_of_threads_to_open = len(domains) if len(domains) < MAX_NUMBER_OF_THREADS_IN_EACH_PROCESS \
        else MAX_NUMBER_OF_THREADS_IN_EACH_PROCESS
    for i in range(1, number_of_threads_to_open   1):
        # distribute the work of this process evenly between all the threads
        if i != number_of_threads_to_open:
            threads_urls[i] = domains[start:(len(domains) // number_of_threads_to_open) * i]
            start = (len(domains) // number_of_threads_to_open) * i
        else:
            threads_urls[i] = domains[start:]
        # create and start the thread
        thread = threading.Thread(target=load_url_http,
                                  args=(protocol, threads_urls[i], timeout,),
                                  daemon=True)
        thread.start()
        threads_list.append(thread)
    # wait for all threads to finish
    while threads_list:
        for thread in threads_list:
            if not thread.is_alive():
                threads_list.remove(thread)
        time.sleep(0.8)


def main():
    with open("live_subdomains.txt", "w") as file:
        file.write("")
    with open("subdomains.txt", "r") as file:
        urls = file.read().split("\n")
    random.shuffle(urls)  # shuffle the urls list
    # create the processes
    processes_list = []
    processes_urls = {}
    start = 0
    number_of_processes_to_open = len(urls) if len(urls) < MAX_NUMBER_OF_PROCESSES else MAX_NUMBER_OF_PROCESSES
    for i in range(1, number_of_processes_to_open   1):
        if i != number_of_processes_to_open:
            # give each process an even amount of work
            processes_urls[i] = urls[start:(len(urls) // number_of_processes_to_open) * i]
            start = (len(urls) // number_of_processes_to_open) * i
        else:
            # the last process will get a bit more / a bit less in
            # case len(urls) isn't dividable by number_of_processes_to_open
            processes_urls[i] = urls[start:]
        # create the process, start it and add to processes list
        process = multiprocessing.Process(target=create_threads_for_process,
                                          args=(PROTOCOL, processes_urls[i], TIMEOUT,),
                                          daemon=True)
        process.start()
        processes_list.append(process)
    # wait for all processes to finish
    while multiprocessing.active_children():
        time.sleep(0.8)
    # print result
    with open("live_subdomains.txt", "r") as live_urls_file:
        live_count = len(live_urls_file.read().split("\n")) - 1  # -1 empty line at the end
    print(f"\n[ ] Live domain: {live_count}/{len(urls)}", end="")
    print("\n--- %s seconds ---" % (time.time() - START_TIME))


if __name__ == '__main__':
    main()

in this code I opened 1 process for each core the cpu has, and in each process I opened 30 threads, also you don't need to wait 10 seconds for a replay something like 5 seconds is enough especially when you use HEAD and not GET.

I run your code and compared to mine, your code took 35 seconds to finish, mine took 25 seconds and that is on 1117 urls, the bigger the url list the more significant it will be.