So I have been trying for days to find a solution for my Problem...
I have the following structure
main.py
-> module1.py
-> task (Threadable Class)
-> module2.py
-> task (Threadable Class)
In my main the user is able to pick a module and the amount of tasks to start in that module. I used to print() their statuse each time something new happens. Example valid response from a web server. But doing this caused the output to get messed up as mutliple threads were calling print at the same time. Even flush=True didnt help here.
From what I have heard the only solution to achieve nice outputs is to use logging. Logging is already implemented as the program creates a log with debug information upon startup and keeps writing to it during runtime. This is for better error-tracing and stuff.
I have tried following solouting:
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s",
handlers=[
logging.FileHandler(f'log/{CURR_LOGFILE}.txt'),
logging.StreamHandler()
]
)
But this does not help me as I want logging.info() to be printed to the console while logging.debug() to the logfile.
How would I go on about making this work and thread safe? Or is there maybe a better method for each thread to emit their status?
Lock wont work as it temporary locks the thread and every ms counts in this usecase.
CodePudding user response:
Technically, using logging is not the only solution. You correctly identified the problem : the multiple threads share a same stdout buffer, in which they write concurrently. The standard logging library happens to be thread-safe, but you could do the same without it 1.
You can create a threading.Lock and rewrite some of your code so that a thread must only print when it holds the lock.
Your workload does not seem output-heavy, so performance should not be a problem.
1 I consider it a case of XY problem : you don't actually want to use logging, what you actually want, If I understood correctly, is just to synchronize your outputs so that the display at the end is not messed up.
CodePudding user response:
Using logging works because internally it uses a lock too. So saying that
I already tried using
threading.Lockbut as the code has to be executed without any stopping running 100 threads would cause other threads to have to wait until they hold the lock
is erroneous.
Here is a simple benchmark, I did not intend it to be bullet-proof, but just to show you that both approaches work in a similar way :
import logging
import time
import threading
import random
import sys
logger = logging.getLogger("so70643825")
logger.addHandler(logging.StreamHandler(sys.stdout))
logger.setLevel(logging.INFO)
lock = threading.Lock()
def lock_print(*args, **kwargs):
with lock:
print(*args, **kwargs)
def task_creator(task_number: int, display_function):
def actual_task(x: int):
for i in range(x):
display_function(f"Task {task_number}, {i=}")
time.sleep(random.random() / 10) # sleep up to 0.1s
return actual_task
def run_tasks(number_of_threads: int, number_of_iterations: int, display_function):
threads = []
for thread_number in range(number_of_threads):
thread = threading.Thread(target=task_creator(task_number=thread_number, display_function=display_function),
args=(number_of_iterations,))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
def main():
N_THREADS = 100
N_LOOPS = 10
N_MEASURES = 10
times = []
for i in range(N_MEASURES):
time_before_logging_tasks = time.time()
run_tasks(N_THREADS, N_LOOPS, logger.info)
time_after_logging_tasks = time.time()
time_before_locking_tasks = time.time()
run_tasks(N_THREADS, N_LOOPS, lock_print)
time_after_locking_tasks = time.time()
time_logging = time_after_logging_tasks - time_before_logging_tasks
time_locking = time_after_locking_tasks - time_before_locking_tasks
times.append((time_logging, time_locking,))
for time_logging, time_locking in times:
print(f"logging: {time_logging} locking: {time_locking} best={'LOG' if time_logging < time_locking else 'LOCK'}")
if __name__ == "__main__":
main()
Results :
logging: 0.7429051399230957 locking: 0.7113175392150879 best=LOCK
logging: 0.7250735759735107 locking: 0.6937775611877441 best=LOCK
logging: 0.712653636932373 locking: 0.691455602645874 best=LOCK
logging: 0.7124729156494141 locking: 0.6672348976135254 best=LOCK
logging: 0.7507524490356445 locking: 0.7678782939910889 best=LOG
logging: 0.690934419631958 locking: 0.7111172676086426 best=LOG
logging: 0.7271988391876221 locking: 0.7642250061035156 best=LOG
logging: 0.7295012474060059 locking: 0.7794029712677002 best=LOG
logging: 0.7717809677124023 locking: 0.7706160545349121 best=LOCK
logging: 0.7927803993225098 locking: 0.7397556304931641 best=LOCK
Which is a mix of LOG and LOCK wins, but each time the difference between them is small.
Removing the random sleep makes LOCK a clear winner (at least ten times quicker). I think it's because the sleep dwarfs the overhead introduced by all the logging machinery, which is just not required for lock_print.
If the performances are VERY (and I mean it) important to the problem, you should have said it originally.
