Python INFO to CONSOLE and DEBUG to FILE Logging-CodePudding

So I have been trying for days to find a solution for my Problem...

I have the following structure

main.py
   -> module1.py
     -> task (Threadable Class)
   -> module2.py
     -> task (Threadable Class)

In my main the user is able to pick a module and the amount of tasks to start in that module. I used to print() their statuse each time something new happens. Example valid response from a web server. But doing this caused the output to get messed up as mutliple threads were calling print at the same time. Even flush=True didnt help here.

From what I have heard the only solution to achieve nice outputs is to use logging. Logging is already implemented as the program creates a log with debug information upon startup and keeps writing to it during runtime. This is for better error-tracing and stuff.

I have tried following solouting:

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[
        logging.FileHandler(f'log/{CURR_LOGFILE}.txt'),
        logging.StreamHandler()
    ]
)

But this does not help me as I want logging.info() to be printed to the console while logging.debug() to the logfile.

How would I go on about making this work and thread safe? Or is there maybe a better method for each thread to emit their status?

Lock wont work as it temporary locks the thread and every ms counts in this usecase.

CodePudding user response：

Technically, using logging is not the only solution. You correctly identified the problem : the multiple threads share a same stdout buffer, in which they write concurrently. The standard logging library happens to be thread-safe, but you could do the same without it ¹.

You can create a threading.Lock and rewrite some of your code so that a thread must only print when it holds the lock.
Your workload does not seem output-heavy, so performance should not be a problem.

¹ I consider it a case of XY problem : you don't actually want to use logging, what you actually want, If I understood correctly, is just to synchronize your outputs so that the display at the end is not messed up.

CodePudding user response：

Using logging works because internally it uses a lock too. So saying that

I already tried using threading.Lock but as the code has to be executed without any stopping running 100 threads would cause other threads to have to wait until they hold the lock

is erroneous.

Here is a simple benchmark, I did not intend it to be bullet-proof, but just to show you that both approaches work in a similar way :

import logging
import time
import threading
import random
import sys


logger = logging.getLogger("so70643825")
logger.addHandler(logging.StreamHandler(sys.stdout))
logger.setLevel(logging.INFO)


lock = threading.Lock()
def lock_print(*args, **kwargs):
    with lock:
        print(*args, **kwargs)


def task_creator(task_number: int, display_function):
    def actual_task(x: int):
        for i in range(x):
            display_function(f"Task {task_number}, {i=}")
            time.sleep(random.random() / 10)  # sleep up to 0.1s
    return actual_task


def run_tasks(number_of_threads: int, number_of_iterations: int, display_function):
    threads = []
    for thread_number in range(number_of_threads):
        thread = threading.Thread(target=task_creator(task_number=thread_number, display_function=display_function),
                                  args=(number_of_iterations,))
        threads.append(thread)
        thread.start()
    for thread in threads:
        thread.join()


def main():
    N_THREADS = 100
    N_LOOPS = 10
    N_MEASURES = 10
    times = []

    for i in range(N_MEASURES):
        time_before_logging_tasks = time.time()
        run_tasks(N_THREADS, N_LOOPS, logger.info)
        time_after_logging_tasks = time.time()

        time_before_locking_tasks = time.time()
        run_tasks(N_THREADS, N_LOOPS, lock_print)
        time_after_locking_tasks = time.time()

        time_logging = time_after_logging_tasks - time_before_logging_tasks
        time_locking = time_after_locking_tasks - time_before_locking_tasks
        times.append((time_logging, time_locking,))

    for time_logging, time_locking in times:
        print(f"logging: {time_logging}   locking: {time_locking} best={'LOG' if time_logging < time_locking else 'LOCK'}")


if __name__ == "__main__":
    main()

Results :

logging: 0.7429051399230957   locking: 0.7113175392150879 best=LOCK
logging: 0.7250735759735107   locking: 0.6937775611877441 best=LOCK
logging: 0.712653636932373    locking: 0.691455602645874  best=LOCK
logging: 0.7124729156494141   locking: 0.6672348976135254 best=LOCK
logging: 0.7507524490356445   locking: 0.7678782939910889 best=LOG
logging: 0.690934419631958    locking: 0.7111172676086426 best=LOG
logging: 0.7271988391876221   locking: 0.7642250061035156 best=LOG
logging: 0.7295012474060059   locking: 0.7794029712677002 best=LOG
logging: 0.7717809677124023   locking: 0.7706160545349121 best=LOCK
logging: 0.7927803993225098   locking: 0.7397556304931641 best=LOCK

Which is a mix of LOG and LOCK wins, but each time the difference between them is small.
Removing the random sleep makes LOCK a clear winner (at least ten times quicker). I think it's because the sleep dwarfs the overhead introduced by all the logging machinery, which is just not required for lock_print.

If the performances are VERY (and I mean it) important to the problem, you should have said it originally.