Splitting Dictionary on Bytes-CodePudding

I have some python code that:

Pulls various metrics from different endpoints
Joins them in a common dictionary with some standardized key/values
Uploads the dictionary to another tool for analysis

While this generally works, there are issues when the dictionary gets too large, it causes performance issues in various points.

I've seen examples using itertools to split based on ranges of keys, to evenly split based on number of keys. However, I would like to try and split it based on the size in bytes, as some of the metrics are drastically larger than others.

Can a dictionary be dynamically split into a list of dictionaries based on the size in bytes?

CodePudding user response：

Assuming that both keys and values are sane types that you can call sys.getsizeof on in a meaningful way, and all distinct objects, you can use that information to split your dictionary into equal-ish chunks.

First compute the total size if you want the max chunk to be a divisor of that. If your maximum size is fixed externally, you can skip this step:

total_size = sum(getsizeof(k)   getsizeof(v) for k, v in my_dict.items())

Now you can iterate the dictionary, assuming approximately random distribution of sizes throughout, cutting a new dict before you exceed the max_size threshold:

from sys import getsizeof

def split_dict(d, max_size):
    result = []
    current_size = 0
    current_dict = {}
    while d:
        k, v = d.popitem()
        increment = getsizeof(k)   getsizeof(v)
        if increment   current_size > max_size:
            result.append(current_dict)
            if current_size:
                current_dict = {k: v}
                current_size = increment
            else:
                current_dict[k] = v  # going to list
                current_dict = {}
                current_size = 0
        else:
            current_dict[k] = v
            current_size  = increment
    if current_dict:
        result.append(current_dict)
    return result

Keep in mind that dict.popitem is descructive: you are actually removing everything from my_dict to populate the smaller versions.

Here is a highly simplified example:

>>> from string import ascii_letters
>>> d = {s: i for i, s in enumerate(ascii_letters)}
>>> total_size = sum(getsizeof(k)   getsizeof(v) for k, v in d.items())
>>> split_dict(d, total_size // 5)
[{'Z': 51, 'Y': 50, 'X': 49, 'W': 48, 'V': 47, 'U': 46, 'T': 45, 'S': 44, 'R': 43, 'Q': 42},
 {'P': 41, 'O': 40, 'N': 39, 'M': 38, 'L': 37, 'K': 36, 'J': 35, 'I': 34, 'H': 33, 'G': 32},
 {'F': 31, 'E': 30, 'D': 29, 'C': 28, 'B': 27, 'A': 26, 'z': 25, 'y': 24, 'x': 23, 'w': 22},
 {'v': 21, 'u': 20, 't': 19, 's': 18, 'r': 17, 'q': 16, 'p': 15, 'o': 14, 'n': 13, 'm': 12},
 {'l': 11, 'k': 10, 'j': 9, 'i': 8, 'h': 7, 'g': 6, 'f': 5, 'e': 4, 'd': 3, 'c': 2},
 {'b': 1, 'a': 0}]

As you can see, the split is not necessarily optimal in terms of distribution, but it ensures that no chunk is bigger than max_size, unless one single entry requires more bytes than that.

Update For Not-Sane Values

If you have arbitrarily large nested values, you can still split at the top level, however, you will have to replace getsizeof(v) with something more robust. For example:

from collections.abc import Mapping, Iterable

def xgetsizeof(x):
    if isinstance(x, Mapping):
        return getsizeof(x)   sum(xgetsizeof(k)   xgetsizeof(v) for k, v in x.items())
    if isinstance(x, Iterable) and not isintance(x, str):
        return getsizeof(x)   sum(xgetizeof(e) for e in x)
    return getsizeof(x)

Now you can also compute total_size with a single call:

total_size = xgetsizeof(d)

Notice that this is bigger than the value you saw before. The earlier result was

xgetsizeof(d) - getsizeof(d)

To make the solution really robust, you would need to add instance tracking to avoid circular references and double-counting.