I have some python code that:
- Pulls various metrics from different endpoints
- Joins them in a common dictionary with some standardized key/values
- Uploads the dictionary to another tool for analysis
While this generally works, there are issues when the dictionary gets too large, it causes performance issues in various points.
I've seen examples using itertools to split based on ranges of keys, to evenly split based on number of keys. However, I would like to try and split it based on the size in bytes, as some of the metrics are drastically larger than others.
Can a dictionary be dynamically split into a list of dictionaries based on the size in bytes?
CodePudding user response:
Assuming that both keys and values are sane types that you can call sys.getsizeof on in a meaningful way, and all distinct objects, you can use that information to split your dictionary into equal-ish chunks.
First compute the total size if you want the max chunk to be a divisor of that. If your maximum size is fixed externally, you can skip this step:
total_size = sum(getsizeof(k) getsizeof(v) for k, v in my_dict.items())
Now you can iterate the dictionary, assuming approximately random distribution of sizes throughout, cutting a new dict before you exceed the max_size threshold:
from sys import getsizeof
def split_dict(d, max_size):
result = []
current_size = 0
current_dict = {}
while d:
k, v = d.popitem()
increment = getsizeof(k) getsizeof(v)
if increment current_size > max_size:
result.append(current_dict)
if current_size:
current_dict = {k: v}
current_size = increment
else:
current_dict[k] = v # going to list
current_dict = {}
current_size = 0
else:
current_dict[k] = v
current_size = increment
if current_dict:
result.append(current_dict)
return result
Keep in mind that dict.popitem is descructive: you are actually removing everything from my_dict to populate the smaller versions.
Here is a highly simplified example:
>>> from string import ascii_letters
>>> d = {s: i for i, s in enumerate(ascii_letters)}
>>> total_size = sum(getsizeof(k) getsizeof(v) for k, v in d.items())
>>> split_dict(d, total_size // 5)
[{'Z': 51, 'Y': 50, 'X': 49, 'W': 48, 'V': 47, 'U': 46, 'T': 45, 'S': 44, 'R': 43, 'Q': 42},
{'P': 41, 'O': 40, 'N': 39, 'M': 38, 'L': 37, 'K': 36, 'J': 35, 'I': 34, 'H': 33, 'G': 32},
{'F': 31, 'E': 30, 'D': 29, 'C': 28, 'B': 27, 'A': 26, 'z': 25, 'y': 24, 'x': 23, 'w': 22},
{'v': 21, 'u': 20, 't': 19, 's': 18, 'r': 17, 'q': 16, 'p': 15, 'o': 14, 'n': 13, 'm': 12},
{'l': 11, 'k': 10, 'j': 9, 'i': 8, 'h': 7, 'g': 6, 'f': 5, 'e': 4, 'd': 3, 'c': 2},
{'b': 1, 'a': 0}]
As you can see, the split is not necessarily optimal in terms of distribution, but it ensures that no chunk is bigger than max_size, unless one single entry requires more bytes than that.
Update For Not-Sane Values
If you have arbitrarily large nested values, you can still split at the top level, however, you will have to replace getsizeof(v) with something more robust. For example:
from collections.abc import Mapping, Iterable
def xgetsizeof(x):
if isinstance(x, Mapping):
return getsizeof(x) sum(xgetsizeof(k) xgetsizeof(v) for k, v in x.items())
if isinstance(x, Iterable) and not isintance(x, str):
return getsizeof(x) sum(xgetizeof(e) for e in x)
return getsizeof(x)
Now you can also compute total_size with a single call:
total_size = xgetsizeof(d)
Notice that this is bigger than the value you saw before. The earlier result was
xgetsizeof(d) - getsizeof(d)
To make the solution really robust, you would need to add instance tracking to avoid circular references and double-counting.
