Home > Blockchain >  Fast String "Startswith" Matching for Dict like object
Fast String "Startswith" Matching for Dict like object

Time:01-07

I currently have some code which needs to be very performant, where I am essentially doing a string dictionary key lookup:

class Foo:
    def __init__(self):
        self.fast_lookup = {"a": 1, "b": 2}

    def bar(self, s):
        return self.fast_lookup[s]

self.fast_lookup has O(1) lookup time, and there is no try/if etc code that would slow down the lookup

Is there anyway to retain this speed while doing a "startswith" lookup instead? With the code above calling bar on s="az" would result in a key error, if it were changed to a "startswith" implementation then it would return 1.

NB: I am well aware how I could do this with a regex/startswith statement, I am looking for performance specifically for startswith dict lookup

CodePudding user response:

I dont fully understand the question, but what I would do is try and think of ways to reduce the work the lookup even has to do. If you know the basic searches the startswith is going to do, you can just add those as keys to the dictionary and values that point to the same object. Your dict will get pretty big pretty fast, however it will greatly reduce the lookup i believe. So maybe for a more dynamic method you can add dict keys for the first groups of letters up to three for each entry.

Without activly storing the references for each search, your code will always need to get each dict objects value until it gets one that matches. You cannot reduce that.

CodePudding user response:

An efficient way to do this would be to use the pyahocorasick module to construct a trie with the possible keys to match, then use the longest_prefix method to determine how much of a given string matches. If no "key" matched, it returns 0, otherwise it will say how much of the string passed exists in the automata.

After installing pyahocorasick, it would look something like:

import ahocorasick

class Foo:
    def __init__(self):
        self.fast_lookup = ahocorasick.Automaton()
        for k, v in {"a": 1, "b": 2}.items():
            self.fast_lookup.add_word(k, v)

    def bar(self, s):
        index = self.fast_lookup.longest_prefix(s)
        if not index:  # No prefix match at all
            raise KeyError(s)
        return self.fast_lookup.get(s[:index])

If it turns out the longest prefix doesn't actually map to a value (say, 'cat' is mapped, but you're looking up 'cab', and no other entry actually maps 'ca' or 'cab'), this will die with a KeyError. Tweak as needed to achieve precise behavior desired (you might need to use longest_prefix as a starting point and try to .get() for all substrings of that length or less until you get a hit for instance).

Note that this isn't the primary purpose of Aho-Corasick (it's an efficient way to search for many fixed strings in one or more long strings in a single pass), but tries as a whole are an efficient way to deal with prefix search of this form, and Aho-Corasick is implemented in terms of tries and provides most of the useful features of tries to make it more broadly useful (as in this case).

  •  Tags:  
  • Related