regex for being the start of a specific string-CodePudding

For example, say I want to match strings that are the start of "dog". The possible matches are "", "d", "do", and "dog". What's a good regex for this?

The best I've come up with so far is (d(o(g)?)?)?, which is not easily constructed for arbitrary examples. I am wondering if there is something simpler.

(Note that this is not asking to match strings that start with "dog", which would include things like "dogs" and "doggo", and can be easily achieved with dog.*.)

Similar questions could be asked about matching strings that are the end of a specific string, or contained in a specific string.

CodePudding user response：

I am probably misunderstanding your question. But if you are looking for a general way of constructing a prefix regular expression for an arbitrary non-zero-length string, you could use a recursive function such as the following coded in Python:

import re

def prefix_re(s):
    assert len(s)
    if len(s) == 1:
        s = re.escape(s) # Handle possible special regex characters
        return f'({s})?'
    s0 = re.escape(s[0])
    return f'({s0}{prefix_re(s[1:])})?'

print(prefix_re('dog'))
print(prefix_re('Mr. Jones'))

Prints:

(d(o(g)?)?)?
(M(r(\.(\ (J(o(n(e(s)?)?)?)?)?)?)?)?)?

CodePudding user response：

I can't quite remember where I found this class (also Python), but it takes a list of words and using a trie structure attempts to create an optimal regex for matching any of the words in the list. So we just create a list that consists of all the prefixes:

import re


class Trie():
    """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
    The corresponding Regex should match much faster than a simple Regex union."""

    def __init__(self):
        self.data = {}

    def add(self, word):
        ref = self.data
        for char in word:
            ref[char] = char in ref and ref[char] or {}
            ref = ref[char]
        ref[''] = 1

    def dump(self):
        return self.data

    def quote(self, char):
        return re.escape(char)

    def _pattern(self, pData):
        data = pData
        if "" in data and len(data.keys()) == 1:
            return None

        alt = []
        cc = []
        q = 0
        for char in sorted(data.keys()):
            if isinstance(data[char], dict):
                try:
                    recurse = self._pattern(data[char])
                    alt.append(self.quote(char)   recurse)
                except:
                    cc.append(self.quote(char))
            else:
                q = 1
        cconly = not len(alt) > 0

        if len(cc) > 0:
            if len(cc) == 1:
                alt.append(cc[0])
            else:
                alt.append('['   ''.join(cc)   ']')

        if len(alt) == 1:
            result = alt[0]
        else:
            result = "(?:"   "|".join(alt)   ")"

        if q:
            if cconly:
                result  = "?"
            else:
                result = "(?:%s)?" % result
        return result

    def pattern(self):
        return self._pattern(self.dump())

if __name__ == '__main__':
    trie = Trie()
    for word in ['', 'd', 'do', 'dog']:
        trie.add(word)
    print(trie.pattern())

Prints:

(?:d(?:og?)?)?

CodePudding user response：

It becomes a bit more easy to construct when the last letter can become optional: ^(?:d(?:og?)?)?$. This will save you one nested grouping but for obvious reasons still not very easy to construct for longer words. Maybe alternations can come in handy here as they can be easier on the eye:

^(?:do?|dog)?$

See an online demo

^ - Start line anchor;
(?: - Open non-capturing group to start alternations;
- do? - Either 'd' or 'do';
- | - Or;
- dog - Literally 'dog';
- )? - Close non-capture group and make it optional;
$ - End-line anchor (but swap this for any pattern you'd need to validate).

Note that depending on the length of the word you are looking for, we can make each alernation two letters longer where the 2nd letter becomes optional. For each odd-length word the alternations will end with the full word.

To differentiate between these two options, let's see what it would look like for the word 'contains' or 'container':

^(?:c(?:o(?:n(?:t(?:a(?:i(?:ns?)?)?)?)?)?)?)?$
^(?:c(?:o(?:n(?:t(?:a(?:i(?:n(?:er?)?)?)?)?)?)?)?)?$

vs.

^(?:co?|cont?|contai?|contains?)?$
^(?:co?|cont?|contai?|containe?|container)?$

I guess it's personal which you find 'easier' but I think the alternations will require a little less steps and is therefor somewhat more recource-friendly, but more importantly they are easier to construct and read.

CodePudding user response：

I am wondering if there is something simpler.

Not really, there are some other options but there isn't really a super clean way to write it using regex.

The solution you already came up with can be build with relative ease though:

function buildRegex(string) {
  // https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#escaping
  const escape = string => string.replace(/[.* ?^${}()|[\]\\]/g, '\\$&');
  return new RegExp(
    Array.from(string)
      .map(escape) // <- optional, but recomended if your string can contain special regex chars
      .reduceRight((pattern, char) => `(${char}${pattern})?`, "") // also known as "fold right"
  );
}

console.log(buildRegex("dog"));