So basically, my input string is some kind of text containing keywords that I want to match, provided that:
- each keyword may have whitespace/non-word chars pre/appended, or none
(|\s\W) - there must be exactly one non-word/whtiespace char seperating multiple keywords, or keyword is at begining/end of line
- Keyword simply ocurring as a substring does not count, e.g.
bardoes not matchfoobarbaz
E.g.:
input: "#foo barbazboo tree car"
keywords: {"foo", "bar", "baz", "boo", "tree", "car"}
I am dynamically generating a Regex in C# using a enumerable of keywords and a string-builder
StringBuilder sb = new();
foreach (var kwd in keywords)
{
sb.Append($"((|[\\s\\W]){kwd}([\\s\\W]|))|");
}
sb.Remove(sb.Length - 1, 1); // last '|'
_regex = new Regex(sb.ToString(), RegexOptions.Compiled | RegexOptions.IgnoreCase);
Testing this pattern on regexr.com, given input matches all keywords. However, I do not want {bar, baz, boo} included, since there is no whitespace between each keyword.
Ideally, I'd want my regex to only match {foo, tree, car}.
Modifying my pattern like (( |[\s\W])kwd([\s\W]| )) causes {bar, baz, boo} not to be included, but produces bogus on {tree, car}, since for that case there must be at least two spaces between keywords.
How do I specify "there may be only one whitespace seperating two keywords", or, to put it differently, "half a whitespace is ok", preserving the ability to create the regex dynamically?
CodePudding user response:
In your case, you need to build the
var pattern = $@"\b(?:{string.Join("|", keywords.OrderByDescending(x => x.Length).Select(Regex.Escape))})\b";
_regex = new Regex(pattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
Here, you are getting the longer keywords before shorter ones, so, if you have foo, bar and foo bar, the pattern will look like \b(?:foo\ bar|foo|bar)\b and will match foo bar, and not foo and bar once there is such a match.
In case your keywords can look like keywords: {"$foo", "^bar^", "[baz]", "(boo)", "tree ", " car"}, i.e. they can have special chars at the start/end of the keyword, you can use
_regex = new Regex($@"(?!\B\w)(?:{string.Join("|", keywords.Select(Regex.Escape))})(?<!\w\B)", RegexOptions.Compiled | RegexOptions.IgnoreCase);
The $@"(?!\B\w)(?:{string.Join("|", keywords.OrderByDescending(x => x.Length).Select(Regex.Escape))})(?<!\w\B)" is an interpolated verbatim string literal that contains
(?!\B\w)- left-hand adaptive dynamic word boundary(?:- start of a non-capturing group:{string.Join("|", keywords.OrderByDescending(x => x.Length).Select(Regex.Escape))}- sorts the keywords by lenght in descending order, escapes them and joins with|
)- end of the group(?<!\w\B)- right-hand adaptive dynamic word boundary.
