Issue
A string should be tested whether it contains numbers with optional characters between the digits. I'll use the terms number meaning a series of digits (e.g. 123) and digit meaning an element of a number (e.g. 0, 1, 2).
Example strings the regex should test and detect the number 012345:
abc 012345 def
abc 0.1.2.3.4.5 def
abc 0-1 2a3x4,,,5 def
Characteristics
- a) The numbers are known, i.e. it should not match any number but specific numbers.
- b) The symbols between digits are unknown and of unknown lengths; though in practice, the length won't exceed 10 in 99% of cases.
- c) The strings to test are < 100 characters in 99% of cases.
The code run in JavaScript Node.js:
// For each regex
for (const regex of regexes) {
if (new RegExp(regex, 'iu').test(text)) {
// ...
}
}
The most simple (and most inefficient) regex requires a significant compute time:
const regexes = [
'0.*1.*2.*3.*4.*5',
'1.*1.*2.*2.*3.*3'
];
Is there any potential optimization, either in the regex or in the JS code?
Performance
Execution time comparison from suggestions in comments and answers below, in order from fastest to slowest:
- remove non-number chars with
str.replace(/\D/g, '')and then use the regex012345is the fastest; it's not a pure regex optimization, but as I said, the optimization can include the JavaScript code (thanks @anubhava) - use negated character classes
0[^1]*1[^2]*2[^3]*3[^4]*4[^5]*5is only ~3% slower than (1) and doesn't require any additional JS code, so it's the best pure regex optimization (thanks @WiktorStribiżew) - with real word data, both the greedy
1.*2.*3.*4.*5and range limited1.{0,10}2.{0,10}3.{0,10}4.{0,10}5variants are ~50% - 90% slower than (1)
CodePudding user response:
You may use negated character classes negating the char that is on the right:
const regexes = [
'0[^1]*1[^2]*2[^3]*3[^4]*4[^5]*5',
'1[^1]*1[^2]*2[^2]*2[^3]*3[^3]*3'
]
See, [^1]*1 matches any zero or more chars other than 1 and then a 1 char. Negated character classes are much more efficient than .* or .*? construts since they help avoid excessive backtracking and should be used always when you need to match shortest substrings between two chars.
