Home > OS >  Why is std::regex notoriously much slower than other regular expression libraries?
Why is std::regex notoriously much slower than other regular expression libraries?

Time:01-05

enter image description here

UPDATE: Added comparison with boost::regex.

Twain 
   boost: 8 ms
   std: 203 ms
[a-z]shing
   boost: 282 ms
   std: 295 ms
Huck[a-zA-Z] |Saw[a-zA-Z] 
   boost: 8 ms
   std: 317 ms
\b\w nn\b
   boost: 178 ms
   std: n/a
[a-q][^u-z]{13}x
   boost: 379 ms
   std: 667 ms
Tom|Sawyer|Huckleberry|Finn
   boost: 11 ms
   std: 548 ms
.{0,2}(Tom|Sawyer|Huckleberry|Finn)
   boost: 747 ms
   std:   1861 ms
.{2,4}(Tom|Sawyer|Huckleberry|Finn)
   boost: 754 ms
   std: 2016 ms
Tom.{10,25}river|river.{10,25}Tom
   boost: 27 ms
   std: 328 ms
[a-zA-Z] ing
   boost: 161 ms
   std:   787 ms
\s[a-zA-Z]{0,12}ing\s
   boost: 206 ms
   std: n/a
([A-Za-z]awyer|[A-Za-z]inn)\s
   boost: 739 ms
   std: n/a
["'][^"']{0,30}[?!\.]["']
   boost: 15 ms
   std: 255 ms
∞|✓
   boost: 6 ms
   std: 316 ms
\p{Sm}
   boost: n/a
   std: n/a

CodePudding user response:

Is that because of the C standard requirements or it is just that that particular implementation is not very well optimized?

The answer is yes. Kinda.

There is no question that libstdc 's implementation of <regex> is not well optimized. But there is more to it than that. It's not that the standard requirements inhibit optimizations so much as the standard requirements inhibit changes.

The regex library is defined through a bunch of templates. This allows people to choose between char and wchar_t, which is in theory good. But there's a catch.

Template libraries are used by copy-and-pasting the code directly into the code compiling against those libraries. Because of how templates get included, even types that nobody outside of the template library knows about are effectively part of the library's ABI. If you change them, two libraries compiled against different versions of the standard library cannot work with each other. And because the template parameter for regex is its character type, those implementation details touch basically everything about the implementation.

The minute libstdc (and other standard library implementations) started shipping an implementation of C regular expressions, they bound themselves to a specific implementation that could not be changed in a way that impacted the ABI of the library. And while they could cause another ABI break to fix it, standard library implementers don't like breaking ABI because people don't upgrade to standard libraries that break their code.

When C 11 forbid basic_string copy-on-write implementations, libstdc had an ABI problem. Their COW string was widely used, and changing it would make code that compiled against the new one break when used with code compiled against the old one. It took years before libstdc bit the bullet and actually implemented C 11 strings.

If Regex had been defined without templates, implementations could use traditional mechanisms to hide implementation details. The ABI for the interface to external code could be fixed and unchanging, with only the implementation of the functions behind that ABI changing from version to version.

  •  Tags:  
  • Related