Background
I've made a couple simple compilers before, but I've never properly addressed this issue:
Say I have a token LT which searches the expression < and a token LTEQ which searches <=.
A LT would match part of <= in this case, and I don't want this.
Other potential examples are:
===vs==vs==>vs>vs=vs>=
Before, what I've done is as follows:
- Order the regular expressions from most specific to least specific
- For each expression match, replace it in the line with an equal-length series of whitespace characters and record the token with its start index (Replacing
||:"true || false"=>"true false") - Sort the list of tokens by their start index
This is inefficient and only works in regex libraries that allow for a lambda to be passed in to process matches. Naturally I want to improve this.
The problem
How do I tokenize a string with tokens definitions that overlap as show with LT and LTEQ?
Possible solutions I've thought of but not tried:
Keeping a map of all used spans of characters and ignore matches within these spans Only having very simple token definitions and creating more complicated patterns from the simple ones
LTEQ<=would be created fromLT<andEQ=
CodePudding user response:
You should try to tokenise the input in one single sweep. You would create a "monster" regex that would match any token (without regarding context unless absolutely necessary -- that would come later), giving precedence to larger tokens over smaller ones.
Here is an example with a very basic regex just to demonstrate the idea. It is implemented in JavaScript, and it parses a tiny Java-like input:
let input = `
for (int i = 0; i < 1e 4; i ) {
int k = i <= n / 2 ? 9.42 : 3.14;
System.out.println(k);
}
`;
const regex = /[ -]?\d (\.\d*)?(e[- ]?\d )?|\w |[<>=]=?|\ [ =]?|-[-=]?|\S/g;
const tokens = input.match(regex);
console.log(tokens);
The final \S is the "anything else" match, so that this is guaranteed to capture all non-space text. Following this basic tokenisation you'd then possibly create an abstract syntax tree from it.
