ANTLR4 - What is the correct way to define an array type?-CodePudding

I am creating my own grammar, and so far I had only primitive types. However, now I would like to add a new type by reference, arrays, with a format similar to Java or C#, but I run into the problem that I am not able to make it work with ANTLR.

The code example I'm working with would be similar to this:

VariableDefinition
{
    id1: string;
    anotherId: bool;
    arrayVariable: string[5];
    anotherArray: bool[6];
}

MyMethod()
{
    temp: string[3];
    temp2: string;
    temp2 = "Some text";
    temp[0] = temp2;
    temp2 = temp[0];
}

The Lexer contains:

BOOL:                   'bool';
STRING:                 'string';

fragment DIGIT:         [0-9];
fragment LETTER:        [[a-zA-Z\u0080-\u00FF_];
fragment ESCAPE :          '\\"' | '\\\\' ; // Escape 2-char sequences: \" and \\
LITERAL_INT:            DIGIT ;
LITERAL_STRING:         '"' (ESCAPE|.)*? '"' ;

OPEN_BRACKET:           '[';
CLOSE_BRACKET:          ']';
COLON:                  ':';
SEMICOLON:              ';';

ID:                     LETTER (LETTER|DIGIT)*;

And my Parser would be an extension of this (there are more rules and other expressions but I don't think that there is a relation with this scenario):


global_
    : GLOBAL '{' globalVariables =variableDefinition* '}'
    ;

variableDefinition
    : name=ID ':' type=type_ ';'                                               
    ;

type_
    : referenceType                     # TypeReference
    | primitiveType                     # TypePrimitive
    ;

primitiveType
    : BOOL                              # TypeBool
    | CHAR                              # TypeChar
    | DOUBLE                            # TypeDouble
    | INT                               # TypeInteger
    | STRING                            # TypeString
    ;

referenceType
    : primitiveType '[' LITERAL_INT ']' # TypeArray
    ;

expression_
    : identifier=expression_ '[' position=expression_ ']'      # AccessArrayExpression
    | left=expression_ operator=( '*' | '/' | '%') right=expression_      # ArithmeticExpression
    | left=expression_ operator=( ' ' | '-' ) right=expression_      # ArithmeticExpression
    | value=ID                              # LiteralID

I've tried:

Put spaces between the different lexemes in the example programme in case there was a problem with the lexer. (nothing changed).
Creating one rule in type_ called arrayType, and in arrayType reference type_ (fails due to a left-recursion: ANTLR shows the following error The following sets of rules are mutually left-recursive [type_, arrayType]
Put primitive and reference types into a single rule.

type_
    : BOOL                              # TypeBool
    | CHAR                              # TypeChar
    | DOUBLE                            # TypeDouble
    | INT                               # TypeInteger
    | STRING                            # TypeString
    | type_ '[' LITERAL_INT ']'         # TypeArray
    ;

Results: · With whitespace separating the array (temp: string [5] ;).

line 23:25 missing ';' at '[5'
line 23:27 mismatched input ']' expecting {'[', ';'}

· Without whitespace (temp: string[5];).

line 23:18 mismatched input 'string[5' expecting {BOOL, 'char', 'double', INT, 'string'}
line 23:26 mismatched input ']' expecting ':'

EDIT 1: This is how the tree would look like when trying to generate the example I gave: Parse tree Inspector

CodePudding user response：

It's common for languages that want to be flexible with whitespace to have a rule, something like this:

WS: [ \t\r\n]  -> skip; // or channel(HIDDEN)

It should address your problem.

This shuttles Whitespace off to the side so you don't have to be concerned with it in your parser rules.

Without that sort of approach, you'd still need to define a whitespace rule (same pattern as above), but, if you don't skip it (or send it to eat HIDDEN channel), you'll have to include it everywhere you want to allow for whitespace by inserting a WS?. Clearly this has the potential to become quite tedious (and adds a lot of "noise" to both your grammar and the resulting parse trees).

CodePudding user response：

fragment LETTER:        [[a-zA-Z\u0080-\u00FF_];

You're allowing [ as a letter (and thus as a character in identifiers), so in string[5], string[5 is interpreted as an identifier, which makes the parser think the subsequent ] has no matching [. Similarly in string [5], [5 is interpreted as an identifier, which makes the parser see two consecutive identifiers, which is also not allowed.

To fix this you should remove the [ from LETTER.

As a general tip, when getting parse errors that you don't understand, you should try to look at which tokens are being generated and whether they match what you expect.