ANTLR4 Best practice on token ambiguities: Lexer predicate, or Parser - Enhance your coding expertise with natem on @onlycoders.net

2 years ago

#56135

natem

ANTLR4 Best practice on token ambiguities: Lexer predicate, or Parser tree walker

I have a question about a certain ambiguity I am encountering in a grammar I am currently working on. Here is the problem, in brief. Consider these two inputs:

1010
0101

In isolation, in my grammar the first input is interpreted as a decimal number, the second as an octal due to the leading zero.

However, if the preceding character to each of these sequences is a % then both would be interpreted as a binary number. This wouldn't be a problem if we stopped there.

Now, let's say before the % we encountered a 5, what would happen? Does my grammar consider each of these as valid input:

5%1010
5%0101

The answer is "Yes!" The rightmost sequences of 1s and 0s simply revert back to decimal and octal, respectively, and the % is a modulo operator.

This wouldn't be a problem if expressions in my grammar only consisted of digits, but that unfortunately is not the case, as any number of non-digit tokens could substitute for the 5 in the example above, like variables, braces, and even other math operators like parentheses and minus signs.

The solution I have come to in ANTLR is simply to have an expression rule where one of the alternatives concatenates an expression and a binary number, so you have:

expr
    :    expr Binary 
    |    expr '%' expr
    |    Integer
    |    Octal
    |    Binary
    ;    
 
Integer
    :    '0'
    |    [1-9] [0-9]*
    ;
    
Octal
    :    '0' [0-7]+
    ;

Binary
    :    '%' [01]+
    ;

I then leave it up to my visitor to actually "pull apart" the right hand side of the expression type above (the expr Binary one), and properly calculate the modulo, which means I have to "re-tokenize" essentially the % and following digits.

I guess my question is: Is this the best solution given my case? I fully accept it if so, but I am curious if others have had to resort to things like these.

I cooked up a lexer predicate to do some crazy lookaheads (and lookbehinds) in the input, but my instinct was this felt wrong, as I was essentially hand-parsing, rather than leveraging the tool itself to give me enough what I needed to work with.

antlr

antlr4

ambiguity

0 Answers

Your Answer

Posts

Questions

Blogs

Jobs