Matching default identifiers

Discussion:

Rui Vilão

2012-11-29 19:10:07 UTC

Hi all,

I want to match a list of well known identifiers and if there's no well
known identifier I want it to become something like a default token. It
would be something like:

TOKEN1: 'Token1'
;

TOKEN2: 'Token2'
;

DEFAULT: NONE OF THE ABOVE AND MUST FOLLOW LETTER (LETTER|NUMBER)*;

Anyone know if it's possible to accomplish that?

Thanks in advance,
Best regards,

Rui

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address

Bernard Kaiflin

2012-11-29 22:34:06 UTC

Permalink

Hi,

The rule DEFAULT should work :

DEFAULT
: LETTER (LETTER|NUMBER)*
;

at least in v4, after the explanation found at page 293 of the beta 3 book
http://pragprog.com/book/tpantlr2/the-definitive-antlr-4-reference

Loosely speaking, the lexer’s goal is to choose the rule that matches the
most input characters. At each character, the lexer decides which rules are
still viable. Eventually, only a single rule will be still viable. At that
point, the lexer creates a token object according the rule’s token type and
matched text.

Sometimes the lexer is faced with more than a single viable matching rule.
For example, input enum would match an ENUM rule and an ID rule. If the
next character after enum is a space, neither rule can continue. The lexer
resolves the ambiguity by choosing the viable rule specified first in the
grammar. That’s why we have to place keyword rules before an identifier
rule like this:

ENUM : 'enum' ; ID : [a-z]+ ;

If, on the other hand, the next character after input enum is a letter,
then only ID is viable.

In your case, an input `Token1` followed by a character not in DEFAULT will
be matched by TOKEN1 , but `Token12` will be matched by DEFAULT.

Verification :

grammar Question;

question
@init {System.out.println("Question last update 2326");}
: entry+
;

entry
: ( TOKEN1 | TOKEN2 | DEFAULT ) {System.out.println("entry found : " +
$entry.text);}
;

TOKEN1
: 'Token1' {System.out.println("TOKEN1 found : " + getText());}
;

TOKEN2
: 'Token2' {System.out.println("TOKEN2 found : " + getText());}
;

DEFAULT
: LETTER (LETTER|DIGIT)* {System.out.println("DEFAULT found : " +
getText());}
;

WS : [ \t\r\n]+ -> channel(HIDDEN) ;
fragment DIGIT : [0-9] ;
fragment LETTER : [a-zA-Z] ;

$ antlr4 Question.g4
$ javac Q*.java
$ grun Question question
Token1 Token2
asjdjffh123
Token123
➾EOF [ctrl-D / Ctrl-Z]
TOKEN1 found : Token1
TOKEN2 found : Token2
DEFAULT found : asjdjffh123
DEFAULT found : Token123
Question last update 2326
entry found : Token1
entry found : Token2
entry found : asjdjffh123
entry found : Token123

HTH
Bernard

PS : See also http://stackoverflow.com/tags and type *antlr* in the field *Type
to find tags:.*

Post by Rui VilÃ£o
DEFAULT: NONE OF THE ABOVE AND MUST FOLLOW LETTER (LETTER|NUMBER)*;

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address

Rui Vilão

2012-11-29 22:55:49 UTC

Permalink

First of all thank you very much for such a complete reply :)

Well indeed I'm using antlr3. I'll try with antlr4 tomorrow and get back
with the results. I tried what you're suggesting with antlr3 and I couldn't
manage to get it to work so I never thought it would be a "version thing".
Basically the message was something like token X can match input such as Y,
Z, etc.

Thanks again,

Rui

Post by Rui VilÃ£o
Hi,
DEFAULT
: LETTER (LETTER|NUMBER)*
;
at least in v4, after the explanation found at page 293 of the beta 3 book
http://pragprog.com/book/tpantlr2/the-definitive-antlr-4-reference
Loosely speaking, the lexer’s goal is to choose the rule that matches the
most input characters. At each character, the lexer decides which rules are
still viable. Eventually, only a single rule will be still viable. At that
point, the lexer creates a token object according the rule’s token type
and matched text.
Sometimes the lexer is faced with more than a single viable matching rule.
For example, input enum would match an ENUM rule and an ID rule. If the
next character after enum is a space, neither rule can continue. The
lexer resolves the ambiguity by choosing the viable rule specified first in
the grammar. That’s why we have to place keyword rules before an
ENUM : 'enum' ; ID : [a-z]+ ;
If, on the other hand, the next character after input enum is a letter,
then only ID is viable.
In your case, an input `Token1` followed by a character not in DEFAULT
will be matched by TOKEN1 , but `Token12` will be matched by DEFAULT.
grammar Question;
question
@init {System.out.println("Question last update 2326");}
: entry+
;
entry
: ( TOKEN1 | TOKEN2 | DEFAULT ) {System.out.println("entry found : "
+ $entry.text);}
;
TOKEN1
: 'Token1' {System.out.println("TOKEN1 found : " + getText());}
;
TOKEN2
: 'Token2' {System.out.println("TOKEN2 found : " + getText());}
;
DEFAULT
: LETTER (LETTER|DIGIT)* {System.out.println("DEFAULT found : " +
getText());}
;
WS : [ \t\r\n]+ -> channel(HIDDEN) ;
fragment DIGIT : [0-9] ;
fragment LETTER : [a-zA-Z] ;
$ antlr4 Question.g4
$ javac Q*.java
$ grun Question question
Token1 Token2
asjdjffh123
Token123
➾EOF [ctrl-D / Ctrl-Z]
TOKEN1 found : Token1
TOKEN2 found : Token2
DEFAULT found : asjdjffh123
DEFAULT found : Token123
Question last update 2326
entry found : Token1
entry found : Token2
entry found : asjdjffh123
entry found : Token123
HTH
Bernard
PS : See also http://stackoverflow.com/tags and type *antlr* in the field
*Type to find tags:.*

Post by Rui VilÃ£o
DEFAULT: NONE OF THE ABOVE AND MUST FOLLOW LETTER (LETTER|NUMBER)*;

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-addr