Discussion:
parsing just a subset of a grammar
Alexander Kostikov
2012-11-19 19:23:22 UTC
Permalink
Hi,

I'm new to ANTLR and I seek for a good advice.

Here is my story. I'm parsing Cisco IOS config files. They are quite
loosely defined but actually I don't need to have whole the config
file parsed. I'm interested in just a subset of the config file (acl
rule below) and I don't really care about all other parts of the file
right now. Having said it, in the future I'll need to add other parts
as well (e.g. rule for interfaces definition) but again, I don't need
to have all of the config file parsed. I don't want to implement
complete Cisco IOS grammar since seams it would become a very hard
task indeed.

To ignore all not interesting parts of the config file I defined the
grammar this way:

/*
* Parser Rules
*/

config: (acl | any)* EOF;
any: (ID|INT)* EOL;
acl: 'ip' 'access-list' 'extended'? ID EOL (remark | rule)+ EOF;
remark: (index)? 'remark' (~EOL)* EOL;
rule: (index)? verb protocol source source_port destination
destination_port flag? log? EOL;

// Not so interesting parser rules here...

/*
* Lexer Rules
*/

fragment
CHAR: 'a'..'z' | 'A'..'Z' | '_' | '-' | '.' | '+' | '/' | ':' | '%';
fragment
NUMBER: '0'..'9';
INT: NUMBER+;
ID: CHAR (CHAR | NUMBER)*;
EOL: ('\r' | '\n')+;
WS: (' ' | '\t') { $channel=HIDDEN; };
COMMENT: '!' (~('\r' | '\n'))* EOL { $channel=HIDDEN; };
ILLEGAL: .;

It turns out ANTLR doesn't behave the way I expected =) What I wanted
is for ANTLR to parse the following line "no ip bootp server" via
'any' rule but ANTLR finds 'ip' token in the line and treats the line
as not correct 'acl' rule. Specifying syntactic predicate "config:
(('ip' 'access-list') => acl | any)* EOF" only makes things worse
judging by ANTLRWorks output - parser stops almost immediately with an
unrecoverable error.

My question is - is there a way to achieve the kind of filtering I'm
talking about (parse only 'acl', ignore anything else) via ANTLR
grammar? What should I use? Syntactic predicate? Several-pass parsing?
Custom lexer (how do I even start implementing such beast?)? Parse out
all interesting sections from a file via regex before supplying them
to ANTLR grammar that is only ACL-oriented (at least I know how to
implement this last option)?

-- Alexander

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
Ivan Brezina
2012-11-19 20:28:45 UTC
Permalink
Post by Alexander Kostikov
Hi,
I'm new to ANTLR and I seek for a good advice.
Here is my story. I'm parsing Cisco IOS config files. They are quite
loosely defined but actually I don't need to have whole the config
file parsed. I'm interested in just a subset of the config file (acl
rule below) and I don't really care about all other parts of the file
right now. Having said it, in the future I'll need to add other parts
as well (e.g. rule for interfaces definition) but again, I don't need
to have all of the config file parsed. I don't want to implement
complete Cisco IOS grammar since seams it would become a very hard
task indeed.
To ignore all not interesting parts of the config file I defined the
/*
* Parser Rules
*/
config: (acl | any)* EOF;
any: (ID|INT)* EOL;
acl: 'ip' 'access-list' 'extended'? ID EOL (remark | rule)+ EOF;
remark: (index)? 'remark' (~EOL)* EOL;
rule: (index)? verb protocol source source_port destination
destination_port flag? log? EOL;
// Not so interesting parser rules here...
/*
* Lexer Rules
*/
fragment
CHAR: 'a'..'z' | 'A'..'Z' | '_' | '-' | '.' | '+' | '/' | ':' | '%';
fragment
NUMBER: '0'..'9';
INT: NUMBER+;
ID: CHAR (CHAR | NUMBER)*;
EOL: ('\r' | '\n')+;
WS: (' ' | '\t') { $channel=HIDDEN; };
COMMENT: '!' (~('\r' | '\n'))* EOL { $channel=HIDDEN; };
ILLEGAL: .;
It turns out ANTLR doesn't behave the way I expected =) What I wanted
is for ANTLR to parse the following line "no ip bootp server" via
'any' rule but ANTLR finds 'ip' token in the line and treats the line
(('ip' 'access-list') => acl | any)* EOF"nly makes things worse
judging by ANTLRWorks output - parser stops almost immediately with an
unrecoverable error.
My question is - is there a way to achieve the kind of filtering I'm
talking about (parse only 'acl', ignore anything else) via ANTLR
grammar? What should I use? Syntactic predicate? Several-pass parsing?
Custom lexer (how do I even start implementing such beast?)? Parse out
all interesting sections from a file via regex before supplying themuse
to ANTLR grammar that is only ACL-oriented (at least I know how to
implement this last option)?
-- Alexander
Maybe this is not what you want. Look at the PLSQL grammar.
For embedded SQL it uses such a trick:

SEMI: ';' ;

swallow_to_semi :
~( SEMI )+
;

select: 'SELECT' swallow_to_semi SEMI;

By using this you can "bypass" all the sections you're not interested in.

Ivan
PS: be warned, negation can make the grammar very complex if you
use many lexer tokens.



List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
Alexander Kostikov
2012-11-20 17:49:08 UTC
Permalink
Ivan,

Thank you for the swallow_to_semi technique.

I've tried fuzzy parsing Terence pointed out, but as the downside the
parser became very loose and it no longer finds input that _almost_
captures the acl rule. Probably the swallow_to_semi technique could
give me the ability not to implement the full parser and find out
almost matching input (indicating that the rule must be updated) at
the same time.
--
Alexander
Post by Ivan Brezina
Post by Alexander Kostikov
Hi,
I'm new to ANTLR and I seek for a good advice.
Here is my story. I'm parsing Cisco IOS config files. They are quite
loosely defined but actually I don't need to have whole the config
file parsed. I'm interested in just a subset of the config file (acl
rule below) and I don't really care about all other parts of the file
right now. Having said it, in the future I'll need to add other parts
as well (e.g. rule for interfaces definition) but again, I don't need
to have all of the config file parsed. I don't want to implement
complete Cisco IOS grammar since seams it would become a very hard
task indeed.
To ignore all not interesting parts of the config file I defined the
/*
* Parser Rules
*/
config: (acl | any)* EOF;
any: (ID|INT)* EOL;
acl: 'ip' 'access-list' 'extended'? ID EOL (remark | rule)+ EOF;
remark: (index)? 'remark' (~EOL)* EOL;
rule: (index)? verb protocol source source_port destination
destination_port flag? log? EOL;
// Not so interesting parser rules here...
/*
* Lexer Rules
*/
fragment
CHAR: 'a'..'z' | 'A'..'Z' | '_' | '-' | '.' | '+' | '/' | ':' | '%';
fragment
NUMBER: '0'..'9';
INT: NUMBER+;
ID: CHAR (CHAR | NUMBER)*;
EOL: ('\r' | '\n')+;
WS: (' ' | '\t') { $channel=HIDDEN; };
COMMENT: '!' (~('\r' | '\n'))* EOL { $channel=HIDDEN; };
ILLEGAL: .;
It turns out ANTLR doesn't behave the way I expected =) What I wanted
is for ANTLR to parse the following line "no ip bootp server" via
'any' rule but ANTLR finds 'ip' token in the line and treats the line
(('ip' 'access-list') => acl | any)* EOF"nly makes things worse
judging by ANTLRWorks output - parser stops almost immediately with an
unrecoverable error.
My question is - is there a way to achieve the kind of filtering I'm
talking about (parse only 'acl', ignore anything else) via ANTLR
grammar? What should I use? Syntactic predicate? Several-pass parsing?
Custom lexer (how do I even start implementing such beast?)? Parse out
all interesting sections from a file via regex before supplying themuse
to ANTLR grammar that is only ACL-oriented (at least I know how to
implement this last option)?
-- Alexander
Maybe this is not what you want. Look at the PLSQL grammar.
SEMI: ';' ;
~( SEMI )+
;
select: 'SELECT' swallow_to_semi SEMI;
By using this you can "bypass" all the sections you're not interested in.
Ivan
PS: be warned, negation can make the grammar very complex if you
use many lexer tokens.
List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
Bernard Kaiflin
2012-11-19 23:05:13 UTC
Permalink
Hi,
welcome in the club "ANTLR doesn't behave the way I expected" :D

As Ivan says in the PS, negation is difficult to manipulate, as well as
ignoring portions of input. But possible in some circumstances, see CHUNK
in the thread containing
http://www.antlr.org/pipermail/antlr-interest/2012-November/045765.html

As I don't have the full grammar, I made a short version. Given the four
rules

any : ( ID | INT )* EOL ;
acl : 'ip' 'access-list' 'extended'? ID EOL ( remark | rule )+ EOF ;
remark : INT? 'remark' (~EOL)* EOL ;
rule : INT? ID+ EOL ;

and the input

$ cat t.config
no ip bootp server
ip access-list xyz
abc def

$ echo $CLASSPATH
.:/usr/local/lib/antlr-3.4-complete-no-antlrv2.jar
$ java org.antlr.Tool -trace Cisco.g
$ java Test < t.config
enter ID n line=1:0
enter CHAR n line=1:0
exit CHAR o line=1:1
exit ID line=1:2
enter config [@0,0:1='no',<7>,1:0]
Cisco last update 2127
enter any [@0,0:1='no',<7>,1:0]
enter WS line=1:2
exit WS i line=1:3
enter T__14 i line=1:3
exit T__14 line=1:5
enter WS line=1:5
exit WS b line=1:6
enter ID b line=1:6
enter CHAR b line=1:6
exit CHAR o line=1:7
exit ID line=1:11
line 1:3 missing EOL at 'ip'
exit any [@2,3:4='ip',<14>,1:3]

I can see :
1) ID is built character by character, it would be better to group them as
in ID : ( 'a'..'z' | 'A'..'Z' | '_')+
2) rule any has been chosen, because it's the first (the other is rule)
that matches a line starting with an ID
3) the lexer consumes the token [@0,0:1='no',<7>,1:0] <7> is in my case the
type of ID, see the file <grammar name>.tokens for a list of token types
4) the lexer skips the white space and sees `ip`. As 'ip' appears as
implicit token in the parser rule acl, it has received it's own token type,
in this case T__14, so it is not an ID
5) I don't know why the lexer doesn't stop here and still reads the next
character, anyway the parser cannot continue with the loop ( ID | INT)* in
rule any, because T__14 is neither an ID nor an INT, it expects an EOL to
terminate the rule and it complains with "missing EOL"

exit any [@2,3:4='ip',<14>,1:3]
enter acl [@2,3:4='ip',<14>,1:3]
enter WS line=1:11
exit WS s line=1:12
enter ID s line=1:12
enter CHAR s line=1:12
exit CHAR e line=1:13
exit ID
line=1:18
line 1:6 missing 'access-list' at 'bootp'

6) Now the parser receives from the lexer T__14='ip', the second token in
the line "no ip bootp server", and naturally chooses the rule acl which
starts with 'ip'.
7) the lexer advances in the input, finds 'server', returns 'bootp' (which
has not been consumed yet) to the parser
8) the parser complains because it expects 'access-list' as the next token
in rule acl.


Now let's do a small change to accept ID and keywords like 'ip' in the rule
any :

grammar Cisco;

/* Parse Cisco config file. */

config
@init {System.out.println("Cisco last update 2320");}
: ( acl | any )* EOF
;

any : ( id_or_keyword | INT )* EOL
{System.out.print("--- any " + $any.text);}
;

acl : IP 'access-list' 'extended'? ID EOL ( remark | rule )+ // EOF
already in config
{System.out.print("--- acl " + $acl.text);}
;

remark
: INT? 'remark' (~EOL)* EOL
;

rule: INT? ID+ EOL;

id_or_keyword
: ID | IP
;

IP : 'ip' ; // before ID, or else 'ip' will be captured by ID and rule acl
will not match
ID : ( LETTER | SPECIAL ) ( LETTER | SPECIAL | NUMBER )* ;
INT : NUMBER+ ;
EOL : ('\r' | '\n')+;
WS : (' ' | '\t') { $channel=HIDDEN; };
COMMENT : '!' (~('\r' | '\n'))* EOL { $channel=HIDDEN; } ;
ILLEGAL : . ;
fragment LETTER : 'a'..'z' | 'A'..'Z' ;
fragment SPECIAL : '_' | '-' | '.' | '+' | '/' | ':' | '%' ;
fragment NUMBER : '0'..'9' ;

$ java Test < t.config
enter ID n line=1:0
exit ID line=1:2
enter config [@0,0:1='no',<6>,1:0]
Cisco last update 2320
enter any [@0,0:1='no',<6>,1:0]
enter id_or_keyword [@0,0:1='no',<6>,1:0]
enter WS line=1:2
...
enter id_or_keyword [@2,3:4='ip',<9>,1:3]
...
enter id_or_keyword [@4,6:10='bootp',<6>,1:6]
...
exit EOL i line=2:0
exit id_or_keyword [@7,18:18='\n',<5>,1:18]
enter IP i line=2:0
exit IP line=2:2
--- any no ip bootp server
exit any [@8,19:20='ip',<9>,2:0]
enter WS line=2:2
exit WS a line=2:3
enter T__14 a line=2:3
exit T__14 line=2:14
enter acl [@8,19:20='ip',<9>,2:0]
...
enter rule [@14,38:40='abc',<6>,3:0]
exit rule [@18,46:46='<EOF>',<-1>,4:0]
--- acl ip access-list xyz
abc def
exit acl [@18,46:46='<EOF>',<-1>,4:0]
exit config [@19,46:46='<EOF>',<-1>,4:0]

Looks better, as I expected :)

HTH
Bernard
Post by Alexander Kostikov
It turns out ANTLR doesn't behave the way I expected =) What I wanted
is for ANTLR to parse the following line "no ip bootp server" via
'any' rule but ANTLR finds 'ip' token in the line and treats the line
(('ip' 'access-list') => acl | any)* EOF" only makes things worse
judging by ANTLRWorks output - parser stops almost immediately with an
unrecoverable error.
My question is - is there a way to achieve the kind of filtering I'm
talking about (parse only 'acl', ignore anything else) via ANTLR
grammar? What should I use? Syntactic predicate? Several-pass parsing?
Custom lexer (how do I even start implementing such beast?)? Parse out
all interesting sections from a file via regex before supplying them
to ANTLR grammar that is only ACL-oriented (at least I know how to
implement this last option)?
-- Alexander
List: http://www.antlr.org/mailman/listinfo/antlr-interest
http://www.antlr.org/mailman/options/antlr-interest/your-email-address
List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
Terence Parr
2012-11-19 23:49:32 UTC
Permalink
In the new v4 book and the v4 doc:

http://www.antlr.org/wiki/display/ANTLR4/Wildcard+Operator+and+Nongreedy+Subrules

i talk about fuzzy parsing.

see

http://media.pragprog.com/titles/tpantlr2/code/reference/FuzzyJava.g4

Terence
Post by Alexander Kostikov
Hi,
I'm new to ANTLR and I seek for a good advice.
Here is my story. I'm parsing Cisco IOS config files. They are quite
loosely defined but actually I don't need to have whole the config
file parsed. I'm interested in just a subset of the config file (acl
rule below) and I don't really care about all other parts of the file
right now. Having said it, in the future I'll need to add other parts
as well (e.g. rule for interfaces definition) but again, I don't need
to have all of the config file parsed. I don't want to implement
complete Cisco IOS grammar since seams it would become a very hard
task indeed.
To ignore all not interesting parts of the config file I defined the
/*
* Parser Rules
*/
config: (acl | any)* EOF;
any: (ID|INT)* EOL;
acl: 'ip' 'access-list' 'extended'? ID EOL (remark | rule)+ EOF;
remark: (index)? 'remark' (~EOL)* EOL;
rule: (index)? verb protocol source source_port destination
destination_port flag? log? EOL;
// Not so interesting parser rules here...
/*
* Lexer Rules
*/
fragment
CHAR: 'a'..'z' | 'A'..'Z' | '_' | '-' | '.' | '+' | '/' | ':' | '%';
fragment
NUMBER: '0'..'9';
INT: NUMBER+;
ID: CHAR (CHAR | NUMBER)*;
EOL: ('\r' | '\n')+;
WS: (' ' | '\t') { $channel=HIDDEN; };
COMMENT: '!' (~('\r' | '\n'))* EOL { $channel=HIDDEN; };
ILLEGAL: .;
It turns out ANTLR doesn't behave the way I expected =) What I wanted
is for ANTLR to parse the following line "no ip bootp server" via
'any' rule but ANTLR finds 'ip' token in the line and treats the line
(('ip' 'access-list') => acl | any)* EOF" only makes things worse
judging by ANTLRWorks output - parser stops almost immediately with an
unrecoverable error.
My question is - is there a way to achieve the kind of filtering I'm
talking about (parse only 'acl', ignore anything else) via ANTLR
grammar? What should I use? Syntactic predicate? Several-pass parsing?
Custom lexer (how do I even start implementing such beast?)? Parse out
all interesting sections from a file via regex before supplying them
to ANTLR grammar that is only ACL-oriented (at least I know how to
implement this last option)?
-- Alexander
List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
Bernard Kaiflin
2012-11-20 15:18:21 UTC
Permalink
Learning every day ... I have rewritten the grammar to use fuzzy parsing in
v4.

grammar Cisco;

/* Parse Cisco config file using fuzzy parsing. */

config
@init {System.out.println("Cisco last update 1606");}
: .*? ( acl .*? )+
;

acl : 'ip' 'access-list' 'extended'? ID '\n'? ( remark | rule ) '\n'
{System.out.print("--- acl " + $acl.text);}
;

remark
: INT? 'remark' ~'\n'*
;

rule: INT? ID+ // the + either here or in rule acl after ( remark | rule )
; // to avoid ambiguity

ID : ( LETTER | SPECIAL ) ( LETTER | SPECIAL | NUMBER )* ;
INT : NUMBER+ ;
COMMENT : '!' .*? '\n' -> channel(HIDDEN) ;
WS : [ \t\r\n]+ -> channel(HIDDEN) ;

ILLEGAL : . ; // after all other lexer rules

fragment LETTER : 'a'..'z' | 'A'..'Z' ;
fragment SPECIAL : '_' | '-' | '.' | '+' | '/' | ':' | '%' ;
fragment NUMBER : '0'..'9' ;

To install ANTLR4 you can start here :
http://forums.pragprog.com/forums/206/topics/11231

$ echo $CLASSPATH
.:/usr/local/lib/antlr-4.0b3-complete.jar
$ antlr4 Cisco.g4
$ javac Cisco*.java
$ grun Cisco config -tokens -diagnostics -trace t.config
[@0,0:1='no',<6>,1:0]
[@1,2:2=' ',<9>,channel=1,1:2]
[@2,3:4='ip',<2>,1:3]
...
[@7,18:18='\n',<4>,1:18]
[@8,19:20='ip',<2>,2:0]
[@9,21:21=' ',<9>,channel=1,2:2]
[@10,22:32='access-list',<3>,2:3]
...
[@18,46:45='<EOF>',<-1>,4:8]
enter config, LT(1)=no
Cisco last update 1606
consume [@0,0:1='no',<6>,1:0] rule config alt=1
consume [@2,3:4='ip',<2>,1:3] rule config alt=1
consume [@4,6:10='bootp',<6>,1:6] rule config alt=1
consume [@6,12:17='server',<6>,1:12] rule config alt=1
consume [@7,18:18='\n',<4>,1:18] rule config alt=1
enter acl, LT(1)=ip
consume [@8,19:20='ip',<2>,2:0] rule acl alt=1
consume [@10,22:32='access-list',<3>,2:3] rule acl alt=1
consume [@12,34:36='xyz',<6>,2:15] rule acl alt=1
consume [@13,37:37='\n',<4>,2:18] rule acl alt=1
enter rule, LT(1)=abc
consume [@14,38:40='abc',<6>,3:0] rule rule alt=1
consume [@16,42:44='def',<6>,3:4] rule rule alt=1
exit rule, LT(1)=

consume [@17,45:45='\n',<4>,3:7] rule acl alt=1
--- acl ip access-list xyz
abc def
exit acl, LT(1)=<EOF>
exit config, LT(1)=<EOF>
Post by Terence Parr
http://www.antlr.org/wiki/display/ANTLR4/Wildcard+Operator+and+Nongreedy+Subrules
i talk about fuzzy parsing.
see
http://media.pragprog.com/titles/tpantlr2/code/reference/FuzzyJava.g4
Terence
List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
Alexander Kostikov
2012-11-20 17:45:41 UTC
Permalink
Bernard,

Thanks for the debugging technique!

Was the resolution for '1) ID is built character by character, it
would be better to group them' to move all fragments to the very end
of the grammar?

I can't use ANTLR4 since there is no C# target for it yet (as far as I
know). I'm targeting C# but for the sake of grammar debugability I'm
trying out the grammar in ANTLRWorks first.

The problem with id_or_keyword approach is - there would be too many
keywords to keep track of in 'any' rule. Plus IP token is not a
keyword that always would start the 'acl' rule. 'ip' could be used as
protocol identifier as well. It looks like I would have to use
alteration like (IP|ID) in the parser rules and it doesn't seem right.
--
Alexander

On Tue, Nov 20, 2012 at 7:18 AM, Bernard Kaiflin
Post by Bernard Kaiflin
Learning every day ... I have rewritten the grammar to use fuzzy parsing in
v4.
grammar Cisco;
/* Parse Cisco config file using fuzzy parsing. */
config
@init {System.out.println("Cisco last update 1606");}
: .*? ( acl .*? )+
;
acl : 'ip' 'access-list' 'extended'? ID '\n'? ( remark | rule ) '\n'
{System.out.print("--- acl " + $acl.text);}
;
remark
: INT? 'remark' ~'\n'*
;
rule: INT? ID+ // the + either here or in rule acl after ( remark | rule )
; // to avoid ambiguity
ID : ( LETTER | SPECIAL ) ( LETTER | SPECIAL | NUMBER )* ;
INT : NUMBER+ ;
COMMENT : '!' .*? '\n' -> channel(HIDDEN) ;
WS : [ \t\r\n]+ -> channel(HIDDEN) ;
ILLEGAL : . ; // after all other lexer rules
fragment LETTER : 'a'..'z' | 'A'..'Z' ;
fragment SPECIAL : '_' | '-' | '.' | '+' | '/' | ':' | '%' ;
fragment NUMBER : '0'..'9' ;
http://forums.pragprog.com/forums/206/topics/11231
$ echo $CLASSPATH
.:/usr/local/lib/antlr-4.0b3-complete.jar
$ antlr4 Cisco.g4
$ javac Cisco*.java
$ grun Cisco config -tokens -diagnostics -trace t.config
...
...
enter config, LT(1)=no
Cisco last update 1606
enter acl, LT(1)=ip
enter rule, LT(1)=abc
exit rule, LT(1)=
--- acl ip access-list xyz
abc def
exit acl, LT(1)=<EOF>
exit config, LT(1)=<EOF>
Post by Terence Parr
http://www.antlr.org/wiki/display/ANTLR4/Wildcard+Operator+and+Nongreedy+Subrules
i talk about fuzzy parsing.
see
http://media.pragprog.com/titles/tpantlr2/code/reference/FuzzyJava.g4
Terence
List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
Bernard Kaiflin
2012-11-20 20:25:59 UTC
Permalink
A. I confirm that, if the central rule, in this case acl, does not match
exactly the input, the whole input is consumed by the first .*?.
Was the resolution for '1) ID is built character by character, it would be
better to group them' to move all fragments to the very end of the grammar?

B. Putting all the fragments at the end is just my personal preference.

C. The trace
enter CHAR b line=1:6
exit CHAR o line=1:7
gave me the feeling that ID is constructed painfuly character by character.
Rewriting CHAR as
CHAR: ( 'a'..'z' | 'A'..'Z' | '_' | '-' | '.' | '+' | '/' | ':' | '%' )
{System.out.println("--- CHAR " + $text);}
;
shows
enter ID b line=1:6
enter CHAR b line=1:6
--- CHAR b
exit CHAR o line=1:7
enter CHAR o line=1:7
--- CHAR bo
exit CHAR o line=1:8
enter CHAR o line=1:8
--- CHAR boo
exit CHAR t line=1:9
enter CHAR t line=1:9
--- CHAR boot
exit CHAR p line=1:10
enter CHAR p line=1:10
--- CHAR bootp
exit CHAR line=1:11
--- ID bootp

But also with fragment CHAR, or putting a print in fragment LETTER shows
that ID is built character by character. So you can forget my remark 1). I
have been fooled by the trace
enter ID b line=1:6
--- ID bootp
exit ID line=1:11
that looked shorter because fragments are not traced. In your version, CHAR
is a lexer rule and is traced in detail.
Was the resolution for '1) ID is built character by character, it
would be better to group them' to move all fragments to the very end
of the grammar?
--
Alexander
List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
Alexander Kostikov
2012-11-20 17:47:31 UTC
Permalink
Terence,

Thank you for the fuzzy parsing advice.

Fuzzy parsing seems to be the natural choice here. I've tried it
yesterday and it worked on a sample data. But when I tried to supply
some real file two things came up:

1) Parser became very loose. ANTLR no longer finds out cases when
input almost matches the acl rule. Fuzzy parsing via 'config: (acl |
.)* EOF' ignores all input that is not 100% described by the acl rule.
I understand that this is a conflicting goal but it looks like
swallow_to_semi technique from Ivan's email could bring benefits from
both fuzzy parsing and error handling by making grammar more verbose.

2) ANTLRWorks debugger took significant time to parse the real data.
It was about ~40 seconds per file compared to ~1 second when I'm using
my old regex-based parser. It was just a run under debugger and for a
different target language (I'm targeting CSharp3) but performance is a
valid concern for me. I don't want to have a speed regression when
porting from the current regex parser. If there would be no way of
doing quick parser I'll introduce an intermediate representation -
only the parsing speed from this intermediate representation would
matter.
--
Alexander
Post by Terence Parr
http://www.antlr.org/wiki/display/ANTLR4/Wildcard+Operator+and+Nongreedy+Subrules
i talk about fuzzy parsing.
see
http://media.pragprog.com/titles/tpantlr2/code/reference/FuzzyJava.g4
Terence
Post by Alexander Kostikov
Hi,
I'm new to ANTLR and I seek for a good advice.
Here is my story. I'm parsing Cisco IOS config files. They are quite
loosely defined but actually I don't need to have whole the config
file parsed. I'm interested in just a subset of the config file (acl
rule below) and I don't really care about all other parts of the file
right now. Having said it, in the future I'll need to add other parts
as well (e.g. rule for interfaces definition) but again, I don't need
to have all of the config file parsed. I don't want to implement
complete Cisco IOS grammar since seams it would become a very hard
task indeed.
To ignore all not interesting parts of the config file I defined the
/*
* Parser Rules
*/
config: (acl | any)* EOF;
any: (ID|INT)* EOL;
acl: 'ip' 'access-list' 'extended'? ID EOL (remark | rule)+ EOF;
remark: (index)? 'remark' (~EOL)* EOL;
rule: (index)? verb protocol source source_port destination
destination_port flag? log? EOL;
// Not so interesting parser rules here...
/*
* Lexer Rules
*/
fragment
CHAR: 'a'..'z' | 'A'..'Z' | '_' | '-' | '.' | '+' | '/' | ':' | '%';
fragment
NUMBER: '0'..'9';
INT: NUMBER+;
ID: CHAR (CHAR | NUMBER)*;
EOL: ('\r' | '\n')+;
WS: (' ' | '\t') { $channel=HIDDEN; };
COMMENT: '!' (~('\r' | '\n'))* EOL { $channel=HIDDEN; };
ILLEGAL: .;
It turns out ANTLR doesn't behave the way I expected =) What I wanted
is for ANTLR to parse the following line "no ip bootp server" via
'any' rule but ANTLR finds 'ip' token in the line and treats the line
(('ip' 'access-list') => acl | any)* EOF" only makes things worse
judging by ANTLRWorks output - parser stops almost immediately with an
unrecoverable error.
My question is - is there a way to achieve the kind of filtering I'm
talking about (parse only 'acl', ignore anything else) via ANTLR
grammar? What should I use? Syntactic predicate? Several-pass parsing?
Custom lexer (how do I even start implementing such beast?)? Parse out
all interesting sections from a file via regex before supplying them
to ANTLR grammar that is only ACL-oriented (at least I know how to
implement this last option)?
-- Alexander
List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
Terence Parr
2012-11-20 18:08:07 UTC
Permalink
Post by Alexander Kostikov
Terence,
Thank you for the fuzzy parsing advice.
Fuzzy parsing seems to be the natural choice here. I've tried it
yesterday and it worked on a sample data. But when I tried to supply
1) Parser became very loose. ANTLR no longer finds out cases when
input almost matches the acl rule. Fuzzy parsing via 'config: (acl |
.)* EOF' ignores all input that is not 100% described by the acl rule.
I understand that this is a conflicting goal but it looks like
swallow_to_semi technique from Ivan's email could bring benefits from
both fuzzy parsing and error handling by making grammar more verbose.
I think it's very sensitive to how you write the grammar. I was very happy with the fuzzy Java parser as it wasn't loose at all. this is with v4
Post by Alexander Kostikov
2) ANTLRWorks debugger took significant time to parse the real data.
It was about ~40 seconds per file compared to ~1 second when I'm using
my old regex-based parser. It was just a run under debugger and for a
different target language (I'm targeting CSharp3) but performance is a
valid concern for me. I don't want to have a speed regression when
porting from the current regex parser. If there would be no way of
doing quick parser I'll introduce an intermediate representation -
only the parsing speed from this intermediate representation would
matter.
ah. you must be using v3. All bets are off. v3 option fuzzy is very slow O(n^2)

T

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
Alexander Kostikov
2012-11-20 20:52:10 UTC
Permalink
Terence,

Is there an estimation when C# target would become available for ANTLR v4?
I would gladly switch to the newest bits but it doesn't seem like C#
output is currently possible.

-- Alexander
Post by Terence Parr
Post by Alexander Kostikov
Terence,
Thank you for the fuzzy parsing advice.
Fuzzy parsing seems to be the natural choice here. I've tried it
yesterday and it worked on a sample data. But when I tried to supply
1) Parser became very loose. ANTLR no longer finds out cases when
input almost matches the acl rule. Fuzzy parsing via 'config: (acl |
.)* EOF' ignores all input that is not 100% described by the acl rule.
I understand that this is a conflicting goal but it looks like
swallow_to_semi technique from Ivan's email could bring benefits from
both fuzzy parsing and error handling by making grammar more verbose.
I think it's very sensitive to how you write the grammar. I was very happy with the fuzzy Java parser as it wasn't loose at all. this is with v4
Post by Alexander Kostikov
2) ANTLRWorks debugger took significant time to parse the real data.
It was about ~40 seconds per file compared to ~1 second when I'm using
my old regex-based parser. It was just a run under debugger and for a
different target language (I'm targeting CSharp3) but performance is a
valid concern for me. I don't want to have a speed regression when
porting from the current regex parser. If there would be no way of
doing quick parser I'll introduce an intermediate representation -
only the parsing speed from this intermediate representation would
matter.
ah. you must be using v3. All bets are off. v3 option fuzzy is very slow O(n^2)
T
List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
Terence Parr
2012-11-20 21:27:56 UTC
Permalink
Post by Alexander Kostikov
Terence,
Is there an estimation when C# target would become available for ANTLR v4?
I would gladly switch to the newest bits but it doesn't seem like C#
output is currently possible.
Sam is working on it and C++. No real estimate I'm afraid; he's finishing up semester and Java version with me at moment.

Ter

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
Kevin J. Cummings
2012-11-20 02:27:04 UTC
Permalink
Post by Alexander Kostikov
Hi,
I'm new to ANTLR and I seek for a good advice.
To ignore all not interesting parts of the config file I defined the
/*
* Parser Rules
*/
config: (acl | any)* EOF;
any: (ID|INT)* EOL;
acl: 'ip' 'access-list' 'extended'? ID EOL (remark | rule)+ EOF;
Did you really want an EOF at the end of your acl rule?
Or should that have been an EOL?
Post by Alexander Kostikov
remark: (index)? 'remark' (~EOL)* EOL;
rule: (index)? verb protocol source source_port destination
destination_port flag? log? EOL;
// Not so interesting parser rules here...
/*
* Lexer Rules
*/
fragment
CHAR: 'a'..'z' | 'A'..'Z' | '_' | '-' | '.' | '+' | '/' | ':' | '%';
fragment
NUMBER: '0'..'9';
INT: NUMBER+;
ID: CHAR (CHAR | NUMBER)*;
EOL: ('\r' | '\n')+;
WS: (' ' | '\t') { $channel=HIDDEN; };
COMMENT: '!' (~('\r' | '\n'))* EOL { $channel=HIDDEN; };
ILLEGAL: .;
It turns out ANTLR doesn't behave the way I expected =) What I wanted
is for ANTLR to parse the following line "no ip bootp server" via
'any' rule but ANTLR finds 'ip' token in the line and treats the line
(('ip' 'access-list') => acl | any)* EOF" only makes things worse
judging by ANTLRWorks output - parser stops almost immediately with an
unrecoverable error.
My question is - is there a way to achieve the kind of filtering I'm
talking about (parse only 'acl', ignore anything else) via ANTLR
grammar? What should I use? Syntactic predicate? Several-pass parsing?
Custom lexer (how do I even start implementing such beast?)? Parse out
all interesting sections from a file via regex before supplying them
to ANTLR grammar that is only ACL-oriented (at least I know how to
implement this last option)?
-- Alexander
List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
--
Kevin J. Cummings
kjchome-H+***@public.gmane.org
cummings-Pd29Y01plOkJmq/***@public.gmane.org
cummings-jYwPtVBxwB7lG7w+***@public.gmane.org
Registered Linux User #1232 (http://www.linuxcounter.net/)

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
Alexander Kostikov
2012-11-20 17:42:49 UTC
Permalink
Kevin,

That should be EOL. I've simplified the grammar a bit before the
sending and a typo sneaked in.

-- Alexander

On Mon, Nov 19, 2012 at 6:27 PM, Kevin J. Cummings
Post by Kevin J. Cummings
Post by Alexander Kostikov
Hi,
I'm new to ANTLR and I seek for a good advice.
To ignore all not interesting parts of the config file I defined the
/*
* Parser Rules
*/
config: (acl | any)* EOF;
any: (ID|INT)* EOL;
acl: 'ip' 'access-list' 'extended'? ID EOL (remark | rule)+ EOF;
Did you really want an EOF at the end of your acl rule?
Or should that have been an EOL?
Post by Alexander Kostikov
remark: (index)? 'remark' (~EOL)* EOL;
rule: (index)? verb protocol source source_port destination
destination_port flag? log? EOL;
// Not so interesting parser rules here...
/*
* Lexer Rules
*/
fragment
CHAR: 'a'..'z' | 'A'..'Z' | '_' | '-' | '.' | '+' | '/' | ':' | '%';
fragment
NUMBER: '0'..'9';
INT: NUMBER+;
ID: CHAR (CHAR | NUMBER)*;
EOL: ('\r' | '\n')+;
WS: (' ' | '\t') { $channel=HIDDEN; };
COMMENT: '!' (~('\r' | '\n'))* EOL { $channel=HIDDEN; };
ILLEGAL: .;
It turns out ANTLR doesn't behave the way I expected =) What I wanted
is for ANTLR to parse the following line "no ip bootp server" via
'any' rule but ANTLR finds 'ip' token in the line and treats the line
(('ip' 'access-list') => acl | any)* EOF" only makes things worse
judging by ANTLRWorks output - parser stops almost immediately with an
unrecoverable error.
My question is - is there a way to achieve the kind of filtering I'm
talking about (parse only 'acl', ignore anything else) via ANTLR
grammar? What should I use? Syntactic predicate? Several-pass parsing?
Custom lexer (how do I even start implementing such beast?)? Parse out
all interesting sections from a file via regex before supplying them
to ANTLR grammar that is only ACL-oriented (at least I know how to
implement this last option)?
-- Alexander
List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
--
Kevin J. Cummings
Registered Linux User #1232 (http://www.linuxcounter.net/)
List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
Continue reading on narkive:
Loading...