How to treat EOF as NEWLINE?

Discussion:

Alexander Kostikov

2012-11-29 17:38:36 UTC

Hi,

The grammar I'm implementing is pretty big and I'm covering it with
unit tests to pin-point the rules that are already implemented and
stable. The problem is - when I test not the top-most rule, but the
smaller onces (the ones I want to pin-point and mark as stable), I
encounter a problem. Here is a small repro that I came up with:

---------------
grammar Test;

config: (rule)* EOF;
rule: 'rule' ID NEWLINE | 'rule' NEWLINE;

ID: CHAR (CHAR | NUMBER)*;
NEWLINE: ('\r' | '\n')+;
WS: (' ' | '\t') { $channel=HIDDEN; };

fragment CHAR: 'a'..'z' | 'A'..'Z' | '_' | '-';
fragment NUMBER: '0'..'9';
---------------

In tests I want to test that 'rule' is producing the correct AST.
Something like TestRuleAst( "rule test", "(RULE test)" ) where the
first argument is input to be parsed by 'rule', and the second
argument is expected text representation for the AST. But for that
test to work I'd like to treat all EOF tokens as NEWLINEs. Otherwise I
get NoViableAltException and MissingTokenException depending on the
rule I'm testing. But changing NEWLINE to 'NEWLINE: ('\r' | '\n')+ |
EOF;' doesn't solve this problem.

The simplest way to fix the tests for me would be to add '\n' at the
end of the tested strings. But I would like to know if it is possible
to treat EOF as NEWLINE in the grammar itself.

-- Alexander

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address

Gerald Rosenberg

2012-11-29 20:00:54 UTC

Permalink

The Antlr message identifies the grammar rule causing the error. Which
is it? Probably changes depending on whether EOF is included in NL.
Given an input string of "rule <ID> <EOF>", with EOF included in the
definition of NL, the config rule is expecting "rule <ID> <EOF> <EOF>",
hence the error.

To the parser, the EOF token is just an ordinary token. EOF is special
only in the sense that it is automatically generated and injected by the
lexer.

Since you are testing a rule with a required NL, valid test strings will
have to include a NL; fudging by equating a NL with EOF kind of defeats
the purpose of the test, doesn't it?

Post by Alexander Kostikov
Hi,
The grammar I'm implementing is pretty big and I'm covering it with
unit tests to pin-point the rules that are already implemented and
stable. The problem is - when I test not the top-most rule, but the
smaller onces (the ones I want to pin-point and mark as stable), I
---------------
grammar Test;
config: (rule)* EOF;
rule: 'rule' ID NEWLINE | 'rule' NEWLINE;
ID: CHAR (CHAR | NUMBER)*;
NEWLINE: ('\r' | '\n')+;
WS: (' ' | '\t') { $channel=HIDDEN; };
fragment CHAR: 'a'..'z' | 'A'..'Z' | '_' | '-';
fragment NUMBER: '0'..'9';
---------------
In tests I want to test that 'rule' is producing the correct AST.
Something like TestRuleAst( "rule test", "(RULE test)" ) where the
first argument is input to be parsed by 'rule', and the second
argument is expected text representation for the AST. But for that
test to work I'd like to treat all EOF tokens as NEWLINEs. Otherwise I
get NoViableAltException and MissingTokenException depending on the
rule I'm testing. But changing NEWLINE to 'NEWLINE: ('\r' | '\n')+ |
EOF;' doesn't solve this problem.
The simplest way to fix the tests for me would be to add '\n' at the
end of the tested strings. But I would like to know if it is possible
to treat EOF as NEWLINE in the grammar itself.
-- Alexander
List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address

Alexander Kostikov

2012-11-29 21:46:51 UTC

Permalink

Post by Gerald Rosenberg
Given an
input string of "rule <ID> <EOF>", with EOF included in the definition of
NL, the config rule is expecting "rule <ID> <EOF> <EOF>", hence the error.

I wouldn't have this question if it was in 'config' rule. The thing is
I see this error for 'rule' rule. Try out in ANTLRWorks:

--- grammar ---
grammar Test;

config: (rule)* EOF;
rule: 'rule' ID NEWLINE | 'rule' NEWLINE;

ID: CHAR (CHAR | NUMBER)*;
NEWLINE: ('\r' | '\n')+ | EOF;
WS: (' ' | '\t') { $channel=HIDDEN; };

fragment CHAR: 'a'..'z' | 'A'..'Z' | '_' | '-';
fragment NUMBER: '0'..'9';
--- input (without \n at the end!; unix line endings) ---
rule test
--- start rule ---
'rule'

Post by Gerald Rosenberg
fudging by equating a NL with EOF kind of defeats the
purpose of the test, doesn't it?

Agree. But as I said, I want to figure out why the grammar approach is
not working. The files I'm parsing don't have a strict syntax. NEWLINE
is a surrogate command separator that seems to be working. The thing
is - EOF also should be a valid command separator.

-- Alexander

Post by Gerald Rosenberg
The Antlr message identifies the grammar rule causing the error. Which is
it? Probably changes depending on whether EOF is included in NL. Given an
input string of "rule <ID> <EOF>", with EOF included in the definition of
NL, the config rule is expecting "rule <ID> <EOF> <EOF>", hence the error.
To the parser, the EOF token is just an ordinary token. EOF is special only
in the sense that it is automatically generated and injected by the lexer.
Since you are testing a rule with a required NL, valid test strings will
have to include a NL; fudging by equating a NL with EOF kind of defeats the
purpose of the test, doesn't it?

Post by Alexander Kostikov
Hi,
The grammar I'm implementing is pretty big and I'm covering it with
unit tests to pin-point the rules that are already implemented and
stable. The problem is - when I test not the top-most rule, but the
smaller onces (the ones I want to pin-point and mark as stable), I
---------------
grammar Test;
config: (rule)* EOF;
rule: 'rule' ID NEWLINE | 'rule' NEWLINE;
ID: CHAR (CHAR | NUMBER)*;
NEWLINE: ('\r' | '\n')+;
WS: (' ' | '\t') { $channel=HIDDEN; };
fragment CHAR: 'a'..'z' | 'A'..'Z' | '_' | '-';
fragment NUMBER: '0'..'9';
---------------
In tests I want to test that 'rule' is producing the correct AST.
Something like TestRuleAst( "rule test", "(RULE test)" ) where the
first argument is input to be parsed by 'rule', and the second
argument is expected text representation for the AST. But for that
test to work I'd like to treat all EOF tokens as NEWLINEs. Otherwise I
get NoViableAltException and MissingTokenException depending on the
rule I'm testing. But changing NEWLINE to 'NEWLINE: ('\r' | '\n')+ |
EOF;' doesn't solve this problem.
The simplest way to fix the tests for me would be to add '\n' at the
end of the tested strings. But I would like to know if it is possible
to treat EOF as NEWLINE in the grammar itself.
-- Alexander
List: http://www.antlr.org/mailman/listinfo/antlr-interest
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address

Gerald Rosenberg

2012-11-30 03:11:18 UTC

Permalink

Post by Alexander Kostikov

Post by Gerald Rosenberg
Given an
input string of "rule <ID> <EOF>", with EOF included in the definition of
NL, the config rule is expecting "rule <ID> <EOF> <EOF>", hence the error.

I wouldn't have this question if it was in 'config' rule. The thing is

Yes, but what is the error message that Antlr is giving? Antlr is quite
good at identifying the line and offset of an error and more than just
the identity of the exception. AntlrWorks, not so much (at least the
original version -- have not used the version Sam is working on now).
BTW, it would help to post the text of the error message.

In any event, if you want to monitor the actual parsing of your test
input, switch to Eclipse (or NetBeans) and step your way through with
the debugger. Would be the only certain way to see that something as
unexpected as a Unicode character having snuck into your test input. Or
maybe even that the presence of the config rule is causing the generated
parser to expect something after the NL (if the NL does not eat the EOF,
then the parser has an EOF token with nowhere to go, so a NoViableAlt is
reasonable; if it does eat the EOF, then MissingToken might be
reasonable and the Antlr error message should identify what token was
expected).

FWIW, Eclipse has a feature that enables an automatic stop in the
debugger at the point of an exception. You can then 'drop to frame',
which will be the state of execution just prior to the exception, and
inspect all of the current variables as well as the prior execution
trace frames and their variable states.

Post by Alexander Kostikov
--- grammar ---
grammar Test;
config: (rule)* EOF;
rule: 'rule' ID NEWLINE | 'rule' NEWLINE;
ID: CHAR (CHAR | NUMBER)*;
NEWLINE: ('\r' | '\n')+ | EOF;
WS: (' ' | '\t') { $channel=HIDDEN; };
fragment CHAR: 'a'..'z' | 'A'..'Z' | '_' | '-';
fragment NUMBER: '0'..'9';
--- input (without \n at the end!; unix line endings) ---
rule test
--- start rule ---
'rule'

Post by Gerald Rosenberg
fudging by equating a NL with EOF kind of defeats the
purpose of the test, doesn't it?

Post by Alexander Kostikov
Hi,
The grammar I'm implementing is pretty big and I'm covering it with
unit tests to pin-point the rules that are already implemented and
stable. The problem is - when I test not the top-most rule, but the
smaller onces (the ones I want to pin-point and mark as stable), I
---------------
grammar Test;
config: (rule)* EOF;
rule: 'rule' ID NEWLINE | 'rule' NEWLINE;
ID: CHAR (CHAR | NUMBER)*;
NEWLINE: ('\r' | '\n')+;
WS: (' ' | '\t') { $channel=HIDDEN; };
fragment CHAR: 'a'..'z' | 'A'..'Z' | '_' | '-';
fragment NUMBER: '0'..'9';
---------------
In tests I want to test that 'rule' is producing the correct AST.
Something like TestRuleAst( "rule test", "(RULE test)" ) where the
first argument is input to be parsed by 'rule', and the second
argument is expected text representation for the AST. But for that
test to work I'd like to treat all EOF tokens as NEWLINEs. Otherwise I
get NoViableAltException and MissingTokenException depending on the
rule I'm testing. But changing NEWLINE to 'NEWLINE: ('\r' | '\n')+ |
EOF;' doesn't solve this problem.
The simplest way to fix the tests for me would be to add '\n' at the
end of the tested strings. But I would like to know if it is possible
to treat EOF as NEWLINE in the grammar itself.
-- Alexander
List: http://www.antlr.org/mailman/listinfo/antlr-interest
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address

Alexander Kostikov

2012-11-30 21:31:49 UTC

Permalink

Here is the shortest possible repro:

input: 'rule' without any line endings

this grammar throws MissingTokenException:

grammar Test;
rule: 'rule' NEWLINE;
NEWLINE: EOF;

this grammar does matches the input just fine:

grammar Test;
rule: 'rule' EOF;
NEWLINE: EOF;

It looks like EOF token is a special one after all.
My question is - is it possible to use EOF in NEWLINE token somehow?

-- Alexander

Post by Gerald Rosenberg

Post by Alexander Kostikov

Post by Gerald Rosenberg
Given an
input string of "rule <ID> <EOF>", with EOF included in the definition of
NL, the config rule is expecting "rule <ID> <EOF> <EOF>", hence the error.

I wouldn't have this question if it was in 'config' rule. The thing is

Yes, but what is the error message that Antlr is giving? Antlr is quite
good at identifying the line and offset of an error and more than just the
identity of the exception. AntlrWorks, not so much (at least the original
version -- have not used the version Sam is working on now). BTW, it would
help to post the text of the error message.
In any event, if you want to monitor the actual parsing of your test input,
switch to Eclipse (or NetBeans) and step your way through with the debugger.
Would be the only certain way to see that something as unexpected as a
Unicode character having snuck into your test input. Or maybe even that the
presence of the config rule is causing the generated parser to expect
something after the NL (if the NL does not eat the EOF, then the parser has
an EOF token with nowhere to go, so a NoViableAlt is reasonable; if it does
eat the EOF, then MissingToken might be reasonable and the Antlr error
message should identify what token was expected).
FWIW, Eclipse has a feature that enables an automatic stop in the debugger
at the point of an exception. You can then 'drop to frame', which will be
the state of execution just prior to the exception, and inspect all of the
current variables as well as the prior execution trace frames and their
variable states.

Post by Gerald Rosenberg
fudging by equating a NL with EOF kind of defeats the
purpose of the test, doesn't it?

Post by Alexander Kostikov
Hi,
The grammar I'm implementing is pretty big and I'm covering it with
unit tests to pin-point the rules that are already implemented and
stable. The problem is - when I test not the top-most rule, but the
smaller onces (the ones I want to pin-point and mark as stable), I
---------------
grammar Test;
config: (rule)* EOF;
rule: 'rule' ID NEWLINE | 'rule' NEWLINE;
ID: CHAR (CHAR | NUMBER)*;
NEWLINE: ('\r' | '\n')+;
WS: (' ' | '\t') { $channel=HIDDEN; };
fragment CHAR: 'a'..'z' | 'A'..'Z' | '_' | '-';
fragment NUMBER: '0'..'9';
---------------
In tests I want to test that 'rule' is producing the correct AST.
Something like TestRuleAst( "rule test", "(RULE test)" ) where the
first argument is input to be parsed by 'rule', and the second
argument is expected text representation for the AST. But for that
test to work I'd like to treat all EOF tokens as NEWLINEs. Otherwise I
get NoViableAltException and MissingTokenException depending on the
rule I'm testing. But changing NEWLINE to 'NEWLINE: ('\r' | '\n')+ |
EOF;' doesn't solve this problem.
The simplest way to fix the tests for me would be to add '\n' at the
end of the tested strings. But I would like to know if it is possible
to treat EOF as NEWLINE in the grammar itself.
-- Alexander
List: http://www.antlr.org/mailman/listinfo/antlr-interest
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address

Alexander Kostikov

2012-11-30 21:59:51 UTC

Permalink

Just read Lexer Grammar Ambiguities chapter in ANTLR reference. This
kind of grammar works as expected:

grammar Test;
rule: 'rule' newline;
newline: NEWLINE | EOF;
NEWLINE: '\n';

=)

-- Alexander

On Fri, Nov 30, 2012 at 1:31 PM, Alexander Kostikov

Post by Alexander Kostikov
input: 'rule' without any line endings
grammar Test;
rule: 'rule' NEWLINE;
NEWLINE: EOF;
grammar Test;
rule: 'rule' EOF;
NEWLINE: EOF;
It looks like EOF token is a special one after all.
My question is - is it possible to use EOF in NEWLINE token somehow?
-- Alexander

Post by Gerald Rosenberg

Post by Alexander Kostikov

Post by Gerald Rosenberg
Given an
input string of "rule <ID> <EOF>", with EOF included in the definition of
NL, the config rule is expecting "rule <ID> <EOF> <EOF>", hence the error.

I wouldn't have this question if it was in 'config' rule. The thing is

Yes, but what is the error message that Antlr is giving? Antlr is quite
good at identifying the line and offset of an error and more than just the
identity of the exception. AntlrWorks, not so much (at least the original
version -- have not used the version Sam is working on now). BTW, it would
help to post the text of the error message.
In any event, if you want to monitor the actual parsing of your test input,
switch to Eclipse (or NetBeans) and step your way through with the debugger.
Would be the only certain way to see that something as unexpected as a
Unicode character having snuck into your test input. Or maybe even that the
presence of the config rule is causing the generated parser to expect
something after the NL (if the NL does not eat the EOF, then the parser has
an EOF token with nowhere to go, so a NoViableAlt is reasonable; if it does
eat the EOF, then MissingToken might be reasonable and the Antlr error
message should identify what token was expected).
FWIW, Eclipse has a feature that enables an automatic stop in the debugger
at the point of an exception. You can then 'drop to frame', which will be
the state of execution just prior to the exception, and inspect all of the
current variables as well as the prior execution trace frames and their
variable states.

Post by Gerald Rosenberg
fudging by equating a NL with EOF kind of defeats the
purpose of the test, doesn't it?

Post by Alexander Kostikov
Hi,
The grammar I'm implementing is pretty big and I'm covering it with
unit tests to pin-point the rules that are already implemented and
stable. The problem is - when I test not the top-most rule, but the
smaller onces (the ones I want to pin-point and mark as stable), I
---------------
grammar Test;
config: (rule)* EOF;
rule: 'rule' ID NEWLINE | 'rule' NEWLINE;
ID: CHAR (CHAR | NUMBER)*;
NEWLINE: ('\r' | '\n')+;
WS: (' ' | '\t') { $channel=HIDDEN; };
fragment CHAR: 'a'..'z' | 'A'..'Z' | '_' | '-';
fragment NUMBER: '0'..'9';
---------------
In tests I want to test that 'rule' is producing the correct AST.
Something like TestRuleAst( "rule test", "(RULE test)" ) where the
first argument is input to be parsed by 'rule', and the second
argument is expected text representation for the AST. But for that
test to work I'd like to treat all EOF tokens as NEWLINEs. Otherwise I
get NoViableAltException and MissingTokenException depending on the
rule I'm testing. But changing NEWLINE to 'NEWLINE: ('\r' | '\n')+ |
EOF;' doesn't solve this problem.
The simplest way to fix the tests for me would be to add '\n' at the
end of the tested strings. But I would like to know if it is possible
to treat EOF as NEWLINE in the grammar itself.
-- Alexander
List: http://www.antlr.org/mailman/listinfo/antlr-interest
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address

Alexander Kostikov

2012-11-30 22:05:40 UTC

Permalink

Here is another test that proofs that EOF is special:

grammar Test;
config: 'config' rule+ EOF;
rule: 'rule' newline;
newline: NEWLINE | EOF;
NEWLINE: '\n';

'config' rule matches 'config rule' input. ANTLR inserts two EOF
tokens and doesn't produce any warnings/errors.

-- Alexander

On Fri, Nov 30, 2012 at 1:59 PM, Alexander Kostikov

Post by Alexander Kostikov
Just read Lexer Grammar Ambiguities chapter in ANTLR reference. This
grammar Test;
rule: 'rule' newline;
newline: NEWLINE | EOF;
NEWLINE: '\n';
=)
-- Alexander
On Fri, Nov 30, 2012 at 1:31 PM, Alexander Kostikov

Post by Gerald Rosenberg

Post by Alexander Kostikov

Post by Gerald Rosenberg
Given an
input string of "rule <ID> <EOF>", with EOF included in the definition of
NL, the config rule is expecting "rule <ID> <EOF> <EOF>", hence the error.

I wouldn't have this question if it was in 'config' rule. The thing is

Yes, but what is the error message that Antlr is giving? Antlr is quite
good at identifying the line and offset of an error and more than just the
identity of the exception. AntlrWorks, not so much (at least the original
version -- have not used the version Sam is working on now). BTW, it would
help to post the text of the error message.
In any event, if you want to monitor the actual parsing of your test input,
switch to Eclipse (or NetBeans) and step your way through with the debugger.
Would be the only certain way to see that something as unexpected as a
Unicode character having snuck into your test input. Or maybe even that the
presence of the config rule is causing the generated parser to expect
something after the NL (if the NL does not eat the EOF, then the parser has
an EOF token with nowhere to go, so a NoViableAlt is reasonable; if it does
eat the EOF, then MissingToken might be reasonable and the Antlr error
message should identify what token was expected).
FWIW, Eclipse has a feature that enables an automatic stop in the debugger
at the point of an exception. You can then 'drop to frame', which will be
the state of execution just prior to the exception, and inspect all of the
current variables as well as the prior execution trace frames and their
variable states.

Post by Gerald Rosenberg
fudging by equating a NL with EOF kind of defeats the
purpose of the test, doesn't it?

Post by Alexander Kostikov
Hi,
The grammar I'm implementing is pretty big and I'm covering it with
unit tests to pin-point the rules that are already implemented and
stable. The problem is - when I test not the top-most rule, but the
smaller onces (the ones I want to pin-point and mark as stable), I
---------------
grammar Test;
config: (rule)* EOF;
rule: 'rule' ID NEWLINE | 'rule' NEWLINE;
ID: CHAR (CHAR | NUMBER)*;
NEWLINE: ('\r' | '\n')+;
WS: (' ' | '\t') { $channel=HIDDEN; };
fragment CHAR: 'a'..'z' | 'A'..'Z' | '_' | '-';
fragment NUMBER: '0'..'9';
---------------
In tests I want to test that 'rule' is producing the correct AST.
Something like TestRuleAst( "rule test", "(RULE test)" ) where the
first argument is input to be parsed by 'rule', and the second
argument is expected text representation for the AST. But for that
test to work I'd like to treat all EOF tokens as NEWLINEs. Otherwise I
get NoViableAltException and MissingTokenException depending on the
rule I'm testing. But changing NEWLINE to 'NEWLINE: ('\r' | '\n')+ |
EOF;' doesn't solve this problem.
The simplest way to fix the tests for me would be to add '\n' at the
end of the tested strings. But I would like to know if it is possible
to treat EOF as NEWLINE in the grammar itself.
-- Alexander
List: http://www.antlr.org/mailman/listinfo/antlr-interest
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address

Continue reading on narkive:

Search results for 'How to treat EOF as NEWLINE?' (Questions and Answers)

replies

distinguish between scanf( ) and gets( ) in programming world?

started 2009-02-17 02:03:29 UTC

programming & design

replies

C Programming: Why doesn't this function work?

started 2013-02-27 05:41:28 UTC

programming & design