Trying to keep whitespace in an AST

Discussion:

Jamie Penney

2008-02-08 03:50:36 UTC

Hi all,
I am trying to work out how to create a grammar that will build an AST
that keeps both comments and some whitespace. Basically the output will
be formatted code, but we need the semantic information provided by the
AST for other parts of the system. Any comments and blank lines need to
be kept in the output code. Is it possible to have rewriting and AST
generation turned on at the same time, or do I have to write two
separate grammars? I am new to ANTLR so sorry if I have the wrong idea
about anything.
To give a concrete example, say I have a language that represents basic
C style statements like so:

int a = 0;
int b = 1;
int c = 2;

// reassign a
a = b + c;

What I need is the semantic information provided by an AST (whether a
statement is a declaration, assignment, ect), but I need to transform
the language partially too. I need to format the individual elements
consistently, so each would be of the form a = b + c; but I also need to
retain the newlines and comments between elements.

If anyone could point me in the right direction I would be very grateful.

Thanks,
Jamie Penney

Matt Benson

2008-02-08 04:16:07 UTC

Permalink

Are you trying to pretty-print all the source, or just
what you've augmented? Does this wiki entry approach
what you're looking for:

http://www.antlr.org/wiki/pages/viewpage.action?pageId=1752

?

-Matt

Post by Jamie Penney
Hi all,
I am trying to work out how to create a grammar that
will build an AST
that keeps both comments and some whitespace.
Basically the output will
be formatted code, but we need the semantic
information provided by the
AST for other parts of the system. Any comments and
blank lines need to
be kept in the output code. Is it possible to have
rewriting and AST
generation turned on at the same time, or do I have
to write two
separate grammars? I am new to ANTLR so sorry if I
have the wrong idea
about anything.
To give a concrete example, say I have a language
that represents basic
int a = 0;
int b = 1;
int c = 2;
// reassign a
a = b + c;
What I need is the semantic information provided by
an AST (whether a
statement is a declaration, assignment, ect), but I
need to transform
the language partially too. I need to format the
individual elements
consistently, so each would be of the form a = b +
c; but I also need to
retain the newlines and comments between elements.
If anyone could point me in the right direction I
would be very grateful.
Thanks,
Jamie Penney

____________________________________________________________________________________
Looking for last minute shopping deals?
Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping

Matt Benson

2008-02-08 04:37:38 UTC

Permalink

copying list...

Date: Fri, 08 Feb 2008 17:25:04 +1300
Subject: Re: [antlr-interest] Trying to keep
whitespace in an AST
What I need to do is use the AST to create an in
memory model of the
code using our own model classes. So for instance we
have a model to
represent a method, which needs to know everything
about the method. It
needs the text of the method, but doesn't need to
parse it. I need to be
able to record the amount of whitespace and the
comments that come
before the method, so that I can put that into our
code model objects.
These code model objects are used for a bunch of
different stuff,
modified, then printed out (which does the actual
pretty printing).
So I guess I need an AST that keeps newlines and
comments associated
with particular nodes. The pretty printing is done
by our code so the
grammar doesn't need to do that.
Thanks,
Jamie Penney

Post by Matt Benson
Are you trying to pretty-print all the source, or

just

Post by Matt Benson
what you've augmented? Does this wiki entry

approach

http://www.antlr.org/wiki/pages/viewpage.action?pageId=1752

Post by Matt Benson
?
-Matt

Post by Jamie Penney
Hi all,
I am trying to work out how to create a grammar

that

Post by Matt Benson

Post by Jamie Penney
will build an AST
that keeps both comments and some whitespace.
Basically the output will
be formatted code, but we need the semantic
information provided by the
AST for other parts of the system. Any comments

and

Post by Matt Benson

Post by Jamie Penney
blank lines need to
be kept in the output code. Is it possible to

have

Post by Matt Benson

Post by Jamie Penney
rewriting and AST
generation turned on at the same time, or do I

have

Post by Matt Benson

Post by Jamie Penney
to write two
separate grammars? I am new to ANTLR so sorry if

Post by Matt Benson

Post by Jamie Penney
have the wrong idea
about anything.
To give a concrete example, say I have a language
that represents basic
int a = 0;
int b = 1;
int c = 2;
// reassign a
a = b + c;
What I need is the semantic information provided

Post by Matt Benson

Post by Jamie Penney
an AST (whether a
statement is a declaration, assignment, ect), but

Post by Matt Benson

Post by Jamie Penney
need to transform
the language partially too. I need to format the
individual elements
consistently, so each would be of the form a = b

Post by Matt Benson

Post by Jamie Penney
c; but I also need to
retain the newlines and comments between

elements.

Post by Matt Benson

Post by Jamie Penney
If anyone could point me in the right direction I
would be very grateful.
Thanks,
Jamie Penney

____________________________________________________________________________________

Post by Matt Benson
Looking for last minute shopping deals?
Find them fast with Yahoo! Search.

http://tools.search.yahoo.com/newsearch/category.php?category=shopping
____________________________________________________________________________________
Never miss a thing. Make Yahoo your home page.
http://www.yahoo.com/r/hs

Jamie Penney

2008-02-08 05:24:56 UTC

Permalink

My apologies - one of the other email lists I use has the reply-to
address set to the list address.

Post by Matt Benson
copying list...

Post by Matt Benson
Are you trying to pretty-print all the source, or

just

Post by Matt Benson
what you've augmented? Does this wiki entry

approach

http://www.antlr.org/wiki/pages/viewpage.action?pageId=1752

Post by Matt Benson
?
-Matt

Post by Jamie Penney
Hi all,
I am trying to work out how to create a grammar

that

Post by Matt Benson

and

Post by Matt Benson

Post by Jamie Penney
blank lines need to
be kept in the output code. Is it possible to

have

Post by Matt Benson

Post by Jamie Penney
rewriting and AST
generation turned on at the same time, or do I

have

Post by Matt Benson

Post by Jamie Penney
to write two
separate grammars? I am new to ANTLR so sorry if

Post by Matt Benson

Post by Jamie Penney
an AST (whether a
statement is a declaration, assignment, ect), but

Post by Matt Benson

Post by Jamie Penney
need to transform
the language partially too. I need to format the
individual elements
consistently, so each would be of the form a = b

Post by Matt Benson

Post by Jamie Penney
c; but I also need to
retain the newlines and comments between

elements.

Post by Matt Benson

Post by Jamie Penney
If anyone could point me in the right direction I
would be very grateful.
Thanks,
Jamie Penney

____________________________________________________________________________________

Post by Matt Benson
Looking for last minute shopping deals?
Find them fast with Yahoo! Search.

--
Jamie Penney

http://www.jamiepenney.co.nz

Johannes Luber

2008-02-08 15:24:14 UTC

Permalink

So I guess I need an AST that keeps newlines and
comments associated
with particular nodes. The pretty printing is done
by our code so the
grammar doesn't need to do that.

Well, there are problems with the naive way of simply attaching newlines
and comments. The AST grammar doesn't recognize "The comments and
whitespace everywhere" requirement. So you can insert at every possible
point "(WS | COMMENT)*", but that would be ugly.

One could try to tell the parser to check for WS and COMMENT in addition
to the expected node, but that would interfere with syntactic
predicates, which don't check for theses variants. One could try to
change the template but I'm not sure how feasible this is. There could
be other problems lurking.

The way I'd go is to use channels. Put newlines and COMMENT into a
channel separate from the unneeded whitespace. This would reduce the
problem into accessing the off-channel tokens at the right places and to
keep them after grammar transformations. But so far I haven't checked
out, how one can do the latter properly.

Johannes

Artem Portnoy

2008-02-08 16:33:29 UTC

Permalink

I'm trying to solve the exact same problem. I found some sample code here http://www.antlr.org/article/whitespace/index.html. The problem is that the way the existing Java grammar is written, not all the tokens end up in the AST tree, such as semicolons, curly brakets, etc. The result of this is some of the comments end up getting weaved onto these "missing" tokens, and there's no way to get them back from AST. If somebody knows how to resolve this issue please share the information.

Thank you,

Artem

Date: Fri, 8 Feb 2008 16:24:14 +0100
Subject: [antlr-interest] Fwd: Re: Trying to keep whitespace in an AST

So I guess I need an AST that keeps newlines and
comments associated
with particular nodes. The pretty printing is done
by our code so the
grammar doesn't need to do that.

Well, there are problems with the naive way of simply attaching newlines
and comments. The AST grammar doesn't recognize "The comments and
whitespace everywhere" requirement. So you can insert at every possible
point "(WS | COMMENT)*", but that would be ugly.
One could try to tell the parser to check for WS and COMMENT in addition
to the expected node, but that would interfere with syntactic
predicates, which don't check for theses variants. One could try to
change the template but I'm not sure how feasible this is. There could
be other problems lurking.
The way I'd go is to use channels. Put newlines and COMMENT into a
channel separate from the unneeded whitespace. This would reduce the
problem into accessing the off-channel tokens at the right places and to
keep them after grammar transformations. But so far I haven't checked
out, how one can do the latter properly.
Johannes

_________________________________________________________________
Helping your favorite cause is as easy as instant messaging. You IM, we give.
http://im.live.com/Messenger/IM/Home/?source=text_hotmail_join

Johannes Luber

2008-02-08 16:54:47 UTC

Permalink

Post by Artem Portnoy
I'm trying to solve the exact same problem. I found some sample code
here http://www.antlr.org/article/whitespace/index.html. The problem is
that the way the existing Java grammar is written, not all the tokens
end up in the AST tree, such as semicolons, curly brakets, etc. The
result of this is some of the comments end up getting weaved onto these
"missing" tokens, and there's no way to get them back from AST. If
somebody knows how to resolve this issue please share the information.

The description is still for ANTLR 2.7.7. Did you manage to translate
into ANTLR 3 syntax? I am nonetheless wondering, why you get the pure
syntactical tokens with the whitespace. My understanding is that each
token includes only the text it actually matched. Did you check the
"intermediate" tokens for being only whitespace and comments? If that
doesn't work the only option is to include the neglected tokens into the
AST.

Johannes

Artem Portnoy

2008-02-08 17:28:42 UTC

Permalink

Johannes,

I have no idea how to do the same thing in Antlr v3. I couldn't find much information or examples. This is the only thing I found and I couldn't make much sense of it http://www.antlr.org/wiki/pages/viewpage.action?pageId=1057. It just sounds too vague for me with what little understanding of antlr I've got. In my case, I do not care about whitespace. All I need is the ability to access the comments.

The way TokenStreamHiddenTokenFilter works in Antlr 2 is that it looks for hidden tokens, such as comments and whitespace and sets them onto the preceding or trailing "regular" token. Because it has no knowledge of which tokens are bound for inclusion in the AST tree I'm faced with the problem I described before.

Regards,
Artem

Date: Fri, 8 Feb 2008 17:54:47 +0100
Subject: Re: [antlr-interest] Fwd: Re: Trying to keep whitespace in an AST

Johannes Luber

2008-02-08 18:04:26 UTC

Permalink

Post by Artem Portnoy
Johannes,
I have no idea how to do the same thing in Antlr v3. I couldn't find
much information or examples. This is the only thing I found and I
couldn't make much sense of it
http://www.antlr.org/wiki/pages/viewpage.action?pageId=1057. It just
sounds too vague for me with what little understanding of antlr I've
got. In my case, I do not care about whitespace. All I need is the
ability to access the comments.

Looking at this, one would set in the comments rule "{channel =
COMMENT_CHANNEL;}", with COMMENT_CHANNEL defined like 50. Then you need
to access from your CommonTree node in the tree grammar getStartIndex()
and getStopIndex(). Then you scan in the TokenStream via a for-loop from
start index until stop index via get(i), if the returned token has the
channel value COMMENT_CHANNEL. Output those.

That should principally work; start and stop index may vary on your
needs though.

Johannes

P.S.:
<http://www.antlr.org/wiki/display/ANTLR3/encoding+off-channel+tokens+in+the+tree>
looks at first as the right match, but is missing all the required
information.

Jim Idle

2008-02-08 18:22:37 UTC

Permalink

Well, remember that the AST is, err abstract ;-). It is just a construct
made from the token stream that you parsed. The parser skips tokens that
you create "off-channel", such as comments:

COMMENT: '//' ~NL* { $channel = 2; } ;

Now, when you walk you AST and find a method, you just need the token
index of the start sequence of your method declaration (this of course
depends on the language). Then you can traverse backwards in the token
stream (the stream you passed to the parser, mostly CommonTokenStream)
for that index, and pick up any off-channel tokens that were ignored by
the parser. If your common token stream is called tstream, then:

tstream.get(index) will return the token at index n, whether it is on
the parsing channel or not. There is also tstrem.getRange(.., which will
return a List of the tokens in a range, whether on channel or off
channel.

So, you hit the 'method' keyword/node/token and find out its index (or
the index of a real token rather than an imaginary one perhaps). Then
you traverse back through the stream until some trigger point such as
the first on-channel token before the comments or something. Only you
can know exactly where you start and stop, and the problem of
associating comments with the correct syntactical element is a thorny
one!

Jim

-----Original Message-----
Sent: Thursday, February 07, 2008 7:51 PM
Subject: [antlr-interest] Trying to keep whitespace in an AST
Hi all,
I am trying to work out how to create a grammar that will build an AST
that keeps both comments and some whitespace. Basically the output will
be formatted code, but we need the semantic information provided by the
AST for other parts of the system. Any comments and blank lines need to
be kept in the output code. Is it possible to have rewriting and AST
generation turned on at the same time, or do I have to write two
separate grammars? I am new to ANTLR so sorry if I have the wrong idea
about anything.
To give a concrete example, say I have a language that represents basic
int a = 0;
int b = 1;
int c = 2;
// reassign a
a = b + c;
What I need is the semantic information provided by an AST (whether a
statement is a declaration, assignment, ect), but I need to transform
the language partially too. I need to format the individual elements
consistently, so each would be of the form a = b + c; but I also need to
retain the newlines and comments between elements.
If anyone could point me in the right direction I would be very grateful.
Thanks,
Jamie Penney

David Holroyd

2008-02-08 22:44:17 UTC

Permalink

Post by Jim Idle
Now, when you walk you AST and find a method, you just need the token
index of the start sequence of your method declaration (this of course
depends on the language). Then you can traverse backwards in the token
stream (the stream you passed to the parser, mostly CommonTokenStream)
for that index, and pick up any off-channel tokens that were ignored by
the parser.

I've implemented this approach, more or less, but with additional effort
to take into account changes to the AST.

What made most sense to me was to alter the token stream to account for
the AST changes. i.e. if I remove a method, I delete from the stream
between child.start and child.stop tokens inclusive.

Of course, CommonTree nodes store start/stop *indexes*, which will leave
lots of AST nodes with invalid indexes if was actually change the
underlying token-stream array.

My solution was to switch the token-stream from an array to a
doubly-linked list, and to replace start/stop indexes with references to
the actual token objects. Now if a subtree is deleted, I can unlink the
appropriate sublist from the token-stream, and all remaining start/stop
references in the AST will still be valid.

It's quite a lot of effort to maintain that list of tokens, and some
things aren't as nice as I'd wish, but I can happily refactor AST
subtrees while maintaining the formatting of the rest of the file.
That includes keeping doc-comments connected to methods, etc.

Code here,

http://svn.badgers-in-foil.co.uk/metaas/trunk/src/main/java/uk/co/badgersinfoil/metaas/impl/antlr/

ta,
dave

--
http://david.holroyd.me.uk/

Johannes Luber

2008-02-08 23:35:42 UTC

Permalink

Post by David Holroyd
Code here,
http://svn.badgers-in-foil.co.uk/metaas/trunk/src/main/java/uk/co/badgersinfoil/metaas/impl/antlr/

Very nice! But what license is the code under? A "We may do as we
please, as long we attribute you" one? That would be the MIT/X license.
I use it it like:

"The MIT License

Copyright (c) 2007-2008 Johannes Luber

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to
deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE."

Or do prefer one other?

Johannes

David Holroyd

2008-02-08 23:46:55 UTC

Permalink

Post by Johannes Luber

Post by David Holroyd
Code here,
http://svn.badgers-in-foil.co.uk/metaas/trunk/src/main/java/uk/co/badgersinfoil/metaas/impl/antlr/

Very nice! But what license is the code under? A "We may do as we
please, as long we attribute you" one? That would be the MIT/X license.

Oh yes: LGPL, due to the grammar I took from another open-source
project. (I keep meaning to ask the original author if they're willing
to make it available under the Apache license, and these parts might be
seperable into a new project if there is interest.)

http://maven.badgers-in-foil.co.uk/sites/metaas/license.html

ta,
dave

--
http://david.holroyd.me.uk/

Jamie Penney

2008-02-09 00:48:33 UTC

Permalink

Thanks for all the replies everyone. I will give the off channel idea a

shot and see if it is feasible. I am working on a simple proof of
concept grammar
over the weekend so I will report back with how difficult it is to
associate the whitespace/comments with the correct node.

Thanks
Jamie

Post by Jim Idle
Well, remember that the AST is, err abstract ;-). It is just a construct
made from the token stream that you parsed. The parser skips tokens that
COMMENT: '//' ~NL* { $channel = 2; } ;
Now, when you walk you AST and find a method, you just need the token
index of the start sequence of your method declaration (this of course
depends on the language). Then you can traverse backwards in the token
stream (the stream you passed to the parser, mostly CommonTokenStream)
for that index, and pick up any off-channel tokens that were ignored by
tstream.get(index) will return the token at index n, whether it is on
the parsing channel or not. There is also tstrem.getRange(.., which will
return a List of the tokens in a range, whether on channel or off
channel.
So, you hit the 'method' keyword/node/token and find out its index (or
the index of a real token rather than an imaginary one perhaps). Then
you traverse back through the stream until some trigger point such as
the first on-channel token before the comments or something. Only you
can know exactly where you start and stop, and the problem of
associating comments with the correct syntactical element is a thorny
one!
Jim

will

be formatted code, but we need the semantic information provided by

the

AST for other parts of the system. Any comments and blank lines need

be kept in the output code. Is it possible to have rewriting and AST
generation turned on at the same time, or do I have to write two
separate grammars? I am new to ANTLR so sorry if I have the wrong idea
about anything.
To give a concrete example, say I have a language that represents

basic

int a = 0;
int b = 1;
int c = 2;
// reassign a
a = b + c;
What I need is the semantic information provided by an AST (whether a
statement is a declaration, assignment, ect), but I need to transform
the language partially too. I need to format the individual elements
consistently, so each would be of the form a = b + c; but I also need
to
retain the newlines and comments between elements.
If anyone could point me in the right direction I would be very
grateful.
Thanks,
Jamie Penney

--
Jamie Penney

http://www.jamiepenney.co.nz