Friday, February 7, 2014

Python-like indentation using Antlr4

Introducing a small helper class for implementing python-like indentation in antlr4: antlr-denter on github.

I've started writing the grammar for my language, but soon ran into a problem generating python-like INDENT/DEDENT tokens in ANTLR v4. Googling found other people with the same problem and half-answers, but nothing seemed satisfactory. I decided to do something about it.

Why didn't I like what's out there? First of all, most of the answers are based on antlr v3, and don't quite work for v4. But more than that, they seemed incomplete. The ones I found don't look like they handle various edge cases, like an EOF while indended, or "half-dedents."

if something():
    takeAction()
    butDontUnindent()<eof>

or

if something():
      if anotherThing()
            takeAction()
   thisIndentationIsWrong()

(I should add a disclaimer, which is that I didn't actually try the proposed solutions; I just looked at the code and noticed that they don't seem to handle it.)

But even if they do work perfectly, there are other issues. The indent/dedent logic is embedded in the grammar, which makes it hard to test separately. You have to copy-paste that code, meaning it could be a pain to update if the person you copied it from has a bug fix (especially if you had to make changes to fit your grammar). In fact, you probably won't even notice that they have a bug fix, if they even bother to mention it in the web.

So, I wrote a modular, testable helper class: welcome the unimaginatively-named antlr-denter. Plugging it into your grammar is wicked easy (docs are in the README.md on github, link above), and if there's ever a bug fix, all you need to do is make sure you have the latest version of the antlr-denter jar. Boom!

The basic strategy is to override your parser's nextToken() and have it delegate to DenterHelper::nextToken. That does some processing, ultimately pulling from your lexer's super.nextToken.

One interesting tidbit is that the overhead of this code — which is completely un-optimized and just in a "get it working" state, for now — seems pretty much negligible. There is about a 10% - 20% overhead in the DenterHelper, but it's offset by the shorter program lengths. Consider these two snippets:

if foo() {
    if bar() {
        doBar()
    }
}

and

if foo()
    if bar()
        dobar()

The indentation-based program is 38 chars long; the braced one is 50 chars — 24% longer! If that's surprising, consider that the fourth line in the braced version, which just closes the if bar() block, is six chars. Sure, most of that is just whitespace, but it's still data the lexer needs to trudge through. The equivalent in the indentation-based version is zero chars.

I've only tested it with simple programs in a simpler language, but the results are all consistent. I wrote up the analysis in the readme for the examples submodule.

antlr-denter isn't on maven central yet, mostly because I don't need it to be. If anyone comes across this post and wants me to upload the module, I'd be happy to. The project is available in maven central, so getting up and running should be pretty straightforward. And feel free to open issues or PRs on the github project.

2 comments:

  1. Thanks for this. I like to write DSLs with indentation style because it makes it easier for non-programmers to read. I have my own ad-hoc parser library, which includes indentation processing. When I decided to switch from ad-hoc to ANTLR, I was at a loss. Denter fills the gap.

    ReplyDelete
  2. Super helpful! Thanks!

    ReplyDelete

Note: Only a member of this blog may post a comment.