Yuval Shavit: February 2014

Thursday, February 20, 2014

Method invocation syntax

I’ve been giving some thought to method invocation lately, trying to come up with something that’s fluent in simple cases, and familiar (to programmers) for the more complex cases. After a bit of playing around, I think I have a system I like.

Consider Java’s BigDecimal, and specifically its divide method. It feels very programmer-y:

aNum = myNum.divide(someOtherNum)

Wouldn’t it be nice if we could make this feel more natural?

aNum = myNum dividedBy someOtherNum

That suggests a grammar of object methodName arg0[, arg1, arg2...]. But if you have more than a couple args, this gets a bit muddy; the words all clump together in my brain, and it’s not entirely clear what’s what anymore: foo doBar baz, apple, banana, coconut. If anything, it looks like the logical grouping is (foo doBar baz) (apple banana coconut). Of course, it isn’t, and my brain knows that… but it’s not intuitive to my eye.

As I was looking around at various methods, I noticed another interesting thing: very often for multi-arg methods, there’s one “main” argument that’s followed by “secondary” arguments. In human-language grammar terms, there’s a single direct object, and then some adjectives and adverbs.

BigDecimal.divide(BigDecimal, RoundingMode) is a good example: the first argument is what you’re dividing by, and the second is just some almost-parenthetical info on how to do the division. It feels like this:

aNum = myNum dividedBy someOtherNum: HalfUp

This suggests a grammar of object methodName arg0 [: arg1, arg2, arg3...]. And that’s in fact what I think I’m going to go with (with a slight tweak that I’ll get to in a bit).

There’s an obvious problem, which is that not all methods follow that semantic pattern. For instance, List.sublist takes two arguments, fromIndex and toIndex. Neither modifies the other; they’re both “primary” args. (This may have been different if the arguments were fromIndex and length, but they’re not). You really do want to invoke this using the parentheses we all know and love:

aList = myList sublist (3, 7)

Yikes — does that mean I need two ways to invoke methods? Worse yet, do I let the call sites determine which to use, so that sometimes I’ll see myList sublist 3: 7 and sometimes I’ll see myNum dividedBy (someOtherNum, HALF_UP)? The latter isn’t bad, but I don’t want my language to encourage inconsistent style on things like this. So maybe I want to let the method declaration define which syntax to use… but how?

The solution is actually pretty simple: methods like sublist take only one arg, but it’s a tuple! That’s not enforced by the language, of course, but the syntax for declaring methods should mirror the syntax for calling them, so that things will naturally work out.

The one big issue with that grammar is that the : char is already used in lots of places, and in particular as a way of declaring a variable’s type (including to upcast it). For instance, myNum divided by someOtherNum : SomeType is ambiguous; does it take one arg, someOtherNum : SomeType, or does it take two args, someOtherNum and SomeType?

To solve this, I’m going to make a slight aesthetic concession and replace the : with {...} in method invocation.

aNum = myNum dividedBy someOtherNum { HalfUp } -- two args, num and mode
aList = myList sublist (3, 7)        -- one arg, a tuple of (start, end)

As I mentioned above, the method declaration should mirror invocation. Something like:

dividedBy divisor:BigDecimal { mode: RoundingMode } -> BigDecimal: ...
aList (start: Int, end: Int) -> List[E]: ...

I like this approach a lot, except for the curly braces. Ideally I’d use a colon, or even a pipe, but all of the single-char approaches I could think of would either cause ambiguity or be ugly. For instance, a pipe would be fine at the call site, but create visual ambiguities at declaration:

dividedBy divisor: BigDecimal | mode: RoundingMode -> BigDecimal: ...

That pipe looks like a disjunctive type at a glance. This isn’t an ambiguity from the grammar’s perspective, since mode is clearly a variable name and not a type (Effes enforces the capitalization scheme that Java suggests), but it’s not nice on the eyes. Some optional parentheses would help, but it’s hard to get excited about that. So for now, curly braces are it.

The thing I like about this syntax is that with one rule, I get everything I want. Simple methods look fluent; methods with adverbs look good (if a tad clunky with the braces); and in the worst case, I get something that’s no worse than what most of the popular languages out there require or recommend.

Tuesday, February 18, 2014

An alternative to function overloading

Method overloading has always struck me as a bit clunky. It separates a method’s main code from helper code, adds clutter, and doesn’t play nicely with inheritance (at least in Java). On the other hand, its ability to provide variants for a given method is useful. I think Effes provides a better alternative.

Overloads provide two axes by which you can create variants of a method: they let you omit arguments by supplying a reasonable default, and they let you pass in a value whose type is similar to (but different from) the “main” type. For instance, you can imagine a method add(double n, RoundingMode mode) with an overload add(long n). That second overload would call the first variant, casting the long to double and using RoundingMode.HALF_UP.

Lots of languages let you omit arguments by providing default values: add(n, mode=HALF_UP) or similar. Effes will, too, but it’s tough for a statically typed language to handle the arg-of-similar-type problem. The only thing you can really do is to accept a supertype, like add(Number n). But to do that, you need control over the type hierarchy, which you obviously may not have.

In Effes, you can use disjunctive types instead:

add(n: Double|Long):
  d = case n of
    Double: n
    Long: n toDouble -- e.g., if there's no automatic type promotion
  ...

Thursday, February 13, 2014

Statements and expressions: an exploration of ambiguity

I've been working on the parser for Effes a bit, and I got a bit stuck on an ambiguity in case constructs; I want them to work as either statements or expressions.

To anchor things a bit, here are two uses of case, one of which is used as an expression, and the other as a statement:

-- as an expression
firstInt = case ints of
    (): 0
    (head, tail): head

-- as a statement
case ints of
    (): print "empty list!"
    (_): print "list has one element"
    _: print "list has #{intsList size} elements"

Languages handle this in various ways that make things simple. For instance:

In Java, it's always unambiguous whether something is expected to be a statement or an expression.
In Haskell, each function is just a single (potentially complex) expression; there are no statements, and thus no ambiguity
In Scala, you can put an expression anywhere in a function body, and the last expression is the function's return value — so again there's no ambiguity, because you can just make case constructs (match as they're called in Scala) always be expressions.

Scala's approach works, but it also lets you define a function as def g() = { 1; 2; 3; }, which I don't like. Statements and expressions are different beasts to me, and conflating them seems like a lazy and inelegant solution.

So then, is the case in that f example above about to introduce a statement or expression?

One solution is to take a hint from Java and have method bodies always consist of statements. If we take that approach, f... : case ints of is a statement. To make it be an expression whose value is returned, we'd have to write f... : return case ints of....

That's not the end of the world. In fact, I've never liked the Ruby-style return statements, where you just plop an expression at the end of a method:

def ugly
    123
end

There are a few reasons I don't like this, but the main reason is that in an imperative context (which a Ruby/Scala/etc method is), returning a value is an action. It should look like one! When I write imperative code, I'm telling the computer a series of actions to take. An implicit return feels like this to me:

First, ask the user how many apples they want.
Then find out how many apples are available.
Then, the minimum of that number and the number of apples requested.

That last sentence feels wrong, because it's not a sentence; it's a phrase. You can figure out what it means, but it feels stilted.

On the other hand, when writing one-liners, the return feels superfluous. Here's a nice size function for a list:

size -> Int: case this of
    (): 0
    _: 1 + (list tail)

One option I'm considering is to look at the return value if the method is a one-liner (that is, just a single statement or expression — even if it's complex). If it's Unit, that one line is a statement; otherwise, it's an expression. (If the function's body is a block instead of a one-liner, that block consists entirely of statements, including possibly a return statement.)

This feels a bit subtle and potentially confusing, and maybe that should be a big warning. On the other hand, I think that for most cases, it'll "just work." Crucially, since this only applies to one-liners, nearly all the cases should hopefully be simple cases. I can't think of any that wouldn't be.

This approach also means that the compiler will have to know about the Unit type specially. My instincts are that this smells wrong, but maybe it's not so bad.

Ah, what the heck. Despite all these warning bells going off, I'll try it out. If nothing else, it'll be good to see if my intuition (that this is a sketchy idea) is right, and why specifically. As Batman Begins put it, we fall so we can learn to pick ourselves up.

Friday, February 7, 2014

Python-like indentation using Antlr4

Introducing a small helper class for implementing python-like indentation in antlr4: antlr-denter on github.

I've started writing the grammar for my language, but soon ran into a problem generating python-like INDENT/DEDENT tokens in ANTLR v4. Googling found other people with the same problem and half-answers, but nothing seemed satisfactory. I decided to do something about it.

Why didn't I like what's out there? First of all, most of the answers are based on antlr v3, and don't quite work for v4. But more than that, they seemed incomplete. The ones I found don't look like they handle various edge cases, like an EOF while indended, or "half-dedents."

if something():
    takeAction()
    butDontUnindent()<eof>

if something():
      if anotherThing()
            takeAction()
   thisIndentationIsWrong()

(I should add a disclaimer, which is that I didn't actually try the proposed solutions; I just looked at the code and noticed that they don't seem to handle it.)

But even if they do work perfectly, there are other issues. The indent/dedent logic is embedded in the grammar, which makes it hard to test separately. You have to copy-paste that code, meaning it could be a pain to update if the person you copied it from has a bug fix (especially if you had to make changes to fit your grammar). In fact, you probably won't even notice that they have a bug fix, if they even bother to mention it in the web.

So, I wrote a modular, testable helper class: welcome the unimaginatively-named antlr-denter. Plugging it into your grammar is wicked easy (docs are in the README.md on github, link above), and if there's ever a bug fix, all you need to do is make sure you have the latest version of the antlr-denter jar. Boom!

The basic strategy is to override your parser's nextToken() and have it delegate to DenterHelper::nextToken. That does some processing, ultimately pulling from your lexer's super.nextToken.

One interesting tidbit is that the overhead of this code — which is completely un-optimized and just in a "get it working" state, for now — seems pretty much negligible. There is about a 10% - 20% overhead in the DenterHelper, but it's offset by the shorter program lengths. Consider these two snippets:

if foo() {
    if bar() {
        doBar()
    }
}

and

if foo()
    if bar()
        dobar()

The indentation-based program is 38 chars long; the braced one is 50 chars — 24% longer! If that's surprising, consider that the fourth line in the braced version, which just closes the if bar() block, is six chars. Sure, most of that is just whitespace, but it's still data the lexer needs to trudge through. The equivalent in the indentation-based version is zero chars.

I've only tested it with simple programs in a simpler language, but the results are all consistent. I wrote up the analysis in the readme for the examples submodule.

~~antlr-denter isn't on maven central yet, mostly because I don't need it to be. If anyone comes across this post and wants me to upload the module, I'd be happy to.~~ The project is available in maven central, so getting up and running should be pretty straightforward. And feel free to open issues or PRs on the github project.