Yuval Shavit

Monday, March 17, 2014

Inheritance is dead, long live composition

One aspect of the type system that’s always left me unsatisfied is its asymmetry against traditional object-oriented languages. Most OO languages formally recognize inheritance within the type system, but not composition. Given that Effes formally recognizes composition, shouldn’t it not recognize inheritance?

This is important to me for more than just aesthetic reasons. Recognizing both patterns makes for a more complicated type system, but worse, it gives the programmer a too-easy crutch. One of the reasons I turned to Haskell when I was interested in learning about functional programming was that I wanted to force myself to really start thinking in FP terms. If I were learning on a language like Scala, which combines OO and FP patterns, it’d be too easy to fall back on familiar ways of looking at a problem.

In the same way, I want Effes to force me into thinking with a composition-based perspective, rather than letting me have another inheritance-based language with a shot of composition flavoring.

The hurdle, though, has been polymorphism. It’s useful to have a method that takes Sizeable objects, whether they’re List, Map or anything else that’s Sizeable. It’s also nice to have that size method on both List and Map.

My solution is to replace “List is-a Sizeable” with “List has-a Size component:”

type List[A]:
    has Size
    add A -> List[A]
    -- etc...

For a user of List to get to the size method, they’ll need to access its Size component, which can be done explicitly with (list @ Size) size. But, if the Size component doesn’t conflict with any other of List’s components, you can implicitly access it: list size. And similarly, if a method takes a Size argument, you can explicitly give it the list’s Size component by calling theMethod (list @ Size), but you can also just call theMethod list, and the compiler will figure out that you want to pass it the Size component.

A nice side benefit of all this is that it provides a nicer answer to the question of conflicting components, which I addressed in earlier posts. Rather than handling conflicts at composition time by knocking out some components, I’ll allow the conflict there, and force the user into stating which component they want, when there’s a conflict. So for instance, if List and Set both have an add method, you can’t write listAndSet add foo. You have to explicitly call out the component you want: (listAndSet @ List[Foo]) add foo.

There are two syntax details I have yet to work out with this all-composition scheme.

The first involves cleaning up the code when a type has only one component: ConsList[A] “implements” List[A], for instance. Everything is fine from a useage perspective, but it’s a bit awkward to write out:

type ConsList[A]:
    has List[A]:
        -- all of the ConsList code goes here

So, I’m thinking of allowing a special “is-a” statement for this situation, which just lets you inline the second line in the above:

type ConsList[A] is List[A]:
    -- all of the ConsList code goes here

The second is in cleaning up implementations of nested types. Remember how List had a Size component above? Does that mean we have to implement it as:

type ConsList[A] is List[A]:
    add elem: ...
    Size:
        size: ...

or can we just write:

type ConsList[A] is List[A]:
    add elem: ...
    size: ...

My inclination here is to mirror the call site rules: you can inline the method definitions for a given component if that component doesn’t conflict with any other components. That keeps things simple, consistent and clean.

Thursday, February 20, 2014

Method invocation syntax

I’ve been giving some thought to method invocation lately, trying to come up with something that’s fluent in simple cases, and familiar (to programmers) for the more complex cases. After a bit of playing around, I think I have a system I like.

Consider Java’s BigDecimal, and specifically its divide method. It feels very programmer-y:

aNum = myNum.divide(someOtherNum)

Wouldn’t it be nice if we could make this feel more natural?

aNum = myNum dividedBy someOtherNum

That suggests a grammar of object methodName arg0[, arg1, arg2...]. But if you have more than a couple args, this gets a bit muddy; the words all clump together in my brain, and it’s not entirely clear what’s what anymore: foo doBar baz, apple, banana, coconut. If anything, it looks like the logical grouping is (foo doBar baz) (apple banana coconut). Of course, it isn’t, and my brain knows that… but it’s not intuitive to my eye.

As I was looking around at various methods, I noticed another interesting thing: very often for multi-arg methods, there’s one “main” argument that’s followed by “secondary” arguments. In human-language grammar terms, there’s a single direct object, and then some adjectives and adverbs.

BigDecimal.divide(BigDecimal, RoundingMode) is a good example: the first argument is what you’re dividing by, and the second is just some almost-parenthetical info on how to do the division. It feels like this:

aNum = myNum dividedBy someOtherNum: HalfUp

This suggests a grammar of object methodName arg0 [: arg1, arg2, arg3...]. And that’s in fact what I think I’m going to go with (with a slight tweak that I’ll get to in a bit).

There’s an obvious problem, which is that not all methods follow that semantic pattern. For instance, List.sublist takes two arguments, fromIndex and toIndex. Neither modifies the other; they’re both “primary” args. (This may have been different if the arguments were fromIndex and length, but they’re not). You really do want to invoke this using the parentheses we all know and love:

aList = myList sublist (3, 7)

Yikes — does that mean I need two ways to invoke methods? Worse yet, do I let the call sites determine which to use, so that sometimes I’ll see myList sublist 3: 7 and sometimes I’ll see myNum dividedBy (someOtherNum, HALF_UP)? The latter isn’t bad, but I don’t want my language to encourage inconsistent style on things like this. So maybe I want to let the method declaration define which syntax to use… but how?

The solution is actually pretty simple: methods like sublist take only one arg, but it’s a tuple! That’s not enforced by the language, of course, but the syntax for declaring methods should mirror the syntax for calling them, so that things will naturally work out.

The one big issue with that grammar is that the : char is already used in lots of places, and in particular as a way of declaring a variable’s type (including to upcast it). For instance, myNum divided by someOtherNum : SomeType is ambiguous; does it take one arg, someOtherNum : SomeType, or does it take two args, someOtherNum and SomeType?

To solve this, I’m going to make a slight aesthetic concession and replace the : with {...} in method invocation.

aNum = myNum dividedBy someOtherNum { HalfUp } -- two args, num and mode
aList = myList sublist (3, 7)        -- one arg, a tuple of (start, end)

As I mentioned above, the method declaration should mirror invocation. Something like:

dividedBy divisor:BigDecimal { mode: RoundingMode } -> BigDecimal: ...
aList (start: Int, end: Int) -> List[E]: ...

I like this approach a lot, except for the curly braces. Ideally I’d use a colon, or even a pipe, but all of the single-char approaches I could think of would either cause ambiguity or be ugly. For instance, a pipe would be fine at the call site, but create visual ambiguities at declaration:

dividedBy divisor: BigDecimal | mode: RoundingMode -> BigDecimal: ...

That pipe looks like a disjunctive type at a glance. This isn’t an ambiguity from the grammar’s perspective, since mode is clearly a variable name and not a type (Effes enforces the capitalization scheme that Java suggests), but it’s not nice on the eyes. Some optional parentheses would help, but it’s hard to get excited about that. So for now, curly braces are it.

The thing I like about this syntax is that with one rule, I get everything I want. Simple methods look fluent; methods with adverbs look good (if a tad clunky with the braces); and in the worst case, I get something that’s no worse than what most of the popular languages out there require or recommend.

Tuesday, February 18, 2014

An alternative to function overloading

Method overloading has always struck me as a bit clunky. It separates a method’s main code from helper code, adds clutter, and doesn’t play nicely with inheritance (at least in Java). On the other hand, its ability to provide variants for a given method is useful. I think Effes provides a better alternative.

Overloads provide two axes by which you can create variants of a method: they let you omit arguments by supplying a reasonable default, and they let you pass in a value whose type is similar to (but different from) the “main” type. For instance, you can imagine a method add(double n, RoundingMode mode) with an overload add(long n). That second overload would call the first variant, casting the long to double and using RoundingMode.HALF_UP.

Lots of languages let you omit arguments by providing default values: add(n, mode=HALF_UP) or similar. Effes will, too, but it’s tough for a statically typed language to handle the arg-of-similar-type problem. The only thing you can really do is to accept a supertype, like add(Number n). But to do that, you need control over the type hierarchy, which you obviously may not have.

In Effes, you can use disjunctive types instead:

add(n: Double|Long):
  d = case n of
    Double: n
    Long: n toDouble -- e.g., if there's no automatic type promotion
  ...

Thursday, February 13, 2014

Statements and expressions: an exploration of ambiguity

I've been working on the parser for Effes a bit, and I got a bit stuck on an ambiguity in case constructs; I want them to work as either statements or expressions.

To anchor things a bit, here are two uses of case, one of which is used as an expression, and the other as a statement:

-- as an expression
firstInt = case ints of
    (): 0
    (head, tail): head

-- as a statement
case ints of
    (): print "empty list!"
    (_): print "list has one element"
    _: print "list has #{intsList size} elements"

Languages handle this in various ways that make things simple. For instance:

In Java, it's always unambiguous whether something is expected to be a statement or an expression.
In Haskell, each function is just a single (potentially complex) expression; there are no statements, and thus no ambiguity
In Scala, you can put an expression anywhere in a function body, and the last expression is the function's return value — so again there's no ambiguity, because you can just make case constructs (match as they're called in Scala) always be expressions.

Scala's approach works, but it also lets you define a function as def g() = { 1; 2; 3; }, which I don't like. Statements and expressions are different beasts to me, and conflating them seems like a lazy and inelegant solution.

So then, is the case in that f example above about to introduce a statement or expression?

One solution is to take a hint from Java and have method bodies always consist of statements. If we take that approach, f... : case ints of is a statement. To make it be an expression whose value is returned, we'd have to write f... : return case ints of....

That's not the end of the world. In fact, I've never liked the Ruby-style return statements, where you just plop an expression at the end of a method:

def ugly
    123
end

There are a few reasons I don't like this, but the main reason is that in an imperative context (which a Ruby/Scala/etc method is), returning a value is an action. It should look like one! When I write imperative code, I'm telling the computer a series of actions to take. An implicit return feels like this to me:

First, ask the user how many apples they want.
Then find out how many apples are available.
Then, the minimum of that number and the number of apples requested.

That last sentence feels wrong, because it's not a sentence; it's a phrase. You can figure out what it means, but it feels stilted.

On the other hand, when writing one-liners, the return feels superfluous. Here's a nice size function for a list:

size -> Int: case this of
    (): 0
    _: 1 + (list tail)

One option I'm considering is to look at the return value if the method is a one-liner (that is, just a single statement or expression — even if it's complex). If it's Unit, that one line is a statement; otherwise, it's an expression. (If the function's body is a block instead of a one-liner, that block consists entirely of statements, including possibly a return statement.)

This feels a bit subtle and potentially confusing, and maybe that should be a big warning. On the other hand, I think that for most cases, it'll "just work." Crucially, since this only applies to one-liners, nearly all the cases should hopefully be simple cases. I can't think of any that wouldn't be.

This approach also means that the compiler will have to know about the Unit type specially. My instincts are that this smells wrong, but maybe it's not so bad.

Ah, what the heck. Despite all these warning bells going off, I'll try it out. If nothing else, it'll be good to see if my intuition (that this is a sketchy idea) is right, and why specifically. As Batman Begins put it, we fall so we can learn to pick ourselves up.

Friday, February 7, 2014

Python-like indentation using Antlr4

Introducing a small helper class for implementing python-like indentation in antlr4: antlr-denter on github.

I've started writing the grammar for my language, but soon ran into a problem generating python-like INDENT/DEDENT tokens in ANTLR v4. Googling found other people with the same problem and half-answers, but nothing seemed satisfactory. I decided to do something about it.

Why didn't I like what's out there? First of all, most of the answers are based on antlr v3, and don't quite work for v4. But more than that, they seemed incomplete. The ones I found don't look like they handle various edge cases, like an EOF while indended, or "half-dedents."

if something():
    takeAction()
    butDontUnindent()<eof>

if something():
      if anotherThing()
            takeAction()
   thisIndentationIsWrong()

(I should add a disclaimer, which is that I didn't actually try the proposed solutions; I just looked at the code and noticed that they don't seem to handle it.)

But even if they do work perfectly, there are other issues. The indent/dedent logic is embedded in the grammar, which makes it hard to test separately. You have to copy-paste that code, meaning it could be a pain to update if the person you copied it from has a bug fix (especially if you had to make changes to fit your grammar). In fact, you probably won't even notice that they have a bug fix, if they even bother to mention it in the web.

So, I wrote a modular, testable helper class: welcome the unimaginatively-named antlr-denter. Plugging it into your grammar is wicked easy (docs are in the README.md on github, link above), and if there's ever a bug fix, all you need to do is make sure you have the latest version of the antlr-denter jar. Boom!

The basic strategy is to override your parser's nextToken() and have it delegate to DenterHelper::nextToken. That does some processing, ultimately pulling from your lexer's super.nextToken.

One interesting tidbit is that the overhead of this code — which is completely un-optimized and just in a "get it working" state, for now — seems pretty much negligible. There is about a 10% - 20% overhead in the DenterHelper, but it's offset by the shorter program lengths. Consider these two snippets:

if foo() {
    if bar() {
        doBar()
    }
}

and

if foo()
    if bar()
        dobar()

The indentation-based program is 38 chars long; the braced one is 50 chars — 24% longer! If that's surprising, consider that the fourth line in the braced version, which just closes the if bar() block, is six chars. Sure, most of that is just whitespace, but it's still data the lexer needs to trudge through. The equivalent in the indentation-based version is zero chars.

I've only tested it with simple programs in a simpler language, but the results are all consistent. I wrote up the analysis in the readme for the examples submodule.

~~antlr-denter isn't on maven central yet, mostly because I don't need it to be. If anyone comes across this post and wants me to upload the module, I'd be happy to.~~ The project is available in maven central, so getting up and running should be pretty straightforward. And feel free to open issues or PRs on the github project.

Sunday, January 12, 2014

A solution to the object-equality problem

As I mentioned in my last post, there are certain methods that require transitivity — and that means both objects must agree on a single implementation of those methods. For instance, if a Ball expects to compare itself against another Ball using only its radius, a WeightedBall cannot use radius and weight, even if it's being compared to another WeightedBall. If it did, transitivity could break when a plain, unweighted Ball joins the party.

My solution in Effes is to define compiletime methods, which look like normal instance methods but are in fact similar to Comparator objects whose instances are picked at compile time. This is basically an implementation of the strategy pattern.

To see this in action, consider the following snippet:

type Eq:
    [compiletime] == (other: This) -> Bool

data type Ball(radius: Int):
    is Eq
    [override compiletime] (== other):
        radius == other.radius

type WeightedBall:
    is Ball -- also inherits Eq
    weight -> Int:
    [override compiletime] (== other):
        radius == other.radius && weight == other.weight

    ByRadius:
        [override compiletime] (== other): (this: Ball) == other

First we define an Eq type, with a compiletime method named ==. This method takes one argument, which must be of the same type as the declaring class (the This class is a reserved class name, just as this is a reserved variable name in many OO languages, including Effes). The This restriction is re-applied in each specific type, so Ball == must take a Ball argument (not just a Eq).

The WeightedBall subtype of Ball redefines == to include the two balls' weights. But it also defines a "variant" called ByRadius, which only compares radii. It does this by casting itself down to a plain Ball, and then calling ==, such that this call to == has a most specific implementation of (Ball ==).

To someone trying to compare two objects, the == method looks like a normal method call. But under the hood, Effes delegates it to a strategy object. Here's a simple set implementation which maintains a list of elements and just checks each new element as it comes in, to make sure it's not already in the list:

type ListSet[T:Eq]:
  val values : List[T] = Empty
  add elem:T -> ListSet[T]: if values contains T
    then this
    else values' = values push elem
  contains (needle : T): case of:
    Nothing: False
    (head, tail): case head of:
      needle: True
      _: tail contains elem

The == magic happens in the last three lines. The case head of statement tries to match head against needle, using ==. If they match, the answer is True; otherwise, we try recursively on the tail list.

As you can see, contains takes one argument, the needle we're looking for. But under the hood, it actually takes two arguments: the needle, and an "equality strategy" that holds some implementation of ==. This object is picked at compile time, ensuring that runtime polymorphism doesn't screw up transitivity.

For instance, let's try to add a heavyBall : WeightedBall to two ListSets, one typed to Ball and the other to WeightedBall:

plainSet : ListSet[Ball] = getSomeListSet
weightedSet : ListSet[WeightedBall] = getSomeListSet
s1 = plainSet add heavyBall
s2 = weightedSet add heavyBall

Note that both variables are initially assigned to identical sets, meaning that both have all WeightedBall elements. But, plainSet is typed to ListSet[Ball]. The last two lines are internally translated to something like:

s1_ = plainSet add heavyBall (Ball ==)
s2_ = weightedSet add heavyBall (WeightedBall ==)

At the == call site (in ListSet contains), the case head of ... needle snippet is translated to something like if (_equalityTester head needle), where _equalityTester is the strategy object.

If we wanted the set to contain WeightedBalls, but only restrict to unique radii, we would write:

uniqueRadii : ListSet[WeightedBall:ByRadius] = getSomeListSet
s3 = uniqueRadii add heavyBall

The WeightedBall:ByRadius bit tells the compiler to pick the ByRadius variant we defined above:

s3_ = uniqueRadii add heavyBall (WeightedBall:ByRadius ==)

There's one important bit to note here. Comparing two objects by radius is going to give you a different result than comparing them by radius and weight; there could be more duplicates. So when we cast the result of getSomeListSet to ListSet[Ball], that's not the relatively light operation that a cast usually is. We have to actually rebuild the ListSet so that it knocks out those extra duplicates. I'm still working on the mechanism for this; maybe the casting is only allowed if the type defines a method for it, something like:

type ListSet[T:Eq]:
    @cast ListSet[T2]: ListSet[T2]() addAll values

The signatures of these compiletime methods are a bit deceitful. They look like instance methods, and even define this but they're actually just normal two-arg methods. This is of course the same for all instance methods (this is always just an implicit argument), but it feels like more of a lie here. I think it's because the this argument isn't even semantically special; the use case for compiletime methods is precisely that this and other are on the same footing!

The reason I went for the cheat is simply that it feels more natural to me, even if it's a bit inaccurate. It lets us write things like a == b (instead of == a b, as would be the case for a non-instance method) without tricky rules for declaring methods as inline. My goal from the beginning was to try to get the best of both worlds that Java offers: the correctness of Comparators with the natural feel of this.equals. I think this approach succeeds in that goal, at the cost of pretty minimal sugar/lies.

Thursday, January 2, 2014

Equalities in object-oriented programs

There's a problem in object-oriented programming: equality and polymorphism don't generally get along. Specifically, extra state introduced in subclasses breaks important properties of equality like transitivity. I have an idea for a solution in Effes which, though I'm sure isn't novel, I haven't seen in other languages. But first, I'd like to take a post to explore some aspects of the problem.

Consider two classes: Ball and WeightedBall extends Ball. A Ball has one piece of state, radius; two objects are equal if they're both balls with the same radius. A WeightedBall adds another piece of state, weight. How does this factor into equality?

One answer could be that WeightedBall checks to see if the other object is a WeightedBall or just a plain Ball. If it's a WeightedBall, look at both the weight and the radius; otherwise, just look at the radius. In pseudocode:

boolean isEqual(Object other):
    if other is a WeightedBall:
        return this.radius == other.radius
            && this.weight == other.weight
    else if other is a Ball:
        return this.radius == other.radius
    else:
        return false

This approach fails because it doesn't satisfy transitivity. Consider three balls and some equalities:

Given:
- ball5 = Ball(radius=5)
- lightBall5 = WeightedBall(radius=5, weight=10)
- heavyBall5 = WeightedBall(radius=5, weight=200)
Then:
- ball5 == lightBall5 (only check radius)
- ball5 == heavyBall5 (only check radius)
- lightBall5 != heavyBall5 (check radius and weight)

In a language like Java, there aren't any great solutions to this. Either you require that both objects have exactly the same class, so that a Ball is never equal to a WeightedBall; or you don't let subclasses add state to the equality check, so that lightBall5 and heavyBall5 are equal because they only check radius, as specified by the Ball.equals contract.

This problem occurs with ordering, too. Java handles this variant by letting you create explicit Comparator classes, so that a Comparator<? super Ball> only checks radius while a Comparator<? super WeightedBall> also checks weight. This also lets you define several orderings, so that you can sort balls by radius, weight, bounciness, or any other property.

The Comparator pattern is really the right one. It lets the programmer pick the right ordering for each task, and it makes reflexivity and transitivity easy. The problem is that it's verbose: every method that cares about T's ordering needs to take a Comparator<? super T>, and every class that defines an ordering needs to create a nested class, preferably as a singleton.

Java addressed this by creating a Comparable interface for providing a "natural" ordering; basically, a class can be its own comparator. That's helpful for the programmer who's writing a comparable class, but it inflates APIs that use ordering. Each one needs two variants: one that takes a T extends Comparable<? super T> and another that takes a T and a Comparator<? super T>. Here are two methods from java.util.Collections, for instance:

public static <T extends Comparable<? super T>>
       void sort(List<T> list)
public static <T>
       void sort(List<T> list, Comparator<? super T> c)

The Comparator pattern is a step in the right direction, but it does nothing for equality, nor for its close cousin, hashing. If you create a HashSet<WeightedBall>, it can only use the default definition of equals and hashCode that WeightedBall defines — and which must still be consistent with Ball's equals and hashCode. I suppose you could create a Hasher interface similar to Comparator, but nobody does — the JDK certainly doesn't. Your only option is to wrap each WeightedBall in a WeightedBallByWeight object that defines equality/hash.

This hybrid approach — Comparable and equals/hashCode on the object, Comparator as a separate object — also leads to inconsistencies between ordering and equality. The JavaDocs for TreeSet includes this bit:

Note that the ordering maintained by a set (whether or not an explicit comparator is provided) must be consistent with equals if it is to correctly implement the Set interface.

This means it's virtually impossible to use a Comparator correctly in some cases. If WeightedBall above went with the exact-class approach (so that it can check both radius and weight), you can't create a TreeSet<WeightedBall> with a custom comparator that only checks radius; if you did, you could have two balls that are not equal (same radius, different weights) but can't both be in the set, because the custom ordering says they are equal.

Similarly, if you decided to let Ball define the equality interface, so that WeightedBall doesn't check weight, then you can't create a TreeSet<WeightedBall> with a Comparator that also factors in weight. If you did, you could have two objects that are both in the set, and yet are equals to one another!

What we really need is a hierarchy of Comparator-like interfaces:

Equalitator<T> with boolean areEqual(T a0, T a1)
- Hasher<T> extends Equalitator<T> with int hashCode(T a0)
- Comparator<T> extends Equalitor<T> with int compare(T a0, T a1) and a restriction that areEqual(a, b) == (compare(a, b) == 0)

As I teased above, I have an idea for how to solve this in Effes. Stay tuned!