Yuval Shavit

Sunday, October 3, 2021

A look back

It's obviously been a while since I updated this, and I've gotten one or two questions about it — enough to get me to write a brief update.

In short, this project is (surprise!) now officially abandonware. I'm not planning on developing Effes any more. That said, I consider the project a success!

Going back to the start, the crux of Effes was to explore what happens when I let programmers define variables whose type is a sum type of declared types: "A or B". That is, I wanted to move one aspect of composition from the API designer's realm to the API user's realm. For example, in Java, an Optional<T> is defined as being one of two things: either an "absent" or a "present of T". This sum type (absent or present) is defined by the Java API. I wanted to see what happens if we let the API define just the "absent" and "present of T" types, and let the user of the API declare when they want a variable that is exactly one of those — or when they want something else, like an "absent or present-of-T or list-of-T", or even a "present-of-T or list-of-S".

To get there, I started with a project called Effes to create a parser and type-checker for such a language. With that done, I moved to Effes2, which would be re-written as a self-bootstrapping language: Effes2 would be written in Effes2.

Since I didn't want to deal with Effes2-to-JVM compatibility, I couldn't rely on any external libraries; the Effes2 compiler had to be written from scratch. One of the first steps was writing the parser, and this was actually a perfect test of the language! Parser rules are sum types (a statement is an assignment, or a function call, or a return, or a...), so they would fit perfectly with the central feature of my language.

But the result was... not great. You can see the result in Parser.ef, which basically looks like gobbledygook. If my idea was really that powerful, the code behind this parser would be intuitive, approaching (if not equaling) a BNF definition. A quick glance at _parseSimpleExpression(tokens) shows that it isn't.

Well, a negative result is still a result — and that's where we stand today. The project started out to explore "what happens if we let users create sum types?" and the answer was, "not enough to warrant designing a language around it."

There are some caveats around the negative result, to be sure: I never finished Effes2, and it could be that using the parser rules in a compiler would be more intuitive than using them in a parser. I also never explored an earlier idea had, was to explore how intersection types ("A and B") could replace inheritance (imagine a List type whose instantiation was something like List & AbstractList & ArrayBasedListFunctionality).

Interestingly enough, I've since learned that TypeScript has this feature, which it calls union types. I have a couple ideas for future projects bouncing around in my head, and if I get around to pursuing those, I look forward to using TypeScript for them and seeing what role union types have in the context of a fully-formed language. But for now, Effes has served its purpose, and I'm content to close the book on it.

Friday, April 8, 2016

Algebraic sum types in Java 8

I’ve been busy busy busy, so I haven’t had much time to work on Effes, but I did want to share a technique I developed for mimicking algebraic sum types in Java. Those are basically what Effes calls disjunctive types: Foo | Bar to represent an object that might be a Foo or a Bar.

Though they’re quite different in a lot of ways, you can think of subtypes as kinda-sorta like sum types. If you have an abstract class Animal with exactly two subclasses, Dog and Cat, then an animal must either be a dog or a cat: Animal -> Dog | Cat.

As it happens, this is how Antlr expresses alternatives within a grammar rule. For instance, my grammar includes this production:

singleType: TYPE_NAME singleTypeParameters # SingleDataType
          | GENERIC_NAME                   # SingleGenericType
          ;

The classes Antlr provides include the hierarchy:

class SingleTypeContext
SingleDataTypeContext extends SingleTypeContext
SingleGenericTypeContext extends SingleTypeContext

Since the grammar is fully defined within my compiler, I know that these are the only two subclasses of SingleTypeContext; I can treat it as a closed sum type. A lot of what I do involves translating ASTs, which in turn requires looking at the specific subtype (or, in sum type terms, the type’s tag).

The approach I take to these is implemented in a Dispatcher class, and it has three main parts:

a mapping from subtype to a function whose input is that subtype
a dispatcher that figures out which function to use, and invokes it
the ability to test that all subtypes are accounted for

To take a trivial example, let’s say I were describing the AST. The first thing I’d do is to translate each node to a string, and for that I’d use the mapping functionality. I’d say that a SingleDataTypeContext maps to a function _ -> "I'm a data type!" while SingleGenericTypeContext maps to g -> "I'm a generic named " + g.name(). I then pass an instance typed as the superclass, SingleTypeContext, to the dispatcher. The dispatcher figures out the specific class of the instance, finds its function, downcasts it and passes it to that function.

As for the test that all subtypes are accounted for, I basically declare as an invariant in my code (which is up to me to not break) that if I want to treat a type as a sum type, all of its specific types have to be nested classes of a single enclosing class. That makes it easy to find those subtypes, which in turn makes it easy to verify that they’re all in the mapping.

My code has a few places where I use the Dispatcher, but one example (with shorthand Java, to fit in a blog format) is in ExpressionCompiler, where I parse an “expression line” (basically a one-line expression or the first line of a case statement):

private static final Dispatcher<~> exprLineDispatcher
  = Dispatcher.builder(/*...*/)
      .put(SingleLineExpressionContext.class, ExpressionCompiler::singleLine)
      .put(CaseExpressionContext.class, ExpressionCompiler::caseExpression)
      .build(ExpressionCompiler::exprLineErr); // default, for err handling

Needless to say, this is not the most efficient approach; it’s basically a visitor-style dispatch, but backed by a HashMap instead of Java’s built-in polymorphism. For my (current) needs, it’s good enough. It’s also quite convenient for when I add a new alternative to a production in the grammar, since a quick run of the find-all-subtypes test tells me exactly where I need to add new code to handle that alternative.

Wednesday, January 6, 2016

Pattern matching reaches 1.0!

I finally finished implementing pattern matching!

I just looked to see when I last updated this blog, and holy crap, it’s been half a year! I tell ya, this had better be one nice feature.

For what it’s worth, I think it is. The pattern matching is recursive, meaning you can break components down as far as you want:

case myWhatever of
    Something(List(BlahBlah(5, a), _)) -> ...

That will match if myWhatever is a Something whose only argument is a List whose first element is a BlahBlah with two arguments, the first being an int 5 and the second being anything, and that thing is now bound to the variable a.

That took a bit of work, though it only gets me to parity with other pattern matching languages. What’s neat about Haskell is that the remaining type — that is, the next line in the case list — will take all of that into account. Taking a simpler example, this would fail to compile:

case coinFlips of
    List(Heads, List(Tails, _)): ...
    List(Heads, List(Tails, _)): ...
    _: ...

because the second line is trying to match against a pattern that’s already been matched. And likewise, without that last _ -> ... line, the statement would fail to compile because some coinFlips types have not been accounted for (namely, List(Heads, Empty | List(Heads, _) | List(Tails, _)).

There’s still room for improvements:

you can’t match against an alias or disjunctive type (ie, you can’t match for Boolean, you have to look for True and False separately)
the “remaining type” tends to be pretty exploded out; instead of List(Heads, Empty | List(Heads, _) it might say List(Heads, Empty) | List(Heads, List(Heads, _). After a few of these exploded forms chain together, it can get pretty hairy
There’s no tracking of values for what I call “large domain values,” such as ints (as opposed to small domain values, such as True | False. For instance, if you match an int against 5, nothing prevents you from matching against 5 again.
There’s no mechanism for variable arity destructors, which is why the List(...) pattern above had to be desctructed node by node, instead of just as List(Heads, Tails, _)

I may pick off one or two of these soonish, but frankly I’m ready to put pattern matching aside for a bit. It’s taken longer than I’d have liked, and it’s time to move on to some more features.

I think the next thing I’m going to do is exception handling, about which I have some neat ideas. After that, I’m going to fill in the built-in types (which currently only include ints), and for strings I intend to include string interpolation. After that, I’m going to do some I/O work of some sort. And at that point, I think I’ll be ready to start writing some real code in Effes, at lon

Monday, August 3, 2015

Evolutionary programming

I’ve realized that working on Effes is a different beast than working in my day job, primarily due to time and a lack of continuity. It hasn’t been quite as challenging as I thought it would be, but it does result in a slightly different development style.

At work, I come in and get to chew on a problem for hours at a time (modulo the usual office interruptions). If I don’t finish, I’ll pick things up the next day pretty much where I left off, with everything still more or less fresh in my head. But with Effes, I usually work on things an hour here, a half-hour there, maybe a stretch of a few hours on a weekend. Sometimes I won’t have time to work on it for weeks at a time.

So I’ve taken an evolution-like approach to Effes. Good traits are brought in atomically and incrementally; bad traits are culled, but ones that are merely superfluous can stick around for a while. Sometimes I work on pieces that I think will mesh, but I work on them without a grand spec of how to combine them. Instead, I work on each one individually, and then try to stitch them together.

It’s not that I don’t do cleanup (I do!), but the goals are different: there’s less emphasis on a tight, succinct code base, and more on making progress. Less on the ethos, and more on the id.

Two things I haven’t deemphasized are code clarity and maintainability. Like with the day job, those characteristics are vital, since I may not come back to a piece of code for months. On the other hand, if something works fine and is clear, but could maybe be collapsed a bit or consolidated with another class for aesthetic reasons — I have less incentive to do that, since that half hour task might well be the only half hour I get to spend on Effes for two weeks.

These are, of course, skills not unique to my hobby program. I have a bit of a “architectural perfectionist” streak in me, and working on Effes has helped me hone my ability to get things done efficiently, without worrying too much about how pretty they are.

Tuesday, July 28, 2015

A type by any other name

I think it’s actually going to be easier than I feared.
~ Me, writing Famous Last Words

Uh, yeah. So it turns out it won’t actually be that easy. Instead of just plugging things in straightforwardly, it turns out that I’m going to need to take a tangent to kinda overhaul the classes that represent Effes types.

Here’s the problem. Let’s say I do something like this:

b : Something[Boolean]
case b of:
  Something[True]: doSomethingWith b
  c: doSomethingElseWith c

… then I want two things to happen. Firstly, in doSomethingWith b, I want the type system to know that b : Something[True]. And secondly, in doSomethingElseWith c, I want the type system to know that c : Something[False] | Nothing.

Neither one of those work today, because of an implementation choice I made a while back. To simplify recursive types like Cons[T](head: T, tail: Cons[T] | Empty), I separated each type’s name from its argument. One bit of information says “I’m a Cons with a generic param T” , and a separate class holds the information “given a Cons[E], I can tell you that its arguments are (E, Cons[E] | Empty).”

This means that there’s no place to store the information that in some particular case, the arguments are (T, Empty) (i.e., that this cons represents the last item of a list). That’s exactly the kind of information that the pattern matching can and should provide.

So, before I go further with the pattern matching, I’m going to have to redo my type registration quite a bit: I’m going to need to remove my “arguments for each type” registry altogether and put that information in the types themselves.

Friday, July 10, 2015

I'm too lazy to type

The title is a double joke. It’s a pun with “lazy” and [data] types, but it’s also funny cause it’s true! This post was supposed to be about lazy evaluations in my pattern matching code, but I don’t feel like writing much about it.

And actually, there’s not a whole lot to write.

the “type subtraction” has to be lazy, because one of the operands is a potentially-infinitely-recursive type. Trivial example: IntSequence(head: Int, tail: IntSequence).
for the sake of keeping expansions down, it’s better to defer forcing the evaluations as long as possible — but that was covered in my last post
step three, profit I guess?

Anyway, point is there’s not much to say, so I’m just going to leave it at that.

My next step is t actually plug all this stuff into my compiler. I did a quick look the other day to remind myself what the compiler code actually looks like in that area, and I think it’s actually going to be easier than I feared. There’s already a decent abstraction of “the pattern that’s being matched,” with logic that takes that pattern and assigns variables to it and such, so with luck it’ll be pretty easy to swap in the new abstractions I’ve been using.

Wednesday, July 8, 2015

In Soviet Russia, recursion invokes you!

This is one of those “I know what worked, but I don’t quite know what I learned” posts.

To implement my type-safe pattern matching, I have to essentially recurse down into two structures: the type of the thing being matched, and the individual pattern that’s trying to match it. That is, given:

t: List[Maybe[Boolean]]     //          call this the haystack type
case t of
    Cons(One(True), _): doWhatever(...) // and this the needle type

… I need to recurse both into the List[Maybe[Boolean]] and into the Cons(One(True), _). The question is, which of those looks like recursion, and which looks like a dispatch? The problem is interesting because both structures are polymorphic: the haystack can be simple type (Cons(head: T, tail: List[T])) or a union type (Cons(...) | Empty), while the needle type can either be a concrete type (One(True)) or a wildcard (_).

Essentially, some piece of code has to look something like this:

Result recursePolymorphically(Arg arg) {
  if (arg instanceof OneThing) { ... }
  else if (arg instanceof AnotherThing { ... }
}

and the question is whether I:

recurse into the haystack polymorphically, and instanceof-ify the needle
recurse into the needle polymorphically, and instanceof-ify the haystack

A quick observation that drove my thinking on this: the haystack type is potentially infinite (the tail of a Cons is a disjunction that includes a Cons, the tail of which is a disjunction that includes a Cons, and so on) while the needle is always finite. Thus, the base case must depend on the needle.

I tried the first of those first, since I like the idea of the base case being driven off the method’s arguments rather than an attribute of this; with the needle as an argument, the base case is when that argument represents a wildcard or a concrete, no-arg type.

The problem is that this didn’t work well with laziness (the subject of my next post), since it meant forcing the haystack more than I wanted. This in turn caused types to explode out much faster than they needed to. Instead of ending up with something like:

t': Cons(Nothing, ...)
  | Empty

I might end up with something like:

t': Cons(Nothing, Cons(Nothing, ...) | Empty)
  | Cons(Nothing, Cons(Maybe[True], ...) | Empty)
  | Cons(Nothing, Cons(Maybe[False], ...) | Empty)
  | Empty

This made testing more difficult, as I had to start balancing thoroughness and verbosity in the test code. It also means that in the case of a compilation error, the user would see a much more confusing error message. Would you rather see “saw True but expected Cons(Nothing, …)” or “saw True but expected Cons(Nothing, Cons(Nothing, …) | Empty) | Cons(Nothing, Cons(Maybe[True], …) | Empty) | Cons(Nothing, Cons(Maybe[False], …) | Empty) | Empty”? I just wrote that, and I can barely even make sense of it!

As it happens, the testing was important because this approach lent itself to more bugs, though I haven’t done the introspection to figure out why.

The explosion problem pretty much went away when I switched the recursion and the dispatch. Now, the polymorphic call to the needle’s Any form can take its unforced argument (a thunk representing some piece of the haystack’s type) and shove it into the result, still in its lazy form. The result becomes something like matched={<thunk>}, which keeps things from having to expand further.

Which gets me back to the question at the top: what did I learn? I still don’t know. I tried one technique, identified its flaws, and tried the other — but I didn’t learn how to pick techniques better. My algorithm is still a linear search.

Maybe what I learned is that the base case should be driven off polymorphism, not off method arguments. Or maybe it’s that lazy evaluation and polymorphism don’t mix well. Or maybe that sometimes, you just have to use brute force to figure new stuff out.