Showing posts with label java. Show all posts
Showing posts with label java. Show all posts

Friday, April 8, 2016

Algebraic sum types in Java 8

I’ve been busy busy busy, so I haven’t had much time to work on Effes, but I did want to share a technique I developed for mimicking algebraic sum types in Java. Those are basically what Effes calls disjunctive types: Foo | Bar to represent an object that might be a Foo or a Bar.

Though they’re quite different in a lot of ways, you can think of subtypes as kinda-sorta like sum types. If you have an abstract class Animal with exactly two subclasses, Dog and Cat, then an animal must either be a dog or a cat: Animal -> Dog | Cat.

As it happens, this is how Antlr expresses alternatives within a grammar rule. For instance, my grammar includes this production:

singleType: TYPE_NAME singleTypeParameters # SingleDataType
          | GENERIC_NAME                   # SingleGenericType
          ;

The classes Antlr provides include the hierarchy:

class SingleTypeContext
SingleDataTypeContext extends SingleTypeContext
SingleGenericTypeContext extends SingleTypeContext

Since the grammar is fully defined within my compiler, I know that these are the only two subclasses of SingleTypeContext; I can treat it as a closed sum type. A lot of what I do involves translating ASTs, which in turn requires looking at the specific subtype (or, in sum type terms, the type’s tag).

The approach I take to these is implemented in a Dispatcher class, and it has three main parts:

  • a mapping from subtype to a function whose input is that subtype
  • a dispatcher that figures out which function to use, and invokes it
  • the ability to test that all subtypes are accounted for

To take a trivial example, let’s say I were describing the AST. The first thing I’d do is to translate each node to a string, and for that I’d use the mapping functionality. I’d say that a SingleDataTypeContext maps to a function _ -> "I'm a data type!" while SingleGenericTypeContext maps to g -> "I'm a generic named " + g.name(). I then pass an instance typed as the superclass, SingleTypeContext, to the dispatcher. The dispatcher figures out the specific class of the instance, finds its function, downcasts it and passes it to that function.

As for the test that all subtypes are accounted for, I basically declare as an invariant in my code (which is up to me to not break) that if I want to treat a type as a sum type, all of its specific types have to be nested classes of a single enclosing class. That makes it easy to find those subtypes, which in turn makes it easy to verify that they’re all in the mapping.

My code has a few places where I use the Dispatcher, but one example (with shorthand Java, to fit in a blog format) is in ExpressionCompiler, where I parse an “expression line” (basically a one-line expression or the first line of a case statement):

private static final Dispatcher<~> exprLineDispatcher
  = Dispatcher.builder(/*...*/)
      .put(SingleLineExpressionContext.class, ExpressionCompiler::singleLine)
      .put(CaseExpressionContext.class, ExpressionCompiler::caseExpression)
      .build(ExpressionCompiler::exprLineErr); // default, for err handling

Needless to say, this is not the most efficient approach; it’s basically a visitor-style dispatch, but backed by a HashMap instead of Java’s built-in polymorphism. For my (current) needs, it’s good enough. It’s also quite convenient for when I add a new alternative to a production in the grammar, since a quick run of the find-all-subtypes test tells me exactly where I need to add new code to handle that alternative.

Thursday, January 2, 2014

Equalities in object-oriented programs

There's a problem in object-oriented programming: equality and polymorphism don't generally get along. Specifically, extra state introduced in subclasses breaks important properties of equality like transitivity. I have an idea for a solution in Effes which, though I'm sure isn't novel, I haven't seen in other languages. But first, I'd like to take a post to explore some aspects of the problem.

Consider two classes: Ball and WeightedBall extends Ball. A Ball has one piece of state, radius; two objects are equal if they're both balls with the same radius. A WeightedBall adds another piece of state, weight. How does this factor into equality?

One answer could be that WeightedBall checks to see if the other object is a WeightedBall or just a plain Ball. If it's a WeightedBall, look at both the weight and the radius; otherwise, just look at the radius. In pseudocode:

boolean isEqual(Object other):
    if other is a WeightedBall:
        return this.radius == other.radius
            && this.weight == other.weight
    else if other is a Ball:
        return this.radius == other.radius
    else:
        return false

This approach fails because it doesn't satisfy transitivity. Consider three balls and some equalities:

  • Given:
    • ball5 = Ball(radius=5)
    • lightBall5 = WeightedBall(radius=5, weight=10)
    • heavyBall5 = WeightedBall(radius=5, weight=200)
  • Then:
    • ball5 == lightBall5 (only check radius)
    • ball5 == heavyBall5 (only check radius)
    • lightBall5 != heavyBall5 (check radius and weight)

In a language like Java, there aren't any great solutions to this. Either you require that both objects have exactly the same class, so that a Ball is never equal to a WeightedBall; or you don't let subclasses add state to the equality check, so that lightBall5 and heavyBall5 are equal because they only check radius, as specified by the Ball.equals contract.

This problem occurs with ordering, too. Java handles this variant by letting you create explicit Comparator classes, so that a Comparator<? super Ball> only checks radius while a Comparator<? super WeightedBall> also checks weight. This also lets you define several orderings, so that you can sort balls by radius, weight, bounciness, or any other property.

The Comparator pattern is really the right one. It lets the programmer pick the right ordering for each task, and it makes reflexivity and transitivity easy. The problem is that it's verbose: every method that cares about T's ordering needs to take a Comparator<? super T>, and every class that defines an ordering needs to create a nested class, preferably as a singleton.

Java addressed this by creating a Comparable interface for providing a "natural" ordering; basically, a class can be its own comparator. That's helpful for the programmer who's writing a comparable class, but it inflates APIs that use ordering. Each one needs two variants: one that takes a T extends Comparable<? super T> and another that takes a T and a Comparator<? super T>. Here are two methods from java.util.Collections, for instance:

public static <T extends Comparable<? super T>>
       void sort(List<T> list)
public static <T>
       void sort(List<T> list, Comparator<? super T> c)

The Comparator pattern is a step in the right direction, but it does nothing for equality, nor for its close cousin, hashing. If you create a HashSet<WeightedBall>, it can only use the default definition of equals and hashCode that WeightedBall defines — and which must still be consistent with Ball's equals and hashCode. I suppose you could create a Hasher interface similar to Comparator, but nobody does — the JDK certainly doesn't. Your only option is to wrap each WeightedBall in a WeightedBallByWeight object that defines equality/hash.

This hybrid approach — Comparable and equals/hashCode on the object, Comparator as a separate object — also leads to inconsistencies between ordering and equality. The JavaDocs for TreeSet includes this bit:

Note that the ordering maintained by a set (whether or not an explicit comparator is provided) must be consistent with equals if it is to correctly implement the Set interface.


This means it's virtually impossible to use a Comparator correctly in some cases. If WeightedBall above went with the exact-class approach (so that it can check both radius and weight), you can't create a TreeSet<WeightedBall> with a custom comparator that only checks radius; if you did, you could have two balls that are not equal (same radius, different weights) but can't both be in the set, because the custom ordering says they are equal.

Similarly, if you decided to let Ball define the equality interface, so that WeightedBall doesn't check weight, then you can't create a TreeSet<WeightedBall> with a Comparator that also factors in weight. If you did, you could have two objects that are both in the set, and yet are equals to one another!

What we really need is a hierarchy of Comparator-like interfaces:

  • Equalitator<T> with boolean areEqual(T a0, T a1)
    • Hasher<T> extends Equalitator<T> with int hashCode(T a0)
    • Comparator<T> extends Equalitor<T> with int compare(T a0, T a1) and a restriction that areEqual(a, b) == (compare(a, b) == 0)

As I teased above, I have an idea for how to solve this in Effes. Stay tuned!

Thursday, November 21, 2013

Half-thread-safe collections in Java

I thought about an "interesting" (ie, I'm probably the only one who finds it interesting) question yesterday. Is it possible to define a thread-safe collection in Java that doesn't transfer objects across threads safely?

I know, that sounds weird; let me explain. The thread-safe collections in the JDK all have documentation to the effect of "actions in a thread prior to placing an object into a BlockingQueue happen-before actions subsequent to the access or removal of that element from the BlockingQueue in another thread."

My question is: is it possible to write a collection such that the collection itself is thread-safe, but it doesn't provide that sort of guarantee? I'm using "collection" colloquially as "some class that provides put-like and get-like operations."

That is, given:

class Foo {
    int val;
}

// Thread A:
Foo fooIn = new Foo();
fooIn.val = 5;
weirdCollection.add(fooIn);

// Thread B:
Foo fooOut = weirdCollection.get();
assert fooOut.val == 5;

... is it possible to create a weirdCollection such that it is apparently thread-safe, yet the assertion would fail?

You might ask what I mean by "apparently thread-safe." The easiest definition might include that its implementation contains no data races, but that's too strong. After all, java.lang.String contains an intentional data race (in hashCode()), but it's considered thread-safe.

Let me define "apparently thread-safe" to mean that the collection itself appears to work in a thread-safe manner from the outside. For instance, if the get method is supposed to block until an element is put in, then you should expect that (a) it will return reasonably soon after an element is put into the collection (i.e., almost instantly) and (b) the result will never be null (let's assume the put method doesn't accept nulls).

Here's what I came up with:

public class Holder<E> {
  private E unsafe;
  private volatile E safe;

  public void put(E element) {
    if (element == null) throw new NullPointerException();
    this.unsafe = element;
    this.safe = element;
  }

  public E get() {
    E result = unsafe;
    // We weren't lucky enough to get the unsafe reference
    while (result == null) {
      result = safe;
    }
    return result;
  }
}

So, my first step is to demonstrate that it's apparently thread-safe (as defined above). To do that, I have to demonstrate that get() will return a non-null reference soon after an element.

The get() method starts by grabbing the value of unsafe and assign it to result. At this point, result may or may not be null. If it is, the code reads safe in a busy-wait loop until it's not. The nature of volatile fields tells us that the JVM will update this state reasonably soon after the write to safe in put. Once result is null, we never assign it to anything else. Actions within a thread always appear sequential, and since result is thread-local, we're guaranteed that it won't flip back to a non-null state.

(Side note: the JVM isn't actually obligated by the JLS to update the state any time soon! It could decide to park the thread for a year, for instance, or order the read to safe after a year's worth of actions in other threads. But in practice, people expect volatile fields to get published essentially instantly, and for them to not be would be a serious performance bug in the JVM. It's similar to how the spec doesn't say how quickly i++ has to happen, but there'd be hell to pay if it weren't nearly instantaneous.)

Okay, so get() returns a non-null value almost-instantly after put. Now I need to demonstrate that this publication is unsafe. This is actually pretty easy! Imagine that result = unsafe assigns a non-null value in that first line of get. If that happens, the body of the while loop will never be executed, meaning that thread B will never read the volatile variable safe. This means there is no happens-before relationship between any of thread B's actions and any of thread A's!

Essentially, thread B got the reference that A unsafely published, but it did so purely out of the kindness of the compiler and CPU's cache. That kindness doesn't come with any other guarantees, such as that the memory referenced by that object has also been flushed. And so, the compiler/cache/whatever is perfectly free to reorder the assignment foo.val = 5 to happen after thread B reads foo.val, which would then have its default value of 0.

I wonder how far this could go. I suspect that it can't go very far: even something as simple as a singly-linked list would get into trouble, as actions like removing a node would be fraught in data races that would be hard or impossible to detect and compensate for in the code (as we did by resorting to the spin on result = safe above).