Yuval Shavit: November 2013

Thursday, November 21, 2013

Half-thread-safe collections in Java

I thought about an "interesting" (ie, I'm probably the only one who finds it interesting) question yesterday. Is it possible to define a thread-safe collection in Java that doesn't transfer objects across threads safely?

I know, that sounds weird; let me explain. The thread-safe collections in the JDK all have documentation to the effect of "actions in a thread prior to placing an object into a BlockingQueue happen-before actions subsequent to the access or removal of that element from the BlockingQueue in another thread."

My question is: is it possible to write a collection such that the collection itself is thread-safe, but it doesn't provide that sort of guarantee? I'm using "collection" colloquially as "some class that provides put-like and get-like operations."

That is, given:

class Foo {
    int val;
}

// Thread A:
Foo fooIn = new Foo();
fooIn.val = 5;
weirdCollection.add(fooIn);

// Thread B:
Foo fooOut = weirdCollection.get();
assert fooOut.val == 5;

... is it possible to create a weirdCollection such that it is apparently thread-safe, yet the assertion would fail?

You might ask what I mean by "apparently thread-safe." The easiest definition might include that its implementation contains no data races, but that's too strong. After all, java.lang.String contains an intentional data race (in hashCode()), but it's considered thread-safe.

Let me define "apparently thread-safe" to mean that the collection itself appears to work in a thread-safe manner from the outside. For instance, if the get method is supposed to block until an element is put in, then you should expect that (a) it will return reasonably soon after an element is put into the collection (i.e., almost instantly) and (b) the result will never be null (let's assume the put method doesn't accept nulls).

Here's what I came up with:

public class Holder<E> {
  private E unsafe;
  private volatile E safe;

  public void put(E element) {
    if (element == null) throw new NullPointerException();
    this.unsafe = element;
    this.safe = element;
  }

  public E get() {
    E result = unsafe;
    // We weren't lucky enough to get the unsafe reference
    while (result == null) {
      result = safe;
    }
    return result;
  }
}

So, my first step is to demonstrate that it's apparently thread-safe (as defined above). To do that, I have to demonstrate that get() will return a non-null reference soon after an element.

The get() method starts by grabbing the value of unsafe and assign it to result. At this point, result may or may not be null. If it is, the code reads safe in a busy-wait loop until it's not. The nature of volatile fields tells us that the JVM will update this state reasonably soon after the write to safe in put. Once result is null, we never assign it to anything else. Actions within a thread always appear sequential, and since result is thread-local, we're guaranteed that it won't flip back to a non-null state.

(Side note: the JVM isn't actually obligated by the JLS to update the state any time soon! It could decide to park the thread for a year, for instance, or order the read to safe after a year's worth of actions in other threads. But in practice, people expect volatile fields to get published essentially instantly, and for them to not be would be a serious performance bug in the JVM. It's similar to how the spec doesn't say how quickly i++ has to happen, but there'd be hell to pay if it weren't nearly instantaneous.)

Okay, so get() returns a non-null value almost-instantly after put. Now I need to demonstrate that this publication is unsafe. This is actually pretty easy! Imagine that result = unsafe assigns a non-null value in that first line of get. If that happens, the body of the while loop will never be executed, meaning that thread B will never read the volatile variable safe. This means there is no happens-before relationship between any of thread B's actions and any of thread A's!

Essentially, thread B got the reference that A unsafely published, but it did so purely out of the kindness of the compiler and CPU's cache. That kindness doesn't come with any other guarantees, such as that the memory referenced by that object has also been flushed. And so, the compiler/cache/whatever is perfectly free to reorder the assignment foo.val = 5 to happen after thread B reads foo.val, which would then have its default value of 0.

I wonder how far this could go. I suspect that it can't go very far: even something as simple as a singly-linked list would get into trouble, as actions like removing a node would be fraught in data races that would be hard or impossible to detect and compensate for in the code (as we did by resorting to the spin on result = safe above).

Tuesday, November 19, 2013

I'm going to have to maybe-kill the cat

In my last post, I discussed problems with the runtime binding of composed objects: if object a has a method foo, and object b also has a method foo, then how does the composed object (a <?> b) behave? In this post, I'd like to explore a separate but related pair of problems: what's the type of a <?> b, both at run-time and compile-time?

(Before you get too far into this post, a disclaimer: this post describes a method of composition and then explains why it doesn't work. If that strikes you as meandering and pointless — if you prefer to read about ideas that might work, rather than those that definitely won't — then you might want to skip this post.)

As before, let me set this up with an example. As an added bonus, you'll get to see the latest revision of Effes' syntax, which is pretty close to complete (now I really mean it!). Since I haven't laid out how composition works, I'll use <?> as a placeholder for "some variant of composition." With that said:

type List[A] = data Node(head:A, tail:List[A]) | Nothing
  put e:A -> List[A] = TODO
  get -> (Maybe[A], List[A]) = TODO

type Box[A] = Maybe[A]:
  put e:A -> Box[A] = TODO
  get -> Maybe[A] = TODO

myList : List[Int] = TODO
myBox : Box[Int] = TODO
myContainer = myList <?> myBox

Here we have two simple types: one representing a linked list, and one representing a box that can contain zero or one items. Their details aren't important. We also have three references: one of type List[A], one of Box[A], and one representing the composition of the first two.

As I discussed in the last post, we have some options with a call like myContainer.put 1. We can take the Schrodinger approach, in which we invoke put on both components of myContainer and then compose those two results; or we can pick one as the "winning" binding and only invoke it.

Let's say we pick the single-bound, "winner" approach; that's the option I was leaning towards in the last post. To make it concrete, let's say that the left-hand side always wins out. That seems reasonable, but what about this situation:

myContainer2 = myContainer.put 123
myVal : Maybe[Int]
myVal = case myContainer2 of
  Box[Int] -> myContainer.get
  _ -> Nothing

In this snippet, we check to see if myContainer is a Box[Int]. It is, so the first pattern will match. This casts myContainer2 to Box[Int] and, within that new context, invokes get. Note that get here has to be bound to the Box version; the List[A] version has a different return type (it returns a pair containing the list's possible head and its tail).

The problem is that myContainer.put was bound to the List[A] method, meaning that the Box component of myContainer never had 123 inserted into it: the value of myVal is Nothing, not 123! This is fully consistent, but it's confusing and violates the principle of least surprise. There has to be a better way!

One possibility is to limit composition such that myContainer2 does not have a runtime type of Box[Int]: the second (catch-all) pattern matches, and the value of myVal is Nothing.

Of course, we always want the runtime type to be a subtype of the compile-time type, so this new approach means that the composed object's compile-time type can't include Box[Int]. We can't generalize that to all compositions, or else the whole idea of composition falls apart and becomes shorthand for the not-very-useful function a <?> b = a.

We can come up with a more precise limit on composition, one that doesn't just throw away the RHS altogether. One approach to compose objects in multiple phases. Here's one such algorithm, in rough terms:

Decompose the RHS object to its constituent objects.
Fold each one of those into the LHS, one at a time. As you fold each object, check to see whether it has any methods that conflict with methods on the composed object; if so, ignore that object (don't fold it in).
Check whether the resulting object has any abstract methods (that weren't given a concrete definition in the composed object). If so, remove the objects that introduced those methods.

These three phases happen separately for both the compile-time type and the runtime type. Here's an example, starting with some types and some objects:

type List[A] =
    ... (put/get, as above)
    size -> Int
type Box[A] = ... (put/get, as above)

type Container:
  isEmpty -> Bool -- abstract method

type Container Box[A]:
  @override isEmpty -> Bool = TODO

intsList : List[Int] = list(1, 2, 3)
intsBox : Box[Int] = box(4)

intsContainer1 = intsBox <?> Container

The List[A] and Box[A] types are as before, except that List[A] now also has a size method.

Let's look at that last line. The LHS is Box[Int] with methods put and get, while the RHS is Container with just one method, isEmpty. There's no overlap in methods, so the composition is straightforward and results in Box[Int] <?> Container for both the compile-time and runtime type.

Okay, so that case works fine. But what about this?

intsContainer2 : Container = intsContainer1
composed2 = intsContainer2 <?> intsList
composedSize = composed2.size

At compile time, that last line looks like Container <?> List[Int], which has no conflicts and thus doesn't remove any types. But the Container component's isEmpty method isn't implemented, so it's removed: the resulting type is List[Int]. We call its size method, which is a pretty reasonable operation to call on a list.

At runtime, the composition looks like (Box[Int] <?> Container) <?> List[Int], and there is a conflict: List[Int]'s methods collide with Box[Int], and we therefore don't fold List[Int] into the composition. That means that the resulting type is just Box[Int] <?> Container — the inverse of the compile-time type! And in particular, there's no size method on that object. Boom.

I've tried a few variants on this sort of decompose-and-recompose theme: decomposing both sides, different handling of abstract methods, etc. Inevitably, the mismatch between runtime and compile-time types always blows up in my face. I don't think there's a way around it.

Unfortunately, I think this leads me to the conclusion that if I'm going to do composition, I have to do composition all the way: Schrodinger-style. This means I'll need to figure out various issues about collapsing the wave form: how does pattern-matching work, and what happens when an object is composed with an object of the same type? For instance, how does "foo" <?> "bar" work?