I know, that sounds weird; let me explain. The thread-safe collections in the JDK all have documentation to the effect of "actions in a thread prior to placing an object into a BlockingQueue happen-before actions subsequent to the access or removal of that element from the BlockingQueue in another thread."
My question is: is it possible to write a collection such that the collection itself is thread-safe, but it doesn't provide that sort of guarantee? I'm using "collection" colloquially as "some class that provides put-like and get-like operations."
That is, given:
class Foo { int val; } // Thread A: Foo fooIn = new Foo(); fooIn.val = 5; weirdCollection.add(fooIn); // Thread B: Foo fooOut = weirdCollection.get(); assert fooOut.val == 5;
... is it possible to create a weirdCollection such that it is apparently thread-safe, yet the assertion would fail?
You might ask what I mean by "apparently thread-safe." The easiest definition might include that its implementation contains no data races, but that's too strong. After all,
java.lang.String
contains an intentional data race (in hashCode()
), but it's considered thread-safe.Let me define "apparently thread-safe" to mean that the collection itself appears to work in a thread-safe manner from the outside. For instance, if the get method is supposed to block until an element is put in, then you should expect that (a) it will return reasonably soon after an element is put into the collection (i.e., almost instantly) and (b) the result will never be
null
(let's assume the put method doesn't accept null
s).Here's what I came up with:
public class Holder<E> { private E unsafe; private volatile E safe; public void put(E element) { if (element == null) throw new NullPointerException(); this.unsafe = element; this.safe = element; } public E get() { E result = unsafe; // We weren't lucky enough to get the unsafe reference while (result == null) { result = safe; } return result; } }
So, my first step is to demonstrate that it's apparently thread-safe (as defined above). To do that, I have to demonstrate that
get()
will return a non-null
reference soon after an element.The
get()
method starts by grabbing the value of unsafe
and assign it to result
. At this point, result
may or may not be null
. If it is, the code reads safe
in a busy-wait loop until it's not. The nature of volatile
fields tells us that the JVM will update this state reasonably soon after the write to safe
in put
. Once result
is null
, we never assign it to anything else. Actions within a thread always appear sequential, and since result
is thread-local, we're guaranteed that it won't flip back to a non-null
state.(Side note: the JVM isn't actually obligated by the JLS to update the state any time soon! It could decide to park the thread for a year, for instance, or order the read to
safe
after a year's worth of actions in other threads. But in practice, people expect volatile
fields to get published essentially instantly, and for them to not be would be a serious performance bug in the JVM. It's similar to how the spec doesn't say how quickly i++
has to happen, but there'd be hell to pay if it weren't nearly instantaneous.)Okay, so
get()
returns a non-null
value almost-instantly after put
. Now I need to demonstrate that this publication is unsafe. This is actually pretty easy! Imagine that result = unsafe
assigns a non-null
value in that first line of get
. If that happens, the body of the while
loop will never be executed, meaning that thread B will never read the volatile
variable safe
. This means there is no happens-before relationship between any of thread B's actions and any of thread A's!Essentially, thread B got the reference that A unsafely published, but it did so purely out of the kindness of the compiler and CPU's cache. That kindness doesn't come with any other guarantees, such as that the memory referenced by that object has also been flushed. And so, the compiler/cache/whatever is perfectly free to reorder the assignment
foo.val = 5
to happen after thread B reads foo.val
, which would then have its default value of 0.I wonder how far this could go. I suspect that it can't go very far: even something as simple as a singly-linked list would get into trouble, as actions like removing a node would be fraught in data races that would be hard or impossible to detect and compensate for in the code (as we did by resorting to the spin on
result = safe
above).