Friday, September 04, 2009

Grand Central Dispatch

Apple has this new API in Snow Leopard called "Grand Central Dispatch". Basically it's a bunch of message queue facilities - the idea is that with concurrent message queues it will be easier for people to write scalable code on multicore machines; if the OS libs use the same API and queues, then there won't be thrash between library worker threads and app worker threads. (OS X does create a fair number of worker threads on your behalf - you can see them in the debugger in 10.5.)

So first of all: told you so.

Now that I got that out of my system: it was interesting to read Apple's documentation on how queues (their name for message queues whose messages are "do this") solve threading problems. The docs spend a lot of time describing how much you won't have to lock, but I don't think this is entirely true. The rule is pretty simple:
Resources that are owned entirely and exclusively by the "task" that has been dispatched to the queue do not need to be locked, because a task cannot be run in two places at once.
And that alone is a pretty nice improvement! (If I could have a nickel for every forum thread I read where people suggest two threads and a mutex as a way to do concurrent processing. The over-use of a mutex as something it is not is probably it's own blog topic.) At least if we can separate our data object to be processed from the world, now it is lock free.

But what do we do if that processing task needs to talk to the rest of our application? There are basically only a few options:
  1. Make the APIs that we have to call non-blocking and lock-free.
  2. Put fine-grained locking into the various APIs (e.g. use internal locks as needed). Now there is a risk that our tasks can block, but perhaps it won't be too bad. Use a performance analyzer to see if this is hosing us.
  3. Put some kind of course-grained lock on the entity the task will need - perhaps for the entire execution of the task, perhaps for longer. At this point you have to wonder if you're losing concurrency.
X-Plane uses a mix of strategies 2 and 3. Basically any API that is called from the main thread (which runs the rendering loop) gets optimized using technique 1 for minimum latency. The rest of the APIs can use techniques 1 or 2, whichever is easier.

An example of strategy 1: by using atomics to swap textures into their container objects, we don't need to obtain a lock on a texture object to use it.

Strategy 1 and 2 combine: a central registry of objects is used only for async loader code. Once we have an object, we are lock free. (Objects are reference counted, and reference counting can be lock-free - see also the glibc++ string implementation.)

A final comment: Apple suggests replacing fine-grained resource locking (e.g. grab a lock, do a few things, release the lock) with queueing a "block" (a small piece of code) on a serial queue (a message queue with only one thread). This can be done synchronously or asynchronously.
  • It's refreshing to see someone abusing threading primitives by suggesting that a semaphore be used instead of a mutex - normally on the net I see people doing just the opposite (and wondering why their design has 5000 edge cases and blows up as soon as they add a third thread).
  • If the block is queued asynchronously this is a performance win. Except for the part where you don't know if your work has actually been done. It's a great refactoring if you can get away with it but I wouldn't expect to hit this case any time soon.
  • If the block is queued synchronously this solution is exactly the same concurrency wise as a mutex.
Except...I'm not 100% sure of that last statement. If I take a lock on my own thread, there is no thread switch if the lock is not contested, and I might not even make a kernel switch (if the mutex is a critical section or another light-weight lock that can spin a few times before trapping).

By comparison, my understanding is that a message queue is at least a message queue - I would normally expect the serial thread with sync queue to have to "toss" the work to another thread, then wait (because if the main thread did the work, the worker thread would be available to pick up work tasks out of order). But perhaps Apple has optimized this, with a fast path to execute synchronously queued tasks on the calling thread when a serial queue is "idling".

3 comments:

  1. "It's refreshing to see someone abusing threading primitives by suggesting that a semaphore be used instead of a mutex - normally on the net I see people doing just the opposite (and wondering why their design has 5000 edge cases and blows up as soon as they add a third thread)."

    Because I didn't know why this was a problem I looked it up on Wikipedia.

    From Semaphores on Wikipedia:

    "Binary semaphore vs. Mutex

    A mutex is a binary semaphore, usually including extra features like ownership or priority inversion protection. The differences between mutexes and semaphores are operating system dependent. Mutexes are meant to be used for mutual exclusion only and binary semaphores are meant to be used for event notification and mutual exclusion."

    ReplyDelete
  2. "I would normally expect the serial thread with sync queue to have to "toss" the work to another thread, then wait "

    libdispatch itself is open source now so you can check, but dispatch_sync to a currently inactive (maybe empty) queue will borrow the current thread to do the work and will not trap into the kernel. This is faster then any of the current pthread operations.

    On a currently active queue it will avoid trapping into the kernel to get the work done (it will trap to halt the current thread until the work completes though).

    ReplyDelete
  3. Chris - fair enough on the definition of semaphores and mutexes (mutices)? - mostly I was being snarky and trying to take a jab at all of the terrible mutex-based code I've seen.

    As stripes points out, libdispatch is open and I took a look under the hood. My original surprise was because a safe assumption for a threading package used to be that the critical section might be optimized for the no-lock case (maybe it spins a few times) whereas the semaphore might not be.

    In the case of libdispatch, queues are optimized twice: the underlying semaphores have fast-past user space code and the queue itself is lock-free and can rapidly detect the uncontended case.

    So that's fun...I would still say that Apple's suggestion of replacing critical sections with a message queue only makes sense if you are using a highly tuned message queue system like libdispatch. Try that with pthreads or on Win32 and you'll be in for some misery.

    ReplyDelete