Saturday, March 18, 2017

Why Your C++ Should Be Simple

I have a lot of strong (meaning "perhaps stupid") opinions about C++ and how to use it (or not use it), and I've been meaning to write posts on the thinking behind these stupid^H^H^H^H^H^Hstrong opinions, particularly where they go against the "conventional wisdom" of the modern C++ community.

This one is sort of an over-arching goal:
Write C++ that is less sophisticated than you are capable of writing.
This idea is a rip off^H^H^H^H^H^Hriff on a Brian Kernighan quote that I think is perhaps the best advice about  programming you'll ever hear:
Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it? 
-- Brian Kernighan, the Elements of Programming Style
First, that's not wrong, that's dead on. So for debugging alone, don't write the most complicated C++ you can. C++ is an extremely complex language in its spec and if you use it to its full capability, you can write something that is completely un-debuggable.

Limited Brain Bandwidth

Here's my take on the idea. You're welcome to dismiss this as the mad ramblings of a father who is over 40 and literally cannot remember to finish steeping my tea because it's been an entire four minutes since I left it in the kitchen.

There's a limit to your mental bandwidth. That mental concentration, focus, short term memory, and whatever else the squishy gray GPU between our ears does can be spent on a few different things:

  • Figuring out how to write code.
  • Figuring out how to make code faster.
  • Figuring out how to solve non-trivial algorthm problems.
  • Figuring out how to solve business problems and design the architecture of a large production system.
  • Finding and fixing bugs.
  • Multi-tasking.
  • Looking at cats on the internet.
  • Swearing at your customers.
  • Trying to maintain focus while your 2 year old repeatedly smashes your keyboard with your mouse.
  • Figuring out the fastest path through a series of coding tasks.
I'm sure there's more, but you see where I'm going with this. You have a lot of things you can use your brain for, and some are more useful for getting your job done than others. You have to decide where to spend your brain budget, and you can't spend it on everything. (If you feel like you're not tight on brain budget, it means you're not working on a problem that's hard enough yet! You can always make the problem harder by shortening the time frame. Shipping things sooner is pretty much always a win.)

My view is that "being clever with C++" isn't a fantastic investment. Using the most bleeding edge C++ doesn't have enough benefit in other areas (code is faster, code is more maintainable, code is easier to debug, code is easier to understand) to justify burning brain power on it. In some cases, it actually moves you in the wrong direction.

Just because C++ lets you do it doesn't mean you have to or that it's even a good idea. Here are some examples where I consider a C++ feature to add complexity and not provide returns:

  • Clever operator overloading. Your coworkers will thank you if you give your functions good names and don't make operator+ turn on the coffee maker. You could skip operator overloading entirely and your code will be fine. (I'm okay with a small amount of overloading for a few obvious cases: dereference on smart pointers and math on math types).
  • Template meta-programming. I make myself watch talks on this from cppcon and what I see is the smartest C++ programmers in the world spending huge amounts of brain power to accomplish really trivial tasks (writing a program to make a program) in a way that is almost completely impossible to understand. You don't have to use the compiler to do your meta-program generation.
  • Heavy Duty Templating. This is a matter of degree - simple use of templates is fantastic and I use them in my code all the time. But at some point, as you add layers, you go past an inflection point and the cure starts to hurt more than the disease. This one is a little bit like smut: I don't have a great definition for when you've jumped the shark with your templates, but I know it when I see it. A good rule of thumb: if templates are making it harder to debug, don't add more. (Debuggers are easily good enough to debug the STL, generic algorithms, don't have to drop that stuff. It's not 1998 anymore!)
  • Overly Complicated Class Hierarchies. This is also a matter of degree, but at some point, it's okay for your class hierarchy to be less pure, clean and perfect if it's simpler and creates less chaos in the rest of the program. For example, our UI framework has one view type - parents, children, leaf nodes, roots, none of these are specialized by type. I've used a lot of other frameworks, and my finding is that clever "factoring out" of aspects of a view hierarchy does nothing to make the code better, but it creates issues you have to work around. Just keep it simple.
I could go on, but you get the idea...I'm one of those old bastards who thinks it's okay to just use C++ as "a nice version of C". Because it is! In my time here in the matrix, I have found that I have written past code that was too complex, language wise, and too simple. Here's the difference:

  • For the code that was too simple (I wrote a C struct when I needed a class), the upgrade is simple, easy to write, easy to understand, doesn't produce a lot of bugs, and the compiler helps me.
  • For code that was too complex (I wrote something worthy of Boost when  I should have made POD), ripping out that complexity is time consuming and goes slowly.
So why not err on the side of simplicity? Here are some other things you could do with your brain instead of writing C++ for the C++ elite:

  • Watch John Oliver on Youtube.
  • Put debugging facilities into your sub-system for more efficient debugging.
  • Put performance analysis data collection into your sub-system so you can tune it.
  • Document your sub-system so your coworkers don't poison your coffee.
  • Get the code to high quality faster and go on to the next thing!
If you are a professional software engineer, you are paid to solve people's business problems, and you use code to do that. Every Gang of Four pattern you use that doesn't do that is wasted brain cells.

Friday, February 17, 2017

Vroom, Vroom! I'm Going To Win This Race Condition!

Here's a riddle:
Q: Santa Clause, the Easter Bunny, a Benign Race Condition and Bjarne Stroustrup each stand 100 feet away from a $100 bill. They all run to the bill at the same time. Who gets the bill?

A: Bjarne Stroustrup, because the other three don't exist.
Okay, now that we have established that (1) I am a nerd and (2) race conditions are bad, we can talk about thread sanitizer (tsan), or as I call it, "the tool that reminds you that you're not nearly as clever as you thought you were when you wrote that multi-threaded widget reduplicator."

Address Sanitizer (asan) is fantastic because it catches memory scribbles and other Really Bad Behavior™ and it's fast - fast enough that you run it in real time on your shipping product.  tsan is fantastic because it finds things that we just didn't have tools to find before.

tsan's design is quite similar to asan: a block of shadow memory records thread access to a real location, and an app's memory operations are in-place modified to work with shadow memory.

Let's See Some Dumb Code

Here is an example of tsan in action:

This is a back-trace of the first race condition I hit when starting X-Plane: our main thread is reading preferences and has just inited the texture resolution setting in memory.

As it turns out, background threads are already loading textures. Doh! I had no idea this race condition existed.

The Yeti of Race Conditions

A benign race condition may not be real, but we can still define what properties it would have if it did exist. It would have to be a case where:

  1. All actual racing reads-writes are relaxed atomics for this particular architecture and data alignment and 
  2. We can live with the resulting operation of the code in all possible orderings the code actually runs at and
  3. Clang doesn't reorganize our code into something really dangerous, even though it has permission to do so.
All of these are sort of scary, and I think I just heard Herb Sutter vomit into his shoe, but x86 programmers often get item one for free, and we might be able to reason our way to item two.

Item three is a tall order and getting harder every day. While Clang is not yet Skynet, it is still a lot smarter than I am. Five years ago a C++ programmer could know enough about compilers and code to reason about the behavior of optimized code under undefined conditions and possibly live to tell about it; at this point the possible code changes are just too astonishing. I have begrudgingly admitted that I have to follow all of the rules or Clang and LLVM will hose me in ways that it would take weeks to even understand.

With that in mind, some of the first few race conditions I hit were, perhaps, benign:
  • In the case above, some texture loads have a flag to indicate "always max res, don't look at prefs" - the code was written to set a local variable to the preferences-based res first, then overwrite it; restructuring this code to not touch the preferences data if it was not going to be used silenced the "benign" race condition.
  • We had some unsynchronized stats counter - this is just wrong, and their data was junk, but this was a "don't care" - the stats counters weren't needed and didn't affect program operation. They could have been turned into relaxed atomics, but instead I just removed them.

So That's What That Was

After cleaning up texture load and some stats counters, I finally hit my first real bug, and it was a big one. The background tasks that build airports sometimes have to add shader-layers (groups of meshes drawn in the same part of the rendering pass, sharing a common shader)  to the main scenery data structure; they bounce between worker thread land (where they build meshes slowly from raw data) and the main thread (where they add the shader layers into our global data structures between frames and thus don't crash us).

That "bouncing" idiom is simple, it works great, we use it everywhere. Except... turns out that all of the airport extruding tasks (and there are a bunch of them, 64 per scenery tile divided spatially) share a single vector of layers.

And while one task is not doing async work while the layer list is being mutated on the same thread, another one might be! (The "bounce between threads" idiom only works if the thing bouncing around has exclusive access to its data structures.)

Every now and then, the vector gets resized and the memory recycled fast enough that the stale use of the vector blows up on the worker thread. This was visible in our automatic crash reporting as the airport extruder just randomly exploding on a worker thread for no reason.

Sure enough, tsan caught this, pointing to the vector resize on the main thread and the mutation on the worker thread. Because tsan catches races when they are unsynchronized and not just when that failure to synchronize results in Really Bad Data™, this race condition hit every time I booted the product, rather than every now and then. (While we have crash reports to know this happens in field, I never saw this race condition on my own machine, ever.)

What Tsan Can Find

While tsan is about 1000x better than what we had before (E.g. staring at your code and going "is this stupid?"), it does have some limitations. Because the problem space is harder than asan, the tracking in tsan isn't perfect; scenarios with more than four threads can lead to evictions of tracking data, history can be purged, etc. I saw a number of issues in my first session with the tool, but having reviewed the docs and presentations again, I think a lot of these things can be worked around.

Because tsan looks for race conditions between synchronization points, it is great for finding race conditions without depending on the actual order of execution. That's huge. But there are still cases where your code's execution order will hide race conditions (and not just by not running the buggy code).

Our weather engine uses a double-buffering scheme to mutate the weather while background threads are producing particle systems for the clouds. The main thread, per frame, adjusts weather, then the buffer is switched.

As it turns out, this idiom is totally wrong, because we don't guarantee that worker threads finish their work and synchronize with the main thread in the same frame. In other words, the double buffer might swap twice while a worker runs, leaving the worker reading from the mutating copy of the weather.

The problem is: if the weather system is pretty fast (and we want it to be!!) and the particle system tasks finish within a frame, they will then synchronize with the main thread (to "check in" their finished work) before the main thread swaps the thread buffers.  Since everything on a single thread is "happens before" by program order (at least, if they cross sequence points in C++, and we do here), tsan goes "yep - the thread finished, checked in, and we're all good." What it can't realize is that had the timing been different, the synchronization path would have run after the racy code.

The result was that I had to let tsan run for almost 30 minutes before it caught a race condition so totally brain damaged that it should have been immediate.

Even if a clean short run with tsan doesn't guarantee your threading code is perfect, it's still a fantastic tool that moves a series of bugs from "nearly impossible to find" to "possible to find easily a lot of the time."

Monday, November 07, 2016

Terminal Voodoo for Code Signing

If you develop apps for the Mac or IOS, there may come a day when code signing fails mysteriously. When this day happens, mix in a boiling cauldron 2 tbsp Eye-of-Newt*, 1 lb Toe of Frog, a healthy pinch of Wool of Bat, and honestly you can skip the dog's tongue - have you seen the places that thing has licked?

Should this incantation neither fix your code signing problems nor make you King of Scotland, these terminal incantations are a good plan B.

(In all three cases, app_path can be a path to your bundle, not the executable deep inside - this way signing of the entire bundle is verified.)

codesign -vvv app_path

This tells you whether the app is signed, and it iterates sub-components.

spctl -vvv -a -t exec -vvv app_path

This tells you whether GateKeeper is going to run your app, and if not, why not. For example, if you get an error that an app is damaged and should be thrown in the trash, this command tells you something more specific, like that your certificate is revoked.

pkgutil --check-signature app_path

This command outputs the fingerprints of the signatures used in the signing, as well as some thumbs up/down on the certificate chain. This is useful when an app's certificate is bad and you are trying to figure out which certificate the app went out the door with.

* The former Speaker of the House may be unamused by this, but sacrifices must be made!

Tuesday, September 20, 2016

Finding Slow Frames With Instruments 8.0

There are two kinds of programmers: those that download a brand new version of X-Code on the very first day, and those that laugh at them when being an early adopter doesn't go well for them.

I am definitively in that second category; my view is that most software has gotten "good enough", so when a new version comes out, you pay a lot of disruption costs for not a lot of benefits. It's been years since a major OS update has shipped a feature I actually care about.

Yet the last two versions of X-Code actually have delivered new features compelling enough to make a tool-chain upgrade worth it. X-Code 7 delivered Address Sanitizer, which was a huge step forward in catching memory scribbles.

X-Code 8 has Thread Sanitizer, which I will blog about shortly, but this post is show and tell from the new version of Instruments, which finally has a clean way to annotate your timeline.*

Finding Dropped Frames

Adaptive Sampling Profilers are spectacular for finding performance problems, because by the definition of their sampling, they are better at finding the code that's taking the most CPU time. If you want to focus your effort on performance problems that really matter, an adaptive sampling profile will tell you where to aim.

There's one case though where an adaptive sampling profiler falls on its face: if you need to find a single "slow frame" in a sea of fast frames.

Imagine we have 300 consecutive frames, each taking about 15 ms. In other words, our game is running solidly at 60 fps.

Then we hit a "slow" frame - some part of the frame takes 60 ms (!), making that frame 4x as slow as the other ones. This is a visible "hitch" to the user and a bug.

But here's the problem: we have 4500 ms of good frames and 60 ms of bad frames. The bad frame represents 1.3% of the sample data. Even if the bad frame is made up of one single 45 ms function call that pushed us past our budget, that function is < 1% of the sample data and is going to be buried in the noise of non-critical every-frame functions (like the 1% we spent on the overlay UI).

Mark My Words

What we need is some kind of marker, showing us where in the timeline the bad frame occurred. Instruments 8 adds this with the "points of interest" track, which is basically user-data where you can mark what the app was doing.

Points of interest is part of the system trace; the system trace is a fairly heavy instrument, so unfortunately you have to use it in Windowed mode - that is, tell it to record the last N seconds, then bail out within N seconds of something happening. Fortunately on my laptop I can run with a 15 second window and the high frequency timed profiler and the machine is okay (barely). I would describe this performance as "usable" and not "amazing" - but then the Instruments team probably has access to some performance tools to make Instruments faster.

The newest Apple OSes have a framework to put debug markers into your app, but fortunately if you need to run on older tech, Apple has documented the secret spell to get a marker in legacy OS:

#if APL

syscall(SYS_kdebug_trace, APPSDBG_CODE(DBG_MACH_CHUD,0) | DBG_FUNC_NONE,0,0,0,0);
That puts a single marker into the timeline, and it is what I used to see my frame boundaries.

Here's a zoom-in of my timeline - the S in a circle are the frame boundaries - the short spans are solid fps and the long spans are the drops.
The middle blue track is VM activity - as you can see, the VM manager goes haywire every time we glitch.

I have an SSD and all I was doing was circling around a parked plane, so this was really weird. I took a look at the back-traces around the VM activity and the GLSL shader compiler was being run.

You Know That's Really Stupid, Right?

One of the purposes of this blog is to act as a confessional for dumb things I have coded; I try to bury these at the bottom of the post so no one sees them.

In this case, X-Plane has an inherited legacy that haunts us: on-the-fly shader compile. I will write another blog post about OpenGL and Vulkan and why this will never really be fixed until we go to a modern API that promises to not compile on the fly behind our backs, but for now the thing to note is that as you fly, X-Plane sometimes discovers that it hasn't compiled the terrain shader with the right options for the particular scenario it hits, which is a function of time of day and weather, the particular scenery, the rendering pass, and user rendering settings.

Here we can do even better with the points of interest: we can mark regions of interest like this:

syscall(SYS_kdebug_trace, APPSDBG_CODE(DBG_MACH_CHUD,1) | DBG_FUNC_START,0,0,0,0);
/* do really slow stuff */
syscall(SYS_kdebug_trace, APPSDBG_CODE(DBG_MACH_CHUD,1) | DBG_FUNC_END,0,0,0,0);
In the picture above, you can see the red bars at the very top - these mark the in-frame shader compiles, and they exactly fit the 'extra space' in the long frames. Problem child found!

What Was That?!?!

I did hit one dropped frame that did not have a shader compile in the middle - the sim went totally crazy with VM activity, and yet the time profile showed nothing unusual. It didn't make any sense.

So I switched to the "core" view, where the system trace shows you what each CPU is doing. I have an i7, so we can see 8 virtual cores. New to Instruments 8, the orange sections are where there are more runnable threads at reasonably high priority than there are cores - in other words, performance is degraded by running out of core count.
The play-head annotates what process is running over any given section, and the amazing thing about this trace is that as you scrub over all of those tiny blocks, it's almost never the same process twice. What happened?

Chris managed to psychically debug this: X-Plane dropped frames when I hit command-tab and switched out of X-Plane and back to Instruments to stop recording. Apparently every service in the universe on my machine is programmed to briefly wake up and reconsider its life choices when the foreground application changes.

The good news is that this kind of crazy activity was not at all present in the regular operation of the sim, and "sim drops frames when wildly changing between other apps" is not a bug I have to fix.

* Timeline markers in a profiler are not revolutionary at all. VTune has had this for at least longer than I have been using VTune. But even if a tool-chain feature is not new to the universe, it is still very useful when it is new to a platform if you are using that platform. VTune's ability to spot long frames two years ago was of no help to our iPhone app!

Thursday, September 08, 2016

You Can Never Have a Custom Z Buffer Distribution and Early Z at the Same Time

The distribution of precision in the default Z buffer configuration is not ideal. Precision is distributed via a 1 / x curve. This means more precision close to the camera, and less precision far away from it (which is good) but the curve goes asymptotic near the front, so it uses up too much precision. If you've had to try to tune your near clip plane, you know that bringing that near clip plane in just a little bit makes things much worse in the distance.

In order to get correct perspective rendering via a view frustum, our projection matrix has to put our eye-space Z coordinate into our clip space W coordinate. But we don't get to pick and choose - perspective divide is part of fixed function hardware, and all of X, Y and Z in clip space are divided by the same W. So if we are dividing by our eye-space Z for our screen location, we have to do so for depth as well. This is what distributes our depth precision.

We Have Shaders - We Can Do Better

Now that we have shaders, we have the power to mess with this scheme. Let's say we want our Z buffer to contain an eye-space linear distribution. No problem! At the end of our vertex shader, we just load whatever Z we want, then multiply by our clip-space W to pre-nullify the perspective divide. Perfect!

You can use this to have a logarithmic Z buffer, or you can use an eye-space Z buffer in floating point and in both cases you get really great distribution of depth precision. You can have a really, really close near clip plane and still have a huge depth range.

The Problem With Not Being Linear

All of these custom Z distributions have one thing in common: the clip space Z values distributed across a triangle are non-linear. That is, if we find a point half-way along the edge of a triangle, the Z value that our formula produces will not be the same as the Z value the hardware interpolator produces based on the two triangle corners.

When are you non-linear? Pretty much any time you do anything to your Z value that can't be described as Ax + By + Cz + D.  So for example, for eye-space linear, if we need to multiply by eye-space Z again, we're not linear.

The problem with not being linear is that the Z buffers written by the middle of a triangle may be very different from where it would have been if the triangle was tessellated. If you have a large-ish terrain triangle with a small decal like a rock on top of it, the middle of the large triangle might interpolate to be too close (or too far), causing Z errors.  If you have a mesh with layers that requires hidden surface removal and it isn't extremely uniform in vertex density, again, interior interpolation error can lead to Z artifacts. Any polygon that intersects the clip planes is also going to go somewhat haywire, as the clipping changes Z mid-triangle while the camera moves.

Let's Try The Fragment Shader

The above problems are almost certainly a show-stopper for any engine that shows artist-made 3-d meshes. The solution is to move the custom Z calculation to the fragment shader, so it can be done per pixel. This works great from a correctness standpoint - every pixel results in the right Z according to your encoding, but it has a nasty side effect: you lose early-Z.

Early Z is an optimization where your GPU runs the depth-stencil test before the fragment shader, instead of after. This is a huge win for any scene with significant hidden surfaces; fragment shading costs for hidden surfaces are completely removed, because the hardware doesn't even dispatch the fragments for shading. Without Early Z, your fragment shader runs, fetching textures, burning bandwidth, consuming ALU, and then throws the result out at the very end when it turns out that, oh noes, you've been drawing a secret base that's behind a closed door.

In order for early-Z to be effective, the Z coordinate out of the rasterizer has to be correct - that is, the fragment shader can't modify it. So any scheme that encodes a non-linear depth buffer in fragment shader defeats this.

The really bad news is: you don't just pay once. Now that your Z buffer is encoded with a special encoding, anything that is Z tested (even if it isn't Z-writing) has to calculate Z in the fragment shader as well. So for example, particle systems, clouds, and other fill-rate-intensive effects become more expensive.

In summary: the need to move a custom Z function to the fragment shader is caused by the very thing that gives it better depth distribution, and this defeats early Z. So you can never have early Z and a super depth buffer distribution at the same time.

There is one work-around I have seen that gets around this: use floating point depth and reverse the near and far encoding values; since floating point has more precision at 0.0, you are using the higher precision of float where we need it (where 1/Z is imprecise).

Split Depth for Now

X-Plane, like many games, uses two separate Z environments to cope with a very close view (inside the cockpit, where millimeters matter) and the world (where you can be millions of meters from the horizon). This isn't ideal - the cost of separating the Z passes isn't zero.

I wrote this post up after having worked out the math behind custom Z encodings (log Z, floating point eye-space, etc.); I had a todo item to investigate whether we could move X-Plane to a single Z pass and save some overhead.

The answer is, unfortunately, no for now. Losing early Z on our particle and cloud effects is a non-starter; we'd need to use a floating point depth buffer. Unfortunately, ARB_clip_control isn't wide-spread enough for us to count on. We'd also eat bandwidth in moving from a D24_S8 integer depth buffer to a D32F_S8 depth buffer (which pads out to 64 bits per depth sample.

One last note: X-Plane uses a floating point channel of the depth buffer to write floating point depth in eye space. While we split the actual Z buffer and draw in two passes, we build a single unified G-Buffer; our eye-space linear floating point depth has enough range to span both passes, and lets us then shade our G-Buffer in a single pass.

(When we run our two depth passes, we change only the near and far clip planes and nothing else; this means we can reuse our shadow maps and share our G-Buffer.  This takes most of the cost out of our two passes.)

Tuesday, July 19, 2016

Is Anyone Generating PMREMs In-Game in Real-Time?

One of the standard solutions that has emerged for physically based rendering (PBR) is to use pre-filtered mipmapped radiance environment maps (PMREMs).

Put in very imprecise terms, a PMREM is a cube map that has a perfect reflection of the environment in the lowest (largest) mip and successively blurrier images in the lower mips. Typically the image is convolved with your material's distribution function at various roughnesses, and a cosine lobe is used at the lowest resolution.

From this, specular lighting is sampled from the LOD level that matches material roughness, and a low LOD is used for diffuse lighting.

What I haven't seen is any papers on games doing this entirely in-engine.

There's plenty written on baking these cubemaps ahead of time, and it appears that quite a few engines are augmenting PMREMs with screen space reflections (with various heuristics to cope with materials being rough and all of the weird things SSR can have).

But the only work I've seen on real-time PMREMs is the old GPU Gems 2 chapter that projects the original cube map into spherical harmonics (and then back into a cube map) as a way of getting a reasonable low frequency or diffuse cube map. But this was written back when a clean reflection and a diffuse map were good enough; it doesn't handle rough specular reflections at all.*

The problem that X-Plane faces in adopting modern game-engine graphics is that we can't bake. Our "level" is the entire planet, and it is built out of user-installed scenery packages that can be mixed and matched in real-time.  This includes adding a mix of surface objects onto a mesh from another pack. Baking is out the question because the final assembly of the level only exists when the user starts the app.

So I have been experimenting with both SH-based convolution of cube maps and simply importance sampling a distribution function on the GPU. It appears we're reaching the point where both of these can potentially function in least, if you have a big GPU, a relatively low-res cube map (e.g. 256x256, not 1024x1024) and only one cube map.**

My question is: is anyone else already doing this? Is this a thing?

* You can add a lot more spherical harmonic coefficients, but it doesn't scale well in image quality; the amazing thing about SH are that the artifacts from having a very small number of bands are, perhaps by luck, very acceptable for low frequency lighting. The problem is that, as coefficients are added in, things get worse. The original image isn't reconstructed well (for the number of bands we can hope to use on a GPU) and the artifacts become significantly less desirable.

** To be clear: importance sampling is only going to work for a very, very small number of samples. I believe that for "tight" distributions it should be possible to find filter kernels that are equivalent to the importance-sampled result that can run in realtime.  For very wide distributions, this is out of the question, but in that case, SH convolution might provide a reasonable proxy.  What I don't know yet is what goes "in the middle".  My guess is: some kind of incorrect and hacky but tolerable blend of the two.

Friday, July 08, 2016

Worst. Lock. Ever

Here's some really good code:

    StLock take_lock(&global_asset_lock);
    delete this;

This was the code to release an art asset in X-Plane 10.45r2.

Can you see what's wrong with it? Don't over think it; it's not a relaxed vs sequential atomics issue or a bug in the RAII code. The logic is roughly this:

  • Decrement my reference count.
  • If I was the last one (and my count is now zero):
  • Lock the global table of art assets.
  • While the table is locked, clear out my entry.
  • Delete myself.
We should get "more optimal" and drop the lock before we delete ourselves, but that's not the issue either.

The issue is a data race!  The race exists between the completion of the atomic decrement and acquiring the table lock.  In this space another thread can come along and:

  1. Try to load this same art asset; it will acquire the table lock, look up our art asset, find it, increase the reference count back to 1.
  2. It will then drop the lock, leaving us exactly how we were except our reference count is now 0.
  3. When we proceed to nuke the art asset we will leave that other client with a bad pointer and crash.
I found this because I actually caught it in the debugger -- what are the odds?

Being too clever for my own good, I thought "let's be clever and just re-check the reference count after we take the global lock; in the rare load-during-delete we can then abort the load, and in most cases releasing our reference count stays fast.

That fixes the above race condition but doesn't fix this other race condition: in the space between the decrement and the lock:

  1. Another thread takes the global lock, finds our art asset, and increases its reference count, drops the table lock.
  2. That thread then releases its reference, which hits zero, takes the lock, deletes the art asset and the table entry and releases the lock.
Now we are the ones with a bad pointer, and we crash re-deleting the art asset.

So I've come to the conclusion that the only safe thing to do is to take the table lock first before doing any decrement. In the event that we hit zero reference count, since we did so while the table is locked, no one could have found our art asset by the table to increase the count (and clearly no one else already had it if we went from ref == 1 to ref == 0).  So now we know it's safe to delete.

This isn't great for performance; it means we take a global lock (per class of art asset) on any reference count decrement, but I think we can survive this; we have already moved not only most loading but most unloading to asynchronous worker threads, so we're eating the lock in a place where we can afford to take our time.

A More Asynchronous Design

I can imagine a more asynchronous design:
  1. The table of art assets itself holds one reference count.
  2. When we decrement the reference count to one, we queue up an asynchronous table "garbage collection" to run on a worker thread.
  3. The garbage collection takes the table lock once to find and clean out unused art assets, blocking loads for a small amount of time while the table is inspected.
Here the reference count decrement doesn't need to do any extra work unless it is "probably" finished (e.g. gets down to 1 reference count remaining) so transient reference count chagnes (e.g. we went from 3 to 4 back to 3) remain as fast as we can perform an atomic operation with some acq/rel semantics.

I have not had a chance to profile the simple solution since trying to make it correct; if it turns out to be a perf problem I'll post a follow-up. The only time we've seen a perf problem is when the table lock was required for all operations (this was a design that we fixed almost a decade ago -- a design that was okay when it was single threaded and thus lock free).

The rendering path in X-Plane requires no reference count changes at all, so it is unaffected by anything on this path.