Sunday, May 25, 2008

Pitfalls of Performance-Tuning OpenGL

Performance-profiling OpenGL is more difficult than performing a non-OpenGL application because:
  1. OpenGL uses a pipelined architecture; the slowest component will slow performance while other parts of the system go idle and
  2. You don't always have good insight into what's going on inside the OpenGL pipeline.
(Toward this second point there are tools like NVidia's PerfHUD and GLExpert that can tell you this, but your regular adaptive sampling profiler won't help you much.)

Pitfall 1: Not Being Opportunistic

The key to opportunistic performance tuning is to look at your whole application - creating a specialized "test case" to isolate performance problems can be misleading. For example, in X-Plane our break-down of main-thread CPU use might be roughly this:
  • Flight model/physics: 10%.
  • 2-d Panel: 5%.
  • 3-d World Rendering: 85%.
This is telling us: the panel doesn't matter much - it's a drop in the bucket. Let's say I make the panel code twice as fast (a huge win). I get a 2.5% performance boost. Not worth much. But if I make the world rendering twice as fast I get a 73% performance boost.

The naive mistake is to stub out the physics and world rendering to "drill down" into panel. What the profiler is saying is: don't even bother with the panel, there are bigger fish to fry.

Pitfall 2: Non-Realistic Usage and the Pipeline

The first pitfall is not OpenGL specific - any app can have that problem. But drill-down gets a lot weirder when we have a pipeline.

The key point here is: the OpenGL pipeline is as slow as the slowest stage - all other stages will go idle and wait. (Your bucket brigade is as slow as the slowest member.)

Therefore when you stub out a section of code to "focus" on another and OpenGL is in the equation, you do more than distort your optimization potential; you distort the actual problem at hand.

For example, imagine that the sum of the pane and scenery code use up all of the GPU's command buffers (one phase of the pipeline) but pixel fill rate is not used up. When you comment out the scenery code, we use less command buffers, the panel runs faster, and all of a sudden the panel bottlenecks on fill rate.

We look at our profiler and go "huh - we're fill rate bound" and try to optimize fill rate in the panel through tricks like stenciling.

When we turn the scenery engine back on, our fill rate use is even lower, but since we are now bottlenecked on command-buffers, our fill rate optimization gets us nothing. We were mislead because we optimized the wrong bottleneck. We saw the wrong bottleneck because the bottleneck changed when we removed a part of real use code.

One non-obvious case of this is framerate itself; high frame-rate can cause the GPU to bottle-neck on memory bandwidth and fill rate, as it spends time simply "flipping" the screen over and over again. It's as if there's an implicit full-screen quad drawn per frame; that's a lot of fill for very little vertex processing - as the ratio of work per frame to number of frames changes, that "hidden quad" starts to matter.

So as a general rule, load up your program and profile graphics in real-world scenario, not at 300 fps with too little work (or 3 fps with too much work).

Pitfall 3: Revealing New Bottlenecks

There's another way you can get in trouble profiling OpenGL: once you make the application faster, OpenGL load shifts again and new bottlenecks are revealed.

As an example, imagine that we are bottlenecked on the CPU (typical) but pixel fill rate is at 90% capacity, and other GPU resources are relatively idle. We go to improve CPU peformance (becaues it's the bottleneck) but as a result, we optimize too far and as a result we bottleneck completely on fill rate; we get to see only a fraction of our CPU work.

Now this isn't the worst thing; there will always be "one worst problem" and you can then optimize fill rate. But it's good to know how much benefit you'll get for an optimization; some optimizations are time consuming or have a real cost in code complexity, so knowing in advance that you won't get a big win is useful.

Therefore there is a case of stubbing that I do recommend: stubbing old code to emulate the performance profile of new code. This can be difficult -- for example, if you're optimizing CPU that emits geometry, you can't just stub the code or geometry goes down. But when you can find this case, you can get a real measurement of what kind of performance win is possible.

No comments:

Post a Comment