Debugging GLSL shaders in 0A.D.

Today I want to discuss a graphics glitch that long plagued a project I work on, called 0 A.D.  The glitch was first noticed years ago, when major changes were made to the renderer system. Until recently, the cause was unknown and not a focus of our efforts. I will present the debugging techniques used to solve it.

First, a word about 0 A.D., since this may begin a series of posts on the project: it’s a free, open-source, cross-platform 3D real-time strategy game, built more or less from the ground up with an engine (Pyrogenesis) written in C/C++. We also use various third-party open-source libraries. The renderer is implemented in OpenGL and supports both the old fixed function pipeline and newer technology including both ARB and GLSL shaders (even GL ES 2.0 for mobile platforms!)

The bug is apparent in the following images:

Side-by-side view of 0 A.D. renderer glitch

Notice the texture at the base of the building appears to “flicker” as the camera moves? In Pyrogenesis, these models are called decals, they are flat and conform to terrain, and are used for enhancing the blending of e.g. buildings and resources with the terrain. Some experimenting led to the following observations:

  1. Only decals show this particular bug, but not all of them
  2. It is only visible when GLSL shaders are used
  3. It is dependent on camera angle
  4. It is independent of graphics card / driver
  5. It is not reproducible in the scenario editor or on minimal test maps

The first two points are critical: the bug doesn’t affect ARB shaders or fixed function pipeline and it only shows up on a subset of models. Currently, 0 A.D. has some fancy graphical effects only implemented in GLSL, such as: specular maps, normal maps, parallax, and ambient occlusion. Could it be one of these that caused the problem? I suspected it’s a bug in 0 A.D. rather than drivers, because it occurs on basically any graphics card that supports GLSL, on Windows or Linux.

0 A.D. models (actors, to be precise) are defined in a custom XML format that links together meshes, textures, props, animations and materials. By inspecting the XML of decals known to have this bug, I found something interesting: they all reference a material with normal and specular mapping – effects only implemented in GLSL. We’re getting closer!

Now it was a matter of finding the shader responsible for decal rendering. There are some nice tools for debugging OpenGL applications, including gDEBugger and AMD’s GPU PerfStudio. 0 A.D. shaders are a little odd, because we heavily use preprocessor definitions to generate different shaders from a single source file, to switch between different effects. That means the source file won’t actually be the code that gets compiled by the GPU, making debugging a bit more complicated. And in this case, simple maps with only a few models don’t trigger the bug, so finding the right permutation of the right shader on the right model can be quite tricky with a debugger.

One of the simplest means to troubleshoot broken code is to narrow the problem down as closely as possible to a particular file, function, or code block, and begin commenting lines out until the problem is “fixed” or you see some difference in the observed behavior. It sounds primitive, but can be surprisingly quick and easier than working with a debugger. Fortunately, 0 A.D. has hotloading support for shaders, so I can see my modifications in real-time.

Using knowledge of 0 A.D.’s material system, I was able to track this down to the GLSL fragment shader responsible for rendering both terrain and decals. With a little trial and error, I discovered the offending line was the following lighting calculation:

vec3 bumplight = max(dot(-sunDir, normal), 0.0) * sunColor;

 If I set this variable to a constant, the flicker went away. I began wondering if the code was incorrect, but as far as I could tell, it wasn’t obviously wrong. More importantly, the calculation isn’t based on camera angle in any way (neither sun direction, color, or terrain normals depend on camera view).

So a bug triggered by moving the camera was caused by a line of code with no relation to camera view? Something wasn’t right there. I broke down the calculation further, and setting sunDir and sunColor as constants didn’t help. But… setting normal to a constant again solved the flicker. Now I knew it was a problem with the normal, but what exactly was going on?

Here’s a useful quick-and-dirty technique for troubleshooting shaders, when you want to know the value of a variable without relying on a debugger.  Set the fragment color with it! This way, you can visualize any 3D vector as an RGB color and from that, deduce it’s value over the scene. It’s exactly what I did for terrain normals, and here is the result:

Side-by-side view showing the bug with shader debugging technique

Let’s think about what that means. For most of the terrain, it looks roughly correct. Green dominates and that corresponds to the Y (up) coordinate of the normal, since the terrain is mostly pointed upward. No problem. But what is with the orange and red under those buildings where the decal is, why is it changing so drastically with slight camera movements? Why is it changing at all?!

At this point, I knew something was badly wrong with terrain decal normals. Another basic debugging technique: start with what you know and work backwards. I traced the normals back through the shader system and didn’t find anything decal specific. In fact, the definitions for the GLSL terrain decal effect didn’t even define a normal stream and attribute. Position, color, and UV coordinates were there, but not normals. (The details are very specific to Pyrogenesis, but I suppose a problem like this can and will happen to others)

Strange, but adding those didn’t actually fix the problem. Now I looked a bit deeper and found where normal attributes are defined in the vertex arrays for other models. Other models, but not decals… they don’t have normals defined! This means the shaders are using garbage data. Don’t ask me how this is allowed by GLSL or why there is no error. But something became obvious now, remember how simple test maps didn’t trigger the glitch? It was still garbage data, but less of it filling up memory compared to a real map. And moving the camera altered the data in memory by chance, such that sometimes the garbage was more or less visible.

The fix? A whopping 8 lines of code (r16349), but hours of debugging and years of experiencing this annoying little bug.

Often, this is how debugging goes, nobody is around who can point you to a specific section of code, the original author may have left the project years before. It may be a technology you haven’t used or don’t fully understand, there may be a lack of available debugging tools / IDEs to assist. But it’s always possible to fall back to the basic methods of debugging: comment out code, add debug logging, work backwards or forwards from a known point. The best part is they work for every language and system. The more you apply these techniques, the easier it becomes =)