Debugging Undefined Behavior in C++

Today I’m going to describe more strange issues we encountered in 0 A.D. last year. It was reported by some OS X users that they were unable to complete multiplayer games without an “out of sync” error occurring.

First, a little explanation of 0 A.D. multiplayer mode is needed.  Each player (client) runs a deterministic simulation of the game, producing a particular game state each turn. When a player or AI takes an action, they send their command(s) to all players, modifying the game state in the same way. The advantage of this approach is that only a small amount of data needs to be sent each turn, but each player must run an identical version of the game. The game state can be serialized into a binary blob of data, which can be passed into a hash function, the result being used to compare different states. This happens frequently and if the states differ between any two players, we call that an “out of sync error,” and the game is basically invalid.

Issue #1

Sometimes these OOS errors occur because of a bug in 0 A.D. code, sometimes because a player has a modified/inconsistent version of the game. So the first troubleshooting step is to ensure all players run the same version of the game. This is determined by collecting logs from all participating players. For OOS errors, the game also dumps a JSON debug version of the game state to a text file. These debug dumps are nice, because they include component and property names, and are nicely formatted for human eyes. By comparing these dumps with e.g. a diff tool, we can determine where the state varied and perhaps infer the cause.

However, in this case, three important clues were reported:

  1. Not all OS X players encountered the error. In particular, it only seemed to effect users of the older Mountain Lion (10.8), but not the new Yosemite (10.9).
  2. Only occurs with players using different OSes. So for instance, if all players used the same version of OS X, there was no problem. However, most 0 A.D. players use Windows or Linux, so cross-platform matches are far more likely.
  3. The dumps were identical between the affected players; both the binary dump and the JSON debug dump.

Uh oh, one of the troubleshooting standbys wasn’t helping here. The OOS dumps were identical, but somehow a different hash was being calculated… Was there a bug in the hash algorithm? To get to the bottom of this, some code modification was necessary. I needed to modify the serializer at the point where data was passed into the hash function, and dump the raw bytes to a file. Doing this, I discovered the most interesting thing yet, there was a big difference! A significant amount of raw data was being hashed, but not properly making it through the serializer.

By now, I was deep in the weeds of the game and unfamiliar source code. One of our original devs, Philip, pointed me in the right direction. It turns out that we use a custom “stream buffer,” allowing some parts of the game to implement their own custom serialization logic, passing the results seamlessly into the serialized data stream. This is an advanced C++ I/O concept. Was there a bug in our stream buffer implementation, or some difference that only occurred on OS X?

I needed to test the theory. Our stream buffer only implemented the xsputn() function, so I placed some debug printouts there. Indeed, the function was being called on Windows, Linux, and OS X 10.9, but not on older OS X!  Well, in fact there are two output functions which can be overridden in a stream buffer: xsputn and overflow. We never implemented overflow.

To be sure, I needed to test this with as many compilers as possible. Enter one of the more useful debugging techniques: the minimal test case. 0 A.D. has a lot of source code, and there are so many layers involved, sometimes it is more efficient to extract the relevant parts into a small piece of code, testing it separately. The serializer by itself is spread out across many source files and headers, and touches many other parts of the code base. All we really needed to test was our stream buffer:

#include <iostream>
#include <streambuf>

class OurStreamBuf : public std::streambuf
{
public:
    OurStreamBuf()
    {
    }

    int overflow(int ch)
    {
        std::cout << "overflow: " << char(ch) << std::endl;
        return ch;
    }

    std::streamsize xsputn(const char* s, std::streamsize n)
    {
        std::cout << "xsputn: " << s << std::endl;
        return n;
    }
};

int main(int argc, char* argv[])
{
    OurStreamBuf ourbuf;
    std::streambuf&amp; streambuf = ourbuf;
    std::ostream stream(&amp;streambuf);

    stream << "This is a test\n";

    return 0;
}

When compiled on MSVC, GCC, clang with libstdc++, or clang with libc++ on OS X 10.9+, only xsputn is ever called with the string “This is a test\n”. Our implementation worked on those systems. However, on clang with older libc++ (prior to r172542, in fact), only overflow was ever called. Incidentally, xsputn is intended as an optimization, which can optionally be used, but if it’s not used, then overflow is used instead. The moral of the story is that overflow should always be implemented for an output stream buffer, or bad things can happen when least expected.

Alas, even our test suite didn’t detect this bug, despite its extensive serialization tests. There was previously no test of our custom stream buffer class. As a bonus, older versions of libc++ had a bug in handling of eofbit for istreams, which caused a number of serialization tests to fail. This bug affected SDKs for OS X 10.7 and older, and is discussed extensively here. 0 A.D. prefers to support as many of our users on older systems as possible, so rather than saying “sorry, you must use OS X 10.8 or newer,” we will do our best to work around such bugs. It’s a tradeoff of supporting older hardware and software, to give users the most flexibility, at the expense of more complicated code with special cases for different platforms.

The solution was to fix the stream buffer, fix the tests to be OS X 10.7-compatible, and call it a day 🙂

Issue #2

With that fixed, there was another perplexing OOS error. This one likewise seemed to affect only OS X. Early in a game, perhaps a few turns in, there would be an OOS error. Comparing the debug simulation state dumps, the only differences were found in the pathfinder state. However, with as many units moving around the world and interacting as we have in 0 A.D., it was difficult to isolate anything more specific. A diff of the debug dumps looked something like this:

--- oos_dump_client_osx.txt
+++ oos_dump_host_win.txt
@@ -35402,8 +35402,8 @@ entities:
     group: 6667
     key: 23
     entity: 6668
-    x: 291.37151
-    z: 1239.88567
+    x: 291.09096
+    z: 1238.9258
     clearance: 1
     flags: 21
     group: 6668
@@ -163352,8 +163352,8 @@ entities:
     flags: 1
     scripted visibility: 0
     key: 6668
-    x: 291.37151
-    z: 1239.88567
+    x: 291.09096
+    z: 1238.9258
     vision: 0
     visibilities: 0
     size: 1
@@ -796328,7 +796328,7 @@ entities:
     goal u z: 0
     goal v x: 0
     goal v z: 0
-    goal hw: 9
+    goal hw: 8
     goal hh: 0
     maxdist: 0
   Vision:
@@ -798710,8 +798710,8 @@ entities:
   Footprint:
   Minimap:
     active: true
-    x: 291.37151
-    z: 1239.88567
+    x: 291.09096
+    z: 1238.9258
   Obstruction:
     active: true
     moving: true

To simplify troubleshooting, why not create a blank map with only a handful of units? I set up two instances of the game with my test map: one on OS X and one on Windows. Moving the units independently didn’t cause any problems. No sooner did the units encounter one another, and begin attacking, than the OOS occurred and now it was quite clear which data was at fault:

--- windows-oos_dump.txt
+++ osx-oos_dump.txt
@@ -76153,7 +76153,7 @@ entities:
     facePointAfterMove: true
     length: 2
     waypoint x: 519.5
-    waypoint z: 322.5
+    waypoint z: 323.5
     waypoint x: 519.5
     waypoint z: 276.5
     length: 0
@@ -76168,7 +76168,7 @@ entities:
     goal u z: 0
     goal v x: 0
     goal v z: 0
-    goal hw: 29.4142
+    goal hw: 28.4142
     goal hh: 0
     maxdist: 0
   Vision:

This is the UnitMotion component, which handles various high-level movement behaviors for units, including interactions with the pathfinder. Worth noting is that both numbers differ by exactly 1.0. No matter how many times I repeated the test, it was always 1.0. This meant that somehow the Windows and OS X versions of the game were calculating different pathfinder goals. Yet we have no OS-specific code in the pathfinder or UnitMotion.

The puzzle began to resolve itself, when considering the pathfinder had recently been updated (r16751). By reverting to the previous commit, and repeating the test, it should be possible to determine if that commit introduced the issue. I tested r16750 and there was no OOS error! So, the problem was narrowed down to a specific OS and commit. Now what?

It was time to inspect the UnitMotion component more closely. I didn’t design the pathfinder update, and it’s rather complicated logic. Sometimes the easiest way to troubleshoot is by process of elimination. In this case, there were only five lines in the code base where “goal hw” was modified. It was a simple matter of determining which line was responsible for a difference of exactly 1. Curiously, that line referenced a global static const in the same source file. The const itself was initialized with a global (static) const from a different source file.

Here’s a simplified listing of the files involved in this problem:

Terrain.cpp

#include “Pathfinding.h”

const int Pathfinding::NAVCELL_SIZE = 1;

CCmpUnitMotion.cpp

#include “Pathfinding.h”

static const int g_GoalDelta = Pathfinding::NAVCELL_SIZE; // uh-oh

bool CmpUnitMotion::MoveToTargetRange()
{
    // ... minRange + g_GoalDelta
}

Pathfinding.h

namespace Pathfinding
{
    extern const int NAVCELL_SIZE;
}

You might assume that, in this situation, your C++ compiler will be smart enough to do something “sensible” here. Like determine the dependencies of the translation units for Terrain.cpp and CCmpUnitMotion.cpp, so your globals are always properly initialized… or at least throw an error? You would be sadly mistaken. In fact, C++ does not specify or enforce any static initialization order between translation units, so this is undefined behavior. It compiles, and on some platforms, it even works as expected, but purely by chance.

With MSVC and GCC, g_GoalDelta was incorrectly assigned the value of 0, whereas it was assigned the correct value of 1 with clang/libc++. This meant that most 0 A.D. users would never receive the error, as they use Windows or Linux. Similarly, multiplayer games between only OS X clients would encounter no errors. This is one example of the challenges of developing a cross-platform application. Sometimes a bug lies dormant for months or years, before popping up in a cross-platform context like this.

There are many ways to solve this problem. My solution was simple: define GOAL_DELTA as a constant in the same source file as NAVCELL_SIZE. Ironically, this exposed another bug in the pathfinder, as goal delta had essentially been disabled by initializing to 0 for most users, which now being fixed, resulted in unexpected interactions with unit obstructions. I won’t go into those boring details here, as I think the post is already long enough, and I mostly wanted to give examples of undefined C++ behavior in the wild, how they were debugged and ultimately solved.

Leave a comment

Your email address will not be published. Required fields are marked *