realloc

Anecdotes don't necessarily teach practices or process, but they do expand the scope of Things That Actually Happen. These are mostly post-mortems of events where The Debugging Was Interesting.

Why is this so slow?

A utility function as part of a search engine back end would just sometimes hang. The operation would get retried (sometimes by the user, sometimes just as part of a timeout) and continue correctly, so the obvious techniques for adding more checks and debug information weren't helping, we didn't know how to make it more reproducible.

So we reached for a Big Hammer: the Linux strace command. Normally this tells you something new about what's going on at the next layer down and you dig there - instead, when strace was attached, the hanging just stopped happening. At the time, that alone was a hint; in the Linux 2.x era strace had a fundamental problem that sleep() calls would just return immediately and not actually sleep at all, so we looked more carefully at the strace output and found sleep(4294967) calls. Clearly without strace those would have been stuck for over a month...

There's another layer to this. There weren't any mentions of 4294967 in the code, and yet it looks vaguely familiar: it's the first 7 digits of 2³², or rather "all but the last three" digits. Thinking about that in the context of time suggests that the difference is also the conversion of milliseconds to seconds (sleep takes seconds.) It's a further leap to consider that you don't see 2³² in a 32 bit environment... but you do see -1 converted (wrongly) from signed to unsigned giving 2³²-1, so it's not too much of a leap to consider the sequence

  • some function returned -1 as an error value (many Linux functions, for example select(), do this as an indicator that you didn't get a value and should look at errno instead.)
  • whatever called that function expected it to return a time in milliseconds and the author didn't even consider that it could fail with a flag value - and likely just stored it in an unsigned integer to begin with.
  • since sleep() takes seconds, they divided by 1000 and got 4294967 (if the intermediate value had been signed, they'd have probably gotten -1/1000=0 instead, making the bug less disruptive and thus harder to find.)

These leaps aren't as ludicrous as they look - there's an entire debugging category that goes by "Computational Numerology", the idea that there's value in recognizing Interesting Numbers1 in debugging contexts - but there was some luck involved. Still, it was more about looking under the floorboards (with strace) and paying attention to what was scurrying around down there.

Cygwin Kerberos

This one is a kind of messy real story that doesn't lead to clean single lessons, but it gives some things to ponder about what parts of the system you can trust...

In the mid 1990s there was a project at Cygnus to get Kerberos 4 running on Windows NT using the Cygwin toolchain. As part of the porting effort there were a handful of improvements to Cygwin itself, to add useful corners of POSIX that it hadn't yet needed, particularly time-related functions. Since from our perspective, Windows was just another especially-annoying embedded target system, the development was all done on stable Linux systems and then the files were copied over and tested.

Ran into a weird authentication problem, turned on full debugging and pushed the updated build over - and it immediately stopped failing. Turned off debugging and changed something else and it started failing again. Went back and forth for half a day of tedious (10-15 minute) edit-compile-test cycles until, due to fatigue, I accidentally failed to save a change, recompiled - and the failing/not-failing state flipped anyway.

That slap-in-the-face led to digging in to my Cygwin changes - Windows time was declared as a pair of 32 bit ints to make a 64 bit microseconds-since-the-start-of-time value, with some explicit math to combine them. This is were the error was: they should have been2 explicitly unsigned integer values - so whenever the high bit of the lower int was set, time would count downwards instead of up. This mean that for around 12 minutes of every 35, if the Windows clock was perfectly synchronized, the Cygwin clock seen by the Kerberos code would be more than 5 minutes off, thus exceeding the replay-bounds threshold and failing to authenticate - entirely based on "what time it actually was" and not anything to do with which version of the code I was testing or what changes I'd made or whether debugging was actually turned on.

The lesson here is a little obscure - the strongest form is probably "the thing that broke is probably the thing you just changed" but you have to be completely clear on what you actually changed - not just the part you're focusing on. Mechanically translating the type declarations would have also avoided this (there may have been perceived license issues with that at the time, though.) From another angle, "the bug goes away with debugging enabled" might be rarer than you think - if you don't have a reason that debugging "unbreaks it" maybe step back and think if something else might be going on. There's a lot of debugging-related mythology - I'm calling it mythology because even back then, enabling debugging with gcc only added labels to the code, it did not change the generated code at all; if you wanted simpler machine code to make things easier to track you actually had to use -O0 or otherwise turn off particular optimizations. So there wasn't a good reason to think that turning debugging on was the cause of it starting to work, either.

Python Lists

In the "late aughts" (200x) I had a problem with a production geographic news search system crashing because a python interpreter dumped core. We preserved the coredump and context (which was huge - the process had been chewing on gigabytes of compressed data for hours, and called in to some custom libraries which were, correctly, the first suspect.) We considered ourselves lucky to have caught this on a system we controlled and not a customer-fielded system - part of the reason to run this in-house search project was to "get ahead of" big data problems by attempting to process more data than most of our customers did.3

Initially the problem proved very difficult to reproduce, even with the exact input data. This led to some improvements in our internal record-keeping with an eye towards making it easier to reproduce next time; it was several months later that we reproduced it (under testing conditions) at all, and by that point the priority got dropped a bit simply because it wasn't happening regularly in production. Eventually we got it down to a case that took 8 hours to run but only failed about one in five times; at that point it was self-regulating, we could try a new perspective on the problem about once a week.

What got strange was as we got better at reproducing the problem, it got further from any of our code. At the time, python lists were a malloc()ed array of pointers to objects, with a "fibonacci" growth pattern - amortizing some of the allocation overhead but without the pathologies that power-of-two growth causes. The failure converged on finding a single pointer element within that list of pointers - not at the start or end, but "towards the end" - that was a literal NULL (0) - an empty Python list element isn't NULL, it's a pointer to a singleton Python None object, there's no intentional way to get an actual C-level NULL in place from Python code (without using C escapes like pyrex or CDLL.) It also didn't happen at a particular size of list - it did happen more for larger ones, but it sometimes happened with relatively short lists too.

We did work out that the particular missing value lined up with the "normal" growth points of the Python list allocator - which is surprisingly approachable code and wasn't that hard to just audit - and it wasn't copying things at all, that was delegated to the C runtime, using realloc(). GNU libc turns out to be highly performance tuned and much harder to audit - it also has to handle obscure cases like detecting that allocations are being made "simultaneously" in multiple threads (which Python didn't do at all, at the time, but made the libc code significantly more complex, and dependent on fairly random initial conditions.)

At this point we stepped back and considered that while there aren't many eyes on our code, there are a lot of eyes on Python itself (I pestered at least one senior community member about this at Pycon!) but there are an even larger community of people banging on libc (and at least finding bugs, if not necessarily reading the code.) Since we were running a relatively current Debian based distribution, we dug through Debian's very short (and very conservative) list of critical post-release updates - and actually found a very obscure fix to realloc. In some difficult-to-trigger conditions, when realloc needed to move the data to a different page, it would copy one fewer word than it should have - so the new array would have a NULL instead of the previously-last element. Nothing would notice this until higher level code, much later, would iterate over the Python list and "trip" on the NULL element, dumping core.

Since we had a reproduction case, we were able to apply the single Debian update and rerun everything, convincing ourselves with a couple of weeks that it really was the same problem we were seeing. (One of the big reasons to create reproduction cases like this is the difference between thinking that "yeah, sure, that's a plausible sounding fix" and having evidence for it.)

Lessons are varied but include

  • "if you didn't find a kernel bug yesterday, you didn't find one today" but at the same time, sometimes the bugs really are several layers down the abstraction you're normally working at. (And sometimes you really are near the edge of the cliff - this particular startup found at least two different kernel bugs in the span of about five years.)
  • Debian post-release updates are rare and usually critical so if you're doing a Debian-based release, track them (or just take them promptly.) You've probably got the resources to at least read the announcements and filter them against the packages you actually care about. (I learned at the end of the story that a coworker had tried a quick duplication of the problem on a "clean" Debian baseline and couldn't reproduce it, and discarded the effort as unhelpful, not realizing that "clean" meant "fully-updated"...)
  • There are basically no reasonable conditions that lead to a Python interpreter coredump - but it's not hard to get at least a traceback out of them and that's easier to talk about. (Believing the traceback when it goes through something as fundamental as List Item is perhaps harder.)
  • Not exactly "never waste a crisis" but it's certainly worth considering, for any "why did that blow up" kind of mess, at least updating the product to gather the evidence that you wished you had, so you have it next time.

  1. Interesting Numbers include powers of two in decimal, powers of ten in other number bases, and more obscure things like strings of ASCII digits converted to numbers - not specifically that 1163022927 or 1330795077 are the beginning of ERROR but that they're worth at least converting to hex and pondering. 

  2. To be very clear - the Microsoft type declarations were correct, the converted-types elsewhere in Cygwin were correct - this was only in my local copy of the new code, and this was fixed long before it went u pstream. 

  3. The New York Stock Exchange is famous for taking a trading week (Monday-Friday) of data every week and replaying it at 5x speed on the weekend - as a "canary" for any trade-volume-scaling problems. This is mostly used as an example of what you should have been doing if you get caught by surprise when your system gets popular...