Archaeology

Essays: Archaeology

Tags: bugs

Archaeology isn't just an excuse to wear a cool hat and carry a whip - it's about recognizing that "today" is built upon layer after layer of "the past" and sometimes you need to peel back those layers to explain a problem.

In the software development context it primarily comes up when fixing bugs and when modifying long-stable code to meet changing requirements. This breaks down into answering near-term "how does this code work now" and longer term "how did we get here" questions.

Digging into the present (forensics)

Outside of some modern art contexts archaeology is not about the present; "digging into the present" more properly lands somewhere between journalism and forensics. In software we aim to approach it more from the forensic side - separating reports/hearsay from concrete evidence, keeping meticulous records of any discoveries (simplified by being able to perfectly copy memory states and program inputs and outputs, and store them at the terabyte scale - criminal forensics would be greatly simplified by being able to make hundreds of exact copies of a crime scene and poke at it with impunity, without disrupting the original!)

This is sort of the reverse of the bug funnel, and at the same time is a good part of the motivation for that model: ultimately a bug report is "something happened that shouldn't have happened" and the shouldn't part is based on some model of how the system works - your customers will have a model of the system that is arguably more important than yours, but is also vastly less precise. To usefully pass a complaint along the chain from user to developer, you need to make sure the complaint continues to fit the model, and that involves making it more concrete without losing sight of the end user's perspective.

A powerful way of making the complaint more concrete is to turn it into a test case.² This test case can serve several purposes:

Others, particularly at review time, can look at the test case and agree (or disagree) that it represents the actual problem.
The test case can be applied to multiple past versions of the code (see "archaeology" below) to narrow down what versions express the undesired behaviour.
The test case can be applied to future versions of the code - as a "regression test" to make sure the problem doesn't come back.¹
Finally the test case can be used on the present version of the code to confirm that the change does the desired thing - serving as a "free" first intent-based test.³

Once you have a (failing) test to work from, you can apply the usual tools for figuring out what's going on - tracing execution, single stepping, measuring performance - the test should give you an "expected path" through the code and you can see where things deviate. Sometimes that will be enough - especially if you're responding to an environmental change or a new business requirement, the difference between "code that doesn't try to do that at all" and "code that successfully does that" is usually straightforward and direct.

It's when the code already almost does that and the change isn't an external desire to change it that you have to start digging more deeply…

Digging into the past (archaeology)

Once you've identified where the relevant code is, you may have questions about how it got that way. Unless you're really aggressive about narratively documenting your project timeline, you're going to be digging in to version control history.⁴

Once you've found the area of concern, the first question to try and answer is "what were the last changes to this specifically." At the file level, a simple git log is enough, just showing every change made to the file; you can then git show each change one by one (or see everything with git log --patch which is noisy, but useful if you want to see the whole flow of the development process, or if you just want to search for keywords and don't know if they're in code, comments, or commit messages.)

In practice, you've probably narrowed down your problem more precisely than "entire file", maybe to a single function or group of functions, or maybe even a few lines of code. This is where git blame (formerly git annotate) comes in - since git is just assembling files from a history of patches, blame runs through the same process, but keeps track of which patch touched a given line, then presents the entire file with annotations on each line. At this point, you can examine that revision directly (using git show) and look at discussion in the commit messages, what the previous code was, or even just what branch it was merged from (which should be enough to let you trace it back to the Pull Request that inspired it, to see the review discussion.)

While it is very often true that "what you just changed is what broke things", that's more about pushing back on the idea that software "just breaks" without some human intervention - it doesn't mean that the most recent change is guaranteed to be the problem. Sometimes the flaw has been there for a long time (see also "how did this ever work") and is revealed by new external circumstances.⁵ In these cases you probably need to do deeper archaeology - the same thing you did to find the recent change, but digging another layer back. While continuing with git log and git blame works fine, git blameall gets you the entire history of the file (including deletions) with all changes from the first commit visible at one time. This can be a bit overwhelmingly noisy, but can give you quick answers to "did anything in this file ever call this function" if your concern happens to be shaped that way.

A note about `git blame`

The "blame" feature is a bit of tech toxicity that can distract from how useful it is. Perforce and Subversion had annotate commands, Subversion got the blame alias (and then a praise alias in reaction to that.) Git documents blame as the "primary" command (it doesn't have praise at all) and git annotate is only for similarity with older version control systems like the ones mentioned above.

The toxicity is that you can only blame people for something; the code doesn't change itself. At the same time, the recorded code doesn't encapsulate what was going on when the code was written - it only contains the actual code itself. The thing to remember (that gets short-circuited by even calling it "blame") is that you're not (at least on a healthy team) looking for someone to blame - you're only trying to take code that doesn't work today and turn it into code that does work. You're doing this digging to see how the code has been shaped over time because other more direct approaches haven't worked and you need more context.

That doesn't mean that "who worked on this particular bit of code" can't be important - perhaps you'll learn that it was some code one of the founders wrote five years ago after three all-nighters and noone has dared touch it since - and you can probably ask them about it and get a good story about how the company tried to get a Red Bull™ sponsorship before the VC money came in. Or perhaps the change was from a senior engineer who was experimenting with something new... or they only touched it incidentally as part of a global cleanup chore. These are weird - culturally interesting, but still weird - corner cases that don't really help you solve the problem. Most of the time you're going to find code written by one of your peers, with similar experience and constraints, and similar habits - who makes similar mistakes to what you'd have made if the task was in your hands instead. This is why you're going to be able to build your own mental model of what's going on, and fix the thing. That is the essence of Rule 3 itself - patterns in code are human patterns of thought, and treating your peers as human isn't just empathetic, it's accurate.

Chesterton's Fence

The principle that one should not tear down a fence unless one knows why it was erected is fundamentally about using principles instead of research; while this applies as much in software as it does anywhere else, good practices in software (version control, code review) mean that you should have the information and it's just a matter of digging it up. Archaeology is about having the tools in place to do this digging.

Conclusion

Archaeology is a tool for understanding code. It's not your only tool, and you will probably have more immediate options for most bugs - "what did you just change, that's probably what broke" is a far more potent investigative tool - but as you work on systems of greater age and complexity, it's more likely to be what you need to understand how the system got where it is.