bugs

There Are No Mysteries

A surprisingly controversial perspective on software engineering and computers in general is the idea that There Are No Mysteries - that computers are entirely "knowable"¹ and it's primarily a matter of how deep you choose to look. (That choice may be limited by resources or access, but not by possibility.) This particularly applies to plumbing around software - npm packages, pypi modules, and Debian packages aren't magic, and actually help you figure out where things are coming from.²

Complexity is still a battle

If you always had to understand every system in its entirety, debugging might actually be intractable - the "spaghetti code" moniker to describe this arose in the 1980s.³ Instead, we generally build code in "layers of abstraction" - which sounds clever except it's really an emergent property of starting with something simple and building one layer on top of it, then doing the same thing again. The main "modern" distinction is that we're more likely to have a formal API defining the layer as part of a remote service, than having it be local function calls; even "local" services often come from a local container rather than the same machine, which allows larger subsystems to be properly isolated - even your develop-and-test "local database" is likely to be an entirely isolated machine that just happens to share a CPU.

The important part of these layers is that each layer only "sees" (communicates with) immediately adjacent ones.⁴ The term for when an upper layer sneaks through and uses a lower layer's interfaces directly is "layering violation".

The Good Stuff is the Next Level Down

Understanding how a system behaves at the surface or user level is sufficient when it is working but often falls apart when you're trying to figure out why it isn't working. A great deal of "debugging" is about peeling back that surface and looking at the mechanisms underneath to get clarity on what is "really" going on.

This isn't an argument about purity - just that the layer immediately below the visible part is what's most directly causing things, properly or improperly.

Just as the layering itself is an emergent property, so is debugging by layer - you've got something wrong at the surface so you go step by step on the surface until you get a wrong result - and then you open up that layer and do the same thing. Much of the time you can repeat that with the each lower layer - sometimes you need to synthesize inputs to keep the operations at that layer "narrow", but that's fine, as long as you can show that it corresponds to the upper level problem.

How do you get to the next level

Tooling varies depending on what you're starting with. On Linux, the "big" layer is "your user space code vs. the kernel", and the tool for looking at that boundary in particular is strace. (Realistically you are probably using libraries and frameworks above that and should look at them first! You'll normally be examining those with debuggers or simply by printing out return values. Still, the kernel/user boundary is where a lot of interesting problems happen.)

To avoid turning this into an strace tutorial, I'll just suggest you practice by running it on something known to work, like ls, and something known to fail, like ls /nonexistant and see what the differences are. (To go sideways a little, a file not existing is often merely a fact and not an error - but if it should be there in normal circumstances, it is common to program in such a way that your code assumes it is present - since you're treating the absence as exceptional, and you're going to have an error handling path anyway that can sometimes be a simpler⁵ structure.)

Higher level environments often have built-in tools - when developing javascript in a web context, you have an entire in-browser console and inspection system that lets you focus on a problematic area of a page (document) and go from there to the related code, since the problems you care about are often more directly visible (at least to end-users) from the page perspective rather than from the flow of logic.

Stories

This isn't just theory, but it helps to have actual cases to show it realistically mattering. Most of these are very situational and don't necessarily lead to modern debugging techniques, so you'll find them under Debugging Tales instead.

Other history: Literature

"The Cuckoo's Egg" and "With Microscope and Tweezers" cover the Morris Worm incident - notably, the part where Bill Sommerfeld noticed that a piece of code was running as the nobody user, and leveraged that into a clean-room duplication of the buffer-overrun attack that was one of the worm's more successful vectors. (The nobody part had mostly been dismissed by other people investigating, but turned out to be significant.)

Other history: 6.004

MIT teaches this directly in 6.004, Computation Structures which (as taught in the late 1980s) built a small computer all the way up from

The "Digital Abstraction" - the idea that you could use amplifier circuits to slam a continuous voltage to one extreme or the other, and oversimplify⁶ those levels into 1/0 true/false logic⁷
Simple multi-bit arithmetic units, first as chips and wires, then as small pre-made circuit boards⁸ (containing the same chips and wires.)
An instruction set (basically a counter and some gates attached to the arithmetic units) and storage
Toggle switches for entering code

and finally a tool chain on a host system to assemble (and later compile and optimize) higher level code to run on this hand-built system.

As you crawl "up" the stack during the course, you don't just build the lower levels, you build a belief in the lower levels - voltages representing logic levels, representing numbers, representing code, with nothing hidden⁹ about how you got there.

Nelson Elhage also wrote about this in "Computers can be understood". ↩
Package systems are particularly helpful for encapsulating provenance metadata - while we primarily care about that for licenses and permissions, it's also an important part of tracing the actual code. ↩
The term seems to have fallen back from a high-water point in 2010 which I am not prepared to explain. ↩
A particular angle on this is called The Law of Demeter which is more of an example of how 1980's computer science was mostly an extended exercise in puns related to ancient greek gods (see also MIT's Project Athena), it does have a kernel of truth about thinking about software in terms of growing it rather than building it. ↩
Error Oriented Design is the idea (reflected somewhat distinctively in the Go language) that your code should be written from the defensive perspective that something is going to fail and you should treat that failure as important information and handle it as carefully as you handle things actually working. This is a good way of questioning assumptions and is a critical part of any security-related programming. ↩
Famously early digital message encryption had the problem that the "exclusive or" function used to combine key and data wasn't quite digital enough and while 1⊕0 and 0⊕1 are both 1, in practice there would be a voltage difference of a few percent, which was enough to "peel apart" the key and data, as long as you intercepted it in the right part of the channel. "Security Doesn't Respect Abstraction Boundaries" has a decent walk through of this in terms of voltages. ↩
The course also immediately covers the Asynchronous Arbiter problem where the the digital abstraction is hiding an unbounded timing problem that is so fundamental it can be demonstrated with mechanical coin-operated devices - but as long as you aren't specifically arranging your circuits to force it to happen, you can reduce the probability below that of spontaneous explosion of the circuit. ↩
The pre-made arithmetic boards weren't special, they just made it possible for students to "get past" practical problems with the early lots-of-chips implementation and not be "stuck" for later work; it also freed up a lot of protoboard space to build later parts of the system. ↩
Nothing hidden from a CS perspective. A not-quite-joke about the difference in coverage between EE and CS was that EE didn't do compilers and CS didn't do semiconductor physics. It was certainly the case that you couldn't get through the CS core without doing at least some soldering - perhaps badly, perhaps enough to convince you that you wanted to be at the other end of the abstraction stack - but you knew why, you had "run your fingers through it" and it wasn't a mystery despite it not being Your Thing. While we're mostly focusing on layers in code, it should be pretty clear that once you get from software to hardware, the physical sciences are layered as well. ↩

2026-02-09

Essays: No Mysteries

Tags: bugs abstraction nelhage

Archaeology

Archaeology isn't just an excuse to wear a cool hat and carry a whip - it's about recognizing that "today" is built upon layer after layer of "the past" and sometimes you need to peel back those layers to explain a problem.

In the software development context it primarily comes up when fixing bugs and when modifying long-stable code to meet changing requirements. This breaks down into answering near-term "how does this code work now" and longer term "how did we get here" questions.

Digging into the present (forensics)

Outside of some modern art contexts archaeology is not about the present; "digging into the present" more properly lands somewhere between journalism and forensics. In software we aim to approach it more from the forensic side - separating reports/hearsay from concrete evidence, keeping meticulous records of any discoveries (simplified by being able to perfectly copy memory states and program inputs and outputs, and store them at the terabyte scale - criminal forensics would be greatly simplified by being able to make hundreds of exact copies of a crime scene and poke at it with impunity, without disrupting the original!)

This is sort of the reverse of the bug funnel, and at the same time is a good part of the motivation for that model: ultimately a bug report is "something happened that shouldn't have happened" and the shouldn't part is based on some model of how the system works - your customers will have a model of the system that is arguably more important than yours, but is also vastly less precise. To usefully pass a complaint along the chain from user to developer, you need to make sure the complaint continues to fit the model, and that involves making it more concrete without losing sight of the end user's perspective.

A powerful way of making the complaint more concrete is to turn it into a test case.² This test case can serve several purposes:

Others, particularly at review time, can look at the test case and agree (or disagree) that it represents the actual problem.
The test case can be applied to multiple past versions of the code (see "archaeology" below) to narrow down what versions express the undesired behaviour.
The test case can be applied to future versions of the code - as a "regression test" to make sure the problem doesn't come back.¹
Finally the test case can be used on the present version of the code to confirm that the change does the desired thing - serving as a "free" first intent-based test.³

Once you have a (failing) test to work from, you can apply the usual tools for figuring out what's going on - tracing execution, single stepping, measuring performance - the test should give you an "expected path" through the code and you can see where things deviate. Sometimes that will be enough - especially if you're responding to an environmental change or a new business requirement, the difference between "code that doesn't try to do that at all" and "code that successfully does that" is usually straightforward and direct.

It's when the code already almost does that and the change isn't an external desire to change it that you have to start digging more deeply…

Digging into the past (archaeology)

Once you've identified where the relevant code is, you may have questions about how it got that way. Unless you're really aggressive about narratively documenting your project timeline, you're going to be digging in to version control history.⁴

Once you've found the area of concern, the first question to try and answer is "what were the last changes to this specifically." At the file level, a simple git log is enough, just showing every change made to the file; you can then git show each change one by one (or see everything with git log --patch which is noisy, but useful if you want to see the whole flow of the development process, or if you just want to search for keywords and don't know if they're in code, comments, or commit messages.)

In practice, you've probably narrowed down your problem more precisely than "entire file", maybe to a single function or group of functions, or maybe even a few lines of code. This is where git blame (formerly git annotate) comes in - since git is just assembling files from a history of patches, blame runs through the same process, but keeps track of which patch touched a given line, then presents the entire file with annotations on each line. At this point, you can examine that revision directly (using git show) and look at discussion in the commit messages, what the previous code was, or even just what branch it was merged from (which should be enough to let you trace it back to the Pull Request that inspired it, to see the review discussion.)

While it is very often true that "what you just changed is what broke things", that's more about pushing back on the idea that software "just breaks" without some human intervention - it doesn't mean that the most recent change is guaranteed to be the problem. Sometimes the flaw has been there for a long time (see also "how did this ever work") and is revealed by new external circumstances.⁵ In these cases you probably need to do deeper archaeology - the same thing you did to find the recent change, but digging another layer back. While continuing with git log and git blame works fine, git blameall gets you the entire history of the file (including deletions) with all changes from the first commit visible at one time. This can be a bit overwhelmingly noisy, but can give you quick answers to "did anything in this file ever call this function" if your concern happens to be shaped that way.

A note about `git blame`

The "blame" feature is a bit of tech toxicity that can distract from how useful it is. Perforce and Subversion had annotate commands, Subversion got the blame alias (and then a praise alias in reaction to that.) Git documents blame as the "primary" command (it doesn't have praise at all) and git annotate is only for similarity with older version control systems like the ones mentioned above.

The toxicity is that you can only blame people for something; the code doesn't change itself. At the same time, the recorded code doesn't encapsulate what was going on when the code was written - it only contains the actual code itself. The thing to remember (that gets short-circuited by even calling it "blame") is that you're not (at least on a healthy team) looking for someone to blame - you're only trying to take code that doesn't work today and turn it into code that does work. You're doing this digging to see how the code has been shaped over time because other more direct approaches haven't worked and you need more context.

That doesn't mean that "who worked on this particular bit of code" can't be important - perhaps you'll learn that it was some code one of the founders wrote five years ago after three all-nighters and noone has dared touch it since - and you can probably ask them about it and get a good story about how the company tried to get a Red Bull™ sponsorship before the VC money came in. Or perhaps the change was from a senior engineer who was experimenting with something new... or they only touched it incidentally as part of a global cleanup chore. These are weird - culturally interesting, but still weird - corner cases that don't really help you solve the problem. Most of the time you're going to find code written by one of your peers, with similar experience and constraints, and similar habits - who makes similar mistakes to what you'd have made if the task was in your hands instead. This is why you're going to be able to build your own mental model of what's going on, and fix the thing. That is the essence of Rule 3 itself - patterns in code are human patterns of thought, and treating your peers as human isn't just empathetic, it's accurate.

Chesterton's Fence

The principle that one should not tear down a fence unless one knows why it was erected is fundamentally about using principles instead of research; while this applies as much in software as it does anywhere else, good practices in software (version control, code review) mean that you should have the information and it's just a matter of digging it up. Archaeology is about having the tools in place to do this digging.

Conclusion

Archaeology is a tool for understanding code. It's not your only tool, and you will probably have more immediate options for most bugs - "what did you just change, that's probably what broke" is a far more potent investigative tool - but as you work on systems of greater age and complexity, it's more likely to be what you need to understand how the system got where it is.

Expressing it as "the problem coming back" is somewhat magical thinking, a regression test is really a "guardrail" to make sure a developer doesn't reintroduce the problem without requiring that they have perfect awareness of all past problems. As such, it needs to be easy and convenient for all developers to run all such tests as part of development - the guardrail doesn't serve it's purpose if someone has to go digging for it. ↩
Open source projects often push the work of generating reproduction cases back to the reporting user, since the whole point is that end users have full access to the source code; commercial projects more often have one or more tiers of internal support organization to handle this. (In both cases it may still end up in the developer's lap.) ↩
TDD or "Test Driven Development" is the logical conclusion of this - where development starts with "writing a test that fails"; it fails for the more specific reason that "the code implementing the feature doesn't exist yet", but it still gives you a very clear stepping stone from "wanting the code to do this" and "showing that (with these changes) it does this." ↩
Concrete examples here all use git - not because you have to, but git is so widespread that any system you do use will likely have a detailed comparison list just to explain why you'd even consider using it. (You can even find these lists for much older systems like Subversion, Perforce, or CVS, they just describe the glorious future, instead of the glorious present.) ↩
This is also why it's important to turn a bug report into a concrete failing test - not just so that you can prevent the problem from recurring in the future, but so that you can resurrect past versions of the code and see if it previously worked or never did. ↩

2026-02-09

Essays: Archaeology

Tags: bugs

Debugging Tales

Anecdotes don't necessarily teach practices or process, but they do expand the scope of Things That Actually Happen. These are mostly post-mortems of events where The Debugging Was Interesting.

Why is this so slow?

A utility function as part of a search engine back end would just sometimes hang. The operation would get retried (sometimes by the user, sometimes just as part of a timeout) and continue correctly, so the obvious techniques for adding more checks and debug information weren't helping, we didn't know how to make it more reproducible.

So we reached for a Big Hammer: the Linux strace command. Normally this tells you something new about what's going on at the next layer down and you dig there - instead, when strace was attached, the hanging just stopped happening. At the time, that alone was a hint; in the Linux 2.x era strace had a fundamental problem that sleep() calls would just return immediately and not actually sleep at all, so we looked more carefully at the strace output and found sleep(4294967) calls. Clearly without strace those would have been stuck for over a month...

There's another layer to this. There weren't any mentions of 4294967 in the code, and yet it looks vaguely familiar: it's the first 7 digits of 2³², or rather "all but the last three" digits. Thinking about that in the context of time suggests that the difference is also the conversion of milliseconds to seconds (sleep takes seconds.) It's a further leap to consider that you don't see 2³² in a 32 bit environment... but you do see -1 converted (wrongly) from signed to unsigned giving 2³²-1, so it's not too much of a leap to consider the sequence

some function returned -1 as an error value (many Linux functions, for example select(), do this as an indicator that you didn't get a value and should look at errno instead.)
whatever called that function expected it to return a time in milliseconds and the author didn't even consider that it could fail with a flag value - and likely just stored it in an unsigned integer to begin with.
since sleep() takes seconds, they divided by 1000 and got 4294967 (if the intermediate value had been signed, they'd have probably gotten -1/1000=0 instead, making the bug less disruptive and thus harder to find.)

These leaps aren't as ludicrous as they look - there's an entire debugging category that goes by "Computational Numerology", the idea that there's value in recognizing Interesting Numbers¹ in debugging contexts - but there was some luck involved. Still, it was more about looking under the floorboards (with strace) and paying attention to what was scurrying around down there.

Cygwin Kerberos

This one is a kind of messy real story that doesn't lead to clean single lessons, but it gives some things to ponder about what parts of the system you can trust...

In the mid 1990s there was a project at Cygnus to get Kerberos 4 running on Windows NT using the Cygwin toolchain. As part of the porting effort there were a handful of improvements to Cygwin itself, to add useful corners of POSIX that it hadn't yet needed, particularly time-related functions. Since from our perspective, Windows was just another especially-annoying embedded target system, the development was all done on stable Linux systems and then the files were copied over and tested.

Ran into a weird authentication problem, turned on full debugging and pushed the updated build over - and it immediately stopped failing. Turned off debugging and changed something else and it started failing again. Went back and forth for half a day of tedious (10-15 minute) edit-compile-test cycles until, due to fatigue, I accidentally failed to save a change, recompiled - and the failing/not-failing state flipped anyway.

That slap-in-the-face led to digging in to my Cygwin changes - Windows time was declared as a pair of 32 bit ints to make a 64 bit microseconds-since-the-start-of-time value, with some explicit math to combine them. This is were the error was: they should have been² explicitly unsigned integer values - so whenever the high bit of the lower int was set, time would count downwards instead of up. This mean that for around 12 minutes of every 35, if the Windows clock was perfectly synchronized, the Cygwin clock seen by the Kerberos code would be more than 5 minutes off, thus exceeding the replay-bounds threshold and failing to authenticate - entirely based on "what time it actually was" and not anything to do with which version of the code I was testing or what changes I'd made or whether debugging was actually turned on.

The lesson here is a little obscure - the strongest form is probably "the thing that broke is probably the thing you just changed" but you have to be completely clear on what you actually changed - not just the part you're focusing on. Mechanically translating the type declarations would have also avoided this (there may have been perceived license issues with that at the time, though.) From another angle, "the bug goes away with debugging enabled" might be rarer than you think - if you don't have a reason that debugging "unbreaks it" maybe step back and think if something else might be going on. There's a lot of debugging-related mythology - I'm calling it mythology because even back then, enabling debugging with gcc only added labels to the code, it did not change the generated code at all; if you wanted simpler machine code to make things easier to track you actually had to use -O0 or otherwise turn off particular optimizations. So there wasn't a good reason to think that turning debugging on was the cause of it starting to work, either.

Python Lists

In the "late aughts" (200x) I had a problem with a production geographic news search system crashing because a python interpreter dumped core. We preserved the coredump and context (which was huge - the process had been chewing on gigabytes of compressed data for hours, and called in to some custom libraries which were, correctly, the first suspect.) We considered ourselves lucky to have caught this on a system we controlled and not a customer-fielded system - part of the reason to run this in-house search project was to "get ahead of" big data problems by attempting to process more data than most of our customers did.³

Initially the problem proved very difficult to reproduce, even with the exact input data. This led to some improvements in our internal record-keeping with an eye towards making it easier to reproduce next time; it was several months later that we reproduced it (under testing conditions) at all, and by that point the priority got dropped a bit simply because it wasn't happening regularly in production. Eventually we got it down to a case that took 8 hours to run but only failed about one in five times; at that point it was self-regulating, we could try a new perspective on the problem about once a week.

What got strange was as we got better at reproducing the problem, it got further from any of our code. At the time, python lists were a malloc()ed array of pointers to objects, with a "fibonacci" growth pattern - amortizing some of the allocation overhead but without the pathologies that power-of-two growth causes. The failure converged on finding a single pointer element within that list of pointers - not at the start or end, but "towards the end" - that was a literal NULL (0) - an empty Python list element isn't NULL, it's a pointer to a singleton Python None object, there's no intentional way to get an actual C-level NULL in place from Python code (without using C escapes like pyrex or CDLL.) It also didn't happen at a particular size of list - it did happen more for larger ones, but it sometimes happened with relatively short lists too.

We did work out that the particular missing value lined up with the "normal" growth points of the Python list allocator - which is surprisingly approachable code and wasn't that hard to just audit - and it wasn't copying things at all, that was delegated to the C runtime, using realloc(). GNU libc turns out to be highly performance tuned and much harder to audit - it also has to handle obscure cases like detecting that allocations are being made "simultaneously" in multiple threads (which Python didn't do at all, at the time, but made the libc code significantly more complex, and dependent on fairly random initial conditions.)

At this point we stepped back and considered that while there aren't many eyes on our code, there are a lot of eyes on Python itself (I pestered at least one senior community member about this at Pycon!) but there are an even larger community of people banging on libc (and at least finding bugs, if not necessarily reading the code.) Since we were running a relatively current Debian based distribution, we dug through Debian's very short (and very conservative) list of critical post-release updates - and actually found a very obscure fix to realloc. In some difficult-to-trigger conditions, when realloc needed to move the data to a different page, it would copy one fewer word than it should have - so the new array would have a NULL instead of the previously-last element. Nothing would notice this until higher level code, much later, would iterate over the Python list and "trip" on the NULL element, dumping core.

Since we had a reproduction case, we were able to apply the single Debian update and rerun everything, convincing ourselves with a couple of weeks that it really was the same problem we were seeing. (One of the big reasons to create reproduction cases like this is the difference between thinking that "yeah, sure, that's a plausible sounding fix" and having evidence for it.)

Lessons are varied but include

"if you didn't find a kernel bug yesterday, you didn't find one today" but at the same time, sometimes the bugs really are several layers down the abstraction you're normally working at. (And sometimes you really are near the edge of the cliff - this particular startup found at least two different kernel bugs in the span of about five years.)
Debian post-release updates are rare and usually critical so if you're doing a Debian-based release, track them (or just take them promptly.) You've probably got the resources to at least read the announcements and filter them against the packages you actually care about. (I learned at the end of the story that a coworker had tried a quick duplication of the problem on a "clean" Debian baseline and couldn't reproduce it, and discarded the effort as unhelpful, not realizing that "clean" meant "fully-updated"...)
There are basically no reasonable conditions that lead to a Python interpreter coredump - but it's not hard to get at least a traceback out of them and that's easier to talk about. (Believing the traceback when it goes through something as fundamental as List Item is perhaps harder.)
Not exactly "never waste a crisis" but it's certainly worth considering, for any "why did that blow up" kind of mess, at least updating the product to gather the evidence that you wished you had, so you have it next time.

Interesting Numbers include powers of two in decimal, powers of ten in other number bases, and more obscure things like strings of ASCII digits converted to numbers - not specifically that 1163022927 or 1330795077 are the beginning of ERROR but that they're worth at least converting to hex and pondering. ↩
To be very clear - the Microsoft type declarations were correct, the converted-types elsewhere in Cygwin were correct - this was only in my local copy of the new code, and this was fixed long before it went u pstream. ↩
The New York Stock Exchange is famous for taking a trading week (Monday-Friday) of data every week and replaying it at 5x speed on the weekend - as a "canary" for any trade-volume-scaling problems. This is mostly used as an example of what you should have been doing if you get caught by surprise when your system gets popular... ↩

2026-02-09

Essays: Debugging Tales

Tags: bugs abstraction tools computational numerology strace time realloc Python

Testing

Doesn't everyone already do this?

One expects there to be pretty good practices around testing, but instead it is often an afterthought; sometimes even with resources devoted to it, there's a narrow focus on only one of many kinds of testing, while other areas are ignored entirely.

Keeping tests near relevant code

We need to start with a simple principle of debugging: "What you changed is what you broke." It's sufficiently rare that something truly interesting and "at a distance" is your problem - we tell stories about them because they're interesting but also because they're rare. To a first approximation, what's wrong is not the JVM, it's not the kernel¹, it's just the most recent thing you changed.

The hardest part about applying this rule is that if you don't have local enough (and prompt enough) tests, you may not even know that you just broke something - you really need tests that are close to your changes (possibly even before your changes, as we'll discuss below.) Sure, you can narrow things down with version control later, if you commit often enough, but you want at least some basic checks as you go.

The Testing Funnel

Since you can't couple everything that tightly, we end up with another "funnel" - by analogy with the bug funnel there is also a "testing funnel" - a hierarchy of tests that goes from needle-precise unit testing of particular lines of code, up to human and system-scale interaction.

Unit Tests (function scale up to object scale, usually involves one developer): Unit tests apply to small amounts of code, and attempt to answer the question "Does what I wrote do what I intended?" Writing them in advance can serve as a stronger guarantee of this. They also aid in review of an interface design - the reviewer can look at the tests as examples, "Does it make sense to call these functions this way?" and secondary documentation, "So we'd do this thing by making this call, does that make sense? Does this expose obvious edge cases? Does it encourage the next developer that comes along to do something that's obvious but wrong?"
Internal API/contract tests (interface scale - which can have a lot of
complexity "behind" it but the interface itself is bounded - team scale): If you're providing a user-facing API or some other concrete promise to your customers ("this interface can handle 100 values at a time", "this data field can handle any unicode string containing only characters with general category Letter") it's worth expressing all of those as tests so you don't ship anything that breaks them in front of live customers. (These are still development tests, meant to be more exhaustive, rather than operational/delivery tests that only probe that certain paths work.) These are often as narrow as unit tests, but they're expressing a relationship with consumers rather than developers.
External interface tests (interface scale but you probably have many
of these - team scale if the team is agreeing on the overhead of using
these particular interfaces): If you're not going to aggressively review third party library updates, you probably also want API/contract tests facing outward and testing the interfaces you consume from those libraries or services. These are often the Internal API tests that you wish the providers would be running themselves. (External service tests might fit better as "delivery tests" (see below) since you likely don't control their updates - even for well behaved services that use /v1 /v2 -style versioned API URIs, you don't necessarily learn that /v1 has been deprecated until it's actually turned off², which won't wait for your release schedule.)
Requirements/"business rules" tests (interface scale up to entire
product scale - not just the team but also external "stakeholders"): Requirements testing is usually an attempt to translate external documentation (including meeting minutes) into code, in order to express that "The thing we agreed on in the meeting is actually in the product, if this test passes." This helps review the concrete understanding within and outside the team. While there's value in showing "we think we're finished", there's even more value in having the tests reviewed by other involved parties, especially if they don't think the test describes the agreement; it's a lot easier to argue with something "down on paper" than a blank page.
External Quality Measurement tests (product scale - cve scans,
audits - usually from external sources with an external audience): These are usually tests from outside sources that are somewhat independent of the code itself - while there are more tools available to search for security anti-patterns directly, simply checking your dependencies and versions against published lists of CVEs and USNs is useful. Doing this as part of development testing isn't enough but it's a good start; you really want to be getting some kind of feed of advisories, filtering that against your actual dependencies, and responding directly - in practice this gets done for the most attention-getting advisories³ and the thorough checks are only done at release time (which is fine for producing an "everything is fine" report, but isn't early enough in the pipeline to actually react and inform customers as needed.)

Outside the funnel

While the space of things you need to test rapidly grows beyond the scope of a single developer testing their own work, there are also test opportunites that work from a different angle entirely.

Performance tests: If you're promising the customer anything with a rate or time, you should be able to demonstrate that - and even if they are slow or expensive, these tests should block merges so that you can immediately show that "this specific change needs more work because otherwise it breaks that contract", instead of having to reverse engineer the problem later. But even if you're not making any such promises, you likely have an internal constraint of "it works on the hardware we have now but haven't budgeted for any more" (or at a smaller scale, "we haven't run out of google cloud bootstrap points, and don't want to start spending Real Money") and so you still want some warning of performance changes, even without an explicit performance story.
Delivery/integration tests: These are operational tests, used to confirm that the delivery is operating correctly (and not just "thrown over the wall".) Ideally you'd work with your operations team on these but you'll want to show that they work in your development and staging environments, and then make them part of your production environment "health tests" to provide some feedback that the system is working as fielded. (The core ideas of DevOps can be oversimplified as extending good software practices to include the actual deployment and use of the software; "tests" in that context expands to include deployment-specific metrics for whether the system is working as intended, measured more on a performance continuum and less on a binary works/doesn't work scale.)
"Other" tests: This list isn't exhaustive - anything that you have working and you'd like to stay working should have a test, even if it doesn't fit neatly into these categories. Customers find it far more frustrating if you fix something and then let it break again than if you don't fix it in the first place⁴ so you're going to want an entire category of "don't let this regress because of this customer" tests. (If you're shipping different things to different customers, consider fixing that but maybe start by running all of the customer-related tests for everything, not just for the inciting customer.

Coverage

Coverage isn't a kind of test, it's instead about measuring testing itself - "how much of the code is run by this test suite", "how much of this code is actually run in actual product usage." It's not an absolute good, but more of a directional improvement indicator - not so much a productivity metric as a ratchet that should always be increasing. This doesn't mean you can't "clean up" tests ever - coverage improves if you get rid of untested code, too. (It is plausible that deleting any code likely deletes bugs, but untested code has had even less attention so it's even more likely.)

Measuring coverage is also a way to cure some illusions about how your own codebase works. In the mid 1990's it was popular to compile gcc with itself, on the theory that this tested many paths through the compiler (and the simultaneous but contradictory theory that the final version would be "better" because the code had been generated by gcc instead of the potentially buggy vendor compiler.) Coverage metrics eventually showed that gcc building itself only used about 20% of its own code, and that this really wasn't a particularly convincing test. (At the time, there was also an enormous test suite based on expressing every detail of the ANSI standard as code, and this was much more effective, at least as a regression test.)

Assertions

Assertions or asserts (based on the language keyword that implements them) are sometimes treated as equivalent to unit tests - in practice they are unit tests that are run at some unusual time (in dynamic languages, they run on module import; in compiled languages, they can sometimes run at compile time but more often only at eventual run time.) It's generally better practice to convert them into proper tests, so they're part of your testing and test reporting, and they're reviewed as such. The common anti-pattern is to end up having only the assertions run, because they're forced to run as a side effect of running the code, while the more formal unit tests you have aren't getting run. Making them Actual Tests puts a little more pressure on making sure your Actual Tests are run sufficiently often.

Moving towards "enough" testing

"Working with Legacy Code" gives a good framework for thinking about the shape of the testing problem. Feathers specifically defines legacy code not as old code but as "code that doesn't have enough test coverage for you to change it with confidence", even if you wrote it in 2025...

Headlights

"Code without tests is like driving without headlights" - you can make progress at driving speed, or you can stumble along slowly in the dark. Your tests are both an alternate expression of your intent for the code and a self-enforcing contract with your past self - even if you don't look at them, as long as you run them you have assertions that your ground remains solid.

Tooling to make it easier to write tests is important

You certainly don't want your developers to be evaluating new test harnesses when they're trying to do feature work - while "how do I write this code in a testable way" is a good thing for them to be considering, this is a lot easier if they have context and existing practices that enable this.

This goes beyond choice of framework - you will get a huge amount of mileage out of developing tooling that supports testing the distinctive kind of things your project does - at very least, it means when someone has to estimate development effort for a feature, it's a lot easier for them to say "testing this thing is very much like testing that other thing, so we can make a convincing estimate for the effort" than if they need to put more effort into inventing "how do I test this" than they do into "how do I solve the actual problem".

Consider tests that are shaped like your interfaces - if you've got a user-facing web interface, playwright is available in most popular languages. If your tooling is mostly on the command line, CRAM is a docttest-style tool that lets you mix Markdown explanations with chunks of shell commandline and result-matching patterns (which also lets you write self-checking documentation, if you write the tests with a narrative rather than as exhaustively following what the code does - this mode is also effective for turning a customer complaint into an acceptance test while keeping a clear relationship between the two.)

Fast-fail tests/Relevant tests

Getting a failure quickly means a developer doesn't have to come back later and recreate their mental context to work on it. (Unfortunately this may need to be instantly - as in, within a few hundred milliseconds - to actually feel fast enough; but in practice, seconds are better than minutes, and anything over ten minutes is probably only going to get run by choice as part of submitting for code review and likely not much sooner than that.)

At the same time, you can't just skip the "long" tests - the reason you wrote long rigorous tests was to make sure some aspect of the product continues to work, and you want to find out as soon as you can after the change that breaks it - not a month later when a release is going out. Unfortunately, higher level tests are often more effective at finding flaws and slower, so you really want to run them on most changes and they're more costly when you do.

Forty-five minutes of testing that checks something you "know" you didn't change is important at some level, but bad for morale - it really is valuable to the team to put the work into doing some kind of test dependency selection (and improving module isolation!) so that you can safely avoid testing things that you really couldn't have influenced. (At the same time, some part of your workflow should recognize when that can't be perfect and that running global cross-functional tests can find horrible and obscure problems - and you want to find them early just like with any other bug - you just need to find ways to limit the day-to-day burden to the team.)

Not much you can do but acknowledge the tradeoffs and keep track of what tests you do choose to run less often - and how often you find out later that you missed something valuable that those tests could have caught sooner; then you can use that feedback to re-balance which tests you run when (or even to convince the team that additional delay pays off.)

Complexity/Fragility

Flaky tests are really bad for morale, and just as inconsistent enforcement reduces respect for the law itself, tests that fail your build because of nothing you've done discourage you from running, let alone trusting, a full test suite.

This does mean that making tests robust the first time around is worth the trouble, while at the same time being difficult to enforce; explicitly budgeting effort to make new things "testable" is one way to handle it culturally.

Being able to say that a feature isn't done if it has tests that get in everyone else's way is theoretically satisfying but doesn't work very well in practice - since flaky tests tend to turn up long after the developers have moved on to other tasks. Instead, you need to have experienced engineers associated with the development of tests, to recognize risky behaviors (the kinds of things that lead to race conditions, for example.) This also includes having the authority to point out things like "calling sleep(1) in a test is probably bad and doesn't do what you think it does⁵", and to put the effort in to define race-free interfaces in your product, even if the only immediate need is to enable correct tests.

Coverage

Coverage is difficult to improve and hits diminishing returns fairly quickly. It's more effective and more difficult to start out with high standards for it on greenfield work, than it is to impose it later. Also, the interesting metric for quality is how much of your code isn't tested at all, rather than how well the rest of it is tested, until a lot of it actually is.

If you do go on a campaign to improve coverage specifically, one Neat Trick is to make sure it's clear to the team that removing untested code is a valid way to improve coverage, especially if you have accumulated speculative code that doesn't actually support real features. Simplifying the code base (and especially removing technical debt) is likely a bigger quality improvement - in its own right, not just in terms of coverage - than just having more tests is.

A Naïve Coverage/Regression Test Trick

If you built an initial system with vastly insufficient testing and are finally dedicating resources to correct this - one cheap but high traction approach is to just write tests for the current behaviour of the system. While you want tests that answer the question "this system behaves in this intentional and desired way", you can still get a lot of coverage by writing tests that answer the weaker question "customers are successfully using the system the way it currently behaves - make sure it keeps behaving that way." This is a common approach to performance tests, rather than doing extensive analysis to see if some particular metric is good enough, you assert that "customers aren't complaining so it's good enough Right Now" and make sure it doesn't deteriorate from there.

One problem with these tests being weaker (and easier to write) is that when they do fail, you may have to defend them, and revisit whether they are testing the right things. Usually at that point you can get more attention on them and actually answer that.

Write A Failing Test First

A large chunk of your development time is going to be spent on fixes for visible failures of your system - which means a large chunk of your test development time will be too. If these are customer-visible, you really want to be sure the customer never sees the same problem again. There's really only one way to do this: once you have a test that you think covers the problem, make sure that it fails on the exact version of the system the customer is running (don't just show that the new code works.)

There's an ambitious variation of this, called Test Driven Development, where you start by writing tests for nonexistent features. These are particularly good for fleshing out API designs, since you're keeping in mind "how does someone want to use this" instead of "what falls naturally out of the implementation", and encourages actually doing design instead of just coding things. (You can get similar value out of writing the documentation first.) This is somewhat different since in the customer-facing case you already have code and input and a misbehaviour to capture and you're trying to prove a particular assertion, rather than explore a design space.

Predictability needs Honesty about the costs

If you want the release phase to be constant and predictable, you can't be discovering problems for the first time at release freeze; that just causes firefighting, and sometimes that means the whole thing burns down. You have to push the "discover global problems" step back up the funnel and that means doing earlier global tests - possibly costing time for every developer and reviewer. While unpopular, this isn't wrong, but there are options for trading it off.

In practice, having a separate QA team doesn't actually matter here, unless they're empowered to pause development to get attention on release-blocking problems, or they themselves are part of product development and really can do it themselves (this is even more rare.)

The "Agile Release Train" model is about slipping features but not slipping releases. What it ignores is that cross-feature interactions aren't really captured by individual feature branches, and when you find complex failures you may find you really haven't isolated things well enough to kick the "failure" off the train - so it reduces to any other release model. (Also, most customers care about their pet feature, not about your release model.)

The Release Train model is also unrealistic about releases that are customer visible, especially when they're exposed to sales organizations that want to be able to promise things to customers in particular releases. The honest/true thing to do is to advertise what you've actually finished - it's just nearly impossible to get the customer-facing side of the organization to acknowledge that.

Brief History of test tooling

In the mid 1990s, nightly (or even weekly!) tests were a big thing. They could be slow because they just needed to finish overnight (but that didn't mean they were especially thorough, computers were just slower back then.)

By the late 1990s and early 2000s, part of having version control was having single points of Intentional Change where all of the files were consistent and you could at least expect the build to work, and maybe even the product itself. CruiseControl came out of ThoughtWorks in 2001 and popularized actually doing that build every possible time, with a straightforward java tool that could just watch for changes (and wait for a repo to "settle" since CVS at that point was only a loose collection of versioned files, fully mechanical multi-file commits didn't catch on until Subversion started replacing CVS a couple of years later.) While it wasn't associated with particularly sophisticated testing, simply knowing "who broke the build"⁶ in less than a day was powerful.

Eventually having feature branches, and tests before merging, became normalized with systems like Hudson/Jenkins and Atlassian Bamboo in the late 2000's and early 2010's; Bamboo could even auto-detect merge-candidate branches, test them, and record the test results directly in the pull request review page. (By 2020 pretty much every multi-user version control interface had some form of this available, with varying degrees of enforcement vs. cooperation.)

Conclusion

There's as much work and detail in testing as there is in building the software itself; while "bits" don't decay, any interesting software system or subsystem has a sophisticated organic interaction within its ecosystem, and static code doesn't respond to that - testing can act as "life support" and at very least as "early warning" that the world is changing around it.

The author has found exactly one C runtime bug this century, and has a particular coworker that found two linux kernel bugs across the entire lifetime of a startup. It took vast amounts of isolation and proving to determine the truth of these; in the same time period we also had a bunch of speculative "that has to be a JVM problem" complaints that turned out to be either locally-written bugs or "no, unix actually works that way"... ↩
This happened to the author in late 2025 when a casual curl script to check if a particular flatpak had been updated to a new version suddenly started returning a descriptive HTML 404 error page instead of detailed JSON. Apparently /api/v1/apps was replaced with /api/v2/appstream at some point. ↩
Heartbleed, ShellShock, RowHammer, aCropalypse and a vast range of other bugs with "cute" names and specialized domains that pop up faster than the CVEs themselves do - basically "self-marketing" vulnerabilities. In practice many of these are disproportionately well-known, but since your customers and investors are likely to hear about them quickly too, you'll be expected to have an answer available promptly, even if it's a trivial dismissal. ↩
At one support-focussed company early in my career I learned that customers valued prompt engagement more than prompt fixes - you still don't want to let things drag on without some negotiation about priorities, but "we're worrying about this so you don't have to" is a powerful message even when you don't yet know how long the engineering is going to take. ↩
Usually calling sleep to let something "settle" is painting over a race condition - instead, the changes made by an interface should be complete on return or have an explicit way to check for completion. (Also, the test probably fails if the machine has competing loads, and maybe the product will too!) If the sleep call really does require time to pass (such that a timestamp crosses an integer boundary) consider if you want to use "Mock" testing techniques which would let you replace the clock - then your test can perform the first step, ask the harness to advance the clock, then perform the next step with confidence that the test-clock has updated correctly. This can take slightly more work to set up (faketime, for example, uses LD_PRELOAD to insinuate a shared library to replace the time checks in the executable under test) but can turn tests that previously needed seconds or more to run into tests that execute robustly in milliseconds. ↩
One late 1990's startup developed a culture of bringing candy or chocolate for the team to apologize for breaking the build; this eventually drifted into there almost always being candy in the engineering wing, so the sales people would make excuses to come by and end up actually talking to us :-) ↩

2026-02-05

Essays: Testing

Tags: bugs Pretty Good Practices

Code Review

This essay is part of a series on "Pretty Good Practices", aimed at development teams that would like to improve their craft, and thus improve their results.

Code Review is the process of having peer developers look at changes early, before they become part of the product, in order to try and find problems before customers do. (This is narrower than "Change Control" in that parts of it are very specific to software, rather than hardware or content.) The broadest modern version of this is Given enough eyeballs, all bugs are shallow but on a more conventionally sized development team you can expect most of the benefit from the first couple of reviewers.¹

Code Review is not a new concept - it has roots as far back as IBM in the 1970's - though it is modern enough that it's not part of the "surgical team" model from The Mythical Man Month. It didn't really get popular (outside of environments where it was legally mandated) until free tools like Rietveld (2008) and Gerrit (2009) started becoming available to particularly motivated teams. The popularity of the code review process within large (FAANG) organizations was a source of inspiration for the industry at large, even if the full-blown tools themselves weren't available. (The Github "Reviews" feature wasn't generally available until the end of 2016.)

Before the tooling made the practice a widespread default, a lot of review only happened after problems were found - even at places where senior engineers read all of the changes that went by, looking for red flags, changes were often "integrated by default" and complex changes could get insufficient attention because there wasn't an enforced time budget. The popular exception, and one of the big "examples of what could be done but you're not realistically doing it" was Linux itself - changes went out over email and were only integrated when Linus (or later, his "lieutenants") accepted them, in a refreshingly public process. Git, Bitkeeper, and other tooling adapted to this, rather than supporting or encouraging it, and the process has remained consistent over a surprisingly long timeline.

It's 2025, not 1995 - doesn't everyone already do this?

Unfortunately, no. In the extreme case, you have your brilliant technical founder who has pumped out a bunch of code that impresses the seed investors or an initial customer or two - and works far better than it has any right to - but now you're expanding the team and adding the first people for whom software is a Real Thing and not just some details that distract from the thing they're actually brilliant about. After all, startups are rarely interesting in a purely software way - instead, software provides some novel leverage in an existing space, and serves the problem rather than being the problem.

This isn't unique to startups - sometimes a new project "falls out" of some other field - maybe a mechanical engineer coded up some automation, maybe an admin intern got lucky with getting a Fiverr contractor or vibe coder to build something - and then existing developers get stuck with a "demo that a bunch of people already rely on" dumped in their laps.² Rarely do you jump into a project in a new space that followed established software engineering principles from the start - instead, you usually stumble into the new space and have to retrofit software engineering principles onto it.

Whatever path the software takes as it matures, it usually progresses from "this did something useful on one computer, at least once" to "multiple people need to understand what's going on here well enough to improve it further". This is the ultimate point of a bunch of software engineering techniques and tools, even if it often isn't expressed that way.

So what are they doing instead?

I've been involved in projects where

the code was just some files, version control was a New Thing³
the code was checked in to a free private account, but everyone was working on the same problem on the main branch and branching wasn't really a thing⁴
"this function works fine if I type it in but not if I put it in a file"⁵

... and many other points on a spectrum of how much the craft of software development was taken seriously. (Not out of disdain or disrespect, just inexperience or lack of exposure to Real Software Practices.) While these examples are primarily about version control, it turns out that reasonably sophisticated version control is what code review is built on top of - both mechanically (review tooling itself) and in terms of process (the review acts as a point of gatekeeping, so you need strict definitions of what the change is being applied to.)

Why Review

We've now established that a world without code review is a bit of a jungle, where software development is possible but more through heroism than technique. Why is code review part of this complete breakfast?

The primary motivation is just getting more eyes on the code you're producing, early in the development process. The obvious benefit is giving more than one developer a chance to say "that's odd" or "what's up with that" - but it's also a chance for the team members who, in the future, are going to be fixing or updating this code, to check on how readable (and thus, maintainable) it is, as long as "this was difficult to figure out" is feedback that you'll actually use to improve the code.⁶ To put it another way: review time is a chance to move "reviewability" issues further up the funnel, so you're not discovering mid-crisis that noone can figure out what the code is doing.

It also gets more eyes on the requirements and gives you another chance to catch a misinterpretation of what was intended, where the code appears perfectly good but is doing a different thing. Clearly you want to have already found such things sooner than this but it's better than finding them later.

It's a chance to emphasize team effort. "Peers pointing out details to each other from our coding standards" gets a lot less reflexive resistance than "coding standards imposed from On High" does, even if you have senior developers driving the standards effort.

It's also a chance to spread historical context from developers who have had longer exposure to the code base - though that can reasonably turn into "that's good to know, can we get it into developer documentation"?⁷⁸ This is a useful question to prime junior reviewers to ask, when someone responds with a detailed explanation - "great, what documentation should I have found that answer in, and can we put it there" recognizes the value of the explanation while also taking the opportunity to move the comprehensibility of the code base forward.

Ultimately, code review is an early demonstration that the code successfully communicates with other developers. You really don't want whoever makes the next change six months down the line to go "what were you thinking" - especially if that ends up being you because six months plenty of time to turn it into "what was I thinking"; it's just much easier to test for that with someone else's eyes while there's still a chance to clarify things.

Review of Public Code

This discussion is mainly about in-team review of proprietary code. If you have a tightly-knit development team for an open source project, it would also apply there, but open source has some additional complexity: one of your sources of code to review is "complete strangers from the Internet." Nurturing a community around a piece of open source code tends to require generosity or at least starting from a non-adversarial stance - but at the same time, some contributors are malicious and you need diligence (and process!) to defend yourself from them.

It's a difficult line to walk, and all I'm trying to do here is acknowledge the difficulty in the hope that you also acknowledge it. It's worth familiarizing yourself with the XZ Utils backdoor (also known as CVE-2024-3094) which is a combination of supply chain threats, the difficulty of dependency proliferation (even with reasonable dependency management) and the kind of things your review process might catch.

This isn't advice on running a public project at scale - dealing with thousands of developers from a starting point of what is honestly outright paranoia isn't healthy or productive. Instead, I'd suggest looking at things like curl's contributor guide for what a good "public face" looks like, and how you can still strive for quality with many strangers instead of a closely knit team - commercial development almost never has the structures (or need) to wrangle that many contributors, but it's still a useful perspective, and you'll find many good ideas on how to treat other developers humanely (that you wouldn't necessarily come up with on your own for coworkers.)

How Code Review Works

Mechanically, code review is part of the process of making changes to code⁹. You start with what you believe is a working codebase, take a proposed change, and try to analyze the change to determine if it

leaves you with a working codebase afterwards
actually achieves the goal it was intended to accomplish.

This means you need reviewers to understand what "working codebase" means - "does it conform to our coding standards" contributes to that, and having coding standards include things like "we assume that code that isn't tested doesn't actually work" is a particular strong one - but ultimately a reviewer needs to know what is changing and understand what the change is actually doing.

They also need to understand the intent of the change - not just as expressed in the code, but expressed externally by what the planned change was - which means you need some shared system that expresses those plans. How you do that is a project management question and out of scope of this article, but it should be clear that you have to have some record, even if it's a simple prose explanation at the top of the review request.

This is somewhat biased towards small changes, but that's the way to make them comprehensible - large changes can generally be broken down (and large mechanical changes, like coding-style consistency improvements, can be made more reviewable by keeping them isolated - it's easier for a reviewer to convincingly affirm that a change is, as intended, only mechanical, and thus has much lower risk to the codebase because it really isn't changing anything else.)

If you're reviewing an entire codebase, perhaps because your team is seeing it for the first time and needs to bring it up to standard so that they can start doing new work on it, you likely want to instead do an audit which needs a different approach.

Gatekeeping

Review is implicitly a gatekeeping action - "yes, this can go in" or "not yet, these improvements are needed." (In a functioning team you rarely need an outright rejection - that should be handled at the tasking level - any developer should be able to start from a position of confidence that the change is actually desired, code review does not substitute for team management.) That means it needs to happen at a point in the process where it can actually be used for that, though the details depend on your branching strategy.

Feature Delivery Cadence

Unless you're doing early "exploratory" development, you probably have some sort of time pressure¹⁰ for delivering features. Since review is in between creating features and landing them, deadlines put certain pressures on the review process.

For reviewers, it's important not to delay review until too close to the deadline - we've already established that you want to do it promptly while things are still top-of-mind for the developer, but you also want to allow the developer to time to turn around thoughtful responses to feedback. It's also important to point out right away if you have any planned reasons that you will not be available to perform your review duties, like vacations, to allow the developer (or team lead) to find an alternate reviewer.

For developers, once your team has a normal review cadence established, it will become important to explicitly communicate about unusual review deadlines that legitimately need faster response (this should be rare, and is not just "hurry up so I can move on to the next project".)

Complex projects that are expected to interfere with other development (particularly by making things hard to merge) such as "updating to a new coding standard all-at-once" or "interface renaming" or "upgrading a major non-compatible dependency" need to be coordinated across the team, but you want to avoid having everyone stop work during the process. One approach is to do the work on a branch, aiming for a point where most other work can synchronize so there are few outstanding branches, then do a "final burst" of pulling those changes back in to the complex branch and reviewing and landing it, before freeing everyone up to continue.

Even simple features can collide, so while you don't usually want everyone on the team keeping track of everything that's going on, it's still valuable for anyone who's about to land something to know what else recently landed, to help understand where to look first for unusual behaviours. (The better your testing, the less of a problem this is!)

Feature Branches

One common strategy is the use of "feature branches" - you manage changes to the code base in units of features, and each new feature starts by branching the main codebase at a specific point (basically a managed copy, so you don't have to deal with competing changes from other developers in an uncontrolled manner. After all, they'll also be working on feature branches.)

In the straightforward case, you've implemented your new code on this branch, you've added new tests and they pass, existing tests also pass, and it's ready to review. The mechanics of setting up the review will usually involve selecting users and providing a description of the change (which may already be available as part of a higher level ticketing system that drives all of this at the project management level - it's still worth adding specific implementation commentary here, to aim your reviewers in the right direction.)

The primary thing keeping things from being straightforward is when some other feature branch has landed and has overlapping work with yours. (In the most obvious cases this will show up as conflicting code directly in the version control system¹¹ but it's more likely to be something like an API or calling convention change where if you're lucky, tests will catch it.) If there's any other work going on in parallel, you'll want to do an update from master immediately before review, so the reviewers are seeing what's actually landing (rather than what you started from) and that you can show that "if you hit Merge right now, the tests would pass" because your branch matches what the post-merge main line would be. The primary defense here is communication, rather than tooling - your team members should have at least a rough idea of what related work is going on, and if the project is too big for that, it needs to be broken up into smaller components with boundaries/contracts that make it possible to have enough context most of the time.

Depending on how long your feature work takes, other features may land that change areas you rely on - in some cases, even features that you need to finish your work. This isn't just about avoiding merge conflicts at the end¹², it's about work that you specifically need to absorb into your feature branch to continue making progress. Ideally, for things like this you'd also be a reviewer on the other changes as part of making sure they support what you're trying to do. This also applies pressure towards keeping features isolated and small - or at least breaking larger features into smaller components.

(As your team gets bigger and there's more churn, you may have to do this again after approvals; you may need to go as far as having a merge schedule, especially as you're approaching a formal release point where you want the option of delaying a feature rather than delaying a release, based on what external commitments are driving your deadlines.)

Multi-layer feature branches

When you are programming in a more exploratory mode (particularly common when bug fixing) you might not know at the start how the changes will break down; one feature might need another feature, or cleaning up some technical debt, or some unanticipated restructuring - the change may just be shaped such that breaking it into several pieces simplifies the review - "adding a system-wide feature" and "using the new system-wide feature" have different kinds of complexity and possibly different reviewer audiences. Perhaps you're the first one touching some old code and need to do a "cosmetic changes only" whitespace cleanup or style pass separate from your own less neutral changes.

There are a few approaches to this. One is to just split the additional work off as new blocking features, so that they start from trunk as their own independent feature branches; then you can just update your branch once they land (or depending on the version control system, you can repoint your branch at them until they land, then point it back at the main branch.) You can describe them as blocking in the review description, but it's an entirely social construct; you're peeling off some self-contained work that you'll use when it lands on the main line.

Slightly more advanced is to develop the "sub-feature" as an independent feature but pull the development branch into your primary feature branch as it evolves. This lets you use it as it develops, but leaves your branch with all of the complexity from both parts, until the sub-feature gets merged and it "disappears" from your feature branch relative to the main line.

Most expressive (but also usually requiring the most sophisticated use of the version control system) is the version where you start with the same feature/sub-feature branches, but "repoint" your feature branch to be based on the sub-feature. This means that your feature branch doesn't reference the main line directly, just indirectly "through" the sub-feature; likewise the mechanics of merging the sub-feature and then "losing" it (until you re-point your branch to the main line post-sub-feature-merge) have more moving parts. It does let the version control system represent more accurately the way you're actually thinking about the code; it can also reduce confusion about which changes are part of the sub-feature vs. the primary feature, though if that confusion is likely this might not be the right way to split them up.

Choice of Reviewers

Depending on the size of your team and the distribution of skills, you may have a bunch of options on reviewers. At the simple end of the spectrum, you may only have three developers and one worked on the branch, so your policy will basically be "one other reviewer" (or depending on your growth plans, start with a two-reviewer policy even if that means "everybody".) "Minimum two-reviewer signoff" is a popular starting policy - at worst, it's inspired by the two-key launch authorization policy; at best, it is an opportunity for reviewer mentoring, or for getting review coverage from different perspectives or areas of experience.

If you have a variety of experience levels (or areas) on your team, you might instead tag reviewers with particular areas of specialization - "John needs be a reviewer for any embedded code", "Mark needs to sign off on anything with security implications". Ultimately you don't want to block on any single reviewer (people take vacations, after all!) so even if you start out with specialist reviewers you should consider pairing them with other developers specifically tasked with "levelling up" in those areas.

Another way to handle specialist reviewers is to have generally distributed "required" reviewers and add the specialists as "optional" reviewers, who are expected to chime in on anything specific to their expertise but aren't necessarily expected to review the entire change. (Review tooling rarely has good support for "partial" review like this, but policy can be enough.) You do have to watch out for getting specialist reviewers "stuck in a rut" where an area of the project can only be worked on and reviewed by two or three people - there's a tendency to develop tunnel vision or even shared but unrecorded knowledge of how the code works, which reduces the effectiveness of the review, especially the quality of the "can future readers comprehend this" measurement.

Depending on how explicit you are about scheduling your team's time, another option is to have a handful of reviewers on a change but count the first two approvals as enough. (If you do this, you generally want formal policy on reviewer response time, otherwise you will run into unfair effort-sharing issues that will harken back to college team projects and chore sharing - it can still work, it's just likely to cause people-management issues comparable to just being explicit about the assignments in the first place.) If your team is already comfortable with a top-down management hierarchy, then you can just assign reviewers. If your team is more self-organizing and team leadership is more about reducing friction and interfacing with other systems in the company, then review assignments can also be self organizing (though you'll want to keep an eye on spreading review experience and helping keep reviewers from getting too specialized.)

What Should Reviewers Do

The main thing a reviewer is expected to do is understand the change that's being made. Given that understanding, it is an oversimplification to say that they then need to "approve" or "reject" it - your team needs to have an understanding of what that means in the context of your project.

The main thing to do with the understanding is to compare it to what the change is supposed to be doing. This usually comes from your project management tooling, and should have been established before even starting the feature branch. It should also be reflected in any internal project documentation - which would ideally also be managed in the feature branch, but might be in a parallel external system instead (if it is, that should be part of the scope of the review as well.)

After the basic "it does what needed doing" review, the next question is "does it do it well". Does it meets your team's standards? Does it use your preferred coding idioms? Does it include sufficient testing (and does the review request include evidence of those tests working?) How about related documentation? If it has a UI component, have the translations been updated? Does it meet accessibility standards? If the change incorporates code from elsewhere, have the licenses and CVE history (or other security reputation) been reviewed? If user data is involved, is the GDPR applicable? If the code involves new network connections, has there been an "attack surface" review, and has deployment documentation been updated?

If enough of these apply, you probably need a formal checklist - with as much automation as you can arrange. (One team built a literal can-i-merge tool that did a bunch of automated tests and quality checks on a trial merge, which was great except it took over half an hour to run, and still covered less than half of the suggestions above.) In some organizations this is (somewhat mistakenly) called "CI" tooling; it is integration-adjacent, but unless you explicitly manage a sequence of merges (which would let you test each feature as it is about to merge) you end up needing effectively simultaneous pipelines for each of your developers.

What Should Reviewers NOT Do

One important reason to formalize coding standards (even if it's a simple as "Run black on all python code") is that experienced developers will often have preferences that are not written down, "I know it when I see it", or just biases, and it's unfair to other developers to get review push back for things that there's no reason for them to have known. Put another way, "I wouldn't have written it in that particular way" is not a valid reviewer comment; if the reviewer actually has concrete beliefs behind the opinion, they get to write them down and build consensus around them first. Critique isn't an effective way to get elegance - you need to actually teach it.

There will also be issues around "how much additional testing can a reviewer insist on". One common dividing line is "if we have a test like that, insisting on using that technique in this area is fine; if we have to invent a new way to test this thing, there's a reasonable discussion to have about building that additional test infrastructure as a separate task" (which may still block the higher level feature.)

Reviewers should also not need to be running tests - the developer should be running them all along during their development, locally (as long as they include some convincing record of the test run) or as part of per-branch CI (which can just be a link to a result page.) Developers shouldn't be submitting changes that don't pass, the reviewers should be looking at "do these new tests represent the feature well, do they cover the interesting parts of the change" and not "do they actually pass if run". (If you're very early in adopting the testing that supports this kind of review, it may make sense for reviewers to run the test as a cross-check that you're getting reproducibility right but it should be as additional checks of your process, not a replacement for developer-reported test results.)

Coding Standards as Crystallized Arguments

Coding standards are simultaneously trivial and vital. The common feature of most coding standards is that they apply to things where there isn't an obviously superior choice - or at least there's enough ambiguity that your team is wasting time arguing about it. At the same time, developers typically have the kind of independence and self-perceived authority that they tend to reflexively reject arbitrary choices handed down from on high. For new team members, having a standard to point at reduces friction and simplifies integration with the team - and if it seems arbitrary to them, you can point out that there was an argument phase and arguments got distilled down into the final resulting policy. It may help to have developers sign off on the standards explicitly, so you can show new team members "look, this standard has five of your peers behind it already, do you have a strong enough case that you think would convince all of them to change their minds?" Being able to point to past efforts to reach consensus helps it looks less authoritarian, and makes it easier to convince newer team members that they'll be included in similar decisions in the future.

For current members when you're trying to establish a standard, it gives them a time and place to bring up their opinions and preferences, with the understanding that once you have a decision, it's time to move on to more important things.¹³

(It isn't necessary to completely suppress further discussion of the topic - just to get buy-in on the idea that the status quo is based on battles already fought and there needs to be significant new information to revisit the question.)

Mechanically, it's also good to produce worked examples of what the standard actually means - your project isn't public enough or big enough to have it's own StackOverflow section, but an internal library of examples is a great introduction to how to apply a standard. Ideally these examples will look more like "use this interface" or "configure your editor this way" than "copy this boilerplate" - you will find a lot less resistance to standards if you put effort into making it straightforward to conform to them.

Writing for Review

One reason to have everyone on the team involved in review is that it gives them perspective from "the other side" on what kind of things are harder or easier to review.

The common pressure here is towards minimalism - smaller changes are easier to review, and well-scoped changes - where everything that changed is clearly and directly related to the problem at hand - are more reviewable even when they are larger. This pressure is strong enough that you'll need some explicit back pressure on things like "you actually need more tests than that" or "this really does need to be better commented or simpler" or "yes, that's very Clever, but no one will appreciate that when they look at it again six months from now."

In particular, this gets in the way of "casually" cleaning up technical debt, even if said cleanup might make the feature change simpler. The sensible thing to do is to acknowledge those explicitly and turn those cleanups into tasks that block the features - in theory you can show that these paths get the features done sooner than without the cleanup, but that is a significant communication challenge (a worthwhile one - but systems like Agile, for example, make this particularly benefit harder to express in conventional backlog management, especially if you discover the cleanup opportunity while doing the feature work.)

Depending on the history (and "maturity") of the code base, you can get some value out of "deliberate scope creep" - team-wide recognition that when a new feature exposes the need for broader changes to enable continuing work and reduce technical debt, it's worthwhile to take the opportunity to make time for the broader change as part of "paying down" the technical debt, or "recognizing that the technical debt has come due". This is basically another explicit anti-minimalism force - and you can develop a consensus in the team about how much pain technical debt is causing vs. tightening up feature work.

Keep Function-Neutral Changes Separate

As an example of how trivial details can cause disproportionate amounts of friction: ideally you'd start with a common coding style and never have to transition. And by "ideally" I mean "that's never actually happened" - so you're going to have to make the existing code base conform. Ideally you'd do that all-at-once, do a testing pass, and be done with it, but, well, "ideally" - in practice you're going to not make a global change because you don't trust your testing enough, even in the face of a self-evidently neutral change. Instead, you're going to encourage people to make these style changes when they touch old code for the first time - and then their branches will be cluttered with style work (which often touches most of the file) and end up harder to reason about and review, possibly causing more friction than giving up and not changing the style in the first place would.

You can compensate for that by requiring that neutral changes get their own reviewed branches, where you only change style and nothing else; you need to do these before your real changes, so you probably also want to fast-track them. You can do that reasonably safely - the reviewer isn't trying to confirm correctness or requirements, just "this really is only changing style and does nothing to what the code does, and is a neutral change" which only really requires language familiarity/competence and not deep familiarity with that part of the codebase. This also means that you can safely have junior devs do the review, and you can have senior devs do them rapidly without derailing them much. (You may be able to encourage senior devs to take on these reviews because they have a better appreciation for the benefits of the change, and may have even driven the adoption of the policy in the first place.)

Neutral changes do still have intent - primarily, to make the code easier to work on (by becoming consistent with more modern parts of the code base, and aligning with the standards the team has come to expect) but also to make those changes complete so you don't need revisit them again in the same area. It's just that all changes of this sort also share a common intent - "don't change the behaviour of the code" - that is unlike any feature change. This does make these kind of reviews simpler, since there's nothing to interpret.

With some languages and environments you can use tooling to prove that the changes are neutral - for example, in python you can use the ast module to turn the before and after versions of the file into an abstract syntax tree and show that those haven't changed. Not a perfect test, but for this kind of thing, a fast cheap test that leaves you with a couple of bits that need manual explanation gives a better return on investment than full scale tooling - if you're going to build that much tooling you can probably build the tool that makes the style change directly, which you can then review/audit and then trust directly. You probably don't want to do go too far down that path unless you're product already has you building software-modification tools, though.

Social aspects of Code Review

Code Review is ultimately a structured form of communication among developers, which means that there are ways to explicitly manage that structure.

Senior reviewer vs Junior

Specifically encourage juniors¹⁴ to ask about testing to get them engaged; it can serve as a "safe" question that they can ask when reviewing code from senior devs, since all of you are trying to reinforce a culture in which everyone is expected to respect the question wherever it comes from (and you don't need that much familiarity with the code base to be able to tell if tests are present.) Also, a test suite often serves double-duty by explaining the code from a different angle - the test calls are also "worked examples" of the interface.

Another way to keep juniors from falling into "this code is from a senior developer, I barely understand it, it must be good" trap is to encourage the idea that a review is a reasonable place for asking good questions about the code. The review process should still emphasize finding problems, but readability and clarity are core components of code being reviewable in the first place. A reviewer who is still becoming familiar with the code can often have clearer judgement of the quality of an explanation, since they felt the need for it in the first place.

In the other direction - remind seniors that the clever idea they came up with during the review is not usually superior enough to derail the review. It's one thing if the code under review is wrong - even if that's merely of the form "we have a tested and working API over here that already does this, can you use it instead of duplicating it here?" - but if it's just a matter of style, and the style isn't codified and agreed on by the whole team, then that's not necessarily enough to push for it. (Using this to launch a conversation about codifying that detail as a shared standard is fine - but it probably belongs in some other forum, not this particular feature review.) An explicit mentoring relationship can change where this bar is - but then the discussion can and should happen in private pre-review, not in "public" (in front of the whole team.)

Tone

Whether or not you're a stereotypical "cranky developer" it's worth considering tone extra carefully in review discussions - because being judgmental about the code is part of the task, it's easy to slip over the line into being judgmental about the developer which is harmful to team relationships.

It's also useful to consider that review comments are usually preserved, not just the code - and if they aren't, that's a tooling issue that's worth fixing - when doing "archaeology" (or "spelunking") on the code, especially when debugging a regression, it's pretty valuable to be able to resurface the discussions around when a problematic change was made. Did this weird case come up at the time, and was dismissed? Great, we were almost sufficiently foresightful, maybe recalibrate that reflex a little - especially if "it would be inconvenient to write a test for this" was in the arguments, turning it into a concrete "Test Better!" argument can be valuable. Or maybe noone thought of it - possibly that means you need to "review a little harder", or maybe it was Just That Strange; either point is interesting to include in a post-mortem.

Interactivity

Respond promptly, in both directions. The whole point is keeping things "early in the funnel" while the developer still has things in their head and can answer questions; at the same time, it's important for the developer to come back with either "good question, here's why that's not a problem" or "good question, it's going to take me a day or two to improve this and have you look at it again." Most code review systems don't have good affordances for this, so encourage explicit communication that makes it clear whose court the ball is in.

You want your team to understand that reviewing the work of others is as important as their own work - especially if you care about feature "velocity". While this does require balance, review is happening at the point were something is "almost done" and giving it prompt attention means that it can land that much sooner, and you can both move on to the next thing (or return to your current thing) more quickly.

One of the workflow challenges in being responsive is that it means review is an interruption - not as severe as being on pager duty, but still something difficult to plan your time around as a developer. One conventional approach is to just observe how much time you spend on review and adjust the amount of review you take on based on that - estimating review time has some of the same problems as estimating software in the first place, but you can generally develop the ability. This is one flaw with the common approach of assigning reviewers at review time - the reviewer can't plan as much as they might like. If you have a review cadence, being explicit about expecting developers to be spending a standard amount of time on review at the end of the cycle can help by providing some structure. If you have particularly specialized reviewers, making sure they know up front (when a new feature is started) that a change is coming up in their specialty can also help - even if they aren't explicitly engaged with the work (though if you can get early design feedback from them, that's probably a good tradeoff against doing their own work in that area) just having a "heads up" can help them keep context in mind and make jumping in to review a little bit less disruptive.

Another approach (sometimes inspired by having a "review problem") is to actually schedule reviews as meetings. This makes the time more explicit and high-bandwidth, since the reviewers and the developers are mostly prevented from engaging in other distractions; it can backfire if the reviewers quickly turn up a handful of things that need developer work before continuing, especially if there's pressure to Fix Them Right Now during the meeting. It's important to be aware that "coding with people looking over your shoulder" is a distinct skill that's rarely explicitly developed - it's more of a teaching or rehearsed presentation activity - and unless it's already a specific mentoring situation, you'll usually get better results from letting the developer go off and actually think about the revisions, particularly if the review has turned up additional (hopefully small) problems to solve. It's also important to understand that meetings have a lot of overhead regardless of what they're for - the "review meeting" approach won't be efficient but it might serve to improve shared alignment, or to focus extra attention on particularly critical cases where the expense is justified. (There's also likely to be a lot less resistance to additional meetings if everyone is clear that the goal is to stop having them once communication has improved.)

The Review Lifecycle

The review process doesn't end with the first round of feedback - while it's important to converge quickly, there's still a step of making changes based on feedback and then getting those reviewed. Part of it is making sure the developers involved can communicate efficiently - but part of that communication is the changes to the code itself.

As the team and the code matures (and practices like Writing for Review take hold) you may find yourself actually needing very few changes in response to the initial review feedback. This goes much more smoothly if you can see that those changes are incremental. Most version control systems support this in straight-line development, but among modern git users there is a popular development pattern of "cleaning up" your changeset using rebase and force push. Effectively, this makes the changes look like the final form is what you intended all along - on the one hand, this is rewriting or "retconning"¹⁵ history; on the other hand, it does mean that when you look back at the code there isn't any distraction from the incremental problem solving, and only the final reviewed change is preserved as history. When applying this to a review, this creates an immediate problem, though - the reviewers have to look at the whole changeset from scratch, instead of just seeing the incremental change as incremental. Not only is it inconvenient, it leads to lower quality review since "oh I've seen that bit before, it was ok last time" is a far less rigorous (and possibly untrue) judgment than "that code is literally unchanged because this incremental diff doesn't touch it".

There are ways to mitigate this¹⁶ with tooling improvements but the simpler approach is to just declare that clean up and force push in particular are only allowed on private branches, never shared branches, and once review has begun, your branch is not private. Having to use different workflows before and after initial branch can also be error prone or at least tedious, so it can be useful to be able to point to "review friction" as a team-wide motivation to put up with it.

In open source projects one occasionally finds a pattern of keeping all the commits for review but doing a squash merge. This keeps the development visible at review time but doesn't keep it long term - the theory is that this gives novice contributors more freedom to flail, since it won't reflect on them at all after the contribution is merged. (It's sometimes suggested that this makes git bisect easier to use, but using --first-parent is enough if you're keeping the usual discipline of the main line always passing all tests.)

Conclusions

Code Review is a process that helps you get more eyes on your team's code, but there are lots of details that go into a good workflow for this. It's an extreme case of the belief that code is as much about communicating with people as it is about communicating with machines, which means you need to be attentive to social friction while developing it. While it requires a reasonable version control workflow, it doesn't need all that much additional plumbing - it mainly needs more communication.

Jakob Nielsen, 2000 was original research specific to qualitative usability testing, the model intuitively fits code review as well. ↩
The flexibility for non-developers to express things in code is often a valuable business practice - but it's also a hint that overall, software development perhaps isn't as much of an engineering discipline as we insist it is. ↩
There wasn't even .old or date-based history; just one set of files that got changed by two or three people (occasionally yelling at each other.) RCS let people at least "lock" the file they were changing which meant people yelling at each other before losing conflicting changes. Full disclosure: this was a 1990 problem, not a 21st century one - but inexperience sometimes looks like time travel. ↩
For a short time around 2015, Atlassian Bitbucket had a big advantage over Github in the not-yet-off-the-ground startup market - a free account could have a private repo. What it didn't have was Github's forced bias for making changes via pull request, so everyone was just checking out master and resolving conflicts rather than avoiding them - which, honestly, isn't too much of a distraction when you have three people sharing a desk. ↩
A much-ignored point of consistency is tabs vs. spaces, and while developers tend to be the sort that want personal flexibility - this is one of those areas where you must make a choice (and that choice is spaces) and enforce it (VC hooks if you can, also shared default configs for the editors and IDEs your team prefers) because trying to use both will hurt your team, and noone really prefers one to the other that strongly - though they're likely to have been burned in the past by having not nailed it down. ↩
Google famously goes as far as to have a company-wide "language skills and standards" level among reviewers, called "readability", specifically to encourage raising the quality bar for all internal google code. I'm using the term more generally, but it's interesting to see where google puts emphasis. ↩
See also Code Retirement Pages. ↩
A common conversation on teams that use something basic like Confluence or Google Docs without someone tasked as "information architect" or "gardener" is for discussions about documenting something to go "well, where did you look for it? I'll put it there" "I just searched for these keywords" which doesn't help to discover structure. It turns out to still be a useful conversation - just make sure to add those keywords explicitly in the document. You could think about it as "old-school SEO" but it's just making sure you include literal keywords that are relevant to your topic even though they're not actually part of how you're discussing the solution. ↩
This sounds like "Change Review" but that's a broader concept applied to systems and configuration rather than simply code (often hiding the "code review" part as "change the system to use this specific release" that was itself reviewed before getting released from the software organization.) ↩
Time pressure can be real ("we need to be in compliance with new regulations by 1 January") or fictional (Agile "sprints") or self-inflicted ("our marketing campaign promises these features this quarter but they never talked to engineering about it.".) In rare cases (XP's "Release Train" concept from the late 1990s) they can be entirely decoupled, with features delivered as they qualify, but that usually requires very tight coupling with a single user rather than having a product. Even without direct customer pressure on timing, if you have a product that has expensive tests (like certification audits, or mechanical performance tests of robot systems) you might reasonably have a group performing those tests on a limited schedule that needs a stream of releases that are "worth testing", which provides the same kind of back-pressure on the review and merge process. ↩
Atlassian Bitbucket used to aggressively render "this won't merge back cleanly into the target branch" as part of the review mechanism, but they dropped it at some point and you're probably left to your own discipline - likely pulling your main branch into your feature branch and re-testing before landing it. ↩
You also never force push to the mainline, but that's because ideally you don't commit to the mainline at all: you get review approvals on your branch and merge it and that's all you ever do to the mainline. (Most professional grade systems will let you outright forbid this by policy, or even only allow a specific "repo owner" to actually ever perform merges - but the fact that it is mechanically disruptive to everyone else sharing your repo and they will come after you with torches and pitchforks will usually encourage self-policing.) ↩
This is related to the military concept of "under orders" - a well run team will support giving a lot of direct feedback during the planning stage, but when a plan is executed, it's executed pursuant to authority - not inflexibly, but without pushback about preferences or committed decisions. (Not that software teams can be run to military standards - "herding cats" is has been the descriptive terms since at least the 1990s - but it is helpful to see that arguing is friction and there's a time and place for it.) ↩
While "Junior Developer" usually means someone who has relatively less career experience, in this context it also includes someone who has only recently had their first exposure to this team or this product. ↩
Retroactive Continuity - that is, failing entirely at preserving continuity but producing something that looks (from the present) as if it were a coherent story all along - is a staple of the comic book and comic book movie industries. The term is usually pejorative. ↩
This author hasn't experienced any beyond "don't do that" but is eager to hear of them. ↩

2026-02-05

Essays: Code Review

Tags: bugs Pretty Good Practices workflow

The Bug Funnel

The bug funnel isn't a practice or tool, it's a perspective that applies across a range of development tasks. The concept is a drastic oversimplification of something Steve McConnell calls the "Cone of Uncertainty" in Software Estimation: Demystifying the Black Art (we'd both be well served if you paused here and read it before you return to my storytelling - it's one of the few books that I've had multiple startups go out of their way to buy for Tech Leads.)

The main point is that cost of fixing a bug increases dramatically with the "distance" from the initial development of the bug. This directly implies that it's worth spending money early on on the process, so you're not forced to spend even more further down the line. Even calling it "down the line" points out that this is an assembly line manufacturing metaphor, and you should be wary of any analogy between physical systems and software systems, so let's look at some actual software situations to make the case more clearly. These are organized by increasing time to resolve and number of parties involved.

Seconds to Minutes, One person: A developer mis-types a keyword and their IDE highlights it, the developer fixes it directly, ideally without even breaking their train of thought.
Minutes to Hours, One person: A developer uses an obsolete API and a quick-check pass (linter¹, unit tests) flags it - immediately after they've written that chunk of code, and haven't yet moved on to the next piece.²
Hours to Days, Several people: A low-level feature gets implemented, with tests, merged, and then when another team goes to use the feature it turns out that something about the design was unsuitable.³
Days to Weeks, Many people: A developer makes a simple change to something that interacts with certain specialized hardware that they don't have available for testing, and isn't found until much later by whole-system testing by the dedicated QA team, leading to an open issue long after the developer has moved on from the change itself. At this point it has also derailed the QA team, and possibly the release engineering team if your QA team is under-resourced and only running pre-shipment approval tests - and yet, you still consider yourself lucky that you caught it before it shipped to anyone.
Weeks to Months, Many people and Multiple companies: A third party dependency update introduces an incompatibility that isn't noticed in QA but causes a bad interaction or crash on a customer network. Even identifying the problem here is very expensive, both due to distance between cause and effect, and due to the exposure of the problem on a customer system - causing commercial embarassment and being harder to diagnose in the field instead of in the lab. This is the level where, if you (or your customer) are interesting enough there will be public post-mortem articles, possibly hostile ones...

There are ways to mitigate these individually - better field introspection tools to make the customer-visible case less expensive to resolve, for example, and there is always a market for developer tools that claim to reduce errors. These tools tend to reduce the costs within one level, and that is valuable, but they don't change the overall shape of the funnel: catching problems earlier is still cheaper, and moving problems closer to the developer still has a huge benefit.

Why this matters

Recognizing the structure of the "funnel" is a good way to see the value of release engineering and infrastructure work like

All of these help find problems faster, and push them "closer" to the developers (who, after all, implemented the problems in the first place.) The "funnel" model should make it clearer why you want to push in that direction.

Opposing forces

"Move fast and break things" is a prototyping technique, when you're trying to see what shape something is - so you're building it as cheaply as possible and have exceptionally high chances of throwing it away. It's generally not customer-facing - most of your users should be on your team or otherwise dedicated to evaluating the proof of concept.

The main problem with this approach is that not everyone involved understands that it's what you're doing - you're expecting "OK, this gave us an understanding of the problem, now let's design something real" and what you get instead was "everyone was happy with that so let's just ship it". The bug funnel can serve as a formal structure to explain why "fixing it later" is extraordinarily expensive.

Traditionally, lint was a tool that picked at little syntactic concerns in C code, but the name has mostly stuck for the category of fast ad-hoc code-checkers. Do not underestimate the power of a simple grep for scanning code for "APIs we shouldn't call any more" that gets run as part of your automated tests... ↩
What about code review? Anything that can be automatically tested before you submit something for review should be, so you don't waste reviewer time with things that you could have already discovered to not work. ↩
This is often an example of how unit tests aren't enough; while they are important, they don't replace practical use-case tests that show off the business logic. ↩

2026-02-05

Essays: Bug Funnel

Tags: bugs books

bugs

There Are No Mysteries

Complexity is still a battle

The Good Stuff is the Next Level Down

How do you get to the next level

Stories

Other history: Literature

Other history: 6.004

Archaeology

Digging into the present (forensics)

Digging into the past (archaeology)

A note about git blame

Chesterton's Fence

Conclusion

Debugging Tales

Why is this so slow?

Cygwin Kerberos

Python Lists

Testing

Doesn't everyone already do this?

Keeping tests near relevant code

The Testing Funnel

Outside the funnel

Coverage

Assertions

Moving towards "enough" testing

Headlights

Tooling to make it easier to write tests is important

Fast-fail tests/Relevant tests

Complexity/Fragility

Coverage

A Naïve Coverage/Regression Test Trick

Write A Failing Test First

Predictability needs Honesty about the costs

Brief History of test tooling

Conclusion

Code Review

It's 2025, not 1995 - doesn't everyone already do this?

So what are they doing instead?

Why Review

Review of Public Code

How Code Review Works

Gatekeeping

Feature Delivery Cadence

Feature Branches

Multi-layer feature branches

Choice of Reviewers

What Should Reviewers Do

What Should Reviewers NOT Do

Coding Standards as Crystallized Arguments

Writing for Review

Keep Function-Neutral Changes Separate

Social aspects of Code Review

Senior reviewer vs Junior

Tone

Interactivity

The Review Lifecycle

Conclusions

The Bug Funnel

Why this matters

Opposing forces

A note about `git blame`