Testing

Essays: Testing

Tags: bugs Pretty Good Practices

Doesn't everyone already do this?

One expects there to be pretty good practices around testing, but instead it is often an afterthought; sometimes even with resources devoted to it, there's a narrow focus on only one of many kinds of testing, while other areas are ignored entirely.

Keeping tests near relevant code

We need to start with a simple principle of debugging: "What you changed is what you broke." It's sufficiently rare that something truly interesting and "at a distance" is your problem - we tell stories about them because they're interesting but also because they're rare. To a first approximation, what's wrong is not the JVM, it's not the kernel¹, it's just the most recent thing you changed.

The hardest part about applying this rule is that if you don't have local enough (and prompt enough) tests, you may not even know that you just broke something - you really need tests that are close to your changes (possibly even before your changes, as we'll discuss below.) Sure, you can narrow things down with version control later, if you commit often enough, but you want at least some basic checks as you go.

The Testing Funnel

Since you can't couple everything that tightly, we end up with another "funnel" - by analogy with the bug funnel there is also a "testing funnel" - a hierarchy of tests that goes from needle-precise unit testing of particular lines of code, up to human and system-scale interaction.

Unit Tests (function scale up to object scale, usually involves one developer): Unit tests apply to small amounts of code, and attempt to answer the question "Does what I wrote do what I intended?" Writing them in advance can serve as a stronger guarantee of this. They also aid in review of an interface design - the reviewer can look at the tests as examples, "Does it make sense to call these functions this way?" and secondary documentation, "So we'd do this thing by making this call, does that make sense? Does this expose obvious edge cases? Does it encourage the next developer that comes along to do something that's obvious but wrong?"
Internal API/contract tests (interface scale - which can have a lot of
complexity "behind" it but the interface itself is bounded - team scale): If you're providing a user-facing API or some other concrete promise to your customers ("this interface can handle 100 values at a time", "this data field can handle any unicode string containing only characters with general category Letter") it's worth expressing all of those as tests so you don't ship anything that breaks them in front of live customers. (These are still development tests, meant to be more exhaustive, rather than operational/delivery tests that only probe that certain paths work.) These are often as narrow as unit tests, but they're expressing a relationship with consumers rather than developers.
External interface tests (interface scale but you probably have many
of these - team scale if the team is agreeing on the overhead of using
these particular interfaces): If you're not going to aggressively review third party library updates, you probably also want API/contract tests facing outward and testing the interfaces you consume from those libraries or services. These are often the Internal API tests that you wish the providers would be running themselves. (External service tests might fit better as "delivery tests" (see below) since you likely don't control their updates - even for well behaved services that use /v1 /v2 -style versioned API URIs, you don't necessarily learn that /v1 has been deprecated until it's actually turned off², which won't wait for your release schedule.)
Requirements/"business rules" tests (interface scale up to entire
product scale - not just the team but also external "stakeholders"): Requirements testing is usually an attempt to translate external documentation (including meeting minutes) into code, in order to express that "The thing we agreed on in the meeting is actually in the product, if this test passes." This helps review the concrete understanding within and outside the team. While there's value in showing "we think we're finished", there's even more value in having the tests reviewed by other involved parties, especially if they don't think the test describes the agreement; it's a lot easier to argue with something "down on paper" than a blank page.
External Quality Measurement tests (product scale - cve scans,
audits - usually from external sources with an external audience): These are usually tests from outside sources that are somewhat independent of the code itself - while there are more tools available to search for security anti-patterns directly, simply checking your dependencies and versions against published lists of CVEs and USNs is useful. Doing this as part of development testing isn't enough but it's a good start; you really want to be getting some kind of feed of advisories, filtering that against your actual dependencies, and responding directly - in practice this gets done for the most attention-getting advisories³ and the thorough checks are only done at release time (which is fine for producing an "everything is fine" report, but isn't early enough in the pipeline to actually react and inform customers as needed.)

Outside the funnel

While the space of things you need to test rapidly grows beyond the scope of a single developer testing their own work, there are also test opportunites that work from a different angle entirely.

Performance tests: If you're promising the customer anything with a rate or time, you should be able to demonstrate that - and even if they are slow or expensive, these tests should block merges so that you can immediately show that "this specific change needs more work because otherwise it breaks that contract", instead of having to reverse engineer the problem later. But even if you're not making any such promises, you likely have an internal constraint of "it works on the hardware we have now but haven't budgeted for any more" (or at a smaller scale, "we haven't run out of google cloud bootstrap points, and don't want to start spending Real Money") and so you still want some warning of performance changes, even without an explicit performance story.
Delivery/integration tests: These are operational tests, used to confirm that the delivery is operating correctly (and not just "thrown over the wall".) Ideally you'd work with your operations team on these but you'll want to show that they work in your development and staging environments, and then make them part of your production environment "health tests" to provide some feedback that the system is working as fielded. (The core ideas of DevOps can be oversimplified as extending good software practices to include the actual deployment and use of the software; "tests" in that context expands to include deployment-specific metrics for whether the system is working as intended, measured more on a performance continuum and less on a binary works/doesn't work scale.)
"Other" tests: This list isn't exhaustive - anything that you have working and you'd like to stay working should have a test, even if it doesn't fit neatly into these categories. Customers find it far more frustrating if you fix something and then let it break again than if you don't fix it in the first place⁴ so you're going to want an entire category of "don't let this regress because of this customer" tests. (If you're shipping different things to different customers, consider fixing that but maybe start by running all of the customer-related tests for everything, not just for the inciting customer.

Coverage

Coverage isn't a kind of test, it's instead about measuring testing itself - "how much of the code is run by this test suite", "how much of this code is actually run in actual product usage." It's not an absolute good, but more of a directional improvement indicator - not so much a productivity metric as a ratchet that should always be increasing. This doesn't mean you can't "clean up" tests ever - coverage improves if you get rid of untested code, too. (It is plausible that deleting any code likely deletes bugs, but untested code has had even less attention so it's even more likely.)

Measuring coverage is also a way to cure some illusions about how your own codebase works. In the mid 1990's it was popular to compile gcc with itself, on the theory that this tested many paths through the compiler (and the simultaneous but contradictory theory that the final version would be "better" because the code had been generated by gcc instead of the potentially buggy vendor compiler.) Coverage metrics eventually showed that gcc building itself only used about 20% of its own code, and that this really wasn't a particularly convincing test. (At the time, there was also an enormous test suite based on expressing every detail of the ANSI standard as code, and this was much more effective, at least as a regression test.)

Assertions

Assertions or asserts (based on the language keyword that implements them) are sometimes treated as equivalent to unit tests - in practice they are unit tests that are run at some unusual time (in dynamic languages, they run on module import; in compiled languages, they can sometimes run at compile time but more often only at eventual run time.) It's generally better practice to convert them into proper tests, so they're part of your testing and test reporting, and they're reviewed as such. The common anti-pattern is to end up having only the assertions run, because they're forced to run as a side effect of running the code, while the more formal unit tests you have aren't getting run. Making them Actual Tests puts a little more pressure on making sure your Actual Tests are run sufficiently often.

Moving towards "enough" testing

"Working with Legacy Code" gives a good framework for thinking about the shape of the testing problem. Feathers specifically defines legacy code not as old code but as "code that doesn't have enough test coverage for you to change it with confidence", even if you wrote it in 2025...

Headlights

"Code without tests is like driving without headlights" - you can make progress at driving speed, or you can stumble along slowly in the dark. Your tests are both an alternate expression of your intent for the code and a self-enforcing contract with your past self - even if you don't look at them, as long as you run them you have assertions that your ground remains solid.

Tooling to make it easier to write tests is important

You certainly don't want your developers to be evaluating new test harnesses when they're trying to do feature work - while "how do I write this code in a testable way" is a good thing for them to be considering, this is a lot easier if they have context and existing practices that enable this.

This goes beyond choice of framework - you will get a huge amount of mileage out of developing tooling that supports testing the distinctive kind of things your project does - at very least, it means when someone has to estimate development effort for a feature, it's a lot easier for them to say "testing this thing is very much like testing that other thing, so we can make a convincing estimate for the effort" than if they need to put more effort into inventing "how do I test this" than they do into "how do I solve the actual problem".

Consider tests that are shaped like your interfaces - if you've got a user-facing web interface, playwright is available in most popular languages. If your tooling is mostly on the command line, CRAM is a docttest-style tool that lets you mix Markdown explanations with chunks of shell commandline and result-matching patterns (which also lets you write self-checking documentation, if you write the tests with a narrative rather than as exhaustively following what the code does - this mode is also effective for turning a customer complaint into an acceptance test while keeping a clear relationship between the two.)

Fast-fail tests/Relevant tests

Getting a failure quickly means a developer doesn't have to come back later and recreate their mental context to work on it. (Unfortunately this may need to be instantly - as in, within a few hundred milliseconds - to actually feel fast enough; but in practice, seconds are better than minutes, and anything over ten minutes is probably only going to get run by choice as part of submitting for code review and likely not much sooner than that.)

At the same time, you can't just skip the "long" tests - the reason you wrote long rigorous tests was to make sure some aspect of the product continues to work, and you want to find out as soon as you can after the change that breaks it - not a month later when a release is going out. Unfortunately, higher level tests are often more effective at finding flaws and slower, so you really want to run them on most changes and they're more costly when you do.

Forty-five minutes of testing that checks something you "know" you didn't change is important at some level, but bad for morale - it really is valuable to the team to put the work into doing some kind of test dependency selection (and improving module isolation!) so that you can safely avoid testing things that you really couldn't have influenced. (At the same time, some part of your workflow should recognize when that can't be perfect and that running global cross-functional tests can find horrible and obscure problems - and you want to find them early just like with any other bug - you just need to find ways to limit the day-to-day burden to the team.)

Not much you can do but acknowledge the tradeoffs and keep track of what tests you do choose to run less often - and how often you find out later that you missed something valuable that those tests could have caught sooner; then you can use that feedback to re-balance which tests you run when (or even to convince the team that additional delay pays off.)

Complexity/Fragility

Flaky tests are really bad for morale, and just as inconsistent enforcement reduces respect for the law itself, tests that fail your build because of nothing you've done discourage you from running, let alone trusting, a full test suite.

This does mean that making tests robust the first time around is worth the trouble, while at the same time being difficult to enforce; explicitly budgeting effort to make new things "testable" is one way to handle it culturally.

Being able to say that a feature isn't done if it has tests that get in everyone else's way is theoretically satisfying but doesn't work very well in practice - since flaky tests tend to turn up long after the developers have moved on to other tasks. Instead, you need to have experienced engineers associated with the development of tests, to recognize risky behaviors (the kinds of things that lead to race conditions, for example.) This also includes having the authority to point out things like "calling sleep(1) in a test is probably bad and doesn't do what you think it does⁵", and to put the effort in to define race-free interfaces in your product, even if the only immediate need is to enable correct tests.

Coverage

Coverage is difficult to improve and hits diminishing returns fairly quickly. It's more effective and more difficult to start out with high standards for it on greenfield work, than it is to impose it later. Also, the interesting metric for quality is how much of your code isn't tested at all, rather than how well the rest of it is tested, until a lot of it actually is.

If you do go on a campaign to improve coverage specifically, one Neat Trick is to make sure it's clear to the team that removing untested code is a valid way to improve coverage, especially if you have accumulated speculative code that doesn't actually support real features. Simplifying the code base (and especially removing technical debt) is likely a bigger quality improvement - in its own right, not just in terms of coverage - than just having more tests is.

A Naïve Coverage/Regression Test Trick

If you built an initial system with vastly insufficient testing and are finally dedicating resources to correct this - one cheap but high traction approach is to just write tests for the current behaviour of the system. While you want tests that answer the question "this system behaves in this intentional and desired way", you can still get a lot of coverage by writing tests that answer the weaker question "customers are successfully using the system the way it currently behaves - make sure it keeps behaving that way." This is a common approach to performance tests, rather than doing extensive analysis to see if some particular metric is good enough, you assert that "customers aren't complaining so it's good enough Right Now" and make sure it doesn't deteriorate from there.

One problem with these tests being weaker (and easier to write) is that when they do fail, you may have to defend them, and revisit whether they are testing the right things. Usually at that point you can get more attention on them and actually answer that.

Write A Failing Test First

A large chunk of your development time is going to be spent on fixes for visible failures of your system - which means a large chunk of your test development time will be too. If these are customer-visible, you really want to be sure the customer never sees the same problem again. There's really only one way to do this: once you have a test that you think covers the problem, make sure that it fails on the exact version of the system the customer is running (don't just show that the new code works.)

There's an ambitious variation of this, called Test Driven Development, where you start by writing tests for nonexistent features. These are particularly good for fleshing out API designs, since you're keeping in mind "how does someone want to use this" instead of "what falls naturally out of the implementation", and encourages actually doing design instead of just coding things. (You can get similar value out of writing the documentation first.) This is somewhat different since in the customer-facing case you already have code and input and a misbehaviour to capture and you're trying to prove a particular assertion, rather than explore a design space.

Predictability needs Honesty about the costs

If you want the release phase to be constant and predictable, you can't be discovering problems for the first time at release freeze; that just causes firefighting, and sometimes that means the whole thing burns down. You have to push the "discover global problems" step back up the funnel and that means doing earlier global tests - possibly costing time for every developer and reviewer. While unpopular, this isn't wrong, but there are options for trading it off.

In practice, having a separate QA team doesn't actually matter here, unless they're empowered to pause development to get attention on release-blocking problems, or they themselves are part of product development and really can do it themselves (this is even more rare.)

The "Agile Release Train" model is about slipping features but not slipping releases. What it ignores is that cross-feature interactions aren't really captured by individual feature branches, and when you find complex failures you may find you really haven't isolated things well enough to kick the "failure" off the train - so it reduces to any other release model. (Also, most customers care about their pet feature, not about your release model.)

The Release Train model is also unrealistic about releases that are customer visible, especially when they're exposed to sales organizations that want to be able to promise things to customers in particular releases. The honest/true thing to do is to advertise what you've actually finished - it's just nearly impossible to get the customer-facing side of the organization to acknowledge that.

Brief History of test tooling

In the mid 1990s, nightly (or even weekly!) tests were a big thing. They could be slow because they just needed to finish overnight (but that didn't mean they were especially thorough, computers were just slower back then.)

By the late 1990s and early 2000s, part of having version control was having single points of Intentional Change where all of the files were consistent and you could at least expect the build to work, and maybe even the product itself. CruiseControl came out of ThoughtWorks in 2001 and popularized actually doing that build every possible time, with a straightforward java tool that could just watch for changes (and wait for a repo to "settle" since CVS at that point was only a loose collection of versioned files, fully mechanical multi-file commits didn't catch on until Subversion started replacing CVS a couple of years later.) While it wasn't associated with particularly sophisticated testing, simply knowing "who broke the build"⁶ in less than a day was powerful.

Eventually having feature branches, and tests before merging, became normalized with systems like Hudson/Jenkins and Atlassian Bamboo in the late 2000's and early 2010's; Bamboo could even auto-detect merge-candidate branches, test them, and record the test results directly in the pull request review page. (By 2020 pretty much every multi-user version control interface had some form of this available, with varying degrees of enforcement vs. cooperation.)

Conclusion

There's as much work and detail in testing as there is in building the software itself; while "bits" don't decay, any interesting software system or subsystem has a sophisticated organic interaction within its ecosystem, and static code doesn't respond to that - testing can act as "life support" and at very least as "early warning" that the world is changing around it.

The author has found exactly one C runtime bug this century, and has a particular coworker that found two linux kernel bugs across the entire lifetime of a startup. It took vast amounts of isolation and proving to determine the truth of these; in the same time period we also had a bunch of speculative "that has to be a JVM problem" complaints that turned out to be either locally-written bugs or "no, unix actually works that way"... ↩
This happened to the author in late 2025 when a casual curl script to check if a particular flatpak had been updated to a new version suddenly started returning a descriptive HTML 404 error page instead of detailed JSON. Apparently /api/v1/apps was replaced with /api/v2/appstream at some point. ↩
Heartbleed, ShellShock, RowHammer, aCropalypse and a vast range of other bugs with "cute" names and specialized domains that pop up faster than the CVEs themselves do - basically "self-marketing" vulnerabilities. In practice many of these are disproportionately well-known, but since your customers and investors are likely to hear about them quickly too, you'll be expected to have an answer available promptly, even if it's a trivial dismissal. ↩
At one support-focussed company early in my career I learned that customers valued prompt engagement more than prompt fixes - you still don't want to let things drag on without some negotiation about priorities, but "we're worrying about this so you don't have to" is a powerful message even when you don't yet know how long the engineering is going to take. ↩
Usually calling sleep to let something "settle" is painting over a race condition - instead, the changes made by an interface should be complete on return or have an explicit way to check for completion. (Also, the test probably fails if the machine has competing loads, and maybe the product will too!) If the sleep call really does require time to pass (such that a timestamp crosses an integer boundary) consider if you want to use "Mock" testing techniques which would let you replace the clock - then your test can perform the first step, ask the harness to advance the clock, then perform the next step with confidence that the test-clock has updated correctly. This can take slightly more work to set up (faketime, for example, uses LD_PRELOAD to insinuate a shared library to replace the time checks in the executable under test) but can turn tests that previously needed seconds or more to run into tests that execute robustly in milliseconds. ↩
One late 1990's startup developed a culture of bringing candy or chocolate for the team to apologize for breaking the build; this eventually drifted into there almost always being candy in the engineering wing, so the sales people would make excuses to come by and end up actually talking to us :-) ↩