Pretty Good Practices
Doesn't everyone already do this?
One expects there to be pretty good practices around testing, but instead it is often an afterthought; sometimes even with resources devoted to it, there's a narrow focus on only one of many kinds of testing, while other areas are ignored entirely.
Keeping tests near relevant code
We need to start with a simple principle of debugging: "What you changed is what you broke." It's sufficiently rare that something truly interesting and "at a distance" is your problem - we tell stories about them because they're interesting but also because they're rare. To a first approximation, what's wrong is not the JVM, it's not the kernel1, it's just the most recent thing you changed.
The hardest part about applying this rule is that if you don't have local enough (and prompt enough) tests, you may not even know that you just broke something - you really need tests that are close to your changes (possibly even before your changes, as we'll discuss below.) Sure, you can narrow things down with version control later, if you commit often enough, but you want at least some basic checks as you go.
The Testing Funnel
Since you can't couple everything that tightly, we end up with another "funnel" - by analogy with the bug funnel there is also a "testing funnel" - a hierarchy of tests that goes from needle-precise unit testing of particular lines of code, up to human and system-scale interaction.
- Unit Tests (function scale up to object scale, usually involves one developer)
- Unit tests apply to small amounts of code, and attempt to answer the question "Does what I wrote do what I intended?" Writing them in advance can serve as a stronger guarantee of this. They also aid in review of an interface design - the reviewer can look at the tests as examples, "Does it make sense to call these functions this way?" and secondary documentation, "So we'd do this thing by making this call, does that make sense? Does this expose obvious edge cases? Does it encourage the next developer that comes along to do something that's obvious but wrong?"
- Internal API/contract tests (interface scale - which can have a lot of
- complexity "behind" it but the interface itself is bounded - team scale)
- If you're providing a user-facing API or some other concrete promise to your customers ("this interface can handle 100 values at a time", "this data field can handle any unicode string containing only characters with general category Letter") it's worth expressing all of those as tests so you don't ship anything that breaks them in front of live customers. (These are still development tests, meant to be more exhaustive, rather than operational/delivery tests that only probe that certain paths work.) These are often as narrow as unit tests, but they're expressing a relationship with consumers rather than developers.
- External interface tests (interface scale but you probably have many
- of these - team scale if the team is agreeing on the overhead of using
- these particular interfaces)
- If you're not going to aggressively review third party library
updates, you probably also want API/contract tests facing outward and
testing the interfaces you consume from those libraries or services.
These are often the Internal API tests that you wish the providers
would be running themselves.
(External service tests might fit better as "delivery tests" (see
below) since you likely don't control their updates - even for well
behaved services that use
/v1/v2-style versioned API URIs, you don't necessarily learn that/v1has been deprecated until it's actually turned off2, which won't wait for your release schedule.) - Requirements/"business rules" tests (interface scale up to entire
- product scale - not just the team but also external "stakeholders")
- Requirements testing is usually an attempt to translate external documentation (including meeting minutes) into code, in order to express that "The thing we agreed on in the meeting is actually in the product, if this test passes." This helps review the concrete understanding within and outside the team. While there's value in showing "we think we're finished", there's even more value in having the tests reviewed by other involved parties, especially if they don't think the test describes the agreement; it's a lot easier to argue with something "down on paper" than a blank page.
- External Quality Measurement tests (product scale - cve scans,
- audits - usually from external sources with an external audience)
- These are usually tests from outside sources that are somewhat independent of the code itself - while there are more tools available to search for security anti-patterns directly, simply checking your dependencies and versions against published lists of CVEs and USNs is useful. Doing this as part of development testing isn't enough but it's a good start; you really want to be getting some kind of feed of advisories, filtering that against your actual dependencies, and responding directly - in practice this gets done for the most attention-getting advisories3 and the thorough checks are only done at release time (which is fine for producing an "everything is fine" report, but isn't early enough in the pipeline to actually react and inform customers as needed.)
Outside the funnel
While the space of things you need to test rapidly grows beyond the scope of a single developer testing their own work, there are also test opportunites that work from a different angle entirely.
- Performance tests
- If you're promising the customer anything with a rate or time, you should be able to demonstrate that - and even if they are slow or expensive, these tests should block merges so that you can immediately show that "this specific change needs more work because otherwise it breaks that contract", instead of having to reverse engineer the problem later. But even if you're not making any such promises, you likely have an internal constraint of "it works on the hardware we have now but haven't budgeted for any more" (or at a smaller scale, "we haven't run out of google cloud bootstrap points, and don't want to start spending Real Money") and so you still want some warning of performance changes, even without an explicit performance story.
- Delivery/integration tests
- These are operational tests, used to confirm that the delivery is operating correctly (and not just "thrown over the wall".) Ideally you'd work with your operations team on these but you'll want to show that they work in your development and staging environments, and then make them part of your production environment "health tests" to provide some feedback that the system is working as fielded. (The core ideas of DevOps can be oversimplified as extending good software practices to include the actual deployment and use of the software; "tests" in that context expands to include deployment-specific metrics for whether the system is working as intended, measured more on a performance continuum and less on a binary works/doesn't work scale.)
- "Other" tests
- This list isn't exhaustive - anything that you have working and you'd like to stay working should have a test, even if it doesn't fit neatly into these categories. Customers find it far more frustrating if you fix something and then let it break again than if you don't fix it in the first place4 so you're going to want an entire category of "don't let this regress because of this customer" tests. (If you're shipping different things to different customers, consider fixing that but maybe start by running all of the customer-related tests for everything, not just for the inciting customer.
Coverage
Coverage isn't a kind of test, it's instead about measuring testing itself - "how much of the code is run by this test suite", "how much of this code is actually run in actual product usage." It's not an absolute good, but more of a directional improvement indicator - not so much a productivity metric as a ratchet that should always be increasing. This doesn't mean you can't "clean up" tests ever - coverage improves if you get rid of untested code, too. (It is plausible that deleting any code likely deletes bugs, but untested code has had even less attention so it's even more likely.)
Measuring coverage is also a way to cure some illusions about how your
own codebase works. In the mid 1990's it was popular to compile
gcc with itself, on the theory
that this tested many paths through the compiler (and the simultaneous
but contradictory theory that the final version would be "better"
because the code had been generated by gcc instead of the potentially
buggy vendor compiler.) Coverage metrics eventually showed that gcc
building itself only used about 20% of its own code, and that this
really wasn't a particularly convincing test. (At the time, there was
also an enormous test suite based on expressing every detail of the
ANSI standard as code, and this was much more effective, at least as a
regression test.)
Assertions
Assertions or asserts (based on the language keyword that implements
them) are sometimes treated as equivalent to unit tests - in practice
they are unit tests that are run at some unusual time (in dynamic
languages, they run on module import; in compiled languages, they can
sometimes run at compile time but more often only at eventual run
time.) It's generally better practice to convert them into proper
tests, so they're part of your testing and test reporting, and they're
reviewed as such. The common anti-pattern is to end up having only
the assertions run, because they're forced to run as a side effect
of running the code, while the more formal unit tests you have
aren't getting run. Making them Actual Tests puts a little more
pressure on making sure your Actual Tests are run sufficiently often.
Moving towards "enough" testing
"Working with Legacy Code" gives a good framework for thinking about the shape of the testing problem. Feathers specifically defines legacy code not as old code but as "code that doesn't have enough test coverage for you to change it with confidence", even if you wrote it in 2025...
Headlights
"Code without tests is like driving without headlights" - you can make progress at driving speed, or you can stumble along slowly in the dark. Your tests are both an alternate expression of your intent for the code and a self-enforcing contract with your past self - even if you don't look at them, as long as you run them you have assertions that your ground remains solid.
Tooling to make it easier to write tests is important
You certainly don't want your developers to be evaluating new test harnesses when they're trying to do feature work - while "how do I write this code in a testable way" is a good thing for them to be considering, this is a lot easier if they have context and existing practices that enable this.
This goes beyond choice of framework - you will get a huge amount of mileage out of developing tooling that supports testing the distinctive kind of things your project does - at very least, it means when someone has to estimate development effort for a feature, it's a lot easier for them to say "testing this thing is very much like testing that other thing, so we can make a convincing estimate for the effort" than if they need to put more effort into inventing "how do I test this" than they do into "how do I solve the actual problem".
Consider tests that are shaped like your interfaces - if you've got a user-facing web interface, playwright is available in most popular languages. If your tooling is mostly on the command line, CRAM is a docttest-style tool that lets you mix Markdown explanations with chunks of shell commandline and result-matching patterns (which also lets you write self-checking documentation, if you write the tests with a narrative rather than as exhaustively following what the code does - this mode is also effective for turning a customer complaint into an acceptance test while keeping a clear relationship between the two.)
Fast-fail tests/Relevant tests
Getting a failure quickly means a developer doesn't have to come back later and recreate their mental context to work on it. (Unfortunately this may need to be instantly - as in, within a few hundred milliseconds - to actually feel fast enough; but in practice, seconds are better than minutes, and anything over ten minutes is probably only going to get run by choice as part of submitting for code review and likely not much sooner than that.)
At the same time, you can't just skip the "long" tests - the reason you wrote long rigorous tests was to make sure some aspect of the product continues to work, and you want to find out as soon as you can after the change that breaks it - not a month later when a release is going out. Unfortunately, higher level tests are often more effective at finding flaws and slower, so you really want to run them on most changes and they're more costly when you do.
Forty-five minutes of testing that checks something you "know" you didn't change is important at some level, but bad for morale - it really is valuable to the team to put the work into doing some kind of test dependency selection (and improving module isolation!) so that you can safely avoid testing things that you really couldn't have influenced. (At the same time, some part of your workflow should recognize when that can't be perfect and that running global cross-functional tests can find horrible and obscure problems - and you want to find them early just like with any other bug - you just need to find ways to limit the day-to-day burden to the team.)
Not much you can do but acknowledge the tradeoffs and keep track of what tests you do choose to run less often - and how often you find out later that you missed something valuable that those tests could have caught sooner; then you can use that feedback to re-balance which tests you run when (or even to convince the team that additional delay pays off.)
Complexity/Fragility
Flaky tests are really bad for morale, and just as inconsistent enforcement reduces respect for the law itself, tests that fail your build because of nothing you've done discourage you from running, let alone trusting, a full test suite.
This does mean that making tests robust the first time around is worth the trouble, while at the same time being difficult to enforce; explicitly budgeting effort to make new things "testable" is one way to handle it culturally.
Being able to say that a feature isn't done if it has tests that get
in everyone else's way is theoretically satisfying but doesn't work
very well in practice - since flaky tests tend to turn up long after
the developers have moved on to other tasks. Instead, you need to
have experienced engineers associated with the development of tests,
to recognize risky behaviors (the kinds of things that lead to race
conditions, for example.) This also includes having the authority to
point out things like "calling sleep(1) in a test is probably bad
and doesn't do what you think it does5", and to put the effort in to
define race-free interfaces in your product, even if the only
immediate need is to enable correct tests.
Coverage
Coverage is difficult to improve and hits diminishing returns fairly quickly. It's more effective and more difficult to start out with high standards for it on greenfield work, than it is to impose it later. Also, the interesting metric for quality is how much of your code isn't tested at all, rather than how well the rest of it is tested, until a lot of it actually is.
If you do go on a campaign to improve coverage specifically, one Neat Trick is to make sure it's clear to the team that removing untested code is a valid way to improve coverage, especially if you have accumulated speculative code that doesn't actually support real features. Simplifying the code base (and especially removing technical debt) is likely a bigger quality improvement - in its own right, not just in terms of coverage - than just having more tests is.
A Naïve Coverage/Regression Test Trick
If you built an initial system with vastly insufficient testing and are finally dedicating resources to correct this - one cheap but high traction approach is to just write tests for the current behaviour of the system. While you want tests that answer the question "this system behaves in this intentional and desired way", you can still get a lot of coverage by writing tests that answer the weaker question "customers are successfully using the system the way it currently behaves - make sure it keeps behaving that way." This is a common approach to performance tests, rather than doing extensive analysis to see if some particular metric is good enough, you assert that "customers aren't complaining so it's good enough Right Now" and make sure it doesn't deteriorate from there.
One problem with these tests being weaker (and easier to write) is that when they do fail, you may have to defend them, and revisit whether they are testing the right things. Usually at that point you can get more attention on them and actually answer that.
Write A Failing Test First
A large chunk of your development time is going to be spent on fixes for visible failures of your system - which means a large chunk of your test development time will be too. If these are customer-visible, you really want to be sure the customer never sees the same problem again. There's really only one way to do this: once you have a test that you think covers the problem, make sure that it fails on the exact version of the system the customer is running (don't just show that the new code works.)
There's an ambitious variation of this, called Test Driven Development, where you start by writing tests for nonexistent features. These are particularly good for fleshing out API designs, since you're keeping in mind "how does someone want to use this" instead of "what falls naturally out of the implementation", and encourages actually doing design instead of just coding things. (You can get similar value out of writing the documentation first.) This is somewhat different since in the customer-facing case you already have code and input and a misbehaviour to capture and you're trying to prove a particular assertion, rather than explore a design space.
Predictability needs Honesty about the costs
If you want the release phase to be constant and predictable, you can't be discovering problems for the first time at release freeze; that just causes firefighting, and sometimes that means the whole thing burns down. You have to push the "discover global problems" step back up the funnel and that means doing earlier global tests - possibly costing time for every developer and reviewer. While unpopular, this isn't wrong, but there are options for trading it off.
In practice, having a separate QA team doesn't actually matter here, unless they're empowered to pause development to get attention on release-blocking problems, or they themselves are part of product development and really can do it themselves (this is even more rare.)
The "Agile Release Train" model is about slipping features but not slipping releases. What it ignores is that cross-feature interactions aren't really captured by individual feature branches, and when you find complex failures you may find you really haven't isolated things well enough to kick the "failure" off the train - so it reduces to any other release model. (Also, most customers care about their pet feature, not about your release model.)
The Release Train model is also unrealistic about releases that are customer visible, especially when they're exposed to sales organizations that want to be able to promise things to customers in particular releases. The honest/true thing to do is to advertise what you've actually finished - it's just nearly impossible to get the customer-facing side of the organization to acknowledge that.
Brief History of test tooling
In the mid 1990s, nightly (or even weekly!) tests were a big thing. They could be slow because they just needed to finish overnight (but that didn't mean they were especially thorough, computers were just slower back then.)
By the late 1990s and early 2000s, part of having version control was having single points of Intentional Change where all of the files were consistent and you could at least expect the build to work, and maybe even the product itself. CruiseControl came out of ThoughtWorks in 2001 and popularized actually doing that build every possible time, with a straightforward java tool that could just watch for changes (and wait for a repo to "settle" since CVS at that point was only a loose collection of versioned files, fully mechanical multi-file commits didn't catch on until Subversion started replacing CVS a couple of years later.) While it wasn't associated with particularly sophisticated testing, simply knowing "who broke the build"6 in less than a day was powerful.
Eventually having feature branches, and tests before merging, became normalized with systems like Hudson/Jenkins and Atlassian Bamboo in the late 2000's and early 2010's; Bamboo could even auto-detect merge-candidate branches, test them, and record the test results directly in the pull request review page. (By 2020 pretty much every multi-user version control interface had some form of this available, with varying degrees of enforcement vs. cooperation.)
Conclusion
There's as much work and detail in testing as there is in building the software itself; while "bits" don't decay, any interesting software system or subsystem has a sophisticated organic interaction within its ecosystem, and static code doesn't respond to that - testing can act as "life support" and at very least as "early warning" that the world is changing around it.
-
The author has found exactly one C runtime bug this century, and has a particular coworker that found two linux kernel bugs across the entire lifetime of a startup. It took vast amounts of isolation and proving to determine the truth of these; in the same time period we also had a bunch of speculative "that has to be a JVM problem" complaints that turned out to be either locally-written bugs or "no, unix actually works that way"... ↩
-
This happened to the author in late 2025 when a casual
curlscript to check if a particular flatpak had been updated to a new version suddenly started returning a descriptive HTML 404 error page instead of detailed JSON. Apparently/api/v1/appswas replaced with/api/v2/appstreamat some point. ↩ -
Heartbleed, ShellShock, RowHammer, aCropalypse and a vast range of other bugs with "cute" names and specialized domains that pop up faster than the CVEs themselves do - basically "self-marketing" vulnerabilities. In practice many of these are disproportionately well-known, but since your customers and investors are likely to hear about them quickly too, you'll be expected to have an answer available promptly, even if it's a trivial dismissal. ↩
-
At one support-focussed company early in my career I learned that customers valued prompt engagement more than prompt fixes - you still don't want to let things drag on without some negotiation about priorities, but "we're worrying about this so you don't have to" is a powerful message even when you don't yet know how long the engineering is going to take. ↩
-
Usually calling
sleepto let something "settle" is painting over a race condition - instead, the changes made by an interface should be complete on return or have an explicit way to check for completion. (Also, the test probably fails if the machine has competing loads, and maybe the product will too!) If thesleepcall really does require time to pass (such that a timestamp crosses an integer boundary) consider if you want to use "Mock" testing techniques which would let you replace the clock - then your test can perform the first step, ask the harness to advance the clock, then perform the next step with confidence that the test-clock has updated correctly. This can take slightly more work to set up (faketime, for example, usesLD_PRELOADto insinuate a shared library to replace the time checks in the executable under test) but can turn tests that previously needed seconds or more to run into tests that execute robustly in milliseconds. ↩
Security design as the dual of software design
Security-related design isn't mysteriously difficult. It's primarily explained by one simple principle: in non-security software, it's usually sufficient that there is some single successful path through the system - in security software, it's not enough that the good path does work, but all of the other paths must not work. This leads to the horrible realization that the main thing about security awareness is that it involves a lot more work.
Think Like an Attacker?
"Red Team" and "Think Like An Attacker" sound like a lot of fun, but they're generally not helpful for improving overall product security. The attacker only needs to break one thing to win - and sure, fixing that one thing in your product is important - but as a defender you need to keep all interfaces clean and correct. There's certainly value in understanding the kind of "tricks" attackers use, but this ends up being more about pointing out areas where you need to pay attention and not take shortcuts. It's also about developing a sense of dismay about how easy many attacks are, once they're noticed - and 30 years later when the same kind of exploits still work.
Attack Surface
As part of Microsoft "turning things around" in terms of Windows Security they formalized the idea that a system had an "attack surface" - the area that was exposed to outside attackers - which you could consciously minimize. This means obvious things like closing unneeded ports, design choices like having distributed systems contact servers and not the other way around, and higher level ideas like reducing or isolating third party dependencies because they're part of your attack surface.
As mentioned above, if you need to keep all interfaces clean and correct - this is more tractable if there are fewer and smaller/simpler interfaces in the first place.
Privilege Boundaries
The biggest place where security becomes a concern is any place someone can get more access. While there are theoretical reasons to focus on these areas in both design and review, the practical reason is more important: that's where the attackers are going to focus.
In the simplest case, being able to get from off-box to on-box is an increase in access. Likewise, getting from an isolated sandbox user to some other user, or from a non-root user to root. There are more complicated paths for some of these - file creation attacks, symlink race conditions - but the root of the problem is fundamentally an unintended increase in access,
Correctness and Security
There are some common areas where one way of writing a piece of code is correct, and another is wrong, and the wrong version happens to also have a security problem. One popular example is the SQL Injection attack, often framed as a "quoting" problem, but fundamentally is about treating user input as anything other than untrustworthy data that needs to be handled with gloves. Using naïve string interpolation on queries is wrong from a data type perspective, and the roots of why it's done at all go back to at least the 1980s when the only interface to the database involved passing strings around rather than a typed interface; this kind of interface persists because it's simple, and the bugs persist because it's slightly difficult to automatically detect.1
The "S" in IoT
For ages the "Internet of Things" has been derided as a security problem - calling it The Internet of "Things that should not be on the internet", or the catchphrase "The "S" in IoT is for Security" - which goes usually comes down to
- outdated beliefs about how much hardware you need to run secure software, once you have enough to be on a network at all, or
- poor planning about the inevitability of firmware updates.
(Let's assume that the case of "the 15 year old firmware that you're copying from a discontinued WiFi router to make an Internet dishwasher" - while definitely part of the problem - doesn't suggest that you're actually interested in making anything better.)
Modern (2025) embedded hardware has support for modern cryptographic security interfaces, even at the cheap end of the scale. For example, the ESP32 supports
- WPA/WPA2/WPA3 and WAPI (for wifi)
- Acceleration for AES, SHA-2, RSA, and ECC
- Secure Boot
- Flash Encryption
and the ESP32 version of CircuitPython supports SSL and HTTPS directly. So while you'll need a process to update your embedded hardware to respond to security issues that appear over time, the tools are there, the hardware and operating systems are not letting you down today.
History
CPU performance has traditionally tracked Moore's Law (in terms of marketing and physics) with a doubling of transistor density every year since 1965. This has a bunch of implications:
- performance-based security (password hashing for exmple) does need to adapt over time (fortunately, modern mechanisms simply have dynamically adjustable tuning parameters)
- encryption that needs fast hardware this year will run on literal toys in five years.
In the 1980s, "encrypting everything over the wire" was seen as ludicriously paranoid. By the mid-1990s, a desktop computer could keep up with DES encryption of an entire 10 megabit network connection (and CPU has only increased faster than network bandwidth - coupled with stronger encryption algorithms that are more CPU-efficient than ever.) Fortunately, by 2010 you could "obviously" use no less than SSH and HTTPS2 for everything over the wire, and the balance had shifted from "that's too paranoid" to "that's minimum basic responsibility".
-
The unbounded-string input interface (
gets) that was a key part of the Morris Worm was eventually suppressed by various techniques including developing a new mechanism to emit link-time warnings, before it was finally removed by ISO C11 a quarter-century later. SQL injection is harder to detect at the language level because "interpolating user data into the string" happens far away from "passing the string to something that could execute it"; perl's "tainted data" mechanism was one way of treating this but it didn't really catch on elsewhere. ↩ -
HTTPS Certificates were a big political roadblock here, but LetsEncrypt solved this a decade ago. ↩
Why is release engineering a big deal? How does it help your developers instead of getting in their way?
Can't we just do Continuous Delivery?
No.
Continuous Delivery is about minimizing the lag between any development work being "done" and all customers being able to use the result. While there is significant value in thinking in those terms and figuring out what you're doing in that gap - actually eliminating the gap can often result in feeding your customers into a buzzsaw of poorly consider reactions to changing requirements, rather than handing them the polished results of well-considered design.
Teams that haven't worked in this mode before may find that there are manual review processes, hard-to-automate system tests, or even external legal approvals that currently occupy those gaps. These are not as unusual as it might seem - even a trivial website, if it's delivered to the government of Canada, must comply with the Official Languages Act and provide multiple translations, which you might not be able to update continuously along with other code changes.
Also, while ideally you should be able to run every possible test on every change, this is often not be possible in practice (and it may not be merely expensive.)
Examples (from past personal experience):
- Some validation that a public search engine worked "at scale" involved spinning up $20k of cloud compute and feeding it synthetic load for at least half a day. (There was corporate enthusiasm for doing this before release rollouts - and for firefighting - but not for every change on every branch.)
- Robot systems where low fidelity simulations were possible, but literally noone had convincing simulation of gripper "fingertips" against actual consumer packaging over wildly varying temperatures without actually doing it in physical space (and a specialized team that helped work out what items were worth testing in the face of particular types of change was an amazing asset to have - but again, they scaled to "releases and firefighting", not to every change.)
- Sometimes you don't "own" the customers - you're delivering a service to someone else to integrate and they have their own constraints. A performance accelerator component delivered to a major global website got validated on their schedule, by installing it on staging hardware and feeding it a fraction of their real login load, then "turning the knob" until it handled a convincing amount (or didn't, and all traffic got reverted to the previous System - because our experiments didn't justify their downtime.)
- Sometimes, while your customers want new features, they don't actually want the disruption, IT time, or data reprocessing effort that comes with an upgrade, and you've negotiated that they get quarterly releases but aren't actually required to take them more than annually. (Unlike modern consumers, these customers are actually paying you money so they do in fact have a say in this kind of thing.) Here you want as much pre-release testing as possible because you really don't want the reputation hit of having the product fail once the customer finally gets around to rolling it out.
What can we do, then?
The first pushback you'll get from adding any process at all will be "why are we even doing this?" It helps to start with principles, and examples of problems that are solved by sticking to those principles. (While it's also educational to look at disasters caused by failing to adhere to principles1 that doesn't give much software-specific actionable guidance.)
Principle: Traceability (and Provenance)
Traceability is simply the idea that there should be no question about what you shipped - a product name and version number should be enough to get you back to all of your inputs. Even if you're building bespoke one-off things for individual customers - as often happens in a startup while it's still trying to converge on a product - you should still be able identify everything that goes in to each of those things. It can save you a lot of time to see that a customer complaint is on release X and see that release X+1 has a fix for it; you need clearly labeled releases to be able to do that. Even if you're delivering bespoke fixes, knowing what "future" release to extract them from makes this kind of traceability worth the trouble.
Keep in mind that you can get a lot of traceability "for free" from process automation - not that the automation itself is free, but it's a good way to forestall "excess creativity" in the build process, and then you can let that process to the labelling automatically.
Provenance is the rest of the story - for every input you use to build a product (third-party libraries or your own code, including data) you should know where those inputs come from - if you track a customer problem down to third party code, that doesn't absolve you of the problem, that just means you've narrowed down where to keep digging. Sometimes that means "knowing which support contract to invoke", sometimes it means "fixing the problem locally and posting the fix on their github", or other things in between; if you're using a component that's actively developed, you need to know what version you shipped so you can see if they've already fixed it upstream. (Ideally you want to be able to say the precise version as they publish it, but even knowing "latest, but downloaded on this date" can be workable.)
Provenance also lets you "connect the dots" between the public security announcements your customer is pestering you about - CVEs, USNs, RHSAs, etc. are generally quite specific about which version numbers a given advisory applies to. (If your software is public-facing you might find yourself having to provide these identifications - or if you have certain large customers, they may insist even if you're not otherwise obligated.)
Finally, if your company is being acquired or taking on significant investment, one part of modern Due Diligence is a License Audit - a review of your third party components to see if there are any license compliance issues. If that's in your future, it's a lot easier to produce that inventory if you just keep track of licenses as you add dependencies - at very least it gives you a hook to find problems before they become load-bearing. (At one company with a product with a large number of open-source dependencies, the board of directors wanted a regular report of what new licenses we had to worry about - this turned out to be easiest to integrate by just comparing our package inventories at release time and reporting on the difference.)
(See the Traceability article for more depth.)
Principle: Repeatability → Debuggability
Repeatability is just the idea that if you chose to build something once, you can revisit that and build it again. The most practical reason for this is that if you see problems in the field, you should be able to easily build the exact same fielded version with
- extra debugging (though if you do this a lot, you should think about what quality level your product is actually at and maybe start shipping builds with a lot of debugging turned on by default)
- small "trial" workarounds - while your fix should be based on an understanding of the problem, it can sometimes be worth seeing if a speculative fix works by trying it.
It should also be clear to everyone working on the project how "what you intended to build" turns into "what you actually shipped" - not necessarily how all of the mechanics work, but how to operate them.
(You should also do the "easy version" and actually keep copies of what got shipped - but that's usually not enough to support even basic debugging, unless you're actually shipping fully debug-capable builds to customers, which is rare.)
Principle: Delivery reflects Intent
There should be no question about what you shipped. While this sounds obvious, be aware that there are red flags like "didn't we fix that last release? Why is this customer still seeing it?" - if the problem was subtle in some way that your fix didn't capture, that's one thing, but if the code got written, reviewed, and landed, but still didn't end up in the next release, that should be a matter of great concern.
Relatedly, if you did fix something, and then two releases later it's broken again - that looks really bad from the customer's perspective and you should be putting more effort into preventing it. Usually that means "customer-visible fixes get more test coverage than would seem reasonable", but if the code "just got lost", that's another red flag that keeps your customer from taking your ability to meet your commitments seriously.
Principle: Automate software, not people
There is often pushback about doing serious automation around release engineering, like "it's not the product focus", "we can just have some checklists", "do we have to review and test the automation too". In practice, you don't need that much code (and the simpler you can keep it, the better.) Checklists2 are critically important, but they're also hard for humans - anything you can reduce to mechanical automation will be more reliable and eliminate human tedium and fatigue (which lead to errors.)
That said, starting with a recipe or detailed checklist can help work out the initial process and figure out what parts are worth automating; they also make it easier to hand the chore around, especially on a small team that's reluctant to dedicate resources to process work. Ultimately you want the release process to be "boring" - it's not where you should be innovating, and you probably don't want to spend creativity on it either.
Principle: Release schedules support planning.
You need to get working software out the door to your customers. Software estimation is hard, and some things really are "researchy" enough that they're done when they're done - but as you accumulate experience in a particular problem domain and product environment you'll get an idea of what parts you can make predictions about. Predictions can help you allocate resources all along the delivery pipeline - testing, marketing, customer delivery and support, and even with the customer themselves.
It also helps with planning across subsystems. In the extreme, Mark Shuttleworth gave a PyCon 2010 Keynote about the "Release Cadence" that Ubuntu was using - and gently imposing on the open source community - you didn't have to pay attention to it, but related projects naturally synchronized around shipping features in time to hit the Ubuntu 6-month release cycle, or the 2-year "Long Term Support" cycle. This was especially helpful for foundational projects like Gnome and KDE to be able to say "we'll have these features by this target, so applications can start using them in the same target" instead of waiting for a later cycle.
This is the opposite of the model used by classic systems like XP "Extreme Programming" which still had a cadence - a release every 3 or 6 weeks - but the release was "only things that are completely ready" with no implied pressure to "fit" a deadline. The lack of pressure was considered important for quality (basically, "we will ship no code before its time") and while it did encourage having very clear terms for what it meant for code to be ready, it did nothing to help related projects plan for future deliveries.
Principle: Release versioning supports planning.
In addition to having a schedule, you can communicate (and commit to) more detail by properly "naming" (with version numbers) the software you release.
The best practice here is Semantic Versioning where you always have a three part version number - changing the major version for incompatible changes, the minor version for "new features that don't break compatibility", and a third "patch" version for compatible bug-fixes only. While this is completely clear (and mostly mechanical) for simple things like library dependencies, for an entire product you'll end up using it to communicate to people (customers) where it gets more complicated. You may choose to define support contracts based on these versions; is a major version change just a continuation of an existing contract, or a new purchase? How long can you run a given version before you are required to upgrade? Sometimes there's a "lighter" version where support is allowed to push back on customer complaints if there's a newer version available, and it's a matter of service quality whether you even try to confirm that the problem is fixed on the newer version.
Sometimes you'll have sales and marketing interest in how these are presented; for one interesting version of this, the first number was controlled by marketing and tied to high-effort customer presentations on all of the new things "System 4" would get them - even if they were already partly rolled out and customer-tested before that. (Engineering controlled the rest of the version number, primarily mechanically.)
Principle: Version what you Test
With all of the emphasis on versioning and support code - for my last
three companies, the version of the data3 accompanying the
code was just as important. Since we generally tested the latest data
with the latest code, rather than a grid of code and data, we didn't
actually need a different version number for the data, just enough
information to trace what data was current for that
version of the code. (Attaching direct references in version control,
using git lfs or something like it, is an obvious way to handle
this, but you'll need to watch out for the data getting updated and
the cross-reference falling behind; it's often worth including some
basic acceptance tests to defend against that.)
Conclusion
While a lot of release engineering looks from the outside like bookkeeping for bookkeeping's sake, it all comes from principles, which support avoiding problems (past and future.) There are arguments for and against - but "it takes time" usually isn't a good one; knowing the tradeoffs, and knowing what you're asking for, are a good way to show your team that using release engineering proceses is actually in their interests.
-
For example, any NTSB or USCSB investigation but specifically the Containership Dali taking out a bridge because a poorly installed label interfered with a single wire connection is practically begging for a certain kind of attention to detail - might be a bit much for a typical software project, though. ↩
-
Worth reading "The Checklist Manifesto", Atul Gawande, if you have any repeated process that needs to be done right; the book is far more significant than it sounds. ↩
-
Data in this case meant bulk machine learning models and map data, product-specific but not generally customer-specific. ↩
"Provenance" is a term more common in the art world, where it means the origin or source of a thing - and can more specifically mean the history of ownership, as distinct from the thing itself. In art, provenance is a mixture of giving credit to the artist and imputing value to the artifact (the difference in value between a true original and a perfect forgery is bounded entirely by the provenance of the piece.) In software, the primary use of provenance is in software licensing with some secondary value in identifying security issues (CVEs).
Open Source
Open Source software specifically has licenses (usually from a well-known set) that allow a developer to use them within some generous constraints. On one end of the generosity scale (within Open Source) you have BSD and MIT which are basically "do whatever, just don't file off the serial numbers and claim it as yours" to messier licenses like LGPL where you have to enable the end-user to be able to update components of the deliverable; some of these licenses require you clearly identify any changes you've made. Whatever actions they require, the important thing is that you be able to identify (for things that you ship1) what bits have what licenses, and that you're conforming to them.
Proprietary
Proprietary licenses are usually much more restricted in specific ways2, since they're often direct contracts and don't rely on copyright. Since they are bespoke, you don't have the shortcut of finding interpretations from the community for what you can do - either you need to do a close reading with your desired use in mind, or to come up with formal questions and ask your own IP lawyer about them.3
Again, the important thing for you to understand is where the pieces of your code come from and what licenses apply to them, and what actions those licenses take.
The Code Itself
While licenses and ownership are important, you also need to be able trace where the actual code comes from. Sometimes that's just a matter of being able to map out your software supply chain risks, but it's also useful when debugging - not only can you confirm you have the correct upstream code, often the upstream bug tracking system will be adjacent to the code repository and you can find out if someone has already reported something related to your issue, possibly even a proposed solution branch. Ubuntu Launchpad acts as a "switchboard" for this, with links to further-upstream sources and ticketing systems (including CVEs) and some information about tracking the progress of fixes within Ubuntu's own releases - especially useful if you're trying to decide about building a local version of a package or waiting for it to show up as an official "backport", Launchpad will have public policy discussion and decisions for each package, as well as being a place to vote on impact or contribute your own test results.
Stackoverflow
There are a number of online sources of problem-solving code snippets to read and learn from. If they don't have an explicit license, that's all you can do - the author has copyright by default. Very popular systems like stackoverflow do have specific licenses - in this case, part of the Creative Commons, with requirements for including credit and not restricting further copying.
In practice, any given stack overflow answer isn't an exact fit anyway, so a pretty good practice is to learn from it, confirm it with more specific documentation (one of the best results of reading stackoverflow answers is that they get you to say "oh that's what that man page meant!"), apply that new understanding locally, and leave a note (in a comment or version control commit log) about where you got the idea - not for compliance, but so the next time someone comes along to debug that bit of code, they can look and see if the original answer has been updated or refined, so you don't have to rediscover it.
Mechanisms
Debian/Ubuntu copyright and license information
Debian packaging is a combination of tooling and
policy/discipline. One important piece of this is that every installed
package contains a /usr/share/doc/$package/copyright file with
information about upstream source, licenses, contributors, and
copyright holders. Though it was originally freeform, about 80% of
packages in a typical install use a strictly parseable format that is
easy to build auditing tools around.
If you're starting from scratch, you'll probably want to do a bulk "look at every package" review; given your specific requirements, you can come up with a few simple questions and have your whole team chip in on reviewing. After that, you should simply have a license review every time you add a new package - for the package itself and the new dependencies it brings in. (You want to keep this simple and do it early to reduce sunk costs in building on top of something with an inappropriate license - or even with a "probably but not quite sure" license that needs lawyer time.) One traditional way of handling this is to have a short list of licenses that are "automatically OK for dependencies" - MIT, BSD, ISC, and Apache 2 usually head the list - with others requiring escalation to someone who can talk to (and spend money on) the company's IP counsel, or take responsibility on behalf of the company for using that license. (You'll quickly find engineers taking the initiative to choose alternative solutions that fall in the approved license list just so they can keep going and not get blocked on license review; depending on what problem space you're working with, that might actually be a reasonable accidental result.)
This author has sucessfully used the debian copyright files to satisfy acquirer and investor due-diligence, mostly of the form "is there any use of the GPL that would require releasing our proprietary source code, and thus reducing its value", for two startups. One of the acquirers was impressed at getting an obviously-exhaustive spreadsheet of dependencies and license categories, instead of the "umm, we have some code?" that they'd gotten from previous startups, saving months of review. Open source dependency management is sufficiently mainstream these days that there are companies4 that will do this auditing as a paid service, usually with value-added features like notifying you when security advisories are issued for anything in your "software supply chain".
Debian/Ubuntu file-by-file packaging information
The entire point of a packaging system is that you know exactly what
files in your install come from what packages; a big part of this is
just that you need to know where they are so you can remove them
cleanly - a step that "just run make install" leaves out - but it
also means that if you find a problem in a particular file at the
operating system level, you can trace it all the way back to the
author, even if it isn't your code.
Other distributions
Redhat's rpm (also used by SUSE) has a parallel history with
Debian's dpkg; while it has made different choices it has support
for the mechanical aspects of provenance tracing.
Language-specific packaging
NPM (for Javascript), Cargo (rust), CPAN (perl), PyPI (python), Gems (ruby) are all community packaging systems oriented around a specific programming language, rather than an operating system. This means they focus more on multiplatform support and community openness, but often less on package quality or malware prevention5. The upside is that peer pressure (and ease of copying) usually means that one license will usually dominate - for example, most NPM packages are MIT licensed, with ISC and BSD filling in most of the rest.
(Go is a little odd in that while it has a packaging mechanism the format is "a URL pointing at a git repository", making Github into the accidental package repository for the industry; this means it completely lacks any kind of policy enforcement beyond "does the package actually build". You probably want to at least add a layer of your own git mirrors to have a little bit of control, here, even if you'd use upstream package repositories for other languages.)
You'll sometimes hear that language-specific package system for C and
C++ and is dpkg (or rpm). That's got a kernel of truth (in that
nothing else is) but those also have tools for automatically
extracting metadata and build options from language-based packages
directly into operating-system-based packages, so they really are a
"higher level" system. They also have more policy information - the
Debian Policy Manual has
details on handling packaging subtleties and upgrade issues that have
come up across a thirty year lifespan, that most language-specific
packaging can't handle (arguably they don't need to, if you're
regularly doing clean builds into freshly created environments, but
that's a relatively recent innovation.)
Conclusion
If your product is an entire platform shipped on an operating system, you'll probably need all of this - but even if you're working with simpler architectures (microservices on AWS, web-frontend-only tooling) you're still likely to need to figure out what you're building on top of - even if you're not aiming for acquisition, you'll still need to be able to identify what upstream security vulnerablities have put you at risk or at least have cause you to need upgrades. Provenance - knowing where all of your pieces come from - is just a matter of having (and keeping) the data you need to keep track of it all; this doesn't require a lot of new work, just a little discipline.
-
Key to copyright is copy - you can generally screw around as much as you want locally/personally, copyright-based licenses don't even apply until you give a thing to someone else. ↩
-
Sometimes they are only restricted in those specific ways and otherwise are very broad; one commercial linux IDE licenses individual people but the software can be installed anywhere at all - so we could ship a pseudo-embedded system with that IDE as one of the local editors, and as long as the individual developer had a license for it, they could run it on any fielded system. In another case, we could use a set of embedded libraries from a CPU-vendor SDK and ship those object files anywhere - but we weren't allowed to run them on similar but competing chips (again, not complicated, just very specific.) ↩
-
In some circumstances you want to talk to an IP lawyer about your open source use too - but it's 2025, if they have any experience in software licensing at all this should be a relatively short conversation because (for example) the GPL is older than they are and they have no excuse for not already being familiar with it. ↩
-
Examples include Black Duck which has been around for ages; Snyk has more recently joined the field; their services are shaped differently but they can both provide the kind of "open source audit" that investors are looking for. ↩
-
Aside from simply "selling off" packages, NPM and PyPI (by virtue of being popular and thus high-value targets, not through any particular flaws) have long been victims of typo-squatting, and more recently "vibe-squatting" attacks on similarly-named packages. ↩
This essay is part of a series on "Pretty Good Practices", aimed at development teams that would like to improve their craft, and thus improve their results.
Code Review is the process of having peer developers look at changes early, before they become part of the product, in order to try and find problems before customers do. (This is narrower than "Change Control" in that parts of it are very specific to software, rather than hardware or content.) The broadest modern version of this is Given enough eyeballs, all bugs are shallow but on a more conventionally sized development team you can expect most of the benefit from the first couple of reviewers.1
Code Review is not a new concept - it has roots as far back as IBM in the 1970's - though it is modern enough that it's not part of the "surgical team" model from The Mythical Man Month. It didn't really get popular (outside of environments where it was legally mandated) until free tools like Rietveld (2008) and Gerrit (2009) started becoming available to particularly motivated teams. The popularity of the code review process within large (FAANG) organizations was a source of inspiration for the industry at large, even if the full-blown tools themselves weren't available. (The Github "Reviews" feature wasn't generally available until the end of 2016.)
Before the tooling made the practice a widespread default, a lot of review only happened after problems were found - even at places where senior engineers read all of the changes that went by, looking for red flags, changes were often "integrated by default" and complex changes could get insufficient attention because there wasn't an enforced time budget. The popular exception, and one of the big "examples of what could be done but you're not realistically doing it" was Linux itself - changes went out over email and were only integrated when Linus (or later, his "lieutenants") accepted them, in a refreshingly public process. Git, Bitkeeper, and other tooling adapted to this, rather than supporting or encouraging it, and the process has remained consistent over a surprisingly long timeline.
It's 2025, not 1995 - doesn't everyone already do this?
Unfortunately, no. In the extreme case, you have your brilliant technical founder who has pumped out a bunch of code that impresses the seed investors or an initial customer or two - and works far better than it has any right to - but now you're expanding the team and adding the first people for whom software is a Real Thing and not just some details that distract from the thing they're actually brilliant about. After all, startups are rarely interesting in a purely software way - instead, software provides some novel leverage in an existing space, and serves the problem rather than being the problem.
This isn't unique to startups - sometimes a new project "falls out" of some other field - maybe a mechanical engineer coded up some automation, maybe an admin intern got lucky with getting a Fiverr contractor or vibe coder to build something - and then existing developers get stuck with a "demo that a bunch of people already rely on" dumped in their laps.2 Rarely do you jump into a project in a new space that followed established software engineering principles from the start - instead, you usually stumble into the new space and have to retrofit software engineering principles onto it.
Whatever path the software takes as it matures, it usually progresses from "this did something useful on one computer, at least once" to "multiple people need to understand what's going on here well enough to improve it further". This is the ultimate point of a bunch of software engineering techniques and tools, even if it often isn't expressed that way.
So what are they doing instead?
I've been involved in projects where
- the code was just some files, version control was a New Thing3
- the code was checked in to a free private account, but everyone was working on the same problem on the main branch and branching wasn't really a thing4
- "this function works fine if I type it in but not if I put it in a file"5
... and many other points on a spectrum of how much the craft of software development was taken seriously. (Not out of disdain or disrespect, just inexperience or lack of exposure to Real Software Practices.) While these examples are primarily about version control, it turns out that reasonably sophisticated version control is what code review is built on top of - both mechanically (review tooling itself) and in terms of process (the review acts as a point of gatekeeping, so you need strict definitions of what the change is being applied to.)
Why Review
We've now established that a world without code review is a bit of a jungle, where software development is possible but more through heroism than technique. Why is code review part of this complete breakfast?
The primary motivation is just getting more eyes on the code you're producing, early in the development process. The obvious benefit is giving more than one developer a chance to say "that's odd" or "what's up with that" - but it's also a chance for the team members who, in the future, are going to be fixing or updating this code, to check on how readable (and thus, maintainable) it is, as long as "this was difficult to figure out" is feedback that you'll actually use to improve the code.6 To put it another way: review time is a chance to move "reviewability" issues further up the funnel, so you're not discovering mid-crisis that noone can figure out what the code is doing.
It also gets more eyes on the requirements and gives you another chance to catch a misinterpretation of what was intended, where the code appears perfectly good but is doing a different thing. Clearly you want to have already found such things sooner than this but it's better than finding them later.
It's a chance to emphasize team effort. "Peers pointing out details to each other from our coding standards" gets a lot less reflexive resistance than "coding standards imposed from On High" does, even if you have senior developers driving the standards effort.
It's also a chance to spread historical context from developers who have had longer exposure to the code base - though that can reasonably turn into "that's good to know, can we get it into developer documentation"?78 This is a useful question to prime junior reviewers to ask, when someone responds with a detailed explanation - "great, what documentation should I have found that answer in, and can we put it there" recognizes the value of the explanation while also taking the opportunity to move the comprehensibility of the code base forward.
Ultimately, code review is an early demonstration that the code successfully communicates with other developers. You really don't want whoever makes the next change six months down the line to go "what were you thinking" - especially if that ends up being you because six months plenty of time to turn it into "what was I thinking"; it's just much easier to test for that with someone else's eyes while there's still a chance to clarify things.
Review of Public Code
This discussion is mainly about in-team review of proprietary code. If you have a tightly-knit development team for an open source project, it would also apply there, but open source has some additional complexity: one of your sources of code to review is "complete strangers from the Internet." Nurturing a community around a piece of open source code tends to require generosity or at least starting from a non-adversarial stance - but at the same time, some contributors are malicious and you need diligence (and process!) to defend yourself from them.
It's a difficult line to walk, and all I'm trying to do here is acknowledge the difficulty in the hope that you also acknowledge it. It's worth familiarizing yourself with the XZ Utils backdoor (also known as CVE-2024-3094) which is a combination of supply chain threats, the difficulty of dependency proliferation (even with reasonable dependency management) and the kind of things your review process might catch.
This isn't advice on running a public project at scale - dealing with thousands of developers from a starting point of what is honestly outright paranoia isn't healthy or productive. Instead, I'd suggest looking at things like curl's contributor guide for what a good "public face" looks like, and how you can still strive for quality with many strangers instead of a closely knit team - commercial development almost never has the structures (or need) to wrangle that many contributors, but it's still a useful perspective, and you'll find many good ideas on how to treat other developers humanely (that you wouldn't necessarily come up with on your own for coworkers.)
How Code Review Works
Mechanically, code review is part of the process of making changes to code9. You start with what you believe is a working codebase, take a proposed change, and try to analyze the change to determine if it
- leaves you with a working codebase afterwards
- actually achieves the goal it was intended to accomplish.
This means you need reviewers to understand what "working codebase" means - "does it conform to our coding standards" contributes to that, and having coding standards include things like "we assume that code that isn't tested doesn't actually work" is a particular strong one - but ultimately a reviewer needs to know what is changing and understand what the change is actually doing.
They also need to understand the intent of the change - not just as expressed in the code, but expressed externally by what the planned change was - which means you need some shared system that expresses those plans. How you do that is a project management question and out of scope of this article, but it should be clear that you have to have some record, even if it's a simple prose explanation at the top of the review request.
This is somewhat biased towards small changes, but that's the way to make them comprehensible - large changes can generally be broken down (and large mechanical changes, like coding-style consistency improvements, can be made more reviewable by keeping them isolated - it's easier for a reviewer to convincingly affirm that a change is, as intended, only mechanical, and thus has much lower risk to the codebase because it really isn't changing anything else.)
If you're reviewing an entire codebase, perhaps because your team is seeing it for the first time and needs to bring it up to standard so that they can start doing new work on it, you likely want to instead do an audit which needs a different approach.
Gatekeeping
Review is implicitly a gatekeeping action - "yes, this can go in" or "not yet, these improvements are needed." (In a functioning team you rarely need an outright rejection - that should be handled at the tasking level - any developer should be able to start from a position of confidence that the change is actually desired, code review does not substitute for team management.) That means it needs to happen at a point in the process where it can actually be used for that, though the details depend on your branching strategy.
Feature Delivery Cadence
Unless you're doing early "exploratory" development, you probably have some sort of time pressure10 for delivering features. Since review is in between creating features and landing them, deadlines put certain pressures on the review process.
For reviewers, it's important not to delay review until too close to the deadline - we've already established that you want to do it promptly while things are still top-of-mind for the developer, but you also want to allow the developer to time to turn around thoughtful responses to feedback. It's also important to point out right away if you have any planned reasons that you will not be available to perform your review duties, like vacations, to allow the developer (or team lead) to find an alternate reviewer.
For developers, once your team has a normal review cadence established, it will become important to explicitly communicate about unusual review deadlines that legitimately need faster response (this should be rare, and is not just "hurry up so I can move on to the next project".)
Complex projects that are expected to interfere with other development (particularly by making things hard to merge) such as "updating to a new coding standard all-at-once" or "interface renaming" or "upgrading a major non-compatible dependency" need to be coordinated across the team, but you want to avoid having everyone stop work during the process. One approach is to do the work on a branch, aiming for a point where most other work can synchronize so there are few outstanding branches, then do a "final burst" of pulling those changes back in to the complex branch and reviewing and landing it, before freeing everyone up to continue.
Even simple features can collide, so while you don't usually want everyone on the team keeping track of everything that's going on, it's still valuable for anyone who's about to land something to know what else recently landed, to help understand where to look first for unusual behaviours. (The better your testing, the less of a problem this is!)
Feature Branches
One common strategy is the use of "feature branches" - you manage changes to the code base in units of features, and each new feature starts by branching the main codebase at a specific point (basically a managed copy, so you don't have to deal with competing changes from other developers in an uncontrolled manner. After all, they'll also be working on feature branches.)
In the straightforward case, you've implemented your new code on this branch, you've added new tests and they pass, existing tests also pass, and it's ready to review. The mechanics of setting up the review will usually involve selecting users and providing a description of the change (which may already be available as part of a higher level ticketing system that drives all of this at the project management level - it's still worth adding specific implementation commentary here, to aim your reviewers in the right direction.)
The primary thing keeping things from being straightforward is when some other feature branch has landed and has overlapping work with yours. (In the most obvious cases this will show up as conflicting code directly in the version control system11 but it's more likely to be something like an API or calling convention change where if you're lucky, tests will catch it.) If there's any other work going on in parallel, you'll want to do an update from master immediately before review, so the reviewers are seeing what's actually landing (rather than what you started from) and that you can show that "if you hit Merge right now, the tests would pass" because your branch matches what the post-merge main line would be. The primary defense here is communication, rather than tooling - your team members should have at least a rough idea of what related work is going on, and if the project is too big for that, it needs to be broken up into smaller components with boundaries/contracts that make it possible to have enough context most of the time.
Depending on how long your feature work takes, other features may land that change areas you rely on - in some cases, even features that you need to finish your work. This isn't just about avoiding merge conflicts at the end12, it's about work that you specifically need to absorb into your feature branch to continue making progress. Ideally, for things like this you'd also be a reviewer on the other changes as part of making sure they support what you're trying to do. This also applies pressure towards keeping features isolated and small - or at least breaking larger features into smaller components.
(As your team gets bigger and there's more churn, you may have to do this again after approvals; you may need to go as far as having a merge schedule, especially as you're approaching a formal release point where you want the option of delaying a feature rather than delaying a release, based on what external commitments are driving your deadlines.)
Multi-layer feature branches
When you are programming in a more exploratory mode (particularly common when bug fixing) you might not know at the start how the changes will break down; one feature might need another feature, or cleaning up some technical debt, or some unanticipated restructuring - the change may just be shaped such that breaking it into several pieces simplifies the review - "adding a system-wide feature" and "using the new system-wide feature" have different kinds of complexity and possibly different reviewer audiences. Perhaps you're the first one touching some old code and need to do a "cosmetic changes only" whitespace cleanup or style pass separate from your own less neutral changes.
There are a few approaches to this. One is to just split the additional work off as new blocking features, so that they start from trunk as their own independent feature branches; then you can just update your branch once they land (or depending on the version control system, you can repoint your branch at them until they land, then point it back at the main branch.) You can describe them as blocking in the review description, but it's an entirely social construct; you're peeling off some self-contained work that you'll use when it lands on the main line.
Slightly more advanced is to develop the "sub-feature" as an independent feature but pull the development branch into your primary feature branch as it evolves. This lets you use it as it develops, but leaves your branch with all of the complexity from both parts, until the sub-feature gets merged and it "disappears" from your feature branch relative to the main line.
Most expressive (but also usually requiring the most sophisticated use of the version control system) is the version where you start with the same feature/sub-feature branches, but "repoint" your feature branch to be based on the sub-feature. This means that your feature branch doesn't reference the main line directly, just indirectly "through" the sub-feature; likewise the mechanics of merging the sub-feature and then "losing" it (until you re-point your branch to the main line post-sub-feature-merge) have more moving parts. It does let the version control system represent more accurately the way you're actually thinking about the code; it can also reduce confusion about which changes are part of the sub-feature vs. the primary feature, though if that confusion is likely this might not be the right way to split them up.
Choice of Reviewers
Depending on the size of your team and the distribution of skills, you may have a bunch of options on reviewers. At the simple end of the spectrum, you may only have three developers and one worked on the branch, so your policy will basically be "one other reviewer" (or depending on your growth plans, start with a two-reviewer policy even if that means "everybody".) "Minimum two-reviewer signoff" is a popular starting policy - at worst, it's inspired by the two-key launch authorization policy; at best, it is an opportunity for reviewer mentoring, or for getting review coverage from different perspectives or areas of experience.
If you have a variety of experience levels (or areas) on your team, you might instead tag reviewers with particular areas of specialization - "John needs be a reviewer for any embedded code", "Mark needs to sign off on anything with security implications". Ultimately you don't want to block on any single reviewer (people take vacations, after all!) so even if you start out with specialist reviewers you should consider pairing them with other developers specifically tasked with "levelling up" in those areas.
Another way to handle specialist reviewers is to have generally distributed "required" reviewers and add the specialists as "optional" reviewers, who are expected to chime in on anything specific to their expertise but aren't necessarily expected to review the entire change. (Review tooling rarely has good support for "partial" review like this, but policy can be enough.) You do have to watch out for getting specialist reviewers "stuck in a rut" where an area of the project can only be worked on and reviewed by two or three people - there's a tendency to develop tunnel vision or even shared but unrecorded knowledge of how the code works, which reduces the effectiveness of the review, especially the quality of the "can future readers comprehend this" measurement.
Depending on how explicit you are about scheduling your team's time, another option is to have a handful of reviewers on a change but count the first two approvals as enough. (If you do this, you generally want formal policy on reviewer response time, otherwise you will run into unfair effort-sharing issues that will harken back to college team projects and chore sharing - it can still work, it's just likely to cause people-management issues comparable to just being explicit about the assignments in the first place.) If your team is already comfortable with a top-down management hierarchy, then you can just assign reviewers. If your team is more self-organizing and team leadership is more about reducing friction and interfacing with other systems in the company, then review assignments can also be self organizing (though you'll want to keep an eye on spreading review experience and helping keep reviewers from getting too specialized.)
What Should Reviewers Do
The main thing a reviewer is expected to do is understand the change that's being made. Given that understanding, it is an oversimplification to say that they then need to "approve" or "reject" it - your team needs to have an understanding of what that means in the context of your project.
The main thing to do with the understanding is to compare it to what the change is supposed to be doing. This usually comes from your project management tooling, and should have been established before even starting the feature branch. It should also be reflected in any internal project documentation - which would ideally also be managed in the feature branch, but might be in a parallel external system instead (if it is, that should be part of the scope of the review as well.)
After the basic "it does what needed doing" review, the next question is "does it do it well". Does it meets your team's standards? Does it use your preferred coding idioms? Does it include sufficient testing (and does the review request include evidence of those tests working?) How about related documentation? If it has a UI component, have the translations been updated? Does it meet accessibility standards? If the change incorporates code from elsewhere, have the licenses and CVE history (or other security reputation) been reviewed? If user data is involved, is the GDPR applicable? If the code involves new network connections, has there been an "attack surface" review, and has deployment documentation been updated?
If enough of these apply, you probably need a formal checklist - with
as much automation as you can arrange. (One team built a literal
can-i-merge tool that did a bunch of automated tests and quality
checks on a trial merge, which was great except it took over half an
hour to run, and still covered less than half of the suggestions
above.) In some organizations this is (somewhat mistakenly) called
"CI" tooling; it is integration-adjacent, but unless you explicitly
manage a sequence of merges (which would let you test each feature as
it is about to merge) you end up needing effectively simultaneous
pipelines for each of your developers.
What Should Reviewers NOT Do
One important reason to formalize coding standards (even if it's a
simple as "Run black on all python code") is that experienced
developers will often have preferences that are not written down, "I
know it when I see it", or just biases, and it's unfair to other
developers to get review push back for things that there's no reason
for them to have known. Put another way, "I wouldn't have written it
in that particular way" is not a valid reviewer comment; if the
reviewer actually has concrete beliefs behind the opinion, they get to
write them down and build consensus around them first. Critique
isn't an effective way to get elegance - you need to actually teach it.
There will also be issues around "how much additional testing can a reviewer insist on". One common dividing line is "if we have a test like that, insisting on using that technique in this area is fine; if we have to invent a new way to test this thing, there's a reasonable discussion to have about building that additional test infrastructure as a separate task" (which may still block the higher level feature.)
Reviewers should also not need to be running tests - the developer should be running them all along during their development, locally (as long as they include some convincing record of the test run) or as part of per-branch CI (which can just be a link to a result page.) Developers shouldn't be submitting changes that don't pass, the reviewers should be looking at "do these new tests represent the feature well, do they cover the interesting parts of the change" and not "do they actually pass if run". (If you're very early in adopting the testing that supports this kind of review, it may make sense for reviewers to run the test as a cross-check that you're getting reproducibility right but it should be as additional checks of your process, not a replacement for developer-reported test results.)
Coding Standards as Crystallized Arguments
Coding standards are simultaneously trivial and vital. The common feature of most coding standards is that they apply to things where there isn't an obviously superior choice - or at least there's enough ambiguity that your team is wasting time arguing about it. At the same time, developers typically have the kind of independence and self-perceived authority that they tend to reflexively reject arbitrary choices handed down from on high. For new team members, having a standard to point at reduces friction and simplifies integration with the team - and if it seems arbitrary to them, you can point out that there was an argument phase and arguments got distilled down into the final resulting policy. It may help to have developers sign off on the standards explicitly, so you can show new team members "look, this standard has five of your peers behind it already, do you have a strong enough case that you think would convince all of them to change their minds?" Being able to point to past efforts to reach consensus helps it looks less authoritarian, and makes it easier to convince newer team members that they'll be included in similar decisions in the future.
For current members when you're trying to establish a standard, it gives them a time and place to bring up their opinions and preferences, with the understanding that once you have a decision, it's time to move on to more important things.13
(It isn't necessary to completely suppress further discussion of the topic - just to get buy-in on the idea that the status quo is based on battles already fought and there needs to be significant new information to revisit the question.)
Mechanically, it's also good to produce worked examples of what the standard actually means - your project isn't public enough or big enough to have it's own StackOverflow section, but an internal library of examples is a great introduction to how to apply a standard. Ideally these examples will look more like "use this interface" or "configure your editor this way" than "copy this boilerplate" - you will find a lot less resistance to standards if you put effort into making it straightforward to conform to them.
Writing for Review
One reason to have everyone on the team involved in review is that it gives them perspective from "the other side" on what kind of things are harder or easier to review.
The common pressure here is towards minimalism - smaller changes are easier to review, and well-scoped changes - where everything that changed is clearly and directly related to the problem at hand - are more reviewable even when they are larger. This pressure is strong enough that you'll need some explicit back pressure on things like "you actually need more tests than that" or "this really does need to be better commented or simpler" or "yes, that's very Clever, but no one will appreciate that when they look at it again six months from now."
In particular, this gets in the way of "casually" cleaning up technical debt, even if said cleanup might make the feature change simpler. The sensible thing to do is to acknowledge those explicitly and turn those cleanups into tasks that block the features - in theory you can show that these paths get the features done sooner than without the cleanup, but that is a significant communication challenge (a worthwhile one - but systems like Agile, for example, make this particularly benefit harder to express in conventional backlog management, especially if you discover the cleanup opportunity while doing the feature work.)
Depending on the history (and "maturity") of the code base, you can get some value out of "deliberate scope creep" - team-wide recognition that when a new feature exposes the need for broader changes to enable continuing work and reduce technical debt, it's worthwhile to take the opportunity to make time for the broader change as part of "paying down" the technical debt, or "recognizing that the technical debt has come due". This is basically another explicit anti-minimalism force - and you can develop a consensus in the team about how much pain technical debt is causing vs. tightening up feature work.
Keep Function-Neutral Changes Separate
As an example of how trivial details can cause disproportionate amounts of friction: ideally you'd start with a common coding style and never have to transition. And by "ideally" I mean "that's never actually happened" - so you're going to have to make the existing code base conform. Ideally you'd do that all-at-once, do a testing pass, and be done with it, but, well, "ideally" - in practice you're going to not make a global change because you don't trust your testing enough, even in the face of a self-evidently neutral change. Instead, you're going to encourage people to make these style changes when they touch old code for the first time - and then their branches will be cluttered with style work (which often touches most of the file) and end up harder to reason about and review, possibly causing more friction than giving up and not changing the style in the first place would.
You can compensate for that by requiring that neutral changes get their own reviewed branches, where you only change style and nothing else; you need to do these before your real changes, so you probably also want to fast-track them. You can do that reasonably safely - the reviewer isn't trying to confirm correctness or requirements, just "this really is only changing style and does nothing to what the code does, and is a neutral change" which only really requires language familiarity/competence and not deep familiarity with that part of the codebase. This also means that you can safely have junior devs do the review, and you can have senior devs do them rapidly without derailing them much. (You may be able to encourage senior devs to take on these reviews because they have a better appreciation for the benefits of the change, and may have even driven the adoption of the policy in the first place.)
Neutral changes do still have intent - primarily, to make the code easier to work on (by becoming consistent with more modern parts of the code base, and aligning with the standards the team has come to expect) but also to make those changes complete so you don't need revisit them again in the same area. It's just that all changes of this sort also share a common intent - "don't change the behaviour of the code" - that is unlike any feature change. This does make these kind of reviews simpler, since there's nothing to interpret.
With some languages and environments you can use tooling to prove
that the changes are neutral - for example, in python you can use the
ast module to turn the before and after versions of the file into an
abstract syntax tree and show that those haven't changed. Not a
perfect test, but for this kind of thing, a fast cheap test that
leaves you with a couple of bits that need manual explanation gives a
better return on investment than full scale tooling - if you're going
to build that much tooling you can probably build the tool that
makes the style change directly, which you can then review/audit and
then trust directly. You probably don't want to do go too far down
that path unless you're product already has you building
software-modification tools, though.
Social aspects of Code Review
Code Review is ultimately a structured form of communication among developers, which means that there are ways to explicitly manage that structure.
Senior reviewer vs Junior
Specifically encourage juniors14 to ask about testing to get them engaged; it can serve as a "safe" question that they can ask when reviewing code from senior devs, since all of you are trying to reinforce a culture in which everyone is expected to respect the question wherever it comes from (and you don't need that much familiarity with the code base to be able to tell if tests are present.) Also, a test suite often serves double-duty by explaining the code from a different angle - the test calls are also "worked examples" of the interface.
Another way to keep juniors from falling into "this code is from a senior developer, I barely understand it, it must be good" trap is to encourage the idea that a review is a reasonable place for asking good questions about the code. The review process should still emphasize finding problems, but readability and clarity are core components of code being reviewable in the first place. A reviewer who is still becoming familiar with the code can often have clearer judgement of the quality of an explanation, since they felt the need for it in the first place.
In the other direction - remind seniors that the clever idea they came up with during the review is not usually superior enough to derail the review. It's one thing if the code under review is wrong - even if that's merely of the form "we have a tested and working API over here that already does this, can you use it instead of duplicating it here?" - but if it's just a matter of style, and the style isn't codified and agreed on by the whole team, then that's not necessarily enough to push for it. (Using this to launch a conversation about codifying that detail as a shared standard is fine - but it probably belongs in some other forum, not this particular feature review.) An explicit mentoring relationship can change where this bar is - but then the discussion can and should happen in private pre-review, not in "public" (in front of the whole team.)
Tone
Whether or not you're a stereotypical "cranky developer" it's worth considering tone extra carefully in review discussions - because being judgmental about the code is part of the task, it's easy to slip over the line into being judgmental about the developer which is harmful to team relationships.
It's also useful to consider that review comments are usually preserved, not just the code - and if they aren't, that's a tooling issue that's worth fixing - when doing "archaeology" (or "spelunking") on the code, especially when debugging a regression, it's pretty valuable to be able to resurface the discussions around when a problematic change was made. Did this weird case come up at the time, and was dismissed? Great, we were almost sufficiently foresightful, maybe recalibrate that reflex a little - especially if "it would be inconvenient to write a test for this" was in the arguments, turning it into a concrete "Test Better!" argument can be valuable. Or maybe noone thought of it - possibly that means you need to "review a little harder", or maybe it was Just That Strange; either point is interesting to include in a post-mortem.
Interactivity
Respond promptly, in both directions. The whole point is keeping things "early in the funnel" while the developer still has things in their head and can answer questions; at the same time, it's important for the developer to come back with either "good question, here's why that's not a problem" or "good question, it's going to take me a day or two to improve this and have you look at it again." Most code review systems don't have good affordances for this, so encourage explicit communication that makes it clear whose court the ball is in.
You want your team to understand that reviewing the work of others is as important as their own work - especially if you care about feature "velocity". While this does require balance, review is happening at the point were something is "almost done" and giving it prompt attention means that it can land that much sooner, and you can both move on to the next thing (or return to your current thing) more quickly.
One of the workflow challenges in being responsive is that it means review is an interruption - not as severe as being on pager duty, but still something difficult to plan your time around as a developer. One conventional approach is to just observe how much time you spend on review and adjust the amount of review you take on based on that - estimating review time has some of the same problems as estimating software in the first place, but you can generally develop the ability. This is one flaw with the common approach of assigning reviewers at review time - the reviewer can't plan as much as they might like. If you have a review cadence, being explicit about expecting developers to be spending a standard amount of time on review at the end of the cycle can help by providing some structure. If you have particularly specialized reviewers, making sure they know up front (when a new feature is started) that a change is coming up in their specialty can also help - even if they aren't explicitly engaged with the work (though if you can get early design feedback from them, that's probably a good tradeoff against doing their own work in that area) just having a "heads up" can help them keep context in mind and make jumping in to review a little bit less disruptive.
Another approach (sometimes inspired by having a "review problem") is to actually schedule reviews as meetings. This makes the time more explicit and high-bandwidth, since the reviewers and the developers are mostly prevented from engaging in other distractions; it can backfire if the reviewers quickly turn up a handful of things that need developer work before continuing, especially if there's pressure to Fix Them Right Now during the meeting. It's important to be aware that "coding with people looking over your shoulder" is a distinct skill that's rarely explicitly developed - it's more of a teaching or rehearsed presentation activity - and unless it's already a specific mentoring situation, you'll usually get better results from letting the developer go off and actually think about the revisions, particularly if the review has turned up additional (hopefully small) problems to solve. It's also important to understand that meetings have a lot of overhead regardless of what they're for - the "review meeting" approach won't be efficient but it might serve to improve shared alignment, or to focus extra attention on particularly critical cases where the expense is justified. (There's also likely to be a lot less resistance to additional meetings if everyone is clear that the goal is to stop having them once communication has improved.)
The Review Lifecycle
The review process doesn't end with the first round of feedback - while it's important to converge quickly, there's still a step of making changes based on feedback and then getting those reviewed. Part of it is making sure the developers involved can communicate efficiently - but part of that communication is the changes to the code itself.
As the team and the code matures (and practices like Writing for
Review take hold)
you may find yourself actually needing very few changes in response to the
initial review feedback. This goes much more smoothly if you can
see that those changes are incremental. Most version control
systems support this in straight-line development, but among modern
git users there is a popular development pattern of "cleaning up" your
changeset using rebase and force push. Effectively, this makes
the changes look like the final form is what you intended all along -
on the one hand, this is rewriting or "retconning"15 history;
on the other hand, it does mean that when you look back at the code
there isn't any distraction from the incremental problem solving, and
only the final reviewed change is preserved as history. When applying
this to a review, this creates an immediate problem, though - the
reviewers have to look at the whole changeset from scratch, instead of
just seeing the incremental change as incremental. Not only is it
inconvenient, it leads to lower quality review since "oh I've seen
that bit before, it was ok last time" is a far less rigorous (and
possibly untrue) judgment than "that code is literally unchanged
because this incremental diff doesn't touch it".
There are ways to mitigate this16 with tooling improvements but the simpler approach is to just declare that clean up and force push in particular are only allowed on private branches, never shared branches, and once review has begun, your branch is not private. Having to use different workflows before and after initial branch can also be error prone or at least tedious, so it can be useful to be able to point to "review friction" as a team-wide motivation to put up with it.
In open source projects one occasionally finds a pattern of keeping
all the commits for review but doing a squash merge. This keeps
the development visible at review time but doesn't keep it long term -
the theory is that this gives novice contributors more freedom to
flail, since it won't reflect on them at all after the contribution is
merged. (It's sometimes suggested that this makes git bisect easier
to use, but using --first-parent is enough if you're keeping the
usual discipline of the main line always passing all tests.)
Conclusions
Code Review is a process that helps you get more eyes on your team's code, but there are lots of details that go into a good workflow for this. It's an extreme case of the belief that code is as much about communicating with people as it is about communicating with machines, which means you need to be attentive to social friction while developing it. While it requires a reasonable version control workflow, it doesn't need all that much additional plumbing - it mainly needs more communication.
-
Jakob Nielsen, 2000 was original research specific to qualitative usability testing, the model intuitively fits code review as well. ↩
-
The flexibility for non-developers to express things in code is often a valuable business practice - but it's also a hint that overall, software development perhaps isn't as much of an engineering discipline as we insist it is. ↩
-
There wasn't even
.oldor date-based history; just one set of files that got changed by two or three people (occasionally yelling at each other.) RCS let people at least "lock" the file they were changing which meant people yelling at each other before losing conflicting changes. Full disclosure: this was a 1990 problem, not a 21st century one - but inexperience sometimes looks like time travel. ↩ -
For a short time around 2015, Atlassian Bitbucket had a big advantage over Github in the not-yet-off-the-ground startup market - a free account could have a private repo. What it didn't have was Github's forced bias for making changes via pull request, so everyone was just checking out
masterand resolving conflicts rather than avoiding them - which, honestly, isn't too much of a distraction when you have three people sharing a desk. ↩ -
A much-ignored point of consistency is tabs vs. spaces, and while developers tend to be the sort that want personal flexibility - this is one of those areas where you must make a choice (and that choice is spaces) and enforce it (VC hooks if you can, also shared default configs for the editors and IDEs your team prefers) because trying to use both will hurt your team, and noone really prefers one to the other that strongly - though they're likely to have been burned in the past by having not nailed it down. ↩
-
Google famously goes as far as to have a company-wide "language skills and standards" level among reviewers, called "readability", specifically to encourage raising the quality bar for all internal google code. I'm using the term more generally, but it's interesting to see where google puts emphasis. ↩
-
See also Code Retirement Pages. ↩
-
A common conversation on teams that use something basic like Confluence or Google Docs without someone tasked as "information architect" or "gardener" is for discussions about documenting something to go "well, where did you look for it? I'll put it there" "I just searched for these keywords" which doesn't help to discover structure. It turns out to still be a useful conversation - just make sure to add those keywords explicitly in the document. You could think about it as "old-school SEO" but it's just making sure you include literal keywords that are relevant to your topic even though they're not actually part of how you're discussing the solution. ↩
-
This sounds like "Change Review" but that's a broader concept applied to systems and configuration rather than simply code (often hiding the "code review" part as "change the system to use this specific release" that was itself reviewed before getting released from the software organization.) ↩
-
Time pressure can be real ("we need to be in compliance with new regulations by 1 January") or fictional (Agile "sprints") or self-inflicted ("our marketing campaign promises these features this quarter but they never talked to engineering about it.".) In rare cases (XP's "Release Train" concept from the late 1990s) they can be entirely decoupled, with features delivered as they qualify, but that usually requires very tight coupling with a single user rather than having a product. Even without direct customer pressure on timing, if you have a product that has expensive tests (like certification audits, or mechanical performance tests of robot systems) you might reasonably have a group performing those tests on a limited schedule that needs a stream of releases that are "worth testing", which provides the same kind of back-pressure on the review and merge process. ↩
-
Atlassian Bitbucket used to aggressively render "this won't merge back cleanly into the target branch" as part of the review mechanism, but they dropped it at some point and you're probably left to your own discipline - likely pulling your main branch into your feature branch and re-testing before landing it. ↩
-
You also never force push to the mainline, but that's because ideally you don't commit to the mainline at all: you get review approvals on your branch and merge it and that's all you ever do to the mainline. (Most professional grade systems will let you outright forbid this by policy, or even only allow a specific "repo owner" to actually ever perform merges - but the fact that it is mechanically disruptive to everyone else sharing your repo and they will come after you with torches and pitchforks will usually encourage self-policing.) ↩
-
This is related to the military concept of "under orders" - a well run team will support giving a lot of direct feedback during the planning stage, but when a plan is executed, it's executed pursuant to authority - not inflexibly, but without pushback about preferences or committed decisions. (Not that software teams can be run to military standards - "herding cats" is has been the descriptive terms since at least the 1990s - but it is helpful to see that arguing is friction and there's a time and place for it.) ↩
-
While "Junior Developer" usually means someone who has relatively less career experience, in this context it also includes someone who has only recently had their first exposure to this team or this product. ↩
-
Retroactive Continuity - that is, failing entirely at preserving continuity but producing something that looks (from the present) as if it were a coherent story all along - is a staple of the comic book and comic book movie industries. The term is usually pejorative. ↩
-
This author hasn't experienced any beyond "don't do that" but is eager to hear of them. ↩
Code Review is a fundamentally incremental process, where you start with working code, and analyze proposed changes with the goal of continuing to have working code. What if instead you've been handed an entirely unfamiliar chunk of code - from a subcontractor, a vendor, a founder, or "the Internet" - and it has grown and evolved without ever having been subjected your review processes?
In the extreme case, you only have the code, but perhaps the only testsuite the code has is that "it's working in production already", and it's now your problem to keep it that way.
Testing
Obviously1 your first task is going to be to start coming up with tests that represent "it works in production" in a more fine-grained way - even if that's just "stand up a staging system, run these commands against it" (in a repeatable, and preferably automated, way.) This leans into the idea that "what the system does now is Correct because someone is paying for it", which isn't the most sophisticated expression of intent but it's a fine starting point.
Documentation
If you have documentation you're in better than average shape, as it's usually a higher level expression of what features people cared about enough to write down - it can also capture "what we thought it should do" which can be clearer guidance than "what it happens to have ended up doing". (This is also why documentation is more important than you think it is for highly technical software products - trying to explain what's going on can lead to "huh, that is stupid when you say it out loud" Ah-Hah!-moments that can feed directly back into valuable customer-facing improvements. This is also why many documentation people have an inner product manager that you'd be well served to take advantage of...)
The Code
Once you're ready to look at the code, you'll probably want to go back and forth between identifying the structure and figuring out where things go, and examining the details of how this is accomplished. Ideally you're recording this analysis for your coworkers, and producing internal documentation for future use - while you may get a "good enough/can't use this" result, it's not the most important part, especially if you do end up actually accepting the code. Also, recorded decisions are good draft material for a more formal auditing policy guide, if you end up doing a lot of this work.
Code Structure
Depending on the kind of code you're working with, the structure may be more discoverable "outside-in" rather than "top-down", but in either case you want to start looking for
- interfaces and data flow
- component structure (internal - how it's built, external - dependencies)
- how "features" are expressed (since those are likely the first things you're going to be asked to change)
This structure should provide guidance for what kinds of tests you should be writing and which parts of the system you want to be able to demonstrate still work after future changes.
(In an system where all changes go through code review, you're generally showing that "all tests still pass" - for an initial audit like this, you are instead likely to write tests that seem entirely reasonable and should pass but don't - so you need to have some way of tracking "expected" failures, so you can put pressure on fixing them soon, even if you can't do it immediately.)
Code Details
If you have fully automated tooling around your coding standards (formatting, cyclomatic complexity, comment density, type annotations) it's worth giving those a shot to "see what happens". If they don't fail outright, they can give you an initial view of what parts of the system are likely to need attention soonest; if they do, then at least you have a clear early cleanup task to schedule.
Security is usually a system design issue, but there are things that can stand out directly in code - and that tools can find - and there are often domain-specific checklists like OWASP that can suggest areas to look at where even shallow examination can expose problems. These are often in the form of "simple industry-wide problems that we should have gotten beyond, some time in the last three decades", like "examine all sql queries for injection attacks" or "be suspicious of all explicit memory allocation".
-
Michael Feathers, "Working with Legacy Code" specifically coins the term "Legacy code" to mean "any code that doesn't have sufficient tests for you to safely make real changes to it." ↩