Traceability

Essays: Traceability

Not to be confused with Observability, traceability is specifically the idea that for anything (software, data, hardware) that you've shipped to a customer, there's a way to know exactly where it came from.

Value: debugging

In the simplest case, something fails at a customer site - what are they actually running? Can you tell the difference between "we fixed that and they customer just needs the latest version" and "we fixed it but the fix didn't actually work", without blindly costing the customer time and effort upgrading?1 Sometimes the customer side of this falls under "change control" or "inventory", especially if the customer is primarily in a non-software industry - after all, you're selling to them, so you likely have something at a higher level of software sophistication than they do - but you still need properly labelled artifacts on your end to handle this.

For more complex problems (and more interesting customers) you need to know exactly what the customer is running so you can reproduce the problem in-house. For some levels of customer support you may be committed to make a minimal incremental change from that version so as not to disrupt them with new features, so you actually want to be able to rebuild exactly that version. Even if you will only include the fix in the next feature release, you'll want a convincing test for the fix - and to be convincing, it needs to fail first, without the fix. (This isn't just "any change needs a test" - fixes for bugs that customers have actually experienced should be held to higher standards, so that you don't embarrass yourself by breaking things in the exact same way in a future release; customers are a lot more tolerant of novel problems.)

Value: liability

Imagine your customer discovers that your product is running a cryptominer using their computing resources. If you ever want them to trust you again, you need to be able to figure out how it got there, and what you've done to prevent it in future releases. This may involve exposing your process and process changes to the customer - or it may be as simple as identifying a software supply chain issue and being able to point to CVEs or tech news and say that you were using best practices and everybody else in the industry got hit by the same thing; this probably isn't sufficient but it can be useful for calming things down.

In order to do any of that you need to be able to figure it out yourself, and that means being able to get from a customer shipment or install back to your own artifacts and source code, and then further to where any software components came from.

This is where provenance is both more and less than a "software bill of materials" - explicit dependency versions are only one piece, you also need to understand how the pieces come together and who has access to change this.

This also works in reverse - if a customer comes to you in a panic about a newsworthy security problem, you want to be able to quickly and efficiently (without tying up engineering resources) convincingly show that you (and the customer) were not vulnerable.2

Value: provenance

Provenance is about knowing who and where your software components came from. This isn't just about "random packages pulled in by npm", which is a plague of the web development space - even if you're building low level code for raw embedded hardware, you likely still have vendor SDKs and toolkits, drivers and network stacks that you didn't write from scratch.

The value here comes from

Traceability gives you the concrete information to support assertions like "we're using that software from this version which is before they got acquired and changed the license" or "we are in compliance with the reporting requirements for all licenses in the code we ship" for customers that track such things.

Repeatability

Traceability by itself is just recording (in an actionable way) the necessary information to find the correct original artifacts. Repeatability is the ability to use that information and to produce identical artifacts from source. This is powerful validation of your mechanisms but may not be where you start, especially with a high velocity product - there is a level of "close enough" that you can reasonably accept, especially if you're not committed to delivering minimal incremental changes to customers but can just ship your latest release once it actually includes the desired fixes.

The open source community (and Debian in particular) has come to recognize that, in the presence of supply chain attacks, there is value in Fully Reproducible Builds and that achieving that requires working your way through your entire dependency stack - the kind of long-term attention to detail that Debian is particularly well-suited to putting steady effort into over more than a decade.

Release Engineering Mechanisms

There are different levels of sophistication, depending on where you are along the path from "shipping daily builds from a developer to a very friendly beta customer" to "we have a product with many customers and a big marketing splash or at least press releases for every version we ship" and there are different tools along those points.

The minimum baseline for this, in the "a developer ships one thing to one user" case, is keeping a copy of every artifact that goes out the door and reliable records of how to rebuild it. If your build process is solid, a single version control tag is enough for the latter, but you will want to test that - especially if developer builds on their own laptop and not through an automatic build system. At this early stage you're probably not yet shipping via separate QA team, so you also want to include records of what tests the developer ran, which leads to enforcing them as part of a build.

The first improvement you can make is to automate builds that are shipped. You still want developers to be able to make builds directly (to show that even small changes don't accidentally have global impacts) though one path to consistency is to use a container environment for "official" builds and then have developers use that same container environment for their own builds. (This also helps with repeatability - even if your environment has a great deal of churn, if you archive these containers you can rebuild an old release by pulling the contemporary build container from your archive.)

You can also improve things by reducing the number of distinct products - even if you don't start out with a level of discipline that keeps individual customer tweaks behind configuration flags but still part of the main product, you can get there with effort, and both improved traceability (since you simply have fewer distinct answers to the "what are they running" question) and simply having fewer customer-specific bugs are useful motivators.


  1. Sometimes forcing upgrades is the right thing to do, especially if you've put effort into making upgrades seamless, but that's often more work then making your artifacts traceable. 

  2. Communication with a customer isn't necessarily about an audit - audit processes usually have fairly specific requirements that are outside the scope of this article. A swift response based on these principles can help avoid the need for an audit in the first place, though.