Provenance

Essays: Provenance

"Provenance" is a term more common in the art world, where it means the origin or source of a thing - and can more specifically mean the history of ownership, as distinct from the thing itself. In art, provenance is a mixture of giving credit to the artist and imputing value to the artifact (the difference in value between a true original and a perfect forgery is bounded entirely by the provenance of the piece.) In software, the primary use of provenance is in software licensing with some secondary value in identifying security issues (CVEs).

Open Source

Open Source software specifically has licenses (usually from a well-known set) that allow a developer to use them within some generous constraints. On one end of the generosity scale (within Open Source) you have BSD and MIT which are basically "do whatever, just don't file off the serial numbers and claim it as yours" to messier licenses like LGPL where you have to enable the end-user to be able to update components of the deliverable; some of these licenses require you clearly identify any changes you've made. Whatever actions they require, the important thing is that you be able to identify (for things that you ship¹) what bits have what licenses, and that you're conforming to them.

Proprietary

Proprietary licenses are usually much more restricted in specific ways², since they're often direct contracts and don't rely on copyright. Since they are bespoke, you don't have the shortcut of finding interpretations from the community for what you can do - either you need to do a close reading with your desired use in mind, or to come up with formal questions and ask your own IP lawyer about them.³

Again, the important thing for you to understand is where the pieces of your code come from and what licenses apply to them, and what actions those licenses take.

The Code Itself

While licenses and ownership are important, you also need to be able trace where the actual code comes from. Sometimes that's just a matter of being able to map out your software supply chain risks, but it's also useful when debugging - not only can you confirm you have the correct upstream code, often the upstream bug tracking system will be adjacent to the code repository and you can find out if someone has already reported something related to your issue, possibly even a proposed solution branch. Ubuntu Launchpad acts as a "switchboard" for this, with links to further-upstream sources and ticketing systems (including CVEs) and some information about tracking the progress of fixes within Ubuntu's own releases - especially useful if you're trying to decide about building a local version of a package or waiting for it to show up as an official "backport", Launchpad will have public policy discussion and decisions for each package, as well as being a place to vote on impact or contribute your own test results.

Stackoverflow

There are a number of online sources of problem-solving code snippets to read and learn from. If they don't have an explicit license, that's all you can do - the author has copyright by default. Very popular systems like stackoverflow do have specific licenses - in this case, part of the Creative Commons, with requirements for including credit and not restricting further copying.

In practice, any given stack overflow answer isn't an exact fit anyway, so a pretty good practice is to learn from it, confirm it with more specific documentation (one of the best results of reading stackoverflow answers is that they get you to say "oh that's what that man page meant!"), apply that new understanding locally, and leave a note (in a comment or version control commit log) about where you got the idea - not for compliance, but so the next time someone comes along to debug that bit of code, they can look and see if the original answer has been updated or refined, so you don't have to rediscover it.

Mechanisms

Debian/Ubuntu copyright and license information

Debian packaging is a combination of tooling and policy/discipline. One important piece of this is that every installed package contains a /usr/share/doc/$package/copyright file with information about upstream source, licenses, contributors, and copyright holders. Though it was originally freeform, about 80% of packages in a typical install use a strictly parseable format that is easy to build auditing tools around.

If you're starting from scratch, you'll probably want to do a bulk "look at every package" review; given your specific requirements, you can come up with a few simple questions and have your whole team chip in on reviewing. After that, you should simply have a license review every time you add a new package - for the package itself and the new dependencies it brings in. (You want to keep this simple and do it early to reduce sunk costs in building on top of something with an inappropriate license - or even with a "probably but not quite sure" license that needs lawyer time.) One traditional way of handling this is to have a short list of licenses that are "automatically OK for dependencies" - MIT, BSD, ISC, and Apache 2 usually head the list - with others requiring escalation to someone who can talk to (and spend money on) the company's IP counsel, or take responsibility on behalf of the company for using that license. (You'll quickly find engineers taking the initiative to choose alternative solutions that fall in the approved license list just so they can keep going and not get blocked on license review; depending on what problem space you're working with, that might actually be a reasonable accidental result.)

This author has sucessfully used the debian copyright files to satisfy acquirer and investor due-diligence, mostly of the form "is there any use of the GPL that would require releasing our proprietary source code, and thus reducing its value", for two startups. One of the acquirers was impressed at getting an obviously-exhaustive spreadsheet of dependencies and license categories, instead of the "umm, we have some code?" that they'd gotten from previous startups, saving months of review. Open source dependency management is sufficiently mainstream these days that there are companies⁴ that will do this auditing as a paid service, usually with value-added features like notifying you when security advisories are issued for anything in your "software supply chain".

Debian/Ubuntu file-by-file packaging information

The entire point of a packaging system is that you know exactly what files in your install come from what packages; a big part of this is just that you need to know where they are so you can remove them cleanly - a step that "just run make install" leaves out - but it also means that if you find a problem in a particular file at the operating system level, you can trace it all the way back to the author, even if it isn't your code.

Other distributions

Redhat's rpm (also used by SUSE) has a parallel history with Debian's dpkg; while it has made different choices it has support for the mechanical aspects of provenance tracing.

Language-specific packaging

NPM (for Javascript), Cargo (rust), CPAN (perl), PyPI (python), Gems (ruby) are all community packaging systems oriented around a specific programming language, rather than an operating system. This means they focus more on multiplatform support and community openness, but often less on package quality or malware prevention⁵. The upside is that peer pressure (and ease of copying) usually means that one license will usually dominate - for example, most NPM packages are MIT licensed, with ISC and BSD filling in most of the rest.

(Go is a little odd in that while it has a packaging mechanism the format is "a URL pointing at a git repository", making Github into the accidental package repository for the industry; this means it completely lacks any kind of policy enforcement beyond "does the package actually build". You probably want to at least add a layer of your own git mirrors to have a little bit of control, here, even if you'd use upstream package repositories for other languages.)

You'll sometimes hear that language-specific package system for C and C++ and is dpkg (or rpm). That's got a kernel of truth (in that nothing else is) but those also have tools for automatically extracting metadata and build options from language-based packages directly into operating-system-based packages, so they really are a "higher level" system. They also have more policy information - the Debian Policy Manual has details on handling packaging subtleties and upgrade issues that have come up across a thirty year lifespan, that most language-specific packaging can't handle (arguably they don't need to, if you're regularly doing clean builds into freshly created environments, but that's a relatively recent innovation.)

Conclusion

If your product is an entire platform shipped on an operating system, you'll probably need all of this - but even if you're working with simpler architectures (microservices on AWS, web-frontend-only tooling) you're still likely to need to figure out what you're building on top of - even if you're not aiming for acquisition, you'll still need to be able to identify what upstream security vulnerablities have put you at risk or at least have cause you to need upgrades. Provenance - knowing where all of your pieces come from - is just a matter of having (and keeping) the data you need to keep track of it all; this doesn't require a lot of new work, just a little discipline.

Key to copyright is copy - you can generally screw around as much as you want locally/personally, copyright-based licenses don't even apply until you give a thing to someone else. ↩
Sometimes they are only restricted in those specific ways and otherwise are very broad; one commercial linux IDE licenses individual people but the software can be installed anywhere at all - so we could ship a pseudo-embedded system with that IDE as one of the local editors, and as long as the individual developer had a license for it, they could run it on any fielded system. In another case, we could use a set of embedded libraries from a CPU-vendor SDK and ship those object files anywhere - but we weren't allowed to run them on similar but competing chips (again, not complicated, just very specific.) ↩
In some circumstances you want to talk to an IP lawyer about your open source use too - but it's 2025, if they have any experience in software licensing at all this should be a relatively short conversation because (for example) the GPL is older than they are and they have no excuse for not already being familiar with it. ↩
Examples include Black Duck which has been around for ages; Snyk has more recently joined the field; their services are shaped differently but they can both provide the kind of "open source audit" that investors are looking for. ↩
Aside from simply "selling off" packages, NPM and PyPI (by virtue of being popular and thus high-value targets, not through any particular flaws) have long been victims of typo-squatting, and more recently "vibe-squatting" attacks on similarly-named packages. ↩