nelhage
A surprisingly controversial perspective on software engineering and
computers in general is the idea that There Are No Mysteries - that
computers are entirely "knowable"1 and it's primarily a matter of how
deep you choose to look. (That choice may be limited by resources
or access, but not by possibility.) This particularly applies to
plumbing around software - npm packages, pypi modules, and
Debian packages aren't magic, and actually help you figure out where
things are coming from.2
Complexity is still a battle
If you always had to understand every system in its entirety, debugging might actually be intractable - the "spaghetti code" moniker to describe this arose in the 1980s.3 Instead, we generally build code in "layers of abstraction" - which sounds clever except it's really an emergent property of starting with something simple and building one layer on top of it, then doing the same thing again. The main "modern" distinction is that we're more likely to have a formal API defining the layer as part of a remote service, than having it be local function calls; even "local" services often come from a local container rather than the same machine, which allows larger subsystems to be properly isolated - even your develop-and-test "local database" is likely to be an entirely isolated machine that just happens to share a CPU.
The important part of these layers is that each layer only "sees" (communicates with) immediately adjacent ones.4 The term for when an upper layer sneaks through and uses a lower layer's interfaces directly is "layering violation".
The Good Stuff is the Next Level Down
Understanding how a system behaves at the surface or user level is sufficient when it is working but often falls apart when you're trying to figure out why it isn't working. A great deal of "debugging" is about peeling back that surface and looking at the mechanisms underneath to get clarity on what is "really" going on.
This isn't an argument about purity - just that the layer immediately below the visible part is what's most directly causing things, properly or improperly.
Just as the layering itself is an emergent property, so is debugging by layer - you've got something wrong at the surface so you go step by step on the surface until you get a wrong result - and then you open up that layer and do the same thing. Much of the time you can repeat that with the each lower layer - sometimes you need to synthesize inputs to keep the operations at that layer "narrow", but that's fine, as long as you can show that it corresponds to the upper level problem.
How do you get to the next level
Tooling varies depending on what you're starting with. On Linux, the
"big" layer is "your user space code vs. the kernel", and the tool for
looking at that boundary in particular is strace. (Realistically
you are probably using libraries and frameworks above that and should
look at them first! You'll normally be examining those with debuggers
or simply by printing out return values. Still, the kernel/user
boundary is where a lot of interesting problems happen.)
To avoid turning this into an strace tutorial, I'll just suggest you
practice by running it on something known to work, like ls, and
something known to fail, like ls /nonexistant and see what the
differences are. (To go sideways a little, a file not existing is
often merely a fact and not an error - but if it should be there in
normal circumstances, it is common to program in such a way that your
code assumes it is present - since you're treating the absence as
exceptional, and you're going to have an error handling path anyway
that can sometimes be a simpler5 structure.)
Higher level environments often have built-in tools - when developing javascript in a web context, you have an entire in-browser console and inspection system that lets you focus on a problematic area of a page (document) and go from there to the related code, since the problems you care about are often more directly visible (at least to end-users) from the page perspective rather than from the flow of logic.
Stories
This isn't just theory, but it helps to have actual cases to show it realistically mattering. Most of these are very situational and don't necessarily lead to modern debugging techniques, so you'll find them under Debugging Tales instead.
Other history: Literature
"The Cuckoo's Egg" and "With Microscope and
Tweezers" cover the Morris Worm incident - notably,
the part where Bill Sommerfeld noticed that a piece of code was
running as the nobody user, and leveraged that into a clean-room
duplication of the buffer-overrun attack that was one of the worm's
more successful vectors. (The nobody part had mostly been dismissed
by other people investigating, but turned out to be significant.)
Other history: 6.004
MIT teaches this directly in 6.004, Computation Structures which (as taught in the late 1980s) built a small computer all the way up from
- The "Digital Abstraction" - the idea that you could use amplifier circuits to slam a continuous voltage to one extreme or the other, and oversimplify6 those levels into 1/0 true/false logic7
- Simple multi-bit arithmetic units, first as chips and wires, then as small pre-made circuit boards8 (containing the same chips and wires.)
- An instruction set (basically a counter and some gates attached to the arithmetic units) and storage
- Toggle switches for entering code
and finally a tool chain on a host system to assemble (and later compile and optimize) higher level code to run on this hand-built system.
As you crawl "up" the stack during the course, you don't just build the lower levels, you build a belief in the lower levels - voltages representing logic levels, representing numbers, representing code, with nothing hidden9 about how you got there.
-
Nelson Elhage also wrote about this in "Computers can be understood". ↩
-
Package systems are particularly helpful for encapsulating provenance metadata - while we primarily care about that for licenses and permissions, it's also an important part of tracing the actual code. ↩
-
The term seems to have fallen back from a high-water point in 2010 which I am not prepared to explain. ↩
-
A particular angle on this is called The Law of Demeter which is more of an example of how 1980's computer science was mostly an extended exercise in puns related to ancient greek gods (see also MIT's Project Athena), it does have a kernel of truth about thinking about software in terms of growing it rather than building it. ↩
-
Error Oriented Design is the idea (reflected somewhat distinctively in the Go language) that your code should be written from the defensive perspective that something is going to fail and you should treat that failure as important information and handle it as carefully as you handle things actually working. This is a good way of questioning assumptions and is a critical part of any security-related programming. ↩
-
Famously early digital message encryption had the problem that the "exclusive or" function used to combine key and data wasn't quite digital enough and while 1⊕0 and 0⊕1 are both 1, in practice there would be a voltage difference of a few percent, which was enough to "peel apart" the key and data, as long as you intercepted it in the right part of the channel. "Security Doesn't Respect Abstraction Boundaries" has a decent walk through of this in terms of voltages. ↩
-
The course also immediately covers the Asynchronous Arbiter problem where the the digital abstraction is hiding an unbounded timing problem that is so fundamental it can be demonstrated with mechanical coin-operated devices - but as long as you aren't specifically arranging your circuits to force it to happen, you can reduce the probability below that of spontaneous explosion of the circuit. ↩
-
The pre-made arithmetic boards weren't special, they just made it possible for students to "get past" practical problems with the early lots-of-chips implementation and not be "stuck" for later work; it also freed up a lot of protoboard space to build later parts of the system. ↩
-
Nothing hidden from a CS perspective. A not-quite-joke about the difference in coverage between EE and CS was that EE didn't do compilers and CS didn't do semiconductor physics. It was certainly the case that you couldn't get through the CS core without doing at least some soldering - perhaps badly, perhaps enough to convince you that you wanted to be at the other end of the abstraction stack - but you knew why, you had "run your fingers through it" and it wasn't a mystery despite it not being Your Thing. While we're mostly focusing on layers in code, it should be pretty clear that once you get from software to hardware, the physical sciences are layered as well. ↩
Nelson Elhage wrote Software Engineers should keep lab
notebooks
discusses the value of keeping detailed notes - more than just a
screen recording, including notes on intent (on what you were trying
to do or how you were interpreting results.) The article emphasizes
it primarily for retrospective analysis (both immediate and longer
term) but I've also found that one side effect of keeping such detailed
notes (in a searchable manner) is that you can also find that snippet
of awk code or those weird lsblk arguments you figured out that
time, even if the problem itself is unrelated.
(See also Lab Notebooking for the Software Engineer for basic mechanisms for applying this technique.)
- Advantage over AI: snippets that actually work and you have a clear record that they did, what happened, and when - "Oh $this stopped working after 16.04 and we need to figure out $that instead" is important.