The Art of Debugging
In my long career as a software developer and architect, I have had to debug a lot of things. And it continually amazes me, when working with others on a problem, how some people just get lost trying to get at the root of a problem. What seem like obvious next steps to me often will elude them. Many times they will eventually solve the problem in their own way, but often that will be at the expense of time which, at the least, will impact development timelines to a greater or lesser degree, or at the most, will impact a customer's downtime if the bug being chased is "live".
And so it is with no small degree of hubris that I make the claim that debugging is as much an art as a skill. Certainly, anyone can learn to debug, just like they can learn to code. But equally certainly (as far as I am concerned) not everyone can debug well in the same way that not everyone can code well.
When I approach a debugging problem, I tend to always do it the same way, and that tends to yield satisfactory results. This approach involves asking the following questions:
- Can the problem be reproduced?
- Are there deep diagnostics to more clearly show the problem?
- If the code or system used to work, but now does not, what has changed?
- If the characteristics of the problem are known, can the problem be reduced to a simpler form?
These seem like obvious questions, but it is surprising how often they are not used to probe the issue and get at the root cause of it.
Can the problem be reproduced?
It goes without saying that a problem that happens sporadically or, worse, is a one-off case, is exceedingly difficult to debug. So when you first approach a debugging problem, you absolutely need to uncover a sequence of steps that will reproduce the problem. In some cases, this won't be until you understand the problem more deeply (following the other questions), but your goal has to be replication. If you can cause the problem to occur over and over, you can:
- increase the diagnostics for each iteration as you narrow down the problem; and,
- measure or demonstrate the effectiveness of your code fix(es)
You can only truly say you have fixed a bug when you can say, "you did this, then that bad thing used to happen, and now with my code fix that no longer happens."
Are there diagnostics?
Having a reproducible problem but not being able to see what the code is doing is like trying to swim with one hand tied behind your back. Having ample diagnostics in the form of logs, traces, and metrics, are critical to effectively debugging hard problems.
Countless times I have been involved in debugging escalations where logging was not turned on, or log files were being purged. So one of the first questions I ask is "where are the logs?"
"What, no logs? Ok, call me when you have logs."
If you are a front-line support person and you need developer help to solve a problem, take this point seriously. You have to collect whatever logs are available to make any sort of headway, and developers will appreciate that. Sure, they can be prima donnas at times, but this is a basic requirement that should be slavishly followed.
Of course, even if you have logs, they may not be sufficient. The log level may not be deep enough. The logs may not span enough time, or cover enough related services, to see the whole problem. Effective debugging involves extracting as much diagnostic information as possible to get at the root cause of a problem and, coupled with being able to reproduce a problem at will, greatly enhances your chances of success.
Seemingly obvious, but very, very often overlooked. Stuff doesn't "just break". There is always a trigger, even when the user or customer claims that "nothing has changed". You need to find that trigger. System update? Network change? Find it.
Often, during a development cycle, code that used to work now does not (and hopefully this is found in QA and not once deployed to a customer). What has changed? When did it work last? If you can isolate that, even over a surprisingly broad window of time, you can use your source code repository to tell you exactly what has changed. Looking at code diffs is a very powerful tool in your arsenal to finding subtle, and not so subtle, coding bugs.
Can the problem be reduced?
This is a power move that is really effective in solving problems efficiently. Sometimes an issue is difficult to reproduce in the sense that it takes many steps, or a long-ish amount of elapsed time. When you have an understanding of the problem to the point where you can localize the effect in some reduced form, you can dramatically affect the speed of resolution by targeting the offending code.
Reducing the problem can also enable you to create a proxy for the issue, a small application that has the guts of what you think is the problem. Small programs are often way more efficient to instrument and test with. This is also very useful when you cannot reproduce the problem but know, roughly, what it may be. Having a small app that a customer can run that simulates the issue and captures a ton of diagnostics can be used to then isolate the problem in the real code.
The questions posed above are not hard, and are fairly obvious, but I am writing this because time and again, I see one or more of them missed. In this blog I will be coming back to these over and over, as I discuss debugging scenarios that have come up in my day-to-day work. It is my sincere hope that they prove useful to you.