To an observer who's go-to method of fixing things is "turn if off and then back on", software can seem pretty mysterious.
"Well, it used to work. It just stopped."
Sorry, but things like that simply do not happen. Software can be complex, but it is 100% deterministic. It doesn't have a mind of its own, is not capricious, cruel, or arbitrary. And so if something used to work but then stopped working, there is a reason for it. Always.
Your mission as an advanced debugger is to uncover what's changed.
And again, as with all of these "advanced" tips, it may seem gobsmackingly obvious, but too many times I have seen people go down rabbit holes without taking a step back and recalibrating their investigation by asking "what's changed".
A short inventory of changes
There can of course be many, many things that have changed, but here is a short inventory to get you thinking about some of the ways software can be impacted:
- Changes to the environment, such as base OS, or machine upgrades. Did a server patch or upgrade alter something your software depends on (e.g., a TLS version deprecation, which bit us a while back)?
- Changes to network configuration or topology. Arguably this is "environment" too, but deserves a special call-out because of how frequently IP changes or new firewall rules cause problems.
- Changes to key services you depend on. Think about web servers, file servers, and cloud services.
- Changes to your code.
Note that code changes are listed last, not because code changes aren't likely to trigger issues (they usually do), but because you need to eliminate the obvious first before digging into code. Sometimes it is the combination of new code on your part, coupled with some other change, that tips things over from "it used to work" to not working now.
Effective debugging is about being methodical. You are essentially creating a binary search over the entire space of what can go wrong, and eliminating key things at the outset can reduce the problem space drastically.
A point in time
If you have (usually quickly) eliminated external issues, and now know that something in your code is broken, how do you go about finding the issue? The next most important piece of data you can obtain is when the problem started happening. Note that this is not necessarily when it was first noticed, but when it actually started happening. You may have a customer report, or a QA bug logged, but you need to dig deeper and uncover what the high-level issue being observed is, and try and see when, in time, it first became apparent. This may be through log files, or it may be by using the same tool or application the customer or QA is using. But assume that when it is noticed, and when in started really happening, are two different things.
As a concrete example, we can consider our integration problem from last week. An external system is consuming data passed from our system, and the data appears wrong in the external system. Luckily, that external system has a usable interface for looking back in time. We can see the wrong data and, more importantly, we can see when the data went from being right to being wrong.
So now we can ask the much more precise question: what has changed at this point in time.
Knowing when something happened is a vital clue to determining what has changed, because there are often cases where multiple things change over a period of time, and a problem is not noticed until after those changes are all in play. Which change caused the problem? By knowing the time as precisely as possible, you can eliminate changes that had no effect, and focus on those that did.
In our integration example, a code fix was deployed to solve one problem, and suddenly another problem started to appear. However, that problem was masked by a third problem, which seemed more important because the second problem hadn't been reported yet, and so that was fixed next. By the time the second problem was noted, we had two fixes deployed. Which was the cause? By working backwards, seeing what was really being reported, and seeing how the initial fix had an unintended side effect, we were able to eliminate the fix to the third problem as being the root issue. And furthermore, by seeing what the actual trigger was, we were well on our way to understand the root cause of it all.
Source code control is your friend
While you may know exactly what the issue is based on what version of code was deployed, more often than not all you know is that between this version and that, some functionality went bad. This is where source code control is your (very best) friend.
After you have reduced the problem to a key functional unit of code, rather than just looking at the code and trying to deduce what is going wrong, use source code control to tell you exactly what has changed. Sometimes that might point at a massive change set, but just as often it will point you at one or two key functions that have changed in some material way. This works between releases, and it also works during code development where code is being tested at key milestones. As long as you have the when, that line in the sand that says "before here it was good, after here it was bad", you can inspect the changes made by your or your team one by one and see how they may have contributed to the issue.
TFS has a nice visual way to do a diff between two arbitrary change sets, and Git also enables you to easily compare two commits on a branch. Be sure you know how to use this functionality in whatever tool you use!
Stand your ground!
The last bit of advice circles back to the opening comments: software doesn't just "stop working" 99.9% of the time. (I am hedging a bit because of lingering bugs like Y2K, but you get my meaning I hope.) So that said, if your code has not changed and yet something has stopped working, approach it logically and firmly insist that there must be something else that has changed. Many times, someone will insist nothing else has changed and yet, on closer inspection "oh yeah, there was that one update last week, but it couldn't be that, could it?"
Over the last several weeks, I have tried to drill into the core concepts, the "power moves" of effective debugging. These principally involve asking the right questions and focusing on the things that matter the most, instead of working at random or being pulled into unproductive rabbit holes.
Moving forward, I intented to amplify on these core concepts, sometimes with real-life examples from my day-to-day work, or through deep dives into technology that I think you should know.
Feedback on this, or any other topic mentioned here, is very welcome. Please register to have access to my inbox!
Here are some links to tech that I am actively investigating right now:
- Logging is half the battle for understanding what your application or service is doing. Effectively monitoring metrics is a powerful tool for early warning into operational problems. Prometheus is that.
2. As mentioned last week, being able to hook into the logging of applications you do not directly control the source for can be vital to capturing diagnostic information. This is a key component of that.