If a Tree Falls
"If a tree falls in a forest and no one is around to hear it, does it make a sound?" – Unknown
"If you can’t measure it, you can’t improve it." – Peter Drucker
The major theme for this blog thusfar has been "Duh, that's pretty obvious". And continuing in that vein, today we are going to talk about logging, which is probably the single most important (and obvious) tool in your belt when debugging difficult problems. Of course when you are doing code development, and debugging functionality along the way and through unit tests, log files play a secondary role to the debugger of choice you are using. But when your code gets deployed, and something goes wrong, the log file is the first critical diagnostic aid that you should be asking for outside of the bug report.
But for a log file to be truly helpful in diagnosing issues, there are some key things you need to be aware of as a developer to maximize its value. These key things, which are true regardless of language or execution runtime, include:
- Creating useful log output always, regardless of logging level.
- Creating variable log levels to allow you (or support techs) to "turn up the volume" on what gets emitted in a log file.
- Creating trace and span output in the log file, at all logging levels.
- Assuming that log files will be huge, and optimizing for searching.
- Outputting what you think you are going to need to debug a problem!
Let's dive into each of these to understand their impact.
Log something useful, always
This first point is so self-evident, it is almost embarrassing to include it. But here it is. When your code is running in production, it is very unlikely to be running in a debug mode. That means, generally, you are not going to be generating a lot of output in your log files. But that does not mean you generate no output beyond:
[2021-05-01] Starting application...
Pretty useless. Even in production, you want meaningful output to show that the program or service is running, and that if something goes awry you have context. Do you have multiple threads? Show them starting up and doing stuff. Are you accepting REST calls? Log them. Do you have some scheduled task that kicks off every now and then? Log when it starts, and when it ends.
If there is a runtime exception, you should be logging that exception with sufficient detail to indicate the cause of the fault, and with sufficient supporting context (like a stack trace) to indicate precisely where the problem occurred. Exceptions are, by definition, exceptional and you should not be worried about filling your log files with exception details. Well, ok, if you are filling your logs with exception details, you should be worried, but not about log file size.
What else is particularly useless about that output example above? No time. Always log the time, to milliseconds ideally, in human-readable form. No Unix time please and, if you are writing code for the Internet, always use UTC. Yes, it takes a bit of getting used to but if you need to correlate activity between different time zones, having a common time format is a huge time saver.
Remember that the log file will be the first piece of data you receive about an issue, so think about triage: what do I need to know right away to narrow down the scope of the issue?
Pump up the volume
Regardless of the logging you always emit, you should always have a way to tune the output, both in terms of the amount of data you put into the log file, and the scope of the functionality being logged. Most (if not all) commonly-available logging libraries support variable logging levels and scope, so this is mostly directed at devs who, for some reason, write their own loggers.
But this is also pointed at the human debugger: once you have a basic idea of the scope of the issue being debugged, and you have a reproducible case, you need to be able to get more information to allow the problem to be further reduced.
As a final note on this, I should point out that when your software is interacting with other software that you do not own, being able to increase the logging in these components can be invaluable to finding the issue. I was working on a problem (mentioned last week) that involved creating a proxy scenario once I had a rough idea of where the problem was. This was a NodeJS thing, and I was interfacing with a standard NodeJS packages for SockJS. The key to my solving that problem was not only to reduce the problem, but to also hook into the SockJS package logging, cranking that way up, and eventually getting enough diagnostics from the customer machine to isolate the problem. Frankly, I didn't know a whole lot about the internals of that package, but was able to figure out how the logging worked and coaxed it into giving out.
Following the thread
The concept of trace and span are fundamental to effective debugging software systems, where the processing is distributed across multiple components. The concept is also useful in monolithic, multi-threaded systems, albeit less so, but in basic terms:
- a trace is a sequence of logged steps that represent the entire flow of an operation;
- a span is a subset of a trace that is handled by a single component.
Each of the span and trace are typically identified by some unique number (the trace ID or span ID) and that number is passed along the workflow so that different components can emit the same number in logs. Imagine a workflow where you have two separate services handling part of the work. A request for the work comes in, and the first service identifies the trace ID for that work item. It also creates a span ID while it does its particular thing. It then passes the work item over to a second service that generates its own span ID but, critically, it uses the same trace ID as the initial service. In this way, these two bits of work can be connected (or correlated) through the log files.
Standards such as Open Telemetry try and make it easy to pass these values around. But the concept is very fundamental and you don't need a library to do it. As long as there is a single value that can be shared across different parts of a distributed system, you have that critical link that ties everything together. And obviously, that value must be emitted in the log files.
Even within a component that executes work in a multi-threaded fashion, identifying each processing thread is critical because, by definition, multi-threaded components will be doing multiple things at the same time. Log output will be interleaved between the various threads as they execute concurrently, and having a unique identifer to each thread is essential to teasing them apart.
And please note that having the values, and making them obvious in the log file are two different things, so make them obvious and searchable, which leads me to...
Log files need to be searched
Non-trivial debugging will almost invariably involve working through a lot of log files. These files may be huge, and so you cannot assume that you can just open them in Notepad or BBedit and be on your way. Sometimes, grep is your only choice, or you may need to write a small tool to pull data out. Over the years, I have done it all.
But one thing stands out: log files must be text. It is the lingua franca of computer systems and, even if log files are ingested into bigger systems for correlation and searching (e.g., ELK), a text file will always be consumable whereas proprietary "compressed" log formats are just a pain in the ass. For years, I had to work with an inherited logging system that used proprietary file formats, necessitating a custom log reader to look at them. Needless to say, as files got bigger, the reader kept choking on the data, and it was a painful exercise to debug across multiple files.
Fast forward to my recent experience, where I can quickly pull in multiple files into BBedit, filter rows based on traces and spans, consolidate them into a single window that I can then sort by date and time (because they are all UTC, of course) and I can do really powerful analysis and debugging. Smart, simple logging empowers deep introspection and effective debugging.
But as noted above, sometimes grep is your only choice when you have to search across multiple, gigabyte files. And to facilitate effective pattern matching, the log files need to have they key data elements (date, time, trace, span, log message type, etc.) highly structured and easily parsed. Aim to have a standard logging format that is used across all the software you or your company produces, and you'll have developers that can effectively diagnose issues, even on components they did not write.
Think like a debugger
When you are writing code and emitting what you hope are useful log entries, think like you are going to need to debug this exact piece of code six months or a year from now. Are you putting out what you need to make it easy to infer the context? Sure, trace and span identifiers are technically all you need to link everything up, but trace and span identifiers are for machines to use to correlate logs. What about you, the human called on to debug this piece of code?
Make it easy on yourself and add that extra bit of "stuff" to help figure things out. I am currently working in the SaaS world, where requests to services come from literally everywhere. So I need to identify the "who" for a request as much as I need to correlate the log entries tied to that request. How many log files do I need to pore through to find the "who" associated with a particular span? Would it help if I included the "who", redundantly perhaps, in every major step in the processing path?
Yeah, it would help.
There are challenges, of course, to emitting too much identifying data. There are even laws preventing personally identifying information from being logged in some contexts. But those cases are far less common, and merely being aware should steer you clear and still allow you to log effectively.
When your software is in final validation, look closely at the logs. Do they tell the full story, or are there assumptions or missing data? While it may not be part of the acceptance criteria for a component, treat the log output as your personal acceptance gate. You, or a co-worker, will be relying on it someday.
Logging is a very important topic, and later posts will dive deeper into some more specifics, such as Open Telemetry. In the meantime, here are some other technologies that might be of interest:
- DataDog, https://www.datadoghq.com
- Splunk, https://www.splunk.com
- ELK or ElasticStack, https://www.elastic.co/elastic-stack