Time Warp

Let's talk about time.

Time has a way of going by, really fast. Which is why I am dismayed that it has been so long since my last post. That is definitely not my intent, but when the grind gets grinding, it is tough to carve out time to share my thoughts.

But enough navel gazing, let's get down to time and how it impacts debugging.

The timeline

I have mentioned before that time is a critical piece of data when debugging a problem, especially in a distributed system where you cannot easily just "step through code". The timeline of events, captured in logs and database records, is often the only way you can see the big picture of what is going on, in order to narrow down the problem and get to the root cause.

So obviously it is important to capture time as a part of debug output. But how you do that is just as important as doing the thing itself, so let's double-click on that.

3...2...1...0

You know how in those heist movies, the team synchronizes their watches to make sure the laser beams turn off just as the acrobat is shimming in through the air vent? Synchronization ensures each team member does their thing exactly when they are supposed to. Your watch is 5 seconds too fast? Alarms go off!

When you are debugging across multiple systems it is vitally important that the clocks on those systems are also in sync to allow logged timestamps to be easily correlated. If they are not in sync, then it is impossible to know for sure "this" happened before, or after, "that". To successfully debug in this case, you would need to know how fast or slow a particular system is, and then do the math on each time value. Painful.

These days, this is far less an issue than even a few years ago, since there is so much running in the cloud and premise servers are generally always reaching out to network time servers to get their clocks correct. But older systems still exist, and so it is important to check that off your list.

UTC

Even if clocks are in sync, there is the pesky issue of time zones. Customers (if you are lucky) are spread across the world. Even a single larger customer may be operating in multiple time zones. So debugging issues can become complicated if you have to do the time zone math in your head as you analyze logs.

Oh, and don't forget Daylight Savings Time, if you happen to be looking at logs for March or November!

For these reasons, you should really do yourself a favour and track time in UTC.

Unless you are absolutely positive that your software will never run in a different time zone, UTC makes all time issues go away.

Sure, you need to know what time zone a customer is in so you can translate their bug reports ("this thing happened at 9 am"), but after that you should have no problem looking at logs across systems and geographies to build your timeline.

Unix time

There was a time (heh, heh) when I hated Unix time. For those that don't know, Unix time is the number of seconds since Jan. 1, 1970, midnight UTC, although some systems will emit those values in milliseconds, microseconds, or even nanoseconds (!!!).

But seconds are most common and, because a huge amount of code written with Unix time still use 32-bit integers, these applications will have a bit of a problem on January 19, 2038 when those integers overflow.

So yeah, Unix time is a bit archaic, and it always surprises me when I see it in logs or databases.

But that said, it has a really nice property in that values are emitted as simple integers, making it easy to see when one thing happened before another. Sorting logs with your text editor of choice is also very easy. And since I had to do exactly that recently I am now warming to Unix time. Luckily I have 17 more years to take advantage of it.

Of course, looking at a very large number is pretty opaque when you are trying to relate Unix time to something humans are more used to. There are probably a gazillion tools to do that kind of conversion, but I just have https://www.unixtimestamp.com always open to quickly do the math. Thanks, Dan's Tools!

Beware the filter

As you know, being a master debugger and all, rich log files are big log files. So invariably you are going to be filtering log data to isolate certain flows. But the moment you filter the logs, you are hiding data that could be very material to the issue you are debugging.

Filtering, while useful and necessary, is making assumptions about what is, and is not, important.

I was looking at some logs the other day, filtering on the specific flow I was looking for. Yup, there were the expected log entries. Counting them, there was one missing. Hmm. Looking closely at the timestamps of the expected entries (which were periodic), there was obviously a gap. Since this was a server log, the obvious (and naive) conclusion is that the request for that time period was not made by the client system.

But because the logs were filtered, it was obscuring an important fact. The server was restarting. A lot. And looking at an unfiltered view of the time span where the expected "missing" log entry should be, it became clear that the log entry was not there because the server was restarting at the time.

In another case, for another problem, the log for a particular flow looked correct. Because we were looking at a particular customer's flow, the logs were being filtered by "client ID". Request received, request processed, data pushed out. Yup, looks good.

But again, looking at the timestamps revealed a problem: the time between the receipt of the initial request and the processing was 2 minutes. (Interestingly, exactly 2 minutes.) It was easy to miss because when you are looking at a lot of log data, timestamps tend to blur, especially when they look like "2021-07-17T22:04:03.933-04:00". Which is the minute and which is the second? But there it was, if you looked closely.


The techniques and tips here are pretty basic, and yet absolutely critical to successfully solving issues, especially in distributed systems. Just remember: time is on your side.


println

Here are a couple of debugging-related posts from around the web, for your reading pleasure:

5 Debugging Tips for Beginners
In the process of learning how to code, we inevitably run into problems. Some are easier to solve tha...
Application Introspection and Debugging
Once your application is running, you’ll inevitably need to debug problems with it. Earlier we described how you can use kubectl get pods to retrieve simple status information about your pods. But there are a number of ways to get even more information about your application.Using kubectl describe …
Capturing logs at scale with Fluent Bit and Amazon EKS | Amazon Web Services
Earlier this year, AWS support engineers noticed an uptick in customers experiencing Kubernetes API server slowness with their Amazon Elastic Kubernetes Service (Amazon EKS) clusters. Seasoned Kubernetes users know that a slow Kubernetes API server is often indicative of a large, overloaded cluster …