Debugging Fluency

In previous posts, I have stressed the importance of good logs, and of them being the cornerstone of effective debugging. However as I have also pointed out, we are not always in control of how the logging is done when we are debugging systems of software components, where only a part of that system is something you or your team has written.

In these cases, you would typically want to:

  1. Consolidate all the logs from all components into a single place.
  2. Transform the logs (as best you can) into a common format.
  3. Enrich the logs (as best you can) with useful metadata to link things up.

Once the logs are all consolidated, you are then empowered to do a deep analysis of flows and interactions, assuming you chose to consolidate the logs in some equally deep tool. Examples of such tools include (but are not limited to):

For today, I am going to focus on the last one because it is insanely popular, and because it can be self-hosted due to its use of open source components.

ELK vs. EFK

The "ELK stack" has become one of the standards for deep log file analysis. It is comprised of three parts:

  1. Elasticsearch, for storage and indexing
  2. Logstash, for gathering and transforming
  3. Kibana, for visualization

In more recent times, the middle piece, Logstash, is often replaced by another open source tool called Fluentd. This is in part because Fluentd has more capabilities in manipulating the log data at ingestion, but equally because FluentBit, the lightweight variant of Fluentd, is super tight and lean, allowing it to run in containerized environments like Kubernetes.

You can find numerous comparisons between Logstash and Fluentd. Here is one, but Google will help you find lots more.

You will frequently hear the term "EFK stack" instead of "ELK stack" to indicate the use of Fluentd or FluentBit. Funnily enough, Elastic.co still calls it "ELK", I think to keep the cool name, and repurposes the "L" for "ELasticsearch".

Over the next few weeks, I'll try and dive deeper into the Elasticsearch and Kibana but, for now, I want to focus on using FluentBit to do log gathering.

FluentBit

Consider a server that is running some mission-critical piece of software. It generates a bunch of logging data but, for disk storage reasons, only keeps the logs around for a short period of time. This log rotation is the key reason why, when things go bad, you often have nothing to work with unless you are on the scene when the problem occurs.

Because it is mission-critical, you cannot bog the server down, so you need something that is lightweight. Because it generates a lot of log information, you also need something that is fast. One approach, of course, is to use some scheduled task that periodically zips up the log files and copies them to a network share. Kludgey but effective.

The downside to the zip-and-copy approach is that it merely solves the archiving problem. Yes, logs will be available if something goes wrong, but it is better to ship the logs into something that provides on-demand as well as proactive analysis capabilities. So you need something that is collecting the log data in real time and shipping it in a more structured fashion.

FluentBit hits the mark because it is small and fast (it is written in C), and has a wide variety of "plugins" and configurations to perform log ingestion, parsing, filtering, and shipping.

This post will not be a detailed tutorial on FluentBit. I found this to be useful, as well as the documention of course, although the latter has some awkward English syntax sometimes. But I will focus on my experiments with using FluentBit to gather logs from that mission-critical system, to give you a taste of its power.

Tailing logs

FluentBit has a wide range of input plugins, but for pulling logs from arbitrary software components, the most effective by far is "tail". Simply, this plugin watches one or more text files, and every time something gets written to that file, it pulls it into its processing pipeline. You can tell it to watch a specific file, or you can wildcard it to watch multiple files. It can't get more basic than that and, in my tests, it was super effective, even when watching files on a remote server.

The "tail" plugin has numerous configuration values, but two are particularly useful. The first allows you to specify that the log file(s) being watched are checkpointed in a SQLite database. This allows FluentBit to restart and pick up where it left off for each file it is watching, which can be handy if log data cannot be missed. Without checkpointing, each time FluentBit starts up it begins reading from the current end-of-file.

I should note that, generally, configuration of FluentBit is somewhat painful. The configuration values are text files that use indentation for grouping values together (not unlike YAML). However there were multiple times where it would not accept my indents as tabs, and I had to create runs of spaces to make things work. Blech.

Parsing

The second useful configuration allows you to associate a specific input source with a "parser". Parsers enable you to transform raw log input into structured data that can then be matched, filtered, or otherwise mutated before shipping to the output destination. In the most general case, the RegEx parser allows arbitrary log lines to be split up to pull out timestamp information (at least) but also, if the log file format is somewhat regular, additional metadata such as span and trace IDs, and other workflow identifiers.

Here is an example of a parser I had devised for my use case:

[PARSER]
    Name    swxiface
    Format  regex
    Regex   ^(?<time>[\d\-]*\s[\d:]*),\d+\s*(?<logtype>\w*)\s*(?<process>\S+)\s*(?<msg>.*C:(?<tenant>\d*).*)$
    Time_Key    time
    Time_Format %Y-%m-%d %H:%M:%S

Note that because my typical log line was quite regular, I could easily group parts of each line into meaningful bits, like "time", "logtype", and "process". These get put into the structured "events" that FluentBit then uses in the rest of the processing pipeline.

This structure allows me, for example, to only ship log entires that had a "logtype" of "ERROR". Or, more usefully, I could ship all log information to one output destination, but then also ship "ERROR" log lines to another place. This is what FluentBit refers to as tag-based routing.

Routing to the console

When developing a FluentBit workflow, one of the most useful outputs is to stdout. This allows you to test your source, parsing logic, and filtering, before taking the next step and shipping it somewhere. I'll end this post here, and pick up the Elasticsearch leg next time. But using the parser given above, here is what structured events look like (note the tagged elements):

[25] tail.0: [1622218260.000000000, {"logtype"=>"INFO", "process"=>"r_Worker-2", "msg"=>"(C:2 U:master S:JobScheduler_Worker-2_237 X:264093)  [c.i.t.i.s.j.k.KettleRunner] Job result: false", "tenant"=>"2"}]
[26] tail.0: [1622218260.000000000, {"logtype"=>"INFO", "process"=>"r_Worker-2", "msg"=>"(C: U: S: X:)  [c.i.t.i.s.j.BaseJob] [customer2.Plugin] [elapsed:0.019904448s]"}]
[27] tail.0: [1622218260.000000000, {"logtype"=>"DEBUG", "process"=>"r_Worker-2", "msg"=>"(C: U: S: X:)  [c.i.t.i.s.j.s.JobHandler] [customer2.Plugin]: Execute done"}]

The output needs some more work, because the timestamp is using Unix-style double values instead of something more human-readable. But that said, this part is not supposed to be human-readable. That is the job of the rest of the stack.

Stay tuned.