Open Your Eyes

All problems have a solution. You may not see it yet, but the answer is there somewhere.

You just have to open your eyes.

Helping two junior programmers this week, they had come up with an elaborate solution to a problem based on their limited knowledge of SQL. No problem, cudos for coming up with something.

But the elaborate solution was very...elaborate. And performance killing. What else could they do? The answer is: open their eyes and look around.

Don't pre-suppose the answer is going to be this, or that. Look at the problem you are trying to solve, the crux of the issue, and solve that. Or assemble data, clues, and see how that might support what you want to do.

"Well we were looking at the data. Didn't see that."

Uh, it is right there. In the data. Look.

"I don't understand the data model."

Hmm, ok, that is a fair comment in the broad sense. But a table with "summary" in the name, and a column with "total" in the same, is not really arbitrary. It is reasonable to assume as a starting point that people creating those schemas know what they are doing and not deliberately trying to throw you off the scent.

Open your eyes, the answer is right there.

On another issue, a way more complicated one, opening ones eyes only goes so far. There just isn't enough data in the logs to provide a clue for what is going on which, in this case, was a service mysteriously stopping its core function. No exceptions are logged, and other parts (like heartbeats) were still going strong. But according to the logs, the main part of the service just stops.

But the logs were only part of the story. We had metrics too. And with different metrics being bumped at key parts of the workflow, this allows you to infer a bit more about where the code was, and was not, executing when it "stops".

That still wasn't enough data to provide an answer so at this point, the only solution is not really a solution but a tactic: get more data. The problem is this only seems to happen in production, randomly, so we need to push out instrumented code or, at least, code with better logging.

But in the meantime, the service stops in production and the only way to get it back up and running is to bounce it. And here is where containerization really shines, because the container orchestration plane (ECS, in this case) is constantly calling the service for health checks. While they are always passing the boilerplate check, we can simply add a flag so that when we know we stopped doing our main thing (because our "main thing" metric is no longer advancing) we can say "uh, we have a problem here, please restart us".

"Turn it off and back on again."

The Support 101 fix from the 90's is back.