Swarming

While these posts have generally been about elevating your debugging abilities to Yoda-like stature, today I will like to riff a bit on swarming.

For those not familiar with the term, swarming refers to assembling a large group of people on a problem in order to come to some kind of resolution. Ideally, a swarm should have the following basic properties:

  1. There should be a decent number of individuals to bring lots of "eyes" on the problem.
  2. There should not be so many humans that a large number of them are sitting idle, twiddling their thumbs.
  3. There should be a wide representation of skills and SMEs (Subject Matter Experts) in the swarm.
  4. There needs to be a good (single!) communication channel for the swarm.

Eyes on the prize

The point of a swarm is to bring different perspectives on a problem. While it is possible to swarm with as little as two people, generally that won't be as effective as a larger group. That said, I have found that the very act of verbalizing a problem to even a single person not intimately familiar with an issue can be a catalyst to uncovering a solution.

Generally an effective swarm will have a half-dozen or so people.

Of course, there is no hard and fast rule here, so it is probably more instructive to grok the basic principle rather than slavishly follow some prescription. The effectiveness of the swarm comes primarily from the different perspectives of the people involved. There will probably be one person that is driving the group (as a master debugger, that may be you). But think of a swarm as a brainstorming exercise, where no idea is "stupid". Get stuff out on the table and look at everything.

Remember that scene in "Apollo 13" where they are trying to figure out how to not kill the astronauts? Someone dumps a bunch of hardware on the table, and the engineers around that table try to figure out a solution to the air scrubbing problem. That was a swarm.

On the flip side, a swarm can be too large when a significant number of people are not contributing. This can be a waste of time but, frankly, if a problem is important enough to swarm then maybe that is a cost worth paying. You'll need to decide that for yourself, but when we look at communication later, this might be less of an issue than you may think.

Broad, not deep

A really good swarm will have people with a wide range of skills, and represent a variety of domains. This has two benefits:

  1. Different perspectives often yield insights that you may not have thought of. Even a "silly" question from someone that doesn't know the specific details of an aspect to the problem (note: there are no silly questions!) can uncover a small thread of something that unravels the problem.
  2. Different domains can grease the skids where there is some blocker to moving forward. A manager may not necessarily get the deep technical details of an issue, but can facilitate pulling someone else in to, say, solve a permission or access issue.

A SME can shed light on how something is supposed to work, or how it is supposed to look from a customer perspective, while a team dev might know what gets written to a database table. Correlating this in real-time can often illuminate an issue that might otherwise be missed, and that can lead to finding the solution.

If you are the facilitator of the swarm, this is the time to use the debugging skills you have developed, and work through them with the group. Assume nothing, look at logs, reproduce and reduce the problem, and use the collective analytical skills of everyone in the swarm to drive to a solution.

An open channel

There needs to be a frictionless channel of communication when swarming, and if there is any silver lining to the pandemic it is that those channels have matured at a significantly accelerated rate. There is really no barrier to having a swarm distributed geographically except, possibly, time zones. While you can certainly swarm in a meeting room where everyone is in the same location, and can all see the same whiteboard or presentation screen, I think the importance of the latter makes remote swarming possible.

To be more specific, effective debugging, either alone or in a swarm, requires you to look at code, logs, databases, UI, etc. Therefore whatever channel you use must ensure that everyone is looking at the same thing at the same time, and if you can all look at the same screen then that can happen remotely just as easily.

I was in a swarm this week where I was sharing my screen to look at logs and databases, and then someone else (two time zones away) would take control for a few minutes and show what the customer was seeing as we replayed a scenario over and over, while a third would then show other logs to see what was happening in their specific part of the solution. Transitioning was painless thanks to the tool we were using, and everyone could follow along with advice or questions.

The right tool for the job should also allow for two important side benefits:

  1. The tool should collect all the necessary artifacts of the swarm in one place, so that it can be reviewed later. Or if the swarm needs to be continued with other people, then there is some record of what was done so they can come up to speed. These artifacts could include chat messages, video recordings, screen shots, etc.

    Note the emphasis on one place. An effective swarming tool should not spread out artifacts to make them harder to find.
  2. The tool should allow people to drop in and out of the swarm as they are needed. This avoids the thumb-twiddling noted earlier, maximizes the effectiveness of the swarm while respecting people's time, and gives an opportunity for senior management to drop in and out as needed if an issue is highly escalated.

Debugging as a team through swarming can be an extremely effective way of solving deep or hard problems. It can also be a great way to teach others how to debug a problem using the techniques you learn here. Part of mastery is teaching others, so share the wealth! While you are at it, please share this with anyone you think might benefit from it.