This is original content by Andriy Buday.

Yes, even this picture of rubber ducks is orignal picture I took ~10 years ago.

I just wanted to document a few different strategies for debugging and root-causing issues that have worked for me in the past, with examples. Two things about this post: it isn’t an attempt to provide a holistic view on debugging, mainly because there are plenty of articles out there. Also, I recently tried using an LLM with a relatively simple prompt, and it generated quite an impressive summary. Instead, I wanted to share a few examples and just a few approaches I like. I have two or three favorite strategies for debugging:

Time & changes

In my experience, if something worked and then stopped more likely than not a change was introduced recently, this is why there is a concept of “change management”. Simply looking at anything that changed at the same time when the problem appeared helps. Steps:

  1. Identify the time of when the problem first started.
  2. Collect available symptom information, such as error messages.
  3. Lookup relevant codebases, releases, data and config changes by the timestamp. Prioritize search using clues from sympthoms but don’t skip lookup just because it seems less relevant.

Sometimes an error message could give a straightforward path to identifying an issue. Happy scenario could look like this: you find the error message in the codebase, open the file change history and right there you see a problematic change. But things aren’t always as straightforward.

Because a symptom can be at the end of a long chain of events I don’t always focus just on the clues from the symptom, but rather advise to look at anything unusual starting at the same time.

As an example, I was involved in root causing an issue where our server jobs were crashing and could not startup after the crash. After looking through logs it seemed something made them not able to load a piece of configuration data, so we looked at stack traces of the crash but it wasn’t bringing us any closer to solving the issue. Instead, by looking at logs at healthy servers we noticed non-critical warnings and inconsistency in configuration. These two things seemed unrelated but they started appearing around the same time, so by looking at logs at healthy servers we identified the fix.

As another example, a colleague of mine was investigating an issue when an API was crashing at specific type of requests, by looking at the changes to the API he couldn’t see any changes around the time of when this started happening. I suggested we expand our search, and see if other clients of this API had any changes and what we found was that someone updated all of the clients of the API at the timestamp of when we started to see the issue, but they forgot to update our client.

Isolation and Bisection

My next best favorite approach to debugging is reduction in the universe of possibilities. What can be better than halving it every time?

I found this approach to be the best when debugging a bunch of recently written code. If this is the code you have locally, you can simply split it into smaller chunks (like commenting out half of a function). If you pulled recent changes by other engineers, a strategy could be to sync to a version midway between your original base and the most recent changes. I have applied this so many times it feels natural.

As an example, we perform huge number of experimentation at Google, occasionally these experiments cause issues, starting from server crashes or impact on revenue in which case automatic systems normally kick in and disable culprit experiments, but sometimes things are more subtle and a culprit has be found semi-manually in a new batch of experiments, in which case we apply bisection strategy.

Rubber duck debugging

Yeah, you’ve read that right, google it after you finish with this post. Rubber duck debugging is a technique of articulating your problem to someone else, explaining what research has been done so far and what you are solving. Oftentimes, this results in an “aha” moment and a realization where the problem is. Since you don’t need too much involvement from another party other than listening, you can just talk to an actual rubber duck.

As a story, I once told about rubber duck debugging to a manager of a team of consultants with whom I worked, the next day he bought all of us rubber ducks. It was a very nice gesture.

But a bit more seriously, there is a lot of merit to this strategy. I have joined dozens of small calls with other engineers where they tried to debug a problem and when after explaining the issue to me and myself just asking them simple “why?”, “give me more details” questions at some point they realized where the problem was. All of this without me even understanding their code or what they are doing.

Other strategies

As I mentioned, I didn’t attempt to write a holistic overview of debugging strategies. Other things that worked for me were: using debugger (obviously), profilers, other tools, stack tracing top to bottom and bottom to top, logging more things, etc, etc. There simply are so many ways to diagnosing and root causing a problem and not all strategies are equally applicable.

What else worked great for you?

If you haven't subsribed yet, you can subsribe below: