David Caulfield

Save Your Team 100s of Hours

How is time wasted in your team?

Teams require constant maintenance, updates and improvements. Otherwise, they become stagnant and inefficient. To become hyper-efficient, each team must analyse how they work. One key metric is to understand how much time your team wastes on repetitive tasks. While software developers are excellent at automating away their own problems, they often forget to focus on the team's problems.

If you analyse your team's time and value output, you will see many areas to improve. Obvious areas are poorly written bugs, too many meetings, poor communication with colleagues and so forth. But there is a category of work that is often overlooked - repetitive team tasks. Repetitive team tasks is any task that the team needs to perform on a regular basis.

Ask your team:

  • What tasks do we do on a regular basis? e.g. Gather logs from a customer system.
  • What topics do we constantly explain? e.g. How to install MyComponent on a Linux server.
  • What issues regularly crop up? e.g. A bug with MyComponent which needs a workaround.
  • What support do we regularly give? e.g. Recovering MyComponent in the event of a failure.

These examples are straightforward for the team member who knows what to do. However, team members are often out of office or unavailable to help. When this happens, something as simple as resolving a code conflict is difficult for someone who does not have the know how. This results in wasted cognitive expense as your team attempts to solve a problem that was previously solved. Your team's valuable time for new features is replaced with old, repetitive work.

We need to reduce this waste.

What are Runbooks?

Runbooks document your team's knowledge for the future. A runbook is a step-by-step recipe to solve a problem. With runbooks, your team can rely less on each other's availability. Instead, when a problem arises, they can search through a database of runbooks for a solution.

Runbooks contain clear and concise steps. They can be "How-To" guides, tutorials or any step-by-step instructions. Runbooks should not have walls of text. This is better left to blog posts and articles. Instead, sentences should be brief and bulleted.

What are the benefits of runbooks?

Most people don't write. It's not surprising - we're not paid to write. But writing down solutions to problems has a host of benefits. When teams document their solutions, they achieve hyper-efficiency quickly. With a runbook, the writer must deep dive into their problem. They must write in such a way that other people can read it quickly and efficiently.

After 6 months of documenting solutions, my team has written 114 runbooks. The top runbook has been viewed 125 times by our team of only 8 people. This top runbook is an extensive 'How-To' document which explains gotchas and workarounds when installing a dev environment. I estimate that each viewing saves about 30 mins. Therefore, this one runbook has saved the team 60 hours. If we account for another 113 runbooks, our team has easily saved 100s of hours.

What kind of runbooks can I create?

Create runbooks to solve specific problems. Not everything should be put into a runbook. Long topics do not facilitate quick and easy searching. Long form writing is also difficult to give quick information to the reader. So, what should go into a runbook?

Bug Runbooks

An in-depth analysis to my team's most difficult bugs has proved useful on multiple occasions. When I create a Bug Runbook, I ask myself: Who will need this in the future? Very often, a new bug is opened against our team where someone will say 'Hey I've seen this one before. Anyone remember how we solved it?'. When this happens, I need quick information about the previous bug.

Without runbooks, I need to go through the team's history of Jira tickets and search for keywords until I find it. Even when I find the old bug, if the assignee closed it in a hasty manner then a lot of information will be missing. Root causes are nowhere to be found. I am lucky if the assignee has even linked their code fix. On the other hand, let's say the bug ticket has a link to a runbook with the following information:

Key phrases

  • RuntimeError in /path/to/logs.

Problem Statement

  • User's login page failed to load.

Steps to Diagnose

  • Viewed logs x and y.
  • Checked the VM resources using df -h.
  • Discovered the VM directory /var was full due to large logs.

Root Cause

  • Customer set their logs to MESSAGE for some investigation then forgot to change back to INFO.

Steps to Fix

  • Remove the log directory /path/to/logs to bring the VM back up.

Steps to Prevent

  • Create an alarm to detect when the VM resources become too low.

Other things to note

  • This happened to a system under heavy consumer load over the course of 2 weeks.

I now have a detailed root cause to a similar ticket. Instead of starting my investigation with no information, I have a possible root cause to quickly check if it solves my new ticket.

Runbooks with this analysis has saved us many hours of repetitive troubleshooting. I admit that most of our bug runbooks never get opened again after creation. But every so often we get a valuable analysis that is useful again and again. We think of runbooks as a method of automating our effort. Rather than spend a limited supply of brainpower repeatedly, we record it once with the hope that it will be useful in the future.

How-To Runbook

How-To Runbooks allow repetitive tasks to be semi-automated. A How-To Runbook should contain all details to figure out a particular scenario. Use these runbooks to setup dev environments, close tickets, create improvements, install software etc. Once a How-To Runbook is created, anyone in your team can use it whenever they please. Otherwise, you would have to ask your teammate to help when they already explained it to someone else. With a How-To Runbook, distractions are minimized as you streamline the team's knowledge into a shared database.

Our template for How-To Runbooks is very straightforward:

Scenario

  • Explain when your team need to use this runbook.

Steps

  1. Do x.
  2. Do y.
  3. Do z.

Process Runbook

Teams constantly shift. They get new members, people leave, they go on holidays and so forth. In many cases, the team processes are only known in depth by one or two people. If they take a sick day or leave the company, does someone know how to take over? For example, if your scrum master takes a day off unexpectedly, do you know how to plan the sprint with the dev team and product owner? Of if a new member joins the team, can they quickly learn about the team's processes?

Take a moment to think about what your team does on a regular basis. Describe that process in a Process Runbook.

For example, my team has a runbook for 'Story Workflow'.

Scenario

  • A developer wants to take a story or task from the backlog to complete.

Steps

  • Open your Jira ticket.
  • Read the description.
  • Write down anything you do not understand.
  • Read the acceptance criteria.
    • Is it clear what you need to do?
    • If not, talk with your scrum master, product owner or a team member.
  • What testing do you need? Comment this in the Jira ticket.
  • What documentation do you need to update? Comment this in the Jira ticket.
  • What tasks or steps will it take to complete this story? Comment this in the Jira ticket.
  • After completing each step, add a comment to your ticket with your progress.
    • You should have at least one update per day.
  • If you get stuck, request help from your team.
  • Once all tasks are completed, show your team what you have accomplished. This is called a demo.
  • To commit your code, follow this Runbook => Link.
  • Ensure all code is reviewed by a team member and merged before closing your story.

When a new developer needs to work on a story, you can provide a workflow that is easy to follow. The developer can work by themselves (to a certain extent) without being hand-held through the process. Without a runbook for this, someone must assist the new developer to ensure that all steps are completed. This increases distractions and eats into the team's total time capacity.

Troubleshooting Runbook

Troubleshooting is a key part of any team. It can also be a frustrating part of the job, particularly for newer members. In general, teams learn new information ad hoc. Ad hoc learning results in different team members picking up different levels of information. This naturally leads to an inequality of expertise across the team. If there is a large inequality of expertise across the team, then a high dependency gets placed on one or two team members. Then, what happens when that person is absent? Or worse, what happens when that person is hit by a bus?

A couple of times a year, my team takes a meeting to ask, 'What is our bus factor?'. In this meeting, the goal is to identify each area of expertise we own and assign it a bus factor. Thereafter, we work to increase our bus factor on anything below 3. The quickest way to do this is for each expert to document their knowledge in a runbook.

Troubleshooting Runbooks are extremely beneficial to document methods of solving bugs or support issues. It will prevent trouble in your team when one or two are missing and a high priority bug comes in. Here is a short troubleshooting runbook we have in our team.

Scenario

  • Developer wants to enable debug mode on MyComponent and check the debug logs.

Steps

  • Open the MyComponent GUI => https://example.com/mycomponent
  • Click on the 'Server' tab.
  • Click on the 'Debugging' tab.
  • Change the dropdown from 'Error' to 'Message'.
  • Save.
  • Login to your Linux terminal for your system.
  • cd /path/to/debug
  • You will see the debug logs are populated with new information.

Here is an example of a longer Troubleshooting Runbook.

Scenario

Developer wants to check the health of MyComponent.

Step 1. Setup SSH password-less connection.

  • ssh-keygen
  • ...

Step 2. Verify the following endpoints are reachable

  • ping https://example.com/mycomponent/isAlive
  • ...

Step 3. Run diagnostic script

  • cd /path/to/diag
  • ./diagnostics.py

Step 4. ...

  • And so forth

With a couple of these runbooks, you can reduce stress in the team if someone is out for the day. Troubleshooting Runbooks also work as a proxy for the expert themselves. By providing some basic information in a Troubleshooting Runbook, the work can be passed to a more junior member, freeing up capacity for the more senior member. This benefits everyone in the team.

How can I get started with my team?

To get started, adopt your team's current method of documentation. Try not to introduce a new tool that your team is not used to. For example, my team uses Confluence to document everything. When we introduced our Runbook system, we created each Runbook in a Confluence page under the same heading 'Team Runbooks'. Now, our Confluence space looks something like this:

  • Team Runbooks
    • Bug Runbook: JIRA-1234 MyComponent crashed under load
    • Bug Runbook: JIRA-9999 MyOtherComponent failed to upgrade
    • Troubleshooting Runbook: MyComponent
    • Process Runbook: How to create a story
    • etc.

It is important you use the easiest method for your team. Software engineers hate documentation because most practices do not encourage quick and easy changes. For your team to be enthuasiastic, pick a method of documentation that is simplest for everyone. Then, once your team gets started and sees that it saves them time and effort, they will be eager to continue. Eventually, the phrase 'Can you create a runbook for that?' will be a regular ask in your team.


3 kudos