Do not get overwhelmedMar 30th, 2020
Much has been said about why the novel coronavirus, SARS-CoV-2, is so dangerous. It is now believed to be 10x more deadly than influenza, in terms of both hospitalization rate and mortality rate. But that is only part of the story. COVID-19 can quickly overwhelm the healthcare system, hence causing severe second-order effects. Tech debts can do the same harm to engineering teams. I have long been pondering the ways to avoid getting overwhelmed by tech debts. I do not have a playbook yet, but I have learned some promising ideas to share.
It is overwhelmingly bad to get the healthcare system overwhelmed. Should it happen, healthcare providers would have to ration care; people with other conditions would be denied care; health care workers would fall ill and reduce capacity even further. The fatality rate for COVID-19 is multiple times higher in Wuhan than in the rest of China, at least partially because the hospitals in Wuhan were flooded with thousands of sick people. Even that number understates the problem, probably by a large margin, because it does not include those people who could not get care for conditions other than COVID-19. It is overwhelmingly sad to see the same script seemingly playing out in Italy, Spain and New York.
Uncontrolled tech debt can overwhelm engineering teams in a similar way1. Teams would often have to make difficult tradeoffs of delivering new features versus paying down tech debt. Facing scheduling pressure, engineers would have to implement workarounds, thus creating more tech debt. In the long run no one would be able to make sense of a system full of hacks, and productivity would plummet even more.
Measuring the debt
Learning from my own lesson, I would start with measuring the cost associated with the tech debt, and use that as an indicator to decide whether to pay it down and where to focus the effort. The metric tend to be an analog one instead of digital, and I would remind myself to fight my internal perfectionism. Good is definitely better than perfect here.
One way I have experimented is tied to the run rotation. The run rotation, employed by many Stripe teams, is a parallel construct to the on-call rotation. Instead of the on-call rotation where engineers get paged by service interruptions, the run rotation is dedicated to handle human interruptions. The on-run engineer dedicates all their time to responding to user asks2. They act as customer service representatives between teams, by either addressing problems or triaging and delegating them. The run tasks are overwhelmingly manual or semi-manual. Therefore, the run load is a good indicator of the ongoing cost we are paying for the existing tech debt. We can also categorize run tasks to find hot spots.
Another way we are trying within the entire organization is to capture the percentage of time spent on Keeping-The-Lights-On type of work from individuals and teams. We started simple by asking each individual to self-report numbers on a weekly basis and then rolling up by team and group. Some tactics, like limiting the granularity to days instead of hours, were adopted to avoid over-engineering here. The idea here is to identify teams with hot spots, to reduce time spent on KTLO work, and to eventually increase velocity. It is still early to tell its effectiveness, but I am optimistic.
Amortizing the debt
With a usable metric in place, I would next worry about paying down the tech debt. I definitely do not want to delay paying it forever until it overwhelms the team, but neither am I passionate about halting all projects and going all in on fixing the tech debt. There has to be some balance, and so far I have learned two effective ways to amortize the cost of paying down the tech debt.
Will Larson has wrote migrations as the sole scalable fix to tech debt, and he has given a talk on that topic. I found it to be a good framework to think about tech debt hot spots. As an example, my team supports an antique yet essential piece of infrastructure where user credentials are still managed manually. It takes on the order of hours to onboard and to offboard users. It has become crystal clear from the amount of run asks we receive that user management is a huge productivity killer, but it is also evident that no one can fix it by doing some side hustle. The one viable way forward is to create a migration project that automates the user provisioning story, and that is what we did during our yearly planning.
The migration solution is a heavy hammer though. I have also noticed cases that does not warrant multi-week-long projects. Instead, someone can focus on them for two days and make a big difference. This might look like increasing parallelism to speed up the tooling by 10x, or reorganizing our team’s documentation to make it easier to discover and to follow. I used to hope3 that on-run engineers would achieve quick wins like this since they are not working on their primary projects, but that did not work because we were also asking them to buy us time for large migrations with those hour-long user provisioning runs. Tamar Ben-Shachar, one of my peers, pioneered the Tech Debt Squashathon week in the group. During that week, every engineer puts aside their primary project, and instead focuses on squashing some tech debt and making an immediate impact. It is a super productive week. The demos are fun to watch, and the team morale gets a boost while tech debt are being paid down. I think that is a brilliant strategy.
Be eager yet patient
I feel bad that I can offer limited help to prevent healthcare workers from being overwhelmed other than donating small amount of money and supplies. I am eager to see the pandemic getting contained, yet I am patient to socially distance myself and my family for the foreseeable future. I think a similar strategy would serve me well in preventing my team from being overwhelmed by tech debt. I am eager to squash all of them, yet I will be patient in measuring the benefit versus the cost and in amortizing the debt in a calculating way.