Here’s Why Enhancing MTTR over MTBF is optimal for Companies


In Software as a Service (SaaS) enterprise, one of the arguable debates is between release frequency vs. reliability and availability. That is, are you Team MTTR (mean time to restore) or Team MTBF (mean time between failure)? I support for MTTR in this blog post, which encourages you to push more often, welcome the uncertainty that this can introduce, and invest in training and resources to cope with the resulting outages. It's just the principle of constantly pushing out minimum viable products (MVPs), production testing, and failure acceptance. And here's why it works.

More Testing Leads to Improved Quality

For engineering teams, the whole concept of optimizing for MTTR can be counterintuitive because it can be overwhelming when things go wrong. But that attitude is precisely why it is laborious for individuals who exploit an MTBF approach to address issues as they arise. It's difficult to refine the answer process when errors happen infrequently.

Disaster-Recovery-1There is more testing and more code review with a continuous stream of releases because the anticipation and also the encouragement to fail is there. With that in mind, the team is ready to refine and iterate on the code for failure recovery, which eventually gives them greater product familiarity and offers improved reliability.

Changes often take longer to execute for those who minimize how much code is deployed. The 80/20 theory applied to engineering says 80 percent of the time is required for the last 20 percent of the job. With SaaS, the last 20 percent is usually deployed through staging and development with your  feature. If you only deploy infrequently, say, weekly or monthly, the "quantization" size can not be smaller than a week for each function.

Although this conservative strategy leads to a more secure site on paper, this results in a stale product in practice. The commodity that wins in the marketplace is always not the best. It's the one that more easily responds to consumer needs.

We do not generally spend all of our resources in creating the most available product with an MTTR strategy. We basically put our effort into building the minimum product and tightening the feedback loop to the maximum extent possible. We will easily stand up to a different, and even better, service that represents the evolving needs of the client if the product does not behave correctly, and that will always happen.


Embracing the Outcomes Contributes to Stability

SaaS can be released very regularly, as often as several times a day, unlike conventional business applications. This enables SaaS companies to adapt rapidly to evolving consumer demands, all while placing zero pressure on their users. (i.e., they do not need to continually update their software.)

That said, during the holiday season, there are usually no "releases" or changes to a website. There are

a) lots of clients and your most lucrative time during this stage and

b) lots of main workers taking time off.

Therefore, many firms do not want to produce releases that could endanger profits and take individuals away from their holiday plans.

So there are code freezes for e-commerce websites. At this time, several brands cease making launches. All these activities demonstrate that you mitigate change in order to maximize MTBF.

The holiday season is the most stable time of year for a SaaS vendor based on MTTR, as are all holidays, because of the high pace at which the team deploys updates and the familiarity every developer has with the code base.

However, the possibility of introducing new bugs and outages comes with each new update. All examples of strategies used to decrease the risk are continuous deployment, blue-green deployments, and canarying. The principle is that you can reduce alterations between any two implementations by making releases more regular. As a consequence, there is a reduced risk of erratic interactions and a better probability of quickly determining which release triggered a problem and thus which improvement.

Reinforcing the Teams

Finally, when there is an outage, MTTR helps create a more resilient call team that does not flinch. When the team is qualified to handle failures on a daily basis, when they get paged, there is no tension and the process can be streamlined, which means quicker repairs. Another explanation why uncertainty is welcome is that it is possible that new versions of the code will fail in unpredictable and exciting ways.

These unpredictable issues can help to educate new members of the on-call team. With drills and training alone, it is difficult to instill confidence in engineers. You just need actual accidents sometimes. One solution is to artificially implement  problems. This helps you find items like single failure points, but this only works well with systems that, to begin with, do not undergo a high rate of change.

Another solution is to more regularly install new applications, which may introduce real issues to fix. 


For MTTR, Here's How to Optimize

To recap, you can release less frequently when you want to optimize for MTBF, resulting in a "stale" product that can not adapt to changing consumer demands. You will also have an on-call team that is not routinely paged, so each new warning is a high-stress scenario.

You're leaning into a team that knows how to respond to and repair failures quickly by optimizing for MTTR. You can also introduce a high cadence of release / deployment that allows you to adapt quickly to the needs of clients and ship features they want.

So, How are you doing this?

  • Adopt tools and technologies such as Kubernetes, which allow you to automate and frequently execute release and deployment.
  • Ensure that your application is well-instrumented and that Grafana, Prometheus, and Loki have a strong observability strategy. Your team would have the trust to solve problems in development with the correct monitoring tools.
  • Track the cadence of release and load on call (incidents per shift) and balance the two. So many occurrences? Push back on characteristics and concentrate on tech debt. Just so few? To take more chances, drive the team.
  • Promote iteration. What would you do to be published more often? What pages are most commonly used?

Summing Up

The "worse is better" argument is used by certain individuals as a critique. In the end, it's just about being agile and having something out there so that individuals can test it even if the item might not yet be the best in class.

A lot of people with no clients put their energies into developing massively accessible facilities. Whereas when you start from zero and create a service, even though that means it might be unstable at first, you can move quickly. Your team will adjust and upgrade the system when outages occur or vulnerabilities emerge, learning from the issues that arise and pivoting when required. Or they're just going to blow everything away and really easily pick up a new product. Whatever is possible.

As they say, Rome wasn't built in a day.