Monday, January 21, 2019

SLO - Service Level Objectives

SLO specify a target level for the reliability of your service. SLOs are key to making data-driven decisions about reliability, they are at the core of SRE practices. SLOs are a tool to help determine what engineering work to prioritize. for example, consider the engineering tradeoffs for two tasks: Upgrade to the next version or doing patches of one very critial scheduler software. By calculating the estimated impact on the error budget, we can determine which task os more beneficial.

Why SREs Need SLOs

In any organizations, engineers are valuable assets and scarce resources. Good qualities engineering time should be invested in the most important characteristics of the most important services. Investing in new features or functionalities will win new customers or users (or retain the existing ones), and investing in the reliability and scalability keep existing customers happy, SLO helps you strike the right balance because SLO is key to making data-informed decisions about the opportunity cost of reliability work, and to determining how to appropriately prioritizie that work.

In fact, SRE's core responsibilities aren't merely to automate "all the things" and hold the pager, their day-to-day tasks and projects are driven by SLOs:

  • Ensure SLOs are defended in the short term and can be maintained in the medium to long term.

How To Get Started

Assuming your current service(s) are fall into one of the following three categories:

  • A greenfield development, with nothing currently deployed
  • A system in production with some monitoring to notify you when things go awry, but no formal objectives, no concept of an error budget, and an unspoken goal of 100% uptime
  • A running deployment with an SLO below 100%, but without a common understanding about its importance or how to leverage it to make continuous improvement choices - An SLO without teeth

In order to adopt an error budget-based approach to SRE, you need to reach a state with the following hold true:

  • There are SLOs that all stakeholders in the organization have approved as being fit for the product
  • The people responsible for ensuring that the service meets its SLO have agreed that it is possible to meet this SLO under normal circumstances
  • The organization has committed to using the error budget for decision making and prioritizing. This commitment is formalized in an error budget policy.
  • There is a process in place for refining the SLO.

What an SLO Should Be

An SLO sets a target level of reliability for the service's customers. Below graph shows you the relationship of SLO threshold and customers:

Ultimately, user happiness is what matters.

SLO of 100% Is The Wrong Target:

  • 100% reliability is not a reasonable goal. With the most expensive and redundant hardware, automated health checks and the fastest failover, there is a nonzero probability that one or more components will fail simultaneously, resulting <100% reliability.
  • Customers won't experience 100% reliability, the systems between you and customers are long and complex, and any of these components can fail.
  • Even if you do manage to create an 100% reliability experience, you can never update or improve your services. Because any change comes with risk, and eventually your service will stagnate and users will go elsewhere.
  • A SLO of 100% means you only have time to be reactive, not proactive.

No comments: