Why SREs Need SLOs
In any organizations, engineers are valuable assets and scarce resources. Good qualities engineering time should be invested in the most important characteristics of the most important services. Investing in new features or functionalities will win new customers or users (or retain the existing ones), and investing in the reliability and scalability keep existing customers happy, SLO helps you strike the right balance because SLO is key to making data-informed decisions about the opportunity cost of reliability work, and to determining how to appropriately prioritizie that work.In fact, SRE's core responsibilities aren't merely to automate "all the things" and hold the pager, their day-to-day tasks and projects are driven by SLOs:
- Ensure SLOs are defended in the short term and can be maintained in the medium to long term.
How To Get Started
Assuming your current service(s) are fall into one of the following three categories:- A greenfield development, with nothing currently deployed
- A system in production with some monitoring to notify you when things go awry, but no formal objectives, no concept of an error budget, and an unspoken goal of 100% uptime
- A running deployment with an SLO below 100%, but without a common understanding about its importance or how to leverage it to make continuous improvement choices - An SLO without teeth
In order to adopt an error budget-based approach to SRE, you need to reach a state with the following hold true:
- There are SLOs that all stakeholders in the organization have approved as being fit for the product
- The people responsible for ensuring that the service meets its SLO have agreed that it is possible to meet this SLO under normal circumstances
- The organization has committed to using the error budget for decision making and prioritizing. This commitment is formalized in an error budget policy.
- There is a process in place for refining the SLO.
What an SLO Should Be
An SLO sets a target level of reliability for the service's customers. Below graph shows you the relationship of SLO threshold and customers:Ultimately, user happiness is what matters.
SLO of 100% Is The Wrong Target:
- 100% reliability is not a reasonable goal. With the most expensive and redundant hardware, automated health checks and the fastest failover, there is a nonzero probability that one or more components will fail simultaneously, resulting <100% reliability.
- Customers won't experience 100% reliability, the systems between you and customers are long and complex, and any of these components can fail.
- Even if you do manage to create an 100% reliability experience, you can never update or improve your services. Because any change comes with risk, and eventually your service will stagnate and users will go elsewhere.
- A SLO of 100% means you only have time to be reactive, not proactive.
No comments:
Post a Comment