TΩИΨ: DevOps vs SRE (Site Reliability Engineer)

Background

DevOps:

DevOps is a loose set of practices, guidelines, and culture designed to vreak down silos in IT development, operations, networking and security. The key points of DevOps can be described by a useful acronym: CALMS (Culture, Automation, Learning, Measurement and Sharing)

SRE:

SRE is more like a job role, a set of practices found to work, and some beliefs that animate those practices. SRE is the implmentation of some of the DevOps philosophy. In a way, class SRE implements interface DevOps.

Key Ideas

DevOps

No More Silos:

No longer separate operations and development teams
Extreme silozation of knowledge, incentives and lack of collaboration is bad for business

Accidents Are Normal:

Accidents are not just a result of the isolated actions of an individual, but rather result from missing safeguards for when things inevitably go wrong. It is more profitable to focus on speeding recovery than preventing accidents, also building safegurards/guradrails system to minimize the individual incidents.

Change Should Be Gradual:

Change is best when it is small and frequent. Change is risky and the correct response for changes implementation is to split up your changes into smaller subcomponents where possible, then you build a steady pipeline of low-risk change out of regular output from your system. This strategy often leads to CI/CD

Tooling and Culture Are Interrelated:

Tooling is an important component, however organizational culture is the key to success. A good culture can work around broken tooling but the opposite rarely hold.

Measure Is Crucial:

Establish the reality of what's happening by means of objective measurement, verify that you're changing the situation as you expect, and create an objective foundation for conversations that different fuctions agree upon.

SRE

Operation Is a Software Problem:

SRE shoud use software engineering approaching to solve problems.

Manage by Service Level Objectives (SLOs):

Instead of promise to give everything 100% availability, SRE team should set an appropriate availability target for the service and its user base, and the servie is managed to that SLO.

Work to Minimize Toil:

Any manual, structurally mandated operational task is abhorrent. If a automation task can perform a desired operation, then it often should. Any time spent on operational tasks means time not spent on project work, project work is how we make the services more reliable and scalable.

Move Fast by Reducing the Cost of Failure:

One of the main benefits of SRE engagement is not necessarily increased reliability, it is actually improved product development output. A reduced MTTR (Mean Time To Repair) for faults results in increased product developer velocity, as engineers don't have to waste time on fixing issues.

Share Ownership with Developers:

There should be no rigid boundaries between "dev" and "ops". SRE team shares skill set with development teams, and SRE has paticular expertise around aailability, latency, performance, efficiency, change management, monitoring, emergency response and capacity planning. Both "dev" and "ops" team should have a holistic view of the stack.

Use the Same Tooling:

Teams minding a service should use the same tools, regardless of function or job title. There is no good way to manage a service that has one tool for "dev" and another tool for "ops". The more divergence you have, the less your organization benefits from each effort to tool improvement.

Conclusion:

Change is necessary in order to improve.
Collaboration is front and center for the work.
Chanage management is best pursued as small, continual actions, the majority of which are ideally both automatically tested and applied.
The right tooling is critically important, API oritentation for system management is a more important philosophy.
Measurement is absolutely key.
Bad things happen occasionally, accept the reality and build a blameless postmortems in order to offset unhelpful, adrenaline-laden reactions.

In general, DevOps is more like a philosophy and culture, and relatively silent on how to run operations at a detailed level. It is not prescriptive around the precise management of services. It choose instead to concentrate on breaking down barriers in the wider orgnization and this has much value. SRE has relatively narrowly defined responsibilities and its remit is generally service-oriented and end-user oriented, rather than business oriented. It brings an opinionated intellectual framework to the problem of how to run systems effectively. SRE believes in the same things as DevOps but for slightly different reasons.

TΩИΨ

Pages

Sunday, January 20, 2019

DevOps vs SRE (Site Reliability Engineer)