Various chapters of this book deal with ITIL practices familiar to us all.
• Incident Management (Chps 11, 12 and 14),
• Problem Management (Chps 15 and 16),
• Availability and Continuity (Chp 13),
• Validation and Testing (Chp 17), and
• Capacity Management (Chps 18 -22)
These are a deep dive into the practical means of delivering on these practices. Certainly worth the read. I suggest much of the content will find its way into the various Practice Guides of ITIL 4.
The First Rule of SRE
Google applies an SRE rule of 50% operations (Service Support) and 50% development (Service Delivery). Remember ITIL v2! The Development in order to improve the services from a customer reliability perspective (Service Warranty) and Operations to keep the SRE in touch with the real/operations world.
The SRE is an ideal fit in the DevOps space. Engage SREs early in new service development to plan and simplify the launch of that service. All services will undergo many releases as future enhancements pass through service value streams. The SRE can make this transition a whole lot smoother. In fact, there is a specialist role – Launch Coordination Engineer – within SREs, especially for this purpose.
Rule 2: Toil is bad
One of the major goals of SRE is, “the elimination of toil.” The definition being given as
“Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.”
Some great points here:
1. Tasks that can be automated
2. Devoid of enduring value (backups,
housekeeping)
3. Scales linearly as a service grows.
From automation to autonomous
Removing the human element of supporting services improves the reliability of systems. Yet automation is more than automating current processes. Review activities, in line with Service Level Objectives (SLO), to Lean processes.
Identify complexity, both Essential Complexity and Accidental Complexity, and look to remove the accidental component. Simplified services make automated recovery procedures easier to spot and design. A balance between Stability and Agility can only be achieved through Simplicity.
Throw in some AI and we start to have services with reliability at levels hard to imagine, let alone achieve. The book then answers the interesting question of, How much reliability do we need? Look at the Service Level Objective on reliability. Is it 99.99% or 99.999%? Where is the weakest link in the reliability chain?
Devoid of enduring value
All human activities should be focussed on adding value. This is perhaps more so in the operations than in development. Where the tasks are not adding value, either eliminate it (if it is accidental complexity), simplify and automate it (if essential complexity).
Scales linearly as a service grows
This aligns with another principle in the SRE space - Service support must grow at a lesser pace than linear, as the service grows.
Otherwise, you will never have the human resources to support the services. SREs are given the development element of their role so as to address this. Removing the toil will remove the demand to constantly recruit more support staff.
Some other interesting concepts in SRE
Error Budgets
Not actually a monetary budget. Error Budgets are a means of managing release agility with reliability expectations.
Service Uptime – Service Level Objective =
Error Budget
13