
SRE is not something that can only be applied to Google, but is a very useful concept for all companies and engineers who use the cloud to develop, release, and maintain their services. This presentation is designed to help not only those who are currently in charge of or leading DevOps, engineers who actually write code and develop applications, but those who have no programming or development experience understand the new concept of SRE in an easy-to-understand manner. I hope that you will be able to follow along with me to the end.
I would like to explain why we need the concept of SRE. In order to understand SRE, it is important to understand why a new way of thinking about SRE is necessary. In this article, I would like to explain "where", "who", and "why" SRE was conceived, implemented, and proved to be a useful methodology.
Applications are designed, developed, tested, and released, and then made available to us in the form of services.
However, in order to ensure that the service is always available, the phase of operations is important after the release of the service. The two main groups are those in charge of "design, development, testing, and release" (mainly the development group called "developers") and those in charge of "operation, maintenance, and trouble-shooting" (mainly the operation group called "operations"). This is the so-called DevOps relationship.
Simply put, the interests of DevOps can be described as a conflict between the development side's desire to "develop and release services whenever necessary to win the competition" and the operations side's desire to "keep the release cycle low to maintain the availability and reliability of the service”.
When Benjamin Treynor Sloss, who first launched SRE, was asked, "What is SRE?" he answered, "Fundamentally, it's what happens when you ask a software engineer to design an operations function...
Google has a concept called 10x (ten-X). Rather than thinking about increasing operating profit by 110%, we should think about what we need to do to increase it by 10x. If you need to do that, you will come to think of a completely different approach because you can't achieve it with your current method.
An example of availability that you may see often is a notation such as "This service guarantees 99.9% availability. If the service you are about to use has such a statement, it means that the service is guaranteed to be unavailable for 8 hours and 46 minutes or less per year, as shown in the table here.
Benjamin remarks on reliability, "reliability is not determined by monitoring, it is determined by the user”. This challenges the conventional notion of how reliability is measured, that it is obtained by constantly monitoring servers to see if they are alive or dead, and constantly verifying that they are working properly.
SLI (Service Level Indicators), also known as Service Level Indicators, are indicators of a pre-determined user's use of a service to be completed to the extent that it is acceptable to the user. More simply put, it is the "good event" of the most important service. So what is a "good event"? Not every system has a "good event" as its indicator, and depending on the system, it may be accuracy, freshness, or throughput in data processing.
Critical User Journey (CUJ) is a specific procedure that a particular user performs in order to achieve a goal using a service provided by a particular user, which is necessary when determining SLI (Service Level Indicators).
For example, suppose a client sends a request to the server and the response speed is measured 100 times: 99 responses are returned in less than one second, but the remaining one takes two seconds. The average response time would be approximately 1 second, making the 2-second outlier, which took more than twice as long as usual and difficult to understand.
The idea of error budgeting came about as a new rule designed to achieve a good balance between the respective goals of DevOps Dev = Development and Ops = Operations.
Development (Dev) is constantly adding new features and upgrading existing features to increase service usage and win the competitive differentiation race.
Postmortem begins by assuming a worst-case scenario in which the service cannot be provided at all. If a new feature is released to the production environment and all users are unable to access the site or are only presented with an error screen at the moment of release, the operations side will naturally focus all its efforts on restoring the site to its original state immediately. Fortunately, a few hours after the problem occurred, the service was restored to its original state. Postmotem is a tool to prevent such a situation from happening again after all services are restored.
When I hear the word toil, it reminds me boring work or work I don’t want to do. Some of the tasks you perform on a daily basis may be considered "toil. However, the word "toil" alone may be taken to mean simply "a hassle," "something I don't want to do.
Let us now introduce SRE in concrete terms, using the "Glossary of Terms to Understand SRE" explained in the previous chapter, with an actual service as an example.
In this section, we will consider what SLI is appropriate based on YouTube's business situation as defined earlier. For definitions 1 through 3, it is necessary to consider priorities for each of the following items when setting SLI.
As we discussed as an example in the glossary, if we consider YouTube as a service, what would be the most important CUJ of the service?
So what would be a good SLI item to choose for that CUJ? Some users may think that it is important for the video to play immediately after pressing the Play button. But what if the video thumbnail itself is not displayed, and of course the Play button that is supposed to be displayed there is not displayed, or worse, the YouTube site is not displayed when accessing the site?
As a next, define how to evaluate the SLI. This time, when accessing youtube.com in terms of availability, the percentage of good responses is those that return 2xx, 3xx, or 4xx (excluding 429 as Too Many Requests) among the responses to requests.
When setting SLOs, the goal and measurement period need to be included. For example, "99.9% of youtube.com responses over the past 30 days must be good responses.
As mentioned earlier, usually, server health checks, etc., are set up on the server side for the purpose of autoscaling, and the logs are monitored to detect error codes, and alerts are sent to the operations manager for notification and response. However, with this method, errors happening on the server side and errors or delays occurring on the user side may be different. For this reason, Google's best practices recommend user-side monitoring whenever possible.
It is possible to achieve three nines with well-designed software, but to achieve four nines, you need to have a well-designed operation, a well-designed failure response procedure, and a well-designed execution organization, and to achieve five nines, the business itself must be well-designed. In other words, it is impossible to achieve five nines without a well-designed business. On the other hand, it is fair to say that he considers this part of the business, the part of the business that he defines as the SLA, to be very important.
I would like to summarize each service level here. I encourage everyone to consider each item as a team with a specific service in mind.
I have talked about DevOps and SRE in various ways: DevOps is an organizational theory for continuous development and operation, or a conceptual development and operation method, while SRE is a methodology for how to rationally implement that method of operation.
It is not easy to consistently meet SLOs and release new services that reflect market needs without running out of error budget, especially when SLOs are set at 99.9% or 99.99%. Many development sites are shifting from the traditional method of development that takes more than a year from development to release in a waterfall* on-premise in-house environment to agile development in a cloud (public cloud) environment.
So is agile development in a cloud environment essential to achieve SRE and SLO? If I were asked this question, the answer would be "Yes.
In this section, we would like to explain the key-words needed to understand what kind of development environment is required to understand SRE.
CI/CD is a concept of automating as much of the software development process as possible. The software development process here refers to a series of tasks such as writing code, build it, test it, and deploy it. When I was a programmer from the 1990s through 2000s, we would compile the code to make sure there were no errors, test it according to the test specification to check for bugs and behavior, deploy it to the production environment when all problems were gone, and doble check it again in the production environment.
Source repositories are introduced to prevent duplication of modifications by multiple developers and to reuse libraries and source code created by others. Usually, an application is not developed by a single developer, and most applications are developed by multiple engineers.
This IaC is an important item as it is a method unique to the cloud environment. In a cloud environment, computers and networks are virtual, and there is no need to order, install, and wire physical equipment as in an on-premises environment. While it is possible to configure this setting through a management console or UI, it is also possible to complete the work from start to finish using only commands. Google Cloud Deployment Manager in GCP and Terraform can be used to automate the creation and updating of infrastructure with code.
To achieve high SLOs, services must always be available. Rollback is an important mechanism to achieve this.
Immutable Infrastructure is a concept of infrastructure based on virtualization and cloud computing where once a server is built, it is replaced by a separately prepared version of the environment.
When you hear the word "container," your first image might be a big steel boxes stacked on top of a large cargo ship like this. The cargo ship is the infrastructure for running applications such as clustered servers, and the containers stacked seamlessly on top of the cargo ship are the applications.
In this section, I will talk about what you, as an SRE, need to consider when an incident occurs.
If you get a phone call or email saying, "The system is down right now! Do something about it right away!" What would you do first? Your first thought would probably be to contact the person in charge to check the current status.
Google has defined a process called the Incident Command System. First, when an incident occurs, a clear division of roles is predetermined, and each role is recognized from the beginning of the incident response and begins to act autonomously.
Once the incident has been successfully resolved through the cooperation of IC, OL, and CL, there is something that needs to be done once the system returns to normal operation.
We have discussed the importance of sharing information. When an incident occurs, what tools should the IC, OL, and CL use to share information while responding to the incident? Also, can the same tools be used for CLs to report the situation in a timely manner as those used during incident response?
Many companies are adopting the concept of SRE, and with the rapid progress of DX (Digital Transformation), an increasing number of companies have set cloud-native as one of their goals and are working not only to shift their infrastructure to the cloud, but also to change their development methods and culture.
This course will help you understand the basics of SRE, a methodology for application development and service operation proposed and practiced by Google, and will be a great reference for promoting cloud computing in your company.
In recent years, some companies have begun to consider implementing site reliability engineering (SRE) or have already done so and found it to be effective for their business.
For example, what would you do if you were in charge of operations team and faced with a situation where a service that is currently running with 5 operations team members and 4 servers needs to handle a 20-fold increase in users in 2 years? Even if you manage to achieve it, it is easy to see that it will be impossible to keep doing it this way for good. Google's SRE approach is a very rational and practical way to maintain current scalability while continuing to release more and more services every day, serving over 2 billion users worldwide.
SRE is not something that can only be applied to Google, but is a very useful concept for all companies and engineers who use the cloud to develop, release, and maintain their products. This presentation is designed to help those who are currently in charge of or leading DevOps, engineers who actually write code and develop applications, and those who have no programming or development experience understand the new concept of SRE in an easy-to-understand manner. I hope that you will be able to follow along with me to the end.