SRE ( Site Reliability Engineering ) Quick Learning Course

Name: SRE ( Site Reliability Engineering ) Quick Learning Course
Rating: 3.5 (26 reviews)

Google's innovative methodology for DevOps

Created byGG トップ

Last updated 10/2022

English

What you'll learn

You can learn the points of SRE (site reliability engineering) advocated by Google for development and operation in a cloud environment, which is important for
In order to make it easy for cloud beginners to understand, we will explain each necessary technical element one by one without using technical terms as much as
Current development engineers and infrastructure operators can aim to improve their skills and acquire new development and operation methods that Google is prac
Before reading an SRE technical book suddenly, you can understand the main points clearly and clearly, and you can expect to improve your understanding by 10 ti

Course content

9 sections • 39 lectures • 1h 56m total length

Chapter Zero1:53
SRE is not something that can only be applied to Google, but is a very useful concept for all companies and engineers who use the cloud to develop, release, and maintain their services. This presentation is designed to help not only those who are currently in charge of or leading DevOps, engineers who actually write code and develop applications, but those who have no programming or development experience understand the new concept of SRE in an easy-to-understand manner. I hope that you will be able to follow along with me to the end.

Why was the idea of SRE born?2:35
I would like to explain why we need the concept of SRE. In order to understand SRE, it is important to understand why a new way of thinking about SRE is necessary. In this article, I would like to explain "where", "who", and "why" SRE was conceived, implemented, and proved to be a useful methodology.
Dev vs Ops?4:50
Applications are designed, developed, tested, and released, and then made available to us in the form of services.
However, in order to ensure that the service is always available, the phase of operations is important after the release of the service. The two main groups are those in charge of "design, development, testing, and release" (mainly the development group called "developers") and those in charge of "operation, maintenance, and trouble-shooting" (mainly the operation group called "operations"). This is the so-called DevOps relationship.
How to successfully resolve DevOps interests?2:22
Simply put, the interests of DevOps can be described as a conflict between the development side's desire to "develop and release services whenever necessary to win the competition" and the operations side's desire to "keep the release cycle low to maintain the availability and reliability of the service”.
What is SRE?2:07
When Benjamin Treynor Sloss, who first launched SRE, was asked, "What is SRE?" he answered, "Fundamentally, it's what happens when you ask a software engineer to design an operations function...
Is it easier to make it 10 times better than 10better?2:51
Google has a concept called 10x (ten-X). Rather than thinking about increasing operating profit by 110%, we should think about what we need to do to increase it by 10x. If you need to do that, you will come to think of a completely different approach because you can't achieve it with your current method.

Availability3:01
An example of availability that you may see often is a notation such as "This service guarantees 99.9% availability. If the service you are about to use has such a statement, it means that the service is guaranteed to be unavailable for 8 hours and 46 minutes or less per year, as shown in the table here.
Reliability2:34
Benjamin remarks on reliability, "reliability is not determined by monitoring, it is determined by the user”. This challenges the conventional notion of how reliability is measured, that it is obtained by constantly monitoring servers to see if they are alive or dead, and constantly verifying that they are working properly.
SLI/SLO/SLA6:50
SLI (Service Level Indicators), also known as Service Level Indicators, are indicators of a pre-determined user's use of a service to be completed to the extent that it is acceptable to the user. More simply put, it is the "good event" of the most important service. So what is a "good event"? Not every system has a "good event" as its indicator, and depending on the system, it may be accuracy, freshness, or throughput in data processing.
CUJ1:04
Critical User Journey (CUJ) is a specific procedure that a particular user performs in order to achieve a goal using a service provided by a particular user, which is necessary when determining SLI (Service Level Indicators).
Percentile1:15
For example, suppose a client sends a request to the server and the response speed is measured 100 times: 99 responses are returned in less than one second, but the remaining one takes two seconds. The average response time would be approximately 1 second, making the 2-second outlier, which took more than twice as long as usual and difficult to understand.
Error budget5:49
The idea of error budgeting came about as a new rule designed to achieve a good balance between the respective goals of DevOps Dev = Development and Ops = Operations.
Development (Dev) is constantly adding new features and upgrading existing features to increase service usage and win the competitive differentiation race.
Postmotem3:50
Postmortem begins by assuming a worst-case scenario in which the service cannot be provided at all. If a new feature is released to the production environment and all users are unable to access the site or are only presented with an error screen at the moment of release, the operations side will naturally focus all its efforts on restoring the site to its original state immediately. Fortunately, a few hours after the problem occurred, the service was restored to its original state. Postmotem is a tool to prevent such a situation from happening again after all services are restored.
Toil4:23
When I hear the word toil, it reminds me boring work or work I don’t want to do. Some of the tasks you perform on a daily basis may be considered "toil. However, the word "toil" alone may be taken to mean simply "a hassle," "something I don't want to do.

Define the business situation4:00
Let us now introduce SRE in concrete terms, using the "Glossary of Terms to Understand SRE" explained in the previous chapter, with an actual service as an example.
Define SLI1:36
In this section, we will consider what SLI is appropriate based on YouTube's business situation as defined earlier. For definitions 1 through 3, it is necessary to consider priorities for each of the following items when setting SLI.
Determine CUJ1:14
As we discussed as an example in the glossary, if we consider YouTube as a service, what would be the most important CUJ of the service?
Determine SLI items1:21
So what would be a good SLI item to choose for that CUJ? Some users may think that it is important for the video to play immediately after pressing the Play button. But what if the video thumbnail itself is not displayed, and of course the Play button that is supposed to be displayed there is not displayed, or worse, the YouTube site is not displayed when accessing the site?
Determine SLI implementation2:39
As a next, define how to evaluate the SLI. This time, when accessing youtube.com in terms of availability, the percentage of good responses is those that return 2xx, 3xx, or 4xx (excluding 429 as Too Many Requests) among the responses to requests.
Determine SLOs3:07
When setting SLOs, the goal and measurement period need to be included. For example, "99.9% of youtube.com responses over the past 30 days must be good responses.
Example of how to implement SLO monitoring1:13
As mentioned earlier, usually, server health checks, etc., are set up on the server side for the purpose of autoscaling, and the logs are monitored to detect error codes, and alerts are sent to the operations manager for notification and response. However, with this method, errors happening on the server side and errors or delays occurring on the user side may be different. For this reason, Google's best practices recommend user-side monitoring whenever possible.
Determine SLAs5:23
It is possible to achieve three nines with well-designed software, but to achieve four nines, you need to have a well-designed operation, a well-designed failure response procedure, and a well-designed execution organization, and to achieve five nines, the business itself must be well-designed. In other words, it is impossible to achieve five nines without a well-designed business. On the other hand, it is fair to say that he considers this part of the business, the part of the business that he defines as the SLA, to be very important.
Summary of Service Levels1:45
I would like to summarize each service level here. I encourage everyone to consider each item as a team with a specific service in mind.

What development environment is required to achieve SLO?3:12
I have talked about DevOps and SRE in various ways: DevOps is an organizational theory for continuous development and operation, or a conceptual development and operation method, while SRE is a methodology for how to rationally implement that method of operation.
Cloud environment + CI/CD achieve SLO through SRE implementation1:44
It is not easy to consistently meet SLOs and release new services that reflect market needs without running out of error budget, especially when SLOs are set at 99.9% or 99.99%. Many development sites are shifting from the traditional method of development that takes more than a year from development to release in a waterfall* on-premise in-house environment to agile development in a cloud (public cloud) environment.
So is agile development in a cloud environment essential to achieve SRE and SLO? If I were asked this question, the answer would be "Yes.

Cloud Native Architecture3:37
In this section, we would like to explain the key-words needed to understand what kind of development environment is required to understand SRE.
CI/CD2:04
CI/CD is a concept of automating as much of the software development process as possible. The software development process here refers to a series of tasks such as writing code, build it, test it, and deploy it. When I was a programmer from the 1990s through 2000s, we would compile the code to make sure there were no errors, test it according to the test specification to check for bugs and behavior, deploy it to the production environment when all problems were gone, and doble check it again in the production environment.
Source Repositories, Images, and Registries2:50
Source repositories are introduced to prevent duplication of modifications by multiple developers and to reuse libraries and source code created by others. Usually, an application is not developed by a single developer, and most applications are developed by multiple engineers.
IaC(Infrastructure as Code)2:55
This IaC is an important item as it is a method unique to the cloud environment. In a cloud environment, computers and networks are virtual, and there is no need to order, install, and wire physical equipment as in an on-premises environment. While it is possible to configure this setting through a management console or UI, it is also possible to complete the work from start to finish using only commands. Google Cloud Deployment Manager in GCP and Terraform can be used to automate the creation and updating of infrastructure with code.
Rollback, Bluegreen Deployment, Canary Release4:21
To achieve high SLOs, services must always be available. Rollback is an important mechanism to achieve this.
Immutable Infrastructure2:27
Immutable Infrastructure is a concept of infrastructure based on virtualization and cloud computing where once a server is built, it is replaced by a separately prepared version of the environment.
Containers and Kubernetes3:36
When you hear the word "container," your first image might be a big steel boxes stacked on top of a large cargo ship like this. The cargo ship is the infrastructure for running applications such as clustered servers, and the containers stacked seamlessly on top of the cargo ship are the applications.

If an incident occurs1:35
In this section, I will talk about what you, as an SRE, need to consider when an incident occurs.
Explicit presentation of response procedures1:06
If you get a phone call or email saying, "The system is down right now! Do something about it right away!" What would you do first? Your first thought would probably be to contact the person in charge to check the current status.
Don't take on the responsibility alone3:11
Google has defined a process called the Incident Command System. First, when an incident occurs, a clear division of roles is predetermined, and each role is recognized from the beginning of the incident response and begins to act autonomously.
Record postmortem learning and prevent recurrence1:50
Once the incident has been successfully resolved through the cooperation of IC, OL, and CL, there is something that needs to be done once the system returns to normal operation.
Tools2:52
We have discussed the importance of sharing information. When an incident occurs, what tools should the IC, OL, and CL use to share information while responding to the incident? Also, can the same tools be used for CLs to report the situation in a timely manner as those used during incident response?

Requirements

This course explains technical elements in an easy-to-understand manner even for IT beginners, so no prior skills or experience are required.

Description

This course will help you understand the basics of SRE, a methodology for application development and service operation proposed and practiced by Google, and will be a great reference for promoting cloud computing in your company.

In recent years, some companies have begun to consider implementing site reliability engineering (SRE) or have already done so and found it to be effective for their business.

For example, what would you do if you were in charge of operations team and faced with a situation where a service that is currently running with 5 operations team members and 4 servers needs to handle a 20-fold increase in users in 2 years? Even if you manage to achieve it, it is easy to see that it will be impossible to keep doing it this way for good. Google's SRE approach is a very rational and practical way to maintain current scalability while continuing to release more and more services every day, serving over 2 billion users worldwide.

SRE is not something that can only be applied to Google, but is a very useful concept for all companies and engineers who use the cloud to develop, release, and maintain their products. This presentation is designed to help those who are currently in charge of or leading DevOps, engineers who actually write code and develop applications, and those who have no programming or development experience understand the new concept of SRE in an easy-to-understand manner. I hope that you will be able to follow along with me to the end.

Who this course is for:

In charge of DX promotion and cloud promotion who want to learn the development and operation approach on the most advanced cloud infrastructure at the moment.
In charge of operations
CxO level and IT leaders who are worried about improving their service quality
Non-technical student who want to learn the new SRE methodology that Google is practicing

SRE ( Site Reliability Engineering ) Quick Learning Course

What you'll learn

Explore related topics

Course content

Introduction1 lecture • 2min

Why was the idea of SRE born?5 lectures • 15min

Glossary of terms to understand SRE8 lectures • 29min

How to implement SRE - SLI, SLO, SLA configuration examples9 lectures • 22min

What development environment is required to achieve SLO?2 lectures • 5min

Keywords to understand the development environment required for SRE7 lectures • 22min

If an incident occurs5 lectures • 11min

What skills are required of a Site Reliability Engineer?1 lecture • 9min

Final Section1 lecture • 3min

Requirements

Description

Who this course is for: