Mastering SRE on Google Cloud

Master Google Cloud SRE principles, SLIs/SLOs, monitoring, incident response, automation & scalability for high-reliabil

Created bySkills Marathon

Last updated 8/2025

English

What you'll learn

Implement Site Reliability Engineering (SRE) principles on Google Cloud effectively.
Design and maintain reliable, scalable, and fault-tolerant cloud infrastructure.
Use Google Cloud tools like Cloud Monitoring, Logging, and Error Reporting.
Apply incident management, SLIs, SLOs, and error budgets in real-world scenarios.

Course content

8 sections • 58 lectures • 4h 9m total length

Introduction1:53
Kick off this SRE bootcamp with hands-on practice on GCE, GKE, Cloud Run, and Cloud Logging, building observability with golden signals, SLIs, SLOs, Grafana, and Cloud Monitoring.
video20:27
Engage with the instructor across YouTube, professional networks, or blogs to ask questions about this course on the Udemy platform as we dive into the agenda.
video31:52
Explore the origins of SRE and master observability with golden signals, SLIs, SLOs, and error budgets. Deploy demo apps on GCE, GKE, and Cloud Run, and build dashboards with Grafana.
video411:03
Create a new Google project, configure gcloud, authenticate with service keys, enable compute and container services, and export billing data to BigQuery for visibility on free tier credits.

video10:41
Explore the origins and core concepts of site reliability engineering, including observability, golden signals, SLIs, SLOs, and error budgets, and define the SRE role and foundational skills.
video212:17
Explore Google sre concepts of reliability by using golden signals—traffic, errors, latency, saturation—and implement slis, slos, and error budgets for gce, gke, and cloud run.
video39:44
Define site reliability engineer characteristics, including metrics, automation, SLOs, error budgets, and logs. Learn eight foundational skills in cloud, DevOps, Linux, and Kubernetes.
video40:27
video50:34
Explore site reliability engineering concepts, including observability, the golden signals, and SLIs, SLOs, and error budgets, plus the Google definition of an SRE engineer and the essential skill set.

video10:51
Gain a bird's-eye view of Google Cloud Platform, compare five key services—GCE, GKE, Cloud Run, Cloud Logging, Cloud Monitoring—and preview their core features.
video27:47
Get a view of Google Cloud Platform services across compute, storage, databases, networking, and monitoring, and run apps on GCE, GKE, and Cloud Run with gcloud.
video37:49
Explore Google Compute Engine, GKE, and Cloud Run, plus Google Cloud's observability and security features, to design scalable, secure, and monitored applications with managed instance groups, autoscaling, and automation.
video40:22
Explore the GCP overview, review products and services, and identify the five core GCP services you'll use throughout the course, then move on to the next section.

video13:13
video23:57
Learn to find help in the Linux terminal using ls --help and man pages, mix switches like -l and -R, and use apropos to locate commands.
video32:15
video415:08
video55:12
video67:12
Master the find command to locate files and folders using -name, -iname, -mtime, -size, and -perm, then combine conditions with and, or, and not for precise searches.
video77:02
video85:20
Learn to use grep and egrep with regex patterns to search text, including case sensitivity, ignore-case (-i), and patterns for starting with, ending with, or matching multiple occurrences.
video93:55
Explore Linux file permissions, using octal values and umask to set default rights, and modify permissions with chmod, including adding execute to scripts and verifying permissions after changes.
video101:40
video111:33
video121:04

video12:04
video24:04
Master zsh profile workflows with exports and aliases, stored in GitHub, and accelerate cloud tasks using gcloud and kubectl aliases for GCE, compute, and GKE.
video33:53
Create a quick get command utility to search across notes and files for gcloud roles and kubectl commands, saving minutes and boosting efficiency in daily SRE workflows.
video43:46
Explore how to find and verify Google Cloud IAM roles with a specific permission using gcloud commands, grep filtering, and a bash script to enforce least privilege.
video50:42
Explore practical bash scripting with examples of a get command utility and a get roles by permission utility, including if-else and for loops to automate file processing.
video60:43
Explore why automation matters for site reliability engineering and infrastructure as code, and see practical bash utilities like get cmd and get roles, plus zsh profile customization to reduce toil.
video72:23
Explore gcloud, the Google Cloud command line interface, to manage GCP resources from the CLI, automate tasks, and format, filter, and sort outputs using interactive help.
video87:38
video910:13
video106:54
Learn to filter and sort gcloud compute machine types by zone and specs, using exact and partial searches, wc counting, and sorting by cpus and memory.
video110:42
Explore how to leverage Google Cloud official documentation, cheat sheets, and CLI interactive help to list, describe, and filter compute instances for targeted insights.

video12:57
video25:45
Connect to your GKE cluster with gcloud credentials and verify the current kubectl context. Set the default namespace to ECP to run deployments in that namespace.
video32:59
Learn to use kubectl for version checks, cluster info, and deployment management in production Kubernetes environments. Discover helpful commands, aliases, and resources for creating, exposing, and scaling pods and deployments.
video511:33
video65:49
Deploy an nginx pod with kubectl, expose it as a load balancer, test with curl, and clean up, while contrasting imperative commands with declarative configuration and CI/CD.
video79:22
video80:54
Recaps connecting to the GKE cluster and kubectl usage, with json, yaml, and jsonpath outputs for deployments, pods, and services, declarative and imperative deployment, and troubleshooting with describe and logs.

video11:49
Master the vi editor on Unix-like systems through command-line navigation, editing, search and replace, and configuration, with a practical cheat sheet for Linux workflows.
video23:22
video36:51
Explore vim editing basics for editing files: insert modes with i, I, a, A, o, O; delete commands like dd, 5dd, dw, D; and copy-paste with yy, p, and P.
video47:04
Explore how to use the VI editor for search and replace, including case-sensitive and case-insensitive searches, with examples like replacing Carolina with India and undoing changes.
video50:55
Configure vim by creating a vim profile (vimrc) in your home directory, set default options like set number and set ignore case, and make these changes permanent.
video61:01
video70:57
Navigate vim with j k h l w b g G, edit with i o dd y p, search with /, set and make permanent in vim profile.

video11:45
Design and subnet a RFC1918 10.240.0.0/16 space for GKE and GCE in east and west regions within a multi-cloud hybrid landscape, routing applications onto the GCP landing zone.
video21:41
Create and configure eight subnets in your organization’s VPC network, selecting a CIDR range like /20, and troubleshoot overlapping subnet errors whether using the console or Terraform.
video31:42
video44:31
video57:33
Learn to access a Google Cloud VM via gcloud ssh and console, monitor logs with journalctl, run essential Linux commands, and inspect system information like hostname, uptime, and IP.
video68:22
video75:48
video80:43

Requirements

Basic understanding of cloud computing concepts.
Familiarity with Google Cloud Platform services (helpful but not mandatory).
Willingness to learn and apply SRE practices in hands-on projects.

Description

Want to become an in-demand Site Reliability Engineer (SRE) for Google Cloud?
This course takes you from the foundations of SRE to advanced, hands-on practices tailored for Google Cloud Platform (GCP). Whether you’re aiming for a career in SRE, DevOps, or Cloud Engineering, this course equips you with the skills to build and maintain reliable, scalable, and secure cloud infrastructure.

In this practical, 4-hour deep-dive, you will:

Understand core SRE concepts like SLIs, SLOs, SLAs, and error budgets.
Learn how to design fault-tolerant architectures on GCP.
Master monitoring, logging, and alerting with Cloud Monitoring, Logging, and Error Reporting.
Implement incident response and automation using GCP tools and best practices.
Apply capacity planning, performance tuning, and cost optimization strategies.

You’ll work through real-world case studies, industry scenarios, and hands-on exercises to gain job-ready skills.

By the end of this course, you will be able to:

Confidently apply SRE principles to Google Cloud environments.
Set up automated monitoring and alerting pipelines.
Handle production incidents effectively and reduce downtime.
Optimize cloud operations for both reliability and cost.

Who is this course for?

Cloud engineers and DevOps professionals looking to specialize in SRE.
IT professionals and software engineers transitioning into reliability engineering roles.
Students and beginners interested in cloud reliability best practices.

No advanced programming skills are required — just a willingness to learn and apply SRE strategies in a hands-on way.

Who this course is for:

Cloud engineers and DevOps professionals looking to master SRE practices on Google Cloud.
IT professionals aiming to improve system reliability and performance.
Students or beginners interested in a career in Cloud Engineering or SRE.
Software engineers transitioning into SRE roles.

Mastering SRE on Google Cloud

What you'll learn

Explore related topics

Course content

Introduction4 lectures • 15min

section25 lectures • 24min

section34 lectures • 17min

section412 lectures • 58min

section511 lectures • 43min

section67 lectures • 39min

section77 lectures • 22min

section88 lectures • 32min

Requirements

Description

Who this course is for: