Managing Big Data on Google's Cloud Platform
4.7 (10 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
113 students enrolled
Wishlisted Wishlist

Please confirm that you want to add Managing Big Data on Google's Cloud Platform to your Wishlist.

Add to Wishlist

Managing Big Data on Google's Cloud Platform

The Second Course in a Series for Attaining the Google Certified Data Engineer
Best Seller
4.7 (10 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
113 students enrolled
Created by Mike West
Last updated 7/2017
Current price: $10 Original price: $25 Discount: 60% off
5 hours left at this price!
30-Day Money-Back Guarantee
  • 1.5 hours on-demand video
  • 6 Articles
  • 5 Supplemental Resources
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of Completion
What Will I Learn?
  • At the end of the course you'll understand Cloud Dataproc
  • You'll also know how to craft machine learning projects at scale on GCP.
  • You'll also know how to integrate dataproc with other core services like BigQuery
  • Additionally, you'll learn how to migrate on premise Hadoop and Spark jobs to Cloud Dataproc.
View Curriculum
  • You'll need a basic understanding of cloud technologies.
  • You'll have taken the first course in the series on Google Cloud Platform.
  • You'll need a basic knowledge of Big Data. Specifically Hadoop and Spark.

Welcome to Managing Big Data on Google's Cloud Platform. This is the second course in a series of courses designed to help you attain the coveted Google Certified Data Engineer. 

Additionally, the series of courses is going to show you the role of the data engineer on the Google Cloud Platform

At this juncture the Google Certified Data Engineer is the only real world certification for data and machine learning engineers.

NOTE: This is NOT a course on Big Data. This is a course on a specific cloud service called Google Cloud Dataproc. The course was designed to be part of a series for those who want to become data engineers on Google's Cloud Platform

This course is all about Google's Cloud and migrating on-premise Hadoop jobs to GCP.  In reality, Big Data is simply about unstructured data.  There are two core types of data in the real world. The first is structured data, this is the kind of data found in a relational database. The second is unstructured, this is a file sitting on a file system. Approximately 90% of all data in the enterprise is unstructured and our job is to give it structure.

Why do we want to give it structure? We want to give is structure so we can analyze it. Recall that 99% of all applied machine learning is supervised learning. That simply means we have a data set and we point our machine learning models at that data set in order to gain insight into that data.

In the course we will spend much of the time working in Cloud Dataproc. This is Google’s managed Hadoop and Spark platform. 

Recall the end goal of big data is to get that data into a state where it can be analyzed and modeled. Therefore, we are also going to cover how to work on machine learning projects with big data at scale.

Please keep in mind this course alone will not give you the knowledge and skills to pass the exam. The course will provide you with the big data knowledge you need for working with Cloud Dataproc and for moving existing projects to the Google Cloud Platform. 

                                                             *Five Reasons to take this Course.*

1) The Top Job in the World

The data engineer role is the single most needed role in the world. Many believe that it's the data scientist but several studies have broken down the job descriptions and the most needed position is that of the data engineer. 

2) Google's the World Leader in Data

Amazon's AWS is the most used cloud and Azure has the best UI but no cloud vendor in the world understands data like Google. They are the world leader in open sources artificial intelligence. You can't be the leader in AI without being the leader in data. 

3) 90% of all Organizational Data is Unstructured

The study of big data is the study of unstructured data. As the data in companies grows most will need to scale to unprecedented level. Without a significant investment in infrastructure and talent this won't be possible without the cloud. 

4) The Data Revolution is Now

We are in a data revolution. Data used to be viewed as a simple necessity and lower on the totem pole. Now it is more widely recognized as the source of truth. As we move into more complex systems of data management, the role of the data engineer becomes extremely important as a bridge between the DBA and the data consumer. Beyond the ubiquitous spreadsheet, graduating from RDBMS (which will always have a place in the data stack), we now work with NoSQL and Big Data technologies.

5) Data is Foundation 

Data engineers are the plumbers building a data pipeline, while data scientists are the painters and storytellers giving meaning to an otherwise static entity. Simply put, data engineers clean, prepare and optimize data for consumption. Once the data becomes useful, data scientists can perform a variety of analyses and visualization techniques to truly understand the data, and eventually, tell a story from the data. 

Thank you for your interest in Managing Big Data on Google's Cloud Platform and we will see you in the course!!

Who is the target audience?
  • Data engineers preparing for the Google Certified Data Engineering Exam.
  • Machine learning engineers learning how to build models at scale on GCP.
  • Database professionals and developers moving to the Google Cloud Platform.
Students Who Viewed This Course Also Viewed
Curriculum For This Course
41 Lectures
6 Lectures 11:26

In this lesson let's high level what this course is about. 

This is the second course in a series of courses on Google's Cloud Platform. 

This is not an entry level course on Hadoop or Spark. 

This course is about Cloud Dataproc and how to move existing projects to GCP. 

Preview 01:39

In this lesson let's talk about the course's targeted audience. 

If you fit one of these roles are just want to learn more about the data engineer path on CGP then this course is for you. 

Preview 01:31

This lesson is just a series of questions I've been asked about this course or other similar courses. 

I try to answer them before you take the course. 

Preview 02:22

There are only two kinds of data in the enterprise. 

One is structured and the other one is unstructured. 

Let's learn about them in this lesson. 

Preview 02:47

Every organization has 4 different kinds of data. 

Let's learn what they are in this lesson. 

Preview 02:16


5 questions
Why Cloud Dataproc
7 Lectures 11:34

Why use GCP for big data if you already have an existing Hadoop on premise set up? 

Lot's of reasons. 

Money and scale and two reasons we will discuss in the following videos. 

Why Use GCP for Big Data?

In this lesson let's learn about all the sundry components of an on-premise build. 

On-premise builds are costly and don't scale very well. 

On-Premise Hadoop Build

Is it better to scale up or out? 

The answer is out but let's find out why in this lesson. 

Scaling up or Scaling Out

In this lesson let's learn about regions and zones. 

Zones and Regions

In GCP we decouple storage and compute. 

We do this so we can easily spin up and tear down our clusters. 

Let's learn about on-premise versus the GCP. 

Separating Compute and Storage

In GCP our end goal is to use Google Cloud Storage to house our data. 

We can then use other services like BigQuery to analyze our data once that data is sitting on common storage. 

Cloud Dataproc Architecture


10 questions
Cloud Dataproc in Action
14 Lectures 36:20

The entire cluster creation process in one screen. 

Let's talk about the core parts in this lesson. 

Create Cluster Screen

In this lesson let's learn how to create a very simple cluster using the Google Cloud Console. 

Create Dataproc Cluster in GCP Console

Creating a dataproc cluster using the shell is just as easy. 

In this lesson let's open a cloud shell session and spin up a cluster.

Create a Cluster using the Shell

We have three options when spin up our dataproc clusters. 

In this lesson let's learn what they are and why we should probably choose high availability for production loads. 

The Three Dataproc Configurations

Preemption is going to save your organizations or your clients money. 

Lets talk about what this is and how to leverage it on GCP.

Using Preemption on Cloud Dataproc

In this brief lesson let's learn how preemption is handled on our preemptive clusters. 

How GCP Handles Preemption

We've have several images to choose from. 

Let's learn why we might want to use the most stable ones instead of our other options. 

Image Version Options

We can easily scale our clusters even when jobs are running. 

Let's learn how to do that in this lesson. 

Scaling Clusters

We can spin up custom boxes in GCP. 

Let's learn how to in this lesson.

Creating a Custom Image

In this lesson let's learn how to customize our clusters. .

Cluster Customization

We can easily install additional software on our clusters. 

Let's learn how to use initialization scripts to do that. 

3 Steps to Install Additional Software on Clusters

In this lesson let's demo how to implement an initialization script. 

Initialization Actions

Cluster have a high availability option. 

Let's learn how to implement it in this lesson via the console. 

High Availability


15 questions
Submitting Jobs
7 Lectures 17:44

Int his lesson let's learn how to submit jobs to our cluster once it's created. 

The Submit Jobs Screen

In this lesson let's submit some jobs. 

In the lesson we will see that our cluster isn't large enough forcing us to kill the cluster and create a new one. 

Submitting Spark Job - Console

In this lesson let's learn how to submit a spark job to our cluster and view the output. 

Submitting Spark Job - Google Cloud Shell

In this lesson let's learn how to submit a PySpark job via the Google Cloud Shell. 

Submitting PySpark Job - SSH

In this lesson let's learn how to move our on-premise Hadoop jobs to GCP. 

Moving from On-Premise to Google Cloud Dataproc

In this lesson let's look at the code behind a Python and Scala. 

Python and Scala Code Reference Change


10 questions
You're the Data Engineer
7 Lectures 11:26

In this lesson let's learn about white boarding. 

It's used heavily at Google for interviewing. 

It's also used for architecting cloud solutions. 

White Boarding: Difference between On-prem and Cloud Dataproc

Let's white board the approach we'd use to move on-premise Hadoop and other big data jobs to GCP. 

White Boarding: Moving Jobs to GCP

When we are designing solutions for clients we want to make sure they understand that their data and clusters need to be in the same zones or regions so they don't incur excessive data movement charges. 

White Boarding: Data Near Clusters

You'll get a lot of questions about preemptibles and how to use them. 

Let's high level some talking points and reinforce this idea of temporary clusters. 

White Boarding: Defining Preemptibles

Clients are going to want to know the exact steps to moving their jobs to GCP. 

In this lesson we will explain our phased approach to them. 

White Boarding: On-Premise Architecture to GCP

Clients always want something they can customize. 

In this lesson let's explain to them how easy it is to use initialization scripts.  

White Boarding: Add Software to Nodes

About the Instructor
Mike West
4.2 Average rating
2,916 Reviews
48,984 Students
42 Courses
SQL Server and Machine Learning Evangelist

I've been a production SQL Server DBA most of my career.

I've worked with databases for over two decades. I've worked for or consulted with over 50 different companies as a full time employee or consultant. Fortune 500 as well as several small to mid-size companies. Some include: Georgia Pacific, SunTrust, Reed Construction Data, Building Systems Design, NetCertainty, The Home Shopping Network, SwingVote, Atlanta Gas and Light and Northrup Grumman.

Experience, education and passion

I learn something almost every day. I work with insanely smart people. I'm a voracious learner of all things SQL Server and I'm passionate about sharing what I've learned. My area of concentration is performance tuning. SQL Server is like an exotic sports car, it will run just fine in anyone's hands but put it in the hands of skilled tuner and it will perform like a race car.


Certifications are like college degrees, they are a great starting points to begin learning. I'm a Microsoft Certified Database Administrator (MCDBA), Microsoft Certified System Engineer (MCSE) and Microsoft Certified Trainer (MCT).


Born in Ohio, raised and educated in Pennsylvania, I currently reside in Atlanta with my wife and two children.