Data Science and Machine Learning with Python - Hands On!
4.5 (5,409 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
36,696 students enrolled
Wishlisted Wishlist

Please confirm that you want to add Data Science and Machine Learning with Python - Hands On! to your Wishlist.

Add to Wishlist

Data Science and Machine Learning with Python - Hands On!

Become a data scientist in the tech industry! Comprehensive data mining and machine learning course with Python & Spark.
4.5 (5,409 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
36,696 students enrolled
Last updated 8/2017
Current price: $19 Original price: $160 Discount: 88% off
30-Day Money-Back Guarantee
  • 9 hours on-demand video
  • 2 Articles
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of Completion
What Will I Learn?
  • Develop using iPython notebooks
  • Understand statistical measures such as standard deviation
  • Visualize data distributions, probability mass functions, and probability density functions
  • Visualize data with matplotlib
  • Use covariance and correlation metrics
  • Apply conditional probability for finding correlated features
  • Use Bayes' Theorem to identify false positives
  • Make predictions using linear regression, polynomial regression, and multivariate regression
  • Understand complex multi-level models
  • Use train/test and K-Fold cross validation to choose the right model
  • Build a spam classifier using Naive Bayes
  • Use decision trees to predict hiring decisions
  • Cluster data using K-Means clustering and Support Vector Machines (SVM)
  • Build a movie recommender system using item-based and user-based collaborative filtering
  • Predict classifications using K-Nearest-Neighbor (KNN)
  • Apply dimensionality reduction with Principal Component Analysis (PCA) to classify flowers
  • Understand reinforcement learning - and how to build a Pac-Man bot
  • Clean your input data to remove outliers
  • Implement machine learning, clustering, and search using TF/IDF at massive scale with Apache Spark's MLLib
  • Design and evaluate A/B tests using T-Tests and P-Values
View Curriculum
  • You'll need a desktop computer (Windows, Mac, or Linux) capable of running Enthought Canopy 1.6.2 or newer. The course will walk you through installing the necessary free software.
  • Some prior coding or scripting experience is required.
  • At least high school level math skills will be required.
  • This course walks through getting set up on a Microsoft Windows based desktop PC. While the code in this course will run on other operating systems, we cannot provide OS-specific support for them.

Data Scientists enjoy one of the top-paying jobs, with an average salary of $120,000 according to Glassdoor and Indeed. That's just the average! And it's not just about money - it's interesting work too!

If you've got some programming or scripting experience, this course will teach you the techniques used by real data scientists in the tech industry - and prepare you for a move into this hot career path. This comprehensive course includes 68 lectures spanning almost 9 hours of video, and most topics include hands-on Python code examples you can use for reference and for practice. I’ll draw on my 9 years of experience at Amazon and IMDb to guide you through what matters, and what doesn’t.

The topics in this course come from an analysis of real requirements in data scientist job listings from the biggest tech employers. We'll cover the machine learning and data mining techniques real employers are looking for, including:

  • Regression analysis
  • K-Means Clustering
  • Principal Component Analysis
  • Train/Test and cross validation
  • Bayesian Methods
  • Decision Trees and Random Forests
  • Multivariate Regression
  • Multi-Level Models
  • Support Vector Machines
  • Reinforcement Learning
  • Collaborative Filtering
  • K-Nearest Neighbor
  • Bias/Variance Tradeoff
  • Ensemble Learning
  • Term Frequency / Inverse Document Frequency
  • Experimental Design and A/B Tests

...and much more! There's also an entire section on machine learning with Apache Spark, which lets you scale up these techniques to "big data" analyzed on a computing cluster.

If you're new to Python, don't worry - the course starts with a crash course. If you've done some programming before, you should pick it up quickly. This course shows you how to get set up on Microsoft Windows-based PC's; the sample code will also run on MacOS or Linux desktop systems, but I can't provide OS-specific support for them.

Each concept is introduced in plain English, avoiding confusing mathematical notation and jargon. It’s then demonstrated using Python code you can experiment with and build upon, along with notes you can keep for future reference.

If you’re a programmer looking to switch into an exciting new career track, or a data analyst looking to make the transition into the tech industry – this course will teach you the basic techniques used by real-world industry data scientists. I think you'll enjoy it!

Who is the target audience?
  • Software developers or programmers who want to transition into the lucrative data science career path will learn a lot from this course.
  • Data analysts in the finance or other non-tech industries who want to transition into the tech industry can use this course to learn how to analyze data using code instead of tools. But, you'll need some prior experience in coding or scripting to be successful.
  • If you have no prior coding or scripting experience, you should NOT take this course - yet. Go take an introductory Python course first.
Curriculum For This Course
71 Lectures
Getting Started
6 Lectures 40:04

What to expect in this course, who it's for, and the general format we'll follow.

Preview 02:44

We'll show you where to download the scripts and sample data used in this course, and where to put it.

[Activity] Getting What You Need

We'll install our Python 3.5 environment, Enthought Canopy, and install the Python libraries and packages we'll need for this course. When we're done, we'll do a quick test of running a real Python notebook!

[Activity] Installing Enthought Canopy

In a crash course on Python and what's different about it, we'll cover the importance of whitespace in Python scripts, how to import Python modules, and Python data structures including lists, tuples, and dictionaries.

Python Basics, Part 1

In part 2 of our Python crash course, we'll cover functions, boolean expressions, and looping constructs in Python.

Preview 09:41

This course presents Python examples in the form of iPython Notebooks, but we'll cover the other ways to run Python code: interactively from the Python shell, or running stand-alone Python script files.

Running Python Scripts
Statistics and Probability Refresher, and Python Practise
12 Lectures 01:39:05

We cover the differences between continuous and discrete numerical data, categorical data, and ordinal data.

Preview 06:58

A refresher on mean, median, and mode - and when it's appropriate to use each.

Mean, Median, Mode

We'll use mean, median, and mode in some real Python code, and set you loose to write some code of your own.

[Activity] Using mean, median, and mode in Python

We'll cover how to compute the variation and standard deviation of a data distribution, and how to do it using some examples in Python.

Preview 11:12

Introducing the concepts of probability density functions (PDF's) and probability mass functions (PMF's).

Probability Density Function; Probability Mass Function

We'll show examples of continuous, normal, exponential, binomial, and poisson distributions using iPython.

Common Data Distributions

We'll look at some examples of percentiles and quartiles in data distributions, and then move on to the concept of the first four moments of data sets.

[Activity] Percentiles and Moments

An overview of different tricks in matplotlib for creating graphs of your data, using different graph types and styles.

[Activity] A Crash Course in matplotlib

The concepts of covariance and correlation used to look for relationships between different sets of attributes, and some examples in Python.

[Activity] Covariance and Correlation

We cover the concepts and equations behind conditional probability, and use it to try and find a relationship between age and purchases in some fabricated data using Python.

[Exercise] Conditional Probability

Here we'll go over my solution to the exercise I challenged you with in the previous lecture - changing our fabricated data to have no real correlation between ages and purchases, and seeing if you can detect that using conditional probability.

Exercise Solution: Conditional Probability of Purchase by Age

An overview of Bayes' Theorem, and an example of using it to uncover misleading statistics surrounding the accuracy of drug testing.

Preview 05:23
Predictive Models
4 Lectures 33:33

We introduce the concept of linear regression and how it works, and use it to fit a line to some sample data using Python.

Preview 11:01

We cover the concepts of polynomial regression, and use it to fit a more complex page speed - purchase relationship in Python.

Preview 08:04

Multivariate models let us predict some value given more than one attribute. We cover the concept, then use it to build a model in Python to predict car prices based on their number of doors, mileage, and number of cylinders. We'll also get our first look at the statsmodels library in Python.

[Activity] Multivariate Regression, and Predicting Car Prices

We'll just cover the concept of multi-level modeling, as it is a very advanced topic. But you'll get the ideas and challenges behind it.

Multi-Level Models
Machine Learning with Python
13 Lectures 01:17:49

The concepts of supervised and unsupervised machine learning, and how to evaluate the ability of a machine learning model to predict new values using the train/test technique.

Supervised vs. Unsupervised Learning, and Train/Test

We'll apply train test to a real example using Python.

[Activity] Using Train/Test to Prevent Overfitting a Polynomial Regression

We'll introduce the concept of Naive Bayes and how we might apply it to the problem of building a spam classifier.

Bayesian Methods: Concepts

We'll actually write a working spam classifier, using real email training data and a surprisingly small amount of code!

Preview 08:05

K-Means is a way to identify things that are similar to each other. It's a case of unsupervised learning, which could result in clusters you never expected!

K-Means Clustering

We'll apply K-Means clustering to find interesting groupings of people based on their age and income.

[Activity] Clustering people based on income and age

Entropy is a measure of the disorder in a data set - we'll learn what that means, and how to compute it mathematically.

Measuring Entropy

In order to run the next lecture on decision trees, you'll need some software called "GraphViz" installed. Here's how.

[Activity] Install GraphViz

Decision trees can automatically create a flow chart for making some decision, based on machine learning! Let's learn how they work.

Preview 08:43

We'll create a decision tree and an entire "random forest" to predict hiring decisions for job candidates.

[Activity] Decision Trees: Predicting Hiring Decisions

Random Forests was an example of ensemble learning; we'll cover over techniques for combining the results of many models to create a better result than any one could produce on its own.

Ensemble Learning

Support Vector Machines are an advanced technique for classifying data that has multiple features. It treats those features as dimensions, and partitions this higher-dimensional space using "support vectors."

Support Vector Machines (SVM) Overview

We'll use scikit-learn to easily classify people using a C-Support Vector Classifier.

[Activity] Using SVM to cluster people using scikit-learn
Recommender Systems
6 Lectures 49:10

One way to recommend items is to look for other people similar to you based on their behavior, and recommend stuff they liked that you haven't seen yet.

Preview 07:57

The shortcomings of user-based collaborative filtering can be solved by flipping it on its head, and instead looking at relationships between items instead of relationships between people.

Item-Based Collaborative Filtering

We'll use the real-world MovieLens data set of movie ratings to take a first crack at finding movies that are similar to each other, which is the first step in item-based collaborative filtering.

[Activity] Finding Movie Similarities

Our initial results for movies similar to Star Wars weren't very good. Let's figure out why, and fix it.

[Activity] Improving the Results of Movie Similarities

We'll implement a complete item-based collaborative filtering system that uses real-world movie ratings data to recommend movies to any user.

Preview 10:22

As a student exercise, try some of my ideas - or some ideas of your own - to make the results of our item-based collaborative filter even better.

[Exercise] Improve the recommender's results
More Data Mining and Machine Learning Techniques
6 Lectures 52:51

KNN is a very simple supervised machine learning technique; we'll quickly cover the concept here.

K-Nearest-Neighbors: Concepts

We'll use the simple KNN technique and apply it to a more complicated problem: finding the most similar movies to a given movie just given its genre and rating information, and then using those "nearest neighbors" to predict the movie's rating.

[Activity] Using KNN to predict a rating for a movie

Data that includes many features or many different vectors can be thought of as having many dimensions. Often it's useful to reduce those dimensions down to something more easily visualized, for compression, or to just distill the most important information from a data set (that is, information that contributes the most to the data's variance.) Principal Component Analysis and Singular Value Decomposition do that.

Dimensionality Reduction; Principal Component Analysis

We'll use sckikit-learn's built-in PCA system to reduce the 4-dimensions Iris data set down to 2 dimensions, while still preserving most of its variance.

[Activity] PCA Example with the Iris data set

Cloud-based data storage and analysis systems like Hadoop, Hive, Spark, and MapReduce are turning the field of data warehousing on its head. Instead of extracting, transforming, and then loading data into a data warehouse, the transformation step is now more efficiently done using a cluster after it's already been loaded. With computing and storage resources so cheap, this new approach now makes sense.

Data Warehousing Overview: ETL and ELT

We'll describe the concept of reinforcement learning - including Markov Decision Processes, Q-Learning, and Dynamic Programming - all using a simple example of developing an intelligent Pac-Man.

Preview 12:44
Dealing with Real-World Data
6 Lectures 45:38

Bias and Variance both contribute to overall error; understand these components of error and how they relate to each other.

Bias/Variance Tradeoff

We'll introduce the concept of K-Fold Cross-Validation to make train/test even more robust, and apply it to a real model.

[Activity] K-Fold Cross-Validation to avoid overfitting

Cleaning your raw input data is often the most important, and time-consuming, part of your job as a data scientist!

Preview 07:10

In this example, we'll try to find the top-viewed web pages on a web site - and see how much data pollution makes that into a very difficult task!

[Activity] Cleaning web log data

A brief reminder: some models require input data to be normalized, or within the same range, of each other. Always read the documentation on the techniques you are using.

Normalizing numerical data

A review of how outliers can affect your results, and how to identify and deal with them in a principled manner.

[Activity] Detecting outliers
Apache Spark: Machine Learning on Big Data
10 Lectures 01:36:31

We'll present an overview of the steps needed to install Apache Spark on your desktop in standalone mode, and get started by getting a Java Development Kit installed on your system.

[Activity] Installing Spark - Part 1

We'll install Spark itself, along with all the associated environment variables and ancillary files and settings needed for it to function properly.

[Activity] Installing Spark - Part 2

A high-level overview of Apache Spark, what it is, and how it works.

Spark Introduction

We'll go in more depth on the core of Spark - the RDD object, and what you can do with it.

Spark and the Resilient Distributed Dataset (RDD)

A quick overview of MLLib's capabilities, and the new data types it introduces to Spark.

Introducing MLLib

We'll take the same problem for our earlier Decision Tree lecture - predicting hiring decisions for job candidates - but implement it using Spark and MLLib!

Preview 16:00

We'll take the same example of clustering people by age and income from our earlier K-Means lecture - but solve it in Spark!

[Activity] K-Means Clustering in Spark

We'll introduce the concept of TF-IDF (Term Frequency / Inverse Document Frequency) and how it applies to search problems, in preparation for using it with MLLib.

Preview 06:44

Let's use TF-IDF, Spark, and MLLib to create a rudimentary search engine for real Wikipedia pages!

[Activity] Searching Wikipedia with Spark

Spark 2.0 introduced a new API for MLLib based on DataFrame objects; we'll look at an example of using this to create and use a linear regression model.

[Activity] Using the Spark 2.0 DataFrame API for MLLib
Experimental Design
5 Lectures 33:16

Running controlled experiments on your website usually involves a technique called the A/B test. We'll learn how they work.

A/B Testing Concepts

How to determine significance of an A/B tests results, and measure the probability of the results being just from random chance, using T-Tests, the T-statistic, and the P-value.

T-Tests and P-Values

We'll fabricate A/B test data from several scenarios, and measure the T-statistic and P-Value for each using Python.

[Activity] Hands-on With T-Tests

Some A/B tests just don't affect customer behavior one way or another. How do you know how long to let an experiment run for before giving up?

Determining How Long to Run an Experiment

There are many limitations associated with running short-term A/B tests - novelty effects, seasonal effects, and more can lead you to the wrong decisions. We'll discuss the forces that may result in misleading A/B test results so you can watch out for them.

Preview 09:26
You made it!
3 Lectures 06:03

Where to go from here - recommendations for books, websites, and career advice to get you into the data science job you want.

More to Explore

If you enjoyed this course, please leave a star rating for it!

Don't Forget to Leave a Rating!

Let's stay in touch! Head to my website for discounts on my other courses, and to follow me on social media. Also info on getting this course in printed book form!

Bonus Lecture: Discounts on my Spark and MapReduce courses!
About the Instructor
Sundog Education by Frank Kane
4.5 Average rating
15,245 Reviews
73,374 Students
9 Courses
Training the World in Big Data and Machine Learning

Sundog Education's mission is to make highly valuable career skills in big data, data science, and machine learning accessible to everyone in the world. Our consortium of expert instructors shares our knowledge in these emerging fields with you, at prices anyone can afford. 

Sundog Education is led by Frank Kane and owned by Frank's company, Sundog Software LLC. Frank spent 9 years at Amazon and IMDb, developing and managing the technology that automatically delivers product and movie recommendations to hundreds of millions of customers, all the time. Frank holds 17 issued patents in the fields of distributed computing, data mining, and machine learning. In 2012, Frank left to start his own successful company, Sundog Software, which focuses on virtual reality environment technology, and teaching others about big data analysis.

Frank Kane
4.5 Average rating
14,853 Reviews
69,690 Students
7 Courses
Founder, Sundog Education

Frank spent 9 years at Amazon and IMDb, developing and managing the technology that automatically delivers product and movie recommendations to hundreds of millions of customers, all the time. Frank holds 17 issued patents in the fields of distributed computingdata mining, and machine learning. In 2012, Frank left to start his own successful company, Sundog Software, which focuses on virtual reality environment technology, and teaching others about big data analysis.