Pro data science in Python

Name: Pro data science in Python
Rating: 4.0 (39 reviews)

Learn Keras, Deep Learning, Scikit-learn, Pandas and Statsmodels

Created byFrancisco Juretig

Last updated 4/2017

English

What you'll learn

Use complex scikit-learn tools for machine learning
Do statistical analysis using Statsmodels
Read, transform and manipulate data using Pandas
Use Keras for neural networks
Solve both supervised and unsupervised machine learning problems
Do time series analysis and forecasting using Statsmodels
Classify images using Deep Convolutional Networks

Course content

12 sections • 47 lectures • 11h 20m total length

Introduction4:20

Loading data in Pandas11:36
Reading a csv via Pandas, and doing some basic data manipulation
Looping through Pandas Datasets - Lambda expressions6:42
Lambda expressions allow us to execute functions on every element of a Pandas dataframe, looping in a natural way
Merging data19:39
We review the different ways of merging data in Pandas. Full, Inner, Left and Right joins
Grouping data in Pandas8:16
Building aggregate analysis is a fundamental technique for validating our analysis and results, before jumping into the actual machine learning algorithms. For example, we can compare our results versus some aggregated reports that we might get, thus validating that our data is in good shape.
Pivoting data in Pandas12:31
Pivoting our dataframes is quite easy in Pandas. We show how to transform a data frame from the long format to the wide format and vice-versa
Pandas

Intro to time series19:52
Basic ideas behind ARIMA modelling. Why we need stationary series. How we can decompose a stationary series into the sum of AR and MA terms.
Forecasting the US GDP: Part119:03
Identifying the AR and MA order of the GDP of the United States series by inspecting the ACF and PACF. Ensuring that the model is stationary by using the ADF test.
Forecasting the US GDP: Part215:41
Building an actual model for the GDP of the US. The essential ARIMA() parameters. Making sure that the residuals are valid, and making the predictions for the next quarters
Forecasting London property prices14:48
Forecasting the prices of London new houses using the techniques that we learnt in the previous lectures.
Forecasting

Naive Bayes - Bernoulli - Multinomial18:45
We review briefly what are the bayesian ideas behind Naive Bayes. We then explain how we can use the bernoulli bayes or the multinomial one depending on the assumptions we make on the data
Detecting spam in SMS19:55
We use Bernoulli and Multinomial Naive Bayes classifiers to predict spam in a real SMS dataset from Kaggle. We finally achieve a 96% accuracy (in sample) vs 86% that we would have obtained by using the proportion of non-spam/total sms. This probably gives a good reason for spammers to hate machine learning!
Linear support Vector machines SVM (SVM and LinearSVC)18:00
We introduce SVM within a very simple (linear) context. Even though it is an extremely powerful algorithm, it will tend to generate too many support vectors, possibly over-fitting the data. Is there a solution to that? Even though SVM is famous as a classification tool, we will see how it can be used as a very powerful regression tool
Lasso - Ridge19:49
We show how to run a linear regression model via ordinary least squares, lasso, and ridge. We see how we LASSO can reduce the dimensionality of a feature set, and how Ridge can estimate using a correlated feature set. At the end, we also end up with models with bias, but that can generate more stable predictions. In the example analyzed here, we end up with all models having a very similar "score", so we can't conclude that either one is "better" than another in terms of prediction. But we show how LASSO can generate a model that competes really well with Ridge and OLS, even with high correlation; and at the same time reduce correctly the dimensionality of the problem. We also how to use the "LASSOCV" and "RIDGECV" functions which automatically compute the regularization parameter we need for those methods, even though in this case we can't get a specific improvement.
Decision Trees19:24
We review the tree functions available in scikit-learn, both for classification and regression
Introduction to ensemble methods5:11
The best performing methods nowadays rely on building smaller models and then averaging (or choosing one) between them. Many of the winning algorithms in Kaggle competitions do exactly this. We describe the two big families of ensemble methods: (A) - Averaging ensemble methods (B) - Boosting ensemble methods
Averaging ensemble methods: Part 1: Bagging15:43
We introduce one of the very best functions in scikit-learn: ensemble.BaggingClassifier. It allows us to plug any estimator into an ensemble family, reducing the bias in our estimator, and performing much better in out-of-sample scenarios.
Averaging ensemble methods: Part 2: Random forests15:08
Because trees are used frequently in an ensemble context, scikit-learn has specific functions to deal with this. We focus on ExtraTreesClassifier, ExtraTreesRegressor and RandomForestClassifier + RandomForestRegressor
Boosting ensemble methods11:37
Boosting is a process of generating simple classifiers and then improving them. We focus on Adaboost, a simple idea, with very solid results for image processing, text classification, and general ML.

Principal components18:37
In ML, we typically deal with hundreds (if not thousands of features), and for many reasons (either for plotting, modelling, identifying rare observations) we will need to reduce that set. We show how to use scikit-learn to compute PCA, and later project that same data into a low-dimensional space. After that, we plot that data, understand which features move in similar directions, which features have high loadings into the principal components, and even identify weird observations.
K-Means10:47
When we observe M observations that we want to group into L groups, there is no easiest way than K-Means. We review how to use it in scikit-learn, and show when it does not perform as expected
DBScan11:55
We review the theory behind the best clustering algorithm nowadays. How it estimates the density and when it considers a point to be an outlier. We review some tuning strategies for its parameters
Clustering and PCA on real countries data from Kaggle19:54
We use a dataset containing information on multiple human development indexes, to cluster the countries into 3 groups. We show that both K-means and PCA+K-Means (with one principal component extracted) achieve practically the same results. We finally report the results per cluster and present some insights

Reading WAV files and extracting features16:44
We have multiple recordings per word: "Banana", "Chair", "IceCream", "Hello", "Goodbye". We want to extract some metrics from each file, so we can do machine learning later. The difficult part is that the metrics that we need are related to the signal encoded in each file (audio file actually). Luckily, we can leverage an existing R package that reads .wav files, and outputs many properties about the frequencies operating in each file. At the end, we produce 2 csv files (one for training and one for testing) containing 21 features that we can use later for doing machine learning. The approach presented here, can be extended to situations requiring the classification of any sound.
Classifying word using Adaboost and SVM15:47
We load the features that we extracted before, both for our training and testing datasets. We evaluate the performance of both Adaboost and SVM. Both methods have a practical in sample accuracy of 100%, 80% of cross-validation accuracy, and 80% of out-of-sample accuracy.

Requirements

Some experience with data science, Python and statistics
Being able to code functions, and understand a Python program
Understand the basics behind regression, random variables, and classification

Description

This course explores several data science and machine learning techniques that every data science practitioner should be familiar with. Fundamentally, the course pivots over four axis:

Pandas and Matplotlib for working with data
Keras for Deep Learning,
Scikit-learn for machine learning
Statsmodels for statistics

This course explores the fundamental concepts in these big four topics, and provides the student with an overview of the problems that can be solved nowadays.

I only focus on the computational and practical implications of these techniques, and it is assumed that the student is partially familiar with Statistics-ML-Data Science - or is willing to complement the techniques presented here with theoretical material. Python programming experience will be absolutely necessary, as we only explain how to define Classes in Python (as we will use them along the course)

The teaching strategy is to briefly explain the theory behind these techniques, show how these techniques work in very simple problems, and finally present the student with some real examples. I believe that these real examples add an enormous value to the student, as it helps understand why these techniques are so used nowadays (because they solve real problems!)

Some examples that we will attack here will be: Forecasting the GDP of the United States, forecasting London new houses prices, identifying squares and triangles in pictures, predicting the value of vehicles using online data, detecting spam on SMS data, and many more!

In a nutshell, this course explains how to:

Define classes for storing data in a better way
Plotting data
Merging, pivoting, subsetting, and grouping data via Pandas
Using linear regression via Statsmodels
Working with time series/forecasting in Statsmodels
Several unsupervised machine learning techniques, such as clustering
Several supervised techniques such as random forests, classification trees, Naive Bayes classifiers, etc
Define Deep Learning architectures using Keras
Design different neural networks such as recurrent neural networks, multi-layer perceptrons,etc.
Classify Audio/sounds in a similar way that Alexa, Siri and Cortana do using machine learning

The student needs to be familiar with statistics, Python and some machine learning concepts

Who this course is for:

Data science beginners, and intermediate users
Statisticians, and CS students wanting to strengthen their data science skills

Pro data science in Python

What you'll learn

Explore related topics

Course content

Introduction1 lecture • 4min

Object Oriented programming in Python2 lectures • 37min

Pandas5 lectures • 59min

Plotting3 lectures • 23min

Linear regression in Statsmodels3 lectures • 50min

Time Series in Statsmodels4 lectures • 1hr 9min

Introduction to machine learning2 lectures • 9min

Machine learning with Scikit-learn: Supervised problems9 lectures • 2hr 24min

Machine learning with Scikit-learn: Unsupervised problems4 lectures • 1hr 1min

Processing sound and identifying words in Audio2 lectures • 33min

Requirements

Description

Who this course is for: