Real data science problems with Python
4.0 (18 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
339 students enrolled
Wishlisted Wishlist

Please confirm that you want to add Real data science problems with Python to your Wishlist.

Add to Wishlist

Real data science problems with Python

Practice machine learning and data science with real problems
4.0 (18 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
339 students enrolled
Created by Francisco Juretig
Last updated 6/2017
English
Current price: $10 Original price: $50 Discount: 80% off
5 hours left at this price!
30-Day Money-Back Guarantee
Includes:
  • 7.5 hours on-demand video
  • 50 Supplemental Resources
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of Completion
What Will I Learn?
  • Work with many ML techniques in real problems such as classification, image processing, regression
  • Build neural networks for classification and regression
  • Apply machine learning and data science to Audio Processing, Image detection, real time video, sentiment analysis and many more things
View Curriculum
Requirements
  • Some experience with Python
  • General knowledge on Machine Learning, Statistics
Description

This course explores a variety of machine learning and data science techniques using real life datasets/images/audio collected from several sources. These realistic situations are much better than dummy examples, because they force the student to better think the problem, pre-process the data in a better way, and evaluate the performance of the prediction in different ways.

The datasets used here are from different sources such as Kaggle, US Data.gov, CrowdFlower, etc. And each lecture shows how to preprocess the data, model it using an appropriate technique, and compute how well each technique is working on that specific problem. Certain lectures contain also multiple techniques, and we discuss which technique is outperforming the other. Naturally, all the code is shared here, and you can contact me if you have any questions. Every lecture can also be downloaded, so you can enjoy them while travelling.

The student should already be familiar with Python and some data science techniques. In each lecture, we do discuss some technical details on each method, but we do not invest much time in explaining the underlying mathematical principles behind each method

Some of the techniques presented here are: 

  • Pure image processing using OpencCV
  • Convolutional neural networks using Keras-Theano
  • Logistic and naive bayes classifiers
  • Adaboost, Support Vector Machines for regression and classification, Random Forests
  • Real time video processing, Multilayer Perceptrons, Deep Neural Networks,etc.
  • Linear regression
  • Penalized estimators
  • Clustering
  • Principal components

The modules/libraries used here are:

  • Scikit-learn
  • Keras-theano
  • Pandas
  • OpenCV

Some of the real examples used here:

  • Predicting the GDP based on socio-economic variables
  • Detecting human parts and gestures in images
  • Tracking objects in real time video
  • Machine learning on speech recognition
  • Detecting spam in SMS messages
  • Sentiment analysis using Twitter data
  • Counting objects in pictures and retrieving their position
  • Forecasting London property prices
  • Predicting whether people earn more than a 50K threshold based on US Census data
  • Predicting the nuclear output of US based reactors
  • Predicting the house prices for some US counties
  • And much more...

The motivation for this course is that many students willing to learn data science/machine learning are usually suck with dummy datasets that are not challenging enough. This course aims to ease that transition between knowing machine learning, and doing real machine learning on real situations.

Who is the target audience?
  • Intermediate Python users with some knowledge on data science
  • Students wanting to practice with real datasets
  • Students who know some machine learning, but want to evaluate scikit-learn and Keras(Theano/Tensorflow) to real problems they will encounter in the analytics industry
Students Who Viewed This Course Also Viewed
Curriculum For This Course
31 Lectures
07:43:41
+
Introduction
1 Lecture 12:21
+
Wines
1 Lecture 11:17

We explain the importance of grid search cross validation, using a real example for Wine classification using chemical characteristics. In this example we easily manage to improve a 5% on our classification rate, by choosing the max_depth of the classification trees that we are feeding to Adaboost

Predicting Wine characteristics - Using GridsearchCV
11:17
+
Doing Machine learning with Audio - Classifying sounds
3 Lectures 39:58

We have multiple recordings per word: "Banana", "Chair", "IceCream", "Hello",  "Goodbye". We want to extract some metrics from each file, so we can do machine learning later. The difficult part is that the metrics that we need are related to the signal encoded in each file (audio file actually). Luckily, we can leverage an existing R package that reads .wav files, and outputs many properties about the frequencies operating in each file. At the end, we produce 2 csv files (one for training and one for testing) containing 21 features that we can use later for doing machine learning. The approach presented here, can be extended to situations requiring the classification of any sound.

Reading WAV files and extracting features
16:44

We load the features that we extracted before, both for our training and testing datasets. We evaluate the performance of both Adaboost and SVM. Both methods have a practical in sample accuracy of 100%, 80% of cross-validation accuracy, and 80% of out-of-sample accuracy.

Classifying words using Adaboost and SVM
15:47

We design a MLP neural network for classifying the audio files we used in the previous lecture. But, in this case we basically get the same out-of-sample accuracy we were getting before, around 72%. So, the extra effort in configuring and running a neural net was not justified

Classifying words using Multilayer Perceptron Deep Neural networks
07:27
+
Nuclear reactors in the US
2 Lectures 28:43

We use official data from the US Nuclear Regulatory Commission, in order to predict the % usage of the existing reactors in the US. We test both Multilayer Perceptrons and Support Vector Regression (SVR). However, in this case, both methods do not perform well; and that is probably good to remind us that Machine Learning cannot always predict everything

Predicting nuclear output in the US via MLP and SVR
15:00

We use a deep neural network to predict the output of US commercial reactors, but instead of predicting one value per observation, we will predict multiple ones. Sounds hard? It's quite easy using Keras 

Multi-output neural networks
13:43
+
Clustering
1 Lecture 19:54
K-Means and PCA on a real dataset containing data for 168 countries
19:54
+
Used car prices for German Ebay
1 Lecture 19:56

We use a real Kaggle example containing 350K observations for used cars in Ebay-Germany. The problem is that constructing the feature matrix is not viable, as we would end up with a Numpy matrix containing over 250 columns and 350K observations. Such a matrix will not fit into our RAM memory and we won't be even able to call Keras (if we somehow could, it would not work).

We thus train the model using train_on_batch(), feeding the model with batches of around 17K observations. Using this incremental approach, we can easily construct the matrix on each batch creation step. We finally estimate this using a deep neural network, achieving a mean absolute error around 1,500 euros per car

Preview 19:56
+
Identifying poisonous mushrooms
2 Lectures 18:50

We work with a dataset containing multiple features per mushroom, and the objective is to predict whether they are edible or not. That is particularly challenging for humans, as there are no clear characteristic/rule that state when a mushroom is poisonous.

Poisonous mushrooms detection using Kaggle Data
09:35

We redo our previous exercise, but now using deep neural networks, using Keras. We easily get to 100% accuracy after very few epochs.

Classifying mushrooms using a super GPU on AWS
09:15
+
Plotting
1 Lecture 16:48

We use a special package in Python that allows us to plot heatmaps directly over Google Maps. This is of incredible utility for visualizing complex patterns in geospatial data. We use this approach for visualizing the cameras used by the police in Chicago, and to plot the homicides in the US since 1980.

Heatmaps: plotting traffic camera revenues in Chicago and Homicides in the US
16:48
+
Useful image classes
2 Lectures 22:37

Images are used frequently in machine learning, both for deep neural networks and for traditional algorithms (SVM, random forests, etc). We review the basics behind image loading and we present a class that can be used to read an entire directory and build the proper matrices needed for doing machine learning. This class is useful for transforming images in RGB channels (3 tensors)  into black and white (0,1) matrices. It should only be used when reading images already in black and white format

A class that maps Black&White images to Python objects
17:01

We present a similar class, but now it is designed to accommodate 3 channel image data (RGB Images), which we typically need to treat as a 5-dim tensor. This class will be useful for doing convolutional neural nets in the next section

A class that maps RGB Images to Python objects
05:36
+
Image classification
3 Lectures 54:42

We train a deep convolutional network in Keras, to identify hand gestures, and we achieve an excellent accuracy. We explain how to prepare the data, and preprocess the images before loading them into Python

Detecting hands in pictures via Convolutional Neural Networks
19:52

Identifying bolts and nuts in images
15:50

We process images via OpenCV and we detect and count nuts in the images. We combine the results from a blob-detector + DBSCAN clustering, to recover the exact amount of nuts appearing in the images (and their positions)

Identifying bolts and nuts by calculating polygons
19:00
6 More Sections
Frequently Bought Together
About the Instructor
Francisco Juretig
4.0 Average rating
128 Reviews
1,171 Students
8 Courses
Mr

I worked for 7+ years exp as statistical programmer in the industry. Expert in programming, statistics, data science, statistical algorithms. I have wide experience in many programming languages. Regular contributor to the R community, with 3 published packages. I also am expert SAS programmer. Contributor to scientific statistical journals. Latest publication on the Journal of Statistical Software.