This course explores a variety of machine learning and data science techniques using real life datasets/images/audio collected from several sources. These realistic situations are much better than dummy examples, because they force the student to better think the problem, pre-process the data in a better way, and evaluate the performance of the prediction in different ways.
The datasets used here are from different sources such as Kaggle, US Data.gov, CrowdFlower, etc. And each lecture shows how to preprocess the data, model it using an appropriate technique, and compute how well each technique is working on that specific problem. Certain lectures contain also multiple techniques, and we discuss which technique is outperforming the other. Naturally, all the code is shared here, and you can contact me if you have any questions. Every lecture can also be downloaded, so you can enjoy them while travelling.
The student should already be familiar with Python and some data science techniques. In each lecture, we do discuss some technical details on each method, but we do not invest much time in explaining the underlying mathematical principles behind each method
Some of the techniques presented here are:
The modules/libraries used here are:
Some of the real examples used here:
The motivation for this course is that many students willing to learn data science/machine learning are usually suck with dummy datasets that are not challenging enough. This course aims to ease that transition between knowing machine learning, and doing real machine learning on real situations.
We explain the importance of grid search cross validation, using a real example for Wine classification using chemical characteristics. In this example we easily manage to improve a 5% on our classification rate, by choosing the max_depth of the classification trees that we are feeding to Adaboost
We have multiple recordings per word: "Banana", "Chair", "IceCream", "Hello", "Goodbye". We want to extract some metrics from each file, so we can do machine learning later. The difficult part is that the metrics that we need are related to the signal encoded in each file (audio file actually). Luckily, we can leverage an existing R package that reads .wav files, and outputs many properties about the frequencies operating in each file. At the end, we produce 2 csv files (one for training and one for testing) containing 21 features that we can use later for doing machine learning. The approach presented here, can be extended to situations requiring the classification of any sound.
We load the features that we extracted before, both for our training and testing datasets. We evaluate the performance of both Adaboost and SVM. Both methods have a practical in sample accuracy of 100%, 80% of cross-validation accuracy, and 80% of out-of-sample accuracy.
We design a MLP neural network for classifying the audio files we used in the previous lecture. But, in this case we basically get the same out-of-sample accuracy we were getting before, around 72%. So, the extra effort in configuring and running a neural net was not justified
We use official data from the US Nuclear Regulatory Commission, in order to predict the % usage of the existing reactors in the US. We test both Multilayer Perceptrons and Support Vector Regression (SVR). However, in this case, both methods do not perform well; and that is probably good to remind us that Machine Learning cannot always predict everything
We use a deep neural network to predict the output of US commercial reactors, but instead of predicting one value per observation, we will predict multiple ones. Sounds hard? It's quite easy using Keras
We use a real Kaggle example containing 350K observations for used cars in Ebay-Germany. The problem is that constructing the feature matrix is not viable, as we would end up with a Numpy matrix containing over 250 columns and 350K observations. Such a matrix will not fit into our RAM memory and we won't be even able to call Keras (if we somehow could, it would not work).
We thus train the model using train_on_batch(), feeding the model with batches of around 17K observations. Using this incremental approach, we can easily construct the matrix on each batch creation step. We finally estimate this using a deep neural network, achieving a mean absolute error around 1,500 euros per car
We work with a dataset containing multiple features per mushroom, and the objective is to predict whether they are edible or not. That is particularly challenging for humans, as there are no clear characteristic/rule that state when a mushroom is poisonous.
We redo our previous exercise, but now using deep neural networks, using Keras. We easily get to 100% accuracy after very few epochs.
We use a special package in Python that allows us to plot heatmaps directly over Google Maps. This is of incredible utility for visualizing complex patterns in geospatial data. We use this approach for visualizing the cameras used by the police in Chicago, and to plot the homicides in the US since 1980.
Images are used frequently in machine learning, both for deep neural networks and for traditional algorithms (SVM, random forests, etc). We review the basics behind image loading and we present a class that can be used to read an entire directory and build the proper matrices needed for doing machine learning. This class is useful for transforming images in RGB channels (3 tensors) into black and white (0,1) matrices. It should only be used when reading images already in black and white format
We present a similar class, but now it is designed to accommodate 3 channel image data (RGB Images), which we typically need to treat as a 5-dim tensor. This class will be useful for doing convolutional neural nets in the next section
We train a deep convolutional network in Keras, to identify hand gestures, and we achieve an excellent accuracy. We explain how to prepare the data, and preprocess the images before loading them into Python
We process images via OpenCV and we detect and count nuts in the images. We combine the results from a blob-detector + DBSCAN clustering, to recover the exact amount of nuts appearing in the images (and their positions)
I worked for 7+ years exp as statistical programmer in the industry. Expert in programming, statistics, data science, statistical algorithms. I have wide experience in many programming languages. Regular contributor to the R community, with 3 published packages. I also am expert SAS programmer. Contributor to scientific statistical journals. Latest publication on the Journal of Statistical Software.