Principal Component Analysis (PCA)

Sundog Education by Frank Kane
A free video tutorial from Sundog Education by Frank Kane
Founder, Sundog Education. Machine Learning Pro
4.5 instructor rating • 21 courses • 418,723 students

Lecture description

Let's learn how PCA allows us to reduce higher-dimensional data into lower dimensions, which is the first step toward understanding SVD.

Learn more from the full course

Building Recommender Systems with Machine Learning and AI

How to create recommendation systems with deep learning, collaborative filtering, and machine learning.

10:05:15 of on-demand video • Updated August 2020

  • Understand and apply user-based and item-based collaborative filtering to recommend items to users
  • Create recommendations using deep learning at massive scale
  • Build recommender systems with neural networks and Restricted Boltzmann Machines (RBM's)
  • Make session-based recommendations with recurrent neural networks and Gated Recurrent Units (GRU)
  • Build a framework for testing and evaluating recommendation algorithms with Python
  • Apply the right measurements of a recommender system's success
  • Build recommender systems with matrix factorization methods such as SVD and SVD++
  • Apply real-world learnings from Netflix and YouTube to your own recommendation projects
  • Combine many recommendation algorithms together in hybrid and ensemble approaches
  • Use Apache Spark to compute recommendations at large scale on a cluster
  • Use K-Nearest-Neighbors to recommend items to users
  • Solve the "cold start" problem with content-based recommendations
  • Understand solutions to common issues with large-scale recommender systems
English -: So, as we've seen, collaborative filtering can produce great results that have been shown to work really well in real, large-scale situations. So, why should we look for something that's even better? Collaborative filtering has been criticized as having limited scalability, since computing similarity matrices on very large sets of items or users can take a lot of computing horsepower. I don't really buy this, however. As we've seen, using item-based collaborative filtering reduces the complexity of that substantially, to the point where you can compute an entire similarity matrix for extremely large product catalogs on a single machine. And even if you couldn't, technologies such as Apache Spark allow you to distribute the construction of this matrix across a cluster if you need to. A legitimate problem with collaborative filtering, though, is that it's sensitive to noisy data and sparse data. You'll only get really good results if you have a large data set to work with that's nice and clean. So, let's explore some other ways to make recommendations. We'll group these together under the label, model-based methods, since instead of trying to find items or users that are similar to each other, we'll instead apply data science and machine learning techniques to extract predictions from our ratings data. Machine learning is all about training models to make predictions, so we'll treat the problem of making recommendations the same way. We will train models with our user ratings data and use those models to predict the ratings of new items by our users. This puts us squarely in the space of the rating prediction architecture that SurpriseLib is built around, and although we've shown it's not always the most efficient approach, it allows us to repurpose machine learning algorithms to build recommender systems, some of which are very good at predicting ratings. Whether they make for good top-N recommendations is something we'll need to find out. There are a wide variety of techniques that fall under the category of matrix factorization. These algorithms can get a little bit creepy. They manage to find broader features of users and items on their own, like action movies or romantic. Although the math doesn't know what to call them, they are just described by matrices that describe whatever attributes fall out of the data. The general idea is to describe users and movies as combinations of different amounts of each feature. For example, maybe Bob is defined as being 80% an action fan and 20% a comedy fan. We'd then know to match him up with movies that are a blend of about 80% action and 20% comedy. That's the general idea, let's dig into how it works. Working backwards from what we want to achieve is usually a good approach, so what are we trying to do when predicting ratings for users? Well, you can think of it as a filling in a really sparse matrix. We can think of all our ratings as existing in a 2D matrix, with rows representing users and columns representing items, in our case, movies. The problem is most of the cells in this matrix are unknown. Our challenge is to fill those unknown cells in with predictions. Well, if we're trying to come up with a matrix at the end of the day, maybe there are some machine learning techniques that work on matrices we can look at. One such technique is called principal component analysis, or PCA. Principal component analysis is usually described as a dimensionality reduction problem. That is, we want to take data that exists in many dimensions, like all of the movies a user might rate, into a smaller set of dimensions that can accurately describe a movie, such as its genres. So, why is there a picture of a flower on the screen? Well, it's a little hard to imagine how this works on movies at first, but there's a data set on iris flowers that makes for a good example of PCA in action. Look closely at an iris, and you'll see that it has a bunch of large petals on the outside and some smaller petals on the inside. Those inner petals aren't called petals at all, they're called sepals. So, one way to describe the shape of a specific iris flower is by the length and width of its petals, and the length and width of its sepals. So, if we look at the data we're working with in the iris data set, we have length and width of petals and sepals for each iris we've measured. So, that's a total of four dimensions of data. Our feeble human brains can't picture a plot of 4D data, so let's just think about the petal length and width at first. That's plotted here for all of the irises in our data set. Without getting into the mathematical details of how it works, those black arrows are what's called the eigenvectors of this data. Basically, it's the vector that can best describe the variance in the data and the vector orthogonal to it. Together, they can define a new vector space, or basis, that better fits the data. These eigenvectors are our principal components of the data. That's why it's called principal component analysis. What we are trying to find are these principal components that describe our data, which are given by these eigenvectors. Let's think about why these principal components are useful. First of all, we can just look at the variance of the data from that eigenvector. That distance from the eigenvector is a single number, a single dimension that is pretty good at reconstructing the two-dimensional data we started with for petal length and petal width. So, you can see how identifying these principal components can let us represent data using fewer dimensions than we started with. Also, these eigenvectors have a way of finding interesting features that are inherent in the data. In this case, we're basically discovering that the overall size of an iris' petals are what's important for classifying which species of iris it is, and the ratio of width to length is more or less constant. So, the distance along this eigenvector is basically measuring the bigness of the flower. The math has no idea what bigness means, but it found the vector that defines some hidden feature in the data that seems to be important in describing it, and it happens to correspond to what we call bigness. So, we can also think of PCA as a feature extraction tool. We call the features it discovers latent features. So, here's what the final result of PCA looks like on our complete four-dimensional iris data set. We've used PCA to identify the two dimensions within that 4D space that best represents the data, and plotted it in those two dimensions. PCA by itself will give you back as many dimensions as you started with, but you can choose to discard the dimensions that contain the least amount of information. So, by discarding the two dimensions from PCA that told us the least about our data, we could go from four dimensions down to two. We can't really say what these two dimensions represent. What does the X axis in this plot mean? What does the Y axis mean? All we know for sure is that they represent some sort of latent factors, or features, that PCA extracted from the data. They mean something, but only by examining the data can we attempt to put some sort of human label on them.