What are Restricted Boltzmann Machines (RBM's)?

Sundog Education by Frank Kane
A free video tutorial from Sundog Education by Frank Kane
Founder, Sundog Education. Machine Learning Pro
4.5 instructor rating • 21 courses • 414,465 students

Lecture description

We'll cover a very simple neural network called the Restricted Boltzmann Machine, and show how it can be used to produce recommendations given sparse rating data.

Learn more from the full course

Building Recommender Systems with Machine Learning and AI

How to create recommendation systems with deep learning, collaborative filtering, and machine learning.

10:05:15 of on-demand video • Updated August 2020

  • Understand and apply user-based and item-based collaborative filtering to recommend items to users
  • Create recommendations using deep learning at massive scale
  • Build recommender systems with neural networks and Restricted Boltzmann Machines (RBM's)
  • Make session-based recommendations with recurrent neural networks and Gated Recurrent Units (GRU)
  • Build a framework for testing and evaluating recommendation algorithms with Python
  • Apply the right measurements of a recommender system's success
  • Build recommender systems with matrix factorization methods such as SVD and SVD++
  • Apply real-world learnings from Netflix and YouTube to your own recommendation projects
  • Combine many recommendation algorithms together in hybrid and ensemble approaches
  • Use Apache Spark to compute recommendations at large scale on a cluster
  • Use K-Nearest-Neighbors to recommend items to users
  • Solve the "cold start" problem with content-based recommendations
  • Understand solutions to common issues with large-scale recommender systems
English Instructor: The grand-daddy of neural networks in recommender systems is the Restricted Boltzmann Machine, or RBM for short. It's been in use since 2007, long before AI had its big resurgence but it's still a commonly cited paper and a technique that's still in use today. Going back to the Netflix Prize, the main things Netflix learned was that matrix factorization and RBM's had the best performance as measured by RMSE, and their scores were almost identical. Again, this shouldn't surprise us too much since we know that you can model matrix factorization as a neural network. But they found that by combining matrix factorization with RBM's, the two of them working together provided even better results, they went from an RMSE of 8.9 to 8.8. A few years ago, Netflix confirmed they were still using RBM's as part of their recommender system that's in production. Let's learn how it works. First of all, if you're serious about using RBM's for recommendations, I recommend tracking down this paper so you can study it later once you understand the general concepts. It's from a team from the University of Toronto, and was published in the Proceedings of the 24th International Conference on Machine Learning in 2007. If you just Google for the title of the paper Restricted Boltzmann Machines for Collaborative Filtering, you should find a free PDF copy of it, that's from the author's page on the University of Toronto website, so I think it's legitimately free for you there. RBM's are really one of the simplest neural networks, it's just two layers, a visible layer and a hidden layer. We train it by feeding our training data into the visible layer in a forward pass, and training weights and biases between them during backpropagation. An activation function such as ReLU is used to produce the output from each hidden neuron. Why are they called Restricted Boltzmann Machines? Well, they are restricted because neurons in the same layer can't communicate with each other directly. There are only connections between the two different layers. That's just what you do these days with modern neural networks, but that restriction didn't exist with earlier Boltzmann Machines back when AI was still kind of floundering as a field. And RBM's weren't invented by a guy named Boltzmann, the name refers to the Boltzmann distribution function they used for their sampling function. RBM's are actually credited to Geoffrey Hinston who was a professor at Carnegie Mellon University at the time, the idea actually dates back to 1985. So, RBM's get trained by doing a forward pass, which we just described, and then a backward pass, where the inputs get reconstructed. We do this iteratively over many epochs, just like when we train a deep neural network, until it converges on a set of weights and biases that minimizes the error. Let's take a closer look at that backward pass. During the backward pass, we are trying to reconstruct the original input by feeding back the output of the forward pass back through the hidden layer, and seeing what values we end up with out of the visible layer. Since those weights are initially random, there can be a big difference between the inputs we started with and the ones we reconstruct. In the process, we end up with another set of bias terms, this time on the visible layer. So, we share weights between both the forward and backward passes, but we have two sets of biases. The hidden bias that's used in the forward pass, and the visible bias used in this backward pass. We can then measure the error we end up with and use that information to adjust the weights a little bit during the next iteration to try and minimize that error. Conceptually, you can see it's not too different from what we call a linear threshold unit in more modern terminology. You can also construct multi-layer RBM's that are akin to modern deep learning architectures. The main difference is that we read the output of an RBM on the lower level during a backward pass, as opposed to just taking outputs on the other side like we would with a modern neural network. So this all works well when you have a complete set of training data, for example, applying an RBM to the same MNIST handwriting recognition problem we looked at in our deep learning intro section is a straightforward thing to do. When you apply RBM's, or neural networks in general, to recommender systems though, things get weird. The problem is that we now have sparse training data very sparse, in most cases. How do you train a neural network when most of your inputs nodes have no data to work with? Adapting an RBM for, say, recommending movies given five-star ratings data, requires a few twists to the generic RBM architecture we just described. Let's take a step back and think about what we're doing here. The general idea is to use each individual user in our training data as a set of inputs into our RBM to help train it. So, we process each user as part of a batch during training, looking at their ratings for every movie they rated. So, our visible nodes represent ratings for a given user on every movie, and we're trying to learn weights and biases to let us reconstruct ratings for user/movie pairs we don't know yet. First of all, our visible units aren't just simple nodes taking in a single input. Ratings are really categorical data, so we actually want to treat each individual rating as five nodes, one for each possible rating value. So, let's say the first rating we have in our training data is a five-star rating, that would be represented as four nodes with a value of zero and one with a value of one, as represented here. Then we have a couple of ratings that are missing for user/item pairs that are unknown and need to be predicted. Then we have a three-star rating, represented like this with a one in the third slot. When we're done training the RBM, we'll have a set of weights and biases that should allow us to reconstruct ratings for any user. So to use it to predict ratings for a new user, we just run it once again using the known ratings of the user we're interested in. We run those through in the forward pass, then back again in the backward pass, to end up with reconstructed rating values for that user. We can then run softmax on each group of five rating values to translate the output back into a five-star rating for every item. But again, the big problem is that the data we have is sparse. If we are training an RBM on every possible combination of users and movies, most of that data will be missing, because most movies have not been rated at all by a specific user. We want to predict user ratings for every movie though, so we need to leave space for all of them. That means if we have N movies, we end up with N time five visible nodes, and for any given user, most of them are undefined and empty. We deal with this by excluding any missing ratings from processing while we're training the RBM. This is kind of a tricky thing to do, because most frameworks built for DeepLearning such as TensorFlow assume you want to process everything in parallel, all the time. Sparse data isn't something they were really built for originally, but there are ways to trick it into doing what we want. But, notice that we've only drawn lines between visible units that actually have known ratings data in them, and the hidden layer. So as we're training our RBM with a given user's known ratings, we only attempt to learn the weights and biases used for the movies that user actually rated. As we iterate through training on all of the other users, we fill in the other weights and biases as we go. For the sake of completeness, I should point out that TensorFlow actually does have a sparse tensor these days you can use, and there are other frameworks such as Amazon's DSSTNE system that are designed to construct more typical deep neural networks with sparse data. RBM's will probably become a thing of the past now that we can treat sparse data in the same manner as complete data with modern neural networks, and we will examine that in more depth in an upcoming section of this course. The other twist is how to best train an RBM that contains huge amounts of sparse data. Gradient descent needs a very efficient expectation function to optimize on, and for recommender systems this function is called contrastive divergence. At least, that's the function the paper on the topic uses successfully. The math behind it gets a bit complicated, but the basic idea is that it samples probability distributions during training using something called a Gibbs sampler. We only train it on the ratings that actually exist, but re-use the resulting weights and biases across other users to fill in the missing ratings we want to predict. So, let's look at some code that actually implements an RBM on our MovieLens data, and play around with it!