New Theory Topics
A free video tutorial from Jose Portilla
Head of Data Science, Pierian Data Inc.
4.6 instructor rating • 30 courses • 1,985,254 students
Learn more from the full courseComplete Guide to TensorFlow for Deep Learning with Python
Learn how to use Google's Deep Learning Framework - TensorFlow with Python! Solve problems with cutting edge techniques!
14:07:23 of on-demand video • Updated April 2020
- Understand how Neural Networks Work
- Build your own Neural Network from Scratch with Python
- Use TensorFlow for Classification and Regression Tasks
- Use TensorFlow for Image Classification with Convolutional Neural Networks
- Use TensorFlow for Time Series Analysis with Recurrent Neural Networks
- Use TensorFlow for solving Unsupervised Learning Problems with AutoEncoders
- Learn how to conduct Reinforcement Learning with OpenAI Gym
- Create Generative Adversarial Networks with TensorFlow
- Become a Deep Learning Guru!
English [Auto] Welcome everyone to the new theory topics lecture we've already reviewed the basics of neural networks in the previous lecture but there's still some theory components we haven't covered yet. And there's also some things that we have introduced that haven't really dived too much in depth on. So let's go ahead and cover some of these topics in what we're calling this new theory lecture. First off we'll discuss initialization of weights and the various options we have and then we'll decide on Xavier initialization so. Already cited previously that we would just choose some sort of random value for our weights. But let's kind of go through the options first suffered we could have done is just initialize all our weights as zeros. However that presents a problem because there is essentially no randomness there meaning we're not being we're being a little too subjective when creating the neural network because we are introducing kind of our heavy hand of just choosing all zeros. So that's really not a great choice because there's no randomness there. And we want to be as impartial as possible when creating this network. So then we decide well let's do some sort of random distribution near-zero to try to get some of those smaller values. However this is still not optimal even if you try to do a uniform distribution from negative one to one or a normal distribution from negative 1 to 1. When you pass those random distributions into an activation function. They can sometimes get distorted to much larger values. So what kind of solution do we have then. Well we can use Xavier initialization and that comes both in uniform and normal distributions or flavors if you will. And the basic idea behind Xavier initialization is to draw weights from a distribution that has zero mean and a specific variance or that variance is defined as variance of W equal to 1 over and N where w is the initialization distribution for the neuron in question. And and n is the number of neurons feeding into it. So this is very specific neuron here. Again W is the initialization distribution for that neuron in question and one over and in is one over the number of neurons feeding into that neuron and the distribution again typically either Gaussian or uniform or Gaussian is just another word for a normal distribution. Let's go ahead and briefly discuss where this formula comes from. If you want more in-depth discussion you can check out the resource links for this. But let's going to walk through some of these equations. Let's suppose that we have an input X with any components and the linear neuron with random weights w that spits out a number. Why. So it's linear. So we basically just have y equals the weights times that input X and we're doing that all the way for weights and X and so now we have to ask well what if we wanted to know the variance of y. Well if you look up the variance formula on Wikipedia or a statistical book you end up getting the second equation. So you see the variance of Y or the variance of They'll be of-I times X or Y is equal to the expected value of x of-I square times the variance of W-why plus Exeter. All the way until W-why and we end up calling by X of. So if our inputs and weights both have a mean of zero. That second equation actually simplifies to the third equation where we can say the variance of W of times XVI is equal to the variance of W of-I times the variance of x. So you can basically do that separation because we know our inputs and our weights both have a mean of zero. Because we're basically defining it that way. So then if we can make the further assumption that the x y and W of II are all independent and identically distributed which is a common term in statistics. Otherwise known as ID then we can work out that the variance of Y is this top equation right here. So we're saying now the variance of y is equal to. And after some transformations and times the variance of W of-I times the variance of X of-I. So in other words the variance of the output is the variance of the input but scaled by N times the variance of W I. So if we want the variance of the input and the output to be the same that means that and variance of daylily of I should be equal to 1 which means the variance of the weights should then lead us to that second equation where we're saying variance w of-I is equal to 1 over end or equal to 1 over. And in Oregon and in a number of neurons feeding into that neuron and that is your Xavier initialization. Again you really don't need to worry too much about where this formula actually came from because it's essentially just an import we're going to be using from tensor flow. Then in case you're more interested in it you can always check out the resource links. And as a quick note the actual original formula was fine as two divided by the neurons in plus the neurons out. So if you check out the link to the original paper you'll actually see that bottom formula. But a lot of implementations for frameworks and they're just using kind of the more simplified one over neurons. OK so that's Xavier initialization. It's basically just telling you why we're going to be later on using Xavier initialization. Intenser flow. Now that we understand exeat here initialization what I want to go over are three components of gradient the sense something definitely heard of before and that is the learning rate the batch size. And then also the second order behavior of gradient descent or that learning rate. So remember that learning rate basically defines the step size during gradient descent. And if we choose a really small learning rate then you're going to be sending at a very very slow pace. So then you may take forever to actually train your model or you may even never converge within a reasonable timeframe. However if you choose too large of a step size then you may overshoot the minimum and then you may never converge. So that's learning rate. So we're going to keep that in the back of our minds of how to choose a good learning rate and we'll talk about that when we discuss the second order of gradient descent. But then there's also batch size which we actually have already seen before and batches allow us to use the stochastic form of gradient descent which is essentially what we've been using whenever we've shown batch sizes. Intenser flow but we haven't actually specifically said that we're using stochastic gradient descent. And the reason for that is if we were to feed all of our data into our neural network at once there would be so many parameters to try to solve for that it would just be computationally too expensive to perform a gradient descent which is why we need to feed in these so-called mini batches of data. Now there are tradeoffs for that. The smaller the batch size the less representative it is of the entire datasource and then the larger the batch size the longer it takes to train. And if you have too much as far as your data that's being input it will just take absolutely forever then I want to discuss the second order behavior of the gradient that is related to that third piece of this puzzle. So second order behavior of the brain and the sense allows us to actually adjust our learning rate based off the rate of descent. So as you can imagine when you're first beginning to do your green descent and you're first starting off training your errors going to be really large. So it would be nice if you could take really large steps or have a much faster learning rate in the beginning. And then as you get closer to the actual minimum that is you would know by the rate of descent sense that that's second order behavior essentially the riveted. You could then adjust your learning rate to make it a slower learning rate. So it'd be nice if we had some sort of mechanism for having kind of faster learning rates in the beginning and then slowing it down as we got closer to that minimum. And there's different methods of doing this. Adam Gerat r m s Propp is another one that we're really going to be focusing on Adam and we'll introduce that later on. Central is going to be a nice import from tensor flow but keep that in mind as you see us working with that intense flow. That's essentially taking advantage of second order behavior when it comes to gradient descent. All right. So again this allows us to start with larger steps and then eventually goes to smaller step sizes. So Adam allows that change happen automatically. Now I want to mention unstable and vanishing gradients. So as you increase the number of layers in a network the layers towards the input will be affected less by the error calculation occurring at the output. And this is especially true if you have a very deep neural network with tons of layers as you back propagate that gradient. It's going to get smaller smaller and smaller which is where the term vanishing gradient comes from. So as you go back to the network if you have a super deep network eventually you won't be changing any of the weights at the very beginning of the network. So initialization and normalization will help us mitigate some of these issues. And in fact if you have a good initialization and have good normalization you rarely have to worry about vanishing gradients. There's also an opposite problem called exploding gradients. But that's a little more rare. And we're going to discuss these unstable gradient ideas a lot more again in detail when we discuss recurrent neural networks because that's a situation where you really have to keep them in mind. OK. Finally I want to discuss overfitting versus underfeeding a model you've already heard me use these terms before and you may already have an idea of what we're talking about but I just really want to make sure that you fully understand what we're discussing when we say overfitting are under fitting. So let's imagine a really simple regression task here. We have some training data here in blue. We have an x axis and the y axis. So let's say we create this red line model and it's fitted to the training data. And you can see here that we're basically under fitting we're not really getting that more or less parabolic behavior. That is true for the data. So we're getting a large error on that training data for under fitting and then if we introduce test points we'll also get a larger error on those test points. So this is essentially just an indication that we're under. Fitting for getting a larger error on both your training step and your testing set. Then you're under fitting and you need to go back and either adjust parameters in your model or change your model again. High error on both the training and test data. Now let's discuss overfitting a model. Let's go back to the situation where we have some training data. Now you might be thinking to yourself Well if my last model was under fitting and I just drew a straight line let me try to build a model that basically hits every single point of my training data. So you may get some sort of wacky model that looks like this. However the danger here is that when you actually evaluate this model it's going to have a very low error on your training data and your data is going to be multi-dimensional so you won't be able to visualize it. As we've done here looking at this I can clearly see that there's an issue with my model. But if I'm working something in 12 dimensions not going to be able to really visualize something like this is the report I'm going to get back is the error on the training data. And unfortunately if you have a model that looks like this you're actually getting a very low error on that training data because you're basically getting to every point. Now the problem overfitting is when you introduce the test data. So if you have the test data you'll notice you end up getting a very large error on that test dataset. Then again this is a kind of cartoonish example for exaggeration purposes but hopefully you get the idea that if you're fitting very well to your training data but but then get a much larger error on your test data. Your overfitting your model. So those are differences between overfitting and underfeeding and overfitting. It's kind of dangerous and deceptive because you may run it on your training data and think wow look at all these great results and getting. But then you introduce your test data to your model and it performs really poorly. So that's a classic example of overfitting you need to strike some sort of balance. And here again it's kind of a parabolic shape. So again with potentially hundreds of parameters in a deep learning neural network the possibility of overfitting is very high. And there are a few ways to help mitigate this issue of overfitting which kind of plague's neural networks in general. There are statistical methods like L1 and L2 regularisation and the basic idea behind L1 and L2 regularisation is they essentially just add a penalty for larger weights in the model. So you don't get end up getting one feature in your training set that really has a large weight attached to it or when you're on in your training set that has a large weight attached to it. So this idea of L1 L2 regularisation it's not unique to neural networks. If you've done any sort of machine learning before you've probably heard of these regularisation methods and again it's just adding a penalty for larger weights in the model. Now another common technique is called dropout and this dropout technique. It's fundamentally a really simple idea and it's unique 10 year old that works but it's actually really effective. And the idea is that you just remove neurons during training randomly. So as you're training you pick random neurons to remove and that way the network doesn't over rely on any particular neuron as it's training. And that can help mitigate overfitting. Then there's another technique which is known as expanding your data and you can basically artificially expand your data by adding noise or you can tilt images or maybe at low white noise the sound data. Things of that nature so that you change your data that you're training on itself. That way you don't technically overfit to the real data source and we'll kind of explore that later on in the course. So we still have more theory to learn things such as pulling layers convolutional layers et cetera. But we'll wait until we actually begin to build convolutional neural networks to cover those. So we'll have an upcoming theory lecture where we really dive in to the specific theory of convolutional neural networks for now let's go ahead and explore the famous and this dataset which is essentially a must know for convolutional neural networks and a must know for deep learning in general pretty much every deep learning framework in course has some sort of amnesty example. So we're definitely going to cover it here. Coming up next we'll discuss the data in general. I'll see it there.