Technology behind Neural Networks

A free video tutorial from Kashyap Murali
Deep Learning / AI Developer
Rating: 3.4 out of 5Instructor rating
1 course
29,087 students
Technology behind Neural Networks

Lecture description

Understanding Activation Functions, Deep Neural Networks, How to prevent overfitting, Biological vs. Artificial Neurons and Gradient Descent

Learn more from the full course

Kickstart Artificial Intelligence

Practical Hands on course to Artificial Intelligence

03:05:03 of on-demand video • Updated August 2018

Students will be able to apply Artificial Intelligence in real world tasks and will be able to build fully functioning AI solutions on their own.
English [Auto]
Hi I'm the one in Raleigh and in the present year we talk about how to get started into artificial intelligence which were the key points like the 10th extra extract except for the definition of the right stuff and also the relationship between artificial intelligence machine learning deepening in this lecture. We're going to be talking about the technology behind your own efforts and at least to understand some of the underlying concepts behind your efforts. Does one go into full into too much detail or do all the math and stuff like that. But this will help you understand as a missionary how it kind of works and under the hood so let's get started. So this is a technology pen you own that works the First let's look at a sampling of OWN network and it all starts with the input layer. So this three input data feeds through and these arrows each represent weight of the model. So you can think about it. So think of all these arrows like dials for sound. So if you look at a music producer or a music artist or studio you see their bunch of dials and you can put like Reagan left or right or turn left or right etc.. And what this does is it actually modifies how the output will be and monitoring how that output would be would be very useful in order to have would be very useful in order to get different outputs of getting the right combination of Dallas is kind of the whole goal of you. Next is we're going to have these things so he delivers. And just to be clear there do not need to be too. There can be one or they can be way more than one and that doesn't matter. But the whole fun thing is these are extra parts where you make your computation and the more you have the more of a model that you have. What this means is that you with such a deep model you could compute more difficult tasks and. Having two people can be you. And on top of that finally these waves converge to an output which is going to be given to check whether that training phase and the testing phase and you make a prediction and if the predictions run you have a construction which determines how bad your model. And you go back with just a back propagation you fix the weights you forward propagate and you come back so you can think of this in like a wrench and repeat process. Keep on going because your predictions made. And just to be clear there are these two important phases and mission in pretty much all machinery training and testing. They are a lot of training data and I'll talk about the data itself later but basically the data can be split into a 70 30 split or 60. What is that not too big of a deal. But using this dataset disseminator Is that what you can do is of you know the answers for the 70 percent aspects of the training phase you actually know the answers and you may and I bet you and either you know that and the neural network makes the associations between the waste and the inherent retracing of Bruce and Rangers and then we test that. So you use those assumptions to me and you get some training data and then you may output on the data and see the output corrector and the outputs correct. Then the model gets the model gets for because no which are changing. But if the output was incorrect then the construction and the loss of just computed by the cost function increases. So the model knows he did something wrong and it fixes those with as necessary. So these two important pieces are the fix the weights as necessary. And he's phase dictate how a new and improved makes its prediction. So now I'm going to talk about Belhaj lowers his artificial neurons in this diagram. As I mentioned these are the ways these circles that you see these circles are the units of the model and the more you have the more deep and more cruel you are orphaning which I will be talking about in the lectures in later on this section. So first here is a biological which is artificial. So it's fundamental understand why a biological neuron how an artificial urine was actually implemented from a biological cure and this is actually sort of biomimicry because it's copying how a neuron is originally constructed. I don't want to go into this too much but rather it's just that there and generates which take the input and then the impulses are carried live this nucleus were kind of the brain and its a brain of the brain turns the head and then it's scary to accept terminals from where it's passed on and Snap's is just the gap between humans and I don't want to. It does actually help really understand the correlate. So it would be useful but I want to talk about is that I mainly want to get to the activation function. So what the activation function does is that you have all your weight and all the weight could be anything really could be any number and there could be completely you know they could be of different ranges as well. So easy to think can actually function to generalize not to generalize and so I am asking would you be understandable by then you're not sure what this means is using activation function. What we could say is we could say that you know depending on the weight how much of an you are we want to be active in it or if we're using a very basic activation function either 0 or 1. So now this might be a kind of confusing morass we need right now in the next life and we all make a lot of sense. So first the types of activation functions. So this was one where I said about it I read 0 1. So in this perception you can either have a zero which is of the waiter or below zero the weight is below zero and that means that you should be zero. But the weight is above zero which is not an aspirin which is positive. Then the neuron should be turned to. Now this is too binary asphyxiation and you get a very basic level. But now what about multiclass glasses. What about multiple categories. And then you just have a bunch of ones and yes you know and you don't really know how those correlate. So using a present turn isn't pretty much never a good idea and it's not something that people do the real sigmoid. So actually you see that I'm coming here. They both are actually using the real world. And there is no best actuation function that you can decide a priori to the problem in implementing it. So but let's just get started on outspan each one I'll tell you what you might want to think in beginning but don't don't. Not consider any of these because you are albatross's. You've been going from the sigmoid. Well the sigmoid is is it's basically range from zero to one and what is does is it can get it then you should be 50 percent activated then you should be 60 70 80 or 90 and backwards they should be 14:20 10 or not activated. And this is actually very useful because it helps. Now multiclass that's Asian because now you can't have all these new and should be 70 percent and should be 20. And now it's much easier to determine whether in what capacity there is one problem with this but it's also the same problem with age also known as TENCH. So Actually I'd describe an angel of knowing tetch so stage is also very similar to sigmoid. But the main difference is a sigmoid is from zero to 1 whereas Tench 0 2 and whereas change is from negative 1 to 1. And what this means is how steep of a gradient we want and to put it in simple words just try both of these and look at the accuracy's and that can help understand that were. Now to talk about the problem. So these are both an analog. And what this means is that these are you get infinitely close to one and it gets computationally more difficult for the computer to actually take a lot of time use. And so while these are very active functions it's better to stick distinctive because it's a linear activation function. And this actually keeps things very very simple for the neuron. That's fine for that and all that or understand how you're going to suppose because in one day and as way no activation function is perfect. But then there always honestly problems and the thing with residue is that we can be anywhere from zero to infinity. So it could jump up like it could have huge huge huge numbers which are not good whatsoever but sometimes that's a risk to take. And another issue is that everything below zero is automatically Pushtu and any weight below zero are automatically told that you're on TV and you're actually turned up. And what that means is that that's not really good because that help that cause this thing all the tension gradient. And anyway the whole point of Ryno is to have it's actually much much more competition efficient just helpful for that in an effort to run things faster and essentially get better activities. One thing that I'd like to mention though is for starters don't worry about what the issue function should not be a problem to get started with just use Reller because that that works on many problems. And that's what pretty much everybody says to use to start with an experiment and change as necessary because it really would be a fantastic starting point because it does explain a lot of difficulties and it does help solve a lot of problems. I know I'm way too integrated isn't so great in isn't. Is basically the car. So we're waiting. Send us is when you say it is what drives and you're on that route to get to its best rates and what does this mean. So this basically means that you know that the dialin argue that it is you. So that really does helps you know make those deaths occurred. So it pretty much tunes all those dials aka weight into the right amount so that the consumption reaches this minimum over here. So that's the lowest amount of the cash crunch and should be using the Muma. You can now make better and better predictions. So as I caution I'm sure there are actually many different types of functions. Some of them. One is mean square and air which is basically the average of the complete difference between the and the predictive value. And that means then there's one area which is just some are just going to just lower the rest which is quickly go which is categorical process repeated by here. So I personally use either kind of workload depending on a problem. No is being so binary across entropy it's kind of in the name I use it only for by consideration problems. What kind of word processor a B or a categorical classification problems. So yet constructions are not a big deal. But I said just use other kind of work. We're right here. And next important for the brain to say now want to go into this in a bit more detail. And I was learning. So what happens here is linear is how big of a jump that you are in a change that way that you make in the way that you make those minima. So what does it mean. So it changes how much of the dial you're turning at a single turn. So now one defines at a time and that's where we get too many bêtise. So there's other full match between zent or many that's getting this and that is just you computed over the entire data set. While many mesh routines and you can build a small portion of the dataset. So it's actually commodiously more efficient and you get very similar results. Next is Adam which is an optimization function. So what Adam does is the No. I am learning. I forgot to mention I'm sorry that it had to be the overloading rate. You could skip the minimum so you could go from year to year to year and then skip it and then that would just be a problem and the next thing is would choose wobbler degree and take too long to get to the minimum. So just keep jumping here and here till you will get to the minimum. But the problem is it take too long. So that's where and if it uses this thing called momentum. So you make a big jump in the beginning based on how much of a change there is but then err on the partial derivative which is basis. Anyway I don't remember that that doesn't Cruger. But basically it just makes a jump and makes a bigger jump. And then as it gets closer and closer as I get smaller and smaller smaller you make smaller and smaller and smaller jumps and you get closer and closer. So you get both feet at the beginning for a large degree of precision and accuracy for a small and this is actually a great construction because it's just a pretty much you. And this seems to be a little minimum so it's very easy. Nothing about a different case is all wiggly and other conscientious not good whatsoever. And now in that case is actually a problem that many people face and what does that mean basically the Dougy is not able to get to the global made him. Q Were global One-Stop only in the local minimum. So basically this suggests that the algorithm it cannot is to code really and that you cannot move past that and the prodigy's that is to increase your learning rate. I'm sorry. The solution to that is to increase your energy. They can jump out of that hole which you breath so on and now and when you talk about how do we use or theory which is a very common problem. So orphaning there are three main ways. One is regularisation so readily is actually very important to implement. So what you mean the Kranti It's you might think oh let me just delete these unnecessary seizures but these are actually not a smart idea because these are unnecessary features might just have some important role to play in the decision. And you don't want to add your bias to those. So by deleting these features you're adding by the model which cannot which would not help it understand how to conform with the real world. Well that realisation comes in which penalizes parameters accordingly based on its usefulness. So if you have too high a regular parameter you penalize too much and too low of a regularization parameter and you owe it to the data. So now before I move on like a visual understanding or funny you can see that these are different points in the data right now. You if you're over. This is the line that will go through. And that's really bad because it will do great for tree. It's like oh. Exactly. And yes it's really interesting when it gets a random point like over here it when the model would go down but rather that point would be here and it would be great. And so orphaning is very bad and it's not useful whatsoever. Now I'm going to talk about Java. So what jumper does is it before I drop it. I just want to say it might seem a bit yeah it might seem pretty counter-intuitive. Jump on is very useful it's proven to work very well. The real supernatural job it does is it turns awesome. So you get an A JOB Butler saying you I fight it for you and off they went and then returned 25 percent of drop on does is you know just deactivate some of the Noonan's and this might seem pretty counterintuitive why would you close. Why just turn on something else. But as I said earlier it has proven to work destroy the job on those injuries in a paper for I think these ideas exist. And it was actually rejected by the college. But definitely made its way to the real world because also Asian's it's very slow the Corbridge approach really is working. And then finally expanding the data. So I want to talk about image data but there are you know you could gather more data mainly but in the case of images you know up there. These tools which I also talk about a bit more in detail in later lectures would help increase the size of the data. Now what I mean here is if we take a picture and you up you know you've got a training area for the electorate. What you don't understand is also in areas or only in this data and in the admissions data. So if you look at it like the brightness the contrast you know the blurriness the zoom. That's right. It could be detuning to only those that are training but in testing data when you know when you have different address or different brightness because you know you can get the same picture and you don't get the struggle to find shoulder for those images. So for that. What we do is we expanded data and we you know twisted it a bit we make it fuzzy we increase the contrast brightness etc. etc. and whatnot. And what this does is it basically gives more options for the model to generalize it sort of purely focusing on this training data. So helping them model their lives is important to expand the data. And we're going to become really one of the options that help do this with just our image data agent in Paris and I'll be talking about that later in some future ventures. All right. So in the next let you talk about the different types of layers and but for now that's it. And I'll see you guys next lecture by.