Fundamentals of Linear regression

Akhilendra Singh MBA, CSPO, PSM1
A free video tutorial from Akhilendra Singh MBA, CSPO, PSM1
Business Analyst
4.1 instructor rating • 4 courses • 9,686 students

Lecture description

Learn fundamentals of Linear regression.

Learn more from the full course

Machine Learning, Business analytics & Data Science

Machine learning, deep learning & data Science with R & Python. Build Image recognition model with Keras & much more.

18:25:53 of on-demand video • Updated May 2020

  • Machine learning & Data science with R & Python
  • Fundamentals of Machine learning
  • Data science
  • Deep learning models
  • Image recognition
  • Keras
  • R programming
  • Anaconda distribution & jupyter notebook
  • Numpy & pandas
  • Multi-layer perceptron
  • Data visualization with pandas, seaborn & matplotlib
  • Data visualization with base R & libraries like ggplot2, lattice, scatter3d plot & more
  • Applied statistics for machine learning covering important topics like standard error, variance, p value, t-test etc.
  • Machine learning models like Neural network, linear regression, logistic regression & more.
  • Handle advance concepts like dimension reduction & data reduction techniques with PCA & K-Means
  • Classification & Regression Tree with Random Forest machine learning model
  • Real life projects to help you understand industry application
  • Tips & Tools to create your online portfolio to promote your skills
  • Tutorial on job searching strategy to find appropriate jobs in machine learning, data science or any other industry.
  • Learn business analytics
English [Auto] Hi guys. So let's start that building machine learning models. In this video we are going to talk about simple linear regression model. So I'm going to focus a lot on the foundation part so that you get better hang of the concept. You don't have to worry about the calculation part. I will show you everything in the simplest possible way. But honestly I don't think it is very complicated. You will enjoy it. It's just that if you are new to calculation and all these kind of things it may take some time. So go over the video again and again. The most important part is the calculation part the rest of the information is important but it's critical. So once you get the hang of the simple linear regression you know what we are dealing with. So as I have mentioned in one of the earlier videos a regression is basically used when we are looking to find out a continuous value continuous variable. And this is used when we have two or more than two variables and we are looking to find out interaction between them a relationship between them. So in simple linear regression we will have only two variables. One will be dependent and one will be independent. So we need to find out if there is a relation between them for example speed and the distance. This is a very common example used in the world of machine learning to explain simple linear regressions. So if we increase the speed we can cover more distance in a given time. So in one hour if we are driving at the speed of 50 miles of 50 kilometers per hour and if we increase the speed to 80 miles or 80 kilometers per hour we are going to cover more distance in one hour. That is keeping the time constant as we are dealing with data than outliers and missing values will usually impact linear regression. So when you are working on linear regression problems whether it's simple linear or multivariate you need to make sure that you are handling your missing values are outliers a lot of time outliers are simply dropped and missing values are either replaced or drop depending upon their numbers. If the missing values are very less it's better to drop those values. But if your dataset has high level of missing values then you will need to find out better ways of handling them before you apply linear regression on them. Relationship could be proportional or inversely proportional or rather directly proportional or inversely proportional. For example in case of speed and distance if we increase the speed distance will increase. This is directly proportional relationship but if we have a relationship where if we increase the independent variable the output that is our dependent variable goes down then the relationship is inversely proportional. So just to show you in terms of graph this is our positive regression line which is encountered when we have directly proportional relationships. Whereas if there is an inverse proportional relationship then the line will go down. That is we will get a negative regression line. So as you can see as values are increasing on the x axis let's say this is our independent variable values on the VI axis which is our dependent variable are going down. So values are inversely proportional to each other in terms of representation. So this is a common formula which is used for the Ordination of Simple Living regression. So we have VIII had equals two better one x plus better not I is basically parameter of X whereas X is the input variable. So usually we will find out the better one and then we will calculate to find the better not. Now I know this is getting confusing so let's use a different representation which is used for the notion which is right. Absolutely right. It's just that it's different slightly more you can say user friendly. So that representation is this Y equal to Amex plus C. This is the formula for simple linear regression where m is represented by this formula. And don't worry I will show you exact calculation. I know if you're not used to farm laws and calculation it may get scary but this is really really simple. I will just show you in a bit. So the numerator is product of X minus x bar and Y minus Viber. And then we need to add the denominator is X squared off X minus X bottle. So we will find out VI and X from our data point then we will calculate m. And then once we have these values we will calculate c. So without confusing new model. Let me jump into the excel where I have simplified them and it will be really easy for you to understand. Don't worry we will look at the implementation in r but in R S A one line of code you won't get the backing bar. So it's important to understand the concept for that. I have used Excel to show you the exact calculation. Okay. So I have created this chart. This entire calculation at this moment just by looking at everything. It might be overwhelming. So don't worry I will explain everything. We have two variables here via variable is our y axis which is let's say this is dependent variable and X is independent variable. All right. And this is the chart so for now and just look at this chart. Ignore everything on the screen. So we have these values of starting from zero to twelve on the x axis and zero to forty five on the VI axis. So these are the points individual data points. So when x is 1 y is 15 when access to Y is twelve. Similarly when access 3 VI is 18 and all of this is coming from here. So you can see that 3 when x is 3 Why is it 18 then when X is full Y's 24 and so on this red line is the linear line which I have inserted into the graph using the Excel functionality. So this is the linear line but right now that is not our concern. We first need to find out the intersection. So far that this orange line here this the present the mean of X and this yellow line the present the mean of Y. And this is r intersection point. Let's go back to our data to this X.. If you see mean of X is 5.5 and mean of y is twenty six point six. So this line here the orange one is around 5.5 and this line here this is around twenty six point six. I have manually drawn it. So I agree that this line is slightly past 5 and this line is probably closer to 26. But apart from that they are representing overall reality. So this value is five point five. And this is twenty six point six and if you remember this was our original formula y equal to Amex plus C where M was represented by this value. So here if you see V have X minus x bar VI minus VI bar and square of x minus X Bar. So first we need to find out these values X mean the data points at the x axis and x bar mean mean of x. Similarly for Y minus Y bar VI mean the individual data point and via bar mean the mean of Y. So what we are going to do we are going to calculate x minus x bar first. So here I have simply taken X minus X Bar. So 1 minus 5.5 is 4.5 to minus 5.5 is three point five three minus 5.5 is 2.5. And so on for all the values. So but formula remains same X minus X spot X bodies the means we are taking individual values from here and then subtracting it from its mean. Similarly we are doing exactly same thing for y bar. So first values 15 we are subtracting it from the mean. Similarly for next value then third value everything. Take Y minus Y bar. So we have values for x minus x bar and value minus my boss all right. So if you see this is our numerator X minus X Y minus labor then we need to find out that denominator which is x minus X by the square. So we already have this X minus X but we just need two squared. Six or twenty point two five is the squared off minus four point five twelve point two five is the square of minus three point five this denominator is nothing else then squared off. Dis value this column so all the individual elements have been squared. Okay so to solve the numerator which is x minus x bar multiplied by value minus Y. But we have multiplied these two values. So this column represent X minus X spot and if we add everything here that this submission. So if you see you have the submission sign here. So we need to find individual calculation of X minus X part multiplied by Y minus five. But and then add everything up so that is to 1 8. So that is going to be a numerator for X minus X by the square. We are going to do same thing we are going to find the square of all these numbers for them here and add them. So it becomes eighty two point five when we solve this. That is two and a divided by eighty two point five. We get two point six for two. So now we have Vi Vi is VI mean. Okay so y equal to Amex plus C means Y equal to twenty six point six and X is five point five. So we are going to use these two here y equal to twenty six point six x equal to 5.5. We have calculated m from here. So we will put these three values in this equation solve it like this and then we get the value of c which is twelve point zero six nine. So that's how we solve the formula for linear regression but question is this is used in predictive modeling how does it help. It helps because initially when you come up with this chart we don't know the values of m we don't know the value of c. So we use this formula to calculate the value of all the variables in this equation and then we use them for prediction of Y. Remember VI is r dependent variable. So here we have points up to 31. Okay. And X is 10 Xs till 10. But in real life when you use predictive modeling you first train the data which means you will use this dataset for training and then you will feed the unseen data that is let's say you will feed hundred into this model and you want your machine learning model. In this case simple linear regression model to determine what will be the value of y if the value of x is hundred. In that case in this formula you will replace the value of X by hundred and replace the value of c by twelve point zero six nine and then find out the value of Vi. So the question will be Y equals to two point six for two multiplied by hundred plus twelve point zero six. Like let me increase the font size so this will be your predictive model and this is simple linear regression. In a nutshell okay. I hope it's helping. So you first have an initial equation you solve that initial equation to get the values of all the input items and then for the given value of x you try to find out the value of y and then you draw a line of best fit. So this line if you see there is a difference between the line and the individual data points. This data point where the value of y is twelve and value of this point here is somewhere let's say 18 so there is a gap between line plotted by our model and the actual value of y which is twelve. In this case this difference is called residual error. Not all approaches to minimize this error. Our model should come up with a line which is close to all these points. So in this case this point here where y is close to 26. If we look at our Red Line linear regression line this is very close to this actual point. So this is kind of we can say ideal world linear regression model rather just difference this let's do a letter is leased. So that's how linear regression model works. If you have any question on this calculation part or anything in general please leave your comment but the overall in linear regression this is what you do to try to find out the value of y for a given x in order to do that. You first need to solve the equation for linear regression let's jump back into a slide and give us a more important topics. So this is the same thing so far. If value of x is hundred or one thousand or ten thousand whatever value you can come up for X you will use this equation to find out the value of y and then I have mentioned here that terms like better. Because again these are commonly used in the industry and somebody is using formula with the bitterness. You need to remember the same formula. Explicit that they are talking about parameter which is M in this case. So that one is called slope parameter and Britain artist called intercept parameter so breakdown is your M and better not as you'll see. These are also for the secret of nature because in reality we don't calculate parameters we estimate them. So this is the good old world of statistical analysis. That's how it's called. So if somebody is asking you what is secret of nature in a linear regression you know they are talking about better parameters which is also called as regression coefficient in simple linear regression the method which is used is called least square method. So basically that to calculate the difference between the linear regression line the red line which we saw and the individual data points. The method we use is called least describe the method and there is another way of solving the secretion that is by using the chart you'll find better not when the VI intercept is equals to zero and then you'll find the bit one which is whenever you change X by one unit you will calculate the change in VI using the plot in this chart. We are saying that when you change X by one unit you observe a change in y by two units. So in this case we will divide them so B1 will be two divided by 1 which is 2 it could be it could be anything. So like 7 you change X by one unit you observe change in y by 20 units or 40 units so you will use that for calculation but the easier and the better option is m x plus C using the data point to have. But again all of this is happening in statistical world in machine learning we are not going to manually calculate these things we will simply use artist to do for R R Python or whatever language you are using and that will do the job for us. This is the negative regression line which I have shown you earlier in terms of hypothesis. This is how you are going to present your null hypothesis so which means that is not an issue here simply means null hypothesis. But in machine learning problems in machine learning projects you are not going to use this. You will simply say s not there is no interaction between given variable. If we have speed and distance you mentioned that there is no interaction between speed and distance. And as I have mentioned is a dual value is basically the difference between the actual values of Y and the predicted values of Y and predicted values simply means a red line. And in terms of mentioning key concepts you need to remember almost audacity and head cross adversity. So when there is similar variance in different samples from same population we call it almost audacity and when there is no similarity or differences we talk about hetero as a desert and these are important key concepts key terms which are used in interviews. So that's why I have mentioned it here. So if you are preparing for interviews make sure to remember them. And if you were to include more variables in the mix so instead of let's say just a speed and distance. We also include something like horsepower of the car displacement of the car or whatever parameters a car have. Then we are looking at multiple linear regression so all all mechanics or all concepts of linear regression will remain the same. The calculation part will remain the same. It's just that the equation will expand to include more variables. You can see that we have between 1 x plus Britain not plus better 2 x 2 and so on. So or all equation will remain same. It's just that we will include more variables into the equation but as far as functionality is concerned the concept is concerned it will remain the same. So in the next lecture we will go through their implementation in R. Thank you.