Regression Assumptions

George Ingersoll
A free video tutorial from George Ingersoll
MBA & PhD
4.6 instructor rating • 1 course • 27,005 students

Lecture description

Overview of the four main assumptions of linear regression: linearity, independence of errors, homoscedasticity, and normality of residual distribution.

Learn more from the full course

Workshop in Probability and Statistics

This workshop will teach you the fundamentals of statistics in order to give you a leg up at work or in school.

21:37:01 of on-demand video • Updated April 2020

  • By the end of this workshop you should be able to pass any introductory statistics course
  • This workshop will teach you probability, sampling, regression, and decision analysis
English All right in this video we're going to be talking about the assumptions of linear regression. These are the things that must hold true in order for a regression to be legitimate. So we actually use the term assumptions not because we just assume these things are true for any time we run a regression and then we can just forget them but rather because when we see a regression output we assume that these are the things these things have been checked out and they all hold true. And then therefore this is a legitimate regression and we can go ahead and interpret the results. If we were if you were a very serious researcher these are extremely important and these are things you're going to want to check out every time. And there's a whole lot of tests to validate these assumptions. But for business statistics I really I think it's more important to understand that these are the underlying assumptions of regression. Be able to sort of you know know what they mean and maybe do a quick eyeball test and maybe one or two more tests basically to understand what goes into you know sort of the fun the foundations of regression . So let's I'm not going spend a lot of time on this. So let's let's proceed. Here you have the four assumptions we're going to go into each of these in a little bit greater detail . But before we do that I want to review just some of the terms of aggression. Now of course you know when we're talking about linear regression simple linear regression with one predictor variable we're going to use We're going to have an out predicter equation that looks like this y equals x plus B M is the slope and that is essentially just the change in y over the change in x otherwise known as rise over run. We have a y intercept over here which is B that's also called the Constant. In simple any regression. And then the real term I want to define here is residuals. Residuals are the difference between an observed data point and the predicted point that we would get from our regression equation. So this line here this red line is our regression equation. That's this. And you'll notice in most cases our actual observed data doesn't actually land exactly on the line and the line is a best fit through our data and the residuals are where we have a difference between the observed and what would be predicted by our regression equation. If you remember actually the way to get this equation you use this equation is the equation that results in the smallest sum of the square of the residuals. That's what linear regression is at least squares it's the way you get this equation is you figure out what is the equation that results in the smallest some of the squares of all of these residuals. So that's what I really wanted to define again this term residuals. Let's talk about the these regression assumptions and get to the first one linearity. Oh want to point out I got a little bit carried away when I was making these drawings these examples and did some kind of awesome drawing. And you can see a couple examples down here. I want to ask you as we go through these slides not to be distracted by my really cool drawings because they are pretty elaborate. Anyhow this is the linearity assumption. Obviously we've talked about this already. You really shouldn't be doing regression on a linear regression on data that is not linear. That sort of the whole point is is that we should be dealing with linear data and therefore putting a best fit line line again through that data. If it's not linear You can't put a line through it and have it make sense. So that's the two examples here. We put in a lot through this sure best fit line right there. You could put a line through this but it's no good. It wouldn't give you accurate predictions because this is not linear data. The same with an inverted U or all sorts of other possibilities. So first assumption linearity the data must be basically linear. This is a trick you want assumption number two independence of errors. OK. Over here we have data that is essentially a cloud. Now it's a linear Looking Cloud. But you know it's just sort of randomly dropped around our you know our predictive line that we signed over here. Notice this for be to ignore my cool drawings. Notice this. This is you know if we put a line this red line through this data you know it's actually it's it's linear sort of. But what we have is the residuals the errors follow a pattern. They start out because they start out close to the center and then they get farther and farther away up to a point and then they come back. And that's because are you can say the data points the residuals are influencing each other. So you know here we've got little residual a little bit bigger a little bit bigger a little bit bigger big residual smaller smaller smaller smaller smaller than it gets negative residuals small negative residual larger negative residual as we go up the line. What this says is that there is these errors are not independent. They have to do the next one has to do with the previous one. And so we can't have that. We you know we have to have basically random errors as as we go you know according to our x values. So that's the second assumption that the errors are independent if they're not independent you really you know you may may have to make an adjustment because this is not the data as it is currently constructed will not be a good candidate for regression you will get your predictions won't be accurate. This one has a very tricky name but it actually isn't that confusing. It's the assumption is that you have data that is whole skin asked mistake. I pretty much said that right. What that means is as you go up there's no trend in as you go up the X you know X gets larger or smaller . There's no trend in the size of the residuals. This is kind of like independence. There is no trend in terms of with independence of areas we have trends with residuals. Those they don't necessarily get larger or smaller on the whole. When we talk about heteros get asked to stick data which we don't want there is a trend in the size of the residuals. It could be they could be trending larger as X gets larger or they could be trending we could have a you know this could be the exact opposite. They could be getting smaller. You start out big residuals getting smaller as X gets larger and usually it means the data is generally look like a cone of one sort or another either you know fun ongoing right or a funnel going left. But we don't want that in you can survive all the data and say does it look like you know the errors are getting larger data points are getting more spread out in other words larger residuals as we go farther or you know talk to in one direction or another. If so that's bad we may need to do a transformation in order to actually be able to run an accurate regression. This slide. I've got another I'll get another slide to explain this. This is a test for us get asked this like is it an eyeball the data maybe you say OK looks like it looks like our data is hetero's good as distinct but we can also run a we can also look at the data we can plot the residuals are the square of the residuals versus the x value. And you what we should see is if we you know if we square the residuals or take the absolute value of the residuals they should basic There should basically be a slope of zero meaning no matter how large your x value is. We can expect a fairly you know consistent or you know random I should say a fairly random distribution of the residuals or the positives and the negative. You know they're all going to be some of they're going to be larger or smaller but they're going to make a difference in terms of you know what our value is in terms of whether we're getting larger or smaller or positive or negative residuals. On the other hand here we have a heterosexist get asked to stick data. Das dick and it was a drastic see. I'm struggling here as you see the square of our residuals gets larger and larger as x values go up . So we want. So this would be thorough scholastic just kind of like this where are residuals get larger as our expenses go up whereas our hosts get asked to data on the left does it. Whoops sorry. The residuals don't follow a trend in terms of our expiry must gastric. Okay. Again this assumption that were for sale could quickly get through these. And this is where my my drawings get really really cool. You can as you can see this is another kind of tricky one. This is there the assumption before is that there is normality of error or distribution. What that means is that at any given point in our in terms of x values we would expect the actual date data points to be normally distributed around that that point in terms of their distance from our predicted value. So our actual observations would be normally distributed around our prediction line in terms of how far away they are from that line. In other words we would expect to see observations that are mostly clustered around a lot you know so our observed values would be mostly pretty close to our predicted by with a few outliers. This is just like a normal curve. You know you see a normal curve. We've got more data points towards the middle and fewer on the edges. That's what we're hoping to see in terms of our residuals essentially where we're expecting to see fewer big residuals positive or negative and more that are clustered around you know small small residuals positive and negative and an even distribution both positive and negative residuals. Any point on the X line over here you can see we have an even distribution positive in negative. I mean there's a lot of ways that we could have non-normal error distribution. But here we've got observed values that are evenly distributed positive and negative but we don't have any middleground. You know we actually we didn't have anything in here we have a whole bunch that are big now a whole bunch that are small though. I mean that that distribution is going to have something like this . I think most most of the data is either clustered around the pretty much the residuals are either very small positive negative or very large positive negative. That is not normal that's like yeah whatever that is waves. I don't know where there's a lot of different ways you can have non-normal data but the assumption is that your error distribution is going to be normal. Any point on x. So that's why you do this all normal curve showing. Yeah we have a couple of outliers here in here but mostly they just kind of clustered around here are residuals are clustered around here. So there you have it