Feel The Chi Square Statistic

Gopal Prasad Malakar
A free video tutorial from Gopal Prasad Malakar
Trains Industry Practices on data science / machine learning
4.3 instructor rating • 16 courses • 103,102 students

Learn more from the full course

Statistics Made Easy by Example for Analytics/ data science

Statistics Simplified - Statistics Made Easy by Excel Simulations. Master fundamentals of statistics & Probability.

10:37:53 of on-demand video • Updated July 2019

  • By the end of this course, you should become very comfortable with popular concepts of statistics
  • You should know the genesis of popular statistical concepts
  • You should know how you apply it in business problem
  • You should have the required course material for referral
English In the last section, you got an intuitive understanding of chi-square statistic. In this section, we will dig in chi-square formula more. We will try to understand why the chi-square formula has been designed in a particular way. So let's understand it better. This is the formula that you have seen - observed minus expected whole square by expected. We'll understand each portion - the numerator portion as well as the denominator portion, one, one by one. So let's first understand the numerator. Numerator is observed minus expected whole square. Question comes why we are you know, observed minus expected clear, more the difference between observed minus expected, you know better it should be, better it. It means that you know that there is a category, which has a distinctive success rate or which has a disproportionate value of a categorical, other categorical variable. But the question comes, why we have a squared it, right? Understand it this way, suppose you know observed was 10, expected was, observed was 12, expected was 10, the numerator will be 12 minus 10 square, 4. Let's say, another case where observed is 15, expected is 10, the numerator is 25. What you need to understand that, observed minus expected, more than difference you know it means there is a larger impact that is right, but squaring even increases the impact, right. What was just a difference of two here right, 12 minus 10, 2 became 4. Difference of 15 minus 10 5 became 25. I mean if you think of if you'd not have squared, it would have looked like this. But the moment you squared, what is happening you know the relative importance of the second category has increased, right. I mean if you think of, now if you just take their square into the same kind of, same chart, here the, it looks like just two and half times where it is this looks six times, right. So the impact you know of the difference increases you know where the observed is farther from the expected. It magnifies the impact of that. The squaring magnifies the impact of those observations, those values which are farther from the expected, right. So that's the first thing it does. This squaring does the magnification job. There is one more thing it does. I mean think of, you know if it was 12 and 10 you got 4. Even if it was 8 and 10, you would have got 8 minus 10, minus 2, square 4, right. So it makes everything positive and that is a very important thing. That's the reason why in a contingency table you can actually sum across all this cells and it is not cancelling out. So the squaring serves two purpose, first it magnifies the impact of those observations, those values which are farther from the expected and second, it makes it positive so that when you add up it will not get cancelled. Now why we have taken denominator, if this is the case, why not chi-square formula was designed just like this - observed minus expected whole square. There is a clear cut reason to do that. The reason is that, if you do not divide it by the expected, you know, you are not, you actually are not getting the relative importance of the thing. Let me explain you by an example. Let's say a case one is there where you have the observed 115 and expected 110. The numerator becomes 115 minus 110 square 25. There is case two, where observed is 15, expected is 10, the numerator becomes 25. Now if I ask you intuitively, which of the value is farther from the expected? I'm sure your answer will be this one right because here it is what you were expected was 10, it has come 15, 50 percent away from this. Whereas what you were expected were 110, what you got 115, not even five percent away from expected. But if you're just going by this formula, observed minus expected whole square, it will give the same value for both the cases 25,25. So it's not actually able to take the relative importance, but the moment you divide by 25 (** slip of tongue, in lieu of 'divide by 25', it should 'divide by X-expected'), what will happen? Case one - You know, it's not even 5 percent case two. But if you just divide this thing, you know, like by expected. Now the 25 divide by 110. will be 0.25, right. Whereas as 25 divide by 10, in this case, will be 2.5. So it's much larger here. So what do I need to understand, the magnitude here. is, was not even 5 percent, right. It was just 115 out of 110 and the squaring was giving 25. Here also, the squaring was giving 25. But, when you divide it by the expected it makes it crystal clear that, this scenario is much more divergent than this scenario because in this scenario the divergence is just, you know not even 2.5 percent right. 25 divided by 100, 100 will give you 0.25, right. The divergence is just 25 percent. However in this case the divergence will be 250 percent, right, 25 divided by expected 10 will give you 250 percent, right. So dividing by expected value captures the sense similar to percentage difference, right. And then that makes it comparable for same or similar contingency table. That's another twist. What I'm trying to say, that you know if you have two contingency table, both having two by two kind of, you know, table. You know one side like you know let's take the scenario that you had, we had discussed earlier. You know one side you have male- female, So you have north, south and others and you know pass-fail, pass-fail. So essentially, two by two, two by two. If this is the kind of case, then you know, by look at the chi- square itself, you can say that bigger the chi-square, larger is the impact, right. So, if it's the same or similar contingency table, it makes it comparable. But how do you know, which are the similar contingency table, right? Which, like you know, which are the table which are not similar, right? That actually gets explained by degree of freedom. So now to me explain you, what is the meaning of degree of freedom in a contingency table.