Sampling

George Ingersoll
A free video tutorial from George Ingersoll
MBA & PhD
4.6 instructor rating • 1 course • 27,005 students

Lecture description

Introduction to Sampling and the Central Limit Theorem. Also how the size of a sample relates to the accuracy of a prediction for a population parameter.

Learn more from the full course

Workshop in Probability and Statistics

This workshop will teach you the fundamentals of statistics in order to give you a leg up at work or in school.

21:37:01 of on-demand video • Updated April 2020

  • By the end of this workshop you should be able to pass any introductory statistics course
  • This workshop will teach you probability, sampling, regression, and decision analysis
English And today we're going to talk about sampling. And before we begin talking about sampling techniques and things like that we need to define a couple of very important terms. First term we're going to talk about is the population we're talking about pop up sampling population means it's defined as this big group that we're trying to find something out about. So this could be anybody in the world sort of the literal definition of population but it could also be something like well it can be all the registered Democrats in the US or all of us college students or all of the customers of a certain company but population does not need to be a population people . It could also be all of the products produced by a certain company or it could be all registered vehicles in a state. In other words it's something that we it's a large group that we want to know something about. But it's too big to go out and test every member of the population or to ask every member of the population a question or to go observe every member of the population you know be a person or a thing. They're just too numerous and so sampling is a way to find something out about the population. We use a sample which is a small group small group taken from the population and we're going to test ask observer. We're going to do something with the members of the sample and because it's a smaller representative group from the population and that's going to tell us something about the population is going to give us an estimate about the is that that should sort of hold true for the entire population. Now I mentioned Treumann their representative is finding a good sample is a real challenge. And one of the hardest things to do in research and what that means is you want to put a sample that is exactly representative of the members of the population. But it's very hard to do. So for instance if you want to survey all registered Democrats in the U.S. How do you get a group that is exactly representative of all Democrats. You can't just go to your political science class and say hey who is a Democrat. Let me ask you some questions and then use that as representative of all of those Democrats because these are you know college graduate students and you know they might not have these extreme views. Has the entire population of Democrats in the United States. You can't even just ask it generally speaking randomisation is better. In other words what you want to do is you want to get it completely and you want to just sort of grab members of the population out at random and that will be nice. That ought to because randomisation that sort of data represents the same distribution of people who feel sort of middle of you and I keep using the term people but this could be items or whatever could be said of the ones that are middle of the road the ones that are at the extremes. But it's hard to even get randomisation. So for instance again go back to the same example if you want to survey Democrats you can't just call up a bunch of registered Democrats because it matters when you're calling them. I mean OK you can but it won't necessarily be 100 percent representative sample because you know you might if you call them during the day you're only getting it the people who were home during the day and that might not be the same as all Democrats. Many many many challenges. But essentially what are you trying to do is to try to get representative and usually the way to do that is to the extent you can try to get randomisation in terms of your selection from the population . But basics are here. Populations the big the whole sample is a smaller group that you're you know it's more manageable. So you're trying to get some truth out of the larger group. Oftentimes what you're trying to figure out about the population is the I mean the average in the the the average and maybe the standard deviation. This could also be proportionate. So again you know if you're if you're surveying people's political opinions might be interested in you know what percentage feel what percentage supports some sort of gun control legislation. And so that's what you're really trying to figure out is like OK well I think that the you know about 60 percent we think that maybe 60 percent so we're going to take a sample and see if that holds true . Based on our sense of proportion it can be a meal an average of maybe only one know your standard deviation variance things like that. One thing that is always true is that the larger the sample the more likely it is to be accurate in terms of making predictions about your population. So if you are interested in finding out about people's political viewpoints and you ask two people you might get two extremists. But if you ask a thousand people they're much more likely to represent the entire population that you're interested in. You know you might. You'll you'll certainly get some extremists but you'll get extremists on both sides and you get a lot of middle of the road people. And so it will also to balance out you look much more likely to sort of end up at an average that is more representative of the average of the population so big samples to the extent that they're possible tend to be good in terms of their have power with regards to population. All right so now I want to talk about the whole news color we're going through the spectrum today. Talk about a new ad. Another concept that is extremely extremely important. I alluded to this in a previous video but I really want to talk about it at length here and so hammered home. It's called the Central Limit Theorem and what the central limit theorem says is this is very important it's actually one of the more challenging concepts in statistics as it regards the sampling I think a lot of people sort of goes over their heads . But we're going to talk about it a lot. We're going to give some examples some term limits there and says this. You take a sample from a population the probability distribution for the mean of your sample is going to be normal around the mean of your population. Let's draw this out. It seems like a lot of words but it actually kind of makes sense. This is a probability distribution. So on the y axis we have the probability of getting any specific value to link the first time I've done that when these videos gun vertical. All right so that's probability here. These ones here we have values that the sample mean might be . We've taken a sample. We're measuring that sample mean the average of the sample they could take any of these values and they center around the population mean that center line here and vertical line right here is the population mean and all along here are values that the sample mean might be. So we're taking a sample could be anywhere in here. And this is there's this all this represents the probability probabilities right where the curve is higher the higher probability and where the curve is lower is lower probability. And obviously all adds up to 100 percent. So it. It means that we're likely to end up with a sample mean a lot more likely to end up with the sample mean that's close to the population as opposed to being extremely far to the right or to the left. These are much less likely being close to the population is likely. Also notice I mean this is these are all the properties of a normal distribution. The it's symmetrical so we're likely to be hired to the population means we are likely to be lower than the population. It's all centered around the population mean and we have this nice bell shaped curve in terms of probability . Really important point here. This is true the tendency ordinates Tennessee v.. The fact is that the proper sample means will be normally distributed around the population mean is true no matter what the distribution looks like for the population. In other words the distribution of values in the population mean he could be in that uniform distribution that the probability distribution that the population mean if the population could look like this it could be something totally bizarre. But both of these are going to have a meaning. And when you take the sample the sample mean is going to be normally the distribution for that for those sample means it's going to be normally shaped. Now an easy way to think about it it's an easy way to think about this. An easier way to think about this is you say look if you take a whole bunch of samples right you take a thousand samples. If this were a distribution and being a probability distribution or distribution like a history gram right where you're going to end up with this is my histogram who point out our values that are normal. So if you took a whole bunch of samples the means of each of those samples the means of each of those samples are going to be normally distributed. Let's actually look at it. I'll give you a specific example here. OK so let's say we are interested in finding out how much the average US families more will say us than the average person in the US spends per week on groceries clothes groceries spend per week we'll say we actually say the actual average is $70. Right. That's that's the actual amount that the entire U.S. population spends on groceries per week. I made that number up. You know the real number then please go right to me. That's the actual number I wrote. Not not not the actual quote We say the average citizen average person the U.S. spends $70 per week on groceries. Now if we took a thousand samples of 100 people each what we would get. So I found a thousand samples each with the sample size of 100 in each of those samples. What we would see is the means of those samples. These are the sample means on the horizontal axis would be normally distributed. The distribution of the sample means would be normally distributed around $70. In other words more most of those samples of a hundred people each would have an average that's pretty close to 70 and it would be a few outliers you know higher than 70 and maybe a few outliers that were significantly lower than 70 that mostly they be around 70. Now what if we took. So this is thousand samples one hundred each 100 people each sample. What if we did one where we did a thousand samples of 500 people each. Well remember that larger samples are generally better more accurate predictors of the actual value in the population. So it's we're still going to see a normal distribution centered around 70. We're going to be harder for me to draw because I've put my center line all in early makes it much harder for me. Let's try this. What you're going to see is a distribution that goes like this. That was remarkably good trying. You got to give me credit for that. I mean maybe you're better I that for me that was as good as it gets. So what we see here is it's still a normal distribution this is still a normal distribution right here . But it's taller. Meaning that more of the values are centered around $70. There's fewer outliers and that are going to be as far away. So in each so those samples of 500 each there's going to be fewer that are farther away from 70 and more that are sort of clustered right around 70 because larger sample sizes are better predictors. Now let's say we took thousand samples of 20 people each still a normal distribution centered at 70 which is the actual population that we're going to see something like this . What we're seeing here still normal but a lot more outliers because 20 is not really you know with the sample 20 we might get pretty far away from 70. The actual average of the population. So we might see you know groups that are pretty far away from from samples that are pretty for they have I mean that's pretty far away from 70 in either direction. They're still going to be mostly in the study mean more that are clustered around 70 but there's a much wider distribution. All right so let's talk about why why is this important why is the central limit theorem for it. Finally it's important because because we know that this distribution is going to be normal the distribution of our sample means is going to be normal around the actual population. It allows us and we know things about the normal curve. And so what we've been looking at in the last few videos we can we can figure things out based on standard deviations and number of standard deviations away in confidence intervals and we can say once we have a sample mean we can say oh OK now we shouldn't have an idea of what the population mean is and because we know things we have we have confidence interval based on the normal distribution we can say OK we can be confident that the actual population mean is going to be within this range and we're going to do a lot of examples of this. I just want to put put that out there right now because we know things about the normal curve we have these tables and things like that we can actually make accurate predictions and know kind of how accurate they are about the population mean based on the size per sample. You know and obviously the standard deviation and the average that we measure in that sample. So this central limit the know because it says distribution is sample means are going to be normal around the population mean gives us tons of things that we can use later on. So let's take a look at an example to illustrate how he might actually use sampling in sort of real situation. Let's say we're we're a marketing firm and five years ago we did a study on what Americans Internet drive and in that study five years ago what we found was that. So we did a very extensive study five years ago we found that average internet usage from that study five years ago was three hours per day. And the standard deviation was one hour per day. We decided that possibly internet habits have changed in the last five years. And so we decided to update our state. And what we did we took a sample of 2000 Americans. We call that a sample. That's our sample size is 2000. And so we actually we could say and equals 2000 sample sizes usually don't know what it is. And so say 2000 Americans and in our news study we got a good group they're very representative we do care. It's a good mix of Representative ness in terms of ages genders and professions and incomes and educational background. It's it's an excellent sample. And what we find is that our sample mean is six hours per day. What does that tell us. Well because the central limit theorem says the sample mean should be normally distributed around the population. Let's say so this was our assumption right our assumption was you know what. The thing that we're interested in knowing is is the average of the population still three hours per day. Right. And if that was the case if the population average still was three hours a day what's the likelihood we would end up out here at six hours the standard deviation is one hour per day. What's the chance that we'd end up three standard deviations above the median by the way there's extra math involved with taking the ice and the distribution and distribution of the sample mean. So we're not even really it's it's actually more than than three standard deviations but just based on this. I mean that's a sample of two thousand and we going to mean that with six. What kind of conclusions can we draw. Well I think it's pretty obvious that we can say you know what. Pretty unlikely that the average is still three hours a day. That does not seem like it would be very possible considering that we got this. We did it 2003 since April. That's a pretty decent sized sample. We got six hours per day on the the average of that sample and lots of samples crazily unrepresentative and I told you it was a good sample. There's just no way there's no possible way that we could get six hours if three hours was still the average. So what do we assume we assume that the average number of hours that Americans spend on the Internet has changed the last five years. There's a lot of evidence to suggest it has gone up and gone up significantly. So this is actually one extremely common usage and we're going to be talking a lot about this more later of sap and it's called hypothesis testing. Basically you say I have this hypothesis that I want to check out and we haven't even done the math around this but this is just sort of an obvious case. We have this hypothesis the hypothesis here is Americans spend three hours per day on the internet because that's the last measured values we have. These are the positives that's called the null hypothesis and we want to say you know what's what the OP is. We also have an alternative hypothesis and the alternative hypothesis is no that's actually changed . That value is no longer three hours a day in what we're looking for is evidence to say we can reject the null hypothesis. We do think there has been a change or we don't have enough evidence and we're going to keep on believing that there's three hours per day is the average Americans and this sample here gives us plenty of evidence that actually we were going to reject the null hypothesis. We're going to reject our original hypothesis which is that Americans spent three hours and say now it's probably something or more. And so that's how you might use sampling kind of in a real world type situation.