Calculating confidence intervals within a population with a known variance

365 Careers
A free video tutorial from 365 Careers
Creating opportunities for Business & Finance students
4.5 instructor rating • 71 courses • 1,384,119 students

Lecture description

We see our first example of the use of confidence intervals and introduce the concept of the z-score.

Learn more from the full course

Statistics for Data Science and Business Analysis

Statistics you need in the office: Descriptive & Inferential statistics, Hypothesis testing, Regression analysis

04:48:23 of on-demand video • Updated January 2021

  • Understand the fundamentals of statistics
  • Learn how to work with different types of data
  • How to plot different types of data
  • Calculate the measures of central tendency, asymmetry, and variability
  • Calculate correlation and covariance
  • Distinguish and work with different types of distributions
  • Estimate confidence intervals
  • Perform hypothesis testing
  • Make data driven decisions
  • Understand the mechanics of regression analysis
  • Carry out regression analysis
  • Use and understand dummy variables
  • Understand the concepts needed for data science even with Python and R!
English A confidence interval is the range within which you expect the population parameter to be. And, its estimation is based on the data we have in our sample. There can be two main situations when we calculate the confidence intervals for a population - when the population variance is known and when it is unknown. Depending on which situation we are in, we would use a different calculation method. Now, the whole field of statistics exists because we almost never have population data. Even if we do have population, we may not be able to analyze it (it may be so much that it doesn't make sense to be used all at once). Think about people using the Internet. The data Google has, approximates population data, BUT even their data is not. There are people who are not a part of the Google ecosystem in any way. That can be done by using other browsers like Opera and Safari, or other search engines like Bing, DuckDuckGo, or video providers different from YouTube. Furthermore, they can browse in incognito. These people are a part of the population of people using the Internet, but Google doesn't have much data on them. As you can see, even the company that has the most data... doesn't necessarily have population data. So, if Google wants to use statistical methods to target them with Google ads, they will basically be using sample data to for with a population variance unknown to guess their preferences. Okay. In this lesson, we will explore the confidence intervals for a population mean with a known variance. An important assumption in this calculation is that the population is normally distributed. Even if it is not, you should use a large sample and let the central limit theorem do the normalization magic for you. Remember? If you work with a sample, which is large enough, you can assume normality of sample means. Alright. Let’s say you want to become a data scientist and you are interested in the salary you are going to get. Imagine you have certain information that the population standard deviation of data science salaries is equal to $15,000. Furthermore, you know the salaries are normally distributed and your sample consists of 30 salaries. The formula for the confidence interval with a known variance is given below. The population mean will fall between: The sample mean minus z of, alpha divided by 2, times the standard error, and, the sample mean plus z of, alpha divided by 2, times the standard error. The sample mean is the point estimate. You know all about the standard error already, so let’s compute it, using the formula. What we have left is the so-called reliability factor – z of, alpha divided by 2. Z is the statistic that we’ve described earlier. The standardized variable that has a standard normal distribution. Right? And what about alpha? This is the same alpha we had when we defined our confidence level. So, for a confidence level of 95%, alpha would be equal to 5%. Similarly, for a confidence level of 99%, alpha would be equal to 1%. It all fits into place now, doesn’t it? Let’s get back to our example. The sample mean is $100,200 and the standard deviation is known to be $15,000, thus the standard error is $2739. Having calculated these values, we can take the next step and choose our confidence level. Common confidence levels are 90%, 95% and 99% with respective alphas of 10%, 5% and 1%. Another way to put the value of alpha is: 0.1, 0.05 and 0.01 respectively. Keep in mind that a 95% confidence interval means you are sure that in 95% of the cases, the true population parameter would fall into the specified interval. Okay. The z of alpha comes from the so called standard normal distribution table. It is best to first see it and then comment on it. Let’s say that we want to find the values for the 95% confidence interval. Alfa is 0.05, therefore, we are looking for z of alpha divided by two, or 0.025. In the table, this will match the value of 1 minus 0.025, or 0.9775. The corresponding z comes from the sum of the row and column table headers associated with this cell. In our case, the value is 1.9+0.06, or 1.96. A commonly used term for the z is critical value. So, we have found the critical value for this confidence interval. Now, we can easily substitute in the formula. The final confidence interval becomes 94,833 to 105,568. The interpretation is the following: we are 95% confident that the average data scientist salary will be in the interval 94,833 and 105,568 dollars. Let’s repeat the exercise using a higher confidence level. Say we want to be 99% certain of the outcome. Alpha is 0.01. We look at the table for the value of 1 minus 0.005, which is equal to 0.995. Bummer! There is no such value. When this happens, we just have to round to the nearest value available. The corresponding critical value is 2.5+0.08, thus 2.58. We plug it into our formula once more and the new confidence interval is equal to 93,135 and 107,206. This means that we are 99% confident that the average data scientist salary is going to lie in the interval between 93,135 and 107,206 dollars. Please note that in this case there is a trade-off between the level of confidence we chose and the estimation precision. The interval we obtained is broader. The opposite is also true. A narrow confidence interval translates to higher uncertainty. Makes sense, right? If we are trying to estimate the population mean and we are picking a larger interval, we are increasing our chances of having an interval that actually includes the mean. And vice versa. If we want to be more specific about the population mean range, this will take away from our confidence about this statement. Okay. This lecture was a bit longer, but very insightful. Don’t skip the exercises provided. They will help you reinforce the knowledge about this concept, which is fundamental for everybody who wants to work with numbers in their job. In the next few lessons we will study some particular cases and teach you how to find confidence intervals for them. Thanks for watching!