Understand the fundamentals of statistics
Learn how to work with different types of data
How to plot different types of data
Calculate the measures of central tendency, asymmetry, and variability
Calculate correlation and covariance
Distinguish and work with different types of distributions
Estimate confidence intervals
Perform hypothesis testing
Make data driven decisions
Understand the mechanics of regression analysis
Carry out regression analysis
Use and understand dummy variables
Understand the concepts needed for data science even with Python and R!
A confidence interval is the range within
which you expect the population parameter to be. And, its estimation is based on the
data we have in our sample. There can be two main situations when we calculate
the confidence intervals for a population - when the population variance is known and
when it is unknown. Depending on which situation we are in, we would use a different calculation
method. Now, the whole field of statistics exists
because we almost never have population data. Even if we do have population, we may not
be able to analyze it (it may be so much that it doesn't make sense to be used all at once).
Think about people using the Internet. The data Google has, approximates population data,
BUT even their data is not. There are people who are not a part of the Google ecosystem
in any way. That can be done by using other browsers like Opera and Safari, or other search
engines like Bing, DuckDuckGo, or video providers different from YouTube. Furthermore, they
can browse in incognito. These people are a part of the population of people using the
Internet, but Google doesn't have much data on them.
As you can see, even the company that has the most data... doesn't necessarily have
population data. So, if Google wants to use statistical methods
to target them with Google ads, they will basically be using sample data to for with
a population variance unknown to guess their preferences.
Okay. In this lesson, we will explore the confidence
intervals for a population mean with a known variance.
An important assumption in this calculation is that the population is normally distributed.
Even if it is not, you should use a large sample and let the central limit theorem do
the normalization magic for you. Remember? If you work with a sample, which is large
enough, you can assume normality of sample means.
Alright. Let’s say you want to become a data scientist
and you are interested in the salary you are going to get. Imagine you have certain information
that the population standard deviation of data science salaries is equal to $15,000.
Furthermore, you know the salaries are normally distributed and your sample consists of 30
salaries. The formula for the confidence interval with
a known variance is given below. The population mean will fall between:
The sample mean minus z of, alpha divided by 2, times the standard error, and, the sample
mean plus z of, alpha divided by 2, times the standard error.
The sample mean is the point estimate. You know all about the standard error already,
so let’s compute it, using the formula. What we have left is the so-called reliability
factor – z of, alpha divided by 2. Z is the statistic that we’ve described
earlier. The standardized variable that has a standard normal distribution.
Right? And what about alpha? This is the same alpha
we had when we defined our confidence level. So, for a confidence level of 95%, alpha would
be equal to 5%. Similarly, for a confidence level of 99%, alpha would be equal to 1%.
It all fits into place now, doesn’t it? Let’s get back to our example.
The sample mean is $100,200 and the standard deviation is known to be $15,000, thus the
standard error is $2739. Having calculated these values, we can take
the next step and choose our confidence level. Common confidence levels are 90%, 95% and
99% with respective alphas of 10%, 5% and 1%. Another way to put the value of alpha
is: 0.1, 0.05 and 0.01 respectively. Keep in mind that a 95% confidence interval means
you are sure that in 95% of the cases, the true population parameter would fall into
the specified interval. Okay.
The z of alpha comes from the so called standard normal distribution table. It is best to first
see it and then comment on it. Let’s say that we want to find the values for the 95%
confidence interval. Alfa is 0.05, therefore, we are looking for z of alpha divided by two,
or 0.025. In the table, this will match the value of 1 minus 0.025, or 0.9775. The corresponding
z comes from the sum of the row and column table headers associated with this cell. In
our case, the value is 1.9+0.06, or 1.96. A commonly used term for the z is critical
value. So, we have found the critical value for this confidence interval.
Now, we can easily substitute in the formula. The final confidence interval becomes 94,833
to 105,568. The interpretation is the following: we are 95% confident that the average data
scientist salary will be in the interval 94,833 and 105,568 dollars.
Let’s repeat the exercise using a higher confidence level. Say we want to be 99% certain
of the outcome. Alpha is 0.01. We look at the table for the value of 1 minus 0.005,
which is equal to 0.995. Bummer!
There is no such value. When this happens, we just have to round to the nearest value
available. The corresponding critical value is 2.5+0.08, thus 2.58. We plug it into our
formula once more and the new confidence interval is equal to 93,135 and 107,206. This means
that we are 99% confident that the average data scientist salary is going to lie in the
interval between 93,135 and 107,206 dollars. Please note that in this case there is a trade-off
between the level of confidence we chose and the estimation precision. The interval we
obtained is broader. The opposite is also true. A narrow confidence interval translates
to higher uncertainty. Makes sense, right?
If we are trying to estimate the population mean and we are picking a larger interval,
we are increasing our chances of having an interval that actually includes the mean.
And vice versa. If we want to be more specific about the population mean range, this will
take away from our confidence about this statement. Okay. This lecture was a bit longer, but very
insightful. Don’t skip the exercises provided. They will help you reinforce the knowledge
about this concept, which is fundamental for everybody who wants to work with numbers in
their job. In the next few lessons we will study some
particular cases and teach you how to find confidence intervals for them.
Thanks for watching!