Understand the fundamentals of statistics
Learn how to work with different types of data
How to plot different types of data
Calculate the measures of central tendency, asymmetry, and variability
Calculate correlation and covariance
Distinguish and work with different types of distributions
Estimate confidence intervals
Perform hypothesis testing
Make data driven decisions
Understand the mechanics of regression analysis
Carry out regression analysis
Use and understand dummy variables
Understand the concepts needed for data science even with Python and R!
Ok! Here we go!
So far, we learned that a distribution of a dataset shows us the frequency at which
possible values occur within an interval. We also said that there are dozens of distributions.
Experienced statisticians can immediately distinguish a Binomial from a Poisson distribution,
as well as a Uniform from an Exponential distribution in a quick glimpse at a plot.
In this course, though, we will rather focus on the Normal and Student’s t distributions
due to the following reasons: • They approximate a wide variety of random
variables. • Distributions of sample means with large
enough sample sizes could be approximated to normal.
• All computable statistics are elegant. • Decisions based on normal distribution
insights have a good track record. If it sounds like gibberish now, I promise
that things will be much easier once we get started 😊
Here is a visual representation of a Normal distribution.
You have surely seen a normal distribution before as it is the most common one. The statistical
term for it is Gaussian distribution, but many people call it the Bell Curve as it is
shaped like a bell. It is symmetrical and its mean, median and mode are equal. If you
remember the lesson about skewness, you would recognize it has no skew! It is perfectly
centered around its mean. Alright.
It is denoted in this way. N stands for normal, the tilde sign shows it is a distribution
and in brackets we have the mean and the variance of the distribution. On the plane, you can
notice that the highest point is located at the mean, because it coincides with the mode.
The spread of the graph is determined by the standard deviation.
Now, let’s try to understand the normal distribution a little bit better.
Let’s look at this approximately normally distributed histogram. There is a concentration
of the observations around the mean, which makes sense as it is equal to the mode. Moreover,
it is symmetrical on both sides of the mean. We used 80 observations to create this histogram.
Its mean is 743 and its standard deviation is 140.
Okay, great! But what if the mean is smaller or bigger?
Let’s first zoom out a bit by adding the origin of the graph. The origin is the zero
point. Adding it to any graph gives perspective. Keeping the standard deviation fixed, or in
statistical jargon, controlling for the standard deviation, a lower mean would result in the
same shape of the distribution, but on the left side of the plane. In the same way, a
bigger mean would move the graph to the right. In our example, this resulted in two new distributions
– one with a mean of 470 and a standard deviation of 140, and one with a mean of 960
and a standard deviation of 140. Alright, let’s do the opposite.
Controlling for the mean, we can change the standard deviation and see what happens. This
time the graph is not moving but is rather reshaping. A lower standard deviation results
in a lower dispersion, so more data in the middle and thinner tails. On the other hand,
a higher standard deviation will cause the graph to flatten out with less points in the
middle and more to the end, or in statistics jargon – fatter tails.
Great! These are the basics of a normal distribution.
In our next lesson, we will use this knowledge to talk about standardization.