What is a Distribution?

Kirill Eremenko
A free video tutorial from Kirill Eremenko
Data Scientist
4.5 instructor rating • 44 courses • 1,750,216 students

Lecture description

Here you will learn what is distribution and how it is used in data analysis

Learn more from the full course

Statistics for Business Analytics and Data Science A-Z™

Learn The Core Stats For A Data Science Career. Master Statistical Significance, Confidence Intervals And Much More!

06:02:26 of on-demand video • Updated April 2021

  • Understand what a Normal Distribution is
  • Understand standard deviations
  • Explain the difference between continuous and discrete variables
  • Understand what a sampling distribution is
  • Understand the Central Limit Theorem
  • Apply the Central Limit Theorem in practice
  • Apply Hypothesis Testing for Means
  • Apply Hypothesis Testing for Proportions
  • Use the Z-Score and Z-Tables
  • Use the t-Score and t-Tables
  • Understand the difference between a normal distribution and a t-distribution
  • Understand and apply statistical significance
  • Create confidence intervals
  • Understand the potential pitfalls of overusing p-Values
English Hello and welcome back. What is A distribution. So let's have a look. According to Wikipedia A distribution or probability distribution is a mathematical function that stated in simple terms can be thought of as providing the probability of occurrence of different possible outcomes in an experiment. So very interesting definition. Let's think of what a distribution so the main take away from here is that it's a mathematical function. Let's see what a distribution is for us in terms of data analytics business analytics data sience what is a distribution. And we're back to our dataset. Probably you are expecting unless you're quite adept to distributions you're probably expecting a chart like a graph a bell curve or something. But that is a common misconception. I specifically didn't want to start by showing a chart because too many people think that a distribution is a chart and distribution is associated with a chart like our distribution. You always think of a chart chart you see a bell curve you think of distribution. It's not the same thing. Distribution is actual a distribution is actually something that's Associates a function as we saw. So if we go back here is a mathematical function that state and simple terms can thorough spraying probably the occurrence of different possible outcomes it says nothing about chart says nothing about a curve. A distribution is a function which is linked to the underlying data to the underlying observations. Did it have that word outcomes. Well in this case outcomes. Well that's about experiments in our case. In our case an experiment in the case of data analytics and experiment is us picking a random variable from our data. So for instance here you've got age. So there's our age. There it is. We pick a random one that's our experiment. What's the possible outcome. While the outcome can be only one of the values that we have here. So a distribution is a function which will tell us what is the probability of getting 59 operability getting 31 or 40 or so on. Or another way to think about it. Probably a better way to think about it is. OK so this is because that'll work better for balance variables like our in this case age. Remember we talked about continuous versus discrete. So if this second way of thinking about is better for continuous variables which age actually age is a continuous variable is just random. So because for balance for instance you can just say OK so what is a probability if I pick a random roll from this data set was a probability of getting let's say one hundred ninety seven point thirty three. Well what happens if you don't have that value is operability 0. Well that's because the proper The best way to think about it the second way which I've been mentioning. Two way to think about it is to say OK what is the probability So this is our data set. And now we're going to add a new random person in here so and we know that that person will look like. Kind of like the the people that we already have. What is the probability that that person will have one hundred ninety-seven dot thirty-three dollars in their bank account or what. What is the probability that that person will be will have will be 37 years old. What does that probability that that person will be a male or female or be residing in this in this country or in this country. So that's the better probably probably the better way to imagine it. And as you can see again it has nothing so far to do with charts or graphs or bell curves and so on. All of that comes later. Well just now just now in a few seconds. But there is an auxiliary element that is a supplementary thing that will help us understand aids us and understand these reasons. So let's proceed to that part. We're going to shorten this dataset just so we have some space and we're going to look at two specific verbal So we'll look at age group and look at balance. And we're going to talk about distributions and you'll see that they're different a bit different for discrete and continuous variable. That's why we specifically picked a discrete variable. So age is debatable whether you wanted to be discrete or continuous in this case is probably continuous whereas age group you can see that it's intentionally been made discrete that there's only certain color. So this is definitely a discrete verbal balance. It's continuous because it's not in cerned buckets. All right. So discrete let's plot the distribution and the description of this variable would look like something for instance like that and. Well the chart for this distribution would look something that's a bar chart and that's how distributions are presented for discrete variables because you only have CRN limited you have a finite number of categories or a finite number of possible outcomes as if we're talking about exponent you have finite number of possible outcomes. When you're dealing with this variable so you have for instance 10 to 20 20 to 30 30 to 40 50 50 50 60 60 70 years old in among the customers of this bank and what is this what is this whole chart about. Well let's put a 0.3 here we'll explain just now just in 10 to 20, 20 to 30 and so on. And so what this chart is showing us is what is the probability of if we pick a person from this dataset. Or as we said in the way of thinking about it number two if we just randomly create a new person that that enters the bank. What is the probability of them becoming a being and this in this bucket this bucket this bucket And so here is probably probabilities like 30 percent of them being here maybe 25 of them being here maybe 5 percent or less than that being here. And so on. So this is what the chart gives us. It's it gives us the probabilities of a person being in one of these groups and for instance here probability of the person of X that variable being 30 to 40 it causes 0.3 percent. So that's how discreet The visualization of discrete distributions works. Now let's talk about continuous. So there's continuous let's draw that this is going to be a bit more interesting a bit more complex simply because it's continuous. So there is always 0 and the continuous distribution is a continuous line. And what does this line mean and how does it work. Well let's have a look let's say that seven point five, doesn't matter that there is 7.3 these different charts. This is 0.5. And let's say that we here we have valances or we're realizing the chart or graph of distribution for the balance verbal. Let's say we have $10000 here on this chart So what does it mean that the peak here is at $10000 or whatever value we get here what does it mean. So if I draw a line here can I say that the probability of a person in this data set or if a new person comes to this bank can I say that the probability of them having $10000 in the bank account is 50 percent. So can I say that. Well let's imagine that that is true for a second. Let's say Yes that's the case just like we had with the discrete distribution that this probability tells you what's what's the likelihood that this variable will be exactly in this book. Well in that case remember we said that there's a limited number of observations here. Like here we only had five or six. I hear there's a limit number so we take 11000. So according to this chart a probability is what. And then we take 12000 according to this chart. The probability is what 49 percent and then the probability for 12000 is 42 percent. Well they already add up to more than one just these three values. And so that can't be the case our assumption that what we're seeing here is the probability of that exact value is incorrect. And that's why we can't do that the correct way of thinking about this is through areas. So this curve over here that we see is actually called the probability density function or the PTF and the probability of having this one value is pretty much zero. That's how this function works and that's how pretty much everything in the world works when you're dealing with continuous variables because there's unlimited values on this axis and we're actually going to include a separate tutorial about that called the Cantor's diagonal argument so it's really cool proof of why there's unlimited values on this. Between like 0 and 1 billion dollars or even between 0 and 100000 why there's unlimited number of observations and the way you pick a random variable is imagining that you throwing a dart at this line and given that you hit the line you throwing a dart at this line. And because as infinity numbers between 0 and here there's an infinity number. The probability of you landing is that the dart landed exactly in 10000 is zero. It's converges to zero. So the probability of exactly 10000 0 but the correct way of thinking about these areas so you take 10000 plus a little bit 10000 minus a little bit and then the probability of that dark landing inside there is going to be given by the area of this little shaded part of the chart that we've created here. So as you can see as the air is wider the probability will be higher so probability of your dart or your random variable that you pick out of your distribution because it doesn't have to be from a dataset you did your distribution has a function which is a or this column has a function a pre-build this function assigned to attached to it which is the which characterizes the distribution. And so basically you're picking from this probability density function that's what we said that the second approach of thinking about it that somebody new enters the bank the probability that their balance is going to be between nine and a half thousand dollars and ten thousand dollars is a shaded area under the curve over here. And in terms of mathematics we're not going to go into calculus here but that is integration. So you need to integrate the area under the curve between 9500 to 10500 and that's are you to integrate the curve and that's going to be your value. That's a probability. So basically or in simple terms thing is the best way to think about is just the area under the curve. And if you extend this area and of course like if you take from not 9500 you take from 8000 to 12000 you know the area is going to be bigger and the probability is going to be bigger. And ultimately if you take from zero to a million probably is going to be 100 percent. So that's how the probability density function works. And that's the difference between discrete distributions and continues distributions. Important to understand that distribution kind of the key probably key takeaway for today is that distribution is attached to the variable or to this in our case column itself or other than just the chart or anything else. And so in this case you have a discrete distribution in this case you have a continuous distribution characterized by a probability density function both of which can help you understand what is the likelihood of a certain event occurring. No just approaches the different if you have a discrete distribution. It's very simple you just pick the bucket that you're interested in and look at the probability in terms of a continuous distributions it is a bit more complex. You need to find the value looking for. You can't just find probably one value because in continue in the world of continuous variables the probability of one value is always going to be zero. You can only find probabilities of ranges. So the probability of the range is going to be the shaded area under the curve. So there you are. Hopefully that's kind of simplifies and makes it a bit clearer what distributions are and how their problems are governed. And that's all for today. I look forward to seeing you on the next tutorial. Until then happy analyzing.