Distribution Plots in Python

Jose Portilla
A free video tutorial from Jose Portilla
Head of Data Science, Pierian Data Inc.
4.6 instructor rating • 31 courses • 1,992,076 students

Lecture description

Learn about Data Visualization with Seaborn and Python!

Learn more from the full course

Python for Data Science and Machine Learning Bootcamp

Learn how to use NumPy, Pandas, Seaborn , Matplotlib , Plotly , Scikit-Learn , Machine Learning, Tensorflow , and more!

24:46:41 of on-demand video • Updated May 2020

  • Use Python for Data Science and Machine Learning
  • Use Spark for Big Data Analysis
  • Implement Machine Learning Algorithms
  • Learn to use NumPy for Numerical Data
  • Learn to use Pandas for Data Analysis
  • Learn to use Matplotlib for Python Plotting
  • Learn to use Seaborn for statistical plots
  • Use Plotly for interactive dynamic visualizations
  • Use SciKit-Learn for Machine Learning Tasks
  • K-Means Clustering
  • Logistic Regression
  • Linear Regression
  • Random Forest and Decision Trees
  • Natural Language Processing and Spam Filters
  • Neural Networks
  • Support Vector Machines
English Hello everyone and welcome to the distribution plot's lecture for Seaborn in this lecture we're going to discuss different plot types with Seaborn that allow us to visualize the distribution of a data set . Let's go ahead and jump to the Jupiter notebook to get started. OK here I am at the notebook. I want to get started by importing seaborne and by convention we import Seaborn as asinus. And since I'm in the notebook I'm going to go ahead and say Matt plot's live in line that way I can see are visualizations inside of the notebook. All right. Now let's get some data a plot seaborne actually comes in with a few built in data sets that you can directly load. And I'm going to grab one called tips and save it as a data frame called tips. You can do this by just saying Tipps is equal to Asinus load data set and then passen tips as a string . And this will load the tips data set and then I can actually check the head of the state of frame and it looks something like this. There's seven columns here and this is basically just data referring to people who had a meal and then left a tip afterwards. So you have the total price or bill of the meal how much the left as a tip the gender or sex of the person leaving the tip whether or not they were a smoker what day and time they ate their meal out. And then the size of the party. All right. Let's go ahead and discuss our first plot type which is this plot CISC plot that this plot allows us to show the distribution of a univariate set of observations and you know very it is just a different way of saying just one variable. Let's go in and explore this. I'm going to say Asinus thought this plot and then for this plot what you do is you just pass in a single column of your data frame. In this case let's go ahead and see how the total bill is distributed. So I'm going to say total bill and then run the cell and you should get a plot that looks like this . If you get a warning here don't worry about it. That actually has to do if another package called stat's models. It won't affect your actual Seaborn code. But here we don't have any warning so we're OK. Notice here that I get basically a histogram and what's known as a k d e a kernel density estimation that's the line here. Later on in this lecture we're going to discuss what this Katie is and how we can actually build that up. But for now we can remove it if we want to by saying as an additional argument here Katie equals false . And just by typing Katy equals false. Now you essentially just have a histogram and a histogram is essentially just a distribution of where your total bill lies. So you can see here that on the y axis you have a count and then you have these bars on the X-axis as bins. And this basically means that most of your total bills are somewhere between $10 and $20. And if you want to get a little more information on this you can change the number of bins so you can go ahead and there's a third argument Sabin's and then the appropriate number of bids and the number really depends on your dataset. But let's go in and choose 30 for now. And now we can get a little more basically definition and we can still see that there's most of the bills happen between 10 and 20. If you choose a value that's too high for instance let's go ahead and put in 100 you'll start to get kind of a weird scenario where you're essentially beginning to plot every single instance of total bills for every single price point. So usually I want to try to find a balance in size but that really depends on your plot itself. OK. Looks like we have a good idea of the information here. And if he can read this graph can basically just say most of the bills happen somewhere between 10 and $20 and begin to fade away as you get higher and higher. Bill Price That's the plot and that allows you to visualize the distribution essentially a histogram and you can add a cake to eat on top of that. But we'll learn about Katie plot's later on. Let's talk about joint's plot and joint plots from seaborne I can say Asinus joint plots allows you to basically match up to this plots for by various data meaning you can essentially combine two different distribution plots. And by very it is just two variables. And when you have a kind of parameter that we're going to play around with which allows us to choose how we actually want to compare these two distributions. Let me go in and show you how we can use essence as a starting point plot. First you have the passen in x variable and then you have to pass in a Y variable and then you have to pass in your data set. Let's start from the back end so passenger data set as tips. So that sort of data frame and then for x and y you just pasand strings that are column names. The two things you want to compare to each other. So for instance maybe I want to compare the distribution of the total bill versus the tip size. Let's go ahead and do that. I'm going to say total bill as my ex and on my way Access I'm going to put in tip the tip column. So right now I'm just passing in the total bill column the tip column and then the data equals tips and I get a plot that looks like this which is essentially just two distribution plots. You can see the tip on the y axis and total bill along the x axis and then zoom out so you can see the whole plot. And then in between I have a scatterplot and this scatterplot actually basically makes sense because it looks like it has a trend that as you go higher in total bill you will go higher in tip and that makes sense because tips are usually proportionate to your total bill. Now joint plots actually give you an additional argument parameter called kind's and this kind parameter allows you to affect what's actually going on inside of this joint plot. Right now by default it's scatter but you can also pass in an argument such as hex and hex allows you to make basically a hexagon distribution representation. It's similar to scatter except basically if the hexagon has a certain number of points in it it gets darker and if it has less points it gets lighter essentially it's just a way of not having to put all those scatter points on but instead showing a distribution with these hexagons. Another argument we can put in for kind is e g which stands for regression. And this will look a lot like a scatterplot except Seaborn is actually going to draw a regression line on it. Now we haven't actually learns about linear regression yet as far as the machine learning topic but later on when we do approach that topic will come back to this and actually discuss how this line is built. But essentially this is just showing almost like a linear fit to the scattered point data and you can actually see it has a P value in a Peerson coefficient which we'll discuss later on when we actually discuss linear regression. Finally in other kind that you can put in here is KDE and that allows you to have this too. I mentioned K-T which essentially just shows you the density of where these points match up the most . All right let's go ahead and move on from joint plot will usually be using plot with the default scatter because that's the one that's essentially easiest to read and gives you quite a bit of information right off the bat. We're going to go ahead and expand that idea by showing you pair plot and pair plot is essentially going to plot pairwise relationships across an entire data frame at least for the numerical columns. And it also supports a color hue argument for categorical columns which I'll show you later on. But we see here on top that we have this joint's plot what plot is essentially going to do is do this joint plot for every single possible combination of the numerical columns in this data frame. Let me go to show you what I mean. Because it's going to do it for all the combinations. Basically you just have to call S and S thought per plot and passing your data frame. And this is something we're going to be doing quite a bit throughout the course. Keep in mind the larger your data frame the longer per plot takes. So a lot of times per plot takes a while if you have a very large data frame instead of frames relatively small. So we're OK. And here we basically have a pair plot for all the numerical column values. So we have size vs. total bill size versus tip. And then when you get to a parameter versus itself for instance size versus size instead of actually doing a scatterplot which when it makes sense you just have a straight line. You see a histogram instead. And same thing for Tip versus tip. And for total bill versus total bill that means per plot is a really nice way to quickly visualize your data. And what's even nicer is that you can add a hue argument to this h you eat and the hewe argument is where you would pass in the column. The aim of a categorical column in categorical means not numerical or continuous but actual categories . For instance the sex column is categorical because there's two categories in it there's male and female . And when you pass this in as you pass in the colony name equal sex and it will color the data points based off of the column you put in for hue. So here all the green points are female based on this legend and all the male points. We're going to zoom out so we can see the whole thing. All the blue points are male. And as a third argument you can specify a palette and the palette allows you to actually color this with some specific color palette. We're going to discuss palettes and color and style at the very end of the seaborne lecture series. Right now I'll just show you an example. Essentially there's these color map strings that are from that plot live that you can pass in his palette and they will choose certain colors for whatever the parameters are. And here we can see now mail is blue and female is this kind of a light pink color. All right. We'll touch it on palettes in colors and styles a lot more. Let's go ahead and move on to Roug plots and rogue plots are actually a very simple concept but we're going to use the concept of a rogue plot to actually build. I am. Explain the K-T plot we saw earlier were I'm going to go ahead and say S.A. Roug plots and just like just plot the distribution plot you're going to pass in a single column here. So we're going to say tips and let's go to pass the total bill column and what the rug plot does is it's a very simple concept. It just draws a dash mark for every points on this uniform or unique variant distribution essentially one single variable. So instead of like a histogram let me go ahead and make that this plot one more time to compare. I will say as soon as this plot tips total bill. Run that and let's go to say Katie is false. OK so the difference between a histogram here below and this Rugge plot is that the histogram essentially has bins and it counts how many dashes were in that bin and then shows it as a number up here and that means theres between like 10 and I don't know 11 there's about if we take a look at this. Forty five dashes there. They're all kind of stacked on top of each other. And then over here as we go further in total bill price there's less Roug or less dashes and that means the Ben is going to be less high. That's the basic relationship between the SR-GR RAM and this rogue plot going rogue plot really simple concept. You just draw a dash mark for every single point along the distribution line. All right. That's the total bill. What we want to do is build off this idea of rogue plots to explain what this actual Kaytee plot is and that's going to be this line right here. How do we actually build this line based off of these rogue plots. And you can see that it kind of has a relationship to the rogue plot counts KDE plots stand for kernel density estimation plots and you can actually Google this and check out the Wikipedia on kernel density estimation plots and the page will look something like this curl density estimation and this is a really scroll down. This is a really nice figure here here and essentially we're going to try to construct. You'll notice that each of these black dashes here is the rogue plot. So the actual points. And then you have these little normal Gaussian distributions on top of each point. And then you sum them all up. So you get this final Currence kernel density estimation. Now what do I mean by normal distribution or Gaussian distribution. Well if you also look up on Wikipedia Nessus in probability theory the normal distribution and I'd say probably the most common continuous probability distribution centrally. It's these kind of normal distributions where you say like oh how did everyone do on their test and you grade all the students and then see the distribution of scores. So usually something normalized like this or for instance people's ages or people's Heights usually a lot of things tend to follow a normal distribution. OK. Let's go ahead and jump back to the Jupiter note book and touch upon these topics in a little more detail in order to do this. I'm going to copy and paste some code from the note book and you don't need to worry about understanding this code. It's just to build out a diagram for explanation in a go to copy and paste this. I've copied and pasted this code and let me break down real quick with this codes doing. I just have a few imports. I create a dataset of random data. Then I use a rug plot on that random data. I set up the x axis for the plot. Use any Ohlund space to create 100 equally spaced point points from X-Men's X max and then here. This is probably the hardest part to understand because it uses the library. We haven't talked about yet. That's not normal. All this does is it plot's a normal distribution for each of the rug plot points. And that looks like this. We go ahead and zoom in on this. Here I have my data set and this is a random data set. So if you run this years may look a little different but keep in mind we're not look working with tips anymore. We're just working on some random data. Notice I have blue dashes here and then these gray lines represent normal distributions. On top of each of these blue dashes. So this is a normal distribution centered around the dash. And we have a bunch of them on top of each other. Well we're going to go ahead and do next is some of them all up to get the kernel density basis function . And this is just the sum of all of these little normal distributions. All right. Copy and pasting the second block of code from the notebook allows us to actually sum up all these basis functions which are just these normal distributions when she sum them all up. You get something that looks like this which is just Teekay the plot from before and that's how the plot is constructed from the disk plot. The very first plot we looked at the highest t below T. All right. So those are all the major ways you can show distributions of data we have see more. Let's go ahead and quickly review all the various plot types that this computer and plot types. We scroll back up. They were the this plot and again that this plot you can use it with two methods have Kaytee equals false and essentially just see a histogram or leave this blank. And then you can actually see the Caity the kernel density estimation which kind of explain that the end is just the sum of all the normal distributions around the rug A-plot joint plot is really similar to this idea expectorate passing in two columns and you pass them in as x and y arguments. If your third argument equal to the data then the next plot we learn about was the pair plot and the plot is just building off of the plot and essentially is a joint plot for every single column or numerical column in your data set and that means you just pass in the data set itself that data frame and you can pass in the hue and palette if you want to actually color by a categorical column next plot we learned about was Roug plot usually won't be using Roug plots but it's there for you and the main idea of using a rogue plot is kind of building the logic of the kernel density estimation plot which is done through this code here. You can take the time and read through this code but I just wanted to get the point across that when you're using a rogue plot and you want to build a kernel density estimation plot for that the Katie plot you can do that just by saying rogue plot has all these normal distributions on to each point and then take the sum of all those points. And that's the kernel density estimation plot. And we've seen how we can do that using this plot and as a quick point if you are using this plot here we know that we can get rid of the K-T plot by saying Kaytee equals false. If you actually just want the K-T plot and don't want the actual Beenz here you can actually pasan instead of the plots you can do Asinus KDE plot and then passen tips. Total bill and this will build the this the K-T plot without any distribution of the bars. All right. Hopefully you realize that Seaborn is incredibly powerful and also very simple as far as the code you need to write. Everything we did was just done in one line. If you try to do this it map plot lib it would have taken you multiple lines but what's nice about this is that it works off of what you know about plot live and we'll see that a lot more when we talk about styling and colors. A lot of that map plot lived knowledge is going to be transferrable to actually editing little things in this plot. OK I hope you're beginning to enjoy Seaborn again like I mentioned before it's one of my favorite libraries and I can't wait to show you the next couple of the plot types we're going to learn about with Seaborn . Thanks everyone and I'll see you at the next lecture