Build 10 advanced Python scripts which together make up a data analysis and visualization program.
Solve six exercises related to processing, analyzing and visualizing US income data with Python.
Learn the fundamental blocks of the Python programming language such as variables, datatypes, loops, conditionals, functions and more.
Use Python to batch download files from FTP sites, extract, rename and store remote files locally.
Import data into Python for analysis and visualization from various sources such as CSV and delimited TXT files.
Keep the data organized inside Python in easily manageable pandas dataframes.
Merge large datasets taken from various data file formats.
Create pivot tables in Python out of large datasets.
Perform various operations among data columns and rows.
Query data from Python pandas dataframes.
Export data from Python into various formats such as TXT, CSV, Excel, HTML and more.
Use Python to perform various visualizations such as time series, plots, heatmaps, and more.
Create KML Google Earth files out of CSV files.
Well in the previous lecture I taught you everything you need to know for getting started in doing visualizations using the seaborn and the matplot libraries In this lecture we will focus on our real world weather data and we try to visualize them using
the seaborn library which I have to mention is based on the matplot library that comes automatically installed when you install seaborn. I'll start by deleting every file that I have in my file input folder. I suggest you also clean out your folder from files.
The data I'm deleting were the date of our four weather stations for year 2010 and 2014 which we downloaded in our previous sections. I deleted them now because I want some data for the same four stations but for a longer time frame. So let's go ahead and download these data again. So I already taught you how to create an FTP downloader function and we have that function just here. So I'll execute the function definition now, so that I have that in my iPython interactive session and now I want to pass that function to a loop of station IDs so that the function will be executed multiple times. The IDs of the stations I'd like data for are these here. Then I want to call the function and pass the name iterator as the ID, the starting year, and the end year. So I want data from 1950 to 2014 so it's quite a lot of years of data. Great! I'll run this now and I expect it to take a while for the files to be downloaded. So we have four stations and 65 years,
so that means 65 times four gives 260 files. Well maybe not 260 because some station files may be missing for some of the years. For that we will get a message printed out as the function is still running if the station if a station is missing,
and the downloading has finished at this point. So what's next? Well, next is to extract these gunzip files in our folder,
so we use the extract file function for that. Just like this. And on, next we want to add a field denoting the station name, so we ran our add field function for that and great. So we have to assume all these three functions in our previous sections and you should already know about them. And then, we want to concatenate all of the extracted files into one single file. That's exactly what our concatenate function is for,
so concatenate. That's it. We also want to merge another column having geographic coordinates to our concatenated file so we use our merge function for that. And lastly we want to create a pivot table out of all the single observations, so we want to aggregate those data. We call the pivot function for that. That's it. This is the final tabular product that we have been looking for. There is one thing I want to do before visualizing the data frame.
I want to store the data frame in a variable. So I'll just write df equals to the pivot function product, so the data frame will get the return of the function. So obvious is a data frame that I'll be visualizing in this lecture.
As you already know from the previous lectures, the values here represent the annual mean temperature for a specific station during a specific year, so this station had an annual mean temperature of 0.57 Celsius degrees, for year 2010 0.98, for year 2011, and so on. Now the stable is a good to overview of data because it
shows an aggregation of the values. However we can have an even better overview using a visualization technique referred to as a heat map. A heat map can be generated by the heat map function of the seaborn library. So let's import seaborn into iPython. And then we can just access heat map through sns, so sns.heatmap, and here you have to pass at list the data argument which in our case is the data frame. Here is the heat map. So we have four representing the weather stations and many many columns representing the years. There's a legend bar here too that helps you read the heat map, so red means high temperature values. The dark blue here represents the nan values, so no data in other words. As you noticed we didn't create a figure or subplots as we did in our previous lecture. We can do it if you like but I prefer this more direct way when exploring data as this way is quicker. So in this way the figure and the subplots are being created
on the fly. So when I called the heatmap method the figure and one axis where automatically created, and the heat map was generated on the axis that was created. If you would like to save the figure as a file image,
you'd need to apply a get figure method to the heat map expression and as you see that figure is saved in a variable. Then you need to apply yet another method to your variable. The method is savefig which gets the current figure of the console and saves the file, and saves the figure in a file path that you pass here, and also set the dots per inch argument which specifies the resolution of the image and you're good to go. The
image should be here now in the folder that I specified. So here it is. Let's now go back to our heat map and see what we got. So these dark color cells here represent the no data values and as you see the deep blue represents the lowest values in the color map scale. I don't like to have no data values in the same color amp.
I would instead want to do have the no data plotted in a separate color, so to do that we have to call again the heatmap method and we need to add a mask argument here to tell Python not to consider a certain value
of our data frame when coloring the data. So that's the mask parameter and that who would be equal to our new data frame values. So df.isnull would filter out the null values. And we execute great. As you see now, the nan values have been colored in gray so this is a general representation of the whole data frame. If you have a look of the last years at the top row, we can see a hot pattern meaning that the temperature has been relatively high during the last 15 years or so. Yeah we were probably dealing with a warming climate here. Anyway I see these two deep blue cells here which look like outliers. Outliers are values that seem to be abnormally high or low so we can inspect this station here more closely. And that we would need to plot only the row of this station, and to plot only one row of the data frame you can use data frame methods to filter out a data frame row first, and then apply the plot method to the data frame. So df.loc and then in the brackets we imput the name
of the row we want to plot, and then we apply the plot math so here we have the year values in the x axis and the temperature plotted along the y axis. The graph is not complete of course because we have a lot of missing data. This is not a nice looking graph I would admit. But we can make it better by applying another graph style but using the same data. By default the graphs style is a line as you see. So to change that we need to specify the kind parameter like so.
So that's called a bar graph. And here is the new figure and I do think this is a much better representation given the existence of no data values. So the minus 10 and minus 11 values are the ones we saw in dark color in the heat map that we generated earlier. Anyway we can't tell the reason why these values are so small now. So let's forget about that. What I wanted to mention is that you can explore the plot function by adding a question mark at the end of the expression. So here are all the parameters explained. We entered the bar value for the kind parameter so you can do this any time you need to do something more than just plotting a basic line chart. Now let's try to plot all the data frame in a single chart. That is quite easy. You can just pass the data frame,
the dot and the plot method. This looks odd because the stations have been plotted along the X-axis here, while the years are on the Y-axis, we sure don't want this, so we want the years along X, that would make more sense so the solution is to transpose our data frame and plot the transposed version of the data frame, and that is quite simple. To transpose the data frame we would simply write the data frame name, and the T method, so T stands for transpose, and that returns a transposed data frame as you see. Therefore we can transpose it on the fly and apply the plot method just like this. And these are the four graphs of the four weather stations and each one represents the temperature over the years. We also have a legend here which was generated automatically. Now if you would like to have the four graphs into four different subplots, you can pass another argument called subplots and set it to True, execute. Here we have different subplots or axes if you like.
We can set the kind parameter value to bar if you want. Or maybe a better representation. That's what I wanted to show you in this lecture as far as visualizations of data frames are concerned. Again visualizing data is much more then this, but I can't cover everything here so I try to extract and teach you the most necessary tools that are both important and crucial to get you started with visualizing arrays of data in general but also how
to visualize Pandas data frames specifically. And I'll talk to you in next lecture.