Data Analysis with Pandas and Python
- 19 hours on-demand video
- 2 articles
- 4 downloadable resources
- 7 coding exercises
- Full lifetime access
- Access on mobile and TV
- Certificate of Completion
Get your team access to 4,000+ top Udemy courses anytime, anywhere.Try Udemy for Business
- Perform a multitude of data operations in Python's popular "pandas" library including grouping, pivoting, joining and more!
- Learn hundreds of methods and attributes across numerous pandas objects
- Possess a strong understanding of manipulating 1D, 2D, and 3D data sets
- Resolve common issues in broken or incomplete data sets
Welcome to Data Analysis with Pandas and Python. In this lesson, we
introduces the pandas library including its history and purpose
introduce Jupyter Notebook, the environment in which we'll be writing our code
explore sample Jupyter Notebooks to showcase some of the technology's features
The datasets for this course are available in a single pandas.zip file. Download and unpack the pandas.zip file in the directory of your choice.
The next batch of lessons focuses on installing and configuring the Anaconda distribution on a MacOS machine. When downloading the distribution, choose the latest version of the language. It will have the greatest version number. In this lesson, we also discuss the differences between Python 2 and 3.
In this lesson, we install the Anaconda distribution on a MacOS machine. The setup install Python and over 100 of the most popular libraries for data science in a central directory on your computer. We also explore the Anaconda Navigator program, a visual application for interacting with Anaconda.
The Terminal is an application for issuing text-based commands to your MacOS operating system. In this lesson, you'll learn two ways to access the Terminal. We also verify that Anaconda has been successfully installed and update the version of the conda environment manager.
The course materials, a collection of datasets in .csv and .xlsx file formats, is available for download in a single zip file attached to this lesson. I strongly recommend following along with my tutorials by practicing the syntax on your end. In this lesson, we walk through the startup and shutdown process for a Jupyter Notebook session. We also execute our first line of Python code!
In this lesson, we download the Anaconda distribution, a software bundle that includes Python and the conda environment manager, for our Windows computers. We discuss the differences between Python 2 and 3 and also determine which version of the distribution to download (32-bit vs 64-bit).
In this lesson, we install the Anaconda distribution on our Windows machines. The executable installs Python, pandas, Jupyter Notebook and over 100 popular libraries for data analysis in a standard "base" environment. We conclude by launching the Anaconda prompt.
Access the Command Prompt on a Windows machine. The prompt (also known as the command line) is used to interact with the computer with text-based commands. We'll use it to download additional Python libraries for the course and update all installed Anaconda libraries.
In this lesson, we extract our .csv and .xlsx datasets, which are available in a single .zip file attached to this lesson.. We also walk through the startup and shutdown process for a study session, which includes
activating the correct Anaconda environment
launching the Jupyter Notebooks application
opening and closing a Jupyter Notebook
shutting down the Jupyter server
The pd.Series constructor method accepts a variety of inputs, including native Python object. In this lesson, we'll create a Series from a Python dictionary. We'll also explore the differences between the Series and Python's built-in objects, and understand how the index operates in a Series.
The time has come to import our first datasets into our Jupyter Notebook work environment. In this lesson, we use the pd.read_csv method to import a dataset of Pokemon and Google stock prices. We also explore the squeeze parameter, which coerces an imported one-column DataFrame into a Series object.
Call the .sort_values() method on a Series to sort the values in ascending or descending order. We'll see how this command operates on both a numeric and alphabetical dataset.
In this lesson, we explore an alternative approach to extracting one or more values from a Series by index position or index label. The get method accepts the key to search for in the index as well as a fallback to value in return if the key is not found.
The pandas Series and DataFrame objects share many attributes and methods. this lesson, we'll review attributes like .index, .values, .shape, .ndim, and .dtypes and see what they return on a 2D DataFrame. We'll also introduce new attributes including .columns and .axes that are exclusive to DataFrames.
Use two syntactical options to extract a single column from a pandas DataFrame. I prefer the square bracket approach because it works 100% of the time. The alternative option is using dot syntax, which treats the columns as attributes of the larger DataFrame object.
A broadcasting operation performs an operation on all values within a pandas object. In this lesson, we'll apply several mathematical operations to values in a DataFrame column (i.e. a Series) including the .add(), .sub(), .mul() and .div() methods. We'll also cover the operator shortcuts for these methods.
Null values are represented with a NaN marker in pandas. In this lesson, we'll delete rows with null (NaN) values by caling the .dropna() method. We'll also modify the arguments of the method to specify how to select the rows to be deleted.
Data types in a Series will not always be the types we want or the types that are best for efficiency. In this lesson, we'll convert the data types in a Series with the .astype() method. We'll also show how to overwrite an old Series with a Series of new data values.
Call the .sort_values() method to sort the values in a DataFrame based on the values in a single column. The method is a bit more complex than when called on a single-dimensional pandas Series.
In this lesson, we'll explore additional parameters to the .sort_values() method to sort the values in a DataFrame based on the values in multiple columns. We'll also cover how to specify different sort orders (ascending vs. descending) on different columns.
In this lesson, we create the Jupyter Notebook for our new section, our second focusing on the 2D DataFrame object. The focus of this module is filtering data or, in other words, how we extract rows based on one or more conditions. We also introduce the employees.csv dataset that we'll be working with.
In this lesson, we'll filter rows from the DataFrame based on a single condition. The logic involves creating a Boolean Series of True and False values, then passing it in square brackets after our DataFrame.
In this lesson, we'll continue filtering rows from the DataFrame based on multiple conditions. However, this time we'll use a new symbol ( | ) to specify an OR check. This requires only one of the tested conditions to evaluate to True in order to include the row.
Call the .isnull() and .notnull() methods to create Boolean Series for extracting rows will null or non-null values. Both methods return a Boolean Series object, which can be passed within square brackets after the DataFrame to filter it.
Call the .between() method to extract rows where a column value falls in between a predefined range. This is another method that return a Boolean Series object, which can be passed within square brackets after the DataFrame to filter it.
Call the .duplicated() method to create a Boolean Series and use it to extract rows that have duplicate values. This is another example of a method that returns a Boolean Series object, which can be passed within square brackets after the DataFrame to filter it.
An alternative option to identifying duplicate rows and removing them through filtering is the .drop_duplicates() method. In this lesson, we'll invoke the method to remove rows with duplicate values in a DataFrame. We'll also provide custom arguments to modify how the method operates.
Call the .unique() and .nunique() methods on a Series to extract the unique values and a count of the unique values. These methods are one letter apart but return completely different results. In addition, the .nunique() requires an additional argument to include null values in its count.
In this lesson, we introduce the third DataFrame-focused section of the course. The upcoming lessons cover how to:
set and reset an index in a DataFrame
retrieve DataFrame rows by index position or index label
set new values for one or more cells in the DataFrame
rename or delete rows or columns
extract a random sample of rows / columns
Pandas will default to assigning a data structure a numeric index starting at 0. In this lesson, we'll explore how we can use the set_index and reset_index methods to customize and reset the index labels of a DataFrame object.
In this lesson, we invoke the rename method on a DataFrame to change the names of the index labels or column names. We can either combine the mapper and axis parameters, or target the columns and index parameter exclusively. In either case, we provide an argument of a dictionary where the keys represent the current label names and the values represent the desired label names.
There is a shortcut available to pull out the rows with the smallest or largest values in a column. Instead of sorting the rows and using the .head() method, we can call the .nsmallest() and .nlargest() methods. We'll dive into these methods and their parameters in this lesson.
In this review of a lesson from our Series Module, we'll call the .apply() method on a Series to apply a Python function on every value within it. This will act as a foundation for the next lesson, where we'll invoke the same method on a DataFrame.
The default bracket syntax extracts a component of the larger DataFrame. Any operations on that component will affect the larger DataFrame. If we want to separate the two objects, we can use the .copy() method, which create an independent copy of a pandas object.
Datasets can arrive with plenty of improperly formatted text data. The Working with Text Data section introduces the methods available in pandas to clean your data. In this introductory lesson, we create a Jupyter Notebook for this sectionand import a CSV file with public data on employees in the city of Chicago. We also optimize the DataFrame for speed and efficiency.
String methods in pandas require a .str prefix to operate properly. In this lesson, we'll explore four popular string methods we can invoke on all values in a Series:
str.lower() to convert a string's characters to lowercase
str.upper() to convert a string's characters to uppercase
str.title() to capitalize the first letter of every word in a string
str.len() to return a count of the number of characters in a string
The str.replace() method replaces a substring within a string with another value for all Series values. In this lesson, we use it to convert our Employee Annual Salary column to store numeric values instead of text ones.
In this lesson, we'll introduce the .str.contains(), .str.startswith(), and .str.endswith() methods. All three create a Boolean Series, which can be used to extracting rows from a DataFrame. We'll also discuss case normalization to increase the accuracy of our results.
In this lesson, we'll invoke the .str.strip() family of methods to remove leading and trailing whitespace from strings in a Series. The .str.lstrip() method removes whitespace from the left side (beginning) of a string, the .str.strip() method removes whitespace from the right side (end) of a string, and the .str.strip() method does both.
Strings can often contain multiple pieces of information that are separated by a common delimiter. In this lesson, we'll introduce the .str.split() method, which can split a string value based on an occurrence of a user-specified value. This is equivalent to the Text to Columns feature in Microsoft Excel.
The index attribute returns the underlying object that makes up the index of a DataFrame. In this lesson, we invoke the get_level_values method on the index to extract the values from one of its levels. We show how this can done either by the layer's index position or by its name.
The .stack() method stacks an index from the column axis to the row axis. It essentially transfers the columns to the row index. In this lesson, we'll see a live example on our bigmac dataset.
Multiple levels of the row-based MultiIndex can be shifted with the .unstack() method. In this lesson, we'll explore how to provide a list argument to the level parameter to move multiple layers at a time. We'll also introduce the fill_value parameter to plug in missing values in the resulting DataFrame.
In this lesson, we'll emulate Excel's Pivot Table functionality with the .pivot_table() method. We'll explore the values, index, column, and aggfunc parameters. We'll also discuss the variety of aggregation functions that we can use including sum, count, max, and min.
The pd.melt() can effectively perform anti-pivot operations. In this lesson, we'll call the method on a DataFrame to convert its current data structure into a more tabular format. We'll also explore the optional parameters available to modify the resulting column names in the new DataFrame.
Certain situations may require different aggregation methods on different columns within our groupings. In this lesson, we'll invoke the .agg() method on our GroupBy object to apply a different aggregation operation to each inner column.
A standard Python for loop can be used to iterate over the groups in a pandas GroupBy object. In this lesson, we'll loop over all of our gropings to extract selected rows from each inner DataFrame. We'll append these rows to a running DataFrame and then view the final result.
Welcome to the Merging, Joining, and Concatenating section! In this module, we'll cover how to combine data from multiple DataFrames into one. In this section, we create a new Jupyter Notebook and introduce the 4 CSV files that we will be using.
In this lesson, we use the keys parameter on the pd.concat method to label each concatenated DataFrame with a unique identifier. This parameter yields a MultiIndex DataFrame where the outermost layer holds the keys and the innermost layer holds each DataFrame's original index values.
A left join establishes one of the DataFrames as the base dataset for the merge. It attempts to find each value in another DataFrame and drag over that DataFrame's rows when there's a value match. In this lesson, we'll practice executing this join with the .merge() method.
The Working with Dates and Times section offers a review of Python's built-in datetime objects as well as a comprehensive introduction to similar tools in the pandas library. In this lesson, we setup our Jupyter Notebook and import Python's datetime module.
Over the course of the next three lessons, we'll call the pd.date_range() method to generate a DatetimeIndex of Timestamp objects. This constructor method includes 3 critical parameters (start, end, and periods); we need to provide 2 of these 3 for it to function. In this lesson, we'll see how the pd.date_range() method operates with arguments for the start and end parameters.
In this lesson, we'll see how the pd.date_range() method operates with arguments for the end and periods parameters. This approach creates a set number of dates, proceeding backwards from a specified date point. We'll also continue our exploration of the freq parameter to vary the durations between each Timestamp.
Upcoming lessons rely on the pandas-datareader library to fetch financial datasets from Yahoo Finance. In this lesson, we'll install the pandas-datareader library.
On a Mac system, open the Terminal. On a Windows machine, look for the Anaconda Prompt from the Start Menu.
Once the application is open, run the following commands.
conda activate followed by your environment name (for example, conda activate pandas_playground)
conda install pandas-datareader
Extracting rows from a DataFrame with a DatetimeIndex is no different than in previous sections. In this lesson, we review the familiar .loc and .iloc accessors. As a reminder, these methods use a pair of square brackets to target one or more rows by either index label or index position.
- Basic / intermediate experience with Microsoft Excel or another spreadsheet software (common functions, vlookups, Pivot Tables etc)
- Basic experience with the Python programming language
- Strong knowledge of data types (strings, integers, floating points, booleans) etc
The instructor knows the material, and has detailed explanation on every topic he discusses. Has clarity too, and warns students of potential pitfalls. He has a very logical explanation, and it is easy to follow him. I highly recommend this class, and would look into taking a new class from him. - Diana
This is excellent, and I cannot complement the instructor enough. Extremely clear, relevant, and high quality - with helpful practical tips and advice. Would recommend this to anyone wanting to learn pandas. Lessons are well constructed. I'm actually surprised at how well done this is. I don't give many 5 stars, but this has earned it so far. - Michael
This course is very thorough, clear, and well thought out. This is the best Udemy course I have taken thus far. (This is my third course.) The instruction is excellent! - James
Welcome to the most comprehensive Pandas course available on Udemy! An excellent choice for both beginners and experts looking to expand their knowledge on one of the most popular Python libraries in the world!
Data Analysis with Pandas and Python offers 19+ hours of in-depth video tutorials on the most powerful data analysis toolkit available today. Lessons include:
Why learn pandas?
If you've spent time in a spreadsheet software like Microsoft Excel, Apple Numbers, or Google Sheets and are eager to take your data analysis skills to the next level, this course is for you!
Data Analysis with Pandas and Python introduces you to the popular Pandas library built on top of the Python programming language.
Pandas is a powerhouse tool that allows you to do anything and everything with colossal data sets -- analyzing, organizing, sorting, filtering, pivoting, aggregating, munging, cleaning, calculating, and more!
I call it "Excel on steroids"!
Over the course of more than 19 hours, I'll take you step-by-step through Pandas, from installation to visualization! We'll cover hundreds of different methods, attributes, features, and functionalities packed away inside this awesome library. We'll dive into tons of different datasets, short and long, broken and pristine, to demonstrate the incredible versatility and efficiency of this package.
Data Analysis with Pandas and Python is bundled with dozens of datasets for you to use. Dive right in and follow along with my lessons to see how easy it is to get started with pandas!
Whether you're a new data analyst or have spent years (*cough* too long *cough*) in Excel, Data Analysis with pandas and Python offers you an incredible introduction to one of the most powerful data toolkits available today!
- Data analysts and business analysts
- Excel users looking to learn a more powerful software for data analysis