
We begin with an overview of the entire project. There are 40 steps and nearly 100 unit tests that must be passed in order to complete the project. The end result will be a fully-functioning data analysis library similar to pandas.
The Pandas Cub library will offer a wide range of functionality upon completion. A tour of its functionality within the Jupyter Notebook is provided.
All of the course material must be downloaded from its GitHub repository (https://github.com/tdpetrou/pandas_cub).
VS Code is a popular, free and open source code editor that I will use to develop Pandas Cub.
It's important to create a separate development environment when beginning new Python projects to ensure a clean work space that can be reliably tested. We use conda to create the environment.
Test-driven development is a popular method for ensuring code quality that involves writing tests first and then writing code that passes them.
Creating a development environment with conda unfortunately does not mean it is connected when launching a Jupyter Notebook. In this video, we have to install an IPython kernel so that we may connect to our environment while within the noteobok.
The __init__.py file is the only file you will be editing for the entire project. We inspect its contents in this video.
A discussion on how Python discovers packages and modules is had.
Unit tests are a good way to ensure code quality, but sometimes you need a quick way to interact with your code. We use the Jupyter Notebook to manually experiment with our DataFrame.
We make sure we are ready to write our first lines of code and attempt the following 40 steps.
Our DataFrame class is constructed with a single parameter, data. Python will call the special __init__ method when first constructing our DataFrame. This method has already been completed for you and you will not need to edit it. Within the __init__ method, several more methods are called that check to see if the user has passed it valid data. You will be editing these methods during the next few steps.
In this step, you will only be editing the _check_input_types method. This is the first method called within the __init__method. It will ensure that our users have passed us a valid data parameter.
We are going to force our users to set data as a dictionary that has strings as the keys and one-dimensional numpy arrays as the values. The keys will eventually become the column names and the arrays will be the values of those columns.
Specifically, _check_input_types must do the following:
raise a TypeError if data is not a dictionary
raise a TypeError if the keys of data are not strings
raise a TypeError if the values of data are not numpy arrays
raise a ValueError if the values of data are not 1-dimensional
Edit this method now. Use the isinstance function to help you determine the type of an object.
Run the following command to test this step. Once you have passed this test move on to the next step.
$ pytest tests/test_dataframe.py::TestDataFrameCreation::test_input_types
We are now guaranteed that data is a dictionary of strings mapped to one-dimensional arrays. Each column of data in our DataFrame must have the same number of elements. In this step, you must ensure that this is the case. Edit the _check_array_lengths method and raise a ValueError if any of the arrays differ in length.
Run the following test:
$ pytest tests/test_dataframe.py::TestDataFrameCreation::test_array_length
Whenever you create a numpy array of Python strings, it will default the data type of that array to unicode. Take a look at the following simple numpy array created from strings. Its data type, found in the dtype attribute is shown to be 'U' plus the length of the longest string.
>>> a = np.array(['cat', 'dog', 'snake'])
>>> a.dtype
dtype('<U5')
Unicode arrays are more difficult to manipulate and don't have the flexibility that we desire. So, if our user passes us a Unicode array, we will convert it to a data type called 'object'. This is a flexible type and will help us later when creating methods just for string columns. Technically, this data type allows any Python objects within the array.
In this step, you will change the data type of Unicode arrays to object. You will do this by checking each arrays data type kind. The data type kind is a single-character value available by doing array.dtype.kind. See the numpy docs for a list of all the available kinds. Let's retrieve the kind of our array from above.
>>> a.dtype.kind
'U'
Pass the astype array method the correct kind character to change its type.
Edit the _convert_unicode_to_object method and fill the dictionary new_data with the converted arrays. The result of this method will be returned and assigned as the _data instance variable.
Run test_unicode_to_object to test.
The number of rows are returned when passing a pandas DataFrame to the builtin len function. We will make pandas_cub behave the same exact way.
To do so we need to implement the special method __len__. This is what Python calls whenever an object is passed to the len function.
Edit the __len__ method and have it return the number of rows. Test with test_len.
Special Methods
Step 4 introduced us to the __len__ 'special method'. Python has over 100 special methods that allow you to define how your class behaves when it interacts with a builtin function or operator. In the above example, if df is a DataFrame and a user calls len(df) then internally the __len__ method will be called. All special methods begin and end with two underscores.
Let's see a few more examples:
df + 5 calls the __add__ special method
df > 5 calls the __lt__ special method
-df calls the __neg__ special method
round(df) calls the __round__ special method
We've actually already seen the special method __init__ which is used to initialize an object and called when a user calls DataFrame(data).
The Python documentation has good (though complex) coverage of all the special methods. We will be implementing many more special methods. I strongly recommend to reference the documentation to learn more.
In this step you will make df.columns return a list of the column names. Notice that df.columns is not a method here. There will be no parentheses that follow it.
Looking at the source code, you will see that columns appears to be defined as if it is a method. But, directly above it is the property decorator. The property decorator will make df.columns work just like a method.
Currently the keys in our _data dictionary refer to the columns in our DataFrame. Edit the columns 'method' (really a property) to return a list of the columns in order. Since we are working with Python 3.6, the dictionary keys are internally ordered. Take advantage of this. Validate with the test_columns test.
In this step, we will be assigning all new columns to our DataFrame by setting the columns property equal to a list. A concrete example below shows how you would set new columns for a 3-column DataFrame.
df.columns = ['state', 'age', 'fruit']
There are three parts to properties in Python; the getter, setter, and deleter. In the previous step, we defined the getter. In this step we will define the setter with the columns.setter decorator. The value on the right hand side of the assignment statement is passed to the method decorated by columns.setter. Edit this method and complete the following tasks:
Raise a TypeError if the object used to set new columns is not a list
Raise a ValueError if the number of column names in the list does not match the current DataFrame
Raise a TypeError if any of the columns are not strings
Raise a ValueError if any of the column names are duplicated in the list
Reassign the _data variable so that all the keys have been updated
Test with test_set_columns.
The shape property will return a two-item tuple of the number of rows and columns. The property decorator is used again here so that df.shape can execute code like a method. We could just make it a normal method and invoke it with df.shape() but we are following pandas lead and keeping shape as a property.
Test with test_shape.
Currently we have no representation of our DataFrame. If you try and output your DataFrame, you'll just get its location in memory and it will look something like this:
>>> df
<pandas_cub.DataFrame at 0x116d405c0>
The _repr_html_ method is made available to developers by iPython so that your objects can have nicely formatted HTML displays within Jupyter Notebooks. Read more on this method here in the iPython documentation along with other similar methods for different representations.
This method must return a string of html. This method is fairly complex and you must know some basic html to complete it. I recommend copying and pasting the implementation from pandas_cub_final instead of doing it yourself.
If you do know HTML and are seeking a greater challenger use the docstrings to give you an idea of how the HTML may be formatted. There are no tests for this method.
In pandas, values is a property that returns a single array of all the columns of data. Our DataFrame will do the same. Edit the values property and concatenate all the column arrays into a single two-dimensional numpy array. Return this array. The numpy column_stack function can be helpful here.
Test with test_values.
Hint when returning a DataFrame from a property/method
Many of the next steps require you to return a DataFrame as the result of the property/method. To do so, you will use the DataFrame constructor like this.
return DataFrame(new_data)
where new_data is a dictionary mapping the column names to a one-dimensional numpy array. It is your job to create the new_data dictionary correctly.
The dtypes property will return a two-column DataFrame with the column names in the first column and their data type as a string in the other. Use 'Column Name' and 'Data Type' as column names.
Use the DTYPE_NAME dictionary to convert from array kind to the string name of the data type. Test with test_dtypes.
In pandas, you can select a single column with df['colname']. Our DataFrame will do the same. To make an object work with the brackets, you must implement the __getitem__ special method. See the official documentation for more.
This special method is always passed a single parameter, the value within the brackets. We use item as the parameter name.
In this step, use isinstance to check whether item is a string. If it is, return a one column DataFrame of that column. You will need to use the DataFrame constructor to return a DataFrame.
These tests are under the TestSelection class. Run the test_one_column test.
Our DataFrame will also be able to select multiple columns if given a list within the brackets. For example, df[['colname1', 'colname2']] will return a two column DataFrame.
Continue editing the __getitem__ method. If item is a list, return a DataFrame of just those columns. Run test_multiple_columns to test.
In pandas, you can filter for specific rows of a DataFrame by passing in a boolean Series/array to the brackets. For instance, the following will select only the rows such that a is greater than 10.
>>> s = df['a'] > 10
>>> df[s]
This is called boolean selection. We will make our DataFrame work similarly. Edit the __getitem__ method and check whether item is a DataFrame. If it is then do the following:
If it is more than one column, raise a ValueError
Extract the underlying array from the single column
If the underlying array kind is not boolean ('b') raise a ValueError
Use the boolean array to return a new DataFrame with just the rows where the boolean array is True along with all the columns.
Run test_simple_boolean to test
When you pass the brackets operator a sequence of comma separated values with df[rs, cs], Python passes the __getitem__ special method a tuple of all the values.
To get started coding, within the __getitem__ special method check whether item is a tuple instance. If is not, raise a TypeError and inform the user that they need to pass in either a string (step 11), a list of strings (step 12), a one column boolean DataFrame (step 13) or both a row and column selection (step 14).
If item is a tuple, return the result of a call to the _getitem_tuple method.
Edit the _getitem_tuple method from now through step 18.
Within the _getitem_tuple method, raise a ValueError if it is not exactly two items in length.
Run test_simultaneous_tuple to test.
In this step, we will select a single cell of data with df[rs, cs]. We will assume rs is an integer and cs is either an integer or a string.
To get started, assign the first element of item to the variable row_selection and the second element of item to col_selection. From step 14, we know that item must be a two-item tuple.
If row_selection is an integer, reassign it as a one-element list of that integer.
Check whether col_selection is an integer. If it is, reassign to a one-element list of the string column name it represents.
If col_selection is a string, assign it to a one-element list of that string.
Now both row_selection and col_selection are lists. You will return a single-row, single-column DataFrame. This is different than pandas, which just returns a scalar value.
Write a for loop to iterate through each column in the col_selection list to create the new_data dictionary. Make sure to select just the row that is needed.
This for-loop will be used for the steps through 18 to return the desired DataFrame.
Run test_single_element to test.
In this step, we will again be selecting rows and columns simultaneously with df[rs, cs]. We will allow rs to be either a single-column boolean DataFrame, a list of integers, or a slice. For now, cs will remain either an integer or a string. The following selections will be possible after this step.
df[df['a'] < 10, 'b']
df[[2, 4, 1], 'e']
df[2:5, 3]
If row_selection is a DataFrame, raise a ValueError if it is not one column. Reassign row_selection to the values (numpy array) of its column. Raise a TypeError if it is not a boolean data type.
If row_selection is not a list or a slice raise a TypeError and inform the user that the row selection must be either an integer, list, slice, or DataFrame. You will not need to reassign row_selection for this case as it will select properly from a numpy array.
Your for-loop from step 15 should return the DataFrame.
Run test_all_row_selections to test.
The row_selection variable is now fully implemented. It can be either an integer, list of integers, a slice, or a one-column boolean DataFrame.
As of now, the col_selection can only be an integer or a string. In this step, we will handle the case when it is a list.
If col_selection is a list, create an empty list named new_col_selection. Iterate through each element of col_selectionand check if it is an integer. If it is, append the string column name to new_col_selection. If not, assume it is a string and append it as it is to new_col_selection.
new_col_selection will now be a list of string column names. Reassign col_selection to it.
Again, your for-loop from step 15 will return the DataFrame.
Run test_list_columns to test.
In this step, we will allow our columns to be sliced with either strings or integers. The following selections will be acceptable.
df[rs, :3]
df[rs, 1:10:2]
df[rs, 'a':'f':2]
Where rs is any of the previously acceptable row selections.
Check if col_selection is a slice. Slice objects have start, stop, and step attributes. Define new variables with the same name to hold those attributes of the slice object.
If col_selection is not a slice raise a TypeError informing the user that the column selection must be an integer, string, list, or slice.
If start is a string, reassign it to its integer index amongst the columns.
If stop is a string, reassign it to its integer index amongst the columns plus 1. We add one here so that we include the last column.
start, stop, and step should now be integers. Use them to reassign col_selection to a list of all the column names that are to be selected. You'll use slice notation to do this.
The for-loop from 15 will still work to return the desired DataFrame.
Run test_col_slice to test.
It is possible to get help completing column names when doing single-column selections. For instance, let's say we had a column name called 'state' and began making a column selection with df['s]. If we press tab right here iPython can show us a dropdown list of all the column names beginning with 's'.
We do this by returning the list of possible values we want to see from the _ipython_key_completions_ method. Complete that method now.
Run test_tab_complete to test.
We will now have our DataFrame create a single new column or overwrite an existing one. Pandas allows for setting multiple columns at once, and even setting rows and columns simultaneously. Doing such is fairly complex and we will not implement those cases and instead focus on just single-column setting.
Python allows setting via the brackets with the __setitem__ special method. It receives two values when called, the keyand the value. For instance, if we set a new column like this:
df['new col'] = np.array([10, 4, 99])
the key would be 'new col' and the value would be the numpy array.
If the key is not a string, raise a NotImplementedError stating that the DataFrame can only set a single column.
If value is a numpy array, raise a ValueError if it is not 1D. Raise a different ValueError if the length is different than the calling DataFrame.
If value is a DataFrame, raise a ValueError if it is not a single column. Raise a different ValueError if the length is different than the calling DataFrame. Reassign value to the underlying numpy array of the column.
If value is a single integer, string, float, or boolean, use the numpy repeat function to reassign value to be an array the same length as the DataFrame with all values the same. For instance, the following should work.
>>> df['new col'] = 85
Raise a TypeError if value is not one of the above types.
After completing the above, value will be a one-dimensional array. If it's data type kind is the string 'U', change its type to object.
Finally, assign a new column by modifying the _data dictionary.
Run test_new_column to test.
The head and tail methods each accept a single integer parameter n which is defaulted to 5. Have them return the first/last n rows.
Run test_head_tail to complete this.
We will now implement several methods that perform an aggregation. These methods all return a single value for each column. The following aggregation methods are defined.
min
max
mean
median
sum
var
std
all
any
argmax - index of the maximum
argmin - index of the minimum
We will only be performing these aggregations column-wise and not row-wise. Pandas enables users to perform both row and column aggregations.
If you look at our source code, you will see all of the aggregation methods already defined. You will not have to modify any of these methods individually. Instead, they all call the underlying _agg method passing it the numpy function.
Complete the generic method _agg that accepts an aggregation function.
Iterate through each column of your DataFrame and pass the underlying array to the aggregation function. Return a new DataFrame with the same number of columns, but with just a single row, the value of the aggregation.
String columns with missing values raise a TypeError. Except this error and don't return columns where the aggregation cannot be found.
Defining just the _agg method will make all the other aggregation methods work.
All the aggregation methods have their own tests in a separate class named TestAggregation. They are all named similarly with 'test_' preceding the name of the aggregation. Run all the tests at once.
The isna method will return a DataFrame the same shape as the original but with boolean values for every single value. Each value will be tested whether it is missing or not. Use np.isnan except in the case for strings which you can use a vectorized equality expression to None.
Test with test_isna found in the TestOtherMethods class.
The count method returns a single-row DataFrame with the number of non-missing values for each column. You will want to use the result of isna.
Test with test_count
This method will return the unique values for each column in the DataFrame. Specifically, it will return a list of one-column DataFrames of unique values in each column. If there is a single column, just return the DataFrame.
The reason we use a list of DataFrames is that each column may contain a different number of unique values. Use the unique numpy function.
Test with test_unique
Return a single-row DataFrame with the number of unique values for each column.
Test with test_nunique
Return a list of DataFrames, unless there is just one column and then just return a single DataFrame. Each DataFrame will be two columns. The first column name will be the name of the original column. The second column name will be 'count'. The first column will contain the unique values in the original DataFrame column. The 'count' column will hold the frequency of each of those unique values.
Use the numpy unique function with return_counts set to True. Return the DataFrames with sorted counts from greatest to least. Use the numpy argsort to help with this.
Use the test_value_counts test within the TestGrouping class.
We will modify the value_counts method to return relative frequencies. The value_counts method also accepts a boolean parameter normalize that by default is set to False. If it is True, then return the relative frequencies of each value instead.
Test with test_value_counts_normalize
The rename method renames one or more column names. Accept a dictionary of old column names mapped to new column names. Return a DataFrame. Raise a TypeError if columns is not a dictionary.
Test with test_rename within the TestOtherMethods class
Accept a single string or a list of column names as strings. Return a DataFrame without those columns. Raise a TypeError if a string or list is not provided.
Test with test_drop
Update:
I updated the pandas_cub_final init file after this video to contain a more robust solution. The `round` method should ignore boolean columns. The original solution had each non-aggregation method work on boolean, integer, and float columns.
Original
There are several non-aggregation methods that function similarly. All of the following non-aggregation methods return a DataFrame that is the same shape as the origin.
abs
cummin
cummax
cumsum
clip
round
copy
All of the above methods will be implemented with the generic _non_agg method. This method is sent the numpy function name of the non-aggregating method.
Pass only the boolean, integer, and float columns to this non-aggregating numpy function.
Keep the string columns (only other data type) in your returned DataFrame. Use the copy array method to make an independent copy of them.
Notice that some of these non-aggregating methods have extra keyword arguments. These are passed to _non_agg and collected with **kwargs. Make sure to pass them to the numpy function as well.
There is a different test for each method in the TestNonAgg class.
The diff method accepts a single parameter n and takes the difference between the current row and the n previous row. For instance, if a column has the values [5, 10, 2] and n=1, the diff method would return [NaN, 5, -8]. The first value is missing because there is no value preceding it.
The diff method is a non-aggregating method as well, but there is no direct numpy function that computes it. Instead, we will define a function within this method that computes this difference.
Complete the body of the func function.
Allow n to be either a negative or positive integer. You will have to set the first or last n values to np.nan. If you are doing this on an integer column, you will have to convert it to a float first as integer arrays cannot contain missing values. Use np.roll to help shift the data in the arrays.
Test with test_diff
The pct_change method is nearly identical to the diff method. The only difference is that this method returns the percentage change between the values and not the raw difference. Again, complete the body of the func function.
Test with test_pct_change
All the common arithmetic and comparison operators will be made available to our DataFrame. For example, df + 5 uses the plus operator to add 5 to each element of the DataFrame. Take a look at some of the following examples:
df + 5
df - 5
df > 5
df != 5
5 + df
5 < df
All the arithmetic and comparison operators have corresponding special methods that are called whenever the operator is used. For instance __add__ is called when the plus operator is used, and __le__ is called whenever the less than or equal to operator is used. See the full list in the documentation.
Each of these methods accepts a single parameter, which we have named other. All of these methods call a more generic _oper method which you will complete.
Within the _oper method check if other is a DataFrame. Raise a ValueError if this DataFrame not one column. Otherwise, reassign other to be a 1D array of the values of its only column.
If other is not a DataFrame do nothing and continue executing the rest of the method. We will not check directly if the types are compatible. Instead we will pass this task onto numpy. So, df + 5 should work if all the columns in df are booleans, integers, or floats.
Iterate through all the columns of your DataFrame and apply the operation to each array. You will need to use the getattrfunction along with the op string to retrieve the underlying numpy array method. For instance, getattr(values, '__add__')returns the method that uses the plus operator for the numpy array values. Return a new DataFrame with the operation applied to each column.
Run all the tests in class TestOperators
This method will sort the rows of the DataFrame by one or more columns. Allow the parameter by to be either a single column name as a string or a list of column names as strings. The DataFrame will be sorted by this column or columns.
The second parameter, asc, will be a boolean controlling the direction of the sort. It is defaulted to True indicating that sorting will be ascending (lowest to greatest). Raise a TypeError if by is not a string or list.
You will need to use numpy's argsort to get the order of the sort for a single column and lexsort to sort multiple columns.
Run the following tests in the TestMoreMethods class.
test_sort_values
test_sort_values_desc
test_sort_values_two
test_sort_values_two_desc
This method randomly samples the rows of the DataFrame. You can either choose an exact number to sample with n or a fraction with frac. Sample with replacement by using the boolean replace. The seed parameter will be used to set the random number seed.
Raise a ValueError if frac is not positive and a TypeError if n is not an integer.
You will be using numpy's random module to complete this method. Within it are the seed and choice functions. The latter function has a replace parameter that you will need to use. Return a new DataFrame with the new random rows.
Run test_sample to test.
This is a complex method to implement. See full description here (https://github.com/tdpetrou/pandas_cub#37-pivot_table-method)
Build a Data a Data Analysis Library from Scratch in Python targets those that have a desire to immerse themselves in a single, long, and comprehensive project that covers several advanced Python concepts. By the end of the project you will have built a fully-functioning Python library that is able to complete many common data analysis tasks. The library will be titled Pandas Cub and have similar functionality to the popular pandas library.
This course focuses on developing software within the massive ecosystem of tools available in Python. There are 40 detailed steps that you must complete in order to finish the project. During each step, you will be tasked with writing some code that adds functionality to the library. In order to complete each step, you must pass the unit-tests that have already been written. Once you pass all the unit tests, the project is complete. The nearly 100 unit tests give you immediate feedback on whether or not your code completes the steps correctly.
There are many important concepts that you will learn while building Pandas Cub.
Creating a development environment with conda
Using test-driven development to ensure code quality
Using the Python data model to allow your objects to work seamlessly with builtin Python functions and operators
Build a DataFrame class with the following functionality:
Select subsets of data with the brackets operator
Aggregation methods - sum, min, max, mean, median, etc...
Non-aggregation methods such as isna, unique, rename, drop
Group by one or two columns to create pivot tables
Specific methods for handling string columns
Read in data from a comma-separated value file
A nicely formatted display of the DataFrame in the notebook
It is my experience that many people will learn just enough of a programming language like Python to complete basic tasks, but will not possess the skills to complete larger projects or build entire libraries. This course intends to provide a means for students looking for a challenging and exciting project that will take serious effort and a long time to complete.
This course is taught by expert instructor Ted Petrou, author of Pandas Cookbook, Master Data Analysis with Python, and Master the Fundamentals of Python.