
Acest curs include exercițiile noastre de codare actualizate, astfel încât să îți poți exersa abilitățile pe măsură ce înveți.
Vezi o demonstrație
Welcome to Data Analysis with Polars in Python. Polars is a data analysis library written in the Rust programming language with support for Python bindings. In this lesson, we introduce the features and functionalities of the library. We also discuss our setup steps which involve installing the uv Python manager, downloading the data sets, reviewing Python, and then getting started with Polars.
Download the datasets and Jupyter Notebooks for the course from GitHub.
Welcome to a sequence of videos dedicated to installing Python and Polars on a macOS computer. First up, we'll need to get acquainted with the Terminal, a command-line interface where the user can issue text commands to the operating system. We practice with the pwd, ls, and cd commands.
In this lesson we install the uv command-line program for managing Python projects . uv will help us download Python, Polars, and Jupyter Lab (our coding environment).
In this lesson, we download the Jupyter Notebooks and datasets from the course's public repository on GitHub. We also use the uv sync command to set up Python, Polars, JupyterLab, and all other project dependencies within our course folder.
Welcome to a sequence of videos dedicated to installing Python and Polars on a Windows computer. First up, we'll need to get acquainted with the PowerShell, a command-line interface where the user can issue text commands to the operating system. We practice with the pwd, ls, and cd commands.
In this lesson we install the uv command-line program for managing Python projects . uv will help us download Python, Polars, and Jupyter Lab (our coding environment)
In this lesson, we download the Jupyter Notebooks and datasets from the course's public repository on GitHub. We also use the uv sync command to set up Python, Polars, JupyterLab, and all other project dependencies within our course folder.
In this lesson, we discuss the startup and shutdown process for JupyterLab. We execute the uv run jupyter-lab command from the Terminal. We work within a Jupyter Notebook, then save our work, then shut down the Python kernel for the Notebook, and finally close the Jupyter Lab server.
In this lesson, we introduced the interface of JupyterLab, including how to create cells, delete cells, execute cells, restart the kernel, and more.
In this lesson, we configure our Jupyter Lab settings to run the Ruff formatter upon every every cell's execution. Ruff will format the code to ensure a consistent (and pretty!) aesthetic standard for our Python code.
Use the import keyword to bring in libraries like Polars into the Jupyter notebook. We can assign an alias (alternate names) to a library with the as keyword. The popular community convention for Polars is pl.
A comment is a line of code ignored by the Python interpret. We create a comment with a hashtag (#), which effectively disables the line. Developers use comments to provide documentation, metadata, diagrams, and more.
In this lesson, we review the primitive data types. in Python: integers, floating-points, strings, Booleans, and the None object., We also introduce operators, symbols that perform operations on the values.
In this lesson, we expand our study of operators by introducing various symbols for addition, subtraction, multiplication, two types of division, exponentiation, modulo, and more.
The equality operator (==) compares whether two values are equal/identical. The complementary inequality operator (!=) confirms that two values are not equal. In this lesson, we practice deriving Booleans from various equality comparisons.
A variable is a name for a value in the program. It serves as a placeholder for the value that also provides context on what the value represents. In this lesson, we declare variables and reassign values to them.
A function is a reusable procedure, a sequence of steps that execute in order. It can accept inputs (parameters) and produce an output (return value). In this lesson, we invoke some of Python's top-level functions including len, int, float, and type.
Custom functions can encapsulate reusable business logic. In this lesson, we define a convert_to_fahrenheit function that converts a Celsius temperature to Fahrenheit. We define a parameter and produce a return value.
A method is a function attached to an object. An object is just a data value in our program. We invoke methods with a dot, the method name, and a pair of parentheses. Like functions, methods can accept arguments and produce a return value. In this lesson, we practice with string methods like upper, lower, strip, and startswith. We also introduce in the in keyword to check for inclusion.
A list is a mutable data structure for storing elements in order. We declare it with a pair of square brackets ([]). The length of a list is a count of its elements. In this lesson, we instantiate lists and practice adding and removing elements from them.
An index position is a numeric order in line that Python assigns to each element within a list. The index starts counting from 0. In this lesson, we practice extracting elements/characters from lists/strings using their index position.
A tuple is an immutable collection of zero or more elements in sequence. We declare tuples by separating multiple values with commas. The community convention is to wrap the tuple in parentheses.
A dictionary is an unordered collection of key-value pairs. A key serves as a unique identifier for a value. In this lesson, we practice creating dictionaries, as well as reading/writing write key-value pairs.
A class is a blueprint for creating one or more objects (digital data structures). We provide an analogy for blueprints and houses in the real world. We also discuss how Python offers shortcuts for creating common objects like lists, dictionaries, and strings.
A module is a Python file that holds code (classes, functions, constants, etc). Developers use modules to organize related code. We can import modules into our Jupyter Notebook with the import keyword to gain access to additional functionalities within the language. In this lesson, we practice bringing in functionality from the datetime module, which has classes for working with temporal data (datetimes, dates, etc).
The import keyword can import both modules and libraries within the Python ecosystem. In this lesson, we import the polars library for data analysis and assign it the alias pl.
In this lesson, we discuss the different types of integers available in Rust, including unsigned integers (zero or positive values) and signed integers (negative or positive values). We also introduce the smallest unit of memory, the bit.
In this section, we'll start exploring the Series, a one-dimensional column of ordered, homogenous data (data of the same type). We kick things off by importing Polars and exploring how we can see its version.
Let's create a few Series! In this lesson, we use the pl.Series constructor to accept a list of values and instantiate a Chips Series. We also explore the name and values parameters to customize the Series name and the source of data.
Polars can be strict in its analysis of data. In this lesson, we use the dtype parameter of the Series constructor to customize the data type of the Series' values. We introduce the data types available at the top-level of the Polars library. We also use the strict parameter to ask Polars to be more permissive with mismatches in data types.
An attribute is a piece of data/information that lives on an object. We access an attribute with dot syntax, then the attribute name. In this lesson, we access some Series attributes including name, dtype, and shape.
Polars use a null value to represent missing data. null is the equivalent of Python's None or pandas's NaN. Most operations on null values will produce null values. We also discuss the differences between null and not a number (NaN) values.
A method is a functionality available on an object. We invoke a method with a dot, the method name, and a pair of parentheses. Methods can accept arguments and produce a return value. In this lesson, we use the alias method to rename a column/Series.
It's time to bring in some data from the outside world! In this lesson, we use the read_csv function to bring in a comma-separated values (CSV) files of data. We also discuss how a CSV text file stores its data. Polars imports a CSV as a 2-dimensional DataFrame, so we also discuss how to convert it to a Series.
In this lesson, we introduce the head and tail method to extract a specified number of rows from the beginning and end of a Polars data structure.
In this lesson, we introduce the schema and schema_overrides parameters to the read_csv function. They allow the developer to customize the inferred data of columns. Both parameters accept dictionary arguments. The schema parameter requires a complete mapping of columns to desired data types; the schema_overrides parameter only needs the columns that will replace Polars' default inferred types.
In this lesson, we use the sort method to sort a Series in both ascending and descending order. We discuss how Polars sort columns of numeric values vs. strings. Most methods in Polars will return a new object rather than mutate the existing one.
In this lesson, we introduce common Series methods for mathematical operations including sum, mean, len, count, null_count, max, min, product, and more.
In this lesson, we introduce several Series methods for rounding including ceil (round up), floor (round down), and round (round up or down depending on proximity).
For data analysts experienced with Pandas, this lesson offers a comparison between the Series object in Pandas vs. Polars.
A DataFrame is a 2-dimensional table consisting of rows and columns. In this lesson, we discuss its basic mechanics with a simple example.
In this lesson, we instantiate a DataFrame to scratch by passing a dictionary to the constructor. The keys serve as column names, and the values are lists of data to populate the columns.
In this lesson, we review the pl.read_csv function to read in a DataFrame from a CSV file. We practice shared methods like head and tail, then introduce DataFrame-specific attributes like columns and dtypes.
Unlike in Pandas, a Polars DataFrame does not have an index. In this lesson, we show how to create an index from scratch if you'd like an experience closer to Pandas.
An expression is a building block, a step in a reusable computation that will be executed at a later point. In this lesson, we build up two expression with the pl.col function. The first targets a column, and the second calculates the mean of its values.
The select method executes one or more expressions, returning the computed columns in a new DataFrame. In this lesson, we pass the expression objects from the previous lesson to select and observe the results!
In this lesson, we get reacquainted with the alias method to rename a column. The alias method is critical because the select method will throw an error if multiple columns end up with the same name.
In this lesson, we explore the variations of the arguments we can pass to the select method. We expand on the pl.col syntax to show the different inputs it can accept including multiple strings and lists of strings.
The pl.col function is even more flexible than we thought! In this lesson, we target DataFrame columns by their data types.
An expression is not coupled to a specific DataFrame or a data type. In this lesson, we apply the same expression to two different DataFrames to prove this point.
In this lesson, we introduce methods for counting the number of present and missing values within a column. We also show off the helpful describe method for generating a summary of various stats about the DataFrame.
In this lesson, we practice using the item and slice methods to pull out one or multiple rows from the DataFrame. We also show off the slice method's flexibility with negative values!
A Polars DataFrame supports Python's list slicing syntax (although the development team advises against it!). In this lesson, we review the syntax option as well as the shortcuts to pull from the beginning of the DataFrame and to the end of the DataFrame.
In this lesson, we introduce a complementary approach, the get method on an expression, to extract the row values for one or more columns at a specific row index.
If you're looking to locate a value the numeric intersection of a row and column, the item method can help you accomplish that! We'll learn about it in this lesson.
In this lesson, we use the gather method to pull out multiple rows by index position and the gather_every method to pull out rows at a consistent interval.
The sample method extracts a random collection of rows from the DataFrame. In this lesson, we practice using it to target both a fixed number of rows and a total percentage of rows.
In this lesson, we use the cast method to convert the values in a column from one data type to another. We review how choosing a smaller numeric data type can reduce the memory footprint of the data structure.
In this lesson, we review the schema_overrides and schema parameters to customize the data types of the columns in a DataFrame. Both parameters accept dictionaries, but schema requires the complete mapping of columns to types while schema_overrides only needs the columns where we want to replace Polars' default inference.
In this lesson, we use both the alias and rename methods to name columns within a new DataFrame. We also discuss how to alter column names at the point of dataset import.
Polars nests additional expression methods under attributes/namespaces. The name attribute/namespace holds methods for adjusting column names. In this lesson, we introduce methods for changing the casing of column names and concatenating a string to the beginning/end of each column name.
In this lesson, we use the drop method to remove one or more columns from a DataFrame. We also review the different options for creating an expression targeting multiple columns.
In this lesson, we learn about the replace method, which swaps the values in a column. We demonstrate two syntax options, parameters and dictionaries, to specify old values and replacement values.
Time to math! In this lesson, we tackle the symbols and methods for common mathematical operations including addition, subtraction, division, multiplication, exponentiation, and remainder. We also learn the equals method for comparing the equality of DataFrames.
We can use multiple columns within an expression! In this lesson, we calculate the product of elements across two columns. We also discuss how Polars handles types in calculations.
In this lesson, we introduce a family of methods for cumulative operations (tallying the value up until the current row in the DataFrame).
The with_columns method creates a new DataFrame that keeps all existing columns and adds new columns from the expressions on the right side. It allows us to keep our old work and expand on it with new calculations. This is a powerful method!
In this lesson, we review two top-level Polars functions for creating expressions that target multiple columns: all for targeting all columns and exclude for targeting all columns except for the ones specified.
Polars uses the null keyword to represent a missing value. In this lesson, we cover the first strategy for dealing with null values: replacing them! We use both constants and forward/backward strategies to populate the missing data. Don't miss out!
Interpolation replaces missing values using linear interpolation, which draws a straight line between two values and fills in the gaps along that line. It's a convenient way to fill missing gaps in data that follows a linear pattern.
The other option for missing data is to remove it entirely. The drop_nulls method removes null values from the target column. We talk about the limitations of the method when combined with the with_columns method.
Sorting changes the order of rows based on one or more columns' values. In this lesson, we invoke the sort method to sort columns with a variety of different data types.
We may want to sort within a group of equal values. In this lesson, we show to sort a DataFrame by multiple columns. We also pass the descending parameter to customize the sort order per column.
In this lesson, we expand on the sorting concepts by passing the descending parameter to customize the sort order per column. Unlike Pandas, Polars requires a list with a length equal to the number of sorted columns.
Length can be tricky! In this lesson, we discuss the differences between characters and bytes. We also introduce the complementary len_bytes and len_chars methods underneath the str namespace.
We can sort a column using the results of another expression! In this lesson, we sort a DataFrame column using the lengths of its string values.
top_k and bottom_k are convenience methods to extract a specific number of rows with the the largest/smallest values in a given column.
The rank method assigns each row value a position in line based on its numeric ranking.
The shuffle method randomizes the order of elements in a column. If you call this a totally random lesson, you'd be right!
In this lesson, we learn the n_unique method for counting the number of distinct values in a column and the unique method to pull them out. We explore these methods on expressions as well as the top-level Polars library.
The value_counts method counts the number of occurrences of each unique value. It returns a column of structs, a data structure consisting of key-value pairs that is comparable to a Python dictionary. We discuss how to extract the struct's contents into separate columns.
In this lesson, we introduce the coffee_sales dataset that we'll use throughout this section. It is a collection of transactions from a coffee chain with a wide variety of data types.
This lesson introduces the filter method, which is the primary way to. extract rows that satisfy a condition We'll explore how the filter method relies on expressions that produce Boolean values.
This lesson walks through filtering rows using mathematical comparison operators like >, <, ==, and !=. Polars applies the logical comparison operation on every row value to produce a Boolean column.
This lesson covers how to filter rows based on missing data. You’ll learn to use methods like is_null and is_not_null instead of direct equality checks with Python's None object.
In this lesson, we filter directly on boolean columns. This pattern is common when working with precomputed flags or conditions stored in the DataFrame.
In this lesson, we combine multiple filter conditions using the logical AND operator ( & ). You’ll see how chaining conditions allows for more precise row selection.
This lesson shows an alternative approach for filtering using keyword arguments. It provides a concise alternative when filtering on exact column values (although it is not recommended by the Polars team!).
In this lesson, we introduce the complementary OR operator ( | 0 for filtering rows that match one of several conditions.
This lesson explains how operator precedence affects filter expressions. You’ll learn when parentheses are required to ensure Polars evaluates conditions correctly.
The next operator in line is XOR ( ^ ), which ensures that one condition is true but not the other one.
This lesson covers filtering rows based on uniqueness or duplication. Polars identifies a value as a duplicate if it occurs more than once in the column.
In this lesson, we filter rows using datetime/temporal values. We use Python's native datetime module to create the datetime objects to compare row values against.
This lesson introduces is_between, which simplifies filtering for values that fall within a range. The method accepts the lower and upper bounds of the interval, which are both exclusive.
The is_in method filters rows based on inclusion in a list of values. The method offers a shortcut to declaring multiple OR conditions.
This lesson shows off the remove method, which excludes rows that match a given condition. It’s conceptually the inverse of the filter method.
In this lesson, we apply logical negation using the tilde (~) operator. Trues become Falses, and Falses become Trues. The operator allows you to elegantly invert filter conditions.
In this lesson, we introduce conditional logic using the when, then, and otherwise methods. These methods are designed to be chained in sequence and they model the if/else if/else paradigm from programming languages.
This final lesson shows how to partition/split a DataFrame into multiple subsets based on a filter condition. The partition method is a nice prerequisite to the groupby object that we'll introduce later in the course.
Welcome to the Joins section. A join merges two DataFrames based on shared values across specified columns. In this lesson, we demonstrate the datasets for our movie fictional streaming service and discuss their logical relationships.
An inner join matches rows with equal values in both DataFrames. Polars will exclude a key if it does not exist in the other DataFrame. In this lesson, we join the users and watch_history DataFrames, looking for the user IDs that are found in both tables.
The on parameter specifies the column whose values will be compared across the two DataFrames to be joined. In this lesson, we cover the complementary left_on and right_on parameters for when the column names differ.
A full join merges two DataFrames, joining rows there is a match on values but also keeping rows where is no match. In this lesson, we perform a full join on the users and plans DataFrames to identify both the orphan users and orphan plans across the datasets.
A left join keeps all the records from the left DataFrames and merges matching rows (where possible) from the right DataFrame. Polars will substitute null for values in the right DataFrame's columns when there is no match.
A semi join keeps only the left DataFrame rows that have a match in the right DataFrame. However, Polars does not concatenate the right DataFrame's columns to the new DataFrame. The join is closer to a filter operation than a proper join.
In this lesson, we introduce the complementary join to a semi join, the anti join. An anti join keeps the left DataFrame rows that do not have a match in the right DataFrame. In this lesson, we join the users and support DataFames to identify the users who did not file a ticket/complaint.
A cross join matches every row from the left DataFrame with every row from the right DataFrame. The strategy is called a Cartesian product. The resulting DataFrame's length will be equal to the product of the two DataFrame's lengths.
Polars can join DataFrames based on matching values across multiple columns. Values must match across both columns in order for the rows to be paired together in the joined DataFrame.
The validate parameter to the join method asserts on the uniqueness of the join keys in both DataFrames. Think of validate as a safety check before the join. In this lesson, we explore the syntax for specifying unique join keys and multiple join keys in the left and right DataFrames.
The join_asof method matches values on the nearest match rather than an exact match. It is ideal for timeseries data, when we care about proximity rather than perfect equality.
The tolerance parameter sets the constraint/boundary by which the join_asof match can occur in the given search direction. In this lesson, we explore how setting a different time window affects the results of joining our outages and uptime_checks DataFrames.
Some datasets require a join by exact keys before performing an approximate match. In this lesson, we expand the join_asof method to apply the by parameter to designate the exact join column between two joined DataFrames.
Concatenation stacks/glues two DataFrames together in a specified direction. We kick this section off by performing vertical concatenation, which adds the second DataFrame's rows to the end of the first DataFrame.
Horizontal concatenation merges the second DataFrame's columns on the right side of the first DataFrame.
In this lesson, we practice diagonal concatenation, which adds both rows and columns to the end of the first DataFrame. Diagonal concatenation expands in a DataFrame in both height (rows) and width (columns).
Align concatenation joins rows together based on shared column values, then performs a diagonal concatenation. When there are no matches, Polars fill the missing cells with null.
Relaxed concatenation is a less strict form of concatenation that coerces columns to their supertypes. The supertype is a type with the capacity to model all of the original types. In this lesson, we explore different arguments to the how parameter to make our concatenations relaxed.
Rechunking is the process of merging multiple chunks of data together so that it is stored contiguously in memory. Rechunking requires an upfront cost (Polars must copy data) but improves performance in future queries. In this lesson, we introduce the rechunk method and the n_chunks method for seeing how many chunks each column occupies in memory.
The pd.concat function requires the complete list of DataFrames to merge upfront. In this lesson, we introduce the vstack method on a DataFrame, which allows us to concatenate one DataFrame at a time.
In this lesson, we introduce the extend method for concatenation. We also compare it the vstack method including each method's capacity for rechunking.
In this quick lesson, we cover the complementary hstack method to horizontally concatenate a DataFrame on the right side of another.
Wide and long describe two ways of organizing data in a table. Wide DataFrames store the same variable across multiple columns. They expand horizontally with more data. Long DataFrames store each variable in a single column. They expand vertically with more data.
In this lesson, we use the unpivot method to transform a DataFrame from a wide format to a long format. This is equivalent to the melt method in Pandas.
Next up is the pivot method, which converts a long DataFrame into a wide DataFrame. The distinct values from a column become the new column headers, and Polars spreads out the values across the correct intersection of index and column.
A pivot table reshapes data by turning unique values into new rows or columns, then summarizing corresponding values with an aggregation operation. In this lesson, we practice with simple operations like pulling out the first and last value for each intersection of row and column.
The aggregate functions from the previous lesson chose one value from a set of possible values. In this lesson, we introduce additional functions that perform aggregate operations across all values.
The transpose method swaps the axes of a DataFrame. The column headers become row entries, and the row values become column headers. We also discuss some additional parameters to ensure all data is brought over.
Polars has two collection types: the array and list. Each row in a list column stores a homogenous collection of zero or more elements in order. In this lesson, we practice creating a list column from scratch.
This lesson offers a more realistic way you might arrive at a list column: the str.split method, which splits a string based on every occurrence of a delimiter.
Polars nests list operations underneath a list attribute/namespace. In this lesson, we explore some convenience methods to calculate the lengths of the lists and pull out one or multiple elements from each list.
In this lesson, we review the sort method on a DataFrame and contrast it with the list.sort method on a column of lists.
The list.explode method creates a row entry for every list value. It is sometimes called a "flatten" operation; it creates a one-dimensional sequence of values from a collection of nested lists.
In this lesson, we practice exploding multiple columns to find every combination of values across two columns of lists.
In this lesson, we introduce more methods underneath the list namespace, including mathematical operations like sum, max, min, and mean.
In this lesson, we utilize the list.eval method to map each list element to a new value. We combine it with the pl.element function to perform a comparison on every list element.
In this lesson, we introduced 3 top-level Polars functions approaches for concatenating column values: pl.format, pl.concat_str, and pl.concat_list. We also cover the list.join method for concatenating the contents in a string list with a separator.
An array is near identical to a list; it's an ordered container for elements. The difference is that each row's array must be of the same length. If this condition can be met, columns of arrays will be more performant than columns of lists.
In this lesson, we review the familiar methods we covered earlier underneath the list attribute but now under the complimentary arr attribute for array columns.
Welcome to the most comprehensive Polars course on Udemy!
Data Analysis with Polars and Python offers 22+ hours of in-depth video tutorials on the powerful Polars data analysis library. The course also includes a wide collection of datasets, quizzes, and coding challenges to aid your learning.
Why Polars?
The core of Polars is written in Rust, one of the fastest programming languages in the world. At the same time, the library enables us to write our code in Python, the most popular language in the world. We gain the best of both worlds -- the speed and efficiency of Rust and the simplicity and elegance of Python.
Who is this Course For?
The course is designed for learners of all skill levels, from experienced data analysts to students who have never programmed before. Lessons include:
installing Python and Polars on your computer
understanding the core mechanics of Python
working with the Jupyter Lab coding environment
Whether you've spent time in a spreadsheet software like Microsoft Excel/Google Sheets or another data analysis library like Pandas, Polars can help take your data analysis skills to the next level.
What Topics Will We Cover?
We'll cover the core objects of Polars including:
Series
DataFrames
LazyFrames
Most of our work will focus on the DataFrame, a 2-dimensional table of rows and columns. We'll cover data manipulation operations including:
sorting
filtering
grouping
aggregating
de-duplicating
pivoting
deleting
joining
replacing
working with text data
working with temporal/datetime data
We'll also cover some of Polar's unique column data types including:
lists
arrays
structs
and more!
Data Analysis with Polars and Python
I'm excited to share everything I've learned about Polars, a powerful library that is quickly emerging as a dominant competitor in Python's data science ecosystem. I look forward to seeing you in the course!