
Explore the core pandas data structures, starting with series and dataframes, to build a solid foundation and prepare for advanced data analysis across the bootcamp.
Master pandas for advanced data analysis in Python by learning its fast, flexible data manipulation capabilities, its dependency on numpy and matplotlib, and starting in a Python environment.
Learn to install and use the Anaconda data science distribution, manage environments with the Navigator, and work with pandas and Jupyter Notebook, plus compare Google Colab as a cloud alternative.
Launch a Jupyter notebook from the Anaconda Navigator and learn to run Python code in an interactive environment with code and markdown cells, using shortcuts like Shift+Enter.
Compare cloud versus local data science setups using Anaconda or Miniconda, Jupyter notebooks, and Google Colab, highlighting free GPU access and cross-platform code applicability.
Master pandas by understanding that Python runs via an interpreter, with CPython as the standard implementation, and use Appendix A to cover fundamentals and data types.
Explore NumPy, the numerical Python library behind pandas, by learning about ndarrays, fast ufuncs, and contiguous memory storage that enables performance.
Explore the fundamentals of Pandas series, including their attributes and methods, and learn data selection and indexing techniques that underpin the rest of the course.
Learn how Pandas Series are one-dimensional labeled arrays that store values of any data type, built from Python lists with the Series constructor, and support mixed types.
Discover the difference between parameters and arguments and see how Python passes data to functions and Pandas constructors, using the data parameter and actual arguments with concrete examples.
Explore how pandas series derive labels from Python lists or dictionaries, compare list and dictionary inputs, and understand automatic integer indexing when labels are absent.
Pandas infers a series dtype from data and lets you specify a dtype manually, such as floats for numbers, while strings yield object dtype; use type to inspect data type.
Discover how NumPy's fixed-size arrays influence Pandas, showing why strings become dtype('o') and how Pandas stores string data as pointers rather than actual values.
Explore how pandas automatically aligns data by the index, and learn to create custom labels for series using the index parameter, including range index, start-stop semantics, and immutability for performance.
Name a Pandas series to assign a readable label, explore the name attribute, and see how a series and its index names become column names and labels in data frames.
Create a four-item list of actor names and a corresponding ages list, then build a pandas series with ages labeled by actor names.
Create a labeled pandas series of actor ages from Python lists, using the long form with keyword arguments, labeling ages by actor names, then explore dictionary and zip methods.
Explore dictionary comprehension to pair actor names with ages from zip outputs, creating a dict suitable for pandas series, a concise, pythonic pattern beyond loops.
Explore data quickly with the head and tail methods on pandas series, learning how to preview the first or last records, and control output size with the n parameter.
Access and extract values from a pandas series by index position using square bracket indexing, zero-based indexes, and slices. Explore using negative indices and custom labels to retrieve items efficiently.
Learn to access items in a pandas series by labeled index using square brackets, contrasting label-based with position-based indexing. See how a labeled alphabet demonstrates inclusive slicing by label.
Learn how to use pandas add_prefix and add_suffix to modify index labels for a series or column labels for a dataframe, with copies and reassignment to apply the changes.
Learn how dot notation cleanly accesses label values in a pandas series, but beware its limitations with slices and invalid identifiers.
Learn how boolean masks enable label-based extraction with the lock indexer and square brackets, ensuring the mask length matches the series to selectively return items.
Explore iloc, the integer locator for position-based indexing with zero-based indexing, including positional slices and index lists. Learn how square brackets reflect either positions or labels like iloc or loc.
Explore indexing pandas series with callables using .loc and .iloc to emit labels, positions, slices, or boolean masks. Build functions that take a series and return the appropriate indexing output.
Learn how the get method retrieves values from a series by label, supports a custom default when a label is missing, and can index by position like square bracket indexing.
Master pandas data selection using label with square brackets and loc, or by position with iloc, slices, boolean masks, and callables, with loc favored for labels and iloc for positions.
Create a 100-item series of squares, then extract the last three items by indexing and by tail, and compare results with the equals method to formalize equivalence.
Create a pandas series of squares from 0 to 99 using a list comprehension. Compare last elements with head, tail, iloc, and equals to illustrate element-wise versus boolean results.
Develop skills in pandas series methods and manipulation, including handling NaNs and missing values, read an external dataset, and assess data structure, and apply descriptive statistics, sorting, and filtering transformations.
Learn to import csv data with pandas read_csv, explore data frames and series, and tailor reads with usecols, index_col, and squeeze using a FiveThirtyEight alcohol dataset.
Explore how to size a pandas series using size, shape, and len, confirming equal lengths for values and index, and noting the one-dimensional nature of series.
Explore pandas series attributes to check uniqueness with is_unique and N unique, handle NaNs with dropna, and assess monotonicity using is_monotonic (increasing) and is_monotonic_decreasing, illustrated with real examples.
Learn how pandas' count method counts non-null values in a series, revealing gaps where Na values exist, and contrasts with size which counts all elements.
Identify and count null values in a pandas series using is null (alias isna) and boolean masking, then verify totals with size, count, and sum.
Explore a numpy ufunc approach to isolating nulls in pandas, leveraging vectorization for performance and seamless numpy-pandas interoperability.
Use notnull or notna to create a boolean mask of non-null records, then sum to count non-missing values; notnull and notna are aliases.
Discover how booleans map to one and zero in Python, with bool as a subclass of int, and how arithmetic and method resolution order reveal their behavior.
Isolate non-nulls in the alcohol series as wine_servings and compute wine consumed by countries. For the challenge, apply a boolean mask for countries with less than 100 servings and sum.
Isolate nulls in the alcohol series, create a boolean mask for wine servings less than 100, and sum to compute total wine servings across countries in 2010.
Learn how dropna and fillna handle missing values in a pandas series. Drops create copies by default, while inplace toggles can modify the original series or require reassignment.
apply descriptive statistics in pandas to summarize data with sum, mean, median, quantiles, iqr, min, max, std, and var. understand how mean exceeds median in a right-skewed wine data distribution.
Apply the describe() method to quickly get descriptive statistics as a pandas series, giving a quick numerical sense of your data, with optional percentiles and include/exclude filters by type.
Explore mode and value_counts in pandas: compute the most frequent item, compare single-value frequency with all unique values, and learn raw and normalized counts in a series.
Learn how to sort a pandas series by values using sort_values, control ascending order, place NaNs with na_position, and compare copy versus in-place sorting, including default quicksort and alternatives.
Explore the nlargest and nsmallest methods to quickly extract top or bottom n values from a sorted series, avoiding explicit sorting and slicing.
Sort_index sorts a series by its index labels, ascending by default, with optional inplace updates; compare to sort_values, and note Na position handling and quicksort as the underlying algorithm.
Practice a pandas skill challenge: filter countries with wine servings over 50 and 50 plus variable, then select smallest 21 from 50 plus and compute mean, median, and standard deviation.
Create a new pandas series '50 plus' by boolean masking values over 50, then select the 20 smallest, and compute mean, standard deviation, and median from that subset.
Perform series arithmetic in pandas: add, subtract, multiply, and divide with scalar and series operands; leverage automatic index alignment and the fill_value option to preserve data when labels differ.
Compute variance and standard deviation in pandas by applying var method on a series, averaging squared differences from mean, and taking square root after adjusting for n minus one.
Explore cumulative operations in pandas, including sum, cumsum, prod, and cumulative min and max, with Na handling and practical examples on a wine servings series.
Learn how the pandas diff() method computes discrete, element-wise differences between pairs in a series, with the periods parameter adjusting the lag. This is foundational for time series analysis.
Discover how to iterate a pandas series using for loops, iterate over index labels, and the items method (or iteritems) for lazy, zip-based tuples.
Filter a pandas series with filter() using regex to show countries starting with v, such as Vietnam, Venezuela, and Vanuatu. Compare index-based filtering with values-based methods, including where and mask.
Master transforming a pandas series with update, apply, and map for spot and global transformations. Apply in-place changes, lambda functions, and parameterized inputs for flexible results.
practice creating a pandas series from beer servings, compute mean, median, and std, assess skewness, compare the first ten countries, and explore z-scores and standardized scores.
Create a data URL for a CSV, read it with pandas, select beer, servings, and country, set country as the index, and convert the result to a series using squeeze.
Compute mean, median, and standard deviation of beer servings in pandas. Explore quantile methods and numpy to verify results, and assess right skew using describe and a quick histogram.
Compute z-scores for a Pandas series by subtracting the mean and dividing by the standard deviation, then interpret deviations and identify the largest absolute z-score (Namibia).
Explore the Pandas DataFrame, its relation to series, and essential data cleaning for numerical analysis today. Practice dataframe manipulation, learn regular expressions, and work with a 9000-item nutritional dataset.
Explore how a pandas data frame extends series to two dimensions, with labeled indices and columns, as a collated, heterogeneous collection of series.
Create a data frame by passing a dictionary of column labels to the pandas constructor, using equal-length lists for names, ages, and married to build the frame column by column.
Explore four ways to build data frames in pandas—dictionary of tuples, dictionary of series, dictionary of dictionaries, and row-wise construction with a list of dicts—highlighting the library's flexible constructor.
Explore the pandas info method to review a data frame’s index, columns, data types, non null values, and memory usage, with verbose, max_calls, and deep options.
Read a 9,000-item nutrition data set with pandas read_csv, inspect its 77 columns and 9,000 records, and note memory usage around 40 MB and embedded units.
Identify a duplicated index in the nutrition dataframe and remove it by dropping the column, setting it as the index, or using read_csv with index_col; prefer the latter.
Explore the data frame sample() method to draw random records, learn how a fixed random state yields deterministic results, and use n or frac to control the sample size.
Explore sampling with and without replacement in pandas, using the replace parameter and bootstrapping concepts, and bias samples with weights via a pandas series and index labels.
Discover how true randomness via natural entropy differs from computer-generated pseudo randomness, and how pandas sample and numpy use the Mersenne Twister PRNG with seeds to produce repeatable results.
Discover how pandas dataframes have two axes, rows and columns, and use the axis attribute to access row and column labels and coordinates.
Explore changing a pandas dataframe index from an int64 range to a meaningful name using set index, with drop and verify integrity options, and discover multi index possibilities.
Explore dataframe extraction by position using iloc and loc on a food nutrition dataset, including selecting specific rows and columns, boolean masks, and handling non-consecutive indices for efficient data access.
Learn how to extract a single value from a pandas DataFrame using at and iat, compare them with loc and iloc, and understand their speed advantages for single-value access.
Learn to convert between labels and integer positions with the get_loc method and use loc, iloc, and at to fetch a single value.
Practice extracting data from a nutrition data frame in pandas: randomly select ten foods, pull total fat and cholesterol, and retrieve calories for the third food using attribute-based accessors.
Learn Pandas techniques by sampling ten rows from a nutrition data frame and using label-based and location-based indexing with loc, iloc, and iat to extract total fat, cholesterol, and calories.
Convert all unit-containing columns from strings to numeric values to enable accurate analysis in pandas, noting that 73 columns require casting before further steps.
Use the pandas astype method to cast dataframes and series to new types, reassign changes, and selectively convert columns, addressing non-numeric values in a nutrition dataframe.
Apply DataFrame.replace and regular expressions to strip units from the nutrition data, converting string values to numeric types and inspecting dtype changes from object to 64-bit int.
Isolate the units from each column by removing all numeric values with a regex. Then use mode to derive the most common unit per column for consistent labeling.
Learn to rename index and column labels in a pandas data frame using dictionaries or a mapper, control with axis and inplace, and rename both axes when needed.
Learn how to use dataframe dropna with axis, how (any or all), and thresh to drop rows or columns containing nan, and apply inplace changes.
Learn to use dropna with the subset parameter to limit drops to selected rows or columns, illustrated with gender and age examples.
Merge units into column labels by building a mapper from the units data frame and renaming columns, preserving unit information while moving toward a numeric data frame.
Remove units from values using the dataframe replace method with a regex pattern. Cast all values to float for a pure numeric dataset ready for numerical analysis.
Enable two-dimensional filtering in pandas by applying the filter method to both index and columns, using like, regex, and items to slice data efficiently.
Sort dataframes with sort_values by calories and by multiple columns using an ascending flag list. Filter grams to assess brain composition, noting fat, protein, and water content.
Choose a one-dimensional series, apply the between method to calories to create a boolean mask, and use boolean indexing to extract four randomly sampled rows from the data frame.
Learn to compute min, max, and idxmin/idxmax across columns or rows in Pandas dataframes, apply to nutrition data like potassium and sodium, and filter with between for dietary insights.
Learn how to use the dataframe methods nlargest and nsmallest to extract top and bottom records, choosing columns and applying to series or dataframes for efficient data selection.
Identify the ten foods with the highest vitamin B12 using Pandas, isolate eggplant-related items to find the one with the most sodium, and sample four random rows and two random columns.
Explore pandas techniques to identify the top ten vitamin B12 foods from a data frame, compare n largest and sorting, filter for eggplant, and sample four rows by two columns.
Apply pandas to remove all items with any Na values, then identify foods with 20–40 mg vitamin C and count those between 2 and 3 standard deviations above the mean.
Apply pandas to clean and analyze nutrition data by dropping rows in place, filtering vitamin c with between, and using mean and standard deviation to identify outliers.
Dive into advanced pandas DataFrame concepts, including binary indexing and bitwise operators, sorting and lookup, pruning duplicates, reshaping, and powerful transformation techniques using pandas, numpy, and Python.
Explore a new dataset of English Premier League players, load it with pandas and numpy, and inspect its structure, data types, and memory usage.
Use pandas boolean masking to generate a boolean sequence from a column comparison and index the dataframe to filter players with a market value over 40 million.
Generate boolean series on the fly with the ease in method, the between method, and comparator wrappers, then filter defenders by market-value ranges and age using boolean masks.
Explore how booleans combine with binary operators in pandas series, using bitwise or and and, and learn that alignment is by label, not order, when indexing data.
Explore XOR and the tilde-based complement operator in Python, learn how XOR differs from or, and see how to use boolean negation in Pandas and NumPy for dataframe indexing.
Master pandas indexing with multiple boolean conditions using and, or, and not, applying filters like left backs, age 25 or younger, and market value at least 10 million.
Refactor complex conditions into standalone boolean variables to simplify data frame indexing. Use parentheses, assignment and comparison operators to filter Arsenal players who are right backs or Chelsea goalkeepers.
Identify english players whose market value exceeds twice the league average and have either more than 4000 views or a new signing, but not both, with refactored conditions as variables.
Learn to build boolean conditions in pandas: filter English players with market value over twice the league mean, using xor for page views or new signings, and apply boolean indexing.
Master two-dimensional indexing in pandas by filtering Chelsea players aged 23 and under with boolean conditions, then select relevant columns, including those starting with P, using the label-based indexer.
Master fancy indexing in pandas with the lookup method to retrieve values by row and column labels. See lookup returns values for label coordinates and compare it to basic indexing.
Sort a data frame by values and by index with sort_values and sort_index, using ascending or descending and in place options, and reset index to demote it to a column.
Learn how to precisely reorder dataframe rows and columns using reindex, beyond basic sort_values and sort_index, including alphabetical column ordering and slicing possibilities.
Explore using any array-like object for Pandas reindex and columns, not just Python lists. See how sorting a Pandas index with sort_values offers a valid alternative, yielding the same result.
Avoid the antipattern of sorting columns by transposing the data frame; learn a direct approach in pandas using sort_index with axis=1 to achieve alphabetically ordered columns efficiently.
Tackle a skill challenge: sort a dataframe by age to find youngest EPL player, reindex by club and sort by index, then sort values by club and market value descending.
Sort by age to reveal the youngest player using sort_values, or the min method with an indexer. Set club as the index, then perform a two-key sort with different directions.
Use pandas' duplicated method to identify duplicates, customize what counts as duplicate with subset, and control which occurrence is original with keep (first, last, or false) for accurate aggregates.
Identify and remove duplicate records in the EPL players dataset using the duplicate and drop duplicates methods, then recalculate the mean market value to reveal the true league-average market value.
Use the drop method to remove rows by index labels or axis, including multiple labels, and return a copy without changing the original.
learn how to remove columns in pandas using drop with axis 1, specify column labels, or pass directly to the columns parameter, returning a copy.
Learn how to remove a column with the Pandas pop method, which returns the removed column as a series and modifies the dataframe in place.
Explore the reindex method to exclude rows and columns by computing set differences and creating a new data frame, noting that drop is often more durable for data cleaning.
Identify and count NaN values in data frames with isna, then locate missing-data records. Index with boolean arrays, convert to values, and use drop_duplicates to manage duplicates.
Learn to handle nas in data frames with fill and dropna, using column-specific defaults or a dictionary, and axis-based removal for rows or columns.
Apply fillna with the method parameter to fill missing values using forward fill or backward fill, and understand axis 0 for index-wide and axis 1 for column-wide filling.
Practice pandas data wrangling: create a copy DF2 by removing rows and a column, check for Na values and unique nationalities, then extract unique age–position pairs by club, excluding club.
Apply pandas techniques to clean a dataframe: drop rows and a column, check Na values, count unique values, extract age and position combinations with a subset, returning age and position.
Apply the pandas agg method to compute aggregates like mean or min across numeric columns, reshape data into a series or dataframe, and filter with select_dtypes.
Learn how pandas transform applies a function to a dataframe without changing its shape, illustrated with currency conversion and a random string capitalization using choice and the str accessor.
Explore the pandas data frame apply method as a flexible tool that handles both aggregations and in-place transforms. Learn axis choices, type checks, and practical rounding of floating point columns.
Explore vectorized operations in numpy and pandas for fast data processing. Learn when to use applymap for element-wise transformations and how it handles logging and inflation adjustments.
Create a numeric classification function that maps inputs to popularity labels using predefined bounds. Apply it to players views with vectorized operations, add a popularity column, count super popular players.
Create a pandas get popularity function with thresholds for labeling page views. Apply it to the players page views, add a popularity column, and count super popular players.
Explore spot value changes in dataframes using label-based and integer-position indexers to modify a single cell. Learn that at and iat are faster than loc and iloc for single-value assignments.
Explore the setting with copy warning in pandas, focusing on chained indexing, copy versus view, and how inplace and drop duplicates can influence updates.
Always assume pandas returns a copy, and use iloc or loc indexers to guarantee a view when updating the underlying data frame.
Learn practical techniques to add columns to a data frame using assignment, insert, and assign methods, including placing new columns like nicknames and career goals.
Add rows to dataframes using the append method with a series or a dataframe, and learn why setting with enlargement is inefficient and not in place.
Learn how pandas stores dataframes in memory as column blocks managed by a block manager, and why appending rows is slow; optimize by operating on columns for better performance.
Create a 4x4 data frame assigned to DF_random, then perform two separate operations: add a new row and add a new column, and compare their speeds using timeit.
Create a 4x4 data frame by random sampling from players, then append a row and add a column, comparing performance; row additions are about seven times slower than column additions.
Explore how to combine multiple datasets with pandas by concatenating and performing join operations, including inner, outer, left, and right joins, focusing on structure and merge rules.
Load five US college salary datasets by region and major with pandas read_csv from URLs, forming engineering, state, party, liberal arts, and Ivy League groups, then merge them later.
Concatenate five data frames into a master data frame, inspect shapes, and resolve duplicates by using the duplicated method and dropping the party data frame, yielding 249 schools.
Fix duplicated indices after concatenation by resetting the index with drop=true. Alternatively, use pd.concat ignore_index to create a unique range index for reliable slicing.
Enforce unique indices when concatenating data frames with pandas by enabling verify integrity. Preserve meaningful indices like school name while avoiding ignore index.
Create multi index dataframes with concat by using the keys parameter to label origin, forming a two-level index, and select with tuple-based labels or iloc.
Concatenate data frames along the column axis with pd.concat by setting axis to 1, enabling side-by-side comparisons of the top five engineering and Ivy League schools by median salary.
Explore how append and pd.concat yield identical results in simple cases, then compare differences: append is an instance method with fixed axis, while concat is a flexible module function.
Learn how pandas concat handles dataframes with extra columns, using the stem column example, and control results with join inner versus outer to manage missing values.
Demonstrate a pandas data frame task: concatenate liberal arts and state schools, compute unique names and average median starting salary, then compare top earners with nested column labels.
Concatenate liberal and state frames to compute unique school names; compute mean mid-career median salary, and display top three liberal arts and top three state schools side-by-side with nested labels.
Learn how the Pandas merge method joins data frames on a common key, using an inner join on the school name via the on parameter, and contrast merging with concatenation.
learn to merge two dataframes with different key names using left_on and right_on, then drop the redundant key to extend schools with mid-career income percentile data.
Explore how the Pandas merge how parameter selects inner or outer joins, showing that inner yields the intersection of keys and outer yields the union with NaN for missing data.
Master left and right joins in pandas using merge, preserving left or right keys, discarding the rest, and handling nans. Flipping input order makes left joins equivalent to right joins.
Identify 1-to-1 and 1-to-many joins in pandas using merge and key uniqueness. Check unique values, duplicates, and use drop duplicates to control merge outcomes.
Explore many-to-many joins by merging data with duplicate key values, observe how Cartesian products arise, and contrast 1-to-1, 1-to-many, and many-to-many join cardinalities.
Learn to merge data frames by index in pandas, using left index and right index as the join keys. Discover mixed merges that combine index and column keys.
Explore the Pandas join method to merge data frames by index or a column key, using a concise instance method that behind the scenes calls merge, for shorter code.
Merge the liberal arts dataframe with the regions dataframe and assign the result to the fme variable to identify the region with the most liberal arts schools.
Learn to merge data frames with pandas, compute region distributions of liberal arts schools with value_counts, and assess 1-to-many joins after setting school name as the index.
Explore advanced pandas indexing with the multi index to represent hierarchical relationships within dataframes. Modify the index to support multiple label levels, enabling efficient analysis of multidimensional data.
Explore a brand-new dataset of daily stock prices for Apple, Facebook, Microsoft, Google, and Amazon, with open, high, low, close, and volume over about five and a half years.
Review index and range index in pandas, and how series and data frames use label-based indexing. Learn to set meaningful labels with set_index, such as dates, to enable date-based selection.
Learn to create a multi-index in pandas by passing a list of labels to set_index, producing a two-level hierarchical index with date and stock name, and applying changes in place.
Learn to create a multi-index dataframe in pandas in one step by using read_csv with the index_col parameter to define date and name as the two-level index.
Extract values from multi-index dataframes by labeling with date and stock ticker, using label-based indexing and iloc for agnostic, position-based access to open and close prices.
Master advanced indexing in a two-level pandas multi-index dataframe by selecting date and stock slices with lists, tuples, and the slice object, including slice(None) for all dates.
Master using pd.IndexSlice to index hierarchical pandas dataframes with the index slice object, enabling concise high-to-low selections across dates and companies.
Explore xs(), the cross section method for hierarchical data frames, compare with the lock indexer, and learn to select multiple levels using tuples, with drop level and axis options.
Complete a skill challenge practicing dataframe slicing: create tech_df2 by date-slicing, sample ten random Apple days from it, and extract intraday high and low prices for Apple and Google.
Create df2 by slicing the tech data frame to extract stock prices between dates, sample ten Apple days, and intraday high and low for Apple and Google via multiindex cross-section.
Explore the anatomy of a pandas multi index: its names, levels, and values, and how the two-level hierarchy of dates and stock names spans 1421 dates by five tickers.
Add a new level to a pandas multi-index with set_index and append, creating a three-level index (date, stock ticker, volume type) and learn selecting with multi-index tuples and cross section.
Master reordering a pandas multi-index by swapping two levels with swap level and using reorder levels for broader ordering, returning a new index.
Remove multi-index levels using drop level or reset index to reshape data frames, with reset index optionally restoring levels as columns or discarding them.
Sort multi-index dataframes efficiently by using sort_index in place, learn to handle unsorted index errors, and tailor sorting with level parameters for optimized slicing and retrieval.
Master multi-index management with standalone methods: check lexicographic sorting, sort levels without altering data, set names, and convert to a flat index for clear labeling.
Reshape a multi-index dataframe with the stack method, moving the column axis to the innermost index. Label the new level with set_names to complete the multi-index series transformation.
Unstack pivots the innermost level of a multi-index back to the column axis, reversing stack. Use the level parameter to target a specific axis.
Learn to manually create a two-level multi-index of columns in pandas by building a cartesian product of volume and ticker, then assemble a data frame of ten records.
Combine set index and transpose to reshape a dataframe into a two-level multi-index column axis, turning the index into columns and trading date and volume category into the levels.
Recognize that panels are deprecated and not covered in this course. Prefer multi-index dataframes for multidimensional data, and consult documentation when working with legacy panels and axis concepts.
Transform the tag data frame into a four-level index (year, month, day, column axis), assign to tag_df_three, form tag_series for 2019 trading days, then compute mean and std of close.
Learn to build three- and four-level multi-indices, select 2019 with label-based indexing, and compute mean and standard deviation of close prices, including an apply-based option.
Learn the split apply combine pattern in data analysis by mastering the groupby method, exploring aggregation functions, and grouping by multiple keys with transform, filter, or generic apply.
Import pandas and numpy, load a 3000-game sales dataset across xbox and playstation, and learn group by to summarize regional sales by console, year, and publisher.
Review how simple aggregations such as sum, mean, standard deviation, and variance operate behind the scenes in pandas, and how axis control switches between vertical and horizontal aggregation.
Explore conditional aggregates by computing platform-specific regional sales using boolean indexing and per-group sums, and contrast with the verbose approach that leads to a future groupby solution.
Discover the split-apply-combine pattern behind groupby, splitting data into groups, applying an aggregation, and combining results to form a single summary dataframe.
Explore the groupby method to split a data frame by platform, apply aggregations such as sum, mean, or median, and recombine results in a single, powerful command.
Explore the dataframe groupby object as a lazy, intermediate view that splits data into four subgroups and awaits the next apply step to produce results.
Map index labels to platform groups with a dictionary, then group by the mapped labels to aggregate PlayStation and Xbox totals without altering the underlying data.
Explore the groupby pattern, epitome of split apply combine, on Series and DataFrame, using genre and global sales to compute mean by genre and sort results to reveal top genres.
Apply concepts by creating a publishers dataset from the games data frame and identifying top publishers by North America sales, plus the platform with the most sales.
Create a smaller dataframe, group by publisher to sum North American sales, rank top ten publishers, and identify the platform with the highest sales in North America using pandas.
Explore how pandas groupby creates a lazily evaluated object and iterate over each platform subgroup to inspect labels and mini dataframes before aggregation.
Learn to selectively access subgroups in pandas groupby objects, diagnose issues with nulls or invalid data, and use get_group for efficient, direct retrieval.
Master multi-key grouping with pandas by using groupby on genre and publisher to analyze top publishers within each genre by global sales, revealing a two-level hierarchical index.
Harness the aggregate function (alias eg/agg) to apply multiple metrics—sum, mean, std, and count—over grouped data, yielding a multi index by genre and publisher and sortable by global sales sum.
Learn to combine group by and filter in pandas to exclude records based on aggregated subgroup totals, such as publishers selling over 50 million in North America within each genre.
Discover how group by and transform apply in-place, subgroup level calculations to convert raw sales into within-genre z-scores using genre mean and standard deviation.
learn how to use groupby with apply in pandas to run custom functions on genre subgroups, returning transforms, scalars, or lists, and assess solid or weak sales and variability.
Apply pandas to the games dataset to compute total sales by year and identify the top three years. Identify Europe's top selling genre/year/platform and Japan sales higher than Europe.
analyze the games dataframe in pandas to compute yearly global sales and top three years, then find europe's top genre-platform and filter platform-genre groups where japan sales exceed europe.
Learn pivoting in pandas, inspired by Excel pivot tables, and use a concise functional interface for grouping, aggregation, and multi-index pivots with customizations.
Analyze New York City high schools' SAT scores from a curated subset. Use pandas to read the CSV and convert the percent tested values to floats.
Pivot data reshapes a dataframe by turning rows into columns using index, columns and values, enabling efficient analysis of long datasets like SAT scores and a wide, readable table.
Learn how to compute average SAT scores by borough using pandas pivot_table to perform aggregation when pivoting, turning many school records into five borough aggregates.
Learn how to use pandas pivot_table to aggregate school-level data by borough, specifying values, index, and columns, with mean by default or custom functions like std.
Show why averages of percentages are misleading and how weighting by enrollment yields the true SAT takers rate by borough, using takers and enrollment ratios.
Replicate pivot tables using groupby and aggregate on multi-index dataframes by applying mean to the score, then unstack to move the section level to columns for a clean pivot-like view.
Set the margins parameter to true to add total rows and columns to pivot tables, tag city and borough averages, and verify results by comparing with raw data.
Explore how pivot tables create multi-index dataframes by using a list of index labels, swap index and columns to move the hierarchy, or transpose to switch axes.
Explore applying multiple aggregation functions in pandas pivots by passing a list to the func parameter, generating mean, minimum, and maximum scores by borough in one pivot.
Apply pivot tables to summarize total and average enrollment across five boroughs. Build a Queens high school pivot with city section scores as columns, sorted by math scores.
Build pandas pivot tables to sum borough enrollments and show mean; then rank Queens schools by city section scores as columns with names as index, sorted by math scores.
Explore Pandas techniques to store and manipulate times and dates, from pure Python foundations to NumPy enhancements, then master datetime indices, resampling, interpolation, and moving averages.
Explore Python's datetime module to manipulate dates and times by importing date, time, and datetime, creating objects, and using attributes and iso format.
Learn to parse dates from text with Python's datetime strptime by defining format codes, convert strings to datetime objects, access year, month, day, and ISO format.
Convert a datetime object to a string using strftime with format codes, including locale-aware %c, and explore an alternative templated string approach that formats the date via substitution.
Explore NumPy datetime64 for efficient large-scale date operations, including 64-bit encoding, time units, vectorized array arithmetic, rescaling to daily, and business day offsets as a stepping stone to Pandas.
The Pandas timestamp merges Python datetime simplicity with NumPy DateTime64 performance, enabling string-to-timestamp conversion and handling day-month ambiguity with a day first option.
Explore a 19-year Brent crude price time series in pandas by reading a csv, inspecting the data structure, and examining dates and prices in usd per barrel.
Convert the date column from object to datetime64 to unlock date-time operations and reduce memory usage, then set it as a datetime index with time-series attributes.
Learn how to parse dates at read time with pandas read_csv by setting the index to the first column and enabling parse_dates to create a datetime index.
Explore indexing data in a datetime index dataframe using pandas: label-based indexing with the log indexer, slices, and partial string indexing to retrieve January 2019 data, with date parsing.
Filter Brand Time series for Dec 1, 2015 to Mar 31, 2016; use partial string indexing; compute Brent price standard deviation; compare Feb 2018 mean to Mar 2017 median.
Select price data in a date range with pandas using lock indexer and string indexing, spanning Dec 2015 to Mar 2016, then compute standard deviation, mean, and compare to median.
Explore DateTimeIndex attribute accessors to extract quarter, week, month, and day name, build boolean masks, and compute means for targeted periods such as leap year Februarys.
Learn to generate flexible date ranges with pandas date_range, creating a DateTime index from start and end dates, or from start dates and periods, with various frequencies.
Learn to shift and adjust dates with the date offset object in pandas by subtracting 18 days and adding 18 hours, enabling precise time-aware data arithmetic.
Learn to resample time series data in pandas by changing observation frequency from daily to monthly, using resample with an aggregation (median or mean) to downsample.
Upsample time series data using resample, then fill gaps with linear interpolation in pandas to create eight-hour observations from daily data.
Explore how pandas as frac resamples time series to a new frequency, fill gaps with forward or backward fill and field value options, and contrast it with resample’s aggregation capabilities.
Learn how rolling windows create moving averages to smooth time series data using pandas, visualize with matplotlib, and explore weighting schemes like Bartlett and Blackman.
Add a quarter column to the Brent data frame, compute average price and standard deviation for 2014 with groupby, then reproduce the same results using resample on the raw data.
Add a quarter column from date, then compute mean and standard deviation by quarter for 2014 using groupby and aggregation, and reproduce results with resample without a quarter column.
Welcome to the best resource online for learning and mastering data analysis with pandas and python.
Over 32 hours, 10+ datasets, and 50+ skill challenges, you will gain hands-on mastery of, not only pandas 1.x, but also tens of computer science, statistics, and programming concepts.
We will break down, understand, and practice hundreds of methods, attributes, and techniques in pandas and python that will fundamentally change the way you work with data.
In The Ultimate Pandas Bootcamp (2022) you won’t be working with outdated versions of pandas, writing repetitive commands on the same boring dataset. Instead, you’ll learn pandorable and pythonic solutions to interesting, real-world data problems, while working with many diverse datasets that range from wine servings, video game sales, and SAT scores to stock prices, college salaries and more!
Data analysis is an applied science, which is why in each section, you’ll stop and practice what you learn in dedicated skill challenges, followed by detailed solutions where we often consider and compare alternative solutions.
Data analysis is one of the most in-demand skill across all industries and an increasing number of roles. And python is increasingly the language of choice.
Pandas is the wonderful open-source library that is the embodiment of those trends: based on the python programming language, pandas is the de facto data analysis library in the python data science community.
––––– Structure & Curriculum –––––
Over more than 31 hours, we'll cover everything that pandas has to offer, from manipulating series and dataframes, to merging datasets, handling time series, aggregations, filtering, sorting and much more!
The first four sections of the bootcamp constitute the core curriculum. You'll get acquainted with series and dataframes and develop an in-depth understanding of pandas data structures.
· Series at a Glance
· Series Methods and Handling
· Introducing DataFrames
· DataFrames More In Depth
In the next eight sections, you will dive into more advanced topics and take your pandas skills to another level, learning how to work with multiple datasets, manipulate time series, visualize data, write custom functions to transform data and much more.
· Working With Multiple DataFrames
· Going MultiDimensional
· GroupBy And Aggregates
· Reshaping With Pivots
· Working With Dates And Time
· Regular Expressions And Text Manipulation
· Visualizing Data
· Data Formats And I/O
Pandas and python go hand-in-hand which is why this bootcamp also includes a full-length introduction to the python programming language, to get you up and running writing pythonic code in no time.
This is the ultimate course on one of the most-valuable skills today. I hope you commit to mastering data analysis with pandas.
See you inside!