
This course includes our updated coding exercises so you can practice your skills as you learn.
See a demo
discover how to tackle 12 advanced pandas projects, from importing and cleaning data to explanatory data analysis, data presentation, seaborn visualizations, and time series with machine learning workflows.
Follow seven tips to maximize learning on this Udemy course: review the overview, content, and prerequisites, avoid skipping sections, and practice with downloaded materials, coding exercises, and the AI assistant.
Download and install the Anaconda distribution to set up Python for data science, with pre-installed packages, dependency management, and support for Jupyter notebooks on Windows, Mac, and Linux.
Learn to navigate the Anaconda Navigator, launch Jupyter Notebook, and run your first Python code, while understanding environments and key packages like NumPy and pandas.
Explore how Jupyter notebooks enable interactive Python coding with text, images, and markdown, and master editing modes, cell shortcuts, and kernel management.
Explore a 45000-movie dataset, analyze and filter data frames to find movies for next movie night that best meet our preferences, and present results, including CSV import and quick inspections.
Download the project zip, unzip it, and access notebooks and datasets; import pandas, load the CSV data, and create the first pandas data frame while troubleshooting file paths.
Explore project one by importing and inspecting a movies dataset, filter top revenue and return-on-investment films, and use user-defined functions to compare franchises and directors across assignments.
Import 44,691 movie records from a csv, adjust pandas to display all 22 columns, convert release date to datetime, and inspect data types, missing values, and distributions with histograms.
Analyze and rank movies by budget, revenue, profit, and ratings using Pandas, create meaningful features like profit and return on investment, and present results with HTML formatting.
Identify the best and worst movies by revenue, budget, and profit, and rank films by return on investment, votes, and audience rating using a user defined function in pandas.
Learn advanced pandas filtering by building boolean masks for genre and cast, combining conditions, selecting title and vote average, sorting by vote average, and displaying posters.
Learn to identify the most common words in movie titles, taglines, and overviews and visualize them with a word cloud using Python and the word cloud package.
Analyze franchise versus non-franchise films to show franchises deliver higher revenue and return on investment, greater budgets and popularity, with similar average ratings.
Compare franchise movies with standalone titles and learn to aggregate franchise data by collection, computing total and mean budgets, revenues, ratings, popularity, and vote counts to rank the top franchises.
Identify the most active and productive directors using pandas, calculating total revenue, movie counts, and average ratings, with filters for votes and horror directors to rank top creators.
Analyze the most successful, active actors by transforming the cast into a long actor data frame, splitting cast, stacking to movie-actor pairs, and joining details like title, revenue, and popularity.
Explore how to identify the most successful actors using pandas by computing total revenue, mean revenue, rating, and popularity, and filtering for actors with at least 10 movies.
Master rapid error handling by embracing trial and error, using simple rules to fix most issues in under a minute, as debugging can take over half a coder's time.
Practice debugging with a hands-on Python exercise using a dictionary of Olympic medalists; fix typos, type errors, indexing issues, and pandas imports to create a series and compute totals.
Identify the top causes of coding errors, starting with code issues such as typos, wrong indents, and inappropriate inputs, then Python installation and external factors.
Identify the most common, fixable python errors. Learn about dict key typos, underscore vs dot mistakes, bracket mixups, unclosed quotes, wrong method names, and forgetting to assign a pandas series.
Learn how omitting cells and changing cell sequence affect pandas data frames, with set index and loc access, and resolve key or index errors by restarting the kernel.
Explore common index errors in pandas dataframes and lists, including out-of-bounds iloc access and negative indexing pitfalls. Inspect previous cells to identify the root cause and fix the error.
Learn to diagnose Python indentation errors in complex code, including for loops and dictionary iterations, by distinguishing unexpected indents from missing required indents in Jupyter notebooks.
Avoid misusing Python built-in function names and keywords like for, break, and continue, such as assigning a list to the variable list, which overwrites the function and triggers errors.
Explore type errors and value errors in Python, distinguishing incorrect object types from invalid values through practical examples like int plus string, string-to-int conversion, list removal, and negative square roots.
Diagnose pandas key errors with ChatGPT, then use boolean indexing or set the athlete column as the index to fix and understand your data frame structure for reliable debugging.
Learn to use Stack Overflow and Google to resolve coding errors, with a practical pandas example illustrating the positional argument versus keyword argument issue.
Trace back Python errors by following the traceback from oldest to latest calls, using line numbers to localize attribute errors in backtesting code and pandas calculations.
Identify missing packages, outdated versions, and corrupted Python installations; use conda and pip to install or upgrade libraries like pandas, NumPy, and seaborn, or reinstall Anaconda when needed.
Identify external factors that cause issues with web APIs, authentication, eligibility, firewall or admin rights when scraping data or loading financial data for online trading.
Download the notebooks, run the code in the notebooks from the videos, and practice coding rather than transcribing code to avoid transcription errors and master original code.
Apply a debugging flowchart to diagnose and fix errors quickly by reading the error, inspecting code and previous cells, restarting the kernel, and using the course notebook for comparison.
Apply the updated debugging flow with ChatGPT assistance to efficiently fix Python coding problems: read errors, inspect code, restart kernels, then consult ChatGPT before web searches.
Fetch movie data from a public database API, validate it, and import it into notebooks. Explore API mechanics and use an optional API key to pull data or test services.
Learn what JSON is, a human-readable, language-agnostic data format for web APIs. Explore JSON structures and loading them into Python and pandas for analysis.
Explore loading data from JSON files into pandas and fetching web API data with an API key, then build dataframes, filter movies by release date, and save to JSON.
Load json files into pandas by converting dictionaries to a data frame, then flatten nested fields like belongs_to_collection with json_normalize, using record_path and prefixes.
Explore the three json orientations for pandas—records, columns, and split—and compare memory use and loading. The lecture recommends records orientation for efficient data frame creation unless memory is tight.
Learn to pull data from web apis using the movie database api, including http requests, api key authentication, and querying with Python's requests library.
Import pandas and requests, authenticate with a personal API key, and fetch Star Wars data from the movie API, then parse JSON into a pandas frame.
Insert your personal api key into the api key string, replacing the placeholder to form the authentication fragment and build the full http request url for the movie api.
Discover how to use the movie database API with pandas and requests to filter releases from January and February 2020, iterate pages with the Discover module, and build data frames.
Learn best practices for importing and storing the movies dataset using requests and pandas, fetch movie data, flatten nested fields, and export to csv.
Import and store the movies dataset in a data frame, compare json versus csv for nested versus flat data, and show how csv exports can turn lists into strings.
Clean and prepare a messy dataset of over 45,000 movie records from a movies database api by handling missing values, correcting data types, and removing duplicates.
Analyze and clean a messy real-world data set by loading, inspecting, and flattening nested columns. Evaluate Python expressions, extract data, convert types, handle missing values, and save the cleaned dataset.
Import and inspect a messy movie dataset with pandas, noting mixed data types, missing values, and nested json-like fields to guide cleaning and type handling.
Drop irrelevant columns to create a leaner data frame for explanatory data analysis and presentation, such as removing adult, alternative movie idea, original title, video, and homepage columns.
Learn to clean stringified JSON columns in pandas by converting them to Python objects using the json module's loads method and ast.literal_eval, addressing quotes and mixed types.
Learn to clean stringified json columns in pandas by applying a lambda that converts non-strings to missing values, handling nested data in collection-like columns.
Learn to flatten nested JSON columns, extract dictionary names like the Toy Story collection, and join list-based genres and languages with a pipe, while handling missing values during data cleaning.
Convert budget and revenue to numeric, coerce invalid strings to missing values, replace zeros, divide by one million, and rename columns to million US dollars.
Clean numerical columns in a movie dataset by converting runtime and id to numeric. Replace zero minutes with missing, coerce non-numeric entries, and handle duplicates and vote counts and averages.
Convert the movie release date column to datetime, coerce invalid strings to missing, inspect with info, and analyze value counts to spot placeholders like the first of January.
Clean string data in pandas by analyzing text columns, handling missing values, and normalizing fields like language, title, overview, and tagline using value counts and NaN replacements.
Identify and remove duplicates in the dataset using the movie id as the identifier, and drop duplicates while keeping the first occurrence to ensure id column uniqueness.
Explore how to identify and handle missing values in a dataset, decide between keeping or removing incomplete rows, and apply threshold-based row removal to clean movie data.
Finalize a cleaned dataset by selecting only released movies, dropping the status column, reordering columns, resetting the index, and validating poster URLs before export.
Learn to merge diverse data sources into a single dataset by combining a large movies dataset with cast and crew details, and apply data cleaning and transformation techniques using pandas.
Merge and clean the movies and credits datasets, drop duplicates, and transform the cast and crew columns by extracting names, tallying cast size, and saving the final csv.
Import and save assets, clean datasets, convert release dates to datetime, load cast and crew, and align data frames via a shared movie_id for merging.
Prepare data for a merge by cleaning the credits frame, ensuring unique id values, and aligning cast and crew with movies data for a left join on the id column.
Perform a left join between the movies dataframe and the credits dataframe to bring in cast and crew, identify missing data, and validate join keys on movie_id.
Clean and flatten the new cast column by parsing stringified json with literal_eval, compute cast size, and extract actor names into a pipe-separated string, replacing missing values.
Clean and flatten the new crew column in pandas, extracting director names from nested dictionaries and computing crew size for each movie. Use a function to extract the director name.
Inspect the first two rows of movies, drop the crew column, verify media information, and save the cleaned dataset to csv for explanatory data analysis and machine learning.
Learn to load data from databases into pandas, customize queries to fetch only what you need, and improve memory and performance with parallel processing and secure databases.
Explore how relational databases differ from individual tables, link multiple tables by keys, and move data between pandas data frames and databases using SQL queries for efficient, scalable analysis.
Transform a large movie dataset into a four-table sqlite relational database using pandas, including movies, votes, and production companies. Load the tables into dataframes and practice advanced sequel queries.
Learn how to create a lightweight SQLite database with Python using connect to create a new database file, inspect the connection, run basic queries, and close the connection.
Load the 80-movie dataset into an sqlite database by creating movies, votes, and production companies tables, linking them via movie id, using pandas data frames and json normalize.
Learn to load data from an SQLite database into pandas dataframes, using read_sql, create a connection, and set indices while parsing dates and inspecting data types.
learn to pull data from databases with sql queries in pandas using read_sql, selecting from the movies table and computing total revenue, non missing title counts, and mean budget.
Learn how to filter large databases with SQL before loading into pandas, using where clauses, distinct values, order by, and pattern matching to extract profitable, high-value movie data.
Join two dataframes, movies and votes, on movie_id using pandas to load and select key columns, filter for vote average above eight, and sort results ascending.
Explore pandas workflows for merging production companies with movies, using left and inner joins, then group by production companies to aggregate total revenue and rank top studios.
Import and concatenate a baby names dataset split into 100+ files from a US government site, and automate a scalable workflow that handles 10 to 1,000,000 files.
Load over 100 birth data files from the website, merge into one pandas data frame, add a year column, and save as csv with name, gender, and count.
Download US baby names datasets from data.gov, including national and state data from Social Security card applications, via csv or zip files, and prepare them for pandas.
Import a single file with pandas, create headers for name, gender, and count, and inspect a 3-column baby names data frame containing about 2000 names with no missing values.
Importing and merging many files in the easy case, this lecture demonstrates combining 139 datasets into a dataframe using a file name pattern, string replacement, and pd.concat with keys.
Export the final data frame as a csv file and define the name. Drop the index before exporting to enable clean re-import with pandas for US baby names.
Load the Alaska baby names file from data.gov, then concatenate all state text files into one data frame, create a range index, and save state, gender, name, and count.
Import a state baby names text file with pandas, create a dataframe with state, name, and count, and inspect Alaska’s top names before merging all states in the next lecture.
Explore the glob module to list pathnames matching patterns with wildcards across directories, and collect file names into a list for use in a pandas data frame.
Import 51 files into individual frames, concatenate into one dataframe, reset the index, producing a 6 million-row, 5-column dataset, then export to csv and re-import.
Convert gender and state columns to categorical dtype to save memory, reducing usage from 230 MB to about 150 MB, and discuss benefits and drawbacks of using categorical data types.
Analyze baby name popularity over time using advanced visualization and data aggregation with pandas, uncovering long-term trends, year-to-year changes, and unisex name insights.
Explore advanced data analysis and aggregation in Python and pandas. Load the U.S. baby names dataset, compute popularity and rank by year and gender, and identify evergreen names.
Analyze 2018 baby names with pandas, filter by year and gender, and extract the top 10 female and male names while converting gender to a category to save memory.
Explore evergreen names that remain in the top 20 from 1880 to 2018 using pandas match, revealing female Emma, Elizabeth, Ella and male William, James, Henry.
Aggregate the Babynames data by name and gender to compute total counts since 1880, then perform multiple aggregations such as total, years, first and last year, and best year.
Determine the ten most popular female and male names by total counts from an aggregated data frame, flatten the multi-index, save as all_times, and plot with seaborn.
Analyze trends in baby name popularity from 1880 to 2018 by comparing absolute counts to relative counts per one million babies, using bar plots and pandas. Note rising name diversity.
Create two features for the babynames dataset: popularity as babies per million and rank by year and gender, computed via groupby sums and a descending count-based ranking.
Visualize name trends over time by plotting popularity (babies per million) and rank on a two-axis chart, using bars and lines to reveal how evergreen names rise and fall.
Learn to analyze baby name popularity in pandas by computing year-to-year differences, handling missing values, grouping by name and gender, and identifying drivers behind peaks.
Explore why names rise from unseen to popular, driven by TV series, characters, actors, singers, and sports stars, and learn how to compute percentage change and identify shifts.
Explore persistent versus spike-fade names using a cricket data framework and a spike score defined as max count divided by total count times 100, ranking names by persistence.
Identify unisex names by grouping by name and gender, computing total counts, creating male and female columns, and ranking by overall popularity or by near 50-50 gender balance.
Predict house prices for California districts using a dataset of 20,000 entries by training and testing a model. Emphasize exploratory data analysis, feature engineering with pandas, and data preprocessing.
Master explanatory data analysis and feature engineering to predict median house value, using data cleaning, derived features, correlations, and seaborn visuals, then fit a linear or random forest model.
Import and inspect the California housing dataset using pandas and seaborn, assess data types and missing values, review the ocean proximity category, and visualize distributions to identify outliers.
Clean missing values in the total bedrooms column and explore removal versus imputation. Create rooms per household and population per household features to gauge impact on house values.
Explore how median income, rooms per household, and location influence house prices through correlation analysis, scatter plots, KDEs, and non-linear models like random forest regress.
Analyze how median income drives median house values by transforming income into quintiles, visualizing with seaborn plots, and comparing income categories across ocean proximity to prepare for modeling.
Separate features and labels, then standardize numerical features by computing means and standard deviations for a mean of zero and standard deviation of one, applying to float columns.
Learn to transform text and categorical data into numeric features using one-hot encoding with pandas get_dummies, avoid the dummy variable trap by dropping one category, and combine features for modeling.
Split the dataset into training and test sets using pandas sample with an 80/20 split, ensuring reproducibility and checking income category distribution to support stratified sampling.
Train a random forest regressor on the training set to capture nonlinear relationships in house prices, monitor for overfitting, and prepare to test on the test set.
Evaluate random forest on the test set. Measure generalization with the coefficient of determination and root mean squared error, and compare true versus predicted house prices with mean absolute difference.
Explore feature importance from a random forest regression, showing median income as the top factor, with location, house size, and district features also shaping the model.
Welcome to the first advanced and project-based Pandas Data Science Course!
This Course starts where many other courses end: You can write some Pandas code but you are still struggling with real-world Projects because
Real-World Data is typically not provided in a single or a few text/excel files -> more advanced Data Importing Techniques are required
Real-World Data is large, unstructured, nested and unclean -> more advanced Data Manipulation and Data Analysis/Visualization Techniques are required
many easy-to-use Pandas methods work best with relatively small and clean Datasets -> real-world Datasets require more General Code (incorporating other Libraries/Modules)
No matter if you need excellent Pandas skills for Data Analysis, Machine Learning or Finance purposes, this is the right Course for you to get your skills to Expert Level! Master your real-world Projects!
This Course covers the full Data Workflow A-Z:
Import (complex and nested) Data from JSON files.
Import (complex and nested) Data from the Web with Web APIs, JSON and Wrapper Packages.
Import (complex and nested) Data from SQL Databases.
Store (complex and nested) Data in JSON files.
Store (complex and nested) Data in SQL Databases.
Work with Pandas and SQL Databases in parallel (getting the best of both worlds).
Efficiently import and merge Data from many text/CSV files.
Clean large and messy Datasets with more General Code.
Clean, handle and flatten nested and stringified Data in DataFrames.
Know how to handle and normalize Unicode strings.
Merge and Concatenate many Datasets efficiently.
Scale and Automate data merging.
Explanatory Data Analysis and Data Presentation with advanced Visualization Tools (advanced Matplotlib & Seaborn).
Test the Performance Limits of Pandas with advanced Data Aggregations and Grouping.
Data Preprocessing and Feature Engineering for Machine Learning with simple Pandas code.
Use your Data 1: Train and test Machine Learning Models on preprocessed Data and analyze the results.
Use your Data 2: Backtesting and Forward Testing of Investment Strategies (Finance & Investment Stack).
Use your Data 3: Index Tracking (Finance & Investment Stack).
Use your Data 4: Present your Data with Python in a nicely looking HTML format (Website Quality).
and many more...
I am Alexander Hagmann, Finance Professional and Data Scientist (> 7 Years Industry Experience) and best-selling Instructor for Pandas, (Financial) Data Science and Finance with Python. Looking forward to seeing you in this Course!