
Welcome to Python Fundamentals for Data Science, a hands-on course designed to build your confidence in core Python programming. You'll learn essential concepts like variables, loops, functions, and object-oriented programming—skills that form the backbone of any data science workflow. We'll explore real-world datasets using libraries like NumPy, Pandas, and Matplotlib, uncover patterns through visualizations, and perform a complete exploratory data analysis (EDA). By the end, you'll even build a simple predictive model—giving you a solid foundation in Python for data-driven problem solving.
Why Python matters, what Jupyter Notebooks are, and how to access Google Colab.
Access additional data science content at EnhaneImpact YouTube channel: https://www.youtube.com/@EnhanceImpact.
The python scripts and sample data can be found in the course's GitHub repository: https://github.com/EnhanceImpact/Python-for-Data-Science
In this section you will create variables, use print, learn about f-strings and use them in your code.
Learn to perform calculations in python with addition, subtraction, multiplication, division, floor division and the modulus.
Comparison and logical operators in Python let you test conditions and combine expressions, making them essential for decision-making and control flow.
In Python, most programs need to collect information from users, such as their name, age or a number. In this lesson, we will cover the input() function that allows user input.
Data Structures are ways to organize and store data. Algorithms are step-by-step methods to solve problems using that data. Learning DSA helps you choose the right tools to write code that works faster and more efficiently.
Control flow is the order in which your code runs. It lets your program make decisions (using if, elif, else), repeat actions (with for and while loops), and stop when needed. In simple terms, it’s how you tell Python what to do next depending on the situation.
List comprehensions are a quick way to make new lists in Python by looping through something and applying a rule in just one line. They let you write shorter, cleaner code compared to using a full for loop.
Functions are like reusable mini-programs inside your code. They let you group steps together, give them a name, and run them whenever you need—without rewriting the same code over and over.
Classes are blueprints for creating objects in Python. They let you bundle data (like variables) and actions (like functions) together so you can build your own custom data types. Think of a class as a template, and each object you make from it as a copy of that template with its own details.
Modules in Python are like toolboxes filled with ready-made functions. The math module gives you extra math tools, random helps you create random numbers, and datetime lets you work with dates and times. Instead of building everything from scratch, you just open the right toolbox and use what you need.
File I/O (Input/Output) is how Python reads from and writes to files on your computer. You can open a file to read its contents, write new information, or add more text—just like opening a notebook to read, write, or add notes.
This section of the course will help student get comfortable with core python for working with data frames.
NumPy and Pandas are two powerful Python libraries for working with data. NumPy makes it easy to handle big sets of numbers and do fast math, while Pandas helps organize that data into tables (DataFrames) so you can clean, explore, and analyze it. Together, they make data science simpler and more efficient.
This lesson covers loading a csv file into a pandas dataframe, subsetting data, exploring the number of rows and columns, checking for missing values, feature engineering and more.
In this section, learners explore the fundamentals of data visualization using Matplotlib and Seaborn, two powerful Python libraries for turning raw data into meaningful insights. Through hands-on examples with a sample housing dataset, students learn how to:
Create line plots to visualize housing trends over time
Build bar charts to compare categorical features like neighborhood or property type
Customize plot elements such as titles, labels, colors, and legends for clarity and impact
Understand the difference between Matplotlib’s low-level control and Seaborn’s high-level aesthetics
In this section, students get a brief introduction to Visual Studio Code (VS Code)—a lightweight, flexible code editor widely used in data science and software development. We explain how VS Code supports both .py files (standard Python scripts) and .ipynb files (Jupyter Notebooks), and when to use each format.
Use .py files for clean, modular scripts—ideal for production code, automation, and reproducible workflows.
Use .ipynb files for interactive exploration, visualizations, and step-by-step analysis—perfect for prototyping and teaching.
We also touch on other popular platforms like JupyterLab, Google Colab, and Kaggle Notebooks, helping students understand where and how data science work can happen across different environments.
This section helps learners choose the right tools for their workflow and understand how file formats shape the way we write, share, and run code.
This section of the course explains supervised verses unsupervised learning with practical examples of each.
This section introduces two core types of supervised learning: classification and regression. Students learn how classification models predict categories (e.g., spam vs. not spam), while regression models predict continuous values (e.g., house prices). Through hands-on examples, learners explore how to choose the right approach based on the problem type, evaluate model performance, and interpret predictions in context.
In this section, students learn how to locate and import datasets from a variety of sources to power their analyses. They explore how to load data from CSV, Excel, and JSON files, as well as connect to APIs for dynamic data retrieval. The module also introduces trusted public repositories like Kaggle, UCI Machine Learning Repository, Google Dataset Search, GitHub, and government portals. By the end, learners will be able to confidently access and load real-world datasets into Python for exploration and modeling.
This section covers the foundational steps for preparing data for analysis and modeling. Students learn how to handle missing values through imputation, apply feature scaling to standardize inputs, and address imbalanced data to improve model fairness and accuracy. By mastering these techniques, learners build reliable, reproducible workflows that set the stage for effective machine learning.
In this section, students learn how to access real-world data from the U.S. Census Bureau website, focusing on the 2022 ACS PUMS dataset for Georgia. We walk through how to load and explore the data, and revisit the GitHub repository where students can open the full exploratory data analysis (EDA) and machine learning notebooks, as well as download the dataset used in both workflows. This module emphasizes practical skills in sourcing public data, performing thorough EDA, and preparing data for modeling.
Students will perform a complete EDA workflow using Census data, including:
Remapping coded values to real-world labels using dictionaries
Cleaning the dataset by dropping unused and high-leakage features
Engineering new features, such as categorizing income levels
Visualizing relationships between income and other variables (e.g., marital status)
Applying statistical tests like the chi-square test to assess significance between income and categorical features
In this section, students learn how logistic regression models binary outcomes—like predicting whether someone earns above or below a certain income level. We explore the math behind the sigmoid function, how to interpret model coefficients, and how to implement logistic regression using scikit-learn. Learners will also evaluate model performance using metrics like accuracy, precision, recall, and ROC AUC, and understand when logistic regression is a good fit for classification problems.
In this section, students learn how Random Forest builds powerful classification models by combining multiple decision trees. We explore how it reduces overfitting, handles complex data, and improves accuracy through ensemble learning. Learners implement Random Forest using scikit-learn, tune hyperparameters, and evaluate performance using metrics like confusion matrices and ROC AUC. This module emphasizes interpretability, robustness, and practical application in real-world datasets.
In this section, students explore XGBoost, a high-performance gradient boosting algorithm known for speed and accuracy. They’ll learn how XGBoost builds trees sequentially to correct errors, handles missing data, and supports regularization to prevent overfitting. Using scikit-learn and XGBoost’s native API, learners will implement models, tune hyperparameters, and evaluate performance using metrics like ROC AUC and confusion matrices. This module emphasizes practical modeling techniques for competitive, real-world datasets.
In this section, students learn how to split data for model evaluation using train-test split and K-Fold cross-validation. We compare the two methods, highlighting how K-Fold provides a more reliable estimate of model performance—especially when working with smaller datasets. Learners explore when to use each approach, how to implement them in scikit-learn.
In this section, students learn how to improve model performance through hyperparameter tuning—the process of adjusting settings like tree depth or learning rate to optimize results. We introduce Grid Search with Cross-Validation (GridSearchCV), a method that systematically tests combinations of hyperparameters across multiple data splits to find the best configuration. Learners implement GridSearchCV using scikit-learn, evaluate results with metrics like ROC AUC, and understand how tuning impacts model accuracy, generalization, and fairness.
In this section, students learn how to evaluate classification models using tools like the confusion matrix, ROC curves, and AUC scores to assess predictive performance. We also introduce feature importance—a technique for identifying which variables most influence model decisions. By combining statistical metrics with interpretability tools, learners gain a deeper understanding of how models behave and how to select the best-performing and most transparent model for their data.
In this section, students walk through a complete machine learning pipeline—from data prep to model deployment. We start by loading essential libraries and building a scikit-learn pipeline to streamline preprocessing and prevent data leakage. Learners perform hyperparameter tuning using GridSearchCV, then pickle the best model for reuse. We evaluate performance using the confusion matrix, ROC curve, and AUC score, and analyze feature importance to understand which variables drive predictions. This hands-on module emphasizes reproducibility, interpretability, and production-ready modeling.
This hands-on course guides students from the ground up—starting with Python programming fundamentals and building toward applied machine learning for classification tasks. Designed for beginners and aspiring data scientists, the course blends clarity, rigor, and real-world relevance to ensure students gain both technical skills and practical insight.
Students begin by mastering Python essentials, then transition into working with real datasets from the U.S. Census Bureau, performing exploratory data analysis (EDA), feature engineering, and statistical testing. From there, the course introduces core classification models—Logistic Regression, Random Forest, and XGBoost—alongside evaluation techniques like confusion matrices, ROC curves, and AUC scores.
Learners build scikit-learn pipelines to prevent data leakage, apply K-Fold cross-validation, and use GridSearchCV for hyperparameter tuning. The course wraps with model comparison, feature importance analysis, and pickling the best model for future use in production or further experimentation.
Along the way, students are introduced to tools like Visual Studio Code, and gain insight into when to use .py vs .ipynb files and how to choose the right platform for their workflow depending on their goals.
By the end of the course, students will be equipped to write clean Python code, build and evaluate classification models, and make data-driven decisions with confidence across a variety of data science contexts.