
Explore what data science is and how it solves real world problems, with foundational skills, feature engineering, model selection, and hands-on labs using open tools like Jupyter Notebook.
install anaconda on your machine to access over 1500 python packages, including numpy and pandas, and use conda to manage environments and launch jupyter notebook for data science.
Set up a python virtual environment with Anaconda, install NumPy, pandas, Matplotlib, seaborn, and scikit-learn, and launch a Jupyter notebook to build machine learning projects.
Explore the Jupyter notebook interface, learn to write Python code in cells, switch between code and markdown, run and restart kernels, and export notebooks for machine learning projects.
Explore the data science methodology and its ten parts, from business understanding to deployment, highlighting iterative data collection, preparation, modeling, evaluation, and feedback-driven backtracking.
Use the CRISP-DM data mining methodology to guide data science projects from business understanding to deployment, addressing big data challenges, data integration, governance, and cost-benefit analysis.
Explore the data understanding phase of CRISP-DM: collect initial data from various sources, describe and explore it, verify data quality, and answer key questions before data preparation.
Prepare data after understanding it by selecting relevant data, cleaning and encoding, extending and formatting for modeling, then apply AI and machine learning algorithms.
Explore the CRISP-DM modelling phase by selecting an algorithm, defining modelling goals, configuring hyperparameters, training and evaluating the model, and planning deployment with monitoring.
Data cleaning matters in data science, as dirty data causes false conclusions and costly failures; prevention costs beat failure costs in real-world data.
Assess data quality by comparing data against predefined constraints and considering business needs; determine validity, accuracy, completeness, consistency, and uniformity to decide appropriate cleaning.
Explore data quality by evaluating data validity against business-specific constraints, including data type, range, mandatory, uniqueness, and set membership, with cross-field validation examples for real-world data.
Explore data accuracy within data quality by comparing measured values to true values, distinguishing accuracy from validity, and using examples like a thermometer and model predictions.
Explore data completeness by identifying relevant data for a customer purchase model; learn how missing income or education data can reduce prediction accuracy.
Assess data quality by checking the consistency of data across fields and attributes. Ensure fields agree with each other and recognize that names alone may not predict outcomes.
Assess data uniformity by identifying mixed units and currencies, convert to a single scale, and ensure data quality through validity, accuracy, completeness, and consistency to improve real-world projects.
Learn to ensure data quality through a continuous, iterative cleanup cycle: inspect data against validity, accuracy, completeness, consistency, and uniformity, then clean, verify, and report improvements.
Inspect data quality by profiling against constraints and visualizing with charts to spot outliers, using the psychic loan library's pre-processing module to check issues.
Cleanse data by removing irrelevant and duplicate records, and fix type mismatches. Impute or flag missing values, manage nonstandard values and outliers, and tailor cleanup steps to business needs.
Transform raw, noisy data into meaningful, quality data through data wrangling to generate valid insights, improve machine learning model performance, and support proper decisions in real-world data science projects.
Transform raw, unstructured data into structured formats through data wrangling, extracting invoice details from pdfs with ocr and tesseract, preparing data for machine learning projects.
Identify how outliers depart from the main distribution with box plots, revealing positive and negative skewness in salary data, from modest earnings to billionaire incomes.
Learn to tidy and normalize messy data by transforming rows and columns into clear observations, remove duplicates, and merge multiple tables through data wrangling for better analysis and preparation.
Learn how to detect and clean data type mismatches in real world data using pandas, mapping checks, and filters to ensure homogeneous column types.
Identify and handle duplicate data using pandas to clean datasets before machine learning projects, employing is_duplicated and drop_duplicates on full and subset columns.
Identify and handle missing data in pandas by using the any function to locate gaps, count them with sum, and fill them with fillna.
Identify and rank feature importance to improve model accuracy by using gradient boosting regression and permutation importance, while preprocessing data, encoding categoricals, and selecting meaningful features.
Identify permutation-based feature importance with gradient boosting regression and select features above a threshold. Train models on the reduced feature set and compare accuracy and efficiency.
Learn feature transformation through normalization and standardization to scale numerical inputs for machine learning, reducing bias toward larger values and improving algorithms like linear regression and distance-based methods.
Compare data normalization and standardization as feature transformation methods—normalization maps data to a 0–1 range, while standardization offers flexible scaling for better model performance.
Standardize data with a standard scaler to zero mean and unit variance, by subtracting the mean, dividing by the standard deviation, and verifying gaussian distribution with a histogram.
Learn to apply normalization with a min-max scaler from cyclone, fitting and transforming data to a 0–1 range, then convert results back to a pandas dataframe for modeling.
Explore standardization in practice using a standard scaler to center data by subtracting the mean and dividing by the standard deviation, and compare it with normalization to improve model performance.
Learn how one hot encoding converts categorical variables into binary vectors by creating a column for each unique category, enabling machine learning models to use gender, color, and other categories.
Apply one hot encoding to convert categorical variables into numerical features using a one hot encoder, transforming flight data from 2013 into training and testing datasets.
Explore numerical data, including continuous and discrete types, with examples like height. Recognize time series, categorical, and textual data, and learn that categorical values must be encoded into numbers.
Explore structured datasets by identifying rows and columns, using features as inputs to predict a target in a machine learning workflow, and learn to download, explore, and visualize data.
Learn the pandas library, built on numpy, for data preprocessing with series and dataframes. Master loading, inspecting, cleaning, and renaming data to prep real-world machine learning projects.
Use train-test split to separate data into training and testing sets, train on the training data, and evaluate on unseen testing data to measure model accuracy.
Learn to build a decision tree for classification and regression by splitting data with entropy and information gain, creating root and leaf nodes to predict outcomes.
Explore how information theory guides decision tree splits by measuring entropy, computing information gain, and choosing the split that maximizes gain, with pure and impure nodes explained.
Build a decision tree classifier by generating a 500-sample, four-feature dataset for a three-class classification problem. Prepare input features and labels in a pandas dataframe and visualize the tree.
Learn to build decision tree classifiers using entropy, information gain, or gini impurity, implement with scikit-learn, and assess performance with cross-validation and confusion matrices in supervised learning.
Explore regression within supervised learning, distinguish it from unsupervised learning, and learn how linear regression predicts continuous outputs from input vectors, aiming to minimize error on future data.
Explore how ordinary least squares finds the best-fit line in simple and multiple linear regression, predicting salary from experience with omega one times x plus omega naught.
Learn how linear regression finds the global minimum of the sum of squared errors using ordinary least squares, deriving the normal equations to estimate the intercept and slope.
In this data science course lecture, learn regression evaluation using mean squared error, root mean squared error, and mean absolute error to measure prediction accuracy.
Learn how R-squared evaluates linear regression by measuring how well the regression line fits data, akin to accuracy. The lecture illustrates variance, the fit, and a mouse size–weight example.
Implement simple linear regression to predict salary from years of experience, deriving the slope and intercept from data and minimizing residual error with a train-test split.
Explore logistic regression as a classification method, contrasting it with linear regression and explaining binary and multiclass problems, where predictions are bounded to finite classes.
Explore gradient descent for logistic regression, minimize the cross entropy cost with learning rate tuning, and update weights and bias toward the global minimum. Compare with linear and multivariate regression.
Implement a line-by-line logistic regression for binary classification of whether an employee will leave, using sigmoid and cross-entropy loss, while exploring and visualizing data to identify key features.
Explore logistic regression with data exploration and visualization, encoding categorical salaries using dummy variables, feature selection, train-test split, and evaluation via confusion matrices for employee retention prediction.
FAQ about Data Science:
What is Data Science?
Data science encapsulates the interdisciplinary activities required to create data-centric artifacts and applications that address specific scientific, socio-political, business, or other questions.
Let’s look at the constituent parts of this statement:
1. Data: Measurable units of information gathered or captured from activity of people, places and things.
2. Specific Questions: Seeking to understand a phenomenon, natural, social or other, can we formulate specific questions for which an answer posed in terms of patterns observed, tested and or modeled in data is appropriate.
3. Interdisciplinary Activities: Formulating a question, assessing the appropriateness of the data and findings used to find an answer require understanding of the specific subject area. Deciding on the appropriateness of models and inferences made from models based on the data at hand requires understanding of statistical and computational methods
Why Data Science?
The granularity, size and accessibility data, comprising both physical, social, commercial and political spheres has exploded in the last decade or more.
According to Hal Varian, Chief Economist at Google and I quote:
“I keep saying that the sexy job in the next 10 years will be statisticians and Data Scientist”
“The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids.”
************ ************Course Organization **************************
Section 1: Setting up Anaconda and Editor/Libraries
Section 2: Learning about Data Science Lifecycle and Methodologies
Section 3: Learning about Data preprocessing: Cleaning, normalization, transformation of data
Section 4: Some machine learning models: Linear/Logistic Regression
Section 5: Project 1: Hotel Booking Prediction System
Section 6: Project 2: Natural Language Processing
Section 7: Project 3: Artificial Intelligence
Section 8: Farewell