
Learn how to clean a legacy e-commerce orders dataset from Dataflow, fix duplicates, normalize dates and segments, and compute ten KPIs from a trusted fact table.
Launch the sql workbench, create a clean workspace, and run a simple core query to inspect the raw e-commerce data for mixed date formats, missing values, and extraction issues.
Assess the silver normalize temp view against the brief rules to diagnostically count how many rows would fail each rule, yielding a one-row data quality summary.
Baseline count keeps 9978 rows after filtering, with 308 rows rejected by the amount rule and zero failures elsewhere, validating the filter and establishing a future monitoring anchor.
Define the clean table as the one-row-per-unique combination of eight business fields, deduplicating identical rows for final inspection and KPI-ready dashboards.
Clean the messy legacy orders table by profiling raw data, normalizing the silver layer, enforcing explicit business rules, and removing duplicates to enable kpis and tell the business story.
Calculate KPI six as the premium and platinum share of GMV. Sum order amounts from premium or platinum orders and divide by total GMV across all segments.
Compute the first KPI, average order value (AOV), by building a temp KPI view, rounding to two decimals, and casting to varchar for a unified API schema.
Assess KPI two, the gross margin, to gauge profitability after expenses and pricing power. Compute (revenue − cost) / revenue, weighted by GMV on a clean dataset, six decimals.
Compute KPI three return rate by summing is_return to count returns and dividing by total valid unique orders, yielding 0.079938 (about 7.99%).
Compute the median order amount KPI by querying the 50th percentile of order values, rounding to two decimals, casting to varchar, yielding 6522 as the overall metric.
Determine the median order amount to reveal the typical customer spend, using the 50th percentile to minimize outliers, and compare with average order value to inform forecasting and leadership insights.
Assess KPI five, the return rate by payment method, to see if certain methods attract more returns and what that reveals about customer experience and risk.
Evaluate KPI six by analyzing the high value segment GMV share to show how much revenue comes from premium and platinum customers, about 64.9% of total GMV, indicating concentration risk.
Interpret KPI seven by calculating realized margin per order, attaching the correct margin by segment, and flagging failures to reveal a 0.29% below-margin rate.
Identify KPI eight, the top GMV month, by converting order dates to year-month, aggregating total GMV per month, and selecting December 2024 as the top result.
Consolidate ten KPI views into a single canonical KPI results table via a union all pattern on a clean deduplicated dataset, enabling automated grading and a concise business narrative.
Audit the detailed results page to confirm every KPI value matches the reference, with a pass status, 100% score, and per channel metrics.
This course is built to give you a publishable portfolio project as the end product — a complete SQL data-cleaning and KPI pipeline you can put on GitHub, link on LinkedIn, and confidently talk through in interviews.
It’s a real-world simulation built around one messy dataset and a business brief with a clear target: deliver ten KPIs that are trustworthy enough to go on a dashboard.
Most SQL “data cleaning” courses either stay at the level of syntax drills, or they use clean toy datasets where nothing breaks. That’s not what you face in real data teams.
In this course you’ll work through the same workflow you’d use on a real project:
Read the brief properly so you know what “correct” means
Explore the raw schema and spot the mess early (mixed date formats, typos in categories, missing values, duplicates)
Build a typed, safer silver layer where errors surface in a controlled way
Enforce the business rules and deduplicate into one trusted clean_table
Compute and standardise all KPI outputs into a consistent results table
Validate results, understand tolerances/rounding, and debug mismatches like a professional
Finish by turning the whole pipeline into a portfolio-ready GitHub project, with a clean repo structure, a strong README, and proof of results
Course outline (high level):
Section 00: Course Introduction
Section 01: The Verulam Blue Mint Environment
Section 02: Understanding the Challenge Brief
Section 03: Exploring Source Data Schema
Section 04: Data Cleaning I – Sampling & Completeness
Section 05: Data Cleaning II – Silver Layer & Normalisation
Section 06: Data Cleaning III – Business Rules & Deduplication
Section 07: Understanding the KPIs
Section 08: Computing KPIs
Section 09: Results
Section 10: Portfolio project deployment (repo + README + LinkedIn-style project story)
By the end, you won’t just know “how to clean data using SQL”. You’ll have an end-to-end portfolio project you can explain clearly: what was wrong with the data, what you changed, what rules you enforced, and why your KPIs can be trusted.