What is Database Normalization?

Kirill Eremenko
A free video tutorial from Kirill Eremenko
Data Scientist
4.5 instructor rating • 110 courses • 1,346,797 students

Lecture description

In this tutorial, you will learn why you should know how the databases design and why it is essential for data scientists

Learn more from the full course

SQL & Database Design A-Z™: Learn MS SQL Server + PostgreSQL

Learn Both SQL Server & PostgreSQL By Doing. Enhance Your Data Analytics Career With Real World Data Science Exercises

12:31:26 of on-demand video • Updated May 2020

  • Create basic SQL Queries
  • Create advanced SQL Queries
  • Create Left, Right, Inner and Full Outer joins
  • Create new tables, alter existing tables in Databases
  • Normalize Databases
  • Understand database design
  • Understand first, second and third normal form schemas
English Hello and welcome back to the course on databases. Today I've got a very exciting section prepared for you we're talking about database design and specifically we're going to be delving into the realm of normalization. I've got some very exciting slides prepared ahead and I can't wait to get started. You excited? Let's jump straight into it. All right. So what is normalization Why is that important. Well the definition of normalization according to Wikipedia is the process of organizing the columns or the attributes and tables or the relations of a relational database to reduce data redundancy and improve data integrity. All right so that's quite a broad definition and it promises a lot but what does it actually mean and what does it actually do. Well throughout the section we're going to aim to understand how we can better structure our databases to remove the potential for any update insertion and deletion anomalies and also just speed up the process of the way that these databases are utilized. So let's have a look in a quick trivial example which will show us or guide us in the right direction. So here we've got a table and this is a table on items that are sold at a clothing store and we've got shorts jeans and long pants that are being sold. We've got them in different colors and prices and taxes. So can you see anything wrong with this table. Well first of all right away we can see that this table in the colors column has duplicate colors right so has instead of just one value sometimes. Or in this case and all the rows has two values instead of one so yellow green blue black yellow green and blue brown So that is a potential for error right. So we want just one value in our columns in a database rather than sometimes having two values sometimes having one sometimes having three. So right away we can see that that doesn't feel right that something could go wrong there. So what else can we see that could potentially be wrong here. Well if we look closely we'll see that these two rows are identical. Right. And in a database we don't want to have identical rows because the second row in this case doesn't add any value. So those are some obvious things and in the section we'll definitely learn how to understand the theory and this design behind how we want to get rid of these obvious issues of the table. But actually there are some other things about this table that are not so obvious but could also be improved. Let's say that we are operating under the condition that color does not impact the price in any way. So this in this store color of the item of clothing does not change the price for those yellow green black or whatever other color doesn't matter. And let's say we want to add a new role to this table. Right. So let's say we want to add a new row and we want to add jeans which are purple Now when we want to have this new row What we'll have to do is we'll have to fill in all four columns we will have to fill in item and type jeans the color which is purple then we'll have to fill in the price right and we'll have to fill in the tax. But the thing is because color doesn't impact price and therefore doesn't impact tax. We already know what the price and the tax should be because we have a column row. With jeans already in it. So we've got a row the second row has genes and therefore we know that the price is 35 and the taxes $3 50. And therefore when we filling this information manually or updating this information as we're adding this rows this fifth row our table. There are two things happening. First of all we are duplicating information so we're performing work that isn't necessary. This information is already present in our database. There's no point there's no purpose in us actually writing that information and it's it's extra work that we're performing which takes time which takes resources and generally slows the process up. And the second thing that happens is that there's a potential for error. Right we already know that price is thirty-five dollars and the tax is 3 50 for jeans. But in this case when writing this row in What if we specify the incorrect price or the incorrect tax rate. Why create this additional potential for error. And so that's another kind of those things that is not that obvious in the database and that's something we'll be looking at throughout the section how to understand these things and how to structure and design your databases in a way that will prevent these things from happening. And so the obvious question here is why Kirill, Why do we need to learn about database design. We're data scientists this is the realm of database administrators and data engineers and a data architects and the custodians of the data. Well the reason for that is pretty simple. So there's actually a couple of reasons. First of all, here I'll give you four reasons. First of all by learning about database design and going through this section of the course you definitely don't have to go through the section of course you can totally just skip it and you probably will have a good career in data science. But knowing these things will really boost your career and will help you be a much more proficient data scientists and why is that. Well first of all you will understand why other databases that you are working with have been designed in a certain way. You'll have a better understanding of not just how they work but exactly why that happened and what's the thinking that went into that. Second thing is that you will understand how to navigate your way around those database. So even just by being introduced to a new database right away been you know some very brief exploratory work. You'll be able to tell what kind of structure this database has and therefore you'll be able to navigate it better because you'll know all of these templates and we're going to cover. And next this is point number three you will be able to understand potential vulnerability and flaws in the databases that you're working with. Right. And so you'll know what to look for you'll be able to make recommendations on how to fix those things and that's an additional skill something not many people would be able to do. So people that don't know these things won't be able to understand and make those recommendations. And finally point number 4, you'll be able to communicate effectively with the data custodians be those database administrators database architects database engineers and basically anybody who's behind the database design and who's working with the database on a daily basis. So an effective communication will help you get results faster. So there you go those are quite some substantial reasons why to indeed go through this section and learn these things. And plus it's actually very fun so I'm quite excited about sharing this knowledge with you so if you are on board I look forward to seeing you on the next tutorial And until next time, happy analyzing.