Data Warehouse Concepts: Basic to Advanced concepts

Data Warehouse Concepts: Learn the in BI/Data Warehouse/BIG DATA Concepts from scratch and become an expert.
4.4 (95 ratings) Instead of using a simple lifetime average, Udemy calculates a
course's star rating by considering a number of different factors
such as the number of ratings, the age of ratings, and the
likelihood of fraudulent ratings.
1,500 students enrolled
Instructed by Sid Inf IT & Software / Other
$19
$30
37% off
Take This Course
  • Lectures 123
  • Length 7.5 hours
  • Skill Level All Levels
  • Languages English
  • Includes Lifetime access
    30 day money back guarantee!
    Available on iOS and Android
    Certificate of Completion
Wishlisted Wishlist

How taking a course works

Discover

Find online courses made by experts from around the world.

Learn

Take your courses with you and learn anywhere, anytime.

Master

Learn and practice real-world skills and achieve your goals.

About This Course

Published 4/2016 English

Course Description

****May 2015 Update****

New section with Data-centric implementation approach for a Data Warehouse Project is added. 

****May 2015 Update****

In this course, you will learn all the concepts and terminologies related to the Data Warehouse , such as the OLTP, OLAP, Dimensions, Facts and much more, along with other concepts related to it such as what is meant by Start Schema, Snow flake Schema, other options available and their differences. It also explains how the data is managed with in the Data Warehouse and explains the process of reading and writing data onto the Warehouse. Later in the course you would also learn the basics of Data Modelling and how to start with it logically and physically. You would also learn all the concepts related to Facts, Dimensions, Aggregations and commonly used techniques of ETL. Upon completion of this course, you would have a clear idea about, all the concepts related to the Data Warehouse, that should be sufficient to help you start off with the next step of becoming an ETL developer or Administering the Data warehouse environment with the help of various tools. All the Best and Happy Learning !

What are the requirements?

  • Basic understanding of the IT (Software) industry
  • Familiar with the basic concept of Database/RDBMS.

What am I going to get from this course?

  • In this course, you would be learning all the concepts and terminologies related to the Datawarehouse , such as the OLTP, OLAP, Dimensions, Facts and much more, along with other concepts related to it such as what is meant by Start Schema, Snow flake Schema, other options available and their differences.
  • It also explains how the data is managed with in the Data Warehouse and explains the process of reading and writing data onto the Warehouse. Later in the course you would also learn the basics of Data Modelling and how to start with it logically and physically. You would also learn all the concepts related to Facts, Dimensions, Aggregations and commonly used techniques of ETL.
  • Upon completion of this course, you would have a clear idea about, all the concepts related to the Data Warehouse, that should be sufficient to help you start off with the next step of becoming an ETL developer or Administering the Data warehouse environment with the help of various tools.
  • All the Best and Happy Learning !

What is the target audience?

  • This Course is intended for freshers who are new to the Data Warehouse world, Application/ETL developers, Mainframe develoeprs, database administrators, system administrators, and database application developers who design, maintain, and use data warehouses.
  • Recent graduates looking to get a job in the IT Industry
  • Finance professionals wanting to learn about Data Warehouses for reporting and analysis purposes, businesses looking to update and improve on data storage, analysis, and reporting processes.

What you get with this course?

Not for you? No problem.
30 day money back guarantee.

Forever yours.
Lifetime access.

Learn on the go.
Desktop, iOS and Android.

Get rewarded.
Certificate of completion.

Curriculum

Section 1: Introduction
06:20

Thank you for choosing this course. In this lecture you would know how to navigate and get the best out of this course. 


All the Best and Happy Learning!! :)

Section 2: Brief about the Data warehouse
04:25

The original intent of the data warehouse was to segregate analytical operations from mainframe transaction processing in order to avoid slowdowns in transaction response times, and minimize the increased CPU costs accrued by running ad hoc queries and creating and distributing reports. Over time, the enterprise data warehouse became a core component of information architectures, and it's now rare to find a mature business that doesn't employ some form of an EDW or a collection of smaller data marts to support business intelligence, reporting and analytics applications.

In this lecture we see what will be the future of Data warehouse in the age of Big Data. 

05:26

The concept of data warehousing is not hard to understand. The notion is to create a permanent storage space for the data needed to support reporting, analysis, and other BI functions. In this lecture we understand what are the main reasons behind creating a data warehouse and the benefits of it. 

This long list of benefits is what makes data warehousing an essential management tool for businesses that have reached a certain level of complexity.

05:48

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.

In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.


05:43

A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William (Bill) Inmon:

•Subject Oriented

•Integrated

•Nonvolatile

•Time Variant

3 questions

Test your understanding on what is a Data Warehouse

Section 3: Business Intelligence
05:37

Business intelligence (BI) is a technology-driven process for analyzing data and presenting actionable information to help corporate executives, business managers and other end users make more informed business decisions.

03:34

Business intelligence (BI) is the use of computing technologies for the identification, discovery and analysis of business data - like sales revenue, products, costs and incomes.

BI technologies provide current, historical and predictive views of internally structured data for products and departments by establishing more effective decision-making and strategic operational insights through functions like online analytical processing (OLAP), reporting, predictive analytics, data/text mining, bench marking and Business Performance Management (BPM). These technologies and functions are often referred to as information management.

08:02
  • Business operations reporting
  • Forecasting
  • Dashboard
  • Multidimensional analysis
  • Finding correlation among different factors
06:24

The most common tools used for business intelligence are discussed in this lecture.

3 questions

Test your understanding on Business Intelligence

Section 4: Data Warehouse Architectures
04:46

In this lecture we see how the Centralized architecture is set up, in which there exists only one data warehouse which stores all data necessary for the business analysis

03:05

In a Federated Architecture the data is logically consolidated but stored in separate physical database, at the same or at different physical sites. The local data marts store only the relevant information for a department. 

The amount of data is reduced in contrast to a central data warehouse. The level of detail is enhanced in this kind of model. 

03:13

A Multi Tired architecture is a distributed data approach. This process cannot be done in a one step because many sources have to be integrated into a warehouse.

03:57

Different data warehousing systems have different structures. Some may have an ODS (operational data store), while some may have multiple data marts. Some may have a small number of data sources, while some may have dozens of data sources. In view of this, it is far more reasonable to present the different layers of a data warehouse architecture rather than discussing the specifics of any one system.

In general, all data warehouse systems have the following layers:

  • Data Source Layer
  • Data Extraction Layer
  • Staging Area
  • ETL Layer
  • Data Storage Layer
  • Data Logic Layer
  • Data Presentation Layer
  • Metadata Layer
  • System Operations Layer
04:43

This is where data is stored prior to being scrubbed and transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration. Based on the business architecture and design there can be more than one staging area which can be termed with different naming conventions. 

03:35

This is where data is stored prior to being scrubbed and transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration. Based on the business architecture and design there can be more than one staging area which can be termed with different naming conventions. 

02:30

 Data warehouses tend to have a high query success, as they have complete control over the four main areas of data management systems:

  • Clean data
  • Indexes: multiple types
  • Query processing: multiple options
  • Security: data and access
06:01

There are considerable disadvantages involved in moving data from multiple, often highly disparate, data sources to one data warehouse that translate into long implementation time, high cost, lack of flexibility, dated information and limited capabilities.

Other disadvantages are discussed in the video. 

3 questions

Test your understanding on DWH Architectures

Section 5: ODS - Operational Data Store
02:20

An ODS is designed for relatively simple queries on small amounts of data (such as finding the status of a customer order), rather than the complex queries on large amounts of data typical of the data warehouse. 

An ODS is similar to your short term memory in that it stores only very recent information; in comparison, the data warehouse is more like long term memory in that it stores relatively permanent information.

08:41

The typical definition of an operational data store (ODS) is that it’s a set of logically related data structures within a database. The data within an ODS is integrated, volatile and at a non-historical granular level that is designed to address a set of operational functions for a specific business purpose. The ODS must also be based on the enterprise standards for data management for the organization.

02:18

To understand the purpose of the ODS and when it is an appropriate solution, its characteristics must first be defined.

Characteristics of an Operational Data Store

Subject Oriented : The ODS contains specific data that is unique to a set of business functions. The data therefore represents a specific subject area.

Integrated : Data in the ODS is sourced from various legacy applications. The source data is taken through a set of ETL operations that includes cleansing and trans-formative processes. These processes are based on rules that        have been created through business requirements for data quality and standardization.

Current (non-historical) : The data in the ODS is up-to-date and is a current status of data from the sourcing applications.

Detail : Data in the ODS is primarily used to support operational business functions. This means that there is a      specific level of granularity based on business requirements that dictate the level of detail that data in the  ODS will have.

04:01

Please refer to the additional resources of this section which contains the Info-graphic on the differences between the ODS, DWH, OLTP, OLAP, DSS and DM (Data Mart).

3 questions

Test your understanding on ODS

Section 6: OLAP
05:54

OLAP (Online Analytical Processing) is the technology behind many Business Intelligence (BI) applications. OLAP is a powerful technology for data discovery, including capabilities for limitless report viewing, complex analytical calculations, and predictive “what if” scenario (budget, forecast) planning.

04:33

- OLTP (On-line Transaction Processing) is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transnational databases is the entity model (usually 3NF). 

- OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema).

06:01

OLTP (On-line Transaction Processing) is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transnational databases is the entity model (usually 3NF). 

OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema).

06:36

MOLAP (multidimensional online analytical processing) is online analytical processing (OLAP) that indexes directly into a multidimensional database.

03:40

Relational online analytical processing (ROLAP) is a form of online analytical processing (OLAP) that performs dynamic multidimensional analysis of data stored in a relational database rather than in a multidimensional database (which is usually considered the OLAP standard).

02:28

HOLAP (hybrid online analytical processing) is a combination of ROLAP (Relational OLAP) and MOLAP (Multidimensional OLAP) which are other possible implementations of OLAP. HOLAP allows storing part of the data in a MOLAP store and another part of the data in a ROLAP store, allowing a tradeoff of the advantages of each.

01:28

DOLAP (Desktop OLAP or Database OLAP)provide multidimensional analysis locally in the client machine on the data collected from relational or multidimensional database servers.

4 questions

Test your understanding on OLAP

Section 7: Data Mart
01:47

The data mart is a subset of the data warehouse that is usually oriented to a specific business line or team. Data marts are small slices of the data warehouse. Whereas data warehouses have an enterprise-wide depth, the information in data marts pertains to a single department.

00:36

Data Warehouse:

  • Holds multiple subject areas
  • Holds very detailed information
  • Works to integrate all data sources
  • Does not necessarily use a dimensional model but feeds dimensional models.

Data Mart:

  • Often holds only one subject area- for example, Finance, or Sales
  • May hold more summarized data (although many hold full detail)
  • Concentrates on integrating information from a given subject area or set of source systems
  • Is built focused on a dimensional model using a star schema.
02:40

The primary advantages are:

  • Data Segregation: Each box of information is developed without changing the other ones. This boosts information security and the quality of data.
  • Easier Access to Information: These data structures provide a easiest way of interpret the information stored on the database
  • Faster Response: Derived from the adopted structure
  • Simple queries: Based on the structure and size of the data
  • Subject full detailed data: Might also provide summarization of the information
  • Specific to User Needs: This set of data is focused on the end user needs
  • Easy to Create and Mantain
04:00
  • Easy access to frequently needed data
  • Creates collective view by a group of users
  • Improves end-user response time
  • Ease of creation
  • Lower cost than implementing a full data warehouse
  • Potential users are more clearly defined than in a full data warehouse
  • Contains only business essential data and is less cluttered.
03:13

Like any other system, data marts have many issues including functionality, data size, scalability, performance, data access, and consolidation. Since data marts can be broken into different departments to focus on their individual needs. This approach makes data access, consolidation, and cleansing very difficult.

For instance, when a company has a data mart for each of its departments including sales, inventory, tracking, shipping, receiving, and production. Combining revenue information from each of these departments into a single data mart can be overwhelming and confusing, due to the volume of data to be analyzed.

02:15

This lecture talks about the mistakes and the mis-conceptions one have with regard to the Data warehouse. 

3 questions

Test your understanding on Data marts

Section 8: Metadata
01:57

Metadata is data that describes other data. Meta is a prefix that in most information technology usages means "an underlying definition or description." Metadata summarizes basic information about data, which can make finding and working with particular instances of data easier.

01:46

Metadata provides a number of very important benefits to the enterprise, including:


Consistency of definitions: Metadata contains information about data that helps reconcile the difference in terminology such as "clients" and "customers," "revenue" and "sales," etc.


Clarity of relationships: Metadata helps resolve ambiguity and inconsistencies when determining the associations between entities stored throughout data environment. 


Clarity of data lineage: Metadata contains information about the origins of a particular data set and can be granular enough to define information at the attribute level; metadata may maintain allowed values for a data attribute, its proper format, location, owner, and steward. 

05:32

There are three broad categories of metadata:


Business metadata: includes definitions of data files and attributes in business terms. It may also contain definitions of business rules that apply to these attributes, data owners and stewards, data quality metrics, and similar information that helps business users to navigate the "information ocean." 


Technical metadata: is the most common form of metadata.  Technical metadata is a key metadata type used to build and maintain the enterprise data environment. Technical metadata typically includes database system names, table and column names and sizes, data types and allowed values, and structural information such as primary and foreign key attributes and indices.


Operational metadata: contains information that is available in operational systems and run-time environments. It may contain data file size, date and time of last load, updates, and backups, names of the operational procedures and scripts that have to be used to create, update, restore, or otherwise access data, etc.


All these types of metadata have to be persistent and available in order to provide necessary and timely information to manage often heterogeneous and complex data environments such as those represented by various Data Hub architectures.

05:34

In this lecture, we will see what is to be considered for setting up the projects with Metadata. 

01:30

The standard best practices are discussed in this project. Though these may vary from project to project and enterprise to enterprise, these can be considered as the ground rules. 

03:16

A quick summary on what is discussed on Metadata. 

3 questions

Test your understanding on Metadata

Section 9: Data Modeling
02:13

Data modeling is the formalization and documentation of existing processes and events that occur during application software design and development. 

03:53

Data modeling techniques and tools capture and translate complex system designs into easily understood representations of the data flows and processes, creating a blueprint for construction and/or re-engineering. 

Section 10: Entity Relational Data Model
03:41

An entity–relationship model (ER model) is a data model for describing the data or information aspects of a business domain or its process requirements, in an abstract way that lends itself to ultimately being implemented in a database such as a relational database.

01:59

An entity can be a real-world object, either animate or inanimate, that can be easily identifiable. 

For example, in a university database, students, teachers, classes, and courses offered can be considered as entities. All these entities have some attributes or properties that give them their identity.

An entity set is a collection of similar types of entities. An entity set may contain entities with attribute sharing similar values. 

For example, a Students set may contain all the students of a university; likewise a Teachers set may contain all the teachers of a university from all faculties. Entity sets need not be disjoint.

04:01

Entity Types: The following types are discussed in this lecture

  • Independent/Fundamental Entity (Strong)
  • Dependent/Attributive Entity (Weak)
  • Associative Entity
02:09

Entity Types: The following types are discussed in this lecture

  • Super type Entity
  • Sub type Entity

01:54

Entities are represented by means of their properties, called attributes. All attributes have values. 

04:00
  • Simple attribute
  • Composite attribute
  • Derived attribute
  • Single-value attribute 
  • Multi-value attribute
02:44

Key is an attribute or collection of attributes that uniquely identifies an entity among entity set.

  • Super Key
  • Candidate Key
  • Primary Key
02:02

One or more attribute uniquely identifies an instance of an entity.

01:15

The association among entities is called a relationship. For example, an employee works_at a department, a customer enrolls for an offer. Here, Works_at and Enrolls are called relationships.

02:37

ER-Diagram is a visual representation of data that describes how data is related to each other.

01:29

A logical data model describes the data in as much detail as possible, without regard to how they will be physical implemented in the database. Features of a logical data model include:

  • Includes all entities and relationships among them.
  • All attributes for each entity are specified.
  • The primary key for each entity is specified.
  • Foreign keys (keys identifying the relationship between different entities) are specified.
  • Normalization occurs at this level.

The steps for designing the logical data model are as follows:

  1. Specify primary keys for all entities.
  2. Find the relationships between different entities.
  3. Find all attributes for each entity.
  4. Resolve many-to-many relationships.
  5. Normalization.
02:26

Physical data model represents how the model will be built in the database. A physical database model shows all table structures, including column name, column data type, column constraints, primary key, foreign key, and relationships between tables. Features of a physical data model include:

  • Specification all tables and columns.
  • Foreign keys are used to identify relationships between tables.
  • Denormalization may occur based on user requirements.
  • Physical considerations may cause the physical data model to be quite different from the logical data model.
  • Physical data model will be different for different RDBMS. For example, data type for a column may be different between MySQL and SQL Server.

The steps for physical data model design are as follows:

  1. Convert entities into tables.
  2. Convert relationships into foreign keys.
  3. Convert attributes into columns.
  4. Modify the physical data model based on physical constraints / requirements.
03:30

Here we compare these three types of data models. 

We can see that the complexity increases from conceptual to logical to physical. This is why we always first start with the conceptual data model (so we understand at high level what are the different entities in our data and how they relate to one another), then move on to the logical data model (so we understand the details of our data without worrying about how they will actually implemented), and finally the physical data model (so we know exactly how to implement our data model in the database of choice). In a data warehousing project, sometimes the conceptual data model and the logical data model are considered as a single deliverable.

03:33

Following are advantages of an E-R Model:

  • Straightforward relation representation
  • Easy conversion for E-R to other data model
  • Graphical representation for better understanding

Disadvantages of E-R Data Model:

  • No industry standard for notation
  • Popular for high-level design
  • Software cannot usefully query a general ER model
  • Non-intuitive 
3 questions

Test your understanding on ER Models

Section 11: Dimensional Model
04:39

Dimensional Model is a database structure that is optimized for online queries and Data Warehousing tools. It is comprised of "fact" and "dimension" tables. A "fact" is a numeric value that a business wishes to count or sum. A "dimension" is essentially an entry point for getting at the facts.

01:50

Benefits of the dimensional model:

  • Understand-ability
  • Query performance 
  • Extensible 
02:48

dimension is a structure that categorizes facts and measures in order to enable users to answer business questions. Commonly used dimensions are people, products, place and time. In a data warehousedimensions provide structured labeling information to otherwise un-ordered numeric measures.

02:05

In data warehousing, a fact table consists of the measurements, metrics or facts of a business process. It is often located at the center of a star schema, surrounded by dimension tables.

There are four types of facts. 

  • Additive - Measures that can be added across all dimensions.
  • Non Additive - Measures that cannot be added across all dimensions.
  • Semi Additive – Measures that can be added across few dimensions and not with others.
  • Fact less fact tables – The fact table does not have aggregate numeric values or information.
01:45

There are four types of facts. 

  • Additive - Measures that can be added across all dimensions.
  • Non Additive - Measures that cannot be added across all dimensions.
  • Semi Additive – Measures that can be added across few dimensions and not with others.
  • Fact less fact tables – The fact table does not have aggregate numeric values or information.
02:28

There are four types of facts. 

  • Additive - Measures that can be added across all dimensions.
  • Non Additive - Measures that cannot be added across all dimensions.
  • Semi Additive – Measures that can be added across few dimensions and not with others.
  • Fact less fact tables – The fact table does not have aggregate numeric values or information.
01:30

There are four types of facts. 

  • Additive - Measures that can be added across all dimensions.
  • Non Additive - Measures that cannot be added across all dimensions.
  • Semi Additive – Measures that can be added across few dimensions and not with others.
  • Fact less fact tables – The fact table does not have aggregate numeric values or information.
02:29

There are four types of facts. 

  • Additive - Measures that can be added across all dimensions.
  • Non Additive - Measures that cannot be added across all dimensions.
  • Semi Additive – Measures that can be added across few dimensions and not with others.
  • Fact less fact tables – The fact table does not have aggregate numeric values or information.
04:01

surrogate key is any column or set of columns that can be declared as the primary key instead of a "real" or natural key. Sometimes there can be several natural keys that could be declared as the primary key, and these are all called candidate keys. So a surrogate is a candidate key.

05:23

A star schema is the simplest form of a dimensional model, in which data is organized into facts and dimensions

03:25

 The snowflake schema is diagrammed with each fact surrounded by its associated dimensions (as in a star schema), and those dimensions are further related to other dimensions, branching out into a snowflake pattern.

02:20

Galaxy schema also know as fact constellation schema because it is the combination of both of star schema and Snow flake schema.

05:27

When choosing a database schema for a data warehouse, snowflake and star schema tend to be popular choices. This comparison discusses suitability of star vs. snowflake schema in different scenarios and their characteristics.

06:38

conformed dimension is a dimension that has exactly the same meaning and content when being referred from different fact tables. A conformed dimension can refer to multiple tables in multiple data marts within the same organization.

03:06

 In a Junk dimension, we combine these indicator fields into a single dimension. This way, we'll only need to build a single dimension table, and the number of fields in the fact table, as well as the size of the fact table, can be decreased.

03:35

According to Ralph Kimball, in a data warehouse, a degenerate dimension is a dimension key in the fact table that does not have its own dimension table, because all the interesting attributes have been placed in analytic dimensions. The term "degenerate dimension" was originated by Ralph Kimball. 

05:30

We start with the basic definition of a Dimension, Fact and start with the Slowly Changing Dimensions.

12:30

There are many approaches how to deal with SCD. The most popular are: 

  • Type 0 - The passive method
  • Type 1 - Overwriting the old value
  • Type 2 - Creating a new additional record
  • Type 3 - Adding a new column
  • Type 4 - Using historical table
  • Type 6 - Combine approaches of types 1,2,3 (1+2+3=6)
03:33

Dimension, Fact and SCD Type 1, 2 and 3 are reviewed in this lecture. 

Step by Step approach to set up the Dimensional Model using a retail case study
07:28
05:50

In this lecture we talk about the differences between ER model and the Dimensional Model.

10 questions

Test your understanding on Dimensional Model

Section 12: DWH Indexes
02:12

Indexing the data warehouse can reduce the amount of time it takes to see query results. When indexing dimensions, you'll want to index on the dimension key. When indexing the fact table, you'll want to index on the date key or the combined data plus time.

04:17

bitmap index is a special kind of database index that uses bitmaps.Bitmap indexes have traditionally been considered to work well for low-cardinality columns, which have a modest number of distinct values, either absolutely, or relative to the number of records that contain the data.

Students Who Viewed This Course Also Viewed

  • Loading
  • Loading
  • Loading

Instructor Biography

Sid Inf, Data/ETL Architect

Business Intelligence Consultant and Trainer with 13+ years of extensive work experience on various client engagements. Delivered many large data warehousing projects and trained numerous professionals on business intelligence technologies. Extensively worked on all facets of data warehousing including requirement gathering, gap analysis, database design, data integration, data modeling, enterprise reporting, data analytics, data quality, data visualization, OLAP.

Has worked on broad range of business verticals and hold exceptional expertise on  various ETL tools like Informatica Powercenter, SSIS, ODI and IDQ, Data Virtualization, DVO, MDM.

Ready to start learning?
Take This Course