
In this lecture we talk about the layout of the course and what is covered and how to get the best out of this course.
ETL is commonly associated with Data Warehousing projects but there in reality any form of bulk data movement from a source to a target can be considered ETL. ETL testing is a data centric testing process to validate that the data has been transformed and loaded into the target as expected.
In this lecture we also talk about data testing and challenges in ETL testing.
This is one of the common questions which is asked by most of the non-Java/Big Data IT professionals about their current technologies and the future of it.
Especially, when it comes to the ETL or the DW world, the future would be better than ever since "Big Data" would help increase the requirement of better processing of data & these tools excel in doing that.
The original intent of the data warehouse was to segregate analytical operations from mainframe transaction processing in order to avoid slowdowns in transaction response times, and minimize the increased CPU costs accrued by running ad hoc queries and creating and distributing reports. Over time, the enterprise data warehouse became a core component of information architectures, and it's now rare to find a mature business that doesn't employ some form of an EDW or a collection of smaller data marts to support business intelligence, reporting and analytics applications.
In this lecture we see what will be the future of Data warehouse in the age of Big Data.
Data is a collection of raw material in unorganized format. which refers an object.
The concept of data warehousing is not hard to understand. The notion is to create a permanent storage space for the data needed to support reporting, analysis, and other BI functions. In this lecture we understand what are the main reasons behind creating a data warehouse and the benefits of it.
This long list of benefits is what makes data warehousing an essential management tool for businesses that have reached a certain level of complexity.
A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.
In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.
The data mart is a subset of the data warehouse that is usually oriented to a specific business line or team. Data marts are small slices of the data warehouse. Whereas data warehouses have an enterprise-wide depth, the information in data marts pertains to a single department.
Data Warehouse:
Data Mart:
The primary advantages are:
Disadvantages of Data Marts are discussed in this lecture.
This lecture talks about the mistakes and the mis-conceptions one have with regard to the Data warehouse.
In this lecture we see how the Centralized architecture is set up, in which there exists only one data warehouse which stores all data necessary for the business analysis.
In a Federated Architecture the data is logically consolidated but stored in separate physical database, at the same or at different physical sites. The local data marts store only the relevant information for a department.
The amount of data is reduced in contrast to a central data warehouse. The level of detail is enhanced in this kind of model.
A Multi Tired architecture is a distributed data approach. This process cannot be done in a one step because many sources have to be integrated into a warehouse.
Different data warehousing systems have different structures. Some may have an ODS (operational data store), while some may have multiple data marts. Some may have a small number of data sources, while some may have dozens of data sources. In view of this, it is far more reasonable to present the different layers of a data warehouse architecture rather than discussing the specifics of any one system.
In general, all data warehouse systems have the following layers:
This is where data is stored prior to being scrubbed and transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration. Based on the business architecture and design there can be more than one staging area which can be termed with different naming conventions.
This is where data is stored prior to being scrubbed and transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration. Based on the business architecture and design there can be more than one staging area which can be termed with different naming conventions.
Data modeling is the formalization and documentation of existing processes and events that occur during application software design and development.
The below aspects will be discussed in this lecture.
•Functional and Technical Aspects
•Completeness in the design
•Understanding DB Test Execution
•Validation
Data modeling techniques and tools capture and translate complex system designs into easily understood representations of the data flows and processes, creating a blueprint for construction and/or re-engineering.
An entity–relationship model (ER model) is a data model for describing the data or information aspects of a business domain or its process requirements, in an abstract way that lends itself to ultimately being implemented in a database such as a relational database.
A Dimensional Model is a database structure that is optimized for online queries and Data Warehousing tools. It is comprised of "fact" and "dimension" tables. A "fact" is a numeric value that a business wishes to count or sum. A "dimension" is essentially an entry point for getting at the facts.
In this lecture we talk about the differences between ER model and the Dimensional Model.
To build a Dimensional Model we need to follow five different phases
Data Modelers have to interact with business analysts to get the functional requirements and with end users to find out the reporting needs.
This model includes all major entities, relationships. But, this will not contain much detail about attributes and is often used in the initial planning phase.
In this phase the actual implementation of a conceptual model in a logical data model will happen. A logical data model is the version of the model that represents all of the business requirements of an organization.
This is a complete model that includes all required tables, columns, relationships, database properties for the physical implementation of the database.
DBA's or ETL developers prepare the scripts to create the entities, attributes and their relationships.
In this lecture we also talk about the reusable database script creation process which can be reused for multiple times.
A dimension is a structure that categorizes facts and measures in order to enable users to answer business questions. Commonly used dimensions are people, products, place and time. In a data warehouse, dimensions provide structured labeling information to otherwise un-ordered numeric measures.
In data warehousing, a fact table consists of the measurements, metrics or facts of a business process. It is often located at the center of a star schema, surrounded by dimension tables.
There are four types of facts.
There are four types of facts.
The numeric measures in a fact table fall into three categories. The most flexible and useful facts are fully additive; additive measures can be summed across any of the dimensions associated with the fact table. Semi-additive measures can be summed across some dimensions, but not all; balance amounts are common semi-additive facts because they are additive across all dimensions except time.
There are four types of facts.
There are four types of facts.
A star schema is the simplest form of a dimensional model, in which data is organized into facts and dimensions.
The snowflake schema is diagrammed with each fact surrounded by its associated dimensions (as in a star schema), and those dimensions are further related to other dimensions, branching out into a snowflake pattern.
Galaxy schema also know as fact constellation schema because it is the combination of both of star schema and Snow flake schema.
When choosing a database schema for a data warehouse, snowflake and star schema tend to be popular choices. This comparison discusses suitability of star vs. snowflake schema in different scenarios and their characteristics.
A conformed dimension is a dimension that has exactly the same meaning and content when being referred from different fact tables. A conformed dimension can refer to multiple tables in multiple data marts within the same organization.
In a Junk dimension, we combine these indicator fields into a single dimension. This way, we'll only need to build a single dimension table, and the number of fields in the fact table, as well as the size of the fact table, can be decreased.
According to Ralph Kimball, in a data warehouse, a degenerate dimension is a dimension key in the fact table that does not have its own dimension table, because all the interesting attributes have been placed in analytic dimensions. The term "degenerate dimension" was originated by Ralph Kimball.
A single physical dimension can be referenced multiple times in a fact table, with each reference linking to a logically distinct role for the dimension. For instance, a fact table can have several dates, each of which is represented by a foreign key to the date dimension.
Slowly Changing Dimensions (SCD) - dimensions that change slowly over time, rather than changing on regular schedule, time-base.
There are many approaches how to deal with SCD. The most popular are:
Dimension, Fact and SCD Type 1, 2 and 3 are reviewed in this lecture.
Data integration is the combination of technical and business processes used to combine data from disparate sources into meaningful and valuable information. A complete data integration solution delivers trusted data from a variety of sources.
ETL is short for extract, transform, load, three database functions that are combined into one tool to pull data out of one database and place it into another database.
Extract is the process of reading data from a database.
Transform is the process of converting the extracted data from its previous form into the form it needs to be in so that it can be placed into another database. Transformation occurs by using rules or lookup tables or by combining the data with other data.
Load is the process of writing the data into the target database.
ETL is used to migrate data from one database to another, to form data marts and data warehouses and also to convert databases from one format or type to another.
The process of extracting the data from different source (operational databases) systems, integrating the data and transforming the data into a homogeneous format and loading into the target warehouse database. Simple called as ETL (Extraction, Transformation and Loading). The Data Acquisition process designs are called in different manners by different ETL vendors.
Data transformation is the process of converting data or information from one format to another, usually from the format of a source system into the required format of a new destination system.
In this lecture we discuss on what are the common questions which are raised for Data Integration and ETL.
ETL is short for extract, transform, load, three database functions that are combined into one tool to pull data out of one database and place it into another database.
Extract is the process of reading data from a database.
Transform is the process of converting the extracted data from its previous form into the form it needs to be in so that it can be placed into another database. Transformation occurs by using rules or lookup tables or by combining the data with other data.
Load is the process of writing the data into the target database.
ETL is used to migrate data from one database to another, to form data marts and data warehouses and also to convert databases from one format or type to another.
ELT is a variation of the Extract, Transform, Load (ETL), a data integration process in which transformation takes place on an intermediate server before it is loaded into the target.
ELT makes sense when the target is a high-end data engine, such as a data appliance, Hadoop cluster, or cloud installation to name three examples. If this power is there, why not use it?
ETL, on the other hand, is designed using a pipeline approach. While data is flowing from the source to the target, a transformation engine (something unique to the tool) takes care of any data changes.
Which is better depends on priorities. All things being equal, it’s better to have fewer moving parts. ELT has no transformation engine – the work is done by the target system, which is already there and probably being used for other development work. On the other hand, the ETL approach can provide drastically better performance in certain scenarios. The training and development costs of ETL need to be weighed against the need for better performance. (Additionally, if you don’t have a target system powerful enough for ELT, ETL may be more economical.)
DW/BI/ETL Testing Training Course is designed for both entry-level and advanced Programmers. The course includes topics related to the foundation of Data Warehouse with the concepts, Dimensional Modeling and important aspects of Dimensions, Facts and Slowly Changing Dimensions along with the DW/BI/ETL set up, Database Testing Vs Data Warehouse Testing, Data Warehouse Workflow and Case Study, Data Checks using SQL, Scope of BI testing and as a bonus you will also get the steps to set up the environment with the most popular ETL tool Informatica to perform all the activities on your personal computer to get first hand practical knowledge.