
Are there any prerequisites to this course? What are they?
A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.
In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.
A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William (Bill) Inmon:
•Subject Oriented
•Integrated
•Nonvolatile
•Time Variant
The data warehouse architecture is based on a relational database management system server that functions as the central repository for informational data. Operational data and processing is completely separated from data warehouse processing. This central information repository is surrounded by a number of key components designed to make the entire environment functional, manageable and accessible by both the operational systems that source data into the warehouse and by end-user query and analysis tools.
Typically, the source data for the warehouse is coming from the operational applications. As the data enters the warehouse, it is cleaned up and transformed into an integrated structure and format.
The challenges with the traditional Data Warehouses.
Expensive & time consuming to setup and operate
Difficult to scale/ Data Growth
Handling Data Variety
Not compatible with modern use cases
Compliance and Security
Transactional Data, Legacy Data, Web, 3rd Party Data, Social Media, Demographic/Geo-spatial, Machine/Devices data
Types of Data (Structured, Semi-Structured, Unstructured, Multi-Structured)
This lecture talks about the Data Ingestion methods.
ETL,ELT, Batch processing, Bulk Load, Event Data processing, Custom code using Java/Python/Perl/Pyspark etc…
ETL is short for extract, transform, load, three database functions that are combined into one tool to pull data out of one database and place it into another database. Extract is the process of reading data from a database. In this stage, the data is collected, often from multiple and different types of sources.
ELT is a different way of looking at the tool approach to data movement. Instead of transforming the data before it’s written, ELT leverages the target system to do the transformation. The data is copied to the target and then transformed in place.
ETL vs. ELT – What’s the Big Difference?
What are the different types of Data processing methods?
Batch
Real Time
Streaming
Let's understand what is batch processing of data.
We run batch processes on much smaller accumulations of data - typically less than a minute’s worth of data. These are called as Micro batches.
The real time event data processing.
Data Lake is a repository where the data is stored in its native format.
Where does Data Lake Fit in? and What are the differences?
Understand the Data Lake vs Data Warehouse with an example
Schema on Read and Schema on Write
Differences between different Data Repositories
The Co-existence of Data Warehouse and Data Lakes
Dis-advantages and remediation options of a Data Lake
Data Lake Zones and Data Governance in Data Lakes
Architecture Approach 1
Architecture Approach 2
Architecture Approach 3
Architecture Approach 4
Architecture Approach 5
Which approach should you choose?
What is Data Quality and its Framework?
What are the different Data Quality Dimensions?
Where does the Data Quality fit in in the approaches we have discussed?
Advanced Analytics is the automated or semi-automated way of data or content analysis using sophisticated techniques and tools, typically which are not usually possible with the traditional business intelligence (BI) and these help to discover deeper insights, make predictions, or generate recommendations.
Advanced analytic techniques include those such as data/text mining, machine learning, pattern matching, forecasting, visualization, semantic analysis, sentiment analysis, network and cluster analysis, graph analysis, simulation, complex event processing etc.
Stay tuned for more awesome content!
The objective of this course is to learn/know the fundamentals of the Modern Data Warehouse and what strategies can be used move from a traditional Data Warehouse in combination of Big Data Technologies, Data Lakes and Data Visualization.
A Modern Data Warehouse gives us flexibility to analyze the data we need, in the format we need it by using familiar tools , technologies, concepts like Big Data, Hadoop, Cloud Ecosystems, BI tools and SQL/NoSQL. A Modern Data Warehouse will be able to consume and process variety of data formats like the semi structured, unstructured and multi structured. These data sets come in multiple formats and are generated from a non-transactional systems such as machines, sensors, and customer interaction streams. These systems are not only varied but also, they're producing data at volumes, varieties, and velocities like we've never seen before. This kind of data is not new. We've actually had multi-structured data for a long time, but very few organizations could work with it before because it is so expensive to store and so hard to connect or link them using the traditional data warehouse architectures or models.