
Discuss the need for quality pipeline design for big data pipelines. Explore the key activities in building such a design
Familiarize with the covered topics, out-of-scope topics and pre-requisites for the course.
Discuss how serverless technologies from cloud providers relate to the contents of this course.
Describe the overall pipeline network and the building blocks in the network
Discuss the features and challenges for the data acquisition block in a big data pipeline
Discuss the features and challenges for the data transport block in a big data pipeline
Discuss the features and challenges for the data processing block in a big data pipeline
Discuss the features and challenges for the data storage block in a big data pipeline
Discuss the features and challenges for the data serving block in a big data pipeline
Discuss the features and challenges for the pipeline infrastructure in a big data pipeline
Discuss the features and challenges for the operations block in a big data pipeline
Study the overall System Design Process to be followed for Big Data Pipeline Design
Explore the functional requirements provided for the use case and look for key indicators that require special attention for big data processing.
Analyze the input data to the big data pipeline to understand various characteristics like format, protocol and availability schedules
Analyze the non-functional requirements for the big data pipelines, especially those that relate to big data like scalability and fault tolerance
Create a pipeline flowchart that captures the steps and workflow needed to convert inputs to outputs
Add Big Data specific patterns and techniques to the flowchart and create a skeleton design
Analyze scaling of the skeleton architecture to ensure horizontal scalability and detect bottlenecks.
Choose the right technologies for the building blocks used in the solution
Design infrastructure, Security and Serviceability for the big data pipeline
Create a test strategy for testing the big data pipeline that covers regression, scaling and automation
Compare the characteristics of Batch Pipelines and Realtime Pipelines and analyze suitability for use cases
Distributed Architectures help ensure horizontal scalability for handling big data traffic. Discuss the key features and levers for distributed architectures
The principles of Microservices architectures still apply when designing big data pipelines. Explore key principles and how they apply to big data pipelines.
Discuss key best practices when designing batch big data pipelines
Discuss key design practices when designing realtime big data pipelines
Explore the options for benchmarking performance for a big data pipeline
Analyze the File Transfer Pattern for Acquisition. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Extraction Client Pattern for Acquisition. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Ingestion API Pattern for Acquisition. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Pub Sub Pattern for Acquisition. Discuss its advantages, shortcomings, use cases and availability technologies.
Explore Design Best Practices for Big Data Acquisition
Analyze the Extract Load Pattern for Data Transport. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Request Response Pattern for Data Transport. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Event Streaming Pattern for Data Transport. Discuss its advantages, shortcomings, use cases and availability technologies.
Explore some Best Practices for Big Data Transport Design
Explore several Data Processing Patterns that can be used for Big Data Processing Design.
Study how Big Data Processing Engines work behind the scenes to process data in a horizontally scalable manner
Discuss best practices for designing batch processing jobs for big data processing
Discuss best practices for designing batch processing jobs for big data processing
Discuss best practices for designing stream processing jobs for big data processing
Study the differences between batch and realtime when it comes to processing jobs. Explore how design changes based on this criteria
Discuss the importance and techniques for reading inputs and writing outputs in a scalable manner inside a processing job
Compare popular processing engine technologies available in the market today.
Analyze the Distributed File System Pattern for Data Storage. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Relational Database Pattern for Data Storage. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Document Database Pattern for Data Storage. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Columnar Database Pattern for Data Storage. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Graph Database Pattern for Data Storage. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Distributed Cache Pattern for Data Storage. Discuss its advantages, shortcomings, use cases and availability technologies.
Discuss Data Storage Best Practices when building big data pipelines
Discuss Data Storage Best Practices when building big data pipelines
Analyze the Query Interface Pattern for Data Serving. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Serving API Pattern for Data Serving. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Push Client Pattern for Data Serving. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Publish Subscribe Pattern for Data Serving. Discuss its advantages, shortcomings, use cases and availability technologies.
Discuss Best Practices for Data Serving when building big data pipelines
Discuss the infrastructure technologies available for deploying and operating big data technologies
Use the microservices deployment patterns for building and deploying building blocks in a big data pipeline
Discuss the deployment options for deploying processing jobs in a big data pipeline. Compare their benefits and use cases
Discuss the deployment options for deploying databases and queues in a big data pipeline. Compare their benefits and use cases
Review the use cases where geographically distributed pipelines are needed. Discuss some best practices for the same
Big data technologies have been growing exponentially over the past few years and have penetrated into every domain and industry in software development. It has become a core skill for a software engineer. Robust and effective big data pipelines are needed to support the growing volume of data and applications in the big data world. These pipelines have become business critical and help increase revenues and reduce cost.
Do quality big data pipelines happen by magic? High quality designs that are scalable, reliable and cost effective are needed to build and maintain these pipelines.
How do you build an end-to-end big data pipeline that leverages big data technologies and practices effectively to solve business problems? How do you integrate them in a scalable and reliable manner? How do you deploy, secure and operate them? How do you look at the overall forest and not just the individual trees? This course focuses on this skill gap.
What are the topics covered in this course?
We start off by discussing the building blocks of big data pipelines, their functions and challenges.
We introduce a structured design process for building big data pipelines.
We then discuss individual building blocks, focusing on the design patterns available, their advantages, shortcomings, use cases and available technologies.
We recommend several best practices across the course.
We finally implement two use cases for illustration on how to apply the learnings in the course to a real world problem. One is a batch use case and another is a real time use case.