If you want to outrun your competitors by taking business decisions using your data, then this course is for you.
SMACK is an open source full stack for big data architecture. It is a combination of Spark, Mesos, Akka, Cassandra, and Kafka. This stack is the newest technique developers have begun to use to tackle critical real-time analytics for big data.
SMACK: Getting Started with Scala, Spark, and the SMACK Stack gets you familiar with Scala and understanding the various features offered by it. You will also get to understand the process for data analysis using Spark. Finally, you will be introduced to the SMACK Stack which helps us to process data blazingly fast. Development using these technologies can be summarized as: More data: Less Time.
This Learning Path is a learner material and the curriculum is so planned to meet your learning needs. It starts with the basics of Apache Spark, one of the trending big data processing frameworks on the market today. We it moves on to Scala, which has emerged as an important tool for performing various data analysis tasks efficiently. It will help you leverage popular Scala libraries and tools to perform core data analysis tasks with ease in Spark. In the last part, we will teach you how to integrate the SMACK stack to create a highly efficient data analysis system for fast data processing.
By the end of the course, you’ll be able to analyze and process data swiftly and efficiently as compared to other traditional data analytic systems.
About the Author:
For this course, we have combined the best works of this esteemed author:
Nishant Garg has over 16 years of software architecture and development experience in various technologies, such as Java Enterprise Edition, SOA, Spring, Hadoop, Hive, Flume, Sqoop, Oozie, Spark, YARN, Impala, Kafka, Storm, Solr/Lucene, NoSQL databases (such as HBase, Cassandra, and MongoDB), and MPP databases (such as GreenPlum). He received his MS in software systems from the Birla Institute of Technology and Science, Pilani, India, and is currently working as a senior technical architect for the Big Data R&D Labs with Impetus Infotech Pvt. Ltd. Nishant has also undertaken many speaking engagements on big data technologies and is also the author of Learning Apache Kafka & HBase Essestials, Packt Publishing.
Anatolii Kmetiuk has been working with Scala-based technologies for four years. He has experience in Deep Learning models for text processing. He is interested in Category Theory and Type-level programming in Scala. Another field of interest is Chaos and Complexity Theory and Artificial Life, and ways to implement them in programming languages.
Raúl Estrada Aparicio is a programmer since 1996 and Java Developer since 2001. He loves functional languages such as Scala, Elixir, Clojure, and Haskell. He also loves all the topics related to Computer Science. With more than 12 years of experience in High Availability and Enterprise Software, he has designed and implemented architectures since 2003.His specialization is in systems integration and has participated in projects mainly related to the financial sector. He has been an enterprise architect for BEA Systems and Oracle Inc., but he also enjoys Mobile Programming and Game Development. He considers himself a programmer before an architect, engineer, or developer.
What are the origins of Apache Spark and what are its uses?
What are the various components in Apache Spark?
This video gets us familiar with the tools used in Apache spark.
This video explains the complete historical journey of project Nutch to Apache Hadoop—how the project Hadoop was started, what were the research papers that influenced the Spark project, and so on. In the end, various goals achieved by developing Hadoop are explained.
In this video, we are going to look at the Apache Hadoop background running JVM processes—name node, data node, resource manager, and node manager. It also provides an overview of Hadoop components—HDFS, YARN, and Map Reduce programming mode.
This video shares more details about Hadoop components Hadoop distributed filesystem—Goals, HDFS components, and the working of HDFS. It also explains another Hadoop component YARN—components, lifecycle, and its use cases.
This video provides an overview of Map Reduce—the Hadoop programming model and its execution behavior at various stages.
The aim of this video is to introduce the Scala language and its features, and by the end of this video, you should be able to get started with Scala.
The aim of this video is to explain the fundamentals of Scala Programming, such as Scala classes, fields, methods, and the different types of arguments, such as default and named arguments passed to class constructors and methods.
The aim of this video is to explain the objects in Scala language, singleton object in Scala, and outline the usages of objects in Scala applications. It also describes companion objects.
The aim of this video is to explain the structure of the Scala collections hierarchy. Look at the examples of different collection types, such as Array, Set, and Map. It also covers how to apply functions to data in collections and outlines the basics of structural sharing.
The aim of this video is to start your learning of Apache Spark fundamentals. It introduces you to the Spark component architecture and how different components are stitched together for Spark execution.
The aim of this video is to take the first step towards Spark programming. It explains the Spark Context and also shares the need of Resilient Distributed Datasets called RDD. It also explains the execution approach change in Map Reduce due to RDD.
The aim of this video is to explain the operations that can be applied on RDDs. These operations are in the form of transformations and actions. It explains various operations under both the categories with examples.
The aim of this video is to explain and demonstrate data loading and storing in Spark from different file types; such as text, CSV, JSON file, and sequence file; different filesystems, such as local filesystem, Amazon S3, and HDFS; and different databases, such as My SQL, Postgres, HBase, and so on.
The aim of this video is to explain the motivations behind key-value-based RDD and the creation of such RDDs. Next, it explains the various transformations and actions that can be applied on key-value-based RDD. Finally, it explains data partitioning techniques in Spark.
The aim of this video is to explain a few more advance concepts, such as accumulators, broadcast variables, and passing data to external programs using pipes.
The aim of this video is to demonstrate the writing of Spark jobs using Eclipse-based Scala IDE, creating Spark job JAR files, and, finally, copying and executing the Spark job on Hadoop cluster.
We need a data set to practice the skills learned in this course. We download the Houses Prices dataset from Kaggle for this.
Spark Notebook is a convenient environment for data analysis and reproducible research. We need to install it.
Before proceeding to load the data, we need to understand how Spark represents and handles it. This theoretical part covers it.
Now that we know the theory, we need to actually see how to load the example dataset in Spark.
Before building a statistical model of a dataset, one must have some understanding of that dataset. This video provides tools to build a visual intuition about the data in the dataset.
Another way to draw insights from the data is to look at its statistical metrics. This video describes how to compute them with Spark.
Preprocess the data before feeding it to a ML algorithm. This video describes how to do that with standard SQL/Collections methods.
SparkSQL operations are powerful, but SparkML supports some common ML operations out of the box. Learning them may greatly reduce the work to be done.
A particular kind of operation on data that is commonly used is slicing the features (taking a subset of them) based on a predicate.
Before proceeding to concrete examples of using SparkML, we need to understand its structure.
The result of data analysis is usually a model of the data in question. This video explains how to do data modeling with the ML algorithms that Spark has.
To find an efficient solution, we need to learn about the data processing challenges first.
It is important to know the process or pipeline of SMACK to use it better.
To use each technology, you need to understand each technology.
Now learn about data expert profiles and how data processing can be a data center operation.
We need to understand Scala hierarchy and the selection of a Scala to work with Scala. This video will teach you that.
Iterators are an important part of Scala. This video uses iterators and shows their importance.
This video shows a host of functions with Scala that includes filtering, merging, sorting and also sets, arrays queues, and stacks.
This video shows the comparison between the Actor Model and traditional OOP, then describing about the actor system and reference.
Here, we will be learning about the functioning of actors using various katas.
Apache Spark cluster-based installations can become a complex task, when we integrate Mesos, Kafka, and Cassandra from: databases, telecommunications, operating systems, and infrastructure.
Spark has four design goals: make in memory (Hadoop is not in-memory) data storage, distribute in a cluster, be fault tolerant, and be fast and efficient.
Apache Spark has its own built-in cluster standalone manager but you can run multiple cluster managers, including Apache Mesos, Hadoop YARN, and Amazon EC2.
Spark Streaming is the module for managing data flows. Much of Spark is built with the concept of RDD. It provides the concept of DStreams or Discretized Streams.
NoSQL is a distributed database with an emphasis on scalability, high availability, and ease of administration, the opposite of established relational databases.
The task of creating a scalable database massively decentralized, optimized for read operations, painlessly modifying data structures. The solution was found by combining two existing technologies that is Google's BigTable and Amazon's Dynamo.
Cassandra offers to create a back up on the local computer. It creates a copy of the base using a snapshot. It is possible to make a snapshot of all the key spaces. Compression increases the cluster nodes capacity, reducing the data size on the disk.
If you use an incremental backup, it is also necessary to provide the incremental backups created after the snapshot. There are multiple ways to perform a recovery from the snapshot.
Work with DBMS optimization
The Spark Cassandra connector is a client used to achieve this connection, but this client is special because it has been designed specifically for Spark and not for a specific language.
In this video, you will learn the basics of the Spark Cassandra connector
Spark streaming allows for handling and processing of high throughput and fault tolerant live data streams. In this video, you will learn about Spark Cassandra streaming and create a stream.
Once our Spark Cassandra is set up, we'll look at the different operations we can perform with Cassandra.
In this video, we will use the Akka Cassandra connector to build a simple Akka application, make HTTP requests, and store the data in Cassandra.
Increasing data requires better data processing systems. Hence, Kafka comes into picture. In this video, you will learn about the features of Kafka and basics of Kafka.
We need to install Kafka to work with it. This video will enable you to do that.
Clusters are Kafka’s Publisher-subscriber messaging systems. In this video, you will learn to program with them.
In this video, we will look at how the Kafka architecture is designed and understand the components that make it what it is.
Producers are applications that create messages and publish them to the broker. You need to understand the working of producers.
Consumers are applications that consume the messages published by the broker. So they are the next step in the Kafka architecture.
To process large volumes of data, we require to integrate Kafka with other big data tools. Integration teaches us that. Also there are numerous tools provided by Kafka to manage features. We will learn about that in administration.
In this video, we will be looking at the relation between Akka and Spark and Kafka and Akka.
In this video, we will review the connectors between Kafka and Cassandra.
In this video, you will be introduced to Mesos and learn about the Mesos architecture.
Resource allocation module of Mesos decides quantity of resources allocated to each framework. Hence, it is important to know about the resource allocation in Mesos.
If you don’t want to use cloud services from Amazon, Google, or Microsoft, we can set up our cluster on our private data center. This video will teach you how to do that.
We need frameworks to deploy, discover, balance load, and handle failure of services. In this video, we will look at the frameworks that are used for service management.
Aurora is a Mesos framework for long running services and cron jobs. Learn about job scheduling with Aurora.
Singularity is a platform that enables deploying and running services and scheduled jobs in the cloud or data centers. Combined with Apache Mesos, it provides efficient management of the underlying processes life cycle and effective use of cluster resource. Let's see what it is all about.
In this video, you will learn how to run Apache Spark on Mesos
In this video, we will deploy Apache Cassandra on Apache Mesos with the help of Marathon.
In this video, we will deploy Apache Kafka on Apache Mesos.
Packt has been committed to developer learning since 2004. A lot has changed in software since then - but Packt has remained responsive to these changes, continuing to look forward at the trends and tools defining the way we work and live. And how to put them to work.
With an extensive library of content - more than 4000 books and video courses -Packt's mission is to help developers stay relevant in a rapidly changing world. From new web frameworks and programming languages, to cutting edge data analytics, and DevOps, Packt takes software professionals in every field to what's important to them now.
From skills that will help you to develop and future proof your career to immediate solutions to every day tech challenges, Packt is a go-to resource to make you a better, smarter developer.
Packt Udemy courses continue this tradition, bringing you comprehensive yet concise video courses straight from the experts.