
Introduction to the course
Please download and unzip the enclosed configuration files. You can copy paste from the files in this during the installation and setup of various tools. Add the commands in bashrc_addon to the end of your .bashrc file on Linux. Make sure to change trainer1 in this to your username on Linux
Please download the zip file provided that contains the solutions for the Map reduce practice activities. You can compile the programs using the instructions provided, only after setting up yarn in the next lesson.
Please download the zip file provided that contains the solutions for the Yarn practice activities. You can compile the programs using the instructions provided in the previous Lesson.
Please download the zip file provided that contains the solutions for the Hive practice activities. You need to connect to beeline as shown in the lesson.
Please download the scala examples file provided. This file will be used in the demos of this class to illustrate Scala language. You can copy paste these commands in spark-shell to practice Scala. You can correct any errors in the file due to division of a line into two parts.
Please download the zip file provided that contains the solutions for the Spark-Scala practice activities. You need to run these on spark-shell.
Big data processing is now moving to cloud and every organization is exploring serverless big data processing on the cloud like Amazon Web Services EMR serverless. So I thought it will be apt to add this demo so that you become familiar with a cloud platform for running big data jobs. This is a simple example but has all the steps to run a Spark Job. I hope this addition will be useful for you. Please ignore some noise in the video as there was lot of construction noise next door.
Please download the zip file provided that contains the solutions for the Spark practice activities. You can run these programs in spark-shell
Please download the zip file provided that contains the solutions for the Machine Learning practice activities. You can run these programs in spark-shell
Explains Setting up big data on cloud with AWS EMR and workflow orchestration with Step Functions
Download and setup this Ubuntu Linux virtual machine on windows that comes loaded with all the big data software taught in this course. This virtual machine has very low footprint (about 5GB for download and 8GB on disk) and can run with just 2GB of memory on windows.
You can use winscp to securely copy files from windows to your virtual machine and virtual machine to windows
Sometimes address range of NAT in vmware and virtual machine may cause connection problem. This shows one technique to correct the issue.
Apache Hadoop, Yarn, Hive and Spark are popular big data tools used by many organizations to develop big data analytics solutions. Through this course students can develop big data applications using these tools to process data and derive valuable insights from data. By the end of the course, students will be able to set up a personal big data development environment, master the fundamental concepts of Hadoop, Yarn, Hive and Spark, copy data into and from a big data cluster, process the data using the Map/Reduce paradigm, run Map/Reduce and Spark jobs on Yarn, Learn to process big data using Scala programming language in Spark, Use RDDs and dataframes to process big data, use Parquet format to store data, and finally use Machine Learning Libraries of Spark to develop Machine Learning solutions like decision trees, recommendation engine, Linear Regression and Anomaly detection.
This is a hands on development course and you will practice more than 50 activities during this course. While Java knowledge is assumed, fundamentals of Scala are taught so that you can write Scala code to process data in Spark. The course provides a foundation for developers to join big data development teams in their organization.