
Learn to solve big data problems using Apache Spark and Hadoop in a fully configured cluster, with hands-on data ingestion, data cleansing, and production-ready applications.
Explore the Udemy environment and its user interface to optimize learning, including reviews, video speed and quality, and downloading course materials.
Explore the high level theory behind big data and the need for distributed storage and processing. Introduce Hadoop, its components, the Hadoop distributed file system, and YARN.
Understand big data through the data explosion, velocity and volume, from internet usage and IoT to social media, and learn how unstructured, structured, and semi-structured data differ.
Explore distributed systems for big data storage and processing via computer clusters and distributed file systems, comparing vertical and horizontal scaling with commodity hardware and tools like Hadoop and Spark.
Explore how Hadoop enables reliable, scalable distributed computing with open source software. Analyze features like scalability, resilience, data locality, and the distributed file system, cluster resource manager, and MapReduce.
Apache Spark is a unified analytics engine for large-scale data processing, delivering speed with built-in libraries Spark SQL, Spark Streaming, MLlib, and GraphX.
Define spark applications as programs using spark APIs with a driver and executors. Explain the master process and spark session for cluster connection and dataframe APIs.
Explore Spark's interactive shell, including the Spark shell interfaces for Scala and Python, to write and test code in an interactive Spark session with Hive support.
Deploy Spark on a Hadoop cluster and use HFS as a data source. Show how Yarn acts as resource scheduler and how Spark executors run in containers on worker nodes.
Set up the blue VM with a Spark Hadoop cluster, install Oracle VM VirtualBox, boot the VM, then access Spark Shell or Zeppelin notebook for practice questions.
Install Oracle VM VirtualBox to load the blue VM, enabling SPARC on a Hadoop cluster from your desktop, with Hive MetaStable server and an Apache Zeppelin notebook.
Load the Verulam Blue VM in Oracle VM VirtualBox by decompressing zip files, importing the VM image, and starting the VM; download files one at a time to avoid corruption.
Boot the Oracle VM VirtualBox and load the Farallon Blue VM, then work with Spark, Hadoop, Hive, and the Zeppelin notebook, and learn to spin up a cluster.
Spin up the Hadoop cluster by clicking the desktop spin up icon on red and blue VM, then verify seven processes, including the name node and resource manager, are running.
Launch spark-shell to start a spark session, use the spark variable to run code, view databases, and exit by typing colon quit.
Launch the Apache Zeppelin notebook, open a new notebook, and run code. Use show databases to list databases, then refresh until the readiness indicator appears, and shut down the process.
Access end-of-section problems and the practice test by launching Spark Show from the desktop, then use the spark-shell to review questions once the Vernon Blue VM cluster is up.
Learn how to interact with the Hadoop distributed file system (HDFS) using the Java API, NFC gateway, web UI, and the file system shell to browse, upload, and manage data.
Learn to use the file system shell to interact with HDFS through the command line, perform basic filesystem operations, and script tasks for efficient data management in a Hadoop cluster.
Learn how to use the dfs command with the -help option to enumerate available options and read detailed information about a specific option, such as put, along with its arguments.
Learn to list directory contents with the dfs -ls option, view root and user directories, and understand the permissions, ownership, and basic file attributes shown.
Use the hdfs dfs -find command to locate paths of files and subdirectories. Employ -name with wildcards to search by partial names and view help with -help find.
Discover how to create directories in hdfs using the mkdir command, including single and multiple directories, and nested directories with the -p option, and verify with ls.
Copy files and directories from the local file system to HDFS using the put command, overwrite with -f, copy directories, and copy only contents with wildcards, with verification.
Learn to copy and move files between HFS locations using the cp and mv options in the official commands, showing source and destination paths, and how to rename via move.
Discover how to view Hadoop file contents using hdfs dfs: cat for small files, head and tail for partial views, and text for compressed files.
Learn to delete files and directories using rm and rmdir, including removing empty directories and recursively deleting non-empty folders, and using wildcards to target multiple files.
Learn how to copy files and directories from hdfs to the local file system using the hdfs dfs -get command, including both directories and single files with practical examples.
Exercise caution when running Hadoop file system shell commands to avoid accidental data deletion, respect cluster permissions, and recover by downloading raw VM files and reloading the overlay.
Explore the three key data structures: spark data frames, high tables, and temporary views, and note that data frames sit on IDs and are a special case of data sets.
Explore Spark data frames, distributed structures with named columns and defined data types for large-scale data across clusters. Understand schema, partitions, immutability, and lazy transformations and actions.
Explore database concepts, tables, and data warehouses with Apache Hive on Hadoop. Learn Hive metastore, table metadata, manage vs external tables, and schema on read for large datasets.
Understand temporary views in Spark as virtual structures not stored in Hive metastore, disappearing when the session ends. Create them from data frames or Hive tables to query with SQL.
Explore creating three key data structures—data frames, archive tables, and temporary views—and use Spark SQL to query them.
Explore querying structured data with Spark SQL, writing SQL queries over Spark dataframes and tables, and using the Spark session to register temp views or save data as tables.
Create data frames with spark sql by reading tables or files via the data frame reader interface and table method. Load from various sources in formats like csv and json.
Learn to create databases and tables with Spark SQL, including create database if not exists and saving data frames as tables in the Hive metastore.
Create temporary views in Spark SQL from data frames or tables, register them for SQL queries, and drop them without affecting the underlying data.
Explore basic spark sql operations on data frames and tables, including column and row manipulations and basic sql queries.
Learn how to manipulate data frame columns with spark sql and dataframe apis, covering referencing columns in multiple ways, selecting subsets, renaming, adding, and dropping columns, plus handling reserved names.
Explore basic data frame operations with the scholar api in a Zeppelin notebook, including limit, show three, counts, union, distinct, drop duplicates, and sample.
Explore essential sql queries for tables, including describe table and function, select, joins, rename with as, cast, backticks, and limits, distinct, and drop table, illustrated in a Zeppelin notebook.
Explore data engineering with Spark SQL and Hadoop by performing ETL tasks, loading data from HFS, reading files in various formats, transforming data, and writing results back to DFS.
Explore the etl process: extract, transform, and load data from data lakes and warehouses, clean and transform it in a Spark data frame, and load for analysis.
Explore the extraction phase of a Spark ETL process by loading files into a data frame with the Read API. Identify formats like text, csv, json, parquet, and avro.
Learn how to load csv and text files into spark data frames, using schema inference or explicit schema, header and separator options, and text versus csv readers.
Load JSON and Parquet data into Spark data frames; schemas auto-infer, view schema, and reorder columns with select expressions and multi-line or merge schema options.
Learn to load avro and ORC files into a data frame with the spark session's data frame reader. Compare standard and alternative loading methods and note limited data source options.
Navigate the transform phase of ETL, covering deduplication, formats revision, splitting, encoding, deriving new values, null handling, and using alias, string, numeric, and datetime functions on HFS data.
Learn string transformations in Spark SQL for data science, including replace with regular expressions, label encoding, splitting, substring, length, trim, capitalization, and joining column values.
Explore numerical transformations in Spark SQL using five math functions—round, absolute value, power, and square root—applied to salary data within a Spark session workflow.
Learn how to perform date and time transformations in Spark, including converting strings to dates, reformatting dates, extracting day, month, and year, obtaining current date, and computing differences between dates.
Explore Spark data types and data type transformations, examine and cast data frame columns, and understand integers, floats, decimals, strings, booleans, and timestamps using print schema and cast.
Transform nulls in Spark dataframes by dropping rows with null values or filling them with replacements, using drop and fill methods with configurable how, thresh, and subset parameters.
Explore the load phase of ETL, writing results back to storage as persistent tables or exporting data frames to file formats like CSV, text, JSON, Parquet, Avro, and ORC.
Write data frame data to files using Spark session and data frame writer; choose formats CSV, JSON, or Avro, and configure options such as schema inference, header, and compression.
Learn how to save dataframe data with Spark's partition by, coalesced, and partition control to manage output files and directories, enabling reading only relevant partitions.
Save a DataFrame as a persistent table using Spark SQL, choosing formats, setting a custom path, and applying save modes while understanding table drop behavior and hive metastore.
Master data analysis with spark by querying loaded data to generate simple reports, filter data, and compute aggregate statistics with group by and joins across datasets.
Use metastore tables as input sources or output sinks in Spark applications, reading with the data frame reader and table method, or with SQL, and save results to Hive tables.
Learn the fundamentals of querying with Sequel in Spark, building select and from clauses, applying where, group by, having, and order by, with results returned as data frames.
Explore sql math functions in Spark SQL, including round, abs, power, and square root, to transform column values such as calculating monthly salaries and naming result columns.
Explore how to filter data with where clauses and boolean expressions in Spark SQL. Apply limit, distinct, between, and like to retrieve targeted records in data frames.
Learn to sort and rank results in Spark SQL using order by, ascending and descending, and ranking functions rank and dense rank, with partition by for regional top salaries.
Explore aggregation in Spark SQL and Hadoop data science by applying common aggregate functions, count, sum, average, min, max, and distinct, to compute insights across table columns and groups.
Group data using the group by clause, apply where filters, and use having to refine groups. Learn aggregation and dataframe approaches in Spark SQL.
Master multi-table queries by applying the join clause for side-by-side results and the union operator to stack results from multiple selects, using common keys and compatible columns.
Learn how to perform multi-table and data frame joins using on conditions and join clauses, with aliases, and handle duplicate column names by dropping or renaming.
Master multi-table queries with spark dataframes by applying inner, left outer, right outer, full outer, left semi, and left anti joins on left and right tables, including select statements.
Master multi-table queries with the union operator to combine rows from top and bottom tables in the school DB, including union all and union distinct, aliasing, and casting for compatibility.
*Important Notice*
This course has been retired and is no longer receiving support. Originally designed to help students pass the now-retired Cloudera Certification exams, the material remains useful for those wanting to practice their skills on Spark and Hadoop clusters. However, its primary focus was certification preparation, which many students successfully completed.
Apache Spark is currently one of the most popular systems for processing big data.
Apache Hadoop continues to be used by many organizations that look to store data locally on premises. Hadoop allows these organisations to efficiently store big datasets ranging in size from gigabytes to petabytes.
As the number of vacancies for data science, big data analysis and data engineering roles continue to grow, so too will the demand for individuals that possess knowledge of Spark and Hadoop technologies to fill these vacancies.
This course has been designed specifically for data scientists, big data analysts and data engineers looking to leverage the power of Hadoop and Apache Spark to make sense of big data.
This course will help those individuals that are looking to interactively analyse big data or to begin writing production applications to prepare data for further analysis using Spark SQL in a Hadoop environment.
The course is also well suited for university students and recent graduates that are keen to gain exposure to Spark & Hadoop or anyone who simply wants to apply their SQL skills in a big data environment using Spark-SQL.
This course has been designed to be concise and to provide students with a necessary and sufficient amount of theory, enough for them to be able to use Hadoop & Spark without getting bogged down in too much theory about older low-level APIs such as RDDs.
On solving the questions contained in this course students will begin to develop those skills & the confidence needed to handle real world scenarios that come their way in a production environment.
(a) There are just under 30 problems in this course. These cover hdfs commands, basic data engineering tasks and data analysis.
(b) Fully worked out solutions to all the problems.
(c) Also included is the Verulam Blue virtual machine which is an environment that has a spark Hadoop cluster already installed so that you can practice working on the problems.
The VM contains a Spark Hadoop environment which allows students to read and write data to & from the Hadoop file system as well as to store metastore tables on the Hive metastore.
All the datasets students will need for the problems are already loaded onto HDFS, so there is no need for students to do any extra work.
The VM also has Apache Zeppelin installed. This is a notebook specific to Spark and is similar to Python’s Jupyter notebook.
This course will allow students to get hands-on experience working in a Spark Hadoop environment as they practice:
Converting a set of data values in a given format stored in HDFS into new data values or a new data format and writing them into HDFS.
Loading data from HDFS for use in Spark applications & writing the results back into HDFS using Spark.
Reading and writing files in a variety of file formats.
Performing standard extract, transform, load (ETL) processes on data using the Spark API.
Using metastore tables as an input source or an output sink for Spark applications.
Applying the understanding of the fundamentals of querying datasets in Spark.
Filtering data using Spark.
Writing queries that calculate aggregate statistics.
Joining disparate datasets using Spark.
Producing ranked or sorted data.