Massive amounts of data are being collected on just about everything and only a small part of that data is being analyzed.
In 2014, every second over 5700 tweets were sent and 870 Facebook links were sent.
In 2013, about 4.4 zettabytes of data were created and approximately 5% of it was analyzed.
By 2020, it’s estimated that we will collect 44 zettabytes of data and the amount we analyze will jump to 40%.
One of the most overused words in recent times is “Big Data”
But what does the word really mean?
Big data refers to data being collected in ever-escalating volumes, at increasingly high velocities, and for a widening variety of unstructured formats and variable semantic contexts.
Big data describes any large body of digital information, from the text in a Twitter feed, to the sensor information from industrial equipment, to information about customer browsing and purchases on an online catalog.
Big data can be historical (meaning stored data) or real-time (meaning streamed directly from the source).
For big data to provide actionable intelligence or insight, not only must the right questions be asked and data be relevant to the issues be collected, the data must be accessible, cleaned, analyzed, and then presented in a useful way.
HDInsight is a cloud implementation on Microsoft Azure of the rapidly exanding Apache Hadoop technology stack that is the go-to solution for big data analysis.
It includes implementations of Storm, HBase, Pig, Hive, Sqoop, Oozie, Ambari, and so on. HDInsight also integrates with business intelligence (BI) tools such as Excel, SQL Server Analysis Services, and SQL Server Reporting Services.
Note: This is not a hands on course. This course creates a knowledge foundation for my next course in this series which is using what we've learned to create a real world end to end big data solution with Azure HDInsight.
What's the course about?
In this course we are going to cover Hadoop on Azure.
Microsoft's take on Big Data is a game changer.
Approximately 90% of all organizational data is unstructured.
That means only 10% is stored in traditional relational databases.
The amount of data stored and analyzed will grow exponentially in the coming years.
This is a very specific course.
This course will focus on Microsoft's approach to big data.
We will be learning Azure HDInsight.
I want to make sure you are in the right place.
If you are looking to learn Microsoft's approach to big data then this course is for you.
This is not a traditional Hadoop course.
Let's cover some of the key terminology associated with this section.
Let's wrap up what we've learned.
Big Data is an ecosystem and Hadoop is a product.
Let's learn what Hadoop really is.
There's a lot of new terminology here.
Let's learn what's involved and what the key components to Hadoop are.
One of the core components of Hadoop is the NameNode.
Let's learn what it is in this lecture.
MapReduce is both and engine and a programming model.
Let's learn about the map and the reduce.
The Pig programming language is designed to handle any kind of data, hence the name.
Let's learn about the two most prevalent Hadoop languages.
MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN
Let's learn about this new feature.
YARN was designed for separation of duties.
In this short lecture let's look at a visual representation of YARN.
Let's learn the new vernacular introduced in this section.
Let's wrap up what we've covered so far in this section.
At this juncture, Azure Data Lake is made up of three different services.
Let's learn about them in this lecture.
The data lake concept is fairly new.
Let's learn what a data lake is and more importantly, if we need one or not.
Let's talk about Microsoft's Azure Data Lake in the Cloud.
In this lecture we will learn about Hadoop in the cloud.
A new service built on Apache YARN that dynamically scales distributed infrastructure
Microsoft's new language for big data.
Let's cover the key terms used in this section.
Let's look at some bullet points on what we've covered in this section.
In order to start our work with big data in the cloud we will need an account.
In this lecture we navigate to the URL to create one.
In order to start working with our HDInsight clusters we have on major dependency.
We need a storage account.
In this lesson we will learn how to create a storage account and provision our first cluster.
In this lesson we will learn the basics of managing our cluster.
We will also look at the new portal... which provides us a much more granular view of our clusters.
In this lesson let's learn how to remote into our cluster.
Let's go over the key words in this lesson.
Let's go over the high points on what we've covered in this section.
We've covered a lot of new information in this course.
My next course will provide real world examples to working with big data.
We will create native MapReduce jobs and learn more about U-SQL.
Let's wrap up what we've covered in this course.
I've been a production SQL Server DBA most of my career.
I've worked with databases for over two decades. I've worked for or consulted with over 50 different companies as a full time employee or consultant. Fortune 500 as well as several small to mid-size companies. Some include: Georgia Pacific, SunTrust, Reed Construction Data, Building Systems Design, NetCertainty, The Home Shopping Network, SwingVote, Atlanta Gas and Light and Northrup Grumman.
Experience, education and passion
I learn something almost every day. I work with insanely smart people. I'm a voracious learner of all things SQL Server and I'm passionate about sharing what I've learned. My area of concentration is performance tuning. SQL Server is like an exotic sports car, it will run just fine in anyone's hands but put it in the hands of skilled tuner and it will perform like a race car.
Certifications are like college degrees, they are a great starting points to begin learning. I'm a Microsoft Certified Database Administrator (MCDBA), Microsoft Certified System Engineer (MCSE) and Microsoft Certified Trainer (MCT).
Born in Ohio, raised and educated in Pennsylvania, I currently reside in Atlanta with my wife and two children.