Elasticsearch is a powerful tool not only for powering search on big websites, but also for analyzing big data sets in a matter of milliseconds! It's an increasingly popular technology, and a valuable skill to have in today's job market. This comprehensive course covers it all, from installation to operations, with 60 lectures including 8 hours of video.
We'll cover setting up search indices on an Elasticsearch cluster, and querying that data in many different ways. Fuzzy searches, partial matches, search-as-you-type, pagination, sorting - you name it. And it's not just theory, every lesson has hands-on examples where you'll practice each skill using a virtual machine running Elasticsearch on your own PC.
We cover, in depth, the often-overlooked problem of importing data into an Elasticsearch index. Whether it's via raw RESTful queries, scripts using Elasticsearch API's, or integration with other "big data" systems like Spark and Kafka - you'll see many ways to get Elasticsearch started from large, existing data sets at scale. We'll also stream data into Elasticsearch using Logstash and Filebeat - commonly referred to as the "ELK Stack" (Elasticsearch / Logstash / Kibana) or the "Elastic Stack".
Elasticsearch isn't just for search anymore - it has powerful aggregation capabilities for structured data. We'll bucket and analyze data using Elasticsearch, and visualize it using the Elastic Stack's web UI, Kibana.
You'll learn how to manage operations on your Elastic Stack, using X-Pack to monitor your cluster's health, and how to perform operational tasks like scaling up your cluster, and doing rolling restarts. We'll also spin up Elasticsearch clusters in the cloud using Amazon Elasticsearch Service and the Elastic Cloud.
Elasticsearch is positioning itself to be a much faster alternative to Hadoop, Spark, and Flink for many common data analysis requirements. It's an important tool to understand, and it's easy to use! Dive in with me and I'll show you what it's all about.
We'll talk about why Elasticsearch is important and what you can expect from this course. Then, we'll install a virtual Ubuntu machine right on your own desktop PC, install Elasticsearch on it, and search the complete works of William Shakespeare!
Let's look at the components of the Elastic Stack from a 30,000-foot level, and see how they all fit together.
We'll cover the logical concepts of Elasticsearch, how indices work, and different ways to interact with Elasticsearch.
Let's talk about how Elasticsearch scales horizontally on a cluster, using primary and replica shards.
Quiz time! Let's see what you learned about Elasticsearch at a conceptual level.
We'll walk though setting up SSH on your server, and how to connect to it from your desktop.
The Movielens dataset will be used throughout the course; let's just familiarize ourselves with it.
We'll define a mapping, or a schema, in Elasticsearch for our movie data prior to importing it.
We'll insert a single movie into our index the "hard way" - using a JSON request over a REST query.
We'll use Elasticsearch's JSON-based bulk API for inserting many movies into our index with a single request.
Learn how to atomically update an existing document in Elasticsearch, and how it's handled under the hood (it may surprise you!)
Let's practice deleting a document, and see what really happens under the hood.
Time to try it yourself without my guidance! Practice what you've learned so far.
What happens if two different clients try to update a document at the same time? We'll practice a way to avoid these possible sources of contention.
Let's dive into the nuances of how Elasticsearch breaks up your text into search terms, and how you can control it.
What do you do with relational data? You don't have to completely de-normalize it with Elasticsearch; let's create an example of grouping movies together into franchises with parent/child data modeling.
Query-string search requests allow quick experimentation without constructing full JSON requests. Let's see how it works, and its limitations.
We'll practice some more with the preferred interface for Elasticsearch, using JSON search request bodies.
Full-text search queries sometimes produce unexpected results. We'll illustrate this by searching for Star Wars movies, and see how phrase search can help produce what we're really after.
Practice using URI and full-body search queries to find Star Wars films released after 1980. Do a phrase search for bonus points!
We'll practice generating subsets of search results, which can be used for pagination to the end user - and cover the limitations of pagination.
We'll practice sorting our results, in this case by release date, and discuss the nuances involved.
Filters are a very efficient way to refine your search results, but can become complex. Let's practice with a search for sci-fi films that don't include the term "Trek" released between 2010 and 2015.
Try it yourself! I'll get you started, and then show you how I did it.
Make your full-text search resilient to typos using fuzzy queries. We'll see how to specify just how "fuzzy" you want your queries to be, and get some results back in spite of some misspellings.
The inverted index can be leveraged to power partial matching, where you can search for some prefix of a search term. We'll illustrate this by searching for movies released in any year that begins with 201.
We'll write Python scripts to import data using Elasticsearch's REST interface directly, and using a higher-level API.
If you're comfortable with programming, try building upon the previous lecture to create your own script to import movie tags into a new "tags" index. I'll show you my solution to compare yours with.
Logstash is an extremely handy tool for importing existing and streaming data from server logs into an Elasticsearch index. Let's see what it's about and how it works.
Let's install Logstash on our Ubuntu system and configure it.
As an example, we'll use Logstash to parse and insert log entries from a real Apache access log.
Logstash supports a very wide variety of sources and destinations. Let's set up a MySQL instance with MovieLens data, and use Logstash to copy this data into an Elasticsearch index.
As another example, we'll import data stored on Amazon's S3 service (a cloud-based distributed file system) into our Elasticsearch index using Logstash.
Kafka serves a similar role to Logstash, in that it collects and publishes data. We'll see how to connect it to your Elasticsearch cluster, which may come in handy if you have an existing Kafka setup publishing streaming data that you want to index.
Apache Spark can also read and write to Elasticsearch. We'll see how you can use Spark to crunch big data in complex ways, and output the results into an Elasticsearch index.
See if you can build upon the code in the previous lecture to write a Spark driver script that imports movie ratings into a new "spark" index with a "ratings" type.
Learn how simple aggregations work, why they're important, and practice by finding how many 5-star ratings are in our data, and what the average rating for Star Wars is.
Practice generating histogram data from aggregations - break down the distribution of movie ratings, and of movie release years.
Elasticsearch has special capabilities for aggregating time-based data. Let's break down web server hits by hour, and Googlebot hits by hour.
Use aggregations to search for a spike in error response codes in my web server's access log data, and narrow it down to a specific hour.
Let's get Kibana installed and configured,
Although Kibana is often used to visualize log data, it can do so much more. Let's actually turn the power of Kibana into gaining some new insights into the works of William Shakespeare.
Practice using Kibana, by visualizing the plays with the most lines in them.
The Elastic Stack is more than just Elasticsearch, Logstash, and Kibana these days. Let's look at the Beats framework, and how it fits in.
We'll install FileBeat and configure it to directly import data into an index from an access log.
We'll see how Kibana's built-in dashboards can quickly produce all the visualizations you need for your server log data.
Use Kibana to narrow down the origin of a spike of 404 error codes in my web log data.
The number of primary shards in your index cannot change - how do you select the right number of shards with future growth in mind?
We'll add new indices as a scaling strategy, and see how it works.
How do you choose the optimal hardware configuration for your Elasticsearch cluster? Let's talk about the different considerations.
An important configuration setting in production is the size of the memory heap dedicated to Elasticsearch. How do you balance this against memory needed by the OS and file system cache?
X-Pack is a paid add-on to the Elastic Stack that provides monitoring and alarming capabilities for your Elasticsearch cluster. Let's use a free trial to play around with it and see what it can do.
Let's simulate hardware failures and see how a properly configured cluster can be resilient to the failure of any given node.
We'll back up our indices to a snapshot saved to disk, and practice restoring our index from it.
Often you'll need to restart all the machines in your cluster in order to update software or the OS. Learn how to do this safely and without disruption to the clients of your cluster.
We'll set up an Amazon ES cluster, configure it, and see if it works - and talk about managing security with it.
The security considerations of cloud-based services makes Logstash integration a little more complicated. Let's see how to connect Logstash to your AWS-based Elasticsearch cluster securely.
Elastic.co offers its own cloud-based service build on top of AWS, called Elastic Cloud. Let's see what it offers.
There's more to explore with Elasticsearch - let's cover additional resources, and where you can go from here.
Visit my website for discounts on my other big data and data science / machine learning courses - and let's stay in touch on social media!
Sundog Education's mission is to make highly valuable career skills in big data, data science, and machine learning accessible to everyone in the world. Our consortium of expert instructors shares our knowledge in these emerging fields with you, at prices anyone can afford.
Sundog Education is led by Frank Kane and owned by Frank's company, Sundog Software LLC. Frank spent 9 years at Amazon and IMDb, developing and managing the technology that automatically delivers product and movie recommendations to hundreds of millions of customers, all the time. Frank holds 17 issued patents in the fields of distributed computing, data mining, and machine learning. In 2012, Frank left to start his own successful company, Sundog Software, which focuses on virtual reality environment technology, and teaching others about big data analysis.
Frank spent 9 years at Amazon and IMDb, developing and managing the technology that automatically delivers product and movie recommendations to hundreds of millions of customers, all the time. Frank holds 17 issued patents in the fields of distributed computing, data mining, and machine learning. In 2012, Frank left to start his own successful company, Sundog Software, which focuses on virtual reality environment technology, and teaching others about big data analysis.