Scaling with New Indices

Sundog Education by Frank Kane
A free video tutorial from Sundog Education by Frank Kane
Founder, Sundog Education. Machine Learning Pro
4.5 instructor rating • 22 courses • 455,190 students

Lecture description

We'll add new indices as a scaling strategy, and see how it works.

Learn more from the full course

Elasticsearch 6 and Elastic Stack - In Depth and Hands On!

Search, analyze, and visualize big data on a cluster with Elasticsearch, Logstash, Beats, Kibana, and more.

08:03:25 of on-demand video • Updated May 2019

  • Install and configure Elasticsearch 6 on a cluster
  • Create search indices and mappings
  • Search full-text and structured data in several different ways
  • Import data into Elasticsearch using several different techniques
  • Integrate Elasticsearch with other systems, such as Spark, Kafka, relational databases, S3, and more
  • Aggregate structured data using buckets and metrics
  • Use Logstash and the "ELK stack" to import streaming log data into Elasticsearch
  • Use Filebeats and the Elastic Stack to import streaming data at scale
  • Analyze and visualize data in Elasticsearch using Kibana
  • Manage operations on production Elasticsearch clusters
  • Use cloud-based solutions including Amazon's Elasticsearch Service and Elastic Cloud
English So once you've decided how many shards your index needs you need to actually create that index and specify the shard count that you want. Let's talk about how to actually do that. And afterwards we'll talk about how you can use new indexes or indices I suppose is the correct word as a scaling strategy of its own. So here's a sentence on how to actually specify the number of primary and replicas that you want on your new index and remember the number of replicas is applied to each primary chart. So by saying number of shards 10 and number of replicas one that means you're gonna end up with 10 primary shards and 10 replica shards and one replica for each primary. OK. So it can be a little bit confusing at first. Now in the past we sort of implicitly created new indices as we've created new data and we've just been using the default settings for the number of shards because we didn't care. You know we didn't have a lot of traffic in our little course here we only had one machine so wasn't really worth thinking about too much. But in the real world you want to think about this a lot. So make sure that any new indexes formerly created before you start inserting data into it with the number of shards that you think you're going to need for the foreseeable future. Now to make this easier. There's also something you can look up called index templates and that's the way to automatically apply mappings and analyzers and aliases and settings like this to any new index that gets created. So it will look that up again. It can save you some time as well. So let's actually get some hands on practice in doing this because it is an important thing to remember and have sort of muscle memory on so let's go off and actually create a new index with a specific number of shards. So let's explore how to actually investigate the settings of an index in terms of its number of shards and how to create a new index with a given number of shards. Fortunately we now have kibana in our tool chest. We don't have to deal with typing in json requests by hand in a console we can just go to dev tools here after going to to access kibana on your running cluster and let's delete all this stuff here that we were playing with before. And instead let's do something like this. Get Shakespeare. Slash underscore settings. OK. And we'll hit play and this gives you back the number of shards and replicas on the Shakespeare index that we created way back at the beginning of this course and you can see that the default settings that I went with are five primary shards and one replica. So we have five primaries and five replicas running all on our one little virtual machine here which is probably more than it really should necessarily have that we really have enough memory on our little virtual machine to handle five primary shards all being equal. But those are the default settings so that's how you can actually check and see what your current index is using for the number of primary shards and replicas. And remember there is the number of replicas is the number of shards times that number to get the actual number of replica shards. So let's say that I wanted to create a new index with three primaries and one replica of each primary. The syntax for that would just be and it's very simple I just wanted to do this kind of like get your fingers to remember it. So you can just say put slash test index, OK. So we're gonna create a new index called Test index that contains the following structure. It will have settings as follows. Number of shards and you can see that it automatically fills the center so I can just like arrow down there and like autocomplete it it even put in the default setting for me but let's change that to three. And we will also set a number of replicas and again it makes it really easy to do this. This hit tab to actually fill that in and we'll stick with the default of one replica for each primary shard. So you can see kibana makes life a whole lot easier. Doing this sort of thing with a hip hip play and looks like it came back fine. We want to double check. We can actually do a get request and make sure that that took in say get slash test index slash underscore settings and sure enough we have three primaries with one replica, cool. So that's all there is to it. And actually specifying the number of shards you want on a new index. Now adding new shards to an existing index is not your only option for adding capacity to your application. Another thing you can do is actually create entirely new indices for your application and spread your search requests across those indices so that way you can just add more capacity and new indices and leave your old indices untouched. That's a lot easier than trying to re-index an existing one to add more shards to it. So the idea here is to have multiple indices that encompass the data in your application and you use index aliases to manage which indices you actually care about at runtime. Now we saw this in action already back when we were playing with logstash, as you recall logstash by default created a separate index for every day's worth of log data. So we would have a logstash create 12-05-00 01 index and a logstash-12-05-0 2 index 1 for each day that it had for input data. And the idea there is that you could restrict your searches perhaps in the most current day or the current three months or whatever it might be just by rotating through the indices for the specific date you're interested in. Now you can actually manage that using index aliases. So if you do have a set up like that where you have logg data split up into separate indices for each individual day you could maintain an alias for example called logs underscore current or whatever you want to call it that points to the most current day or the most current month worth of indices that you have available. Or you can have another one that's called last three months. That points to all the indices that encompass log data from the past three months. So these new indices get created for new days of data that come in you would update those aliases to point to different specific indexes, indices. That's always going to mess me up that point to the specific dates that you want. OK so let's look at what that actually looks like syntactically. Let's imagine if you will that we have a new month of log data and we're actually organizing our indices as to contain one month of log information. So let's say that log information for the month of June 2017 comes in. And that's been added. Now if you want to have a logs underscore current alias that points to the most current to the most recent month of data we would do something like this where we could say add an alias for logs underscore current to add log's 2017-06 to that alias and an alias can contain more than one index. It just searches them all together and at the same time we would remove from logs understore current the previous month which was 2017_03. So what we're doing here is maintaining the logs current alias by adding the June data to it and removing the data from it thereby just making sure that logs current is pointing to the most recent month. Now we could have a logs last three months Aliase as well. So in that case you might want to add junes data into that last last three months alias and remove March's data from it because that's going to be four months old at that point. So you can see that an alias can encompass more than one index. But when you search that alias it will actually go across all of those indices at once automatically. So this is a nice little way of adding capacity and adding more data to your elasticsearch application by having separate indices that just get managed in this manner. So if you do have time based requests where typically you're only going to be querying a certain period of time relative to now. This can be a very very good strategy and optionally as new data falls off those aliases you could delete them just to free up more space in your cluster and maintain your server capacity needs to something you know manageable and constant or relatively constant at least. Maybe you want to back that up to a snapshot first and we'll talk about that in a few lectures from now but this is the basic idea of Alias rotation and how you can use multiple indices to actually add capacity and add new data without having to re-index things all the time. Very powerful idea.