N-Grams, and Search as you Type

A free video tutorial from Sundog Education by Frank Kane
Join over 800K students learning ML, AI, AWS, and Data Eng.
Rating: 4.6 out of 5Instructor rating
37 courses
849,072 students
N-Grams, and Search as you Type

Lecture description

We'll see what N-Grams are, how they can be used for prefix searching, and how to implement a search-as-you-type system using N-Grams.

Learn more from the full course

Elasticsearch 6 and Elastic Stack - In Depth and Hands On!

Search, analyze, and visualize big data on a cluster with Elasticsearch, Logstash, Beats, Kibana, and more.

08:03:25 of on-demand video • Updated May 2019

Install and configure Elasticsearch 6 on a cluster
Create search indices and mappings
Search full-text and structured data in several different ways
Import data into Elasticsearch using several different techniques
Integrate Elasticsearch with other systems, such as Spark, Kafka, relational databases, S3, and more
Aggregate structured data using buckets and metrics
Use Logstash and the "ELK stack" to import streaming log data into Elasticsearch
Use Filebeats and the Elastic Stack to import streaming data at scale
Analyze and visualize data in Elasticsearch using Kibana
Manage operations on production Elasticsearch clusters
Use cloud-based solutions including Amazon's Elasticsearch Service and Elastic Cloud
-: Let's wrap up our discussion of search with a feature that you might even think of Elasticsearch as being able to do, search-as-you-type. So you know like when you go to Wikipedia or IMDB and sites like that, when you just start typing in a phrase, it will automatically complete it and give you some suggestions for searches while you're typing. Google does the same thing. You can actually use Elasticsearch in a few different ways to do something similar, so let's find out how that works. The easiest thing to do is call query-time search-as-you-type, and it's easy because you don't have to index your data in any particular way for this to work. It's just using the same prefix search capability that we talked about earlier. So in this example, let's just imagine that user typed in the term "star trek". You can use a specialized query called match_phrase_prefix and it's just like match_prefix that we looked at before for prefix searches but it works on the phrase level. So by typing in "* trek", it will search for any titles in this example that begin with the phrase "star trek", and you can also specify a slot value with that query. So if you want to provide more flexibility with the ordering of the words in that phrase and things like that, you can specify a slot, and with that, you know, you can actually get back results for people who searched for "trek star" or titles that don't quite match that phrase exactly and might have stuff in between the terms, if you want to. You don't have to. Now the only problem with this is that it is a little bit resource-intensive compared to an index-based solution. So while there are properties of the inverted index that make prefix matching more efficient than you might think, it's still a fairly intensive operation. So if you're gonna be depending on this at large scale, like someone like Google might, you really want an index-time solution and here's what that would look like. So the trick with doing this at index-time is using something called n-grams, and it's a very simple concept to wrap your head around, really, so let's examine the term "star". If you were to build up a set of unigrams from the word "star", that would just consist of all of the single-letter sets that compose that term. So the set of unigrams would be S, T, A, and R. You can also compose a set of bigrams, which are just pairs of letters inside the search term. So in this example, the "star" word would consist of the bigrams, S-T, T-A, and A-R. Those are all of the possible two-letter combinations found within that word, in order. Similarly, it has two trigrams, S-T-A, and T-A-R, and only one 4-gram because it's only a four-letter word to begin with, which is S-T-A-R. Now this is applicable to search-as-you-type because you can imagine if you treat the input as an n-gram basically, you can match that against your index to see what terms it matches. So if I were to type in the letter S, 'cause I'm starting the search for Star Trek, that unigram would match the set of unigrams from star, the letter S. When I type in S-T, I now have a bigram I can search against and that S-T bigram would be indexed against star as well, or Star Trek, and if I type in S-T-A, I now have a trigram of S-T-A and that could be inverted-indexed to the term Star Trek as well. So you see how that works. Now there's a specialized type of n-grams called edge n-grams because you may have noticed that all we really care about for search-as-you-type are the beginning n-grams of a given phrase or a search term. So edge n-grams allow you to only compute the n-grams for a given term at the beginning. So for example, if I was computing just the edge n-grams for star, I would only have a single unigram of S, a single bigram of S-T, a single trigram of S-T-A, and a single 4-gram of S-T-A-R. So now we have a very straightforward way of matching up combinations of letters against the prefixes that they might match for a given search term. How do we set this up in Elasticsearch? Well, it's a little bit of work, but it's pretty easy to understand once you take a look at it. The first step is to create your own custom analyzer that we're going to call autocomplete. So you see in this syntax here, what we're doing is we're creating our own custom filter that creates edge n-grams, and we're saying we're gonna have a minimum n-gram length of one, a single character, with a maximum of 20, so we can handle words up to 20 letters long in our search-as-you-type system. We then set up a new custom analyzer that in addition to the standard lowercase filter also has our autocomplete filter that has the n-grams as part of it, and we say we'll use our own, the standard tokenizer with that as well. You might choose a different one depending on what you want to do but this is a same source of settings to start with here. So once we've actually created this new autocomplete analyzer, we can apply at that at index-time. So when we're creating a mapping for our index, and again, this mapping needs to be in place before you index your data, but you could set, for example, as part of the properties on the title field, not only that it's of string type, but also that's going to use our custom analyzer called autocomplete, which will in turn use the filter we created that creates edge n-grams in our index. Once we've done that, we just need to make sure that our query side is doing the right thing. So the trick here is that we want to use the standard analyzer for our queries. We obviously don't want to be splitting up everything we type into n-grams, but we want them to match against the n-grams that we created on the index side. So this is a rare example of where you want a different analyzer on the query side and on the index side. So you can see in this example, let's imagine the user typed in S-T-A, that will be our query "sta", we specify the standard analyzer on the query side so it doesn't mess with that, but on the index side, that will be matched against the trigram S-T-A, which will be associated with Star Trek, okay? So you see how that works? It's a little bit convoluted but the concepts aren't that difficult really. One last way of doing this is something called completion suggesters. So Elasticsearch has a mechanism called completion suggesters that allow you to upload explicit lists of completion ahead of time. So if you really want things to work as efficiently as possible and have complete control over how autocomplete works, this provides a mechanism to allow you to do that. So if you have a lot of engineering time and you really want the best possible results and the most control over search-as-you-type, completion suggesters are worth looking into. But for now, let's take a look at those other approaches and actually put them into action and get our hands dirty with 'em. So let's try the edge n-gram solution to autocomplete and see that in action. Now like we said, we need to actually reindex our data to get those n-grams in place, so the first thing we need to do unfortunately is delete our existing index. So enter in that command there, curl -XDELETE, if you'd like to follow along and that will blow away our existing index, and now we need to set up our new custom analyzer that contains an autocompletion filter, and again, just to save time, I'm gonna put that up there. So ahead and hit Pause if you want to type that in yourself and if you don't want to worry about the tabs, you don't have to, but I put them here to make it easier to see what's going on. Again, you can see that we're setting up a custom analyzer here that has the autocomplete filter we defined which consists of edge n-grams between one and 20 characters and that has been made part of the filter set for our new autocomplete analyzer, which consists of a standard tokenizer, the lowercase filter, and our autocomplete filter that we defined that produces the edge n-grams at index-time. So let's hit return here to actually get that mapping in place. So now we have a new analyzer in place on the index side. So it's possible to actually test out that analyzer that we just created, and we can do that by passing in a specific text field that we want to analyze and see how it actually breaks it up. So let's do that with curl -XGET as the special command there, and we want pretty results back with the following request body in JSON format of course. The analyzer that we want to test is called autocomplete, which we just created, and we will pass in the text "Sta", and see what it does with it. All right, so you can see that it broke up "Sta" into a unigram of S, a bigram of S-T, and trigram of S-T-A. So it looks like our n-gram analyzer is actually working here. Now all we have to do is actually apply that analyzer to our title field and remap things appropriately. So let's get that out of the way, curl -XPUT -d '. All right, and I'll skip the indents this time just to save some time. We'll say movie will contain properties for title, and that'll have a type of text, meaning that it can be analyzed, and we will assign our special analyzer that we named autocomplete. Now we just have to close everything off. It looks like it took it. So let's go ahead and reindex our data, curl -XPUT, --data-binary @movies.json. All right, so now we can play around with it. All right, so let's try it out. Let's say curl -XGET,_search in pretty format, with the following search query. Let's just do a match query like we would normally. For the title, let's imagine that the user type in "sta". Let's see what we get back. The observant among you may realize that this isn't going to work. Let's see if you were paying attention. All right, why do we get back Plan 9 from Outer Space when we searched for "sta"? And I did all this work to like index all the edge n-grams, so what went wrong? Well, remember that we need to specify explicitly that we used the standard analyzer on the query side and not on the index side. What happened here is that because we had an analyzer for the edge n-grams associated on both sides by default, my "sta" got broken up into its unigram components. So in addition to everything else, we're not just searching for the "sta" prefix, we're actually searching for anything that contains the unigrams, S, T, or A, and since the letter S appears in Plan 9 from Outer Space, and the letter A appears in Plan 9 from Outer Space, and even the letter T appears in the phrase Plan 9 from Outer Space, that came back as a legitimate match for "sta" because I was indexing all of those different unigrams. So to get around that, all we need to do is make sure that we specify the standard analyzer on the query side and we can do that like this. I'm just gonna hit the up-arrow and modify this query slightly. So instead of just specifying that "sta" title, I'm gonna give it a little bit more information here to go with. Let's specify the query is "sta" but we'll now specify the analyzer on the query side to be standard, and now we should get the results we expect. Sure enough. So now for "sta", we're getting Star Trek Beyond and Star Wars because those both match the trigram S-T-A. So we can play around with this some more. Let's change that to actually be a little bit more specific, like "star tr", and we got Star Trek again but we're still getting Star Wars, so you know, we're again encountering the limitations of the n-gram, the edge n-gram approach here. The problem is that each word is being treated as individual search terms still, so our standard analyzer is still splitting that up and since the term "star" still matches Star Wars and the edge en-grams, the fact that we still had T-R in there doesn't really help matters. Now the original approach we talked about using match-phrase-prefix would get around that, but, you know, again, the only real way to have complete control and really perfect results is to use completion suggesters. So consider these ways of like, you know, getting a start in getting in some good results, but every one of these approaches has their own limitations. Completion suggesters are really the only way to get perfect results but obviously that requires a lot more work because you need to submit ahead of time what all those completions are. But anyway, that's some thoughts of how to do search, autocomplete search-as-you-type with Elasticsearch. There's many ways to do it. Each one of them has their pros and cons. If your aim is to use Elasticsearch for what it was originally intended for, searching, you've got all the basics you need to become productive right now. I hope you're feeling more and more confident about Elasticsearch and how it works at this point. Believe it or not, we're hardly halfway through covering everything Elasticsearch can do. Stay with me. Elasticsearch has many more tricks up its sleeve for us to explore.