Bag-of-words model

Hadelin de Ponteves
A free video tutorial from Hadelin de Ponteves
AI Entrepreneur
4.5 instructor rating • 28 courses • 1,275,092 students

Learn more from the full course

Deep Learning and NLP A-Z™: How to create a ChatBot

Learn the Theory and How to implement state of the art Deep Natural Language Processing models in Tensorflow and Python

11:38:44 of on-demand video • Updated April 2021

  • Why this is important
  • Types of Natural Language Processing
  • Classical vs. Deep Learning Models
  • End to End Deep Learning Models
  • Seq2Seq Architecture & Training
  • Beam Search Decoding
English [Auto] Well I will go back to the course on deep natural language processing. Today we're looking at a bag of words model. First thing I'd like us to look at is an email an email I received just a few days ago. So here we go. The e-mail is about to catch up and my friend is asking. Hello Carol. Checking if you're back in Oz hands for Australia let me know if you're around and keen to see on how things are going. I definitely definitely could use some of your creative thinking to help of mine. Cheers B. And so what I'd like us to pay attention to. First of all of course you can see that I sent this email to myself but that is because I wanted to keep my friend. Actually it is because I read your reply to the email and then I wanted to send it and also wanted to keep my friend keep his privacy but this is a real email. This is the exact text that I got literally a couple of days ago and the title was different but I just called a change to catch up and so on. What is interesting about this we're going to be looking at how we can apply natural language processing to this email in the next couple tutorials and it will help us work with a real life example and then the other thing is that it is here you can see in Google the gmail app for iPhone you can see that is give me some suggestions. Very interesting it's saying I was requiring some quick replies that I can use it can be yes I'm around. I'm back so I am not very interesting so let's keep that in mind and we will come back to this later. In the meantime text of the email is here. What can we do with it. All right. So first things were going to start off simple. We're going to create a model. We're going to look at how we can create a model that will give us an a yes no response because that's one of those questions. The question is are you back in Australia. Let me know if you're around and keen to say so. Yes. No of course it's better to have a long response. And that's that's the social norm and it's it's the etiquette to converse with people and they just say you know what even Let's try to get a yes no response and see how would you go about that because that's a first step into an LP and then further on we will see how we can expand that and more. All right so we're going to start off with we're vector a vector or a just like an hour a full of zeros let's call it vectors. These are like that. So just 0 0 0 0. How many zeros. Well a lot of zeros. Twenty thousand elements in total twenty thousand. Why is that. Well it's because of the way that we're building a small 20000 is the number of words that are commonly used by the average native English language speakers. So here's a quick search on Google. How many words in the English. That's the search I took I came up with how many words are there in English language. Hundred seventy one thousand pardons seventy six words. That's how many entries in our tradition plus some obsolete words plus derivative words and so on. But also people also you can see Google's giving a suggestion that more subtle adult native test takers range from 20 to 30 trade thirty five thousand words average native test takers of age eight or ten thousand words average age of test takers or you know 5000 or that it's an adult native test takers low almost white whatever this is going to is what's. But the interesting thing here is that Harmy like what I wanted to point out. First of all 20000 and we will see why exactly resisted a lot more. What I wanted to point out is how many words are there in the English language. Even this in its own is actually Google's applying natural language processing. It's it's looking at what we wrote and and then it's also checking other similar answers. How many words in the English language does that other person the average person know. So that's a question I ask but it came up with that. Then you came up with many other questions. So you can see that the irony is that even in this search on its own We're really falling victim of natural language processing even though that wasn't our intention and that's not what we're going to be talking about. But it's just funny that it came up anyway. So 20000 words and fun fact is that we actually use about 3000 words out of those hundred seventy one thousand four is only six words we only use 3000 words not just in conversational language but you can see here vocabulary of just 3000 words provides coverage for around 95 percent of common texts 95 percent of common text that I like I'm assuming that's including books and stuff like that. So if you do the math it's I only use one point seventy five percent of the total number of words in English language. So as you can see even that 3000 like our 20000 is more than even the 3000 that covers any facts of the situation so we're pretty good. We're definitely covered if we say that our vocabulary all possible words that we can encounter is going to fit into a vector of tweets. So every Basically what we're saying this is important what we're saying is that every word in the English language has a position somewhere on this vector. So for example this the word if could have this position. So if count one two three four five six the seventh position in our custom made vector is that word events always going to be on that position. That's very crucial for this. For instance the word badminton is just like that we can construct this vector any way we want the word Badman's and could be on this position is always going to be on this position and the word table is going to be on this position. And this is like how this bag of words model works so just keep in mind that once you like what we've taken out 20000 words then we've assigned them a space that's where they that's what they will. This base in this vector will be associated with will be associated with we this will be associate with or badminton. This will this position will be associated or table and other thing is here you can see a great out first and the last one first two are going to be reserved for solace and iOS stands for start of sentence. U.S. stands for end of sentence and the last one will be reserved for special words and that's for those words that you're wondering about. Second I can hear your brain churning right now. What about those other hundred and fifty thousand words that we didn't take into account. What if they come up. Well if they come up we're going to just associate them with this with this last thing. This last element would be just throw them all in there any kind of words that we can recognize of the 20000 and throw them into that last element. All right so let's go back to our e-mail text. Here it is. Hello Carol. Checking if you're back in let me know if you are around. Etc etc. etc.. Cheers V. And so let's see how this can be put into our bag of words. If you've probably noticed by now that this is our bag of words that we are constructing here so now we get to throw the text into this bag of words as they are going to happen. I'm just going to throw it in and then I'll explain how it happened so there it is. That's the result it that's it. And of course depends on how we construct our victory. But this is our result. And in the way we construct a vector and let's let's look at it this way so we as we discussed previously we took the 20000 words and we associate each position with a work and now we go through our text and find and then like increase the counter in each position of the associated words so. Let's say in our lecture it is in position number five because we only have one hello in this whole e-mail. We're going to put a one here. Cairo is definitely not an English language word. So we're going to have to put it into there. And the reason why there's three here is because we have curial than ours. And the those are not English language words not among or 20000. They're all going to go here. Then we've got the comma surprise the comma also has a position. Let's say it was in position number 3 6 7 8 9. So the ninth position is associated with a comma because we have one comma in our e-mail. Oh actually we have two commas. OK so this should be a two but let's listen something about that. Let's forget about that comma. I didn't notice it. So assuming we have one comment or email this is a one checking let's say that this this element is associated with or checking this a 1 because it's only one word checking if it's a two because we have two ifs in our e-mail. So it's going to be up to you too because we have to use in our email including the rest of the text. I don't think there's any more use in there. And so and so that's basically how we feel this bag of words. We just put in the quantity of words for every positions pretty straightforward we're just filling in this vector. As you can see it's going to be quite a sparse electorate is going to be lots of zeros. Almost 20000 zeros and some of the words are going to be folding So what is our goal. So our goal as we discussed before is to come up with replies Yes or no to this e-mail which is now in the form of a vector and how we're going to do that. Well we're going to do it through training day. So we're going to look at all the e-mails that I have replied to because US training model to reply to my e-mails or in your case in anybody's case is going to be training the model to reply to their e-mails are going to look at training data we're going to need some change. I'm going to fish it out of the inbox outbox. So let's say let's look at a couple. So here we've got Hey mate have you read about Hinton's capsule networks and general reply to that. No. So we're going to use that as a training example. Next one did you like that recipe I sent you last week. The answer was yes. It was a good recipe I guess so. So now we have two three. Hi Carol are you coming to dinner tonight. Yes dear Carol would you like to search your car with us again. No. Are you coming to Australia in December. Yes. And so on. So ideally we would have tens or hundreds of thousands of e-mails like that and responses like that yes or no responses of course would be like a lot of ground work to get that data because we usually don't just respond to emails so we have to look at this answer and understand what was the sentiment. Sentiment was no know what was the overall was it a yes or no. No yes at all and so of course it's kind of more of a theoretical example. Nobody is going to do this for their own inbox. But nevertheless the point. So how would we train. How would we use the data. We would use a similar principle and convert each one of those e-mails to a vector. And again each vector would be 20000 elements long. So you know I just threw some numbers in here too to get the point across. It's not exactly accurate but so we have these vectors like lots and lots and lots of pictures lots and lots and lots of responses yes and no. And yeah so now what we're going to do is we go into apply and model once we have all this data we're going to apply model. So one of the models we can apply to create a bag of words or one of the algorithms we can apply to create a bag of words model is the logistic regression. So we apply the logistic regression to our Yes no responses to these to this information that we have. And then once we have that model once we've separated. So we know we kind of like we've modeled what goes like what goes into a yes like what's what is likely to yield a yes it is like to table no border between them then we can feed our actual e-mail that we got into this model and then get a response. So for instance yes and that's it. So we use all the training data to create a model. We see it in our actual email which is important which has exactly the same format so you can see that every input here every every time we train the data the independent variable of the independent variable vector always has the same length 20000 and always had the same format. We know that this position always corresponds to a certain word. This position is always a certain word this position let's say 1 to 3 which is where it was at one two three four five six seven eight. So this was what we know this one as if right. So this corresponds to it or something. So we know that it's the same format it's always the same length 20000 so we can safely feed in this victory into there. It's got the same number of features. BAM we get an answer. So for instance we get. Yes. So and then we can like look back on what the actual emails say it said. Hello Carol. OK so based on my training I would have most likely replied to this. Yes. Interesting. The other approach we can take here are first of all let's put this on our diagram. Our diagram and that's a natural language processing algorithm which is called back of words. So it's over there. The other approach we could apply here to here is we could sort of logistic regression we could use a neural network we could because we have a vector right. So we have all these vectors we could feed them into as an input layer like over 20000 neurons into our neural network. They would go through we want to learn to analyze as me layers as we want our own decision on how to structure it and then bam we've got an output layer and tell us yes or no. And so we again would use all this data that we have here all our millions and millions and millions of emails and responses would use that to train on your own networks. All through back propagation and sarcastic gradient descent all the weights would be updated and bam we have an answer. So not that we have an answer or so we would use these answers here to train that I would use the pairs like the vector and the answer vector answer. So to minimize the errors because a great and decent back propagation of the ways bam we have a neural network it's all trained up. Now we feed in our vector here which represents our new email into the neural network and voila we get our answer as in this case might also be yes they might reveal different. But if their models are circuit Well moralists should be coming up with similar or the same answers most of the time as in this case we have got a deep natural language processing. I didn't put emphasis there either. We got a deep natural language processing algorithm. Right because we're using a neural network and that is different so in both cases a bag of words model one case is an old bag of words. In other cases the deep and a the bag of words. But in both cases it is still a bag of words and it has its own limitations and it has its own limitations and issues that are not that great. And so I'll point out one of them right now is that that response is very simple it's just a yes or no. Right. Like we want something more sophisticated we want like a conversation can really have a conversation can really build the Chugwater is going to be saying yes all the time. So that's one of the limitations. We'll talk about some more of them in upcoming tutorial and we'll also see how to overcome those limitations and what models await us in the future. And I hope you enjoy this. I really enjoyed going through all this with you together. And I can't wait to see an accent. Until then enjoy natural language processing.