# Image to text

**A free video tutorial from**Martin Jocqueviel

Freelance data scientist

4.3 instructor rating â€¢
6 courses â€¢
51,029 students

### Learn more from the full course

Modern Natural Language Processing in PythonSolve Seq2Seq and Classification NLP tasks with Transformer and CNN using Tensorflow 2 in Google Colab

05:45:42 of on-demand video â€¢ Updated February 2021

- Build a Transformer, new model created by Google, for any sequence to sequence task (e.g. a translator)
- Build a CNN specialized in NLP for any classification task (e.g. sentimental analysis)
- Write a custom training process for more advanced training methods in NLP
- Create customs layers and models in TF 2.0 for specific NLP tasks
- Use Google Colab and Tensorflow 2.0 for your AI implementations
- Pick the best model for each NLP task
- Understand how we get computers to give meaning to the human language
- Create datasets for AI from those data
- Clean text data
- Understand why and how each of those models work
- Understand everything about the attention mechanism, lying behind the newest and most powerful NLP algorithms

English [Auto]
Hi and welcome back to this, because so now that we have seen how CNN is applied to images for television, let's see how we can apply them to texts for A.P. And so the first part will be to see how we can transform a text so that it's a valid input for CNN. So the idea was to say, OK, so CNN four images, what does it do? Its search for local features in an image. So why don't we want to process text or sentences the same way by looking for local features throughout the whole sentence? But in order to do that, we have to have the same kind of inputs as images which were mattresses, so we want to get a matrix out of a sentence. We want to represent the sentence as a matrix, because for now, it's just a list of characters or a list of words, but it doesn't really look like a matrix so far. So what we would like to have is something like this. And the most intuitive way is to say that if we have a matrix, each row will correspond to awards and each colony real corresponds to whatever we will find out later. But this representation means that each word will be a vector. So let's wrap around that. The easier and maybe more natural way to represent a word with a vector is also the less effective one. It is what we call the one hot encoding. So basically, if we say that our vocabulary has 100000 words, each word will be a vector of that size and it will be mostly zeros. It will only have one one. And for example, for Doug, if we say that dog is world number two. Thirty eight, well, at the index to thirty eight of our vector, we will see in one and all the other numbers will be zero. So that way we have a unique representation for each of our 100000 words, but there is absolutely no relation between each of them. There is no mathematical relation that appears, there is no meaning that is conveyed, which it shows how little meaning and information it conveys. Well, it conveys information about which word we are dealing with, but nothing about the relations between those words. So what we would like to have is a smaller representation of each word. So instead of having a vector of size 100000, we would have a vector of smaller size, let's say, for instance, 64. So that's a common thing in science. We want to make the vector smaller. So that means that we add more constraints to our information in a certain way. It has less liberty and that forces our system to create relations, links or even meaning in the process if we do it the right way, of course. So this time, Doug, for instance, instead of being a vector of size 100000 with all zeroes but one, it would be a vector of size 64 and it wouldn't be binary anymore. It would be numbers between zero and one. So before going into mathematical details to see how we do that, most of the time we would just have a look at some cool visual representations to see what it does and what it performs. So the first one is this quite well-known effects of the word embedding. Now that they are embedded in a smaller vector space, we have some mathematical relations between the words as vectors and we can actually send the vectors and sending the vectors means Semin meaning in our embedded space so we can actually get some very cool stuff, like if we take the word king, for instance, and that we say minus one policewoman, we actually get the word queen. So if the embedding has been done properly, we can actually subtract a part of the meaning of the words and add another one. Like Forkin, we can actually get rid of the male parts of King and add the female parts to this result in order to get queen. And another example, very simple is Paris, minus France, plus Italia. We get from. So with Paris, minus France, we get the capital ID and if we add Italy, then we get the capital of Italy, which is Rome. So that means that we are now able to apply mathematics to our words, to our vectors. And that is very, very useful in the sense, of course. So very good points. Second, visual ID, you may have to get a little bit closer to your screen, but this is a two dimensional representation of some words that have been embedded. And the important thing here is to realize that words that have similar meanings will be very close in the embedded space. So, for instance, if you have a look at some words here, we can see that Kingdome will be close to states or that data will be close to information. Also, there is relations between the tenses of the verbs here. We can see that there is because that is close to became close to being also close to us. Finally, we can see that he is very close to XI, which is close to it also. So by embedding vectors into a smaller dimensional vector space, we added constraints, as I said before. And it first, of course, if we do the embedding properly, it forced the positions of vectors are the values of our vectors to convey meaning. And so to make words of similar meaning close in all embedding space. So that's cool. We get a vector representation of words. So that will allow us to have matrices out of sentences. And it seems to be a very powerful tool because it conveys meaning. We can do mathematical stuff with it. So awesome. But how does it work exactly like the process, the embedding process mathematically? What do we do exactly? The idea is that with an input vector as a one encoded vector, as we saw before. So Ozawa's but only one one for each word. We want to multiply it by a matrix and then by the Matrix and get our embedded vector rotate. Actually, the fact that we multiply a 100 in that matrix, it just means that if we have, for instance, the words I saw there is only X which is activated. That means that we just take the matrix. So this mattress is just a list of all the embedded vectors of the embedded words. But that's a detail. So we want to multiply and you want to include vector by this matrix and have a number of. Victor, but the question is, how do we train these metrics, how do we train the weights, how do we learn the embedded vectors? We can't, of course, say that we want from this to get back to the original one, because we will just have to apply the opposite's operation just to see which way it corresponds to. And so get the in this right here. So the idea is that we want to multiply by another matrix, which we call the context matrix. You'll see why. And yet again, a vector of size. Capsis But we want to get something else that the original input vector, but we still want it to have a meaning to have a correlation with the origin of words. We want these outputs, this vector to have a semantic correlation with this one. So one idea to do that is to use what we call context. So this is the key program model where basically for an input words, we will pick several other words, which we call contexts, and we want them to appear in this output vector. So let's say that word number one corresponds to context words number 10, 20 and 30, for instance, at the end of this operation. So after embedding the the word number one and after retrieving information with the context matrix, we want he to have many zeros, but we want to have high number in position 10, 20 and 30, which are the indices that corresponds to the context words that we define before for the words number one. So in order to get those context, would we actually take a huge corpus of texts? And for each words we will take, for instance, the two previous words and the two next words as context. So in this sentence, in spite of everything, I still believe people are really good at heart, which comes from The Diary of Anne Frank. If we take the word good, it's surrounded by the words are really hot and hot. So this will produce four pairs of inputs and contexts, which will be good, are good, really good arts and good hearts. So when we put the words good as inputs, so there will be a one only in the position that corresponds to this one. Good. And after embedding it, after misapplying, it's by the context matrix we want to have the indices corresponding to are really at and heart activated. Of course, as good will it be? Many times in our corpus will have a lot more complex vectors as those ones. But the idea is that each time we use a pair of input, context, words, we will change those mattresses, the w w prime. We will tune the coefficients in order to activate the context would add of the input words. So the global idea about that was from a single word. In one hand, English version gets a smaller vector and from this smaller vector, get several words or activate several words that are often close to the initial words in our corpus. So that's pretty clever because we have our embedding face right here. But it must convey meaning. It must be semantically efficient in order to retrieve the context information after this operation. And of course, this box was only used for the training, you know, to have something that has meaning. That's when we embed a vector. We only use this. This is the one that we are interested in in the end. Important and interesting thing to notice is that if two words have a close meaning, this means that they probably in our caucus will have similar contexts, and that also means that they will have a close embedding. So if we take this sentence again, we could argue that the word people could be replaced with humans, for instance. So I still believe humans are really good at hearts. And that makes sense because humans and people, they convey a similar meaning. So in all caucus, we could expect humans and people to usually be surrounded by the same words. So that means that after we embed the words people, all the words humans, we get to different embedded vectors and but in words. But when we apply the context metrics, we want to get very close results, actually pretty much the same results. So that means that the embedded versions of those two words have to be close. And that's where it comes from, what we saw earlier. That's two words of similar meaning will be closed in all embedded space. So that was it for the global idea of what embedding, which will allow us to get efficient vectors from words. And so to get a matrix out of a sentence, if we just want to summarize this word, embedding face using this Kubra model is that it finds that dimension reduction in vector representation of words while adding a semantic relation between them. And I would add that this Felician is mathematical so that we can use it with standard operations in artificial intelligence. So that was it for the very first step of seeing and applied to NLB, which is how to get value input for CNN, of course. And now we're ready to feed them to CNN and to see what happens in our module and what we will have to change in order to have properly working CNN full texts. So that's the topic of the next parts. And so using.