Count Vectorization

Abhishek Kumar
A free video tutorial from Abhishek Kumar
Computer Scientist at Adobe
4.1 instructor rating • 16 courses • 9,405 students

Learn more from the full course

Natural Language Processing (NLP) with Python and NLTK

Master Natural Language with Python and NLP using Spam Filter detection

03:32:54 of on-demand video • Updated January 2020

  • Natural Language Processing using Python
English [Auto] No there'd be no Detroit listen let's start with the simplest of trades and techniques which is current were traders and so count victories in the trades and technique which creates a document of metrics. And in the last review we are team in metrics like if you or X had some documents and you clean after cleaning goods or documents that are removing punctuation stop boards after or tokenism or we will be getting a list of boards. And so those if the extract are all the unique words and list to them and also count how many times they have occurred in the entire text corpus and right because forwarding frequencies in each document make for dock 1. Dock 2 in dock 1 The blue one appear to change the blue are due to change and so on. So this metrics is called document or metrics. So count treason treason will fit loading this word vocabulary and try to create a document or metrics which will vary the individual cells ignored the frequency of that word in a particular document and in order to use contra credit we will imported from Skillern featured extracts and or xt rule import contract treasure and then use create an object of note. So this is optional parameter analyzer so we will see the numbers on board with our passing this and it was doing this. So let's begin in our notebook. So this is all the stuff. Whatever we have missing in previous produce. We are in or being we are saving the stock words conforming to English language and then we are using the port or similar. And then we are reading this collection into a data stream using Barnardos our CSP function and we are creating two columns level and method and so let's go ahead and run the first five rows of the data stream for the other levels. And is the message then we defined a function to clean the next fill. This is also the usual stuff that we have been doing. Here we are removing the punctuation and we are splitting it into tokens splitting is done or non-work characters then we are doing stemming on all those words and also removing the stop words. So this is our cleaning function we will use this after something for First let's look at a sample corpus so busy over sample corpus not the same as a spam Alexandre to us we will look look into this also. So in order to understand conflict treasure here we have imported and then view the attribute using all the default parameters. And this is our Corpus. Just this three sentences so when we do see read or Sweet Series this contract changes it will learn of a vocabulary dictionary of all the tokens in the raw documents. So all the tokens in it it will learn but it will not be at Clear no document or metrics that will happen when we do transform. So it will just learn it. So in this is if we come and talk all the code and let's go ahead and run it. So we see that first a printed vocabulary so vocabulary is this Nixzmary. So here you can see all the unique words are listed here and it automatically by default will get rid of any one character the words. So it is not in the list and dressed other it will be it. So it sorts them in alphabetical order like document has index or one another as index of you then here as index of two. So this is the dictionary of award and your responding index and then we call get treated names then we will get this list of tokens No let's do transform on this show then we do transform it transform the documents to document or metrics for it will now calculate this document on metrics where we will have as columns it will be the unique tokens and the one draw will cause one to one document. So in this case three documents are there so three rows and seven words sedatives or seven columns and individual cell values will do not call many times a particular book has occurred in a particular document. So when we do transform then it creates then document metrics so then we go ahead and run it and we printed the same to begin to see that this tape is three cross seven because there are three documents that is two sentences in this corpus and there are seven unique words for each column course on conduct and then finally then we just print x it is best metrics so you can see here also that most of the cell tells you to and this was a small example and they are grouped or the system to repeat some broken multiple times more in general effectively large metrics than most of the cell values the residual because a general stage a document not Bookman but the corpus of document will contain several thousand or even hundreds of tolerance of unique words and whereas one sentence or one document will not be that big suits born to contain you to in most of the places so then we do this print x and run so we just drench the poisons we touch nonzero thought internally dispersed matrix and it's printing the search not a matrix they're just getting printed in order to get their data frame out on that in the matrix form we have to do it to error or if you want to create digital frame out of it we can do business and lurks in or are knows we know I think already imported on us well we can directly use it be do door to offering and then we can ask this export to every columns also we can certainly we will this feature names to the columns and then we can drink to the ginger to frame Fuller strong ditto for him constructed not properly called for here or it is your function. Now we get back this data from No. We have got a feel for this of a nowhere. We are ready to use it on our sample to desert him as a spam collection. Cone the transition only so much spam collection so here we will create a new Congress treasures and here we will use our custom text cleaning function to analyze it. We will pass this clean text function so it will use all its clean text instead of using their default analyses. And no alerts for their data. And here we are on it in two steps. Once this new call for it. And then transform. We can combine it with this fit transform in one step. So let's all go ahead and use transform and then we will call it on this image column. So it will take them as you column them all to apply the cleaning function that we are passing and no drink the sake of this. The let's try it clean text nor define. So I think I am not wrong. We need to learn it once. Yes. No let's run it again. Obviously there it has 5 5 7 2 rules and that is 0 5 5 7 2 documents. This has 5 5 7 2 messages. And then we turn 1 3 unique tokens. After cleaning the tokens and 0 is 3 print the CB One North good feature. Names it will be a huge list. So here in our sample get data then reprinted this feature name it just printed this list of seven tokens. Here we are 7 1 3 tokens we have printed so we need to call it on CV 1 and then run it through. This is a big list and you can see even are some numbers and then then wards also so we cannot work on this looks clear to simply does it or. So we will call it there to sample and we will take forced tendrils of this original data frame that we had. Now we will define CV to and no looks. Copy this. And no let's run it. So here we have just enrolled because we have selected just 200 of our original data and entitled 131 unique 0 0 clones so we can go ahead and print the date of. Let's create a date offered him or offered PD nor data stream and we will call X nor do read and columns will be this C read nor can get treated names so Edward will receive little and then bring this tough room so this is your data frame you can see or it will contain 131 columns which will cause 1 2 unique tokens and you test 10 rows 2 in first 2 enforced document or first message. This wall is one aimed toward these 1 aims so let's ticket in over this method. So we see there here in the first message we have awarded and then the other w e score which is a little collected. So this is their document on metrics and the upgraded using the current treasure in the next redo we will see or in grams of a treason and then sort of better we will see if I do so thanks for watching see you in the next reading.