This Natural Language Processing (NLP) tutorial covers core basics of NLP using the well-known Python package Natural Language Toolkit (NLTK). The course helps trainees become familiar with common concepts like tokens, tokenization, stemming, lemmatization, and using regex for tokenization or for stemming. It discusses classification, tagging, normalization of our input or raw text. It also covers some machine learning algorithms such as Naive Bayes.
After taking this course, you will be familiar with the basic terminologies and concepts of Natural Language Processing (NLP) and you should be able to develop NLP applications using the knowledge you gained in this course.
What is Natural Language Processing (NLP)?
Natural language processing, or NLP for short, is the ability of a computer program to understand, manipulate, analyze, and derive meaning from human language in a smart and useful way. By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, topic segmentation, and spam detection.
What is NLTK?
The Natural Language Toolkit (NLTK) is a suite of program modules and data-sets for text analysis, covering symbolic and statistical Natural Language Processing (NLP). NLTK is written in Python. Over the past few years, NLTK has become popular in teaching and research.
NLTK includes capabilities for tokenizing, parsing, and identifying named entities as well as many more features.
This Natural Language Processing (NLP) tutorial mainly cover NLTK modules.
This Natural Language Processing (NLP) tutorial is basically designed to make you understand the fundamental concepts of Natural Language Processing (NLP) with Python, and we will be learning some machine learning algorithms as well because natural language processing and machine learning move hand in hand as NLP employs machine learning techniques to learn and understand what a sentence is saying, or what a user has said and it sends an appropriate response back.
So, by the end of this course, I hope you will have a clear idea, a clear view of the core fundamental concepts of NLP and how we can actually make applications using these core concepts. Looking forward to seeing you in the course.
Keywords: Natural Language Processing (NLP) tutorial; Python NLTK; Machine Learning; Sentiment Analysis; Data Mining; Text Analysis; Text Processing
What is Natural Language Processing (NLP)?
The question is, what is Natural Language Processing? Natural Language Processing (NLP) is basically a branch of artificial intelligence which deals with understanding our own simple languages and interacting with humans.
Examples of Natural Language Processing (NLP) Applications
In this video, we will download the NLTK module, and all the additional resources associated with it. I hope you've downloaded Python and set it up on your PCs. What we do here is we have a new project in Python. In the main directory, I have a file, a TXT file, requirements.txt Then I go and write: nltk So what this file does is that it downloads all the modules that you want for this project all by itself. So you don't really need to worry about going to the Terminal or CMD and getting these yourself. It will handle this. PyCharm will do this all by itself. If I move back to setting-up.py here, if I hadn't installed the NLTK module earlier, it would have displayed a warning box here, showing that the requirements are not satisfied. What you do here is you just click yes, and it will download all the requirements that have not yet been satisfied. For me, I've already downloaded the NLTK module. It's not giving me that warning. Now the first step is complete. We have downloaded the NLTK module. It is here in the Python library. Now what we want to do... is we want to download the additional resources associated with the NLTK module. As I move on along in the course I'll explain why we need these additional resources. So I say: import nltk Then I say: nltk.download Now I'm going to move to "Run" here, and I'm going to run this. It shows the run setting up. As I move back to Desktop, you can see that the Python GUI is opening up. Just give it a minute. Here it is up and running. We can see that here it gives me four options: Collections, Corpora, Models, All Packages This is the NLTK downloader we have for downloading additional resources associated with that. Now you can see that if I click on Corpora here, all the corpora associated with NLTK are listed here. Modules and All Packages. I have installed most of them here. What I want you to do it just go to Collections, click on "all". Make sure that this "all" is highlighted. If you click on "all" you can see that... this row has been highlighted. I want you to click on "all". Just make sure that you clicked on "all" and then press "Download". It is going to download now all the resources associated with the NLTK module. It will take some time to download as the file sizes are huge. You'll get it done. I'll see you in the next video and we'll proceed with an introduction to the NLTK module. I hope you get this done by then. Thank you!
So in the last video, what we did was we downloaded all the additional resources associated with our NLTK module. I hope you have downloaded them because you will be needing them throughout this course. So in this video, what we are going to do is we are going to access some very basic resources of the NLTK module. So I say: from nltk.book import * So what this is going to do is this is going to import everything from nltk.book So this .book is an additional resource of the NLTK module. If I just run this right now, I haven't typed anything else, I've just imported this, you can see that what it is doing is... what it is doing is it is loading all the texts that are associated with this module, nltk.book So these are all introductory texts and we can actually access them individually, like you know, programmatically. So how do I do that? I can say: print and then I say: texts So this is a function we have; a texts function, what it does is that it loads all the texts into our memory, and if I run it, I can see that. So you can see that, till here, this was what we had previously and after that, what this does is, this loads everything... as this command is executed, at line No. 5 So the next function we have here is sents function; what it does is it loads some introductory sentences from nltk.book module, so I can say: sents and if I run this now... I'm just going to have a print() command here so that there's a gap between this command and this one. So you can see that what this has done, is this has loaded all the sentences that this module has, like introductory sentences. And what if I want to access them individually, I can definitely do that. So I can say: print and if I run this now, you'll see that it will give me the first sentence it has. So it has given me a list here and each element of the list is a word. Okay, so this is returning me the first sentence there. And similarly, I can also access a text individually, like this function here. What it did was it loaded all the texts that are here in the nltk.book module, the introductory texts. So how can I access a text individually? It's similar to what we did with sentences. We say: print and if I do this, it is going to give me an object of text1 So you can see this here, that it has returned me the first text there. You can see the first text is "Moby Dick by Herman" and here I have an object text, which is the same here, like this one here. And okay, so now there are various functions associated with each text. I'll just show you something. Can you see this here? If I type text1. I can see a lot of functions that are available here. So in the next video, what we are going to do is we are going to discuss some of the very frequently used functions and help you understand what they do and how we can actually use them to build our own applications.
So, in the last video, what we did was we discussed how to access the very basic introductory texts of the nltk.book module. In this video, what we will be doing is that we will be discussing very basic introductory functions which we use extensively for analysis. So if you remember, we had these 9 introductory texts in the nltk.book module. So now, what I want you to know is that this text is basically a text class, and the hierarchy for this text class is nltk.text.Text So, if I say print), and I say text1 or text2, anything here, and if I print this now, you're going to see that it actually shows us this hierarchy, the type of text2, which is nltk.text.Text, which means that we have the "nltk" module first, and then there's a package named "text", and then there's this class, "Text". So you can see this here that we have this as a class. So what we will be doing here is we will be discussing some very basic functions of this class here, "Text", or this here, the same thing. Okay. So the first function which we are going to discuss is "concordance". So what this does is this expects a single word as its input, and what it gives us as an output is that it searches for the word we have given it as an input, and it returns us all the occurrences of that word in our text with some context. By context, I mean the words which appear before that word and the words which appear after that word. So we'll just see this now, what we actually mean by this. So I say: text1.concordance, and let's say I go for "man". And if I run this now, we're just going to see that it is going to give us all the occurrences of man with some context in text1. So you can see that it says displaying 25 of 527 matches. So it has, you know, it has found 527 occurrences of the word "man", and you can see that here, it is "man", and it is also giving us some context to the word man, some words before the word "man", and some words after the word "man". Okay, so this is what "concordance" does. This is pretty helpful, you know, if you want to see where a word is appearing and in what context it is appearing usually.
The next function which we are going to discuss is "similar". So what "similar" does is it expects a single word as its input, and what it gives us in return is that it searches for that word which we have provided in the input, and it fetches the context of that word, like, you know, every context, and then it returns us all the words that appear in the same context. I'm just going to run this, and you'll understand it better than what I'm trying to say. I say: print and let's say I go for "woman". So if I print this now, what this is going to do is this is going to search for the word woman in text1, and then what it is going to return us is that it is going to return us all the words that appear in the same context as the word "woman". So you can see them here, the words which appear in the same context as the word "woman": man, king, wife, hussar, fiddler, bull, laugh, writer. So all of these words, they appear in the same context as the word "woman" in text1, which is Moby Dick here. Okay, so this is what we have for similar.
Another function which we are going to discuss right now is dispersion_plot This is very helpful in analysis of data, like, you know, we can actually see how many times a word has come in a particular text and where, you know, when did it appear? Was it used very extensively at the start of the text, or was it used extensively at the end, or in the middle, or wherever? So what this does is this specifically returns us a graph where on our y axis, we have the words we have given it as the input of which we want to find the frequency. And on the x axis, what we have is we have the number of words and it displays us where this word occurred. So I'm just going to, you know, run this, and hopefully you'll understand this better, what I'm trying to tell you right now. I'll just explain. Okay, so text, let's say I'll go with text4 here, which is this here. So I say text4.dispersion_plot Now, this function expects a list as its input. So I give it a list. I say, let's say I give it "democracy", "freedom", "law". So if I run this now, this function is going to give me a graph as its output. You're just going to see that right now. And I'll explain what I was telling you earlier. So you can see this here that on the y axis, what I have is I have the list I gave it as its input: "democracy", freedom", "law"; three words. "democracy", "freedom", "law" And on the y axis, what this shows is that this text has roughly this many words, and if we go with "democracy", you can see that the first occurrence of "democracy" is somewhere here around 20,000, and as we move forward, you can see that "democracy has been used a lot at the very end, but it has been rarely used at the start and in the middle. The word "freedom" has been used throughout the text. You can see that here, and similarly for "law", it has also been used throughout the text. So this is a pretty helpful function which can be used to run some pretty quick analysis to see what this text contains, like for example, let's say you have a chat of a person, and, you know, maybe a million words, and now you want to see how many negative words this person uses or how extensively that person uses those negative words, so you know you can better assess that person's personality through that. So what do you do with it? You give this function a list of negative words as its input, and you'll see that, okay, this person used this many negative words, and where did he use them? Was he using them a lot at the very start or in the middle or maybe throughout? I mean, you'll get a better idea of what that person is, or what he thinks or how he writes. So, this is a pretty helpful function.
If you want to know the number of words that I have in my text. So what do I do now? I say, it's the same function we have for Python len() and I say "text" maybe 5 And let's run this. So this is going to give us the length of our text here, text5. Okay, so, you know, let's say I want to... Okay, so this is 45010, so this is the length of text5 here, which is this, "Chat Corpus".
So let's say now, you know, we want to find how many times the word "freedom" appeared in our text4. What do I do now? So the function we have for that is I say print). Now, this .count() function, it expects a single word as its input, and it is going to return me an integer value, giving me the occurrences of that word. So I say "freedom", and if I run this now, you'll see that it will give me, it will give me the times the word "freedom" occurred in text4. So it's 174 here. Okay, now, let's say, you know, I want to calculate a percentage for the word "freedom" based on its occurrence in text4. What do I do now? So the thing we can do here is we can say print. Now, you know, I say the number of times the word freedom appeared, text4.count and I divide it by the length of text4, and then I multiply it by 100. So if I run this now, this is going to give me a percentage of times the word freedom appeared in text4. So you can see that it's 0.1193%, or if we round it off, it's like 0.12 So it's 0.12% So I hope that these basic functions are clear to you, and I hope that especially this function, dispersion_plot is clear to you, as this is very helpful just running a very quick analysis and seeing how this text is placed. So I'll see you in the next video, and we will be discussing Frequency Distributions.
How to access and use the built-in corpora of NLTK
How to use your own texts in NLTK
In Natural Language Processing, Tokenization is the process of breaking text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining. Learn more about Tokenization with Python NLTK in this video tutorial.
What does "token" mean?
Text Normalization with Stemming in NLTK
Text Normalization with Lemmatization in NLTK
Using Regular Expressions to tokenize texts
The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The collection of tags used for a particular task is known as a tagset.
Tagging with Regular Expressions
Tagging into single words
Tagging into phrases of multiple words
Commonly used in Machine Learning, Naive Bayes is a collection of classification algorithms based on Bayes Theorem. Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For So for example, a fruit may be considered to be an apple if it is red, round, and about 3″ in diameter. A Naive Bayes classifier considers each of these three "features" to contribute independently to the probability that the fruit is an apple, regardless of any correlations between features. However, features are not always independent which is often seen as a shortcoming of the Naive Bayes algorithm and this is why it’s labeled “naive”.
Although it’s a relatively simple idea, Naive Bayes can often outperform other more sophisticated algorithms and is extremely useful in common applications like spam detection and document classification.
In this video tutorial, you will build a small application which is going to take the name of a person as its input and is going to tell you whether that person's name is a male or female. You are going to use Naive Bayes in your classifier. You are going to make your application learn based on some training data set; and after it has learned, we are going to give it a word, a name, and it is going to tell us whether this person is male or female.
In this video tutorial, you are going to build a document classifier in which you will give your classifier a movie or product review as an input and it will tell you if that movie review is a positive one or a negative one.
Chunking or Shallow parsing is an analysis of a sentence which first identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and then links them to higher order units that have discrete grammatical meanings e.g phrases.
Chunking in Python NLTK
GoTrained is an e-learning academy aiming at creating useful content in different languages and it concentrates on technology and management.
We adopt a special approach for selecting content we provide; we mainly focus on skills that are frequently requested by clients and jobs while there are only few videos that cover them. We also try to build video series to cover not only the basics, but also the advanced areas.