Natural Language Processing (NLP) - (Theory Lecture)

Dr. Junaid Qazi, PhD
A free video tutorial from Dr. Junaid Qazi, PhD
Data Scientist
4.6 instructor rating • 1 course • 6,901 students

Lecture description

We will cover key and fundamental concepts in Natural Language Processing. Term frequency, Inverse Document Frequency, TF-IDF, how the documents get importance based on important words in them.

Learn more from the full course

Data Science and Machine Learning using Python - A Bootcamp

Numpy Pandas Matplotlib Seaborn Ploty Machine Learning Scikit-Learn Data Science Recommender system NLP Theory Hands-on

24:52:28 of on-demand video • Updated February 2020

  • Python to analyze data, create state of the art visualization and use of machine learning algorithms to facilitate decision making.
  • Python for Data Science and Machine Learning
  • NumPy for Numerical Data
  • Pandas for Data Analysis
  • Plotting with Matplotlib
  • Statistical Plots with Seaborn
  • Interactive dynamic visualizations of data using Plotly
  • SciKit-Learn for Machine Learning
  • K-Mean Clustering, Logistic Regression, Linear Regression
  • Random Forest and Decision Trees
  • Principal Component Analysis (PCA)
  • Support Vector Machines
  • Recommender Systems
  • Natural Language Processing and Spam Filters
  • and much more...................!
English [Auto] Hi guys. Welcome to the natural language processing lecture also known as an LTE. This is another very important area of machine learning and artificial intelligence. Many leading companies are looking for people who got expertise in the in this lecture we will discuss how to manipulate and analyze the language did the the concept behind grouping the article's legal documents and news based on their advance we will also discuss how to store the lungfish data in standard format and much more. There are several books and lots of material available on the on the web for free. You can always look at if you're working with entity in Python natural language processing by Steven. You and what is a very good. It is free on the quited link the documentation on official website is always a great resource asset. So you can follow the documentation on and T.K. that or cheap next create couple of situations where you want to use your skills in an entry. Suppose you are working with one of the biggest research publication of nation like Springer or science dialect. They want you to group the research articles by the research in an other example you are employed by a leading news agency like the BBC or CNN. And your task is to group the news by their headlines. Our topics in a third situation consider you are a part of deep legal fun. You need to find all the relevant documents from thousands of pages. So in all these situations are you going to do this work manually. Are you going to find out a page from thousands of pages. This is not easy task. So these and many other similar tasks are not trivial for human. This is the natural language processing can have when we are dealing with text to typically by definition and Nalty is an area of computer science and artificial intelligence consarned with the interactions between computers and human languages. In particular how to program computers to fruitfully process large amounts of national language data from the mentioned tasks. We want to do the following steps. First we compile the document in some pression then we get features from those documents and then we compare their features from similar to in a very simple example where we have two documents with the following tax document one with a single sentence that Dag and document to read single sentence back. A simple way to fictionalize the tests document is to do the word compt. We can't transform text into ever tries to work copy. And in order to do this we basically create vector Cung for all the possible worlds in all of the documents what we do. We then code. How many times DOS was a purity. Each document in the above example we have three boards that lean and tag in our example defeated position of each document document one and document two based on the word code will be our document one red tag that green tag 1 0 1 for document 2 green tag red green bag 0 1 1 show that green tag implies at 1 to 1 means that of the. A one time Green zero time and Dagh appeared for one time. And on so a documentary presented as a vector a lot conchs is called bag of logs treating each document as a vector of features is useful because once we have a bag of electrons we can perform mathematical operations on the for example we can compute all similarities using that equation. We can also compute other similarity mattresses in order to figure out how similar two text documents are to each other. So we have a bag of Lord. We can improve this by just doing what. Based on the three grantees in the conference the carcass is actually a collection of written text or a group of all the document we can then use DFI idea which is done frequently and in most document frequently in order to do this process. So moving forward it's important to understand what is Tom 320 and what is in one document frequent that define these two terms phos down frequency measures whole frequently if done occurs in optimum don't frequently depend upon the number of occurrences often T in the document. Small d here they suggest how important the term is in that document. On the other hand in most document frequency Micheles the importance of the time in the corpus in the group of all the documents and is equal to the lock of capital-T the capital-T is the number of documents divided by the number of document in the Corpus with tons of Pierre So IDF actually diminishes the weight of the time that occurs very frequently in the Congress and it increases the rate at the time. That of course really let's try to understand this with another Jon-Paul consider we have to document the Vaun indeed to get some towns and their frequencies as given in the table. Let's compute the math and IDF for the boards this and example next start with the war. This in the document one the frequency of this is one your total number of awards or five then the tone frequency for this documentary one will be one over five which is equal to zero point two indeed Duckman due to the frequency of this is again one but the total number of Lord Semon and indeed to the tone frequency for this is equal to one or 7 which is equal to zero point one for note is that DS too has more large delta through Grancey this is smaller 32. Let's compute IDF for the word this Endicott's us now. So the IDF is constant Narcopolis in this case for the war. This IDF is equal to log two or two which is equal to zero. So now we got the values for TFN idea. We can compute T.F. IDF simply by multiplying T.F. by IDF. In this case the idea for boules document Xeno's. Now this implies that the Lord this is not really informative as it appears in all documents. On the other hand if we consider the ward example the situation is interesting. It appeared three times but only in document D2 let's compute the time you example in both documents the one and D2 so far D1 we don't have Tom example in D1. Then they don't to Grantville deceitful 42 example has C Tom counted and then we got zero plan for 2:9. As is Tom 320 far more examples in document to next compute IDF the dumb example which is constant carpus the IDF for example is equal to log of two divided by one which is equal to zero point zero. One note we are using Basten Loxley them here. Now we have TFN idea. We can simply input these values and find out the T.F. idea for the blood example in document D and D to we see that t f IDF for example in D on a seat or rate as t IDF. What they would indeed do is zero point one. The so this implies that example is an informative ward in carpus and in this case this is more relevant to the document D-2. So I hope you got a really good understanding of unissued language processing. You got an idea behind TFN idea and how they are computed for each document and how the Tom Clancy and in Rostok went. Frequencies are important for any ward in the conference. So this was all about the pety action behind Nisshin language processing it. Thing is to do some hands on. Let's move on and learn practical example using Python. Note that the initial language processing is not part of sikat lot. In order to do this we need to install other additional library infighter which deals with end of the need to go to the top of our command line and install it in another package called T.K.. If you have a condom distribution installed on your computer you can use it command Conda install and Teekay and if you are not using in a corner solution you can use it. Come on people install and T.K. to install this national language processing library. I will see you in the next lecture. It installed and T.K..