Latent Dirichlet Allocation Overview

A free video tutorial from Jose Portilla
Head of Data Science at Pierian Training
73 courses
3,538,127 students
Learn more from the full course
NLP - Natural Language Processing with Python
Learn to use Machine Learning, Spacy, NLTK, SciKit-Learn, Deep Learning, and more to conduct Natural Language Processing
11:21:51 of on-demand video • Updated April 2023
Learn to work with Text Files with Python
Learn how to work with PDF files in Python
Utilize Regular Expressions for pattern searching in text
Use Spacy for ultra fast tokenization
Learn about Stemming and Lemmatization
Understand Vocabulary Matching with Spacy
Use Part of Speech Tagging to automatically process raw text files
Understand Named Entity Recognition
Visualize POS and NER with Spacy
Use SciKit-Learn for Text Classification
Use Latent Dirichlet Allocation for Topic Modelling
Learn about Non-negative Matrix Factorization
Use the Word2Vec algorithm
Use NLTK for Sentiment Analysis
Use Deep Learning to build out your own chat bot
English [CC]
(intro music) -: Welcome back everyone. In this lecture, we're going to give an explanatory overview of how LDA or Latent Dirichlet allocation for topic modeling works. Johann Dirichlet was a German mathematician in the 1800s who contributed widely to the field of modern mathematics. And there's a probability distribution named after him called the Dirichlet Distribution. And this is the distribution that is used later on in LDA. So Latent Dirichlet allocation is based off this probability distribution. And in 2003, LDA was actually first published as a graphical model for topic discovery in the journal of Machine Learning Research. So keep in mind, even though Dirichlet's name is attached to this particular method for a topic modeling, it really just stems from the fact that it uses the Dirichlet probability distribution, not that Dirichlet himself actually invented LDA for topic modeling. That actual method is relatively new, from 2003. So we're gonna get a high level overview of how LDA works for topic modeling but I would really encourage you to also take a look at the original publication paper. Now there are two main assumptions we're going to make in order to actually apply LDA for topic modeling. The first one is that documents with similar topics use similar groups of words. And that's a pretty reasonable assumption because that basically is saying that if you have various documents covering a similar topic, like, a bunch of documents covering the topic of business or economy, that they should end up using similar words like money, price, market, stocks, et cetera. The other assumption we're going to make is that latent topics can then be found by searching for groups of words that frequently occur together in documents across the corpus. And that's gonna be the assumption that we're really gonna dive into the details later on. So again, these are the two assumptions and they're both actually quite reasonable for the way humans write documents. And we can actually think of these two assumptions mathematically. The way we can model these assumptions is the following. We can say that documents are probability distributions over some underlying latent topics. And then topics themselves are probability distributions over words. So let's see how each of those actually plays out. We can imagine that any particular document is going to have a probability distribution over a given amount of latent topics. So let's say, we decide that there are five latent topics across various documents. Then any particular document is going to have a probability of belonging to each topic. So here we can see document one has the highest probability of belonging to topic number two. So we have this discrete probability distribution across the topics for each document. Then we could look at another document such as document number two and in this case, it does have probabilities of belonging to other topics but we're gonna say that it has the highest probability of belonging to topic four. Notice here, we're not saying definitively that document one belongs to any particular topic or document two belongs to any particular topic. Instead, we're modeling them as having a probability distribution over a variety of latent topics. And then if we look at the topics themselves, those are simply going to be modeled as probability distributions over words. So for example, we can define topic one as different probabilities belonging to each of these words as belonging to that topic. So, we can see here that it has a low probability of the word He belonging at topic one, low probability of Food belonging at topic one, et cetera. And then we can see that words such as Cat and Dog have a higher probability of belonging to topic one. And here is where we're actually going to begin as a user trying to understand what this topic is actually representative of. So, if we were to get this sort of probability distribution across all the vocabulary of all the words in the corpus, what we would end up doing is asking for maybe the top 10 highest probability words for topic one. And then we would try to realize what the actual underlying topic was. So in this case, we could make an educated guess that topic one happens to do with pets and we would say that topic one has to do with pets. Again, the LDA or unsupervised learning technique is not going to be able to tell you that directly. It's up to the user to interpret these probability distributions as topics. And we'll actually get hands on practice with that when we perform LDA ourselves with Python. So LDA represents documents as mixtures of topics that spit out words with certain probabilities. And it's going to assume that documents are produced in the following fashion. It's first going to decide on the number of words, N, that the document will have. Then, we choose a topic mixture for the document. According to a Dirichlet distribution over a fixed set of K topics. So, that's where the Dirichlet distribution comes to effect. So for example, we start off and say this document is 60% business, 20% politics and 10% foods. So, that's our actual distribution. And then what we're going to do is we're gonna generate each word in the document by first picking a topic according to the multinomial distribution that we sampled previously. So we picked words, 60% of them from the business topic, 20% of them from politics and then 10% from the food topic. And then using the topic to generate the word itself. So again, this is according to the topic's own multinomial distribution across the words. So for example, if we selected the food topic, we might generate the word Apple with 60% probability and another word Home with less probability, like 30% probability and so on. Assuming this sort of generative model for a collection of documents, LDA is actually going to then try to backtrack from the documents to find the set of topics that are likely to have generated the collection. So again, this process here that we just went over, LDA is assuming that that's how you built the documents. Now, obviously in the real world, you're not actually building documents with this sort of frame of mind, but it's a very useful construct of the way topics can be mixed throughout various documents and the way words can be mixed throughout various topics. So, what we're gonna do is attempt to backtrack that sort of process. So, let's actually show you what LDA is going to do since it's assuming that that's how you built the documents. So we're gonna imagine we have a set of documents and the first step we have to do is actually choose some fixed number of K topics to discover. And you should note that very carefully that this is actually really hard. In order for LDA to work, you as a user need to decide how many topics are going to be discovered. So even before you start LDA, you need to have some sort of intuition over how many topics. So, we choose some fixed number K of topics to discover and then we're going to want to use LDA to learn the topic representation of each document and the words associated to each topic. Then we're going to go through each document and we're going to randomly assign each word in the document to one of the K topics. So keep in mind, the very first pass, this random assignment actually already gives you both topic representations of all the documents and word distributions of all the topics. And we've assigned everything randomly at the very first pass. So, we're technically not done yet because these initial random topics won't really make sense. They're gonna be really poor representations of topics since you essentially just assigned every word a random topic. So, now it's time to iterate over this and see how we can figure out how to fix these sort of assignments. So we're gonna iterate over every word in every document to improve these topics. And what we're going to do is for every word in every document, and for each topic, T, we're going to calculate the following. We're gonna calculate the proportion of words in document D that are currently assigned to topic T. Then we're also going to calculate the proportion of assignments to topic T over all documents that come from this particular word W. And then we're going to reassign W a new topic where we choose topic T with probability of topic T given document D times probability of word W given topic T. So, this is essentially the probability that topic T generated the word W. And after repeating that previous step a large number of times, we eventually reach a roughly steady state where the assignments for the are acceptable. These words and topics don't start changing that often; They become pretty steady. So at the end, what we have is each document being assigned to a topic. And then what we can do is we can then search for the words that have the highest probability of being assigned a topic. So we end up with an output such as this. We'll say, after running through all the documents and performing LDA, you pass in one particular document and it'll report back, LDA will report back, I think this document is assigned to topic number four. And we actually don't know yet, what topic number four represents but what we can ask LDA is what are the most common words in topic four or what words have the highest probability of showing up in topic four? And then you may get results like Cat Vet, Birds, Dog, Food, Home, et cetera. And then, given this list of high probability words for this particular topic, it's up to the user to interpret what topic this is. And for this, I think a reasonable interpretation would be that these particular documents assigned to topic four probably have to do something with pets. And we would say, okay, I think topic four is pets. Now, is that the right answer? There's really no way of knowing because we didn't have the right answer to begin with. But, it's a reasonable assumption to make given these high probability of words showing up for topic four. So again, two important notes here before we continue and actually apply this with Python. The first important note is that the user themselves must decide on the amount of topics present in the document before even beginning this process. And the second one is that the user themselves must interpret what the topics are and what they're actually representing. Okay, in the next lecture, we're gonna explore how to actually implement LDA with Python and Circular. I'll see you there.