
This video gives an overview of the entire course.
In this video, we will learn how we setup a stack of libraries for natural language processing.
Use Python for machine learning
Learn how does NLTK and spaCy fit into natural language processing
Learn how does scikit-learn fit into natural language processing
In this video, we will learn we will be putting textual data into Python to perform NLP.
Use iterators to read large text files
Speed up text file input-output with multiprocessing
In this video, we will be capturing each word in a corpus as a feature.
Split lines of text into word tokens with the split function
Explore a better tokenizer with regular expressions
In this video, we will learn to remove effects of capitalization in our analysis.
Combine what we have learned to read a text file and process it
Split the corpus into case-insensitive tokens
In this video, we will learn to remove noise caused by stop words and uncommon words.
Remove uncommon words
Learn about stop words
Remove uncommon words using the collections module
In this video, we will be getting started with spam classification using an open sourced dataset.
Learn how data anchors natural language processing and machine learning
Find and understand Enron spam dataset
Download the Enron spam dataset
In this video, we will import a directory of data into Python.
Use the OS module to list all files in a folder
Import both positive and negative examples into memory
In this video, we will Implement your first natural language preprocessing pipeline.
Learn about Tokenization and Lemmatization
Learn how do we do these preprocessing steps in Python with NLTK
In this video, we will be implementing your first feature extractor.
Learn about bag-of-words (BOW) features
Extract BOW features from text data in Python
In this video, we will be creating your first ML model with scikit-learn.
Split the dataset into training and testing
Set up your first machine learning model
Evaluate the Naive Bayes model
In this video, we will understand the origin and features of the movie review dataset.
Learn the importance of understanding the data source
Learn how the movie review dataset was collected
In this video, we will see how to get the movie review data into Python.
Learn about scikit-learn's load files function
Organize your dataset to work with sklearn
In this video, we will preprocess the dataset to remove unwanted words and characters.
Learn what a Vectorizer in Scikit-learn is
Use Count Vectorizer to extract features from text
In this video, we will create TF-IDF weighted natural language features.
Learn what is TF-IDF and why is it useful
Implement TF-IDF in scikit-learn
In this video, we will do the basic sentiment analysis with logistic regression.
Learn what logistic regression is, and how does it fit into NLP
Implement logistic regression on sentiment analysis
In this video, we will learn how to engineer better features by looking at the raw data.
Look and analyze the raw data and see how it can help us create better features
Whitelist words related to sentiment
In this video, we will see how to clean tokens using Python string functions and regex.
Learn how to use word lists to test regex
Explore wildcards, endings, optionality and more
In this video, we will create features based on phrases instead of words.
Learn about N-grams
Create N-grams with scikit-learn
In this video, we will see how to experiment with a collection of advanced scikit-learn models.
Explore different models
Use the sklearn module name as a proxy to model type
In this video, we will be combining the predictions of the models into one ensemble model.
Combine models
Use VotingClassifier to combine predictions
In this video, we will understand the origin and features of the 20 newsgroups dataset.
Learn what is document or topic classification
Learn what is the 20 news group dataset
In this video, we will be loading the newsgroup data and extracting features.
Load the 20 newsgroup dataset with load_files
Split up the dataset with train_test_split
In this video, we will see how to build a document classification pipeline.
Learn about pipelines in scikit-learn
Explore the chaining feature extraction and model training using pipelines
In this video, we will be creating a performance report of the model on the test set.
Store predictions and evaluating accuracy
Understand performance with classification report
In this video, we will be finding optimal hyper-parameters using grid search.
Learn about hyper parameters
Implement GridSearch in scikit-learn
In this video, we will learn about Elegantly queue up text preprocessing steps as a decision tree.
Use send() and yield() to pass data around functions
Structure your text preprocessing graph using decorators
In this video, we will be creating hashing based features from natural language.
Differentiate between HashingVectorizer and CountVectorizer
Learn how to use HashingVectorizer
This video shows How to use LSA to reduce the dimensionality of your term-document matrix.
Reduce the size of the matrix that came out of HashingVectorizer
Leverage the LSA algorithm to perform topic classification
This video shows how to use SVM to power document classification.
Learn What are SVMs
Implement a model combining TF-IDF and SVMs in scikit-learn
There is an overflow of text data online nowadays. As a Python developer, you need to create a new solution using Natural Language Processing for your next project. Your colleagues depend on you to monetize gigabytes of unstructured text data. What do you do?
Hands-on NLP with NLTK and scikit-learn is the answer. This course puts you right on the spot, starting off with building a spam classifier in our first video. At the end of the course, you are going to walk away with three NLP applications: a spam filter, a topic classifier, and a sentiment analyzer. There is no need for fancy mathematical theory, just plain English explanations of core NLP concepts and how to apply those using Python libraries.
Taking this course will help you to precisely create new applications with Python and NLP. You will be able to build actual solutions backed by machine learning and NLP processing models with ease.
This course uses Python 3.6, TensorFlow 1.4, NLTK 2, and scikit-learn 0.19, while not the latest version available, it provides relevant and informative content for legacy users of NLP with NLTK and Scikit-learn.
About the Author
Colibri Ltd is a technology consultancy company founded in 2015 by James Cross and Ingrid Funie. The company works to help its clients navigate the rapidly changing and complex world of emerging technologies, with deep expertise in areas such as big data, data science, machine learning, and cloud computing. Over the past few years, they have worked with some of the world's largest and most prestigious companies, including a tier 1 investment bank, a leading management consultancy group, and one of the World's most popular soft drinks companies, helping each of them to make better sense of its data, and process it in more intelligent ways. The company lives by its motto: Data -> Intelligence -> Action.
Rudy Lai is the founder of QuantCopy, a sales acceleration startup using AI to write sales emails to prospects. By taking in leads from your pipelines, QuantCopy researches them online and generates sales emails from that data. It also has a suite of email automation tools to schedule, send, and track email performance—key analytics that all feedback into how our AI generates content.
Prior to founding QuantCopy, Rudy ran HighDimension.IO, a machine learning consultancy, where he experienced first-hand the frustrations of outbound sales and prospecting. As a founding partner, he helped startups and enterprises with High Dimension. IO's Machine-Learning-as-a-Service, allowing them to scale up data expertise in the blink of an eye.
In the first part of his career, Rudy spent 5+ years in quantitative trading at leading investment banks such as Morgan Stanley. This valuable experience allowed him to witness the power of data, but also the pitfalls of automation using data science and machine learning. Quantitative trading was also a great platform from which to learn deeply about reinforcement learning and supervised learning topics in a commercial setting.
Rudy holds a Computer Science degree from Imperial College London, where he was part of the Dean's List, and received awards such as the Deutsche Bank Artificial Intelligence prize.