Hands-on NLP with NLTK and Scikit-learn

Name: Hands-on NLP with NLTK and Scikit-learn
Rating: 3.9 (18 reviews)

A complete Python guide to Natural Language Processing to build spam filters, topic classifiers, and sentiment analyzers

Created byPackt Publishing

Last updated 10/2019

English

What you'll learn

Build end-to-end Natural Language Processing solutions, ranging from getting data for your model to presenting its results.
Core NLP concepts such as tokenization, stemming, and stop word removal.
Use open source libraries such as NLTK, scikit-learn, and spaCy to perform routine NLP tasks.
Classify emails as spam or not-spam using basic NLP techniques and simple machine learning models.
Put documents in their relevant topics using techniques such as TF-IDF, SVMs, and LDAs.
Common text data processing steps to increase the performance of your machine learning models.

Course content

6 sections • 30 lectures • 2h 46m total length

The Course Overview2:10
This video gives an overview of the entire course.
Use Python, NLTK, spaCy, and Scikit-learn to Build Your NLP Toolset6:40
In this video, we will learn how we setup a stack of libraries for natural language processing.
Use Python for machine learning
Learn how does NLTK and spaCy fit into natural language processing
Learn how does scikit-learn fit into natural language processing
Reading a Simple Natural Language File into Memory5:57
In this video, we will learn we will be putting textual data into Python to perform NLP.
Use iterators to read large text files
Speed up text file input-output with multiprocessing
Split the Text into Individual Words with Regular Expression6:25
In this video, we will be capturing each word in a corpus as a feature.
Split lines of text into word tokens with the split function
Explore a better tokenizer with regular expressions
Converting Words into Lists of Lower Case Tokens4:00
In this video, we will learn to remove effects of capitalization in our analysis.
Combine what we have learned to read a text file and process it
Split the corpus into case-insensitive tokens
Removing Uncommon Words and Stop Words6:35
In this video, we will learn to remove noise caused by stop words and uncommon words.
Remove uncommon words
Learn about stop words
Remove uncommon words using the collections module

Use an Open Source Dataset, and What Is the Enron Dataset5:07
In this video, we will be getting started with spam classification using an open sourced dataset.
Learn how data anchors natural language processing and machine learning
Find and understand Enron spam dataset
Download the Enron spam dataset
Loading the Enron Dataset into Memory4:36
In this video, we will import a directory of data into Python.
Use the OS module to list all files in a folder
Import both positive and negative examples into memory
Tokenization, Lemmatization, and Stop Word Removal5:12
In this video, we will Implement your first natural language preprocessing pipeline.
Learn about Tokenization and Lemmatization
Learn how do we do these preprocessing steps in Python with NLTK
Bag-of-Words Feature Extraction Process with Scikit-learn4:46
In this video, we will be implementing your first feature extractor.
Learn about bag-of-words (BOW) features
Extract BOW features from text data in Python
Basic Spam Classification with NLTK's Naive Bayes6:50
In this video, we will be creating your first ML model with scikit-learn.
Split the dataset into training and testing
Set up your first machine learning model
Evaluate the Naive Bayes model

Understanding the Origin and Features of the Movie Review Dataset6:02
In this video, we will understand the origin and features of the movie review dataset.
Learn the importance of understanding the data source
Learn how the movie review dataset was collected
Loading and Cleaning the Review Data5:18
In this video, we will see how to get the movie review data into Python.
Learn about scikit-learn's load files function
Organize your dataset to work with sklearn
Preprocessing the Dataset to Remove Unwanted Words and Characters6:26
In this video, we will preprocess the dataset to remove unwanted words and characters.
Learn what a Vectorizer in Scikit-learn is
Use Count Vectorizer to extract features from text
Creating TF-IDF Weighted Natural Language Features5:01
In this video, we will create TF-IDF weighted natural language features.
Learn what is TF-IDF and why is it useful
Implement TF-IDF in scikit-learn
Basic Sentiment Analysis with Logistic Regression Model6:22
In this video, we will do the basic sentiment analysis with logistic regression.
Learn what logistic regression is, and how does it fit into NLP
Implement logistic regression on sentiment analysis

Deep Dive into Raw Tokens from the Movie Reviews7:20
In this video, we will learn how to engineer better features by looking at the raw data.
Look and analyze the raw data and see how it can help us create better features
Whitelist words related to sentiment
Advanced Cleaning of Tokens Using Python String Functions and Regex6:49
In this video, we will see how to clean tokens using Python string functions and regex.
Learn how to use word lists to test regex
Explore wildcards, endings, optionality and more
Creating N-gram Features Using Scikit-learn5:40
In this video, we will create features based on phrases instead of words.
Learn about N-grams
Create N-grams with scikit-learn
Experimenting with Advanced Scikit-learn Models Using the NLTK Wrapper4:33
In this video, we will see how to experiment with a collection of advanced scikit-learn models.
Explore different models
Use the sklearn module name as a proxy to model type
Building a Voting Model with Scikit-learn4:04
In this video, we will be combining the predictions of the models into one ensemble model.
Combine models
Use VotingClassifier to combine predictions

Understanding the Origin and Features of the 20 Newsgroups Dataset4:57
In this video, we will understand the origin and features of the 20 newsgroups dataset.
Learn what is document or topic classification
Learn what is the 20 news group dataset
Loading the Newsgroup Data and Extracting Features4:45
In this video, we will be loading the newsgroup data and extracting features.
Load the 20 newsgroup dataset with load_files
Split up the dataset with train_test_split
Building a Document Classification Pipeline4:00
In this video, we will see how to build a document classification pipeline.
Learn about pipelines in scikit-learn
Explore the chaining feature extraction and model training using pipelines
Creating a Performance Report of the Model on the Test Set5:11
In this video, we will be creating a performance report of the model on the test set.
Store predictions and evaluating accuracy
Understand performance with classification report
Finding Optimal Hyper-parameters Using Grid Search6:12
In this video, we will be finding optimal hyper-parameters using grid search.
Learn about hyper parameters
Implement GridSearch in scikit-learn

Building a Text Preprocessing Pipeline with NLTK6:15
In this video, we will learn about Elegantly queue up text preprocessing steps as a decision tree.
Use send() and yield() to pass data around functions
Structure your text preprocessing graph using decorators
Creating Hashing Based Features from Natural Language6:52
In this video, we will be creating hashing based features from natural language.
Differentiate between HashingVectorizer and CountVectorizer
Learn how to use HashingVectorizer
Classify Documents into 20 Topics with LSA6:01
This video shows How to use LSA to reduce the dimensionality of your term-document matrix.
Reduce the size of the matrix that came out of HashingVectorizer
Leverage the LSA algorithm to perform topic classification
Document Classification with TF-IDF and SVMs6:11
This video shows how to use SVM to power document classification.
Learn What are SVMs
Implement a model combining TF-IDF and SVMs in scikit-learn

Requirements

Prior programming experience with Python is assumed along with being comfortable dealing with machine learning terms such as supervised learning, regression, and classification. No prior Natural Language Processing or text mining experience is needed.

Description

There is an overflow of text data online nowadays. As a Python developer, you need to create a new solution using Natural Language Processing for your next project. Your colleagues depend on you to monetize gigabytes of unstructured text data. What do you do?

Hands-on NLP with NLTK and scikit-learn is the answer. This course puts you right on the spot, starting off with building a spam classifier in our first video. At the end of the course, you are going to walk away with three NLP applications: a spam filter, a topic classifier, and a sentiment analyzer. There is no need for fancy mathematical theory, just plain English explanations of core NLP concepts and how to apply those using Python libraries.

Taking this course will help you to precisely create new applications with Python and NLP. You will be able to build actual solutions backed by machine learning and NLP processing models with ease.

This course uses Python 3.6, TensorFlow 1.4, NLTK 2, and scikit-learn 0.19, while not the latest version available, it provides relevant and informative content for legacy users of NLP with NLTK and Scikit-learn.

About the Author

Colibri Ltd is a technology consultancy company founded in 2015 by James Cross and Ingrid Funie. The company works to help its clients navigate the rapidly changing and complex world of emerging technologies, with deep expertise in areas such as big data, data science, machine learning, and cloud computing. Over the past few years, they have worked with some of the world's largest and most prestigious companies, including a tier 1 investment bank, a leading management consultancy group, and one of the World's most popular soft drinks companies, helping each of them to make better sense of its data, and process it in more intelligent ways. The company lives by its motto: Data -> Intelligence -> Action.

Rudy Lai is the founder of QuantCopy, a sales acceleration startup using AI to write sales emails to prospects. By taking in leads from your pipelines, QuantCopy researches them online and generates sales emails from that data. It also has a suite of email automation tools to schedule, send, and track email performance—key analytics that all feedback into how our AI generates content.

Prior to founding QuantCopy, Rudy ran HighDimension.IO, a machine learning consultancy, where he experienced first-hand the frustrations of outbound sales and prospecting. As a founding partner, he helped startups and enterprises with High Dimension. IO's Machine-Learning-as-a-Service, allowing them to scale up data expertise in the blink of an eye.

In the first part of his career, Rudy spent 5+ years in quantitative trading at leading investment banks such as Morgan Stanley. This valuable experience allowed him to witness the power of data, but also the pitfalls of automation using data science and machine learning. Quantitative trading was also a great platform from which to learn deeply about reinforcement learning and supervised learning topics in a commercial setting.

Rudy holds a Computer Science degree from Imperial College London, where he was part of the Dean's List, and received awards such as the Deutsche Bank Artificial Intelligence prize.

Who this course is for:

This course is for developers, data scientists, and programmers who want to learn about practical Natural Language Processing with Python in a hands-on way. Developers who have an upcoming project that needs NLP, or a pile of unstructured text data on their hands, and don't know what to do with it, will find this course useful.

Hands-on NLP with NLTK and Scikit-learn

What you'll learn

Explore related topics

Course content

Working with Natural Language Data6 lectures • 32min

Spam Classification with an Email Dataset5 lectures • 27min

Sentiment Analysis with a Movie Review Dataset5 lectures • 29min

Boosting the Performance of Your Models with N-grams5 lectures • 28min

Document Classification with a Newsgroup Dataset5 lectures • 25min

Advanced Topic Modelling with TF-IDF, LSA, and SVMs4 lectures • 25min

Requirements

Description

Who this course is for: