Natural Language Processing:Concept along with Case Study

Name: Natural Language Processing:Concept along with Case Study
Rating: 4.6 (213 reviews)

Free Course: Natural Language Processing (NLP), Text Processing, Machine Learning, Spam Filter [Python]

Created byRishi Bansal

Last updated 6/2020

English

What you'll learn

What are various text processing techniques and their implementation in python.
Case Study: Role of Hashing in Spam Filter compared to Countvectorizer.

Course content

3 sections • 19 lectures • 1h 31m total length

What is Natural Language Processing (NLP)4:49
•NLP: Natural Language Processing
•is a subfield of linguistics, computer science, information engineering, and AI
•deals with the interactions between computers and human languages
•how to program computers to process and analyze large amounts of natural language data
•computers can read text, hear speech, interpret it, measure sentiment and determine which parts are important
•App: Optical Character Recognition (OCR), Speech Recognition, Machine Translation, and Chatbots
•ML Algorithm study millions of text examples written by humans
•Algorithms gain understanding of the context
•This helps in differentiating between meaning of various texts
Tokenization2:15
•Task of breaking a text into pieces called as token
Types:
•Word Tokenization
•Sentence Tokenization
Stop Words Removal3:28
•Stopwords are the English words which does not add much meaning to a sentence.
•They can safely be ignored without sacrificing the meaning of the sentence.
•A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore.
N-Grams3:46
•An n-gram is a contiguous sequence of n items from a given sample of text or speech.
E.g: While typing we get suggestion
Stemming1:42
•Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form
E.g: Search Engine
Word Sense Disambiguation2:04
•WSD is identifying which sense of a word (i.e. meaning) is used in a sentence, when the word has multiple meanings.
Count Vectorizer5:34
•Provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.
•The same vectorizer can be used on documents that contain words not included in the vocabulary. These words are ignored and no count is given in the resulting vector.
•Issue: Appearance of “the”
•Each column represents one word, count refers to frequency of the word
•Sequence of words are not maintained
TF-IDF Vectorizer7:30
•TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.
•The importance is in scale of 0 & 1
Term Frequency: This summarizes how often a given word appears within a document.
Inverse Document Frequency: This downscales words that appear a lot across documents.
Adv:
•Feature vector much more tractable in size
•Frequency and relevance captured
DisAdv:
•Context still not captured
Hashing Vectorizer4:21
•Issue with Counts and frequencies – vocabulary can become very large
•Work around is to use a one way hash of words to convert them to integers
•No vocabulary is required and you can choose an arbitrary-long fixed length vector
•Downside - no way to convert the encoding back to a word

Tokenization - Python5:41
•Task of breaking a text into pieces called as token
Types:
•Word Tokenization
•Sentence Tokenization
Stop Word Removal - Python6:14
•Stopwords are the English words which does not add much meaning to a sentence.
•They can safely be ignored without sacrificing the meaning of the sentence.
•A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore.
N-Grams - Python4:21
•An n-gram is a contiguous sequence of n items from a given sample of text or speech.
E.g: While typing we get suggestion
Stemming - Python4:11
•Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form
E.g: Search Engine
Word Sense Disambiguation - Python7:48
•WSD is identifying which sense of a word (i.e. meaning) is used in a sentence, when the word has multiple meanings.
Count Vectorizer - Python7:49
•Provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.
•The same vectorizer can be used on documents that contain words not included in the vocabulary. These words are ignored and no count is given in the resulting vector.
•Issue: Appearance of “the”
•Each column represents one word, count refers to frequency of the word
•Sequence of words are not maintained
TF-IDF Vectorizer - Python3:05
•TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.
•The importance is in scale of 0 & 1
Term Frequency: This summarizes how often a given word appears within a document.
Inverse Document Frequency: This downscales words that appear a lot across documents.
Adv:
•Feature vector much more tractable in size
•Frequency and relevance captured
DisAdv:
•Context still not captured
Hashing Vectorizer - Python4:41
•Issue with Counts and frequencies – vocabulary can become very large
•Work around is to use a one way hash of words to convert them to integers
•No vocabulary is required and you can choose an arbitrary-long fixed length vector
•Downside - no way to convert the encoding back to a word

Spam Filter using CountVectorizer8:50
•Article Classification(E.g: Spam Classification) using bag-of-words representation (BOW)
•corpus = [
'i earn 20 lakh rupees per month just chitchating on the net!',
'are you free for a meeting anytime tomorrow?',
]
Spam Filter using Hashing3:33
•Apply Hash Function
•“Rishi Bansal” -> 23
•“Rashi Bansal” -> 72
•Output number depend on Hash function

Features:
•Same value for same string
•Collison: Possibility of same value for different string

Requirements

Basic Understanding of Python
One Laptop with Python IDE installed
Understanding of Machine learning will be helpful in Case Study however not mandatory

Description

This course provides a basic understanding of NLP. Anyone can opt for this course. No prior understanding of NLP is required. Text Processing like Tokenization, Stop Words Removal, Stemming, different types of Vectorizers, WSD, etc are explained in detail with python code. Also difference between CountVectorizer and Hashing in Spam Filter.

Who this course is for:

People willing to learn NLP and looking forward to build career in Machine Learning.

Natural Language Processing:Concept along with Case Study

What you'll learn

Explore related topics

Course content

Introduction to Natural Language Processing9 lectures • 35min

Text Preprocessing - Python Code8 lectures • 44min

Case Study: Spam Filter2 lectures • 12min

Requirements

Description

Who this course is for: