
•NLP: Natural Language Processing
•is a subfield of linguistics, computer science, information engineering, and AI
•deals with the interactions between computers and human languages
•how to program computers to process and analyze large amounts of natural language data
•computers can read text, hear speech, interpret it, measure sentiment and determine which parts are important
•App: Optical Character Recognition (OCR), Speech Recognition, Machine Translation, and Chatbots
•ML Algorithm study millions of text examples written by humans
•Algorithms gain understanding of the context
•This helps in differentiating between meaning of various texts
•Task of breaking a text into pieces called as token
Types:
•Word Tokenization
•Sentence Tokenization
•Stopwords are the English words which does not add much meaning to a sentence.
•They can safely be ignored without sacrificing the meaning of the sentence.
•A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore.
•An n-gram is a contiguous sequence of n items from a given sample of text or speech.
E.g: While typing we get suggestion
•Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form
E.g: Search Engine
•WSD is identifying which sense of a word (i.e. meaning) is used in a sentence, when the word has multiple meanings.
•Provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.
•The same vectorizer can be used on documents that contain words not included in the vocabulary. These words are ignored and no count is given in the resulting vector.
•Issue: Appearance of “the”
•Each column represents one word, count refers to frequency of the word
•Sequence of words are not maintained
•TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.
•The importance is in scale of 0 & 1
Term Frequency: This summarizes how often a given word appears within a document.
Inverse Document Frequency: This downscales words that appear a lot across documents.
Adv:
•Feature vector much more tractable in size
•Frequency and relevance captured
DisAdv:
•Context still not captured
•Issue with Counts and frequencies – vocabulary can become very large
•Work around is to use a one way hash of words to convert them to integers
•No vocabulary is required and you can choose an arbitrary-long fixed length vector
•Downside - no way to convert the encoding back to a word
•Task of breaking a text into pieces called as token
Types:
•Word Tokenization
•Sentence Tokenization
•Stopwords are the English words which does not add much meaning to a sentence.
•They can safely be ignored without sacrificing the meaning of the sentence.
•A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore.
•An n-gram is a contiguous sequence of n items from a given sample of text or speech.
E.g: While typing we get suggestion
•Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form
E.g: Search Engine
•WSD is identifying which sense of a word (i.e. meaning) is used in a sentence, when the word has multiple meanings.
•Provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.
•The same vectorizer can be used on documents that contain words not included in the vocabulary. These words are ignored and no count is given in the resulting vector.
•Issue: Appearance of “the”
•Each column represents one word, count refers to frequency of the word
•Sequence of words are not maintained
•TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.
•The importance is in scale of 0 & 1
Term Frequency: This summarizes how often a given word appears within a document.
Inverse Document Frequency: This downscales words that appear a lot across documents.
Adv:
•Feature vector much more tractable in size
•Frequency and relevance captured
DisAdv:
•Context still not captured
•Issue with Counts and frequencies – vocabulary can become very large
•Work around is to use a one way hash of words to convert them to integers
•No vocabulary is required and you can choose an arbitrary-long fixed length vector
•Downside - no way to convert the encoding back to a word
•Article Classification(E.g: Spam Classification) using bag-of-words representation (BOW)
•corpus = [
'i earn 20 lakh rupees per month just chitchating on the net!',
'are you free for a meeting anytime tomorrow?',
]
•Apply Hash Function
•“Rishi Bansal” -> 23
•“Rashi Bansal” -> 72
•Output number depend on Hash function
Features:
•Same value for same string
•Collison: Possibility of same value for different string
This course provides a basic understanding of NLP. Anyone can opt for this course. No prior understanding of NLP is required. Text Processing like Tokenization, Stop Words Removal, Stemming, different types of Vectorizers, WSD, etc are explained in detail with python code. Also difference between CountVectorizer and Hashing in Spam Filter.