While processing a natural language which we human speak, we need to take care of following things:
2. Semantics: It deals with the meaning of words and their interpretation within sentences.
3. Pragmatics: Same as semantics but it also consider context in which the word is used.
Applications of NLP
Applications of NLP (Natural Language Processing) are unlimited. I have listed few of those:
1. Machine translation (like Google Translate)
2. Sentiment analysis (reviews and comments on e-commerce and social-networking sites)
3. Text classification, generation and automatic summarization
5. Personal assistants (like Alexa, Siri, Google Assistant, Cortana etc.)
NLP Toolkit (library) in Python
There are a lot of libraries in Python for NLP but the most commonly used library is NLTK( Natural Language Toolkit). It provides very efficient modules for preprocessing and cleaning of raw data like removing punctuation, tokenizing, removing stopwords, stemming, lemmatization, vectorization, tagging, parsing, and more.
Pre-processing of raw data in NLP
Following are the basic steps which we need to perform while cleaning the raw data in NLP:
1. Remove Punctuation
3. Remove Stopwords
1. Remove Punctuation: First of all, we should remove all the punctuation marks (like comma, semicolon, colon, apostrophe, quotation marks, dash, hyphen, brackets, braces, parentheses, ellipsis etc.) from the text as these carry negligible weight.
2. Tokenization: Now create a list of words used in the text. Each word is called a token. We can use regular expression to find out tokens from the sentences otherwise NLTK has efficient modules for this task.
3. Remove Stopwords: Now we need to remove all the stopwords from the token list. Stopwords are the words which occur frequently in a sentence but carry little weight (like the, for, is, and, or, been, to, this, that, but, if, in, a, as etc.).
4.1 Stemming: It is used to reduce the number of tokens just like removing stopwords. In this process, we reduce inflected words to their word stem or root. We keep only the semantic meaning of similar words.
1) Tokens like stemming and stemmed are converted to a token stem.
2) Tokens like working, worked, works and work are converted to a token work.
Points 1 and 2 clearly illustrate that how can we reduce the number of tokens in a token list using stemming. But wait! There is a problem. Consider following examples of stemming:
4) Tokens like goose and geese are converted to the tokens goos and gees respectively (it will just remove "e" suffix from both the tokens). Now this is again wrong. "geese" is just a plural of "goose", even then its treating both tokens as different.
Points 3 and 4 can be resolved using Lemmatization.
NLTK library has 4 stemmers:
1) Porter Stemmer
2) Snowball Stemmer
3) Lancaster Stemmer
4) Regex-based Stemmer
I mainly use Porter stemmer for stemming the tokens in my NLP code.
4.2: Lemmatization: We saw the limitation of stemming in above examples (3 and 4). We can overcome these limitations using Lemmatization. It is more powerful and sophisticated as compared to stemming and returns more accurate and meaningful words / tokens by considering the context in which the word is used in a sentence.
But the tradeoff is that, it is slower and complex as compared to the stemming.
1) Tokens like meanness and meaning are retained as it is instead of reducing it to mean (unlike stemming).
2) Tokens like goose and geese are converted to a token goose which is right. We should get rid of the token "geese" as it is just a plural of "goose".
I mainly use WordNet Lemmatizer present in NLTK library.
5. Vectorization: Machine Learning algorithms don't understand text. These need numeric data for matrix multiplications. Till now, we have just cleaned our tokens. So, in this process, we encode our final tokens into numbers to create feature vectors so that algorithms can understand. In other words, we will fit and transform vectorization methods to our preprocessed and cleaned data which we created till lemmatization.
Document-term matrix: Let's first understand this term before proceeding further. We use document term matrix to represent the words in the text in the form of matrix of numbers. The rows of the matrix represent the text responses to be analyzed, and the columns of the matrix represent the words / tokens from the text that are to be used in the analysis.
There are mainly 3 types vectorization:
1) Count vectorization
2) N-grams vectorization
3) Term Frequency - Inverse Document Frequency (TF-IDF)
1) Count vectorization: It creates a document-term matrix which contains the count of each unique word / token in the text response.
2) N-grams vectorization: It creates a document-term matrix which also considers context of the word depending upon the value of N.
If N = 2, it is called bi-gram,
If N = 3, it is called tri-gram,
If N = 4, it is called four-gram and so on...
We need to be careful about value of N and choose it wisely.
Example: Consider a sentence "NLP is awesome". Count vectorization will create a column corresponding to each word in document-term matrix while N-gram will create columns like following in case of bi-gram:
"NLP is", "is awesome"
3) Term Frequency - Inverse Document Frequency (TF-IDF) - It is just like count vectorization but instead of count, it stores weightage of each word by using following formula:
w(i, j) = weightage of a particular word "i" in a document "j"
N = number of total documents
df(i) = number of documents containing the word "i"
So, in this way, TF-IDF considers two facts while calculating the weightage of a word or token:
df(i) = number of text messages containing the word NLP which in our case is 1.
So, the final equation becomes: