Read documents from the roots - Stemming and Lemmatization

The main goal of stemming and lemmatization is to convert related words to a common base/root word. It’s a special case of text normalization.

Stemming

Stemming any word means returning stem of the word. A single word can have different versions. But all the versions of that word have a single stem/base/root word. The stem word is not necessarily identical to the morphological root of the word.

Example:

The word "work" is the stem word for the words 'working', 'worked', and 'works'.

working => work
worked => work
works => work

Loading Stemmer Module

There are many stemming algorithms. "Porter Stemming Algorithm" is the most popular one.

from nltk.stem 
import PorterStemmer
stemmer = PorterStemmer()
print (stemmer.stem('working')) # output: work
print (stemmer.stem('works')) # output: work
print (stemmer.stem('worked')) # output: work

Stemming Text Document
First, the text need to be converted into word tokens. After that, each word of the token list can be stemmed.

As shown in the code snipped below, the word "jumps" has been stemmed to "jump" and the word "lazy" has been stemmed to "lazi".

from nltk.tokenize 
import word_tokenize
from nltk.stem 
import PorterStemmer 
text = "A quick brown fox jumps over the lazy dog." 
# Normalize text
# NLTK considers capital letters and small letters differently.
# For example, Fox and fox are considered as two different words.
# Hence, we convert all letters of our text into lowercase.
text = text.lower() 
# tokenize text 
words = word_tokenize(text) 
print (words)
'''
Output: ['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
''' 
stemmer = PorterStemmer() 
words_stem = [stemmer.stem(word) for word in words] 
# The above line of code is a shorter version of the following code:
'''
words_stem = [] for word in words:    
	words_stem.append(stemmer.stem(word))
''' 
#words_stem_2 = [str(item) for item in words_stem]#print (words_stem_2) 
print (words_stem)
'''
Output: ['a', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'dog', '.']
'''

Using split() function

The stemmer on your text without word tokenization. For this, the split() method which turns a string into a list based on any delimiter. The default delimiter is a space.

Note: Tokenizing sentences into words is useful as it separates punctuations from the words. In below example, the last word "dog" is taken as "dog." full-stop at the end). The punctuation mark is not separated from the word.

from nltk.stem 
import PorterStemmer
stemmer = PorterStemmer()
text = "A quick brown fox jumps over the lazy dog."
text_stem = " ".join([stemmer.stem(word) for word in text.split()])
print (text_stem)

Stemming Non-English Words

There are other stemmers like SnowballStemmer, LancasterStemmer, ISRIStemmer, RSLPStemmer, RegexpStemmer.

SnowballStemmer can stem words of various languages besides English.

from nltk.stem 
import SnowballStemmer 
# Languages supported by SnowballStemmer
print (SnowballStemmer.languages)
'''
Output: ['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish']
'''

Stemming Spanish Words using SnowballStemmer

Let’s stem some Spanish words.

Below is the English translation of the Spanish words:

trabajando => working
trabajos => works
trabajó => worked

from nltk.stem 
import SnowballStemmer 
stemmer_spanish = SnowballStemmer('spanish') 
print (stemmer_spanish.stem('trabajando')) 
# output: trabajprint (stemmer_spanish.stem('trabajos')) 
# output: trabajprint (stemmer_spanish.stem('trabajó'.decode('utf-8'))) 
# output: trabaj # UTF-8 decode is done to solve the following error: 
# UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3

Stemming English Words using SnowballStemmer

stemmer_english = SnowballStemmer('english') 
print (stemmer_english.stem('working')) 
# output: workprint (stemmer_english.stem('works')) 
# output: workprint (stemmer_english.stem('worked')) 
# output: work

LEMMATIZATION

Lemmatization is closely related to stemming. Lemmatization returns the lemmas of the word which is the base/root word.

Difference between Stemming and Lemmatisation

– A stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.

– While converting any word to the root/base word, stemming can create non-existent work but lemmatization creates actual dictionary words.

– Stemmers are typically easier to implement than Lemmatizers.

– Stemmers run faster than Lemmatizers.

– The accuracy of stemming is less than that of lemmatization.

Lemmatization in NLTK can be done using WordNet’s Lemmatizer. WordNet is a lexical database of English.

from nltk.stem 
import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
# Lemmatisation depends upon the Part of Speech of the word
# lemmatize(word, pos=NOUN)
# the default part of speech (pos) for lemmatize method is "n", i.e. noun
# we can specify part of speech (pos) value like below:
# noun = n, verb = v, adjective = a, adverb = r 
print (lemmatizer.lemmatize('is')) # output: is
print (lemmatizer.lemmatize('are')) # output: are 
print (lemmatizer.lemmatize('is', pos='v')) # output: be
print (lemmatizer.lemmatize('are', pos='v')) # output: be 
print (lemmatizer.lemmatize('working', pos='n')) # output: working
print (lemmatizer.lemmatize('working', pos='v')) # output: work

Lemmatising Text Document
First the text need to be converted into word tokens. After that, each word of the token list can be lemmatized.

As shown in the below code snippet the word "jumps" has been converted to its base word "jump".

from nltk.tokenize 
import word_tokenize
from nltk.stem 
import WordNetLemmatizer 
text = "A quick brown fox jumps over the lazy dog." 
# Normalize text
# NLTK considers capital letters and small letters differently.
# For example, Fox and fox are considered as two different words.
# Hence, we convert all letters of our text into lowercase.text = text.lower() 
# tokenize 
text words = word_tokenize(text) 
print (words)
'''
Output: ['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
''' 
lemmatizer = WordNetLemmatizer() 
words_lemma = [lemmatizer.lemmatize(word) for word in words] 
# The above line of code is a shorter version of the following code:
'''
words_lemma = [] 
for word in words:
	words_lemma.append(lemmatizer.lemmatize(word))
''' 
#words_lemma_2 = [str(item) for item in words_lemma]
#print (words_lemma_2) print (words_lemma)
'''
Output: ['a', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog', '.']
'''

Read documents from the roots - Stemming and Lemmatization - NLP

Trilok Chowdary Maddipudi

Trilok Chowdary Maddipudi

LEMMATIZATION

Javascript: ES6+ to ES5 with Babel

Image Processing using Python