The main goal of stemming and lemmatization is to convert related words to a common base/root word. It’s a special case of text normalization.

Stemming

Stemming any word means returning stem of the word. A single word can have different versions. But all the different versions of that word has a single stem/base/root word. The stem word is not necessary to be identical to the morphological root of the word.

Example:

The word "work" will be the stem word for working, worked, and works.

working => work
worked => work
works => work

Loading Stemmer Module

There are many stemming algorithms. "Porter Stemming Algorithm" is the most popular one.

from nltk.stem 
import PorterStemmer
stemmer = PorterStemmer()
print (stemmer.stem('working')) # output: work
print (stemmer.stem('works')) # output: work
print (stemmer.stem('worked')) # output: work

Stemming text document
We need to first convert the text into word tokens. After that, we can stem each word of the token list.

We can see the below code that the word "jumps" has been stemmed to "Jump" and the word "lazy"  has been stemmed to "lazi".

from nltk.tokenize 
import word_tokenize
from nltk.stem 
import PorterStemmer 
text = "A quick brown fox jumps over the lazy dog." 
# Normalize text
# NLTK considers capital letters and small letters differently.
# For example, Fox and fox are considered as two different words.
# Hence, we convert all letters of our text into lowercase.
text = text.lower() 
# tokenize text 
words = word_tokenize(text) 
print (words)
'''
Output: ['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
''' 
stemmer = PorterStemmer() 
words_stem = [stemmer.stem(word) for word in words] 
# The above line of code is a shorter version of the following code:
'''
words_stem = [] for word in words:    
	words_stem.append(stemmer.stem(word))
''' 
#words_stem_2 = [str(item) for item in words_stem]#print (words_stem_2) 
print (words_stem)
'''
Output: ['a', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'dog', '.']
'''

Using split() function

You can simply test the stemmer on your text without work tokenizing. For this, you can use the split() method which turns a string into a list based on any delimiter. The default delimiter is a space.

Note: Tokenizing sentences into words is useful as it separates punctuations from the words. In below example, the last word "dog" will be taken as "dog." full-stop at the end). The punctuation mark is not separated from the word.

from nltk.stem 
import PorterStemmer
stemmer = PorterStemmer()
text = "A quick brown fox jumps over the lazy dog."
text_stem = " ".join([stemmer.stem(word) for word in text.split()])
print (text_stem)

Stemming Non-English Words

There are other different stemmers like SnowballStemmer, LancasterStemmer, ISRIStemmer, RSLPStemmer, RegexpStemmer.

SnowballStemmer can stem words of different languages besides English.

from nltk.stem 
import SnowballStemmer 
# Languages supported by SnowballStemmer
print (SnowballStemmer.languages)
'''
Output: ['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish']
'''

Stemming Spanish Words using SnowballStemmer

Let’s stem some Spanish words.

Here’s the English translation of the Spanish words:

trabajando => working
trabajos => works
trabajó => worked
from nltk.stem 
import SnowballStemmer 
stemmer_spanish = SnowballStemmer('spanish') 
print (stemmer_spanish.stem('trabajando')) 
# output: trabajprint (stemmer_spanish.stem('trabajos')) 
# output: trabajprint (stemmer_spanish.stem('trabajó'.decode('utf-8'))) 
# output: trabaj # UTF-8 decode is done to solve the following error: 
# UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3

Stemming English Words using SnowballStemmer

stemmer_english = SnowballStemmer('english') 
print (stemmer_english.stem('working')) 
# output: workprint (stemmer_english.stem('works')) 
# output: workprint (stemmer_english.stem('worked')) 
# output: work

LEMMATIZATION

Lemmatization is closely related to stemming. Lemmatization returns the lemmas of the word which is the base/root word.

Difference between Stemming and Lemmatisation

– A stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.

– While converting any word to the root/base word, stemming can create non-existent work but lemmatization creates actual dictionary words.

– Stemmers are typically easier to implement than Lemmatizers.

– Stemmers run faster than Lemmatizers.

– The accuracy of stemming is less than that of lemmatization.

Lemmatization in NLTK can be done using WordNet’s Lemmatizer. WordNet is a lexical database of English.

from nltk.stem 
import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
# Lemmatisation depends upon the Part of Speech of the word
# lemmatize(word, pos=NOUN)
# the default part of speech (pos) for lemmatize method is "n", i.e. noun
# we can specify part of speech (pos) value like below:
# noun = n, verb = v, adjective = a, adverb = r 
print (lemmatizer.lemmatize('is')) # output: is
print (lemmatizer.lemmatize('are')) # output: are 
print (lemmatizer.lemmatize('is', pos='v')) # output: be
print (lemmatizer.lemmatize('are', pos='v')) # output: be 
print (lemmatizer.lemmatize('working', pos='n')) # output: working
print (lemmatizer.lemmatize('working', pos='v')) # output: work

Lemmatising text document
We need to first convert the text into word tokens. After that, we can lemmatize each word of the token list.

We can see in the below code that the word "jumps" has been converted to its base word "jump".

from nltk.tokenize 
import word_tokenize
from nltk.stem 
import WordNetLemmatizer 
text = "A quick brown fox jumps over the lazy dog." 
# Normalize text
# NLTK considers capital letters and small letters differently.
# For example, Fox and fox are considered as two different words.
# Hence, we convert all letters of our text into lowercase.text = text.lower() 
# tokenize 
text words = word_tokenize(text) 
print (words)
'''
Output: ['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
''' 
lemmatizer = WordNetLemmatizer() 
words_lemma = [lemmatizer.lemmatize(word) for word in words] 
# The above line of code is a shorter version of the following code:
'''
words_lemma = [] 
for word in words:
	words_lemma.append(lemmatizer.lemmatize(word))
''' 
#words_lemma_2 = [str(item) for item in words_lemma]
#print (words_lemma_2) print (words_lemma)
'''
Output: ['a', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog', '.']
'''