The main goal of stemming and lemmatization is to convert related words to a common base/root word. It’s a special case of text normalization.
Stemming
Stemming any word means returning stem of the word. A single word can have different versions. But all the versions of that word have a single stem/base/root word. The stem word is not necessarily identical to the morphological root of the word.
Example:
The word "work" is the stem word for the words 'working', 'worked', and 'works'.
working => work
worked => work
works => work
Loading Stemmer Module
There are many stemming algorithms. "Porter Stemming Algorithm" is the most popular one.
from nltk.stem
import PorterStemmer
stemmer = PorterStemmer()
print (stemmer.stem('working')) # output: work
print (stemmer.stem('works')) # output: work
print (stemmer.stem('worked')) # output: work
Stemming Text Document
First, the text need to be converted into word tokens. After that, each word of the token list can be stemmed.
As shown in the code snipped below, the word "jumps" has been stemmed to "jump" and the word "lazy" has been stemmed to "lazi".
from nltk.tokenize
import word_tokenize
from nltk.stem
import PorterStemmer
text = "A quick brown fox jumps over the lazy dog."
# Normalize text
# NLTK considers capital letters and small letters differently.
# For example, Fox and fox are considered as two different words.
# Hence, we convert all letters of our text into lowercase.
text = text.lower()
# tokenize text
words = word_tokenize(text)
print (words)
'''
Output: ['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
'''
stemmer = PorterStemmer()
words_stem = [stemmer.stem(word) for word in words]
# The above line of code is a shorter version of the following code:
'''
words_stem = [] for word in words:
words_stem.append(stemmer.stem(word))
'''
#words_stem_2 = [str(item) for item in words_stem]#print (words_stem_2)
print (words_stem)
'''
Output: ['a', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'dog', '.']
'''
Using split() function
The stemmer on your text without word tokenization. For this, the split() method which turns a string into a list based on any delimiter. The default delimiter is a space.
Note: Tokenizing sentences into words is useful as it separates punctuations from the words. In below example, the last word "dog" is taken as "dog." full-stop at the end). The punctuation mark is not separated from the word.
from nltk.stem
import PorterStemmer
stemmer = PorterStemmer()
text = "A quick brown fox jumps over the lazy dog."
text_stem = " ".join([stemmer.stem(word) for word in text.split()])
print (text_stem)
Stemming Non-English Words
There are other stemmers like SnowballStemmer, LancasterStemmer, ISRIStemmer, RSLPStemmer, RegexpStemmer.
SnowballStemmer can stem words of various languages besides English.
from nltk.stem
import SnowballStemmer
# Languages supported by SnowballStemmer
print (SnowballStemmer.languages)
'''
Output: ['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish']
'''
Stemming Spanish Words using SnowballStemmer
Let’s stem some Spanish words.
Below is the English translation of the Spanish words:
trabajando => working
trabajos => works
trabajó => worked
from nltk.stem
import SnowballStemmer
stemmer_spanish = SnowballStemmer('spanish')
print (stemmer_spanish.stem('trabajando'))
# output: trabajprint (stemmer_spanish.stem('trabajos'))
# output: trabajprint (stemmer_spanish.stem('trabajó'.decode('utf-8')))
# output: trabaj # UTF-8 decode is done to solve the following error:
# UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3
Stemming English Words using SnowballStemmer
stemmer_english = SnowballStemmer('english')
print (stemmer_english.stem('working'))
# output: workprint (stemmer_english.stem('works'))
# output: workprint (stemmer_english.stem('worked'))
# output: work
LEMMATIZATION
Lemmatization is closely related to stemming. Lemmatization returns the lemmas of the word which is the base/root word.
Difference between Stemming and Lemmatisation
– A stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.
– While converting any word to the root/base word, stemming can create non-existent work but lemmatization creates actual dictionary words.
– Stemmers are typically easier to implement than Lemmatizers.
– Stemmers run faster than Lemmatizers.
– The accuracy of stemming is less than that of lemmatization.
Lemmatization in NLTK can be done using WordNet’s Lemmatizer. WordNet is a lexical database of English.
from nltk.stem
import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# Lemmatisation depends upon the Part of Speech of the word
# lemmatize(word, pos=NOUN)
# the default part of speech (pos) for lemmatize method is "n", i.e. noun
# we can specify part of speech (pos) value like below:
# noun = n, verb = v, adjective = a, adverb = r
print (lemmatizer.lemmatize('is')) # output: is
print (lemmatizer.lemmatize('are')) # output: are
print (lemmatizer.lemmatize('is', pos='v')) # output: be
print (lemmatizer.lemmatize('are', pos='v')) # output: be
print (lemmatizer.lemmatize('working', pos='n')) # output: working
print (lemmatizer.lemmatize('working', pos='v')) # output: work
Lemmatising Text Document
First the text need to be converted into word tokens. After that, each word of the token list can be lemmatized.
As shown in the below code snippet the word "jumps" has been converted to its base word "jump".
from nltk.tokenize
import word_tokenize
from nltk.stem
import WordNetLemmatizer
text = "A quick brown fox jumps over the lazy dog."
# Normalize text
# NLTK considers capital letters and small letters differently.
# For example, Fox and fox are considered as two different words.
# Hence, we convert all letters of our text into lowercase.text = text.lower()
# tokenize
text words = word_tokenize(text)
print (words)
'''
Output: ['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
'''
lemmatizer = WordNetLemmatizer()
words_lemma = [lemmatizer.lemmatize(word) for word in words]
# The above line of code is a shorter version of the following code:
'''
words_lemma = []
for word in words:
words_lemma.append(lemmatizer.lemmatize(word))
'''
#words_lemma_2 = [str(item) for item in words_lemma]
#print (words_lemma_2) print (words_lemma)
'''
Output: ['a', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog', '.']
'''