Go Machine Learning Projects
上QQ阅读APP看书,第一时间看更新

Normalizing and lemmatizing

In the previous section, I wrote that all the words in the second example, she shan't be excessively learned, are already in the dictionary from the first sentence. The observant reader might note the word be isn't actually in the dictionary. From a linguistics point of view, that isn't necessarily false. The word be is the root word of is, of which was is the past tense. Here, there is a notion that instead of just adding the words directly, we should add the root word. This is called lemmatization. Continuing from the previous example, the following are the lemmatized words from the first sentence:

the
child
be
learn
a
new
word
and
be
use
it
excessively
shall
not
she
cry

Again, here I would like to point out some inconsistencies that will be immediately obvious to the observant reader. Specifically, the word excessively has the root word of excess. So why was excessively listed? Again, the task of lemmatization isn't exactly a straightforward lookup of the root word in a dictionary. Often, in complex NLP related tasks, the words have to be lemmatized according to the context they are in. That's beyond the scope of this chapter because, as before, it's a fairly involved topic that could span an entire chapter of a book on NLP preprocessing.

So, let's go back to the topic of adding a word to a dictionary. Another useful thing to do is to normalize the words. In English, this typically means lowercasing the text, replacing unicode combination characters and the like. In the Go ecosystem, there is an extended standard library package that does just this: golang.org/x/text/unicode/norm. In particular, if we are going to work on real datasets, I personally prefer a NFC normalization schema. A good resource on string normalization is on the Go blog post as well: https://blog.golang.org/normalization. The content is not specific to Go, and is a good guide to string normalization in general.

The LingSpam corpus comes with variants that are normalized (by lowercasing and NFC) and lemmatized. They can be found in the lemm and lemm_stop variants of the corpus.