NLP: Text Vectorization Methods from Scratch
NLP projects work with text, but text cannot be used by machine learning algorithms unless transformed into a numerical representation. This representation is typically called a vector, and it can be applied to any reasonable unit of a text: individual tokens, n-grams, sentences, paragraphs, or even whole documents.
In statistical NLP on whole corpuses, different techniques for vectorization have been applied, such as one-hot, count, or frequency encoding. In neural NLP, word vectors, also called word embeddings, are dominant. Pre-trained vectors as well as learned vector representations in complex neural networks can be used.
This article explains and shows Python implementation for all the mentioned vectorization techniques: one-hot-encoding, counter encoding (bag-of-words), term frequency, and finally word vectors.
The technical context of this article is Python v3.11
and several additional libraries: gensim v4.3.1
, pandas v2.0.1
, numpy v1.26.1
, nltk v3.8.1
and scikit-learn v1.2.2
. All examples should work with newer library versions too.
Requirements and Used Python Libraries
Be sure to read and run the requirements of my previous article in order to have a Jupyter Notebook to run all code examples.
For this article, the following libraries are needed:
- Collections
- The
Counter
object for counting the number of tokens in a document
- The
- Gensim
- The
downloader
object allows loading several pre-trained word vectors
- The
- Pandas
DataFrame
objects to store the text, tokens, and vectors
- Numpy
- Serveral methods for creating and working with
arrays
- Serveral methods for creating and working with
- NLTK
PlaintextCorpusReader
for a traversable object that gives access to documents, provides tokenization methods, and computes statistics about all filessent_tokenizer
andword_tokenizer
for generating tokens- The
stopwords
list for token reduction
- SciKitLearn
Pipeline
object to implement the chain of processing stepsBaseEstimator
andTransformerMixin
to build custom classes that represent Pipeline steps
All examples require these imports and base classes:
import numpy as np
import re
from copy import deepcopy
from collections import Counter
from gensim import downloader
from nltk.corpus import stopwords
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.base import BaseEstimator, TransformerMixin
from time import time
class SciKitTransformer(BaseEstimator, TransformerMixin):
def fit(self, X=None, y=None):
return self
def transform(self, X=None):
return self
Basic Example
Based upon earlier articles, the NLTK PlaintextCorpusReader will be reused.
Here is the updated version of the WikipediaCorpus
class with an additional filter()
method - it reduces the vocabulary to text only, without any stopwords.
class WikipediaCorpus(PlaintextCorpusReader):
def __init__(self, root_path):
PlaintextCorpusReader.__init__(self, root_path, r'.*[0-9].txt')
def filter(self, word):
#only keep letters, numbers, and sentence delimiter
word = re.sub('[\(\)\.,;:+\-—"]', '', word)
#remove multiple whitespace
word = re.sub(r'\s+', '', word)
if not word in stopwords.words("english"):
return word.lower()
return ''
def vocab(self):
return sorted(set([self.filter(word) for word in corpus.words()]))
def max_words(self):
max = 0
for doc in self.fileids():
l = len(self.words(doc))
max = l if l > max else max
return max
def describe(self, fileids=None, categories=None):
started = time()
return {
'files': len(self.fileids()),
'paras': len(self.paras()),
'sents': len(self.sents()),
'words': len(self.words()),
'vocab': len(self.vocab()),
'max_words': self.max_words(),
'time': time()-started
}
To keep the example vectors in this article brief and understandable, the corpus consists of the first three sentences from the Wikipedia article about machine learning.
_Source: [Wikipedia](https://en.wikipedia.org/wiki/Artificial_intelligence)_
Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by humans or by other animals.
Example tasks in which this is done include speech recognition, computer vision, translation between (natural) languages, as well as other mappings of inputs.
As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect. For instance, optical character recognition is frequently excluded from things considered to be AI, having become a routine technology.
Using the corpus class to parse these sentences gives the following statistics: A vocabulary of 49 words, and the total number of 113 words. That’s large and small enough to keep the following explanations brief.
corpus = WikipediaCorpus('ai_sentences')
print(corpus.fileids())
# ['sent1.txt', 'sent2.txt', 'sent3.txt']
print(corpus.describe())
# {'files': 3, 'paras': 3, 'sents': 3, 'words': 91, 'vocab': 40, 'max_words': 32, 'time': 0.01642608642578125}
print(corpus.vocab())
# ['', 'ai', 'animals', 'artificial', 'as', 'become', 'capable', 'computer', 'considered', ..., 'well']
One-Hot Encoding
A one-hot encoding represents the relationship which words occur in a specific document based on the total vocabulary of all documents. The implementation therefore requires these steps:
- Compute the total ordered vocabulary of all documents
- Iterate each document and mark which words are occurring
The following implementation builds a vocab_dict
object filled with default float values 0.0
, and then sets these values to 1.0
for each token that appears in the sentence.
class OneHotEncoder(SciKitTransformer):
def __init__(self, vocab):
self.vocab_dict = dict.fromkeys(vocab, 0.0)
def one_hot_vector(self, tokens):
vec_dict = deepcopy(self.vocab_dict)
for token in tokens:
if token in self.vocab_dict:
vec_dict[token] = 1.0
vec = [v for v in vec_dict.values()]
return np.array(vec)
Here are the one-hot vectors for the first two sentences:
encoder = OneHotEncoder(corpus.vocab())
sent1 = [word for word in word_tokenize(corpus.raw('sent1.txt'))]
vec1 = encoder.one_hot_vector(sent1)
print(vec1)
# [0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0.
# 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
print(vec1.shape)
# (40,)
sent2 = [word for word in word_tokenize(corpus.raw('sent2.txt'))]
vec2 = encoder.one_hot_vector(sent2)
print(vec2)
# [0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1.
# 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 1. 1. 1.]
print(vec2.shape)
# (40,)
Counter Encoding
Counter encoding is an intermediate form for creating vectors. Based on the complete ordered vocabulary of all documents, the number and occurrence of all words in a document is determined. This number is typically scaled, e.g. by the length of the document.
Here is a counter encoding implementation in Python. As before, it builds a vocab_dict
object filled with default float values 0.0
, and for each document, it sets a value of number(word)/len(document)
.
from collections import Counter
class CountEncoder(SciKitTransformer):
def __init__(self, vocab):
self.vocab = dict.fromkeys(vocab, 0.0)
def count_vector(self, tokens):
vec_dict = deepcopy(self.vocab)
token_vec = Counter(tokens)
doc_length = len(tokens)
for token, count in token_vec.items():
if token in self.vocab:
vec_dict[token] = count/doc_length
vec = [v for v in vec_dict.values()]
return np.array(vec)
Using the counter encoding yields these results:
encoder = CountEncoder(corpus.vocab())
sent1 = [word for word in word_tokenize(corpus.raw('sent1.txt'))]
vec1 = encoder.count_vector(sent1)
print(vec1)
# [0. 0. 0.03571429 0. 0.03571429 0.
# 0. 0. 0. 0. 0. 0.03571429
# 0. 0. 0. 0.03571429 0. 0.
# 0.03571429 0. 0. 0.07142857 0. 0.
# 0.03571429 0. 0. 0. 0.03571429 0.
# 0. 0. 0. 0. 0. 0.03571429
# 0. 0. 0. 0. ]
print(vec1.shape)
# (40,)
sent2 = [word for word in word_tokenize(corpus.raw('sent2.txt'))]
vec2 = encoder.count_vector(sent2)
print(vec2)
# [0. 0. 0. 0. 0.06896552 0.
# 0. 0.03448276 0. 0. 0. 0.
# 0.03448276 0. 0. 0. 0.03448276 0.
# 0. 0. 0.03448276 0. 0. 0.03448276
# 0. 0.03448276 0.03448276 0. 0. 0.
# 0. 0.03448276 0. 0. 0.03448276 0.
# 0.03448276 0.03448276 0.03448276 0.03448276]
print(vec2.shape)
# (40,)
Term Frequency Encoding
The first two encodings lead to the problem that very rare terms do not have enough weight to be of significance when used with machine learning algorithm. Especially to counter this problem, the Term Frequency, Term Indirect Frequency metric balances rare terms in a large corpora of documents. The detailed mathematics can be studied in the TfIdf Wikipedia article - here is the essential summary:
- TF, term frequency, is the number of times a term appears inside a document, divided by the total length of the document, in pseudo code
word_occurences_in_doc/doc_len
- IDF, indirect document frequency, is the logarithm of the number of documents containing a word divided by the total number of documents in the corpus, in pseudo code
log(number_of_docs/number_of_docs_containing_word)
The implementation is quite complex, structured along these considerations:
- The encoder receives the corpus vocabulary as a list, and it receives a dictionary object of the form
{document_name: [tokens]}
(otherwise this implementation would be too tightly coupled to the corpus object) - During initialization, a
word_frequency
dictionary is created that includes the total number of how often a term appears in all documents - The TfIdf method determines the total number of documents as
number_of_docs
, and the documents length asdoc_len
. It then creates aCounter
for all words in the documents, and then for each word that is contained in the vocabulary, calculates the TfIdf value - All values are converted to, and returned as, a Numpy array
Here is the implementation:
class TfIdfEncoder(SciKitTransformer):
def __init__(self, doc_arr, vocab):
self.doc_arr = doc_arr
self.vocab = vocab
self.word_frequency = self._word_frequency()
def _word_frequency(self):
word_frequency = dict.fromkeys(self.vocab, 0.0)
for doc_name in self.doc_arr:
doc_words = Counter([word for word in self.doc_arr[doc_name]])
for word, _ in doc_words.items():
if word in self.vocab:
word_frequency[word] += 1.0
return word_frequency
def TfIdf_vector(self, doc_name):
if not doc_name in self.doc_arr:
print(f'Document "{doc_name}" not found.')
return
number_of_docs = len(self.doc_arr)
doc_len = len(self.doc_arr[doc_name])
doc_words = Counter([word for word in self.doc_arr[doc_name]])
TfIdf_vec = dict.fromkeys(self.vocab, 0.0)
for word, word_count in doc_words.items():
if word in self.vocab:
tf = word_count/doc_len
idf = np.log(number_of_docs/self.word_frequency[word])
idf = 1 if idf == 0 else idf
TfIdf_vec[word] = tf * idf
vec = [v for v in TfIdf_vec.values()]
return np.array(vec)
For our example with only three sentences, the vectors adequately represent documents, but their full potential is only realized for large campus.
doc_list = [doc for doc in corpus.fileids()]
words_list = [corpus.words(doc) for doc in [doc for doc in corpus.fileids()]]
doc_arr = dict(zip(doc_list, words_list))
encoder = TfIdfEncoder(doc_arr, corpus.vocab())
vec1 = encoder.TfIdf_vector('sent1.txt')
print(vec1)
# [0. 0. 0.03433163 0. 0.03125 0.
# 0. 0. 0. 0. 0.03433163 0.03433163
# 0. 0. 0. 0.03433163 0. 0.
# 0.03433163 0.03433163 0. 0.03801235 0. 0.
# 0.01267078 0. 0. 0. 0.03433163 0.03433163
# 0. 0. 0. 0. 0. 0.03433163
# 0. 0. 0. 0. ]
print(vec1.shape)
# (40,)
vec2 = encoder.TfIdf_vector('sent2.txt')
print(vec2)
# [0. 0. 0. 0. 0.06896552 0.
# 0. 0.03788318 0. 0. 0. 0.
# 0.03788318 0. 0. 0. 0.03788318 0.
# 0. 0. 0.03788318 0. 0. 0.03788318
# 0. 0.03788318 0.03788318 0. 0. 0.
# 0. 0.03788318 0. 0. 0.03788318 0.
# 0.01398156 0.03788318 0.03788318 0.03788318]
print(vec2.shape)
# (40,)
WordVectors
The final encoding type is word vectors. In essence, each word is represented with an n-dimensional vector. This vector represents fine-grained relationships between the words, and it enables vector arithmetic’s to enable comparison and composition of vectors, such as to fulfill the vector algebra of king + women = queen
.
Word vectors provided tremendous and surprising value for large scale natural language processing tasks. The three main word vector implementations are the original Word2Vec, FastText, and Glove.
Word2Vec was the first model, trained on news articles, and used different n-grams sizes to capture the meaning of words from the surrounding context. FastText uses a similar continuous n-gram approach, but it considers not only the actual context of a word within the training data, but other context as well. This improves the representation of sparse words and handles unknown words that were nor present during training. Glove considers the total corpus to compute a word-word co-occurrence matrix from the training data, and builds a probabilistic model about the likelihood of any word occurrences from the sampled data.
Word vectors represent the structures that occurred in their training data. If this data is of sufficient size and close to the texts of the corpus, pretrained vectors can be used. Otherwise they need to be trained from the campus.
In the following implementation, the Gensim library will be used to load pretrained Word2Vec vectors and apply them to the corpus. To use one of the pretrained models, you need to download its model with a Gensim helper. Note that the models can be very large. For example, the word2vec-google-news-300
models is 1.6GB and provides 300-dimensional vectors for each word.
>>> wv = downloader.load('word2vec-google-news-300')
# [=======-------------------------------------------] 15.5% 258.5/1662.8MB downloaded
The vectorizer implementation uses the known structure as the others. Its implementation is quite straight forward: It will process a documents token list and output a vector with numeric values for each word for which a vector representation exists.
class Word2VecEncoder(SciKitTransformer):
def __init__(self, vocab):
self.vocab = vocab
self.vector_lookup = downloader.load('word2vec-google-news-300')
def word_vector(self, tokens):
vec = np.array([])
for token in tokens:
if token in self.vocab:
if token in self.vector_lookup:
print(f'Add {token}')
vec = np.append(self.vector_lookup[token], vec)
return vec
Here is an example output.
encoder = Word2VecEncoder(corpus.vocab())
sent1 = [word for word in word_tokenize(corpus.raw('sent1.txt'))]
vec1 = encoder.word_vector(sent1)
print(vec1)
# [ 0.01989746 0.24707031 -0.23632812 ... -0.24707031 0.05249023
# 0.19824219]
print(vec1.shape)
# (3000,)
sent2 = [word for word in word_tokenize(corpus.raw('sent2.txt'))]
vec2 = encoder.word_vector(sent2)
print(vec2)
# [-0.11767578 -0.13769531 -0.140625 ... -0.03295898 -0.01733398
# 0.13476562]
print(vec2.shape)
# (4500,)
As you can see, the vectors are 3000 and 4500 values for just two sentences. The result is document-specific matrix where each column represents the documents token appearing as is, and the number of columns are the number of words contained in the column.
Conclusion
This article showed how to implement text vectorization methods from scratch. It showed the implementation of one-hot-encoding, counter encoding, frequency encoding with TfIdf, and word vectors with Word2Vec. It also showed concrete examples of the resulting vectors when applied to sentences from the Wikipedia article about artificial intelligence.