Skip to content

NLP: Text Vectorization Methods from Scratch

By Sebastian Günther

Posted in Nlp, Python, Word_vectors

NLP projects work with text, but text cannot be used by machine learning algorithms unless transformed into a numerical representation. This representation is typically called a vector, and it can be applied to any reasonable unit of a text: individual tokens, n-grams, sentences, paragraphs, or even whole documents.

In statistical NLP on whole corpuses, different techniques for vectorization have been applied, such as one-hot, count, or frequency encoding. In neural NLP, word vectors, also called word embeddings, are dominant. Pre-trained vectors as well as learned vector representations in complex neural networks can be used.

This article explains and shows Python implementation for all the mentioned vectorization techniques: one-hot-encoding, counter encoding (bag-of-words), term frequency, and finally word vectors.

The technical context of this article is Python v3.11 and several additional libraries: gensim v4.3.1, pandas v2.0.1, numpy v1.26.1, nltk v3.8.1 and scikit-learn v1.2.2. All examples should work with newer library versions too.

Requirements and Used Python Libraries

Be sure to read and run the requirements of my previous article in order to have a Jupyter Notebook to run all code examples.

For this article, the following libraries are needed:

  • Collections
    • The Counter object for counting the number of tokens in a document
  • Gensim
    • The downloader object allows loading several pre-trained word vectors
  • Pandas
    • DataFrame objects to store the text, tokens, and vectors
  • Numpy
    • Serveral methods for creating and working with arrays
  • NLTK
    • PlaintextCorpusReader for a traversable object that gives access to documents, provides tokenization methods, and computes statistics about all files
    • sent_tokenizer and word_tokenizer for generating tokens
    • The stopwords list for token reduction
  • SciKitLearn
    • Pipeline object to implement the chain of processing steps
    • BaseEstimator and TransformerMixin to build custom classes that represent Pipeline steps

All examples require these imports and base classes:

import numpy as np
import re

from copy import deepcopy
from collections import Counter
from gensim import downloader
from nltk.corpus import stopwords
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.base import BaseEstimator, TransformerMixin
from time import time

class SciKitTransformer(BaseEstimator, TransformerMixin):
  def fit(self, X=None, y=None):
    return self

  def transform(self, X=None):
    return self

Basic Example

Based upon earlier articles, the NLTK PlaintextCorpusReader will be reused.

Here is the updated version of the WikipediaCorpus class with an additional filter() method - it reduces the vocabulary to text only, without any stopwords.

class WikipediaCorpus(PlaintextCorpusReader):
    def __init__(self, root_path):
        PlaintextCorpusReader.__init__(self, root_path, r'.*[0-9].txt')

    def filter(self, word):
        #only keep letters, numbers, and sentence delimiter
        word = re.sub('[\(\)\.,;:+\-—"]', '', word)
        #remove multiple whitespace
        word = re.sub(r'\s+', '', word)
        if not word in stopwords.words("english"):
            return word.lower()
        return ''

    def vocab(self):
        return sorted(set([self.filter(word) for word in corpus.words()]))

    def max_words(self):
        max = 0
        for doc in self.fileids():
            l = len(self.words(doc))
            max = l if l > max else max
        return max

    def describe(self, fileids=None, categories=None):
        started = time()

        return {
            'files': len(self.fileids()),
            'paras': len(self.paras()),
            'sents': len(self.sents()),
            'words': len(self.words()),
            'vocab': len(self.vocab()),
            'max_words': self.max_words(),
            'time': time()-started
            }

To keep the example vectors in this article brief and understandable, the corpus consists of the first three sentences from the Wikipedia article about machine learning.

_Source: [Wikipedia](https://en.wikipedia.org/wiki/Artificial_intelligence)_

Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by humans or by other animals.

Example tasks in which this is done include speech recognition, computer vision, translation between (natural) languages, as well as other mappings of inputs.

As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect. For instance, optical character recognition is frequently excluded from things considered to be AI, having become a routine technology.

Using the corpus class to parse these sentences gives the following statistics: A vocabulary of 49 words, and the total number of 113 words. That’s large and small enough to keep the following explanations brief.

corpus = WikipediaCorpus('ai_sentences')

print(corpus.fileids())
# ['sent1.txt', 'sent2.txt', 'sent3.txt']

print(corpus.describe())
# {'files': 3, 'paras': 3, 'sents': 3, 'words': 91, 'vocab': 40, 'max_words': 32, 'time': 0.01642608642578125}

print(corpus.vocab())
# ['', 'ai', 'animals', 'artificial', 'as', 'become', 'capable', 'computer', 'considered', ..., 'well']

One-Hot Encoding

A one-hot encoding represents the relationship which words occur in a specific document based on the total vocabulary of all documents. The implementation therefore requires these steps:

  • Compute the total ordered vocabulary of all documents
  • Iterate each document and mark which words are occurring

The following implementation builds a vocab_dict object filled with default float values 0.0, and then sets these values to 1.0 for each token that appears in the sentence.

class OneHotEncoder(SciKitTransformer):
    def __init__(self, vocab):
        self.vocab_dict = dict.fromkeys(vocab, 0.0)

    def one_hot_vector(self, tokens):
        vec_dict = deepcopy(self.vocab_dict)
        for token in tokens:
            if token in self.vocab_dict:
                vec_dict[token] = 1.0
        vec = [v for v in vec_dict.values()]
        return np.array(vec)

Here are the one-hot vectors for the first two sentences:

encoder = OneHotEncoder(corpus.vocab())

sent1 = [word for word in word_tokenize(corpus.raw('sent1.txt'))]
vec1 = encoder.one_hot_vector(sent1)

print(vec1)
# [0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0.
# 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]

print(vec1.shape)
# (40,)

sent2 = [word for word in word_tokenize(corpus.raw('sent2.txt'))]
vec2 = encoder.one_hot_vector(sent2)

print(vec2)
# [0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1.
# 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 1. 1. 1.]

print(vec2.shape)
# (40,)

Counter Encoding

Counter encoding is an intermediate form for creating vectors. Based on the complete ordered vocabulary of all documents, the number and occurrence of all words in a document is determined. This number is typically scaled, e.g. by the length of the document.

Here is a counter encoding implementation in Python. As before, it builds a vocab_dict object filled with default float values 0.0, and for each document, it sets a value of number(word)/len(document).

from collections import Counter

class CountEncoder(SciKitTransformer):
    def __init__(self, vocab):
        self.vocab = dict.fromkeys(vocab, 0.0)

    def count_vector(self, tokens):
        vec_dict = deepcopy(self.vocab)
        token_vec = Counter(tokens)
        doc_length = len(tokens)
        for token, count in token_vec.items():
            if token in self.vocab:
                vec_dict[token] = count/doc_length
        vec = [v for v in vec_dict.values()]
        return np.array(vec)

Using the counter encoding yields these results:

encoder = CountEncoder(corpus.vocab())

sent1 = [word for word in word_tokenize(corpus.raw('sent1.txt'))]
vec1 = encoder.count_vector(sent1)

print(vec1)
# [0.         0.         0.03571429 0.         0.03571429 0.
# 0.         0.         0.         0.         0.         0.03571429
# 0.         0.         0.         0.03571429 0.         0.
# 0.03571429 0.         0.         0.07142857 0.         0.
# 0.03571429 0.         0.         0.         0.03571429 0.
# 0.         0.         0.         0.         0.         0.03571429
# 0.         0.         0.         0.        ]

print(vec1.shape)
# (40,)

sent2 = [word for word in word_tokenize(corpus.raw('sent2.txt'))]
vec2 = encoder.count_vector(sent2)

print(vec2)
# [0.         0.         0.         0.         0.06896552 0.
#  0.         0.03448276 0.         0.         0.         0.
#  0.03448276 0.         0.         0.         0.03448276 0.
#  0.         0.         0.03448276 0.         0.         0.03448276
#  0.         0.03448276 0.03448276 0.         0.         0.
#  0.         0.03448276 0.         0.         0.03448276 0.
#  0.03448276 0.03448276 0.03448276 0.03448276]

print(vec2.shape)
# (40,)

Term Frequency Encoding

The first two encodings lead to the problem that very rare terms do not have enough weight to be of significance when used with machine learning algorithm. Especially to counter this problem, the Term Frequency, Term Indirect Frequency metric balances rare terms in a large corpora of documents. The detailed mathematics can be studied in the TfIdf Wikipedia article - here is the essential summary:

  • TF, term frequency, is the number of times a term appears inside a document, divided by the total length of the document, in pseudo code word_occurences_in_doc/doc_len
  • IDF, indirect document frequency, is the logarithm of the number of documents containing a word divided by the total number of documents in the corpus, in pseudo code log(number_of_docs/number_of_docs_containing_word)

The implementation is quite complex, structured along these considerations:

  1. The encoder receives the corpus vocabulary as a list, and it receives a dictionary object of the form {document_name: [tokens]} (otherwise this implementation would be too tightly coupled to the corpus object)
  2. During initialization, a word_frequency dictionary is created that includes the total number of how often a term appears in all documents
  3. The TfIdf method determines the total number of documents as number_of_docs, and the documents length as doc_len. It then creates a Counter for all words in the documents, and then for each word that is contained in the vocabulary, calculates the TfIdf value
  4. All values are converted to, and returned as, a Numpy array

Here is the implementation:

class TfIdfEncoder(SciKitTransformer):
    def __init__(self, doc_arr, vocab):
        self.doc_arr = doc_arr

        self.vocab = vocab
        self.word_frequency = self._word_frequency()

    def _word_frequency(self):
        word_frequency = dict.fromkeys(self.vocab, 0.0)

        for doc_name in self.doc_arr:
            doc_words = Counter([word for word in self.doc_arr[doc_name]])
            for word, _ in doc_words.items():
                if word in self.vocab:
                    word_frequency[word] += 1.0

        return word_frequency

    def TfIdf_vector(self, doc_name):
        if not doc_name in self.doc_arr:
            print(f'Document "{doc_name}" not found.')
            return

        number_of_docs = len(self.doc_arr)
        doc_len = len(self.doc_arr[doc_name])

        doc_words = Counter([word for word in self.doc_arr[doc_name]])
        TfIdf_vec = dict.fromkeys(self.vocab, 0.0)

        for word, word_count in doc_words.items():
            if word in self.vocab:
                tf = word_count/doc_len
                idf = np.log(number_of_docs/self.word_frequency[word])
                idf = 1 if idf == 0 else idf
                TfIdf_vec[word] = tf * idf

        vec = [v for v in TfIdf_vec.values()]
        return np.array(vec)

For our example with only three sentences, the vectors adequately represent documents, but their full potential is only realized for large campus.

doc_list = [doc for doc in corpus.fileids()]
words_list = [corpus.words(doc) for doc in [doc for doc in corpus.fileids()]]
doc_arr = dict(zip(doc_list, words_list))

encoder = TfIdfEncoder(doc_arr, corpus.vocab())

vec1 = encoder.TfIdf_vector('sent1.txt')

print(vec1)
# [0.         0.         0.03433163 0.         0.03125    0.
#  0.         0.         0.         0.         0.03433163 0.03433163
#  0.         0.         0.         0.03433163 0.         0.
#  0.03433163 0.03433163 0.         0.03801235 0.         0.
#  0.01267078 0.         0.         0.         0.03433163 0.03433163
#  0.         0.         0.         0.         0.         0.03433163
#  0.         0.         0.         0.        ]

print(vec1.shape)
# (40,)

vec2 = encoder.TfIdf_vector('sent2.txt')

print(vec2)
# [0.         0.         0.         0.         0.06896552 0.
# 0.         0.03788318 0.         0.         0.         0.
# 0.03788318 0.         0.         0.         0.03788318 0.
# 0.         0.         0.03788318 0.         0.         0.03788318
# 0.         0.03788318 0.03788318 0.         0.         0.
# 0.         0.03788318 0.         0.         0.03788318 0.
# 0.01398156 0.03788318 0.03788318 0.03788318]

print(vec2.shape)
# (40,)

WordVectors

The final encoding type is word vectors. In essence, each word is represented with an n-dimensional vector. This vector represents fine-grained relationships between the words, and it enables vector arithmetic’s to enable comparison and composition of vectors, such as to fulfill the vector algebra of king + women = queen.

Word vectors provided tremendous and surprising value for large scale natural language processing tasks. The three main word vector implementations are the original Word2Vec, FastText, and Glove.

Word2Vec was the first model, trained on news articles, and used different n-grams sizes to capture the meaning of words from the surrounding context. FastText uses a similar continuous n-gram approach, but it considers not only the actual context of a word within the training data, but other context as well. This improves the representation of sparse words and handles unknown words that were nor present during training. Glove considers the total corpus to compute a word-word co-occurrence matrix from the training data, and builds a probabilistic model about the likelihood of any word occurrences from the sampled data.

Word vectors represent the structures that occurred in their training data. If this data is of sufficient size and close to the texts of the corpus, pretrained vectors can be used. Otherwise they need to be trained from the campus.

In the following implementation, the Gensim library will be used to load pretrained Word2Vec vectors and apply them to the corpus. To use one of the pretrained models, you need to download its model with a Gensim helper. Note that the models can be very large. For example, the word2vec-google-news-300 models is 1.6GB and provides 300-dimensional vectors for each word.

>>> wv = downloader.load('word2vec-google-news-300')
# [=======-------------------------------------------] 15.5% 258.5/1662.8MB downloaded

The vectorizer implementation uses the known structure as the others. Its implementation is quite straight forward: It will process a documents token list and output a vector with numeric values for each word for which a vector representation exists.

class Word2VecEncoder(SciKitTransformer):
    def __init__(self, vocab):
        self.vocab = vocab
        self.vector_lookup = downloader.load('word2vec-google-news-300')

    def word_vector(self, tokens):
        vec = np.array([])
        for token in tokens:
            if token in self.vocab:
                if token in self.vector_lookup:
                    print(f'Add {token}')
                    vec = np.append(self.vector_lookup[token], vec)

        return vec

Here is an example output.

encoder = Word2VecEncoder(corpus.vocab())

sent1 = [word for word in word_tokenize(corpus.raw('sent1.txt'))]
vec1 = encoder.word_vector(sent1)

print(vec1)
# [ 0.01989746  0.24707031 -0.23632812 ... -0.24707031  0.05249023
#  0.19824219]

print(vec1.shape)
# (3000,)

sent2 = [word for word in word_tokenize(corpus.raw('sent2.txt'))]
vec2 = encoder.word_vector(sent2)

print(vec2)
# [-0.11767578 -0.13769531 -0.140625   ... -0.03295898 -0.01733398
#  0.13476562]

print(vec2.shape)
# (4500,)

As you can see, the vectors are 3000 and 4500 values for just two sentences. The result is document-specific matrix where each column represents the documents token appearing as is, and the number of columns are the number of words contained in the column.

Conclusion

This article showed how to implement text vectorization methods from scratch. It showed the implementation of one-hot-encoding, counter encoding, frequency encoding with TfIdf, and word vectors with Word2Vec. It also showed concrete examples of the resulting vectors when applied to sentences from the Wikipedia article about artificial intelligence.