The Arduino Nano BLE 33 Sense Microcontroller is an Arduino-compatible board with a fleet of onboard sensors, including sound, light, temperature, and a microphone. It can be programmed with Arduino C and MicroPython to read and write data. This article introduces this unique board, details how to use the digital and analog pins and which functions the board supports.
The Tinker Board is a Single Board Computer from Asus. Its form factor, number of GPIO Pins, and general hardware capabilities resemble those of the Raspberry Pi. An important difference is the ARM Mali T764 GPU that contains Tensor Units tailored towards running Machine Learning applications.
Spacy is a powerful NLP library that performs many NLP tasks in its default configuration, including tokenization, stemming and part-of-speech tagging. These steps can be extended with a text classification task as well, in which training data in the form of preprocessed text and expected categories as dictionary objects are provided. Both multi-label and single-label classification is supported.
Spacy is a powerful NLP library in which many NLP tasks like tokenization, stemming, part-of-speech tagging and named entity resolution are provided out-of-the box with pretrained models. All of these tasks are wrapped by a pipeline object, and internal abstraction of different functions that are applied step by step on a given text. This pipeline can be both customized and extended with self-written functions.
Wikipedia is a rich source of information and knowledge. Conveniently structured into articles with categories and links to other articles, it also forms a network of related documents. My NLP project downloads, processes, and applies machine learning algorithms on Wikipedia articles.
Wikipedia is a rich source of information and knowledge. Conveniently structured into articles with categories and links to other articles, it also forms a network of related documents. My NLP project downloads, processes, and applies machine learning algorithms on Wikipedia articles.
In any NLP project, text data needs to be vectorized in order to be used for machine learning algorithms. Different methods exist, starting from simple on-hot or count encodings, and continuing with term frequency metrics and word embeddings.
SciKit Learn is an extensive library for machine learning projects, including several classifier and classifications algorithms, methods for training and metrics collection, and for preprocessing input data. In every NLP project, text needs to be vectorized in order to be processed by machine learning algorithms. Vectorization methods are one-hot encoding, counter encoding, frequency encoding, and word vector or word embeddings. Several of these methods are available in SciKit Learn as well.
NLP projects work with text, but text cannot be used by machine learning algorithms unless transformed into a numerical representation. This representation is typically called a vector, and it can be applied to any reasonable unit of a text: individual tokens, n-grams, sentences, paragraphs, or even whole documents.
My NLP project downloads, processes, and applies machine learning algorithms on Wikipedia articles. In my last article, the projects outline was shown, and its foundation established. First, a Wikipedia crawler object that searches articles by their name, extracts title, categories, content, and related pages, and stores the article as plaintext files. Second, a corpus object that processes the complete set of articles, allows convenient access to individual files, and provides global data like the number of individual tokens.