Featured image of post Wikipedia Article Crawler & Clustering: KMeans

Wikipedia Article Crawler & Clustering: KMeans

Wikipedia is a rich source of information and knowledge. Conveniently structured into articles with categories and links to other articles, it also forms a network of related documents. My NLP project downloads, processes, and applies machine learning algorithms on Wikipedia articles.

Featured image of post NLP: Text Vectorization Methods with SciKit Learn

NLP: Text Vectorization Methods with SciKit Learn

SciKit Learn is an extensive library for machine learning projects, including several classifier and classifications algorithms, methods for training and metrics collection, and for preprocessing input data. In every NLP project, text needs to be vectorized in order to be processed by machine learning algorithms. Vectorization methods are one-hot encoding, counter encoding, frequency encoding, and word vector or word embeddings. Several of these methods are available in SciKit Learn as well.

Featured image of post NLP: Text Vectorization Methods from Scratch

NLP: Text Vectorization Methods from Scratch

NLP projects work with text, but text cannot be used by machine learning algorithms unless transformed into a numerical representation. This representation is typically called a vector, and it can be applied to any reasonable unit of a text: individual tokens, n-grams, sentences, paragraphs, or even whole documents.

Featured image of post NLP Project: Wikipedia Article Crawler & Classification - Corpus Transformation Pipeline

NLP Project: Wikipedia Article Crawler & Classification - Corpus Transformation Pipeline

My NLP project downloads, processes, and applies machine learning algorithms on Wikipedia articles. In my last article, the projects outline was shown, and its foundation established. First, a Wikipedia crawler object that searches articles by their name, extracts title, categories, content, and related pages, and stores the article as plaintext files. Second, a corpus object that processes the complete set of articles, allows convenient access to individual files, and provides global data like the number of individual tokens.

Featured image of post NLP Project: Wikipedia Article Crawler & Classification - Corpus Reader

NLP Project: Wikipedia Article Crawler & Classification - Corpus Reader

Natural Language Processing is a fascinating area of machine leaning and artificial intelligence. This blog posts starts a concrete NLP project about working with Wikipedia articles for clustering, classification, and knowledge extraction. The inspiration, and the general approach, stems from the book [Applied Text Analysis with Python](https://www.goodreads.com/book/show/32758032-applied-text-analysis-with-python).

Featured image of post Python NLP Library: Flair

Python NLP Library: Flair

Flair is a modern NLP library. From text processing to document semantics, all core NLP tasks are supported. Flair uses modern transformer neural networks models for several tasks, and it incorporates other Python libraries which enables to choose specific models. Its clear API and data structures that annotate text, as well as multi-language support, makes it a good candidate for NLP projects.

Featured image of post Python NLP Library: Spacy

Python NLP Library: Spacy

With Spacy, a sophisticated NLP library, differently trained models for a variety of NLP tasks can be used. From tokenization to part-of-speech tagging to entity recognition, Spacy produces well-designed Python data structures and powerful visualizations too. On top of that, different language models can be loaded and fine-tuned to accommodate NLP tasks in specific domains. Finally, Spacy provides a powerful pipeline object, facilitating mixing built-in and custom tokenizer, parser, tagger and other components to create language models that support all desired NLP tasks.

Featured image of post Python NLP Library: NLTK

Python NLP Library: NLTK

NLTK is a sophisticated library. Continuously developed since 2009, it supports all classical NLP tasks, from tokenization, stemming, part-of-speech tagging, and including semantic index and dependency parsing. It also has a rich set of additional features, such as built-in corpora, different models for its NLP tasks, and integration with SciKit Learn and other Python libraries.