NLP Project: Wikipedia Article Crawler & Classification - Corpus Reader
Natural Language Processing is a fascinating area of machine leaning and artificial intelligence. This blog posts starts a concrete NLP project about working with Wikipedia articles for clustering, classification, and knowledge extraction. The inspiration, and the general approach, stems from the book Applied Text Analysis with Python.
Over the course of the next articles, I will show how to implement a Wikipedia article crawler, how to collect articles into a corpus, how to apply text preprocessing, tokenization, encoding and vectorization, and finally applying machine learning algorithms for clustering and classification.
The technical context of this article is Python v3.11
and several additional libraries, most important nltk v3.8.1
and wikipedia-api v0.6.0
. All examples should work with newer versions too.
Project Outline
The projects’ goal is to download, process, and apply machine learning algorithms on Wikipedia articles. First, selected articles from Wikipedia are downloaded and stored. Second, a corpus is generated, the totality of all text documents. Third, each documents text is preprocessed, e.g. by removing stop words and symbols, then tokenized. Fourth, the tokenized text is transformed to a vector for receiving a numerical representation. Finally, different machine learning algorithms are applied.
In this first article, steps one and two are explained.
Prerequisites
I like to work in a Jupyter Notebook and use the excellent dependency manager Poetry. Run the following commands in a project folder of your choice to install all required dependencies and to start the Jupyter notebook in your browser.
# Complete the interactive project creation
poetry init
# Add core dependencies
poetry add nltk@^3.8.1 jupyterlab@^4.0.0 scikit-learn@^1.2.2 wikipedia-api@^0.5.8 matplotlib@^3.7.1 numpy@^1.24.3 pandas@^2.0.1
# Add NLTK dependencies
python3 -c "import nltk; \
nltk.download('punkt'); \
nltk.download('averaged_perceptron_tagger'); \
nltk.download('reuters'); \
nltk.download('stopwords');"
# Start jupyterhub
poetry run jupyterlab
A fresh Jupyter Notebook should open in your browser.
Python Libraries
In this blog post, the following Python libraries will be used:
- wikipedia-api:
Page
objects representing Wikipedia articles with their title, text, categories, and related pages.
- NLTK
PlaintextCorpusReader
for a traversable object that gives access to documents, provides tokenization methods, and computes statistics about all filessent_tokenizer
andword_tokenizer
for generating tokens
Part 1: Wikipedia Article Crawler
The project starts with the creation of a custom Wikipedia crawler. Although we can work with Wikipedia corpus datasets from various sources, such as built-in corpus in NLTK, the custom crawler provides best control about file format, content, and the contents actuality.
Downloading and processing raw HTML can time consuming, especially when we also need to determine related links and categories from this. A very handy library comes to the rescue. The wikipedia-api does all of these heavy lifting for us. Based on this, lets develop the core features in a stepwise manner.
First, we create a base class that defines its own Wikipedia object and determines where to store the articles.
import os
import re
import wikipediaapi as wiki_api
class WikipediaReader():
def __init__(self, dir = "articles"):
self.pages = set()
self.article_path = os.path.join("./", dir)
self.wiki = wiki_api.Wikipedia(
language = 'en',
extract_format=wiki_api.ExtractFormat.WIKI)
try:
os.mkdir(self.article_path)
except Exception as e:
pass
This also defines the pages
, a set of page objects that the crawler visited. This page
object is tremendously helpful because it gives access to an articles title, text, categories, and links to other pages.
Second, we need helper methods that receives an article name, and if it exists, it will add a new page
object to the set. We need to wrap the call in a try except
block, because some articles containing special characters could not be processed correctly, such as Add article 699/1000 Tomasz Imieliński
. Also, there are several meta-articles that we do not need to store.
def add_article(self, article):
try:
page = self.wiki.page(self._get_page_title(article))
if page.exists():
self.pages.add(page)
return(page)
except Exception as e:
print(e)
Third, we want to extract the categories of an article. Each Wikipedia article defines categories in both visible sections at the bottom of a page - see the following screenshot - as well as in metadata that is not rendered as HTML. Therefore, the initial list of categories might sound confusing. Take a look at this example:
wr = WikipediaReader()
wr.add_article("Machine Learning")
ml = wr.list().pop()
print(ml.categories)
# {'Category:All articles with unsourced statements': Category:All articles with unsourced statements (id: ??, ns: 14),
# 'Category:Articles with GND identifiers': Category:Articles with GND identifiers (id: ??, ns: 14),
# 'Category:Articles with J9U identifiers': Category:Articles with J9U identifiers (id: ??, ns: 14),
# 'Category:Articles with LCCN identifiers': Category:Articles with LCCN identifiers (id: ??, ns: 14),
# 'Category:Articles with NDL identifiers': Category:Articles with NDL identifiers (id: ??, ns: 14),
# 'Category:Articles with NKC identifiers': Category:Articles with NKC identifiers (id: ??, ns: 14),
# 'Category:Articles with short description': Category:Articles with short description (id: ??, ns: 14),
# 'Category:Articles with unsourced statements from May 2022': Category:Articles with unsourced statements from May 2022 (id: ??, ns: 14),
# 'Category:Commons category link from Wikidata': Category:Commons category link from Wikidata (id: ??, ns: 14),
# 'Category:Cybernetics': Category:Cybernetics (id: ??, ns: 14),
# 'Category:Learning': Category:Learning (id: ??, ns: 14),
# 'Category:Machine learning': Category:Machine learning (id: ??, ns: 14),
# 'Category:Short description is different from Wikidata': Category:Short description is different from Wikidata (id: ??, ns: 14),
# 'Category:Webarchive template wayback links': Category:Webarchive template wayback links (id: ??, ns: 14)}
Therefore, we do not store these special categories at all by applying multiple regular expression filters.
def get_categories(self, title):
page = self.add_article(title)
if page:
if (list(page.categories.keys())) and (len(list(page.categories.keys())) > 0):
categories = [c.replace('Category:','').lower() for c in list(page.categories.keys())
if c.lower().find('articles') == -1
and c.lower().find('pages') == -1
and c.lower().find('wikipedia') == -1
and c.lower().find('cs1') == -1
and c.lower().find('webarchive') == -1
and c.lower().find('dmy dates') == -1
and c.lower().find('short description') == -1
and c.lower().find('commons category') == -1
]
return dict.fromkeys(categories, 1)
return {}
Fourth, we now define the crawl method. It’s a customizable breadth-first search that starts from an article, gets all related pages, ads these to the page objects, and then process them again until the number of total articles is exhausted, or the depth level reached. To be honest: I only ever crawled 1000 articles with it.
def crawl_pages(self, article, depth = 3, total_number = 1000):
print(f'Crawl {total_number} :: {article}')
page = self.add_article(article)
childs = set()
if page:
for child in page.links.keys():
if len(self.pages) < total_number:
print(f'Add article {len(self.pages)}/{total_number} {child}')
self.add_article(child)
childs.add(child)
depth -= 1
if depth > 0:
for child in sorted(childs):
if len(self.pages) < total_number:
self.crawl_pages(child, depth, len(self.pages))
Lets start crawling the machine learning articles:
reader = WikipediaReader()
reader.crawl_pages("Machine Learning")
print(reader.list())
# Crawl 1000 :: Machine Learning
# Add article 1/1000 AAAI Conference on Artificial Intelligence
# Add article 2/1000 ACM Computing Classification System
# Add article 3/1000 ACM Computing Surveys
# Add article 4/1000 ADALINE
# Add article 5/1000 AI boom
# Add article 6/1000 AI control problem
# Add article 7/1000 AI safety
# Add article 8/1000 AI takeover
# Add article 9/1000 AI winter
Finally, when a set of page
objects is available, we extract their text content and store them in files in which the file name represents a cleaned-up version of its title. A caveat: file names need to retain capitulation of their article name, or else we cannot get the page object again because a search with a lower-cased article name does not return results.
def process(self, update=False):
for page in self.pages:
filename = re.sub('\s+', '_', f'{page.title}')
filename = re.sub(r'[\(\):]','', filename)
file_path = os.path.join(self.article_path, f'{filename}.txt')
if update or not os.path.exists(file_path):
print(f'Downloading {page.title} ...')
content = page.text
with open(file_path, 'w') as file:
file.write(content)
else:
print(f'Not updating {page.title} ...')
Here is the complete source code of the WikipediaReader
class.
import os
import re
import wikipediaapi as wiki_api
class WikipediaReader():
def __init__(self, dir = "articles"):
self.pages = set()
self.article_path = os.path.join("./", dir)
self.wiki = wiki_api.Wikipedia(
language = 'en',
extract_format=wiki_api.ExtractFormat.WIKI)
try:
os.mkdir(self.article_path)
except Exception as e:
pass
def _get_page_title(self, article):
return re.sub(r'\s+','_', article)
def add_article(self, article):
try:
page = self.wiki.page(self._get_page_title(article))
if page.exists():
self.pages.add(page)
return(page)
except Exception as e:
print(e)
def list(self):
return self.pages
def process(self, update=False):
for page in self.pages:
filename = re.sub('\s+', '_', f'{page.title}')
filename = re.sub(r'[\(\):]','', filename)
file_path = os.path.join(self.article_path, f'{filename}.txt')
if update or not os.path.exists(file_path):
print(f'Downloading {page.title} ...')
content = page.text
with open(file_path, 'w') as file:
file.write(content)
else:
print(f'Not updating {page.title} ...')
def crawl_pages(self, article, depth = 3, total_number = 1000):
print(f'Crawl {total_number} :: {article}')
page = self.add_article(article)
childs = set()
if page:
for child in page.links.keys():
if len(self.pages) < total_number:
print(f'Add article {len(self.pages)}/{total_number} {child}')
self.add_article(child)
childs.add(child)
depth -= 1
if depth > 0:
for child in sorted(childs):
if len(self.pages) < total_number:
self.crawl_pages(child, depth, len(self.pages))
def get_categories(self, title):
page = self.add_article(title)
if page:
if (list(page.categories.keys())) and (len(list(page.categories.keys())) > 0):
categories = [c.replace('Category:','').lower() for c in list(page.categories.keys())
if c.lower().find('articles') == -1
and c.lower().find('pages') == -1
and c.lower().find('wikipedia') == -1
and c.lower().find('cs1') == -1
and c.lower().find('webarchive') == -1
and c.lower().find('dmy dates') == -1
and c.lower().find('short description') == -1
and c.lower().find('commons category') == -1
]
return dict.fromkeys(categories, 1)
return {}
Let’s use the Wikipedia crawler to download articles related to machine learning.
reader = WikipediaReader()
reader.crawl_pages("Machine Learning")
print(reader.list())
# Downloading The Register ...
# Not updating Bank ...
# Not updating Boosting (machine learning) ...
# Not updating Ian Goodfellow ...
# Downloading Statistical model ...
# Not updating Self-driving car ...
# Not updating Behaviorism ...
# Not updating Statistical classification ...
# Downloading Search algorithm ...
# Downloading Support vector machine ...
# Not updating Deep learning speech synthesis ...
# Not updating Expert system ...
Part 2: Wikipedia Corpus
All articles are downloaded as text files in the article
folder. To provide an abstraction over all these individual files, the NLTK library provides different corpus reader objects. This object not only provides a quick access to individual files, but can also generate statistical information’s, such as the vocabulary, the total number of individual tokens, or the document with the most amount of words.
Lets use the PlaintextCorpusReader
class as a starting point, and just initialize it so that it points to the articles:
import nltk
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from time import time
class WikipediaCorpus(PlaintextCorpusReader):
pass
corpus = WikipediaCorpus('articles', r'[^\.ipynb].*', cat_pattern=r'[.*]')
print(corpus.fileids())
# ['2001_A_Space_Odyssey.txt',
# '2001_A_Space_Odyssey_film.txt',
# '2001_A_Space_Odyssey_novel.txt',
# '3D_optical_data_storage.txt',
# 'A*_search_algorithm.txt',
# 'A.I._Artificial_Intelligence.txt',
# 'AAAI_Conference_on_Artificial_Intelligence.txt',
# 'ACM_Computing_Classification_System.txt',
Ok, this is good enough. Let’s extend it with two methods to compute the vocabulary and the maximum number of words. For the vocabulary, we will use the NLTK helper class FreqDist
, which is a dictionary object with all word occurrences, this method consumes all texts with the simple helper corpus.words()
, from which non-text and non-numbers are removed.
def vocab(self):
return nltk.FreqDist(re.sub('[^A-Za-z0-9,;\.]+', ' ', word).lower() for word in corpus.words())
To get the maximum number of words, we traverse all documents with fileids()
, then determine the length of words(doc)
, and record the highest value
def max_words(self):
max = 0
for doc in self.fileids():
l = len(self.words(doc))
max = l if l > max else max
return max
Finally, lets add a describe
method for generating statistical information (this idea also stems from the above mentioned book Applied Text Analysis with Python).
This method starts a timer to record how long the campus processing lasts, and then it uses the built-in methods of the corpus reader object and the just now created methods to compute the number of files, paragraphs, sentences, words, the vocabulary and the maximum number of words inside a document.
def describe(self, fileids=None, categories=None):
started = time()
return {
'files': len(self.fileids()),
'paras': len(self.paras()),
'sents': len(self.sents()),
'words': len(self.words()),
'vocab': len(self.vocab()),
'max_words': self.max_words(),
'time': time()-started
}
pass
Here is the final WikipediaCorpus
class:
import nltk
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from time import time
class WikipediaCorpus(PlaintextCorpusReader):
def vocab(self):
return nltk.FreqDist(re.sub('[^A-Za-z0-9,;\.]+', ' ', word).lower() for word in corpus.words())
def max_words(self):
max = 0
for doc in self.fileids():
l = len(self.words(doc))
max = l if l > max else max
return max
def describe(self, fileids=None, categories=None):
started = time()
return {
'files': len(self.fileids()),
'paras': len(self.paras()),
'sents': len(self.sents()),
'words': len(self.words()),
'vocab': len(self.vocab()),
'max_words': self.max_words(),
'time': time()-started
}
pass
At the time of writing, after crawling the Wikipedia articles about artificial intelligence and machine learning, the following statistic could be obtained:
corpus = WikipediaCorpus('articles', r'[^\.ipynb].*', cat_pattern=r'[.*]')
corpus.describe()
{'files': 1163,
'paras': 96049,
'sents': 238961,
'words': 4665118,
'vocab': 92367,
'max_words': 46528,
'time': 32.60307598114014}
Conclusion
This article is the starting point for an NLP project to download, process, and apply machine learning algorithms on Wikipedia articles. Two aspects were covered in this article. First, the creation of the WikipediaReader
class that finds articles by its name, and can extract its title, content, category and mentioned links. The crawler is controlled with two variables: The total number of crawled articles, and the depth of crawling. Second, the WikipediaCorpus
, an extension of the NLTK PlaintextCorpusReader
. This object provides convenient access to individual files, sentences, and words, as well as total corpus data like the number files or the vocabulary, the number of unique tokens. The next article continues with building a text processing pipeline.