Building a Question-Answer System with Machine Learning Algorithms

For a very long time, I have been reading several articles, some papers, and lots of blog post about artificial intelligence and machine learning. The recent advances in neural networks were especially fascinating, such as the GPT3.5 model which produces human-level like texts. In order to understand the state of the art in natural language processing using neural networks, I want to design a question-answer system that parses, understands, and answers a question about a specific topic, for example the content of a book. With this far-reaching goal in mind, I started this blog series to cover all relevant knowledge and engineering areas: machine learning, natural language processing, expert systems, and neuronal networks.

This article briefly summarizes the main features of a question-answer system as I envision it, and I also explain the relevant knowledge areas that could contribute to a solution. As we will see, however, it is most likely that the system needs to combine approaches and programs from different such domains to fullfill the requirements. So, lets start this journey.

Requirements for a Question-Answer System

The question answer system aims to understand a given question, interfer its meaning, investigate the available data, and provide a natural-language answer.

The given question is processed in several stages. First, the system decides whether the question is syntactically and semantically correct. If not, it will ask the user for refinements. Second, the question is transformed into a knowledge representation. Third, this knowledge representation and the question intent are used to query an internal knowledge database. Finally, an answer will be formulated if or if not the question intent can be answered with the available data

The system needs to be grounded in a particular knowledge area. At the time of writing, a very appealing idea is that the system understands books from a public source, e.g. from the Project Gutenberg webpage. When loaded, the user shall select one of the processed books, and then a terminal opens. In this terminal, the user enters a natural-language question.

Now, let’s consider the relevant technologies areas with which to construct such a system.

Technologies for Question-Answer System

At the time of writing this article, I assume the following knowledge areas to be most relevant for the design and implementation of the question-answer system. I’m very sure that the concrete understanding will change over time, both when getting more familiar with the knowledge areas, and when exploring how to build a concrete system. That being said, lets consider the areas individually.

Machine Learning

Machine Learning is concerned with designing algorithms that process structured or unstructured data with the goal to recognize patterns and features and to make predictions. These predictions are related to the data. For example, it could be regression, which means to compute a continuous value e.g. forecasting temperatures, or it could be classification, putting data into different clusters of similar data, like whether a text has a positive or a negative sentiment, and finally feature learning, which means to attribute data with different, mutually exclusive or combinable attributes, e.g. recognizing different shapes inside a picture.

It is crucial to distinguish machine learning from artificial intelligence. Following these Wikipedia articles, it is clear that artificial intelligence is the original computer science discipline that researches how intelligence can be build or represented by machines. This includes hardware design and software solutions like algorithms and programs. AI research is a general discipline that develops sophisticated models to represent behavior, knowledge, reasoning and interference. Machine learning is a specific subbranch concerned with finding patterns in data and make predictions.

A machine learning discipline which produces phenomenal progress is deep learning, the application of artificial neural networks. The descisive advantage of neural networks compared with other machine learning approaches is the ability to learn representation and patterns in the data by itself. In most application areas, especially image classification, text classification and machine translation, neural networks became the dominant approach.

Natural Language Processing

Natural Language Processing, or short NLP, is a computer science discipline that dates to 1950. Following the explanation on wikipedia, three different epochs can be distinguished. The symbolic area tried to create complex rule systems that govern language modelling. Then, the statistical area emerged when computers became more widespread, which resulted in bigger data sets and the discovery of machine learning algorithms that could process language. Finally, the current area is called neural NLP, which is rooted in the aforementioned deep learning. This approach resulted in many state-of-the art performances in areas such as language modelling, applying statistics to find the most plausible sequence of words that represent a correct sentence in a language, or machine translation.

Semantic Web

During the beginning of the internet area, it became apparent that over time, huge amounts of data would be accessible for automatic processing. Yet how this content would be structured for human consumption and for easy digestion by algorithms was and still is an open question. The HTML standard defines the layout of documents, but the text inside, its appearance on the screen and its meaning, would only be available to human observers. To cope with this, early proposals of the semantic web were about structuring data with the help of XML documents. Their structure is machine readable, and by adding a grammar like rules representation in the form of schemas, the meaning could be made accessible. Semantic web then progressed toward forming ontologies, graph structures that represent knowledge areas by describing entitles, attributes and their relationships. With the help of graph query languages, concrete knowledge can be obtained from these. Semantic Web is an active research are that created several standards over the years, such as OWL, the Web Ontology Language for describing knowledge, and SPARQL, a query language.

Project Overview and Next Steps

To start the project, I will first learn about the knowledge areas in more detail, understanding how they can contribute concretely, followed by three development phases in which the system will be built iteratively. The coarse structure is as follows:

Machine Learning Essentials: From my current point of view, machine learning is a very successful approach to process unstructured data, find hidden patterns, and produce correctly predicts values. Therefore, the first part of the project will explore the current landscape of machine learning as well as concrete algorithms. Given the tremendous success of neural network, I think to focus on this area and start with concrete examples to see how far they can contribute to an QA system.
Natural Language Processing: In this area, I expect insights into abstract representation of languages and to learn about algorithms that help to extract meaning from concrete texts. To be precise, I want to understand how language itself and how knoweldge formulated with language is represented. Overall, I expect this area to contribute knowledge how to design the interface to the QA system, that is recognizing what is being asked (or if something is asked at all!) and formulating an answer
Semantic Web: The final research area should provide insight how machine-readable knowledge representation are formulated. The particular question is how generic such a representation can be defined to represent text from various contents. Also, based on a generalizable, yet specific knowledge base, how can any information or new knowledge be generated? For the question-answer sytem design, this area should ptovide insight how to design the knowledge database in which the question is placed to find an answer.

With the three knowledge areas completed, I will begin the concrete system design.

QA System Explorative Prototype: The first prototype targets the core of the QA system: Data processing, information/knowledge extraction, conceptualizing and querying the knowledge representation. Continuing the above discussion, I might already have a very concrete architecture at hand and can see which technology and which algorithm contribute towards the system design. The prototype also needs to make hard restrictions, e.g. only processing texts of a specific kind.
QA System Interface Design: The second phase will build a prototype that provides a convenient to use interface to make a question and ger an answer. This will be either a command line interface running in a shell, or a web application that needs to be connected to core component.
QA System Mark 1: With the interface in place, I want to publish the system and allow user interaction. Providing such a system is the initial goal of my project, and from here, new versions and features might evolve. It would be great to allow users to upload their own texts, process the data in a timely manner, and then start new interactive QA sessions. But I can only speculate how far the system will evolve.

Conclusion

I will start a new project about machine learning, natural language processing, and knowledge representation. This article gave a brief overview about the system I want to design and outlined the project phases. In particular, the system should be an interactive question answer system that is initialized with pre-selected texts. From this, the system should generate an abstract knowledge representation. Users can ask questions, the system infers knowledge, and synthesizes an answer e.g. by quoting relevant parts of the book or giving a well formulated the answer. The project consists of six different phases. The first three phases will investigate the state of the art in the relevant scientific fields of machine learning, natural language processing and semantic web. The three following phases will create a prototype, and interface, and the final system. I’m looking forward to this challenge and what can be learned from it.