Project Overview
Knowledge base question answering (KB-QA) has been a topic of much interest and there have been a lot of studies in both academic and industrial fields. The goal of KBQA is to construct a question-answering (QA) system that parses a question proposed by the user and leverages its supporting knowledge base (KB) to answer the question. The performance of a QA system is therefore heavily reliant on the correctness, completeness and structuredness of its underlying KB.
A knowledge base can be categorized as either a curated KB or an open KB by the way it is constructed. Curated KBs are commonly abstracted as semantic networks composed manually by their community members, while open KBs store semantic assertions, often triples in the form of hsubject, relation, objecti, directly derived from unstructured online resources using open information extraction (OIE) techniques. In general, curated KBs are more favored by researchers for building KB-QA systems by virtue of their high-precision and low-redundancy knowledge. Most KB-QA systems nowadays are based on curated KBs, such as Freebase and DBpedia. Nevertheless, given the sheer volume of collective human knowledge, manual maintenance of massive curated KBs can be prohibitively expensive and difficult to scale. QA systems built on curated KBs inevitably suffer from low recall due to their innate incompleteness of factual knowledge.
Some researchers have turned to open KB as a more comprehensive source of knowledge, but the performance (accuracy, efficiency and robustness) of open KB on question answering is significantly limited by a common problem called entity ambiguity. Entity ambiguity refers to the problem of linking and grouping different surface manifestations (names) of the same real world entity, as well as identifying one among many entities that a certain surface form (name) may refer to. As it is common for an entity to have several different names, and a name to refer to several different entities, open KB contains a tremendous amount of noisy, inconsistent and redundant information. As a result, it can be extremely difficult to match the correct assertion in an open KB without performing thorough entity disambiguation. This process of disambiguating named entities is called entity resolution, which is essential for an open KB to be applicable for QA.
Entity resolution can make an open KB more precise, structured and complete, and thus more capable for QA tasks. This project investigates how entity resolution can be done by canonicalization (i.e., mapping each name into a canonical form) and how canonicalization enables more effective question answering through redundancy detection, KB compression and knowledge population. In this project plan, we will present our agenda to build a new QA system named COQA, short for canonicalized open question answering, which is a QA system built upon canonicalized open KB.