Feasibility analysis

Despite all the progress already made in KB-QA and canonicalization, there are still a few challenging problems yet to be properly solved, many of which have been discussed in Section 3. Question-answering on canonicalized KB is also a novel topic with no previous realizations found, so the project is not free from theoretical and practical challenges. The main challenges are summarized as follows.

Risks

Time taken to process large KBs

From entity clustering to training ranking models, a large knowledge base with millions of assertions poses severe challenges to run time and may render theoretically-effective algorithms impractical to use. Luckily, an algorithm proposed recently has effectively carried out orders-of-magnitude speedups for entity clustering, shortening the canonicalization process from months to minutes. In order to practically handle massive open KBs, our team will need to devise similar efficient techniques to speed up any algorithm concerning the processing of large KBs.

Ground truth for training ranking model

To train the ranking model used for selecting the answer from candidate assertions, we need a collection of question-answer pairs as the labeled training data. Since such datasets are not readily available for open KBs, we may need to tackle the problem in the following two ways. The first is to transform and leverage on question-answer-pair datasets that are based on curated KBs. SimpleQuestions and WebQuestions are two commonly used KB-QA benchmarks. The second is to apply learning methods on open KBs to generate natural language questions using facts, so that we can produce largescale question-answer corpora for a specific KB. The documentation of OQA systems can be referred to for training the individual component models.

Implementation of automatic populater

An automatic populater can link assertions together and consequently discover new relations. A successfully canonicalized KB will automatically group the same entity in the same cluster. These clustered identities may be further related and form new relations. On top of that, we hope to apply natural language processing on the clustered assertions, understand their semantics and derive new relationships from them. There are several NLP tools such as Stanford CoreNLP and AllenNLP that we can make use of to perform logical reasoning.

Accuracy-efficiency tradeof

Canonicalization will make question-answering more efficient by compressing the knowledge base. But as demonstrated in Section 3, while deduplicating redundant data, canonicalization may take away valuable information and hurt the accuracy of questionanswering tasks. Therefore accuracy and efficiency need to be well balanced in order to achieve a satisfactory overall performance.

Limitations

Given the inexhaustible variety in natural language and knowledge representations, robustness is an inherently difficult challenge, as a slight tweak in wording or query formulation may easily confuse the system. Many researchers have tried to develop techniques such as question decomposition and rule mining to achieve a higher robustness. In this project, a high robustness will not be our primary goal until we approach the final stage of our development.

Overall assessment

Since both KB-QA and canonicalization are well-researched areas, there are quite a few existing projects with comprehensive implementation details to refer to. These projects will significantly facilitate our setup of preliminary architectures as well as performance benchmarks for subsequent implementations. Hence, building a working prototype for COQA is not surrounded by much difficulty.

In theory, caconicalization alone, if implemented correctly, could give a huge boost to the efficiency (and accuracy) of question-answering. Our ability to solve most of the theoretical and practical problems mentioned above leads to extra improvements in performance that our product can achieve. As demonstrated above, these challenges are mutually independent components and each could be readily replaced with a ”baseline approach”. In addition, these challenges are highly addressable by the methodologies mentioned above, which guarantees a considerable level of feasibility for our product.

Next up: Proposed Schedule