Problem Statement

The ultimate objective of our project is to improve the performance of QA systems using canonicalized KB. With regard to the major steps involved in building COQA, our team aspires to overcome the following five hurdles.

Accuracy and efficiency of entity clustering


Entity clustering, which is the first step of canonicalization, requires a pairwise function that quantifies the similarity between two candidate entities. Intuitively, a strong similarity function should be able to reflect both string similarity and contextual similarity. The state-of-the-art similarity function used currently, however, is purely based on word frequency and string matching. Given the ubiquity of synonyms with dissimilar spellings (e.g. ’New York City’ and ’Big Apple’, ’Xiamen’ and ’Amoy’), it is reasonable to believe that similarity functions with more sophisticated context-reading capabilities could effectively boost the accuracy of entity clustering. Efficiency is another major concern, as the clustering process has been proved to be extremely time-consuming.

Information loss due to canonicalization


Given a cluster of synonym phrases, the state-of-the-art canonicalization process proposed selecting a representative phrase to replace the other phrases in the same cluster, or instead making use of certain coding scheme to give every entity a unique code. This works for the sole purpose of deduplication or KB compression, but it will potentially cause substantial damage to the completeness of the KB and therefore compromising the accuracy of the supported QA system. The canonicalization process has to be reinvented with the question-answering objectives in mind.

Ranking models on canonicalized KB


The ranking models used in current QA systems for finding matching entities in a KB are believed to be poorly trained. Our team will propose new machine learned models and train them well for them to be effectively adapted to canonicalized KB. To accomplish this, we need to first design the components of the feature vector that can effectively distinguish between different entities. With the feature vector in hand, our well-trained model can be applied to it and give a representative score for every entity mention.

Knowledge population


By clustering synonymous entities, canonicalization detects new relationships that can be derived from the combination of several existing assertions. Consider the instance where the KB contains (”Mary”,”marries”,”Johni”) and (”Marry”,”is the mother of”, ”Henry”). If the user asks ”who is the father of Henry”, the current QA systems will not be able to answer it. If this can be solved, the QA system will be able to answer much more complex questions.

Improvement of parsing techniques


The current parsing technique also needs to be improved, as sometimes the user questions may be misrepresented by the parsed query. A major problem lies in the parsing of constraints. For example, if the user asks ”who is the former president of the US”, the parsing program may not be able to recognize the word ”former” as a significant constraint, therefore the QA system may ultimately return ”Donald Trump” instead of ”Barack Obama” as the result. However, this will not be our main focus, as the current parsing techniques are already doing quite well, and the problems stated above do not always occur.

Next up: Objectives