Methodologies

Process of building COQA

Find a medium-size open KB that is suitable for a final year project. Around 1 million will be the right size.
Adapt the current canonicalization code to build a canonicalized KB, which is written mainly in Java. The main area that we are aspired to improve is the similarity function. We will try to plug in various of contextual factors inside the similarity function, implementing various approximation method, in hope of making the clustering process accurate as well as efficient.
Build a QA system on top of the canonicalized KB. The current QA systems need to be adjusted for it to apply on canonicalized KB, which has totally different feature vectors. Various machine learning methods and fine-tuning will also be investigated, in hope of improving its probability to select out the most suitable matching assertion. This part will probably be written in Python.
Implement a automatic populater which can populate the KB automatically, based on the linking of current assertions as well as natural language processing to deal with sentence semantics.
Investigate different parsing methods to improve the probability of right parsing of the proposed question.

Evaluation metrics

As has been stated above, applying canonicalization is mainly for improving the performance of QA system. To see the degree to which the performance has been improved, we will need some evaluation metrics to compare the performance of OQA and our product COQA. We will mainly use the following metrics.

Recall reflects the overall ability of our QA system to answer various questions across different domains. A high recall requires the breadth of knowledge of the underlying KB, and therefore this is where existing curated KB-QA systems consistently underperformed. It also requires our COQA system to parse questions correctly to form the representative queries.
Precision factors in all questions answered by the QA system and determines how precise these answers are. Precision is the main weakness of an uncanonicalized open KB compared with curated KBs in question-answering tasks, due to the large amount of noisy data a raw open KB normally contains. In this project, precision will reflect the effectiveness of our canonicalization approach. A high precision is also highly dependent on correct training of the machine learning model to rank results.
F1 Score is the harmonic average of precision and recall, ranging from 0 to 1. F1 score seeks a balance between recall and precision. Essentially, one of the goals we set out to achieve with a canonicalized open KB (instead of a curated KB) is to balance more data (completeness, favoring recall) with better data (correctness, favoring precision). Theoretically, canonicalization is expected to improve both recall and precision up to a certain level. Beyond that level, a higher recall may come at the expense of a lower precision, and vice versa. This recall/precision trade-off will be carefully studied in order to reach a reasonable compromise between the two.

Our team will first focus on achieving a high recall, to make sure our system can answer a grand scope of questions. We will then try to improve the precision by improving every step from question parsing to result ranking. The combined improvement of recall and precision is believed to give us a huge improvement in terms of F1 score.

Next up: Feasibility Analysis