Human being has started to explore the mystery of heredity, the passing of traits to one’s offspring, since a long time ago. Nowadays, we know that heredity is highly related to our genome. To understand more about the genome, an accurate method for doing variant calling - telling the difference of ones genome from a reference genome, is important. This project focus on doing variant calling from data generated by single molecule sequencing, a new technology that gave us the power to discover larger variant, but lacking the accuracy for calling short variations, using an artificial intelligence approach.


This project works on the data by Oxford Nanopore Technology (ONT), and focus in single nucleotide polymorphism (one base in the genome is changed to another) and short indels (insertion and deletion with length < 16).






F-1 Score

Clair - The RNN based neural network for calling short variants

The Clair is a RNN based neural network base on the previous work CLairvoyante but uses RNN-LSTM layers as its major component and is able to reach 88.74% F-1 score in calling the nanopore wgs consortium data of the sample NA12878.

T-SNE analysis

With the t-SNE algorithm for dimension reduction, we can visualize the results and obeserve that 1) the performance of the network in this dataset is hightly constrained by the poor read quality in homopolymer regions and, 2) the output of network may have flaws and can be serve as a future direction.