DNA Indexing and Alignment Tools

HKU Computer Science Department - HKU-BGI Bioinformatics Algorithms and Core Technology Research Laboratory

2BWT Libray
SOAP3 Family
Other
Our Works

2BWT (Bi-directinal BWT) is a full-text index of DNA that can support efficient pattern matching via backward search, forward search, or a mix of backward and forward search (i.e., switch of search direction is allowed). Typically, it takes tens of minutes to build the 2BWT for the human genome (~3 billion bp), and a few microseconds to search the occurrences of a pattern with a few hundred characters.

Abstract

In the context of pattern matching over DNA, it has been known that Burrows-Wheeler transform (BWT) is a very compact (if not the most compact) full-text index that can support pattern matching. The original BWT allows searching of a pattern in one direction, namely, from the end to the beginning (i.e., backward search). 2BWT is a tricky extension of BWT that can support searching in both directions (forward and backward search) and allow switching from forward to backward search or vice versa in the course of pattern matching. This would enable one to start matching a pattern in the middle, then extend the search in either direction and switch the direction occasionally. A seemingly trivial solution is to keep two BWTs, one for the original text and one for its reversal. Then forward search can be done via a backward search on the reversal of the pattern based on the latter BWT. However, this solution cannot allow interleaving the backward and forward search. And it deals with two separate searching space (the suffix arrays of the text and its reverse). 2BWT provides a tricky implementation that solves the above technical problems, while only scarifying the speed of forward search slightly (~1.1 times slower than backward search). 2BWT is particularly useful for approximate pattern matching with very few mismatches. In particular, the short read alignment tools SOAP2 and SOAP3 are based on 2BWT. (Please refer to external link for web link to BGI - SOAP software.)

Download Package

The source code of 2BWT package(index builder, library) is available here for download. The software has been tested on a 64-bit x86 Linux environment (Ubuntu). All source code is freely available under the GNU GPL license agreement. You can view a copy of the license agreement here.

Note: 2BWT-Library requires a 64-bit platform

Download Readme v1.0.0 (Revision 1) - 2011-06-21
Download Source Code v1.0.0 (Revision 1) - 2011-06-21
Download Sample Data -- Extracted from NCBI human genome

Installation

1. Un-tar the source code package

% gunzip 2bwt-lib-v1.0.0-x84-64bit.tar.gz
% tar -xvvf 2bwt-lib-v1.0.0-x84-64bit.tar

2. Navigate into the directory with Makefile present

% 2bwt-lib-v1.0.0
% make

Then, you should see a 2bwt-builder program being built. 2bwt-builder is the program which builds 2BWT index from FASTA formatted reference sequence. For usage of this program, please proceed to the next section.

Usage Guide

% ./2bwt-builder <ref sequence>

<ref sequence> is the path to your FASTA formatted DNA sequences. e.g.

% ./2bwt-builder ncbi.genome.fa

In this example, 2bwt-builder will build up a 2BWT index on the sequence ncbi.genome.fa. The output index will be named ncbi.genome.fa.index.*. There should be 16 files with this prefix created.

NOTE: 2bwt-builder operates based on certain number of restrictions and assumptions. See 2bwt-lib-v1.0.0-readme.txt to view them.

NOTE: What is in the index?
*.bwt -- The BWTs of the reference sequence.
*.fmv -- The auxiliary data structure to support operations on the BWTs.
*.pac -- The reference sequence in 2-bit per character format.
*amb/*.ann -- The information of the different chromosomes/sub-sequence of reference sequence. Also, the position and length of each removed long ambiguous region(N-region).
*.sa -- The Suffix Array
*.lkt -- Lookup table for fast retrieval of SA ranges of any length-k pattern where k is pre-defined in index building time.

Development with 2BWT package

Though 2bwt-builder is the only runnable program built from the compilation, there are also a handy 2BWT Library that is built together which allows developer to implement different applications with 2BWT package rapidly. Below is a list of functions provided by the 2BWT Library(2BWT-Interface.h) and some brief description of them.

Index loading/unloading

BWTLoad2BWT(indexFilePrefix,saFileNameExtension)

It loads the 2BWT index from disk into memory

BWTFree2BWT(idx2BWT)

It unloads the 2BWT index from memory

Pattern pre-processing

BWTConvertPattern(idx2BWT,patternSource,patternLength,patternDestination)

In order for 2BWT to be able to process the character on your pattern, you might first convert it from DNA character into a coding scheme recognised by 2BWT. This function converts human readable DNA pattern into a 2BWT recognised one. For all BWT search function below, all characters passing in and out them are expected to be "converted" characters.

Pattern matching - incremental by character

BWTSARangeInitial(idx2BWT,c,saIndexLeft,saIndexRight)

BWT search for the SA range of the first character c. Initial values of saIndexLeft and saIndexRight will be overwritten.

BWTSARangeBackward(idx2BWT,c,saIndexLeft,saIndexRight)

Incremental BWT backward search for the SA range of cP where P is a pattern having SA range (saIndexLeft,saIndexRight). Initial values of saIndexLeft and saIndexRight will be overwritten.

BWTSARangeBackward_Bidirection(idx2BWT,c,saIndexLeft,saIndexRight,rev_saIndexLeft,rev_saIndexRight)

Incremental Bi-directional BWT backward search for the SA ranges of cP where P is a pattern having SA range (saIndexLeft,saIndexRight,rev_saIndexLeft,rev_saIndexRight). Initial values of saIndexLeft, saIndexRight, rev_saIndexLeft and rev_saIndexRight will be overwritten.

BWTSARangeForward_Bidirection(idx2BWT,c,saIndexLeft,saIndexRight,rev_saIndexLeft,rev_saIndexRight)

Incremental Bi-directional BWT forward search for the SA range of Pc where P is a pattern having SA range (saIndexLeft,saIndexRight,rev_saIndexLeft,rev_saIndexRight). Initial values of saIndexLeft, saIndexRight, rev_saIndexLeft and rev_saIndexRight will be overwritten.

BWTAllSARangesBackward(idx2BWT,saIndexLeft,saIndexRight,resultSaIndexesLeft,resultSaIndexesRight)

Incremental BWT backward search for all SA ranges of cP where P is a pattern having SA range (saIndexLeft,saIndexRight) for all character c in alphabet set.

BWTAllSARangesBackward_Bidirection(idx2BWT,c, saIndexLeft,saIndexRight,rev_saIndexLeft,rev_saIndexRight, resultSaIndexesLeft,resultSaIndexesRight,rev_resultSaIndexesLeft,rev_resultSaIndexesRight)

Incremental Bi-directional BWT backward search for all SA ranges of cP where P is a pattern having SA range (saIndexLeft,saIndexRight,rev_saIndexLeft,rev_saIndexRight) for all character c in alphabet set.

BWTAllSARangesForward_Bidirection(idx2BWT,c, saIndexLeft,saIndexRight,rev_saIndexLeft,rev_saIndexRight, resultSaIndexesLeft,resultSaIndexesRight,rev_resultSaIndexesLeft,rev_resultSaIndexesRight)

Incremental Bi-directional BWT forware search for all SA ranges of Pc where P is a pattern having SA range (saIndexLeft,saIndexRight,rev_saIndexLeft,rev_saIndexRight) for all character c in alphabet set. Position reporting

Position reporting

BWTRetrievePositionFromSAIndex(idx2BWT,saIndex,sequenceId,offset)

Function to output an occurrence from an index within a SA range. The occurring position pointed by an SA index will be output and represented by a sequenceId:offset format which means the position is on the sequenceId-th sequence in the original FASTA reference sequence starting at the offest-th character on the sequence.

Sample Code that uses 2BWT package

Sample Program

Developer can find sample program that contains example in the 2bwt-lib package. Please see 2bwt-lib-v1.0.0-readme.txt for detail.

BGI: SOAP2 Package

The short read alignment tools developed by BGI, Short Oligonucleotide Analysis Package, SOAP2 are based on 2BWT.

Go to page SOAP Family.

The advancement of next generation sequencing technologies has made it feasible for researchers to consider many highthroughput biological applications. A core step of these applications is to align an enormous number of short reads (of length tens to hundreds) to a reference genome. The SOAP aligners detailed below allow users to build a 2BWT index for any genome (in FASTA format ) no longer than 4 billion nucleotides in length and to aligns short reads (in FASTA/FASTQ format ) of length 35-200bp with up to 2 to 4 mismatches/indels. The index and alignment algorithms were developed by the algorithms research group of the University of Hong Kong (T.W. Lam, C.M. Liu, Alan Tam, Simon Wong, Thomas Wong, Edward Wu and S.M. Yiu), in collaboration with BGI.

SOAP3-DP

Go to web-site http://www.cs.hku.hk/2bwt-tools/soap3-dp/

SOAP3

Go to web-site http://www.cs.hku.hk/2bwt-tools/soap3/

SOAP2.5

TBR

SOAP2

SOAP2 was developed in early 2008 and was currently maintained by BGI. It supports alignment with 2 mismatches or 2 indels.

Go to web-site http://soap.genomics.org.cn/soapaligner.html

BWT-SW

A local alignment tool for searching nucleotide sequences.. For more detail please visit the full site for BWT-SW.

Go to web-site http://www.cs.hku.hk/~ckwong3/bwtsw/

GNU GPL license v3

All software available in this website is freely available under the GNU GPL license v3 agreement. You can view a copy of the license agreement here. The basic idea behind is that we request everyone that downloads and re-distribute our software to pass-on the GNU GPL license v3.

The 2BWT index and search algorithms used by SOAP2/SOAP3/SOAP3-CPU/SOAP3-dp were developed by the algorithms reserach group of the University of Hong Kong (T.W. Lam, C.M. Liu, Ruibang Luo, Alan Tam, Simon Wong, Thomas Wong, Edward Wu and S.M. Yiu).