Whole Genome Alignment with Noise

This page includes the information on using the software and input/output format.

How to Use the Software

To execute the programs, you just need to type the name of the script, e.g., ./msp.sh or ./msp_mincard.sh. After that, you will be prompted to give some information to the program, including specifying the input/output files or setting the parameters. Each prompt is self-explanatory so that you can follow the steps easily .

You will be prompted for the following information:

  • Path of the input files
    Notes: Currently our software does not support wildcards or pathname expansion. So typing seq*.txt will not map to seq1.txt or seq2.txt. Instead, a File Not Found error will be reported.


  • Option (yes/no) to invoke the grahical plotting utility.

  • Parameters of the programs, including
    • Minimum length of maximal matched substrings
    • Maximal gap of adjacent MSPs in clusters
    • Minimum size of clusters
    • Noise level of clusters
    • Display regions for plotting the graphical output
    • Labels and titles in the graphical output
    • Name of the output files
    Notes: We set default values for some parameters. To choose the default setting, just press ENTER.

Input and Output Formats

  • Input Format
  • The most primitive input is the sequence file. The files should contain only uppercase ACGT without white spaces or new line characters. In addition, DO NOT add any headers to it.

    Notes: We have a pre-processing step that removes all non-ACGT characters in the file, including while space and new line characters. Violating the format may affect program execution.

    Sample Sequence File

  • Output Format


    • Maximal Substring Pairs (MSP)


    • The MSP output is primarily a 5-column list. Each line in the file represents a MSP and each of the five columns correponds to an attribute of that MSP. Lines are in the format of s1 s2 e1 e2 sign, where
      • s1 and e1 are the start and end positions of the MSP in sequence 1,
      • s2 and e2 are the start and end positions of the MSP in sequence 2, and
      • sign is the sign of the MSP.
      N.B.: The file should be sorted by s1.

      At the beginning of the file, there are four header lines, which are in the following format

      # <Sequence 1>
      # <Length 1>
      # <Sequence 2>
      # <Length 2>

      The <sequence 1> and <sequence 2> are the temporary names of the input sequences. The <Length 1> and <Length 2> are the lengths of the input sequences. Information in the header lines will be used in generating the graphical output.

      Sample MSP File

    • Clusters


    • Similar to the MSP output file, the cluster file is also primarily a 5-column list and also has 4 header lines of the same interpretation at the beginning of the file.

      Each individual cluster is delimited by the '#' sign. Within a cluster, each line corresponds to a MSP, containing 5 attributes in the format of s1 s2 sign d1 d2, where

      • s1 and s2 are the start positions of the matched substrings in sequence 1 and 2, respectively.
      • sign is the sign of the MSP.
      • d1 and d2 are the distances between the previous match's end and the current match's start (gap distance).
        Notes: For the first MSP in a cluster, d1 and d2 will be replaced by '-'.

      Sample Cluster File