This page includes the information on using the software and input/output format.
To execute the programs, you just need to type the name of the script, e.g., ./msp.sh
or ./msp_mincard.sh.
After that, you will be prompted to give some information to the program, including
specifying the input/output files or setting the parameters.
Each prompt is self-explanatory so that you can follow the steps easily .
You will be prompted for the following information:
- Path of the input files
Notes: Currently our software does not support wildcards or pathname expansion. So
typing seq*.txt will not map to seq1.txt or seq2.txt. Instead, a File Not Found error will be reported.
- Option (yes/no) to invoke the grahical plotting utility.
- Parameters of the programs, including
- Minimum length of maximal matched substrings
- Maximal gap of adjacent MSPs in clusters
- Minimum size of clusters
- Noise level of clusters
- Display regions for plotting the graphical output
- Labels and titles in the graphical output
- Name of the output files
Notes: We set default values for some parameters.
To choose the default setting, just press ENTER.
- Input Format
The most primitive input is the sequence file. The files should contain
only uppercase ACGT without white spaces or new line characters. In addition,
DO NOT add any headers to it.
Notes: We have a pre-processing step that removes
all non-ACGT characters in the file, including while space and new
line characters. Violating the format may affect program execution.
Sample Sequence File
- Output Format
- Maximal Substring Pairs (MSP)
The MSP output is primarily a 5-column list. Each line in the file represents a
MSP and each of the five columns correponds to an attribute of that
MSP. Lines are in the format of s1 s2 e1 e2 sign,
where
- s1 and e1
are the start and end positions of the MSP in sequence 1,
- s2 and e2
are the start and end positions of the MSP in sequence 2, and
- sign is the sign of the MSP.
N.B.: The file should be sorted by s1.
At the beginning of the file, there are four header lines, which
are in the following format
# <Sequence
1>
# <Length 1>
# <Sequence 2>
# <Length 2>
The <sequence 1> and <sequence 2>
are the temporary names of the input sequences. The <Length 1> and
<Length 2> are the lengths
of the input sequences. Information in the header lines will be used
in generating the graphical output.
Sample MSP File
- Clusters
Similar to the MSP output file, the cluster file is also primarily
a 5-column list and also has 4 header lines of the same interpretation
at the beginning of the file.
Each individual cluster is delimited by the '#' sign. Within a cluster,
each line corresponds to a MSP, containing 5 attributes in the format
of s1 s2 sign d1 d2, where
- s1 and s2
are the start positions of the matched substrings in sequence 1
and 2, respectively.
- sign is the sign of the MSP.
- d1 and d2
are the distances between the previous match's end and the current
match's start (gap distance).
Notes: For the first MSP in a cluster, d1
and d2 will be replaced by '-'.
Sample Cluster File
|