Fast and sensitive alignment of large genomic sequences.

Fast and sensitive alignment of large genomic sequences.
Michael
Brudnd
and
Burkhard
Morgenstern2
1110 Gates Building, Computer Science Department, Stanford University, Stanford, CA 94305
2lnternational Graduate School in Bioinformatics and Genome Research, Universitiit Bielefeld, Postfach
100131, 33501 Bielefeld, Germany.
Email:
[email protected] .Stanford.
EDU, [email protected]
Abstract
Comparative
analysis of syntenic
can be used to identify functional
.Uni -Bielefeld
.DE
http://bibiserv.techfak.uni-bielefeld.de/dialign/
genome sequences
sites such as ex-
ons and regulatory elements. Here, the first step is
to align two or several evolutionary
related sequences
and, in recent years, a number of computer programs
have been developed for alignment of large genomic
sequences. Some of these programs are extremely fast
but often time-efficiency
is achieved at the expen,S'e of
sensitivity.
One way of combining speed and sensitivity is to use an anchored-alignment
approach. In a
first step, a fast heuristic identifies a chain of strong
sequence similarities
that serve as anchor points. In
a second step, regions between these anchor points are
1.
Cross-species sequence comparison is playing an increasingly important role in genome analysis and annotation. The functional parts of the genome are
under selective pressure, and therefore evolve more
slowly than non-functional parts, where random mutations can be tolerated without affecting the evolutionary fitnes of the organism. Consequently, islands of local sequence conservation often correspond
to functional elements. Comparative sequence analysis has been used for a variety of purposes, e.g. for
gene prediction [10, 3, 4, 20, 33, 30, 27] and identification of regulatory elements [15, 17, 23, 11,12,23, 13].
One major advantage of these approaches is that they
are based on simple measurement of sequence similarity and require little additional information about
the elements to be detected. While more traditional
statistical methods need large sets of training data
to construct species-specific models of genes or regulatory elements, the comparative methods essentially
depend on the availability of syntenic sequences at an
appropriate evolutionary distance, making them effective for analysis of the newly sequenced genomes,
when little training data is available.
aligned using a slower but more sensitive method.
We present CHAOS, a novel algorithm for rapid
identification
of chains of local sequence similarities
among large genomic sequences. Similarities
identified by CHAOS are used as anchor points to improve the running time of the DIALIGN
alignment
program. Systematic test runs show that this method
can reduce the running
time of DIALIGN
by more
than 93% while affecting the quality of the resulting
alignments by only 1% .
The source code for CHAOS is available at
http://www.stanford.edu/~brudno/chaos/
An integrated program package containing
and DIALIGN
is available at
Introduction
CHAOS
1
Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB’02)
0-7695-1653-X/02 $17.00 © 2002 IEEE
1.
INTRODUCTION
If syntenic sequences are to be compared, the most
straight-forward approach is to compile a list of sequence similarities identified by a local alignment
method such as BLAST [2]; these similarities can
then be ranked according to some quality measure,
e.g. by their statistical significance [18]. One problem with this method is that a stringent cut-off criterion needs to be applied in order to eliminate spurious
similarities, but such a threshold would also exclude
many weak but biologically important similarities.
It is well known that in eukaryotes large-scale
genome rearrangements are relatively rare events
during evolution; therefore the relative order of genes
remains conserved across fairly large evolutionary
distances. This fact can be used to reduce the noise
of false positive similarities in comparative genome
analysis. Rather than considering all local sequence
simliarities above some (arbitrary) threshold value,
one can search for chains of similarities, i.e. for sets
of similarities that respect the relative order within
the sequences under study. This combinatorial constraint can considerably reduce the noise of false positive similarities without preventing weak but biologically important similarities from being detected.
Consequently, a new challenge in sequence analysis
is construction of global alignments of large genomic
2
program implementing a gapped BLAST algorithm
(BLASTZ) is used.
Another possible way of combining speed and sensitivity for genomic alignment is to (i) use a fast
heuristics to identify a chain of high-scoring sequence
similarities that can be used as anchor points for the
final alignment, and then (ii) use a more sensitive
method to align the regions that are left oVer between
these anchor points. Such an approach has been proposed, for example, by BatzogloU et al. [4]. These authors developed a Computer program called GLASS
that aligns genomic sequences based on matching kmers. A recursive procedure is used where the minimum length for matching k-mers is decreased at every
level of the recursion.
Generally, if sequences of length LI and L2, respectively, are to be aligned, a first heuristic procedure
would define anchor points Xo = 1, XI, ..., XN = LI
in the first sequence and Yo = l,yI, ...,YN = L2 in
the second sequence. This way, the search swce for
the final alignment procedure is reduced to Li=I (Xi Xi-I) x (Yi -Yi-I ) compared to LI x L2 for the nonanchored procedure. The idea is that strong local
sequence similarities identified by fast heuristics can
be expected to be part of an optimal alignment for
any reasonable objective function. Thus, if these similarities are used as anchor points for the final alignsequences.
Recently a number of alignment algorithms have ment procedure the resulting alignment would still
been proposed that combine local and global align- be optimal or near-optimal with respect to the obment features by returning ordered chains of local jective function employed in the second step. This
similarities. Since these algorithms must be able to approach is related to divide-and-conquer strategies
deal with large sequences, a trade-off between sensi- for sequence alignment [16, 32].
tivity and speed is necessary. Some approaches for
Obviously, the more dense a chain of anchor points
genomic alignment use suflix-tree or hashing algo- (Xo,Yo),...,(XN,YN)
is, the higher is the reduction
rithms to identify pairs of k-mers of a certain min- of the search space and gain in speed for the final proimum length (and, possibly, a maximum number of cedure -on the other hand, too many anchor points
mismatches) [8, 22, 21]. These approaches are ex- could overly restrict the search space and result in
tremely time-eflicient but are most effective at align- decreased alignment quality. The main challenge in
ing sequences from closely related genomes, e.g. from the anchored-alignment approach is therefore to find
different strains of a bacterium [8]. Other approaches a trade-off between speed and alignment quality are more sensitive and have been successfully used to to locate anchor points that are as dense as possible
compare more distantly related organisms but these while still leading to optimal or near-optimal alignmethods are far more time-consuming and are cur- ments.
rently restricted to sequences of a few hundred kb in
In this paper, we present a fast local alignment
length [19, 25, 27]. A third approach is used in the tool called CHAOS (CHAins Of SCores). For a pair
PipMaker [31] set of tools, where a local alignment~ of input sequences, CHAOS returns a chain of 10-
Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB’02)
0-7695-1653-X/02 $17.00 © 2002 IEEE
2. ALGORITHM
3
cal sequence alignments that can be used as anchor a back pointer to the node which corresponds to
points to reduce the search space and improve the W2...Wp. We start by inserting into the T-trie all of
running time of any sensitive global alignment proce- the k-mers of one of the sequences, which we will
dure. Herein, we show how these anchor points can call the database. Then we do a " walk" using the
be used to speed up the DIALIGN alignment program other, query sequence, where we start by making the
[24]. Systematic test runs demonstrate that this way root of the T-trie our current node, and for every
the running time of DIALIGN can be reduced by 1 - letter of the query:
2 orders of magnitude while the quality of the resulting alignments is only minimally affected. Moreover ,
1. if the current node has a child corresponding
the relative improvement in speed increases with the to this letter we make this child our current node,
length of the input sequences, making our approach and return any seeds stored in it.
particularly effective for alignment of large genomic
2. otherwise make the node pointed to by our
back pointer our current node, and return to step I.
sequences.
As an illustration of why this method works well
in practice, assume that all of the possible k-roers
are present in the database ( which is most likely the
The CHAOS algorithm works by chaining together case). Then, finding the k-roers that correspond to
pairs of similar regions, one region from each of the the next letter of the query requires only two pointer
two input DNA sequences; we call such pairs of re- operations: the first is to follow a back pointer from
gions seeds. More precisely, a seed is a pair of words the k level node which is our current node, the second
of length k with at least n identical base pairs (bp). A to follow a down pointer from the resulting node to
seed S(l) can be chained to another seed 8(2) whenever the appropriate child. To allow degeneracy we per(i) the indces of 8(1) in both sequences are higher than mit multiple current nodes, which correspond to the
the indices of 8(2), and (ii) 8(1) and 8(2) are "near"
possible degenerate words.
each other, with "near" defined by both a distance
Let D be the maximum distance between two adand a gap criteria as illustrated in Figure 1. The fi- jacent seeds. The seeds generated while examining
nal score of a chain is the total number of matching the last D base pairs of the query sequence (Figure 1)
bp in it. The default parameters used by CHAOS are are stored in a skip list, a probabilistic data structure
words of legth 7, with no degeneracy, a distance and that allows for fast searches and easy in-order travergap criteria of 20 and 5 bp respectively, and a score sal of its elements [29] .The seeds are ordered by the
cutoff of 25.
difference of its indices in the two sequences ( diagonal
The seeds are located using a simplified version of number) .For each seed s found at the current locathe Aho-Corasick [1} algorithm. A variation on the tion do a search in the skip list for previously stored
trie data structure [9] which we call a threaded trie seeds which have diagonal numbers within the per(T-trie) is used to store the k-mers of one sequence. mitted gap criterion of the diagonal number of s. We
A trie is a tree for storing strings in which there is thus find the possible previous seeds with which s can
one node for every common prefix. A node which be chained. The highest scoring chain is picked, and
corresponds to the word W1...Wp would have as its this chain can be further extended by future seeds. In
parent a node that corresponds to W1...Wp-1. A trie order to enforce the distance criterion we then remove
that contains all of the k-mers of some string each from the skip list all seeds which were generated D
leaf is at height k, and in each leaf we store all of base pairs from the position at which the new seeds
the locations where this k-mer occurs in the indexed are, and insert the new seeds into the skip list.
CHAOS can be used as a stand-alone program for
sequence.
A T-trie differs from a regular trie in that a node local sequence alignment or as a pre-processing step
that corresponds to the string W1...Wp will also have to find anchor points for any global alignment proce2.
Algorithm
Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB’02)
0-7695-1653-X/02 $17.00 © 2002 IEEE
3. RESULTS
4
dure. In this case, once all of the local alignments are
identified, we use an algorithm based on the longest
increasing subsequence problem [14] to find the highest scoring montonically increasing subsequence of
alignments. In the present study we used such chains
of local alignments identified by CHAOS as anchor
points for DIALIGN.
2x gap
cutoff
distance
cutoff
r---'-,
I
I
I
seed
query
,
,
,
,
,
,
,
,
,
I
3.
Results
To test our method, we used a set of 42 sequence pairs
form human and mouse as compiled by Jareborg et al.
[17], they vary in length between less than 6 kb and
more than 227 kb, with an average length of 38 kb.
1/
These sequences were used in a recent paper for a systematic comparison of five different software tools for
genomic alignment [27]; it was shown that DIALIGN
was the most specific of these methods though it was
considerably slower than the other methods. In the
present
paper, we do not repeat the comparison of
,;.,
~
"'
these
different
alignment methods. Instead, we in~
~
vestigate
how
the
running time of DIALIGN is imE
/ ,I
I:
/'
J
proved by the above described anchoring procedure
.§ I
and how this affects the quality of the output alignments. First, we applied CHAOS to our data in order
/
/
to obtain chains of anchor points. Next, we aligned
Search
location
Range of
the sequence pairs with DIALIGN, first without anbox
in query
search
choring and then using the anchor points identified by
CHAOS. We measured the running time for aligning
Figure 1: The figure shows a matrix representation of our test sequences with and without anchoring and
sequence alignment. The seed shown can be chained compared the quality of the resulting alignments. DIto any seed which lies inside the search box. All seeds ALIGN was run with the translation option where lolocated less then distance bp from the current location cal similarity among DNA sequences is compared at
are stored in a skip-Iist, in which we do a range query the peptide level, see [26].
for seeds located within a gap cutoff from the diagoWhen CHAOS is run with default parameters the
nal on which the current seed is located. The seeds
density of the returned anchor points was, on averlocated in the grey areas cannot be chained to in orage, 2.1 anchor points per kb; the total CPU time
der to make the algorithm independent of sequence
for running CHAOS on all the 42 sequence sets was
order .
94 s on a SUN Ultra Enterprise 420 with a 450MHz
CPU. The total CPU time for aligning the 42 sequences pairs with the standard version of DIALIGN
-i.e. without anchor points -was 179,001 s, i.e. more
than two days. By contrast, running DIALIGN using
the anchor points found by CHAOS with the default
cut-off value took only 11,391 s of CPU time. Thus,
.,
,
,
,
,
'"
/
Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB’02)
0-7695-1653-X/02 $17.00 © 2002 IEEE
3. RESULTS
the combined running time for the anchored alignment procedure -i.e. the time necessary for finding
the anchor points plus the running time of the alignment final procedure using these anchor points ~ was
only 6.4 % of the original running time of DIALIGN .
If CHAOS is run with decreased cut-off parameters,
larger chains of anchor points are found. Thereby the
search space for DIALIGN is further decreased and
the running time is improved accordingly. On the
other hand, this option leads to slightly decreased
quality of the final alignments as shown in Table 1.
To compare the quality of the anchored alignments
to the non-anchored ones, we used two different measures of alignment quality. First, we considered the
numerical alignment score that is employed by the
DIALIGN program. DIALIGN construcs alignments
by assembling pairs of un-gapped segments, so-called
fragments. Each possible fragment is given a quality score and, for pairwise alignment, the program
identifies a chain of fragments with maximum total
score; the scores of the fragments are defined based
on probabilistic considerations as explained in [24].
We considered the total sum of scores of the alignments produced with and without anchoring option,
respectively. For the non-anchored DIALIGN alignments, the sum of scores was 54,214. If CHAOS was
applied with the default cut-off in order to anchor
the DIALIGN alignment, the total sum of scores was
53,658, i.e. the numerical alignment score was reduced by only 1.02% while the running time was reduced by more than 93%.
Next, we tried to compare the biological quality of
the returned alignments. The 42 sequence pairs contain a total of 77 known gene pairs and we investigated to what extent the alignments with and without anchoring option were able to identify proteincoding regions. We compared the different alignments at the nucleotide level, i.e. a nucleotide that
is part of a selected fragment is considered a true
positive (TP) if it is also part an annotated exon
and as false positives (FP) if it is not; true and
false negatives (TN and FN) are defined accordingly.
We used the usual measures for prediction accuracy, namely sensitivity = TP/(TP + FN), specificity = T P / (T P + F P) , and approximate correlation = O.5((TP/(TP + FN) + (TP/(TP + FP) +
Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB’02)
0-7695-1653-X/02 $17.00 © 2002 IEEE
5
(TNI(TN
+ FP) + (TNI(TN
+ FN)) -1.
The sensitivity of DIALIGN without anchoring option was 83.68%, the specificity was 40.06%, and
the approximate correlation was 57.31%. Note that
DIALIGN is a general-purpose alignment program
and does not attempt to find exact exon boundaries.
Therefore, these results roughly reflect the ability of
DIALIGN to identify biologically functional regions
but they cannot be compared with specialized genefinding programs. With the new anchoring option,
the sensitivity of DIALIGN was 83.44%, the specificity was 40.53% while the approximate correlation
was 57.51%. Here, the results for the anchored alignments were even slightly better than for the nonanchored ones though the difference in quality was
minimal.
By comparison, for CHAOS alone (i.e.
without using DIALIGN in the second step), sensitivity was 50%, specificity was 45% and approximate
correlation was only 44%.
Finally, we wanted to know how the relative improvement in program running time that we achieved
by our anchoring method depends on the length of
the input sequences. The main benefit of reduced
running time of DIALIGN is that this way the program becomes applicable to genomic sequences that
are currently beyond its scope, so we wanted to estimate the behavior of the running time for very long
sequences. Let d be the average distance between
two anchor points and let L be the maximum sequence length. If the anchors are evenly spaced over
the length of the sequences, one has approximately
Lid anchor points while the search space between
any two adjacent anchor points is d2. Thus, the the
search space for the anchored alignment procedure
would be Lid x ~ = L x d compared to L2 for the
non-anchored algorithm.
The running time for our algorithm would therefore grow linearly rather than quadratically with the
sequence length, so the relative improvement in running time that is achieved by using anchor points
grows with the sequence length. In reality, it is difficult to estimate the distribution of distances between
anchor points since this depends, of course, on the sequences being compared. Nevertheless, for our data
we could confirm that the improvement in running
time was larger for longer sequences (Figure 2).
3. RESULTS
6
0.9
0.8
0.7
0.6
0.5
."f
Cl>
.§ 0.4
0.3
+++
+
.fi.+
0.2
:I:
0.1
t+t
+
0
0
+ +
*
+
++
+ +
50000
100000
150000
sequence
200000
250000
length
Figure 2: Relative improvement in program running time for 42 pairs of genomic sequences form human
and mouse of different length. Each point represents one sequence pair. The x-axis is the medium sequence
length of sequence pairs while the y-axis is the relative running time of the anchored alignment procedure
compared to the non-anchored procedure.
Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB’02)
0-7695-1653-X/02 $17.00 © 2002 IEEE
3. RESULTS
7
CHAOS-DIALIGN
program
cut-off
anc./kb
D
CPU %CPU
179,001
C+D
C+D
C+D
C+D
C+D
C+D
35
30
25
20
15
10
c
c
c
c
c
c
35
30
25
20
15
10
1.4
1.7
2.1
2.8
4.2
6.5
14,334
11,717
11,485
8,964
7,404
6,696
92
92
94
93
93
94
100.0
8.0
6.5
6.4
5.0
4.1
3.7
results
score
54,214
53,839
53,820
53,654
53,642
53,208
52,684
% score Sn
Sp
AC
100.0
83
40
57
99.3
99.2
98.9
98.9
98.1
97.1
83
83
83
83
82
82
40
40
40
40
41
41
57
57
57
57
57
57
43
46
50
53
57
60
48
47
45
43
39
34
42
43
44
44
43
42
Table 1: Total CPU time and alignment quality for DIALIGN (D), DIALIGN anchored with CHAOS (C+D)
and CHAOS (C) applied to a set of 42 pairs of genomic sequences from human and mouse [17]. CHAOS was
run with varying cut-off parameters. Lower cut-off values for CHAOS produced higher numbers of anchor
points resulting in a decreased search space for the final DIALIGN alignment procedure thus leading to
improved running time but slightly decreased alignment quality. The average number of anchor points per
kilo base is shown (anc./kb). Score is the total numerical score of all produced DIALIGN alignments, i.e.
the sum of the scores of the segment pairs the alignments consists of. As a rough measure of the biological
quality of the produced alignments, we compared local sequence similarities identified by DIALIGN and
CHAOS to known protein-coding regions. Here, Sn, Sp and AC are sensitivity, specificity and approximate
correlation, respectively. For the D and C+ D results, DIALIGN was evaluated by comparing all segment
pairs contained in the alignment to annotated exons.
Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB’02)
0-7695-1653-X/02 $17.00 © 2002 IEEE
REFERENCES
4.
Discussion
8
existing protein-coding exons. This is, admittedly,
a very rough measure of alignment quality as nonCHAOS is a novel heuristic local alignment program coding functional elements are completely ignored.
for large genomic sequences. Most of the methods However, in the absence of reliable benchmark data
for heuristic local alignment, such as BLAST [2] and for genomic sequence alignment, comparing identified
FASTA [28] were developed when the bulk of avail- sequence similarities to known exons gives us an efable sequence where proteins. It has been shown that fective approximation for assessing the performance
such algorithms are not as efficient in aligning non- of different alignment methods.
coding sequences [5]. With the new availability of
Several studies have shown that DIALIGN is highly
genomic sequences it is appropriate to refine the al- efficient in identifying coding regions and regulagorithms used for local alignment so that they more tory elements in genomic sequences [11, 13, 27,6, 7]
closely reflect the fashion in which the genomic se- and a new gene-prediction program called AGenDA
quences are conserved. Unlike other fast algorithms
(Alignment-based Gene Detection Algorithm) has
for genomic alignment, CHAOS does not depend on recently been developed that takes DIALIGN alignlong exact matches, does not require extensive un- ment as input information [30]. However, DIALIGN
gapped homology, and allows mismatches in seeds, used to be far slower than other programs for genomic
all of which are important when comparing distantly
alignment so its applicability was limited to sequences
related organisms or non-coding regions, where con- of a few hundred kilo bases. With the new anchoring
servation is generally much poorer than in coding ar- option and anchor points created by CHAOS, DIeas.
ALIGN as well as other slow but sensitive alignment
Some previous algorithms for anchored global programs can be applied to a much larger range of
alignment have worked by first identifying very strong sequences.
local similarities among the input sequences and
adding weaker similarities later. The problem with
this approach is that one high-scoring spurious match Acknow
ledgements
can lead to a wrong output alignment where many
weaker but biologically important homologies may be M.B. would like to thank Serafim Batzoglou, Inna
missed. By contrast, CHAOS searches for the high- Dubchak, and Lior Pachter for valuable conversaest scoring chain of local alignments. This way, a tions during this study. The authors also gratefully
numerically high-scoring but biologically wrong local acknowledge Serafim Batzoglou's comments on the
alignment can be conterbalanced by several weaker manuscript.
local alignments if the total score of these alignments
exceeds the score of the one wrong alignment .
We demonstrate that the chains of local alignments References
returned by CHAOS can be used to anchor the DIALIGN alignment procedure, significantly improving
[1] A. Aho and M. Corasick. Efficient string matchthe alignment speed, without affecting the quality of
ing: an aid to bibliographic search. Comm.
the resulting alignments. To compare the quality of
ACM, 18:333-340,1975.
the anchored and non-anchored alignments, we compared the numerical scores of the resulting alignments
[2] S. F. Altschul, W. Gish, W. Miller, E. M. Myers,
as well as their biological quality.
The numerical
and D. J. Lipman. Basic local alignment search
scores that we used are the scores for alignment qualtool. J. Mol. BioI., 215:403-410, 1990.
ity as employed by DIALIGN. That is, we measured
the sum of the scores of the segment pairs an align[3} V. Bafna and D. H. Huson. The conserved
ment is composed of. The biological alignment qualexon method for gene finding. In R. Altmann,
ity was measured by comparing (local) alignments to
T. Bailey, P. Bourne, M. Gribskov, T. Lengauer,
Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB’02)
0-7695-1653-X/02 $17.00 © 2002 IEEE
REFERENCES
I. Shindyalov, L. T. Eyck, and H. Weissig, editors, Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, Menlo Parc, CA, 2000. AAAI Press.
J. Rogers, D. Bentley, and A. Green. Transcriptional regulation of the Stem Cell Leukemia gene
(SCL) Comparative analysis of five vertebrate
SCL loci. Genome Research, 12:749-759,2002.
[4] S. Batzoglou, L. Pachter, J. P. Mesirov,
B. Berger, and E. S. Lander. Human and mouse
gene structure: comparative analysis and application to exon prediction.
Genome Research,
10(7):950-958, 2000.
[13] B. Gottgens, J. Gilbert, L. Barton, D. Grafuam,
J. Rogers, D. Bentley, and A. Green. Long-range
comparison of human and mouse SCL loci: localized regions of sensitivity to restriction endonucleases correspond precisely with peaks of
conserved non coding sequences. Genome Res.,
11:87-97,2001.
[5] C. M. Bergman and M. Kreitman. Analysis of
conserved noncoding dna in drosophila reveals
similar constraints in intergenic and intronic sequences. Genome Research, 11:1335-1345, 2001.
[6] M. Blanchette, B. Schwikowski, and M. Tompa.
Algorithms for phylogenetic footprinting. Journal of Computational Biology, 9(2)211-23, 2002.
[7] Mo Blanchette and M. Tompa. Discovery of
reguatory elements by a computational method
for phylogenetic footprintingo
Algorithms for
phylogenetic footprinting.
Genome Research,
12(5):739-48, 2002.
[8] Ao L. Delcher, So Kasif, R. D. Fleischmann,
J. Peterson, 0. White, and SoL. Salzbergo Alignment of whole genomes. Nucleic Acids Res,
27(11):2369-2376, 1999.
[9] E. Fredkin. Thie memory. Comm. ACM, 3:490500, 1960.
[14] D. Gusfield.
Algorithms
on Strings,
Trees, and
Sequences:
Computer
Science and Computational Biology.
Cambridge
University
Press,
Cambridge,
UK,
1997.
[15] R. Hardison, J. L. Slightom, D. L. Gumucio,
M. Goodman, N. Stojanovic, and W. Miller. Locus control regions of mammalian ,8-globin gene
clusters: combining phylogenetic analyses and
experimental results to gain functional insights.
Gene, 205:73-94, 1998.
[16] D. S. Hirschberg.
A linear space algorithm
for computing maximal common subsequences.
Commun. ACM, 18:314-343, 1975.
[17] N. J areborg, E. Birney, and R. Durbin. Comparative analysis of non coding regions of 77 orthologous mouse and human gene pairs. Genome
Research, 9:815-824, 1999.
[10] M. S. Gelfand, A. A. Mironov, and P. A.
Pevzner.
Gene recognition via spliced sequence alignment. Proc. Natl. Acad. Sci. USA,
93(17) :9061-9066, 1996.
[18] S. Karlin and S. F. Altschul. Applications and
statistics for multiple high-scoring segments in
molecular sequences. Proc. Natl. Acad. Sci.
[11] B. Gottgens, L. Barton, J. Gilbert, A. Bench,
M. Sanchez, S. Bahn, S. Mistry, D. Grafham,
A. McMurray, M. Vaudin, E. Amaya, D. Bentley,
and A. Green. Analysis of vertebrate SCL loci
identifies conserved enhancers. Nature Biotechnology, 18:181-186, 2000.
[19] W. J. Kent and A. M. Zahler. Conservation, regulation, synteny, and introns in a large-scale a.
briggsae-a. elegans genomic alignment. Genome
Research, 10(8):1115-1125, 2000.
[12] B. Gottgens, L. Barton, M. Chapman, A. Sinclair, B. Knudsen, D. Grafuam, J. Gilbert,
Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB’02)
0-7695-1653-X/02 $17.00 © 2002 IEEE
USA, 90:5873-5877,1993.
[20] I. Korf, P. Flicek, D. Duan, and M. R. Brent.
Integrating genomic homology into gene structure prediction. Bioinformatics, (17):8140-8148,
2001.
REFERENCES
[21] S. Kurtz, E. Ohlebusch, C. Schleiermacher,
J. Stoye, and R. Giegerich. Computation and
visualization of degenerate repeats in complete
genomes. In Proceedings of the Bth International
Conference on Intelligent Systems for Molecular
Biology, pages 228-238, Menlo Parc, CA, 2000.
AAAI Press.
[22] S. Kurtz and C. Schleiermacher. REPuter: Fast
computation of maximal repeats in complete
genomes. Bioinformatics, 15(5):426-427, 1999.
[23] G. G. Loots, R. M. Locksley, C. M. Blankespoor,
Z. E. Wang, W. Miller, E. M. Rubin, and K. A.
Frazer. Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species
sequence comparisons. Science, 288(5463):136140, 2000.
[24] B. Morgenstern. DIALIGN 2: improvement of
the segment-to-segment approach to multiple sequence alignment. Bioinformatics, 15:211-218,
1999.
[25] B. Morgenstern.
A simple and space-efficient
fragment-chaining algorithm for alignment of
DNA and protein sequences. Applied Mathematics Letters, 15:11-16, 2002.
[26] B. Morgenstern, A. W. M. Dress, and T. Werner .
Multiple DNA and protein sequence alignment
based on segment-to-segment comparison. Proc.
Natl. Acad. Sci. USA, 93:12098-12103,1996.
[27] B. Morgenstern, 0. Rinner, S. Abdeddalm,
D. Haase, K. Mayer, A. Dress, and H.-W.
Mewes. Exon discovery by genomic sequence
alignment. Bioinformatics, in press.
[28] W. R. Pearson and D. J.Lipman. Improved tools
for biological sequence comparison. Proc. Natl.
A cad. Sci. USA, 85:2444-2448, 1988.
[29] W. Pugh. Skip lists: Skip lists: A probabilistic alternative to balanced trees. Comm A CM,
33:668-676, 1990.
[30] 0. Rinner and B. Morgenstern. AGenDA: Gene
prediction by comparative sequence analysis. In
Silico Biology, 2:0018, 2002
Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB’02)
0-7695-1653-X/02 $17.00 © 2002 IEEE
10
[31] S. Schwartz, Z. Zhang, K. A. Frazer, A. Smit,
C. Riemer, J. Bouck, R. Gibbs, R. Hardison, and
W. Miller. PipMaker-a web server for aligning
two genomic DNA sequences. Genome Research,
10:577-586, 2000.
[32] J. Stoye. Multiple sequence alignment with the
divide-and-conquer method. Gene, 211:GC45GC56, 1998.
[33] T. Wiehe, S. Gebauer-Jung, T. Mitchell-Olds,
and R. Guig6. SGP-1: Prediction and validation
of homologous genes based on sequence alignments. Genome Research, 11:1574-1583, 2001.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement