Michael Brudnd and Burkhard Morgenstern2 1110 Gates Building, Computer Science Department, Stanford University, Stanford, CA 94305 2lnternational Graduate School in Bioinformatics and Genome Research, Universitiit Bielefeld, Postfach 100131, 33501 Bielefeld, Germany. Email: [email protected] .Stanford. EDU, [email protected] Abstract Comparative analysis of syntenic can be used to identify functional .Uni -Bielefeld .DE http://bibiserv.techfak.uni-bielefeld.de/dialign/ genome sequences sites such as ex- ons and regulatory elements. Here, the first step is to align two or several evolutionary related sequences and, in recent years, a number of computer programs have been developed for alignment of large genomic sequences. Some of these programs are extremely fast but often time-efficiency is achieved at the expen,S'e of sensitivity. One way of combining speed and sensitivity is to use an anchored-alignment approach. In a first step, a fast heuristic identifies a chain of strong sequence similarities that serve as anchor points. In a second step, regions between these anchor points are 1. Cross-species sequence comparison is playing an increasingly important role in genome analysis and annotation. The functional parts of the genome are under selective pressure, and therefore evolve more slowly than non-functional parts, where random mutations can be tolerated without affecting the evolutionary fitnes of the organism. Consequently, islands of local sequence conservation often correspond to functional elements. Comparative sequence analysis has been used for a variety of purposes, e.g. for gene prediction [10, 3, 4, 20, 33, 30, 27] and identification of regulatory elements [15, 17, 23, 11,12,23, 13]. One major advantage of these approaches is that they are based on simple measurement of sequence similarity and require little additional information about the elements to be detected. While more traditional statistical methods need large sets of training data to construct species-specific models of genes or regulatory elements, the comparative methods essentially depend on the availability of syntenic sequences at an appropriate evolutionary distance, making them effective for analysis of the newly sequenced genomes, when little training data is available. aligned using a slower but more sensitive method. We present CHAOS, a novel algorithm for rapid identification of chains of local sequence similarities among large genomic sequences. Similarities identified by CHAOS are used as anchor points to improve the running time of the DIALIGN alignment program. Systematic test runs show that this method can reduce the running time of DIALIGN by more than 93% while affecting the quality of the resulting alignments by only 1% . The source code for CHAOS is available at http://www.stanford.edu/~brudno/chaos/ An integrated program package containing and DIALIGN is available at Introduction CHAOS 1 Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB’02) 0-7695-1653-X/02 $17.00 © 2002 IEEE 1. INTRODUCTION If syntenic sequences are to be compared, the most straight-forward approach is to compile a list of sequence similarities identified by a local alignment method such as BLAST ; these similarities can then be ranked according to some quality measure, e.g. by their statistical significance . One problem with this method is that a stringent cut-off criterion needs to be applied in order to eliminate spurious similarities, but such a threshold would also exclude many weak but biologically important similarities. It is well known that in eukaryotes large-scale genome rearrangements are relatively rare events during evolution; therefore the relative order of genes remains conserved across fairly large evolutionary distances. This fact can be used to reduce the noise of false positive similarities in comparative genome analysis. Rather than considering all local sequence simliarities above some (arbitrary) threshold value, one can search for chains of similarities, i.e. for sets of similarities that respect the relative order within the sequences under study. This combinatorial constraint can considerably reduce the noise of false positive similarities without preventing weak but biologically important similarities from being detected. Consequently, a new challenge in sequence analysis is construction of global alignments of large genomic 2 program implementing a gapped BLAST algorithm (BLASTZ) is used. Another possible way of combining speed and sensitivity for genomic alignment is to (i) use a fast heuristics to identify a chain of high-scoring sequence similarities that can be used as anchor points for the final alignment, and then (ii) use a more sensitive method to align the regions that are left oVer between these anchor points. Such an approach has been proposed, for example, by BatzogloU et al. . These authors developed a Computer program called GLASS that aligns genomic sequences based on matching kmers. A recursive procedure is used where the minimum length for matching k-mers is decreased at every level of the recursion. Generally, if sequences of length LI and L2, respectively, are to be aligned, a first heuristic procedure would define anchor points Xo = 1, XI, ..., XN = LI in the first sequence and Yo = l,yI, ...,YN = L2 in the second sequence. This way, the search swce for the final alignment procedure is reduced to Li=I (Xi Xi-I) x (Yi -Yi-I ) compared to LI x L2 for the nonanchored procedure. The idea is that strong local sequence similarities identified by fast heuristics can be expected to be part of an optimal alignment for any reasonable objective function. Thus, if these similarities are used as anchor points for the final alignsequences. Recently a number of alignment algorithms have ment procedure the resulting alignment would still been proposed that combine local and global align- be optimal or near-optimal with respect to the obment features by returning ordered chains of local jective function employed in the second step. This similarities. Since these algorithms must be able to approach is related to divide-and-conquer strategies deal with large sequences, a trade-off between sensi- for sequence alignment [16, 32]. tivity and speed is necessary. Some approaches for Obviously, the more dense a chain of anchor points genomic alignment use suflix-tree or hashing algo- (Xo,Yo),...,(XN,YN) is, the higher is the reduction rithms to identify pairs of k-mers of a certain min- of the search space and gain in speed for the final proimum length (and, possibly, a maximum number of cedure -on the other hand, too many anchor points mismatches) [8, 22, 21]. These approaches are ex- could overly restrict the search space and result in tremely time-eflicient but are most effective at align- decreased alignment quality. The main challenge in ing sequences from closely related genomes, e.g. from the anchored-alignment approach is therefore to find different strains of a bacterium . Other approaches a trade-off between speed and alignment quality are more sensitive and have been successfully used to to locate anchor points that are as dense as possible compare more distantly related organisms but these while still leading to optimal or near-optimal alignmethods are far more time-consuming and are cur- ments. rently restricted to sequences of a few hundred kb in In this paper, we present a fast local alignment length [19, 25, 27]. A third approach is used in the tool called CHAOS (CHAins Of SCores). For a pair PipMaker  set of tools, where a local alignment~ of input sequences, CHAOS returns a chain of 10- Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB’02) 0-7695-1653-X/02 $17.00 © 2002 IEEE 2. ALGORITHM 3 cal sequence alignments that can be used as anchor a back pointer to the node which corresponds to points to reduce the search space and improve the W2...Wp. We start by inserting into the T-trie all of running time of any sensitive global alignment proce- the k-mers of one of the sequences, which we will dure. Herein, we show how these anchor points can call the database. Then we do a " walk" using the be used to speed up the DIALIGN alignment program other, query sequence, where we start by making the . Systematic test runs demonstrate that this way root of the T-trie our current node, and for every the running time of DIALIGN can be reduced by 1 - letter of the query: 2 orders of magnitude while the quality of the resulting alignments is only minimally affected. Moreover , 1. if the current node has a child corresponding the relative improvement in speed increases with the to this letter we make this child our current node, length of the input sequences, making our approach and return any seeds stored in it. particularly effective for alignment of large genomic 2. otherwise make the node pointed to by our back pointer our current node, and return to step I. sequences. As an illustration of why this method works well in practice, assume that all of the possible k-roers are present in the database ( which is most likely the The CHAOS algorithm works by chaining together case). Then, finding the k-roers that correspond to pairs of similar regions, one region from each of the the next letter of the query requires only two pointer two input DNA sequences; we call such pairs of re- operations: the first is to follow a back pointer from gions seeds. More precisely, a seed is a pair of words the k level node which is our current node, the second of length k with at least n identical base pairs (bp). A to follow a down pointer from the resulting node to seed S(l) can be chained to another seed 8(2) whenever the appropriate child. To allow degeneracy we per(i) the indces of 8(1) in both sequences are higher than mit multiple current nodes, which correspond to the the indices of 8(2), and (ii) 8(1) and 8(2) are "near" possible degenerate words. each other, with "near" defined by both a distance Let D be the maximum distance between two adand a gap criteria as illustrated in Figure 1. The fi- jacent seeds. The seeds generated while examining nal score of a chain is the total number of matching the last D base pairs of the query sequence (Figure 1) bp in it. The default parameters used by CHAOS are are stored in a skip list, a probabilistic data structure words of legth 7, with no degeneracy, a distance and that allows for fast searches and easy in-order travergap criteria of 20 and 5 bp respectively, and a score sal of its elements  .The seeds are ordered by the cutoff of 25. difference of its indices in the two sequences ( diagonal The seeds are located using a simplified version of number) .For each seed s found at the current locathe Aho-Corasick [1} algorithm. A variation on the tion do a search in the skip list for previously stored trie data structure  which we call a threaded trie seeds which have diagonal numbers within the per(T-trie) is used to store the k-mers of one sequence. mitted gap criterion of the diagonal number of s. We A trie is a tree for storing strings in which there is thus find the possible previous seeds with which s can one node for every common prefix. A node which be chained. The highest scoring chain is picked, and corresponds to the word W1...Wp would have as its this chain can be further extended by future seeds. In parent a node that corresponds to W1...Wp-1. A trie order to enforce the distance criterion we then remove that contains all of the k-mers of some string each from the skip list all seeds which were generated D leaf is at height k, and in each leaf we store all of base pairs from the position at which the new seeds the locations where this k-mer occurs in the indexed are, and insert the new seeds into the skip list. CHAOS can be used as a stand-alone program for sequence. A T-trie differs from a regular trie in that a node local sequence alignment or as a pre-processing step that corresponds to the string W1...Wp will also have to find anchor points for any global alignment proce2. Algorithm Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB’02) 0-7695-1653-X/02 $17.00 © 2002 IEEE 3. RESULTS 4 dure. In this case, once all of the local alignments are identified, we use an algorithm based on the longest increasing subsequence problem  to find the highest scoring montonically increasing subsequence of alignments. In the present study we used such chains of local alignments identified by CHAOS as anchor points for DIALIGN. 2x gap cutoff distance cutoff r---'-, I I I seed query , , , , , , , , , I 3. Results To test our method, we used a set of 42 sequence pairs form human and mouse as compiled by Jareborg et al. , they vary in length between less than 6 kb and more than 227 kb, with an average length of 38 kb. 1/ These sequences were used in a recent paper for a systematic comparison of five different software tools for genomic alignment ; it was shown that DIALIGN was the most specific of these methods though it was considerably slower than the other methods. In the present paper, we do not repeat the comparison of ,;., ~ "' these different alignment methods. Instead, we in~ ~ vestigate how the running time of DIALIGN is imE / ,I I: /' J proved by the above described anchoring procedure .§ I and how this affects the quality of the output alignments. First, we applied CHAOS to our data in order / / to obtain chains of anchor points. Next, we aligned Search location Range of the sequence pairs with DIALIGN, first without anbox in query search choring and then using the anchor points identified by CHAOS. We measured the running time for aligning Figure 1: The figure shows a matrix representation of our test sequences with and without anchoring and sequence alignment. The seed shown can be chained compared the quality of the resulting alignments. DIto any seed which lies inside the search box. All seeds ALIGN was run with the translation option where lolocated less then distance bp from the current location cal similarity among DNA sequences is compared at are stored in a skip-Iist, in which we do a range query the peptide level, see . for seeds located within a gap cutoff from the diagoWhen CHAOS is run with default parameters the nal on which the current seed is located. The seeds density of the returned anchor points was, on averlocated in the grey areas cannot be chained to in orage, 2.1 anchor points per kb; the total CPU time der to make the algorithm independent of sequence for running CHAOS on all the 42 sequence sets was order . 94 s on a SUN Ultra Enterprise 420 with a 450MHz CPU. The total CPU time for aligning the 42 sequences pairs with the standard version of DIALIGN -i.e. without anchor points -was 179,001 s, i.e. more than two days. By contrast, running DIALIGN using the anchor points found by CHAOS with the default cut-off value took only 11,391 s of CPU time. Thus, ., , , , , '" / Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB’02) 0-7695-1653-X/02 $17.00 © 2002 IEEE 3. RESULTS the combined running time for the anchored alignment procedure -i.e. the time necessary for finding the anchor points plus the running time of the alignment final procedure using these anchor points ~ was only 6.4 % of the original running time of DIALIGN . If CHAOS is run with decreased cut-off parameters, larger chains of anchor points are found. Thereby the search space for DIALIGN is further decreased and the running time is improved accordingly. On the other hand, this option leads to slightly decreased quality of the final alignments as shown in Table 1. To compare the quality of the anchored alignments to the non-anchored ones, we used two different measures of alignment quality. First, we considered the numerical alignment score that is employed by the DIALIGN program. DIALIGN construcs alignments by assembling pairs of un-gapped segments, so-called fragments. Each possible fragment is given a quality score and, for pairwise alignment, the program identifies a chain of fragments with maximum total score; the scores of the fragments are defined based on probabilistic considerations as explained in . We considered the total sum of scores of the alignments produced with and without anchoring option, respectively. For the non-anchored DIALIGN alignments, the sum of scores was 54,214. If CHAOS was applied with the default cut-off in order to anchor the DIALIGN alignment, the total sum of scores was 53,658, i.e. the numerical alignment score was reduced by only 1.02% while the running time was reduced by more than 93%. Next, we tried to compare the biological quality of the returned alignments. The 42 sequence pairs contain a total of 77 known gene pairs and we investigated to what extent the alignments with and without anchoring option were able to identify proteincoding regions. We compared the different alignments at the nucleotide level, i.e. a nucleotide that is part of a selected fragment is considered a true positive (TP) if it is also part an annotated exon and as false positives (FP) if it is not; true and false negatives (TN and FN) are defined accordingly. We used the usual measures for prediction accuracy, namely sensitivity = TP/(TP + FN), specificity = T P / (T P + F P) , and approximate correlation = O.5((TP/(TP + FN) + (TP/(TP + FP) + Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB’02) 0-7695-1653-X/02 $17.00 © 2002 IEEE 5 (TNI(TN + FP) + (TNI(TN + FN)) -1. The sensitivity of DIALIGN without anchoring option was 83.68%, the specificity was 40.06%, and the approximate correlation was 57.31%. Note that DIALIGN is a general-purpose alignment program and does not attempt to find exact exon boundaries. Therefore, these results roughly reflect the ability of DIALIGN to identify biologically functional regions but they cannot be compared with specialized genefinding programs. With the new anchoring option, the sensitivity of DIALIGN was 83.44%, the specificity was 40.53% while the approximate correlation was 57.51%. Here, the results for the anchored alignments were even slightly better than for the nonanchored ones though the difference in quality was minimal. By comparison, for CHAOS alone (i.e. without using DIALIGN in the second step), sensitivity was 50%, specificity was 45% and approximate correlation was only 44%. Finally, we wanted to know how the relative improvement in program running time that we achieved by our anchoring method depends on the length of the input sequences. The main benefit of reduced running time of DIALIGN is that this way the program becomes applicable to genomic sequences that are currently beyond its scope, so we wanted to estimate the behavior of the running time for very long sequences. Let d be the average distance between two anchor points and let L be the maximum sequence length. If the anchors are evenly spaced over the length of the sequences, one has approximately Lid anchor points while the search space between any two adjacent anchor points is d2. Thus, the the search space for the anchored alignment procedure would be Lid x ~ = L x d compared to L2 for the non-anchored algorithm. The running time for our algorithm would therefore grow linearly rather than quadratically with the sequence length, so the relative improvement in running time that is achieved by using anchor points grows with the sequence length. In reality, it is difficult to estimate the distribution of distances between anchor points since this depends, of course, on the sequences being compared. Nevertheless, for our data we could confirm that the improvement in running time was larger for longer sequences (Figure 2). 3. RESULTS 6 0.9 0.8 0.7 0.6 0.5 ."f Cl> .§ 0.4 0.3 +++ + .fi.+ 0.2 :I: 0.1 t+t + 0 0 + + * + ++ + + 50000 100000 150000 sequence 200000 250000 length Figure 2: Relative improvement in program running time for 42 pairs of genomic sequences form human and mouse of different length. Each point represents one sequence pair. The x-axis is the medium sequence length of sequence pairs while the y-axis is the relative running time of the anchored alignment procedure compared to the non-anchored procedure. Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB’02) 0-7695-1653-X/02 $17.00 © 2002 IEEE 3. RESULTS 7 CHAOS-DIALIGN program cut-off anc./kb D CPU %CPU 179,001 C+D C+D C+D C+D C+D C+D 35 30 25 20 15 10 c c c c c c 35 30 25 20 15 10 1.4 1.7 2.1 2.8 4.2 6.5 14,334 11,717 11,485 8,964 7,404 6,696 92 92 94 93 93 94 100.0 8.0 6.5 6.4 5.0 4.1 3.7 results score 54,214 53,839 53,820 53,654 53,642 53,208 52,684 % score Sn Sp AC 100.0 83 40 57 99.3 99.2 98.9 98.9 98.1 97.1 83 83 83 83 82 82 40 40 40 40 41 41 57 57 57 57 57 57 43 46 50 53 57 60 48 47 45 43 39 34 42 43 44 44 43 42 Table 1: Total CPU time and alignment quality for DIALIGN (D), DIALIGN anchored with CHAOS (C+D) and CHAOS (C) applied to a set of 42 pairs of genomic sequences from human and mouse . CHAOS was run with varying cut-off parameters. Lower cut-off values for CHAOS produced higher numbers of anchor points resulting in a decreased search space for the final DIALIGN alignment procedure thus leading to improved running time but slightly decreased alignment quality. The average number of anchor points per kilo base is shown (anc./kb). Score is the total numerical score of all produced DIALIGN alignments, i.e. the sum of the scores of the segment pairs the alignments consists of. As a rough measure of the biological quality of the produced alignments, we compared local sequence similarities identified by DIALIGN and CHAOS to known protein-coding regions. Here, Sn, Sp and AC are sensitivity, specificity and approximate correlation, respectively. For the D and C+ D results, DIALIGN was evaluated by comparing all segment pairs contained in the alignment to annotated exons. Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB’02) 0-7695-1653-X/02 $17.00 © 2002 IEEE REFERENCES 4. Discussion 8 existing protein-coding exons. This is, admittedly, a very rough measure of alignment quality as nonCHAOS is a novel heuristic local alignment program coding functional elements are completely ignored. for large genomic sequences. Most of the methods However, in the absence of reliable benchmark data for heuristic local alignment, such as BLAST  and for genomic sequence alignment, comparing identified FASTA  were developed when the bulk of avail- sequence similarities to known exons gives us an efable sequence where proteins. It has been shown that fective approximation for assessing the performance such algorithms are not as efficient in aligning non- of different alignment methods. coding sequences . With the new availability of Several studies have shown that DIALIGN is highly genomic sequences it is appropriate to refine the al- efficient in identifying coding regions and regulagorithms used for local alignment so that they more tory elements in genomic sequences [11, 13, 27,6, 7] closely reflect the fashion in which the genomic se- and a new gene-prediction program called AGenDA quences are conserved. Unlike other fast algorithms (Alignment-based Gene Detection Algorithm) has for genomic alignment, CHAOS does not depend on recently been developed that takes DIALIGN alignlong exact matches, does not require extensive un- ment as input information . However, DIALIGN gapped homology, and allows mismatches in seeds, used to be far slower than other programs for genomic all of which are important when comparing distantly alignment so its applicability was limited to sequences related organisms or non-coding regions, where con- of a few hundred kilo bases. With the new anchoring servation is generally much poorer than in coding ar- option and anchor points created by CHAOS, DIeas. ALIGN as well as other slow but sensitive alignment Some previous algorithms for anchored global programs can be applied to a much larger range of alignment have worked by first identifying very strong sequences. local similarities among the input sequences and adding weaker similarities later. The problem with this approach is that one high-scoring spurious match Acknow ledgements can lead to a wrong output alignment where many weaker but biologically important homologies may be M.B. would like to thank Serafim Batzoglou, Inna missed. By contrast, CHAOS searches for the high- Dubchak, and Lior Pachter for valuable conversaest scoring chain of local alignments. This way, a tions during this study. The authors also gratefully numerically high-scoring but biologically wrong local acknowledge Serafim Batzoglou's comments on the alignment can be conterbalanced by several weaker manuscript. local alignments if the total score of these alignments exceeds the score of the one wrong alignment . We demonstrate that the chains of local alignments References returned by CHAOS can be used to anchor the DIALIGN alignment procedure, significantly improving  A. Aho and M. Corasick. Efficient string matchthe alignment speed, without affecting the quality of ing: an aid to bibliographic search. Comm. the resulting alignments. To compare the quality of ACM, 18:333-340,1975. the anchored and non-anchored alignments, we compared the numerical scores of the resulting alignments  S. F. Altschul, W. Gish, W. Miller, E. M. Myers, as well as their biological quality. The numerical and D. J. Lipman. Basic local alignment search scores that we used are the scores for alignment qualtool. J. Mol. BioI., 215:403-410, 1990. ity as employed by DIALIGN. That is, we measured the sum of the scores of the segment pairs an align[3} V. Bafna and D. H. Huson. The conserved ment is composed of. The biological alignment qualexon method for gene finding. In R. Altmann, ity was measured by comparing (local) alignments to T. Bailey, P. Bourne, M. Gribskov, T. Lengauer, Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB’02) 0-7695-1653-X/02 $17.00 © 2002 IEEE REFERENCES I. Shindyalov, L. T. Eyck, and H. Weissig, editors, Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, Menlo Parc, CA, 2000. AAAI Press. J. Rogers, D. Bentley, and A. Green. Transcriptional regulation of the Stem Cell Leukemia gene (SCL) Comparative analysis of five vertebrate SCL loci. Genome Research, 12:749-759,2002.  S. Batzoglou, L. Pachter, J. P. Mesirov, B. Berger, and E. S. Lander. Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Research, 10(7):950-958, 2000.  B. Gottgens, J. Gilbert, L. Barton, D. Grafuam, J. Rogers, D. Bentley, and A. Green. Long-range comparison of human and mouse SCL loci: localized regions of sensitivity to restriction endonucleases correspond precisely with peaks of conserved non coding sequences. Genome Res., 11:87-97,2001.  C. M. Bergman and M. Kreitman. Analysis of conserved noncoding dna in drosophila reveals similar constraints in intergenic and intronic sequences. Genome Research, 11:1335-1345, 2001.  M. Blanchette, B. Schwikowski, and M. Tompa. Algorithms for phylogenetic footprinting. Journal of Computational Biology, 9(2)211-23, 2002.  Mo Blanchette and M. Tompa. Discovery of reguatory elements by a computational method for phylogenetic footprintingo Algorithms for phylogenetic footprinting. Genome Research, 12(5):739-48, 2002.  Ao L. Delcher, So Kasif, R. D. Fleischmann, J. Peterson, 0. White, and SoL. Salzbergo Alignment of whole genomes. Nucleic Acids Res, 27(11):2369-2376, 1999.  E. Fredkin. Thie memory. Comm. ACM, 3:490500, 1960.  D. Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge, UK, 1997.  R. Hardison, J. L. Slightom, D. L. Gumucio, M. Goodman, N. Stojanovic, and W. Miller. Locus control regions of mammalian ,8-globin gene clusters: combining phylogenetic analyses and experimental results to gain functional insights. Gene, 205:73-94, 1998.  D. S. Hirschberg. A linear space algorithm for computing maximal common subsequences. Commun. ACM, 18:314-343, 1975.  N. J areborg, E. Birney, and R. Durbin. Comparative analysis of non coding regions of 77 orthologous mouse and human gene pairs. Genome Research, 9:815-824, 1999.  M. S. Gelfand, A. A. Mironov, and P. A. Pevzner. Gene recognition via spliced sequence alignment. Proc. Natl. Acad. Sci. USA, 93(17) :9061-9066, 1996.  S. Karlin and S. F. Altschul. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci.  B. Gottgens, L. Barton, J. Gilbert, A. Bench, M. Sanchez, S. Bahn, S. Mistry, D. Grafham, A. McMurray, M. Vaudin, E. Amaya, D. Bentley, and A. Green. Analysis of vertebrate SCL loci identifies conserved enhancers. Nature Biotechnology, 18:181-186, 2000.  W. J. Kent and A. M. Zahler. Conservation, regulation, synteny, and introns in a large-scale a. briggsae-a. elegans genomic alignment. Genome Research, 10(8):1115-1125, 2000.  B. Gottgens, L. Barton, M. Chapman, A. Sinclair, B. Knudsen, D. Grafuam, J. Gilbert, Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB’02) 0-7695-1653-X/02 $17.00 © 2002 IEEE USA, 90:5873-5877,1993.  I. Korf, P. Flicek, D. Duan, and M. R. Brent. Integrating genomic homology into gene structure prediction. Bioinformatics, (17):8140-8148, 2001. REFERENCES  S. Kurtz, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich. Computation and visualization of degenerate repeats in complete genomes. In Proceedings of the Bth International Conference on Intelligent Systems for Molecular Biology, pages 228-238, Menlo Parc, CA, 2000. AAAI Press.  S. Kurtz and C. Schleiermacher. REPuter: Fast computation of maximal repeats in complete genomes. Bioinformatics, 15(5):426-427, 1999.  G. G. Loots, R. M. Locksley, C. M. Blankespoor, Z. E. Wang, W. Miller, E. M. Rubin, and K. A. Frazer. Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science, 288(5463):136140, 2000.  B. Morgenstern. DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics, 15:211-218, 1999.  B. Morgenstern. A simple and space-efficient fragment-chaining algorithm for alignment of DNA and protein sequences. Applied Mathematics Letters, 15:11-16, 2002.  B. Morgenstern, A. W. M. Dress, and T. Werner . Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proc. Natl. Acad. Sci. USA, 93:12098-12103,1996.  B. Morgenstern, 0. Rinner, S. Abdeddalm, D. Haase, K. Mayer, A. Dress, and H.-W. Mewes. Exon discovery by genomic sequence alignment. Bioinformatics, in press.  W. R. Pearson and D. J.Lipman. Improved tools for biological sequence comparison. Proc. Natl. A cad. Sci. USA, 85:2444-2448, 1988.  W. Pugh. Skip lists: Skip lists: A probabilistic alternative to balanced trees. Comm A CM, 33:668-676, 1990.  0. Rinner and B. Morgenstern. AGenDA: Gene prediction by comparative sequence analysis. In Silico Biology, 2:0018, 2002 Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB’02) 0-7695-1653-X/02 $17.00 © 2002 IEEE 10  S. Schwartz, Z. Zhang, K. A. Frazer, A. Smit, C. Riemer, J. Bouck, R. Gibbs, R. Hardison, and W. Miller. PipMaker-a web server for aligning two genomic DNA sequences. Genome Research, 10:577-586, 2000.  J. Stoye. Multiple sequence alignment with the divide-and-conquer method. Gene, 211:GC45GC56, 1998.  T. Wiehe, S. Gebauer-Jung, T. Mitchell-Olds, and R. Guig6. SGP-1: Prediction and validation of homologous genes based on sequence alignments. Genome Research, 11:1574-1583, 2001.
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project