EFFICIENT CONSTRUCTION OF ACCURATE MULTIPLE ALIGNMENTS AND LARGE-SCALE PHYLOGENIES by Travis John Wheeler A Dissertation Submitted to the Faculty of the DEPARTMENT OF COMPUTER SCIENCE In Partial Fulfillment of the Requirements For the Degree of DOCTOR OF PHILOSOPHY In the Graduate College THE UNIVERSITY OF ARIZONA 2009 2 THE UNIVERSITY OF ARIZONA GRADUATE COLLEGE As members of the Dissertation Committee, we certify that we have read the dissertation prepared by Travis John Wheeler entitled Efficient Construction of Accurate Multiple Alignments and Large-Scale Phylogenies and recommend that it be accepted as fulfilling the dissertation requirement for the Degree of Doctor of Philosophy. Date: 08/18/09 John D. Kececioglu Date: 08/18/09 Michael J. Sanderson Date: 08/18/09 Alon Efrat Date: 08/18/09 David R. Maddison Date: 08/18/09 Bongki Moon Final approval and acceptance of this dissertation is contingent upon the candidate’s submission of the final copies of the dissertation to the Graduate College. We hereby certify that we have read this dissertation prepared under our direction and recommend that it be accepted as fulfilling the dissertation requirement. Date: 08/18/09 Dissertation Director: John D. Kececioglu Date: 08/18/09 Dissertation Director: Michael J. Sanderson 3 STATEMENT BY AUTHOR This dissertation has been submitted in partial fulfillment of requirements for an advanced degree at the University of Arizona and is deposited in the University Library to be made available to borrowers under rules of the Library. Brief quotations from this dissertation are allowable without special permission, provided that accurate acknowledgment of source is made. Requests for permission for extended quotation from or reproduction of this manuscript in whole or in part may be granted by the head of the major department or the Dean of the Graduate College when in his or her judgment the proposed use of the material is in the interests of scholarship. In all other instances, however, permission must be obtained from the author. SIGNED: Travis John Wheeler 4 ACKNOWLEDGEMENTS I owe a debt of gratitude to the many inspirational and supportive people who have enabled this work, a few of whom are mentioned here. I grew up surrounded by books, and by parents who not only preached life-long learning, but live it. Both entered teaching roles in academics late in life, and probably won’t leave until a visit from the Grim Reaper. That dedication is in my blood, and I’m thankful for it. None of this would have been possible without their nurturing support and encouragement. I also had the good fortune to have what you might call a second tier of parental figures, in the parents of my close friends Sam Mordka and Ephron Rosenzweig. I spent a great deal of time at their homes in my formative years, and Maurice Mordka and Michael Rosenzweig both provided significant encouragement and important guidance. While an undergraduate English major, I had the good fortune of taking a lifechanging course from Stephen Zegura. He certainly doesn’t remember me, but his enthusiastic and brilliant introduction to genetics in that class, and the two others I later took with him, paved the way for the path I have followed since. I worked for a couple years as an undergraduate research assistant with Yaron Ziv, then a PhD student in Michael Rosenzweig’s lab. Like many in Michael’s lab, Yaron had entered graduate studies after several years out of the academic system. Together, they taught me a fair amount about the science of ecology, but a lot more about life - especially life as an older PhD student (which is what I became, after going almost 10 years between earning my bachelor’s and starting my PhD). After my undergraduate degree, I went to work at Intuit. Nicki Graham and Stephanie Yablonski were instrumental in allowing me to teach myself computer programming on the company’s dime (sure, I was producing results . . . but computer programming felt more like fun than work). While at Intuit, I met my peerless wife, Karen Tempkin, whose love and support have made me a better person. Her return to school for her MBA was the final reminder that it was time to go back to academia, and it was clear that the field should be at the intersection of old and new joys, genetics (now genomics) and computer programming. David Maddison was kind enough to hire me to design and develop a databasedriven upgrade to the Tree of Life Web Project (ToL, http://tolweb.org) while I took undergraduate courses in computer science to pave the way for my eventual graduate college admission. He and Katja Schultz introduced me to much of what I know about phylogenetics. David has continued his encouragement and support since I left the ToL, now serving on my PhD committee. 5 Before entering the CS PhD program, I was lucky enough to sit in on a course on computational biology, taught by John Kececioglu. The topic was enthralling, and John’s excellent lectures made it accessible to a newcomer to the field. I joined his research group’s weekly discussions, and knew I’d found a home. Since then, John has always been available whenever I’ve needed him, and he’s been a constant source of guidance, encouragement, and information. He is the deepest thinker I’ve known, and I can only hope that some of that has rubbed off on me. The first part of this dissertation describes work done under his advisement, and it’s good qualities are a testament to his excellent guidance. During the course of my studies, I became increasingly interested in methods of statistical inference and machine learning, and was welcomed into the discussions of Kobus Barnard’s research group. I am thankful for the inviting attitude of the group, and for the insight shared. About three years ago, Michael Sanderson became my co-advisor. We’ve had innumerable enlightening conversations, covering a wide range of topics. One of those topics led to the work described in the second part of the dissertation. Mike has been an invigorating presence in my life, and I thank him for that. In a more mundane matter, I also thank him for allowing me access to his computing cluster; the experiments given in the paper are computationally demanding, and were to a great extent only possible because of his cluster. I’ve made a great many friends in John’s and Mike’s research groups, the CS department, and also through the IGERT-in-genomics fellowships that have funded almost all of my graduate studies. I’m sure I’ll unfairly leave someone off this list, but I’ll always appreciate my stimulating, enlightening, and entertaining discussions with Dean Starrett, Eagu Kim, Somu Perianayagam, Jeff Good, [SC]eline Hayden, Matt Dean, Joel Wertheim, Patrick Degnan, David Hearn, Karen Cranston, and Darren Boss. Finally, I thank Misawa Katoh for assistance with re-engineering MAFFT, specifically for the experiments described in Section 4.6, and Morgan Price for many recent discussions involving fast phylogeny inference in NINJA and FastTree. During the course of my studies, I have been supported by the NSF through three separate grants: (1) a PhD Fellowship from the University of Arizona NSF IGERT Genomics Initiative Grant DGE-0114420 (3.5 years), (2) Grant DBI0317498 (1 year), and (3) a PhD Fellowship from the University of Arizona NSF IGERT Comparative Genomics Initiative Grant DGE-0654435 (1 year). 6 This work is dedicated to three generations of family: To my mother, Kathy Abramowitz, and father, Norty Wheeler: I didn’t need to look very far for examples of commitment to lifelong learning and contribution to the greater good. Their love and nurturing were everything I could have hoped for. To my wife, Karen Tempkin: it’s possible that I could have made this academic journey without her in my life, but I’d have been a much lonelier, poorer, and grumpier person without her love, support, and supreme patience. I also wouldn’t have a third generation to include in this dedication. To my children, Elliot and Allison Wheeler: they didn’t exactly improve the quality of my research . . . but that’s not the point, is it? It’s pretty hard to take yourself too seriously when your kids think your greatest talents are tickling and making fart noises. 7 TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 CHAPTER 1: INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . 15 1.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.2. Introduction to sequence alignment . . . . . . . . . . . . . . . . 18 1.3. Introduction to phylogeny inference . . . . . . . . . . . . . . . . 22 PART 1 ACCURATE MULTIPLE ALIGNMENT WITH THE FORM- AND-POLISH STRATEGY . . . . . . . . . . . . . . . . . . . . . . . . 26 CHAPTER 2: BEST-OF-BREED METHODS FOR FORM-AND-POLISH ALIGNMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.1. Survey of methods and tools . . . . . . . . . . . . . . . . . . . . 28 2.2. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4. Merge tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4.1. Grouping sequences . . . . . . . . . . . . . . . . . . . . 32 2.4.2. Measuring distances . . . . . . . . . . . . . . . . . . . . 36 2.5. Merging alignments . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6. Sequence-pair weights . . . . . . . . . . . . . . . . . . . . . . . 40 2.7. Polishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 8 TABLE OF CONTENTS — Continued 2.8. Alignment consistency . . . . . . . . . . . . . . . . . . . . . . . 47 2.9. Parameter advisor . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.10. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 CHAPTER 3: WEIGHTS FOR SEQUENCE-PAIRS RELATED BY A TREE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.1. Prior approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.2. Influence weights . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.3. Estimation of edge lengths . . . . . . . . . . . . . . . . . . . . . 74 3.4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 CHAPTER 4: IMPROVING MULTIPLE ALIGNMENT THROUGH PAIRWISE ALIGNMENT CONSISTENCY . . . . . . . . . . . . . . . . 78 4.1. Prior approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2. Suboptimality as a measure of support for alignment features . . 87 4.2.1. Modified substitution score . . . . . . . . . . . . . . . . 88 4.2.2. Modified gap extension scores . . . . . . . . . . . . . . 90 4.2.3. Modified per-gap scores . . . . . . . . . . . . . . . . . . 92 4.2.4. Support from multiple parameter choices . . . . . . . . 93 4.3. Alternate definitions for consistency-modified costs . . . . . . . 94 4.3.1. Imbalanced suboptimality blend variants . . . . . . . . 95 4.3.2. Balanced suboptimality blend variants . . . . . . . . . 95 9 TABLE OF CONTENTS — Continued 4.4. Easing the computational burden . . . . . . . . . . . . . . . . . 4.4.1. Computing consistency from C of fixed size . . . . . . 96 97 4.4.2. Computing consistency using a band around optimal alignments . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.4.3. Consistency only for small subtrees . . . . . . . . . . . 101 4.5. Incorporating consistency-modified costs into the algorithm for aligning alignments . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.5.1. Bound pruning . . . . . . . . . . . . . . . . . . . . . . 105 4.5.2. Dominance pruning . . . . . . . . . . . . . . . . . . . . 107 4.6. Experimental results . . . . . . . . . . . . . . . . . . . . . . . . 108 PART 2 LARGE-SCALE NEIGHBOR-JOINING PHYLOGENIES . . . . 113 CHAPTER 5: BACKGROUND ON NEIGHBOR-JOINING . . . . . . . . 114 5.1. Canonical neighbor-joining algorithm . . . . . . . . . . . . . . . 115 5.2. Properties of neighbor-joining . . . . . . . . . . . . . . . . . . . 116 5.3. Prior methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 CHAPTER 6: DATA STRUCTURES AND ALGORITHMS FOR SCALING UP NEIGHBOR-JOINING . . . . . . . . . . . . . . . . . . . 121 6.1. Restricting search of the distance matrix . . . . . . . . . . . . . 121 6.1.1. The d-filter . . . . . . . . . . . . . . . . . . . . . . . . 121 6.1.2. Overcoming memory limits . . . . . . . . . . . . . . . 124 6.2. Candidate handling . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.3. Algorithm overview . . . . . . . . . . . . . . . . . . . . . . . . . 128 10 TABLE OF CONTENTS — Continued CHAPTER 7: EXPERIMENTAL VERIFICATION OF METHODS FOR SCALING UP NEIGHBOR-JOINING . . . . . . . . . . . . . . . . . . . 130 7.1. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.2. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 CHAPTER 8: FUTURE WORK AND CONCLUSIONS . . . . . . . . . . 140 8.1. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 8.2. Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . 142 APPENDIX A: CONSISTENCY GRAPHS . . . . . . . . . . . . . . . . . . 146 A.0.1. Subpath pairs consistent with substitution . . . . . . . 146 A.0.2. Subpath pairs consistent with gap extension . . . . . . 148 A.0.3. Subpath pairs consistent with gap-open and gap-close. 152 APPENDIX B: ON JAVA VERSUS C++ FOR SCIENTIFIC COMPUTING 173 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 11 LIST OF TABLES Table 2.1. Clustering methods for constructing the merge tree. . . . . . . 33 Table 2.2. Comparison of clustering methods. . . . . . . . . . . . . . . . . 35 Table 2.3. Comparison of distance methods. . . . . . . . . . . . . . . . . . 37 Table 2.4. Comparison of alignment methods. . . . . . . . . . . . . . . . . 40 Table 2.5. Comparison of weighting methods. . . . . . . . . . . . . . . . . 43 Table 2.6. Comparison of weighting methods on BAliBase refs 2 and 3. . 43 Table 2.7. Comparison of polishing methods. . . . . . . . . . . . . . . . . 47 Table 2.8. Comparison of consistency methods. . . . . . . . . . . . . . . . 49 Table 2.9. Comparison of parameter advisor methods. . . . . . . . . . . . 54 Table 2.10. Assessing the accuracy gains by selecting best-of-breed methods selected at each stage. . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Table 2.11. Comparison of the accuracy of Opal to commonly-used tools on protein benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Table 2.12. Comparison of the accuracy of Opal to commonly-used tools on the RNA benchmark, BRAliBASE. . . . . . . . . . . . . . . . . . . . . . 58 Table 4.1. Comparison of consistency methods. All results are without polishing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Table 4.2. Accuracy of sub-tree consistency in MAFFT. . . . . . . . . . . . 110 Table 4.3. Impact of consistency methods in MAFFT. . . . . . . . . . . . . 112 Table 4.4. Impact of using consistency-modified scores in ProbCons. . . . 112 12 LIST OF FIGURES Figure 1.1. Descent with modification (substitutions and indels). . . . . . 16 Figure 1.2. Path in a pairwise alignment dynamic programming table. . . 21 Figure 2.1. Aligning alignments is a merging of two alignments . . . . . . 38 Figure 3.1. Oversampled groups can dominate sum-of-pairs score. . . . . 60 Figure 3.2. Anomalous results of covariance weights. . . . . . . . . . . . . 61 Figure 3.3. Anomalous results of division weights. . . . . . . . . . . . . . 65 Figure 3.4. Anomalous results of division weights. . . . . . . . . . . . . . 66 Figure 3.5. Calculating influence weights. . . . . . . . . . . . . . . . . . . 67 Figure 3.6. Graphical description of values used in influence calculation . 69 Figure 3.7. Effective number of sequences in a subtree. . . . . . . . . . . 70 Figure 3.8. Splitting wx between the subtrees of x. . . . . . . . . . . . . 71 Figure 3.9. Altschul [3] suggests that the weight wxy should approach 0. . 72 Figure 4.1. Features in dynamic programming table for alignment (A ∼ B). 81 Figure 4.2. Subpaths consistent with (ai ∼ bj ) . . . . . . . . . . . . . . . 89 Figure 4.3. Subpaths consistent with ai extending a gap after bj . . . . . 91 Figure 4.4. Subpaths consistent with ai opening a gap immediately after bj 93 Figure 4.5. Windows used to compute suboptimality. . . . . . . . . . . . 99 Figure 4.6. Window range is the intersection of windows. . . . . . . . . . 100 Figure 4.7. Using consistency only on nodes near leaves. . . . . . . . . . . 102 Figure 4.10. Stitching shapes for bound pruning. . . . . . . . . . . . . . . 106 Figure 4.11. Extra gap-open and gap-close costs in dominance pruning. . . 108 Figure 7.1. Candidates viewed during tree-building with and without filters 133 Figure 7.2. Difficult case case for d-filtering. . . . . . . . . . . . . . . . . 134 Figure 7.3. Performance of NINJA on 113 medium-sized Pfam inputs. . . . 136 Figure 7.4. Performance of NINJA on large Pfam inputs. . . . . . . . . . . 137 13 LIST OF FIGURES — Continued Figure A.1. Subpaths consistent with (ai ∼ bj ). . . . . . . . . . . . . . . . 147 Figure A.2. Subpaths consistent with ai extending a gap (1). . . . . . . . 148 Figure A.3. Subpaths consistent with ai extending a gap (2). . . . . . . . 149 Figure A.4. Subpaths consistent with ai extending a gap (3). . . . . . . . 150 Figure A.5. Subpaths consistent with ai extending a gap (4). . . . . . . . 151 Figure A.6. Subpaths consistent with ai opening a gap (1). . . . . . . . . 152 Figure A.7. Subpaths consistent with ai opening a gap (2). . . . . . . . . 153 Figure A.8. Subpaths consistent with a1 opening a gap (3). . . . . . . . . 154 Figure A.9. Subpaths consistent with ai opening a gap (4). . . . . . . . . 155 Figure A.10. Subpaths consistent with ai opening a gap (5). . . . . . . . . 156 Figure A.11. Subpaths consistent with ai opening a gap (6). . . . . . . . . 157 Figure A.12. Subpaths consistent with ai opening a gap (7). . . . . . . . . 158 Figure A.13. Subpaths consistent with ai opening a gap (8). . . . . . . . . 159 Figure A.14. Subpaths consistent with ai closing a gap (1). . . . . . . . . 160 Figure A.15. Subpaths consistent with ai closing a gap (2). . . . . . . . . . 161 Figure A.16. Subpaths consistent with am closing a gap (3). . . . . . . . . 162 Figure A.17. Subpaths consistent with ai closing a gap (4). . . . . . . . . . 163 Figure A.18. Subpaths consistent with am closing a gap (5). . . . . . . . . 164 Figure A.19. Subpaths consistent with am closing a gap (6). . . . . . . . . 165 Figure A.20. Subpaths consistent with am closing a gap (7). . . . . . . . . 166 Figure A.21. Subpaths consistent with ai closing a gap (8). . . . . . . . . . 167 Figure A.22. Decomposing the (A ∼ C) part of figure A.13 (1) . . . . . . . 170 Figure A.23. Decomposing the (A ∼ C) part of figure A.13 (3) . . . . . . . 171 Figure A.24. Decomposing the (A ∼ C) part of figure A.13 (2) . . . . . . . 171 Figure A.25. Decomposing the (A ∼ C) part of figure A.13 (4) . . . . . . . 172 14 ABSTRACT A central focus of computational biology is to organize and make use of vast stores of molecular sequence data. Two of the most studied and fundamental problems in the field are sequence alignment and phylogeny inference. The problem of multiple sequence alignment is to take a set of DNA, RNA, or protein sequences and identify related segments of these sequences. Perhaps the most common use of alignments of multiple sequences is as input for methods designed to infer a phylogeny, or tree describing the evolutionary history of the sequences. The two problems are circularly related: standard phylogeny inference methods take a multiple sequence alignment as input, while computation of a rudimentary phylogeny is a step in the standard multiple sequence alignment method. Efficient computation of high-quality alignments, and of high-quality phylogenies based on those alignments, are both open problems in the field of computational biology. The first part of the dissertation gives details of my efforts to identify a best-of-breed method for each stage of the standard form-and-polish heuristic for aligning multiple sequences; the result of these efforts is a tool, called Opal, that achieves state-of-the-art 84.7% accuracy on the BAliBASE alignment benchmark. The second part of the dissertation describes a new algorithm that dramatically increases the speed and scalability of a common method for phylogeny inference called neighbor-joining; this algorithm is implemented in a new tool, called NINJA, which is more than an order of magnitude faster than a very fast implementation of the canonical algorithm, for example building a tree on 218,000 sequences in under 6 days using a single processor computer. 15 CHAPTER 1 INTRODUCTION Molecular sequencing technology has ushered in a new era in biology, one in which gene function and structure can be predicted based only on sequence (though confirmation of predictions still depends on experimentation), and in which historical relationship of sequences and species can be inferred based on fairly objective sequence-based methods rather than subjective classification schemes. The task of sequence alignment is to take a set of molecular sequences (DNA, RNA, or protein) and identify regions with functional, structural, or evolutionary relationship. This is typically done by organizing the sequences into a matrix in which each row corresponds to one sequence, and the sequences are spread out so that related residues (amino acids or nucleotides) share a column. Alignments can be used, for example, to: • identify conserved regions (conservation suggests that the region encodes some function); • estimate molecular structure (much better predictions of structure are possible from a high-quality alignment of multiple sequences than from a single sequence); • find signals of natural selection (for example based on Ka/Ks ratio, which may indicate positive or purifying selection [57]); • make statements of evolutionary ancestry, where the statement may be something like “sequence A shares common ancestry with sequence B”, or “the phylogeny (family tree) of this set of sequences is . . . ”. 16 Sequence alignment + cc g!a acacct acat acat acgt c agt a gt a!g acacct aca--t a-g--t g-g--t ggt Figure 1.1: Descent with modification, in the form of substitutions, insertions, and deletions. The evolutionary history of the sequences induces an alignment of the extant sequences. Take as an example a single long-dead organism that is ancestor to a collection of extant organisms, and consider a single gene in that organism. If we could track the mutations that were endured by copies of that gene in the course of the tree-like process of descent leading to present-day sequences, we would find that substitutions, insertions, and deletions occurring in organisms at various historical points would be passed on to their descendants. This evolutionary process can be seen to induce a multiple sequence alignment in which each column contains characters that share descent based on (perhaps faulty) replication from the same ancestral character, while positions in sequences are shifted one way or another to account for historical insertions and deletions (see Figure 1.1). Of course, such a history is generally unknowable, but is the aim of much of computational biology: recovering the alignment without knowing the ancestral sequences or true phylogeny, and recovering the true phylogeny without knowing the true alignment. This dissertation covers these two closely related topics, exploring a wide range 17 of methods applied to aligning multiple sequences, as well as an approach that dramatically increases the speed and scalability of a common method for phylogeny inference. The majority of methods applied to phylogeny and alignment treat the two problems as essentially independent. Phylogeny methods typically take a single alignment as input, and treat it as if it were a true statement of homology of characters, despite clear evidence that uncertainty in alignment correlates with problematic tree inference [85]. Meanwhile, the standard method of aligning multiple sequences makes use of a quickly estimated phylogeny (guide tree) in an internal step, despite concerns that choice of guide tree may bias analyses made based on the alignment [70]. Though groundbreaking progress has been made in the area of simultaneously inferring both alignment and phylogeny [91], the computational complexity of such methods is such that only several sequences can be compared. In the genomics era, in which alignments of hundreds, or even tens of thousands, of sequences may be the subject of analysis, it is still of value to consider alignment and phylogeny as stages in a process, as is done in this dissertation. 1.1. Overview The remainder of this chapter will provide a very basic introduction to the topics of sequence alignment and phylogeny inference. The following chapters are broken into two parts. Part 1 (Chapters 2 through 4) will relate to the problem of multiple sequence alignment. Chapter 2 gives an overview of the standard multiple sequence alignment framework, describing the stages used in what we call the form-and-polish method. It also presents an investigation of new and established methods at each stage of the form-and-polish method, and identifies best-of-breed methods for each. Chapter 3 gives details on a promising new method for one of the stages, involving the weighting of sequence pairs used to overcome the dominance by over-represented groups of the standard sum-of-pairs score. Finally, Chapter 4 provides in-depth 18 discussion of a new method for using alignment consistency to avoid errors that are a common by-product of the standard method of alignment construction. Part 2 (Chapters 5 through 7) relates to a new algorithm that dramatically increases the speed and scalability of a common method for phylogeny inference, called neighbor-joining. Chapter 5 describes the canonical neighbor-joining algorithm, and places it in context of related work. Chapter 6 give details of the new algorithm, which reduces run time via a two-staged filtering approach, and overcomes memory limitations by employing data structures that are externalmemory-efficient. Chapter 7 presents experimental results supporting the success of this new algorithm, highlighted by the result of producing a tree for an input of 218,000 sequences in under 6 days on a single computer. Finally, we conclude with summary remarks and possible future directions of research. 1.2. Introduction to sequence alignment When aligning sequences, it is common to consider three kinds of mutation: substitution (in which one nucleotide is replaced with another, perhaps as the result of a faulty replication), insertion (in which a stretch of characters is added into a sequence), and deletion (in which a stretch of characters is removed). Other mutations, such as reversal or recombination, are much more rare, require more complex representative techniques, and add computational complexity, and are thus not considered in standard sequence alignment methods. Mutations are subject to selection, with the result that they are relatively less likely to be observed in regions of functional importance, and of the those that are observed in functional regions, most have minimal functional impact, for example replacing one amino acid with another one bearing similar biochemical properties. The task of sequence alignment is to take a set of molecular sequences and identify related regions. A sequence alignment is typically represented as a matrix, 19 in which each row corresponds to one sequence, and the characters of each sequence are shifted left and right so that related residues share a column. Characters in a column need not be identical, just homologous (i.e. the result of a substitution). The way sequences are spread out is by placing one or more spacer characters (typically ‘-’, which is effectively the null character) between characters in the sequence, as in the induced alignment in Figure 1.1. These spacer characters are often called gap characters. A maximal run of consecutive spacer characters in a row is called a gap (alternatively an indel ), and is presumed to be the result of a single mutational event, either an insertion or deletion (or possibly an artifact of incomplete sequencing, in the case of terminal gaps). Finding a biologically meaningful alignment depends on assigning scores to substitutions, and costs to gaps. In protein sequence alignment, the substitution scores typically come from a 20×20 matrix (e.g. BLOSUM [51] or VTML [79]), in which the score of a substitution from residue a to residue b may be viewed as p(ab) the logarithm of the odds ratio (σ(a, b) = log p(a)p(b) ), based on frequencies in an alignment database of observed alignments (where p(a) is the observed frequency of character a among all sequences in the database, and p(ab) is the frequency of observing an a aligned with a b in the database). In these matrices, identities have highest (positive) score, while substitutions between amino acids with very different biochemical properties have lowest (and negative) score. The substitution matrix for nucleotides is a simpler 4×4 matrix. Gaps are typically assigned an affine cost, in which a gap incurs a length-dependent cost, with cost of λ for each character in the gap, and a per-gap cost γ (so the total cost of a gap of length ` is γ + `λ). In this scoring scheme, the goal is to line characters up into an alignment A so as to maximize the sum of the induced substitution scores, minus the induced gap costs. This is the scoring scheme used in most related work. It is a trivial matter to convert substitution scores to costs, so that identities have low cost and substitutions have high cost, thus turning the task into an 20 equivalent cost-minimization problem. This is the scoring scheme used in this work, so that the cost of a pairwise alignment is cost(A) := cγ + `λ + X σ(a, b), (1.1) a→b where the number of gaps is c, the total length of those gaps is `, and the summation is over the cost of all substitutions (a ∼ b). The matter of maximizing scores or minimizing costs is mostly a semantic issue, but one that informs the terms used in describing other tools (score) and our models (cost). The technique of dynamic programming can be used to find a pairwise alignment with optimal cost. The key to using this technique is the observation that an optimal alignment of a pair of prefixes ends in a column (involving either a substitution or a gap), and if that final column is removed, the remaining alignment must be an optimal alignment of the shortened prefixes. Under the condition that the per-gap cost γ is 0, this leads to the following recurrence for computing the optimal cost, C(i, j) of an alignment of prefixes (a1 . . . ai ) and (b1 . . . bj ) of sequences A and B: C(i − 1, j − 1) + σ(ai , bj ), C(i, j) = min C(i − 1, j) + λ, C(i, j − 1) + λ . (1.2) (The recurrence in the case γ > 0 is a bit more complex, requiring that optimal costs must be kept at each cell for each of the 3 possible columns that might end a prefix alignment. Details are unnecessary for the purposes of this introduction, but can be found in [50].) The optimal cost of aligning the full sequences A and B, of length m and n respectively, is then the value C(m, n). This value may be found by filling in an m × n table, where the ith row corresponds to the ith position of A, and the j th 21 B a c a t t a c A t A B ac-tct acat-t c t Figure 1.2: Path in a pairwise alignment dynamic programming table. A diagonal edge leading into cell (i, j) corresponds to a substitution column containing ai and bj . A vertical edge leading into cell (i, j) corresponds to a column containing ai in a gap between bj and bj+1 . A horizontal edge leading into cell (i, j) corresponds to a column containing bj in a gap between ai and ai+1 . column to the j th position of B, so that the cost at a cell (i, j) corresponds to the cost of aligning prefixes (a1 . . . ai ) and (b1 . . . bj ). This dynamic programming table is filled out in row-major order (row by row, from upper left to lower right), so that the three values required by the recurrence in equation 1.2 are always available when the value at a new cell is computed. Thus, a single pass through the table is sufficient to find the cost of the optimal alignment. Since constant time is required to compare the 3 values at each cell, this gives a O(nm) algorithm for finding the optimal cost. An alignment corresponds to a path through the alignment graph, as in Figure 1.2. A path (and corresponding alignment) with the optimal cost can be found by the standard dynamic programming method of backtracking through this table [50]. Generalization of this approach to finding good alignments of multiple sequences requires generalization of the scoring method. In principle, if we knew the phylogeny relating sequences, and the ancestral sequences labelling internal nodes, it would be 22 reasonable to seek an alignment of extant and ancestral sequences that minimize the sum of pairwise alignment costs over all edges in the phylogeny. However, the phylogeny and internal labellings are in general unknown. Even given a fixed phylogeny, the problem of labeling internal nodes so as to produce a minimal cost summing over all edges (often called Tree alignment [96]) is NP complete [114]. For this reason, the standard practice is to optimize the sum of the alignment costs over all pairwise alignments induced by the multiple alignment. This scoring scheme has the odd trait of effectively treating each sequence as if it evolved from all the others, but has proven to be useful. With this sum-of-pairs scoring scheme in place, the dynamic programming method for two-sequence alignment can be generalized to aligning multiple sequences by filling in a k-dimensional table for k sequences. The exponential size of this table makes exact alignment infeasible for more than several sequences, even with clever methods for limiting the visited space in that table [71, 44], so heuristics are used in practice. An overview of the standard heuristic for aligning multiple sequences, which we call the form-and-polish approach, are given in Chapter 2, with detailed investigation of the stages of that method described in Chapters 2 through 4. 1.3. Introduction to phylogeny inference The diversity of life, and the DNA that encodes its function, arose through a branching pattern of evolution: a copy of the DNA of an organism is passed on to that organism’s offspring, possibly with mutation. Over time, this descent with modification [21] can be seen to induce a tree-like relationship on a family of presentday sequences. The task of molecular phylogenetic inference is to take as input a set of sequences, usually already aligned, and create a tree that recovers their ancestral history. This tree may be what is called a gene tree, which defines the ancestral 23 history of descendants of a single gene (or perhaps a concatenated set of genes), or a species tree, which defines the history of the species containing those genes. Due to incomplete lineage sorting [73, 74], which may cause incongruence between the trees of different genes sequenced from the same individuals, these may not agree. In this work, discussion of phylogeny inference relates to gene tree construction. Such a tree is valuable both in its own right as a statement of the relationship of entities, and also as a framework for understanding a wide range biological processes (e.g. rates of evolution, selective pressures) and for addressing biodiversity issues. Inferred trees have two components: the topology (branching order) and edge lengths. Trees are typically binary, and most commonly-used inference methods produce unrooted trees. A standard method of establishing a root for an unrooted tree is to include a known outgroup (a sequence that is known not to break the monophyly of the remaining sequences): the root is then placed along the edge connecting that group to the rest of the tree. A variety of methods are applied to the problem of phylogeny inference, with the notable ones including Parsimony [33], Maximum-likelihood [34], Bayesian [117, 56], and distance based methods Minimum Evolution [94, 22] and Neighbor-joining [95]. The first three in this list all assign a value to a tree that depends on possible (unknown) ancestral sequences, then seek trees with optimal value. The number of possible topologies grows super-exponentially in the number of sequences [35], leading parsimony and maximum-likelihood (ML) to use local-search heuristics to try to find good trees, while Bayesian phylogeny inference employs a MarkovChain Monte-Carlo approach to sample from the probability landscape. In all three methods, positions are usually assumed to be independent, and gap characters are treated either as an extra character or as missing data. Parsimony aims to find a tree that minimizes the number of changes over the tree that are required to attain the observed sequences; an efficient dynamic programming approach (linear in the size of the input alignment) allows computation of optimal labellings for internal 24 nodes for a fixed tree. ML seeks to find a tree that has highest likelihood under a probabilistic model of sequence evolution; the likelihood for a fixed tree is computed by marginalizing over all possible internal node labellings, and is computed with a linear time dynamic-programming algorithm. Bayesian inference is also model based, but samples from the space of possible trees, rather than seeking a single peak in the probability space. Distance-based methods do not consider internal node labels. Rather, they begin by computing distances between all pairs of sequences, and build trees based only on these pairwise distances. Under the minimum evolution (ME) framework, edge lengths for a fixed topology are those that minimize the sum of the squares of the differences between the input pairwise distances and those induced by the tree’s edge lengths. The minimum-evolution tree is the tree with topology minimizing the sum of these edge lengths, and ME methods use local search heuristics to try to find a tree near this minimum. Neighbor-joining (NJ) is a greedy heuristic for finding the balanced minimum evolution tree [40], is proven statistically consistent (guarantees the correct tree given a sufficiently long alignment), and is guaranteed to produce the correct tree even when the input pairwise distances are somewhat noisy (see Chapter 5 for details). The largest published tree inferred by the Parsimony method contains 73,060 taxa [42], while ML has currently achieved limits of around 50,000 sequences in testing (pers. comm. Mike Sanderson, Karen Cranston). Both methods require weeks of computation time to build such trees. Bayesian inference can only handle inputs on the order of hundreds of sequences. There is little hope that these methods will scale to the point of rapidly building trees for inputs of more than 200,000 sequences in the near future. Since trees of this size are now possible given the available sequence data, methods that can scale up are desirable. The canonical NJ algorithm is faster than Parsimony or ML, but its O(k 3 ) run time and O(k 2 ) space requirements for k sequences, make it unfit for such large inputs. In Part 2, we 25 describe a new algorithm for computing a neighbor-joining tree that improves over the run time of the canonical algorithm by more than an order of magnitude. This new method reduces run time by using a filtering mechanism to dramatically reduce the number of computations done for each iteration of the tree formation process, and uses external-memory efficient data structures to overcome space constraints. As an example, it produces a tree with 50,000 sequences in about 9 hours using a single processor on a system with 4GB of RAM, and a tree with 218,000 sequences in fewer than 6 days on a similar system with 12 GB RAM. 26 PART 1 ACCURATE MULTIPLE ALIGNMENT WITH THE FORM-AND-POLISH STRATEGY 27 CHAPTER 2 BEST-OF-BREED METHODS FOR FORM-AND-POLISH ALIGNMENT All widely-used tools for multiple sequence alignment at essence seek an alignment that minimizes the sum-of-pairs cost: the weighted sum of the costs of all pairwise alignments induced by the multiple alignment. Optimal multiple alignment with sum-of-pairs scoring is NP-complete [114] which motivates the use of good heuristics. Most commonly-used tools use a heuristic called progressive alignment [36] which has two steps: (1) construct a binary merge tree whose leaves are the input sequences and whose internal nodes arrange the sequences into groups, and (2) merge these groups leaf-to-root over the tree by combining the alignments at the two children of a node into one alignment at their parent to form successively larger alignments at internal nodes. When merging the groups at the two children of a node into one group at their parent, the two sets of sequences in the groups are usually combined by aligning alignments [45, 63] or aligning profiles that compactly represent alignments [46, 64]. When this merging process reaches the root, the last pair of clusters is merged at the root of the tree, and it has formed an alignment of all the input sequences. This hierarchical approach is greedy: the alignment between two sequences in a group is not altered when that group is merged with another group. Consequently, errors made in early merges remain in the final alignment, and may lead to further misalignment in later merges. One approach to correcting such errors is to apply a second stage we call polishing [9], which refines the alignment by repeatedly splitting its sequences into subsets and realigning their induced subalignments. Another strategy is to reduce early errors using an approach called consistency [84, 25], 28 which tries to avoid making such errors in the first place. Consistency approaches assign position-specific substitution scores for a pair of sequences A and B and a pair of positions i and j that depend on the support for the substitution between i and j from the pairwise alignments of sequences A and B to all other sequences C. The combination of all these steps makes up the core of what we call the formand-polish strategy for multiple alignment. In total, this strategy consists of seven stages: 1. Choose alignment parameters. 2. Estimate distances between pairs of sequences. 3. Construct a merge tree. 4. Compute weights for sequence pairs. 5. Compute consistency-modified scores. 6. Merge alignments over the merge tree from stage 3. 7. Polish the alignment formed at the root of the merge tree in stage 6. Form-and-polish alignment methods make use of these stages, with stages 1, 2, 3, and 6 being obligatory, while stages 4, 5, and 7 are optional. The stages will be described in detail below. The order of presentation will differ to simplify exposition. 2.1. Survey of methods and tools Current tools use a profusion of methods for each stage. The most widely-used multiple alignment tool, ClustalW [110], relies on a complex scoring scheme in which substitution scores and gap penalties are adjusted according to features of the aligned sequences, including divergence, length, hydrophobicity of amino acids, and proximity of neighboring gaps. 29 PRRP/N [48] improved accuracy using polishing, in which an alignment is repeatedly split into subalignments, which are then realigned. T-Coffee [83] and its predecessor Coffee [84] introduced alignment consistency, with resulting increase in accuracy. T-Coffee assigns position-specific substitution scores based on a mixture of support provided by optimal global pairwise alignments and a small set of high scoring local pairwise alignments. Early versions of MAFFT [60] substantially increased alignment speed without sacrificing ClustalW-like accuracy through a host of ideas including scoring system modifications, use of the Fast Fourier Transform to speed up profile alignment and a two-stage strategy for constructing an alignment that first builds a draft alignment using a merge tree based on distance estimates from k-mer frequencies, and then obtains a pre-polished alignment based on a merge tree built using distances derived from the pairwise alignments induced by the draft alignment. Muscle [29] further improved speed with a variety of algorithm-engineering approaches, and matched T-Coffee’s accuracy by employing reduced terminal gap costs and a measure called log-expectation to score alignments of profiles. ProbCons [25] introduced probabilistic consistency, which assigns positionspecific substitution scores based on a measure of expected accuracy derived from a hidden Markov model using a three-way consistency transform. It also offers a column reliability score that estimates the likelihood that each column represents a correct alignment of residues. Recent versions of MAFFT [59] incorporate a heuristic for T-Coffee-like consistency, resulting in a substantial improvement in accuracy. The results are slower than the fastest version of MAFFT, but still quite fast. Details of the consistency methods of T-Coffee, MAFFT, and ProbCons are provided in Chapter 4. 30 2.2. Overview It can be difficult to determine which methods are contributing to the accuracy of these various alignment tools, and should be in a best-of-breed tool. In this chapter, we carefully study the impact of the standard methods for each stage of the form-and-polish method of multiple sequence alignment, and offer several new methods that substantially improve accuracy. The greatest gains in accuracy come from two simple new ideas: (i) estimating distances between sequences for merge tree construction by normalized alignment costs, and (ii) polishing the final alignment using 3-partitions of the input sequences induced by cutting pairs of edges in the merge tree. By combining the best methods for each of the stages of the form-and-polish strategy, we obtain a new tool, which we call Opal, whose accuracy matches the state-of-the-art as measured on the standard benchmark datasets. In the next section, we describe our experimental methods. In Section 2.4, we study methods for constructing the merge tree. Section 2.5 considers methods for merging alignments. Section 2.6 assesses the effect of weighted sum-of-pairs. Section 2.7 explores methods for polishing alignments. Section 2.8 inspects the value of alignment consistency. Section 2.9 examines the impact of parameter choice, and finally Section 2.10 compares the combined approach to current tools. It is reasonable to ask if the order in which stages are addressed might impact the results. We considered many combinations of the methods for each stage, and the best-of-breed choices are consistent under any ordering. The only impact of the order appears to be on the scale of improvements attributed to a method. For example, if polishing were presented earlier, its gains would be larger, while the gains seen in other methods would be smaller. 31 2.3. Methodology The standard practice for evaluating multiple alignment tools is to use benchmark datasets of reference alignments that are usually based on structural alignment of proteins. When comparing methods for a stage, and comparing alignment tools, we evaluate accuracy by measuring the recovery of reference alignments from three standard suites of protein alignment benchmarks: BAliBASE3.0 [109], SABmark1.65 [113], and PALI2.5 [7]. These suites, which are used by many comparative studies, of course represent only a sample of the types of inputs biologists face. BAliBASE is a collection of 218 reference alignments based on structural alignments with manually-arranged gaps, exhibiting a variety of phylogenetic and structural characteristics. We limited our tests to the 163 alignments with no more than 40 sequences, as our focus is measuring accuracy and not speed. The SABmark benchmark contains 627 alignments, each containing at most 25 sequences, that cover the entire known fold space found in the SCOP[80] classification of protein families. Each benchmark is a collection of pairwise structural alignments that are not necessarily consistent with one multiple alignment. PALI contains 1655 alignments of all SCOP families constructed by structural multiple alignment without hand curation. We used a subset of 102 alignments consisting of all reference alignments with at least 7 sequences that have nontrivial gap structure. In our experiments the measure of accuracy is mainly the so-called SPS score [6]: the percentage of pairs of aligned positions from the reference alignment that were correctly recovered by the computed alignment. In some cases we also report the so-called TC score [6]: the percentage of columns from the reference alignment that were completely recovered; this score is appropriate for BAliBASE and PALI, but not SABmark, since there is no column to be recovered (only possibly inconsistent pairwise alignments). 32 For BAliBASE and PALI, both scores are measured on their core blocks: those columns in the reference alignment that are deemed reliable by the benchmark (typically through strong support from a structural alignment). SABmark does not provide core blocks, so accuracy is measured against all columns of the benchmark alignment pairs, regardless of their true reliability. This fact partially accounts for the much lower recovery rates seen for SABmark than for BAliBASE and PALI, though the divergent nature of the SABmark alignments is certainly responsible for much of the difference. We use the above suites of benchmarks in experiments to determine the best method for each of the eight identified stages of the form-and-polish strategy for multiple alignment. While this may appear to ignore interactions between methods for different stages, results not shown here show the best method for a given stage is independent of the choices at other stages. The best method is also generally independent of which suite of benchmarks is considered. 2.4. Merge tree Constructing a merge tree involves (1) grouping sequences hierarchically, based on (2) a measure of distance between sequences. We study these two aspects in turn. 2.4.1. Grouping sequences Progressive alignment tools construct the merge tree using one of a number of similar sequential clustering algorithms, in which the tree is built in a stepwise manner. In the general step, a distance (see Section 2.4.2) is known for each pair of sequence clusters, and the two closest clusters are merged into a new cluster, which defines an internal node of a binary tree. For a cluster ab that is the merge of clusters a and b, new distances dab|c to all other clusters c are calculated as a simple combination of dac , dbc , and possibly dab . Merging continues until one group remains, which is the root of the tree. The tree may be rerooted at a different 33 Table 2.1: Clustering methods for constructing the merge tree. Method NJ UPGMA Cluster ab minimization criterion P (dab − c (dac + dbc ))/(n − 2) Updated distance from ab to c (dac + dbc − dab )/2 dab (dac + dbc )/2 MST dab min{dac , dbc } MST+UPGMA dab α min{dac , dbc }+ DAD dab Cost of aligning ab to c (1−α)(dac +dbc )/2 node, for instance to balance root-to-leaf path lengths, or based on integration of a known outgroup. Beginning with distance da|b (see Section 2.4.2) between each pair a,b, the various methods discussed choose the pair to merge, then calculate a distance for the new cluster to all others. We study five clustering schemes, one new, as shown in Table 2.1. Neighbor-joining Neighbor-joining [95, 105] is the method used by ClustalW and T-Coffee. It is a heuristic for construction of a minimum evolution tree [40], and is generally regarded as among the best of the fast distance-based methods at producing the true evolutionary tree for the sequences. Let G be the current set of groups during the merging process, and da|b be the current distance between a pair a, b of groups. neighbor-joining merges the groups a and b that minimize (|G| − 2) da|b − X da|c + db|c . (2.1) c ∈ G−{a,b} The new distance dab|c between the merged group ab and all other groups c is dab|c := da|c + db|c − da|b . 2 (2.2) 34 The canonical algorithm for neighbor-joining takes O(k 3 ) time for k sequences. See Part 2 for significant discussion of this method. UPGMA and MST The unweighted-pair group-method with arithmetic-mean or UPGMA [101], and minimum spanning tree or MST [20], are simpler approaches that run in O(k 2 ) time. Both merge the pair a, b of groups with minimum distance da|b , but differ in how they define the distance dab|c from the merged group ab to all other groups c. UPGMA sets dab|c to the average 12 (da|c +db|c ), while MST uses min{da|c , db|c }. We also considered the mixture UPGMA + MST used by MAFFT, which for 0 ≤ α ≤ 1, sets dab|c to the convex combination 1 dab|c = α min da|c , db|c + (1−α) da|c + db|c . 2 (2.3) DAD A shortcoming of the above methods is that distances are derived from sequences only at initialization. When group ab is formed, the new distances dab|c are calculated from original sequence distances, which ignores the constraints on sequence pairs across groups ab and c imposed by the alignments for these groups. We evaluated several new methods that take such constraints into account. The best of these, which we call dynamic alignment distance or DAD, computes dab|c by aligning the current alignments for ab and c to obtain an alignment A for group abc, and then taking for dab|c the minimum of the distances measured on the pairwise alignments in A induced by sequence pairs across groups ab and c (analogous to MST). The distance measure on pairwise alignments can be any of those from Section 2.4.2. We also considered variants that use the average of the distances of the induced pairwise alignments between ab and c (analogous to UPGMA), and that add in the average distances between abc and all other groups d, but the above method performed the best. DAD does not perform as well as the seemingly less-informed methods neighborjoining, UPGMA, and MST. While there is more information in the aligned sequences, 35 Table 2.2: Comparison of clustering methods. Tree method BAliBASE SABmark PALI average MST 79.4 44.1 79.8 67.8 UPGMA + MST 79.2 44.0 80.2 67.8 UPGMA 78.0 42.7 80.5 67.1 neighbor-joining 77.4 42.1 77.2 65.6 DAD 78.2 43.5 73.0 64.9 alignments against large groups are more constrained than those against small groups, so larger groups tend to have higher distances. This causes smaller groups to be preferentially merged next, which leads to a balanced merge tree even when this is undesirable. Comparison Table 2.2 shows a comparison of the accuracy of these methods on three suites of benchmarks in terms of their SPS score. As our baseline for the other stages, we measure the initial sequence distances using percent identity over a compressed alphabet (Section 2.4.2), merge alignments using pessimistic gap counts (Section 2.5), and use unweighted sum-of-pairs (Section 2.6), no polishing (Section 2.7), no consistency (Section 2.8), and default parameters (Section 2.9). For UPGMA + MST we use α = 0.9, as in MAFFT. The best accuracies are in bold. The multiple alignment literature [29, 25] suggests that using a merge tree that is similar to the correct evolutionary tree is less important for alignment accuracy than using a tree that groups similar sequences first. Our results agree with this suggestion: generally the methods based on minimum spanning trees outperform the others. For the merge tree method in the rest of the chapter, we chose the simpler method MST. 36 2.4.2. Measuring distances The standard measure of distance between two sequences is based on percent identity: the percentage of matched positions in an optimal pairwise alignment that are identities. (The actual distance used is the complement of percent identity.) Many tools modify percent identity by the Kimura correction for multiple substitutions at a locus [65], and measure it over a compressed alphabet that groups amino acids with similar characteristics into equivalence classes, which we call compressed identity. MAFFT and Muscle compute a draft alignment from distances based on k-mer frequencies on the way to their initial alignment based on percent identity. Percent identity is a coarse measure of similarity, while alignment cost is a more refined measure that can be obtained with no overhead. We therefore tested a new distance measure, which we call normalized alignment cost, that simply normalizes the cost of an optimal pairwise alignment by dividing by the average length of the two sequences (where alignment cost uses affine gap penalties, as discussed in Section 2.5). In a sense this generalizes percent identity and compressed identity to the full spectrum of substitutions while also taking into account gaps. The distance measure used in ProbCons, expected accuracy, is similar in concept to normalized cost. It cannot be tested here, because it depends on the hidden Markov model foundation of ProbCons. Comparison Table 2.3 shows a comparison the accuracy of these three distance methods. We use an MST merge tree, pessimistic gap counts, no weighting, no polishing, no consistency, and default parameters. Identity measures have Kimura correction. Normalized costs give a striking boost in accuracy, that persists even after polishing (not shown). Results in the rest of the chapter depend on normalized costs as the distance measure. 37 Table 2.3: Comparison of distance methods. 2.5. Distance method BAliBASE SABmark PALI average normalized cost 81.6 48.2 83.0 70.9 compressed identity 79.4 44.1 79.8 67.8 percent identity 78.5 43.5 79.9 67.3 Merging alignments Merging two multiple alignments of disjoint groups of sequences into one alignment of all the sequences, which has been called group-to-group alignment [45, 46] and aligning alignments [64, 63], is central to both forming an initial multiple alignment and polishing a final alignment. When forming an initial multiple alignment, the merge tree T is processed from the leaves to the root. Each leaf is labeled by an input sequence, which may be viewed as a trivial alignment. As each internal node v is processed, the alignments A and B labeling the children of v are combined into one alignment C of all the sequences at the leaves of the subtree for v. During the polishing stage, the final alignment is repeatedly split into subalignments A and B that are recombined into an updated alignment C. Clearly the quality of the final alignment strongly depends on how subalignments are merged in both stages. When merging alignments A and B to form C, we take as our objective to optimize the sum-of-pairs score [17] of alignment C, which is the sum of the scores of all induced pairwise alignments in C. all commonly-used tools today. This is the objective used in The sum-of-pairs score may be weighted, as discussed in Section 2.6. When merging A and B to form C, the columns within subalignments A and B are preserved within C, as pictured in Figure 2.1. Computing an optimal C given A and B using sum-of-pairs scoring has 38 33 !"#"$"%"& #"$"&"!"% #"#"&"%"% ! # # #"&"&"% !"#"%"% # $ # # ! $ & & & # % ! % & % & % % % % ! # # # $ # $ & & % ! % & % % # ! & # & & & % % % !"#"$"%"& #"$"&"!"% #"#"&"%"% #"&"'"&"% !"#"'"%"% Figure 2.1: Aligning alignments is a merging of two alignments where the columns Figure 3.2 Aligning a merging twocolumns alignments where the columns onea of one alignment arealignments aligned asisunits with ofthe of the other, resultingof in alignment are aligned as units with the columns of the other, resulting in a new multiple new multiple alignment. (Reproduced with permission from [103]) alignment. been called the problem aligning and even was without first considered by optimal multiple alignmentsofunder this alignments objective is [64], NP-hard considering gaps [50, 42, Scores 71, 72],ofand been demonstrated practical using only for a limited number Gotoh [45]. thehas alignments are usuallyasevaluated affine gap penalties, of sequences. where a gap of length ` in a pairwise alignment has cost γ + `λ, for constants γ and gap of length ` is a maximal run of either ` insertions or deletions.) The 3.1.2λ. (A Progressive alignment constants γ and λ are called the gap open and extension penalties, and may have A widely used alternative to finding optimal multiple alignments is progressive align- different values for terminal or internal gaps. When computing an optimal merge C ment [19], a heuristic where a multiple alignment is progressively built following a binary by dynamic programming, the essential difficulty is in correctly counting the total merge-tree in bottom-up fashion. Leaves of the merge-tree are the input sequences and number incurred in thealignment. induced pairwise C. the root of is gaps the output multiple Internal alignments nodes of theofmerge-tree are interWe study twoalignments basic waysofofthe computing C byofaligning alignments: using mediate multiple sequencesthe at merge the leaves the corresponding subtree, exact counts, which yields a merge that has optimal and and aregap formed by merging their two children (which are eithersum-of-pairs sequences or score; intermediate alignments). two alignments merged, the columns of yield each alignment are using pessimisticWhen gap counts, which isare a fast heuristic that may a suboptimal treated as units, with the columns of one alignment to be aligned against the columns of merge. the other (Figure 3.2). Exact computing an for optimal merge Cofofk sequences, two alignments A The counts appeal ofSurprisingly, progressive alignment is that an alignment we need only perform alignments, each onisa pair of sequences[72]. (of columns). There is however, and B with k−1 affine gap penalties NP-complete While this shows there thelikely inevitable trade-off. Notice that the induced pairwise alignments an the alignment is no algorithm that computes an optimal merge and iswithin fast in worst are not changed when it is merged with another alignment. As a result, errors introduced case, nevertheless Kececioglu and Starrett developed an exact algorithm [63] that early in the merging process persist to the final alignment and can adversely affect later computes an optimal merge and is remarkably fast in practice. To optimally align merges. For example when the process begins, the merging is on a pair of sequences 39 two alignments, each having k sequences and n columns, their algorithm takes worst-case time O(5k n2 ). Extensive experiments, however, show empirically that it runs in O(k 2 n2 ) time on biological data [63]. In this study, we compute an optimal merge C with exact gap counts, using an implementation of the algorithm of [63]. Pessimistic counts Another approach to computing the merge C is to avoid the difficulty of determining exact gap counts by instead using an approximation called pessimistic gap counts [64]. Pessimistic counts were introduced under the name quasi-natural gap costs by Altschul [2]. This approximation overestimates the true number of gaps by assuming, in cases where the number of gaps started by a multiple alignment column is not determined by the preceding column, that the number of gaps started attains its largest possible value. The benefit of pessimistic gap counts is that the merge of two alignments that is best under this estimate can always be found efficiently: with an alphabet of size |Σ|, computing the best pessimistic merge of two alignments, each having k sequences and n columns, takes worst-case time O(n2 · min (|Σ|, k)) with O(nk) preprocessing (see Chapter 8 in [103]). Most multiple alignment tools use some version of profile alignment [46, 64, 60, 29], to merge alignments. It is entirely possible that the pessimistic heuristic, which is also a profile method, is outperforming other profile methods, though we do not study that here. Comparison Table 2.4 compares the accuracy (SPS score) of exact and pessimistic gap counts when merging alignments leaf-to-root on the tree. We use an MST merge tree with normalized costs, no weighting, no polishing, no consistency, and default parameters. Exact gap counts are consistently superior to pessimistic gap counts, though 40 Table 2.4: Comparison of alignment methods. Merge method BAliBASE SABmark PALI average exact 82.4 48.4 84.0 71.6 pessimistic 81.6 48.2 83.0 70.9 only slightly. Our experiments (not shown) indicate that exact gap counts continue to slightly outperform pessimistic counts after polishing. We use exact counts in the rest of the chapter, but note that the performance of the pessimistic heuristic is encouraging, and suggests that it is a reasonable choice for inputs so large that even the surprisingly fast exact-counts algorithm is too slow to use in practice. 2.6. Sequence-pair weights Sum-of-pairs scoring of multiple alignments is a potentially biased scoring measure: if the input sequences are not independent but instead over-sample some groups compared to others, the higher number of pairwise alignments to an over-sampled group can dominate the alignment score. This greater contribution of an oversampled group to the score will tend to drive the multiple alignment toward improving the pairwise alignments to such groups at the price of worsening the pairwise alignments to under-sampled groups, thus degrading the overall quality of the alignment. Several schemes have been proposed to correct for this bias by nonuniformly weighting the pairwise alignment scores in the sum-of-pairs measure. We study three such schemes. The first is new, and the other two are the schemes most widely used by current alignment software. Brief descriptions are given here; full details of the methods are provided in Chapter 3. All these schemes assign weights to pairs of input sequences on the basis of a tree T whose leaves correspond to the sequences, and that has edge lengths `e for 41 all edges e ∈ T . Each scheme assigns a weight wij to a pair i, j of leaves in T that depends on both the topology of T and its edge lengths. Suppose that in a multiple alignment A of the sequences, the score of the induced pairwise alignment of the sequences associated with leaves i and j is sij . The weighted sum-of-pairs score of alignment A using these weights is X wij sij . (2.4) i,j Influence weights Our new method of assigning a weight wij to a pair i, j of leaves depends on quantifying the influence of one leaf on another. Informally, the influence of j on i, ω(i, j), is computed by rerooting the tree at i, then assigning a weight of 1 to i and distributing that weight among the remaining leaves in a manner that depends on shape of T and edge lengths. The function ω is not necessarily symmetric, but we can define a symmetric weight on a pair of leaves by the geometric mean of their influences: wij := p ω(i, j) ω(j, i) . (2.5) We call the wij obtained in this way, influence weights. Influence weights have several nice properties. For k sequences, the weights wij can be computed in time O(k 2 ), given tree T and its edge lengths, which is optimal. They also avoid the anomalous behavior exhibited by the other two methods described below, as discussed in chapter 3. Covariance weights Perhaps the best-known weights for sum-of-pairs multiple alignment are those of Altschul et al. [3]. Of the two weighting schemes they suggest, their second scheme, which they call rationale 2 weights, has the more rigorous basis and is the one that has been more widely adopted. We call their second scheme, covariance weights. Covariance weights depend on the extent to which paths between two different pairs of sequences share internal edges (their covariance). In principle, they are 42 computed with a formula that inverts a matrix of these O(k 4 ) covariances, but a run time of O(k 2 ) is achieved using an involved algorithm described in [4]. This scheme is not directly used in common software; instead Gotoh’s 3-way method [47] of approximating covariance weights is normally used, for example in weighting sequences during the polishing stages of MAFFT and Muscle. Division weights The program ClustalW introduced a simple weighting scheme that has been incorporated into other alignment software as well, such as in the construction stages of MAFFT and Muscle. We call ClustalW’s scheme, division weights. Briefly, division weights are computed by rooting the tree, then dividing the length of each edge in the tree among the leaves under that edge, finally computing leaf-pair weights as the product of the collected leaf totals. The wij for k sequences can be computed in time O(k 2 ). Comparison Table 2.5 compares the accuracy of the weighting methods across the benchmarks. We use an MST merge tree with normalized costs, exact gap counts, no polishing, no consistency, and default parameters. (Using pessimistic gap counts and polishing we find the same ranking of weighting methods.) Covariance weights are computed exactly using matrix inversion, based on the original method of [3] (i.e. not Gotoh’s approximation). The edge lengths `e on the MST tree are computed from normalized costs dij by fitting edge lengths to T so as to minimize the sum of the differences between path lengths `ij and the distances dij . These optimally-fitted edge lengths are efficiently computed by solving a linear program. Details of this linear program are given in Chapter 3. On BAliBASE and PALI, we also report the TC score as the second measure of quality, since it is less distorted by overrepresentation of groups than the SPS score. Surprisingly, and in contrast to what has generally been suggested in the 43 Table 2.5: Comparison of weighting methods. Weighting method BAliBASE SABmark PALI average uniform 82.4 / 53.6 48.4 84.0 / 57.3 71.6 / 55.5 influence 82.2 / 53.3 48.4 84.1 / 57.7 71.6 / 55.5 division 82.2 / 53.4 48.2 84.3 / 57.3 71.6 / 55.4 covariance 82.1 / 53.2 48.4 83.5 / 57.5 71.3 / 55.4 Table 2.6: Comparison of weighting methods on BAliBase refs 2 and 3. Weighting method BAliBASE refs 2 and 3 (SPS/TC) influence 83.3 / 30.7 uniform 82.8 / 31.4 division 82.5 / 30.8 covariance 81.5 / 31.5 literature [47, 29], unweighted sum-of-pairs (the uniform row of the table) performs as well as all three weighting schemes. Even for inputs with the largest numbers of sequences, where overrepresentation is more likely, weighting continues to give no benefit under both measures of quality. This behavior might be due to the fact that most benchmark alignments do not contain strongly overrepresented groups. BAliBASE references 2 and 3 do provide a set of inputs that are designed such that they contain somewhat oversampled groups. Table 2.6 compares recovery of the weighting methods on the 28 inputs we used from these two references. Here, the SPS and TC numbers show disagreement regarding the best performing method. Since uniform weighting is most consistently successful, even on these inputs manifesting overrepresented groups, we use uniform 44 weighting for the remaining tests in this chapter. 2.7. Polishing The progressive alignment method forms an alignment at each internal node by aligning the columns of the alignments at that node’s children, keeping those columns intact. The result is that errors made in early stages can disrupt the quality of the final alignment. Responses to this problem in the literature come in two forms: (1) avoid errors in the first place (consistency-based schemes [84, 25]), and (2) fix errors after they’ve been made (polishing, aka refinement [9]; see [52] for review of methods). We consider polishing methods in this section, and consistency in the next. To our knowledge, all previously implemented techniques perform polishing by splitting sequences into two groups, resulting in two induced alignments of subsets A and B of the sequences. The subalignments A and B induced on these groups are realigned, without altering the columns of A or B. Realignment of A and B is done as when merging alignments in Section 2.5 by aligning profiles [46, 64] or aligning alignments [45, 63]. The resulting alignment is retained if its score improves. This process is a form of local search, and is repeated for a fixed number of iterations or until there is no improvement. Though 2-partitions may be formed in a number of ways, the methods in common tools either randomly split sequences into two groups [25], or perform tree dependent restricted partitioning [52], which considers only those partitions that can be formed by cutting an edge on the merge tree. Tree-based partitioning may be iterated exhaustively over the edges of the tree [29], or edges may be repeatedly cut at random [59]. Tools using random choices tend to do many iterations: ProbCons by default does 100 iterations, and MAFFT does 1000. We considered several alternatives, and present results of the three best performers (with combinations) here: 45 Exhaustive 2-cut We implemented a tree-based method we call exhaustive 2-cut that cuts tree edges and realigns until there is no improvement. Since a tree with k leaves has O(k) edges, if there is an edge whose cut gives improvement this finds it within a linear number of realignments. Rather than scanning tree edges in a fixed order [29], we dynamically order the edges e by a measure Φ(e) of their potential for improvement. For edge e, let P (e) be the set of pairs of sequences that are on opposite sides of the partition given by cutting e, let cij be the cost of the pairwise alignment induced on sequences i and j in the current multiple alignment, and let dij be the cost of their optimal pairwise alignment. We use the potential X Φ(e) := (i,j) ∈ P (e) cij −dij |P (e)| . (2.6) These potentials are updated after several cuts alter the alignment (a default of five). This approach yields a slight speedup in convergence over the exhaustive 2-cut method used in Muscle [29] while attaining the same accuracy. Random 3-cut Consider the situation where two sequences in a large alignment are misaligned. The 2-cut method cannot separate these two sequences from the rest of the input and realign them without interference from all other sequences. This can be easily accomplished, however, by instead partitioning into three groups. We examined a variety of methods for three-way partitioning, both tree-based and not. We report the best one here, which we call random 3-cut. To our knowledge, this is the first time 3-cut polishing has been implemented. This method partitions the sequences by cutting two tree edges selected at random. (A tree with k leaves has O(k 2 ) such cuts, so an exhaustive approach would be slow.) The resulting groups a, b, and c are merged in two steps by realigning a and b to form group ab, and then realigning c with ab. We consider the three merge orders ab|c, ac|b, bc|a, and retain the best of the three if it gives 46 improvement. This process is repeated until a time or iteration limit is reached. Most edges are near the leaves, so this method tends to split off two relatively small groups of sequences, enabling repair of errors between small groups followed by integration into the rest of the alignment. In contrast to 2-cut, this method in essence alters the merge tree. We also considered 3-cut variants that reattach tree edges to reflect the merges of the three groups, or that rebuild the tree on the affected paths up to the root, but surprisingly none gave better quality than the above method that does not change the tree. On-the-fly An attractive idea is to polish subalignments as they are formed [106], rather than waiting for a complete alignment. This allows errors to be fixed before causing further misalignment. We implemented a version we call on-the-fly polishing: when an internal node v is created during merge tree construction, iterate over edges that are grandchildren or children of v, at each iteration cutting and realigning. This is repeated until no improvement is seen in one sweep through nearby offspring edges. This is similar to 2-cut, but scans a limited set of edges and operates while forming the initial alignment. Combined We also consider the method of polishing both during and after alignment construction. Our method is similar to the tree-based iterative algorithm of [52], but with polishing at internal nodes restricted in depth, and post-polishing possibly using tree-based 3-cuts, rather than 2-cuts. We call this method k-cut + on-the-fly. Comparison Table 2.7 compares the accuracy of polishing methods using MST trees with normalized costs, exact gap counts, no weighting, no consistency, and default parameters. For 3-cut we did 60 iterations; 2-cut and on-the-fly iterated until no improvement. These results suggest that the post-polishing methods tend to converge to 47 Table 2.7: Comparison of polishing methods. Polishing method BAliBASE SABmark PALI average 3-cut + on-the-fly 84.3 50.2 84.6 73.1 3-cut 84.2 49.7 84.8 72.9 2-cut 84.4 49.8 84.7 72.9 2-cut + on-the-fly 83.6 50.0 84.5 72.7 on-the-fly 83.3 49.6 84.4 72.4 none 82.4 48.4 84.0 71.6 roughly the same alignments, and that depth-restricted on-the-fly polishing, though a much faster (not shown) way to gain improvement, is not capable of reaching this same convergence point on its own. It is worth noting that, for inputs with 30-40 sequences, the run time for exhaustive 2-cut is generally twice that of the 3-cut method. Also, using on-the-fly in conjunction with with exhaustive 2-cut slightly speeds up convergence, since it performs much of the same polishing work before the full compliment of sequences has been added. In the rest of the chapter we use 3-cut + on-the-fly polishing. 2.8. Alignment consistency Polishing, described in the last section, attempts to overcome the inherent error propagation of the progressive alignment method by cleaning errors after-the-fact. An alternate approach is to try to avoid errors in the first place, using a method known as alignment consistency [84, 25]. In the standard alignment scheme, two sequences are aligned such that the alignment optimizes, over all possible alignments, the sum of the substitution and gap scores. These scores are position-independent: the substitution score is based solely on the pair of characters being aligned (as from BLOSUM substitution score 48 matrices [51]), and the per-gap and gap-extension costs are fixed (the γ and λ values of Section 2.5, possibly with different such fixed costs for internal and terminal gaps). Two alignments are aligned in an attempt to optimize the sum of the scores of induced pairwise alignments, with the constraint that the columns of each alignment remain intact. In the alignment consistency framework, costs are modified in a per-position manner. Typically [84, 59, 25], this involves computing a consistency-modified substitution score for each position-pair (ai , bj ). These modified position-pair scores are based on the amount of support found in other sequences C for the alignment of position ai with position bj . Details of how this support is calculated for position pairs differ from tool to tool, and are given detailed treatment in Chapter 4. In the consistency framework, gap-related costs in these methods are usually ignored, so that only modified substitution costs are used. The argument for doing this is that gap costs are implicitly a part of the process of computing the modified position-pair costs, but this approach may lead to less reasonable gap structure than is typically recovered with an affine gap penalty. Our new method depends not only on modified substitution scores, but also on modified per-gap and and gap-extension costs. As before, two alignments are aligned with the goal of optimizing the sum of scores of induced pairwise alignments, where the scores are now based on modified substitution scores. Comparison We considered a wide variety of novel approaches for calculating modified alignment feature costs. Details are provided in Chapter 4. Results of the best-performing method are given in Table 2.8. We use an MST merge tree with normalized costs, exact gap counts, no weighting, and default baseline parameters. Results are given both with and without polishing, since consistency and polishing are alternate approaches for solving the same problem, and the results of 49 Table 2.8: Comparison of consistency methods. Consistency method BAliBASE SABmark PALI average consistency, no polish 84.0 / 56.1 49.6 84.5 / 59.8 72.7 / 58.0 consistency, polish 83.6 / 56.2 49.9 83.7 / 57.2 72.4 / 56.7 no consistency, no polish 82.5 / 52.7 48.5 83.8 / 57.1 71.6 / 54.9 no consistency, polish 83.9 / 56.8 50.1 84.3 / 57.9 72.8 / 57.4 one impact the efficacy of the other. (The astute reader will notice that non-consistency numbers disagree slightly with those of the prior section. This is due to a minor change in the set of alignments used as input to the alternatives in this section). These results show that the alignment consistency approach (without polishing) overcomes early errors of the progressive alignment method about as well as postpolishing does. Intriguingly, however, a mixture of the two approaches results in a worsening of the average alignment recovery. Reasons for this worsening of quality are discussed in Chapter 4. Because consistency does not clearly improve results, and runs slower than noconsistency with polishing (even with speedup methods described in Chapter 4), the no-consistency option is used for the rest of the chapter. 2.9. Parameter advisor Selecting gap penalties involves a good deal of guesswork [111] and most practitioners simply use the default values provided by modern software. Fortunately there are now well-designed suites of protein alignment benchmarks, and the sequences in these benchmarks presumably represent the kinds of protein inputs a tool will see in the wild, which suggests that parameters optimized on those benchmarks should be reasonably good choices (though this is likely not the case for DNA, where only 50 coding-DNA- [18] and -RNA- [38] benchmarks are available). We look at the effect of parameter choice on accuracy from three perspectives: finding the best default input-independent values, determining how well a perfect input-dependent choice given by an oracle can perform, and designing an advisor that makes a good choice. Default parameters To select default parameters for aligning proteins with Opal, we trained primarily on BAliBASE. Based on results from doing inverse parametric sequence alignment using the tool InverseAlign [62], we fixed the substitution matrix at BLOSUM62 [51] and identified a reasonable seed value for the gap open and extension penalties. We then enumerated a range of parameters around this seed, including variants with reduced terminal gap costs. In total we examined 774 parameter choices. To quickly filter out poor parameter choices, we aligned all BAliBASE inputs with a fast version of Opal (MST with normalized cost, pessimistic gap counts, no weights, and no polishing), and identified the 100 choices with the best average recovery. We then selected a parameter choice from this set that had high recovery on all three benchmark suites using a more accurate version of Opal (same as above, but using exact gap counts and on-the-fly polishing). From this set of parameters we selected as our default the choice that had the highest recovery over all three suites of benchmarks. With BLOSUM62 transformed to a cost matrix in the range [0,88], our default parameter choice has internal-gap open and extension penalties γI and λI , and terminal-gap open and extension penalties γT and λT , of (γI , λI , γT , λT ) = (60, 38, 15, 36). A reduced terminal-gap open penalty is common, typically half the internal-gap open penalty [29], though a reduced terminal-gap extension penalty is not. This parameter choice was used in all results described in prior sections. 51 Parameter oracle While the default performs well overall, it severely underperforms other choices on many benchmark alignments. (The default has accuracy more than 10% worse than the optimal choice on about 15% of the inputs). This leads us to ask what accuracy could be achieved if we had an oracle that could identify the best parameter choice for each input. We consider results with an oracle for purposes of comparison. Parameter advisor Though a true oracle is unattainable, it is possible to design an advisor that can choose an input-dependent parameter value that improves alignment quality. We are aware of only one other attempt at implementing a method to automatically choose parameters, called MULTICLUSTAL [118]. Our tests show that, for combinations of 3 parameters, Multiclustal’s advisor performs poorly, typically choosing the worst of the 3 on nearly 50% of inputs. We considered a variety of novel advisor methods, and describe two of them here. Naive Bayes advisor Multiple sequence alignments present a broad range of features that may be useful in picking parameters for an input. ClustalW uses one of these, percent identity in optimal pairwise alignments, as part of its scoremodification strategy. In addition to percent identity, we considered features related to spacer density (number of gap characters) and gap-start density in computed multiple alignments. Our approach is to ask, for each parameter x among a small set of alternative parameter choices P, “what is the range of observed feature values in alignments generated by x that have high accuracy?”. The idea is that some parameters may work better with alignments that are, for example, more gappy than average, and that observing such traits may give support for using one parameter over another. 52 The approach is then to identify, for a new input, which parameter results in an alignment with feature values most like those for which the parameter has worked well on training data. We fit a Gaussian, Gxj , for each feature j to the set of observed values of that feature over all inputs S for which x is good (i.e. x results in an alignment with accuracy no more than 2% worse than the best parameter choice). Similar Gaussians, Bxj , are determined for bad inputs (x results in alignments at least 10% worse than the best parameter). Distributions were trained using alignments from BAliBASE for parameter choices in P. After determining these distributions, a quality value is computed for parameter choice x on input S as follows. Let Ax be the alignment of S under x, and let f (j, Ax ) be the value for feature j of Ax . Let G(x, j, S) be the area under the curve Gxj of an interval sized around f (j, Ax ), and likewise B(x, j, S) under the bad Gaussian Bxj . Let g be the number of training inputs for which the parameter is good, and likewise b for for bad, so g/b is a prior belief of the quality of an alignment. Then the quality of a parameter x on an input S is Q g j G(x, j, S) Q Q(x, S) = . b j B(x, j, S) (2.7) Given a new input S, the parameter choice that this advisor selects is p := argmax Q(x, S). x∈P This approach bears a strong resemblance to a common discrimination technique called a naive Bayes classifier [26]. Note that this approach makes an assumption that the various feature ranges are independent. Modification to this approach to account for possible covariance led to no improvement in advisor success. Core column count advisor A simpler approach works as follows. Define a core column to be a column in a multiple alignment where at least a fraction α of its 53 rows have letters from the same character class in the compressed alphabet. (In the experiments, we use α = 0.9, and used the compressed alphabet of [59].) Given a set S of input sequences to align and a candidate parameter choice x, let Ax be the alignment of S that results from using parameter choice x, and let f (Ax ) be the number of core columns in Ax . The parameter choice that this advisor selects based on core columns is p := argmax f (Ax ). x∈P In other words, it selects the parameter choice that gives an alignment with the most core columns (ties are broken in favor of shorter alignments). Both of the above advisors perform similarly. We present results for the core column approach. The output alignment using either advisor is Ap . Comparison Table 2.9 compares the hypothetical accuracy of the oracle to that achieved using default parameters and the advisor. We picked a set U of twelve parameter choices that performed well on average and cover the domain of reasonable gap open and extension values, and applied the oracle to set U. The advisor was applied to a smaller set P ⊆ U of four parameter choices, choosing p = (γI , λI , γT , λT ) from the set P = (56, 38, 7, 36), (58, 37, 7, 35), (64, 37, 8, 37), (64, 38, 32, 36) . (An alignment Ax must be computed by the advisor for each x ∈ P, so a small set P is preferable.) We also considered running the oracle on P. 54 Table 2.9: Comparison of parameter advisor methods. Parameter method BAli SAB PALI average oracle on U 87.0 54.4 87.1 76.1 oracle on P 86.2 52.9 86.2 75.1 advisor on P 84.7 50.5 84.9 73.4 default 84.4 50.0 84.5 73.0 Given a parameter choice, alignments were computed using the best methods from prior stages: MST trees with normalized costs, exact gap counts, no weighting, no consistency, and on-the-fly + 3-cut polishing. The oracle clearly provides a large boost in recovery, and offers an intriguing target for further research. The improvement of the advisor over the default is modest, and might conceivably be the result of just fortuitous choices that exploit the variation in accuracy within set P. A closer look, however, reveals that the advisor’s performance is much better than random. For a given subset S ⊆ U, we can compare the accuracy of the advisor on S to that of the single best choice from S. Our advisor outperforms the best choice for 60% of all subsets S ⊆ U tested. It also outperforms the mean accuracy of the x ∈ S for 94% of all subsets. (If the advisor’s parameter selection were random, we would expect it to beat the average accuracy 50% of the time.) By contrast the method in [118] outperforms the best choice from S for only 1% of the subsets, and the mean of S for less than 3%. When comparing against other tools in the next section, we consider performance both with and without an advisor. 55 Table 2.10: Assessing the accuracy gains by selecting best-of-breed methods selected at each stage. 2.10. Stage BAli SAB PALI average (baseline) 78.0 42.7 80.5 67.1 tree +1.4 +1.4 -0.7 +0.7 distance +2.2 +4.1 +3.2 +3.1 merge +0.8 +0.2 +1.0 +0.7 polish +1.9 +1.8 +0.6 +1.5 parameters +0.4 +0.3 +0.3 +0.3 (combined) 84.7 50.5 84.9 73.4 Discussion The prior sections have examined seven stages in the form-and-polish strategy, and identified the best method for each stage. In Table 2.10 we summarize for all stages the net improvement in alignment quality gained by the best method over a standard method. The baseline methods in the table are a UPGMA merge tree, percent identity for distances, pessimistic gap counts to merge subalignments, unweighted sum-ofpairs, no polishing, no consistency, and default parameters. We omit the stages of choosing weights and consistency which gave no improvement. These results represent the outcome of a careful study of methods for each stage of the form-and-polish strategy for multiple alignment. This includes new methods for estimating distances, merging alignments, weighting pairs of sequences, polishing the alignment, employing alignment consistency, and choosing parameters. A new merging method that optimally aligns alignments yields only a minor 56 improvement over an approximate merging heuristic. The largest gains in quality come from new methods for estimating distances by normalized alignment costs, and polishing by 3-cuts on the merge tree; together these two boost recovery by more than 4%. The weighting and consistency methods did not result in improved accuracy as implemented, but are promising; details are given in the next two chapters. The best method for a stage is generally the same across all suites of benchmarks, suggesting that what we have singled out as best has not been overfitted to the data. This best combination of methods yields a new tool we call Opal. In brief, the best-performing variant of Opal uses an MST merge tree, normalized costs for distances, exact gap counts to merge subalignments, unweighted sum-of-pairs, no consistency, and both on-the-fly and 3-cut polishing. A slight boost in accuracy can be gained using the parameter advisor. In Table 2.11, we compare the accuracy of Opal to other commonly-used tools: ProbCons [25], MAFFT [59], Muscle [29], T-Coffee [83], and ClustalW [110]. On BAliBASE and PALI, the second quality measure we report is the percentage of reference columns completely recovered, the TC score [6]. Both the most accurate and baseline versions of Opal are shown. All other tools were run at their highest accuracy (for MAFFT this was their L-INS-i variant). We highlight two observations: (1) The results show that even the baseline version of Opal strongly outperforms ClustalW, which uses essentially identical choices at each stage. This suggests that the pessimistic heuristic for aligning alignments works much better than the heuristic used in ClustalW. (2) By combining the best methods at each stage, Opal attains accuracy on par with the state-of-theart (namely ProbCons and MAFFT), even without using the alignment consistency method that is responsible for much of the success of those tools. Though the experiments presented in this chapter were performed on protein 60.4 57.9 58.2 54.3 52.1 48.0 41.6 85.1 84.5 84.7 84.3 80.1 80.2 78.0 73.2 MAFFT ProbCons Opal with advisor Opal with default parameters T-Coffee Muscle Opal baseline ClustalW 44.0 42.7 45.6 46.7 74.5 80.5 81.2 81.4 84.6 84.9 50.5 50.2 84.8 84.3 SPS 44.3 50.2 55.5 55.0 58.5 59.6 60.0 60.3 TC PALI 50.1 49.2 SPS SABmark 63.9 67.1 69.0 69.4 73.1 73.4 73.1 72.9 SPS 43.0 49.1 53.8 54.7 58.4 58.7 59.0 60.4 TC Average Table 2.11: Comparison of the accuracy of Opal to commonly-used tools on protein benchmarks. For each tool and for each suite of benchmarks, the table reports the average accuracy of the tool across all benchmarks in the suite. The last columns report the average accuracy of a tool across all suites. Accuracy is measured by the SPS score, and where applicable, the TC score [6]. The SPS score is the percentage of pairs of aligned positions from the reference alignment that were correctly recovered; the TC score is the percentage of columns from the reference alignment that were completely recovered. The performance of Opal is shown both using an advisor method for choosing input-dependent values for gap penalties for scoring alignments, and using input-independent default values. The highest accuracies for a suite are in bold. 57.9 TC SPS Tool BAliBASE 57 58 Table 2.12: Comparison of the accuracy of Opal to commonly-used tools on the RNA benchmark, BRAliBASE. Tool BRAliBASE (SPS/TC) ProbCons 86.9 / 76.8 Opal 86.8 / 76.7 MAFFT 85.8 / 74.8 Muscle 85.1 / 74.3 ClustalW 82.2 / 71.0 benchmarks, Opal parameters have been learned for DNA and RNA alignments in a manner similar to that in Section 2.9. Table 2.12 compares the accuracy of Opal to that of other tools on the BRAliBASE benchmark, which consists of mostly RNA alignments. Opal was developed as a testbed for methods, with a focus on accuracy, but has been somewhat optimized for speed. On inputs of 40 sequences, its running time when using the exact algorithm for aligning alignments is about two orders of magnitude slower than ClustalW and Muscle, and about the same as the slowest tool, T-Coffee; the pessimistic heuristic for aligning alignments gives about an order of magnitude speedup. Over all benchmarks, the median run time for Opal with the exact algorithm, was less than 10 seconds, which was on an input of 20 sequences of length near 250. 59 CHAPTER 3 WEIGHTS FOR SEQUENCE-PAIRS RELATED BY A TREE Sum-of-pairs scoring of multiple alignments is problematic when applied to sequences that have evolved on an evolutionary tree. The implicit assumption of sum-of-pairs is independence of sequences, but biological sequences are related under a tree-like evolutionary process, so the common case is that one sequence will tend to be more similar to some sequences than to others, and sequences often cluster into groups within the evolutionary tree describing their relationship. If the input sequences oversample some groups, the result of optimizing sum-of-pairs scoring will be to improve the pairwise alignments to such groups at the price of worsening the pairwise alignments to other groups. An example of this is seen in Figure 3.1. Overview Several schemes have been proposed to correct for this bias by nonuniformly weighting the pairwise alignment scores in the sum-of-pairs measure. The two schemes that are most widely used by current alignment software are described in Section 3.1. A new scheme, which we call influence weights, is easy to implement, overcomes the anomalies observed for the first two, and is the subject Section 3.2. All these schemes assign weights to pairs of input sequences on the basis of a tree T whose leaves correspond to the sequences, and that has edge lengths `e for all edges e ∈ T . Each scheme assigns a weight wij to a pair i, j of leaves in T . We now discuss details of the calculation of these weights. 60 Weighting sequence pairs A 20 seqs B 2 seqs C 50 seqs Weighting sequence pairs (a) 40 A-B pair alignments A (20) B 100 B-C pair alignments 44 (2) 1000 A-C pair alignments C (50) (b) Figure 3.1: Overrepresented groups can dominate sum-of-pairs scoring function. a) Sequences related by a tree, with two overrepresented groups. b) Sum-of-pairs scoring will drive towards small improvements in alignments between sequences in groups A and C, at the expense of possibly large reduction in quality in alignments between sequences in group B and all others. Weighting anomalies 61 Covariance weights u v 2d d 2w z y x w 47 Figure 3.2: The covariance weight method gives the anomalous result that the ratio of weights wxz : wyz depends entirely on the ratio of edge lengths `vx : `vy , even when lengths `vx and `vy are very small compared to lengths `uv = `uz See text for validation. 3.1. Prior approaches Covariance weights Perhaps the best-known weights for sum-of-pairs multiple alignment are those of Altschul et al. [3]. Of the two weighting schemes they suggest, their second scheme, which they call rationale 2 weights, has the more rigorous basis. We call this scheme covariance weights. Let `(p) be the length of path p between two leaves in a tree, and let `(pq) be the length of the overlapping portion of two paths p and q. Altschul et al. argue that a correlation coefficient between two pairs of sequences should be defined as ρpq = (`(pq))2 . `(p)`(q) (3.1) For k sequences, if these O(k 4 ) covariance are stored in a matrix M , weights for the pairs can then be computed by taking row sums from M −1 . This would take O(k 4 ) 62 time just to create such a matrix, but Altschul described an involved algorithm [4] that avoids matrix inversion, and can compute weights in O(k 2 ) time. Covariance weights have the nice property of being independent of root placement, but can exhibit counterintuitive behavior. For example, consider the tree in Figure 3.2. Suppose edge lengths `vx and `vy are very small compared to `uv and `uz . Under this condition, nodes x and y are essentially identical from the perspective of z, so it is reasonable that wxz and wyz should be roughly equal. But under covariance weights, if `vx = c `vy , then wyz ≈ c wxz even for arbitrarily long `uv + `uz . An example is provided here: let `uv = 998, `uz = 1000, `vx = 2, and `vy = 1. Then M = 1 22 12 3·2000 3·1999 22 2000·3 1 19982 1999·2000 12 1999·3 19982 2000·1999 1 1 0.00067 0.00017 = 0.00067 1 0.99850 0.00017 0.99850 1 , M −1 −0.16689 0.16647 1.00008 = −0.16689 333.61137 −333.11093 0.16647 −333.11093 333.61123 . 63 And the resulting pair weights are: wxy = 0.99967 , wxz = 0.33356 , wyz = 0.66678 . Covariance weights are not used in practice for sequence alignment, but an approximate method for computing them, called 3-way weights is. See below for details. 3-way weights Aligning two length-n alignments of k sequences takes time O(n2 k 2 ) when the substitution costs for aligning a column of one alignment with a column of another alignment are computed naively. Gotoh described significant speedups using what he called generalized profile operations [46, 103], which give an Ω(k) reduction in run time, to O(n2 k). This speedup is employed by the fastest alignment tools, Muscle [29] and MAFFT [59], and in Opal [115]. However, these operations can be shown [47] to work for the weighted sum-of-pairs scoring function only if each pairwise weight wxy for sequences x and y is of the form wxy = wx · wy (3.2) (i.e. pair weights are the product of constant per-sequence weights). Covariance weights do not obey this property, but Gotoh described a method [47] of approximating covariance weights that does. This method is often called 3-way weights. Weights for each edge are computed based on a decomposition of the tree into overlapping three-way trees. When aligning profiles formed from two 64 subtrees (as is done in both progressive alignment [36] or tree-based polishing [52]) a weight wx can be efficiently computed for each sequence x as a product of the weights of the edges from that sequence to the the root of its subtree, satisfying equation 3.2. The edge weights are computed as follows. For a three-way (unrooted) tree with edges A, B, and C, with respective edge lengths a, b, and c, edge weights are assigned values s bc(a + b)(a + c) a(b + c)F S s ac(b + a)(b + c) b(a + c)F S s ab(c + a)(c + b) c(a + b)F S wA = wB = wC = where S := (ab + bc + ac), and F is a hand-tunable parameter. In the full tree, a leaf edge takes its weight directly from the single 3-way tree that contains the edge. The weight of an internal edge is computed as the product of the weights for that edge found in the two 3-way subtrees containing that edge. This method depends on a hand-tuned parameter, and provides no guarantee that it approximates covariance weights, though it found similar pairwise weights in experiments on a very limited range of topologies [47]. To the extent that the 3-way method does approximate covariance weights, it suffers from the same anomalous behavior as covariance weights do. 3-way weights are used in the polishing stages of MAFFT and Muscle. Division weights The program ClustalW introduced a simple weighting scheme that has been incorporated into other alignment software as well, such as in the construction Weighting anomalies 65 Division weights u v x 2w d d w z y 2w 48 Figure 3.3: The division weight method gives the anomalous result that both wxz and wyz converge to a value twice as large as wxy when lengths `vx and `vy are very small compared to lengths `uv = `uz . See text for derivation. stages of MAFFT and Muscle. We call ClustalW’s scheme, division weights. In this scheme, each leaf i of a rooted tree T is first assigned a weight wi as follows. The length `v of an edge from child v to parent u is equally divided among the kv leaves in the subtree rooted at v, and the portion `v /kv that leaf i gets is accumulated in wi . Letting pi be the path from leaf i to the root, wi = X `v /kv . (3.3) v∈pi The weight for pair i, j is defined to be wij := wi wj . It is easy to see that the wij for k sequences can be computed in time O(k 2 ), and that the weights satisfy the requirements for equation 3.2. Division weights are dependent on the placement of the root, which may be problematic, as correct root placement is often not known for a set of sequences to be aligned, and the tree construction methods used by alignment tools in the guide Weighting anomalies 66 Division weights 2 u v 2d d y w1 x z w2 > w1 Figure 3.4: The division weight method gives more weight to more divergent sequences. When sibling edges vx and vy have different lengths, the alignment of the less divergent pair of sequences, z and y is more reliable, but receives 48less weight under this scheme. tree construction stage (Section 2.4) compute unrooted trees. Division weights can also exhibit counterintuitive behavior. Consider the tree in Figure 3.3. Suppose lengths `vx and `vy are very small compared to lengths `uv ≈ `uz . The similarity of x and y is high, so pair weight wxy should be much higher than pair weights wxz and wyz , which correspond to unreliable alignments. But under division weights, leaf weight wz will be roughly twice wx and wy since `uv is split between wx and wy while `uz is given entirely to wz . The result is that wxz = wyz ≈ 2wxy . By using division weights only in the construction stage (not in polishing), Muscle and MAFFT avoid the trouble this causes, as the weights between groups only come into play after the weights within groups have already played their role in the alignment of those groups. This sort of error would wreak havoc with alignments if, for example, the edge between x and v were cut in a tree polishing step (see Section 2.7). Note that ClustalW does not perform polishing. For another anomaly, see Figure 3.4. When `vx > `vy , weight wyz should be Influence weights 67 B A weights Influence C j i (a) i 50 A’ A” B C j (b) 51 Figure 3.5: Calculating influence weights. (a) Sequences i and j in the context of a rooted tree. (b) The tree is re-rooted at i, and the influence of j on i is computed. greater than wxz , since sequence y is more similar to z, and the alignment of y to z is therefore more reliable than the alignment of x to z. But division weight method sets wxz > wyz . 3.2. Influence weights Here we describe a new weighting method that is easy to compute and overcomes the anomalous behavior observed for the other methods. One way to assign a weight wij to a pair i, j of leaves is to quantify the influence of one leaf on another on the basis of the shape of T and its edge lengths. Suppose we have a measure ω(i, j) of the influence of leaf j on leaf i in T where function ω is not necessarily symmetric. Then we can define a symmetric weight on a pair of 68 leaves by the geometric mean of their influences: wij := p ω(i, j) ω(j, i). (3.4) We call the wij obtained in this way influence weights. Our influence function ω is nonnegative, and for every leaf i it satisfies X ω(i, j) = 1. (3.5) j : j6=i As a consequence, the resulting weights satisfy 0 ≤ wij ≤ 1. Influence definition To determine the influence ω(i, j) of leaf j on leaf i, we carry out the following recursive process. Imagine making i the new root of T , and letting T hang from root i. We denote this rerooted tree with root i by Ti . Starting from root i, we process Ti top-down, splitting the total mass of weight 1 from equation 3.5 among the descendants of i. The new root i has exactly one child (which was originally the parent of i in T ), and this child receives the entire mass of weight 1 passed down from its parent i. In general, if a descendant x receives mass wx from its parent, this mass is split among its two children y and z (possibly unequally) so that wx = wy + wz . (3.6) After completing this top-down splitting process, the amount of the original mass 1 at the root i that ends up at leaf j is taken as the influence of leaf j on root i, (see figure 3.5b): ω(i, j) := wj . (3.7) The key is determining how to split the mass at a parent among its two children on the basis of the edge lengths of T . We do such a split as follows. 69 Influence weights i x y Hi(y) z Hx(y) T(y) L(y) Figure 3.6: Graphical description of values used in influence calculation For nodes v and w of T , denote the path in T between v and w by Pvw . For a node y, let T (y) be the subtree of T rooted at node y, and let Tx (y) be T (y) together with the path Pxy . Denote the length of path Pxy by `xy and the set of leaves in T (y) by L(y). The total size of Tx (y) is X Sx (y) := `xy + `e . (3.8) e ∈ T (v) The average height (see Figure 3.6) of Tx (y) is X 1 Hx (y) := `xy + `iy . L(y) i ∈ L(v) We call the effective number of leaves in Tx (y), Sx (y) Hx (y) , Hx (y) 6= 0; Nx (y) := 1, otherwise. (3.9) (3.10) The effective number of leaves satisfies 1 ≤ Nx (y) ≤ L(v). See Figure 3.7 for extremal examples of the effective number of leaves. 70 Influence weights Influence weights i i Sx(y) = l (x,y) + ! l (e) (size) e T(y) Hx(y) = avg path length from x to L(y) (height) x Nx(y) = z Nx(y) ! 1 Sx(y) Hx(y) x (effective #ysequences) z l Sx(y) = (x,y) + ! l (e) e T(y) (siz Hx(y) = avg path length from x to L(y) (heig Nx(y) = Sx(y) Hx(y) (effective # sequence Nx(y) ! k y k seqs k seqs (a) (b) 55 54 Figure 3.7: Effective number of sequences in a subtree depends on the apparent independence of sequences from the perspective of the root of the subtree. a) When k sequences appear to be nearly identical, Nx (z) ≈ 1. b) When k sequences appear to be independent, Nx (z) ≈ k. In tree Ti , we split weight wx = wy + wz between the two children y and z of x according to the ratio, wy Nx (y) Hi (z) = wz Nx (z) Hi (y) (3.11) (where when Hi (y) = 0, we assign node y all the weight). For example, with children that have identical average heights and effective numbers of leaves Nx (y) = 1 and Nx (z) = 3, child y gets 1/4 of wx and child z gets 3/4. See Figure 3.8 for further examples. Splitting the weight wi = 1 top-down over Ti according to these ratios fully specifies the leaf weights wj , and hence the influence function ω(i, j) and the weights wij . For k sequences, influence weights wij can be computed in time O(k 2 ), which is optimal. They produce reasonable weights in the cases that division and covariance weights produce anomalous results. For example, in the covariance anomaly of Figure 3.2, influence weights converge to wyz ≈ wxz as `vz grows. In the division 71 Influence weights Influence weights i i l l (e) T(y) Sx(y) = (x,y) + ! wx Hi(y) ! Hi(z) 7 w y 8 x x 1 w 8 x e (size) wx Hx(y) = avg path length from x toxL(y) (height) 2 1 w w 3 x y 3 x z Sx(y) Nx(y) = (effective # sequences) Hx(y) Hi(y) = 5 Hi(z) = 10 Split wx = wy + wz according to the ratio: wy z Nx(y) ! 7 wz Nx(z) ! 1 (a) = Nx(y) Hi(z) Nx(z) Hi(y) l Sx(y) = (x,y) + ! l (e) e T(y) (s Hx(y) = avg path length from x to L(y) (hei Nx(y) = Sx(y) (effective # sequenc Hx(y) Split wx = wy + wz according to the ratio: wy wz = Nx(y) Hi(z) Nx(z) Hi(y) Nx(y) ! Nx(z) 57 56 (b) Figure 3.8: Splitting wx , the the portion of i’s weight that has been distributed to x, between the subtrees of x. a) When heights are equal, the subtree with more effective sequences gets a larger portion of the weight. b) When effective sequence counts are equal, the subtree with lower average height to the root gets a larger portion of the weight √ weight anomaly of Figure 3.3, influence weights converge to wxz = wyz ≈ wxy / 2`vz as `vz grows, which matches the expectation that the alignment of similar sequences x and y should be more reliable than the other alignments. And in the division weight anomaly of Figure 3.4, influence weights appropriately assign greater weight to a pair of sequences with higher similarity. Influence weights disagree with both division and covariance weights in another situation. Consider the unrooted tree in Figure 3.9 . Altschul suggests [3] that, as `vz approaches 0, wxy should go to 0. At first glance this seems reasonable, since in this extreme case sequence z is effectively the ancestral sequence relating x and y, so those two sequences should align to each other through z. The covariance weighting scheme achieves this goal, as do division weights (so long as the root is placed approximately at z). After further consideration, however, it appears that a non-zero wxy weight is Weighting anomalies 72 Covariance weights x d ~0 z v d y Figure 3.9: Altschul [3] suggests that the weight wxy should be 0 as length `yz goes to 0. This may not be appropriate, as it will fail to sensibly align regions resulting from convergent evolution. more appropriate. Consider the case where the sequences are: x := ACTG , y := ACTG , z := ACG . In other words, there were independent insertions of a T between the C and G of the ancestral sequence. This is a simple form of convergent mutation, and though the Ts are not homologous in terms of being replicated from the same ancestral character, they are at the very least similar, and may be a sort of latent homology (e.g. sequence x was primed for insertion of the T). If the wxy is zero, then the weighted score of the multiple alignment of all three sequences won’t be affected by the way those Ts are aligned (in the same column or two separate columns), so no alignment will be preferred. In contrast, if wxy is non-zero, then the weighted score will favor an alignment placing the insertions in the same column. 49 73 Influence weights assign a non-negligible value is assigned to wxy . This is due to Hi (z) Hi (y) the component of the recursive distribution of influences. The concrete values are: ω(z, x) = 0.5 , ω(z, y) = 0.5 , ω(x, y) = 0.33 , ω(x, z) = 0.67 , ω(y, x) = 0.33 , ω(y, z) = 0.67 , so √ 0.5 · 0.67 = 0.58 , √ = 0.5 · 0.67 = 0.58 , √ = 0.5 · 0.33 = 0.41 . wxz = wyz wxy These weights seem reasonable, as they assign most of the weight to good alignments of x through z to y, but will still reasonably resolve instances of convergent evolution. Efficiently computing weights Pairwise weights can be computed by hanging the tree by each leaf i in order, and for each such rehanging, computing the influence of each other leaf j on i by distributing i weight across the tree as described in the previous section. If the average height, Hx (y), and effective number of leaves, Nx (y), are pre-computed then distribution of weights for each root will take O(k) time for k sequences, resulting in an overall Θ(k 2 ) computation. 74 In an unrooted binary tree T , every internal node has three relatives, which map in some way to the parent and two children in a rooted tree, with the mapping dependent on how the tree is rooted. We aim to store for each internal node x the Hx (y) and Sx (y) values for each relative y as thought x were the root of T and those relatives were all its children. Then, for any arbitrary hanging of T , we will have the Hx (y) and Sx (y) values for the two children y of x induced by that hanging. These values can be computed in an O(k) preprocessing step, in which we begin by hanging T by an arbitrary leaf. A simple bottom-up pass over T allows computation of the Hx (y) and Sz (y) values for the two children y of every node x induced by that hanging, but does not assign the values for the third relative (currently the parent). The Sz (y) values of the parent for a fixed node is simply the total branch-length of the tree minus the sizes of the two other relatives. Hz (y) values can be gathered via a second top-down sweep over T , in which weighted heights to the remainder of the tree above that node are collected based on values computed in the first (bottom-up) pass. It appears to be not possible to apply these weights to Gotoh’s generalized profile operations, since the weights are not of the form wij = ω i · ω j , and there is no clear way to make them behave in a way analogous to the method of 3-way weights for profile operations [47]. This is because of the Hi (x) factors in formula 3.11. This suggests an avenue of future work on the problem. 3.3. Estimation of edge lengths All the methods described in this chapter for computing pair weights depend on a tree with known edge lengths, a requirement that is largely ignored in papers describing the application of weights to sequence alignment. The neighbor-joining method, used for guide tree construction (Section 2.4) in ClustalW, generates a tree with accompanying edge lengths, but the guide tree methods that have been shown to produce better alignment results, UPGMA and MST [30, 59, 115], generate 75 a topology only, with no edge lengths. When guide trees are so constructed, a pair weight method requires that edge lengths be computed. In the tests described in Section 2.6, we estimated edge lengths by minimizing the L1 −norm between the observed pairwise distances and the path lengths induced by the tree’s edge lengths. The L1 −norm is a measure of distance between vectors that is often called manhattan distance, specifically the sum of differences in vector values. In this case it is the sum over all leaf-pair paths of the differences between observed and tree-induced path lengths. Formally, given a tree topology T and pairwise distances dij , let `(e) ≥ 0 be the estimated length of edge e, and let `ij P be the length of the path pij from leaf i to leaf j: `ij = e∈pij `(e). The L1 -norm can be found using a simple linear program, with the objective function minimizing P i,j |dij − `ij | . The L1 −norm is not, however, the only method for estimating edge lengths. For example, the Minimum Evolution method of phylogeny inference is based on least-squares inference of edge lengths, which is essentially the L2 −norm (often called Euclidean distance, the square root of the sum of squared differences). P Under least-squares, edge lengths are found that minimize i,j (dij − `ij )2 . This method for fitting a linear function to data is known as ordinary least squares (OLS); its application to edge length estimation was suggested by Cavalli-Sforza and Edwards [19], and followed by Altschul [3] and Gotoh [47]. It works under the implicit assumption that errors in observed values are uncorrelated and have equal variance. OLS is known to be a maximum-likelihood estimator of a function when these conditions hold and errors are normally distributed. Furthermore, the GaussMarkov theorem indicates that OLS has the minimum variance among all linear estimators when errors have expected value of zero, regardless of the distribution [1]. Since variance of the distance estimate for a pair of sequences is known to rise as sequence distance rises [39], the assumption of equal variance among pairwise distances is unfounded; lengths can be found under weighted least squares (WLS), 76 which accounts for the per-pair variances. Also, pairwise distances are, of course, correlated since the sequences are related by a tree; generalized least squares (GLS) accounts for this by incorporating a covariance matrix. Bryant and Waddell [14] described algorithms for computation of edge lengths under each of these criteria, OLS, WLS, and GLS with respective run times of O(k 2 ), O(k 3 ), and O(k 4 ) for k sequences. Variances and covariances for pairwise distances under the Jukes Cantor model of sequence evolution [58] can be estimated using the method of Bulmer [16, 39]. 3.4. Discussion We have presented a new method for computing non-uniform weights for sequences related by a tree. It is easy to implement, fast to evaluate, and avoids the anomalies of current approaches. Surprisingly, and in contrast to what has generally been suggested in the literature [47, 29], unweighted sum-of-pairs performs as well as all weighting schemes that we tested in our experiments on using weighted sum-ofpairs scores for multiple sequence alignment (see Section 2.6). Even for the small number of inputs from BAliBASE that do present some level of overrepresentation, the accuracy gains from using these weights are limited at best. An important point, which to our knowledge has not been mentioned in prior studies of multiple alignment methods, is that the standard measure of recovery is itself inherently biased. The SPS recovery score is measured uniformly over all induced pairwise alignments. So if a weighting method alters an alignment by correcting for over-represented groups, it is entirely possible that the corrected alignment worsens between a few highly over-represented groups while improving between several under-represented groups, and in the end has decreased percent recovery under SPS. Most other measures of alignment accuracy also sum a measure of pairwise accuracy over all pairs, and thus suffer the bias. (The TC score measures accuracy as the percent of reference columns that are exactly recovered by the 77 computed alignment; thus, it doesn’t suffer the same problem, but is highly sensitive to small errors, and thus seems not applicable to alignments of a large number of sequences.) While weighting methods attempt to correct for precisely this bias, it is far from clear how to measure recovery more objectively, since measuring recovery using a weighting method’s own weights is not objective either. It is possible that the choice of L1 -norm edge lengths had a negative impact on the quality of pair weights, leading to the observation (Section 2.6) that pair weights did not impact alignment quality. This is worth further consideration, since other work ([29, 59]) suggests a slight positive impact from weights, though they do not describe how edge lengths were computed for their trees. Finally, we note that this weighting scheme is general, and may be applied to other fields in which entities are related by a tree. For example, they could be used in estimating an average character trait value for a family of organisms, or in improving homology search against sequence families, or in other tree-based measures such as a “balanced” average subtree distance with application to the least-squares framework for edge length estimation [24]) (see Section 8.2 for more details of this idea). 78 CHAPTER 4 IMPROVING MULTIPLE ALIGNMENT THROUGH PAIRWISE ALIGNMENT CONSISTENCY The standard heuristic for building multiple sequence alignments, called progressive alignment [36] and described in detail in Chapter 2, merges sequences into increasingly large alignments in an order determined by a binary tree relating the input sequences. Each leaf of the tree stores a single sequence from the input (this sequence can be thought of as a degenerate alignment), and each internal node stores an alignment of the sequences belonging to the leaves in the subtree under that node. The alignment at an internal node is generated by merging (aligning) the alignments of the node’s two children. When merging alignments, columns of one alignment are aligned with columns of the other; the columns of each child alignment remain unaltered in the resulting alignment. In this process, an alignment at an internal node is formed based only on information contained in the sequences being aligned, without regard for how well the resulting alignment will align with other sequences in the final alignment formed at the root of the tree. The progressive alignment heuristic can be contrasted with computing an optimal alignment of multiple sequences, e.g. [71]. By definition, optimal alignment of multiple sequences under the standard sum-of-pairs scoring function (which is NP-hard [114]) induces pairwise alignments for each pair of sequences A and B that are optimal with respect to how the alignment (A ∼ B) constrains alignments of A and B with all other sequences C - no other alignment of A and B could improve the sum of pairwise scores over all pairs in the alignment. In the standard progressive strategy, an optimal alignment of sequences in a subtree is computed based on local information contained in those sequences, and may prove to be suboptimal in the 79 larger context of the full alignment - resulting in a final alignment in which some (or all) induced pairwise alignments (A ∼ B) could be realigned in a way that would improve the overall alignment score. As mentioned in Chapter 2, one approach to resolving this suboptimality is to modify the alignment after the fact, in a method known as polishing (see Section 2.7). Another approach is to avoid making the errors in the first place using a technique based on consistency of alignments. The idea of using alignment consistency is to bridge the divide between the progressive and exact approaches. The core idea is to make alignment decisions at each internal node of the progressive alignment guide tree based on global information, specifically information gleaned from other sequences in the input. Practically, this means modifying the function used to score an alignment generated for the sequences belonging to a subtree in a way that reflects the support given by sequences outside that subtree for particular alignment decisions. Use of consistency has been shown to improve recovery of reference alignments by several percent on average [83, 25]. All the consistency-based methods described below share the common strategy of modifying scores for aligning pairs of sequences, then aligning alignments so as to optimize the sum of (modified) pairwise alignment scores. For this reason, description of methods below will focus on how scores are modified for alignments of a pair of sequences A and B. Under standard scoring conventions, the score for aligning two sequences is comprised of substitution scores and gap penalties. Substitution scores are typically derived from a matrix (e.g. BLOSUM [51]) that gives a position-independent score for substituting one letter with another, while affine gap penalties (see Section 2.9) are composed of a per-gap cost, γ, and a length-dependent cost, `λ for a gap of length `. The gap costs are position independent, and chosen so that the tool performs well on benchmarks. 80 In the consistency framework, position-dependent scores are computed. A substitution of the ith position of A with the j th position of B can be represented as (ai ∼ bj ). The position-dependent score D(ai , bj ) of (ai ∼ bj ) depends on the support given by other sequences for that particular substitution. Gap penalties are not used in prior implementations of this framework, so that the score of an optimal pairwise alignment of prefixes (a1 . . . ai ) of A and (b1 . . . bj ) of B can be computed by the recurrence in equation 1.2, but with D(ai , bj ) replacing σ(ai , bi ), and λ = 0. This recurrence can be generalized to aligning two alignments, A and B, by replacing the D(ai , bj ) score with a function that sums over the scores of all pairs of characters between the two columns i and j. The two-sequence variant finds a maximum weight trace [61] on two sequences, and the two-alignment variant finds a constrained maximum weight trace of all the sequences in the two sets, in which the columns of the two merged alignments are required to remain intact. As described in Section 1.2, an alignment of two sequences can be viewed as a path in a 2-dimensional dynamic programming table. It is useful to consider features of such paths, as they will play a role in devising a more complete model of alignment consistency. A feature can be thought of as a subpath in the graph. A diagonal edge in a path (as in Figure 4.1a) corresponds to a substitution feature. A vertical or horizontal edge in a path (as in Figures 4.1b and c) corresponds to a gap extension feature, while the subpaths in Figures 4.1d through k correspond to gap-open and gap-close features. Note that in prior alignment consistency methods, a substitution score (the diagonal edge in Figure 4.1a) is incorporated, but all other features (gap-related costs, Figures 4.1 b through k) have zero cost. The argument supporting this gap-cost-free approach is that gap costs are considered in computing the modified substitution scores, so they need not be considered in the final alignment procedure &'(" 81 &'()*+,-."/"012" &'()*+,-."/"012" %" %" #" $" %" #" #" $" $" !" !" !" (a) substitution &'()*+,-."/")*01"""$" %" %" #" $" (b) vertical gap ex(c) horizontal gap &'()*+,-."/")*01"""$" &'()*+,-."/")*01""""%" &'()*+,-."/")*01"""%" tension extension %" #" #" $" !" #" $" !" $" %" !" !" (d) vertical gap open, (e) vertical gap open, (f) horizontal gap (g) horizontal gap &'()*+,-."/"0.)12"""%" &'()*+,-."/"0.)12"""$" general &'()*+,-."/"0.)12"""$" case special %&'()*+,-"."/-(01"""2" case open, general case open, special case %" 2" #" $" %" #" $" $" !" !" %" #" #" $" !" !" (h) vertical gap close, general case (i) vertical gap close, special case (j) horizontal gap close, general case (k) horizontal gap close, special case Figure 4.1: Features in dynamic programming table for alignment (A ∼ B). (a) Substitution of ai with bj . (b) Placing ai somewhere in a gap between bj and bj+1 . (c) Placing bj somewhere in a gap between ai and ai+1 . (d and e) The general (d) and special (e) cases of opening a gap with ai immediately following bj . (f and g) The general (f) and special (g) cases of opening a gap with bj immediately following ai . (h and i) The general (h) and special (i) cases of ai ending a gap that immediately follows bj . (j and k) The general (j) and special (k) cases of bj ending a gap that immediately follows ai . 82 based on those modified scores. However, it is possible that conflicting signals from a set of different sequences could lead to an alignment (A ∼ B) that opens gaps in a way that is universally poor, for example opening many small gaps rather than a few large gaps. Such gap patterns are restricted in standard alignment methods by use of the common affine gap cost function, in which each run of gap characters incurs a per-gap penalty (γ, which corresponds to the open and close costs pictured in Figures 4.1d through k) and a per-unit-length penalty (λ, Figures 4.1b and c). For this reason, it seems worthwhile to consider a consistency-based score modification method that includes modified gap costs. Our new method does this. 4.1. Prior approaches The idea of using consistency to improve alignments has a long history. Gotoh [44] described a method that identifies regions in optimal pairwise alignments that are consistent with each other, and uses them as anchor points in an algorithm for exact alignment of multiple sequences. It is noteworthy that the method does not simply consider a single optimal alignment for each pair of sequences, but rather the set of alignments for each pair that have optimal score. Unfortunately, as the number of sequences grows, it becomes increasingly unlikely that all sequence pairs will agree with any anchor point, severely limiting the scale of inputs the method can handle. The methods described below employ consistency in a progressive alignment framework. The primary difference between these methods is how each measures support for an aligned pair. T-Coffee Notredame et al. merged the ideas of consistency and progressive alignment in their landmark work on tools Coffee [84] and T-Coffee [83]. This amounts to a redefining of the term consistency as applied to alignment: rather than using consistent segments of pairwise alignments as mechanisms for speeding exact alignment, these tools use consistency of pairwise alignments to improve the 83 quality of alignments built with the progressive framework. They take as input a set of sequences and a set of pairwise alignments of those sequences, and define the quality of a multiple alignment of those sequences as the extent to which that multiple alignment is consistent with the set of pairwise alignments. There is no requirement that all pairwise alignments be consistent with any part of the resulting multiple alignment, just that the multiple alignment try to agree as much as possible with those pairwise alignments. Where T-Coffee and other consistency methods differ is in their definitions of such agreement. The approach of T-Coffee is to compute substitution scores for all pairs of positions in all pairs of input sequences, then perform alignments based on those position-pair substitution scores. The position-pair scores of T-Coffee are calculated as follows. For each pair of sequences (A, B), an optimal global alignment is calculated using ClustalW [110], and the 10 top-scoring non-overlapping local alignments from Lalign [55] are gathered. Every aligned position pair (ai ∼ bj ) found in one of these alignments contributes a value to the position-pair score S(ai , bj ), where the value contributed is the percent sequence identity of the alignment containing that pair. The motivation behind using percent identity as a weight is that more similar sequences should contribute more information to the final alignment. Scores are contributed additively, so that if a pair is found in more than one pairwise alignment (i.e. both a global and local alignment), it will have a score that is the sum of the percent identities of the containing alignments. If a pair is not found in any alignment, then its score is 0. This describes how to calculate an initial score for each pair. The set of initial scores is called the primary library. The notion of consistency is employed in developing what is termed the extended library. The approach is to quantify the support from other sequences C for aligning a position pair (ai ∼ bj ) by tallying the positions ch that are found to align to both ai and bj in the input alignments. These can be thought of positions such that ai aligns through ch to bj , (ai ∼ ch ∼ bj ). Specifically, the consistency-based position- 84 pair score is D(ai , bj ) = S(ai , bj ) + X C min {S(ai , ch ), S(bj , ch )}. (4.1) For k sequences of length n, T-Coffee requires time O(k 2 n2 ) and space O(k 2 n) to construct and store the primary library, and worst case time O(k 3 n2 ) to complete the extended library (though in practice runtime is reported to be roughly O(k 3 n) because of the sparseness of the score matrices in the primary library). Results given in the T-Coffee paper (and reiterated in Sections 2.10 and 4.6) show that T-Coffee gives a several percent average increase in recovery over ClustalW, affirming the benefits of incorporating local alignment and triple-based consistency. As mentioned earlier, one concern about this approach is that the alignment method depends only on substitution scores, with no accounting for gaps. Another significant drawback of this approach is that only a single optimal global alignment for each pair is used to establish pair scores. This is despite the fact that when sequence C is eventually merged with the alignment (A ∼ B), it will likely do so with an alignment that disagrees with the both optimal alignments (A ∼ C) and (B ∼ C). It seems preferable to allow C to contribute to the modified (A ∼ B) scores by expressing the extent of suboptimality required of alignments (A ∼ C) and (B ∼ C) in order to support the alignment of position-pair (ai ∼ bj ). MAFFT Katoh et al. developed a method [59] that allows MAFFT to compute its analog to the extended library in time quadratic in the number of sequence, avoiding the cubic overhead of T-Coffee. Rather than explicitly measure support for aligning positions (ai ∼ bj ) based on triples (ai ∼ ch ∼ bj ), MAFFT’s approach measures support based on how often ai and bj are part of some optimal alignment. The importance score of a pair of positions is computed such that high scores are assigned to pairs in which both positions are frequently involved in (possibly independent) high-scoring gap-free segments in pairwise alignment with 85 other sequences. Position-pair scores are calculated based on a linear mixture of BLOSUM-derived substitution scores and these importance scores. This approach avoids explicitly considering triples of strings, and results in a runtime of O(k 2 n2 ) for calculation of consistency-modified position-pair scores. While this method gives a significant speedup in practice, the role of consistency has been changed to an indirect one: rather than finding an alignment that will be consistent with respect to other sequences, it finds an alignment that aligns positions that are consistently (read: frequently) paired off in other alignments. Even with this indirect approach, results given in the paper show a several percent improvement in recovery, across a variety of benchmarks, over not using this consistency approach (see Section 4.6). Aside from the indirect method of computing consistency modified scores MAFFT’s approach is essentially the same as that of T-Coffee. It bases its consistency-modified scores on global alignments computed with its FFT alignment algorithm, and optionally includes local alignments from the tool fasta34. Like T-Coffee, MAFFT does not account for sub-optimal global alignments in assigning scores, and does not incorporate gap costs in the final alignment cost scheme. ProbCons Do et al. introduced an idea they call probabilistic consistency, in a tool named ProbCons [25]. Their approach depends on an established method [28] of computing, for a pair of sequences A and B, the probability that each position ai aligns with each position bj . Letting z ∗ be the true alignment, and Z the set of all possible alignments, their method defines the posterior probability of residue ai aligning to residue bj as p(ai ∼ bj ∈ z ∗ |A, B) = X z∈Z P (z|A, B) 1(ai ∼ bj ∈ z) (4.2) where 1(expr ) is the indicator function that returns 1 if expr is true, and 0 otherwise. P (z|A, B) represents the probability of emitting alignment z under a pairwise HMM (see [28]), and given sequences A and B. 86 Two sequences can be aligned in a way that maximizes the sum of these probabilities, in effect finding an alignment of maximum expected accuracy [78, 28]. (It should be noted that this may disagree with the most probable alignment under the pair-HMM, found using the so-called Viterbi algorithm, which is equivalent to the standard dynamic programming algorithm.) This pairwise alignment method can be trivially extended to aligning multiple sequences: the positionpair probabilities are simply another form of position-pair score, so progressive alignment can be applied with sum-of-pairwise scores maximized at each internal node. The notion of consistency is applied to this framework via a consistency transformation based on the probability, for each other sequence C in input C, of the pairs that induce an aligned triple (ai ∼ ch ∼ bj ). This is computed as p0 (ai ∼ bj ) = 1 XX p(ai ∼ ch ) p(bj ∼ ch ). |C| C∈C c (4.3) h These consistency-transformed probabilities replace the initial ones in computing an alignment with maximum expected accuracy. This approach represents a key advance in sequence alignment. In the other methods described, only a single optimal global alignment is used for each pair of characters, along with a small number of local alignments. These do not allow any input from pairwise alignments that are suboptimal under the scoring function, or for that matter, any alternate alignments with optimal score. In the scheme of ProbCons, alignment probabilities for each pair of positions (ai , bj ) are summed over all ways of aligning A and B. Thus, in the consistency transform, all pairwise alignments, optimal and suboptimal, contribute to the support (score) of pair (ai ∼ bj ), with their level of support dependent precisely on the probability of that alignment under the model. The time to compute the initial p(ai ∼ bj ) values over all pairs of sequences is O(k 2 n2 ). The worst-case time to then calculate the transformed probabilities is 87 O(k 3 n3 ): for each triple of sequences, each pair (ai , bj ) requires consideration of all positions ch . But since most values in the probability matrices are typically near zero, the transformation is computed efficiently using sparse matrix multiplication by ignoring all entries smaller than a threshold ω. Thus, the runtime is cubic in k with a smaller practical factor of n. The paper reports a 5% improvement in benchmark recovery over using pair probabilities without consistency, in agreement with results in Section 4.6. An additional benefit of the method is that it also offers a column reliability score that estimates the probability that each column represents a correct alignment of residues. A variety of tools based on the ProbCons’ formulation of probabilistic consistency have been developed in the past few years [88, 89, 93, 97, 86, 11], with various aims of improving speed, accuracy (sensitivity), or specificity. All follow the same central dogma of determining consistency-based scores using only information about aligned pairs, with no regard for gap costs. The two recent tools out of Pachter’s group, AMAP [97] and FSA [11], do somewhat account for gaps in the progressive alignment step, but this is only after gap-ignorant consistency modification is performed, and still ignores the per-gap penalties that allow affine gap scoring to prefer long gap stretches over many short gaps. 4.2. Suboptimality as a measure of support for alignment features In the equations above, a position ch provides support for the alignment of ai with bj only if it is aligned with both. In a sense, this doesn’t so much consider the consistency of alignments of C with A and B as it does the constraint they impose. For example, the following pairwise alignments of C with A and B are consistent with (ai ∼ bj ), in the sense that they do not restrict that alignment from occurring: ··· ai−1 ai ch − ··· and ··· ch − bj−1 bj ··· 88 Consistent, but non-constraining, alignments may provide more flexibility for identifying support of one sequence for alignment of another. But because of the high level of freedom such alignments have, they incur extra computational burden, so we follow in the footsteps of the methods above, and consider only alignments involving other sequences C that constrain the alignment of A and B to have particular features, though we continue to describe the approach as consistencybased, in agreement with standard nomenclature. Our approach departs from other methods in two ways: 1. It explicitly accounts for gaps. Rather than compute only consistency- modified substitution scores and align with no-cost gap model, we compute consistency-modified costs for all features of an alignment under affine gap costs: substitution, gap-open, gap-extension, and gap-close costs. 2. Rather than ask “which optimal alignments support this alignment feature?”, our method asks “what is the minimum suboptimality required of other alignments in order for those alignments to support this feature?”. The first question is the one asked by the consistency methods of T-Coffee and MAFFT. The latter is similar to that asked by ProbCons, though in fact, ProbCons goes further by effectively soliciting support from all alignments, not just the least suboptimal. 4.2.1. Modified substitution score We begin by describing a simple method of computing the support by a sequence C for a pairwise substitution, (ai ∼ bj ), a diagonal edge in the dynamic programing matrix. Similar modified costs for other edges are described later, and a more complete discussion of modification equations follows in the next section. The extent of agreement that position ch has with substitution (ai ∼ bj ) (Figure 4.1a), can be determined by computing the level of suboptimality required )*+" 89 %" %" #" $" #" (" !" &" '" Figure 4.2: Subpaths of alignment graphs of (A ∼ C) and (B ∼ C) that are consistent with (ai ∼ bj ) (figure 4.1a). The alignment that agrees with these subpaths is: ··· ai ch · · · bj of alignments (A ∼ C) and (B ∼ C) to be consistent with the substitution. Since our definition of consistency requires constraint, the only way ch can be consistent with (ai ∼ bj ) is if ch is aligned with both: (ai ∼ ch ∼ bj ). This can be represented as a pair of alignment subpaths, one for (A ∼ C) and one for (B ∼ C), as in Figure 4.2. Let T (A, C) be the optimal (minimum) cost of aligning A and C, and let Tih (A, C) be the cost of the optimal alignment of those sequences, under the constraint that the alignment is forced to go through the diagonal edge (ai ∼ ch ). Then the suboptimality of using that edge is defined as Sih (A, C) := Tih (A, C) − T (A, C) . T (A, C) (4.4) Sjh (B, C) is defined similarly. The suboptimality required of C in order to support 90 (ai ∼ bj ) can be defined by picking the best position ch : n o sub(ai , bj , C) := min max {Sih (A, C) , Sjh (B, C)} . h (4.5) Suppose the values of Sih (A, C) and Sjh (B, C) are capped at ∆ (this is done for reasons of computational efficiency, as described later). Then one way of modifying the cost of substitution (ai ∼ bj ) given a single other sequence C is to define the modified cost as: sub(ai , bj , C) D(ai , bj ) := σ(ai , bj ) 1 + F · , ∆ (4.6) where σ(x, y) is the cost of substitution x for y, as from a substitution cost matrix (e.g. BLOSUM62 transformed to a cost matrix). Under this definition, an aligned pair (ai ∼ bj ) with consistent subpaths that are optimal (sub(ai , bj , C) = 0) will have cost σ(ai , bj ), while an aligned pair (ai0 ∼ bj 0 ) with maximally suboptimal consistent subpaths will have that cost multiplied by (F + 1). Equation 4.6 can be extended to computing modifications based on a set of other sequences C by taking the average suboptimality over the sequences C ∈ C: X sub(ai , bj , C) C∈C . (4.7) D(ai , bj ) := σ(ai , bj ) 1 + F · |C| · ∆ The consistency definitions provided here are fairly simple. Somewhat more complicated (and more successful) methods are discussed in Section 4.3. 4.2.2. Modified gap extension scores The modified cost described above is a substitution cost, which corresponds to a diagonal edge in the dynamic programming table of alignment (A ∼ B), as in Figure 4.1a. Computing modified costs for other features is done in a similar fashion. For example, a vertical gap extension is shown in Figure 4.1b, and corresponds to character ai being part of a run of characters from A that align 91 )*+,-."$" %" %" #" $" #" (" !" &" '" Figure 4.3: Subpaths of alignment graphs of (A ∼ C) and (B ∼ C) that are consistent with ai extending a gap after bj (Figure 4.1b). The alignment that agrees with these graphs is: ··· ◦ ai ◦ ch · · · bj − (where ◦ represents freedom to either include or not include a character from the associated sequence ) to a gap in B between bj and bj+1 . A modified vertical gap extension cost can be based on the level of suboptimality required for alignments (A ∼ C) and (B ∼ C) to be consistent with the edge of (A ∼ B) shown in Figure 4.1b. One such consistent pair of alignment subpaths involving C is shown in Figure 4.3. An exhaustive list of such subpath-pairs, consistent with both vertical and horizontal (Figure 4.1c) gap extensions, is given in appendix Section A.0.2. Let T (B, C) be defined as before, and let Tj≺h (B, C) be the cost of the optimal alignment of (B ∼ C) when that alignment is forced to go through the horizontal edge leading to cell (j, h) of the dynamic programming table (the right half of Figure 4.3). Then the suboptimality of that edge is defined as Sj≺h (B, C) := Tj≺h (B, C) − T (B, C) , T (B, C) (4.8) and the suboptimality required of C in order to support the edge of (A ∼ B) shown 92 in Figure 4.1b can be defined by picking the best position ch , for the best subpath pair p among all consistent subpath pairs P: ( ) n o ext(ai , bj , C) := min min max {Sih (A, C) , Sj≺h (B, C))} . pP h (4.9) With the cap ∆ on the value of S(·) suboptimality values, the modified vertical gap extension cost due to a set of other sequences C can be defined as: avgC∈C {ext(ai , bj , C)} V (ai , bj ) = λ 1 + F · , ∆ (4.10) where λ is the unmodified cost of a gap extension. 4.2.3. Modified per-gap scores Per-gap costs can be modified in a similar way. In our model, the typical pergap cost (γ) is divided into two halves, the gap-open cost and the gap-close cost, each with value γ/2. This is done to ensure that an alignment under the modified cost scheme will have the same score if its columns are reversed. Here, we give an example of computing modified vertical gap-open costs (Figure 4.1d). A vertical gap-open corresponds to ai being the first character in a run of characters from A that are aligned to a gap between positions bj and bj+1 . Horizontal gap open costs (Figure 4.1f), and both vertical (Figure 4.1h) and horizontal (Figure 4.1j) gap-close costs are computed similarly. One pair of subpaths involving C that is consistent with the horizontal gap open of Figure 4.1f is shown in Figure 4.4. An exhaustive list of such graph-pairs, consistent with both horizontal and vertical gap boundaries, is given in appendix Section A.0.3. Let T (B, C) be defined as before, and let Tj|≺h (B, C) be the cost of the optimal alignment (B ∼ C) when that alignment is forced to open a gap with ch being the first entry of the gap following bj (the right half of Figure 4.4). Then the suboptimality of that path is Sj|≺h (B, C) = Tj|≺h (B, C) − T (B, C) , T (B, C) (4.11) 93 )*+,"$" %" %" #" $" #" (" !" &" '" Figure 4.4: Subpaths of alignment graphs of (A ∼ C) and (B ∼ C) that are consistent with ai opening a gap immediately after bj (figure 4.1d). The alignment that agrees with these subpaths is: ··· ◦ ai ch−1 ch · · · bj − (where ◦ represents freedom to either include or not include a character from the associated sequence ) and the suboptimality required of C in order to support the gap open in (A ∼ B) shown in Figure 4.1f can be defined by picking the best position ch , for the best subpath pair p among all consistent subpath pairs P : ( ) n o open(ai , bj , C) = min min max {Sih (A, C) + Sj|≺h (B, C)} . p∈P h (4.12) With the cap ∆ on the value of S(·), the modified vertical gap open cost due to a set of other sequences C can be defined as: γ avgC∈C {open(ai , bj , C)} Vo (ai , bj ) = 1+F · , 2 ∆ (4.13) where γ/2 is the unmodified gap open cost. 4.2.4. Support from multiple parameter choices This section has described a method for computing consistency-modified costs based on the suboptimality information found in a single alignment for each pair 94 of sequences. In their MCoffee paper, Wallace et al. [112] described a method that develops consistency-modified scores based on multiple alignments generated by multiple software tools. A similar idea can be harnessed internally, generating pairwise alignments with multiple parameter choices, and computing alignment feature support from that set of pairwise alignments. The idea is based on the fact that we don’t know which parameters are optimal for a given input, but hope that the relatively good parameters will tend to agree, and overwhelm the input of the few poor parameters. This is related to the underlying motivation of the elision method [116]. We tested this idea with a wide range of parameter combinations, using up to 10 parameter choices, but found slight degradation in average alignment accuracy relative to simply using the single parameter choice that performs best. 4.3. Alternate definitions for consistency-modified costs In the previous section, simple equations (4.7, 4.10, and 4.13) were given for computing modified alignment costs based on graph-derived suboptimalities. The equations have explanatory value, but did not perform well in our tests. Alternate equations are given in this section. The focus of this section will be on computing modified substitution costs. Equations for computing modified gap costs are not given, but are easy to derive, essentially by replacing the σ(ai , bj ) values with either λ or γ/2 values, as in the previous section. We call the various equations consistency blends, because they all aim to blend unmodified costs for features in an (A ∼ B) alignment with the extent of suboptimality required of pairwise alignments involving other sequences C in order to be consistent with those features. 95 4.3.1. Imbalanced suboptimality blend variants The equations from the previous section are what we call the imbalanced maximum suboptimality blend. For each sequence C ∈ C, the suboptimality sub(ai , bj , C) comes from the position ch giving the smallest maximum of Sih (A, C) and Sjh (B, C) (see equation 4.5). These suboptimalities are averaged over all sequences C, normalized by the maximum suboptimality allowed (∆), and blended with the unmodified cost of the (A ∼ B) feature to give the modified feature cost. A similar modification can be called the imbalanced average suboptimality blend. Under this method, for each sequence C ∈ C, the suboptimality comes from the position ch giving the smallest average of Sih (A, C) and Sjh (B, C). Sih (A, C) + Sjh (B, C) sub(ai , bj , C) := min . h 2 (4.14) The modified feature cost is computed as before. One problem with these methods is that, when computing a modified cost for a feature in the alignment graph of (A ∼ B), they fail to account for how suboptimal that feature of the (A ∼ B) alignment is in its own (unmodified) alignment graph. This represents an imbalance, where suboptimality is considered only for features in (A, C) and (B, C) alignments, but not (A, B) alignments. This can be resolved with a slightly more complicated equation, which imposes a balance between internal and external suboptimality. 4.3.2. Balanced suboptimality blend variants Let Sij (A, B) be defined as Sih (A, C) was defined in the previous section, and capped at a maximum value of ∆. One natural way to insert self-suboptimality into the equation is by blending it into the numerator of equation 4.7: (1 − α) · Sij (A, B) + α · avgC∈C {sub(ai , bj , C)} D(ai , bj ) = σ(ai , bj ) 1 + F · , ∆ (4.15) 96 where 0 ≤ α ≤ 1 is a blending coefficient. Note that this equation now features two tuning coefficients, F and α. A singleparameter variant that works equally well in our tests is Sij (A, B) + F · avgC∈C {sub(ai , bj , C)} D(ai , bj ) = σ(ai , bj ) 1 + . ∆ · (1 + F ) (4.16) If all suboptimalities are zero, the modified cost of the feature will be σ(ai , bj ), while maximal suboptimality will cause the modified feature cost to be doubled. We call this equation the balanced suboptimality blend, and it can be based on either the maximum or average suboptimality for each pair (A ∼ C) and (B ∼ C) (Equations 4.5 and 4.14, respectively). 4.4. Easing the computational burden The methods described above for computing consistency-modified feature costs are computationally demanding. In the default form, run time to compute consistency costs will be an impractical O(k 3 n3 ): for each pair of sequences, (A, B), every other sequence C is considered, and for each such (A, C, B) triple, all (ai , ch , bj ) positiontriples are viewed. This is much more than the practical run times of the related tools: T-Coffee and ProbCons require time more like O(k 3 n2 ), and MAFFT’s takes time O(k 2 n2 ), as discussed in Section 4.1. Also, in the simple method described in the previous sections, the entire alignment matrix is stored for each pairwise alignment, in order to compute suboptimality values. This consumes default space O(k 2 n2 ), which is greater than the O(k 2 n) space used by MAFFT and T-Coffee to store a single optimal alignment of each pair. It is also greater than that used by ProbCons, which in principle could require O(n2 ) space per alignment to store posterior probabilities for all position pairs, but in practice consumes less since it maintains a sparse matrix holding only non-negligible values. The practical run time for computing consistency costs can be reduced to O(k 2 n2 ) by using two orthogonal ideas: (1) compute consistency-modified scores 97 based on a fixed-sized set of sequences C, and (2) limit the number of positions ch that are viewed in searching for ch with minimum suboptimality. Further speed up is possible if consistency-modified costs are only used for the few alignments performed near the tips of the guide tree (see Section 2.4) in the progressive alignment method. As implemented for our experiments, alignment with consistency is quite slow even with these speedups, taking slightly longer to complete (with no polishing) than standard alignment with polishing takes. 4.4.1. Computing consistency from C of fixed size In the methods of T-Coffee and ProbCons, per-position scores for sequence-pair (A, B) are derived by looking at all other sequences C. This results in a time bound of O(k 3 ) for constant n. By restricting the number of sequences C that are used to compute consistency-based costs for (A ∼ B), this bound is reduced. We do this by creating a subset of C for each pair (A, B), CAB . Then, for example, equation 4.7 can be altered to be avgC∈CAB {sub(ai , bj , C)} D(ai , bj ) = σ(ai , bj ) 1 + F · , ∆ (4.17) with other equations altered similarly. This is a seemingly reasonable thing to do, since the sequences most distant from A and B can probably not contribute much information about the alignment of (A ∼ B) that isn’t found in more similar sequences. In other words, sequences C that are similar to sequences A and B are more likely to provide accurate support for features in the alignment (A ∼ B). Based on this observation, our approach is to compute pairwise distances for all sequences, with d(A, B) equal to the normalized cost from Section 2.4. Then for each sequence-pair (A, B), the X sequences from {C − {A, B}} with smallest d(A, C) + d(B, C) are assigned to CAB . In our tests, we found X = 5 to result in accuracy as high as larger values of X. 98 Note that it is also possible to assign weights to the contributions of various sequences to the consistency equations. Since sequences that are more similar to A and B are more likely to provide accurate signal to their alignment, we could apply higher weights to the suboptimality values of more similar sequences. This was not found to improve accuracy. Creating a strict limit to the number of utilized sequences amounts to a step function for weighting sequences: the closest X sequences all get weight 1, and others get weight 0. 4.4.2. Computing consistency using a band around optimal alignments For a pair of sequences (A, B) and a third sequence C, the default time to compute consistency-modified feature costs is O(n3 ) for sequences of length n: for each position-pair (ai , bj ), all positions ch are considered. But the practical runtime can be reduced by considering a restricted set of positions in C for alignments (A ∼ C) and (B ∼ C), specifically the positions that are likely to be involved with the least suboptimal subpath pairs. One way to achieve this is to consider, for a position-pair (ai , bj ), only those positions ch that fall inside a band around optimal alignments of (A ∼ C) and (B ∼ C). This is sensible, since the support, as in equation 4.5, comes from the position ch with lowest suboptimality, and positions (ai , ch ) that are far from an optimal alignment (A ∼ C) are unlikely to provide the lowest suboptimality support levels. Because there can be more than one optimal scoring alignment (A ∼ C), the leftmost and rightmost optimal paths in the dynamic programming table are determined. Then, for each row i in that table (corresponding to position ai ), a left-to-right window is established that extends a few cells to the left of the leftmost optimal alignment and a few cells to the right of the rightmost optimal alignment. Note that the window width may be different for each row, depending on the 99 %# $# !"# Figure 4.5: Size of windows in alignment (A ∼ C) that will be used to compute suboptimality measures to alignment (A ∼ B). Window width may vary for positions ai , depending on the horizontal range of left-most and right-most optimal alignments. flexibility of optimal alignments involving each position ai (see Figure 4.5). One approach to specifying what “a few cells to the left” means is to extend the window a fixed number of cells w outside the left- and right-most optimal alignments. An alternative, which we found to be more successful, is to extend the window as far as possible to include all cells (ai , ch ) such that the optimal alignment constrained to align (ai ∼ ch ) has cost no more than a fixed value ∆ higher than the optimal alignment of (A ∼ C). Note that this is the ∆ used as a suboptimality limit in previous sections. In our experiments, we found ∆ = 150 to result in high accuracy while limiting computation. With substitution costs ranging between 0 and 88, and the cost of 98 for a gap of length one (see Section 2.9), this allows for a little flexibility in alignment suboptimality, but does not include alignments that 100 #" !" $" '" (" %" &" Figure 4.6: The range used in consistency is the intersection of (A ∼ C) and (B ∼ C) windows. require multiple clearly erroneous features. Note that the range of good h-values for the ith row of (A ∼ C) may disagree with the range of good h-values for the j th row of (B ∼ C). We found that it was sufficient to search for the lowest suboptimality only the intersection of those ranges (see Figure 4.6); if the ranges did not intersect, then the suboptimality was defined as ∆. In addition to reducing run time, space utilization is reduced as well, based on the combination of restricting CAB and using suboptimality windows. The default method would require (possibly infeasible) space O(k 2 n2 ) to store the full dynamic programming table for all pairwise alignments. This can be dramatically reduced by recomputing alignments (A ∼ C) as they are needed for calculating consistencybased costs for other alignments (A ∼ B). For each pair (A, B), compute and store the alignments (A ∼ B) and for a small set of C, (A ∼ C) and (B ∼ C), storing 101 only the portions of those tables falling inside the suboptimality windows. When X neighboring sequences are used to compute consistency, this will require an Xfold increase in the number of pairwise alignments performed over the default, but result in a large storage savings, requiring O(n) space (for fixed X) in practice. 4.4.3. Consistency only for small subtrees Even with the speedups described above, this method is quite slow for inputs with more than a dozen sequences. Run time can be reduced substantially if modified costs are computed for only a fraction of the pairs of sequences. We achieve this by restricting consistency-based alignment to only those nodes of the progressive alignment merge tree (see Section 2.4) that involve a small number of sequences. In other words, consistency-modified costs are used for nodes close to the leaves, and standard costs for deeper internal nodes. See Figure 4.7 for an example of the resulting reduction in computation. We should also note that the fast profile alignment methods described by Gotoh and Starrett [45, 103] can not be applied when position-pair costs are used, so the aligning alignments step will tend to grow quadratically with number of sequences, rather than linearly. By limiting the size of alignments in which consistency is used, the fast profile methods can be used for the large alignments formed at deep internal nodes. This form of tree-restricted consistency is necessary to achieve acceptable speed, but also seems well-motivated. The purpose of using consistency modified costs is to avoid making errors early in the progressive alignment process. Once alignments being merged have reached sufficient size, there should be enough signal in those alignments to overcome the need for external influence on alignment costs. When no polishing is used, we found that there was effectively no benefit in our scheme to using consistency on subtrees larger than 6 sequences. However, our experiments 102 !"#$%&'#()+/(&*+%,#,(-%*.& !"#$%&'#()&*+%,#,(-%*.& Figure 4.7: Number of pairwise alignments performed is reduced by limiting consistency to nodes near leaves. For example, suppose each pictured subtree contains 6 sequences. Then if all pairs are computed, the total number of pairs is 24 ∗ 23/2 = 276, but if pairs are only computed within subtrees, then 6 · 5/2 = 15 pairs are computed for each subtree, for a total of 60. If the number of sequences is doubled to 48, but the same threshold retained (thus doubling the number of size-6 subtrees), then the number of computed pairwise alignments will be reduced from 1128 to 120. (Section 4.6) show that restricting consistency to subtrees results in conflicts with polishing. 4.5. Incorporating consistency-modified costs into the algorithm for aligning alignments It is a straightforward exercise to incorporate consistency-based costs into the algorithm for aligning two sequences, and into the pessimistic heuristic for aligning alignments (see Section 2.5 and [64]), replacing the matrix-derived substitution costs and fixed gap costs λ and γ with the modified variants described in Sections 4.2 and 4.3. Modifications of the algorithm for aligning alignments optimally (with exact gap counts, see Section 2.5 and [63, 103]) are only slightly more complicated. Here we discuss details of that algorithm only in the depth required to see the modifications necessary to incorporate consistency. See Section 2.5 and [64, 63, Dominance and hydrophobicity B A B (a) Flat: substitution ------ (b) Overhang: vertical gap A B (c) Underhang: horizontal gap ------ A B A B ------ 103 ------ A B A B A in a two-sequence alignment Figure 4.8: Shapes ------ B 103] for more details. ------ A The algorithm for exactly aligning alignments follows a standard dynamic B programming approach, and can be described as it relates to aligning two sequences. In aligning two sequences under affine gap costs, a subproblem (i, j) corresponds to prefixes (a1 . . . ai ) and (b1 . . . bj ) of sequences A and B, and a shape that represents 73 the relative ordering of the final characters of the two prefixes. The flat shape (Figure 4.8a) indicates that ai is aligned to bj (a substitution). The overhanging shape (Figure 4.8b) corresponds to a run of characters from A ending in ai aligning to a gap between positions bj and bj+1 ; an overhang is a run of vertical edges in the dynamic programming table. The underhanging shape (Figure 4.8c) similarly corresponds to a run of characters from B in a (horizontal) gap. A direct result of the dynamic programming recurrence is that, when filling in a matrix with the costs of solutions to these subproblems from the upper-left portion of the matrix to lower-right portion is that, for each of the three shapes, only the cost of the cheapest path (alignment of prefixes) ending in that shape in that cell need be retained. This is effectively a pruning method: rather than keep track of all alignments of prefixes (a1 . . . ai ) and (b1 . . . bj ), only one for each shape is tracked. In the problem of aligning two alignments, the alignments A and B are viewed as sequences of columns, and a subproblem consists of prefixes of the two alignments, along with the shape leading to those prefixes. As in pairwise Algorithm We solve Aligning Alignments by dynamic programming. •! •! Inputs A and B viewed as sequences of columns. Subproblem consists of prefixes i and j, and a shape s. 104 i i A s A B B j j 61 Figure 4.9: Multiple sequence shape. alignment, shapes describe the relative order of occurrence of the most-recent letter in each sequence. Multi-sequence shapes are more complicated, as sequences within an alignment may have gaps relative to other sequences in the alignment. See Figure 4.9 for an example. The difficulty in aligning alignments optimally is in correctly computing the per-gap cost of an alignment as it is extended in the dynamic programming matrix. Using standard costs, this amounts to correctly counting the number of gaps opened by a new column in the alignment. Shapes are the key to determining whether a column starts a gap in a pair of rows: for example, if a new column appends a letter from A against a gap in B, the effect is to create an overhang. If that column is appended to a shape that is not already an overhang (Figure 4.8b), then a new gap (a, −) has been opened. As in the two-sequence method, a sort of rudimentary pruning is possible: many paths may result in the same shape at a given cell in the dynamic programming matrix, but it is simple to ensure that each shape retained in that cell represents the cheapest path leading to that cell and ending at that shape. But more effective 105 pruning of shapes is possible, as well; if it can be shown that a shape s can not possibly lead to an optimal alignment, then that shape may be pruned immediately. Two such improved pruning techniques are described in the work of Kececioglu and Starrett [63, 103], bound pruning and dominance pruning. These require special attention in terms of how they incorporate consistency-modified costs. The descriptions below present the issues in terms of two-sequence shapes induced by a multi-sequence shape; this is natural, as alignment costs are based on sum-ofpairwise costs. 4.5.1. Bound pruning The bound pruning method uses heuristic alignments to establish bounds to support shape pruning. First, a heuristic alignment is performed on the reversed alignments A and B. Think of this as building an alignment from the bottom-right up to the top-left on the un-reversed alignments. The heuristic used is the optimistic gapcount heuristic [63], in which per-gap costs are counted only when it is clear that a new gap has been formed. At each cell (ai , bj ) in the dynamic programming table, these (optimistic) costs provide a lower bound on the cost of aligning the suffixes (ai . . . am ) and (bj . . . bn ). These suffix-alignment lower bounds can be used to establish cost bounds on extending alignments of prefixes (in the forward direction). The true cost of this heuristic alignment is then computed. Since it is a feasible alignment, this represents an upper bound on the optimal cost of an alignment (in practice, a pessimistic heuristic alignment is also performed, since this tends to have lower cost). In the optimal alignment algorithm, the general step is to consider a shape s in cell (i, j), where the cost of the cheapest alignment at that cell ending in that shape is c(i, j, s). The idea is to find a lower bound on the cost of extending s into a complete alignment; if that lower bound exceeds the previously established upper Bound pruning and consistency No overcounting 106 No overc add ! close add ! clos No overcounting (a) no additional cost (b) additional gap-close cost ! close add ! close add ! open add ! ope (c) additional gap-open cost ! close for bound pruning. When considering cost of Figure 4.10: Stitching shapes extending a shape from the left with a shape from the right, additional costs may open be incurred. add ! bound (from the optimistic or pessimistic heuristic), then s may be discarded. At cell (i, j), there are three possible ways to begin the extension: right, down, and diagonally down/right. The reverse alignment described above establishes a lower-bound on the cost of extension in each direction, so a lower-bound on total cost can be established by adding the cost of s (which is the exact cost of the optimal alignment of prefixes leading to that cell with that shape) to the lower bound on cost of extension in each direction. In the normal (non-consistency) scoring scheme, the per-gap cost was implemented as a single gap-open cost γ. Thus, when an overhang in shape s is extended into a similarly overhanging shape from the reverse alignment (Figure 4.10a), both shapes had already incurred a per-gap cost. This required that the bound be corrected by reducing the summed costs based on possible gap-open overcharges. This issue it avoided by splitting the per-gap cost into gap-open and gap-close costs, each γ/2. With these split-gap costs, no overcounting is possible, either in the standard or consistency-modified scoring schemes. With these split costs comes a new requirement: to keep the bound as tight as possible, all possible gap-close 95 107 (Figure 4.10b) and gap-open (Figure 4.10c) costs are now added to the lower-bound on the cost of extending a shape. 4.5.2. Dominance pruning If two paths to the same cell end in the same shape, then the one with lower cost can be said to dominate the other: because the two shapes will incur identical costs for any possible extension, there is no way that the more expensive path to that shape can lead to an alignment with a cheaper cost. A dominated path can then be discarded while still guaranteeing that an alignment with optimal cost will be found. This notion can be generalized to comparing different shapes. For two shapes in a cell, s and t, if there is no possible path through the remainder of the alignment graph that will cause the cost of t extended through that path to be less than the cost of s extended through that path, then s can be said to dominate t, and t can be discarded. It is not enough that c(i, j, s) < c(i, j, t), because it is possible that some path will force s to incur a new cost that t does not incur (s opens a gap that t already has open, or s closes a gap that isn’t open in t), thus causing the cost of extending s to rise relative to the cost of extending t. A bound on potential excess gap-open costs for s can be computed by counting the number of pairs for which t has an open gap that s does not have (Figure 4.11a). A similar accounting for excess gap-close costs can be made, by counting the pairs for which s has an open gap where t does not (Figure 4.11b). Let B be the sum, over all pairs of sequences, of all such possible gap-open or gap-close costs incurred by s and not t (Figures 4.11a and 4.11b). Then s dominates t if c(i, j, s) + B ≤ c(i, j, t). Thus, dominance is an easily-tested sufficient condition for t to be no better than s. Dominance and hydrophobicity Dominance and hydroph 108 a s t - s a t ------ (a) Extra gap-open cost. a ------ a (b) Extra gap-close cost. Figure 4.11: Extra gap-open and gap-close costs in dominance pruning. For two shapes s and t, each pair of sequences with induced pairwise shapes shown here cause the cost of s to grow, relative to the cost of t. In (a) the cost growth is the cost of opening a gap. In (b) the cost growth is the cost of closing a gap. 4.6. Experimental results 73 The accuracies of the blend methods described in the Section 4.3 are compared in Table 4.1. The methods were implemented in Opal, and experiments were performed as in Chapter 2 (using benchmarks BAliBASE, PALI, and SABmark; measuring accuracy with the SPS and TC scores). Inputs with no more than 50 sequences were used, due to excessive run time for the algorithm as implemented. A range of blend factors (the F value) were tested, and results shown are under the best F for each method. Consistency was computed only on subtrees of 6 or fewer sequences, with larger subtrees aligned under standard scoring. Consistency support was gathered from 5 nearest neighbors for each pair (A, B), and ∆ = 150 was used in all cases. The methods are compared to default alignment in Opal, with no polishing. The balanced maximum suboptimality method was found to perform best. Results presented in Section 2.8 indicate that the gains were roughly the same as gains from polishing, and consistency and polishing did not work well together. 109 Table 4.1: Comparison of consistency methods. All results are without polishing. Consistency method BAliBASE SABmark PALI average balanced maximum (F=1.9) 84.3 / 56.7 49.9 84.7 / 59.6 73.0 / 58.2 balanced average (F=3.0) 83.9 / 56.4 49.2 84.8 / 60.0 72.6 / 58.2 imbalanced maximum (F=1.0) 83.1 / 55.3 45.7 84.3 / 58.7 71.0 / 57.0 imbalanced average (F=1.0) 82.9 / 55.1 45.1 84.2 / 59.1 70.7 / 57.1 no consistency 82.5 / 52.7 48.5 83.8 / 57.1 71.6 / 54.9 One valid concern about these tests is that the observed success of basing consistency-modification on 5 related sequences may be due to the constraint placed on the number of sequences in tested inputs. We addressed this concern by using a similar methodology with MAFFT at the core. Because MAFFT’s method is much faster, all inputs from PALI and BAliBASE with more than 60 sequences could be tested (SABmark has no inputs of that size). After computing a guide tree, all subtrees with no more than X sequences were aligned with MAFFT’s local-alignment-based consistency method (L-INS-1), while all internal nodes with subtrees containing more than X sequences were aligned using MAFFT’s profile alignment with unmodified parameters. Table 4.2 shows accuracy results over a range of values for X. Recovery as measured by SPS leveled off at X = 15, though it is interesting to note that the TC recovery values continued to grow with X, and even X = 40 didn’t attain TC values as high as using consistency for all inputs. The results described so far are for consistency without polishing. In Section 2.8, it was reported that accuracy was reduced when polishing was applied to a consistency-based alignment. It is instructive to understand if this is true for other methods as well. Tables 4.3 and 4.4 show the recovery gains due to consistency and polishing on the same inputs as in Table 4.1, for MAFFT and ProbCons respectively. 110 Table 4.2: Accuracy of method using MAFFT’s consistency approach (linsi) on small subtrees of the guide tree, and unmodified profile alignment on deep internal nodes. No polishing was used. Results shown for the 26 inputs from BAliBASE and PALI containing at least 60 sequences. which internal-node alignments performed with consistency SPS TC no consistency (fftns1) 90.8 40.6 subtrees ≤ 5 sequences 90.5 39.5 subtrees ≤ 6 sequences 91.4 41.2 subtrees ≤ 7 sequences 91.4 41.2 subtrees ≤ 8 sequences 90.8 40.8 subtrees ≤ 9 sequences 91.0 42.6 subtrees ≤ 10 sequences 91.0 43.3 subtrees ≤ 11 sequences 91.5 41.9 subtrees ≤ 12 sequences 91.6 43.5 subtrees ≤ 13 sequences 91.7 44.5 subtrees ≤ 14 sequences 92.1 42.4 subtrees ≤ 15 sequences 92.1 44.4 subtrees ≤ 16 sequences 91.9 44.9 subtrees ≤ 17 sequences 91.9 45.0 subtrees ≤ 18 sequences 91.9 45.1 subtrees ≤ 19 sequences 91.9 44.5 subtrees ≤ 20 sequences 92.1 43.5 subtrees ≤ 25 sequences 92.5 48.5 subtrees ≤ 30 sequences 91.2 49.7 subtrees ≤ 35 sequences 91.5 51.3 subtrees ≤ 40 sequences 91.3 49.2 all (lins1) 92.1 54.4 111 Both tools experience additive gains in accuracy by combining consistency and polishing, with ProbCons getting a small (< 1%) gain in both SPS and TC score over just consistency, while MAFFT’s gains were 2% and 3% in SPS and TC respectively. To understand why these tools see gains in combining the methods, while Opal experiences a reduction in quality, it is important to remember how the Opal model works: consistency-modified costs are used when building alignments in small subtrees of the full guide tree, but standard (unmodified costs) are used for both deep internal nodes and for polishing. This partitioning of consistency was used because the computational burden of Opal’s model was excessive for larger trees. The conjecture that this approach would still work well was rooted in the idea that using consistency on small trees would help avoid errors when information was sparse, but that information contained by larger alignments would be sufficient to overcome the need for consistency. However, it appears that this is likely the cause of the poor relationship between consistency and polishing. An analogous idea is to build a full alignment with MAFFT using consistency, then to apply MAFFT’s standard-parameter (non-consistency) polishing to that alignment. When that was done, accuracy was worse than polish-free consistency-based alignment (results not shown). This, along with similar results in our model, suggests that polishing and consistency can cooperate effectively only if the parameters used to polish are the same ones used in computing an initial consistency-based alignment. When the polishing parameters differ from the initial-alignment parameters, the gains of consistency may be broken by polishing without getting sufficient return in the form of polishing-based improvement. 112 Table 4.3: Impact of using global- and local-sequence alignments in computing consistency-modified scores in MAFFT. MAFFT variant BAliBASE SABmark PALI average no consistency, no polish 77.8 / 47.4 42.0 79.2 / 50.2 66.3 / 48.8 with consistency, no polish 82.8 / 55.4 44.1 82.8 / 56.1 69.9 / 55.8 with consistency, with polish 84.6 / 59.2 46.2 84.0 / 59.2 71.6 / 59.2 Table 4.4: Impact of using consistency-modified scores in ProbCons. ProbCons variant BAliBASE SABmark PALI average no consistency, no polish 79.7 / 49.1 44.3 81.7 / 52.8 68.6 / 50.1 with consistency, no polish 83.0 / 56.4 46.7 84.3 / 59.2 71.3 / 57.8 with consistency, with polish 83.7 / 57.1 47.1 84.5 / 59.3 71.8 / 58.2 113 PART 2 LARGE-SCALE NEIGHBOR-JOINING PHYLOGENIES 114 CHAPTER 5 BACKGROUND ON NEIGHBOR-JOINING The neighbor-joining method of Saitou and Nei [95] is a widely-used method for constructing phylogenetic trees, owing its popularity to good speed, generally good accuracy [81], and proven statistical consistency (informally: neighbor-joining reconstructs the correct tree given a sufficiently long sequence alignment) [35]. In particular, it has been highlighted as a valuable method for inferring very large phylogenies [108], due to its combination of speed and accuracy. Neighbor-joining is a hierarchical clustering algorithm. It begins with a distance matrix, where dij is the observed distance between clusters i and j, and initially each of the n input sequences forms its own cluster. Neighbor-joining repeatedly joins a pair of clusters that are closest under a measure, qij , that is related to the dij values. The canonical algorithm [105] finds the minimum qij at each iteration by scanning through the entire current distance matrix, requiring Θ(r2 ) work per iteration, where r is the number of remaining clusters. The result is a Θ(n3 ) run time, using Θ(n2 ) space. Thus, while neighbor-joining is quite fast for n in the hundreds or thousands, both time and space balloon for inputs of tens of thousands of sequences. As a frame of reference, there are 8 families in Pfam [37] containing more than 50,000 sequences, and 3 families in Rfam [49] with more than 100,000 sequences, and since the number of sequences in GenBank is growing exponentially [41], these numbers will certainly increase. Phylogenies of such size are applicable, for example, to large-scale questions in comparative biology (e.g. [100]). In this part of the dissertation, a new algorithm is described that produces a correct neighbor-joining phylogeny, and achieves increased speed by restricting its 115 search for the smallest qij at each iteration to a small portion of the quadratic-sized distance matrix. The key innovations of this algorithm are (1) introduction of a search-space filtering scheme that is shown to be consistently effective even in the face of difficult inputs, and (2) inclusion of data structures that efficiently use disk storage as external memory in order to overcome input size limits. The result is a statistically consistent phylogeny inference tool that is roughly an order of magnitude faster than a very fast implementation of the canonical algorithm, QuickTree (for example, calculating a neighbor-joining tree for 60,000 sequences in less than a day on a desktop computer), and is scalable to hundreds of thousands of sequences. The algorithm is implemented in a tool called NINJA, freely available at http://nimbletwist.com/software/ninja. 5.1. Canonical neighbor-joining algorithm Neighbor-joining [95, 105] is a hierarchical clustering algorithm. It begins with a distance matrix, D, where dij is the observed distance between clusters i and j, and initially each sequence forms its own cluster. Neighbor-joining forms an unrooted tree by repeatedly joining pairs of clusters until a single cluster remains. At each iteration, the pair of clusters merged are those that are closest under a transformed distance measure qij = (r − 2) dij − ti − tj , (5.1) where r is the number of clusters remaining at the time of the merge, and ti is the total distance of cluster i to all other clusters: X ti = dik . (5.2) k When the {i, j} pair with minimum qij is found, clusters i and j are joined by a new parental node, which now represents cluster ij, and the length of the edge from i to the parent node is 1 bi = 2 dij ti − tj + (r − 2) ! , (5.3) 116 while the length of the edge from j to the parent node ! 1 tj − ti bi = dij + . 2 (r − 2) (5.4) D is updated by inactivating both the rows and columns corresponding to clusters i and j, then adding a new row and column containing the distances to all remaining clusters for the newly formed cluster ij. The new distance dij|k between the cluster ij and each other cluster k is dij|k = di|k + dj|k − di|j . 2 (5.5) There are n-1 merges, and in the canonical algorithm each iteration takes time Θ(r2 ) to scan all of D. This results in an overall running time of Θ(n3 ). Note that it is possible for one of each pair of branch lengths to gain a negative length based on formulas 5.3 and 5.4; this is problematic since the length of an edge represents the extent of sequence evolution occurring along that edge, which should be non-negative. It has been suggested that the branch with negative length should be set to length zero, and the sibling branch shortened an equal amount. This does not alter the topology of the tree (Kuhner and Felsenstein, 1994). Also note that it is possible that more than one pair will share the same minimum q value. It appears to be common to make an arbitrary choice, though it is certainly possible to choose randomly (an option in [98]), and repeat analysis of the input multiple times to measure robustness of resulting trees. 5.2. Properties of neighbor-joining Numerous studies have demonstrated neighbor-joining’s efficacy in recovering accurate phylogenies [67, 81, 104, 69], and in particular, Tamura et al. [108] observed very little reduction in accuracy as the number of sequences grows into the thousands. (It should be noted that accuracy results in phylogenetics depend on simulation studies, since the true evolutionary history or real sequences is unknowable.) 117 Neighbor-joining takes as input a matrix of estimated pairwise distances, and constructs a tree such that the pairwise distances induced by that tree are close to the input distances. Input distances are estimated from observed sequence data, usually based on a probabilistic model of sequence evolution. These models are statistically consistent in the sense that they converge on the true distance with sufficiently long sequences and a correct model of sequence evolution [35]. Neighbor-joining, which is known to correctly recover trees when pairwise distances are in agreement with the true tree (these are known as additive, or tree-like, distances) [39] is statistically consistent as well. In practice, the model of sequence evolution may be wrong, and distances are calculated on finite-length sequence, meaning computed distances are subject to statistical error. It is therefore important to understand how neighbor-joining deals with erroneous estimated pairwise distances. Consider the true tree under which the sequences evolved. The edge lengths of that tree induce a distance between each pair of sequences: the sum of the length of edges on the path between the leaves corresponding to those two sequences. Call these the true-tree-induced distances. The estimated pairwise distances may differ from these true distances, so we desire a method that is robust to large errors. Atteson [5] showed that neighbor-joining will recover the correct tree if the largest error among all pairwise distance estimates is at most 1/2 the length of the shortest edge in the true tree. Distances that are close in this way to the true-tree-induced distances are called “nearly-additive”. The ratio of allowed disagreement from true-tree-induced distances to minimum edge length is called the reconstruction radius; neighbor-joining has a radius of 1/2. This reconstruction radius has been shown to be optimal for distance methods [31]. The reconstruction radius indicates robustness in the face of distance estimation error, but seems to be of little practical value for moderate size real-world problems, where a near-zero-length branch is quite likely to be found somewhere in the true tree. A more valuable result from Mihaescu et al. [77] is the related notion of an 118 edge radius, wherein neighbor-joining is guaranteed to correctly recover every edge with length at least 4x as large as the largest distance error. This is shown in the same work to be optimal. Thus, even though distance estimation error may cause neighbor-joining to fail in finding the correct tree, only short edges will be missed. (It should be noted that these short edges are not guaranteed to be missed, and are often correctly recovered despite the lack of guarantee [77]). Gascuel [39] showed that formula 5.5 can be replaced with a formula that differentially weights the contributions of di|k and dj|k to dij|k , while still retaining neighbor-joining’s desirable properties. His approach is to assign weights that are dependent on sampling variance of distance estimates, such that the distance with higher variance (which is thus a less reliable estimate of the true distance) is down-weighted. He also shows that sampling variance can be reasonably approximated when distances are estimated from aligned sequences. This approach was implemented in a tool called BioNJ, which reportedly improved the accuracy of tree estimation from sequence data, on simulated data. Bryant [13] showed that the selection criterion (formula 5.1) is the unique such criterion satisfying the requirements of being linear, unaffected by order of input, statistically consistent, and based solely on distance data. 5.3. Prior methods QuickTree [54] is a very efficient implementation of the canonical neighbor-joining algorithm. Due to low data-structure overhead, it is able to compute trees up to nearly 40,000 sequences before running out of memory on a system with 4GB of RAM. QuickJoin [76, 75], RapidNJ [99], and the bucket-based method of [119] all produce correct neighbor-joining trees, reducing run time by finding the globally smallest qij without looking at the entire matrix in each iteration. While all methods still suffer from worst-case running time of O(n3 ), they offer substantial speed improvements in practice. Unfortunately, the memory overhead of the employed 119 data structures reduces the number of sequences for which a tree can be computed. For example, on a system with 4GB RAM, RapidNJ scales to 13,000 sequences, and QuickJoin scales to 8000. No implementation of [119] was found. The focus of this work is on exact neighbor-joining tools, but we briefly mention other methods for completeness. Alternate methods for phylogeny inference include parsimony, maximum-likelihood, Bayesian, and minimum evolution. Parsimony and maximum-likelihood (ML) use local-search heuristics to seek good trees, while the Bayesian method samples from the probability landscape. Bayesian methods (e.g.[92, 27, 107]) are unable to handle inputs on the scale of thousands of sequences. Recent advances in parsimony [43] and ML [102] methods have greatly improved the size of problems that can be handled by these methods, and the speed with which trees can be inferred. The largest published tree inferred by either of those methods contains 73,060 taxa [42], which reportedly took 2.5 months of processor time. Relaxed [32] and fast [31] neighbor-joining are neighbor-joining heuristics that improve speed by choosing the pair to merge from an incomplete subset of all pairs. Both will return the same tree as neighbor-joining when distances are additive. Fast neighbor-joining shares the same reconstruction radius with exact neighbor-joining [77], and runs in time O(n2 ), but was observed to be slightly less accurate than neighbor-joining [31], suggesting a lack of robustness when that radius is violated. An implementation of the relaxed neighbor joining heuristic, ClearCut [98], is nearly an order of magnitude faster than NINJA on very large inputs, but also shows reduced accuracy. Minimum-evolution (ME) methods offer an alternate fast approach. Under the ME framework, edge lengths for a fixed topology are those that minimize the sum of the squares of the differences between the input pairwise distances and those induced by the tree’s edge lengths. The minimum-evolution tree is the tree with topology minimizing the sum of edge lengths. Saitou and Nei [95] described their 120 neighbor-joining method as working “under the principle of minimum evolution”, and it has been proven to be a greedy heuristic for finding the balanced minimum evolution tree [40]. Minimum-evolution with the SPR [53] local-search improvement method has proven reconstruction radius of 1/3 [10], and consistency (though not a reconstruction radius > 0) has been conjectured for ME with the NNI [22] local-search improvement method. A recent implementation of a minimum-evolution heuristic with NNI, FastTree [90], is notable for constructing accurate trees on datasets of the scale discussed here, with speed more than 10-fold greater than that achieved by NINJA on very large inputs. 121 CHAPTER 6 DATA STRUCTURES AND ALGORITHMS FOR SCALING UP NEIGHBOR-JOINING In this chapter, we describe a two-tiered filtering regime that dramatically reduces the number of distance values that are viewed during the search for the minimum qij value at each iteration, relative to a complete scan of the distance matrix. In addition, the method overcomes memory constraints seen in earlier filtering-based work by incorporating external-memory-efficient data structures into the algorithm. 6.1. 6.1.1. Restricting search of the distance matrix The d-filter A valid filter must retain the standard neighbor-joining optimization criterion at each iteration: merge a pair {i, j} with smallest qij . To avoid scanning the entire distance matrix D, the pairs can be organized in a way that makes it possible to view only a few values before reaching a bound that ensures that the smallest qij has been found. To achieve this we use a bound that represents a slight improvement to that used in RapidNJ [99]. In that work, ({i, j}, dij ) triples are grouped into sets, sorted in order of increasing dij , with one set for each cluster. Thus, when there are r remaining clusters, each cluster i has a related set Si containing r−1 triples, storing the distances of i to all other clusters j, sorted by dij . Then, for each cluster i, Si is scanned in order of increasing dij . The value of qij is calculated (equation 5.1) for each visited entry, and kept as qmin if it is the smallest yet seen. 122 To limit the number of triples viewed in each set, a second value is calculated for each visited triple, a lower bound on q-values among the unvisited triples in the current set Si : qbound = (r − 2) dij − ti − tmax , where tmax = maxk {tk }. In a single iteration, tmax is constant, and for a fixed set Si , ti is constant and the sorted dij values are by construction non-decreasing. Thus, if qbound ≥ qmin , no unvisited entries in Si can improve qmin , and the scan is stopped. After this bounded scan of all sets, it is guaranteed that the correct qmin has been found. This is the approach of RapidNJ. Improving the d-filter While this method is correct, and provides dramatic speed gains [99], it can be improved. First, observe that the bound is dependent on tmax , which may be very loose (see fig. 7.2a). One way to provide tighter bounds is to abandon the idea of creating one list per cluster. Instead, the interval (tmin , tmax ) is divided into evenly spaced disjoint bins, where each bin Bx covers the interval [Txmin , Txmax ). For X bins, then, the size of the interval between min and max values will be (tmax − tmin )/X (the default number of bins is 30). Each cluster i is associated with the bin Bx for which Txmin ≤ ti < Txmax . Adopt the notation that cluster i’s bin B(i) = x. Note that bins may contain differing numbers of clusters. Then create a set S{x,y} for each bin-pair {Bx , By }. Now, instead of placing ({i, j}, dij ) triples into per-cluster sets as before, place them in per-bin-pair sets S{B(i),B(j)} , still sorting triples within a set by increasing dij . To find qmin , traverse the sets, scanning through each as before, but now calculating the bound based on current triple ({i, j}, dij ), taken from set S{x,y} , as qbound = (r − 2) dij − Txmax − Tymax . (6.1) This improves the filter because, for an unvisited pair {i0 , j 0 } from the same set S{x,y} , setting ρ = (r − 2) di0 j 0 , ρ − Txmax − Tymax will usually be a tighter bound on qi0 j 0 than is ρ − ti0 − tmax . 123 Updating data structures After merging clusters i and j, the rows and columns associated with those clusters are inactivated in D, and a new row and column are added for the merged cluster ij. Entries in the sets also require update. The new cluster, ij, is associated with the bin B(ij) = argminx {Txmax > tij )}. Triples ({ij, k}, dij|k ) for distances to each remaining cluster are added to the appropriate sets, S{B(ij),B(k)} . Triples for the removed clusters i and j are removed from sets in a lazy fashion: i and j are marked as retired, and when triples involving either i or j are encountered while scanning sets in future iterations, they are removed. While this method provides tighter qbound values than the method of keeping one set per cluster, these bounds will tend to loosen over time. Before any merges are performed, the intervals of these sets are non-overlapping, but because the change in tk after a merge may be different for each cluster k, this non-overlapping property is no longer guaranteed to hold after a merge is performed. The result is a loosening of the value Txmax as a bound for ti for an arbitrary cluster (i.e. the difference between the true value and the bound may be greater than (tmax − tmin )/X). The loosening of the bound grows as iterations pass, though it is still tighter than the per-cluster bound until the set ranges overlap almost completely. It may seem appealing to move a cluster to a new bin when that bin could provide a tighter bound, but doing so would incur substantial work to take all corresponding triples out of the various bin-pair sets. The strategy taken by NINJA is to occasionally rebuild the sets from scratch. For a constant K > 1 (the default is K = 2), the sets are rebuilt after r/K merges have been performed since the last rebuild, where r is the number of clusters remaining at the time of that prior rebuild. Overall runtime for these set constructions is dominated by the time of the first construction, O(n2 log n). This process of regularly rebuilding data structures resembles that used in [76]. 124 6.1.2. Overcoming memory limits The size of D is quadratic in the number of sequences, as is the size of the sets of triples described above. If these structures grow to exceed available RAM, an application may either abort or store the structures to external storage (i.e. disk). Data in RAM may be accessed in random patterns without concerns about speed, but reading and writing data on disks is done in large chunks (called disk blocks) of several kilobytes. The dramatic difference in latency between disk and RAM access (on the order of 106 -fold difference [87]) means that it takes effectively the same amount of time to access a single value from disk as it takes to access all the values stored in a disk block. This necessitates I/O-efficient algorithms if external storage is to be used. We describe methods for efficiently handling both the sets S{x,y} and the distance matrix. Bin-pair sets in external memory The set of triples associated with each bin pair set S{x,y} has been described as a sorted list. In fact, in order to allow fast insertion of triples for new clusters, such a list would likely be implemented as a data structure such as a binary search tree [66]. Binary search trees have poor I/O behavior when stored to disk, but could be easily replaced by a B-tree [8] or B+ tree, which allow for logarithmic number of disk I/Os for both insertions and reads. However, since only a small portion of the entries in a set are accessed, the effort of keeping a totally ordered data structure is unnecessary. A min-heap [20] provides the tools necessary to scan through increasing dij , with less overhead since it only need keep a partial order. NINJA implements an external memory array heap [12], keyed on dij . This heap structure can store more triples than would fill a 1TB hard drive while maintaining a memory footprint smaller than 2MB, and guarantees an amortized number of I/O operations for insert and extract-min operations that is logarithmic in the number of inserted triples. One heap is used for each set S{x,y} . 125 Distance matrix in external memory Though the heaps are used to identify the cluster pair {i, j} to merge, the distance matrix D should still be maintained. After a merge, dij|k is calculated for every cluster k. From equation 5.5, we see that we must view dik and djk for every k, which is more efficiently done by traversing the rows and columns for i and j in D than by scanning through the heaps. Since neighbor-joining assumes a symmetric D, an efficient way to store D for in-memory use is to keep only its upper-right triangle: distances for cluster i are spread across row i and column i, such that all reside in the upper triangle. When a pair of clusters {i, j} is merged, a new row and column are said to be added, but no additional space is actually required: the distances of the new cluster ij to all remaining clusters k can be stored in the cells previously belonging to one of the retired clusters, say i, so dij|k is stored in the cell where dik was stored. Clusters i and j are noted as retired, and the mapping of cluster ij’s stored location is simple. However, when D is stored to disk, this approach will lead to poor disk paging behavior, because values for cluster i are split between row i (which can be accessed efficiently from disk, with many consecutive values per disk block), and column i (which will be spread across the disk, with typically one value per disk block). Therefore, a modification is required. For an input of n sequences, a file F stores a matrix with 2n columns and n rows. The full initial D (i.e. both the upper and lower triangles) is stored to F , filling the first n columns of each row. When a merge is performed, and new distances are calculated, the values dik and djk can be gathered by sweeping through rows i and j, allowing the number of distance values that fit in a disk page to be gathered at the cost of a single disk access. The mapping for the storage location of the new dij|k values will be different for rows and columns: if ij is formed as the result of the pth merge, then it will map to the row in F where i was stored, but will fill a new column n + p − 1. Newly calculated distances are not immediately stored to disk, instead waiting until enough values have been calculated to allow for efficient disk I/O. Suppose b distance values fit in 126 a disk block: then dij s for new clusters are appended to a b x n in-memory matrix M until all b columns of that matrix are full. At that time, each row of M is appended to the same row in F (requiring one disk I/O per row), and each column is translated and written into the mapped row in F (requiring up to dn/be I/Os). 6.2. Candidate handling Due to the nature of heaps, all viewed ({i, j}, dij ) triples are removed from their containing heaps during the search for qmin ; call these the candidates. The d-filter method described in Section 6.1 dramatically reduces the number of candidates viewed in most cases, but inputs with relationships like those seen in Figure 7.2a reduce the efficacy of d-filtering, for reasons described in Section 7.1. Examples of the impact on run time are given in Table 7.2c. Below, a second level of filtering, called the q-filter, is described. It works by sequestering candidates passing the d-filter, and organizing them in a way that allows a new bound to limit the number of those candidates that are viewed in each iteration. q-filter on a candidate heap Let qij (p), r(p), and ti (p) correspond to the values of qij , r, and ti at a fixed previous iteration p. And let δ i (p) = (r−2)ti (p)−(r(p)−2)ti . Then it is easy to show that, for the current iteration, qij = (r − 2) qij (p) + δ i (p) + δ j (p) . r(p) − 2 (6.2) Suppose all candidates on hand at iteration p are stored as ({i, j}, qij (p)) triples in a candidate set, sorted according to their qij (p) values. Assign the current r and ti as r(p) and ti (p) for that set. Since relative q-values change by small amounts from one iteration to the next, the {i, j} pair with the smallest qij at a future iteration is likely to be near the front of this sorted list. It can be found by initializing qmin to ∞, then scanning candidates in order of increasing qij (p), updating qmin when an entry with a smaller qij is found. 127 Let S be the set of all clusters with at least one representative in the candidate set, and define ∆max (p) := max {δ i (p) + δ j (p)} . i,j ∈ S i6=j (6.3) Then scanning of this sorted list may be stopped when an element is found with (r − 2) qij (p) + ∆max (p) ≥ qmin . r(p) − 2 (6.4) The candidate set can be large enough to exceed memory for very large inputs, and because only a partial order is required, NINJA stores the contents of the candidate set in an external memory heap array, as described for the d-bound binpairs in Section 6.1. The heap formed from such candidates is called a candidate heap. Candidate heap chain Adding a new candidate to a candidate heap created in a previous iteration pa (with associated r(pa ) and ti (pa ) values) is problematic: (1) if the candidate involves a cluster j that was formed after pa , then qij (pa ) and tj (pa ) are undefined, and (2) even if both clusters existed before pa , the candidate would need to be stored on the heap with a back-calculated qij (pa ) (and thus looser than necessary bounds) to retain sensible δ-values. NINJA overcomes this problem by keeping a chain of candidate heaps. At initiation, there are no candidates. In each iteration, newly gathered candidates from the d-filter are placed in a single candidate pool. When the size of that pool exceeds a threshold (default is 50,000; it should be fairly large because of the overhead required to form an external memory array heap), a candidate heap is created and populated with the triples in the pool, and the pool is then emptied. As more candidates are gathered, they are again stored in the pool, until it exceeds threshold, at which time a second candidate heap is formed, filled from the candidate pool, and linked to the first. This is repeated until the tree is complete. 128 This results in a chain of candidate heaps. The chain is destroyed when bin-pair heaps are rebuilt (Section 6.1.1). At each iteration, these heaps are scanned for elements with small qij by removing triples until the bound (6.4) is reached. Those viewed triples with qij > qmin are placed in the candidate pool, rather than being returned to their source candidate heap, because the δ-bound usually gets looser, so they’d almost always just be pulled back off their original heap on the next iteration. When a candidate heap drops below a certain size (default = 60% of original size), it is liquidated, and all triples placed in the candidate pool. 6.3. Algorithm overview At each iteration pa , NINJA follows this process, tracking qmin at each step: 1. Scan all candidates in the pool, keeping the one with smallest qij . 2. Sweep through the candidate heap chain, for each heap removing triples until reaching the bound (6.4), and placing those triples in the candidate pool. Possibly liquidate heaps in the chain if they become too empty. 3. Sweep through the bin-pair heaps, for each heap removing triples until reaching bound (6.1), and placing those triples in the candidate pool. 4. If the size of the candidate pool exceeds threshold, move all candidates into a new heap, storing qij (pa ) for each candidate, and ti (pa )-values and r(pa ) for the heap. Append this heap to the candidate heap chain. 5. Having found the qij with minimum value, merge clusters i and j, update the bin-pair heaps and the in-memory part of the distance matrix M with entries for new cluster ij, and possibly write out to the on-disk distance matrix D. Also occasionally liquidate the candidate heap chain and rebuild the bin-pair heaps (see Section 6.1.1). 129 Steps 1 and 2 typically identify a pair with good qij value, because they start with a set of previously filtered candidates. This improves the efficiency of the bin-heap search, since the d-filter bound will be more effective if the starting qmin is close to the true qmin . 130 CHAPTER 7 EXPERIMENTAL VERIFICATION OF METHODS FOR SCALING UP NEIGHBOR-JOINING To assess the effectiveness of the two-tiered filtering algorithm, it has been implemented in an application called NINJA. Three variants were used in various tests in the results shown below. The default variant, NINJA, stores the distance matrix on disk, and uses both the d-filter described in Section 6.1 and the qfilter described in Section 6.2, both implemented with external-memory array heaps [12]. The variant labeled NINJA-d-filter is identical to NINJA, except that it implements only the d-filter, not the q-filter. The variant labeled NINJA-InMem also uses only the d-filter, but does so with in-memory data structures - keeping the distance matrix entirely in memory, and using a binary heap in place of the external-memory array heap. NINJA-InMem makes it possible to directly assess the impact of external-memory components of the algorithm. On a machine with 4GB RAM, NINJA-InMem is only able to compute neighbor-joining trees on inputs of fewer than about 7000 sequences, due to overhead memory use. For comparison purposes, we tested two tools that similarly avoid viewing the entire distance matrix at each iteration, QuickJoin and RapidNJ, and a very fast implementation of the canonical algorithm, QuickTree. To our knowledge, these are the fastest available tools that implement exact neighbor-joining. Both of the former tools are unable to handle inputs of more than 13,000 sequences on a machine with 4GB of RAM, but an experimental external-memory version of RapidNJ, called RapidDiskNJ, has been released. A version of RapidDiskNJ downloaded on 04/24/09 was used as a reference for large inputs. Note that the filter used in RapidNJ and RapidDiskNJ is almost equivalent to that used in NINJA-d-filter 131 and NINJA-InMem. This analysis considers only tools that produce exact neighborjoining trees. It is worth noting that ClearCut and FastTree, both of which implement neighbor-joining heuristics, are both faster than NINJA. QuickTree was implemented in C, QuickJoin and both RapidNJ variants were implemented in C++, and the NINJA variants were implemented in Java (see appendix B). Environment Experiments were run on a bank of 8 identical dedicated systems running CentOS 4.5 (kernel 2.6.9-55), with 64 bit 2.33 GHz Xeon processors, 4 GB allocated RAM, and 500 GB 7200 RPM SATA hard drives. NINJA used roughly 60 GB of disk space for the largest inputs. The “real time” output from the standard time tool was used to measure run time. Data Pfam [37] families were used as sample input for the tools. Each protein domain family was preprocessed to remove duplicate sequences, and all 415 families with more than 2000 unique sequences were used. Distances calculated with QuickTree, with Kimura’s correction [65] for multiple substitutions at a locus, were used as input to all tools. 7.1. Experiments Effect of filters At each iteration, the canonical algorithm scans all r(r−1)/2 cells in the distance matrix, where r is the number of remaining clusters. It thus views Θ(n3 ) cells (candidates) over the course of building a complete neighbor-joining tree on n sequences. Figure 7.1 shows the often dramatic reduction in number of candidates passing the d-filter, relative to this total count of cells. It also highlights instances where the d-filter is mostly ineffective, and shows that more consistent success is achieved when the q-filter is used in conjunction with the d-filter. It is important to note that Figures 7.1, 7.3, and 7.4 are all log-log plots. Thus, the roughly linear growth observed in all plots corresponds to polynomial growth of 132 both candidates and run time, with the polynomial exponent visible in the log-log slope. Inputs causing bad d-filtration Figure 7.2a shows an example of the kind of input that makes the d-filter fairly ineffective. It contains large clusters of very closely related sequences, and a few relatively long branches. Contrast this to the more evenly-distributed sequences seen in Figure 7.2b, for which the d-filter is quite effective. The reason for the computational difficulty of trees like the one in Figure 7.2a is that the clusters on the very long branches have very large t values relative to the t values for most clusters, while the clusters for the tree in Figure 7.2b will all have fairly similar t values. With RapidNJ’s bound, which depends on tmax , d-filtering on 7.2a is immediately inefficient because of this t-value discrepancy. NINJA’s bound (equation 6.1) starts off relatively tight, but the d-filtering becomes inefficient as the range of t-values within a bin grows. This can happen dramatically when clusters along one long branch, which thus begin in a high t-value bin, are merged (with corresponding relative reduction in t values), while other clusters sharing the same bin are not merged, and thus retain relatively high t values. Table 7.2c shows the effect that these differing tree forms have on both number of viewed candidates and run time. Focus on the results for family Cytochrom B N, an input with structure like that shown in Figure 7.2a: d-filtering only reduces the number of candidates by a factor of 10, much less effective filtering than the 10,000fold reduction seen in family WD40, an input with structure much like that seen in Figure 7.2b. Because of the extra overhead of their algorithms, the resulting run times for both RapidDiskNJ and NINJA-d-filter are much worse than that of QuickTree. By applying the q-filter, NINJA achieves a further 100-fold reduction in candidates viewed for Cytochrom B N, along with a large reduction in run time. 133 1.E+14 slope 1.E+13 Unﬁltered cells 3.0 total # viewed candidates (log scale) 2.8 d‐ﬁlter candidates 1.E+12 2.4 d+q‐ﬁlter candidates 1.E+11 1.E+10 1.E+09 1.E+08 1.E+07 1.E+06 1.E+05 1.E+03 1.E+04 1.E+05 # sequences (log scale) Figure 7.1: Number of candidates viewed during tree-building with and without filters, for all 415 Pfam alignments with more than 2000 non-duplicate sequences. Data points are placed on a log-log plot: the slope on such a plot gives the exponent of growth. The canonical algorithm treats all cells as candidates, and the corresponding count of unfiltered cells shows the expected slope of 3 for Θ(n3 ) number of cells. The d-filter often reduces the number of viewed candidates by more than 3 orders of magnitude, but is less effective for some inputs. Addition of the q-filter results in more consistent filtering success across all inputs, and an observed growth rate in number of viewed cells of roughly O(n2.4 ). 134 (a) Flu M1, 707 sequences. Example of a topology for which filtering is ineffective. (b) QRPTase N, 707 sequences. A topology for which filtering is effective. Number candidates All cells RuBisCO_large PPR 17,490 18,961 9E+11 1E+12 5E+10 2E+08 1E+09 2E+08 64 155 124 20 331 21 25 20 Cytochrom_B_N WD40 33,789 33,327 6E+12 6E+12 6E+11 2E+08 7E+09 2E+08 539 756 1,678 52 6,092 121 146 110 RVT_1 ABC_tran 56,822 53,116 3E+13 2E+13 2E+12 2E+08 1E+10 2E+08 n/a n/a >18,500 159 >18,500 554 717 530 Pfam ID d-filter only d+q filters Run time (minutes) Sequence number QuickTree RapidDiskNJ NINJA d-filter NINJA d+q filters (c) Impact of d- and q-filtering at various input sizes. Figure 7.2: Trees (a) and (b) are both of 707 sequences, and represent approximately equal evolutionary distance between the most divergent pair of sequences. They are mid-point-rooted, and images were created using FigTree (http://tree.bio.ed.ac.uk/software/figtree/). Pfam datasets with relationships like those shown in (a), with many closely related sequences and a few relatively long branches, cause d-filtering to be ineffective. In Table (c), the first of each pair has topology similar to (a), and shows poor d-filtering; the second has topology similar to (b), and shows good d-filtering. Run times for RapidNJ and NINJA-d-filter are very slow when d-filtering is ineffective, but the additional q-filter used by NINJA results in much better filtering, and therefore improves runtimes even for these hard cases. On a system with 4GB RAM, QuickTree crashes on all Pfam families with more than 37,000 sequences. RapidDiskNJ and NINJA-d-filter both took longer than 13 days to compute a tree for RVT 1. 135 Comparison to other tools Figures 7.3 and 7.4 compare NINJA variants to other neighbor-joining tools. They focus on inputs of more than 2000 sequences, since smaller inputs are solved by the canonical algorithm (implemented in QuickTree) in under 10 seconds. The orders-of-magnitude reduction in viewed candidates seen in Figure 7.1 does not translate to a similar reduction in run-time because the underlying data structures required to gain this filtering advantage incur a great deal of overhead relative to the simple scanning of a matrix. In addition, the large-scale applications (NINJA and RapidDiskNJ) incur a constant-factor overhead from disk accesses. Those factors are mitigated by using algorithms with good disk-paging behavior, but are nevertheless present. Figure 7.3 shows run times for a random sample of 113 medium-sized (2000-7000 sequences) inputs from Pfam. A sample is shown, rather than the entire dataset, to improve visibility of the chart, and agrees with trends for the full set of similarlysized inputs. Note that QuickTree’s run time grows with a slope of 2.9 on a log-log plot, essentially what is expected of a Θ(n3 ) algorithm. QuickJoin and RapidNJ are in-memory versions of competitor algorithms - both show a reduction in runtime, and a growth rate that is slightly more than quadratic. This is in agreement with results from [99]. Results for NINJA-InMem and NINJA are presented to show their relative performance to each other and the other tools. Both show a roughly quadratic run-time growth on this data set. NINJA-InMem is slightly faster than the fastest other tool, RapidNJ. Since the two tools use essentially the same bounding method for their d-filter methods, this difference is likely explained by the tighter bounds generated by the bin-pair approach of NINJA. Figure 7.4 shows run times for all inputs from Pfam with more than 7000 sequences. Results are given for the variant of each tool that best handles these large inputs: QuickTree, RapidDiskNJ, and NINJA. Only NINJA successfully computed 136 1.E+03 slope QuickTree 2.9 QuickJoin 2.1 wall seconds (log scale) 1.9 NINJA 1.E+02 RapidNJ 2.3 NINJA‐InMem 1.9 1.E+01 1.E+00 2.E+03 2000 4000 7000 # sequences (log scale) Figure 7.3: Performance of NINJA and NINJA-InMem compared to that of QuickTree, QuickJoin, and RapidNJ on a random sample of 113 medium-sized (2000 to 7000 sequences) Pfam inputs. neighbor-joining trees for all inputs; QuickTree crashed on all inputs with more than 37,000 sequences, while RapidDiskNJ failed to complete within 13 days on the two largest inputs. QuickTree continues to exhibit the expected slope (3.0) on a log-log plot for a O(n3 ) algorithm. Interestingly, both RapidDiskNJ and NINJA also show a similar cubic slope for these larger inputs, in conflict with the lower rate of growth observed for smaller inputs in Figure 7.3 and [99]. Inspection of the data suggests that this is due to an increased frequency in these larger datasets of the sort of difficult inputs characterized by Figure 7.2a. Note that the number of 137 1.E+04 slope QuickTree 2.9 1 RapidDiskNJ 3.3 2 NINJA 3.0 wall minutes (log scale) 1.E+03 1.E+02 1.E+01 1.E+00 1.E‐01 7.E+03 7000 20,000 60,000 # sequences (log scale) Figure 7.4: Performance of NINJA compared to that of QuickTree and RapidDiskNJ on all large (7000 to 60,000 sequences) Pfam inputs. (1) On a system with 4GB RAM, QuickTree crashes on inputs with more than 37,000 sequences. (2) RapidDiskNJ failed to complete within 13 days for the two largest inputs; the uncertain times-to-completion are represented with the arrowed circles in the upper right corner. The slope for RapidDiskNJ, which shows that its run time is growing faster than n3 , does not include these two points. 138 viewed candidates was observed in Figure 7.1 as growing with a power of 2.4. The logarithmic overhead of heap data structures is responsible for the observation that run time grows faster than the number of candidates. While the inputs from Pfam are large, the largest contains fewer than 60,000 unique sequences. To assess NINJA’s ability to build trees on truly large datasets, the GreenGenes collection of 218,348 16S ribosomal RNAs was used as input (http://www.microbesonline.org/fasttree/downloads/Large16S.tar.gz). On a system with the same specs as earlier, but with 12GB allocated RAM, NINJA computed the tree in 134 hours (5.6 days). Interestingly, this represents a roughly quadratic growth in run time over the 10 hours required to build a tree for almost 60,000 sequences. It is estimated that QuickTree would take more than 80 days to compute the same tree, if a machine with sufficient memory (more than 500GB RAM) were available. 7.2. Discussion We have presented a new tool, NINJA, that builds a tree under the traditional optimization criteria of neighbor-joining, with the associated guarantee of statistical consistency and optimal edge radius. NINJA speeds up neighbor-joining by employing a two-tiered filtering regime, which greatly reduces the number of viewed candidates in each iteration relative to the complete scan of the distance matrix that is employed in the canonical algorithm. NINJA also overcomes memory constraints seen in earlier filtering-based work by incorporating external-memory-efficient data structures into the algorithm, specifically the external memory array heap [12] and simple on-disk storage of the distance matrix. The latter structure can be trivially co-opted by any neighbor-joining tool to overcome memory constraints due to the size of the distance matrix. Run time growth rate is somewhat unclear. The results of Figure 7.4 suggest a cubic rate of growth, which represents no asymptotic improvement over the 139 canonical algorithm. However, the single huge 16S RNA tree does not follow this trend. Further examination, perhaps on simulated alignments, might clear this up. In any case, NINJA represents a major advance in the scalability of neighbor-joining, and makes it feasible to construct trees for inputs with well over 100,000 sequences in a matter of a small number of days of computation on a modern desktop. The accuracy of NINJA is not discussed in this paper, as accuracy of any exact neighbor-joining tool is expected to be the same. That said, there is a clear dependency on correctly estimated distances, motivating more attention to this topic. BioNJ [39] implements a method of minimizing the variance of distances as the neighbor-joining tree is formed, with demonstrated accuracy benefits relative to canonical neighbor-joining; it is a straightforward exercise to incorporate these methods into NINJAś algorithm. In addition, a number of methods have been suggested for improving distance estimation, including Bayesian inference of evolutionary rates [82], and simultaneous estimation of all pairwise distances under a likelihood framework [108]. 140 CHAPTER 8 FUTURE WORK AND CONCLUSIONS The focus of this dissertation has been spread over two classic, and closely related, topics in molecular sequence analysis: multiple sequence alignment and phylogeny inference. I have explored a variety of approaches applied to the various stages of the form-and-polish heuristic for aligning multiple sequences, and described an approach that dramatically increases the speed and scalability of the neighbor-joining method for phylogeny inference. Next I summarize my work and end by suggesting a few related research questions. 8.1. Summary Part 1 (Chapters 2 through 4) of this work investigated methods for the phases of the fom-and-polish method of sequence alignment, while Part 2 (Chapters 6 through 8) described a new algorithm for the neighbor-joining method of phylogeny inference. Chapter 2 introduced the standard form-and-polish heuristic for multiple sequence alignment, delineating what we call the stages of that method. A profusion of approaches have been applied in the literature to those various stages, with little discussion of their relative efficacy, and that chapter presented a careful study of methods for each stage, identifying best-of-breed methods for each. The methods investigated included both established and novel ones, including new methods for estimating distances, weighting pairs of sequences, polishing the alignment, employing alignment consistency, and choosing parameters. We showed that the largest gains in quality come from new methods for estimating distances by 141 normalized alignment costs, and polishing by 3-cuts on the merge tree. We also found that the exact gap-count algorithm for aligning alignments [63] yields only a minor improvement over an approximate merging heuristic. The outcome of this investigation is an alignment tool called Opal which, with a careful choice of methods for each stage, achieves the same accuracy as state-of-the-art tools like ProbCons and MAFFT that in addition use various forms of consistency. Chapter 3 presented details of a new method for assigning weights to pairs of sequences. The purpose of devising sequence pair weights is to reduce the undue influence of overrepresented groups on the final outcome of an alignment; without weighting, the sum-of-pairs scoring function may lead to small improvements to the quality of alignments involving overrepresented groups while greatly reducing the quality of alignments involving underrepresented groups. This new method avoids anomalous behavior exhibited by two other standard methods. Also, despite conflicting results in the literature, we found that the tested methods of pairweighting provided no significant benefit to alignment accuracy across benchmarks (though our new method gave encouraging results on a small dataset expected to suffer from overrepresented groups). Chapter 4 gave details of a new method for using alignment consistency to avoid the errors that often occur in the early stages of the progressive alignment method. The approach depends on modifying the function used to score an alignment generated at a node of the alignment merge tree. The function is modified in a way that reflects the support given for particular alignment features in a pair of sequences A and B by other sequences C. Our results show that our new consistency model works well, but does not provide accuracy gains when used in conjunction with polishing, because of conflicts between the local application of consistencybased modified costs and polishing based on un-modified costs. The canonical algorithm for neighbor-joining was given in Chapter 5, and Chapter 6 developed a new algorithm that finds a exact neighbor-joining tree in 142 much less time, while overcoming previous memory-requirements. The method depends on a two-tiered filtering process, which dramatically restricts the number of distance values that are viewed at each iteration of the neighbor-joining method. The filters depend on external-memory-efficient priority queues, which, in conjunction with efficient on-disk storage of the distance matrix, allow the algorithm to handle inputs of effectively unlimited size (bounded only by disk size). Chapter 7 presents the results of experiments showing that an implementation of this algorithm reduces run time by roughly an order of magnitude on benchmark protein alignments. Specifically, the implementation is able to compute a tree for over 200,000 sequences in less than 6 days, while the canonical algorithm is predicted to take more than 80 days if a machine with sufficient RAM (roughly 500GB) were available. 8.2. Future directions The work described here suggests a number of attractive directions of further research. Sequence pair weights that permit fast profile alignment In Section 2.6 and Chapter 3, we described a new method for weighting sequences pairs to overcome the bias induced by overrepresented groups. While the method avoids the anomalous behavior of both division- and covariance-weights, and gives encouraging results on a small sample of alignments likely experiencing overrepresented groups, it can not be used by optimized profile alignment subroutines. It would be of value to modify the weighting method to allow application to fast profile alignment methods. Unbiased measure of alignment accuracy We noted in Section 2.6 that assessment of sequence pair weighting methods is inherently difficult, as the unweighted recovery measure is itself biased. Devising 143 an unbiased recovery measure that takes overrepresentation into account appears to be a challenging but valuable challenge. Alignment parameter advisor Our experiments (Section 2.9) with an oracle for choosing alignment parameter values suggest that large gains in recovery may be possible by an input-dependent choice of parameters. We reported modest success with both our naive Bayes classifier and core-column-count advisors, but a great deal of room for advancement remains. One interesting direction to take this work is into the realm of probabilistic alignment (such as ProbCons). A full-blown expectation-maximization approach would be too slow in practice, but an advisor that selects from a finite number of candidate parameterizations might offer improved average alignment quality. Local alignments in suboptimal-alignment-based consistency Much of the success of the consistency methods of T-Coffee and MAFFT comes from their inclusion of local alignment information into the consistency framework. In both tools, if local alignment support is removed from the consistency modification method, accuracy values are several percent worse (result not shown). ProbCons and Opal do not incorporate local alignments, but perform very well because of their use of suboptimal alignments in estimating support for alignment features. Merging the use of local alignments and posterior alignment probabilities seems like a natural step in the advancement of consistency-based sequence alignment. Speeding up suboptimality-based consistency to permit polishing Chapter 4 showed that polishing and consistency failed to work well together in the scheme tested in Opal, because polishing using a different scoring scheme than consistency, altering the alignment in ways that eliminated the gains made by consistency. Consistency-modifed costs weren’t used in that scheme because 144 computing those costs was too slow to be used on more than tens of sequences. Algorithm engineering to substantially speed up suboptimality-based consistency might allow it to be applied along the full tree, and in conjunction with polishing. Results in MAFFT and ProbCons suggest that this will produce improved alignment accuracy. Probabilistic consistency with posterior probabilities on gaps The probabilistic consistency method of ProbCons depends on posterior probabilities for only substitutions, but there is no reason posteriors can’t be computed for other transitions in a pair-HMM. By computing these posteriors, and applying the constrained subpaths in Chapter 4 and appendix A, it may be possible to gain improved results from the probabilistic consistency framework. Faster methods for optimizing branch length for minimum evolution The minimum evolution (ME) framework (described in Chapter 5) assigns edge lengths to a tree that minimize the squared difference between observed pairwise sequence distances and those induced by the tree. While a fair bit of work has gone into developing fast methods for computing least-squares edge lengths [14, 68], there may be room for improvement. An algorithm is developed [14] for computing ordinary least squares (OLS) in O(k 2 ) time for k leaves, but other variants WLS and GLS (which weight the components of the sun-of-squares based on variance and covariance of distance estimates) run in time O(k 3 ) and O(k 4 ) respectively. Speeding up WLS and GLS branch length computation may be value since it appears that ME local search may improve phylogeny recovery, relative to an initial neighbor-joining tree (Morgan Price, personal communication). 145 Sequence pair weighting applied to WLS branch lengths Desper and Gascuel have presented very good phylogeny inference accuracy in results based on a variant of minimum evolution called balanced minimum evolution (BME) [22, 23]. At its heart, the approach depends on a computation of the average distance between two subtrees in which, for two subtrees A and B, where S B = B1 B2 , the average distance between A and B is defined recursively, in the general case being D(A, B) = 1 (D(A, B1 ) + D(A, B2 )) . 2 The result is to avoid undue influence on weights by overrepresented groups. The equation is dependent only on topology, and is completely ignorant of branch lengths. As was shown in Chapter 3, specifically in Figure 3.7, identical topologies can correspond to wildly different levels of sequence independence. Replacing the topology-only equation with one based on our sequence weighting scheme might result in an improvement in the accuracy of phylogenies computed in the BME framework. Clearly there are many potentially fruitful directions in which the work of this dissertation can be extended. I look forward to exploring them. 146 APPENDIX A CONSISTENCY GRAPHS In this appendix, an exhaustive listing is given of the sub-paths in alignment graphs for (A ∼ C) and (B ∼ C) that are consistent with each feature in the alignment graph of (A ∼ B) (see Figure 4.1). This appendix extends the description in Section 4.2, and agrees with that definition of consistency, in that it lists alignment subpaths that constrain particular features in the alignment of sequences A and B. A.0.1. Subpath pairs consistent with substitution Only one subpath pair is consistent with a substitution column (ai ∼ bj ) (Figure 4.1a). That subpath pair is shown in Figure 4.2, and given again here. 147 )*+" %" %" #" $" #" (" !" &" '" Figure A.1: Alignment subpaths of A ∼ C and B ∼ C that are consistent with (ai ∼ bj ). The alignment that agrees with these subpaths is: ai ··· ch · · · bj 148 A.0.2. Subpath pairs consistent with gap extension Figures A.2 through A.5 present four subpath pairs that are consistent with ai extending a gap between bj and bj+1 . This is a vertical edge in the pairwise alignment graph of (A ∼ B) (Figure 4.1b). Creating subpath pairs for horizontal )*+,-."$" gap extension is a trivial matter of flipping the graphs on a diagonal axis. %" %" #" $" #" (" !" '" &" Figure A.2: Alignment subpaths of (A ∼ C) and (B ∼ C) that are consistent with ai extending a gap between bj and bj+1 . The alignment that agrees with these subpaths is: ··· ◦ ai ◦ ch · · · bj − (where ◦ represents freedom to either include or not include character from the associated sequence ) 149 )*+,-."(" %" %" #" $" #" (" !" '" &" Figure A.3: Alignment subpaths of A ∼ C and B ∼ C that are consistent with ai extending a gap between bj and bj+1 . The alignment that agrees with these subpaths is: ◦ ai ◦ ··· - ch − ch+1 · · · ◦ ◦ ↑ ↑ bj % bj+1 The arrows indicate that the subpaths correspond to alignments of B with C such that bi aligns either with or before ch , and bi+1 aligns either with or after ch+1 . 150 )*+,-."%" #" %" #" %" '" $" (" !" &" Figure A.4: Alignment subpaths of (A ∼ C) and (B ∼ C) that are consistent with ai extending a terminal gap before the first character of B. The alignment that agrees with these subpaths is: ··· ai ◦ − ··· c1 · · · ◦ ↑ % b1 The arrows indicate that the subpaths correspond to alignments of B with C such that b1 aligns either with or after c1 , and c1 aligns after ai . 151 )*+,-."/" %" %" #" $" #" (" !" &" '" Figure A.5: Alignment subpaths of (A ∼ C) and (B ∼ C) that are consistent with ai extending a terminal gap after the last character of B. The alignment that agrees with these subpaths is: ◦ ··· cp · · · ai − ··· ◦ - ↑ bn The arrows indicate that the subpaths correspond to alignments of B with C such that bn aligns either with or before cp , and cp aligns before ai . 152 A.0.3. Subpath pairs consistent with gap-open and gap-close. Figures A.6 through A.13 present eight subpath pairs that are consistent with ai opening a gap immediately after bj . This is a vertical gap open in the pairwise alignment graph of (A ∼ B). Figures A.14 through A.21 presnt another eight subpath pairs that are consistent with vertical gap closure. Creating subpath pairs for horizontal gap boundaries is a trivial matter of flipping the graphs on a diagonal )*+,"$" axis. %" %" #" $" #" (" !" &" '" Figure A.6: Alignment subpaths of (A ∼ C) and (B ∼ C) that are consistent with ai opening a gap immediately after bj (general case). The alignment that agrees with these subpaths is: ◦ ··· ai ch−1 ch · · · bj − 153 )*+,"%""-#+,"!./"0'.12"13#345" %" %" #" $" #" (" &" !" '" Figure A.7: Alignment subpaths of (A ∼ C) and (B ∼ C) that are consistent with ai opening a gap immediately after bj (case: i > 1, 0 < j < n, 0 < h < p). The alignment that agrees with these subpaths is: ai−1 ai ··· ch bj − · · · ch+1 · · · ◦ ↑ % bj+1 The arrows indicate that the subpaths correspond to alignments of B with C such that bj+1 aligns either with or after ch+1 . 154 )*+,"(""-*+.!/0"./1+"2#+,"'3345""6/07+"2!00"8,09":+"71+;"!<"!"33"=" %" %" #" #" '" $" (" !" &" Figure A.8: Alignment subpaths of (A ∼ C) and (B ∼ C) that are consistent with a1 opening a gap before the first character of B (case: j = 0, i = 1). The alignment that agrees with these subpaths is: ai ··· ch · · · ◦ b1 ··· 155 )*+,"-".*+/!01"/02+"3!4#"'5567"89#95:;""<#55:"011=3+>?" %" %" #" $" #" (" &" !" '" Figure A.9: Alignment subpaths of (A ∼ C) and (B ∼ C) that are consistent with ai opening a gap immediately after bn (case: j = n, 0 < h ≤ p). The alignment that agrees with these subpaths is: ai−1 ai ··· ch bn − ··· 156 )*+,"-"."/*+0!12"013+"4!5#"'6678"9:#:6;<""=#66;"122>[email protected]" %" %" #" #" (" !" $" &" '" Figure A.10: Alignment subpaths of (A ∼ C) and (B ∼ C) that are consistent with a1 opening a gap immediately after bn (case: i = 1, j = n, 0 < h ≤ p). The alignment that agrees with these subpaths is: a1 ··· ch − · · · bn 157 )*+,"-"."/*+0!12"013+"4!5#"!"66"78"'9:;":<#<=>" %" %" #" #" (" !" $" &" '" Figure A.11: Alignment subpaths of (A ∼ C) and (B ∼ C) that are consistent with a1 opening a gap immediately after bj (case: i = 1, j > 0, 0 < h < p). The alignment that agrees with these subpaths is: a1 ··· ch − · · · ch+1 · · · bj ◦ ↑ % bj+1 The arrows indicate that the subpaths correspond to alignments of B with C such that bj+1 aligns either with or after ch+1 . 158 )*+,"-"./*+0!12"013+"4!5#"!"66"78"'6698"96:#:;"".#669"122<4+=>>" %" %" #" #" '" !" $" &" (" Figure A.12: Alignment subpaths of (A ∼ C) and (B ∼ C) that are consistent with a1 opening a gap before the first character of B (case: i = 1, j = 0, 0 ≤ h < p). The alignment that agrees with these subpaths is: a1 ··· ch − · · · ch+1 · · · ◦ ↑ % b1 The arrows indicate that the subpaths correspond to alignments of B with C such that b1 aligns either with or after ch+1 . 159 +,-."/" #$" &" &" #" %" #$" #" )" *"*"*"" '" !" *"*"*"" (" Figure A.13: Complex alignment subpaths of (A ∼ C) and (B ∼ C) that are consistent with ai opening a gap between bj and bj+1 . The alignment that agrees with these subpaths is: ◦ ··· ch0 − − − ai · · · ch · · · bj − − − − In this complex graph pair, ai is the first character of A to appear after a gap that goes back to at least ch0 , and bj is the last character of B to appear before a gap that goes until at least ch . This special case is valid only for h0 ≤ h − 2. See text for proof that an optimal pair (h0 , h) can be found in linear time for fixed i and j. 160 %)*+,"$" %" %" #" $" #" (" !" &" '" Figure A.14: Alignment subpaths of (A ∼ C) and (B ∼ C) that are consistent with ai closing a gap between bj and bj+1 (general case). The alignment that agrees with these subpaths is: ai ··· ◦ bj ··· ch ch+1 · · · − bj+1 161 $'()*"$""+!*,"-./"01.23"24!456" $" $" !" #" !" &" -" %" 1" Figure A.15: Alignment subpaths of (A ∼ C) and (B ∼ C) that are consistent with ai closing a gap between bj and bj+1 (case: i > 1, j > 0, 0 < h < p). The alignment that agrees with these subpaths is: ai ai+1 ··· ch · · · ◦ - − ch+1 · · · bj+1 ↑ bj The arrows indicate that the subpaths correspond to alignments of B with C such that bj aligns either with or before ch . 162 %)*+,"(""+-,.!/)"./+,0"'1123""45)6"/.78/))6"8+,9":#,5"!11;" %" %" #" $" #" (" &" !" '" Figure A.16: Alignment subpaths of (A ∼ C) and (B ∼ C) that are consistent with am closing a terminal gap after the last character of B (case: i = m, j = n). The alignment that agrees with these subpaths is: am ··· ◦ bn ··· ch · · · 163 $'()*"+"",-*./0'".0)*"1/2!"34456"/75"8/9:;""594!9<"8!445"(=>" $" $" !" !" 3" #" &" /" %" Figure A.17: Alignment subpaths of (A ∼ C) and (B ∼ C) that are consistent with ai closing a terminal gap before b1 (case: j = 0, 0 < i < m, 0 ≤ h < p). The alignment that agrees with these subpaths is: ai ai+1 ··· − ch+1 · · · b1 164 $'()*"+"",-*./0'".0)*"1/2!"34456"/4478""594!9:";!445"(<=" $" $" !" !" 3" #" &" %" /" Figure A.18: Alignment subpaths of (A ∼ C) and (B ∼ C) that are consistent with am closing a terminal gap before the first character of B (case: i = m, j = 0, 0 ≤ h < p). The alignment that agrees with these subpaths is: am ··· − ch+1 · · · b1 165 $'()*"+""",-*./0'".0)*"1/2!"/3345"6789"!7:;"<69!=>?" $" $" !" #" !" &" %" 6" /" Figure A.19: Alignment subpaths of (A ∼ C) and (B ∼ C) that are consistent with am closing a gap between bj and bj+1 (case: i = m, 0 < j < n, 0 < h < p). The alignment that agrees with these subpaths is: am ··· ch · · · ◦ - − ch+1 · · · bj+1 ↑ bj The arrows indicate that the subpaths correspond to alignments of B with C such that bj aligns either with or before ch . 166 $'()*"+""",-*./0'".0)*"1/2!"/"33"45"63375"89!93:";!33:"0''(1*<=" $" $" !" #" !" &" %" /" 6" Figure A.20: Alignment subpaths of (A ∼ C) and (B ∼ C) that are consistent with am closing a terminal gap after bn (case: i = m, j = n, 0 < h ≤ p). The alignment that agrees with these subpaths is: am ··· ch · · · − ··· ◦ - ↑ bn The arrows indicate that the subpaths correspond to alignments of B with C such that bn aligns either with or before ch . 167 &-./0"1" &" #$" &" #+," %" !" #$" #" )" *"*"*"" *"*"*"" '" (" Figure A.21: Complex alignment subpaths of (A ∼ C) and (B ∼ C) that are consistent with ai closing a gap between bj and bj+1 . The alignment that agrees with these subpaths is: ai − − − ··· ch0 · · · − − − − − ◦ ch−1 ch · · · − bj In this complex graph pair, ai is the last character of A to appear before a gap starting at ch0 +1 and going to ch−1 , and bj is the first character of B to appear after a gap going back to at least ch0 . This special case is valid only for that h0 ≤ h − 2. See text for proof that an optimal pair (h0 , h) can be found in linear time for fixed i and j. 168 The reader should remember that these graphs are used in the computation of modified costs by finding, for each position-pair (i, j), the values of h0 and h that result in minimum suboptimality. Most of the subpaths above involve only a single value, h, so scanning for the best h requires linear time for each (i, j). This leads to O(n3 ) run time for computing suboptimality modifiers from one sequence C for alignment (A ∼ B), when all three sequences have length n. However, two of the subpaths (Figures A.13 and A.21) involve a pair of values, h0 and h. For a fixed pair (i, j), naively finding the pair (h0 , h) that gives smallest suboptimality would require quadratic time, leading to an overall O(n4 ) algorithm. Fortunately, it is possible to find the optimal pair (h0 , h) for a fixed (i, j) in a single linear-time scan through C. This is achieved by keeping two pointers to positions in sequence C. Call these the leading and trailing pointers, which point to positions f and g, and serve as candidates for h0 and h respectively. The leading pointer is advanced one position at a time, and the trailing pointer makes occasional (possibly multi-step) advances. When the leading pointer reaches the end of C, the process is complete; this is a linear process since the leading pointer moves one step at a time and neither pointer ever backtracks. We now provide details to show how the optimal pair (h0 , h) is found. The approach will be described in the context of the average suboptimality described in 4.3. In that context, the pair of constrained alignments with lowest cost will also be the pair with lowest suboptimality. Modifying this approach to use the maximum suboptimality criterion is straightforward. For a fixed i and j, we will maintain the invariant that, for a pointer value g, f is the position in C that results in the lowest average cost of constrained alignments (A ∼ C) and (B ∼ C) among values ≤ g − 2. Initialize f = 1 and g = 3; f satisfies the above condition trivially. In the general case, we seek to advance g to g 0 , and find the value f 0 that gives lowest cost among all values ≤ g 0 − 2. It will be shown (below) that for g 0 = g + 1, f gives lowest cost among positions ≤ g 0 − 3. Thus, 169 for g 0 , the optimal f 0 will be either f or g − 2. This means that a fixed amount of work is performed for each advancement of the leading pointer g. For fixed i and j, when all values g (each with corresponding f ) have been tested, the pair (f, g) with lowest cost gives the (h0 , h) with lowest suboptimality. This is a linear scan per (i, j) pair, resulting in cubic run time to compute suboptimalities over all (i, j). Theorem A.1 (Linear time to compute best (h0 , h) pair for suboptimality graph in Figure A.13). For fixed position g, let f be the position giving lowest cost constrained alignments (A ∼ C) and (B ∼ C) among values ≤ g − 2. Then for position g 0 = g + 1, f is the position giving lowest constrained cost among values ≤ g 0 − 3. Proof. For position pair (f, g), the two parts of the path-pair in Figure A.13 can be decomposed into subpaths that will be used in this proof. Specifically, the cost of the optimal alignment (A ∼ C) forced to go through positions f and g can be broken into the sum L + M + N + P , as shown in Figure A.22. The cost for the similar alignment for (B ∼ C) can be broken into into the sum W + X + Y + Z, as in Figure A.24. Similar decompositions can be made for position pair (f, g 0 ), as in Figures A.23 and A.25. Let C(f, g) be the sum of (A ∼ C) and (B ∼ C) costs going through f and g of Figures A.22 and A.24, and let C(f, g 0 ) be the similar sum going through f and g 0 . It is clear that C(f, g 0 ) = C(f, g) − N − P − Z + N 0 + P 0 + Z 0 + 2λ for any f ≤ g − 2. Thus, when g 0 replaces g, all positions f ≤ g − 2 will be subject to constant change in cost, so the relative order of costs for positions ≤ g − 2 is unchanged when g is replaced by g 0 . This ensures that the f ≤ g − 2 that gives lowest cost alignments for subpaths ending at g will also give the lowest cost alignments for subpaths ending at g 0 = g + 1, among values ≤ g 0 − 3. 170 Bound pruning C L M A N P f g Figure A.22: Decomposing the (A ∼ C) part of figure A.13 (1). Used in proof of linear-time computation of optimal (f, g) pair. 79 Bound pruning 171 C L M A !" N’ P’ f g g' Figure A.23: Decomposing the (A ∼ C) part of figure A.13 (1). Used in proof of linear-time computation of optimal (f, g) pair. Bound pruning C W B X Y Z f g Figure A.24: Decomposing the (B ∼ C) part of figure A.13 (1). Used in proof of linear-time computation of optimal (f, g) pair. 81 172 Bound pruning C W B X Y !" Z’ f g g' Figure A.25: Decomposing the (B ∼ C) part of figure A.13 (1). Used in proof of linear-time computation of optimal (f, g) pair. 82 173 APPENDIX B ON JAVA VERSUS C++ FOR SCIENTIFIC COMPUTING In this appendix, I briefly describe my personal experience with developing complex scientific applications in Java. I start by giving my motivations for moving from C++ to Java in the first place. I then give a sense of how Java has performed in my two scientific computing applications, and reach the conclusion that (1) Java is an excellent language for prototyping scientific algorithms and gives acceptable speed if used carefully, but (2) C++ appears to give better (faster) performance, especially in I/O-intensive applications, and is thus preferable for polished software release. Based on personal observation, the implementation languages of choice in computational biology appear to be C and C++. Java seems to have much less penetrance in the market of serious scientific computing, likely because it is viewed as being too slow or restrictive for such use. The original implementation of Opal [115] was in C, and depended on an existing C implementation of the algorithms mentioned in Section 2.5 for aligning alignments [63] . As I began development related to the consistency methods described in Chapter 4, it was clear that a great deal of refactoring of code would be required to incorporate the modified cost schemes, and the flexibility of an objectoriented programming language would make it much easier to test the wide range of ideas we had in mind. Because I preferred the ease of development, debugging, and portability that Java offer, I decided to investigate reimplementing Opal with Java. The results of Bull et al. [15] and a variety of benchmark tests posted to the web (e.g. http://www.idiom.com/∼zilla/Computer/javaCbenchmark.html and http://kano.net/javabench/) were compelling enough to convince me to 174 reimplement the code for aligning alignments (AlignAlign) in Java to compare performance. AlignAlign is an interesting test case, in that the C implementation requires a fairly complex memory pool management scheme in order to efficiently deal with creating and discarding the shapes described in Section 4.5. It is also likely an ideal candidate for porting to Java, since input files can be handled much more easily with Java’s regular expression engine than with C, and its core computation effectively consists of a big sweep through a 2-dimensional array. In the Java implementation, I was able to do away with pool management, leaving the job to the Java virtual machine (JVM) garbage collector. With otherwise essentially identical algorithms, the Java implementation ran about 10% slower than a heavily optimized C implementation, on average over a wide range of inputs and several computers. It should be noted that this required careful avoidance of creating unnecessary objects inside core loops of the program, as object creation in Java is slow. It seemed worthwhile to pay a 10% performance penalty for improved ease of implementation and debugging. A side benefit of implementing in pure Java is that it was a simple matter to plug Opal into Mesquite (http://mesquiteproject.org), a well-regarded software suite for evolutionary biology. NINJA was also developed in Java, and requires quite a bit of low-level diskI/O code. In this case, it appears that a C implementation might provide a substantial speed improvement. NINJA is roughly as fast as RapidNJ (written in C++) on large inputs of more than 1000 sequences, but RapidNJ is much faster for small inputs. In fact, on inputs of a few hundred sequences RapidNJ completes in less time than NINJA requires just to read in the quadratic-sized distance matrix, despite significant effort invested in optimizing NINJA’s disk-reading functions. This suggests that NINJA’s success on larger inputs is due to algorithmic success, and that porting NINJA to C might result in a modest speed improvement. In addition to the slow disk-I/O in java, other issues have come to light that 175 suggest limitations to the value of Java in scientific computing. First, on a 32-bit machine, JVMs will allow allocation of only 2GB RAM, even if more is available on the machine; this, for example, may not be sufficient to store the dynamic programming tables for aligning very long sequences. Second, even on 64-bit machines, memory allocation appears to be a problem on machines with huge RAM; Morgan Price (pers. comm.) reports that his JVM crashed repeatedly when attempting to allocate more than 60GB on a machine that has 128GB available - this was not a NINJA error. In addition, Java appears to be prone to dropping buffered data when processing data from a Unix pipe, though I’ve not seen this documented elsewhere. A final problem is that the portability of Java applications depends on the availability of a JVM; while appropriate JVMs are available for download, end users may find themselves unable to install the JVM on a server over which they have no control. Furthermore, some more recent benchmarking results released to the web have been less supportive of Java’s asserted near-equality to the C variants (e.g. http://www.stefankrause.net/wp/?p=9) Having developed two serious scientific computing tools in Java, I believe that it is an excellent test-bed for algorithms, providing a relatively simple programming environment with a reasonable speed. However, any tool that is likely to be used heavily by the public, and for which a premium will be placed on performance, should probably be ported to C++ in the end. This should be a fairly simple port, given the similarities of the languages. 176 REFERENCES [1] A. Aitken, On least squares and linear combination of observations, Proceedings of the Royal Society of Edinburgh. Section A. Mathematics, 55 (1934), p. 42. [2] S. Altschul, Gap costs for multiple sequence alignment, Journal of theoretical biology, 138 (1989), pp. 297–309. [3] S. Altschul, R. Carroll, and D. Lipman, Weights for data related by a tree., Journal of Molecular Biology, 207 (1989), pp. 647–653. [4] S. F. Altschul, Leaf pairs and tree dissections, SIAM Journal on Discrete Mathematics, 2 (1989), pp. 293–299. [5] K. Atteson, The performance of neighbor-joining methods of phylogenetic reconstruction, Algorithmica, 25 (1999), pp. 251–278. [6] A. Bahr, J. Thompson, J. Thierry, and O. Poch, BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations, Nucleic Acids Research, 29 (2001), pp. 323–326. [7] S. Balaji, S. Sujatha, S. Kumar, and N. Srinivasan, Pali—a database of phylogeny and alignment of homologous protein structures, Nucleic Acids Research, 29 (2001), pp. 61–65. [8] R. Bayer and E. McCreight, Organization and maintenance of large ordered indexes, Acta informatica, 1 (1972), pp. 173–189. [9] M. Berger and P. Munson, A novel randomized iterative strategy for aligning multiple protein sequences, CABIOS, 7 (1991), pp. 479–484. [10] M. Bordewich, O. Gascuel, K. Huber, and V. Moulton, Consistency of topological moves based on the balanced minimum evolution principle of phylogenetic inference, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6 (2009), pp. 110–117. [11] R. Bradley, A. Roberts, M. Smoot, S. Juvekar, J. Do, C. Dewey, I. Holmes, and L. Pachter, Fast statistical alignment, PLoS Computational Biology, 5 (2009), p. e1000392. 177 [12] K. Brengel, A. Crauser, P. Ferragina, and U. Meyer, An experimental study of priority queues in external memory, Proceedings of the 3rd International Workshop on Algorithm Engineering, (1999), pp. 345–359. [13] D. Bryant, On the uniqueness of the selection criterion in neighbor-joining, Journal of Classification, 22 (2005), pp. 3–15. [14] D. Bryant and P. Waddell, Rapid evaluation of least-squares and minimum-evolution criteria on phylogenetic trees, Molecular Biology and Evolution, 15 (1998), pp. 1346–1359. [15] J. Bull, L. Smith, L. Pottage, and R. Freeman, Benchmarking java against c and fortran for scientific applications, Proceedings of the 2001 joint ACM-ISCOPE conference on Java Grande, (2001), pp. 97–105. [16] M. Bulmer, Use of the method of generalized least squares in reconstructing phylogenies from sequence data, Molecular Biology and Evolution, 8 (1991), pp. 868–883. [17] H. Carrillo and D. Lipman, The multiple sequence alignment problem in biology, SIAM Journal on Applied Mathematics, 48 (1988), pp. 1073–1082. [18] H. Carroll, W. Beckstead, T. O’Connor, and M. Ebbert, Dna reference alignment benchmarks based on tertiary structure of encoded proteins, Bioinformatics, 23 (2007), pp. 2648–2649. [19] L. Cavalli-Sforza and A. Edwards, Phylogenetic analysis models and estimation procedures, American Journal of Human Genetics, 19 (1967), pp. 233–257. [20] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to algorithms, 2001. [21] C. Darwin, On the origin of species by means of natural selection, or the preservation of favoured races in the struggle for life, New York: D. Appleton, (1859). [22] R. Desper and O. Gascuel, Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle, J Comput Biol, 9 (2002), pp. 687–705. [23] , Theoretical foundation of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted least-squares tree fitting, Molecular Biology and Evolution, 21 (2004), pp. 587–598. 178 [24] , The minimum evolution distance-based approach to phylogenetic inference, In: Gascuel O, editor. Mathematics of evolution & phylogeny. Oxford, UK: Oxford University Press, (2005), pp. 1–32. [25] C. Do, M. Mahabhashyam, M. Brudno, and S. Batzoglou, Probcons: Probabilistic consistency-based multiple sequence alignment, Genome Research, 15 (2005), pp. 330–340. [26] P. Domingos and M. Pazzani, On the optimality of the simple Bayesian classifier under zero-one loss, Machine learning, (1997). [27] A. Drummond and A. Rambaut, Beast: Bayesian evolutionary analysis by sampling trees, BMC Evol Biol, (2007). [28] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological sequence analysis: Probabilistic models of proteins and nucleic acids, books.google.com, (1998). [29] R. Edgar, Muscle: a multiple sequence alignment method with reduced time and space complexity, BMC Bioinformatics, 5 (2004), p. 113. [30] R. Edgar, Muscle: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Research, (2004). [31] I. Elias and J. Lagergren, Fast neighbor joining, Proc. of the 32nd International Colloquium on Automata, Languages and Programming (ICALP’05, (2005), pp. 1263–1274. [32] J. Evans, L. Sheneman, and J. Foster, Relaxed neighbor joining: a fast distance-based phylogenetic tree construction method, Journal of Molecular Evolution, (2006). [33] J. Farris, V. Albert, M. Kallersjo, and D. Lipscomb, Parsimony jackknifing outperforms neighbor-joining, Cladistics, (1996). [34] J. Felsenstein, Numerical methods for inferring evolutionary trees, Quarterly Review of Biology, (1982), pp. 379–404. [35] J. Felsenstein, Inferring phylogenies, (2004). [36] D. Feng and R. Doolittle, Progressive sequence alignment as a prerequisite to correct phylogenetic trees, Journal of Molecular Evolution, 25 (1987), pp. 351–360. 179 [37] R. D. Finn, J. Tate, J. Mistry, P. C. Coggill, S. J. Sammut, H. Hotz, G. Ceric, K. Forslund, S. R. Eddy, E. L. L. Sonnhammer, and A. Bateman, The pfam protein families database, Nucleic Acids Research, 36 (2007), pp. D281–D288. [38] P. P. Gardner, A benchmark of multiple sequence alignment programs upon structural rnas, Nucleic Acids Research, 33 (2005), pp. 2433–2439. [39] O. Gascuel, BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data, Molecular Biology and Evolution, (1997). [40] O. Gascuel and M. Steel, Neighbor-joining revealed, Molecular Biology and Evolution, 23 (2006), pp. 1997–2000. [41] N. Goldman and Z. Yang, Introduction. statistical and computational challenges in molecular phylogenetics and evolution, Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 363 (2008), pp. 3889–3892. [42] P. A. Goloboff, S. A. Catalano, J. M. Mirande, C. A. Szumik, J. S. Arias, M. Kallersjo, and J. S. Farris, Phylogenetic analysis of 73 060 taxa corroborates major eukaryotic groups, Cladistics, 25 (2009), pp. 211–230. [43] P. A. Goloboff, J. S. Farris, and K. C. Nixon, TNT, a free program for phylogenetic analysis, Cladistics, 24 (2008), pp. 774–786. [44] O. Gotoh, Consistency of optimal sequence alignments, Bulletin of Mathematical Biology, 52 (1990), pp. 509–525. [45] , Optimal alignment between groups of sequences and its application to multiple sequence alignment, CABIOS, 9 (1993), pp. 361–370. [46] , Further improvement in methods of group-to-group sequence alignment with generalized profile . . . , CABIOS, 10 (1994), pp. 379–387. [47] , A weighting system and aigorithm for aligning many phylogenetically related sequences, CABIOS, 11 (1995), pp. 543–551. [48] , Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments, Journal of Molecular Biology, 264 (1996), pp. 823–838. [49] S. Griffiths-Jones, S. Moxon, M. Marshall, A. Khanna, S. R. Eddy, and A. Bateman, Rfam: annotating non-coding RNAs in complete genomes, Nucleic Acids Research, 33 (2005), pp. D121–D124. 180 [50] D. Gusfield, Algorithms on strings, trees, and sequences: computer science and computational biology, (1997). [51] S. Henikoff and J. Henikoff, Amino acid substitution matrices from protein blocks, Proceedings of the National Academy of Sciences, 89 (1992), pp. 10915–10919. [52] M. Hirosawa, Y. Totoki, M. Hoshida, and M. Ishikawa, Comprehensive study on iterative algorithms of multiple sequence alignment, CABIOS, 11 (1995), pp. 13–18. [53] W. Hordijk and O. Gascuel, Improving the efficiency of SPR moves in phylogenetic tree search methods based on maximum likelihood, Bioinformatics, 21 (2005), pp. 4338–4347. [54] K. Howe, A. Bateman, and R. Durbin, QuickTree: building huge neighbour-joining trees of protein sequences, Bioinformatics, 18 (2002), pp. 1546–1547. [55] X. Huang and W. Miller, A time-efficient linear-space local similarity algorithm, Advances in Applied Mathematics, 12 (1991), pp. 337–357. [56] J. Huelsenbeck and F. Ronquist, MRBAYES: Bayesian inference of phylogenetic trees, Bioinformatics, 17 (2001), pp. 754–755. [57] L. Hurst, The Ka/Ks ratio: diagnosing the form of sequence evolution, TRENDS in Genetics, 18 (2002), pp. 486–487. [58] T. Jukes and C. Cantor, Evolution of protein molecules, Mammalian protein metabolism, 3 (1969), pp. 21–132. [59] K. Katoh, K. Kuma, H. Toh, and T. Miyata, MAFFT version 5: improvement in accuracy of multiple sequence alignment, Nucleic Acids Research, 33 (2005), pp. 511–518. [60] K. Katoh, K. Misawa, K. Kuma, and T. Miyata, MAFFT: a novel method for rapid multiple sequence alignment based on fast fourier transform, Nucleic Acids Research, 30 (2002), p. 3059. [61] J. Kececioglu, The maximum weight trace problem in multiple sequence alignment, Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching, (1993), pp. 106–119. [62] J. Kececioglu and E. Kim, Simple and fast inverse alignment, Proceedings of the 10th ACM Conference on Research in Computational Molecular Biology (RECOMB), 3909 (2006), pp. 441–455. 181 [63] J. Kececioglu and D. Starrett, Aligning alignments exactly, Proceedings of the 8th ACM Conference on Research in Computational Molecular Biology (RECOMB 2004), (2004), pp. 85–96. [64] J. Kececioglu and W. Zhang, Aligning alignments, Combinatorial Pattern Matching: 9th Annual Symposium, (1998), pp. 189–208. [65] M. Kimura, The neutral theory of molecular evolution, Cambridge University Press, (1983). [66] D. Knuth, The art of computer programming, vol. 3: Sorting and searching, Addison-Wesley, (1973). [67] M. Kuhner and J. Felsenstein, A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates, Molecular Biology and Evolution, 11 (1994), pp. 459–468. [68] S. Kumar, A stepwise algorithm for finding minimum evolution trees, Molecular Biology and Evolution, 13 (1996), pp. 584–593. [69] S. Kumar and S. Gadagkar, Efficiency of the neighbor-joining method in reconstructing deep and shallow evolutionary relationships in large phylogenies, Journal of Molecular Evolution, 51 (2000), pp. 544–553. [70] J. Lake, The order of sequence alignment can bias the selection of tree topology, Molecular Biology and Evolution, 8 (1991), pp. 378–385. [71] D. Lipman, S. Altschul, and J. Kececioglu, A tool for multiple sequence alignment, Proceedings of the National Academy of Sciences, 86 (1989), pp. 4412–4415. [72] B. Ma, Z. Wang, and K. Zhang, Alignment between two multiple alignments, Combinatorial pattern matching: 14th annual symposium, CPM, (2003), pp. 254–265. [73] W. Maddison, Gene trees in species trees, Systematic Biology, 46 (1997), pp. 523–536. [74] W. Maddison and L. Knowles, Inferring phylogeny despite incomplete lineage sorting, Systematic Biology, 55 (2006), pp. 21–30. [75] T. Mailund, G. Brodal, R. Fagerberg, C. Pedersen, and D. Phillips, Recrafting the neighbor-joining method, BMC Bioinformatics, 7 (2006). 182 [76] T. Mailund and C. Pedersen, QuickJoin-fast neighbour-joining tree reconstruction, Bioinformatics, 20 (2004), pp. 3261–3262. [77] R. Mihaescu, D. Levy, and L. Pachter, Why neighbor-joining works, Algorithmica, 54 (2009), pp. 1–24. [78] S. Miyazawa, A reliable sequence alignment method based on probabilities of residue correspondences, Protein Engineering, 8 (1995), pp. 999–1009. [79] T. Muller, R. Spang, and M. Vingron, Estimating amino acid substitution models: a comparison of Dayhoff ’s estimator, the resolvent approach and a maximum likelihood method, Molecular Biology and Evolution, 19 (2002), pp. 8–13. [80] A. Murzin, S. Brenner, T. Hubbard, and C. Chothia, SCOP: a structural classification of proteins database for the investigation of sequences and structures, Journal of Molecular Biology, 247 (1995), pp. 536–540. [81] L. Nakhleh, B. Moret, U. Roshan, K. John, and T. Warnow, The accuracy of fast phylogenetic methods for large datasets, Proc. 7th Pacific Symp. Biocomputing PSB 2002, (2002), pp. 211–222. [82] M. Ninio, E. Privman, T. Pupko, and N. Friedman, Phylogeny reconstruction: increasing the accuracy of pairwise distance estimation using Bayesian inference of evolutionary rates, Bioinformatics, 23 (2007), pp. e136– e141. [83] C. Notredame, D. Higgins, and J. Heringa, T-Coffee: A novel method for fast and accurate multiple sequence alignment, Journal of Molecular Biology, 302 (2000), pp. 205–217. [84] C. Notredame, L. Holm, and D. Higgins, Coffee: an objective function for multiple sequence alignments, Bioinformatics, 14 (1998), pp. 407–422. [85] T. Ogden and M. Rosenberg, Multiple sequence alignment accuracy and phylogenetic inference, Systematic Biology, 55 (2006), pp. 314–328. [86] B. Paten, J. Herrero, K. Beal, and E. Birney, Sequence progressive alignment, a framework for practical large-scale probabilistic consistency alignment, Bioinformatics, 25 (2008), pp. 295–301. [87] D. Patterson, Latency lags bandwith, Commun Acm, 47 (2004), pp. 71–75. [88] J. Pei and N. Grishin, MUMMALS: multiple sequence alignment improved by using hidden markov models with local structural information, Nucleic Acids Research, 34 (2006), pp. 4364–4374. 183 [89] J. Pei and N. V. Grishin, PROMALS: towards accurate multiple sequence alignments of distantly related proteins, Bioinformatics, 23 (2007), pp. 802– 808. [90] M. N. Price, P. S. Dehal, and A. P. Arkin, FastTree: Computing large minimum evolution trees with profiles instead of a distance matrix, Molecular Biology and Evolution, 26 (2009), pp. 1641–1650. [91] B. Redelings and M. Suchard, Joint bayesian estimation of alignment and phylogeny, Systematic Biology, 54 (2005), pp. 401–418. [92] F. Ronquist and J. Huelsenbeck, MrBayes 3: Bayesian phylogenetic inference under mixed models, Bioinformatics, 19 (2003), pp. 1572–1574. [93] U. Roshan and D. R. Livesay, Probalign: multiple sequence alignment using partition function posterior probabilities, Bioinformatics, 22 (2006), pp. 2715–2721. [94] A. Rzhetsky and M. Nei, Theoretical foundation of the minimumevolution method of phylogenetic inference, Molecular Biology and Evolution, 10 (1993), pp. 1073–1095. [95] N. Saitou and M. Nei, The neighbor-joining method: a new method for reconstructing phylogenetic trees, Molecular Biology and Evolution, 4 (1987), pp. 406–425. [96] D. Sankoff, Minimal mutation trees of sequences, SIAM Journal on Applied Mathematics, 28 (1975), pp. 35–42. [97] A. S. Schwartz and L. Pachter, Multiple alignment by sequence annealing, Bioinformatics, 23 (2007), pp. e24–e29. [98] L. Sheneman, J. Evans, and J. Foster, Clearcut: a fast implementation of relaxed neighbor joining, Bioinformatics, 22 (2006), pp. 2823–2824. [99] M. Simonsen, T. Mailund, and C. Pedersen, Rapid neighbourjoining, Proceedings of the 8th international Workshop on Algorithms in Bioinformatics, (2008), pp. 113–122. [100] S. A. Smith, J. M. Beaulieu, and M. J. Donoghue, Mega-phylogeny approach for comparative biology: an alternative to supertree and supermatrix approaches, BMC Evol Biol, 9 (2009), pp. 1–12. [101] P. Sneath and R. Sokal, Numerical taxonomy: the principles and practice of numerical classification, Springer, (1973). 184 [102] A. Stamatakis, RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models, Bioinformatics, 22 (2006), pp. 2688–2690. [103] D. Starrett, Optimal alignment of multiple sequences, Dissertation, (2008). [104] K. St.John, T. Warnow, B. Moret, and L. Vawter, Performance study of phylogenetic methods: (unweighted) quartet methods and neighborjoining, J Algorithm, 48 (2003), pp. 173–193. [105] J. Studier and K. Keppler, A note on the neighbor-joining algorithm of Saitou and Nei, Molecular Biology and Evolution, 5 (1988), pp. 729–731. [106] S. Subbiah and S. Harrison, A method for multiple sequence alignment with gaps, Journal of Molecular Biology, 209 (1989), pp. 539–548. [107] M. Suchard and B. Redelings, BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny, Bioinformatics, 22 (2006), pp. 2047– 2048. [108] K. Tamura, M. Nei, and S. Kumar, Prospects for inferring very large phylogenies by using the neighbor-joining method, Proceedings of the National Academy of Sciences, 101 (2004), pp. 11030–11035. [109] J. Thompson, F. Plewniak, and O. Poch, A comprehensive comparison of multiple sequence alignment programs, Nucleic Acids Research, 27 (1999), pp. 2682–2690. [110] J. D. Thompson, D. G. Higgins, and T. J. Gibson, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, Nucleic Acids Research, 22 (1994), pp. 4673–4680. [111] M. Vingron and M. Waterman, Sequence alignment and penalty choice. review of concepts, case studies and implications., Journal of Molecular Biology, 235 (1994), pp. 1–12. [112] I. Wallace, OO’Sullivan, D. Higgins, and C. Notredame, M-Coffee: combining multiple sequence alignment methods with T-Coffee, Nucleic Acids Research, 34 (2006), pp. 1692–1699. [113] I. V. Walle, I. Lasters, and L. Wyns, SABmark–a benchmark for sequence alignment that covers the entire known fold space, Bioinformatics, 21 (2005), pp. 1267–1268. 185 [114] L. Wang and T. Jiang, On the complexity of multiple sequence alignment., J Comput Biol, 1 (1994), pp. 337–348. [115] T. Wheeler and J. Kececioglu, Multiple alignment by aligning alignments, Bioinformatics, 23 (2007), pp. i559–i568. [116] W. C. Wheeler, J. Gatesy, and R. DeSalle, Elision: a method for accommodating multiple molecular sequence alignments with alignmentambiguous sites, Molecular Phylogenetics and Evolution, 4 (1995), pp. 1–9. [117] Z. Yang and B. Rannala, Bayesian phylogenetic inference using DNA sequences: A Markov chain monte carlo method, Molecular Biology and Evolution, 14 (1997), pp. 717–724. [118] J. Yuan, A. Amend, J. Borkowski, R. DeMarco, W. Bailey, Y. Liu, G. Xie, and R. Blevins, MULTICLUSTAL: a systematic method for surveying Clustal W alignment parameters, Bioinformatics, 15 (1999), pp. 862–863. [119] L. Zaslavsky and T. A. Tatusova, Accelerating the neighbor-joining algorithm using the adaptive bucket data structure, Lecture Notes in Computer Science, 4983 (2008), pp. 122–133.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement