Multiple Sequence Alignment by Iterated Local Search HS{IKI{MD{04{001 Jens Auer Submitted by Jens Auer to the University of Skovde as a dissertation towards the degree of Master of Science with a major in computer science. September 2004 I certify that all material in this dissertation which is not my own work has been identied and that no material is included for which a degree has already be conferred upon me. Jens Auer Hogskolan i Skovde Kommunikation och information Master Dissertation Multiple Sequence Alignment by Iterated Local Search Jens Auer August 29, 2004 supervised by Bjorn Olsson I would like to thank all people involved in this thesis, especially my supervisor Bjorn Olsson for his support and comments which helped me very much to complete this thesis project. I also would like to thank Thomas Stutzle for providing me with some of the literature used in this thesis. Abstract The development of good tools and algorithms to analyse genome and protein data in the form of amino acid sequences coded as strings is one of the major subjects of Bioinformatics. Multiple Sequence Alignment (MSA) is one of the basic yet most important techniques to analyse data of more than two sequences. From a computer scientiest point of view, MSA is merely a combinatorial optimisation problem where strings have to arrange in a tabular form to optimise an objective function. The diculty with MSA is that it is NP {complete, and thus impossible to solve optimally in reasonable time under the assumption that P 6= NP . The ability to tackle NP {hard problems has been greatly extended by the introduction of Metaheuristics (see Blum & Roli (2003)) for a summary of most Metaheuristics, general problem-independent optimisation algorithms extending the hill-climbing local search approach to escape local minima. One of these algorithms is Iterated Local Search (ILS) (Lourenco et al., 2002; St utzle, 1999a, p. 25), a recent easy to implement but powerful algorithm with results comparable or superior to other state-of-the-art methods for many combinatorial optimisation problems, among them TSP and QAP. ILS iteratively samples local minima by modifying the current local minimum and restarting a local search porcedure on this modied solution. This thesis will show how ILS can be implemented for MSA. After that, ILS will be evaluated and compared to other MSA algorithms by BAliBASE (Thomson et al., 1999), a set of manually rened alignments used in most recent publications of algorithms and in at least two MSA algorithm surveys. The runtime-behaviour will be evaluated using runtime-distribution. The quality of alignments produced by ILS is at least as good as the best algorithms available and signicantly superiour to previously published Metaheuristics for MSA, Tabu Search and Genetic Algorithm (SAGA). On the average, ILS performed best in ve out of eight test cases, second for one test set and third for the remaining two. A drawback of all iterative methods for MSA is the long runtime needed to produce good alignments. ILS needs considerably less runtime than Tabu Search and SAGA, but can not compete with progressive or consistency based methods, e. g. ClustalW or T{COFFEE. keywords: Combinatorial Optimisation, Metaheuristics, Multiple Sequence Alignment, Iterated Local Search Contents 1 Introduction 1 2 Background 4 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Aim and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Multiple Sequence Alignment . . . . . . . . . . 2.1.1 Denition . . . . . . . . . . . . . . . . . 2.1.2 Alignment Scores . . . . . . . . . . . . . 2.1.3 Complexity . . . . . . . . . . . . . . . . 2.2 Metaheuristics and Combinatorial Optimisation 2.2.1 Combinatorial Optimisation . . . . . . . 2.2.2 Heuristic algorithms . . . . . . . . . . . 2.2.3 Metaheuristics . . . . . . . . . . . . . . 2.3 Iterated Local Search . . . . . . . . . . . . . . . 2.3.1 GenerateInitialSolution . . . . . . . . . 2.3.2 EmbeddedHeuristic . . . . . . . . . . . . 2.3.3 Perturbation . . . . . . . . . . . . . . . 2.3.4 Acceptance criterion . . . . . . . . . . . 2.3.5 Termination condition . . . . . . . . . . 2.4 Related Work . . . . . . . . . . . . . . . . . . . 2.4.1 Simulated Annealing . . . . . . . . . . . 2.4.2 SAGA . . . . . . . . . . . . . . . . . . . 2.4.3 Tabu Search . . . . . . . . . . . . . . . 3 Algorithm 3.1 Construction heuristic . . . . . . . . . . 3.2 Local Search . . . . . . . . . . . . . . . 3.2.1 Single exchange local search . . . 3.2.2 Block exchange local search . . . 3.2.3 Comparison . . . . . . . . . . . . 3.3 Perturbation . . . . . . . . . . . . . . . 3.3.1 Random regap perturbation . . . 3.3.2 Block move perturbation . . . . 3.3.3 Pairwise alignment perturbation 3.3.4 Re-aligning of complete regions . v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 3 4 6 8 9 10 10 11 12 13 15 16 16 17 18 18 20 21 21 23 23 24 24 27 29 30 30 30 31 32 3.4 3.5 3.6 3.7 3.3.5 Comparison . . . Acceptance criterion . . Termination condition . Iterated Local Search for Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multiple Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 33 33 34 36 4 Method and Materials 37 5 Results 41 4.1 Run-time distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 The BAliBASE database of multiple alignments . . . . . . . . . . . . . . . 38 4.2.1 Statistical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.1 ILS components . . . . . . . . 5.1.1 Construction heuristic 5.1.2 Local search . . . . . . 5.1.3 Perturbation . . . . . 5.1.4 Acceptance Criterion . 5.2 BaliBASE evaluation of ILS 5.3 Runtime-bahaviour of ILS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 41 41 43 43 44 50 6 Conclusions and Discussion 54 A Abbreviations 56 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 vi List of Figures 2.1 Example Multiple Sequence Alignment . . . . . . . . . . . . . . . . . . . . 5 2.2 Structure of the hevein protein . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Pictorial view of ILS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1 3.2 3.3 3.4 3.5 3.6 3.7 Move operation of the single exchange local search Local Search step with block exchange. . . . . . . . Random regap perturbation . . . . . . . . . . . . . Perturbation using a random block move . . . . . . Perturbation using optimal pairwise alignments . . Block-alignment perturbation . . . . . . . . . . . . COFFEE scores for ILS on reference set gal4 ref1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 27 30 31 31 32 34 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 Boxplot of the average BSPS scores of all 144 alignments in the database. Boxplot of the average CS scores of all 144 alignments in the database. . . Boxplots for each reference set showing BSPS-scores . . . . . . . . . . . . Boxplots for each reference set showing CS-scores . . . . . . . . . . . . . . Solution quality trade-o for 1bbt3 ref1 . . . . . . . . . . . . . . . . . . . Solution quality trade-o for 1pamA ref1 . . . . . . . . . . . . . . . . . . Runlength-distribution for 1r69 ref1 . . . . . . . . . . . . . . . . . . . . . Runlength-distribution for 1pamA ref1 . . . . . . . . . . . . . . . . . . . . 46 46 47 48 51 52 53 53 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Percentage derivation of optimised alignments from reference alignments . 55 vii List of Tables 2.1 Classication of MSA algorithms . . . . . . . . . . . . . . . . . . . . . . . 20 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 Comparison of construction heuristics . . . . . . . . . . . . . . . . . . . . Comparison of COFFEE-scores for both local searches . . . . . . . . . . . Runtime comparison for both local searches . . . . . . . . . . . . . . . . . Results of dierent perturbations . . . . . . . . . . . . . . . . . . . . . . . Runtimes of dierent perturbation . . . . . . . . . . . . . . . . . . . . . . Percentage improvements in COFFEE-score for dierent acceptance criteria. Runtime for dierent acceptance criterions . . . . . . . . . . . . . . . . . . Average SP-scores for BAliBASE . . . . . . . . . . . . . . . . . . . . . . . Average column scores for BAliBASE . . . . . . . . . . . . . . . . . . . . Statistical signicance of the dierences in BSPS- and CS score. . . . . . . Runtime for BAliBASE . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 41 42 42 43 43 44 44 49 49 50 50 List of Algorithms 2.1 Iterated Local Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 BETTER acceptance criterion . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 RESTART acceptance criterion . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1 3.2 3.3 3.4 3.5 3.6 3.7 Single exchange local search. . . Block exchange local search . . . Random regap perturbation . . . Random block move perturbation Pairwise alignment perturbation Realign region perturbation . . . Iterated Local Search for MSA . . . . . . . . ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 28 30 31 32 32 35 1 Introduction The development of good tools and algorithms to analyse genome and protein data is one of the major subjects of Bioinformatics. This data is typically collected in the form of biological sequences, i. e. strings of symbols representing amino acids or nucleotides. From the comparison of these sequences, information about evolutionary development, conserved regions or motifs and structural commonalities can be acquired. This provides the basis for deducing biological functions of proteins or characterising protein families. One central technique to compare a set of sequences is Multiple Sequence Alignment (MSA). This work presents a new algorithm for amino acid sequence alignment, where the goal is to arrange a set of sequences representing protein data in a tabular form such that a scoring function is optimised. The problem of Multiple Sequence Alignment is hard to solve, for most scoring functions it has been proven to be NP-hard. Considering computational and biological problems, Notredame (2002) describes three major problems: Which sequences should be aligned? How is the alignment scored? How should the alignment be optimized w. r. t. to the objective function? This thesis focuses on the third point, summarised as follows by Notredame (2002): Computing optimal alignments is very time-consuming, it even has been proved that it is NP-hard with most used objective functions, which makes it intractable for exact algorithms when more than a small set of sequences is to be aligned. To handle larger sets, heuristics must be used, which leaves the question of the reliability of these heuristics and their capability to produce near optimal results. 1.1 Motivation MSA has been a eld of research for a long time and many algorithms have been proposed. Most algorithms proposed for MSA rely on a step-by-step alignment procedure where one sequence is consecutively added until all sequences are aligned. In most cases, the sequences are added in a precomputed order to the alignment. This class of algorithms, with ClustalW (Thompson et al., 1994) being the most prominent member, suers from the usual problem of their greedy approach, e. g. being unable to alter decisions made in an earlier phase even if it would produce an improvement. An initial 1 study has shown that a simple local search is able to improve the results of ClustalW, especially for the interesting case of sequence similarity below 25%, where a local search can improve the alignment by 8% (see table 5.2 on page 42). This approach is very likely to become stuck in a local minimum and could therefore be improved by using algorithms able to escape a local minimum. To overcome the problems of local minima, several stochastic schemes have been proposed in the context of combinatorial optimization. These schemes are often described as Metaheuristics, since they describe a general algorithmic framework which is independent from a special optimization problem. Established metaheuristics are simulated annealing, tabu search, guided local search, ant colony optimization and genetic/evolutionary algorithms. Metaheuristics are an active research eld in AI and are used in many different domains, both in research and industrial applications (see Blum & Roli (2003) for an overview of Metaheuristics in Combinatorial Optimisation). Even though MSA is of such remarkable importance, only three cases of contemporary Metaheuristics applications to MSA were found in the literature search for this project. Kim et al. (1994) used simulated annealing, Notredame & Higgins (1996) implemented a genetic algorithm and recently, Riaz et al. (2004) studied Tabu Search. The results of Notredame and Riaz are comparable with other state of the art techniques and encourage further investigation of other Metaheuristics. In this thesis, Iterated Local Search (ILS) (Lourenco et al., 2002) is implemented and evaluated as an algorithm for MSA. Starting from a local minimum generated by running an embedded heuristic procedure on a heuristically computed initial alignment, Iterated Local Search (ILS) iteratively performs the following steps: 1. Perturbation: Modify the current solution to escape the local minimum. 2. EmbeddedHeuristic: Run the embedded heuristic on the perturbed solution to reach the next local minimum. 3. Acceptance: Select either the solution before or after running the embedded solution as starting solution for the next iteration. Section 2.3 describes ILS as a Metaheuristic in detail, and section 3.6 presents the algorithm implemented in this project for MSA. The idea behind ILS, the search in the reduced space of local minima can be found in many algorithms in combinatorial optimization under dierent names, including iterated descent, large-step Markov chains, iterated Lin-Kernighan and chained local optimisation. Despite a long history of the basic idea, ILS is a relatively new Metaheuristic and has not been used for MSA. Lourenco et al. (2001) gives a brief introduction to ILS. ILS has shown very good results on many combinatorial optimization problems, e. g. the traveling salesman problem (St utzle & Hoos, 2002), where it constitutes one of the best algorithms currently available and the quadratic assignment problem (St utzle, 1999b), one of the hardest problems known. Other applications include scheduling problems of dierent kinds (den Besten et al., 2001; St utzle, 1998) and graph colouring (Chiarandini & St utzle, 2002; Paquete & St utzle, 2002) where ILS can compete with the 2 best algorithms available. Lourenco et al. (2002) describe these and more applications and give a good overview of ILS. Compared to already applied Metaheuristics, ILS has several advantages. It is a modular algorithm with very simple basic steps and few parameters which have to be adjusted, as opposed to e. g. genetic algorithms. Another benet which ILS shares with other Metaheuristics is its independence of the objective function to be optimised. 1.2 Aim and Objectives The aim of this dissertation project is to implement and evaluate ILS (Lourenco et al., 2002) for MSA. The algorithm is evaluated using a reference set of alignments, provided by the BaliBase database of multiple sequence alignments (Bahr et al., 2001; Thomson et al., 1999). ILS is compared to established state-of-the-art algorithms for MSA w. r. t. biological validity by measuring the conformance of the computed alignments with those found in the database, and w. r. t. computational properties by using runtime distributions (Hoos & St utzle, 1998). To full the aim, the following objectives have to be considered: Choose a good construction heuristic by testing several alternatives Construct a good local search procedure which is able to improve an alignment very fast but also reaches high solution quality. This involves testing dierent neighbourhood structures and design decisions such as pivoting strategies or neighbourhood pruning techniques. Implement a perturbation procedure which ts to the local search procedure used and helps to nd high quality solutions. Choose an acceptance criterion which balances intensication and diversication to explore interesting regions of the search space eciently. Determine a cut-o to end the ILS. Chapter 3 presents several alternatives for each of these points. Chapter 5 compares each of the alternatives taking into account the intended use as a component of the ILS. Considering the results, the complete ILS is presented in algorithm 3.7 on page 35 in chapter 3. Altogether, three construction heuristics, two local search procedures (with two pivoting strategies), four perturbation strategies and three acceptance criteria are tested considering the intended use as a part of an ILS algorithm. 3 2 Background This chapter introduces the basic concepts needed in this thesis and provides an overview about previous related work. 2.1 Multiple Sequence Alignment MSA is a fundamental tool in molecular biology and bioinformatics. It has a variety of applications, e. g. nding characteristic motifs and conserved regions in protein families, analysing evolutionary relations between proteins and secondary/tertiary structure prediction. The following section shows concrete examples of MSA in widely known problems of molecular biology. A common way to represent the evolution of a set of proteins is to arrange them in a phylogenetic tree. The rst step in computing the tree is to measure all pairwise distances of the given set of sequences and derive a distance matrix from this. The distances can be gathered from a MSA by comparing each pair of aligned sequences in the alignment. From this distance matrix, a phylogenetic tree can be computed, e. g. by UPGMA(Sneath & Sokal, 1973) or Neighbour Joining(Saitou & Nei, 1987). MSA is a basic tool when grouping proteins into functionally related families. From a MSA, it is easy to construct a prole or a consensus, which is a pseudo-sequence representing the complete alignment. The consensus or the prole in turn can be used to identify conserved regions or motifs. MSA can also be used when deciding if a protein belongs to a family or not by aligning the query-sequence together with known proteins belonging to the family. MSA can also be used as a starting point for secondary or tertiary structure prediction. When aligned together with related sequences, the conserved motifs can give a hint of secondary or tertiary structure, since the structure is usually conserved during evolution, but the actual sequence is likely to vary in certain parts. Figure 2.1 shows this exemplary with the Cysteins (printed yellow) aligned together. These amino acids inuence the structure of an protein by forming disuld-bridges . Figures 2.1 and 2.2 show an example MSA of a family of related sequences together with a picture of the three-dimensional tertiary structure of one of the aligned proteins, hevein. As dened later, all sequences in the MSA are adjusted to the same length by inserting gaps (symbolised by dots in this example). Similar amino acids are aligned together in the same column and regions of high amino acid similarity can be identied. Most important are the black printed columns of Cysteins, since these parts of the proteins inuence the tertiary structure by forming bonds inside the proteins. These bonds are printed green in gure 2.2. 4 Figure 2.1: Example Multiple Sequence Alignment of N-acetylglucosamine-binding proteins. The alignment shows eight columns of Cysteins, which form disuld bridges and are an essentail part of the proteins structure. The dots symbolise gaps. Taken from Reinert et al. (2003). Figure 2.2: Structure of one of the proteins from gure 2.1. The disuld bridges formed by the Cysteins are printed in green. Taken from Reinert et al. (2003). 5 2.1.1 Denition This section denes the domain, i. e. the MSA problem in formal terms, so that it can be used to describe the algorithm precisely. We rst formalise the concept of a biological sequence. Denition 2.1 (Alphabet and Sequence) Let 6 be an alphabet, i. e. a nite set of symbols, and 6 6= ;. A sequence over the alphabet 6 is a nite string of symbols from 6. The length of a sequence s, denoted as jsj, is the number of symbols in the sequence. For a sequence, s[i] describes the ith symbol in this sequence. In molecular biology, the alphabet 6 usually consists of the set of twenty-three amino acids found in proteins, extended with a special symbol as a placeholder for any amino acid (usually denoted as \X" or \{"). We can now dene the multiple alignment of a set of sequences. We dene rst a variation of multiple alignment, where columns consisting completely of gap symbols are allowed, and restrict this in the nal denition of multiple alignment. Denition 2.2 (Pseudo Multiple Alignment) Let 6 be an alphabet as dened in denition 2.1, and fs1 , . . . ,sk g a set S of k sequences over this alphabet. A pseudo multiple alignment is a set of sequences S 0 = fs01 ; :::; s0k g over the alphabet 60 = 6 [ f0g, where k2 All sequences in 8i k; si 2 S S 0 have the same length: 8s0i ; s0j 2 S 0 js0i j = js0j j can be reduced to the corresponding sequence si in all gap symbols from s0i 0 0 The order of symbols in every sequence s0i in sequence si . S 0 S by removing is the same as in the corresponding The number of sequences in the alignment is jS 0 j; the number of columns is dened as the length of the sequences in S 0 . (S ) is the set of all possible pseudo alignments of the sequences S . PMSA The set of all possible alignments PMSA(S ) is an innite set, since the number of columns consisting only of gaps is not restricted, which also implies that there is no bound on the length of the sequences in a pseudo alignment. A Multiple fSequenceg Alignment is a pseudo alignment with the restriction that no column may consist only of gap characters. Denition 2.3 (Multiple Sequence Alignment) Let n be the number of sequences in the Alignment and k the number of columns. A set S of sequences is a multiple alignment i it is a pseudo alignment and it fulls the following condition: 8i k 9j n sj [i] 6= 0 6 S. MSA(S ) denotes the set of all possible Multiple Sequence Alignments of the sequences Despite of the innite set of pseudo alignments, the number of Multiple Sequence Alignments is restricted by the lengths and the number of sequences and hence nite, as is the length of the alignment. When only two sequences are aligned, the MSA is called a pairwise alignment. Pairwise alignment is not discussed in this thesis, because they can be computed eciently by dynamic programming (Needleman & Wunsch, 1970). In order to dene an optimal multiple alignment, a quality measure must be dened, which assigns a value to an alignment. The measure is called a scoring or Objective function (OF), because it is the mathematical goal to be optimised. The term OF will be preferred in this thesis because the MSA problem is tackled as a Combinatorial Optimisation (CO) problem. Denition 2.4 (Scoring Function, Objective Function) denition 2.2. For a number of sequences n, a function Let 60 be dened as in f : (6 ? )n ! R 0 is called an Objective function (OF). 6? is dened as the set of all words obtained by concatenation of symbols from the alphabet 6. Typical OFs are (weighted) Sum of Pairs (SP)(Altschul, 1989), which scores the MSA by scoring all pairs of amino acids induced by the columns of the alignment, and Consistency based Objective Function For alignmEnt Evalutation (COFFEE) (Notredame et al., 1998), which compares the alignment to a library of reference alignments. OFs are discussed in 2.1.2 on the following page in more detail. Finally, an optimal alignment can now be dened. The term optimal can mean either a minimum or maximum value, depending on the objective function used. As we use a scoring function (as opposed to a cost function) in our alignment algorithm, where better alignments receive a higher score, we dene optimality by maximisation.1 Denition 2.5 (Optimal Multiple Sequence Alignment) S = s1 ; :::; sk over an alphabet 6 and an OF sequence alignment i S 0 f, S 0 For a set of sequences = s1 ; :::; sk is an optimal multiple 0 0 2 MSA(S ) 8S ? 2 MSA(S ) : S ? 6= S 0 ! f (S ? ) f (S 0 ) The second point states that f (S 0 ) is a maximum value of the function f on the set of all alignments over S . This denition includes the possibility of more than one optimal MSA. 1 This doesn't cause any loss of generality, since a minimisation problem can easily be transformed in a maximisation problem. 7 2.1.2 Alignment Scores To estimate the quality of an alignment, each alignment gets a value based on an OF, where a higher value corresponds to improved quality. The OF measures the quality of an alignment and should reect how close this alignment is to the optimal biological alignment. Sum-of-pairs is the most widely used scoring function SP (Altschul, 1989) score, a direct extension from the scoring methods used in pairwise alignment. For each pair of sequences, the similarity between the aligned pairs of amino acids are computed, in most cases by using a scoring matrix. The pairwise scores are then added for each aligned pair of residues in every pair of sequences. SP scores are used in many variations in MSA algorithms, e. g. using weights for the pairs of sequences to prevent sets of similar sequences from dominating the whole alignment or dierent methods of assigning penalties to gaps in an alignment. The dependence on a scoring matrix is a major drawback of the SP-scores, since these scoring matrices are very general and created by statistical analysis of large sets of alignments. The degree to which they match the properties of the sequences one is to align varies. Morgenstern et al. (1998) argue that scoring matrices fail to produce biologically reasonable alignments when the sequences have a low global similarity and share only local similarities. Notredame (2002) identies the scoring of gaps when using the SP-score as a major source of concern. COFFEE is a consistency based approach to scoring an MSA. Instead of measuring the similarity of each pair of amino acids induced by an alignment, COFFEE (Notredame et al., 1998) measures the correspondence between each induced pairwise alignment and a library of computed pairwise alignment. Each pair in the library is assigned a weight that is a function of its quality. The weight is equal to the percentage identity of the aligned sequences. The score of an alignment is computed as the score of all pairs of induced alignments, i. e. all pairs of aligned sequences in this multiple alignment. For each pair of residues aligned in both the library and the induced alignment, the weight is added to the score. To normalise the score, this score is divided by the maximal value, where all pairs in the multiple alignment are found in the library alignments.2 Formalised, the COFFEE score for an alignment A can be formulated as: Denition 2.6 (COFFEE score for a MSA) P P W 2 SCORE(A ) P P W 2 LEN (A ) N 01 COF F EE (A) = N i=1 j =i+1 N 01 N i=1 j =i+1 2 i;j i;j ij i;j For the optimal score to be 1, the alignments in the library must be totally consistent, such that each residue is aligned to the same residue in every pairwise alignment. 8 with: Ai;j the pairwise alignment obtained for taking sequences Ai and Aj , Wi;j the weight associated with that alignment, LEN (Ai;j ) the length of this alignment and SCORE (Ai;j ) being the number of aligned pairs of residues found both in Ai;j and the library alignment. Using a library of alignments computed from the set of sequences under analysis prevents the scoring scheme from being too general and takes the properties of the sequences into account. Notredame et al. (1998) used a library of pairwise alignments created by ClustalW, but Notredame et al. (2000) suggest a library containing both global alignments from ClustalW and local alignments and showed that this actually lead to an increase in the quality of the computed alignments. When considering COFFEE as the OF for a local search, it is a great benet that one can restrict the computational eort to those parts which have been changed by the procedure when performing one step, e. g. to those two columns where a gap and a residue have been exchanged. One of the problems of both COFFEE and SP scoring is their assumption of uniform and time-invariant substitution probabilities for all positions in an alignment. The substitution probability depends on structural and functional properties of the proteins and can vary from zero for some regions to complete variability in others. norMD (Thompson et al., 2001) tries to reect this by using a column-based approach, where the probabilities of substitutions and thus the scores for residues pairs are columndependent. The norMD score is computed by measuring mean distances in a continuous space imposed by each sequence in the alignment. norMD incorporates ane gap costs and a similarity measure for the sequences which is based on a hash score derived from dot plots of each pair of sequences. Finally, the score is normalised to a value between zero and one. The strength of this method is its independence from the number and length of the sequences in the alignment, and the assignment of column-based scores which can be used to nd good or bad aligned regions inside an alignment. A major drawback which prohibits the usage of norMD as a scoring function for the ILS is its complexity. Due to the handling of gap costs and the computation of the hash scores during scoring, the whole alignment has to be processed completely which leads to very large run-times. Taking this into account, COFFEE is preferred as the OF for this project. 2.1.3 Complexity The alignment of only two sequences is a common special case of MSA and ecient algorithms have been proposed, e. g. by Needleman & Wunsch (1970) or Gotoh (1982). These algorithms use dynamic programming to eciently compute all possible alignments; their asymptotic complexity is O(l2 ), where l is the length of the longest sequence, needing also O(l2 ) space (the space complexity can be reduced to linear space consumption by a method proposed by Hirschberg (1975). Generalising the special case to more than 9 two algorithms is easily possible, but has the drawback of exponential growth of complexity with the number of sequences: The complexity for dynamic programming with n sequences of maximum length l is O (ln ). Considering this, only small sets of sequences can be aligned with dynamic programming. Computing alignments for more than two sequences has been shown to be NP-hard by Wang & Jiang (1996) for metric scoring matrices and by Just (1999) for a more general class of scoring matrices, which gives strong reasons to believe that there will be no ecient algorithm to solve this problem exactly (except P = NP). Even if Just's prove can not be directly transferred to MSA with COFFEE as OF, it gives strong reason to believe that computing optimal COFFEE alignments is computational hard and needs advanced techniques. 2.2 Metaheuristics and Combinatorial Optimisation This section explains and denes two key terms of this thesis, namely Metaheuristic and Combinatorial Optimisation (CO). 2.2.1 Combinatorial Optimisation The problem of optimisation can generally be formulated as optimising a function f (X ), where X is a set of variables.3 Usually the possible set of values is restricted by constraints, expressed by inequalities such as g (X ) c; c 2 R, where g (x) is a general function. Optimisation problems can be divided into two categories: those where the set of variables are from a continuous domain, and those with a discrete domain; the latter one is called combinatorial. According to Papadimitriou & Steiglitz (1982), CO is the search for a nite (or countably innite) set, typically an integer, set, permutation or graph, whereas in the continuous case, the solutions are usually a set of real numbers or even a function. Both cases are tackled with very dierent methods, and we focus only on CO problems. Blum & Roli (2003) denes a CO problem as follows: Denition 2.7 (Combinatorial Optimisation problem) A CO problem is dened by: a set of variables sets of variables P = (S; f ) X = x1 ; :::; xn D1 ; :::; D2 constraints among variables an objective function 3 f to be optimised, where f : D1 2 ::: 2 Dn ! R In this context, \optimising" must be seen in a very broad view as the search for a best assignment of values to the variables in X . 10 S is the set of all feasible assignments to the variables: S = fs = f(x1; v1); :::; (x2 ; v2 )gjvi 2 Di ; s saties all constraintsg S is usually called the search space or solution space, and the elements of S are feasible solutions (also called candidate solutions). To conform with our denition of optimal MSA (see denition 2.5), we again focus on maximisation problems solely. To solve optimisation problem means to nd the solution s? 2 S with maximum value, that is f (s? ) f (s) 8s 2 S. The solution s? is called a globally optimal solution. If there is more than one globally optimal solution, the set S? S will be called the set of globally optimal soultions. 2.2.2 Heuristic algorithms When tackling CO problems, one generally distinguishes between complete and approximation problems. Complete algorithms are guaranteed to nd an optimal solution for every instance of an CO problem in nite time. One of the most widely used algorithms are the Branch & Bound algorithms, which are also called A? in AI. To guarantee optimal solutions is often not possible, since many CO problems are NP -hard, which implies that no polynomial time algorithms exist, if we assume P 6= NP . Instead of complete algorithms the guarantee of nding optimal solutions is relaxed to nd good solutions close to the optimal ones. Approximative algorithms are often called heuristic algorithms, especially in Articial Intelligence and Operations Research. The word heuristic has Greek roots with to nd or to discover as its original meaning. Reeves (1993) denes a heuristic as: Denition 2.8 (Heuristic) A heuristic is a technique seeking good (i. e. near-optimal solutions at a reasonable computational cost without being able to guarantee either feasibility or optimality, or even in many cases to state how close to optimality a particular feasible solution is. Heuristics usually incorporate problem specic knowledge beyond the knowledge found in the problem denition to reach a good trade-o between solution quality and computational complexity. The following presentation of heuristic methods follows those by Blum & Roli (2003) and St utzle (1999a). When considering heuristic methods, St utzle (1999a) distinguishes between constructive and local search methods. Constructive methods generate solutions from an initially empty solution by adding components until a complete solution is reached; most prominent of these is the greedy approach which always add the component which gives the best improvement in terms of the OF. Constructive methods are usually among the fastest approximation algorithms found, but the computed solutions are of inferior quality compared to local search methods. Many MSA algorithms are constructive algorithms, among them ClustalW, PileUp and MultAlign. Local search methods instead perform an iterative search. They start from an initial solution and try to improve the current solution by modifying it to a solution with a 11 better score. More formally, a local search to exchanges the current solution with a better solution which can be computed from the current solution by applying a modication operator to it. The set of all solutions computable from the current solution s is called the neighbourhood of the current solution s. The neighbourhood structure is often visualised as a graph on top of the set of all solutions where an edge between two solutions indicate that these two solutions are neighboured, i. e. that one solution can be computed from the other by using the modication operator on the other one. The neighbourhood is dened according to Blum & Roli (2003); St utzle (1999a) as: that assigns to every of s. s2S N S A neighbourhood structure is a function : ! 2S a set of neighbours (s) . (s) is called the neighbourhood Denition 2.9 (Neighbourhood) N S N The iterative process goes on until no better solution in the neighbourhood of the current solution can be found. This nal solution of the local search is called a locally maximal solution: Denition 2.10 (Local Optimum) A locally maximal solution (or local maximum) w. r. t. a neighbourhood structure is a solution s such that 8s0 2 (s) f (s0 ) f (s). s is called strict local minimum i 8s0 2 (s) f (s0 ) < f (s). N N N A neighbourhood where every local optimum is also a global optimum is called an exact neighbourhood. Exact neighbourhoods often have exponential size, so that to search the neighbourhood requires exponential time in the worst case which makes them unusable in practise. 2.2.3 Metaheuristics One of the main features of heuristics as described in the last section is that they incorporate problem specic knowledge. This makes them specic for the problem and prohibits the transfer of one successful heuristic from one problem to another. In recent years, much eort was spent on the development of general heuristic methods which are applicable to a wide range of CO problems. These general-purpose methods are now called metaheuristics (Glover, 1986) (some earlier publications use the term \modern heuristics", e. g. Reeves (1993)). The Metaheuristics Network denes metaheuristic as: Denition 2.11 (Metaheuristic) A metaheuristic is a set of concepts that can be used to dene heuristic methods that can be applied to a wide set of dierent problems. In other words, a metaheuristic can be seen as a general algorithmic framework which can be applied to dierent optimization problems with relatively few modications to make them adapted to a specic problem.4 4 Metaheuristics Network web site: 2004 http://www.metaheuristics.net, 12 last accessed Sun, 25th April (Blum & Roli, 2003, p. 270f) compare many denitions of metaheuristics and summarise the main properties which characterise a metaheuristic as : Metaheuristics are strategies that \guide" the search process. The goal is to eciently explore the search space in order to nd (near-) optimal solutions. Techniques which constitute metaheuristic algorithms range from simple local search procedures to complex learning processes. Metaheuristic algorithms are approximative and usually non-deterministic. They may incorporate mechanisms to avoid getting trapped in conned areas of the search space. The basic concept of metaheuristics permit an abstract level description. Metaheuristics are not problem specic. Metaheuristics may make use of domain-specic knowledge in the form of heuristics that are controlled by the upper level strategy. Todays more advanced metaheuristics use search experience (embodied in some form of memory) to guide the search. Many metaheuristics have been proposed during the last years and are well known today, e. g. Simulated Annealing (SA) or Genetic Algorithm (GA). Blum & Roli (2003) gives an overview of the most successful metaheuristics and shows several ways of categorising them, depending on which aspect of the algorithm is in focus. Metaheuristics dier in the number of feasible solutions examined in each step (population-based vs. single point search), in their use of the objective-function (dynamic vs. static objective function), in the number of neighbourhood structures used during the search (one vs. various neighbourhoods), the usage of memory (memory usage vs. memory-less methods) and the domain where the underlying idea estimated from (nature-inspired vs. non-nature inspired). Common metaheuristics are (in alphabetical order): ant algorithms, evolutionary algorithms, Greedy Randomised Adaptive Search Procedure (GRASP), Guided Local Search, Iterated Local Search (ILS), Simulated Annealing (SA), Tabu Search (TS) and Variable Neighbourhood Search (VNS). Section 2.4 on page 18 shows examples of metaheuristics for MSA. 2.3 Iterated Local Search A major problem with local search algorithms is the probability of getting trapped in a local optimum. One possibility to prevent this is to let the local search accept worsening steps, as done in SA or Tabu Search (TS). Another straightforward way is to modify 13 score the current solution and start a new local search from this modied solution. The modication should be done in such a way that it leads to a solution distantly enough such that the local search could not easily undo the changes and return to the local optimum. s’’ s s’ solution space S Figure 2.3: Pictorial view of ILS. After being trapped in the local optimum s, the current solution is modied which leads to a new point in the search space, marked with s0 . The next run of the local search leads to the new local optimum s00 , which in turn can be modied again. ILS (Lourenco et al., 2002; St utzle, 1999a, p. 25) is a systematic application of this idea, where an embedded heuristic search is applied repeatedly to a modied solution computed before. This results in a random walk of solutions locally optimal w. r. t. the embedded solution. ILS is a very general framework which includes other metaheuristics such as iterated Lin-Kernighan, variable neighbourhood search or large-step Markov chains. ILS is expected to perform much better than simply restarting the embedded heuristic from a randomly computed position in the search space(Lourenco et al., 2002). Lourenco et al. (2002) point out that random restarts of local search algorithms have a distribution that becomes arbitrarily peaked about the mean when the instance size goes to innity. Most local optima found have a value close to the mean local optimum, and only few optima are found with signicant better values. The probability to nd a local optimum with a value close to the mean local optimum found is very large, whereas the probability to nd a signicantly better local optimum converges to zero with increasing instance size. They conclude that it is very unlikely to nd a solution even slightly better than the mean local optimum for large problem instances and that a biased sampling of the space of local optima could lead to better solutions. ILS uses the idea of a random walk in the space of local optima. From the current solution s, a new starting point s0 for the local search is computed by modifying s using a perturbation procedure (s0 is usually not a local optimum, but belongs to the set S of 14 feasible solutions). The embedded heuristic is then run on s0 , which leads to a new local optimum s00 . If s00 passes an acceptance test, e.g. yields an improvement compared to the starting point s, it becomes the next element of the walk, otherwise s00 is discarded and the walk restarts with s. Figure 2.3 shows one step of the random walk. The nal solution of the algorithm is the best solution found during this walk. Algorithm 2.1 An algorithmic presentation of ILS. procedure Iterated Local Search s0 GenerateInitialSolution LocalSearch(s0 ) sbest s s while termination condition not met do Perturbation(s,history) EmbeddedHeuristic(s ) if f (s ) > f (sbest ) then sbest s s s 0 00 0 00 00 end if s AcceptanceCriterion(s; s ; history ) 00 end while return sbest end procedure This process is formalised in algorithm 2.1. The algorithm consists of components: GenerateInitialSolution, LocalSearch, Perturbation, AcceptanceCriterion and termination condition. In order to reach high performance and solution quality, the Perturbation and AcceptanceCriterion can make use of the history of the search, e. g. by emphasising diversication or intensication based on the period of time passed since the last improvement was found. The acceptance criterion acts as a counter balance to the perturbation procedure by ltering the new local optima. The following paragraphs describe each component of the algorithm. 2.3.1 GenerateInitialSolution GenerateInitialSolution is a construction heuristic which computes the starting point for the rst run of the embedded heuristic. The construction heuristic used should be as fast as possible but produce good starting points for the embedded heuristic; standard choices are either random solutions or greedy heuristics. In most cases, greedy heuristics are preferred since the embedded heuristic reaches a local optimum in fewer steps than starting from random solutions. It is worth to note that the best greedy heuristics may not be the best starting point for a local search. Lourenco et al. (2002) give an example where using one of the best construction heuristics known for the Travelling Salesman Problem (TSP) leads to worse results than using a random starting solution as input for a local search procedure. Another point to consider is the time a greedy heuristic needs. The inuence of GenerateInitialSolution should decrease with the run-time of the ILS. 15 Short runs depend heavily on good starting points, whereas in the long run, the starting point gets less important. 2.3.2 EmbeddedHeuristic In most cases, the embedded heuristic is an iterative improvement local search procedure, but any other heuristic can be used, since the embedded heuristic can mostly be treated as a black box. Lourenco & Zwijnenburg (1996) used a TS in their application to the job-shop scheduling problem. However, there are points to consider when optimising the performance of an ILS. The local search should be selected with respect to the perturbation used, where it should not be able to easily undo the changes of the perturbation. A trade-o between speed and solution quality of the embedded heuristic must be found. Usually a heuristic with better results is preferable, but if the heuristic needs too much time, the ILS is unable to examine enough local optima to nd high-quality solutions. Finally, one has to consider the interrelation between the embedded heuristic and the perturbation when optimising an ILS. In many cases, the embedded heuristic uses some tricks to prune the search space and speed up the search process, e.g. by using don't look bits (Bentley, 1992) in TSP algorithms (St utzle, 1999a, p. 78). The highest performance is reached when these tricks are implemented w. r. t. the perturbation procedure. In the example of don't look bits for TSP, the don't look bits are saved for the next run of the local search and only reseted for the cities where the perturbation induced a change. This gains a large advantage in performance of the ILS. 2.3.3 Perturbation The perturbation procedure modies a local optimum to become a new starting point for the next heuristic search. Perturbations are often indeterministic to avoid cycling. A general aspect in choosing perturbations is that they must not be undone easily by the embedded heuristic, since it would be fall back to the last visited local optimum. To nd a good perturbation, properties of the problem instance can be considered as well as problem specic knowledge. Most perturbations are rather simple, e. g. in TSP algorithms, small \double-bridge moves" are frequently used, which remove four edges and add four new edges. While in the case of TSP, a simple perturbation procedure leads to good results, other problems might need more complex procedures, e. g. optimising a part of the current instance or solving a related instance with relaxed constraints. The most important property of the perturbation is its strength. Perturbation strength can be informally dened as the number of components in a solution that get changed by a perturbation. Since the number of changed components provides a measure for the distance between two solutions, the perturbation strength can be seen as the distance of the perturbed solution to the original solution. The strength of the perturbation should be large enough to search eciently in the search space of local optima, but small enough 16 to prevent a random restart behaviour. The size of the perturbation depends heavily on the problem and on the particular instance. The strength of a perturbation can be xed, where the distance between s and s0 is xed or adaptive, where its strength changes over time, e. g. gets larger when no improvements were found to emphasise diversication. To adapt the perturbation strength, the search history can be used. 2.3.4 Acceptance criterion The acceptance criterion determines whether a solution from the embedded heuristic is used as a starting point for the next iteration of ILS. The acceptance criterion provides a balance between intensication and diversication of the search space. The two simplest cases are a criterion which accepts only improvements and thus emphasises intensication and the opposite criterion, which accepts every time and provides a strong diversication. The rst criterion is dened as BETTER (algorithm 0), the second one as ALWAYS. Algorithm 2.2 BETTER acceptance criterion if f (s ) > f (s) then 00 s s 00 end if In between these two extremes, there are several possibilities for acceptance criteria. It is possible to prefer improvements, but accept non-improvement steps sometimes, e. g. by using a criterion similar to the one used in SA. Other choices could make use of the search history similar to some implementations of TS. A limited use of the history is to restart the ILS algorithm from a new initial solution when it seems that there will be no improvement found in the current region of the search space. A simple way would be to restart if the search is not able to traverse to a better local minimum for a certain number of steps. The RESTART criterion implements this approach: Let i be the current number iterations of the ILS and ilast the iteration where the last improving step was made. The RESTART acceptance criterion is dened in algorithm 2.3. Algorithm 2.3 RESTART acceptance criterion if f (s ) > f (s) then 00 s s else if f (s ) f (s) ^ i 0 ilast > iT then s GenerateInitialSolution 00 00 end if It conforms to the BETTER criterion by accepting a local minimum as starting point for the next iteration only if it is better than the previous starting point. Furthermore, if there has been no improvement for a number iT of iterations, the search is restarted with a newly generated solution. The new starting point can be generated in dierent ways. The simplest way is to use the construction heuristic to generate a completely 17 new initial solution, but other approaches considering the search history or the former starting point are also possible. 2.3.5 Termination condition The termination condition ends the algorithm. There are no restrictions on how to choose the condition. Common choices include a xed number of iterations, a number of iterations depending on some properties of the problem instance, a xed amount of consumed run-time or (if possible) an estimation of the solution quality. 2.4 Related Work This chapter presents previously published algorithms and summarises the most important results found during the literature study done in the initial phase of this project. We start with a classication of MSA algorithms briey describing the most prominent algorithms for each class. Found applications of Metaheuristics are described in detail later in this chapter. Notredame (2002) reviews most available algorithms and classies them into four classes: Progressive algorithms add sequences in a step-by-step manner according to a pre- computed order. They use an extension of the dynamic programming algorithm used in pairwise alignment (Needleman & Wunsch, 1970) to add a sequence to an alignment. This class includes the most widely used algorithms: ClustalW, which can be seen as a standard method for multiple alignments, PileUp and MultAlign. All progressive algorithms start by choosing an order in which the sequences will be added to the alignment, mostly implemented by computing an approximative phylogenetic tree from the similarities of the sequences. The sequences are then added according to this order. This process is similar to the construction heuristic used in ILS. The drawbacks of progressive algorithms are the inability to correct gaps placed in early steps of the alignment, the domination of a large set of similar sequences and the imposed order on the sequences. The most prominent algorithm for MSA, ClustalW (Thompson et al., 1994), aligns the sequences according to a guidance tree computed by neighbour joining. It uses many heuristic adjustements to compensate the drawbacks of pure progressive strategies, among them automatic choice of substitution matrices for dierent degress of sequence identity, local gap penalties and gap penalty adjustment. Exact algorithms rely on a generalisation of the algorithm for pairwise alignment for more than two sequences. The original pairwise alignment algorithm (Needleman & Wunsch, 1970) uses dynamic programming to nd an optimal path in a two-dimensional lattice which corresponds to the optimal pairwise alignment. By extending this scheme to a n-dimensional space, n sequences can be optimally aligned. Given that the time 18 needed to computed the optimal alignment is O (ln ) for n sequences of maximal length l, this approach is limited to only small sets of short sequences. This approach is implemented in the MSA program (Lipman et al., 1989), which in turn is an heuristic implementation of an algorithm proposed by Carrillo & Lipman (1988). DCA (Stoye et al., 1997) enhances the number of sequences which can be aligned by MSA by using a divide-and-conquer approach to cut the sequences into smaller fraction, which in turn can be aligned with MSA and later reassembled. Still, the maximum number of sequences which can be handled remains relatively small. Iterative algorithms are based on the idea that a solution can be computed by starting from a sub-optimal solution which is then modied until it reaches the optimum. Iterative algorithms can be further classied into stochastic and deterministic algorithms. The rst subclass includes Hidden Markov Models, GA, SA, TS and the ILS presented in this thesis. Most algorithms in the second subclass are based on dynamic programming. SAGA (a Genetic algorithm), SA and TS are described in detail at the end of this chapter. One of the best MSA algorithms available, Prrp, uses a deterministic iterative strategy. Prrp uses a doubly nested loop strategy to optimise weights assigned to the pairs of sequences and the weighted SP-score simultaneously. The inner loop re-aligns subgroups inside the alignment until no further improvement can be made. The outer loop in turn computes approximative weights for all pairs of sequences based on the current alignment. The iteration ends when the weights converged. Consistency based algorithms try to produce a MSA that corresponds most with all pairwise alignments. By using the pairwise alignments, consistency base algorithms are independent of a specic scoring matrix and make use of position dependent scores. This bypasses many of the problems of progressive algorithms and resembles the enhancements of ClustalW as an inherent feature of the algorithm. Since this class includes all algorithms which make use of consistency based objective function, it includes SAGA, T-COFFEE, DiAlign and the ILS. T-COFFEE is a progressive algorithm using the COFFEE function instead of a substition matrix when computing the alignment. It uses an extended library of global and local alignments, which are merged into a scoring matrix by a heurstic method called library extension. The library extension process assigns a weight to each pair of residues present in the library alignments. The weight is larger for pairs in an alignment which are consistent with the other library alignments. DiAlign is based on gap-free segments of residue pairs in the pairwise alignments induced by the sequences inside a MSA. It starts with computing all pairwise alignments and identifying all diagonals (diagonals represent matches of residues and hence, a diagonal is a gap-free segment). The diagonals are then sorted according to a probabilistic weight and their overlap with other diagonals. The later property should emphasize motifs occuring in more than two sequences. The MSA is formed from this list by adding the rst consistent diagonal not added yet. When the list is completely processed, the align- 19 ment is completed by connecting the diagonals by introducing gaps into the alignment. Name Classication Reference MSA DCA Exact Exact Lipman et al. (1989) Stoye et al. (1997) ClustalW Multalin Progressive Progressive Thompson et al. (1994) Corpet (1988) DiAlign DiAlign2 T-COFEE Consistency-based Consistency-based Consistency-based/Progressive Morgenstern et al. (1998) Morgenstern (1999) Notredame et al. (2000) Prrp SAGA Tabu Search Iterative Iterative Iterative Gotoh (1996) Notredame & Higgins (1996) Riaz et al. (2004) Table 2.1: MSA algorithm and their classication, taken from Notredame (2002). Please refer to the text for a short description of the algorithms and the classications. Table 2.1 shows the most popular algorithms together with classication and references. The remainder of this chapter briey describes already implemented Metaheuristics in more detail. More detailed information can be found in the references. 2.4.1 Simulated Annealing The rst modern Metaheuristic applied to MSA is Simulated Annealing (Kim et al., 1994), which was used to optimise the SP objective function with ane and natural gap costs. The neighbourhood structure used is built up from a move operation which swaps a block of consecutive gaps with a adjacent block of symbols in one of the sequences. All parameters for the swap operation, which are size and position of the gap block, whether symbols to the left or right are swapped and how many symbols are swapped are determined randomly. No gaps are inserted or deleted during the search. SA tries to explore this neighbourhood structure by traversing iteratively from one solution to a better scoring solution, but accepting also steps to solutions with lower scores with a certain probability. This gives SA the possibility to escape local minima. SA is one of the few Metaheuristics with mathematically proved convergence against an optimal solution. In practise, SA is too slow for being used as a complete alignment algorithm, but can be used as an alignment improver to correct and improve alignments computed by other methods. 20 2.4.2 SAGA SAGA (Sequence Alignment by Genetic Algorithm) is a GA for MSA proposed by Notredame & Higgins (1996). It was used to optimise alignments for the weighted SP score with dierent gap penalties, and later extended to use the COFFEE objective function as well (Notredame et al., 1998). SAGA starts with a population of random alignments generated by adding a random number of gaps to the start and end of each sequence, ensuring that all sequences have the same length afterwards. To compute the next generation of one iteration, 50% of the new population is lled with the 50% ttest (the tness value corresponds to the score of the alignment) alignments from the previous generation. The remaining 50% are created by modifying randomly chosen alignments from the previous generation, where the probability for being chosen depends on the tness value. SAGA uses 22 dierent operators to modify an alignment: two crossover operators which generate a new alignment from two parents, two operator which inserts blocks of gaps into the alignment, a family of sixteen operators shifting blocks of gap inside the alignment to new positions, an operator which tries to nd small proles by nding the best small gap-free substring of a sequence in all sequences and an operator optimising a pattern of gaps in a small region of the alignment by either exhaustive search or GA, depending on the size of the region and number of gaps to be placed. Which operator is applied is stochastically chosen, with a probability depending on the quality of the ospring an operator has generated in the last ten iterations. SAGA is among the best algorithms for MSA regarding alignment quality, but needs very long time to produce good quality alignments, maybe due to using randomly generated alignments as the initial population. 2.4.3 Tabu Search Tabu Search (TS) (Glover, 1986; Glover & Laguna, 1997) is a general Metaheuristic which explores a neighbourhood by traversing from one solution to the neighbour with the best score, even if this neighbour has a lower score. To prevent cycling, TS uses a form of memory to declare recent moves as forbidden for a number of iterations. In some cases, a move to a solution declared tabu may actually lead to an improvement. In this case, aspiration criteria are used which allow forbidden moves under some circumstances, e. g. if the new solution would be better than all previously solutions found. The memory can also be used to balance between diversication and intensication. Riaz et al. (2004) implemented a TS for MSA, using the COFFEE objective function. They propose two dierent move strategies: a single block move, which is the same move as used in the BlockMove local search in ILS, and move strategy moving a rectangle of gaps which spans over more than one sequence to a new position. The second move strategy is implemented in their TS algorithm since it has been found to give shorter runtimes than performing a series of block moves. To balance between intensication and diversication, Riaz et al. (2004) divide the 21 alignment into good and badly aligned regions and focus the TS on a bad aligned region until it has been improved to be not further classied as bad aligned. The TS is then focused on the next bad aligned region until no further bad aligned regions are found. TS is tested on the BAliBASE database (version 1.0), and is found to be comparable to other algorithms in terms of solution quality, but needs very long time to produce good alignments. It is pointed out that the initial alignment from which the TS is started greatly inuences the solution quality when optimising larger instances. 22 3 Algorithm This chapter describes the ILS algorithm for MSA and motivates the decisions made when designing the components of ILS (section 2.3). For each of the components, several alternatives will be proposed and compared to nd the best alternative to use in the ILS. The results are all presented in chapter 5. The decisions made are motivated by preliminary tests and subsets of the BAliBASE database. 3.1 Construction heuristic Three variants of construction heuristics have been implemented and evaluated to see if they can serve as a good starting point for a local search. The rst heuristic, Random, builds the initial alignment by randomly inserting a xed percentage of gaps into the sequences, lling the shorter sequences with additional gaps. The second heuristic, AlignRandom, generates the initial alignment by successively aligning a sequence to an alignment in random order. It starts with aligning two sequences, and adds the next sequence to this alignment until all sequences have been added. The alignment method is similar to standard dynamic programming and described by Gotoh (1993).1 The third heuristic, NeighbourJoining, is a greedy heuristic which aligns the sequences step by step according to a guidance tree, which is computed using the neighbour joining method (Saitou & Nei, 1987) and the alignment is done with the same procedure as in heuristic number two. Neighbour Joining is implemented by the Quickjoin library, which uses a heuristic speed-up technique to implement the algorithm as described by Saitou (Brodal et al., 2003).2 The heuristics will be referred to as Random, AlignRandom and NeighbourJoining in the following. Table 5.1 on page 41 compares the quality of the computed alignments by each construction heuristics for both COFFEE-score and BAliBASE Sum of Pairs Score (BSPS)score. NeighbourJoining clearly achieved the highest scores in both cases. Considering that the run-time for NeighbourJoining and AlignRandom is nearly the same and the runtime of the whole ILS benets from a starting solution near to a local optimum, NeighbourJoining is selected as the construction heuristic used in the implementation. This decision is further motivated by the assumption that good solution are clustered together in the search space, which is one reason for the good performance on ILS. Gotoh presents four algorithms, diering in estimated run-time and quality of the alignments produced. In this implementation, Algorithm A is used. 2 Quickjoin is available from http://www.daimi.au.dk/mailund/quick-join.html 1 23 The neighbour joining heuristic has the drawback of being deterministic, so that each search starts from the same initial solution, which makes it unsuited for the RESTART acceptance criterion. When used in conjunction with RESTART, the rst initial solution is computed by NeighbourJoining, whereas the new initial solutions for restarts are computed by AlignRandom. 3.2 Local Search The embedded heuristic, which will be a local search procedure in this project, has great inuence on the performance of ILS. Given that great impact, we will design the other components around a local search procedure. To nd a good local search regarding both runtime and solution quality, two local search procedures have been implemented and tested on a random subset of the BAliBASE database. The subset consists of a quarter of all sequences of each reference set, randomly chosen. The two local search procedures each examine a dierent neighbourhood structure by strict iterative improvement, going from one solution to another only if the new solution is better than the former. 3.2.1 Single exchange local search The rst local search procedure exchanges one symbol with an adjacent gap symbol until no further improvement can be made. The neighbourhood structure used in the local search is dened as follows: Denition 3.1 (Neighbourhood structure for single exchange local search) A pseudo multiple sequence alignment A is contained in the neighbourhood of another pseudo multiple alignment B i alignment A can be transformed into alignment B by exchanging a gap with a symbol, preserving the order of the symbols in each sequence. The size of the neighbourhood is O(N 2 3 L), where N is the number of sequences in the alignment and L is the length of the longest sequence in the set S of sequences. The worst case is an alignment with a maximum number of gaps, since each gap can be replaced by two residues (except gaps at the beginning or end of a sequence, which can only be replaced by one residue), which gives two (one) neighbours. This worst case alignment consists of columns which are built with N 0 1 gaps and only one non-gap symbol. Each symbol from the sequences constitutes one column, so the alignment has s2S jsj columns. The number of gaps is then given by (N 01)3 s2S jsj. For L being the length of the longest sequence, it holds that L jsj8s 2 S and from that s2S jsj jS j3L = N 3L. Altogether, this gives (N 0 1) 3 N 3 L = N 2 3 L 0 N 3 L = O(N 2 3 L). The local search procedure explores this neighbourhood in a combination of greedy and random procedure. The local search traverses every block of gaps, i.e. a consecutive sequence of gaps, and searches for an improvement within this block. Within a block, every possible move is evaluated and the best one taken. If the best move yields to an P 24 P P ERT−−−QYPDI−−−−PK HF−−−N−RY−ITS−−PK KQQ−−−−KYLSA−KSPK −−−KLVN−−E−AKNL−− ERT−−−QYPDI−−−−PK HFN−−−−RY−ITS−−PK KQQ−−−−KYLSA−KSPK −−−KLVN−−E−AKNL−− Figure 3.1: Move operation of the single exchange local search. A gap inside a block of consecutive gaps is swapped with one of the symbols adjacent to this block. improvement, it is executed and the local search is restarted. If no improvement was found in this block, the local search is continued with the next block. The blocks and the gaps in each block are traversed in random order. One nal point of variation inside the local search worth to consider is the question which move should be performed during each iteration. Algorithm 3.1 shows the complete local search algorithm in pseudo code. The algorithm examines all gap blocks and the adjacent symbols if swapping the symbol with one of the gaps inside the block yields to an improvement until either an improvement is found or all gaps have been examined. Depending on the parameter pivot, which can be set to either \Best" or \First", the local search selects the neighbour which yields the largest improvement possible or the rst neighbour found to yield an improvement. The algorithm uses an auxiliary function swap which swaps two positions inside an alignment. The function ExchangeMove evaluates a move in the neighbourhood structure by computing the dierent scores of the current alignment and the neighbour reached by exchanging one symbol and a gap. The dierence between these two scores is returned. The function ExamineBlock searches the best improvement w. r. t. a block of consecutive gaps by evaluating all possible moves done when considering one of the symbols right and left to the block and the gaps inside the block. It returns the value and position of the gap of the best improvement move found, and zero if no improvement can be made. The local search is implemented in procedure SingleExchange. It examines all blocks of gaps as long as an improvement move in the neighbourhood can be made. To do this, it rst evaluates all moves induced by a block and the symbol left to this block (lines 24{25). If new improvement was found, or if pivoting is set to \Best", the same is done for symbol to the right and the same block (lines 26{34). If the examination of a block found an improvement and pivoting is set to \First", the for-loop is quit in line 36. When the loop terminates, the best improvement move found is executed (lines 39{40). Figure 3.1 shows a possible move during the local search. First, a block of gaps is selected. For both adjacent residues, the score of exchanging them with any gap in the selected block is computed, and the best combination is eventually executed as the move for this iteration. The estimated run-time of one iteration of the local search procedure is O(N 2 3 L). In the worst case, the local search has to evaluate the whole neighbourhood of a solution which conforms to evaluating every possible move since each neighbour can be reached by performing exactly one move. 25 Algorithm 3.1 Single exchange local search. 1: function ExchangeMove(Alignment A, OF f, gapPos, symbolPos) 2: delta f (A) 3: A A 4: swap(A', gapPos, symbolPos) . swap two positions in an alignment 5: delta delta 0 f (A ) 6: return delta 7: end function 8: 9: function ExamineBlock(Alignment A, OF, Block G, symbolPos) 10: bestDelta 0 11: for all gap g 2 G do . nd the best improvement for this block 12: delta ExchangeMove(A; f; gap; symbolP os) 13: if delta < bestDelta then 14: bestDelta delta 15: gapP osition gap 16: end if 17: end for 18: return (bestDelta; gapP osition) . bestDelta = 0 if no improvement found 19: end function 20: 21: procedure SingleExchange(Alignment A, Objective Function f, pivot) 22: repeat 23: for each block G of consecutive gaps do 24: symbolP os symbol left to G 25: (bestDelta; gapP os) ExamineBlock(A; f; G; symbolP os) 26: if bestDelta < 0 _ pivot = \Best" then 27: rightP os symbol right to G 28: (rightDelta; rightGap) ExamineBlock(A; f; G; rightP os) 29: if rightDelta < bestDelta then 30: bestDelta rightDelta 31: symbolP os rightP os 32: gapP os rightGap 33: end if 34: end if 35: if pivot = \First" ^ bestDelta < 0 then 36: break . quit the for loop 37: end if 38: end for 39: if bestDelta < 0 then . if improvement found 40: swap(A, gapPos, symbolPos) . perform the found move 41: end if 42: until no improvement was found in the last iteration 43: end procedure 0 0 26 To speed up the computation, the local search uses don't look bits to prune the search space. Don't look bits are a matrix of boolean values, which are initially set to false. A symbol beneath a gap block is only considered for a move if the corresponding don't look bit is false. Whenever the local search evaluates a symbol adjacent to a block of gaps, and nds that no improvement can be made by exchanging this symbol with one of the gaps in the adjacent block, the don't look bit is set to true; when a move is actually executed, the don't look bits for the two exchanged positions are set to false. 3.2.2 Block exchange local search The second local search procedure is similar to the rst one but tries to exchange whole blocks of gaps instead of one symbol each step. Riaz et al. (2004) favoured this local search in their TS implementation3 and state that it is superior in runtime and solution quality to the single exchange local search. The neighbourhood structure for the block exchange local search is dened as: Denition 3.2 (Neighbourhood structure for block exchange local search) A pseudo multiple sequence alignment A is contained in the neighbourhood of another pseudo multiple sequence alignment B i alignment A can be transformed into alignment B by exchanging a block of gaps with a block of non-gap symbols in one sequence and preserving the order of the symbols in each sequence. In the worst case, the neighbourhood structure is equivalent to the neighbourhood structure of single exchange local search, since a maximal number of gap blocks is given when each block is of length one. This gives the same worst-case size of the neighbourhood, O (N 2 3 L), but on the average, the neighbourhood can be expected to be much smaller. ERTQYP−−−DI−−−−PK HFN−−−−RY−ITS−−PK KQQ−−−−KYLSA−KSPK −−−KLVN−−E−AKNL−− ERT−−−QYPDI−−−−PK HFN−−−−RY−ITS−−PK KQQ−−−−KYLSA−KSPK −−−KLVN−−E−AKNL−− Figure 3.2: Move operation for the block exchange local search. A randomly selected block of gaps is exchanged with a block of symbols of equal size to its right. Algorithm 3.2 describes the local search procedure and gure 3.2 presents one step in graphical form. It uses the same technique to distinguish between best improvement and rst improvement pivoting as SingleExchange (algorithm 3.1). Swapping of blocks of symbols is done by an auxiliary function swapBlock which swaps two blocks inside an alignment. 3 TS traverses to the best solution in the neighbourhood of the current solution, a process similar to the local search procedures used here. 27 Algorithm 3.2 Block exchange local search 1: function BlockMove(Alignment A, OF f, Block gaps, Block symbols) 2: delta f (A) 3: A A 4: swapBlock(A', gaps, symbols) 5: delta delta 0 f (A ) 6: return delta 7: end function 8: 9: procedure BlockExchange(Alignment A, OF f, pivot) 10: repeat 11: bestDelta 0 12: for each block G of consecutive gaps do 13: leftSymbols maximal block of at most jGj consecutive symbols to the left 14: if BlockMove(A,f, G, leftSymbols) < bestDelta then 15: bestDelta BlockMove(A,f, G, leftSymbols) 16: symbolBlock leftSymbols 17: gapBlock G 18: end if 19: if pivot = \Best" _ bestDelta = 0 then 20: rightSymbols maximal block of at most jGj consecutive symbols to the right 21: if BlockMove(A,f,G, rightSymbols) < bestDelta then 22: bestDelta BlockMove(A,f,G, rightSymbols) 23: symbolBlock rightSymbols 24: gapBlock G 25: end if 26: end if 27: if bestDelta < 0 ^ pivot = \First" then 28: break . end the for-loop 29: end if 30: end for 31: 32: if bestDelta < 0 then 33: swapBlock(A, gapBlock, symbolBlock) 34: end if 35: until no improvement found in last iteration 36: end procedure 0 0 28 The local search is implemented in procedure BlockExchange. It iteratively examines all blocks of gaps until no further improvement can be made. It starts by examining the eect of swapping a block of symbols to the left with the current gap block (lines 13{18). The non-gap block is of maximal length, meaning that the local search tries to swap as many symbols found to the left of the current gap-block as possible. The size of the swapped block is the minimum of the number of consecutive non-gap symbols to the left and the size of the gap-block. If this didn't show an improvement, or pivoting is set to best improvement, the same is done with the maximal block of non-gaps to the right (lines 19{26). In both cases, the gap and the non-gap blocks are saved if the move is better than the best move found so far (which is always the case for the rst improvement found). Lines 27{29 terminate the loop if an improving move was found and pivoting is set to rst improvement. The best move found is executed in lines 32{34. A further extension would be to consider not only swapping blocks or symbols, but to move the gaps to a completely new position in the sequence. This approach has the drawback of an extremely large neighbourhood, since there are N 0 1 new positions for a gap block of length one, which would give O(N 3 N 2 3 L) as the worst case neighbourhood size. Considering this and the large computational costs for evaluating each step, this approach is not considered here. 3.2.3 Comparison Both local search procedures are tested by running them on a random part of the BAliBASE database. One fourth of each subset is randomly selected. For each selected test case, ClustalW is used to compute an initial alignment, which is used as input for the local search procedures. Table 5.2 on page 42 and table 5.3 on page 42 present the results as percentage improvements in COFFEE-score for each local search procedure with rst- and best-improvement pivoting. The table shows that SingleExchange local search performs better than the BlockExchange local search, with an improvement of three times better in some cases. There is no big dierence between the two pivoting schemes. Comparing the runtime of the algorithms shown in table 5.3, the SingleExchange performs much better than BlockExchange, the only exception being reference set four, where BlockExchange is the fastest algorithm in all cases. When focusing on the pivoting scheme used, it turns out that rst-improvement is much faster than best-improvement, up to a factor of six for reference set 3. Considering the runtime and improvement made by the local search, SingleExchange with a rst-improvement pivoting scheme will be used in the ILS as the embedded heuristic. Since ILS samples local minima, both factors have to be considered to select the local search procedure. The local search should be able to produce as many local minima as possible in the given time, but the local minima's quality must also be taken into account. SingleExchange is superior to BlockExchange in both terms, and thus seems to be a good basis for the following design decisions to be made. 29 3.3 Perturbation Four perturbation procedures are proposed and evaluated on a small subset of BAliBASE. All perturbations are tested using a preliminary version of the ILS, using SingleExchange and a Better acceptance criterion, running for 100 iterations. Due to the large runtimes of one of the perturbations, the test runs are restricted to a subset of reference set 1. which consists of relatively small sets of sequences. 3.3.1 Random regap perturbation ERT−−−QYPDI−−−−PK HFN−−−−RY−ITS−−PK KQQ−−−−KYLSA−KSPK −−−KLVN−−E−AKNL−− shuffle gaps IT−−SP−K ERT−−−QYPDI−−−−PK HFN−−−−RYIT−−SP−K KQQ−−−−KYLSA−KSPK −−−KLVN−−E−AKNL−− Figure 3.3: Random regap perturbation. A subsequence of one of the sequences inside the alignment is randomly selected and the gaps inside that sequence are moved to new random positions inside the block. Algorithm 3.3 Random regap perturbation procedure Regap(Alignment A, Integer length) select a sequence Si 2 A select a substring si Si n countGaps(si ) s0i removeGaps(si ) add n gaps at random positions to substitute si with s0i si 0 end procedure The rst perturbation, shown in algorithm 3.3 and exemplied in gure 3.3, simply selects a random subpart in one of the sequences in the alignment and changes the positions of the gaps to random positions inside the subsequence. The number of gaps remains constant. In order to prevent the ILS from cycling, all decisions made in the perturbation procedures are done randomly, with the length of the selected subsequence as the only exception. But since the SingleExchange local search is very similar to this perturbation, it is likely that the changes will be undone in the next local search step. 3.3.2 Block move perturbation The BlockMove perturbation is a move in another neighbourhood of the current alignment. It selects a block of consecutive gaps and moves that block to a new position inside the same sequence (see algorithm 3.4 and gure 3.4). Moving a block to a new position can be seen as executing a move neighbourhood of higher order, which is a 30 ERT−−−QYP−−DI−−PK HFN−−−−RY−ITS−−PK KQQ−−−−KYLSAK−SPK −−−KLVN−−E−AKNL−− move block ERT−−−QYP−−DI−−PK HFNRY−ITS−−−−−−PK KQQ−−−−KYLSAK−SPK −−−KLVN−−E−AKNL−− Figure 3.4: Block move perturbation. A block of gaps is randomly selected and then shifted to a new randomly determined position. Algorithm 3.4 Random block move perturbation procedure BlockMove(Alignment A) select a sequence Si 2 A select a block of gaps B Si move B to a new position in Si end procedure generalisation of the neighbourhood used by BlockExchange. Since using moves in another neighbourhood are often proposed in ILS literature as successful perturbations, this seems to be a promising approach. Initial experiments with this perturbation indicated that the distance by which a block can be moved should be restricted to keep the modied alignment close enough to the former alignment. After running small tests (data not shown here), 1=10 of the alignment length seems to be a good upper limit for the block move distance. 3.3.3 Pairwise alignment perturbation ERT−−−QYPDI−−−−PK HFN−−−−RY−ITS−−PK KQQ−−−−KYLSA−KSPK −−−KLVN−−E−AKNL−− pairwise −−DI LSAK alignment ERT−−−QYP−−DI−−PK HFN−−−−RY−ITS−−PK KQQ−−−−KYLSAK−SPK −−−KLVN−−E−AKNL−− Figure 3.5: Pairwise alignment perturbation with perturbation strength=5. Two sequences are selected. Within these two sequences, two substrings are aligned and then pasted back into the MSA. This perturbation procedure relies on computing the optimal pairwise alignment of two subsequences of the alignment and tting this alignment back into the MSA. Figure 3.5 illustrates the perturbation process, described by algorithm 3.5. This perturbation tries to make use of a biologically justied method to change the alignment. It rst selects two random subsequences from two randomly chosen sequences in the alignment. These subsequences are then aligned together (with gaps in the subsequences removed before aligning) using optimal pairwise alignment with ane gap costs (Gotoh, 1982) and copied back to their original placed. If the length of the pairwise alignment diers from the length of the original subsequences, gaps are added either in the pairwise alignment at random positions (if the pairwise alignment is longer) or to the multiple 31 Algorithm 3.5 Pairwise alignment perturbation procedure PairwisePerturbation(Alignment A) select two sequences S1 and S2 select two substrings s1 S1 and s2 S2 s01 = removeGaps(s1 ) s02 = removeGaps(s2 ) compute optimal pairwise alignment P A of s01 and s02 if jP Aj < s1 then add js1 j 0 jP Aj gaps to each sequence in A else if jP Aj > s1 then add jP Aj 0 js1 j gaps to the alignment at random positions end if copy P A back into A end procedure alignment to gain enough space for the pairwise alignment. 3.3.4 Re-aligning of complete regions ERT−−−QYP−−DI−−PK HFN−−−−RY−ITS−−PK KQQ−−−−KYLSAK−SPK −−−KLVN−−E−AKNL−− AlignRandom −PD−−− Y−−IT− Y−LS−A −−E−−A ERT−−−QY−PD−−−I−−PK HFN−−−−RY−−IT−S−−PK KQQ−−−−KY−LS−AK−SPK −−−KLVN−−−E−−AKNL−− Figure 3.6: Blockwise realign perturbation. A region of the alignment is selected by a sliding window analysis to nd unreliably aligned regions. This part is then aligned with the AlignRandom procedure. Algorithm 3.6 Realign region perturbation procedure Realign(Alignment A) perform a sliding window analysis with norMD to nd a badly aligned region [i; j ) for all sequence s 2 A do remove all gaps between symbols s[i] and s[j ] end for compute MSA of the region between i and j using AlignRandom copy the computed alignment back into A replacing the region between i and end procedure j The most sophisticated perturbation tries to make use of biological knowledge by performing a sliding window analysis using norMD to nd unreliably aligned regions, similar to the procedure described used Rascal (Thompson et al., 2003). After performing the sliding window analysis, all blocks of columns with scores below 0.5 are considered to be unreliably aligned, as proposed in citetThompson01:norMD. One of these block is ran- 32 domly selected, and aligned by using AlignRandom (see section) 3.1. Since the computed alignment can be longer or shorter than the original block, the perturbed alignment can in turn grow or shrink. This perturbation combines biological justied methods and random decisions to do a guided perturbation to the alignment in regions where further improvement is needed. It is expected to be able to guide the local search eectively and produce good results, but the expected runtime is very high due to the additional complexity introduced by running norMD each time on the complete alignment. Thompson et al. (2001) don't give an estimation of the norMD procedure, but since they consider each pair of sequences it is at least quadratic. 3.3.5 Comparison The results of a comparison of the performance for a preliminary version of an ILS with dierent perturbation criteria are presented in tables 5.4 and 5.5 on page 43. The ILS used in this comparison uses the SingleExchange local search with a rst-improvement pivoting strategy and a BETTER acceptance criterion. The runtime has been restricted to 100 iterations. As BlockMove shows the largest percentage improvement with acceptable runtime, we will use BlockMove with a restricted maximal distance of 1=10 of the alignment's length as the perturbation procedure. 3.4 Acceptance criterion The nal component of the ILS to consider is the acceptance criterion. Most publications propose the BETTER criterion to be superior w. r. t. solution quality to the ALWAYS criterion. Additionally, we examine the Restart criterion (St utzle & Hoos, 2002), which prevents the ILS from stagnation by starting with a new initial solution generated with AlignRandom when no improvement steps could be made for a number of steps. The results of these comparison can be found in tables 5.6 and 5.7 on page 44. Evaluation is done again by running a preliminary ILS on subsets of reference set 1 with 1=4 of the sequences for ten times, measuring the COFFEE-scores and runtime. Tables 5.6 and 5.7 show the results. The best improvement are computed by an ILS using the Restart criterion. This could possibly indicate that the initial alignments produced by NeighbourJoining lie in regions of the search space far away from the global optimal solution, and that it is very hard to traverse from one region of good solutions to another. 3.5 Termination condition The algorithm is terminated after a xed number of iterations, depending on the length of the longest sequence in the alignment. Since it is not possible to guarantee the 33 optimality of this algorithm4 , a cut-o had been made. Experimental investigation has shown that ILS begins to stagnate after a certain amount of time, so terminating the search when the stagnation begins seems to be a good termination condition. Figure 0.37 0.36 0.35 COFFEE score 0.34 0.33 0.32 0.31 0.3 0.29 0 2000 4000 6000 iterations 8000 10000 12000 Figure 3.7: COFFEE scores for ILS on reference set gal4 ref. The algorithm was run for 12000 iterations, the longest sequence consists of 395 residues. The average similarity between sequences is 14%. 3.7 shows an example run of the ILS algorithm on the test set \gal4 ref1" . This test set consists of ve sequences, the shortest consisting of 335 residue, the longest of 395. Several other tests (data not shown) indicate twenty times the length of the longest sequence being a good threshold for the maximum number of iterations (that would be 7900 in this case). 3.6 Iterated Local Search for Multiple Sequence Alignment We nally present the complete ILS algorithm for multiple sequence alignment. The algorithm is described in pseudo code in algorithm 3.7. To improve the runtime needed by the ILS, we preset the don't look bits used in the local search procedure by performing a sliding window analysis with norMD on the initial alignment and set the don't look bits for good aligned regions.5 This also helps the ILS to keep the well aligned parts of the initial alignment during the search. The don't look bits are furthermore reused for each iteration by using the don't look bits from the last iteration (with regions changed by the perturbation set to false) as starting don't look bits for the next iteration, so that the local search can focus on perturbed parts of the alignment. When a restart is done, the don't look bits are preset again. If optimality should be guaranteed, the algorithm would have to enumerate all possible alignments to nd the optimal one and thus need exponential runtime. 5 A threshold of 0.8 is used to indicate a good aligned region. 4 34 Algorithm 3.7 The complete ILS. function ILS(Objective Function f) compute initial alignment using NeighbourJoining perform sliding window analysis with norMD set don't look bits to true in well aligned regions SingleExchange(a,f, \rst") a best a for 202 length of the longest sequence do a a BlockMove(a) SingleExchange(a,f, \rst") if f (a) > f (best) then best a . save alignment 0 . new best alignment found end if if f (a ) > f (a) then . no improvement step made 0 a a 0 end if if no improvement steps made for jaj iterations then new AlignRandom computed alignment preset don't look bits with norMD BlockMove(a) if f (a) > f (best) then a best end if end if end for a remove all columns with no symbols from return best end function 35 best . Restart 3.7 Implementation The algorithm is implemented in C++, it can be compiled with any ISO C++ compiler. The implementation uses the seqio-library for sequence input and output. Seqio is availbale from http://www.cs.ucdavis.edu/gusfield/seqio.html. Since the algorithm is inherently stochastic in nature, a good random number generator is crucial. In our implementation, the Mersenne twister implementation of Agner Fog (http://www.agner.org/random/) is used. The neighbour joining algorithm used in the construction heuristic NeighbourJoining is implemented in the library quick{join (Brodal et al., 2003). Patched versions of ClustalW and lalign are used to generate the COFFEE-library. The programs have been modied to simplify parsing their output and circumvent the creation of unnecessary temporary output les. 36 4 Method and Materials Two points must be considered to evaluate the performance of the ILS algorithm. The rst point is the performance of ILS as an optimisation algorithm for optimising the alignment according to the objective function used. The second point is the biological viability of the computed alignment. The performance is measured using run-time distributions, and the biological evaluation is done with the BAliBASE database of reference alignment, which has been used in most recent publications of MSA algorithms. 4.1 Run-time distributions St utzle (1999a) and Hoos & St utzle (1998) propose an empirical evaluation method based on the distribution of solution quality and run-time, since Metaheuristics, and especially ILS, are probabilistic algorithms. Instead of benchmarking an algorithm on test cases and computing averages and other simple statistic measures, the runtime behaviour of an algorithm is examined by taken samples of the Run-time distribution (RTD) of specic instances. An RTD describes the runtime behaviour as a two-dimensional random variable (C; T ), where C is the random variable describing the solution cost and T is the random variable describing the run-time on a specic problem instance. Instead of measuring the run-time in terms of CPU time consumed, it is often preferable to use machine independent measures such as operation counts. This also minimises the inuence of other environmental properties, e. g. the inuence of the compiler or programming language. The RTD using such a machine independent measure is called Run-length distribution (RLD). We will use the number of executed local search steps as a performance measure when analysing the runtime behaviour of ILS since it provides a good measure of the actual runtime needed. The random variables describe the behaviour of the algorithm for a given instance, but are not available a-priori in general. Knowledge of these random variables can be used to analyse and compare algorithms. To estimate the distribution of the random variables, samples are taken by simply running the algorithm a number of times on an instance and collecting data. In each run, the solution quality and computation time to obtain it is reported every time a new best solution is found. Let k be the number of runs of the algorithm, f (sj ) the quality of the best solution found so far and rt(j ) the run-time of the j th run. The estimated RTD to reach a solution with quality c is dened as: Denition 4.1 (Runtime-Distribution) Gc (t) = jj jrt(j ) t ^ f (sj ) cj k 37 If optimal solution qualities are available, the required quality is often set to a value or certain percentage worse than the optimal value, e. g. to 5% worse. However, since there are know mathematically optimal known reference solutions for MSA, we will use the best value found during benchmarking as the reference value and compute RLD according to this value. St utzle & Hoos (2002) show how RTDs are used to analyse the behaviour of ILS for TSP. The authors concluded from the estimated RTDs that the ILS suers from stagnation and proposed two improvements, which actually lead to an improvement in solution quality and performance. 4.2 The BAliBASE database of multiple alignments The BAliBASE (Bahr et al., 2001; Thomson et al., 1999) database of benchmark1 alignments is a database of reference alignments to evaluate the biological validity of MSA algorithms. The database consists of 167 manually constructed alignments of real sequences based on three-dimensional structural superposition.2 In each alignment, regions that can be aligned reliably by structural superpositions are dened as core blocks. BAliBASE has been used in two published surveys of MSA algorithms (Notredame, 2002; Thompson et al., 1999) and is used frequently for evaluation in recent publications of MSA algorithms (Notredame et al., 2000; Riaz et al., 2004). It can be seen as a standard benchmark for biological evaluation of MSA algorithms. Since version one of the BAliBASE database was used in all publications found, we will restrict ourselves to the same test set, which matches reference sets 1{5 of BAliBASE 2.0 (reference sets 4 and 5 were formed a single reference set in BAliBASE 1.0). Totally, 144 alignments are used for evaluation. The database is structured into ve reference sets, each representing one commonly faced problem for MSA algorithms. Reference 1 consists of small sets of equidistant sequences of similar length. The length ranges from below 100 residues up to 900. Each pair of sequences in this set has a similar percentage identity and no large extensions or insertions have been introduced. The set can be structured in three sub-groups of similar identity, with identity below 25%, between 20{40% and > 35%. Reference 2 contains alignments of a family of sequences, i. e. closely related sequences with identity > 25% together with at most three distantly related sequences of the same family. Distantly related refers to sequences with less than 25% similarity. Reference 3 contains alignments of equidistant groups of sequences from divergent families. Each pair of sequences from two dierent groups has a percentage identity below 25%. 1 2 available online at http://www-igbmc.u-strasbg.fr/BioInfo/BAliBASE/ Test set seven, which consists of transmembrane sequences is not based three-dimensional structural superposition due to the lack of known three-dimensional structures and variability in automatic prediction methods. 38 Reference 4 contains sequences with large N/C-terminal extensions. Reference 5 contains sequences with internal insertions. Comparing a test alignment with the alignment contained in the database is done with the bali score program3 . The program computes two dierent scores for each alignment: The BSPS that shows to which extent the sequences in the alignment are correctly aligned and Column Score (CS), which measures for ability to align all sequences correctly. Scoring is restricted to core blocks, which are reliable areas of the alignment where no ambiguity from structural alignment exists. BSPS counts the number of residue pairs aligned in the test alignment that also occur in the reference alignment. For a test alignment A with N sequences and M columns and reference alignment R, the score Si for the ith column is dened as Denition 4.2 X X S = N i N j =1 K =1;k6=j pi (j; k) where pi (x; y ) = 1 if the residue pair (Ax [i]; Ay [i]) is also found in the reference alignment, otherwise pi (x; y ) = 0. With Smax dened as the number of aligned residue pairs in the reference alignment, the SPS for the test alignment A is then: Denition 4.3 (BAliBase Sum-of-pairs Score (BSPS)) PS M BSP S (A) = i=1 i Smax The CS is dened similarly to the BSPS, but counts only columns completely identical with the reference alignment: Denition 4.4 (BAliBASE Column Score (CS)) CS = XC M i M i=1 Ci is dened as Ci = 1 if all residues in the ith column are aligned in the reference alignment, otherwise Ci = 0. M is the number of columns in the test alignment. The column score is a much more stronger measure of alignment quality since it takes only complete columns into account. In a situation where one sequence is completely aligned inconsistently with the referent alignment, the CS score would be zero, whereas the BSPS score can be rather high since it takes each individual pair of amino acids into account. 3 available at the BAliBASE web-page score.c http://www-igbmc.u-strasbg.fr/BioInfo/BAliBASE2/bali 39 4.2.1 Statistical Evaluation When comparing the performance of dierent algorithms, it is important to establish whether the dierences between two algorithms are statistically signicant by using statistical tests. In this thesis, the Wilcoxon signed-rank test is used to calculate a p-value associated with the dierences in the results of two algorithms. This p-value expresses a probability that the dierences are due to chance, with lower p-values expressing a higher signicance. The Wilcoxon signed-rank test is also used by Notredame et al. (2000) and Gotoh (1996); Thompson et al. (1999) used a Friedman-test instead. 40 5 Results This chapter consists of three parts: The rst part, contains the results gathered to compare the alternatives for each of the components of ILS proposed in chapter 3. The second part contains the results of the evaluation of ILS for the complete BAliBASE database. The results are analysed using the methodology described in chapter 4. The third section analyses the runtime behaviour by exploiting solution-quality tradeos and runlength distributions. 5.1 ILS components For each component of ILS, chapter 3 proposed several alternatives. This section contains the results of the comparisons for each of the alternatives proposed for each component.These results were used in order to motivate and explain the decisions made in this chapter and to select the components of the nal ILS algorithm. 5.1.1 Construction heuristic Reference Set Random COFFEE BSPS AlignRandom NeighbourJoining COFFEE BSPS COFFEE BSPS ref1 test1 ref1 test2 ref1 test3 0.132 0.131 0.119 0.582 0.784 0.844 0.0588 0.0478 0.0374 0.255 0.493 0.598 0.667 0.836 0.883 0.3 0.524 0.623 Table 5.1: Comparison of construction heuristics. Each heuristic was run 10 times for each set of sequences. The rst column for each heuristic shows the average COFFEE score, the second column the average BSPS-score. The test sets and the BSPS score are described in section 4.2 Table 5.1 shows the quality of the computed alignments by each construction heuristic. The best results are computed using NeighbourJoining, AlignRandom performs second and Random shows the lowest results. 5.1.2 Local search The two local search variants presented in section 3.2 are compared by running them on a randomly chosen part of the BAliBASE database. From each reference set inside the 41 database, one fourth of the test cases are randomly selected and given to ClustalW to produce an initial alignment, which is then optimised by the local search procedures. SingleExchange Reference set global < 25 % identity combined global combined 8.5 1.9 1.3 2.2 1.4 3.4 2.4 8.4 1.9 1.3 2.1 1.4 3.6 2.5 6.6 1.6 1.2 2.0 1.4 2.8 1.7 6.5 1.7 1.2 1.9 1.3 2.6 1.8 2.4 0.9 0.4 1.2 0.9 1.8 1.3 2.5 0.9 0.4 1.1 0.8 1.8 1.3 2.1 1.2 0.5 1.1 0.6 1.3 1.1 2.1 1.2 0.5 1.1 0.6 1.3 1.1 3.01 3.03 2.47 2.43 1.27 1.26 1.13 1.13 0 20{40% identity > 35% identity reference 2 reference 3 reference 4 reference 5 Average BlockExchange Table 5.2: Percentage improvement for both local searches on ClustalW generated alignments, averaged over ten runs. See section 5.1.2 for details. SingleExchange Reference set < 25% identity 20{40% identity > 35% identity reference 2 reference 3 reference 4 reference 5 Average global BlockExchange combined global combined 0.003 0.002 0.002 0.027 0.028 0.021 0.006 0.008 0.003 0.005 0.172 0.136 0.067 0.025 0.003 0.002 0.002 0.036 0.043 0.014 0.007 0.013 0.004 0.007 0.247 0.218 0.064 0.023 0.002 0.002 0.003 0.099 0.079 0.006 0.007 0.006 0.006 0.007 0.875 0.587 0.021 0.023 0.003 0.003 0.004 0.172 0.160 0.006 0.009 0.012 0.007 0.012 1.384 1.183 0.018 0.027 0.01 0.06 0.02 0.08 0.03 0.22 0.05 0.38 Table 5.3: Runtimes (in seconds) for both local searches on ClustalW generated alignments averaged over ten runs. See section 5.1.2 for details. The results are presented as average percentage improvements in COFFEE-score for ten runs of the local search on the alignments produced by ClustalW. Reference set 1 is split into three dierent sets according to the average percentage identity between the sequences. Table 5.2 shows the improvement in percent from the alignment computed with ClustalW. For each local search, the results of using a library of only global alignments generated by ClustalW (column \global") and using a library of both global and local alignments (column \combined") as the reference library for the COFFEE-OF are shown. The rst column for each library gives the results for rst-improvement pivoting, the second column for best-improvement. Each local search was run ten times. 42 5.1.3 Perturbation To test the perturbation procedures, a preliminary version of the ILS has been implemented. The ILS uses the SingleExchange local search procedure with rst-improvement pivoting and a BETTER acceptance criterion. It is run ten times for 100 iterations on 1=4 of the sequences of each subset of reference set 1 due to the large runtime needed when using Realign. Table 5.4 shows the average improvement made in percent w. r. t. the initial alignment computed by NeighbourJoining. The table shows two values for each perturbation, the rst corresponds to using a COFFEE-function with a library of only global alignments, the second column using a library of global and local alignments. Table 5.5 gives the average runtimes needed by each ILS in seconds, again showing results for only global and global plus local COFFEE-library. Reference Set Regap BlockMove Pairwise Realign < 25% identity 20 0 40% identity > 35% identity 31.1 7.4 2.0 16.3 4.1 1.3 33.4 8.3 2.5 17.1 4.4 1.7 29.0 4.9 1.1 15.9 2.4 0.8 26.5 5.3 1.8 15.3 3.5 1.3 Average 13.5 7.23 14.73 7.73 11.67 6.37 11.2 6.7 Table 5.4: Percentage improvements in COFFEE-scores from the initial alignment for 100 iterations using dierent perturbations and SingleExchange. The results for a library of only global alignments are shown in the rst column of each perturbation, and the results using a combined library of global and local alignments in second column respectively. Reference set Regap BlockMove Pairwise Realign < 25% identity 20 0 40% identity > 35% identity 0.07 0.06 0.04 0.08 0.07 0.06 0.23 0.21 0.22 0.25 0.25 0.28 0.10 0.22 0.09 0.11 0.24 0.10 8.64 9.00 5.45 8.61 9.03 5.48 Average 0.06 0.07 0.22 0.26 0.14 0.15 7.7 7.71 Table 5.5: Runtimes of ILS using dierent perturbations, running for 100 iterations. The runtimes are measured in seconds. 5.1.4 Acceptance Criterion The comparison of the three acceptance criteria proposed in section 3.4 is done in the same way as the comparison of the perturbation procedures in the preceding section, but this time the perturbation procedure of the preliminary ILS is xed and the acceptance criterion is varied. The results indicate that an ILS using the RESTART criterion achieves the highest results. Not surprisingly, the runtime for RESTART is the highest runtime measured, 43 Test set < 25% identity ALWAYS BETTER RESTART > 35% identity 20{40% identity 33.7 11.8 1.8 20.8 6.0 1.0 44.7 14.8 2.2 32.7 9.0 1.4 49.3 14.9 2.5 36.4 9.3 1.6 Average 15.77 9.27 20.57 14.37 22.23 15.77 Table 5.6: Percentage improvement of the COFFEE-score for random subsets of each test case. For each acceptance criterion, the rst column is computed using a library of global alignments, the second using global and local alignments. Test set < 25% 20{40% > 35% ALWAYS 7.02 4.75 2.04 8.19 5.82 2.68 BETTER 6.11 4.98 2.32 6.90 5.94 2.89 RESTART 6.94 7.69 3.90 7.89 7.90 4.54 Table 5.7: Runtime of the three evaluated acceptance criteria. For each acceptance criterion, the rst column is computed using a library of global alignments, the second using global and local alignments. since it has to compute new initial alignments by using the AlignRandom heuristic. The runtime is higher for sequences with high similarity because the initial alignments computed by NeighbourJoining already have a high COFFEE-score and it is hard for the ILS to actually make improvements. 5.2 BaliBASE evaluation of ILS This chapter evaluates the performance of ILS for MSA using the methodology described in chapter 4. ILS is compared with ClustalW (Thompson et al., 1994), PRRP1 (Gotoh, 1996), DIALIGN2 (Morgenstern, 1999), MULTALIN (Corpet, 1988), T{COFFEE (Notredame et al., 2000) and SAGA (Notredame & Higgins, 1996). The values for the Tabu Search are taken from Riaz et al. (2004) since the available program failed to produce an alignment in most cases. Unfortunately, the publication doesn't show the CS, which therefore will be omitted. All other programs are run with default parameters on every sequence set in the BAliBASE database, and the BSPS and CS scores are computed with the bali score program taken from the BAliBASE web-page2 . Each algorithm except ILS was run only once on each alignment, ILS was run ten times each.3 The results for the complete database are visualised as boxplots both for BSPS and 1 2 3 PRRP is now merged together with PRRN into a single program http://www-igbmc.u-strasbg.fr/BioInfo/BAliBASE/bali score.c Deterministic algorithms are run only once, but for non-deterministic algorithms, it would be more appropriate to run them several times. However, when SAGA was run on the rst reference set, it showed exactly the same results for each run. 44 CS in gures 5.1 and 5.2. Boxplots for each reference set, together with the subsets of reference set 1 are shown in gure 5.3 for BSPS-scores and gure 5.4 for CS-scores respectively. The gures conrm the good performance of ILS. The boxplot for ILS is always among the highest boxes, and the median indicates always one of the highest values of all algorithms. Surprisingly, SAGA couldn't keep up with the other algorithms and showed very poor performances in the experiments, diering greatly from other values previously published. To gain a picture of the overall performance of ILS, average values for each reference set have been computed, as shown in tables 5.8 and 5.9. As tables 5.8 and 5.9 conrm, ILS achieves the highest average score for the complete database. The standard derivation lies in the middle of the measured algorithms. For the BSPS-scores (table 5.8), ILS performs best in four cases, among them the case of very low sequence identity, which is usually regarded as the most dicult case (\twilight zone"). In one case, it is only 0.01 away from the best algorithm. Comparing the rankings of each algorithm for the dierent reference sets, ILS achieves the lowest sum of ranks together with T{COFFEE. In cases where ILS does not perform best, it is very close to the best algorithm. The worst rank encountered for ILS is 3. For reference set 5, it scores second, but only 0.01 points away from the best algorithm. Looking at the column scores (table 5.9), which are considered to be more discriminating than the BSPS scores leads to similar conclusions: ILS performs best for reference sets 1 and only 0.01 points worse than the best algorithm in set 2. Looking at the ranking of the algorithms, ILS achieves the best average ranking indicated by the lowest sum of ranks. It is always under the top-three algorithms when not being the best. Following the methodology explained in section 4.2, a Wilcoxon signed-rank test is done to establish the signicance of the dierences when comparing ILS to the other alignment algorithms. Since the ILS achieved the highest average score, the alternative hypothesis is a one-way hypothesis stating that ILS performs better than the other tested algorithm. Table 5.10 shows the results of these tests. Summarising the results presented in the table, ILS performs signicantly better than SAGA, MULTALIN and DIALIGN2, and at least equally good as PRRN and T{COFFEE. For ClustalW, the tests showed a signicant dierence for BSPS-scores, but not for CS-scores, so it can be said that ILS performs at least as good as ClustalW, with some evidence to perform even better. Finally, the runtimes needed by the algorithms are examined in table 5.11. TS has not been measured since it does not succeed to produce an alignment in many cases, but when it succeeded, the runtime lied between ILS and SAGA. This is in line with the published runtime by Riaz et al. (2004). Comparing the pure runtime is dangerous since the environment (such as Programming Language, Compiler etc) inuences the measured results in a great way. To minimise these eects, all algorithms are run on the same machine (a multi-processor machine with 12 SUN UltraSparc9 processors, 400 MHz each. None of the programs uses multi-processing.) and compiled with the same compiler using the same ags. Results are shown in table 5.11. The runtime needed by ILS to compute the alignments is shown in table 5.11. Compared only to the iterative algorithms, the runtime for ILS is much better. Whereas 45 1.0 0.8 0.6 0.4 0.2 PRRN Clust DIALIGN2 Mult T−Coff SAGA TS ILS 0.0 0.2 0.4 0.6 0.8 Figure 5.1: Boxplot of the average BSPS scores of all 144 alignments in the database. PRRN ClustalW Dialign Multalign T.Coffee SAGA ILS Figure 5.2: Boxplot of the average CS scores of all 144 alignments in the database. 46 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1.0 1.0 1.0 0.8 0.6 0.4 0.2 PRRN Clustalw Dialign2 Multalin T.Coffee SAGA ILS PRRN (a) Set 1, < 25% Clustalw Dialign2 Multalin T.Coffee SAGA ILS PRRN Clustalw Dialign2 Multalin T.Coffee SAGA ILS (c) Set 1, > 35% 0.0 0.2 0.2 0.2 0.4 0.4 0.4 0.6 0.6 0.6 0.8 0.8 0.8 1.0 1.0 1.0 (b) Set 1, 20{35% PRRN Clustalw Dialign2 Multalin T.Coffee SAGA ILS PRRN Clustalw Multalin T.Coffee SAGA ILS PRRN Clustalw (e) Set 2 Dialign2 Multalin T.Coffee SAGA ILS (f) Set 3 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1.0 1.0 (d) Set 1 complete Dialign2 PRRN Clustalw Dialign2 Multalin T.Coffee SAGA ILS PRRN (g) Set 4 Clustalw Dialign2 Multalin T.Coffee SAGA ILS (h) Set 5 Figure 5.3: Boxplots for each reference set showing BSPS-scores.The results are presented in the following order: PRRN, ClustalW, DIALIGN2, MULTALIN, T{COFFEE, SAGA, ILS. 47 Clustalw Dialign2 Multalin T.Coffee SAGA 1.0 0.4 0.0 0.2 0.0 0.2 0.4 0.6 0.8 0.6 0.8 1.0 1.0 0.8 0.6 0.4 0.2 0.0 PRRN ILS PRRN (a) Set 1, < 25% Clustalw Dialign2 Multalin T.Coffee SAGA ILS PRRN Clustalw Dialign2 Multalin T.Coffee SAGA ILS (c) Set 1, > 35% PRRN Clustalw Dialign2 Multalin T.Coffee SAGA 0.0 0.0 0.2 0.2 0.2 0.4 0.4 0.4 0.6 0.6 0.6 0.8 0.8 0.8 1.0 1.0 (b) Set 1, 20{35% ILS PRRN Clustalw Multalin T.Coffee SAGA ILS PRRN Clustalw (e) Set 2 Dialign2 Multalin T.Coffee SAGA ILS (f) Set 3 0.0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1.0 1.0 (d) Set 1 complete Dialign2 PRRN Clustalw Dialign2 Multalin T.Coffee SAGA ILS PRRN (g) Set 4 Clustalw Dialign2 Multalin T.Coffee SAGA ILS (h) Set 5 Figure 5.4: Boxplots for each reference set showing CS-scores. The results are presented in the following order: PRRN, ClustalW, DIALIGN2, MULTALIN, T{COFFEE, SAGA, ILS. 48 Reference Set Set 1, < 25% Set 1, 20{35% Set 1, > 35% Set 1 Set 2 Set 3 Set 4 Set 5 Ranksum Average W/Average Stdev PRRN ClustalW DIALIGN2 MULTALIN T{COFFEE SAGA TS ILS 0.64(2) 0.94(2) 0.97(1) 0.85(2) 0.93(1) 0.83(1) 0.75(6) 0.93(4) 19 0.86(1) 0.87(1) 0.11(2) 0.63(3) 0.92(3) 0.96(2) 0.84(3) 0.93(1) 0.75(4) 0.78(4) 0.86(6) 26 0.83(2) 0.85(2) 0.11(2) 0.50(4) 0.89(5) 0.95(3) 0.78(3) 0.89(2) 0.68(7) 0.83(2) 0.94(3) 29 0.81(3) 0.82(3) 0.15(3) 0.46(6) 0.91(4) 0.96(2) 0.78(3) 0.89(2) 0.70(6) 0.58(7) 0.78(7) 37 0.75(4) 0.78(5) 0.17(4) 0.63(3) 0.94(2) 0.97(1) 0.85(2) 0.93(1) 0.78(2) 0.84(1) 0.96(1) 13 0.86(1) 0.87(1) 0.11(2) 0.19(6) 0.3(7) 0.36(5) 0.28(5) 0.47(3) 0.35(8) 0.24(8) 0.08(8) 50 0.23(5) 0.26(6) 0.12(5) 0.47(5) 0.88(6) 0.93(4) 0.76(4) 0.89(2) 0.72(5) 0.77(5) 0.90(5) 36 0.81(3) 0.80(4) 0.08(1) 0.65(1) 0.95(1) 0.97(1) 0.86(1) 0.93(1) 0.77(3) 0.79(3) 0.95(2) 13 0.86(1) 0.87(1) 0.11(2) Table 5.8: Average BSPS-scores. Scoring is done with core block annotation, the BSPS are computed by running the algorithms and scoring the computed alignment with the bali score program. The small number indicate the rank of each algorithm, from 1 (best) to 8 (worst) Reference Set Set 1, < 25% Set 1, 20{35% Set 1, > 35% Set 1 Set 2 Set 3 Set 4 Set 5 Ranksum Average W/Average Stdev PRRN ClustalW DIALIGN2 MULTALIN T{COFFEE SAGA ILS 0.46(2) 0.91(2) 0.95(2) 0.77(2) 0.60(1) 0.63(1) 0.46(5) 0.84(3) 18 0.69(2) 0.72(1) 0.19(2) 0.45(3) 0.87(4) 0.95(2) 0.76(3) 0.58(3) 0.46(4) 0.51(4) 0.64(4) 27 0.64(3) 0.68(2) 0.19(2) 0.32(5) 0.82(6) 0.92(3) 0.69(4) 0.38(6) 0.35(6) 0.68(1) 0.84(3) 34 0.62(4) 0.64(3) 0.24(4) 0.27(6) 0.85(5) 0.95(2) 0.69(4) 0.42(5) 0.37(5) 0.21(6) 0.51(5) 38 0.51(6) 0.57(4) 0.27(5) 0.41(4) 0.90(3) 0.96(1) 0.76(3) 0.57(4) 0.51(2) 0.65(2) 0.9(1) 20 0.70(1) 0.72(1) 0.20(3) 0.04(7) 0.05(7) 0.14(3) 0.08(5) 0.01(7) 0.01(7) 0.00(7) 0.19(6) 49 0.09(7) 0.08(5) 0.07(1) 0.48(1) 0.92(1) 0.95(2) 0.78(1) 0.59(2) 0.49(3) 0.52(3) 0.86(2) 15 0.69(2) 0.72(1) 0.20(3) Table 5.9: Average CS-scores. Scoring is done with core block annotation, the CS are computed by running the algorithms and scoring the computed alignment with the bali score program. The small number indicate the rank of each algorithm, from 1 (best) to 8 (worst). 49 Score PRRP ClustalW Dialign2 Multalin T{Coee SAGA BSPS n.s. * ** *** n.s. *** CS n.s. n.s. * *** n.s. *** Table 5.10: Statistical signicance of the dierences in BSPS- and CS scores. The table gives the signicance-level p-Value of a Wilcoxon signed-rank test for ILS and another alignment algorithm: *: P < 0:05, **: P < 0:01, ***: P < 0:001 and n.s.: P > 0:05. Reference Set Set Set Set Set Set Set Set Set 1, < 25% 1, 20{35% 1, > 35% 1 2 3 4 5 Average W/Average Stdev PRRN ClustalW DIALIGN2 MULTALIN T{COFFEE SAGA ILS 6.38 2.92 1.41 3.57 26.88 66.82 118.03 35.7 0.86 1.56 1.96 1.46 7.8 10.71 3.17 3.23 1.69 2.61 3.71 2.67 33.17 58.97 9.7 10.58 0.14 0.21 0.23 0.19 0.69 0.6 0.53 0.39 3.03 3.71 4.82 3.85 74.76 68.95 29.71 23.26 537.33 735.69 921.88 731.63 8671.22 25946.5 7629.59 4242.74 26.98 38.27 46.43 37.22 282.67 488.07 161.7 113.98 36.88 27.05 41.22 4.18 3.59 3.53 17.21 13.66 20.43 0.4 0.36 0.21 29.75 24.94 29.77 6322.35 5999 8632.06 165.44 133.74 162.3 Table 5.11: Average runtimes. Each value is the average runtime for the specied reference set in seconds. The time is measured in second. SAGA and Tabu Search can take extremely long runtime (up to more than one hour), ILS computes an alignment in less than 38 seconds for small sets of sequences (up to ve sequences, reference set 1). For large sequence sets, as in reference set 3 (sets of more than twenty sequences with 511 amino acids maximum), the runtime is in the order of several minutes. This qualies ILS for normal use. From the times given for the subsets of reference set 1, it can be seen that the average identity clearly inuences the runtime, where less identity corresponds to shorter runtime. 5.3 Runtime-bahaviour of ILS This section analyses the capability of ILS optimising an alignment to reach a high value of the COFFEE objective function. ILS was run 500 times on reference set 1, the number of iterations set to 20 2 lmax , with lmax being the length of the longest sequences. A RESTART acceptance criterion was used which starts from a new alignment after doing lmax non-improving iterations (see section 2.3.4 and 3.4 for details). A rst approach to analyse the runtime-behaviour of ILS is to analyse the tradeo between solution quality and computation time. A common way to visualise the 50 0.16 1bbt3_ref1 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 Figure 5.5: Solution quality trade-o for 1bbt3 ref1. The x-axis gives the runtime, measured in performed local search steps, the y -axis shows the percentage derivation from the best value found, averaged over 500 runs. relationship between runtime and solution quality is to plot the development of average solution quality found versus runtime, as shown in gure 5.5. The gure shows that the ILS quickly improves the solution quality close to the best value found in 500 iterations, but starts stagnating after approximately one fourth of the runtime. This pictures is typical for the runtime behaviour of ILS; other solution quality trade-os look similar, e. g. gure 5.6. However, the point of beginning stagnation varies depending on the particular instance. Looking again at gure 5.6, the stagnation behaviour starts close to the end of the runtime. A more powerful approach to describe and analyse the runtime-behaviour of Metaheuristics are RTDs, as described in section 4.1. We will now use this approach to examine the behaviour of the ILS. Figures 5.7 and 5.8 show the RLD for sequence sets 1r69 ref1 and 1pamA ref1, two sets of four short sequences with low identity (< 25%). 1r69 consists of small sequences up to 78 amino acids, 1pamA of long sequences at most 572 length. The gures show probability distributions for reaching a solution quality of 10%, 7.5%, 5%, 3% and 2% derivation from the best value found in all 500 runs together with a plot of an exponential distribution. The reasons why the exponential distribution is plotted will be explained later. Both gures show that the ILS can easily nd a solution no more than 3% away from the maximum in short time with a very high probability (the measured runtime for this instance was only 1.54 seconds), but when a solution 2% worse than the maximum should be reached, the probability becomes very low to reach it in the given time. Another conclusion drawn from the gures is that ILS needs an initialisation time 51 0.22 1pamA_ref1 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0 50000 100000 150000 200000 250000 Figure 5.6: Solution quality trade-o for 1pamA ref1. For explanation, refer to section 5.3 and gure 5.5. ti before any solution in the required boundaries are found. Considering the enormous number of possible alignments for a set of sequences, it becomes clear that the probability to nd an optimal solution just at the beginning must be very low. The probability to nd a sucient solution at the beginning clearly depends on the size of the sequence set, their lengths and on a factor which can be called the hardness of this set. The probability will be considerably higher for sets which have many local optima close to the global optimum, but lower for sets where the global optimum is peeked and the local minima quality is much lower (the corresponding tness landscape would look like a at valley with small hills and one extremely steep mountain). The RLD can be approximately described by an exponential distribution (veried with a 2 -test). When the actual RLD corresponds to an exponential distribution, the use of RESTART to improve the performance can be justied from a theoretical point rather than arguing only with measured performances: In this case, the probability of nding the desired solution when running an algorithm once for a long time t is equal to the probability of running the same algorithm p times with runtime pt (St utzle, 1999a). However, using restarts can change the steepness and thus gain an improvement if the empirical distribution without using restarts is less steep than an exponential distribution; this behaviour is exemplary explained by St utzle & Hoos (2002) for ILS algorithm on TSP. 52 1 10% g(x) 7.5% 5% 3% 2% exp 0.8 0.6 0.4 0.2 0 -10 0 1000 2000 -5 3000 4000 5000 0 6000 7000 5 8000 9000 10000 10 Figure 5.7: Runlength-distribution for 1r69 ref1. The x-axis is the CPU-time measured in local search steps and the y -axis the probability of nding a solution of the required quality. Please refer to the text for details. 1 Exp. 90% 91% 92% 93% 0.8 0.6 0.4 0.2 0 0 50000 100000 150000 200000 250000 Figure 5.8: Runlength-distribution for 1pamA ref1. The x-axis is the CPU-time measured in local search steps and the y -axis the probability of nding a solution of the required quality. Please refer to the text for details. 53 6 Conclusions and Discussion The intention of thesis was to transfer a contemporary optimisation method from Computer Science to a problem from the Bionformatics domain. The method selected is ILS, a fast and high-performing optimisation technique successfully used on many CO problems before. 6.1 Conclusions The last chapter has shown that ILS is quite capable to optimise an alignment w. r. t. to the COFFEE OF. On the other hand, ILS has problems to produce biological viable alignments in some cases, expressed by low scores when comparing computed alignments to the reference alignments in the database. One factor largely inuencing the biological viability is the ability of the OF to assign higher scores to biological more reliable alignments, a key factor of a good scoring function for MSA and a major problem in the design of every OF. Figure 6.1 shows the percentage derivation of the optimised alignments computed by ILS compared to the COFFEE score of the reference alignments found in the database. On the average, alignments computed by ILS show a score 6% above the score of the reference alignments, in some extreme cases ILS even produces alignments scoring twice as good as the reference alignments. Since the reference alignments are manually rened and derived from structural properties and thus very close to the biological optimum, this seems to be major weakness of the COFFEE OF and should probably be the focus of research in the future. Looking at the results in chapter 5, ILS is not to be proposed as an alternative for the standard MSA method, ClustalW, which seems to be challenged more by T{COFFEE according to the results measured during this project. But when the standard methods fail to produce meaningful alignments, the current method of choice is SAGA. Since ILS produces better alignments in most cases, and needs considerably less time to do so, ILS is proposed as a replacement for SAGA. The exponential shape of the empirical RLDs gives strong evidence that the actual RLD of ILS is an exponential distribution. This gives a theoretical justication for the RESTART acceptance criterion and indicates that ILS can benet from parallelisation. 6.2 Discussion ILS has been shown to be a high-scoring and very reliable method for MSA in all testcases given by the BAliBASE database, achieving always a rank among the top-three algorithms in all cases. One point to mention in this discussion is that the measured 54 1.0 0.8 0.6 0.4 0.2 0.0 Percentage derivation of ILS from reference alignment (COFFEE) Sequence set Figure 6.1: Percentage derivation of optimised alignments from reference alignments. Each bar shows the dierence of the optimised alignment score from the reference score in percent. Positive values indicate that the computed alignment's score is higher than the reference alignment's, negative values the opposite. BSPS-scores do dier from values published by Thompson et al. (1999), especially in the case of SAGA, which achieved much better scores in this study. Nevertheless, this does not falsify the conclusion that ILS is reliably and high-scoring. Comparing the published results of SAGA to ILS, ILS still achieves higher average scores for the complete database. The results for SAGA are higher in only two out of eight cases. Due to its stochastic nature, ILS is likely to produce slightly dierent alignments each time it is run. These alignments can be used in cases were all algorithms fail to produce reliable alignments to be manually rened, or to investigate if there are similarities among all high-scoring alignments. The later may lead to a method to detect conserved regions or motifs in a set of sequences. The major drawback of ILS are the long run-times, as compared to the progressive algorithms. This is a intrinsic point induced by the nature of iterative algorithms and it is shared with even longer run-times with SAGA and TS. The run-time could be improved by running ILS several times in parallel. Recalling that the run-time follows an exponential distribution, this would yield to a near-optimal speedup. 55 A Abbreviations BSPS BAliBASE Sum of Pairs Score CO Combinatorial Optimisation COFFEE Consistency based Objective Function For alignmEnt Evalutation CS Column Score GA Genetic Algorithm ILS Iterated Local Search MSA Multiple Sequence Alignment OF Objective function RLD Run-length distribution RTD Run-time distribution SA Simulated Annealing SP Sum of Pairs TS Tabu Search TSP Travelling Salesman Problem 56 Bibliography Altschul, S. F. (1989) Gap costs for multiple sequence alignment. Journal of Theoretical Biology, 138 (3), 297{309. Bahr, A., Thompson, J. D., Thierry, J.-C. & Poch, O. (2001) BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations. Nucleic Acids Research, 29 (1), 323{326. Bentley, J. L. (1992) Fast algorithms for geometric traveling salesman problems. ORSA Journal on Computing, 4, 387{411. Blum, C. & Roli, A. (2003) Metaheuristics in combinatorial optimization: overview and conceptual comparison. ACM Computing Surveys, 35 (3), 268{308. Brodal, G. S., Fagerberg, R., Mailund, T., Pedersen, C. N. & Phillips, D. (2003). Speeding up neighbour-joining tree construction. Technical report University of Arhuis. ALCOMFT-TR-03-102, ALCOM-FT. Carrillo, H. & Lipman, D. (1988) The multiple sequence alignment problem in biology. SIAM Journal on Applied Mathematics, 48 (5), 1073{1082. Chiarandini, M. & St utzle, T. (2002) An application of iterated local search to graph coloring. In Proceedings of the Computational Symposium on Graph Coloring and its Generalizations, (Johnson, D. S., Mehrotra, A. & Trick, M., eds), pp. 112{125 Discrete Applied Mathematics, Ithaca, New York, USA. Corpet, F. (1988) Multiple sequence alignment with hierarchical clustering. Nucleic Acids Research, 16 (22). den Besten, M., St utzle, T. & Dorigo, M. (2001) Design of iterated local search algorithms: an example application to the single machine total weighted tardiness problem. Lecture Notes in Computer Science, 2037, 441{451. Glover, F. (1986) Future paths for integer programming and links to articial intelligence. Computers and Operations Research, 13 (5), 533{549. Glover, F. & Laguna, M. (1997) Tabu Search. Kluwer Academic Publishers, London. Gotoh, O. (1982) An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162, 705{708. 57 Gotoh, O. (1993) Optimal alignment between groups of sequences and its application to multiple sequence alignment. Computer Applications in Bioscience, 9 (3), 361{370. Gotoh, O. (1996) Signicant improvement in accurancy of multiple protein sequence alignment by iterative renements as assessed by reference to structural alignments. Journal of Molecular Biology, 264, 823{838. Hirschberg, D. S. (1975) A linear space algorithm for computing maximal common subsequences. Communications of the ACM, 18 (6), 341{343. Hoos, H. H. & St utzle, T. (1998) Evaluating Las Vegas algorithms: pitfalls and remedies. In Proceedings of UAI-98 pp. 238{245. Just, W. (1999) Computational complexity of multiple sequence alignment with SPscore. Journal of Computational Biology, 8 (6), 615 { 623. Kim, J., Pramanik, S. & Chung, M. J. (1994) Multiple sequence alignment using simulated annealing. Computer Applications in the Biosciences, 10 (4), 419{426. Lipman, D., Altschul, S. & Kececioglu, J. (1989) A tool for multiple sequence alignment. Proc. National Academic of Sciences USA, 86 (12), 4412{4415. Lourenco, H. & Zwijnenburg, M. (1996) Combining the Large-Step Optimization with Tabu-Search: Application to the Job-Shop Scheduling Problem. Norwell, MA: Kluwer Academic Publishers pp. 219{236. Lourenco, H. R. D., Martin, O. & St utzle, T. (2001) A beginner's introduction to iterated local search. In Proceedings of the 4th Metaheuristics International Conference - MIC 2001 pp. 1{6. Porto, Portugal. Lourenco, H. R. D., Martin, O. & St utzle, T. (2002) Iterated local search. In Handbook of Metaheuristics, (Glover, F. & Kochenberger, G., eds),. Kluwer Academic Publishers Norwell, MA pp. 321{353. Morgenstern, B. (1999) DIALIGN2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics, 15, 211{218. Morgenstern, B., Frech, K., Dress, A. & Werner, T. (1998) Dialign: nding local similarities by multiple sequence alignment. Bioinformatics, 14 (3), 290{294. Needleman, S. B. & Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino-acid sequence of two proteins. Journal of Molecular Biology, 48, 443{453. Notredame, C. (2002) Recent progress in multiple sequence alignment: a survey. Pharmacogenomics, 3, 131{144. Notredame, C. & Higgins, D. G. (1996) SAGA: sequence alignment by genetic algorithm. Nucleic Acids Research, 24 (8), 1515{1524. 58 Notredame, C., Higgins, D. G. & Heringa, J. (2000) T-Coee: A novel method for fast and accurate multiple sequence alignment. Journal of Molecular Biology, 302, 205{217. Notredame, C., Holm, L. & Higgins, D. G. (1998) COFFEE: an objective function for multiple sequence alignments. Bioinformatics, 14 (5), 407{422. Papadimitriou, C. H. & Steiglitz, K. (1982) Combinatorial Optimization (Algorithms and Complexity). Prentice-Hall, New Jersey. Paquete, L. & St utzle, T. (2002) An experimental investigation of iterated local search for coloring graphs. In Applications of Evolutionary Computing, (Cagnoni, S., Gottlieb, J., Hart, E., Middendorf, M. & Raidl, G. R., eds), vol. 2279,. Springer Verlag pp. 122{131. Reeves, C. R. (1993) Modern heuristic techniques for combinatorial problems. John Wiley & Sons, Inc. Reinert, K., Cordes, F., Huson, D., Lutz, H., Steinke, T., Schliep, A. & Vingon, M. (winter term 2003). Algorithmische Bioinformatik. lecture notes at FU Berlin. Riaz, T., Li, K.-B. & Wang, Y. (2004) Multiple sequence alignment using tabu search. In Second Asia-Pacic Bioinformatics Conference (APBC2004), (Chen, Y.-P. P., ed.), vol. 29, of CRPIT pp. 223{232 Australian Computer Society Inc., Dunedin, New Zealand. Saitou, N. & Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4 (4), 406{425. Sneath, P. H. A. & Sokal, R. R. (1973) Numerical Taxonomy : The principles and practice of numerical classication. W. H. Freeman, San Francisco. Stoye, J., Moulton, V. & Dress, A. W. M. (1997) DCA: an ecient implementation of the divide-and-conquer approach to simultaneous multiple sequence alignment. Computer Applications in the Biosciences, 13 (6), 625{626. St utzle, T. (1998). Applying iterated local search to the permutation ow shop problem. Aida-98-04 Darmstadt University of Technology, Computer Science Department, Intellectics Group. St utzle, T. (1999a) Local Search Algorithms for Combinatorial Problems { Analysis, Improvements, and New Applications. Inx, Sankt Augustin. Also PhD thesis, Darmstadt University of Technology, Computer Science Department, 1998. St utzle, T. (1999b). Iterated local search for the quadratic assignment problem. Aida{ 99-03 Darmstadt University of Technology, Computer Science Department, Intellectics Group. 59 St utzle, T. & Hoos, H. (2002) Analyzing the run-time behaviour of iterated local search for the TSP. In Essays and Surveys on Metaheuristics. Kluwer Academic Publishers pp. 589{612. Thompson, J. D., Higgins, D. G. & Gibson, T. J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specic gap penalties and weight matrix choice. Nucleic Acids Research, 22 (22), 4673{80. Thompson, J. D., Plewniak, F. & Poch, O. (1999) A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Research, 27 (13), 2682{2690. Thompson, J. D., Plewniak, F., Ripp, R., Thierry, J.-C. & Poch, O. (2001) Towards a reliable objective function for multiple sequence alignments. Jorunal of Molecular Biology, 314 (4), 937{951. Thompson, J. D., Thierry, J.-C. & Poch, O. (2003) Rascal: rapid scanning and correction of multiple sequence alignments. Bioinformatics, 19 (9), 1155{61. Thomson, J. D., Plewniak, F. & Poch, O. (1999) Balibase: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics, 15 (1), 87{88. Wang, L. & Jiang, T. (1996) On the complexity of multiple sequence alignment. Journal of Computational Biology, 1 (4), 337{348. 60

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement