Multiple Sequence Alignmen t

Multiple Sequence Alignmen t
Multiple Sequence Alignment by Iterated Local
Search
HS{IKI{MD{04{001
Jens Auer
Submitted by Jens Auer to the University of Skovde as a dissertation towards
the degree of Master of Science with a major in computer science.
September 2004
I certify that all material in this dissertation which is not my own work has been
identied and that no material is included for which a degree has already be
conferred upon me.
Jens Auer
Hogskolan i Skovde
Kommunikation och information
Master Dissertation
Multiple Sequence Alignment by Iterated
Local Search
Jens Auer
August 29, 2004
supervised by Bjorn Olsson
I would like to thank all people involved in this thesis, especially
my supervisor Bjorn Olsson for his support and comments which
helped me very much to complete this thesis project. I also would
like to thank Thomas Stutzle for providing me with some of the
literature used in this thesis.
Abstract
The development of good tools and algorithms to analyse genome and protein data in
the form of amino acid sequences coded as strings is one of the major subjects of Bioinformatics. Multiple Sequence Alignment (MSA) is one of the basic yet most important
techniques to analyse data of more than two sequences. From a computer scientiest
point of view, MSA is merely a combinatorial optimisation problem where strings have
to arrange in a tabular form to optimise an objective function. The diculty with MSA
is that it is NP {complete, and thus impossible to solve optimally in reasonable time
under the assumption that P 6= NP .
The ability to tackle NP {hard problems has been greatly extended by the introduction
of Metaheuristics (see Blum & Roli (2003)) for a summary of most Metaheuristics,
general problem-independent optimisation algorithms extending the hill-climbing local
search approach to escape local minima. One of these algorithms is Iterated Local
Search (ILS) (Lourenco et al., 2002; St
utzle, 1999a, p. 25), a recent easy to implement
but powerful algorithm with results comparable or superior to other state-of-the-art
methods for many combinatorial optimisation problems, among them TSP and QAP. ILS
iteratively samples local minima by modifying the current local minimum and restarting
a local search porcedure on this modied solution.
This thesis will show how ILS can be implemented for MSA. After that, ILS will be
evaluated and compared to other MSA algorithms by BAliBASE (Thomson et al., 1999),
a set of manually rened alignments used in most recent publications of algorithms and
in at least two MSA algorithm surveys. The runtime-behaviour will be evaluated using
runtime-distribution.
The quality of alignments produced by ILS is at least as good as the best algorithms
available and signicantly superiour to previously published Metaheuristics for MSA,
Tabu Search and Genetic Algorithm (SAGA). On the average, ILS performed best in
ve out of eight test cases, second for one test set and third for the remaining two.
A drawback of all iterative methods for MSA is the long runtime needed to produce
good alignments. ILS needs considerably less runtime than Tabu Search and SAGA,
but can not compete with progressive or consistency based methods, e. g. ClustalW or
T{COFFEE.
keywords: Combinatorial Optimisation, Metaheuristics, Multiple Sequence
Alignment, Iterated Local Search
Contents
1 Introduction
1
2 Background
4
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Aim and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Multiple Sequence Alignment . . . . . . . . . .
2.1.1 Denition . . . . . . . . . . . . . . . . .
2.1.2 Alignment Scores . . . . . . . . . . . . .
2.1.3 Complexity . . . . . . . . . . . . . . . .
2.2 Metaheuristics and Combinatorial Optimisation
2.2.1 Combinatorial Optimisation . . . . . . .
2.2.2 Heuristic algorithms . . . . . . . . . . .
2.2.3 Metaheuristics . . . . . . . . . . . . . .
2.3 Iterated Local Search . . . . . . . . . . . . . . .
2.3.1 GenerateInitialSolution . . . . . . . . .
2.3.2 EmbeddedHeuristic . . . . . . . . . . . .
2.3.3 Perturbation . . . . . . . . . . . . . . .
2.3.4 Acceptance criterion . . . . . . . . . . .
2.3.5 Termination condition . . . . . . . . . .
2.4 Related Work . . . . . . . . . . . . . . . . . . .
2.4.1 Simulated Annealing . . . . . . . . . . .
2.4.2 SAGA . . . . . . . . . . . . . . . . . . .
2.4.3 Tabu Search . . . . . . . . . . . . . . .
3 Algorithm
3.1 Construction heuristic . . . . . . . . . .
3.2 Local Search . . . . . . . . . . . . . . .
3.2.1 Single exchange local search . . .
3.2.2 Block exchange local search . . .
3.2.3 Comparison . . . . . . . . . . . .
3.3 Perturbation . . . . . . . . . . . . . . .
3.3.1 Random regap perturbation . . .
3.3.2 Block move perturbation . . . .
3.3.3 Pairwise alignment perturbation
3.3.4 Re-aligning of complete regions .
v
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
3
4
6
8
9
10
10
11
12
13
15
16
16
17
18
18
20
21
21
23
23
24
24
27
29
30
30
30
31
32
3.4
3.5
3.6
3.7
3.3.5 Comparison . . .
Acceptance criterion . .
Termination condition .
Iterated Local Search for
Implementation . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
Multiple Sequence Alignment .
. . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33
33
33
34
36
4 Method and Materials
37
5 Results
41
4.1 Run-time distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 The BAliBASE database of multiple alignments . . . . . . . . . . . . . . . 38
4.2.1 Statistical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1 ILS components . . . . . . . .
5.1.1 Construction heuristic
5.1.2 Local search . . . . . .
5.1.3 Perturbation . . . . .
5.1.4 Acceptance Criterion .
5.2 BaliBASE evaluation of ILS
5.3 Runtime-bahaviour of ILS .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41
41
41
43
43
44
50
6 Conclusions and Discussion
54
A Abbreviations
56
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
vi
List of Figures
2.1 Example Multiple Sequence Alignment . . . . . . . . . . . . . . . . . . . . 5
2.2 Structure of the hevein protein . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Pictorial view of ILS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1
3.2
3.3
3.4
3.5
3.6
3.7
Move operation of the single exchange local search
Local Search step with block exchange. . . . . . . .
Random regap perturbation . . . . . . . . . . . . .
Perturbation using a random block move . . . . . .
Perturbation using optimal pairwise alignments . .
Block-alignment perturbation . . . . . . . . . . . .
COFFEE scores for ILS on reference set gal4 ref1 .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
27
30
31
31
32
34
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
Boxplot of the average BSPS scores of all 144 alignments in the database.
Boxplot of the average CS scores of all 144 alignments in the database. . .
Boxplots for each reference set showing BSPS-scores . . . . . . . . . . . .
Boxplots for each reference set showing CS-scores . . . . . . . . . . . . . .
Solution quality trade-o for 1bbt3 ref1 . . . . . . . . . . . . . . . . . . .
Solution quality trade-o for 1pamA ref1 . . . . . . . . . . . . . . . . . .
Runlength-distribution for 1r69 ref1 . . . . . . . . . . . . . . . . . . . . .
Runlength-distribution for 1pamA ref1 . . . . . . . . . . . . . . . . . . . .
46
46
47
48
51
52
53
53
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6.1 Percentage derivation of optimised alignments from reference alignments . 55
vii
List of Tables
2.1 Classication of MSA algorithms . . . . . . . . . . . . . . . . . . . . . . . 20
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
Comparison of construction heuristics . . . . . . . . . . . . . . . . . . . .
Comparison of COFFEE-scores for both local searches . . . . . . . . . . .
Runtime comparison for both local searches . . . . . . . . . . . . . . . . .
Results of dierent perturbations . . . . . . . . . . . . . . . . . . . . . . .
Runtimes of dierent perturbation . . . . . . . . . . . . . . . . . . . . . .
Percentage improvements in COFFEE-score for dierent acceptance criteria.
Runtime for dierent acceptance criterions . . . . . . . . . . . . . . . . . .
Average SP-scores for BAliBASE . . . . . . . . . . . . . . . . . . . . . . .
Average column scores for BAliBASE . . . . . . . . . . . . . . . . . . . .
Statistical signicance of the dierences in BSPS- and CS score. . . . . . .
Runtime for BAliBASE . . . . . . . . . . . . . . . . . . . . . . . . . . . .
viii
41
42
42
43
43
44
44
49
49
50
50
List of Algorithms
2.1 Iterated Local Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 BETTER acceptance criterion . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 RESTART acceptance criterion . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1
3.2
3.3
3.4
3.5
3.6
3.7
Single exchange local search. . .
Block exchange local search . . .
Random regap perturbation . . .
Random block move perturbation
Pairwise alignment perturbation
Realign region perturbation . . .
Iterated Local Search for MSA .
.
.
.
.
.
.
.
ix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26
28
30
31
32
32
35
1 Introduction
The development of good tools and algorithms to analyse genome and protein data is
one of the major subjects of Bioinformatics. This data is typically collected in the form
of biological sequences, i. e. strings of symbols representing amino acids or nucleotides.
From the comparison of these sequences, information about evolutionary development,
conserved regions or motifs and structural commonalities can be acquired. This provides
the basis for deducing biological functions of proteins or characterising protein families.
One central technique to compare a set of sequences is Multiple Sequence Alignment (MSA). This work presents a new algorithm for amino acid sequence alignment,
where the goal is to arrange a set of sequences representing protein data in a tabular form such that a scoring function is optimised. The problem of Multiple Sequence
Alignment is hard to solve, for most scoring functions it has been proven to be NP-hard.
Considering computational and biological problems, Notredame (2002) describes three
major problems:
Which sequences should be aligned?
How is the alignment scored?
How should the alignment be optimized w. r. t. to the objective function?
This thesis focuses on the third point, summarised as follows by Notredame (2002):
Computing optimal alignments is very time-consuming, it even has been
proved that it is NP-hard with most used objective functions, which makes it
intractable for exact algorithms when more than a small set of sequences is to
be aligned. To handle larger sets, heuristics must be used, which leaves the
question of the reliability of these heuristics and their capability to produce
near optimal results.
1.1 Motivation
MSA has been a eld of research for a long time and many algorithms have been proposed. Most algorithms proposed for MSA rely on a step-by-step alignment procedure
where one sequence is consecutively added until all sequences are aligned. In most cases,
the sequences are added in a precomputed order to the alignment. This class of algorithms, with ClustalW (Thompson et al., 1994) being the most prominent member,
suers from the usual problem of their greedy approach, e. g. being unable to alter decisions made in an earlier phase even if it would produce an improvement. An initial
1
study has shown that a simple local search is able to improve the results of ClustalW,
especially for the interesting case of sequence similarity below 25%, where a local search
can improve the alignment by 8% (see table 5.2 on page 42). This approach is very
likely to become stuck in a local minimum and could therefore be improved by using
algorithms able to escape a local minimum.
To overcome the problems of local minima, several stochastic schemes have been proposed in the context of combinatorial optimization. These schemes are often described as
Metaheuristics, since they describe a general algorithmic framework which is independent
from a special optimization problem. Established metaheuristics are simulated annealing, tabu search, guided local search, ant colony optimization and genetic/evolutionary
algorithms. Metaheuristics are an active research eld in AI and are used in many different domains, both in research and industrial applications (see Blum & Roli (2003) for
an overview of Metaheuristics in Combinatorial Optimisation).
Even though MSA is of such remarkable importance, only three cases of contemporary
Metaheuristics applications to MSA were found in the literature search for this project.
Kim et al. (1994) used simulated annealing, Notredame & Higgins (1996) implemented
a genetic algorithm and recently, Riaz et al. (2004) studied Tabu Search. The results of
Notredame and Riaz are comparable with other state of the art techniques and encourage
further investigation of other Metaheuristics.
In this thesis, Iterated Local Search (ILS) (Lourenco et al., 2002) is implemented and
evaluated as an algorithm for MSA. Starting from a local minimum generated by running
an embedded heuristic procedure on a heuristically computed initial alignment, Iterated
Local Search (ILS) iteratively performs the following steps:
1. Perturbation: Modify the current solution to escape the local minimum.
2. EmbeddedHeuristic: Run the embedded heuristic on the perturbed solution to
reach the next local minimum.
3. Acceptance: Select either the solution before or after running the embedded solution as starting solution for the next iteration.
Section 2.3 describes ILS as a Metaheuristic in detail, and section 3.6 presents the algorithm implemented in this project for MSA. The idea behind ILS, the search in the
reduced space of local minima can be found in many algorithms in combinatorial optimization under dierent names, including iterated descent, large-step Markov chains,
iterated Lin-Kernighan and chained local optimisation. Despite a long history of the basic idea, ILS is a relatively new Metaheuristic and has not been used for MSA. Lourenco
et al. (2001) gives a brief introduction to ILS.
ILS has shown very good results on many combinatorial optimization problems, e. g.
the traveling salesman problem (St
utzle & Hoos, 2002), where it constitutes one of
the best algorithms currently available and the quadratic assignment problem (St
utzle,
1999b), one of the hardest problems known. Other applications include scheduling problems of dierent kinds (den Besten et al., 2001; St
utzle, 1998) and graph colouring
(Chiarandini & St
utzle, 2002; Paquete & St
utzle, 2002) where ILS can compete with the
2
best algorithms available. Lourenco et al. (2002) describe these and more applications
and give a good overview of ILS.
Compared to already applied Metaheuristics, ILS has several advantages. It is a
modular algorithm with very simple basic steps and few parameters which have to be
adjusted, as opposed to e. g. genetic algorithms. Another benet which ILS shares with
other Metaheuristics is its independence of the objective function to be optimised.
1.2 Aim and Objectives
The aim of this dissertation project is to implement and evaluate ILS (Lourenco et al.,
2002) for MSA. The algorithm is evaluated using a reference set of alignments, provided
by the BaliBase database of multiple sequence alignments (Bahr et al., 2001; Thomson et al., 1999). ILS is compared to established state-of-the-art algorithms for MSA
w. r. t. biological validity by measuring the conformance of the computed alignments
with those found in the database, and w. r. t. computational properties by using runtime
distributions (Hoos & St
utzle, 1998).
To full the aim, the following objectives have to be considered:
Choose a good construction heuristic by testing several alternatives
Construct a good local search procedure which is able to improve an alignment
very fast but also reaches high solution quality. This involves testing dierent
neighbourhood structures and design decisions such as pivoting strategies or neighbourhood pruning techniques.
Implement a perturbation procedure which ts to the local search procedure used
and helps to nd high quality solutions.
Choose an acceptance criterion which balances intensication and diversication
to explore interesting regions of the search space eciently.
Determine a cut-o to end the ILS.
Chapter 3 presents several alternatives for each of these points. Chapter 5 compares
each of the alternatives taking into account the intended use as a component of the ILS.
Considering the results, the complete ILS is presented in algorithm 3.7 on page 35 in
chapter 3.
Altogether, three construction heuristics, two local search procedures (with two pivoting strategies), four perturbation strategies and three acceptance criteria are tested
considering the intended use as a part of an ILS algorithm.
3
2 Background
This chapter introduces the basic concepts needed in this thesis and provides an overview
about previous related work.
2.1 Multiple Sequence Alignment
MSA is a fundamental tool in molecular biology and bioinformatics. It has a variety
of applications, e. g. nding characteristic motifs and conserved regions in protein families, analysing evolutionary relations between proteins and secondary/tertiary structure
prediction. The following section shows concrete examples of MSA in widely known
problems of molecular biology.
A common way to represent the evolution of a set of proteins is to arrange them in
a phylogenetic tree. The rst step in computing the tree is to measure all pairwise
distances of the given set of sequences and derive a distance matrix from this. The
distances can be gathered from a MSA by comparing each pair of aligned sequences in
the alignment. From this distance matrix, a phylogenetic tree can be computed, e. g. by
UPGMA(Sneath & Sokal, 1973) or Neighbour Joining(Saitou & Nei, 1987).
MSA is a basic tool when grouping proteins into functionally related families. From
a MSA, it is easy to construct a prole or a consensus, which is a pseudo-sequence
representing the complete alignment. The consensus or the prole in turn can be used
to identify conserved regions or motifs. MSA can also be used when deciding if a protein
belongs to a family or not by aligning the query-sequence together with known proteins
belonging to the family.
MSA can also be used as a starting point for secondary or tertiary structure prediction.
When aligned together with related sequences, the conserved motifs can give a hint of
secondary or tertiary structure, since the structure is usually conserved during evolution,
but the actual sequence is likely to vary in certain parts. Figure 2.1 shows this exemplary
with the Cysteins (printed yellow) aligned together. These amino acids inuence the
structure of an protein by forming disuld-bridges .
Figures 2.1 and 2.2 show an example MSA of a family of related sequences together
with a picture of the three-dimensional tertiary structure of one of the aligned proteins,
hevein.
As dened later, all sequences in the MSA are adjusted to the same length by inserting
gaps (symbolised by dots in this example). Similar amino acids are aligned together in
the same column and regions of high amino acid similarity can be identied. Most
important are the black printed columns of Cysteins, since these parts of the proteins
inuence the tertiary structure by forming bonds inside the proteins. These bonds are
printed green in gure 2.2.
4
Figure 2.1: Example Multiple Sequence Alignment of N-acetylglucosamine-binding proteins. The alignment shows eight columns of Cysteins, which form disuld
bridges and are an essentail part of the proteins structure. The dots symbolise gaps. Taken from Reinert et al. (2003).
Figure 2.2: Structure of one of the proteins from gure 2.1. The disuld bridges formed
by the Cysteins are printed in green. Taken from Reinert et al. (2003).
5
2.1.1 Denition
This section denes the domain, i. e. the MSA problem in formal terms, so that it can be
used to describe the algorithm precisely. We rst formalise the concept of a biological
sequence.
Denition 2.1 (Alphabet and Sequence)
Let 6 be an alphabet, i. e. a nite set of
symbols, and 6 6= ;. A sequence over the alphabet 6 is a nite string of symbols from
6. The length of a sequence s, denoted as jsj, is the number of symbols in the sequence.
For a sequence, s[i] describes the ith symbol in this sequence.
In molecular biology, the alphabet 6 usually consists of the set of twenty-three amino
acids found in proteins, extended with a special symbol as a placeholder for any amino
acid (usually denoted as \X" or \{").
We can now dene the multiple alignment of a set of sequences. We dene rst a
variation of multiple alignment, where columns consisting completely of gap symbols are
allowed, and restrict this in the nal denition of multiple alignment.
Denition 2.2 (Pseudo Multiple Alignment)
Let 6 be an alphabet as dened in
denition 2.1, and fs1 , . . . ,sk g a set S of k sequences over this alphabet. A pseudo
multiple alignment is a set of sequences S 0 = fs01 ; :::; s0k g over the alphabet 60 = 6 [ f0g,
where
k2
All sequences in
8i k; si 2 S
S
0
have the same length: 8s0i ; s0j 2 S 0 js0i j = js0j j
can be reduced to the corresponding sequence si in
all gap symbols from s0i
0
0
The order of symbols in every sequence s0i in
sequence si .
S
0
S
by removing
is the same as in the corresponding
The number of sequences in the alignment is jS 0 j; the number of columns is dened as
the length of the sequences in S 0 .
(S ) is the set of all possible pseudo alignments of the sequences S .
PMSA
The set of all possible alignments PMSA(S ) is an innite set, since the number of
columns consisting only of gaps is not restricted, which also implies that there is no
bound on the length of the sequences in a pseudo alignment.
A Multiple fSequenceg Alignment is a pseudo alignment with the restriction that no
column may consist only of gap characters.
Denition 2.3 (Multiple Sequence Alignment) Let n be the number of sequences
in the Alignment and k the number of columns. A set S of sequences is a multiple
alignment i it is a pseudo alignment and it fulls the following condition:
8i k 9j n sj [i] 6= 0
6
S.
MSA(S ) denotes the set of all possible Multiple Sequence Alignments of the sequences
Despite of the innite set of pseudo alignments, the number of Multiple Sequence Alignments is restricted by the lengths and the number of sequences and hence nite, as is
the length of the alignment.
When only two sequences are aligned, the MSA is called a pairwise alignment. Pairwise
alignment is not discussed in this thesis, because they can be computed eciently by
dynamic programming (Needleman & Wunsch, 1970).
In order to dene an optimal multiple alignment, a quality measure must be dened,
which assigns a value to an alignment. The measure is called a scoring or Objective
function (OF), because it is the mathematical goal to be optimised. The term OF
will be preferred in this thesis because the MSA problem is tackled as a Combinatorial
Optimisation (CO) problem.
Denition 2.4 (Scoring Function, Objective Function)
denition 2.2. For a number of sequences
n, a function
Let 60 be dened as in
f : (6 ? )n ! R
0
is called an Objective function (OF).
6? is dened as the set of all words obtained by concatenation of symbols from the
alphabet 6.
Typical OFs are (weighted) Sum of Pairs (SP)(Altschul, 1989), which scores the MSA
by scoring all pairs of amino acids induced by the columns of the alignment, and Consistency based Objective Function For alignmEnt Evalutation (COFFEE) (Notredame
et al., 1998), which compares the alignment to a library of reference alignments. OFs
are discussed in 2.1.2 on the following page in more detail.
Finally, an optimal alignment can now be dened. The term optimal can mean either
a minimum or maximum value, depending on the objective function used. As we use
a scoring function (as opposed to a cost function) in our alignment algorithm, where
better alignments receive a higher score, we dene optimality by maximisation.1
Denition 2.5 (Optimal Multiple Sequence Alignment)
S = s1 ; :::; sk over an alphabet 6 and an OF
sequence alignment i
S
0
f, S
0
For a set of sequences
= s1 ; :::; sk is an optimal multiple
0
0
2 MSA(S )
8S ? 2 MSA(S ) : S ? 6= S 0 ! f (S ? ) f (S 0 )
The second point states that f (S 0 ) is a maximum value of the function f on the set of
all alignments over S . This denition includes the possibility of more than one optimal
MSA.
1
This doesn't cause any loss of generality, since a minimisation problem can easily be transformed in a
maximisation problem.
7
2.1.2 Alignment Scores
To estimate the quality of an alignment, each alignment gets a value based on an OF,
where a higher value corresponds to improved quality. The OF measures the quality
of an alignment and should reect how close this alignment is to the optimal biological
alignment.
Sum-of-pairs is the most widely used scoring function SP (Altschul, 1989) score, a
direct extension from the scoring methods used in pairwise alignment. For each pair of
sequences, the similarity between the aligned pairs of amino acids are computed, in most
cases by using a scoring matrix. The pairwise scores are then added for each aligned
pair of residues in every pair of sequences.
SP scores are used in many variations in MSA algorithms, e. g. using weights for
the pairs of sequences to prevent sets of similar sequences from dominating the whole
alignment or dierent methods of assigning penalties to gaps in an alignment.
The dependence on a scoring matrix is a major drawback of the SP-scores, since
these scoring matrices are very general and created by statistical analysis of large sets
of alignments. The degree to which they match the properties of the sequences one is
to align varies. Morgenstern et al. (1998) argue that scoring matrices fail to produce
biologically reasonable alignments when the sequences have a low global similarity and
share only local similarities. Notredame (2002) identies the scoring of gaps when using
the SP-score as a major source of concern.
COFFEE
is a consistency based approach to scoring an MSA. Instead of measuring the
similarity of each pair of amino acids induced by an alignment, COFFEE (Notredame
et al., 1998) measures the correspondence between each induced pairwise alignment and
a library of computed pairwise alignment. Each pair in the library is assigned a weight
that is a function of its quality. The weight is equal to the percentage identity of the
aligned sequences. The score of an alignment is computed as the score of all pairs of
induced alignments, i. e. all pairs of aligned sequences in this multiple alignment. For
each pair of residues aligned in both the library and the induced alignment, the weight
is added to the score. To normalise the score, this score is divided by the maximal value,
where all pairs in the multiple alignment are found in the library alignments.2
Formalised, the COFFEE score for an alignment A can be formulated as:
Denition 2.6 (COFFEE score for a MSA)
P P W 2 SCORE(A )
P P W 2 LEN (A )
N 01
COF F EE (A) =
N
i=1 j =i+1
N 01 N
i=1 j =i+1
2
i;j
i;j
ij
i;j
For the optimal score to be 1, the alignments in the library must be totally consistent, such that each
residue is aligned to the same residue in every pairwise alignment.
8
with: Ai;j the pairwise alignment obtained for taking sequences Ai and Aj , Wi;j the
weight associated with that alignment, LEN (Ai;j ) the length of this alignment and
SCORE (Ai;j ) being the number of aligned pairs of residues found both in Ai;j and the
library alignment.
Using a library of alignments computed from the set of sequences under analysis
prevents the scoring scheme from being too general and takes the properties of the
sequences into account. Notredame et al. (1998) used a library of pairwise alignments
created by ClustalW, but Notredame et al. (2000) suggest a library containing both
global alignments from ClustalW and local alignments and showed that this actually
lead to an increase in the quality of the computed alignments.
When considering COFFEE as the OF for a local search, it is a great benet that
one can restrict the computational eort to those parts which have been changed by
the procedure when performing one step, e. g. to those two columns where a gap and a
residue have been exchanged.
One of the problems of both COFFEE and SP scoring is their assumption of uniform
and time-invariant substitution probabilities for all positions in an alignment. The substitution probability depends on structural and functional properties of the proteins and
can vary from zero for some regions to complete variability in others.
norMD
(Thompson et al., 2001) tries to reect this by using a column-based approach,
where the probabilities of substitutions and thus the scores for residues pairs are columndependent.
The norMD score is computed by measuring mean distances in a continuous space
imposed by each sequence in the alignment. norMD incorporates ane gap costs and
a similarity measure for the sequences which is based on a hash score derived from dot
plots of each pair of sequences. Finally, the score is normalised to a value between zero
and one.
The strength of this method is its independence from the number and length of the
sequences in the alignment, and the assignment of column-based scores which can be
used to nd good or bad aligned regions inside an alignment.
A major drawback which prohibits the usage of norMD as a scoring function for the
ILS is its complexity. Due to the handling of gap costs and the computation of the hash
scores during scoring, the whole alignment has to be processed completely which leads
to very large run-times. Taking this into account, COFFEE is preferred as the OF for
this project.
2.1.3 Complexity
The alignment of only two sequences is a common special case of MSA and ecient algorithms have been proposed, e. g. by Needleman & Wunsch (1970) or Gotoh (1982). These
algorithms use dynamic programming to eciently compute all possible alignments; their
asymptotic complexity is O(l2 ), where l is the length of the longest sequence, needing
also O(l2 ) space (the space complexity can be reduced to linear space consumption by
a method proposed by Hirschberg (1975). Generalising the special case to more than
9
two algorithms is easily possible, but has the drawback of exponential growth of complexity with the number of sequences: The complexity for dynamic programming with n
sequences of maximum length l is O (ln ). Considering this, only small sets of sequences
can be aligned with dynamic programming.
Computing alignments for more than two sequences has been shown to be NP-hard by
Wang & Jiang (1996) for metric scoring matrices and by Just (1999) for a more general
class of scoring matrices, which gives strong reasons to believe that there will be no
ecient algorithm to solve this problem exactly (except P = NP).
Even if Just's prove can not be directly transferred to MSA with COFFEE as OF, it
gives strong reason to believe that computing optimal COFFEE alignments is computational hard and needs advanced techniques.
2.2 Metaheuristics and Combinatorial Optimisation
This section explains and denes two key terms of this thesis, namely Metaheuristic and
Combinatorial Optimisation (CO).
2.2.1 Combinatorial Optimisation
The problem of optimisation can generally be formulated as optimising a function f (X ),
where X is a set of variables.3 Usually the possible set of values is restricted by constraints, expressed by inequalities such as g (X ) c; c 2 R, where g (x) is a general
function. Optimisation problems can be divided into two categories: those where the set
of variables are from a continuous domain, and those with a discrete domain; the latter
one is called combinatorial. According to Papadimitriou & Steiglitz (1982), CO is the
search for a nite (or countably innite) set, typically an integer, set, permutation or
graph, whereas in the continuous case, the solutions are usually a set of real numbers or
even a function. Both cases are tackled with very dierent methods, and we focus only
on CO problems.
Blum & Roli (2003) denes a CO problem as follows:
Denition 2.7 (Combinatorial Optimisation problem)
A CO problem
is dened by:
a set of variables
sets of variables
P = (S; f )
X = x1 ; :::; xn
D1 ; :::; D2
constraints among variables
an objective function
3
f
to be optimised, where
f : D1 2 ::: 2 Dn ! R
In this context, \optimising" must be seen in a very broad view as the search for a best assignment of
values to the variables in X .
10
S is the set of all feasible assignments to the variables:
S = fs = f(x1; v1); :::; (x2 ; v2 )gjvi 2 Di ; s saties all constraintsg
S is usually called the search space or solution space, and the elements of S are feasible
solutions (also called candidate solutions).
To conform with our denition of optimal MSA (see denition 2.5), we again focus on
maximisation problems solely. To solve optimisation problem means to nd the solution
s? 2 S with maximum value, that is f (s? ) f (s) 8s 2 S. The solution s? is called a
globally optimal solution. If there is more than one globally optimal solution, the set
S? S will be called the set of globally optimal soultions.
2.2.2 Heuristic algorithms
When tackling CO problems, one generally distinguishes between complete and approximation problems. Complete algorithms are guaranteed to nd an optimal solution for
every instance of an CO problem in nite time. One of the most widely used algorithms
are the Branch & Bound algorithms, which are also called A? in AI. To guarantee optimal solutions is often not possible, since many CO problems are NP -hard, which implies
that no polynomial time algorithms exist, if we assume P 6= NP . Instead of complete
algorithms the guarantee of nding optimal solutions is relaxed to nd good solutions
close to the optimal ones.
Approximative algorithms are often called heuristic algorithms, especially in Articial
Intelligence and Operations Research. The word heuristic has Greek roots with to nd
or to discover as its original meaning. Reeves (1993) denes a heuristic as:
Denition 2.8 (Heuristic)
A heuristic is a technique seeking good (i. e. near-optimal
solutions at a reasonable computational cost without being able to guarantee either feasibility or optimality, or even in many cases to state how close to optimality a particular
feasible solution is.
Heuristics usually incorporate problem specic knowledge beyond the knowledge found
in the problem denition to reach a good trade-o between solution quality and computational complexity. The following presentation of heuristic methods follows those by
Blum & Roli (2003) and St
utzle (1999a).
When considering heuristic methods, St
utzle (1999a) distinguishes between constructive and local search methods. Constructive methods generate solutions from an initially empty solution by adding components until a complete solution is reached; most
prominent of these is the greedy approach which always add the component which gives
the best improvement in terms of the OF. Constructive methods are usually among
the fastest approximation algorithms found, but the computed solutions are of inferior quality compared to local search methods. Many MSA algorithms are constructive
algorithms, among them ClustalW, PileUp and MultAlign.
Local search methods instead perform an iterative search. They start from an initial
solution and try to improve the current solution by modifying it to a solution with a
11
better score. More formally, a local search to exchanges the current solution with a better
solution which can be computed from the current solution by applying a modication
operator to it. The set of all solutions computable from the current solution s is called the
neighbourhood of the current solution s. The neighbourhood structure is often visualised
as a graph on top of the set of all solutions where an edge between two solutions indicate
that these two solutions are neighboured, i. e. that one solution can be computed from
the other by using the modication operator on the other one. The neighbourhood is
dened according to Blum & Roli (2003); St
utzle (1999a) as:
that assigns to every
of s.
s2S
N S
A neighbourhood structure is a function : ! 2S
a set of neighbours (s) . (s) is called the neighbourhood
Denition 2.9 (Neighbourhood)
N
S N
The iterative process goes on until no better solution in the neighbourhood of the
current solution can be found. This nal solution of the local search is called a locally
maximal solution:
Denition 2.10 (Local Optimum)
A locally maximal solution (or local maximum)
w. r. t. a neighbourhood structure
is a solution s such that 8s0 2 (s) f (s0 ) f (s). s
is called strict local minimum i 8s0 2 (s) f (s0 ) < f (s).
N
N
N
A neighbourhood where every local optimum is also a global optimum is called an exact
neighbourhood. Exact neighbourhoods often have exponential size, so that to search the
neighbourhood requires exponential time in the worst case which makes them unusable
in practise.
2.2.3 Metaheuristics
One of the main features of heuristics as described in the last section is that they
incorporate problem specic knowledge. This makes them specic for the problem and
prohibits the transfer of one successful heuristic from one problem to another.
In recent years, much eort was spent on the development of general heuristic methods
which are applicable to a wide range of CO problems. These general-purpose methods
are now called metaheuristics (Glover, 1986) (some earlier publications use the term
\modern heuristics", e. g. Reeves (1993)).
The Metaheuristics Network denes metaheuristic as:
Denition 2.11 (Metaheuristic) A metaheuristic is a set of concepts that
can be used to dene heuristic methods that can be applied to a wide set of
dierent problems. In other words, a metaheuristic can be seen as a general
algorithmic framework which can be applied to dierent optimization problems with relatively few modications to make them adapted to a specic
problem.4
4
Metaheuristics Network web site:
2004
http://www.metaheuristics.net,
12
last accessed Sun, 25th April
(Blum & Roli, 2003, p. 270f) compare many denitions of metaheuristics and summarise the main properties which characterise a metaheuristic as :
Metaheuristics are strategies that \guide" the search process.
The goal is to eciently explore the search space in order to nd (near-) optimal
solutions.
Techniques which constitute metaheuristic algorithms range from simple local
search procedures to complex learning processes.
Metaheuristic algorithms are approximative and usually non-deterministic.
They may incorporate mechanisms to avoid getting trapped in conned areas of
the search space.
The basic concept of metaheuristics permit an abstract level description.
Metaheuristics are not problem specic.
Metaheuristics may make use of domain-specic knowledge in the form of heuristics
that are controlled by the upper level strategy.
Todays more advanced metaheuristics use search experience (embodied in some
form of memory) to guide the search.
Many metaheuristics have been proposed during the last years and are well known
today, e. g. Simulated Annealing (SA) or Genetic Algorithm (GA). Blum & Roli (2003)
gives an overview of the most successful metaheuristics and shows several ways of categorising them, depending on which aspect of the algorithm is in focus. Metaheuristics
dier in the number of feasible solutions examined in each step (population-based vs.
single point search), in their use of the objective-function (dynamic vs. static objective function), in the number of neighbourhood structures used during the search (one
vs. various neighbourhoods), the usage of memory (memory usage vs. memory-less
methods) and the domain where the underlying idea estimated from (nature-inspired
vs. non-nature inspired).
Common metaheuristics are (in alphabetical order): ant algorithms, evolutionary algorithms, Greedy Randomised Adaptive Search Procedure (GRASP), Guided Local Search,
Iterated Local Search (ILS), Simulated Annealing (SA), Tabu Search (TS) and Variable
Neighbourhood Search (VNS). Section 2.4 on page 18 shows examples of metaheuristics
for MSA.
2.3 Iterated Local Search
A major problem with local search algorithms is the probability of getting trapped in a
local optimum. One possibility to prevent this is to let the local search accept worsening
steps, as done in SA or Tabu Search (TS). Another straightforward way is to modify
13
score
the current solution and start a new local search from this modied solution. The
modication should be done in such a way that it leads to a solution distantly enough
such that the local search could not easily undo the changes and return to the local
optimum.
s’’
s
s’
solution space S
Figure 2.3: Pictorial view of ILS. After being trapped in the local optimum s, the current
solution is modied which leads to a new point in the search space, marked
with s0 . The next run of the local search leads to the new local optimum s00 ,
which in turn can be modied again.
ILS (Lourenco et al., 2002; St
utzle, 1999a, p. 25) is a systematic application of this
idea, where an embedded heuristic search is applied repeatedly to a modied solution
computed before. This results in a random walk of solutions locally optimal w. r. t. the
embedded solution. ILS is a very general framework which includes other metaheuristics
such as iterated Lin-Kernighan, variable neighbourhood search or large-step Markov
chains.
ILS is expected to perform much better than simply restarting the embedded heuristic
from a randomly computed position in the search space(Lourenco et al., 2002). Lourenco
et al. (2002) point out that random restarts of local search algorithms have a distribution
that becomes arbitrarily peaked about the mean when the instance size goes to innity.
Most local optima found have a value close to the mean local optimum, and only few
optima are found with signicant better values. The probability to nd a local optimum
with a value close to the mean local optimum found is very large, whereas the probability
to nd a signicantly better local optimum converges to zero with increasing instance
size. They conclude that it is very unlikely to nd a solution even slightly better than
the mean local optimum for large problem instances and that a biased sampling of the
space of local optima could lead to better solutions.
ILS uses the idea of a random walk in the space of local optima. From the current
solution s, a new starting point s0 for the local search is computed by modifying s using
a perturbation procedure (s0 is usually not a local optimum, but belongs to the set S of
14
feasible solutions). The embedded heuristic is then run on s0 , which leads to a new local
optimum s00 . If s00 passes an acceptance test, e.g. yields an improvement compared to
the starting point s, it becomes the next element of the walk, otherwise s00 is discarded
and the walk restarts with s. Figure 2.3 shows one step of the random walk. The nal
solution of the algorithm is the best solution found during this walk.
Algorithm 2.1 An algorithmic presentation of ILS.
procedure Iterated Local Search
s0
GenerateInitialSolution
LocalSearch(s0 )
sbest s
s
while termination condition not met do
Perturbation(s,history)
EmbeddedHeuristic(s )
if f (s ) > f (sbest ) then
sbest s
s
s
0
00
0
00
00
end if
s
AcceptanceCriterion(s; s ; history )
00
end while
return sbest
end procedure
This process is formalised in algorithm 2.1. The algorithm consists of components:
GenerateInitialSolution, LocalSearch, Perturbation, AcceptanceCriterion and termination condition. In order to reach high performance and solution quality, the Perturbation
and AcceptanceCriterion can make use of the history of the search, e. g. by emphasising diversication or intensication based on the period of time passed since the last
improvement was found. The acceptance criterion acts as a counter balance to the perturbation procedure by ltering the new local optima. The following paragraphs describe
each component of the algorithm.
2.3.1 GenerateInitialSolution
GenerateInitialSolution is a construction heuristic which computes the starting point for
the rst run of the embedded heuristic. The construction heuristic used should be as
fast as possible but produce good starting points for the embedded heuristic; standard
choices are either random solutions or greedy heuristics. In most cases, greedy heuristics
are preferred since the embedded heuristic reaches a local optimum in fewer steps than
starting from random solutions. It is worth to note that the best greedy heuristics may
not be the best starting point for a local search. Lourenco et al. (2002) give an example
where using one of the best construction heuristics known for the Travelling Salesman
Problem (TSP) leads to worse results than using a random starting solution as input for
a local search procedure. Another point to consider is the time a greedy heuristic needs.
The inuence of GenerateInitialSolution should decrease with the run-time of the ILS.
15
Short runs depend heavily on good starting points, whereas in the long run, the starting
point gets less important.
2.3.2 EmbeddedHeuristic
In most cases, the embedded heuristic is an iterative improvement local search procedure,
but any other heuristic can be used, since the embedded heuristic can mostly be treated
as a black box. Lourenco & Zwijnenburg (1996) used a TS in their application to the
job-shop scheduling problem.
However, there are points to consider when optimising the performance of an ILS. The
local search should be selected with respect to the perturbation used, where it should
not be able to easily undo the changes of the perturbation.
A trade-o between speed and solution quality of the embedded heuristic must be
found. Usually a heuristic with better results is preferable, but if the heuristic needs
too much time, the ILS is unable to examine enough local optima to nd high-quality
solutions.
Finally, one has to consider the interrelation between the embedded heuristic and the
perturbation when optimising an ILS. In many cases, the embedded heuristic uses some
tricks to prune the search space and speed up the search process, e.g. by using don't look
bits (Bentley, 1992) in TSP algorithms (St
utzle, 1999a, p. 78). The highest performance
is reached when these tricks are implemented w. r. t. the perturbation procedure. In the
example of don't look bits for TSP, the don't look bits are saved for the next run of the
local search and only reseted for the cities where the perturbation induced a change.
This gains a large advantage in performance of the ILS.
2.3.3 Perturbation
The perturbation procedure modies a local optimum to become a new starting point
for the next heuristic search. Perturbations are often indeterministic to avoid cycling.
A general aspect in choosing perturbations is that they must not be undone easily by
the embedded heuristic, since it would be fall back to the last visited local optimum. To
nd a good perturbation, properties of the problem instance can be considered as well
as problem specic knowledge.
Most perturbations are rather simple, e. g. in TSP algorithms, small \double-bridge
moves" are frequently used, which remove four edges and add four new edges. While in
the case of TSP, a simple perturbation procedure leads to good results, other problems
might need more complex procedures, e. g. optimising a part of the current instance or
solving a related instance with relaxed constraints.
The most important property of the perturbation is its strength. Perturbation strength
can be informally dened as the number of components in a solution that get changed
by a perturbation. Since the number of changed components provides a measure for the
distance between two solutions, the perturbation strength can be seen as the distance of
the perturbed solution to the original solution. The strength of the perturbation should
be large enough to search eciently in the search space of local optima, but small enough
16
to prevent a random restart behaviour. The size of the perturbation depends heavily on
the problem and on the particular instance.
The strength of a perturbation can be xed, where the distance between s and s0 is
xed or adaptive, where its strength changes over time, e. g. gets larger when no improvements were found to emphasise diversication. To adapt the perturbation strength, the
search history can be used.
2.3.4 Acceptance criterion
The acceptance criterion determines whether a solution from the embedded heuristic is
used as a starting point for the next iteration of ILS. The acceptance criterion provides a
balance between intensication and diversication of the search space. The two simplest
cases are a criterion which accepts only improvements and thus emphasises intensication
and the opposite criterion, which accepts every time and provides a strong diversication.
The rst criterion is dened as BETTER (algorithm 0), the second one as ALWAYS.
Algorithm 2.2 BETTER acceptance criterion
if f (s ) > f (s) then
00
s
s
00
end if
In between these two extremes, there are several possibilities for acceptance criteria.
It is possible to prefer improvements, but accept non-improvement steps sometimes, e. g.
by using a criterion similar to the one used in SA. Other choices could make use of the
search history similar to some implementations of TS. A limited use of the history is to
restart the ILS algorithm from a new initial solution when it seems that there will be
no improvement found in the current region of the search space. A simple way would
be to restart if the search is not able to traverse to a better local minimum for a certain
number of steps. The RESTART criterion implements this approach:
Let i be the current number iterations of the ILS and ilast the iteration where the last
improving step was made. The RESTART acceptance criterion is dened in algorithm
2.3.
Algorithm 2.3 RESTART acceptance criterion
if f (s ) > f (s) then
00
s
s
else if f (s ) f (s) ^ i 0 ilast > iT then
s GenerateInitialSolution
00
00
end if
It conforms to the BETTER criterion by accepting a local minimum as starting point
for the next iteration only if it is better than the previous starting point. Furthermore,
if there has been no improvement for a number iT of iterations, the search is restarted
with a newly generated solution. The new starting point can be generated in dierent
ways. The simplest way is to use the construction heuristic to generate a completely
17
new initial solution, but other approaches considering the search history or the former
starting point are also possible.
2.3.5 Termination condition
The termination condition ends the algorithm. There are no restrictions on how to
choose the condition. Common choices include a xed number of iterations, a number
of iterations depending on some properties of the problem instance, a xed amount of
consumed run-time or (if possible) an estimation of the solution quality.
2.4 Related Work
This chapter presents previously published algorithms and summarises the most important results found during the literature study done in the initial phase of this project.
We start with a classication of MSA algorithms briey describing the most prominent
algorithms for each class. Found applications of Metaheuristics are described in detail
later in this chapter. Notredame (2002) reviews most available algorithms and classies
them into four classes:
Progressive algorithms add sequences in a step-by-step manner according to a pre-
computed order. They use an extension of the dynamic programming algorithm used
in pairwise alignment (Needleman & Wunsch, 1970) to add a sequence to an alignment.
This class includes the most widely used algorithms: ClustalW, which can be seen as a
standard method for multiple alignments, PileUp and MultAlign.
All progressive algorithms start by choosing an order in which the sequences will be
added to the alignment, mostly implemented by computing an approximative phylogenetic tree from the similarities of the sequences. The sequences are then added according
to this order. This process is similar to the construction heuristic used in ILS. The drawbacks of progressive algorithms are the inability to correct gaps placed in early steps of
the alignment, the domination of a large set of similar sequences and the imposed order
on the sequences.
The most prominent algorithm for MSA, ClustalW (Thompson et al., 1994), aligns
the sequences according to a guidance tree computed by neighbour joining. It uses
many heuristic adjustements to compensate the drawbacks of pure progressive strategies,
among them automatic choice of substitution matrices for dierent degress of sequence
identity, local gap penalties and gap penalty adjustment.
Exact algorithms rely on a generalisation of the algorithm for pairwise alignment
for more than two sequences. The original pairwise alignment algorithm (Needleman &
Wunsch, 1970) uses dynamic programming to nd an optimal path in a two-dimensional
lattice which corresponds to the optimal pairwise alignment. By extending this scheme
to a n-dimensional space, n sequences can be optimally aligned. Given that the time
18
needed to computed the optimal alignment is O (ln ) for n sequences of maximal length
l, this approach is limited to only small sets of short sequences.
This approach is implemented in the MSA program (Lipman et al., 1989), which
in turn is an heuristic implementation of an algorithm proposed by Carrillo & Lipman
(1988). DCA (Stoye et al., 1997) enhances the number of sequences which can be aligned
by MSA by using a divide-and-conquer approach to cut the sequences into smaller fraction, which in turn can be aligned with MSA and later reassembled. Still, the maximum
number of sequences which can be handled remains relatively small.
Iterative algorithms are based on the idea that a solution can be computed by
starting from a sub-optimal solution which is then modied until it reaches the optimum.
Iterative algorithms can be further classied into stochastic and deterministic algorithms.
The rst subclass includes Hidden Markov Models, GA, SA, TS and the ILS presented in
this thesis. Most algorithms in the second subclass are based on dynamic programming.
SAGA (a Genetic algorithm), SA and TS are described in detail at the end of this
chapter.
One of the best MSA algorithms available, Prrp, uses a deterministic iterative strategy.
Prrp uses a doubly nested loop strategy to optimise weights assigned to the pairs of
sequences and the weighted SP-score simultaneously. The inner loop re-aligns subgroups
inside the alignment until no further improvement can be made. The outer loop in
turn computes approximative weights for all pairs of sequences based on the current
alignment. The iteration ends when the weights converged.
Consistency based algorithms try to produce a MSA that corresponds most with all
pairwise alignments. By using the pairwise alignments, consistency base algorithms are
independent of a specic scoring matrix and make use of position dependent scores. This
bypasses many of the problems of progressive algorithms and resembles the enhancements
of ClustalW as an inherent feature of the algorithm.
Since this class includes all algorithms which make use of consistency based objective
function, it includes SAGA, T-COFFEE, DiAlign and the ILS.
T-COFFEE is a progressive algorithm using the COFFEE function instead of a substition matrix when computing the alignment. It uses an extended library of global and
local alignments, which are merged into a scoring matrix by a heurstic method called
library extension. The library extension process assigns a weight to each pair of residues
present in the library alignments. The weight is larger for pairs in an alignment which
are consistent with the other library alignments.
DiAlign is based on gap-free segments of residue pairs in the pairwise alignments induced by the sequences inside a MSA. It starts with computing all pairwise alignments
and identifying all diagonals (diagonals represent matches of residues and hence, a diagonal is a gap-free segment). The diagonals are then sorted according to a probabilistic
weight and their overlap with other diagonals. The later property should emphasize motifs occuring in more than two sequences. The MSA is formed from this list by adding the
rst consistent diagonal not added yet. When the list is completely processed, the align-
19
ment is completed by connecting the diagonals by introducing gaps into the alignment.
Name
Classication
Reference
MSA
DCA
Exact
Exact
Lipman et al. (1989)
Stoye et al. (1997)
ClustalW
Multalin
Progressive
Progressive
Thompson et al. (1994)
Corpet (1988)
DiAlign
DiAlign2
T-COFEE
Consistency-based
Consistency-based
Consistency-based/Progressive
Morgenstern et al. (1998)
Morgenstern (1999)
Notredame et al. (2000)
Prrp
SAGA
Tabu Search
Iterative
Iterative
Iterative
Gotoh (1996)
Notredame & Higgins (1996)
Riaz et al. (2004)
Table 2.1: MSA algorithm and their classication, taken from Notredame (2002). Please
refer to the text for a short description of the algorithms and the classications.
Table 2.1 shows the most popular algorithms together with classication and references. The remainder of this chapter briey describes already implemented Metaheuristics in more detail. More detailed information can be found in the references.
2.4.1 Simulated Annealing
The rst modern Metaheuristic applied to MSA is Simulated Annealing (Kim et al.,
1994), which was used to optimise the SP objective function with ane and natural gap
costs. The neighbourhood structure used is built up from a move operation which swaps
a block of consecutive gaps with a adjacent block of symbols in one of the sequences.
All parameters for the swap operation, which are size and position of the gap block,
whether symbols to the left or right are swapped and how many symbols are swapped
are determined randomly. No gaps are inserted or deleted during the search.
SA tries to explore this neighbourhood structure by traversing iteratively from one
solution to a better scoring solution, but accepting also steps to solutions with lower
scores with a certain probability. This gives SA the possibility to escape local minima.
SA is one of the few Metaheuristics with mathematically proved convergence against
an optimal solution. In practise, SA is too slow for being used as a complete alignment
algorithm, but can be used as an alignment improver to correct and improve alignments
computed by other methods.
20
2.4.2 SAGA
SAGA (Sequence Alignment by Genetic Algorithm) is a GA for MSA proposed by
Notredame & Higgins (1996). It was used to optimise alignments for the weighted
SP score with dierent gap penalties, and later extended to use the COFFEE objective
function as well (Notredame et al., 1998).
SAGA starts with a population of random alignments generated by adding a random
number of gaps to the start and end of each sequence, ensuring that all sequences have
the same length afterwards.
To compute the next generation of one iteration, 50% of the new population is lled
with the 50% ttest (the tness value corresponds to the score of the alignment) alignments from the previous generation. The remaining 50% are created by modifying
randomly chosen alignments from the previous generation, where the probability for being chosen depends on the tness value. SAGA uses 22 dierent operators to modify an
alignment: two crossover operators which generate a new alignment from two parents,
two operator which inserts blocks of gaps into the alignment, a family of sixteen operators shifting blocks of gap inside the alignment to new positions, an operator which
tries to nd small proles by nding the best small gap-free substring of a sequence
in all sequences and an operator optimising a pattern of gaps in a small region of the
alignment by either exhaustive search or GA, depending on the size of the region and
number of gaps to be placed.
Which operator is applied is stochastically chosen, with a probability depending on
the quality of the ospring an operator has generated in the last ten iterations.
SAGA is among the best algorithms for MSA regarding alignment quality, but needs
very long time to produce good quality alignments, maybe due to using randomly generated alignments as the initial population.
2.4.3 Tabu Search
Tabu Search (TS) (Glover, 1986; Glover & Laguna, 1997) is a general Metaheuristic
which explores a neighbourhood by traversing from one solution to the neighbour with
the best score, even if this neighbour has a lower score. To prevent cycling, TS uses a
form of memory to declare recent moves as forbidden for a number of iterations. In some
cases, a move to a solution declared tabu may actually lead to an improvement. In this
case, aspiration criteria are used which allow forbidden moves under some circumstances,
e. g. if the new solution would be better than all previously solutions found. The memory
can also be used to balance between diversication and intensication.
Riaz et al. (2004) implemented a TS for MSA, using the COFFEE objective function.
They propose two dierent move strategies: a single block move, which is the same move
as used in the BlockMove local search in ILS, and move strategy moving a rectangle of
gaps which spans over more than one sequence to a new position. The second move
strategy is implemented in their TS algorithm since it has been found to give shorter
runtimes than performing a series of block moves.
To balance between intensication and diversication, Riaz et al. (2004) divide the
21
alignment into good and badly aligned regions and focus the TS on a bad aligned region
until it has been improved to be not further classied as bad aligned. The TS is then
focused on the next bad aligned region until no further bad aligned regions are found.
TS is tested on the BAliBASE database (version 1.0), and is found to be comparable
to other algorithms in terms of solution quality, but needs very long time to produce
good alignments. It is pointed out that the initial alignment from which the TS is started
greatly inuences the solution quality when optimising larger instances.
22
3 Algorithm
This chapter describes the ILS algorithm for MSA and motivates the decisions made
when designing the components of ILS (section 2.3). For each of the components, several
alternatives will be proposed and compared to nd the best alternative to use in the
ILS. The results are all presented in chapter 5. The decisions made are motivated by
preliminary tests and subsets of the BAliBASE database.
3.1 Construction heuristic
Three variants of construction heuristics have been implemented and evaluated to see if
they can serve as a good starting point for a local search.
The rst heuristic, Random, builds the initial alignment by randomly inserting a xed
percentage of gaps into the sequences, lling the shorter sequences with additional gaps.
The second heuristic, AlignRandom, generates the initial alignment by successively
aligning a sequence to an alignment in random order. It starts with aligning two sequences, and adds the next sequence to this alignment until all sequences have been
added. The alignment method is similar to standard dynamic programming and described by Gotoh (1993).1
The third heuristic, NeighbourJoining, is a greedy heuristic which aligns the sequences
step by step according to a guidance tree, which is computed using the neighbour joining
method (Saitou & Nei, 1987) and the alignment is done with the same procedure as in
heuristic number two. Neighbour Joining is implemented by the Quickjoin library, which
uses a heuristic speed-up technique to implement the algorithm as described by Saitou
(Brodal et al., 2003).2
The heuristics will be referred to as Random, AlignRandom and NeighbourJoining in
the following.
Table 5.1 on page 41 compares the quality of the computed alignments by each construction heuristics for both COFFEE-score and BAliBASE Sum of Pairs Score (BSPS)score. NeighbourJoining clearly achieved the highest scores in both cases.
Considering that the run-time for NeighbourJoining and AlignRandom is nearly the
same and the runtime of the whole ILS benets from a starting solution near to a
local optimum, NeighbourJoining is selected as the construction heuristic used in the
implementation. This decision is further motivated by the assumption that good solution
are clustered together in the search space, which is one reason for the good performance
on ILS.
Gotoh presents four algorithms, diering in estimated run-time and quality of the alignments produced.
In this implementation, Algorithm A is used.
2
Quickjoin is available from http://www.daimi.au.dk/mailund/quick-join.html
1
23
The neighbour joining heuristic has the drawback of being deterministic, so that each
search starts from the same initial solution, which makes it unsuited for the RESTART
acceptance criterion. When used in conjunction with RESTART, the rst initial solution is computed by NeighbourJoining, whereas the new initial solutions for restarts are
computed by AlignRandom.
3.2 Local Search
The embedded heuristic, which will be a local search procedure in this project, has great
inuence on the performance of ILS. Given that great impact, we will design the other
components around a local search procedure.
To nd a good local search regarding both runtime and solution quality, two local
search procedures have been implemented and tested on a random subset of the BAliBASE database. The subset consists of a quarter of all sequences of each reference set,
randomly chosen.
The two local search procedures each examine a dierent neighbourhood structure by
strict iterative improvement, going from one solution to another only if the new solution
is better than the former.
3.2.1 Single exchange local search
The rst local search procedure exchanges one symbol with an adjacent gap symbol until
no further improvement can be made. The neighbourhood structure used in the local
search is dened as follows:
Denition 3.1 (Neighbourhood structure for single exchange local search) A
pseudo multiple sequence alignment A is contained in the neighbourhood of another
pseudo multiple alignment B i alignment A can be transformed into alignment B by
exchanging a gap with a symbol, preserving the order of the symbols in each sequence.
The size of the neighbourhood is O(N 2 3 L), where N is the number of sequences in the
alignment and L is the length of the longest sequence in the set S of sequences. The worst
case is an alignment with a maximum number of gaps, since each gap can be replaced
by two residues (except gaps at the beginning or end of a sequence, which can only be
replaced by one residue), which gives two (one) neighbours. This worst case alignment
consists of columns which are built with N 0 1 gaps and only one non-gap symbol.
Each symbol from the sequences constitutes one column, so the alignment has s2S jsj
columns. The number of gaps is then given by (N 01)3 s2S jsj. For L being the length of
the longest sequence, it holds that L jsj8s 2 S and from that s2S jsj jS j3L = N 3L.
Altogether, this gives (N 0 1) 3 N 3 L = N 2 3 L 0 N 3 L = O(N 2 3 L).
The local search procedure explores this neighbourhood in a combination of greedy
and random procedure. The local search traverses every block of gaps, i.e. a consecutive
sequence of gaps, and searches for an improvement within this block. Within a block,
every possible move is evaluated and the best one taken. If the best move yields to an
P
24
P
P
ERT−−−QYPDI−−−−PK
HF−−−N−RY−ITS−−PK
KQQ−−−−KYLSA−KSPK
−−−KLVN−−E−AKNL−−
ERT−−−QYPDI−−−−PK
HFN−−−−RY−ITS−−PK
KQQ−−−−KYLSA−KSPK
−−−KLVN−−E−AKNL−−
Figure 3.1: Move operation of the single exchange local search. A gap inside a block of
consecutive gaps is swapped with one of the symbols adjacent to this block.
improvement, it is executed and the local search is restarted. If no improvement was
found in this block, the local search is continued with the next block. The blocks and
the gaps in each block are traversed in random order.
One nal point of variation inside the local search worth to consider is the question
which move should be performed during each iteration.
Algorithm 3.1 shows the complete local search algorithm in pseudo code. The algorithm examines all gap blocks and the adjacent symbols if swapping the symbol with
one of the gaps inside the block yields to an improvement until either an improvement
is found or all gaps have been examined. Depending on the parameter pivot, which can
be set to either \Best" or \First", the local search selects the neighbour which yields the
largest improvement possible or the rst neighbour found to yield an improvement. The
algorithm uses an auxiliary function swap which swaps two positions inside an alignment.
The function ExchangeMove evaluates a move in the neighbourhood structure by computing the dierent scores of the current alignment and the neighbour reached by exchanging one symbol and a gap. The dierence between these two scores is returned.
The function ExamineBlock searches the best improvement w. r. t. a block of consecutive gaps by evaluating all possible moves done when considering one of the symbols right
and left to the block and the gaps inside the block. It returns the value and position of
the gap of the best improvement move found, and zero if no improvement can be made.
The local search is implemented in procedure SingleExchange. It examines all blocks
of gaps as long as an improvement move in the neighbourhood can be made. To do
this, it rst evaluates all moves induced by a block and the symbol left to this block
(lines 24{25). If new improvement was found, or if pivoting is set to \Best", the same is
done for symbol to the right and the same block (lines 26{34). If the examination of a
block found an improvement and pivoting is set to \First", the for-loop is quit in line 36.
When the loop terminates, the best improvement move found is executed (lines 39{40).
Figure 3.1 shows a possible move during the local search. First, a block of gaps is
selected. For both adjacent residues, the score of exchanging them with any gap in the
selected block is computed, and the best combination is eventually executed as the move
for this iteration.
The estimated run-time of one iteration of the local search procedure is O(N 2 3 L). In
the worst case, the local search has to evaluate the whole neighbourhood of a solution
which conforms to evaluating every possible move since each neighbour can be reached
by performing exactly one move.
25
Algorithm 3.1 Single exchange local search.
1: function ExchangeMove(Alignment A, OF f, gapPos, symbolPos)
2:
delta f (A)
3:
A A
4:
swap(A', gapPos, symbolPos)
. swap two positions in an alignment
5:
delta delta 0 f (A )
6:
return delta
7: end function
8:
9: function ExamineBlock(Alignment A, OF, Block G, symbolPos)
10:
bestDelta 0
11:
for all gap g 2 G do
. nd the best improvement for this block
12:
delta ExchangeMove(A; f; gap; symbolP os)
13:
if delta < bestDelta then
14:
bestDelta delta
15:
gapP osition gap
16:
end if
17:
end for
18:
return (bestDelta; gapP osition)
. bestDelta = 0 if no improvement found
19: end function
20:
21: procedure SingleExchange(Alignment A, Objective Function f, pivot)
22:
repeat
23:
for each block G of consecutive gaps do
24:
symbolP os symbol left to G
25:
(bestDelta; gapP os)
ExamineBlock(A; f; G; symbolP os)
26:
if bestDelta < 0 _ pivot = \Best" then
27:
rightP os symbol right to G
28:
(rightDelta; rightGap)
ExamineBlock(A; f; G; rightP os)
29:
if rightDelta < bestDelta then
30:
bestDelta rightDelta
31:
symbolP os rightP os
32:
gapP os rightGap
33:
end if
34:
end if
35:
if pivot = \First" ^ bestDelta < 0 then
36:
break
. quit the for loop
37:
end if
38:
end for
39:
if bestDelta < 0 then
. if improvement found
40:
swap(A, gapPos, symbolPos)
. perform the found move
41:
end if
42:
until no improvement was found in the last iteration
43: end procedure
0
0
26
To speed up the computation, the local search uses don't look bits to prune the search
space. Don't look bits are a matrix of boolean values, which are initially set to false.
A symbol beneath a gap block is only considered for a move if the corresponding don't
look bit is false. Whenever the local search evaluates a symbol adjacent to a block of
gaps, and nds that no improvement can be made by exchanging this symbol with one of
the gaps in the adjacent block, the don't look bit is set to true; when a move is actually
executed, the don't look bits for the two exchanged positions are set to false.
3.2.2 Block exchange local search
The second local search procedure is similar to the rst one but tries to exchange whole
blocks of gaps instead of one symbol each step. Riaz et al. (2004) favoured this local
search in their TS implementation3 and state that it is superior in runtime and solution
quality to the single exchange local search.
The neighbourhood structure for the block exchange local search is dened as:
Denition 3.2 (Neighbourhood structure for block exchange local search)
A
pseudo multiple sequence alignment A is contained in the neighbourhood of another
pseudo multiple sequence alignment B i alignment A can be transformed into alignment B by exchanging a block of gaps with a block of non-gap symbols in one sequence
and preserving the order of the symbols in each sequence.
In the worst case, the neighbourhood structure is equivalent to the neighbourhood
structure of single exchange local search, since a maximal number of gap blocks is given
when each block is of length one. This gives the same worst-case size of the neighbourhood, O (N 2 3 L), but on the average, the neighbourhood can be expected to be much
smaller.
ERTQYP−−−DI−−−−PK
HFN−−−−RY−ITS−−PK
KQQ−−−−KYLSA−KSPK
−−−KLVN−−E−AKNL−−
ERT−−−QYPDI−−−−PK
HFN−−−−RY−ITS−−PK
KQQ−−−−KYLSA−KSPK
−−−KLVN−−E−AKNL−−
Figure 3.2: Move operation for the block exchange local search. A randomly selected
block of gaps is exchanged with a block of symbols of equal size to its right.
Algorithm 3.2 describes the local search procedure and gure 3.2 presents one step
in graphical form. It uses the same technique to distinguish between best improvement
and rst improvement pivoting as SingleExchange (algorithm 3.1). Swapping of blocks
of symbols is done by an auxiliary function swapBlock which swaps two blocks inside an
alignment.
3
TS traverses to the best solution in the neighbourhood of the current solution, a process similar to
the local search procedures used here.
27
Algorithm 3.2 Block exchange local search
1: function BlockMove(Alignment A, OF f, Block gaps, Block symbols)
2:
delta f (A)
3:
A A
4:
swapBlock(A', gaps, symbols)
5:
delta delta 0 f (A )
6:
return delta
7: end function
8:
9: procedure BlockExchange(Alignment A, OF f, pivot)
10:
repeat
11:
bestDelta 0
12:
for each block G of consecutive gaps do
13:
leftSymbols maximal block of at most jGj consecutive symbols to the left
14:
if BlockMove(A,f, G, leftSymbols) < bestDelta then
15:
bestDelta BlockMove(A,f, G, leftSymbols)
16:
symbolBlock leftSymbols
17:
gapBlock G
18:
end if
19:
if pivot = \Best" _ bestDelta = 0 then
20:
rightSymbols maximal block of at most jGj consecutive symbols to the right
21:
if BlockMove(A,f,G, rightSymbols) < bestDelta then
22:
bestDelta BlockMove(A,f,G, rightSymbols)
23:
symbolBlock rightSymbols
24:
gapBlock G
25:
end if
26:
end if
27:
if bestDelta < 0 ^ pivot = \First" then
28:
break
. end the for-loop
29:
end if
30:
end for
31:
32:
if bestDelta < 0 then
33:
swapBlock(A, gapBlock, symbolBlock)
34:
end if
35:
until no improvement found in last iteration
36: end procedure
0
0
28
The local search is implemented in procedure BlockExchange. It iteratively examines
all blocks of gaps until no further improvement can be made. It starts by examining
the eect of swapping a block of symbols to the left with the current gap block (lines
13{18). The non-gap block is of maximal length, meaning that the local search tries to
swap as many symbols found to the left of the current gap-block as possible. The size
of the swapped block is the minimum of the number of consecutive non-gap symbols to
the left and the size of the gap-block. If this didn't show an improvement, or pivoting
is set to best improvement, the same is done with the maximal block of non-gaps to
the right (lines 19{26). In both cases, the gap and the non-gap blocks are saved if the
move is better than the best move found so far (which is always the case for the rst
improvement found).
Lines 27{29 terminate the loop if an improving move was found and pivoting is set to
rst improvement. The best move found is executed in lines 32{34.
A further extension would be to consider not only swapping blocks or symbols, but
to move the gaps to a completely new position in the sequence. This approach has the
drawback of an extremely large neighbourhood, since there are N 0 1 new positions for a
gap block of length one, which would give O(N 3 N 2 3 L) as the worst case neighbourhood
size. Considering this and the large computational costs for evaluating each step, this
approach is not considered here.
3.2.3 Comparison
Both local search procedures are tested by running them on a random part of the BAliBASE database. One fourth of each subset is randomly selected. For each selected
test case, ClustalW is used to compute an initial alignment, which is used as input for
the local search procedures. Table 5.2 on page 42 and table 5.3 on page 42 present the
results as percentage improvements in COFFEE-score for each local search procedure
with rst- and best-improvement pivoting. The table shows that SingleExchange local
search performs better than the BlockExchange local search, with an improvement of
three times better in some cases. There is no big dierence between the two pivoting
schemes.
Comparing the runtime of the algorithms shown in table 5.3, the SingleExchange performs much better than BlockExchange, the only exception being reference set four,
where BlockExchange is the fastest algorithm in all cases. When focusing on the pivoting
scheme used, it turns out that rst-improvement is much faster than best-improvement,
up to a factor of six for reference set 3.
Considering the runtime and improvement made by the local search, SingleExchange
with a rst-improvement pivoting scheme will be used in the ILS as the embedded
heuristic. Since ILS samples local minima, both factors have to be considered to select
the local search procedure. The local search should be able to produce as many local
minima as possible in the given time, but the local minima's quality must also be taken
into account. SingleExchange is superior to BlockExchange in both terms, and thus seems
to be a good basis for the following design decisions to be made.
29
3.3 Perturbation
Four perturbation procedures are proposed and evaluated on a small subset of BAliBASE. All perturbations are tested using a preliminary version of the ILS, using SingleExchange and a Better acceptance criterion, running for 100 iterations. Due to the large
runtimes of one of the perturbations, the test runs are restricted to a subset of reference
set 1. which consists of relatively small sets of sequences.
3.3.1 Random regap perturbation
ERT−−−QYPDI−−−−PK
HFN−−−−RY−ITS−−PK
KQQ−−−−KYLSA−KSPK
−−−KLVN−−E−AKNL−−
shuffle
gaps
IT−−SP−K
ERT−−−QYPDI−−−−PK
HFN−−−−RYIT−−SP−K
KQQ−−−−KYLSA−KSPK
−−−KLVN−−E−AKNL−−
Figure 3.3: Random regap perturbation. A subsequence of one of the sequences inside
the alignment is randomly selected and the gaps inside that sequence are
moved to new random positions inside the block.
Algorithm 3.3 Random regap perturbation
procedure Regap(Alignment A, Integer length)
select a sequence Si 2 A
select a substring si Si
n countGaps(si )
s0i removeGaps(si )
add n gaps at random positions to
substitute si with s0i
si
0
end procedure
The rst perturbation, shown in algorithm 3.3 and exemplied in gure 3.3, simply
selects a random subpart in one of the sequences in the alignment and changes the
positions of the gaps to random positions inside the subsequence. The number of gaps
remains constant.
In order to prevent the ILS from cycling, all decisions made in the perturbation procedures are done randomly, with the length of the selected subsequence as the only
exception. But since the SingleExchange local search is very similar to this perturbation,
it is likely that the changes will be undone in the next local search step.
3.3.2 Block move perturbation
The BlockMove perturbation is a move in another neighbourhood of the current alignment. It selects a block of consecutive gaps and moves that block to a new position
inside the same sequence (see algorithm 3.4 and gure 3.4). Moving a block to a new
position can be seen as executing a move neighbourhood of higher order, which is a
30
ERT−−−QYP−−DI−−PK
HFN−−−−RY−ITS−−PK
KQQ−−−−KYLSAK−SPK
−−−KLVN−−E−AKNL−−
move
block
ERT−−−QYP−−DI−−PK
HFNRY−ITS−−−−−−PK
KQQ−−−−KYLSAK−SPK
−−−KLVN−−E−AKNL−−
Figure 3.4: Block move perturbation. A block of gaps is randomly selected and then
shifted to a new randomly determined position.
Algorithm 3.4 Random block move perturbation
procedure BlockMove(Alignment A)
select a sequence Si 2 A
select a block of gaps B Si
move B to a new position in Si
end procedure
generalisation of the neighbourhood used by BlockExchange. Since using moves in another neighbourhood are often proposed in ILS literature as successful perturbations,
this seems to be a promising approach.
Initial experiments with this perturbation indicated that the distance by which a block
can be moved should be restricted to keep the modied alignment close enough to the
former alignment. After running small tests (data not shown here), 1=10 of the alignment
length seems to be a good upper limit for the block move distance.
3.3.3 Pairwise alignment perturbation
ERT−−−QYPDI−−−−PK
HFN−−−−RY−ITS−−PK
KQQ−−−−KYLSA−KSPK
−−−KLVN−−E−AKNL−−
pairwise
−−DI
LSAK
alignment
ERT−−−QYP−−DI−−PK
HFN−−−−RY−ITS−−PK
KQQ−−−−KYLSAK−SPK
−−−KLVN−−E−AKNL−−
Figure 3.5: Pairwise alignment perturbation with perturbation strength=5. Two sequences are selected. Within these two sequences, two substrings are aligned
and then pasted back into the MSA.
This perturbation procedure relies on computing the optimal pairwise alignment of
two subsequences of the alignment and tting this alignment back into the MSA. Figure
3.5 illustrates the perturbation process, described by algorithm 3.5. This perturbation
tries to make use of a biologically justied method to change the alignment.
It rst selects two random subsequences from two randomly chosen sequences in the
alignment. These subsequences are then aligned together (with gaps in the subsequences
removed before aligning) using optimal pairwise alignment with ane gap costs (Gotoh,
1982) and copied back to their original placed. If the length of the pairwise alignment
diers from the length of the original subsequences, gaps are added either in the pairwise
alignment at random positions (if the pairwise alignment is longer) or to the multiple
31
Algorithm 3.5 Pairwise alignment perturbation
procedure PairwisePerturbation(Alignment A)
select two sequences S1 and S2
select two substrings s1 S1 and s2 S2
s01 = removeGaps(s1 )
s02 = removeGaps(s2 )
compute optimal pairwise alignment P A of s01 and s02
if jP Aj < s1 then
add js1 j 0 jP Aj gaps to each sequence in A
else if jP Aj > s1 then
add jP Aj 0 js1 j gaps to the alignment at random positions
end if
copy
P A back into A
end procedure
alignment to gain enough space for the pairwise alignment.
3.3.4 Re-aligning of complete regions
ERT−−−QYP−−DI−−PK
HFN−−−−RY−ITS−−PK
KQQ−−−−KYLSAK−SPK
−−−KLVN−−E−AKNL−−
AlignRandom
−PD−−−
Y−−IT−
Y−LS−A
−−E−−A
ERT−−−QY−PD−−−I−−PK
HFN−−−−RY−−IT−S−−PK
KQQ−−−−KY−LS−AK−SPK
−−−KLVN−−−E−−AKNL−−
Figure 3.6: Blockwise realign perturbation. A region of the alignment is selected by a
sliding window analysis to nd unreliably aligned regions. This part is then
aligned with the AlignRandom procedure.
Algorithm 3.6 Realign region perturbation
procedure Realign(Alignment A)
perform a sliding window analysis with norMD to nd a badly aligned region [i; j )
for all sequence s 2 A do
remove all gaps between symbols s[i] and s[j ]
end for
compute MSA of the region between i and j using AlignRandom
copy the computed alignment back into A replacing the region between i and
end procedure
j
The most sophisticated perturbation tries to make use of biological knowledge by performing a sliding window analysis using norMD to nd unreliably aligned regions, similar
to the procedure described used Rascal (Thompson et al., 2003). After performing the
sliding window analysis, all blocks of columns with scores below 0.5 are considered to be
unreliably aligned, as proposed in citetThompson01:norMD. One of these block is ran-
32
domly selected, and aligned by using AlignRandom (see section) 3.1. Since the computed
alignment can be longer or shorter than the original block, the perturbed alignment can
in turn grow or shrink.
This perturbation combines biological justied methods and random decisions to do
a guided perturbation to the alignment in regions where further improvement is needed.
It is expected to be able to guide the local search eectively and produce good results,
but the expected runtime is very high due to the additional complexity introduced by
running norMD each time on the complete alignment. Thompson et al. (2001) don't give
an estimation of the norMD procedure, but since they consider each pair of sequences it
is at least quadratic.
3.3.5 Comparison
The results of a comparison of the performance for a preliminary version of an ILS with
dierent perturbation criteria are presented in tables 5.4 and 5.5 on page 43. The ILS
used in this comparison uses the SingleExchange local search with a rst-improvement
pivoting strategy and a BETTER acceptance criterion. The runtime has been restricted
to 100 iterations.
As BlockMove shows the largest percentage improvement with acceptable runtime, we
will use BlockMove with a restricted maximal distance of 1=10 of the alignment's length
as the perturbation procedure.
3.4 Acceptance criterion
The nal component of the ILS to consider is the acceptance criterion. Most publications
propose the BETTER criterion to be superior w. r. t. solution quality to the ALWAYS
criterion. Additionally, we examine the Restart criterion (St
utzle & Hoos, 2002), which
prevents the ILS from stagnation by starting with a new initial solution generated with
AlignRandom when no improvement steps could be made for a number of steps.
The results of these comparison can be found in tables 5.6 and 5.7 on page 44. Evaluation is done again by running a preliminary ILS on subsets of reference set 1 with 1=4
of the sequences for ten times, measuring the COFFEE-scores and runtime. Tables 5.6
and 5.7 show the results.
The best improvement are computed by an ILS using the Restart criterion. This could
possibly indicate that the initial alignments produced by NeighbourJoining lie in regions
of the search space far away from the global optimal solution, and that it is very hard
to traverse from one region of good solutions to another.
3.5 Termination condition
The algorithm is terminated after a xed number of iterations, depending on the length
of the longest sequence in the alignment. Since it is not possible to guarantee the
33
optimality of this algorithm4 , a cut-o had been made. Experimental investigation has
shown that ILS begins to stagnate after a certain amount of time, so terminating the
search when the stagnation begins seems to be a good termination condition. Figure
0.37
0.36
0.35
COFFEE score
0.34
0.33
0.32
0.31
0.3
0.29
0
2000
4000
6000
iterations
8000
10000
12000
Figure 3.7: COFFEE scores for ILS on reference set gal4 ref. The algorithm was run for
12000 iterations, the longest sequence consists of 395 residues. The average
similarity between sequences is 14%.
3.7 shows an example run of the ILS algorithm on the test set \gal4 ref1" . This test
set consists of ve sequences, the shortest consisting of 335 residue, the longest of 395.
Several other tests (data not shown) indicate twenty times the length of the longest
sequence being a good threshold for the maximum number of iterations (that would be
7900 in this case).
3.6 Iterated Local Search for Multiple Sequence Alignment
We nally present the complete ILS algorithm for multiple sequence alignment. The
algorithm is described in pseudo code in algorithm 3.7. To improve the runtime needed
by the ILS, we preset the don't look bits used in the local search procedure by performing
a sliding window analysis with norMD on the initial alignment and set the don't look
bits for good aligned regions.5 This also helps the ILS to keep the well aligned parts of
the initial alignment during the search. The don't look bits are furthermore reused for
each iteration by using the don't look bits from the last iteration (with regions changed
by the perturbation set to false) as starting don't look bits for the next iteration, so
that the local search can focus on perturbed parts of the alignment. When a restart is
done, the don't look bits are preset again.
If optimality should be guaranteed, the algorithm would have to enumerate all possible alignments to
nd the optimal one and thus need exponential runtime.
5
A threshold of 0.8 is used to indicate a good aligned region.
4
34
Algorithm 3.7 The complete ILS.
function ILS(Objective Function f)
compute initial alignment using NeighbourJoining
perform sliding window analysis with norMD
set don't look bits to true in well aligned regions
SingleExchange(a,f, \rst")
a
best
a
for 202 length of the longest sequence do
a a
BlockMove(a)
SingleExchange(a,f, \rst")
if f (a) > f (best) then
best a
. save alignment
0
. new best alignment found
end if
if f (a ) > f (a) then
. no improvement step made
0
a
a
0
end if
if no improvement steps made for jaj iterations then
new AlignRandom computed alignment
preset don't look bits with norMD
BlockMove(a)
if f (a) > f (best) then
a
best
end if
end if
end for
a
remove all columns with no symbols from
return best
end function
35
best
. Restart
3.7 Implementation
The algorithm is implemented in C++, it can be compiled with any ISO C++ compiler.
The implementation uses the seqio-library for sequence input and output. Seqio is
availbale from http://www.cs.ucdavis.edu/gusfield/seqio.html.
Since the algorithm is inherently stochastic in nature, a good random number generator is crucial. In our implementation, the Mersenne twister implementation of Agner
Fog (http://www.agner.org/random/) is used.
The neighbour joining algorithm used in the construction heuristic NeighbourJoining
is implemented in the library quick{join (Brodal et al., 2003).
Patched versions of ClustalW and lalign are used to generate the COFFEE-library.
The programs have been modied to simplify parsing their output and circumvent the
creation of unnecessary temporary output les.
36
4 Method and Materials
Two points must be considered to evaluate the performance of the ILS algorithm. The
rst point is the performance of ILS as an optimisation algorithm for optimising the
alignment according to the objective function used. The second point is the biological
viability of the computed alignment. The performance is measured using run-time distributions, and the biological evaluation is done with the BAliBASE database of reference
alignment, which has been used in most recent publications of MSA algorithms.
4.1 Run-time distributions
St
utzle (1999a) and Hoos & St
utzle (1998) propose an empirical evaluation method based
on the distribution of solution quality and run-time, since Metaheuristics, and especially
ILS, are probabilistic algorithms. Instead of benchmarking an algorithm on test cases
and computing averages and other simple statistic measures, the runtime behaviour of
an algorithm is examined by taken samples of the Run-time distribution (RTD) of specic instances. An RTD describes the runtime behaviour as a two-dimensional random
variable (C; T ), where C is the random variable describing the solution cost and T is
the random variable describing the run-time on a specic problem instance. Instead of
measuring the run-time in terms of CPU time consumed, it is often preferable to use machine independent measures such as operation counts. This also minimises the inuence
of other environmental properties, e. g. the inuence of the compiler or programming
language. The RTD using such a machine independent measure is called Run-length
distribution (RLD). We will use the number of executed local search steps as a performance measure when analysing the runtime behaviour of ILS since it provides a good
measure of the actual runtime needed.
The random variables describe the behaviour of the algorithm for a given instance,
but are not available a-priori in general. Knowledge of these random variables can be
used to analyse and compare algorithms. To estimate the distribution of the random
variables, samples are taken by simply running the algorithm a number of times on an
instance and collecting data. In each run, the solution quality and computation time to
obtain it is reported every time a new best solution is found. Let k be the number of
runs of the algorithm, f (sj ) the quality of the best solution found so far and rt(j ) the
run-time of the j th run. The estimated RTD to reach a solution with quality c is dened
as:
Denition 4.1 (Runtime-Distribution)
Gc (t) =
jj jrt(j ) t ^ f (sj ) cj
k
37
If optimal solution qualities are available, the required quality is often set to a value
or certain percentage worse than the optimal value, e. g. to 5% worse. However, since
there are know mathematically optimal known reference solutions for MSA, we will use
the best value found during benchmarking as the reference value and compute RLD
according to this value.
St
utzle & Hoos (2002) show how RTDs are used to analyse the behaviour of ILS
for TSP. The authors concluded from the estimated RTDs that the ILS suers from
stagnation and proposed two improvements, which actually lead to an improvement in
solution quality and performance.
4.2 The BAliBASE database of multiple alignments
The BAliBASE (Bahr et al., 2001; Thomson et al., 1999) database of benchmark1 alignments is a database of reference alignments to evaluate the biological validity of MSA
algorithms. The database consists of 167 manually constructed alignments of real sequences based on three-dimensional structural superposition.2 In each alignment, regions
that can be aligned reliably by structural superpositions are dened as core blocks.
BAliBASE has been used in two published surveys of MSA algorithms (Notredame,
2002; Thompson et al., 1999) and is used frequently for evaluation in recent publications
of MSA algorithms (Notredame et al., 2000; Riaz et al., 2004). It can be seen as a
standard benchmark for biological evaluation of MSA algorithms. Since version one of
the BAliBASE database was used in all publications found, we will restrict ourselves to
the same test set, which matches reference sets 1{5 of BAliBASE 2.0 (reference sets 4
and 5 were formed a single reference set in BAliBASE 1.0). Totally, 144 alignments are
used for evaluation.
The database is structured into ve reference sets, each representing one commonly
faced problem for MSA algorithms.
Reference 1 consists of small sets of equidistant sequences of similar length. The length
ranges from below 100 residues up to 900. Each pair of sequences in this set
has a similar percentage identity and no large extensions or insertions have been
introduced. The set can be structured in three sub-groups of similar identity, with
identity below 25%, between 20{40% and > 35%.
Reference 2 contains alignments of a family of sequences, i. e. closely related sequences
with identity > 25% together with at most three distantly related sequences of the
same family. Distantly related refers to sequences with less than 25% similarity.
Reference 3 contains alignments of equidistant groups of sequences from divergent families. Each pair of sequences from two dierent groups has a percentage identity
below 25%.
1
2
available online at http://www-igbmc.u-strasbg.fr/BioInfo/BAliBASE/
Test set seven, which consists of transmembrane sequences is not based three-dimensional structural
superposition due to the lack of known three-dimensional structures and variability in automatic
prediction methods.
38
Reference 4 contains sequences with large N/C-terminal extensions.
Reference 5 contains sequences with internal insertions.
Comparing a test alignment with the alignment contained in the database is done
with the bali score program3 . The program computes two dierent scores for each alignment: The BSPS that shows to which extent the sequences in the alignment are correctly
aligned and Column Score (CS), which measures for ability to align all sequences correctly. Scoring is restricted to core blocks, which are reliable areas of the alignment
where no ambiguity from structural alignment exists.
BSPS counts the number of residue pairs aligned in the test alignment that also occur
in the reference alignment. For a test alignment A with N sequences and M columns
and reference alignment R, the score Si for the ith column is dened as
Denition 4.2
X X
S =
N
i
N
j =1 K =1;k6=j
pi (j; k)
where pi (x; y ) = 1 if the residue pair (Ax [i]; Ay [i]) is also found in the reference alignment,
otherwise pi (x; y ) = 0. With Smax dened as the number of aligned residue pairs in the
reference alignment, the SPS for the test alignment A is then:
Denition 4.3 (BAliBase Sum-of-pairs Score (BSPS))
PS
M
BSP S (A) =
i=1
i
Smax
The CS is dened similarly to the BSPS, but counts only columns completely identical
with the reference alignment:
Denition 4.4 (BAliBASE Column Score (CS))
CS =
XC
M
i
M
i=1
Ci is dened as Ci = 1 if all residues in the ith column are aligned in the reference
alignment, otherwise Ci = 0. M is the number of columns in the test alignment.
The column score is a much more stronger measure of alignment quality since it takes
only complete columns into account. In a situation where one sequence is completely
aligned inconsistently with the referent alignment, the CS score would be zero, whereas
the BSPS score can be rather high since it takes each individual pair of amino acids into
account.
3
available at the BAliBASE web-page
score.c
http://www-igbmc.u-strasbg.fr/BioInfo/BAliBASE2/bali
39
4.2.1 Statistical Evaluation
When comparing the performance of dierent algorithms, it is important to establish
whether the dierences between two algorithms are statistically signicant by using statistical tests. In this thesis, the Wilcoxon signed-rank test is used to calculate a p-value
associated with the dierences in the results of two algorithms. This p-value expresses
a probability that the dierences are due to chance, with lower p-values expressing a
higher signicance. The Wilcoxon signed-rank test is also used by Notredame et al.
(2000) and Gotoh (1996); Thompson et al. (1999) used a Friedman-test instead.
40
5 Results
This chapter consists of three parts: The rst part, contains the results gathered to
compare the alternatives for each of the components of ILS proposed in chapter 3.
The second part contains the results of the evaluation of ILS for the complete BAliBASE database. The results are analysed using the methodology described in chapter
4.
The third section analyses the runtime behaviour by exploiting solution-quality tradeos and runlength distributions.
5.1 ILS components
For each component of ILS, chapter 3 proposed several alternatives. This section contains the results of the comparisons for each of the alternatives proposed for each component.These results were used in order to motivate and explain the decisions made in
this chapter and to select the components of the nal ILS algorithm.
5.1.1 Construction heuristic
Reference Set
Random
COFFEE BSPS
AlignRandom NeighbourJoining
COFFEE BSPS COFFEE BSPS
ref1 test1
ref1 test2
ref1 test3
0.132
0.131
0.119
0.582
0.784
0.844
0.0588
0.0478
0.0374
0.255
0.493
0.598
0.667
0.836
0.883
0.3
0.524
0.623
Table 5.1: Comparison of construction heuristics. Each heuristic was run 10 times for
each set of sequences. The rst column for each heuristic shows the average
COFFEE score, the second column the average BSPS-score. The test sets
and the BSPS score are described in section 4.2
Table 5.1 shows the quality of the computed alignments by each construction heuristic.
The best results are computed using NeighbourJoining, AlignRandom performs second and
Random shows the lowest results.
5.1.2 Local search
The two local search variants presented in section 3.2 are compared by running them on
a randomly chosen part of the BAliBASE database. From each reference set inside the
41
database, one fourth of the test cases are randomly selected and given to ClustalW to
produce an initial alignment, which is then optimised by the local search procedures.
SingleExchange
Reference set
global
< 25 % identity
combined
global
combined
8.5
1.9
1.3
2.2
1.4
3.4
2.4
8.4
1.9
1.3
2.1
1.4
3.6
2.5
6.6
1.6
1.2
2.0
1.4
2.8
1.7
6.5
1.7
1.2
1.9
1.3
2.6
1.8
2.4
0.9
0.4
1.2
0.9
1.8
1.3
2.5
0.9
0.4
1.1
0.8
1.8
1.3
2.1
1.2
0.5
1.1
0.6
1.3
1.1
2.1
1.2
0.5
1.1
0.6
1.3
1.1
3.01
3.03
2.47
2.43
1.27
1.26
1.13
1.13
0
20{40% identity
> 35% identity
reference 2
reference 3
reference 4
reference 5
Average
BlockExchange
Table 5.2: Percentage improvement for both local searches on ClustalW generated alignments, averaged over ten runs. See section 5.1.2 for details.
SingleExchange
Reference set
< 25% identity
20{40% identity
> 35% identity
reference 2
reference 3
reference 4
reference 5
Average
global
BlockExchange
combined
global
combined
0.003
0.002
0.002
0.027
0.028
0.021
0.006
0.008
0.003
0.005
0.172
0.136
0.067
0.025
0.003
0.002
0.002
0.036
0.043
0.014
0.007
0.013
0.004
0.007
0.247
0.218
0.064
0.023
0.002
0.002
0.003
0.099
0.079
0.006
0.007
0.006
0.006
0.007
0.875
0.587
0.021
0.023
0.003
0.003
0.004
0.172
0.160
0.006
0.009
0.012
0.007
0.012
1.384
1.183
0.018
0.027
0.01
0.06
0.02
0.08
0.03
0.22
0.05
0.38
Table 5.3: Runtimes (in seconds) for both local searches on ClustalW generated alignments averaged over ten runs. See section 5.1.2 for details.
The results are presented as average percentage improvements in COFFEE-score for
ten runs of the local search on the alignments produced by ClustalW. Reference set 1 is
split into three dierent sets according to the average percentage identity between the
sequences. Table 5.2 shows the improvement in percent from the alignment computed
with ClustalW. For each local search, the results of using a library of only global alignments generated by ClustalW (column \global") and using a library of both global and
local alignments (column \combined") as the reference library for the COFFEE-OF are
shown. The rst column for each library gives the results for rst-improvement pivoting,
the second column for best-improvement. Each local search was run ten times.
42
5.1.3 Perturbation
To test the perturbation procedures, a preliminary version of the ILS has been implemented. The ILS uses the SingleExchange local search procedure with rst-improvement
pivoting and a BETTER acceptance criterion. It is run ten times for 100 iterations on
1=4 of the sequences of each subset of reference set 1 due to the large runtime needed
when using Realign. Table 5.4 shows the average improvement made in percent w. r. t.
the initial alignment computed by NeighbourJoining. The table shows two values for each
perturbation, the rst corresponds to using a COFFEE-function with a library of only
global alignments, the second column using a library of global and local alignments. Table 5.5 gives the average runtimes needed by each ILS in seconds, again showing results
for only global and global plus local COFFEE-library.
Reference Set
Regap
BlockMove
Pairwise
Realign
< 25% identity
20 0 40% identity
> 35% identity
31.1
7.4
2.0
16.3
4.1
1.3
33.4
8.3
2.5
17.1
4.4
1.7
29.0
4.9
1.1
15.9
2.4
0.8
26.5
5.3
1.8
15.3
3.5
1.3
Average
13.5
7.23
14.73
7.73
11.67
6.37
11.2
6.7
Table 5.4: Percentage improvements in COFFEE-scores from the initial alignment for
100 iterations using dierent perturbations and SingleExchange. The results
for a library of only global alignments are shown in the rst column of each
perturbation, and the results using a combined library of global and local
alignments in second column respectively.
Reference set
Regap
BlockMove
Pairwise
Realign
< 25% identity
20 0 40% identity
> 35% identity
0.07
0.06
0.04
0.08
0.07
0.06
0.23
0.21
0.22
0.25
0.25
0.28
0.10
0.22
0.09
0.11
0.24
0.10
8.64
9.00
5.45
8.61
9.03
5.48
Average
0.06
0.07
0.22
0.26
0.14
0.15
7.7
7.71
Table 5.5: Runtimes of ILS using dierent perturbations, running for 100 iterations. The
runtimes are measured in seconds.
5.1.4 Acceptance Criterion
The comparison of the three acceptance criteria proposed in section 3.4 is done in the
same way as the comparison of the perturbation procedures in the preceding section, but
this time the perturbation procedure of the preliminary ILS is xed and the acceptance
criterion is varied.
The results indicate that an ILS using the RESTART criterion achieves the highest
results. Not surprisingly, the runtime for RESTART is the highest runtime measured,
43
Test set
< 25% identity
ALWAYS
BETTER
RESTART
> 35% identity
20{40% identity
33.7
11.8
1.8
20.8
6.0
1.0
44.7
14.8
2.2
32.7
9.0
1.4
49.3
14.9
2.5
36.4
9.3
1.6
Average
15.77
9.27
20.57
14.37
22.23
15.77
Table 5.6: Percentage improvement of the COFFEE-score for random subsets of each
test case. For each acceptance criterion, the rst column is computed using a
library of global alignments, the second using global and local alignments.
Test set
< 25%
20{40%
> 35%
ALWAYS
7.02
4.75
2.04
8.19
5.82
2.68
BETTER
6.11
4.98
2.32
6.90
5.94
2.89
RESTART
6.94
7.69
3.90
7.89
7.90
4.54
Table 5.7: Runtime of the three evaluated acceptance criteria. For each acceptance criterion, the rst column is computed using a library of global alignments, the
second using global and local alignments.
since it has to compute new initial alignments by using the AlignRandom heuristic. The
runtime is higher for sequences with high similarity because the initial alignments computed by NeighbourJoining already have a high COFFEE-score and it is hard for the ILS
to actually make improvements.
5.2 BaliBASE evaluation of ILS
This chapter evaluates the performance of ILS for MSA using the methodology described
in chapter 4. ILS is compared with ClustalW (Thompson et al., 1994), PRRP1 (Gotoh, 1996), DIALIGN2 (Morgenstern, 1999), MULTALIN (Corpet, 1988), T{COFFEE
(Notredame et al., 2000) and SAGA (Notredame & Higgins, 1996). The values for the
Tabu Search are taken from Riaz et al. (2004) since the available program failed to
produce an alignment in most cases. Unfortunately, the publication doesn't show the
CS, which therefore will be omitted. All other programs are run with default parameters on every sequence set in the BAliBASE database, and the BSPS and CS scores
are computed with the bali score program taken from the BAliBASE web-page2 . Each
algorithm except ILS was run only once on each alignment, ILS was run ten times each.3
The results for the complete database are visualised as boxplots both for BSPS and
1
2
3
PRRP is now merged together with PRRN into a single program
http://www-igbmc.u-strasbg.fr/BioInfo/BAliBASE/bali score.c
Deterministic algorithms are run only once, but for non-deterministic algorithms, it would be more
appropriate to run them several times. However, when SAGA was run on the rst reference set, it
showed exactly the same results for each run.
44
CS in gures 5.1 and 5.2. Boxplots for each reference set, together with the subsets
of reference set 1 are shown in gure 5.3 for BSPS-scores and gure 5.4 for CS-scores
respectively.
The gures conrm the good performance of ILS. The boxplot for ILS is always among
the highest boxes, and the median indicates always one of the highest values of all
algorithms. Surprisingly, SAGA couldn't keep up with the other algorithms and showed
very poor performances in the experiments, diering greatly from other values previously
published.
To gain a picture of the overall performance of ILS, average values for each reference
set have been computed, as shown in tables 5.8 and 5.9.
As tables 5.8 and 5.9 conrm, ILS achieves the highest average score for the complete
database. The standard derivation lies in the middle of the measured algorithms.
For the BSPS-scores (table 5.8), ILS performs best in four cases, among them the
case of very low sequence identity, which is usually regarded as the most dicult case
(\twilight zone"). In one case, it is only 0.01 away from the best algorithm. Comparing
the rankings of each algorithm for the dierent reference sets, ILS achieves the lowest
sum of ranks together with T{COFFEE. In cases where ILS does not perform best, it is
very close to the best algorithm. The worst rank encountered for ILS is 3. For reference
set 5, it scores second, but only 0.01 points away from the best algorithm.
Looking at the column scores (table 5.9), which are considered to be more discriminating than the BSPS scores leads to similar conclusions: ILS performs best for reference
sets 1 and only 0.01 points worse than the best algorithm in set 2. Looking at the
ranking of the algorithms, ILS achieves the best average ranking indicated by the lowest
sum of ranks. It is always under the top-three algorithms when not being the best.
Following the methodology explained in section 4.2, a Wilcoxon signed-rank test is
done to establish the signicance of the dierences when comparing ILS to the other
alignment algorithms. Since the ILS achieved the highest average score, the alternative
hypothesis is a one-way hypothesis stating that ILS performs better than the other tested
algorithm. Table 5.10 shows the results of these tests. Summarising the results presented
in the table, ILS performs signicantly better than SAGA, MULTALIN and DIALIGN2,
and at least equally good as PRRN and T{COFFEE. For ClustalW, the tests showed a
signicant dierence for BSPS-scores, but not for CS-scores, so it can be said that ILS
performs at least as good as ClustalW, with some evidence to perform even better.
Finally, the runtimes needed by the algorithms are examined in table 5.11. TS has
not been measured since it does not succeed to produce an alignment in many cases,
but when it succeeded, the runtime lied between ILS and SAGA. This is in line with
the published runtime by Riaz et al. (2004). Comparing the pure runtime is dangerous
since the environment (such as Programming Language, Compiler etc) inuences the
measured results in a great way. To minimise these eects, all algorithms are run on
the same machine (a multi-processor machine with 12 SUN UltraSparc9 processors, 400
MHz each. None of the programs uses multi-processing.) and compiled with the same
compiler using the same ags. Results are shown in table 5.11.
The runtime needed by ILS to compute the alignments is shown in table 5.11. Compared only to the iterative algorithms, the runtime for ILS is much better. Whereas
45
1.0
0.8
0.6
0.4
0.2
PRRN
Clust
DIALIGN2
Mult
T−Coff
SAGA
TS
ILS
0.0
0.2
0.4
0.6
0.8
Figure 5.1: Boxplot of the average BSPS scores of all 144 alignments in the database.
PRRN
ClustalW
Dialign
Multalign
T.Coffee
SAGA
ILS
Figure 5.2: Boxplot of the average CS scores of all 144 alignments in the database.
46
0.2
0.2
0.4
0.4
0.6
0.6
0.8
0.8
1.0
1.0
1.0
0.8
0.6
0.4
0.2
PRRN
Clustalw
Dialign2
Multalin
T.Coffee
SAGA
ILS
PRRN
(a) Set 1, < 25%
Clustalw
Dialign2
Multalin
T.Coffee
SAGA
ILS
PRRN
Clustalw
Dialign2
Multalin
T.Coffee
SAGA
ILS
(c) Set 1, > 35%
0.0
0.2
0.2
0.2
0.4
0.4
0.4
0.6
0.6
0.6
0.8
0.8
0.8
1.0
1.0
1.0
(b) Set 1, 20{35%
PRRN
Clustalw
Dialign2
Multalin
T.Coffee
SAGA
ILS
PRRN
Clustalw
Multalin
T.Coffee
SAGA
ILS
PRRN
Clustalw
(e) Set 2
Dialign2
Multalin
T.Coffee
SAGA
ILS
(f) Set 3
0.2
0.2
0.4
0.4
0.6
0.6
0.8
0.8
1.0
1.0
(d) Set 1 complete
Dialign2
PRRN
Clustalw
Dialign2
Multalin
T.Coffee
SAGA
ILS
PRRN
(g) Set 4
Clustalw
Dialign2
Multalin
T.Coffee
SAGA
ILS
(h) Set 5
Figure 5.3: Boxplots for each reference set showing BSPS-scores.The results are presented in the following order: PRRN, ClustalW, DIALIGN2, MULTALIN,
T{COFFEE, SAGA, ILS.
47
Clustalw
Dialign2
Multalin
T.Coffee
SAGA
1.0
0.4
0.0
0.2
0.0
0.2
0.4
0.6
0.8
0.6
0.8
1.0
1.0
0.8
0.6
0.4
0.2
0.0
PRRN
ILS
PRRN
(a) Set 1, < 25%
Clustalw
Dialign2
Multalin
T.Coffee
SAGA
ILS
PRRN
Clustalw
Dialign2
Multalin
T.Coffee
SAGA
ILS
(c) Set 1, > 35%
PRRN
Clustalw
Dialign2
Multalin
T.Coffee
SAGA
0.0
0.0
0.2
0.2
0.2
0.4
0.4
0.4
0.6
0.6
0.6
0.8
0.8
0.8
1.0
1.0
(b) Set 1, 20{35%
ILS
PRRN
Clustalw
Multalin
T.Coffee
SAGA
ILS
PRRN
Clustalw
(e) Set 2
Dialign2
Multalin
T.Coffee
SAGA
ILS
(f) Set 3
0.0
0.2
0.2
0.4
0.4
0.6
0.6
0.8
0.8
1.0
1.0
(d) Set 1 complete
Dialign2
PRRN
Clustalw
Dialign2
Multalin
T.Coffee
SAGA
ILS
PRRN
(g) Set 4
Clustalw
Dialign2
Multalin
T.Coffee
SAGA
ILS
(h) Set 5
Figure 5.4: Boxplots for each reference set showing CS-scores. The results are presented in the following order: PRRN, ClustalW, DIALIGN2, MULTALIN,
T{COFFEE, SAGA, ILS.
48
Reference Set
Set 1, < 25%
Set 1, 20{35%
Set 1, > 35%
Set 1
Set 2
Set 3
Set 4
Set 5
Ranksum
Average
W/Average
Stdev
PRRN
ClustalW
DIALIGN2
MULTALIN
T{COFFEE
SAGA
TS
ILS
0.64(2)
0.94(2)
0.97(1)
0.85(2)
0.93(1)
0.83(1)
0.75(6)
0.93(4)
19
0.86(1)
0.87(1)
0.11(2)
0.63(3)
0.92(3)
0.96(2)
0.84(3)
0.93(1)
0.75(4)
0.78(4)
0.86(6)
26
0.83(2)
0.85(2)
0.11(2)
0.50(4)
0.89(5)
0.95(3)
0.78(3)
0.89(2)
0.68(7)
0.83(2)
0.94(3)
29
0.81(3)
0.82(3)
0.15(3)
0.46(6)
0.91(4)
0.96(2)
0.78(3)
0.89(2)
0.70(6)
0.58(7)
0.78(7)
37
0.75(4)
0.78(5)
0.17(4)
0.63(3)
0.94(2)
0.97(1)
0.85(2)
0.93(1)
0.78(2)
0.84(1)
0.96(1)
13
0.86(1)
0.87(1)
0.11(2)
0.19(6)
0.3(7)
0.36(5)
0.28(5)
0.47(3)
0.35(8)
0.24(8)
0.08(8)
50
0.23(5)
0.26(6)
0.12(5)
0.47(5)
0.88(6)
0.93(4)
0.76(4)
0.89(2)
0.72(5)
0.77(5)
0.90(5)
36
0.81(3)
0.80(4)
0.08(1)
0.65(1)
0.95(1)
0.97(1)
0.86(1)
0.93(1)
0.77(3)
0.79(3)
0.95(2)
13
0.86(1)
0.87(1)
0.11(2)
Table 5.8: Average BSPS-scores. Scoring is done with core block annotation, the BSPS
are computed by running the algorithms and scoring the computed alignment
with the bali score program. The small number indicate the rank of each
algorithm, from 1 (best) to 8 (worst)
Reference Set
Set 1, < 25%
Set 1, 20{35%
Set 1, > 35%
Set 1
Set 2
Set 3
Set 4
Set 5
Ranksum
Average
W/Average
Stdev
PRRN
ClustalW
DIALIGN2
MULTALIN
T{COFFEE
SAGA
ILS
0.46(2)
0.91(2)
0.95(2)
0.77(2)
0.60(1)
0.63(1)
0.46(5)
0.84(3)
18
0.69(2)
0.72(1)
0.19(2)
0.45(3)
0.87(4)
0.95(2)
0.76(3)
0.58(3)
0.46(4)
0.51(4)
0.64(4)
27
0.64(3)
0.68(2)
0.19(2)
0.32(5)
0.82(6)
0.92(3)
0.69(4)
0.38(6)
0.35(6)
0.68(1)
0.84(3)
34
0.62(4)
0.64(3)
0.24(4)
0.27(6)
0.85(5)
0.95(2)
0.69(4)
0.42(5)
0.37(5)
0.21(6)
0.51(5)
38
0.51(6)
0.57(4)
0.27(5)
0.41(4)
0.90(3)
0.96(1)
0.76(3)
0.57(4)
0.51(2)
0.65(2)
0.9(1)
20
0.70(1)
0.72(1)
0.20(3)
0.04(7)
0.05(7)
0.14(3)
0.08(5)
0.01(7)
0.01(7)
0.00(7)
0.19(6)
49
0.09(7)
0.08(5)
0.07(1)
0.48(1)
0.92(1)
0.95(2)
0.78(1)
0.59(2)
0.49(3)
0.52(3)
0.86(2)
15
0.69(2)
0.72(1)
0.20(3)
Table 5.9: Average CS-scores. Scoring is done with core block annotation, the CS are
computed by running the algorithms and scoring the computed alignment
with the bali score program. The small number indicate the rank of each
algorithm, from 1 (best) to 8 (worst).
49
Score PRRP ClustalW Dialign2 Multalin T{Coee SAGA
BSPS
n.s.
*
**
***
n.s.
***
CS
n.s.
n.s.
*
***
n.s.
***
Table 5.10: Statistical signicance of the dierences in BSPS- and CS scores. The table
gives the signicance-level p-Value of a Wilcoxon signed-rank test for ILS and
another alignment algorithm: *: P < 0:05, **: P < 0:01, ***: P < 0:001
and n.s.: P > 0:05.
Reference Set
Set
Set
Set
Set
Set
Set
Set
Set
1, < 25%
1, 20{35%
1, > 35%
1
2
3
4
5
Average
W/Average
Stdev
PRRN
ClustalW
DIALIGN2
MULTALIN
T{COFFEE
SAGA
ILS
6.38
2.92
1.41
3.57
26.88
66.82
118.03
35.7
0.86
1.56
1.96
1.46
7.8
10.71
3.17
3.23
1.69
2.61
3.71
2.67
33.17
58.97
9.7
10.58
0.14
0.21
0.23
0.19
0.69
0.6
0.53
0.39
3.03
3.71
4.82
3.85
74.76
68.95
29.71
23.26
537.33
735.69
921.88
731.63
8671.22
25946.5
7629.59
4242.74
26.98
38.27
46.43
37.22
282.67
488.07
161.7
113.98
36.88
27.05
41.22
4.18
3.59
3.53
17.21
13.66
20.43
0.4
0.36
0.21
29.75
24.94
29.77
6322.35
5999
8632.06
165.44
133.74
162.3
Table 5.11: Average runtimes. Each value is the average runtime for the specied reference set in seconds. The time is measured in second.
SAGA and Tabu Search can take extremely long runtime (up to more than one hour),
ILS computes an alignment in less than 38 seconds for small sets of sequences (up to ve
sequences, reference set 1). For large sequence sets, as in reference set 3 (sets of more
than twenty sequences with 511 amino acids maximum), the runtime is in the order of
several minutes. This qualies ILS for normal use. From the times given for the subsets
of reference set 1, it can be seen that the average identity clearly inuences the runtime,
where less identity corresponds to shorter runtime.
5.3 Runtime-bahaviour of ILS
This section analyses the capability of ILS optimising an alignment to reach a high value
of the COFFEE objective function. ILS was run 500 times on reference set 1, the number
of iterations set to 20 2 lmax , with lmax being the length of the longest sequences. A
RESTART acceptance criterion was used which starts from a new alignment after doing
lmax non-improving iterations (see section 2.3.4 and 3.4 for details).
A rst approach to analyse the runtime-behaviour of ILS is to analyse the tradeo between solution quality and computation time. A common way to visualise the
50
0.16
1bbt3_ref1
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
Figure 5.5: Solution quality trade-o for 1bbt3 ref1. The x-axis gives the runtime, measured in performed local search steps, the y -axis shows the percentage derivation from the best value found, averaged over 500 runs.
relationship between runtime and solution quality is to plot the development of average
solution quality found versus runtime, as shown in gure 5.5. The gure shows that the
ILS quickly improves the solution quality close to the best value found in 500 iterations,
but starts stagnating after approximately one fourth of the runtime. This pictures is
typical for the runtime behaviour of ILS; other solution quality trade-os look similar,
e. g. gure 5.6. However, the point of beginning stagnation varies depending on the
particular instance. Looking again at gure 5.6, the stagnation behaviour starts close
to the end of the runtime.
A more powerful approach to describe and analyse the runtime-behaviour of Metaheuristics are RTDs, as described in section 4.1. We will now use this approach to
examine the behaviour of the ILS.
Figures 5.7 and 5.8 show the RLD for sequence sets 1r69 ref1 and 1pamA ref1, two
sets of four short sequences with low identity (< 25%). 1r69 consists of small sequences
up to 78 amino acids, 1pamA of long sequences at most 572 length. The gures show
probability distributions for reaching a solution quality of 10%, 7.5%, 5%, 3% and 2%
derivation from the best value found in all 500 runs together with a plot of an exponential
distribution. The reasons why the exponential distribution is plotted will be explained
later. Both gures show that the ILS can easily nd a solution no more than 3% away
from the maximum in short time with a very high probability (the measured runtime for
this instance was only 1.54 seconds), but when a solution 2% worse than the maximum
should be reached, the probability becomes very low to reach it in the given time.
Another conclusion drawn from the gures is that ILS needs an initialisation time
51
0.22
1pamA_ref1
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0
50000
100000
150000
200000
250000
Figure 5.6: Solution quality trade-o for 1pamA ref1. For explanation, refer to section
5.3 and gure 5.5.
ti before any solution in the required boundaries are found. Considering the enormous
number of possible alignments for a set of sequences, it becomes clear that the probability
to nd an optimal solution just at the beginning must be very low. The probability to
nd a sucient solution at the beginning clearly depends on the size of the sequence
set, their lengths and on a factor which can be called the hardness of this set. The
probability will be considerably higher for sets which have many local optima close to
the global optimum, but lower for sets where the global optimum is peeked and the local
minima quality is much lower (the corresponding tness landscape would look like a at
valley with small hills and one extremely steep mountain).
The RLD can be approximately described by an exponential distribution (veried
with a 2 -test). When the actual RLD corresponds to an exponential distribution, the
use of RESTART to improve the performance can be justied from a theoretical point
rather than arguing only with measured performances: In this case, the probability
of nding the desired solution when running an algorithm once for a long time t is
equal to the probability of running the same algorithm p times with runtime pt (St
utzle,
1999a). However, using restarts can change the steepness and thus gain an improvement
if the empirical distribution without using restarts is less steep than an exponential
distribution; this behaviour is exemplary explained by St
utzle & Hoos (2002) for ILS
algorithm on TSP.
52
1
10%
g(x)
7.5%
5%
3%
2%
exp
0.8
0.6
0.4
0.2
0
-10
0
1000
2000 -5 3000
4000
5000
0
6000
7000 5 8000
9000
10000
10
Figure 5.7: Runlength-distribution for 1r69 ref1. The x-axis is the CPU-time measured
in local search steps and the y -axis the probability of nding a solution of
the required quality. Please refer to the text for details.
1
Exp.
90%
91%
92%
93%
0.8
0.6
0.4
0.2
0
0
50000
100000
150000
200000
250000
Figure 5.8: Runlength-distribution for 1pamA ref1. The x-axis is the CPU-time measured in local search steps and the y -axis the probability of nding a solution
of the required quality. Please refer to the text for details.
53
6 Conclusions and Discussion
The intention of thesis was to transfer a contemporary optimisation method from Computer Science to a problem from the Bionformatics domain. The method selected is
ILS, a fast and high-performing optimisation technique successfully used on many CO
problems before.
6.1 Conclusions
The last chapter has shown that ILS is quite capable to optimise an alignment w. r. t.
to the COFFEE OF. On the other hand, ILS has problems to produce biological viable
alignments in some cases, expressed by low scores when comparing computed alignments
to the reference alignments in the database. One factor largely inuencing the biological viability is the ability of the OF to assign higher scores to biological more reliable
alignments, a key factor of a good scoring function for MSA and a major problem in the
design of every OF. Figure 6.1 shows the percentage derivation of the optimised alignments computed by ILS compared to the COFFEE score of the reference alignments
found in the database. On the average, alignments computed by ILS show a score 6%
above the score of the reference alignments, in some extreme cases ILS even produces
alignments scoring twice as good as the reference alignments. Since the reference alignments are manually rened and derived from structural properties and thus very close
to the biological optimum, this seems to be major weakness of the COFFEE OF and
should probably be the focus of research in the future.
Looking at the results in chapter 5, ILS is not to be proposed as an alternative for the
standard MSA method, ClustalW, which seems to be challenged more by T{COFFEE
according to the results measured during this project. But when the standard methods
fail to produce meaningful alignments, the current method of choice is SAGA. Since ILS
produces better alignments in most cases, and needs considerably less time to do so, ILS
is proposed as a replacement for SAGA.
The exponential shape of the empirical RLDs gives strong evidence that the actual
RLD of ILS is an exponential distribution. This gives a theoretical justication for the
RESTART acceptance criterion and indicates that ILS can benet from parallelisation.
6.2 Discussion
ILS has been shown to be a high-scoring and very reliable method for MSA in all testcases given by the BAliBASE database, achieving always a rank among the top-three
algorithms in all cases. One point to mention in this discussion is that the measured
54
1.0
0.8
0.6
0.4
0.2
0.0
Percentage derivation of ILS from reference alignment (COFFEE)
Sequence set
Figure 6.1: Percentage derivation of optimised alignments from reference alignments.
Each bar shows the dierence of the optimised alignment score from the
reference score in percent. Positive values indicate that the computed alignment's score is higher than the reference alignment's, negative values the
opposite.
BSPS-scores do dier from values published by Thompson et al. (1999), especially in the
case of SAGA, which achieved much better scores in this study. Nevertheless, this does
not falsify the conclusion that ILS is reliably and high-scoring. Comparing the published
results of SAGA to ILS, ILS still achieves higher average scores for the complete database.
The results for SAGA are higher in only two out of eight cases.
Due to its stochastic nature, ILS is likely to produce slightly dierent alignments each
time it is run. These alignments can be used in cases were all algorithms fail to produce
reliable alignments to be manually rened, or to investigate if there are similarities
among all high-scoring alignments. The later may lead to a method to detect conserved
regions or motifs in a set of sequences.
The major drawback of ILS are the long run-times, as compared to the progressive
algorithms. This is a intrinsic point induced by the nature of iterative algorithms and
it is shared with even longer run-times with SAGA and TS. The run-time could be
improved by running ILS several times in parallel. Recalling that the run-time follows
an exponential distribution, this would yield to a near-optimal speedup.
55
A Abbreviations
BSPS BAliBASE Sum of Pairs Score
CO Combinatorial Optimisation
COFFEE Consistency based Objective Function For alignmEnt Evalutation
CS Column Score
GA Genetic Algorithm
ILS Iterated Local Search
MSA Multiple Sequence Alignment
OF Objective function
RLD Run-length distribution
RTD Run-time distribution
SA Simulated Annealing
SP Sum of Pairs
TS Tabu Search
TSP Travelling Salesman Problem
56
Bibliography
Altschul, S. F. (1989) Gap costs for multiple sequence alignment. Journal of Theoretical
Biology, 138 (3), 297{309.
Bahr, A., Thompson, J. D., Thierry, J.-C. & Poch, O. (2001) BAliBASE (Benchmark
Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations. Nucleic Acids Research, 29 (1), 323{326.
Bentley, J. L. (1992) Fast algorithms for geometric traveling salesman problems. ORSA
Journal on Computing, 4, 387{411.
Blum, C. & Roli, A. (2003) Metaheuristics in combinatorial optimization: overview and
conceptual comparison. ACM Computing Surveys, 35 (3), 268{308.
Brodal, G. S., Fagerberg, R., Mailund, T., Pedersen, C. N. & Phillips, D. (2003). Speeding up neighbour-joining tree construction. Technical report University of Arhuis.
ALCOMFT-TR-03-102, ALCOM-FT.
Carrillo, H. & Lipman, D. (1988) The multiple sequence alignment problem in biology.
SIAM Journal on Applied Mathematics, 48 (5), 1073{1082.
Chiarandini, M. & St
utzle, T. (2002) An application of iterated local search to graph
coloring. In Proceedings of the Computational Symposium on Graph Coloring and its
Generalizations, (Johnson, D. S., Mehrotra, A. & Trick, M., eds), pp. 112{125 Discrete
Applied Mathematics, Ithaca, New York, USA.
Corpet, F. (1988) Multiple sequence alignment with hierarchical clustering. Nucleic
Acids Research, 16 (22).
den Besten, M., St
utzle, T. & Dorigo, M. (2001) Design of iterated local search algorithms: an example application to the single machine total weighted tardiness problem.
Lecture Notes in Computer Science, 2037, 441{451.
Glover, F. (1986) Future paths for integer programming and links to articial intelligence.
Computers and Operations Research, 13 (5), 533{549.
Glover, F. & Laguna, M. (1997) Tabu Search. Kluwer Academic Publishers, London.
Gotoh, O. (1982) An improved algorithm for matching biological sequences. Journal of
Molecular Biology, 162, 705{708.
57
Gotoh, O. (1993) Optimal alignment between groups of sequences and its application to
multiple sequence alignment. Computer Applications in Bioscience, 9 (3), 361{370.
Gotoh, O. (1996) Signicant improvement in accurancy of multiple protein sequence
alignment by iterative renements as assessed by reference to structural alignments.
Journal of Molecular Biology, 264, 823{838.
Hirschberg, D. S. (1975) A linear space algorithm for computing maximal common subsequences. Communications of the ACM, 18 (6), 341{343.
Hoos, H. H. & St
utzle, T. (1998) Evaluating Las Vegas algorithms: pitfalls and remedies.
In Proceedings of UAI-98 pp. 238{245.
Just, W. (1999) Computational complexity of multiple sequence alignment with SPscore. Journal of Computational Biology, 8 (6), 615 { 623.
Kim, J., Pramanik, S. & Chung, M. J. (1994) Multiple sequence alignment using simulated annealing. Computer Applications in the Biosciences, 10 (4), 419{426.
Lipman, D., Altschul, S. & Kececioglu, J. (1989) A tool for multiple sequence alignment.
Proc. National Academic of Sciences USA, 86 (12), 4412{4415.
Lourenco, H. & Zwijnenburg, M. (1996) Combining the Large-Step Optimization with
Tabu-Search: Application to the Job-Shop Scheduling Problem. Norwell, MA: Kluwer
Academic Publishers pp. 219{236.
Lourenco, H. R. D., Martin, O. & St
utzle, T. (2001) A beginner's introduction to iterated
local search. In Proceedings of the 4th Metaheuristics International Conference - MIC
2001 pp. 1{6. Porto, Portugal.
Lourenco, H. R. D., Martin, O. & St
utzle, T. (2002) Iterated local search. In Handbook
of Metaheuristics, (Glover, F. & Kochenberger, G., eds),. Kluwer Academic Publishers
Norwell, MA pp. 321{353.
Morgenstern, B. (1999) DIALIGN2: improvement of the segment-to-segment approach
to multiple sequence alignment. Bioinformatics, 15, 211{218.
Morgenstern, B., Frech, K., Dress, A. & Werner, T. (1998) Dialign: nding local similarities by multiple sequence alignment. Bioinformatics, 14 (3), 290{294.
Needleman, S. B. & Wunsch, C. D. (1970) A general method applicable to the search for
similarities in the amino-acid sequence of two proteins. Journal of Molecular Biology,
48, 443{453.
Notredame, C. (2002) Recent progress in multiple sequence alignment: a survey. Pharmacogenomics, 3, 131{144.
Notredame, C. & Higgins, D. G. (1996) SAGA: sequence alignment by genetic algorithm.
Nucleic Acids Research, 24 (8), 1515{1524.
58
Notredame, C., Higgins, D. G. & Heringa, J. (2000) T-Coee: A novel method for
fast and accurate multiple sequence alignment. Journal of Molecular Biology, 302,
205{217.
Notredame, C., Holm, L. & Higgins, D. G. (1998) COFFEE: an objective function for
multiple sequence alignments. Bioinformatics, 14 (5), 407{422.
Papadimitriou, C. H. & Steiglitz, K. (1982) Combinatorial Optimization (Algorithms
and Complexity). Prentice-Hall, New Jersey.
Paquete, L. & St
utzle, T. (2002) An experimental investigation of iterated local search for
coloring graphs. In Applications of Evolutionary Computing, (Cagnoni, S., Gottlieb,
J., Hart, E., Middendorf, M. & Raidl, G. R., eds), vol. 2279,. Springer Verlag pp.
122{131.
Reeves, C. R. (1993) Modern heuristic techniques for combinatorial problems. John
Wiley & Sons, Inc.
Reinert, K., Cordes, F., Huson, D., Lutz, H., Steinke, T., Schliep, A. & Vingon, M.
(winter term 2003). Algorithmische Bioinformatik. lecture notes at FU Berlin.
Riaz, T., Li, K.-B. & Wang, Y. (2004) Multiple sequence alignment using tabu search. In
Second Asia-Pacic Bioinformatics Conference (APBC2004), (Chen, Y.-P. P., ed.),
vol. 29, of CRPIT pp. 223{232 Australian Computer Society Inc., Dunedin, New
Zealand.
Saitou, N. & Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4 (4), 406{425.
Sneath, P. H. A. & Sokal, R. R. (1973) Numerical Taxonomy : The principles and
practice of numerical classication. W. H. Freeman, San Francisco.
Stoye, J., Moulton, V. & Dress, A. W. M. (1997) DCA: an ecient implementation of the
divide-and-conquer approach to simultaneous multiple sequence alignment. Computer
Applications in the Biosciences, 13 (6), 625{626.
St
utzle, T. (1998). Applying iterated local search to the permutation ow shop problem. Aida-98-04 Darmstadt University of Technology, Computer Science Department,
Intellectics Group.
St
utzle, T. (1999a) Local Search Algorithms for Combinatorial Problems { Analysis,
Improvements, and New Applications. Inx, Sankt Augustin. Also PhD thesis, Darmstadt University of Technology, Computer Science Department, 1998.
St
utzle, T. (1999b). Iterated local search for the quadratic assignment problem. Aida{
99-03 Darmstadt University of Technology, Computer Science Department, Intellectics
Group.
59
St
utzle, T. & Hoos, H. (2002) Analyzing the run-time behaviour of iterated local search
for the TSP. In Essays and Surveys on Metaheuristics. Kluwer Academic Publishers
pp. 589{612.
Thompson, J. D., Higgins, D. G. & Gibson, T. J. (1994) CLUSTAL W: improving the
sensitivity of progressive multiple sequence alignment through sequence weighting,
position-specic gap penalties and weight matrix choice. Nucleic Acids Research, 22
(22), 4673{80.
Thompson, J. D., Plewniak, F. & Poch, O. (1999) A comprehensive comparison of
multiple sequence alignment programs. Nucleic Acids Research, 27 (13), 2682{2690.
Thompson, J. D., Plewniak, F., Ripp, R., Thierry, J.-C. & Poch, O. (2001) Towards
a reliable objective function for multiple sequence alignments. Jorunal of Molecular
Biology, 314 (4), 937{951.
Thompson, J. D., Thierry, J.-C. & Poch, O. (2003) Rascal: rapid scanning and correction
of multiple sequence alignments. Bioinformatics, 19 (9), 1155{61.
Thomson, J. D., Plewniak, F. & Poch, O. (1999) Balibase: a benchmark alignment
database for the evaluation of multiple alignment programs. Bioinformatics, 15 (1),
87{88.
Wang, L. & Jiang, T. (1996) On the complexity of multiple sequence alignment. Journal
of Computational Biology, 1 (4), 337{348.
60
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement