ProbCons: Probabilistic consistency-based multiple alignment of amino acid sequences

ProbCons: Probabilistic consistency-based multiple alignment of amino acid sequences
Downloaded from www.genome.org on November 19, 2007 - Published by Cold Spring Harbor Laboratory Press
ProbCons: Probabilistic consistency-based multiple sequence
alignment
Chuong B. Do, Mahathi S.P. Mahabhashyam, Michael Brudno and Serafim Batzoglou
Genome Res. 2005 15: 330-340
Access the most recent version at doi:10.1101/gr.2821705
Supplementary
data
References
"Supplemental Research Data"
http://www.genome.org/cgi/content/full/15/2/330/DC1
This article cites 59 articles, 32 of which can be accessed free at:
http://www.genome.org/cgi/content/full/15/2/330#References
Article cited in:
http://www.genome.org/cgi/content/full/15/2/330#otherarticles
Email alerting
service
Receive free email alerts when new articles cite this article - sign up in the box at the
top right corner of the article or click here
Notes
To subscribe to Genome Research go to:
http://www.genome.org/subscriptions/
© 2005 Cold Spring Harbor Laboratory Press
Downloaded from www.genome.org on November 19, 2007 - Published by Cold Spring Harbor Laboratory Press
Resource
ProbCons: Probabilistic consistency-based multiple
sequence alignment
Chuong B. Do,1 Mahathi S.P. Mahabhashyam,1 Michael Brudno,1 and
Serafim Batzoglou1,2
1
Department of Computer Science, Stanford University, Stanford, California 94305, USA
To study gene evolution across a wide range of organisms, biologists need accurate tools for multiple sequence
alignment of protein families. Obtaining accurate alignments, however, is a difficult computational problem because
of not only the high computational cost but also the lack of proper objective functions for measuring alignment
quality. In this paper, we introduce probabilistic consistency, a novel scoring function for multiple sequence
comparisons. We present ProbCons, a practical tool for progressive protein multiple sequence alignment based on
probabilistic consistency, and evaluate its performance on several standard alignment benchmark data sets. On the
BAliBASE, SABmark, and PREFAB benchmark alignment databases, ProbCons achieves statistically significant
improvement over other leading methods while maintaining practical speed. ProbCons is publicly available as a Web
resource.
[Supplemental material is available online at www.genome.org. Source code and executables are available as public
domain software at http://probcons.stanford.edu.]
Given a set of biological sequences, a multiple alignment provides a way of identifying and visualizing patterns of sequence
conservation by organizing homologous positions across different sequences in columns. As sequence similarity often implies
divergence from a common ancestor or functional similarity, sequence comparisons facilitate evolutionary and phylogenetic
studies (Phillips et al. 2000; Castillo-Davis et al. 2004) and isolation of the most relevant regions (Attwood 2002) for a variety of
biological analyses. In particular, conserved amino acid stretches
in proteins are strong indicators of preserved three-dimensional
structural domains, so protein alignments have been widely used
in aiding structure prediction (Rost and Sander 1994; Jones 1999)
and characterization of protein families (Sonnhammer et al.
1998; Johnson and Church 1999; Bateman et al. 2004). However,
when sequence identity falls below 30%, called the “twilight
zone” of protein alignments, the accuracies of most automatic
sequence alignment methods drop considerably (Rost 1999;
Thompson et al. 1999b). As a result, alignment quality is often
the limiting factor in biological analyses of amino acid sequences
(Jaroszewski et al. 2002).
The problem of alignment construction consists of defining
either explicitly or implicitly an objective function for assessing
alignment quality and employing an efficient algorithm to find
the optimal, or a near optimal, alignment according to the objective function. Two-sequence alignments are usually evaluated
by addition of match/mismatch scores for aligned pairs of positions
and affine gap penalties for unaligned amino acids (Needleman and
Wunsch 1970; Smith and Waterman 1981). Quantitatively, scores
for aligned residues are given by log-odds (Altschul 1991) substitution matrices such as PAM (Dayhoff et al. 1978), GONNET (Gonnet
et al. 1992), or BLOSUM (Henikoff and Henikoff 1992). Estimation
of appropriate gap penalties, however, is often regarded as a “black
art” based on trial and error (Vingron and Waterman
2
Corresponding author.
E-mail [email protected]; fax (650) 725-1449.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/
gr.2821705.
330
Genome Research
www.genome.org
1994). For two sequences of length L, an optimal alignment according to this metric may be computed in O(L2) time (Gotoh
1982) and O(L) space (Myers and Miller 1988) via dynamic programming.
Pair-hidden Markov models (HMMs) provide an alternative
formulation of the sequence alignment problem in which alignment generation is directly modeled as a first-order Markov process involving state emissions and transitions. In this approach,
model parameters obtain an intuitive probabilistic interpretation
and can be trained on real data using standard supervised or
unsupervised likelihood-based methods. The Viterbi (1967)
algorithm computes the highest probability alignment of two
input sequences according to an alignment pair-HMM. In the
standard three-state pair-HMM for alignment, the Viterbi algorithm may be viewed as an instantiation of the Needleman–
Wunsch algorithm in which alignment parameters are determined by a log-odds transformation of the HMM scoring scheme
(Durbin et al. 1998).
Since they specify a conditional probability distribution
over the space of all suboptimal alignments, pair-HMMs also allow the computation of the posterior probability, P(xi ∼ yj ∈ a* | x,
y), that particular positions xi and yj of two sequences x and y,
respectively, will be matched in an alignment a* generated by the
model. Running the Needleman–Wunsch algorithm with these
posterior probabilities as substitution scores and no gap penalties
gives rise to the maximum expected accuracy alignment method
(see Methods), also known as optimal accuracy alignment (Holmes
and Durbin 1998).
In the general case of multiple sequence comparisons, theoretically sound and biologically motivated scoring methods are
not straightforward to devise. In practice, ad hoc sum-of-pairs
schemes (Carrillo and Lipman 1988), which combine the projected pairwise log-odds scores for all pairs of sequences in the
alignment, and their weighted variants (Altschul et al. 1989) are
commonly used. Unfortunately, direct application of dynamic
programming is too inefficient for alignment of more than a few
sequences. Instead, a variety of heuristic strategies have been
15:330–340 ©2005 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/05; www.genome.org
Downloaded from www.genome.org on November 19, 2007 - Published by Cold Spring Harbor Laboratory Press
ProbCons multiple alignment tool
proposed, including genetic algorithms (Notredame and Higgins
1996), simulated annealing (Kim et al. 1994), alignment to a
profile HMM (Krogh et al. 1994; Eddy 1995), or greedy assemblage of multiple segment-to-segment comparisons (Morgenstern et al. 1996). By far, the most popular heuristic strategies
involve tree-based progressive alignment (Feng and Doolittle 1987)
in which groups of sequences are assembled into a complete multiple alignment via several pairwise alignment steps. As with any
hierarchical approach, however, errors at early stages in the
alignment not only propagate to the final alignment but also
may increase the likelihood of misalignment due to incorrect
conservation signals. Post-processing steps such as iterative refinement (Gotoh 1996) alleviate some of the errors made during
progressive alignment.
Consistency-based schemes take the alternative view that
“prevention is the best medicine.” Note that for any multiple
alignment, the induced pairwise alignments are necessarily consistent—that is, given a multiple alignment containing three sequences x, y, and z, if position xi aligns with position zk and
position zk aligns with yj in the projected x–z and z–y alignments,
then xi must align with yj in the projected x–y alignment. Consistency-based techniques apply this principle in reverse, using
evidence from intermediate sequences to guide the pairwise
alignment of x and y, such as needed during the steps of a progressive alignment. By adjusting the score for an xi ∼ yj residue
pairing according to support from some position zk that aligns to
both xi and yj in the respective x–z and y–z pairwise comparisons,
consistency-based objective functions incorporate multiple sequence information in scoring pairwise alignments.
Gotoh (1990) first introduced consistency to identify anchor points for reducing the search space of a multiple alignment. A mathematically elegant reformulation of consistency in
terms of boolean matrix multiplication was later given by Vingron and Argos (1991) and implemented in the program MALI,
which builds multiple alignments from dot matrices (Vingron
and Argos 1989). An alternative formulation of consistency was
employed in the DIALIGN tool, which finds ungapped local
alignments via segment-to-segment comparisons, determines
new weights for these alignments using consistency, and assembles them into a multiple alignment by a greedy selection
procedure (Morgenstern et al. 1996).
More recently, Notredame et al. (1998) introduced COFFEE,
a new consistency-based objective function for scoring residue
pairs in a pairwise alignment. In this approach, an alignment
library is computed by merging consistent CLUSTALW (Thompson et al. 1994) global and LALIGN (Huang and Miller 1991) local
pairwise alignments to form three-way alignments, which are
assigned percent identity weights. Then, the score for aligning xi
to yj is defined to be the sum of the weights of all alignments in
the library containing that aligned residue pair. The program
T-Coffee (Notredame et al. 2000), which implements multiple
sequence alignment under this objective function using progressive maximum weight trace computations (Kececioglu 1993), has
demonstrated superior accuracy on the BAliBASE test suite
(Thompson et al. 1999a) over competing methods, including
CLUSTALW, DIALIGN, and PRRP (Gotoh 1996).
In this article, we introduce probabilistic consistency, a novel
modification of the traditional sum-of-pairs scoring system that
incorporates HMM-derived posterior probabilities and three-way
alignment consistency. We discuss the theoretical motivations
behind the probabilistic consistency scoring system and demonstrate its applicability with ProbCons, a protein progressive mul-
tiple alignment tool based on this technique. To assess the utility
of our methods, we compared ProbCons to several current leading alignment tools including Align-m (Van Walle et al. 2004),
CLUSTALW, DIALIGN, MAFFT (Katoh et al. 2002), MUSCLE
(Edgar 2004), and T-Coffee on the BAliBASE, SABmark (Van
Walle et al. 2004), and PREFAB (Edgar 2004) benchmark alignment databases, using commonly accepted accuracy measures for
validating alignment quality. In this comparison, ProbCons
shows a clear statistically significant improvement in accuracy
over all other alignment tools in every benchmark test, while
maintaining practical running times. Moreover, all parameters
for the program are derived through unsupervised training methods without making any manual adjustments. ProbCons is publicly available as a Web resource. Source code and executables are
available as public domain software at http://probcons.stanford.
edu.
Results
Algorithm overview
Fundamentally, ProbCons is a pair-hidden Markov model-based
progressive alignment algorithm that primarily differs from most
typical approaches in its use of maximum expected accuracy rather
than Viterbi alignment, and of the probabilistic consistency transformation to incorporate multiple sequence conservation information during pairwise alignment. ProbCons uses the HMM
shown in Figure 1 to specify the probability distribution over all
alignments between a pair of sequences. Emission probabilities,
which correspond to traditional substitution scores, are based on
the BLOSUM62 matrix (Henikoff and Henikoff 1992). Transition
probabilities, which correspond to gap penalties, are trained with
unsupervised expectation maximization (EM).
ProbCons algorithm
Given m sequences, S = {s(1), …, s(m)}:
Step 1: Computation of posterior-probability matrices
For every pair of sequences x, y ∈ S and all i ∈ {1, …, |x|}, j ∈ {1, …,
|y|}, compute the matrix Pxy, where Pxy(i, j) = P(xi ∼ yj ∈ a* | x, y) is
the probability that letters xi and yj are paired in a*, an alignment
of x and y generated by the model.
Figure 1. Basic pair-HMM for sequence alignment between two sequences, x and y. State M emits two letters, one from each sequence, and
corresponds to the two letters being aligned together. State Ix emits a
letter in sequence x that is aligned to a gap, and similarly state Iy emits a
letter in sequence y that is aligned to a gap. Finding the most likely
alignment according to this model by using the Viterbi algorithm corresponds to applying Needleman–Wunsch with appropriate parameters.
The logarithm of the emission probability function p(.,.) at M corresponds
to a substitution scoring matrix, while affine gap penalty parameters can
be derived from the transition probabilities ␦ and ␧ (Durbin et al. 1998).
Genome Research
www.genome.org
331
Downloaded from www.genome.org on November 19, 2007 - Published by Cold Spring Harbor Laboratory Press
Do et al.
Step 2: Computation of expected accuracies
Define the expected accuracy of a pairwise alignment a between
x and y to be the expected number of correctly aligned pairs of
letters, divided by the length of the shorter sequence:
Ea*共accuracy共a,a*兲ⱍx,y兲 =
1
P共xi ∼ yj ∈a*ⱍx,y兲.
min兵ⱍxⱍ,ⱍyⱍ其 xi∼yj∈a
兺
For each pair of sequences x, y ∈ S, compute the alignment a that
maximizes expected accuracy by dynamic programming, and set
E(x, y) = Ea*(accuracy(a, a*) | x, y).
Step 3: Probabilistic consistency transformation
Reestimate the match quality scores P(xi ∼ yj ∈ a* | x, y) by applying the probabilistic consistency transformation, which incorporates
similarity of x and y to other sequences from S into the x–y pairwise comparison:
P⬘共xi ∼ yj ∈ a*ⱍx,y兲 ←
1
ⱍSⱍ
兺兺P共x ∼ z
i
z∈S zk
k
∈ a*ⱍx,z兲P共zk ∼ yj ∈ a*ⱍz,y兲.
In matrix form, the transformation may be written as
P⬘xy ←
1
ⱍSⱍ
兺P
z∈S
xzPzy.
Since most values in the Pxz and Pzy matrices will be near zero, the
transformation is computed efficiently using sparse matrix multiplication by ignoring all entries smaller than a threshold ␻. This
step may be repeated as many times as desired.
Step 4: Computation of guide tree
Construct a guide tree for S through hierarchical clustering. As a
measure of similarity between two sequences x and y use E(x, y)
as computed in Step 2. Define the similarity of two clusters by a
weighted average of the pairwise similarities between sequences
of the clusters.
Step 5: Progressive alignment
Align sequence groups hierarchically according to the order
specified in the guide tree. Alignments are scored using a sumof-pairs scoring function in which aligned residues are assigned
the transformed match quality scores P⬘(xi ∼ yj ∈ a* | x, y) and
gap penalties are set to zero.
Post-processing step: Iterative refinement
Randomly partition alignment into two groups of sequences and
realign. This step may be repeated as many times as desired.
In addition to the steps shown, we also experimented with
the generation of automatic column reliability annotations for
the alignment based on the posterior matrix formulation above
(see Methods).
Testing methodology
To test the empirical performance of ProbCons, we used three
different multiple alignment benchmarking suites, including
BAliBASE 2.01 (Thompson et al. 1999a), PREFAB 3.0 (Edgar
2004), and SABmark 1.63 (Van Walle et al. 2004). Tests were
performed on a 3.3-GHz Pentium IV with 2 GB RAM.
The BAliBASE 2.01 benchmark alignment database is a collection of 141 reference protein alignments, consisting of structural alignments from the FSSP (Holm and Sander 1994) and
HOMSTRAD (Mizuguchi et al. 1998) databases and handconstructed alignments from the literature. The database is orga-
332
Genome Research
www.genome.org
nized into five reference sets: Reference 1 consists of a few equidistant sequences of similar length; Reference 2, families of
closely related sequences with up to three distant “orphan” sequences; Reference 3, equidistant divergent families; Reference 4,
sequences with large N/C-terminal extensions; and Reference 5,
sequences with large internal insertions. Test alignments are
scored with respect to BAliBASE core blocks, regions for which
reliable alignments are known to exist.
The PREFAB 3.0 database is an automatically generated database consisting of 1932 alignments averaging 49 sequences of
length 240. Each test consists of a pair of protein sequences
supplemented with homologs found through PSI-BLAST
(Altschul et al. 1997) queries over the NCBI nonredundant protein sequence database (Pruitt et al. 2003). The accuracy of a
multiple sequence alignment is then evaluated with respect to
the pairwise structural alignments of the original two protein
sequences using the consensus of FSSP and CE alignments. Note
that the pairwise structural alignments in PREFAB only cover
some regions of the sequences; we treated these like BAliBASE
core blocks.
The SABmark 1.63 database consists of two sets of consensus
regions based on SOFI (Boutonnet et al. 1995) and CE (Shindyalov and Bourne 1998) structural alignments of sequences from
the ASTRAL (Brenner et al. 2000) database. The “Twilight Zone”
set contains 1994 domains sorted into 236 subsets representing
SCOP folds (Murzin et al. 1995), where each subset contains sequences within no more than 25% identity. The “Superfamily”
set contains 3645 domains sorted into 462 subsets representing
SCOP superfamilies, where each subset contains sequences
within no more than 50% identity. Unlike BAliBASE, SABmark
uses all-pairs pairwise reference structural alignments for evaluating multiple alignment quality.
While no universally accepted accuracy measure exists for
protein alignments, we chose to score each alignment according
to the original benchmarking measures proposed for its respective database. In the BAliBASE data set, we scored alignments
according to the sum-of-pairs score (SP), defined as the number of
correctly aligned residue pairs found in the test alignment divided by the total number of aligned residue pairs in core blocks
of the reference alignment (Thompson et al. 1999b). Additionally, we measured the column score (CS), defined as the number of
correctly aligned columns found in the test alignment divided by
the total number of aligned columns in core blocks of the reference alignment. On the PREFAB alignments, we measured the
quality (Q) score (Edgar 2004), which is equivalent to the SP
score. Finally, for SABmark, we used the developer (fD) score,
which is also equivalent to the SP score (where all residues in the
reference alignment are treated as being in core blocks), and the
modeler (fM) score, defined as the number of correctly aligned
residue pairs found in the test alignment divided by the total
number of aligned residue pairs in the test alignment (Sauder et
al. 2000). For each type of scoring metric used, we averaged the
scores per multiple alignment (or average score per subset in the
case of SABmark) over all multiple alignment tests in the database.
Comparison to other aligners
We compared the results of ProbCons on the above databases to
those of six leading multiple alignment systems: (1) CLUSTALW
1.83 (Thompson et al. 1994), currently the most popular progressive alignment method; (2) DIALIGN 2.2.1 (Morgenstern et al.
Downloaded from www.genome.org on November 19, 2007 - Published by Cold Spring Harbor Laboratory Press
ProbCons multiple alignment tool
Table 1. Performance of aligners on the BAliBASE benchmark alignments database
Ref 1 (82)
Aligner
Align-m
DIALIGN
CLUSTALW
MAFFT
T-Coffee
MUSCLE
ProbCons
ProbCons-ext
Ref 2 (23)
Ref 3 (12)
Ref 4 (12)
Ref 5 (12)
Overall (141)
SP
CS
SP
CS
SP
CS
SP
CS
SP
CS
SP
CS
Time
(mm:ss)
76.6
81.1
86.1
86.7
86.6
88.7
90.1
90.0
n/a
70.9
77.3
78.1
77.4
80.8
82.6
82.5
88.4
89.3
93.2
92.4
93.4
93.5
94.4
94.2
n/a
35.9
56.8
50.2
56.1
56.3
61.3
59.1
68.4
68.4
75.3
78.8
78.5
82.5
84.1
84.3
n/a
34.4
46.0
50.4
48.7
56.4
61.3
61.1
91.1
89.7
83.4
91.6
91.8
87.6
90.1
93.8
n/a
76.2
52.2
72.7
73.0
60.9
72.3
81.0
91.7
94.0
85.9
96.3
95.8
96.8
97.9
98.1
n/a
84.3
63.8
85.9
90.3
90.2
91.9
92.2
80.4
83.2
86.1
88.2
88.3
89.6
91.0
91.2
n/a
63.7
68.0
71.4
72.2
73.9
77.2
77.6
19:25
2:53
1:07
1:18
21:31
1:05
5:32
8:02
Columns show the average sum-of-pairs (SP) and column scores (CS) achieved by each aligner for each of the five BAliBASE references. All scores have
been multiplied by 100. The number of sequences in each reference is given in parentheses. Overall numbers for the entire database are reported in
addition to the total running time of each aligner for all 141 alignments. The best results in each column are shown in bold.
finement for every alignment. We also experimented with a
modified version of ProbCons (ProbCons-ext) in which the HMM
model was extended to include an extra pair of insertion states (I⬘x
and I⬘y) to model long or terminal insertions.
The results of testing on the BAliBASE benchmark alignments database are shown in Table 1. To assess the significance of
the differences in overall SP and CS scores, we performed a Friedman rank test for all pairs of programs; these results are summarized in Table 2. A typical BAliBASE alignment and its corresponding plot of column reliability are shown in Figure 2. The
correlation between predicted and actual column reliability
scores as shown in the diagram demonstrates the ability of pairwise posterior matrices to predict the expected proportion of correctly aligned residue pairs per column.
With the exception of Reference 4, ProbCons achieves the
strongest performance in both SP and CS scores in all references.
Reference 4 sequences are marked by long N/C-terminal extensions in which local alignment methods tend to be more successful, suggesting that incorporation of a local alignment probabilistic model into ProbCons might improve its performance on
such sequences. Alternatively, we found that extending the
HMM model with an extra pair of insertion states (ProbCons-ext)
did improve BAliBASE performance in Reference 4; however, this
1998), a local aligner using segment-based homology; (3) TCoffee 1.37 (Notredame et al. 2000), a heuristic consistencybased aligner that combines global and local alignments; (4)
MAFFT 3.88 (Katoh et al. 2002), a set of six scripts for performing
multiple alignment with a variety of iterative refinement techniques; (5) MUSCLE 3.3 (Edgar 2004), a new aligner reporting the
best published results on BAliBASE to date; and (6) Align-m 1.0
(Van Walle et al. 2004), a consistency-based method for computing all-pairs pairwise alignments of multiple sequences. Of the six
scripts comprising the MAFFT alignment utilities, we chose to
test nw-ns-i, the most accurate script. For Align-m 1.0, we used
the parameter settings picked for testing the program in Van
Walle et al. (2004). All other programs were run with default
parameters.
Emission probabilities for the ProbCons HMM were adapted
from the BLOSUM62 scoring matrix (Henikoff and Henikoff
1992). The default transition parameters of ProbCons were
trained via unsupervised Expectation-Maximization (EM) on unaligned sequences from the BAliBASE benchmark database; thus,
the tests on the PREFAB and SABmark databases provide external
validation of the results shown on BAliBASE. The default options
for the ProbCons program included applying two iterations of
the consistency transformation and 100 rounds of iterative re-
Table 2. Significance test for differences in BAliBASE performance
Align-M
Align-M
DIALIGN
CLUSTALW
MAFFT
T-Coffee
MUSCLE
ProbCons
ProbCons-ext
ⳮ(0.61)
ⳮ8.2 ⳯ 10ⳮ6
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ1.9 ⳯ 10ⳮ5
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
DIALIGN
CLUSTALW
MAFFT
ⳮ3
ⳮ3
+2.4 ⳯ 10
ⳮ9
+1.2 ⳯ 10
ⳮ1.0 ⳯ 10
+1.0 ⳯ 10
ⳮ3
T-Coffee
+<10ⳮ10
+8.4 ⳯ 10ⳮ6
MUSCLE
ⳮ10
+1.9 ⳯ 10
ⳮ8
ⳮ10
+<10
ⳮ10
+<10
ⳮ10
ProbCons
ProbCons-ext
+<10
+<10
+<10
ⳮ10
ⳮ5
ⳮ3.0 ⳯ 10
ⳮ(0.65)
ⳮ(0.92)
+9.6 ⳯ 10
ⳮ6
+1.6 ⳯ 10
ⳮ7
+8.3 ⳯ 10
ⳮ6
ⳮ8
ⳮ4.9 ⳯ 10
ⳮ5
ⳮ10
ⳮ6.1 ⳯ 10
ⳮ<10ⳮ10
ⳮ1.7 ⳯ 10
ⳮ9
ⳮ2.6 ⳯ 10
ⳮ4.9 ⳯ 10ⳮ8
ⳮ7.0 ⳯ 10ⳮ3
ⳮ1.5 ⳯ 10ⳮ6
ⳮ8.4 ⳯ 10ⳮ6
ⳮ3
ⳮ6.6 ⳯ 10ⳮ3
+1.7 ⳯ 10
ⳮ3
+1.9 ⳯ 10
ⳮ6
+0.012
+3.2 ⳯ 10
ⳮ5
+(0.092)
ⳮ3.0 ⳯ 10
+0.043
ⳮ(0.088)
Entries show the p-value indicating the significance of a difference in performance between two alignment methods as measured using a Friedman rank
test. Nonitalicized values above the diagonal were calculated using SP scores on all alignments, whereas italicized values were computed using CS scores.
(+) Method on the left had lower average rank (better performance); (ⳮ) Method on the left had higher average rank (worse performance); parentheses
denote (nonsignificant) p-values >0.05.
Genome Research
www.genome.org
333
Downloaded from www.genome.org on November 19, 2007 - Published by Cold Spring Harbor Laboratory Press
Do et al.
Figure 2. Column reliability plot for 1csy_ref1 from BAliBASE, Reference 1. The red line and solid regions indicate the predicted and actual
proportion of correct pairwise matches at each alignment position, respectively. All column reliability values have been multiplied by 100. Below, the actual ProbCons alignment is shown with core block residues
highlighted in green. Note that only pairwise matches in core block
regions of the BAliBASE alignment are considered correct when computing the “actual” proportion of correct pairwise matches; however, some
residues outside of the core block regions may also be alignable. Thus,
regions in which predicted homology exceeds actual homology do not
necessarily indicate overprediction of homology by the aligner.
addition roughly doubled the running time, with variable performance benefit in the other databases.3
The results of testing six of the methods on the PREFAB
database are shown in Table 3. Results for the Align-m program
are omitted, since the program failed to complete all alignments
3
Previous results on the BAliBASE 2.01 benchmark alignments database reported in an abstract (Do et al. 2004), which correspond to the ProbCons-ext
program, differ slightly from those shown in the text. These small differences
are attributable to (1) a change in the methods used for extracting BAliBASE
core blocks as suggested by Robert C. Edgar (pers. comm.), and (2) minor
changes in the HMM model and training procedure for the current version of
ProbCons.
Table 3. Performance of aligners on the PREFAB protein
reference alignment benchmark
Aligner
Overall (1927)
Time
57.2
58.9
63.6
64.8
64.8
66.9
68.0
12 h, 25 min
2 h, 57 min
144 h, 51 min
3 h, 11 min
2 h, 36 min
19 h, 41 min
37 h, 46 min
DIALIGN
CLUSTALW
T-Coffee
MUSCLE
MAFFT
ProbCons
ProbCons-ext
Entries show the average Q (equivalent to SP) score achieved by each aligner
on all 1927 alignments of the PREFAB database. All scores have been
multiplied by 100. Running times for programs over the entire database
are given for each program in hours and minutes. The best results in each
column are shown in bold.
334
Genome Research
www.genome.org
in the PREFAB database. Again, ProbCons and ProbCons-ext
demonstrate a strong lead in SP score although their running
times are longer than those of the other aligners except for TCoffee. This is due to the computation of all-pairs pairwise posterior probability matrices in the first step of the algorithm; other
schemes for formulating probabilistic consistency that avoid this
need for a quadratic number of initial alignments may be possible. The significance results for these values are given in Table 4.4
The results of testing of the SABmark benchmark alignment
database are shown in Table 5. Many of the same trends as found
in the BAliBASE alignments are seen in SABmark, with the difference between ProbCons and the next best aligner in terms of
fD (SP) scores even more exaggerated. It should be noted, however, that while the Align-m aligner lags far behind in SP score5
(which may be thought of as a measure of sensitivity), its fM
scores, which are the proportion of correctly predicted amino
acid matches among all predicted matches (and which may be
regarded as a measure of specificity) are the highest. Due to this
disparity, it is difficult to make a precise quantitative statement
regarding the relative performance of Align-m compared to the
other methods without characterizing the sensitivity/specificity
trade-off of each method, such as performed in a ROC analysis
(Metz 1978).6 Nevertheless, compared to all other aligners, ProbCons demonstrates significantly higher fD and fM scores overall,
as seen in Table 6.
Comparison of ProbCons variants
To understand the features of ProbCons that give it a strong
increase in performance, we compared several ProbCons variants
on the “Twilight Zone” set from the SABmark alignment database. In particular, we examined the effects of four main algorithmic changes: (1) using the Viterbi algorithm to compute the
highest probability alignment, instead of the highest expected
accuracy alignment that is computed by ProbCons; (2) using the
posterior probability matrices generated by ProbCons to produce
all-pairs pairwise alignments instead of full multiple alignments;
(3) varying the number of applications of consistency transformation applied before alignment; and (4) omitting the application of iterative refinement to optimize the alignment with respect to the sum-of-pairs probabilistic consistency metric. In this
article, we have omitted a full comparison of expected accuracy
4
The results for the nw-ns-i script from MAFFT on the PREFAB database given
in Edgar (2004) contain an editing error (R.C. Edgar, pers. comm.); the values
shown here are correct. Interestingly, although MAFFT achieves a slightly
higher overall average SP score than MUSCLE, a Friedman rank test indicates that MUSCLE consistently produces better alignments than MAFFT (see
Table 4).
5
The numbers reported for the Align-m aligner are similar to those given in
Edgar (2004), but differ from the results reported in Van Walle et al. (2004).
The primary reason for this difference is that the averages in the latter study
were computed across all SABmark pairwise alignments; this fails to account
for dependencies within each subset, so the weight of each subset scales
quadratically with the number of sequences present. We avoid this by averaging pairwise alignment scores within each subset before averaging all subset
scores.
6
While a ROC analysis would better characterize aligner performance, properly
defining sensitivity and specificity measures for alignment accuracy involves
subtle issues regarding the alignability of particular positions in sequences.
Furthermore, the appropriate manner for adjusting program parameters so as
to observe the sensitivity/specificity trade-off for the expected accuracy alignment algorithm is also an open problem. We leave these questions for future
work.
Downloaded from www.genome.org on November 19, 2007 - Published by Cold Spring Harbor Laboratory Press
ProbCons multiple alignment tool
Table 4. Significance test for differences in PREFAB performance
DIALIGN
DIALIGN
CLUSTALW
T-Coffee
MUSCLE
MAFFT
ProbCons
CLUSTALW
T-Coffee
MUSCLE
MAFFT
ProbCons
ProbCons-ext
ⳮ1.06 ⳯ 10ⳮ9
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
+2.3 ⳯ 10ⳮ9
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ0.031
Entries show the p-value indicating the significance of a difference in performence between two alignment methods as measured using a Friedman rank
test. Values were calculated using Q (SP) scores on all alignments. (+) Method on the left had lower average rank (better performance); (ⳮ) Method
on the left had higher average rank (worse performance); parentheses denote (nonsignificant) p-values >0.05.
guide tree construction to more popular methods such as
neighbor-joining or UPGMA, though preliminary results indicate
ProbCons to be relatively insensitive to tree topology (R.C. Edgar,
pers. comm.).
The results for each of these tests are shown in Table 7. Note
that we tested the Viterbi algorithm only on pairwise alignments,
as the HMM used in the ProbCons algorithm is strictly for pairwise comparisons; properly extending it to handle progressive
profile alignment is beyond the scope of this study. As seen by a
comparison of the first two rows of the table, alignments that
optimize expected accuracy were significantly more accurate
than Viterbi alignments.
The numbers also show that pairwise methods (rows 2–4)
tend to generate alignments with slightly higher fD (SP) scores
and slightly lower fM scores than their multiple alignment counterparts (rows 5–7). However, a stronger trend is that in both the
pairwise and multiple alignment cases, iterated applications of
consistency lead to simultaneous improvements in fD and fM,
thus showing that the consistency does help incorporate multiple sequence information into pairwise alignments. Using 100
rounds of iterative refinement helps optimize the alignment, as
reflected in the difference between rows 5 and 8 of the table.
Employing both iterated consistency and iterative refinement
thus gives the default parameter settings for the ProbCons program (row 9).
Interestingly, computing multiple alignments using the expected accuracy criterion alone generates significantly more acTable 5. Performance of aligners on the SABmark sequence and
structure alignment benchmark
Superfamily
(462)
Aligner
Align-m
DIALIGN
CLUSTALW
MAFFT
T-Coffee
MUSCLE
ProbCons
ProbCons-ext
Twilight
zone (236)
Overall
(698)
fD
fM
fD
fM
fD
fM
Time
(mm:ss)
44.4
50.3
53.7
54.1
55.4
55.9
59.9
59.9
58.9
42.5
38.7
40.0
41.8
40.1
45.0
45.3
17.1
22.5
24.8
24.8
26.4
27.6
32.1
32.0
43.0
19.2
15.2
16.0
18.0
17.5
21.7
22.1
35.2
41.0
43.9
44.2
45.6
46.4
50.5
50.5
53.5
34.6
30.8
31.9
33.7
33.0
37.1
37.5
56:44
8:28
2:16
7:33
59:10
20:42
17:20
23:10
Columns show the average developer (fD) score (equivalent to sum-ofpairs [SP] score) and modeler (fM) score achieved by each aligner for the
“Superfamily” and “Twilight Zone” sets in the SABmark database. All
scores have been multiplied by 100. The number of sequences in each set
is given in parentheses. Overall numbers for the entire database are reported in addition to the total running time of each aligner for all 698
alignments. The best results in each column are shown in bold.
curate alignments in terms of both fD and fM scores than those
produced by current leading alignment methods. To check the
validity of this claim, we applied the expected accuracy criterion
for multiple alignment to the entire SABmark database, achieving an fD score of 0.479 and an fM score of 0.355, again significantly better than all other methods except for the full ProbCons
method itself. Therefore expected accuracy alignments give better sensitivity in terms of predicting true matches and better specificity in terms of predicting a higher proportion of true matches.
This observation suggests that posterior-based approaches are a
powerful general approach for improving alignment accuracy.
Additionally, among the added features, using the probabilistic
consistency transformation provided the largest accuracy improvement.
Discussion
Though the problem of protein multiple sequence alignment is
hardly new, the computation of high accuracy multiple sequence
alignments is still an open problem. In this article, we presented
ProbCons, a practical tool for protein multiple sequence alignment, which has demonstrated dramatic improvements in alignment accuracy over several leading methods on the BAliBASE,
PREFAB, and SABmark benchmark alignment databases while
maintaining competitive running times.
Despite its strong performance on empirical tests, the ProbCons algorithm uses an extremely simple model of sequence
similarity (a three-state pair-HMM) and makes no attempt to incorporate biological knowledge such as position-specific gap
scoring, rigorous evolutionary tree construction, and other features used by aligners such as CLUSTALW. ProbCons does
not use protein-specific alignment information other than the
amino acid alphabet and the BLOSUM emission probability matrices. Replacing these with equivalent values for nucleotides
may give a DNA alignment procedure with improved accuracy
over standard Needleman–Wunsch-based aligners. In addition,
the parameters used in the model are transparent, and include
the probability ␦ of transition from the match/mismatch state to
the insertion states (corresponding to a gap-open penalty) the
probability ␧ of self-transition in an insertion state (corresponding to a gap-extend penalty), and the initial probability ␲insert of
starting with an insertion. Since all training for the program was
done automatically on unaligned sequences using Expectation–
Maximization without human guidance, it is thus possible to
retrain ProbCons on specific sequence types to obtain parameters
that would be more appropriate for particular alignment tasks.
Our results in comparing different variations of ProbCons
indicate that the two main features that contribute to its accu-
Genome Research
www.genome.org
335
Downloaded from www.genome.org on November 19, 2007 - Published by Cold Spring Harbor Laboratory Press
Do et al.
Table 6. Significance test for differences in SABmark performance
Align-M
Align-M
DIALIGN
CLUSTALW
MAFFT
T-Coffee
MUSCLE
ProbCons
ProbCons-ext
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ10
ⳮ10
ⳮ10
ⳮ10
ⳮ<10
ⳮ10
ⳮ<10ⳮ10
ⳮ<10
ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ10
ⳮ<10ⳮ10
ⳮ10
ⳮ<10ⳮ10
ⳮ10
ⳮ<10
DIALIGN
ⳮ<10
ⳮ10
ⳮ10
CLUSTALW
ⳮ<10
ⳮ<10
MAFFT
ⳮ<10ⳮ10
ⳮ<10ⳮ10
ⳮ10
ⳮ<10
ⳮ0.02
ⳮ<10
ⳮ6
ⳮ0.01
ⳮ7.5 ⳯ 10
ⳮ1.5 ⳯ 10ⳮ5
+(0.083)
ⳮ<10ⳮ10
ⳮ2.5 ⳯ 10
ⳮ3
MUSCLE
ⳮ10
ⳮ<10
ⳮ1.2 ⳯ 10
ⳮ7
ProbCons
ⳮ<10ⳮ10
+<10ⳮ10
+<10ⳮ10
+<10ⳮ10
+<10ⳮ10
+<10ⳮ10
ⳮ10
ⳮ10
ⳮ10
ⳮ10
ⳮ10
ⳮ10
ⳮ<10
T-Coffee
ⳮ<10
ProbCons-ext
+<10
+<10
ⳮ10
ⳮ<10
+<10
+<10
ⳮ10
+<10
ⳮ10
+1.2 ⳯ 10
+<10
ⳮ0.052
ⳮ4
ⳮ1.5 ⳯ 10
+<10
ⳮ<10
ⳮ5
ⳮ<10
+<10
+6.4 ⳯ 10ⳮ4
+(0.31)
Entries show the p-value indicating the significance of a difference in performance between two alignment methods as measured using a Friedman rank
test. Nonitalicized values above the diagonal were calculated using fD (SP) scores on all alignments, whereas italicized values were computed using fM
scores. (+) Method on the left had lower average rank (better performance); (ⳮ) Method on the left had higher average rank (worse performance);
parentheses denote (nonsignificant) p-values >0.05.
racy are the use of maximum expected accuracy as an objective
function and the application of the probabilistic consistency
transformation. The methodology employed in developing the
ProbCons algorithm is straightforward and widely applicable: (1)
specify an appropriate quality measure and (2) maximize its expected value according to the probability distribution given by
the model. For example, the accuracy measure used in this article
maximizes the expected number of correct matches in an alignment; if one is concerned about overprediction of matches, one
may use an alternative objective function that penalizes overprediction of matches and, provided it is easily decomposable, derive the corresponding optimization algorithm. Exploring this
framework provides a novel and exciting direction for future
work in pursuing even higher accuracy alignment approaches.
The principles employed, however, are not unique to sequence alignment alone. As an example, consider the related
problem of motif finding among a set of divergent sequences.
Consistency-based approaches have previously been applied to
Table 7. Performance of ProbCons Variants on SABmark
“Twilight Zone” set
Algorithm
c
lr
Output
fD
fM
Time
(mm:ss)
1.
2.
3.
4.
5.
6.
7.
8.
9.
0
0
1
2
0
1
2
0
2
0
0
0
0
0
0
0
100
100
Pairwise
Pairwise
Pairwise
Pairwise
Multiple
Multiple
Multiple
Multiple
Multiple
27.5
29.6
32.5
33.2
29.1
30.9
31.5
30.6
32.1
17.2
18.5
20.4
21.0
19.8
20.8
21.3
20.8
21.7
0:42
2:54
3:15
3:47
2:57
3:17
3:50
4:14
5:50
Viterbi
Posterior
Posterior
Posterior
Posterior
Posterior
Posterior
Posterior
Posterior
The first column indicates whether the Viterbi algorithm (highest probability alignment) or posterior decoding (maximal expected accuracy
alignment) was used. The next two columns indicate c, the number of
iterations of the consistency transformation used, and ir, the number of
rounds of iterative refinement used as post-processing. The fourth column indicates whether the ProbCons was set to generate all-pairs pairwise alignments or consistent multiple alignments. The next two columns
show the average developer (fD) score (equivalent to sum-of-pairs [SP]
score) and modeler (fM) score achieved by each aligner for the “Twilight
Zone” set in the SABmark database. The last column gives the total
running time for each method over all 236 alignments. All scores have
been multiplied by 100. Note that the last row corresponds to the parameter settings that are the default in the ProbCons program. The best
results in each column are shown in bold.
336
Genome Research
www.genome.org
motif-finding tasks with strong empirical results (Heger et al.
2003). A more principled algorithm based on probabilistic consistency may further increase the sensitivity of motif detection
methods. Comparative gene finding and RNA or protein structural prediction methods may also benefit from a probabilistic
consistency-based approach.
Methods
The ProbCons algorithm works by (1) computing posteriorprobability matrices, (2) computing expected accuracies for each
pairwise comparison, (3) applying the probabilistic consistency
transformation, (4) computing an expected accuracy guide tree,
and (5) performing progressive alignment. As a default, we also
perform iterative refinement as a post-processing step. In the
subsections that follow, we consider each of these steps in greater
detail, describe the EM training procedure used to obtain parameters for the ProbCons HMM, and present a novel technique for
estimating column reliability scores based on the alignment scoring matrices.
1. Posterior probability matrices
Let x and y be two proteins represented as character strings in
which xi is the ith amino acid of x. Consider the pair-HMM given
in Figure 1, where A is the space of all possible x–y alignments. An
alignment a corresponds uniquely to a sequence of stateemission pairs, 〈s1, o1〉, …, 〈sn, on〉. The probability of a is given by
P共aⱍx,y兲 = ␲共s1兲
冉
n−1
兿␣共s → s
i=1
i
i+1 兲
冊冉
n
兿␤共o ⱍs 兲
i=1
i i
冊
,
where ␲(s) is the initial probability of starting in state s, ␣(si → si+1)
is the transition probability from si to si+1, and ␤(oi | si) is the emission probability for either a single letter or aligner residue pair oi in
the state si.
In the derivation which follows, let a* be the (unknown)
alignment from A that most nearly represents the “true” biological alignment of x and y. Ideally, we wish to determine a* based
on the sequence information in x and y alone. To do this we use
the distribution P(A | x, y) to represent our beliefs regarding a*,
i.e., we assume that P(a | x, y) is the probability that an alignment
a is equal to a*.
Let the notation xi ∼ yj ∈ a denote the event that two posi-
Downloaded from www.genome.org on November 19, 2007 - Published by Cold Spring Harbor Laboratory Press
ProbCons multiple alignment tool
where the common indicator notation 1{condition} is used to define a function that evaluates to 1 whenever condition is true and
0 otherwise. Then, the posterior probability matrix Pxy for the
alignment of x and y is a table of P(xi ∼ yj ∈ a* | x, y) values for
1 ⱕ i ⱕ |x|, 1 ⱕ j ⱕ |y|. The ProbCons algorithm begins by calculating these posterior probability matrices using a modification
of the Forward and Backward algorithms for computing posterior
probabilities in pair-HMMs as described in Durbin et al. (1998).
This computation step takes time O(m2L2), where m is the number of sequences and L is the length of each sequence.
One way to use sequence z is to generalize the pair-HMM
given in Figure 1 to a triple-HMM that parameterizes a conditional distribution over three-sequence alignments of x, y, and z,
and similarly generalize the previous formulas for expected accuracy to handle three-way alignments. Such an approach, however, leads to impractical O(L3) algorithms for computing posterior matrices of sequences of length L. Here, we follow a heuristic
approach that allows us to derive an algorithm with an approximately O(L2) running time.
For a sequence z, let z(k,k+1) denote the interletter regions (or
gaps) between amino acids k and k + 1 of z for 0 ⱕ k ⱕ |z| (where
z(0,1) and z(|z|,|z|+1) denote the gaps at the beginning and ends of z).
Generalizing our notation for posterior probabilities of matches,
an alternative estimate for the quality of an xi ∼ yj match is given
by marginalized probability,
2. Maximal expected accuracy alignment
P共xi ∼ yj ∈a*ⱍx,y,z兲 =
tions xi and yj are matched in an alignment a. Formally, the
posterior probability of xi ∼ yj ∈ a* is
P共xi ∼ yj ∈ a*ⱍx,y兲 =
兺P共aⱍx,y兲1兵x ∼ y ∈ a其
i
a∈A
j
Most alignment schemes build an “optimal” pairwise alignment
by finding the highest probability alignment using the Viterbi
algorithm. In this approach, one computes arg maxa P(a | x, y),
which may be alternatively written as arg maxa Ea*[1{a = a*} | x,
y]; that is, the Viterbi algorithm finds the alignment whose probability of being exactly equal to a* is optimal. When the odds of
recovering the exact correct alignment is low but partially correct
alignments are still useful, this is not necessarily the best choice.
In this work, we explore an alternative strategy that finds
the alignment a that does not maximize the probability of a = a*
but rather tries to guarantee high accuracy for a, which we define
with respect to the alignment a* as
1
1兵x ∼ yj ∈a*其.
accuracy共a,a*兲 =
min兵ⱍxⱍ,ⱍyⱍ其 xi∼yj∈a i
兺
During the alignment process, however, a* is not known, so we
instead maximize the expected accuracy of the reported alignment.
Computing this quantity is straightforward since
Ea*共accuracy共a,a*兲 | x,y兲 =
=
兺P共ãⱍx,y兲 兺 1兵x ∼ y ∈ã其
xi∼yj∈a
ã∈A
i
j
min兵ⱍxⱍ,ⱍyⱍ其
兺 冉 兺P共ãⱍx,y兲1兵x ∼ y ∈ã其 冊
i
xi∼yj∈a ã∈A
j
min兵ⱍxⱍ,ⱍyⱍ其
1
=
P共xi ∼ yj ∈ a*ⱍx,y兲.
min兵ⱍxⱍ,ⱍyⱍ其xi∼yj∈a
兺
Using this decomposition, we compute the maximal expected
accuracy alignment by a simple variant of the Needleman–
Wunsch algorithm, where all match/mismatch scores are given
by the posterior probability terms for corresponding letters and
gap penalties are set to zero. This form of alignment bears strong
resemblance to the problem of finding the maximum weight
trace of a matrix (Kececioglu 1993), and a similar scheme is used
to compute final progressive alignments in the T-Coffee program.
3. Probabilistic consistency transformation
In the previous section, we described a method for performing
pairwise sequence alignment of two sequences x and y based on
computing P(xi ∼ yj ∈ a* | x,y) values for all positions in x and y,
and subsequently using these posterior probabilities as match/
mismatch scores in a Needleman–Wunsch-like alignment procedure. In this section we introduce probabilistic consistency, a
method for obtaining more accurate substitution scores when a
third homologous sequence z is available.
兺P共x ∼ y ∼ z
zk
i
j
k
兺 P共x ∼ y ∼ z
∈ a*ⱍx,y,z兲 +
z共k,k+1兲
i
j
共k,k+1兲
∈ a*ⱍx,y,z兲,
where a* now refers to a three-sequence alignment of x, y, and z.
We refer to the concept of re-estimating pairwise alignment
match quality scores based on three-sequence information as
probabilistic consistency.
As stated, computing P(xi ∼ yj ∈ a* | x, y, z) values for each
xi–yj pair requires O(L3) time for the Forward and Backward algorithms (given an appropriate three-sequence HMM); to avoid
this, we simplify the computation as follows. First, we heuristically ignore the second summation over gaps in z to get
兺P共x ∼ y ∼ z
i
zk
j
k
∈ a*ⱍx,y,z兲.
Second, we change the inner condition to an equivalent expression,
兺P共共x ∼ z
i
zk
k
∈ a*兲 ∧ 共zk ∼ yj ∈ a*兲ⱍx,y,z兲
Then, we use the chain rule to factorize each inner term of the
summation to obtain
兺P共x ∼ z
zk
i
k
∈ a*ⱍx,y,z兲P共zk ∼ yj ∈ a*ⱍx,y,z,xi ∼ zk ∈ a*兲
Finally, we make heuristic independence assumptions to get
兺P共x ∼ z
i
zk
k
∈ a*ⱍx,z兲P共zk ∼ yj ∈ a*ⱍz,y兲.
This latter expression still requires O(L3) time to be computed.
Now, however, we transform the Pxz and Pzy matrices into sparse
matrices by discarding all values smaller than a threshold ␻ (by
default, ␻ = 0.01). For alignable sequences, posterior probability
alignment matrices tend to be sparse, with most entries near
zero, so this step is justified. This effectively reduces the probabilistic consistency re-estimation step to sparse matrix multiplication; therefore, Pxy is re-estimated in time O(c2L), where c is the
average number of nonzero elements per row (typically 1 ⱕ c ⱕ 5
in practice).
With the procedure described above, we can align two sequences given information from a third sequence. To align two
sequences x and y given a set of sequences, S, we would ideally
like to estimate P(xi ∼ yj ∈ a* | S). In practice, we use the following heuristic decomposition:
1
ⱍSⱍ z∈S
兺兺P共x ∼ z
zk
i
k
∈ a*ⱍx,z兲P共zk ∼ yj ∈ a*ⱍz,y兲
where we set P(xi ∼ xj | x) to 1 if i = j and 0 otherwise.
Genome Research
www.genome.org
337
Downloaded from www.genome.org on November 19, 2007 - Published by Cold Spring Harbor Laboratory Press
Do et al.
In this sense, the approximate probabilistic consistency calculation may be viewed as a transformation that, given a set of
all-pairs pairwise match quality scores, produces a new set of
all-pairs pairwise match quality scores that have been adjusted to
account for a single intermediate sequence. By iterated applications of the transformation, then, we can informally approximate
the effect of accounting for more than one intermediate sequence at a time. As a default, ProbCons uses two iterated applications, which works well in practice (see Results).
In the derivations above, it is clear that several unjustified
assumptions were needed in order to obtain an efficiently computable form for probabilistic consistency. In the first step, the
simplification of not considering gapped positions in a sequence
z is problematic. In the fourth step, the independence assumptions required for the transformation clearly do not hold for sets
of related sequences. Furthermore, the decomposition of
P(xi ∼ yj ∈ a* | S) into an average over the different intermediate
sequences in S is also not well grounded. Nevertheless, these
methods work well in practice.
As a sanity check, ignoring gapped positions in the first
simplification hurts only when xi is aligned to yj through a gap in
z; for reliably alignable regions in which all sequences are present, this has little effect. Averaging P(xi ∼ yj ∈ a* | z) values in
the final step can be interpreted as a linear regression-like
method for predicting P(xi ∼ yj ∈ a* | S) where all inputs are given
identical weight. Finally, to assess the reasonableness of the independence assumptions used in deriving the factorized form of
probabilistic consistency, we implemented a version of ProbCons
using the full O(L3) consistency algorithm. Because this algorithm is slow, we tested it only on a set of 74 alignments with
at most five sequences and length at most 100 residues from the
Twilight Zone subset of SABmark. The full O(L3) consistency algorithm achieved an average fD score of 0.431 compared to 0.403
when no iterations of approximate probabilistic consistency were
used, 0.422 when one iteration was used, and 0.427 when two
iterations were used. In contrast to the other methods that completed all tests in under 2 sec, however, the O(L3) method took
nearly 10 min to finish. We decided not to support the O(L3)
version because it is inherently much slower even in the smallest
examples, while it provides only modest improvements on the
Twilight Zone alignments where we tested it.
4. Guide tree computation
Most progressive multiple sequence alignment programs use evolutionary distances estimated from pairwise alignments or k-mer
statistics to build an approximate evolutionary tree via neighbor
joining (Saitou and Nei 1987) or UPGMA (Sneath and Sokal
1973). In contrast, ProbCons does not attempt to build an evolutionarily correct tree but rather uses a greedy heuristic method
reminiscent of UPGMA to construct a tree with high expected
alignment reliability.
Given a set S of sequences to be aligned, denote the expected
accuracy for aligning any two sequences x and y as E(x, y). Initially, each sequence is placed in its own cluster. Then, the two
clusters x and y with the highest expected accuracy are merged to
form a new cluster xy; we then define the expected accuracy of
aligning xy with any other cluster z as E(x, y)(E(x, z) + E(y, z))/2.
This process is repeated until only a single cluster remains.
Like UPGMA, the guide-tree computation procedure used
here relies on modified arithmetic averaging to estimate the “distance” of newly created clusters to other clusters. However, the
important distinction is that the computation here has the goal
of finding clusters that can be reliably aligned, i.e., have high
338
Genome Research
www.genome.org
expected accuracy, rather than ones that may appear evolutionarily closer.
5. Progressive alignment
The final progressive alignment step in ProbCons is a routine
extension of maximal expected accuracy alignment to an unweighted sum-of-pairs model. Since the alignments within each
group are fixed, we may ignore matches between sequences in
each group. Thus, for each progressive alignment step, we run a
profile–profile Needleman–Wunsch alignment procedure in
which the score for matching a column containing n1 non-gap
letters to one with n2 non-gap letters is computed by summing
n1n2 values from the corresponding pairwise posterior matrices.
Note that no gap penalties are used in this final step, thus greatly
simplifying the task of profile–profile alignment.
Post-processing: Iterative refinement
While incorporating consistency helps to reduce the chances of
errors during the hierarchical merging of groups of sequences,
the progressive alignment procedure still does not produce optimal alignments with respect to the sum-of-pairs probabilistic
consistency objective function. To improve the alignment, we
employ a randomized iterative improvement strategy (Berger and
Munson 1991).
In this approach, the sequences of the existing multiple
alignment are randomly partitioned into two groups of possibly
unequal size by randomly assigning each sequence to one of the
two groups to be realigned. Subsequently, the same dynamic programming procedure used for progressive alignment is employed
to realign the two projected alignments. This refinement procedure can be iterated either for a fixed number of iterations or
until convergence; for simplicity, only the former of these options is implemented in ProbCons, where 100 rounds of iterative
refinement are applied in the default setting. Because gap penalties are not used during each realignment step, the sum-of-pairs
alignment score is guaranteed to increase monotonically.
Unsupervised EM training
The ProbCons approach to alignment is simple in that the only
parameters in the program are the ones specific to the HMM used
to model the distribution over alignments. If one keeps the emission probabilities fixed, the HMM in Figure 1 is completely specified by three parameters, which fully determine the initial state
and transition probabilities: the initial insertion probability ␲insert,
the insertion start probability ␦, and the insertion extension probability ␧. To train ProbCons via Expectation–Maximization (EM),
then, we applied 20 iterations of the Baum–Welch algorithm on
unaligned BAliBASE sequences, starting from random initial parameters. The resulting parameters (␦ = 0.019931, ␧ = 0.79433,
␲insert = 0.19598) were used as the default for the program. The
low number of parameters for the probabilistic model here distinguishes ProbCons from profile-HMM approaches (Durbin et
al. 1998), which have a much richer alignment model but consequently face a tougher training task.
Estimating column reliability
Many applications that make use of protein sequence alignments
need the ability to assess which parts of an alignment are likely to
be correct. Previous approaches to quantifying alignment quality
have included using suboptimal alignments to locate reliable regions of alignments (Vingron and Argos 1990; Chao et al. 1993)
or using a fuzzy “winner-takes-most” version of Needleman–
Wunsch dynamic programming in order to “predict” the probability that a pair of residues are correctly aligned (Schlosshauer
Downloaded from www.genome.org on November 19, 2007 - Published by Cold Spring Harbor Laboratory Press
ProbCons multiple alignment tool
and Ohlsson 2002). It is clear that both of these approaches deal
with many of the questions answered by match posterior probabilities (Miyazawa 1995, Kschischo and Lässing 2000), which
represent the likelihood that specific pairs of residues are aligned.
In the multiple alignment case, one possible generalization
is to estimate the expected proportion of correct pairwise
matches in each column of the alignment. Given a set C of the
aligned residues in a particular column, this expected proportion
of correct pairwise matches ␺(C) is given by
␺共C兲 =
冉冊
ⱍCⱍ
2
−1
兺 P共x ∼ y ∈ a*ⱍS兲
xi,yj∈C
x⫽y
i
j
which we approximate using the pairwise posterior matrices calculated in Step 1. Though this is certainly not the only possible
measure of column reliability based on posterior probabilities, we
leave extensions of this method as future work.
Acknowledgments
The authors thank Arend Sidow and Robert Edgar for useful discussions and Sandhya Kunnatur for help in program development. C.B.D. was partly supported by a Siebel Fellowship. M.B.
was partly supported by an NSF Graduate Fellowship. Work in
the Batzoglou laboratory is supported in part by NSF grant EF0312459, NIH grant U01-HG003162, the NSF CAREER Award,
and the Alfred P. Sloan Fellowship.
References
Altschul, S.F. 1991. Amino acid substitution matrices from an
information theoretic perspective. J. Mol. Biol. 219: 555–565.
Altschul, S.F., Carroll, R.J., and Lipman, D.J. 1989. Weights for data
related by a tree. J. Mol. Biol. 207: 647–653.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller,
W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new
generation of protein database search programs. Nucleic Acids Res.
25: 3389–3402.
Attwood, T.K. 2002. The PRINTS database: A resource for identification
of protein families. Brief. Bioinform. 3: 252–263.
Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V.,
Griffiths-Jones, S., Khanna, A., Moxon, M.M., Sonnhammer, E.L.,
Studholme, D.J., et al. 2004. The Pfam protein families database.
Nucleic Acids Res. 32: D138–D141.
Berger, M.P. and Munson, P.J. 1991. A novel randomized iterative
strategy for aligning multiple protein sequences. Comput. Appl.
Biosci. 7: 479–484.
Boutonnet, N.S., Rooman, M.J., Ochagavia, M.E., Richelle, J., and
Wodak, S.J. 1995. Optimal protein structure alignments by multiple
linkage clustering: Application to distantly related proteins. Protein
Eng. 8: 647–662.
Brenner, S.E., Koehl, P., and Levitt, M. 2000. The ASTRAL compendium
for protein structure and sequence analysis. Nucleic Acids Res.
28: 254–256.
Carrillo, H. and Lipman, D. 1988. The multiple sequence alignment
problem in biology. SIAM J. Appl. Math. 48: 1073–1082.
Castillo-Davis, C.I., Kondrashov, F.A., Hartl, D.L., and Kulathinal, R.J.
2004. The functional genomic distribution of protein divergence in
two animal phyla: Coevolution, genomic conflict, and constraint.
Genome Res. 14: 802–811.
Chao, K.-M., Hardison, R.C., and Miller, W. 1993. Locating
well-conserved regions within a pairwise alignment. Comput. Appl.
Biosci. 9: 387–396.
Dayhoff, M.O., Schwartz, R.M., and Orcutt, B.C. 1978. A model of
evolutionary change in proteins. In Atlas of proteins sequences and
structure, vol. 5, Suppl. 2, pp. 345–352. National Biomedical Research
Foundation, Washington, D.C.
Do, C.B., Brudno, M., and Batzoglou, S. 2004. ProbCons: Probabilistic
consistency-based multiple alignment of amino acid sequences. In
Proceedings of the Thirteenth National Conference on Artificial
Intelligence, pp. 703–708. AAAI Press, San Jose, CA.
Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. 1998. Biological
sequence analysis. Cambridge University Press, Cambridge, UK.
Eddy, S.R. 1995. Multiple alignment using hidden Markov models. In
Proceedings of the Third International Conference on Intelligent Systems
in Molecular Biology, pp. 114–120. AAAI Press, Cambridge, UK.
Edgar, R.C. 2004. MUSCLE: Multiple sequence alignment with high
accuracy and high throughput. Nucleic Acids Res. 32: 1792–1797.
Feng, D.F. and Doolittle, R.F. 1987. Progressive sequence alignment as a
prerequisite to correct phylogenetic trees. J. Mol. Evol. 25: 351–360.
Gonnet, G.H., Cohen, M.A., and Benner, S.A. 1992. Exhaustive
matching of the entire protein sequence database. Science
256: 1443–1445.
Gotoh, O. 1982. An improved algorithm for matching biological
sequences. J. Mol. Biol. 162: 705–708.
———. 1990. Consistency of optimal sequence alignments. Bull. Math.
Biol. 52: 509–525.
———. 1996. Significant improvement in accuracy of multiple protein
sequence alignments by iterative refinement as assessed by reference
to structural alignments. J. Mol. Biol. 264: 823–838.
Heger, A., Lappe, M., and Holm, L. 2003. Accurate detection of very
sparse sequence motifs. In Proceedings of the Seventh Annual
International Conference on Computational Molecular Biology, pp.
139–147. ACM Press, Berlin, Germany.
Henikoff, S. and Henikoff, J.G. 1992. Amino acid substitution matrices
from protein blocks. Proc. Nat. Acad. Sci. 89: 10915–10919.
Holm, L. and Sander, C. 1994. The FSSP database of structurally aligned
protein fold families. Nucleic Acids Res. 22: 3600–3609.
Holmes, I. and Durbin, R. 1998. Dynamic programming alignment
accuracy. J. Comput. Biol. 5: 493–504.
Huang, X. and Miller, W. 1991. A time-efficient, linear space local
similarity algorithm. Adv. Appl. Math. 12: 337–357.
Jaroszewski, L., Li, W., and Godzik, A. 2002. In search for more accurate
alignments in the twilight zone. Protein Sci. 11: 1702–1713.
Johnson, J.M. and Church, G.M. 1999. Alignment and structure
prediction of divergent protein families: Periplasmic and outer
membrane proteins of bacterial efflux pumps. J. Mol. Biol.
287: 695–715.
Jones, D.T. 1999. Protein secondary structure prediction based on
position-specific scoring matrices. J. Mol. Biol. 292: 195–202.
Katoh, K., Misasa, K., Kuma, K., and Miyata, T. 2002. MAFFT: A novel
method for rapid multiple sequence alignment based on fast Fourier
transform. Nucleic Acids Res. 30: 3059–3066.
Kececioglu, J. 1993. The maximum weight trace problem in multiple
sequence alignment. In Proceedings of the Fourth Symposium on
Combinatorial Pattern Matching, Springer-Verlag Lecture Notes in
Computer Science, vol. 684, pp. 106–119. London.
Kim, J., Pramanik, S., and Chung, M.J. 1994. Multiple sequence
alignment using simulated annealing. Comput. Appl. Biosci.
10: 419–426.
Krogh, A., Brown, M., Mian, I.S., Sjolander, K., and Haussler, D. 1994.
Hidden Markov models in computational biology: Applications to
protein modeling. J. Mol. Biol. 235: 1501–1531.
Kschischo, M., and Lässing, M. 2000. Finite-temperature sequence
alignment. Pac. Symp. Biocomput. 5: 621–632.
Metz, C.E. 1978. Basic principles of ROC analysis. Semin. Nucl. Med.
8: 283–298.
Mizuguchi, K., Deane, C.M., Blundell, T.L., and Overington, J.P. 1998.
HOMSTRAD: A database of protein structure alignments for
homologous families. Prot. Sci. 7: 2469–2471.
Miyazawa, S. 1995. A reliable sequence alignment method based on
probabilities of residue correspondences. Protein Eng. 8: 999–1009.
Morgenstern, B., Dress, A., and Werner, T. 1996. Multiple DNA and
protein sequence alignment based on segment-to-segment
comparison. Proc. Nat. Acad. Sci. 93: 12098–12103.
Morgenstern, B., Frech, K., Dress, A., and Werner, T. 1998. DIALIGN:
Finding local similarities by multiple sequence alignment.
Bioinformatics 14: 290–294.
Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995. SCOP: A
structural classification of proteins database for the investigation of
sequences and structures. J. Mol. Biol. 247: 536–540.
Myers, E.W. and Miller, W. 1988. Optimal alignments in linear space.
Comput. Appl. Biosci. 4: 11–17.
Needleman, S.B. and Wunsch, C.D. 1970. A general method applicable
to the search for similarities in the amino acid sequence of two
proteins. J. Mol. BIol. 48: 443–453.
Notredame, C. and Higgins, D.G. 1996. SAGA: Sequence alignment by
genetic algorithm. Nucleic Acids Res. 24: 1515–1524.
Notredame, C., Holm, L., and Higgins, D.G. 1998. COFFEE: An objective
function for multiple sequence alignments. Bioinformatics
14: 407–422.
Notredame, C., Higgins, D.G., and Heringa, J. 2000. T-Coffee: A novel
method for multiple sequence alignments. J. Mol. Biol.
302: 205–217.
Genome Research
www.genome.org
339
Downloaded from www.genome.org on November 19, 2007 - Published by Cold Spring Harbor Laboratory Press
Do et al.
Phillips, A., Janies, D., and Wheeler, W. 2000. Multiple sequence
alignments in phylogenetic analysis. Mol. Phylogenet. Evol.
16: 317–330.
Pruitt, K.D., Tatusova, T., and Maglott, D.R. 2003. NCBI Reference
Sequence project: Update and current status. Nucleic Acids Res.
31: 34–37.
Rost, B. 1999. Twilight zone of protein sequence alignments. Protein Eng.
12: 85–94.
Rost, B. and Sander, C. 1994. Combining evolutionary information and
neural networks to predict protein secondary structure. Proteins
19: 55–77.
Saitou, N. and Nei, M. 1987. The neighbor-joining method: A new
method for reconstructing phylogenetic trees. Mol. Biol. Evol.
4: 406–425.
Sauder, J.M., Arthur, J.W., and Dunbrack Jr., R.L. 2000. Large-scale
comparison of protein sequence alignment algorithms with structure
alignments. Proteins 40: 6–22.
Schlosshauer, M. and Ohlsson, M. 2002. A novel approach to local
reliability of sequence alignments. Bioinformatics 18: 847–854.
Shindyalov, I.N. and Bourne, P.E. 1998. Protein structure alignment by
incremental combinatorial extension (CE) of the optimal path.
Protein Eng. 11: 739–747.
Smith, T.F. and Waterman, M.S. 1981. Identification of common
molecular subsequences. J. Mol. Biol. 147: 195–197.
Sneath, P.H.A. and Sokal, R.R. 1973. Numerical Taxonomy. Freeman, San
Francisco, CA.
Sonnhammer, E.L.L., Eddy, S.R., Birney, E., Bateman, A., and Durbin, R.
1998. Pfam: Multiple sequence alignments and HMM-profiles of
protein domains. Nucleic Acids Res. 26: 320–322.
Thompson, J.D., Higgins, D.G., and Gibson, T.J. 1994. CLUSTAL W:
Improving the sensitivity of progressive multiple sequence
alignment through sequence weighting, position-specific gap
340
Genome Research
www.genome.org
penalties and weight matrix choice. Nucleic Acids Res.
22: 4673–4680.
Thompson, J.D., Plewniak, F., and Poch, O. 1999a. BAliBASE: A
benchmark alignment database for the evaluation of multiple
alignment programs. Bioinformatics 15: 87–88.
———. 1999b. A comprehensive comparison of multiple sequence
alignment programs. Nucleic Acids Res. 27: 2682–2690.
Van Walle, I., Lasters, I., and Wyns, L. 2004. Align-m—A new algorithm
for multiple alignment of highly divergent sequences. Bioinformatics
20: 1428–1435.
Vingron, M. and Argos, P. 1989. A fast and sensitive multiple sequence
alignment algorithm. Comput. Appl. Biosci. 5: 115–121.
———. 1990. Determination of reliable regions in protein sequence
alignments. Protein Eng. 3: 565–569.
———. 1991. Motif recognition and alignment for many sequences by
comparison of dot matrices. J. Math. Biol. 218: 34–43.
Vingron, M. and Waterman, M.S. 1994. Sequence alignment and
penalty choice: Review of concepts, case studies and implications. J.
Mol. Biol. 235: 1–12.
Viterbi, A.J. 1967. Error bounds for convolutional codes and an
asymptotically optimal decoding algorithm. IEEE Trans. Inf. Theory
IT-13: 260–269.
Web site references
http://probcons.stanford.edu; ProbCons alignment tool.
Received May 24, 2004; accepted in revised form November 29, 2004.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement