Minimum-Residual Methods for Sparse Least-Squares Using Golub

Minimum-Residual Methods for Sparse Least-Squares Using Golub
MINIMUM-RESIDUAL METHODS
FOR SPARSE LEAST-SQUARES
USING GOLUB-KAHAN BIDIAGONALIZATION
A DISSERTATION
SUBMITTED TO THE INSTITUTE FOR
COMPUTATIONAL AND MATHEMATICAL
ENGINEERING
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
David Chin-lung Fong
December 2011
© 2011 by Chin Lung Fong. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This dissertation is online at: http://purl.stanford.edu/sd504kj0427
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Michael Saunders, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Margot Gerritsen
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Walter Murray
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost Graduate Education
This signature page was generated electronically upon submission of this dissertation in
electronic format. An original signed hard copy of the signature page is on file in
University Archives.
iii
iv
A BSTRACT
For 30 years, LSQR and its theoretical equivalent CGLS have been the
standard iterative solvers for large rectangular systems Ax = b and
least-squares problems min kAx − bk. They are analytically equivalent
to symmetric CG on the normal equation ATAx = ATb, and they reduce
krk k monotonically, where rk = b − Axk is the k-th residual vector. The
techniques pioneered in the development of LSQR allow better algorithms to be developed for a wider range of problems.
We derive LSMR, an algorithm that is similar to LSQR but exhibits
better convergence properties. LSMR is equivalent to applying MINRES to the normal equation, so that the error kx∗ − xk k, the residual krk k, and the residual of the normal equation kATrk k all decrease
monotonically. In practice we observe that the Stewart backward error
kATrk k/krk k is usually monotonic and very close to optimal. LSMR has
essentially the same computational cost per iteration as LSQR, but the
Stewart backward error is always smaller. Thus if iterations need to be
terminated early, it is safer to use LSMR.
LSQR and LSMR are based on Golub-Kahan bidiagonalization. Following the analysis of LSMR, we leverage the techniques used there to
construct algorithm AMRES for negatively-damped least-squares systems (ATA − δ 2 I)x = ATb, again using Golub-Kahan bidiagonalization. Such problems arise in total least-squares, Rayleigh quotient iteration (RQI), and Curtis-Reid scaling for rectangular sparse matrices.
Our solver AMRES provides a stable method for these problems. AMRES allows caching and reuse of the Golub-Kahan vectors across RQIs,
and can be used to compute any of the singular vectors of A, given a
reasonable estimate of the singular vector or an accurate estimate of the
singular value.
v
A CKNOWLEDGEMENTS
First and foremost, I am extremely fortunate to have met Michael Saunders during my second research rotation in ICME. He designed the optimal research project for me from the very beginning, and patiently
guided me through every necessary step. He devotes an enormous
amount of time and consideration to his students. Michael is the best
advisor that any graduate student could ever have.
Our work would not be possible without standing on the shoulder
of a giant like Chris Paige, who designed MINRES and LSQR together
with Michael 30 years ago. Chris gave lots of helpful comments on
reorthogonalization and various aspects of this work. I am thankful
for his support.
David Titley-Peloquin is one of the world experts in analyzing iterative methods like LSQR and LSMR. He provided us with great insights
on various convergence properties of both algorithms. My understanding of LSMR would be left incomplete without his ideas.
Jon Claerbout has been the most enthusiatic supporter of the LSQR
algorithm, and is constantly looking for applications of LSMR in geophysics. His idea of “computational success” aligned with our goal in
designing LSMR and gave us strong motivation to develop this algorithm.
James Nagy’s enthusiasm in experimenting with LSMR in his latest
research also gave us much confidence in further developing and polishing our work. I feel truly excited to see his work on applying LSMR
to image processing.
AMRES would not be possible without the work that Per Christian
Hansen and Michael started. The algorithm and some of the applications are built on top of their insights.
I would like to thank Margot Gerritsen and Walter Murray for serving on both my reading and oral defense committees. They provided
many valuable suggestions that enhanced the completeness of the experimental results. I would also like to thank Eric Darve for chairing
my oral defense and Peter Kitanidis for serving on my oral defense
vi
committee and giving me objective feedback on the algorithm from a
user’s perspective.
I am also grateful to the team of ICME staff, especially Indira Choudhury and Brian Tempero. Indira helped me get through all the paperwork, and Brian assembled and maintained all the ICME computers.
His support allowed us to focus fully on number-crunching.
I have met lots of wonderful friends here at Stanford. It is hard to acknowledge all of them here. Some of them are Santiago Akle, Michael
Chan, Felix Yu, and Ka Wai Tsang.
Over the past three years, I have been deeply grateful for the love
that Amy Lu has brought to me. Her smile and encouragement enabled
me to go through all the tough times in graduate school.
I thank my parents for creating the best possible learning environment for me over the years. With their support, I feel extremely priviledged to be able to fully devote myself to this intellectual venture.
I would like to acknowledge my funding, the Stanford Graduate
Fellowship (SGF). For the first half of my SGF, I was supported by the
Hong Kong Alumni Scholarship with generous contributions from Dr
Franklin and Mr Lee. For the second half of my SGF, I was supported
by the Office of Technology Licensing Fellowship with generous contributions from Professor Charles H. Kruger and Ms Katherine Ku.
vii
C ONTENTS
Abstract
v
Acknowledgements
vi
1 Introduction
1.1 Problem statement . . . . . . . . . . . . . . . . .
1.2 Basic iterative methods . . . . . . . . . . . . . . .
1.3 Krylov subspace methods for square Ax = b . .
1.3.1 The Lanczos process . . . . . . . . . . . .
1.3.2 Unsymmetric Lanczos . . . . . . . . . . .
1.3.3 Transpose-free methods . . . . . . . . . .
1.3.4 The Arnoldi process . . . . . . . . . . . .
1.3.5 Induced dimension reduction . . . . . . .
1.4 Krylov subspace methods for rectangular Ax = b
1.4.1 The Golub-Kahan process . . . . . . . . .
1.5 Overview . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
2
3
7
8
9
10
11
11
13
2 A tale of two algorithms
2.1 Monotonicity of norms . . . . . . . .
2.1.1 Properties of CG . . . . . . . .
2.1.2 Properties of CR and MINRES
2.2 Backward error analysis . . . . . . .
2.2.1 Stopping rule . . . . . . . . .
2.2.2 Monotonic backward errors .
2.2.3 Other convergence measures
2.3 Numerical results . . . . . . . . . . .
2.3.1 Positive-definite systems . . .
2.3.2 Indefinite systems . . . . . .
2.4 Summary . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
16
16
16
20
22
22
22
23
23
30
32
3 LSMR
3.1 Derivation of LSMR . . . . . . . . . . . . . . . . . . . . . .
3.1.1 The Golub-Kahan process . . . . . . . . . . . . . .
36
37
37
viii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3.2
3.3
3.4
3.5
3.6
3.1.2 Using Golub-Kahan to solve the normal equation
3.1.3 Two QR factorizations . . . . . . . . . . . . . . . .
3.1.4 Recurrence for xk . . . . . . . . . . . . . . . . . . .
3.1.5 Recurrence for Wk and W̄k . . . . . . . . . . . . .
3.1.6 The two rotations . . . . . . . . . . . . . . . . . . .
3.1.7 Speeding up forward substitution . . . . . . . . .
3.1.8 Algorithm LSMR . . . . . . . . . . . . . . . . . . .
Norms and stopping rules . . . . . . . . . . . . . . . . . .
3.2.1 Stopping criteria . . . . . . . . . . . . . . . . . . .
3.2.2 Practical stopping criteria . . . . . . . . . . . . . .
3.2.3 Computing krk k . . . . . . . . . . . . . . . . . . .
3.2.4 Computing kATrk k . . . . . . . . . . . . . . . . . .
3.2.5 Computing kxk k . . . . . . . . . . . . . . . . . . .
3.2.6 Estimates of kAk and cond(A) . . . . . . . . . . .
LSMR Properties . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Monotonicity of norms . . . . . . . . . . . . . . . .
3.3.2 Characteristics of the solution on singular systems
3.3.3 Backward error . . . . . . . . . . . . . . . . . . . .
Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . .
Regularized least squares . . . . . . . . . . . . . . . . . .
3.5.1 Effects on kr̄k k . . . . . . . . . . . . . . . . . . . .
3.5.2 Pseudo-code for regularized LSMR . . . . . . . . .
Proof of Lemma 3.2.1 . . . . . . . . . . . . . . . . . . . . .
38
38
39
39
40
41
41
42
42
43
43
45
45
45
46
46
48
49
50
50
51
52
52
4 LSMR experiments
4.1 Least-squares problems . . . . . . . . . . . . . . . .
4.1.1 Backward error for least-squares . . . . . .
4.1.2 Numerical results . . . . . . . . . . . . . . .
4.1.3 Effects of preconditioning . . . . . . . . . .
4.1.4 Why does kE2 k for LSQR lag behind LSMR?
4.2 Square systems . . . . . . . . . . . . . . . . . . . .
4.3 Underdetermined systems . . . . . . . . . . . . . .
4.4 Reorthogonalization . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
56
56
56
59
61
71
72
73
77
5 AMRES
5.1 Derivation of AMRES . . . . . . .
5.1.1 Least-squares subsystem .
5.1.2 QR factorization . . . . .
5.1.3 Updating Wkv . . . . . . .
5.1.4 Algorithm AMRES . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
82
83
84
85
85
86
ix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.2
5.3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
86
86
86
86
88
89
89
6 AMRES applications
6.1 Curtis-Reid scaling . . . . . . . . . . . . . . . .
6.1.1 Curtis-Reid scaling using CGA . . . . .
6.1.2 Curtis-Reid scaling using AMRES . . . .
6.1.3 Comparison of CGA and AMRES . . . .
6.2 Rayleigh quotient iteration . . . . . . . . . . . .
6.2.1 Stable inner iterations for RQI . . . . . .
6.2.2 Speeding up the inner iterations of RQI
6.3 Singular vector computation . . . . . . . . . . .
6.3.1 AMRESR . . . . . . . . . . . . . . . . . .
6.3.2 AMRESR experiments . . . . . . . . . . .
6.4 Almost singular systems . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
90
90
91
92
94
94
97
100
107
110
110
113
7 Conclusions and future directions
7.1 Contributions . . . . . . . . . . . . . . . . . . . . .
7.1.1 MINRES . . . . . . . . . . . . . . . . . . . .
7.1.2 LSMR . . . . . . . . . . . . . . . . . . . . . .
7.1.3 AMRES . . . . . . . . . . . . . . . . . . . . .
7.2 Future directions . . . . . . . . . . . . . . . . . . .
7.2.1 Conjecture . . . . . . . . . . . . . . . . . . .
7.2.2 Partial reorthogonalization . . . . . . . . .
7.2.3 Efficient optimal backward error estimates
7.2.4 SYMMLQ-based least-squares solver . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
115
115
115
117
117
118
118
118
119
119
5.4
Stopping rules . . . . . . . . . . . . .
Estimate of norms . . . . . . . . . . .
5.3.1 Computing kb
rk k . . . . . . .
5.3.2 Computing kÂb
rk k . . . . . .
5.3.3 Computing kb
xk . . . . . . . .
5.3.4 Estimates of kÂk and cond(Â)
Complexity . . . . . . . . . . . . . . .
Bibliography
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
120
x
L IST OF TABLES
1.1
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3.1
Storage and computational cost for various least-squares
methods . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
4.1
4.2
Effects of diagonal preconditioning on LPnetlib matrices and convergence of LSQR and LSMR on min kAx − bk 67
Relationship between CG, MINRES, CRAIG, LSQR and LSMR 75
5.1
Storage and computational cost for AMRES . . . . . . . .
7.1
Comparison of CG and MINRES properties on an spd system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Comparison of LSQR and LSMR properties . . . . . . . . . 117
7.2
xi
89
L IST OF F IGURES
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
6.1
6.2
Distribution of condition number for matrices used for
CG vs MINRES comparison . . . . . . . . . . . . . . . . . .
Backward and forward errors for CG and MINRES (1) . .
Backward and forward errors for CG and MINRES (2) . .
Solution norms for CG and MINRES (1) . . . . . . . . . . .
Solution norms for CG and MINRES (2) . . . . . . . . . . .
Non-monotonic backward error for MINRES on indefinite system . . . . . . . . . . . . . . . . . . . . . . . . . . .
Solution norms for MINRES on indefinite systems (1) . . .
Solution norms for MINRES on indefinite systems (2) . . .
krk k for LSMR and LSQR . . . . . . . . . . . . . . . . . . .
kE2 k for LSMR and LSQR . . . . . . . . . . . . . . . . . . .
kE1 k, kE2 k, and µ
e(xk ) for LSQR and LSMR . . . . . . . . .
kE1 k, kE2 k, and µ
e(xk ) for LSMR . . . . . . . . . . . . . . .
µ
e(xk ) for LSQR and LSMR . . . . . . . . . . . . . . . . . .
µ
e(xk ) for LSQR and LSMR . . . . . . . . . . . . . . . . . .
kx∗ − xk for LSMR and LSQR . . . . . . . . . . . . . . . . .
Convergence of kE2 k for two problems in NYPA group . .
Distribution of condition number for LPnetlib matrices .
Convergence of LSQR and LSMR with increasingly good
preconditioners . . . . . . . . . . . . . . . . . . . . . . . .
LSMR and LSQR solving two square nonsingular systems
Backward errors of LSQR, LSMR and MINRES on underdetermined systems . . . . . . . . . . . . . . . . . . . . . .
LSMR with and without reorthogonalization of Vk and/or
Uk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
LSMR with reorthogonalized Vk and restarting . . . . . .
LSMR with local reorthogonalization of Vk . . . . . . . . .
24
25
27
28
29
31
33
34
60
61
62
63
63
64
64
65
66
70
74
78
79
80
81
Convergence of Curtis-Reid scaling using CGA and AMRES 95
Improving an approximate singular value and singular
vector using RQI-AMRES and RQI-MINRES (1) . . . . . . . 101
xii
6.3
6.7
6.8
Improving an approximate singular value and singular
vector using RQI-AMRES and RQI-MINRES (2) . . . . . . . 102
Convergence of RQI-AMRES, MRQI-AMRES and MRQI-AMRESOrtho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Convergence of RQI-AMRES, MRQI-AMRES and MRQI-AMRESOrtho for linear operators of varying cost. . . . . . . . . . 106
Convergence of RQI-AMRES, MRQI-AMRES and MRQI-AMRESOrtho for different singular values . . . . . . . . . . . . . . 107
Convergence of RQI-AMRES and svds . . . . . . . . . . . 108
Singular vector computation using AMRESR and MINRES 114
7.1
Flowchart on choosing iterative solvers . . . . . . . . . . 116
6.4
6.5
6.6
xiii
L IST OF A LGORITHMS
1.1
1.2
1.3
1.4
1.5
1.6
1.7
3.1
3.2
3.3
3.4
3.5
5.1
6.1
6.2
6.3
6.4
6.5
6.6
Lanczos process Tridiag(A, b) . . . . . . . . . . . . .
Algorithm CG . . . . . . . . . . . . . . . . . . . . . .
Algorithm CR . . . . . . . . . . . . . . . . . . . . . .
Unsymmetric Lanczos process . . . . . . . . . . . . .
Arnoldi process . . . . . . . . . . . . . . . . . . . . .
Golub-Kahan process Bidiag(A, b) . . . . . . . . . .
Algorithm CGLS . . . . . . . . . . . . . . . . . . . . .
Algorithm LSMR . . . . . . . . . . . . . . . . . . . . .
Computing krk k in LSMR . . . . . . . . . . . . . . . .
Algorithm CRLS . . . . . . . . . . . . . . . . . . . . .
Regularized LSMR (1) . . . . . . . . . . . . . . . . . .
Regularized LSMR (2) . . . . . . . . . . . . . . . . .
Algorithm AMRES . . . . . . . . . . . . . . . . . . . .
Rayleigh quotient iteration (RQI) for square A . . . .
RQI for singular vectors for square or rectangular A
Stable RQI for singular vectors . . . . . . . . . . . . .
Modified RQI for singular vectors . . . . . . . . . . .
Singular vector computation via residual vector . .
Algorithm AMRESR . . . . . . . . . . . . . . . . . . .
xiv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
.
4
.
6
.
7
.
9
. 11
. 13
. 41
. 44
. 47
. 53
. 54
. 87
. 96
. 96
. 97
. 103
. 109
. 111
L IST OF M ATLAB C ODE
4.1
4.2
4.3
4.4
4.5
4.6
6.1
6.2
6.3
6.4
6.5
6.6
Approximate optimal backward error . . . . . . . . .
Right diagonal preconditioning . . . . . . . . . . . . .
Generating preconditioners by perturbation of QR . .
Criteria for selecting square systems . . . . . . . . . .
Diagonal preconditioning . . . . . . . . . . . . . . . .
Left diagonal preconditioning . . . . . . . . . . . . . .
Least-squares problem for Curtis-Reid scaling . . . .
Normal equation for Curtis-Reid scaling . . . . . . . .
CGA for Curtis-Reid scaling . . . . . . . . . . . . . . .
Generate linear operator with known singular values
Cached Golub-Kahan process . . . . . . . . . . . . . .
Error measure for singular vector . . . . . . . . . . . .
xv
.
.
.
.
.
.
.
.
.
.
.
.
. 58
. 59
. 69
. 72
. 73
. 76
. 91
. 92
. 93
. 99
. 104
. 112
xvi
1
I NTRODUCTION
The quest for the solution of linear equations is a long journey. The earliest known work is in 263 AD [64]. The book Jiuzhang Suanshu (Nine
Chapters of the Mathematical Art) was published in ancient China with
1 Procedures for solving
a chapter dedicated to the solution of linear equations.1 The modern
systems of three linear equastudy of linear equations was picked up again by Newton, who wrote tions in three variables were
unpublished notes in 1670 on solving system of equations by the sys- discussed.
tematic elimination of variables [33].
Cramer’s Rule was published in 1750 [16] after Leibniz laid the
work for determinants in 1693 [7]. In 1809, Gauss invented the method
of least squares by solving the normal equation for an over-determined
system for his study of celestial orbits. Subsequently, in 1826, he extended his method to find the minimum-norm solution for underdetermined systems, which proved to be very popular among cartographers
[33].
For linear systems with dense matrices, Cholesky factorization, LU
factorization and QR factorization are the popular methods for finding
solutions. These methods require access to the elements of matrix.
We are interested in the solution of linear systems when the matrix
is large and sparse. In such circumstances, direct methods like the ones
mentioned above are not practical because of memory constraints. We
also allow the matrix to be a linear operator defined by a procedure for
computing matrix-vector products. We focus our study on the class of
iterative methods, which usually require only a small amount of auxiliary storage beyond the storage for the problem itself.
1.1
P ROBLEM STATEMENT
We consider the problem of solving a system of linear equations. In
matrix notation we write
Ax = b,
(1.1)
where A is an m × n real matrix that is typically large and sparse, or is
available only as a linear operator, b is a real vector of length m, and x
is a real vector of length n.
1
2
CHAPTER 1. INTRODUCTION
We call an x that satisfies (1.1) a solution of the problem. If such
an x does not exist, we have an inconsistent system. If the system is
inconsistent, we look for an optimal x for the following least-squares
problem instead:
min kAx − bk2 .
(1.2)
x
We denote the exact solution to the above problems by x∗ , and ℓ denotes the number of iterations that any iterative method takes to converge to this solution. That is, we have a sequence of approximate solutions x0 , x1 , x2 , . . . , xℓ , with x0 = 0 and xℓ = x∗ . In Section 1.2 and
1.3 we review a number of methods for solving (1.1) when A is square
(m = n). In Section 1.4 we review methods that handle the general case
when A is rectangular, which is also the main focus of this thesis.
1.2
B ASIC ITERATIVE METHODS
Jacobi iteration, Gauss-Seidel iteration [31, p510] and successive overrelaxation (SOR) [88] are three early iterative methods for linear equations. These methods have the common advantage of minimal memory requirement compared with the Krylov subspaces methods that we
focus on hereafter. However, unlike Krylov subspace methods, these
methods will not converge to the exact solution in a finite number of
iterations even with exact arithmetic, and they are applicable to only
narrow classes of matrices (e.g., diagonally dominant matrices). They
also require explicit access to the nonzeros of A.
1.3
K RYLOV SUBSPACE METHODS FOR SQUARE Ax = b
Sections 1.3 and 1.4 describe a number of methods that can regard A as
an operator; i.e. only matrix-vector multiplication with A (and some2 These methods are also
times AT ) is needed, but not direct access to the elements of A.2 Section
know as matrix-free iterative
1.3 focuses on algorithms for the case when A is square. Section 1.4 fomethods.
cuses on algorithms that handle both rectangular and square A. Krylov
subspaces of increasing dimensions are generated by the matrix-vector
products, and an optimal solution within each subspace is found at
each iteration of the methods (where the measure of optimality differs
with each method).
1.3. KRYLOV SUBSPACE METHODS FOR SQUARE AX = B
Algorithm 1.1 Lanczos process Tridiag(A, b)
1: β1 v1 = b (i.e. β1 = kbk2 , v1 = b/β1 )
2: for k = 1, 2, . . . do
3:
w = Avk
4:
αk = vkT wk
5:
βk+1 vk+1 = w − αk vk − βk vk−1
6: end for
1.3.1
T HE L ANCZOS PROCESS
In this section, we focus on symmetric linear systems. The Lanczos
process [40] takes a symmetric matrix A and a vector b, and generates
a sequence of Lanczos vectors vk and scalars αk , βk for k = 1, 2, . . .
as shown in Algorithm 1.1. The process can be summarized in matrix
form as
AVk = Vk Tk + βk+1 vk+1 eTk = Vk+1 Hk
(1.3)
with Vk = v1 v2 · · · vk and

α1

β
 2
Tk = 


β2
α2
..
.
..
.
..
.
βk




,

βk 
αk
Hk =
Tk
βk+1 eTk
!
.
An important property of the Lanczos vectors in Vk is that they lie
in the Krylov subspace Kk (A, b) = span{b, Ab, A2 b, . . . , Ak−1 b}. At iteration k, we look for an approximate solution xk = Vk yk (which lies in
the Krylov subspace). The associated residual vector is
rk = b − Axk = β1 v1 − AVk yk = Vk+1 (β1 e1 − Hk yk ).
By choosing yk in various ways to make rk small, we arrive at different
iterative methods for solving the linear system. Since Vk is theoretically orthonormal, we can achieve this by solving various subproblems
to make
Hk yk ≈ β 1 e 1 .
(1.4)
Three particular choices of subproblem lead to three established
methods (CG, MINRES and SYMMLQ) [52]. Each method has a different minimization property that suggests a particular factorization of
Hk . Certain auxiliary quantities can be updated efficiently without the
3
4
CHAPTER 1. INTRODUCTION
Algorithm 1.2 Algorithm CG
r0 = b, p1 = r0 , ρ0 = r0T r0
for k = 1, 2, . . . do
qk = Apk
αk = ρk−1 /pTk qk
xk = xk−1 + αk pk
rk = rk−1 − αk qk
ρk = rkT rk
βk = ρk /ρk−1
pk+1 = rk + βk pk
end for
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
need for yk itself (which in general is completely different from yk+1 ).
With exact arithmetic, the Lanczos process terminates with k = ℓ
for some ℓ ≤ n. To ensure that the approximations xk = Vk yk improve
by some measure as k increases toward ℓ, the Krylov solvers minimize
some convex function within the expanding Krylov subspaces [27].
CG
CG was introduced in 1952 by Hestenes and Stiefel [38] for solving
Ax = b when A is symmetric positive definite (spd). The quadratic
form φ(x) ≡ 12 xTAx − bTx is bounded below, and its unique minimizer solves Ax = b. CG iterations are characterized by minimizing
the quadratic form within each Krylov subspace [27], [46, §2.4], [84,
§§8.8–8.9]:
x k = V k yk ,
where
yk = arg min φ(Vk y).
y
(1.5)
With b = Ax and 2φ(xk ) = xTkAxk − 2xTAxk , this is equivalent to
minimizing the function kx∗ − xk kA ≡ (x∗ − xk )TA(x∗ − xk ), known as
the energy norm of the error, within each Krylov subspace. A version of
CG adapted from van der Vorst [79, p42] is shown in Algorithm 1.2.
CG has an equivalent Lanczos formulation [52]. It works by deleting
the last row of (1.4) and defining yk by the subproblem Tk yk = β1 e1 .
If A is positive definite, so is each Tk , and the natural approach is to
employ the Cholesky factorization Tk = Lk Dk LTk . We define Wk and
zk from the lower triangular systems
Lk WkT = VkT ,
L k D k zk = β 1 e 1 .
1.3. KRYLOV SUBSPACE METHODS FOR SQUARE AX = B
It then follows that zk = LTkyk and xk = Vk yk = Wk LTk yk = Wk zk ,
where the elements of Wk and zk do not change when k increases. Simple recursions follow.
xk = xk−1 + ζk wk , where zk =
In particular,
zk−1 and Wk = Wk−1 wk . This formulation requires one more
ζk
n-vector than the non-Lanczos formulation.
When A is not spd, the minimization in (1.5) is unbounded below,
and the Cholesky factorization of Tk might fail or be numerically unstable. Thus, CG cannot be recommended in this case.
MINRES
MINRES [52] is characterized by the following minimization:
where
x k = Vk yk ,
yk = arg min kb − AVk yk.
y
(1.6)
Thus, MINRES minimizes krk k within the kth Krylov subspace. Since
this minimization is well-defined regardless of the definiteness of A,
MINRES is applicable to both positive definite and indefinite systems.
From (1.4), the minimization is equivalent to
min kHk yk − β1 e1 k2 .
Now it is natural to use the QR factorization
Q k Hk
β1 e1 =
Rk
0
zk
ζ̄k+1
!
,
from which we have Rk yk = zk . We define Wk from the lower triangular system RkT WkT = VkT and then xk = Vk yk = Wk Rk yk = Wk zk =
xk−1 + ζk wk as before. (The Cholesky factor Lk is lower bidiagonal
but RkT is lower tridiagonal, so MINRES needs slightly more work and
storage than CG.)
Stiefel’s Conjugate Residual method (CR) [72] for spd systems also
minimizes krk k in the same Krylov subspace. Thus, CR and MINRES
must generate the same iterates on spd systems. We will use the two
algorithms interchangeably in the spd case to prove a number of properties in Chapter 2. CR is shown in Algorithm 1.3.
Note that MINRES is reliable for any symmetric matrix A, whereas
CR was designed for positive definite systems. For example it will fail
5
6
CHAPTER 1. INTRODUCTION
Algorithm 1.3 Algorithm CR
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
x0 = 0, r0 = b, s0 = Ar0 , ρ0 = r0Ts0 , p0 = r0 , q0 = s0
for k = 1, 2, . . . do
(qk−1 = Apk−1 holds, but not explicitly computed)
αk = ρk−1 /kqk−1 k2
xk = xk−1 + αk pk−1
rk = rk−1 − αk qk−1
sk = Ark
ρk = rkTsk
βk = ρk /ρk−1
pk = rk + βk pk−1
qk = sk + βk qk−1
end for
on the following nonsingular but indefinite system:
 


0
0 1 1
 


x
=
1
2
1
0 .


1
1 1 1
In this case, r1 and s1 are nonzero, but ρ1 = 0 and CR fails after 1
iteration. Luenberger extended CR to indefinite systems [43; 44]. The
extension relies on testing whether αk = 0 and switching to a different
update rule. In practice it is difficult to judge whether αk should be
treated as zero. MINRES is free of such a decision (except when A is
singular and Ax = b is inconsistent, in which case MINRES-QLP [14; 13]
is recommended).
Since the pros and cons of CG and MINRES are central to the design
of the two new algorithms in this thesis (LSMR and AMRES), a more
in-depth discussion of their properties is given in Chapter 2.
SYMMLQ
SYMMLQ [52] solves the minimum 2-norm solution of an underdetermined subproblem obtained by deleting the last 2 rows of (1.4):
min kyk k s.t.
T
Hk−1
yk = β 1 e 1 .
T
This is solved using the LQ factorization Hk−1
QTk−1 = Lk−1 0 . A
benefit is that xk is computed as steps along a set of theoretically orthogonal directions (the columns of Vk QTk−1 ).
1.3. KRYLOV SUBSPACE METHODS FOR SQUARE AX = B
7
Algorithm 1.4 Unsymmetric Lanczos process
1: β1 = kbk2 , v1 = b/β1 , δ1 = 0, v0 = w0 = 0, w1 = v1 .
2: for k = 1, 2, . . . do
3:
αk = wkT Avk
4:
δk+1 vk+1 = Avk − αk vk − βk vk−1
5:
w̄k+1 = ATwk − αk wk − δk wk−1
T
6:
βk+1 = w̄k+1
vk+1
7:
wk+1 = w̄k+1 /βk+1
8: end for
1.3.2
U NSYMMETRIC L ANCZOS
The symmetric Lanczos process from Section 1.3.1 transforms a symmetric matrix A into a symmetric tridiagonal matrix Tk , and generates
3 Orthogonality
a set of orthonormal3 vectors vk using a 3-term recurrence. If A is not
holds
only under exact arithsymmetric, there are two other popular strategies, each of which sacri- metic. In finite precision,
orthogonality is quickly lost.
fices some properties of the symmetric Lanczos process.
If we don’t enforce a short-term recurrence, we arrive at the Arnoldi
process presented in Section 1.3.4. If we relax the orthogonality require4 The version of unsymment, we arrive at the unsymmetric Lanczos process4 , the basis of BiCG
metric Lanczos presented
and QMR. The unsymmetric Lanczos process is shown in Algorithm here is adapted from [35].
1.4.
T
The scalars δk+1 and βk+1 are chosen so that kvk+1 k = 1 and vk+1
wk+1 =
1. In matrix terms, if we define

α1

 δ2


Tk = 



β2
α2
..
.
β3
..
.
δk−1
..

.
αk−1
δk
then we have the relations
AVk = Vk+1 Hk




,

βk 

αk
and
Hk =
Tk
δk+1 eTk
!
,
H̄k = Tk
ATWk = Wk+1 H̄k .
(1.7)
As with symmetric Lanczos, by defining xk = Vk yk to search for some
optimal solution within the Krylov subspace, we can write
rk = b − Axk = β1 v1 − AVk yk = Vk+1 (β1 e1 − Hk yk ).
For unsymmetric Lanczos, the columns of Vk are not orthogonal even
βk+1 ek ,
8
CHAPTER 1. INTRODUCTION
in exact arithmetic. However, we pretend that they are orthogonal,
and using similar ideas from Lanczos CG and MINRES, we arrive at the
following two algorithms by solving
Hk yk ≈ β 1 e 1 .
(1.8)
B I CG
BiCG [23] is an extension of CG to the unsymmetric Lanczos process.
As with CG, BiCG can be derived by deleting the last row of (1.8), and
solving the resultant square system with LU decomposition [79].
QMR
QMR [26], the quasi-minimum residual method, is the MINRES analog
for the unsymmetric Lanczos process. It is derived by solving the leastsquares subproblem min kHk yk − β1 e1 k at every iteration with QR decomposition. Since the columns of Vk are not orthogonal, QMR doesn’t
give a minimum residual solution for the original problem in the corresponding Krylov subspace, but the residual norm does tend to decrease.
1.3.3
T RANSPOSE - FREE METHODS
One disadvantage of BiCG and QMR is that matrix-vector multiplication by AT is needed. A number of algorithm have been proposed to
remove this multiplication. These algorithms are based on the fact that
in BiCG, the residual vector lies in the Krylov subspace and can be written as
rk = Pk (A)b,
where Pk (A) is a polynomial of A of degree k. With the choice
rk = Qk (A)Pk (A)b,
where Qk (A) is some other polynomial of degree k, all the coefficients
needed for the update at every iteration can be computed without using the multiplication by AT [79].
CGS [66] is the extension of BiCG with Qk (A) ≡ Pk (A). CGS has been
shown to exhibit irregular convergence behavior. To achieve smoother
convergence, BiCGStab [78] was designed with some optimal polynomial Qk (A) that minimizes the residual at each iteration.
1.3. KRYLOV SUBSPACE METHODS FOR SQUARE AX = B
Algorithm 1.5 Arnoldi process
1: βv1 = b (i.e. β = kbk2 , v1 = b/β)
2: for k = 1, 2, . . . , n do
3:
w = Av
4:
for i = 1, 2, . . . , k do
5:
hik = wT vi
6:
w = w − hik vi
7:
end for
8:
βk+1 vk+1 = w
9: end for
1.3.4
T HE A RNOLDI PROCESS
Another variant of the Lanczos process for an unsymmetric matrix A
is the Arnoldi process. Compared with unsymmetric Lanczos, which
preserves the tridiagonal property of Hk and loses the orthogonality
among columns of Vk , the Arnoldi process transforms A into an upper
Hessenberg matrix Hk with an orthogonal transformation Vk . A shortterm recurrence is no longer available for the Arnoldi process. All the
Arnoldi vectors must be kept to generate the next vector, as shown in
Algorithm 1.5.
The process can be summarized by
(1.9)
AVk = Vk+1 Hk ,
with

h11

 β2



Hk = 




h12
h22
β3
...
...
...
..
.
...
...
...
..
.
βk

h1k

h2k 

h3k 

.
.. 
. 


hkk 
βk+1
As with symmetric Lanczos, this allows us to write
rk = b − Axk = β1 v1 − AVk yk = Vk+1 (β1 e1 − Hk yk ),
and our goal is again to find approximate solutions to
Hk yk ≈ β 1 e 1 .
(1.10)
Note that at the k-th iteration, the amount of memory needed to
9
10
CHAPTER 1. INTRODUCTION
store Hk and Vk is O(k 2 + kn). Since iterative methods primarily focus on matrices that are large and sparse, the storage cost will soon
overwhelm other costs and render the computation infeasible. Most
Arnoldi-based methods adopt a strategy of restarting to handle this issue, trading storage cost for slower convergence.
FOM
FOM [62] is the CG analogue for the Arnoldi process, with yk defined
by deleting the last row of (1.10) and solving the truncated system.
GMRES
GMRES [63] is the MINRES counterpart for the Arnoldi process, with yk
defined by the least-squares subproblem min kHk yk − β1 e1 k2 . Like the
methods we study (CG, MINRES, LSQR, LSMR, AMRES), GMRES does not
break down, but it might require significantly more storage.
1.3.5
I NDUCED DIMENSION REDUCTION
Induced dimension reduction (IDR) is a class of transpose-free methods
that generate residuals in a sequence of nested subspaces of decreasing
dimension. The original IDR [85] was proposed by Wesseling and Sonneveld in 1980. It converges after at most 2n matrix-vector multiplications under exact arithmetic. Theoretically, this is the same complexity
as the unsymmetric Lanczos methods and the transpose-free methods.
In 2008, Sonneveld and van Gijzen published IDR(s) [67], an improvement over IDR that takes advantage of extra memory available. The
memory required increases linearly with s, while the maximum number of matrix-vector multiplications needed becomes n + n/s.
We note that in some informal experiments on square unsymmetric
systems Ax = b arising from a convection-diffusion-reaction problem
involving several parameters [81], IDR(s) performed significantly better than LSQR or LSMR for some values of the parameters, but for certain other parameter values the reverse was true [82]. In this sense the
solvers complement each other.
1.4. KRYLOV SUBSPACE METHODS FOR RECTANGULAR AX = B
Algorithm 1.6 Golub-Kahan process Bidiag(A, b)
1:
2:
3:
4:
5:
β1 u1 = b, α1 v1 = ATu1 .
for k = 1, 2, . . . do
βk+1 uk+1 = Avk − αk uk
αk+1 vk+1 = ATuk+1 − βk+1 vk .
end for
1.4
K RYLOV SUBSPACE METHODS FOR RECTANGULAR Ax = b
In this section, we introduce a number of Krylov subspace methods
for the matrix equation Ax = b, where A is an m-by-n square or rectangular matrix. When m > n, we solve the least-squares problem
min kAx − bk2 . When m < n, we find the minimum 2-norm solution
minAx=b kxk2 . For any m and n, if Ax = b is inconsistent, we solve the
problem min kxk s.t. x = arg min kAx − bk.
1.4.1
T HE G OLUB -K AHAN PROCESS
In the dense
case,
we can construct orthogonal matrices U and V to
transform b A to upper bidiagonal form as follows:
U
T
1
A
b
⇒
b

×
! 


=

V

AV
×
×
= U β1 e1
..
.
..
.
B ,






×
×
where B is a lower bidiagonal matrix. For sparse matrices or linear
operators, Golub and Kahan [29] gave an iterative version of the bidiagonalization as shown in Algorithm 1.6.
After k steps, we
where

α1

 β2 α2


..
Bk = 
.



have AVk = Uk+1 Bk and ATUk+1 = Vk+1 LTk+1 ,
..

.
βk




,

αk 

βk+1
Lk+1 = Bk
αk+1 ek+1 .
11
12
CHAPTER 1. INTRODUCTION
This is equivalent to what would be generated by the symmetric Lanczos process with matrix ATA and starting vector ATb. The Lanczos
vectors Vk are the same, and the Lanczos tridiagonal matrix satisfies
Tk = BkT Bk . With xk = Vk yk , the residual vector rk can be written as
rk = b − AVk yk = β1 u1 − Uk+1 Bk yk = Uk+1 (β1 e1 − Bk yk ),
and our goal is to find an approximate solution to
B k yk ≈ β 1 e 1 .
(1.11)
CRAIG
CRAIG [22; 53] is defined by deleting the last row from (1.11), so that
yk satisfies Lk yk = β1 e1 at each iteration. It is an efficient and reliable
method for consistent square or rectangular systems Ax = b, and it
is known to minimize the error norm kx∗ − xk k within each Krylov
subspace [51].
LSQR
LSQR [53] is derived by solving min krk k ≡ min kβ1 e1 − Bk yk k at each
iteration. Since we are minimizing over a larger Krylov subspace at
each iteration, this immmediately implies that krk k is monotonic for
LSQR.
LSMR
LSMR is derived by minimizing ATrk within each Krylov subspace.
LSMR is a major focus in this thesis. It solves linear systems Ax = b and
least-squares problems min kAx − bk2 , with A being sparse or a linear
operator. It is analytically equivalent to applying MINRES to the normal
equation ATAx = ATb, so that the quantities kATrk k are monotonically
decreasing. We have proved that krk k also decreases monotonically. As
we will see in Theorem 4.1.1, this means that a certain backward error
measure (the Stewart backward error kATrk k/krk k) is always smaller
for LSMR than for LSQR. Hence it is safer to terminate LSMR early.
CGLS
LSQR has an equivalent formulation named CGLS [38; 53], which doesn’t
version of CGLS
is adapted from [53].
5 This
involve the computation of Golub-Kahan vectors. See Algorithm 1.7.5
1.5. OVERVIEW
Algorithm 1.7 Algorithm CGLS
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
1.5
r0 = b, s0 = ATb, p1 = s0
ρ0 = ks0 k2 , x0 = 0
for k = 1,2,. . . do
qk = Apk
αk = ρk−1 /kqk k2
xk = xk−1 + αk pk
rk = rk−1 − αk qk
sk = ATrk
ρk = ksk k2
βk = ρk /ρk−1
pk+1 = sk + βk pk
end for
O VERVIEW
Chapter 2 compares the performance of CG and MINRES on various
symmetric problems Ax = b. The results suggested that MINRES is
a superior algorithm even for solving positive definite linear systems.
This provides motivation for LSMR, the first algorithm developed in
this thesis, to be based on MINRES. Chapter 3 focuses on a mathematical background of constructing LSMR, the derivation of a computationally efficient algorithm, as well as stopping criteria. Chapter 4 describes a number of numerical experiments designed to compare the
performance of LSQR and LSMR in solving overdetermined, consistent,
or underdetermined linear systems. Chapter 5 derives an iterative algorithm AMRES for solving the negatively damped least-squares problem. Chapter 6 focuses on applications of AMRES to Curtis-Reid scaling, improving approximate singular vectors, and computing singular
vectors when the singular value is known. Chapter 7 summarizes the
contributions of this thesis and gives a summary of interesting problems available for future research.
The notation used in this thesis is summarized in Table 1.1.
13
14
CHAPTER 1. INTRODUCTION
Table 1.1 Notation
A
Aij
matrix, sparse matrix or linear operator
the element of matrix A in i-th row and j-th column
b, p, r, t, u, v, x, y, . . . vectors
k
subscript index for iteration number. E.g. xk is
the approximate solution generated at the k-th
iteration of an iterative solver such as MINRES.
In Chapters 3 to 6, k represents the number of
Golub-Kahan bidiagonalization iterations.
q
subscript index for RQI iteration number, used in
Section 6.2.
c k sk
c k , sk
non-identity elements ( −s
) in a Givens rotak ck
tion matrix
Bk
bidiagonal matrix generated at the k-th step of
Golub-Kahan bidiagonalization.
Greek letters
scalars
k·k
vector 2-norm or the induced matrix 2-norm.
k · kF
Frobenius norm
kxkA
energy norm of vector
x with repect to positive
√
definite matrix A: xTAx
cond(A)
condition number of A
ek
k-th column of an identity matrix
1
a vector with all entries being 1.
R(A)
range of matrix A.
N(A)
null space of matrix A.
A≻0
A is symmetric positive definite.
x∗
the unique solution to a nonsingular square system Ax = b, or more generally the pseudoinverse
solution of a rectangular system Ax ≈ b.
2
A TALE OF TWO ALGORITHMS
The conjugate gradient method (CG) [38] and the minimum residual
method (MINRES) [52] are both Krylov subspace methods for the iterative solution of symmetric linear equations Ax = b. CG is commonly
used when the matrix A is positive definite, while MINRES is generally
reserved for indefinite systems [79, p85]. We reexamine this wisdom
from the point of view of early termination on positive-definite systems. This also serves as the rationale for why MINRES is chosen as the
basis for the development of LSMR.
In this Chapter, we study the application of CG and MINRES to real
symmetric positive-definite (spd) systems Ax = b, where A is of dimension n × n. The unique solution is denoted by x∗ . The initial approximate solution is x0 ≡ 0, and rk ≡ b − Axk is the residual vector
for an approximation xk within the kth Krylov subspace.
From Section 1.3.1, we know that CG and MINRES use the same information Vk+1 and Hk to compute solution estimates xk = Vk yk within
the Krylov subspace Kk (A, b) ≡ span{b, Ab, A2 b, . . . , Ak−1 b} (for each
k). It is commonly thought that the number of iterations required will
be similar for each method, and hence CG should be preferable on spd
systems because it requires less storage and fewer floating-point operations per iteration. This view is justified if an accurate solution
is required (stopping tolerance τ close to machine precision ǫ). We
show that with looser stopping tolerances, MINRES is sure to terminate
sooner than CG when the stopping rule is based on the backward error
for xk , and by numerical examples we illustrate that the difference in
iteration numbers can be substantial.
Section 2.1 describes a number of monotonic convergence properties that make CG and MINRES favorable iterative solvers for linear systems. Section 2.2 introduces the concept of backward error and how it
is used in designing stopping rules for iterative solvers, and the reason
why MINRES is more favorable for applications where backward error
is important. Section 2.3 compares experimentally the behavior of CG
and MINRES in terms of energy norm error, backward error, residual
norm, and solution norm.
15
16
CHAPTER 2. A TALE OF TWO ALGORITHMS
2.1
M ONOTONICITY OF NORMS
In designing iterative solvers for linear equations, we often gauge convergence by computing some norms from the current iterate xk . These
1 For any vector x, the ennorms1 include kxk − x∗ k, kxk − x∗ kA , krk k. An iterative method might
ergy norm with repect to spd
sometimes be stopped by an iteration limit or a time limit. It is then
matrix A is defined as
√
highly desirable that some or all of the above norms converge monokxkA = xTAx.
tonically.
It is also desirable to have monotonic convergence for kxk k. First,
some applications such as trust-region methods [68] depend on that
property. Second, when convergence is measure by backward error
(Section 2.2), monotonicity in kxk k (together with monotonicity in krk k)
gives monotonic convergence in backward error. More generally, if
kxk k is monotonic, there cannot be catastrophic cancellation error in
stepping from xk to xk+1 .
2.1.1
P ROPERTIES OF CG
A number of monotonicity properties have been found by various authors. We summarize them here for easy reference.
Theorem 2.1.1. [68, Thm 2.1] For CG on an spd system Ax = b, kxk k is
strictly increasing.
Theorem 2.1.2. [38, Thm 4:3] For CG on an spd system Ax = b, kx∗ −xk kA
is strictly decreasing.
Theorem 2.1.3. [38, Thm 6:3] For CG on an spd system Ax = b, kx∗ − xk k
is strictly decreasing.
krk k is not monotonic for CG. Examples are shown in Figure 2.4.
2.1.2
P ROPERTIES OF CR AND MINRES
Here we prove a number of monotonicity properties for CR and MINRES on an spd system Ax = b. Some known properties are also included for completeness. Relations from Algorithm 1.3 (CR) are used
extensively in the proofs. Termination of CR occurs when rk = 0 for
some index k = ℓ ≤ n (⇒ ρℓ = βℓ = 0, rℓ = sℓ = pℓ = qℓ = 0, xℓ = x∗ ,
where Ax∗ = b). Note: This ℓ is the same ℓ at which the Lanczos process
theoretically terminates for the given A and b.
Theorem 2.1.4. The following properties hold for Algorithm CR:
2.1. MONOTONICITY OF NORMS
(a) qiT qj = 0
(0 ≤ i, j ≤ ℓ − 1, i 6= j)
(b) riT qj = 0
(0 ≤ i, j ≤ ℓ − 1, i ≥ j + 1)
(c) ri 6= 0 ⇒ pi 6= 0
(0 ≤ i ≤ ℓ − 1)
Proof. Given in [44, Theorem 1].
Theorem 2.1.5. The following properties hold for Algorithm CR on an spd
system Ax = b:
(a) αi > 0
(i = 1, . . . , ℓ)
(b) βi > 0
βℓ = 0
(i = 1, . . . , ℓ − 1)
(c) pTi qj > 0
(0 ≤ i, j ≤ ℓ − 1)
(d) pTi pj > 0
(0 ≤ i, j ≤ ℓ − 1)
(e) xTi pj > 0
(1 ≤ i ≤ ℓ, 0 ≤ j ≤ ℓ − 1)
(f) riT pj > 0
(0 ≤ i, j ≤ ℓ − 1)
Proof. (a) Here we use the fact that A is spd. Since ri 6= 0 for 0 ≤ i ≤
ℓ − 1, we have for 1 ≤ i ≤ ℓ,
T
T
ρi−1 = ri−1
si−1 = ri−1
Ari−1 > 0
2
(A ≻ 0)
(2.1)
αi = ρi−1 /kqi−1 k > 0,
where qi−1 6= 0 follows from qi−1 = Api−1 and Theorem 2.1.4 (c).
(b) For 1 ≤ i ≤ ℓ − 1, we have
βi = ρi /ρi−1 > 0,
(by (2.1))
and rℓ = 0 implies βℓ = 0.
(c) For any 0 ≤ i, j ≤ ℓ − 1, we have
Case I: i = j
pTi qi = pTi Api > 0
(A ≻ 0)
where pi 6= 0 from Theorem 2.1.4 (c). Next, we prove the cases
where i 6= j by induction.
17
18
CHAPTER 2. A TALE OF TWO ALGORITHMS
Case II: i − j = k > 0
pTi qj = pTi qi−k = riT qi−k + βi pTi−1 qi−k
(by Thm 2.1.4 (b))
= βi pTi−1 qi−k
> 0,
2 Note that i − j = k > 0
implies i ≥ 1.
where βi > 0 by (b)2 and pTi−1 qi−k > 0 by induction as (i − 1) − (i −
k) = k − 1 < k.
Case III: j − i = k > 0
pTi qj = pTi qi+k = pTi Api+k
= pTi A(ri+k + βi+k pi+k−1 )
= qiT ri+k + βi+k pTi qi+k−1
(by Thm 2.1.4 (b))
= βi+k pTi qi+k−1
> 0,
where βi+k = βj > 0 by (b) and pTi qi+k−1 > 0 by induction as
(i + k − 1) − i = k − 1 < k.
(d) Define P ≡ span{p0 , p1 , . . . , pℓ−1 } and Q ≡ span{q0 , . . . , qℓ−1 } at
termination. By construction, P = span{b, Ab, . . . , Aℓ−1 b} and Q =
span{Ab, . . . , Aℓ b} (since qi = Api ). Again by construction, xℓ ∈ P,
and since rℓ = 0 we have b = Axℓ ⇒ b ∈ Q. We see that P ⊆ Q. By
Theorem 2.1.4(a), {qi /kqi k}ℓ−1
i=0 forms an orthonormal basis for Q. If
we project pi ∈ P ⊆ Q onto this basis, we have
pi =
ℓ−1 T
X
p qk
i
qT q
k=0 k k
qk ,
where all coordinates are positive from (c). Similarly for any other
pj . Therefore pTi pj > 0 for any 0 ≤ i, j < ℓ.
(e) By construction,
xi = xi−1 + αi pi−1 = · · · =
Therefore xTi pi > 0 by (d) and (a).
i
X
k=1
αk pk−1
(x0 = 0)
2.1. MONOTONICITY OF NORMS
(f) Note that any ri can be expressed as a sum of qi :
ri = ri+1 + αi+1 qi
= ···
= rl + αl ql−1 + · · · + αi+1 qi
= αl ql−1 + · · · + αi+1 qi .
Thus we have
riT pj = (αl ql−1 + · · · + αi+1 qi )T pj > 0,
where the inequality follows from (a) and (c).
We are now able to prove our main theorem about the monotonic
increase of kxk k for CR and MINRES. A similar result was proved for
CG by Steihaug [68].
Theorem 2.1.6. For CR (and hence MINRES) on an spd system Ax = b, kxk k
is strictly increasing.
Proof. kxi k2 − kxi−1 k2 = 2αi xTi−1 pi−1 + pTi−1 pi−1 > 0, where the last
inequality follows from Theorem 2.1.5 (a), (d) and (e). Therefore kxi k >
kxi−1 k.
The following theorem is a direct consequence of Hestenes and Stiefel
[38, Thm 7:5]. However, the second half of that theorem, kx∗ − xCG
k−1 k >
∗
MINRES
kx − xk
k, rarely holds in machine arithmetic. We give here an
alternative proof that does not depend on CG.
Theorem 2.1.7. For CR (and hence MINRES) on an spd system Ax = b, the
error kx∗ − xk k is strictly decreasing.
Proof. From the update rule for xk , we can express x∗ as
x∗ = xl = xl−1 + αl pl−1
= ···
= xk + αk+1 pk + · · · + αl pl−1
= xk−1 + αk pk−1 + αk+1 pk + · · · + αl pl−1 .
(2.2)
(2.3)
19
20
CHAPTER 2. A TALE OF TWO ALGORITHMS
Using the last two equalities above, we can write
kx∗ − xk−1 k2 − kx∗ − xk k2
= (xl − xk−1 )T (xl − xk−1 ) − (xl − xk )T (xl − xk )
= 2αk pTk−1 (αk+1 pk + · · · + αl pl−1 ) + αk2 pTk−1 pk−1
> 0,
where the last inequality follows from Theorem 2.1.5 (a), (d).
The following theorem is given in [38, Thm 7:4]. We give an alternative proof here.
Theorem 2.1.8. For CR (and hence MINRES) on an spd system Ax = b, the
energy norm error kx∗ − xk kA is strictly decreasing.
Proof. From (2.2) and (2.3) we can write
kxl − xk−1 k2A − kxl − xk k2A
= (xl − xk−1 )T A(xl − xk−1 ) − (xl − xk )T A(xl − xk )
= 2αk pTk−1 A(αk+1 pk + · · · + αl−1 pl−1 ) + αk2 pTk−1 Apk−1
T
T
= 2αk qk−1
(αk+1 pk + · · · + αl−1 pl−1 ) + αk2 qk−1
pk−1
> 0,
where the last inequality follows from Theorem 2.1.5 (a), (c).
The following theorem is available from [52] and [38, Thm 7:2], and
is the characterizing property of MINRES. We include it here for completeness.
Theorem 2.1.9. For MINRES on any system Ax = b, krk k is decreasing.
Proof. This follows immediately from (1.6).
2.2
B ACKWARD ERROR ANALYSIS
“The data frequently contains uncertainties due to measurements, previous computations, or errors committed in storing numbers on the computer. If the backward error is no
larger than these uncertainties then the computed solution
can hardly be criticized — it may be the solution we are
seeking, for all we know.” Nicholas J. Higham, Accuracy
and Stability of Numerical Algorithms (2002)
2.2. BACKWARD ERROR ANALYSIS
For many physical problems requiring numerical solution, we are
given inexact or uncertain input data. Examples include model estimation in geophysics [45], system identification in control theory [41], and
super-resolution imaging [80]. For these problems, it is not justifiable
to seek a solution beyond the accuracy of the data [19]. Computation
time may be wasted in the extra iterations without yielding a more desirable answer [3]. Also, rounding errors are introduced during computation. Both errors in the original data and rounding errors can be
analyzed in a common framework by applying the Wilkinson principle,
which considers any computed solution to be the exact solution of a
nearby problem [10; 25]. The measure of “nearby” should match the error in the input data. The design of stopping rules from this viewpoint
is an important part of backward error analysis [4; 39; 48; 61].
For a consistent linear system Ax = b, there may be uncertainty in
A and/or b. From now on we think of xk coming from the kth iteration
of one of the iterative solvers. Following Titley-Peloquin [75] we say
that xk is an acceptable solution if and only if there exist perturbations E
and f satisfying
(A + E)xk = b + f,
kEk
≤ α,
kAk
kf k
≤β
kbk
(2.4)
for some tolerances α ≥ 0, β ≥ 0 that reflect the (preferably known)
accuracy of the data. We are naturally interested in minimizing the
size of E and f . If we define the optimization problem
min ξ s.t. (A + E)xk = b + f,
ξ,E,f
kEk
≤ αξ,
kAk
kf k
≤ βξ
kbk
to have optimal solution ξk , Ek , fk (all functions of xk , α, and β), we
see that xk is an acceptable solution if and only if ξk ≤ 1. We call ξk the
normwise relative backward error (NRBE) for xk .
With rk = b − Axk , the optimal solution ξk , Ek , fk is shown in [75]
to be given by
βkbk
,
αkAkkxk k + βkbk
krk k
ξk =
,
αkAkkxk k + βkbk
φk =
Ek =
(1 − φk )
rk xTk ,
kxk k2
fk = −φk rk .
(2.5)
(2.6)
(See [39, p12] for the case β = 0 and [39, §7.1 and p336] for the case
α = β.)
21
22
CHAPTER 2. A TALE OF TWO ALGORITHMS
2.2.1
S TOPPING RULE
For general tolerances α and β, the condition ξk ≤ 1 for xk to be an
acceptable solution becomes
krk k ≤ αkAkkxk k + βkbk,
(2.7)
the stopping rule used in LSQR for consistent systems [53, p54, rule S1].
2.2.2
M ONOTONIC BACKWARD ERRORS
Of interest is the size of the perturbations to A and b for which xk is an
exact solution of Ax = b. From (2.5)–(2.6), the perturbations have the
following norms:
αkAkkrk k
krk k
=
,
kxk k
αkAkkxk k + βkbk
βkbkkrk k
kfk k = φk krk k
=
.
αkAkkxk k + βkbk
kEk k = (1 − φk )
(2.8)
(2.9)
Since kxk k is monotonically increasing for CG and MINRES (when A
is spd), we see from (2.5) that φk is monotonically decreasing for both
solvers. Since krk k is monotonically decreasing for MINRES (but not for
CG), we have the following result.
Theorem 2.2.1. Suppose α > 0 and β > 0 in (2.4). For CR and MINRES
(but not CG), the relative backward errors kEk k/kAk and kfk k/kbk decrease
monotonically.
Proof. This follows from (2.8)–(2.9) with kxk k increasing for both solvers
and krk k decreasing for CR and MINRES but not for CG.
2.2.3
O THER CONVERGENCE MEASURES
Error kx∗ −xk k and energy norm error kx∗ −xk kA are two possible measures of convergence. In trust-region methods [68], some eigenvalue
problems [74], finite element approximations [1], and some other applications [46; 84], it is desirable to minimize kx∗ − xk kA , which makes
CG a sensible algorithm to use.
We should note that since x∗ is not known, neither kx∗ − xk k nor
kx∗ − xk kA can be computed directly from the CG algorithm. However,
bounds and estimates have been derived for kx∗ −xk kA [9; 30] and they
can be used for stopping rules based on the energy norm error.
2.3. NUMERICAL RESULTS
An alternative stopping criterion is derived for MINRES by Calvetti
et al. [8] based on an L-curve defined by krk k and cond(Hk ).
2.3
N UMERICAL RESULTS
Here we compare the convergence of CG and MINRES on various spd
systems Ax = b and some associated indefinite systems (A − δI)x = b.
The test examples are drawn from the University of Florida Sparse
Matrix Collection (Davis [18]). We experimented with all 26 cases for
which A is real spd and b is supplied. We compute the condition number for each test matrix by finding the largest and smallest eigenvalue
using eigs(A,1,’LM’) and eigs(A,1,’SM’) respectively. For this test
set, the condition numbers range from 1.7E+03 to 3.1E+13.
Since A is spd, we applied diagonal preconditioning by redefining
A and b as follows: d = diag(A), D = diag(1./sqrt(d)), A ← DAD,
b ← Db, b ← b/kbk. Thus in the figures below we have diag(A) = I
and kbk = 1. With this preconditioning, the condition numbers range
from 1.2E+01 to 2.2E+11. The distribution of condition number of the
test set matrices before and after preconditioning is shown in Figure
2.1.
The stopping rule used for CG and MINRES was (2.7) with α = 0
and β = 10−8 (that is, krk k ≤ 10−8 kbk = 10−8 ).
2.3.1
P OSITIVE - DEFINITE SYSTEMS
In defining backward errors, we assume for simplicity that α > 0 and
β = 0 in (2.4)–(2.6), even though it doesn’t match the choice α = 0 and
β = 10−8 in the stopping rule. This gives φk = 0 and kEk k = krk k/kxk k
in (2.8). Thus, as in Theorem 2.2.1, we expect kEk k to decrease monotonically for CR and MINRES but not for CG.
We also compute kx∗ − xk k and kx∗ − xk kA at each iteration for both
algorithms, where x∗ is obtained by M ATLAB’s backslash function A\b,
which uses a sparse Cholesky factorization of A [12].
In Figure 2.2 and 2.3, we plot krk k/kxk k, kx∗ − xk k, and kx∗ − xk kA
for CG and MINRES for four different problems. For CG, the plots confirm the properties in Theorem 2.1.2 and 2.1.3 that kx∗ − xk k and kx∗ −
xk kA are monotonic. For MINRES, the plots confirm the properties in
Theorem 2.2.1, 2.1.8, and 2.1.7 that krk k/kxk k, kx∗ −xk k, and kx∗ −xk kA
are monotonic.
23
CHAPTER 2. A TALE OF TWO ALGORITHMS
CG vs MINRES test set
14
10
12
10
10
10
8
10
Cond(A)
24
6
10
4
10
2
10
original
preconditioned
0
10
0
5
10
15
Matrix ID
20
25
30
Figure 2.1: Distribution of condition number for matrices used for CG
vs MINRES comparison, before and after diagonal preconditioning
Figure 2.2 (left) shows problem Schenk_AFE_af_shell8 with A of
size 504855×504855 and cond(A) = 2.7E+05. From the plot of backward
errors krk k/kxk k, we see that both CG and MINRES converge quickly at
the early iterations. Then the backward error of MINRES plateaus at
about iteration 80, and the backward error of CG stays about 1 order of
magnitude behind MINRES. A similar phenomenon of fast convergence
at early iterations followed by slow convergence is also observed in the
energy norm error and 2-norm error plots.
Figure 2.2 (right) shows problem Cannizzo_sts4098 with A of size
19779×19779 and cond(A) = 6.7E+03. MINRES converges slightly faster
in terms of backward error, while CG converges slightly faster in terms
of energy norm error and 2-norm error.
Figure 2.3 (left) shows problem Simon_raefsky4 with A of size 19779×
19779 and cond(A) = 2.2E+11. Because of the high condition number,
both algorithms hit the 5n iteration limit that we set. We see that the
backward error for MINRES converges faster than for CG as expected.
For the energy norm error, CG is able to decrease over 5 orders of magnitude while MINRES plateaus after a 2 orders of magnitude decrease.
25
2.3. NUMERICAL RESULTS
Name:Cannizzo_sts4098, Dim:4098x4098, nnz:72356, id=13
Name:Schenk_AFE_af_shell8, Dim:504855x504855, nnz:17579155, id=11
0
−1
CG
MINRES
CG
MINRES
−1
−2
−2
−3
−3
−4
log(||r||/||x||)
log(||r||/||x||)
−4
−5
−6
−5
−6
−7
−7
−8
−8
−9
−10
−9
0
200
400
600
800
iteration count
1000
1200
1400
−10
0
50
100
150
200
250
iteration count
300
350
400
450
Cannizzo_sts4098, Dim:4098x4098, nnz:72356, id=13
Schenk_AFE_af_shell8, Dim:504855x504855, nnz:17579155, id=11
1
1
0
0
−1
−1
−2
log||xk − x*||A
log||xk − x*||A
−2
−3
−4
−4
−5
−5
−6
−6
−7
−3
−7
CG
MINRES
0
200
400
600
800
iteration count
1000
1200
−8
1400
CG
MINRES
0
50
100
150
200
250
iteration count
300
350
400
450
Cannizzo_sts4098, Dim:4098x4098, nnz:72356, id=13
Schenk_AFE_af_shell8, Dim:504855x504855, nnz:17579155, id=11
2
2
1
1
0
0
log||xk − x*||
log||xk − x*||
−1
−1
−2
−2
−3
−4
−3
−5
−4
−5
−6
CG
MINRES
0
200
400
600
800
iteration count
1000
1200
1400
−7
CG
MINRES
0
50
100
150
200
250
iteration count
300
350
400
450
Figure 2.2: Comparison of backward and forward errors for CG and MINRES solving two
spd systems Ax = b.
Top: The values of log10 (krk k/kxk k) are plotted against iteration number k. These values
define log10 (kEk k) when the stopping tolerances in (2.7) are α > 0 and β = 0. Middle: The
values of log10 kxk − x∗ kA are plotted against iteration number k. This is the quantity that
CG minimizes at each iteration. Bottom: The values of log10 kxk − x∗ k.
Left: Problem Schenk_AFE_af_shell8, with n = 504855 and cond(A) = 2.7E+05.
Right: Cannizzo_sts4098, with n = 19779 and cond(A) = 6.7E+03.
26
CHAPTER 2. A TALE OF TWO ALGORITHMS
For both the energy norm error and 2-norm error, MINRES reaches a
lower point than CG for some iterations. This must be due to numerical error in CG and MINRES (a result of loss of orthogonality in Vk ).
Figure 2.3 (right) shows problem BenElechi_BenElechi1 with A of
size 245874 × 245874 and cond(A) = 1.8E+09. The backward error of
MINRES stays ahead of CG by 2 orders of magnitude for most iterations.
Around iteration 32000, the backward error of both algorithms goes
down rapidly and CG catches up with MINRES. Both algorithms exhibit
a plateau on energy norm error for the first 20000 iterations. The error
norms for CG start decreasing around iteration 20000 and decreases
even faster after iteration 30000.
Figure 2.4 shows krk k and kxk k for CG and MINRES on two typical
spd examples. We see that kxk k is monotonically increasing for both
solvers, and the kxk k values rise fairly rapidly to their limiting value
kx∗ k, with a moderate delay for MINRES.
Figure 2.5 shows krk k and kxk k for CG and MINRES on two spd
examples in which the residual decrease and the solution norm increase are somewhat slower than typical. The rise of kxk k for MINRES is
rather more delayed. In the second case, if the stopping tolerance were
β = 10−6 rather than β = 10−8 , the final MINRES kxk k (k ≈ 10000)
would be less than half the exact value kx∗ k. It will be of future interest to evaluate this effect within the context of trust-region methods for
optimization.
W HY DOES krk k FOR CG LAG BEHIND MINRES ?
It is commonly thought that even though MINRES is known to minimize krk k at each iteration, the cumulative minimum of krk k for CG
should approximately match that of MINRES. That is,
min kriCG k ≈ krkMINRES k.
0≤i≤k
However, in Figure 2.2 and 2.3 we see that krk k for MINRES is often smaller than for CG by 1 or 2 orders of magnitude. This phenomenon can be explained by the following relations between krkCG k
and krkMINRES k [76; 35, Lemma 5.4.1]:
krkMINRES k
.
krkCG k = q
MINRES 2
1 − krkMINRES k2 /krk−1
k
(2.10)
27
2.3. NUMERICAL RESULTS
Name:Simon_raefsky4, Dim:19779x19779, nnz:1316789, id=7
Name:BenElechi_BenElechi1, Dim:245874x245874, nnz:13150496, id=22
2
0
CG
MINRES
−2
−2
−4
−3
−6
−4
−8
−5
−10
−6
−12
−7
−14
−8
−16
−9
−18
CG
MINRES
−1
log(||r||/||x||)
log(||r||/||x||)
0
0
1
2
3
4
5
iteration count
6
7
8
9
−10
10
0
0.5
1
4
x 10
Simon_raefsky4, Dim:19779x19779, nnz:1316789, id=7
1.5
2
iteration count
2.5
3
3.5
4
x 10
BenElechi_BenElechi1, Dim:245874x245874, nnz:13150496, id=22
5
1
0
4
−1
3
*
log||xk − x ||A
log||xk − x*||A
−2
2
−3
−4
1
−5
0
−6
−1
CG
MINRES
0
1
2
3
4
5
iteration count
6
7
8
9
−7
10
CG
MINRES
0
0.5
1
4
x 10
Simon_raefsky4, Dim:19779x19779, nnz:1316789, id=7
1.5
2
iteration count
2.5
3
3.5
4
x 10
BenElechi_BenElechi1, Dim:245874x245874, nnz:13150496, id=22
10
2
1
9
0
*
log||xk − x ||
k
log||x − x*||
8
7
−1
−2
6
−3
5
−4
CG
MINRES
4
0
1
2
3
4
5
iteration count
6
7
8
9
10
4
x 10
−5
CG
MINRES
0
0.5
1
1.5
2
iteration count
2.5
3
3.5
4
x 10
Figure 2.3: Comparison of backward and forward errors for CG and MINRES solving two
spd systems Ax = b.
Top: The values of log10 (krk k/kxk k) are plotted against iteration number k. These values
define log10 (kEk k) when the stopping tolerances in (2.7) are α > 0 and β = 0. Middle: The
values of log10 kxk − x∗ kA are plotted against iteration number k. This is the quantity that
CG minimizes at each iteration. Bottom: The values of log10 kxk − x∗ k.
Left: Problem Simon_raefsky4, with n = 19779 and cond(A) = 2.2E+11.
Right: BenElechi_BenElechi1, with n = 245874 and cond(A) = 1.8E+09.
28
CHAPTER 2. A TALE OF TWO ALGORITHMS
Name:Simon_olafu, Dim:16146x16146, nnz:1015156, id=6
Name:Cannizzo_sts4098, Dim:4098x4098, nnz:72356, id=13
2
1
CG
MINRES
CG
MINRES
0
0
−1
−2
−2
log||r||
log||r||
−3
−4
−4
−5
−6
−6
−7
−8
−8
−10
0
0.5
5
18
x 10
1
1.5
iteration count
2
2.5
−9
3
0
50
100
150
4
x 10
Name:Simon_olafu, Dim:16146x16146, nnz:1015156, id=6
200
250
iteration count
300
350
400
450
Name:Cannizzo_sts4098, Dim:4098x4098, nnz:72356, id=13
25
16
20
14
12
15
||x||
||x||
10
8
10
6
4
5
2
CG
MINRES
0
0
0.5
1
1.5
iteration count
2
CG
MINRES
2.5
3
4
x 10
0
0
50
100
150
200
250
iteration count
300
350
400
450
Figure 2.4: Comparison of residual and solution norms for CG and MINRES solving two spd
systems Ax = b with n = 16146 and 4098.
Top: The values of log10 krk k are plotted against iteration number k. Bottom: The values
of kxk k are plotted against k. The solution norms grow somewhat faster for CG than for
MINRES. Both reach the limiting value kx∗ k significantly before xk is close to x.
29
2.3. NUMERICAL RESULTS
Name:BenElechi_BenElechi1, Dim:245874x245874, nnz:13150496, id=22
Name:Schmid_thermal1, Dim:82654x82654, nnz:574458, id=14
0
0
CG
MINRES
CG
MINRES
−2
−2
−3
−3
−4
−4
log||r||
log||r||
−1
−1
−5
−5
−6
−6
−7
−7
−8
−8
−9
−9
0
200
400
600
800
iteration count
1000
1200
1400
0
0.5
1
1.5
2
iteration count
2.5
3
3.5
4
x 10
Name:BenElechi_BenElechi1, Dim:245874x245874, nnz:13150496, id=22
Name:Schmid_thermal1, Dim:82654x82654, nnz:574458, id=14
80
25
70
20
60
50
||x||
||x||
15
10
40
30
20
5
10
CG
MINRES
CG
MINRES
0
0
0
200
400
600
800
iteration count
1000
1200
1400
0
0.5
1
1.5
2
iteration count
2.5
3
3.5
4
x 10
Figure 2.5: Comparison of residual and solution norms for CG and MINRES solving two spd
systems Ax = b with n = 82654 and 245874. Sometimes the solution norms take longer to
reach the limiting value kxk.
Top: The values of log10 krk k are plotted against iteration number k. Bottom: The values of
kxk k are plotted against k. Again the solution norms grow faster for CG.
30
CHAPTER 2. A TALE OF TWO ALGORITHMS
From (2.10), one can infer that if krkMINRES k decreases a lot between
iterations k−1 and k, then krkCG k would be roughly the same as krkMINRES k.
The converse also holds, in that krkCG k will be much larger than krkMINRES k
if MINRES is almost stalling at iteration k (i.e., krkMINRES k did not decrease much relative to the previous iteration). The above analysis was
pointed out by Titley-Peloquin [76] in comparing LSQR and LSMR. We
repeat the analysis here for CG vs MINRES and extend it to demonstrate
why there is a lag in general for large problems.
With a residual-based stopping rule, CG and MINRES stop when
krk k ≤ βkr0 k.
When CG and MINRES stop at iteration ℓ, we have
ℓ
Y
krℓ k
krk k
=
≈ β.
krk−1 k
kr0 k
k=1
MINRES
Thus on average, krkMINRES k/krk−1
k will be closer to 1 if ℓ is large.
This means that for systems that take more iterations to converge, CG
will lag behind MINRES more (a bigger gap between krkCG k and krkMINRES k).
2.3.2
I NDEFINITE SYSTEMS
A key part of Steihaug’s trust-region method for large-scale unconstrained optimization [68] (see also [15]) is his proof that when CG is
applied to a symmetric (possibly indefinite) system Ax = b, the solution norms kx1 k, . . . , kxk k are strictly increasing as long as pTjApj > 0
for all iterations 1 ≤ j ≤ k. (We are using the notation in Algorithm
1.3.)
From our proof of Theorem 2.1.5, we see that the same property
holds for CR and MINRES as long as both pTjApj > 0 and rjTArj > 0
for all iterations 1 ≤ j ≤ k. Since MINRES might be a useful solver
in the trust-region context, it is of interest now to offer some empirical
results about the behavior of kxk k when MINRES is applied to indefinite
systems.
First, on the nonsingular indefinite system

 

2 1 1
0

 

x
=
1 0 1
1 ,
1 1 2
1
(2.11)
2.3. NUMERICAL RESULTS
1.6
3
1.4
2.5
1.2
||r || / ||x ||
2
1.5
k
k
||x ||
k
1
0.8
1
0.6
0.5
0.4
0.2
1
1.5
2
Iteration number
2.5
3
0
1
1.5
2
Iteration number
2.5
3
Figure 2.6: For MINRES on indefinite problem (2.11), kxk k and the backward error krk k/kxk k are both slightly non-monotonic.
MINRES gives non-monotonic solution norms, as shown in the left plot
of Figure 2.6. The decrease in kxk k implies that the backward errors
krk k/kxk k may not be monotonic, as illustrated in the right plot.
More generally, we can gain an impression of the behavior of kxk k
by recalling from Choi et al. [14] the connection between MINRES and
M
MINRES-QLP. Both methods compute the iterates xM
k = Vk yk in (1.6)
from the subproblems
ykM = arg min kHk y − β1 e1 k
y∈Rk
and possibly
Tℓ yℓM = β1 e1 ,
where k = ℓ is the last iteration. When A is nonsingular or Ax = b
is consistent (which we now assume), ykM is uniquely defined for each
k ≤ ℓ and the methods compute the same iterates xM
k (but by different numerical methods). In fact they both compute the expanding QR
factorizations
#
"
i
h
R k tk
,
Q k Hk β 1 e 1 =
0 φk
(with Rk upper tridiagonal) and MINRES-QLP also computes the orthogonal factorizations Rk Pk = Lk (with Lk lower tridiagonal), from
which the kth solution estimate is defined by Wk = Vk Pk , Lk uk = tk ,
and xM
k = Wk uk . As shown in [14, §5.3], the construction of these quantities is such that the first k − 3 columns of Wk are the same as in Wk−1 ,
and the first k − 3 elements of uk are the same as in uk−1 . Since Wk
has orthonormal columns, kxM
k k = kuk k, where the first k − 2 elements
of uk are unaltered by later iterations. As shown in [14, §6.5], it means
that certain quantities can be cheaply updated to give norm estimates
31
32
CHAPTER 2. A TALE OF TWO ALGORITHMS
in the form
χ2 ← χ2 + µ̂2k−2 ,
2
2
2
2
kxM
k k = χ + µ̃k−1 + µ̄k ,
where it is clear that χ2 increases monotonically. Although the last two
2
terms are of unpredictable size, kxM
k k tends to be dominated by the
monotonic term χ2 and we can expect that kxM
k k will be approximately
monotonic as k increases from 1 to ℓ.
Experimentally we find that for most MINRES iterations on an indefinite problem, kxk k does increase. To obtain indefinite examples that
were sensibly scaled, we used the four spd (A, b) cases in Figures 2.4–
2.5, applied diagonal scaling as before, and solved (A − δI)x = b with
δ = 0.5 and where A and b are now scaled (so that diag(A) = I). The
number of iterations increased significantly but was limited to n.
Figure 2.7 shows log10 krk k and kxk k for the first two cases (where
A is the spd matrices in Figure 2.4). The values of kxk k are essentially
monotonic. The backward errors krk k/kxk k (not shown) were even
closer to being monotonic.
Figure 2.7 shows the values of kxk k for the first two cases (MINRES
applied to (A − δI)x = b, where A is the spd matrices used in Figure 2.4 and δ = 0.5 is large enough to make the systems indefinite).
The number of iterations increased significantly but again was limited
to n. These are typical examples in which kxk k is monotonic as in the
spd case.
Figure 2.8 shows kxk k and log10 krk k for the second two cases (where
A is the spd matrices in Figure 2.5). The left example reveals a definite period of decrease in kxk k. Nevertheless, during the n = 82654
iterations, kxk k increased 83% of the time and the backward errors
krk k/kxk k decreased 91% of the time. The right example is more like
those in Figure 2.8. During n = 245874 iterations, kxk k increased 83%
of the time, the backward errors krk k/kxk k decreased 91% of the time,
and any nonmonotonicity was very slight.
2.4
S UMMARY
Our experimental results here provide empirical evidence that MINRES
can often stop much sooner than CG on spd systems when the stopping
rule is based on backward error norms krk k/kxk k (or the more general
norms in (2.8)–(2.9)). On the other hand, CG generates iterates xk with
smaller kx∗ − xk kA and kx∗ − xk k, and is recommended in applications
33
2.4. SUMMARY
Cannizzo_sts4098, Dim:4098x4098, nnz:72356, id=39
Simon_olafu, Dim:16146x16146, nnz:1015156, id=32
0
0
−0.5
−0.5
−1
−1
log||r||
log||r||
−1.5
−2
−1.5
−2.5
−2
−3
−3.5
0
2000
4000
6000
8000
10000
iteration count
12000
14000
16000
−2.5
18000
Simon_olafu, Dim:16146x16146, nnz:1015156, id=32
0
500
1000
1500
2000
2500
iteration count
3000
3500
4000
4500
Cannizzo_sts4098, Dim:4098x4098, nnz:72356, id=39
9
15
8
7
6
10
||x||
||x||
5
4
3
5
2
1
0
0
2000
4000
6000
8000
10000
iteration count
12000
14000
16000
18000
0
0
500
1000
1500
2000
2500
iteration count
3000
3500
4000
4500
Figure 2.7: Residual norms and solution norms when MINRES is applied to two indefinite
systems (A − δI)x = b, where A is the spd matrices used in Figure 2.4 (n = 16146 and 4098)
and δ = 0.5 is large enough to make the systems indefinite.
Top: The values of log10 krk k are plotted against iteration number k for the first n iterations.
Bottom left: The values of kxk k are plotted against k. During the n = 16146 iterations, kxk k
increased 83% of the time and the backward errors krk k/kxk k (not shown here) decreased
96% of the time.
Bottom right: During the n = 4098 iterations, kxk k increased 90% of the time and the backward errors krk k/kxk k (not shown here) decreased 98% of the time.
34
CHAPTER 2. A TALE OF TWO ALGORITHMS
Schmid_thermal1, Dim:82654x82654, nnz:574458, id=40
BenElechi_BenElechi1, Dim:245874x245874, nnz:13150496, id=48
0
0
−0.5
−0.5
−1
−1
log||r||
log||r||
−1.5
−1.5
−2
−2
−2.5
−2.5
−3
−3.5
0
1
2
3
4
5
iteration count
6
7
8
−3
9
0
350
350
300
300
250
250
||x||
||x||
400
200
150
100
100
50
50
2
3
4
5
iteration count
6
7
2
2.5
5
x 10
200
150
1
1.5
BenElechi_BenElechi1, Dim:245874x245874, nnz:13150496, id=48
400
0
1
iteration count
Schmid_thermal1, Dim:82654x82654, nnz:574458, id=40
0
0.5
4
x 10
8
9
4
x 10
0
0
0.5
1
1.5
iteration count
2
2.5
5
x 10
Figure 2.8: Residual norms and solution norms when MINRES is applied to two indefinite
systems (A−δI)x = b, where A is the spd matrices used in Figure 2.5 (n = 82654 and 245874)
and δ = 0.5 is large enough to make the systems indefinite.
Top: The values of log10 krk k are plotted against iteration number k for the first n iterations.
Bottom left: The values of kxk k are plotted against k. There is a mild but clear decrease
in kxk k over an interval of about 1000 iterations. During the n = 82654 iterations, kxk k
increased 83% of the time and the backward errors krk k/kxk k (not shown here) decreased
91% of the time.
Bottom right: The solution norms and backward errors are essentially monotonic. During
the n = 245874 iterations, kxk k increased 88% of the time and the backward errors krk k/kxk k
(not shown here) decreased 95% of the time.
2.4. SUMMARY
where these quantities should be minimized.
For full-rank least-squares problems min kAx − bk, the solver LSQR
[53; 54] is equivalent to CG on the (spd) normal equation ATAx = ATb.
This suggests that a MINRES-based algorithm for least-squares may
share the same advantage as MINRES for symmetric systems, especially
in the case of early termination. LSMR is designed on such a basis. The
derivation of LSMR is presented in Chapter 3. Numerical experiments
in Chapter 4 show that LSMR has more good properties than our original expectation.
Theorem 2.1.5 shows that MINRES shares a known property of CG:
that kxk k increases monotonically when A is spd. This implies that
kxk k is monotonic for LSMR (as conjectured in [24]), and suggests that
MINRES might be a useful alternative to CG in the context of trustregion methods for optimization.
35
3
LSMR
We present a numerical method called LSMR for computing a solution
x to the following problems:
Unsymmetric equations:
Linear least squares (LS):
Regularized least squares:
minimize kxk2 subject to Ax = b
minimize kAx
−!bk2
!
A
b x−
minimize λI
0 2
where A ∈ Rm×n , b ∈ Rm , and λ ≥ 0, with m ≤ n or m ≥ n. The matrix
A is used as an operator for which products of the form Av and ATu
can be computed for various v and u. (If A is symmetric or Hermitian
and λ = 0, MINRES [52] or MINRES-QLP [14] are applicable.)
LSMR is similar in style to the well known method LSQR [53; 54] in
being based on the Golub-Kahan bidiagonalization of A [29]. LSQR is
equivalent to the conjugate-gradient (CG) method applied to the normal equation (ATA + λ2 I)x = ATb. It has the property of reducing krk k
monotonically, where rk = b − Axk is the residual for the approximate
solution xk . (For simplicity, we are letting λ = 0.) In contrast, LSMR is
equivalent to MINRES applied to the normal equation, so that the quantities kATrk k are monotonically decreasing. We have also proved that
krk k is monotonically decreasing, and in practice it is never very far
behind the corresponding value for LSQR. Hence, although LSQR and
LSMR ultimately converge to similar points, it is safer to use LSMR in
situations where the solver must be terminated early.
Stopping conditions are typically based on backward error: the norm
of some perturbation to A for which the current iterate xk solves the
perturbed problem exactly. Experiments on many sparse LS test problems show that for LSMR, a certain cheaply computable backward error
for each xk is close to the optimal (smallest possible) backward error.
This is an unexpected but highly desirable advantage.
O VERVIEW
Section 3.1 derives the basic LSMR algorithm with λ = 0. Section 3.2
derives various norms and stopping criteria. Section 3.3.2 discusses
36
3.1. DERIVATION OF LSMR
singular systems. Section 3.4 compares the complexity of LSQR and
LSMR. Section 3.5 derives the LSMR algorithm with λ ≥ 0. Section 3.6
proves one of the main lemmas.
3.1
D ERIVATION OF LSMR
We begin with the Golub-Kahan
process Bidiag(A, b) [29], an iterative procedure for transforming b A to upper-bidiagonal form β1 e1 Bk .
The process was introduced in 1.4.1. Since it is central to the development of LSMR, we will restate it here in more detail.
3.1.1
T HE G OLUB -K AHAN PROCESS
1. Set β1 u1 = b (that is, β1 = kbk, u1 = b/β1 ) and α1 v1 = ATu1 .
2. For k = 1, 2, . . . , set
βk+1 uk+1 = Avk − αk uk
(3.1)
αk+1 vk+1 = ATuk+1 − βk+1 vk .
After k steps, we have
AVk = Uk+1 Bk
and
ATUk+1 = Vk+1 LTk+1 ,
where we define Vk = v1 v2 . . .
and


α1



 β2 α2




.. ..
Bk = 
,
.
.



βk
αk 


βk+1
v k , U k = u1
Lk+1 = Bk
u2
...
uk ,
αk+1 ek+1 .
Now consider
T
T
A AVk = A Uk+1 Bk =
Vk+1 LTk+1 Bk
!
= Vk+1
BkT
αk+1 eTk+1
= Vk+1
BkTBk
αk+1 βk+1 eTk
Bk
!
.
(3.2)
37
38
CHAPTER 3. LSMR
This is equivalent to what would be generated by the symmetric Lancthis reason we de- zos process with matrix ATA and starting vector ATb,1 and the columns
fine β̄k ≡ αk βk below.
of Vk lie in the Krylov subspace Kk (ATA, ATb).
1 For
3.1.2
U SING G OLUB -K AHAN TO SOLVE THE NORMAL EQUATION
Krylov subspace methods for solving linear equations form solution
estimates xk = Vk yk for some yk , where the columns of Vk are an exand panding set of theoretically independent vectors.2
2 In this case, V
k
also Uk are theoretically orthonormal.
For the equation ATAx = ATb, any solution x has the property of
minimizing krk, where r = b − Ax is the corresponding residual vector. Thus, in the development of LSQR it was natural to choose yk to
minimize krk k at each stage. Since
rk = b − AVk yk = β1 u1 − Uk+1 Bk yk = Uk+1 (β1 e1 − Bk yk ),
(3.3)
where Uk+1 is theoretically orthonormal, the subproblem minyk kβ1 e1 −
Bk yk k easily arose. In contrast, for LSMR we wish to minimize kATrk k.
Let β̄k ≡ αk βk for all k. Since ATrk = ATb−ATAxk = β1 α1 v1 −ATAVk yk ,
from (3.2) we have
T
A rk = β̄1 v1 − Vk+1
BkTBk
αk+1 βk+1 eTk
!
yk = Vk+1
BkTBk
β̄k+1 eTk
β̄1 e1 −
!
yk
!
and we are led to the subproblem
min kA rk k = min β̄1 e1 −
yk yk
T
BkTBk
β̄k+1 eTk
!
yk .
(3.4)
Efficient solution of this LS subproblem is the heart of algorithm LSMR.
3.1.3
T WO QR FACTORIZATIONS
As in LSQR, we form the QR factorization
Qk+1 Bk =
Rk
0
!
,

ρ1



Rk = 


θ2
ρ2
..
.
..
.




.

θk 
ρk
(3.5)
,
3.1. DERIVATION OF LSMR
If we define tk = Rk yk and solve RkTqk = β̄k+1 ek , we have qk =
(β̄k+1 /ρk )ek = ϕk ek with ρk = (Rk )kk and ϕk ≡ β̄k+1 /ρk . Then we perform a second QR factorization


ρ̄1 θ̄2


!
!
.


ρ̄2 . .
R̄k
zk
RkT
β̄1 e1


,
R̄
=
=
Q̄k+1

 . (3.6)
k
T
.

0 ζ̄k+1
ϕk e k
0
. . θ̄ 

k
ρ̄k
Combining what we have with (3.4) gives
! ! RkTRk
RkT
T
y
min kA rk k = min β̄1 e1 −
=
min
β̄
e
−
t
k
1 1
k
yk
yk tk qkT Rk
ϕk eTk
! !
z
R̄k
k
= min tk . (3.7)
−
tk ζ̄k+1
0
The subproblem is solved by choosing tk from R̄k tk = zk .3
3.1.4
R ECURRENCE FOR xk
Let Wk and W̄k be computed by forward substitution from RkT WkT =
VkT and R̄kT W̄kT = WkT . Then from xk = Vk yk , Rk yk = tk , and R̄k tk =
zk , we have x0 ≡ 0 and
xk = Wk Rk yk = Wk tk = W̄k R̄k tk = W̄k zk = xk−1 + ζk w̄k .
3.1.5
R ECURRENCE FOR Wk AND W̄k
If we write
Vk = v 1 v 2 · · · v k ,
W̄k = w̄1 w̄2 · · · w̄k ,
Wk = w 1 w2 · · · wk ,
T
zk = ζ 1 ζ 2 · · · ζ k ,
an important fact is that when k increases to k + 1, all quantities remain
the same except for one additional term.
The first QR factorization proceeds as follows. At iteration k we
construct a plane rotation operating on rows l and l + 1:

Il−1


Pl = 

cl
−sl

sl
cl
Ik−l


.

39
3 Since every element of
tk changes in each iteration,
it is never constructed explicitly. Instead, the recurrences derived in the following sections are used.
40
CHAPTER 3. LSMR
Now if Qk+1 = Pk . . . P2 P1 , we have
Bk
Qk+1 Bk+1 = Qk+1
Qk+2 Bk+1

Rk

= Pk+1  0
αk+1 ek+1
βk+2
!

θk+1 ek

ᾱk+1  ,
βk+2

θk+1 ek

ρk+1  ,
0

Rk

= 0
 
Rk
θk+1 ek
 
ᾱk+1  =  0
0
βk+2
and we see that θk+1 = sk αk+1 = (βk+1 /ρk )αk+1 = β̄k+1 /ρk = ϕk .
Therefore we can write θk+1 instead of ϕk .
For the second QR factorization, if Q̄k+1 = P̄k . . . P̄2 P̄1 we know
that
!
!
RkT
R̄k
,
=
Q̄k+1
0
θk+1 eTk
and so
Q̄k+2
T
Rk+1
θk+2 eTk+1
!

R̄k

= P̄k+1 
 
θ̄k+1 ek
R̄k
 
c̄k ρk+1  = 
θk+2

θ̄k+1 ek

ρ̄k+1  . (3.8)
0
T
T
T
By considering the last row of the matrix equation Rk+1
Wk+1
= Vk+1
T
T
T
and the last row of R̄k+1
W̄k+1
= Wk+1
we obtain equations that define
wk+1 and w̄k+1 :
T
T
= vk+1
,
θk+1 wkT + ρk+1 wk+1
T
T
θ̄k+1 w̄kT + ρ̄k+1 w̄k+1
= wk+1
.
3.1.6
T HE TWO ROTATIONS
To summarize, the rotations Pk and P̄k have the following effects on
our computation:
c̄k
−s̄k
!
!
α¯k
ck sk
=
βk+1 αk+1
−sk ck
!
!
ζ̄k
c̄k−1 ρk
s̄k
=
θk+1
ρk+1
c̄k
!
ρk
0
θk+1
ᾱk+1
ρ̄k
0
θ̄k+1
c̄k ρk+1
ζk
ζ̄k+1
!
.
3.1. DERIVATION OF LSMR
Algorithm 3.1 Algorithm LSMR
1: (Initialize)
β1 u1 = b α1 v1 = ATu1
c̄0 = 1
s̄0 = 0
2:
3:
ᾱ1 = α1
h1 = v 1
ζ̄1 = α1 β1
h̄0 = 0
ρ0 = 1 ρ̄0 = 1
x0 = 0
for k = 1, 2, 3 . . . do
(Continue the bidiagonalization)
βk+1 uk+1 = Avk − αk uk ,
αk+1 vk+1 = ATuk+1 − βk+1 vk
(Construct and apply rotation Pk )
4:
2
ρk = ᾱk2 + βk+1
θk+1 = sk αk+1
12
ck = ᾱk /ρk
sk = βk+1 /ρk
(3.9)
(3.10)
ᾱk+1 = ck αk+1
(Construct and apply rotation P̄k )
5:
θ̄k = s̄k−1 ρk
2
ρ̄k = (c̄k−1 ρk )2 + θk+1
c̄k = c̄k−1 ρk /ρ̄k
s̄k = θk+1 /ρ̄k
ζk = c̄k ζ̄k
ζ̄k+1 = −s̄k ζ̄k
12
(3.11)
(3.12)
(Update h, h̄ x)
6:
h̄k = hk − (θ̄k ρk /(ρk−1 ρ̄k−1 ))h̄k−1
xk = xk−1 + (ζk /(ρk ρ̄k ))h̄k
hk+1 = vk+1 − (θk+1 /ρk )hk
7:
end for
3.1.7
S PEEDING UP FORWARD SUBSTITUTION
The forward substitutions for computing w and w̄ can be made more
efficient if we define hk = ρk wk and h̄k = ρk ρ̄k w̄k . We then obtain the
updates described in part 6 of the pseudo-code below.
3.1.8
A LGORITHM LSMR
Algorithm 3.1 summarizes the main steps of LSMR for solving Ax ≈ b,
excluding the norms and stopping rules developed later.
41
42
CHAPTER 3. LSMR
3.2
N ORMS AND STOPPING RULES
For any numerical computation, it is impossible to obtain a result with
less relative uncertainty than the input data. Thus, in solving linear
systems with iterative methods, it is important to know when we have
arrived at a solution with the best level of accuracy permitted, in order
to terminate before we waste computational effort.
Sections 3.2.1 and 3.2.2 discusses various criteria used in stopping
LSMR at the appropriate number of iterations. Sections 3.2.3, 3.2.4, 3.2.5
and 3.2.6 derive efficient ways to compute krk k, kAT rk k, kxk k and estimate kAk and cond(A) for the stopping criteria to be implemented
efficiently. All quantities require O(1) computation each iteration.
3.2.1
S TOPPING CRITERIA
With exact arithmetic, the Golub-Kahan process terminates when either
αk+1 = 0 or βk+1 = 0. For certain data b, this could happen in practice
when k is small (but is unlikely later because of rounding error). We
show that LSMR will have solved the problem at that point and should
therefore terminate.
When αk+1 = 0, with the expression of kAT rk k from section 3.2.4,
we have
−1
kAT rk k = |ζ̄k+1 | = |s̄k ζ̄k | = |θk+1 ρ̄−1
k ζ̄k | = |sk αk+1 ρ̄k ζ̄k | = 0,
where (3.12), (3.11), (3.10) are used. Thus, a least-squares solution has
been obtained.
When βk+1 = 0, we have
sk = βk+1 ρ−1
k = 0.
(from (3.9))
(3.13)
β̈k+1 = −sk β̈k = 0.
(from (3.19), (3.13))
β̃k − s̃k (−1)k s(k) ck+1 β1
β̇k = c̃−1
(from (3.30))
k
(3.14)
= c̃−1
k β̃k
=
=
ρ̇−1
k ρ̃k β̃k
ρ̇−1
k ρ̃k τ̃k
= τ̇k .
(from (3.13))
(from (3.20))
(from Lemma 3.2.1)
(from (3.26), (3.27))
(3.15)
By using (3.21) (derived in Section 3.2.3), we conclude that krk k = 0
from (3.15) and (3.14), so the system is consistent and Axk = b.
3.2. NORMS AND STOPPING RULES
3.2.2
P RACTICAL STOPPING CRITERIA
For LSMR we use the same stopping rules as LSQR [53], involving dimensionless quantities ATOL, BTOL, CONLIM:
S1: Stop if krk k ≤ BTOLkbk + ATOLkAkkxk k
S2: Stop if kAT rk k ≤ ATOLkAkkrk k
S3: Stop if cond(A) ≥ CONLIM
S1 applies to consistent systems, allowing for uncertainty in A and b
[39, Theorem 7.1]. S2 applies to inconsistent systems and comes from
Stewart’s backward error estimate kE2 k assuming uncertainty in A; see
Section 4.1.1. S3 applies to any system.
3.2.3
C OMPUTING krk k
We transform R̄kT to upper-bidiagonal form using a third QR factorizaek = Q
e k R̄T with Q
e k = Pek−1 . . . Pe1 . This amounts to one addition: R
k
tional rotation per iteration. Now let
e k tk ,
t̃k = Q
!
ek
Q
b̃k =
1
(3.16)
Qk+1 e1 β1 .
Then from (3.3), rk = Uk+1 (β1 e1 − Bk yk ) gives
rk = Uk+1
β1 e1 −
QTk+1
= Uk+1
β1 e1 − QTk+1
= Uk+1
QTk+1
= Uk+1 QTk+1
eT
Q
k
eT
Q
k
1
!
! !
Rk
yk
0
!!
tk
0
!
1
b̃k −
QTk+1
b̃k −
t̃k
0
!!
e T t̃k
Q
k
0
!!
.
Therefore, assuming orthogonality of Uk+1 , we have
krk k = b̃k −
t̃k
0
!
.
(3.17)
43
44
CHAPTER 3. LSMR
Algorithm 3.2 Computing krk k in LSMR
1: (Initialize)
β̈1 = β1
2:
3:
β̇0 = 0
ρ̇0 = 1
τ̃−1 = 0
β̈k+1 = −sk β̈k
c̃k−1 = ρ̇k−1 /ρ̃k−1
12
θ̃k = s̃k−1 ρ̄k
β̃k−1 = c̃k−1 β̇k−1 + s̃k−1 β̂k
ρ̇k = c̃k−1 ρ̄k
β̇k = −s̃k−1 β̇k−1 + c̃k−1 β̂k
(Update t̃k by forward substitution)
τ̇k = (ζk − θ̃k τ̃k−1 )/ρ̇k
(Form krk k)
2
γ = (β̇k − τ̇k )2 + β̈k+1
,
7:
(3.20)
s̃k−1 = θ̄k /ρ̃k−1
τ̃k−1 = (ζk−1 − θ̃k−1 τ̃k−2 )/ρ̃k−1
6:
(3.19)
(If k ≥ 2, construct and apply rotation Pek−1 )
ρ̃k−1 = ρ̇2k−1 + θ̄k2
5:
ζ0 = 0
for the kth iteration do
(Apply rotation Pk )
β̂k = ck β̈k
4:
θ̃0 = 0
krk k =
√
γ
(3.21)
end for
The vectors b̃k and t̃k can be written in the form
T
b̃k = β̃1 · · · β̃k−1 β̇k β̈k+1
T
t̃k = τ̃1 · · · τ̃k−1 τ̇k .
(3.18)
eT t̃k = zk .
The vector t̃k is computed by forward substitution from R
k
Lemma 3.2.1. In (3.17)–(3.18), β̃i = τ̃i for i = 1, . . . , k − 1.
Proof. Section 3.6 proves the lemma by induction.
Using this lemma we can estimate krk k from just the last two elements of b̃k and the last element of t̃k , as shown in (3.21).
Algorithm 3.2 summarizes how krk k may be obtained from quantities arising from the first and third QR factorizations.
3.2. NORMS AND STOPPING RULES
3.2.4
C OMPUTING kATrk k
From (3.7), we have kATrk k = |ζ̄k+1 |.
3.2.5
C OMPUTING kxk k
From Section 3.1.4 we have xk = Vk Rk−1 R̄k−1 zk . From the third QR
e k R̄T = R
ek in Section 3.2.3 and a fourth QR factorization
factorization Q
k
T
e
Q̂k (Qk Rk ) = R̂k we can write
e Tk z̃k
xk = Vk Rk−1 R̄k−1 zk = Vk Rk−1 R̄k−1 R̄k Q
T
T
eT Q
e
= Vk Rk−1 Q
k k Rk Q̂k ẑk = Vk Q̂k ẑk ,
eT z̃k = zk and
where z̃k and ẑk are defined by forward substitutions R
k
T
R̂k ẑk = z̃k . Assuming orthogonality of Vk we arrive at the estimate
ek and the bottom 2 × 2
kxk k = kẑk k. Since only the last diagonal of R
part of R̂k change each iteration, this estimate of kxk k can again be
updated cheaply. The pseudo-code, omitted here, can be derived as in
Section 3.2.3.
3.2.6
E STIMATES OF kAk AND COND(A)
It is known that the singular values of Bk are interlaced by those of A
and are bounded above and below by the largest and smallest nonzero
singular values of A [53]. Therefore we can estimate kAk and cond(A)
by kBk k and cond(Bk ) respectively. Considering the Frobenius norm
of Bk , we have the recurrence relation
2
kBk+1 k2F = kBk k2F + αk2 + βk+1
.
From (3.5)–(3.6) and (3.8), we can show that the following QLP factorization [71] holds:
Qk+1 Bk Q̄Tk
=
T
R̄k−1
θ̄k eTk−1
c̄k−1 ρk
!
(the same as R̄kT except for the last diagonal). Since the singular values of Bk are approximated by the diagonal elements of that lowerbidiagonal matrix [71], and since the diagonals are all positive, we
can estimate cond(A) by the ratio of the largest and smallest values
in {ρ̄1 , . . . , ρ̄k−1 , c̄k−1 ρk }. Those values can be updated cheaply.
45
46
CHAPTER 3. LSMR
3.3
LSMR P ROPERTIES
With x∗ denoting the pseudoinverse solution of min kAx − bk, we have
the following theorems on the norms of various quantities for LSMR.
3.3.1
M ONOTONICITY OF NORMS
A number of monotonic properties for LSQR follow directly from the
corresponding properties for CG in Section 2.1.1. We list them here
from completeness.
Theorem 3.3.1. kxk k is strictly increasing for LSQR.
Proof. LSQR on min kAx − bk is equivalent to CG on ATAx = ATb. By
Theorem 2.1.1, kxk k is strictly increasing.
Theorem 3.3.2. kx∗ − xk k is strictly decreasing for LSQR.
Proof. This follows from Theorem 2.1.2 for CG.
Theorem 3.3.3. kx∗ − xk kATA = kA(x∗ − xk )k = kr∗ − rk k is strictly
decreasing for LSQR.
Proof. This follows from Theorem 2.1.3 for CG.
We also have the characterizing property for LSQR [53].
Theorem 3.3.4. krk k is strictly decreasing for LSQR.
Next, we prove a number of monotonic properties for LSMR. We
would like to emphasize that LSMR has all the above monotonic properties that LSQR enjoys. In addition, kATrk k is monotonic for LSMR.
This gives LSMR a much smoother convergence behavior in terms of
the Stewart backward error, as shown in Figure 4.2.
Theorem 3.3.5. kATrk k is monotonically decreasing for LSMR.
Proof. From Section 3.2.4 and (3.12), kATrk k = |ζ̄k+1 | = |s̄k ||ζ̄k | ≤ |ζ̄k | =
kATrk−1 k.
Theorem 3.3.6. kxk k is strictly increasing for LSMR on min kAx − bk when
A has full column rank.
Proof. LSMR on min kAx − bk is equivalent to MINRES on ATAx = ATb.
When A has full column rank, ATA is symmetric positive definite. By
Theorem 2.1.6, kxk k is strictly increasing.
3.3. LSMR PROPERTIES
Algorithm 3.3 Algorithm CRLS
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
x0 = 0, r̄0 = ATb, s0 = ATAr̄0 , ρ0 = r̄0Ts0 , p0 = r̄0 , q0 = s0
for k = 1, 2, . . . do
αk = ρk−1 /kqk−1 k2
xk = xk−1 + αk pk−1
r̄k = r̄k−1 − αk qk−1
sk = ATAr̄k
ρk = r̄kTsk
βk = ρk /ρk−1
pk = r̄k + βk pk−1
qk = sk + βk qk−1
end for
Theorem 3.3.7. The error kx∗ − xk k is strictly decreasing for LSMR on
min kAx − bk when A has full column rank.
Proof. This follows directly from Theorem 2.1.7 for MINRES.
Theorem 3.3.8. kx∗ − xk kATA = kr∗ − rk k is strictly decreasing for LSMR
on min kAx − bk when A has full column rank.
Proof. This follows directly from Theorem 2.1.8 for MINRES.
Since LSMR is equivalent to MINRES on the normal equation, and CR
is a non-Lanczos equivalent of MINRES, we can apply CR to the normal
equation to derive a non-Lanczos equivalent of LSMR, which we will
call CRLS. We start from CR (Algorithm 1.3) and apply the substitutions
A → ATA, b → ATb. Since rk in CR would correspond to ATrk in CRLS,
we rename rk to r̄k in the algorithm to avoid confusion. With these
substitutions, we arrive at Algorithm 3.3.
In CRLS, the residual rk = b − Axk is not computed. However, we
know that r0 = b0 and that rk satisfies the recurrence relation:
rk = b − Axk = b − Axk−1 − αk Apk−1 = rk−1 − αk Apk−1 .
(3.22)
Also, as mentioned, we define
r̄k = ATrk .
(3.23)
Let ℓ denote the iteration at which LSMR terminates; i.e. ATrℓ = r̄ℓ = 0
and xℓ = x∗ . Then Theorem 2.1.4 for CR translates to the following.
47
48
CHAPTER 3. LSMR
Theorem 3.3.9. These properties hold for Algorithm CRLS:
(a) qiT qj = 0
(b)
riT qj
=0
(0 ≤ i, j ≤ ℓ − 1, i 6= j)
(0 ≤ i, j ≤ ℓ − 1, i ≥ j + 1)
(c) r̄i 6= 0 ⇒ pi 6= 0
(0 ≤ i ≤ ℓ − 1)
Also, Theorem 2.1.5 for CR translates to the following.
Theorem 3.3.10. These properties hold for Algorithm CRLS on a least-squares
system min kAx − bk when A has full column rank:
(a) αi > 0
(i = 1, . . . , ℓ)
(b) βi > 0
βℓ = 0
(i = 1, . . . , ℓ − 1)
(c) pTi qj > 0
(0 ≤ i, j ≤ ℓ − 1)
(d) pTi pj > 0
(0 ≤ i, j ≤ ℓ − 1)
(e)
xTi pj
>0
(f) r̄iT pj > 0
(1 ≤ i ≤ ℓ, 0 ≤ j ≤ ℓ − 1)
(0 ≤ i ≤ ℓ − 1, 0 ≤ j ≤ ℓ − 1)
Theorem 3.3.11. For CRLS (and hence LSMR) on min kAx − bk when A has
full column rank, krk k is strictly decreasing.
Proof.
T
krk−1 k2 − krk k2 = rk−1
rk−1 − rkT rk
= 2αk rkT Apk−1 + αk2 pTk−1 AT Apk−1
T
T
= 2αk (A rk ) pk−1 +
> 2αk (ATrk )T pk−1
αk2 kApk−1 k2
by (3.22)
by Thm 3.3.9 (c)
and Thm 3.3.10 (a)
=
2αk r̄kT pk−1
by (3.23)
≥ 0,
where the last inequality follows from Theorem 3.3.10 (a) and (f), and
is strict except at k = ℓ. The strict inequality on line 4 requires the fact
that A has full column rank. Therefore we have krk−1 k > krk k.
3.3.2
C HARACTERISTICS OF THE SOLUTION ON SINGULAR SYSTEMS
The least-squares problem min kAx − bk has a unique solution when A
has full column rank. If A does not have full column rank, infinitely
3.3. LSMR PROPERTIES
many points x give the minimum value of kAx − bk. In particular, the
normal equation ATAx = ATb is singular but consistent. We show that
LSQR and LSMR both give the minimum-norm LS solution. That is, they
both solve the optimization problem min kxk2 such that ATAx = ATb.
Let N(A) and R(A) denote the nullspace and range of a matrix A.
Lemma 3.3.12. If A ∈ Rm×n and p ∈ Rn satisfy ATAp = 0, then p ∈ N(A).
Proof. ATAp = 0 ⇒ pTATAp = 0 ⇒ (Ap)TAp = 0 ⇒ Ap = 0.
Theorem 3.3.13. LSQR returns the minimum-norm solution.
= ATb, and any other
Proof. The final LSQR solution satisfies ATAxLSQR
ℓ
, the difference
solution xb satisfies ATAb
x = ATb. With p = xb − xLSQR
ℓ
between the two normal equations gives ATAp = 0, so that Ap = 0 by
Lemma 3.3.12. From α1 v1 = ATu1 and αk+1 vk+1 = ATuk+1 − βk+1 vk
(3.1), we have v1 , . . . , vℓ ∈ R(AT ). With Ap = 0, this implies pT Vℓ = 0,
so that
kb
xk22 − kxLSQR
k22 = kxLSQR
+ pk22 − kxLSQR
k22 = pTp + 2pT xLSQR
ℓ
ℓ
ℓ
ℓ
= pTp + 2pT Vℓ yℓLSQR
= pTp ≥ 0.
Corollary 3.3.14. LSMR returns the minimum-norm solution.
Proof. At convergence, αℓ+1 = 0 or βℓ+1 = 0. Thus β̄ℓ+1 = αℓ+1 βℓ+1 =
0, which means equation (3.4) becomes min kβ̄1 e1 − BℓTBℓ yℓ k and hence
BℓTBℓ yℓ = β̄1 e1 , since Bℓ has full rank. This is the normal equation for
min kBℓ yℓ − β1 e1 k, the same LS subproblem solved by LSQR. We conclude that at convergence, yℓ = yℓLSQR and thus xℓ = Vℓ yℓ = Vℓ yℓLSQR =
, and Theorem 3.3.13 applies.
xLSQR
ℓ
3.3.3
B ACKWARD ERROR
For completeness, we state a final desirable result about LSMR. The
Stewart backward error kATrk k/krk k for LSMR is always less than or
equal to that for LSQR. See Chapter 4, Theorem 4.1.1 for details.
49
50
CHAPTER 3. LSMR
Table 3.1 Storage and computational cost for various least-squares
methods
Storage
Work
m
n
m n
LSMR
LSQR
MINRES on ATAx = ATb
3.4
p = Av, u
p = Av, u
p = Av
x, v = ATu, h, h̄
3
x, v = ATu, w
3
x, v1 , v2 = ATp, w1 , w2 , w3
6
5
8
C OMPLEXITY
We compare the storage requirement and computational complexity
for LSMR and LSQR on Ax ≈ b and MINRES on the normal equation
ATAx = ATb. In Table 3.1, we list the vector storage needed (excluding
storage for A and b). Recall that A is m × n and for LS systems m
may be considerably larger than n. Av denotes the working storage for
matrix-vector products. Work represents the number of floating-point
multiplications required at each iteration.
From Table 3.1, we see that LSMR requires storage of one extra vector, and also n more scalar floating point multiplication when compared to LSQR. This difference is negligible compared to the cost of
performing Av and ATu multiplication for most problems. Thus, the
computational and storage complexity of LSMR is comparable to LSQR.
3.5
R EGULARIZED LEAST SQUARES
In this section we extend LSMR to the regularized LS problem
!
A
min x−
λI
where λ is a nonnegative scalar. If Ā =
T
T
2
Ā r̄k = A rk − λ xk = Vk+1
= Vk+1
β̄1 e1 −
β̄1 e1 −
!
b ,
0 (3.24)
2
A
λI
and r̄k = ( 0b ) − Āxk , then
BkTBk
β̄k+1 eTk
RkTRk
β̄k+1 eTk
!
!
yk − λ
yk
!
2
yk
0
!!
3.5. REGULARIZED LEAST SQUARES
and the rest of the main algorithm follows the same as in the unregularized case. In the last equality, Rk is defined by the QR factorization
Bk
λI
Q2k+1
!
Rk
0
=
!
,
Q2k+1 ≡ Pk P̂k . . . P2 P̂2 P1 P̂1 ,
where P̂l is a rotation operating on rows l and l + k + 1. The effects of
P̂1 and P1 are illustrated here:


α1

 β2

P̂1 


λ
3.5.1

α̂1
 
α2   β2
 

β3 
=
 
 0
λ

ρ1
 
α2  
 

β3 
=
 
 
λ


α̂1

 β2

P1 





α2 

β3 
,


λ

θ2

ᾱ2 

β3 
.


λ
E FFECTS ON kr̄k k
Introduction of regularization changes the residual norm as follows:
r̄k =
b
0
!
A
λI
−
!
xk =
u1
0
=
u1
0
=
=
=
=
with b̃k =
ek
Q
1
!
!
β1 −
AVk
λVk
!
yk
Uk+1 Bk
β1 −
λVk
!
Uk+1
Vk
Uk+1
Vk
Uk+1
Vk
Uk+1
Vk
!
!
!
!
yk
!
e1 β1 −
Bk
λI
e1 β1 −
QT2k+1
e1 β1 − QT2k+1
eT
Q
k
QT2k+1
1
yk
!
!
! !
Rk
yk
0
!!
tk
0
b̃k −
Q2k+1 e1 β1 , where we adopt the notation
b̃k = β̃1
···
β̃k−1
β̇k
β̈k+1
β̌1
···
β̌k
We conclude that
2
kr̄k k2 = β̌12 + · · · + β̌k2 + (β̇k − τk )2 + β̈k+1
.
T
.
t̃k
0
!!
51
52
CHAPTER 3. LSMR
The effect of regularization on the rotations is summarized as
ck
−sk
3.5.2
sk
ck
ĉk
−ŝk
!
ŝk
ĉk
α̂k
βk+1
!
β¨k
ᾱk
λ
β́k
αk+1
!
!
=
=
!
α̂k
β́k
β̌k
ρk
θk+1
ᾱk+1
β̂k
β̈k+1
!
.
P SEUDO - CODE FOR REGULARIZED LSMR
Algorithm 3.4 summarizes LSMR for solving the regularized problem
(3.24) with given λ. Our M ATLAB implementation is based on these
steps.
3.6
P ROOF OF L EMMA 3.2.1
The effects of the rotations Pk and Pek−1 can be summarized as

c̃k s̃k
−s̃k c̃k
!
ck sk
−sk ck
!


e
Rk = 


β̈k
0
ρ̇k−1
β̇k−1
θ̄k ρ̄k β̂k
!
!
ρ̃1
θ̃2
..
.
..
.
ρ̃k−1
!
=
β̂k
β̈k+1
=
ρ̃k−1 θ̃k
0
ρ̇k



,

θ̃k 
ρ̇k
,
β̃k−1
β̇k
!
,
where β̈1 = β1 , ρ̇1 = ρ̄1 , β̇1 = β̂1 and where ck , sk are defined in section 3.1.6.
We define s(k) = s1 . . . sk and s̄(k)
= s̄1 . . . s̄k . Then from (3.18) and
eT t̃k = zk = Ik 0 Q̄k+1 ek+1 β̄1 . Expanding this and
(3.6) we have R
k
(3.16) gives



ekT t̃k = 
R



c̄1

−s̄1 c̄2

 β̄1 ,
..


.
(−1)k+1 s̄(k−1) c̄k
b̃k =
ek
Q

c1



!
−s1 c2




..
 β1 ,

.

1 
(−1)k+1 s(k−1) c 

k
(−1)k+2 s(k)

3.6. PROOF OF LEMMA 3.2.1
Algorithm 3.4 Regularized LSMR (1)
1: (Initialize)
α1 v1 = ATu1
ᾱ1 = α1
ζ̄1 = α1 β1
ρ0 = 1
c̄0 = 1
s̄0 = 0
β̈1 = β1
β̇0 = 0
ρ̇0 = 1 τ̃−1 = 0
θ̃0 = 0
ζ0 = 0
d0 = 0
h1 = v 1
h̄0 = 0
β 1 u1 = b
2:
3:
αk+1 vk+1 = ATuk+1 − βk+1 vk
(Construct rotation P̂k )
α̂k = ᾱk2 + λ2
5:
12
ĉk = ᾱk /α̂k
θk+1 = sk αk+1
12
ck = α̂k /ρk
sk = βk+1 /ρk
ᾱk+1 = ck αk+1
(Construct and apply rotation P̄k )
θ̄k = s̄k−1 ρk
c̄k = c̄k−1 ρk /ρ̄k
ζk = c̄k ζ̄k
7:
ŝk = λ/α̂k
(Construct and apply rotation Pk )
2
ρk = α̂k2 + βk+1
6:
x0 = 0
for k = 1, 2, 3, . . . do
(Continue the bidiagonalization)
βk+1 uk+1 = Avk − αk uk
4:
ρ̄0 = 1
2
ρ̄k = (c̄k−1 ρk )2 + θk+1
s̄k = θk+1 /ρ̄k
12
ζ̄k+1 = −s̄k ζ̄k
(Update h̄, x, h)
h̄k = hk − (θ̄k ρk /(ρk−1 ρ̄k−1 ))h̄k−1
xk = xk−1 + (ζk /(ρk ρ̄k ))h̄k
hk+1 = vk+1 − (θk+1 /ρk )hk
8:
(Apply rotation P̂k , Pk )
β́k = ĉk β̈k
9:
10:
β̌k = −ŝk β̈k
β̈k+1 = −sk β́k
if k ≥ 2 then
(Construct and apply rotation Pek−1 )
ρ̃k−1 = ρ̇2k−1 + θ̄k2
c̃k−1 = ρ̇k−1 /ρ̃k−1
21
θ̃k = s̃k−1 ρ̄k
β̃k−1 = c̃k−1 β̇k−1 + s̃k−1 β̂k
11:
β̂k = ck β́k
end if
s̃k−1 = θ̄k /ρ̃k−1
ρ̇k = c̃k−1 ρ̄k
β̇k = −s̃k−1 β̇k−1 + c̃k−1 β̂k
53
54
CHAPTER 3. LSMR
Algorithm 3.5 Regularized LSMR (2)
12:
(Update t̃k by forward substitution)
τ̃k−1 = (ζk−1 − θ̃k−1 τ̃k−2 )/ρ̃k−1
13:
(Compute kr̄k k)
dk = dk−1 + β̌k2
14:
τ̇k = (ζk − θ̃k τ̃k−1 )/ρ̇k
2
γ = dk + (β̇k − τ̇k )2 + β̈k+1
kr̄k k =
√
γ
(Compute kĀTr̄k k, kxk k, estimate kĀk, cond(Ā))
kĀTr̄k k = |ζ̄k+1 | (section 3.2.4)
Compute kxk k (section 3.2.5)
Estimate σmax (Bk ), σmin (Bk ) and hence kĀk, cond(Ā) (section 3.2.6)
15:
16:
Terminate if any of the stopping criteria in Section 3.2.2 are satisfied.
end for
and we see that
τ̃1 = ρ̃−1
1 c̄1 β̄1
(3.25)
k (k−2)
τ̃k−1 = ρ̃−1
c̄k−1 β̄1 − θ̃k−1 τ̃k−2 )
k−1 ((−1) s̄
τ̇k =
k+1 (k−1)
ρ̇−1
s̄
c̄k β̄1
k ((−1)
− θ̃k τ̃k−1 ).
β̇1 = β̂1 = c1 β1
β̇k = −s̃k−1 β̇k−1 + c̃k−1 (−1)k−1 s(k−1) ck β1
k (k)
β̃k = c̃k β̇k + s̃k (−1) s
ck+1 β1 .
(3.26)
(3.27)
(3.28)
(3.29)
(3.30)
We want to show by induction that τ̃i = β̃i for all i. When i = 1,
β̃1 = c̃1 c1 β1 −s̃1 s1 c2 β1 =
β1
β1 α1 ρ21
β̄1 ρ1
β̄1
(c1 ρ̄1 −θ̄2 s1 c2 ) =
=
=
c̄1 = τ̃1
ρ̃1
ρ̃1 ρ1 ρ̄1
ρ̃1 ρ̄1
ρ̃1
where the third equality follows from the two lines below:
c1 α2
α2
α1
1
= ρ̄1 − θ̄2 s1
=
(ρ̄1 − θ̄2 s1 α2 )
ρ2
ρ2
ρ1
ρ2
θ2
ρ̄21 − θ22
ρ21 + θ22 − θ22
1
1
=
.
ρ̄1 − θ̄2 s1 α2 = ρ̄1 − (s̄1 ρ2 )θ2 = ρ̄1 − θ2 =
ρ2
ρ2
ρ̄1
ρ̄1
ρ̄1
c1 ρ̄1 − θ̄2 s1 c2 = c1 ρ̄1 − θ̄2 s1
3.6. PROOF OF LEMMA 3.2.1
Suppose τ̃k−1 = β̃k−1 . We consider the expression
c̄k−1 ρk (k−1)
(s
ck )c̄k−1 ρk β1
ρ̄k
θ2
θk
θ2 · · · θk α1 ρ1 · · · ρk−1
ρk β1 = c̄k · · ·
β̄1
= c̄k
ρ1 · · · ρk ρ̄1 · · · ρ̄k−1
ρ̄1
ρ̄k−1
2
2
s(k−1) ck ρ̄−1
k c̄k−1 ρk β1 =
= c̄k s̄1 · · · s̄k−1 β̄1 = c̄k s̄(k−1) β̄1 .
(3.31)
(−1)k+1 s̄(k−1) c̄k β̄1 − θ̃k τ̃k−1
Applying the induction hypothesis on τ̃k = ρ̃−1
k
gives
k+1 (k−1)
k (k−1)
(−1)
s̄
c̄
β̄
−
θ̃
τ̃k = ρ̃−1
c̃
β̇
+
s̃
(−1)
s
c
β
k
1
k
k−1
k−1
k−1
k
1
k
k+1 −1
= ρ̃−1
ρ̃k s̄(k−1) c̄k β̄1 − θ̃k s̃k−1 s(k−1) ck β1
k θ̃k c̃k−1 β̇k−1 + (−1)
k+1 −1 (k−1)
= ρ̃−1
ρ̃k s
β1 ρ̇k c̃k−1 ck − θ̄k+1 sk ck+1
k (ρ̄k s̃k−1 )c̃k−1 β̇k−1 + (−1)
= c̃k s̃k−1 β̇k−1 + (−1)k+1 s(k−1) β1 (c̃k c̃k−1 ck − s̃k sk ck+1 )
= c̃k −s̃k−1 β̇k−1 + c̃k−1 (−1)k+1 s(k−1) ck β1 + s̃k (−1)k+1 s(k) ck+1 β1
= c̃k β̇k + s̃k (−1)k+1 s(k) ck+1 β1 = β̃k
with the second equality obtained by the induction hypothesis, and the
fourth from
s̄(k−1) c̄k β̄1 − θ̃k s̃k−1 s(k−1) ck β1
2
2
(k−1)
= s(k−1) ck ρ̄−1
c k β1
k c̄k−1 ρk β1 − (s̃k−1 ρ̄k )s̃k−1 s
ck 2
c̄
ρ2 − s̃2k−1 ρ̄2k
= s(k−1) β1
ρ̄k k−1 k
= s(k−1) β1 ρ̇k c̃k−1 ck − θ̄k+1 sk ck+1 ,
where the first equality follows from (3.31) and the last from
2
c̄2k−1 ρ2k − s̃2k−1 ρ̄2k = ρ̄2k − θk+1
− s̃2k−1 ρ̄2k
2
2
= ρ̄2k (1 − s̃2k−1 ) − θk+1
= ρ̄2k c̃2k−1 − θk+1
,
ck 2 2
ρ̄ c̃
= ρ̄k c̃2k−1 ck = ρ̇k c̃k−1 ck ,
ρ̄k k k−1
θk+1
θk+1 ρk+1
ck
ck 2
θk+1 =
θk+1 ck =
sk αk+1
ρ̄k
ρ̄k
ρ̄k
ρk+1
= θ̄k+1 sk ck+1 .
By induction, we know that τ̃i = β̃i for i = 1, 2, . . . . From (3.18), we see
that at iteration k, the first k − 1 elements of b̃k and t̃k are equal.
55
4
LSMR EXPERIMENTS
In this chapter, we perform numerous experiments comparing the convergence of LSQR and LSMR. We discuss overdetermined systems first,
then some square examples, followed by underdetermined systems.
We also explore ways to speed up convergence using extra memory by
reorthogonalization.
4.1
4.1.1
L EAST- SQUARES PROBLEMS
B ACKWARD ERROR FOR LEAST- SQUARES
For inconsistent problems with uncertainty in A (but not b), let x be any
approximate solution. The normwise backward error for x measures the
perturbation to A that would make x an exact LS solution:
µ(x) ≡ min kEk s.t. (A + E)T (A + E)x = (A + E)T b.
E
(4.1)
It is known to be the smallest singular value of a certain m × (n + m)
matrix C; see Waldén et al. [83] and Higham [39, pp392–393]:
µ(x) = σmin (C),
h
C≡ A
krk
kxk
I−
rr T
krk2
i
.
(4.2)
Since it is generally too expensive to evaluate µ(x), we need to find
approximations.
A PPROXIMATE BACKWARD ERRORS E1 AND E2
In 1975, Stewart [69] discussed a particular backward error estimate
that we will call E1 . Let x∗ and r∗ = b−Ax∗ be the exact LS solution and
residual. Stewart showed that an approximate solution x with residual
r = b − Ax is the exact LS solution of the perturbed problem min kb −
(A + E1 )xk, where E1 is the rank-one matrix
E1 =
exT
,
kxk2
kE1 k =
56
kek
,
kxk
e ≡ r − r∗ ,
(4.3)
4.1. LEAST-SQUARES PROBLEMS
with krk2 = kr∗ k2 + kek2 . Soon after, Stewart [70] gave a further important result that can be used within any LS solver. The approximate
x and a certain vector r̃ = b − (A + E2 )x are the exact solution and
residual of the perturbed LS problem min kb − (A + E2 )xk, where
E2 = −
rrTA
,
krk2
kE2 k =
kATrk
,
krk
r = b − Ax.
(4.4)
LSQR and LSMR both compute kE2 k for each iterate xk because the
current krk k and kATrk k can be accurately estimated at almost no cost.
An added feature is that for both solvers, r̃ = b − (A + E2 )xk = rk
because E2 xk = 0 (assuming orthogonality of Vk ). That is, xk and rk
are theoretically exact for the perturbed LS problem (A + E2 )x ≈ b.
Stopping rule S2 (section 3.2.2) requires kE2 k ≤ ATOLkAk. Hence
the following property gives LSMR an advantage over LSQR for stopping early.
Theorem 4.1.1. kE2LSMR k ≤ kE2LSQR k.
Proof. This follows from kATrkLSMR k ≤ kATrkLSQR k and krkLSMR k ≥ krkLSQR k.
A PPROXIMATE OPTIMAL BACKWARD ERROR µ
e(x)
Various authors have derived expressions for a quantity µ
e(x) that has
proved to be a very accurate approximation to µ(x) in (4.1) when x is
at least moderately close to the exact solution xb. Grcar, Saunders, and
Su [73; 34] show that µ
e(x) can be obtained from a full-rank LS problem
as follows:

K=
A
krk
kxk I

,
 
r
v =  ,
0
min kKy−vk,
y
µ
e(x) = kKyk/kxk,
(4.5)
and give M ATLAB Code 4.1 for computing the “economy size” sparse
QR factorization K = QR and c ≡ QTv (for which kck = kKyk) and
thence µ
e(x). In our experiments we use this script to compute µ
e(xk ) for
each LSQR and LSMR iterate xk . We refer to this as the optimal backward
error for xk because it is provably very close to the true µ(xk ) [32].
57
58
CHAPTER 4. LSMR EXPERIMENTS
M ATLAB Code 4.1 Approximate optimal backward error
[m,n]
= size(A);
2
r
= b - A*x;
3
normx
= norm(x);
4
eta
= norm(r)/normx;
5
p
= colamd(A);
6
K
= [A(:,p); eta*speye(n)];
7
v
= [r; zeros(n,1)];
8
[c,R]
= qr(K,v,0);
9
mutilde = norm(c)/normx;
1
D ATA
For test examples, we have drawn from the University of Florida Sparse
Matrix Collection (Davis [18]). Matrices from the LPnetlib group and
the NYPA group are used for our numerical experiments.
The LPnetlib group provides data for 138 linear programming problems of widely varying origin, structure, and size. The constraint matrix and objective function may be used to define a sparse LS problem
min kAx−bk. Each example was downloaded in M ATLAB format, and a
sparse matrix A and dense vector b were extracted from the data structure via A = (Problem.A)’ and b = Problem.c (where ’ denotes transpose). Five examples had b = 0, and a further six gave ATb = 0. The
remaining 127 problems had up to 243000 rows, 10000 columns, and
1.4M
nonzeros
in A. Diagonal scaling was applied to the columns of
h
i
A b to give a scaled problem min kAx − bk in which the columns of
A (and also b) have unit 2-norm. LSQR and LSMR were run on each of
the 127 scaled problems with stopping tolerance ATOL = 10−8 , gener} and {xLSMR
ating sequences of approximate solutions {xLSQR
}. The itk
k
eration indices k are omitted below. The associated residual vectors are
denoted by r without ambiguity, and x∗ is the solution to the LS problem, or the minimum-norm solution to the LS problem if the system
is singular. This set of artificially created least-squares test problems
provides a wide variety of size and structure for evaluation of the two
algorithms. They should be indicative of what we could expect when
using iterative methods to estimate the dual variables if the linear programs were modified to have a nonlinear objective function (such as
P
the negative entropy function xj log xj ).
The NYPA group provides data for 8 rank-deficient least-squares problems from the New York Power Authority. Each problem provides a
4.1. LEAST-SQUARES PROBLEMS
M ATLAB Code 4.2 Right diagonal preconditioning
1
% scale the column norms to 1
2
cnorms = sqrt(sum(A.*A,1));
3
D = diag(sparse(1./cnorms));
4
A = A*D;
matrix Problem.A and a right-hand side vector Problem.b. Two of the
problems are underdetermined. For the remaining 6 problems we compared the convergence of LSQR and LSMR on min kAx − bk with stopping tolerance ATOL = 10−8 . This set of problems contains matrices
with condition number ranging from 3.1E+02 to 5.8E+11.
P RECONDITIONING
For this set of test problems, we apply a right diagonal preconditioning
that scales the columns of A to unit 2-norm. (For least-squares systems,
a left preconditioner will alter the least-squares solution.) The preconditioning is implemented with M ATLAB Code 4.2.
4.1.2
N UMERICAL RESULTS
Observations for the LPnetlib group:
1. krLSQR k is monotonic by design. krLSMR k is also monotonic (as
predicted by Theorem 3.3.11) and nearly as small as krLSQR k for
all iterations on almost all problems. Figure 4.1 shows a typical
example and a rare case.
2. kxk is monotonic for LSQR (Theorem 3.3.1) and for LSMR (Theorem 3.3.6). With krk monotonic for LSQR and for LSMR, kE1 k
in (4.3) is likely to appear monotonic for both solvers. Although
kE1 k is not normally available for each iteration, it provides a
benchmark for kE2 k.
3. kE2LSQR k is not monotonic, but kE2LSMR k appears monotonic almost always. Figure 4.2 shows a typical case. The sole exception
for this observation is also shown.
4. Note that Benbow [5] has given numerical results comparing a
generalized form of LSQR with application of MINRES to the corresponding normal equation. The curves in [5, Figure 3] show the
irregular and smooth behavior of LSQR and MINRES respectively
59
60
CHAPTER 4. LSMR EXPERIMENTS
Name:lp greenbeb, Dim:5598x2392, nnz:31070, id=631
Name:lp woodw, Dim:8418x1098, nnz:37487, id=702
1
1
LSQR
LSMR
LSQR
LSMR
0.9
0.998
0.996
0.7
||r||
||r||
0.8
0.994
0.6
0.992
0.5
0.99
0.4
0
50
100
150
iteration count
200
250
300
0.988
0
10
20
30
40
50
iteration count
60
70
80
90
Figure 4.1: For most iterations, krLSMR k is monotonic and nearly as small as krLSQR k. Left: A
typical case (problem lp_greenbeb). Right: A rare case (problem lp_woodw). LSMR’s residual
norm is significantly larger than LSQR’s during early iterations.
in terms of kATrk k. Those curves are effectively a preview of the
left-hand plots in Figure 4.2 (where LSMR serves as our more reliable implementation of MINRES).
5. kE1LSQR k ≤ kE2LSQR k often, but not so for LSMR. Some examples
e(xk ), the accurate estimate
are shown on Figure 4.3 along with µ
(4.5) of the optimal backward error for each point xk .
6. kE2LSMR k ≈ µ
e(xLSMR ) almost always. Figure 4.4 shows a typical
example and a rare case. In all such “rare” cases, kE1LSMR k ≈
µ
e(xLSMR ) instead!
7. µ
e(xLSQR ) is not always monotonic. µ
e(xLSMR ) does seem to be
monotonic. Figure 4.5 gives examples.
8. µ
e(xLSMR ) ≤ µ
e(xLSQR ) almost always. Figure 4.6 gives examples.
9. The errors kx∗ − xLSQR k and kx∗ − xLSMR k decrease monotonically
(Theorem 3.3.2 and 3.3.7), with the LSQR error typically smaller
than for LSMR. Figure 4.7 gives examples. This is one property
for which LSQR seems more desirable (and it has been suggested
[57] that for LS problems, LSQR could be terminated when rule S2
would terminate LSMR).
61
4.1. LEAST-SQUARES PROBLEMS
Name:lp sc205, Dim:317x205, nnz:665, id=665
Name:lp pilot ja, Dim:2267x940, nnz:14977, id=657
0
0
LSQR
LSMR
LSQR
LSMR
−1
−1
−2
−2
log(E2)
log(E2)
−3
−3
−4
−5
−4
−6
−5
−7
−6
0
100
200
300
400
500
600
iteration count
700
800
900
1000
−8
0
20
40
60
iteration count
80
100
120
Figure 4.2: For most iterations, kE2LSMR k appears to be monotonic (but kE2LSQR k is not). Left:
A typical case (problem lp_pilot_ja). LSMR is likely to terminate much sooner than LSMR
(see Theorem 4.1.1). Right: Sole exception (problem lp_sc205) at iterations 54–67. The exception remains even if Uk and/or Vk are reorthogonalized.
For every problem in the NYPA group, both solvers satisfied the stopping condition in fewer than 2n iterations. Much greater fluctuations
are observed in kE2LSQR k than kE2LSMR k. Figure 4.8 shows the convergence of kE2 k for two problems. Maragal_5 has the largest condition
number in the group, while Maragal_7 has the largest dimensions. kE2LSMR k
converges with small fluctuations, while kE2LSQR k fluctuates by as much
as 5 orders of magnitude.
We should note that when cond(A) ≥ 108 , we cannnot expect any
solver to compute a solution with more than about 1 digit of accuracy.
The results for problem Maragal_5 are therefore a little difficult to interpret, but they illustrate the fortunate fact that LSQR and LSMR’s estimates of kATrk k/krk k do converge toward zero (really kAkǫ), even if
the computed vectors ATrk are unlikely to become so small.
4.1.3
E FFECTS OF PRECONDITIONING
The numerical results in the LPnetlib test set are generated with every
matrix A diagonally preconditioned (i.e., the column norms are scaled
to be 1). Before preconditioning, the condition numbers range from
2.9E+00 to 7.2E+12. With preconditioning, they range from 2.7E+00 to
3.4E+08. The condition numbers before and after preconditioning are
shown in Figure 4.9.
62
CHAPTER 4. LSMR EXPERIMENTS
Name:lp pilot, Dim:4860x1441, nnz:44375, id=654
Name:lp cre a, Dim:7248x3516, nnz:18168, id=609
0
E2
E1
Optimal
0
−1
−1
log(Backward Error for LSQR)
−2
log(Backward Error for LSQR)
E2
E1
Optimal
−3
−4
−5
−2
−3
−4
−5
−6
−6
−7
−8
0
200
400
600
800
iteration count
1000
1200
1400
−7
1600
0
300
iteration count
400
500
0
E2
E1
Optimal
600
E2
E1
Optimal
−1
−1
log(Backward Error for LSMR)
−2
log(Backward Error for LSMR)
200
Name:lp pilot, Dim:4860x1441, nnz:44375, id=654
Name:lp cre a, Dim:7248x3516, nnz:18168, id=609
0
100
−3
−4
−5
−2
−3
−4
−5
−6
−6
−7
−8
0
200
400
600
800
iteration count
1000
1200
1400
1600
−7
0
100
200
300
iteration count
400
500
600
Figure 4.3: kE1 k, kE2 k, and µ
e(xk ) for LSQR (top figures) and LSMR (bottom figures). Top left:
A typical case. kE1LSQR k is close to the optimal backward error, but the computable kE2LSQR k
is not. Top right: A rare case in which kE2LSQR k is close to optimal. Bottom left: kE1LSMR k and
kE2LSMR k are often both close to the optimal backward error. Bottom right: kE1LSMR k is far
from optimal, but the computable kE2LSMR k is almost always close (too close to distinguish
in the plot!). Problems lp_cre_a (left) and lp_pilot (right).
63
4.1. LEAST-SQUARES PROBLEMS
Name:lp ship12l, Dim:5533x1151, nnz:16276, id=688
Name:lp ken 11, Dim:21349x14694, nnz:49058, id=638
0
0
E2
E1
Optimal
−1
−2
log(Backward Error for LSMR)
−2
log(Backward Error for LSMR)
E2
E1
Optimal
−1
−3
−4
−5
−3
−4
−5
−6
−6
−7
−7
−8
−8
0
50
100
150
200
−9
250
0
10
20
30
40
50
iteration count
iteration count
60
70
80
90
Figure 4.4: Again, kE2LSMR k ≈ µ
e(xLSMR ) almost always (the computable backward error estimate is essentially optimal). Left: A typical case (problem lp_ken_11). Right: A rare case
(problem lp_ship12l). Here, kE1LSMR k ≈ µ
e(xLSMR )!
Name:lp cre c, Dim:6411x3068, nnz:15977, id=611
Name:lp maros, Dim:1966x846, nnz:10137, id=642
0
0
LSQR
LSMR
LSQR
LSMR
−0.5
−1
−1
−2
log(Optimal Backward Error)
log(Optimal Backward Error)
−1.5
−2
−2.5
−3
−3
−4
−5
−3.5
−6
−4
−7
−4.5
−5
0
100
200
300
400
500
iteration count
600
700
800
900
−8
0
200
400
600
800
iteration count
1000
1200
1400
1600
Figure 4.5: µ
e(xLSMR ) seems to be always monotonic, but µ
e(xLSQR ) is usually not. Left: A
typical case for both LSQR and LSMR (problem lp_maros). Right: A rare case for LSQR, typical
for LSMR (problem lp_cre_c).
64
CHAPTER 4. LSMR EXPERIMENTS
Name:lp pilot, Dim:4860x1441, nnz:44375, id=654
Name:lp standgub, Dim:1383x361, nnz:3338, id=693
0
0
LSQR
LSMR
LSQR
LSMR
−1
−1
−2
log(Optimal Backward Error)
log(Optimal Backward Error)
−2
−3
−4
−3
−4
−5
−6
−5
−7
−6
−8
−7
0
100
200
300
iteration count
400
500
600
−9
0
50
100
150
iteration count
Figure 4.6: µ
e(xLSMR ) ≤ µ
e(xLSQR ) almost always. Left: A typical case (problem lp_pilot).
Right: A rare case (problem lp_standgub).
Name:lp ship12l, Dim:5533x1151, nnz:16276, id=688
Name:lp pds 02, Dim:7716x2953, nnz:16571, id=649
2
1
LSQR
LSMR
LSQR
LSMR
1
0
0
−1
−1
log||xk − x*|
log||xk − x*|
−2
−2
−3
−3
−4
−4
−5
−5
−6
−6
−7
0
10
20
30
40
50
iteration count
60
70
80
90
−7
0
10
20
30
40
iteration count
50
60
70
Figure 4.7: The errors kx∗ − xLSQR k and kx∗ − xLSMR k seem to decrease monotonically, with
LSQR’s errors smaller than for LSMR. Left: A nonsingular LS system (problem lp_ship12l).
Right: A singular system (problem lp_pds_02). LSQR and LSMR both converge to the
minimum-norm LS solution.
65
4.1. LEAST-SQUARES PROBLEMS
Name:NYPA.Maragal 5, Dim:4654x3320, nnz:93091, id=182
Name:NYPA.Maragal 5, Dim:4654x3320, nnz:93091, id=182
1
0
LSQR
LSMR
LSQR
LSMR
0
−1
−1
−2
log(E2)
log(E2)
−2
−3
−4
−4
−5
−5
−6
−6
−7
−3
0
500
1000
1500
2000
iteration count
2500
3000
3500
−7
4000
0
500
1000
1500
2000
2500
iteration count
Name:NYPA.Maragal 7, Dim:46845x26564, nnz:1200537, id=184
Name:NYPA.Maragal 7, Dim:46845x26564, nnz:1200537, id=184
0
LSQR
LSMR
0
−1
−1
−2
−2
−3
log(E2)
log(E2)
LSQR
LSMR
−3
−4
−4
−5
−5
−6
−6
0
2000
4000
6000
8000
iteration count
10000
12000
14000
−7
0
1000
2000
3000
4000
iteration count
5000
6000
7000
Figure 4.8: Convergence of kE2 k for two problems in NYPA group using LSQR and LSMR.
Upper: Problem Maragal_5.
Left: No preconditioning applied. cond(A)=5.8E+11. If the iteration limit had been n
iterations, the final LSQR point would be very poor.
Right: Right diagonal preconditioning applied. cond(A)=2.6E+12.
Lower: Problem Maragal_7.
Left: No preconditioning applied. cond(A)=1.4E+03.
Right: Right diagonal preconditioning applied. cond(A)=4.2E+02.
The peaks for LSQR (where it would be undesirable for LSQR to terminate) correspond to
plateaus for LSMR where kE2 k remains the smallest value so far, except for slight increases
near the end of the LSQR peaks.
CHAPTER 4. LSMR EXPERIMENTS
LPnetlib matrices
14
10
original
preconditioned
12
10
10
10
8
10
Cond(A)
66
6
10
4
10
2
10
0
10
0
20
40
60
80
100
120
140
Matrix ID
Figure 4.9: Distribution of condition number for LPnetlib matrices. Diagonal preconditioning reduces the condition number in 117 out of 127
cases.
To illustrate the effect of this preconditioning on the convergence
speed of LSQR and LSMR, we solve each problem min kAx − bk from
the LPnetlib set using the two algorithms and summarize the results
in Table 4.1. Both algorithms use stopping rule S2 in Section 3.2.2 with
ATOL=1E-8, or a limit of 10n iterations.
In Table 4.1, we see that for systems that converge quickly, the advantage gained by using LSMR compared with LSQR is relatively small.
For example, lp_osa_60 (Row 127) is a 243246 × 10280 matrix. LSQR
converges in 124 iterations while LSMR converges in 122 iterations. In
contrast, for systems that take many iterations to converge, LSMR usually converges much faster than LSQR. For example, lp_cre_d (Row
122) is a 73948 × 8926 matrix. LSQR takes 19496 iterations, while LSMR
takes 9259 iterations (53% fewer).
4.1. LEAST-SQUARES PROBLEMS
67
Table 4.1: Effects of diagonal preconditioning on LPnetlib
matrices†† and convergence of LSQR and LSMR on min kAx − bk
ID name
1 720 lpi_itest6
2 706 lpi_bgprtr
3 597 lp_afiro
4 731 lpi_woodinfe
5 667 lp_sc50b
6 666 lp_sc50a
7 714 lpi_forest6
8 636 lp_kb2
9 664 lp_sc105
10 596 lp_adlittle
11 713 lpi_ex73a
12 669 lp_scagr7
13 712 lpi_ex72a
14 695 lp_stocfor1
15 603 lp_blend
16 707 lpi_box1
17 665 lp_sc205
18 663 lp_recipe
19 682 lp_share2b
20 700 lp_vtp_base
21 641 lp_lotfi
22 681 lp_share1b ‡
23 724 lpi_mondou2
24 711 lpi_cplex2
25 606 lp_bore3d
26 673 lp_scorpion
27 709 lpi_chemcom
28 729 lpi_refinery
29 727 lpi_qual
30 730 lpi_vol1
31 703 lpi_bgdbg1
32 668 lp_scagr25
33 678 lp_sctap1
34 608 lp_capri
35 607 lp_brandy
36 635 lp_israel
37 629 lp_gfrd_pnc
38 601 lp_bandm
39 621 lp_etamacro
40 704 lpi_bgetam
41 728 lpi_reactor
42 634 lp_grow7
43 670 lp_scfxm1
44 623 lp_finnis
45 620 lp_e226
46 598 lp_agg
47 725 lpi_pang
48 692 lp_standata
49 674 lp_scrs8
50 693 lp_standgub
51 602 lp_beaconfd
52 683 lp_shell
53 694 lp_standmps
54 691 lp_stair
55 617 lp_degen2
56 685 lp_ship04s
57 699 lp_tuff
58 599 lp_agg2
59 600 lp_agg3
60 655 lp_pilot4
61 726 lpi_pilot4i
62 671 lp_scfxm2
63 604 lp_bnl1
64 632 lp_grow15
65 653 lp_perold
66 684 lp_ship04l
67 622 lp_fffff800
Continued on next page. . .
m
17
40
51
89
78
78
131
68
163
138
211
185
215
165
114
261
317
204
162
346
366
253
604
378
334
466
744
464
464
464
629
671
660
482
303
316
1160
472
816
816
808
301
600
1064
472
615
741
1274
1275
1383
295
1777
1274
614
757
1506
628
758
758
1123
1123
1200
1586
645
1506
2166
1028
n
11
20
27
35
50
50
66
43
105
56
193
129
197
117
74
231
205
91
96
198
153
117
312
224
233
388
288
323
323
323
348
471
300
271
220
174
616
305
400
400
318
140
330
497
223
488
361
359
490
361
173
536
467
356
444
402
333
516
516
410
410
660
643
300
625
402
524
nnz σi = 0∗
29
0
70
0
102
0
140
0
148
0
160
0
246
0
313
0
340
0
424
0
457
5
465
0
467
5
501
0
522
0
651
17
665
0
687
0
777
0
1051
0
1136
0
1179
0
1208
1
1215
1
1448
2
1534
30
1590
0
1626
0
1646
0
1646
0
1662
0
1725
0
1872
0
1896
0
2202
27
2443
0
2445
0
2494
0
2537
0
2537
0
2591
0
2612
0
2732
0
2760
0
2768
0
2862
0
2933
0
3230
0
3288
0
3338
1
3408
0
3558
1
3878
0
4003
0
4201
2
4400
42
4561
31
4740
0
4756
0
5264
0
5264
0
5469
0
5532
1
5620
0
6148
0
6380
42
6401
0
Original A
Cond(A) kLSQR† kLSMR†
1.5E+02
5
5
1.2E+03
21
21
1.1E+01
22
22
2.9E+00
17
17
1.7E+01
41
41
1.2E+01
38
38
3.1E+00
21
21
5.1E+04
150
147
3.7E+01
68
68
4.6E+02
61
61
4.5E+01
99
96
6.0E+01
80
80
5.6E+01
101
98
7.3E+03
107
105
9.2E+02
186
186
4.6E+01
28
28
1.3E+02
122
122
2.1E+04
4
4
1.2E+04
516
510
3.6E+07
557
556
6.6E+05
149
146
1.0E+05
1170
1170
4.2E+01
151
147
4.6E+02
105
102
4.5E+04
782
681
7.4E+01
159
152
6.9E+00
40
39
5.1E+04
2684
1811
4.8E+04
2689
1828
4.8E+04
2689
1828
9.5E+00
44
43
5.9E+01
155
147
1.7E+02
364
338
8.1E+03
1194
1051
6.4E+03
665
539
4.8E+03
351
325
1.6E+05
84
73
3.8E+03
2155
1854
6.0E+04
568
186
5.3E+04
536
186
1.4E+05
85
85
5.2E+00
31
30
2.4E+04
1368
1231
1.1E+03
328
327
9.1E+03
591
555
6.2E+02
159
154
2.9E+05
350
247
2.2E+03
144
141
9.4E+04
1911
1803
2.2E+03
144
141
4.4E+03
254
254
4.2E+01
88
85
4.2E+03
286
255
5.7E+01
122
115
9.9E+01
264
250
1.1E+02
103
100
1.1E+06
1021
1013
5.9E+02
184
175
5.9E+02
219
208
4.2E+05
1945
1379
4.2E+05
2081
1357
2.4E+04
2154
1575
3.4E+03
1394
1253
5.6E+00
35
35
5.1E+05
5922
3173
1.1E+02
77
76
1.5E+10
2064
1161
Diagonally preconditioned A
Cond(A) kLSQR† kLSMR†
8.4E+01
7
7
1.2E+01
19
19
4.9E+00
21
21
2.8E+00
17
17
9.1E+00
36
36
7.6E+00
34
34
2.8E+00
19
19
7.8E+02
128
128
2.2E+01
58
58
2.5E+01
39
39
3.5E+01
85
84
1.7E+01
60
59
4.5E+01
89
88
1.8E+03
263
238
1.1E+02
118
118
2.1E+01
24
24
7.6E+01
102
101
5.4E+02
4
4
7.5E+02
331
328
1.9E+04
1370
1312
1.4E+03
386
386
6.2E+02
482
427
2.5E+01
119
116
9.9E+01
81
80
1.2E+02
265
263
3.4E+01
116
115
5.0E+00
35
35
6.2E+01
113
113
6.3E+01
121
120
6.3E+01
121
120
1.1E+01
53
52
1.7E+01
97
95
1.8E+02
334
327
3.8E+02
453
451
1.0E+02
208
208
9.2E+03
782
720
3.7E+04
270
210
4.6E+01
196
187
9.2E+01
171
162
9.2E+01
171
162
5.0E+03
151
149
4.4E+00
28
28
1.4E+03
547
470
7.7E+01
279
275
3.0E+03
504
437
1.1E+01
35
35
3.7E+01
125
111
6.6E+02
140
139
1.4E+02
356
338
6.6E+02
140
139
2.1E+01
64
63
1.1E+01
43
42
6.6E+02
201
201
3.4E+01
95
94
3.5E+01
151
146
5.1E+01
75
75
8.2E+02
648
642
5.2E+00
31
31
5.1E+00
32
31
1.2E+02
195
190
1.2E+02
195
191
3.1E+03
975
834
1.4E+02
285
278
4.9E+00
33
32
4.6E+02
706
619
6.1E+01
67
67
1.2E+06
5240
5240
68
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
CHAPTER 4. LSMR EXPERIMENTS
ID
628
687
662
679
690
672
633
689
637
658
696
680
625
642
594
614
710
686
659
624
657
605
611
688
649
609
717
708
613
595
618
630
631
718
615
619
702
616
660
654
638
627
650
705
701
697
656
661
639
716
651
626
645
652
612
610
646
640
647
648
∗
††
†
‡
¶
name
lp_ganges
lp_ship08s
lp_qap8
lp_sctap2
lp_sierra
lp_scfxm3
lp_grow22
lp_ship12s
lp_ken_07
lp_pilot_we
lp_stocfor2
lp_sctap3
lp_fit1p
lp_maros ‡
lp_25fv47
lp_czprob
lpi_cplex1
lp_ship08l
lp_pilotnov ¶
lp_fit1d
lp_pilot_ja
lp_bnl2
lp_cre_c
lp_ship12l
lp_pds_02
lp_cre_a
lpi_gran ‡
lpi_ceria3d
lp_cycle ‡
lp_80bau3b
lp_degen3
lp_greenbea
lp_greenbeb
lpi_greenbea
lp_d2q06c ‡
lp_dfl001
lp_woodw
lp_d6cube
lp_qap12
lp_pilot
lp_ken_11
lp_fit2p
lp_pds_06
lpi_bgindy
lp_wood1p
lp_stocfor3
lp_pilot87
lp_qap15
lp_ken_13
lpi_gosh
lp_pds_10
lp_fit2d
lp_osa_07
lp_pds_20
lp_cre_d
lp_cre_b
lp_osa_14
lp_ken_18
lp_osa_30
lp_osa_60
m
1706
2467
1632
2500
2735
1800
946
2869
3602
2928
3045
3340
1677
1966
1876
3562
5224
4363
2446
1049
2267
4486
6411
5533
7716
7248
2525
4400
3371
12061
2604
5598
5598
5596
5831
12230
8418
6184
8856
4860
21349
13525
29351
10880
2595
23541
6680
22275
42659
13455
49932
10524
25067
108175
73948
77137
54797
154699
104374
243246
n
1309
778
912
1090
1227
990
440
1151
2426
722
2157
1480
627
846
821
929
3005
778
975
24
940
2324
3068
1151
2953
3516
2658
3576
1903
2262
1503
2392
2392
2393
2171
6071
1098
415
3192
1441
14694
3000
9881
2671
244
16675
2030
6330
28632
3792
16558
25
1118
33874
8926
9648
2337
105127
4350
10280
nnz σi = 0∗
6937
0
7194
66
7296
170
7334
0
8001
10
8206
0
8252
0
8284
109
8404
49
9265
0
9357
0
9734
0
9868
0
10137
0
10705
1
10708
0
10947
0
12882
66
13331
0
13427
0
14977
0
14996
0
15977
87
16276
109
16571
11
18168
93
20111
586
21178
0
21234
28
23264
0
25432
2
31070
3
31070
3
31074
3
33081
0
35632
13
37487
0
37704
11
38304
398
44375
0
49058
121
50284
0
63220
11
66266
0
70216
1
72721
0
74949
0
94950
632
97246
169
99953
2
107605
11
129042
0
144812
0
232647
87
246614
2458
260785
2416
317097
0
358171
324
604488
0
1408073
0
Original A
Cond(A) kLSQR† kLSMR†
2.1E+04
219
216
1.6E+02
169
169
1.9E+01
8
8
1.8E+02
639
585
5.0E+09
87
74
2.4E+04
2085
1644
5.7E+00
39
38
8.7E+01
135
134
1.3E+02
168
168
5.3E+05
5900
3503
2.8E+04
430
407
1.8E+02
683
618
6.8E+03
81
81
1.9E+06
8460
7934
3.3E+03
5443
4403
8.8E+03
114
110
1.7E+04
89
79
1.6E+02
123
123
3.6E+09
164
343
4.7E+03
61
61
2.5E+08
7424
950
7.8E+03
1906
1333
1.6E+04
20109
12067
1.1E+02
106
104
4.0E+02
129
124
2.1E+04
20196
11219
7.2E+12
26580
20159
7.3E+02
57
56
1.5E+07
19030
19030
5.7E+02
119
111
8.3E+02
1019
969
4.4E+03
2342
2062
4.4E+03
2342
2062
4.4E+03
2148
1860
1.4E+05
21710
15553
3.5E+02
937
848
4.7E+04
557
553
1.1E+03
174
169
3.9E+01
8
8
2.7E+03
1392
1094
4.6E+02
498
491
4.7E+03
73
73
5.4E+01
197
183
6.7E+02
366
358
1.6E+04
53
53
4.5E+05
832
801
8.1E+03
896
751
5.5E+01
8
8
4.5E+02
471
462
5.6E+04
3236
1138
5.6E+02
223
208
1.7E+03
55
55
1.9E+03
105
105
7.3E+01
323
283
9.9E+03
19496
9259
6.8E+03
14761
7720
9.9E+02
120
120
1.2E+03
999
957
6.0E+03
116
115
2.1E+03
124
122
Diagonally preconditioned A
Cond(A) kLSQR† kLSMR†
3.3E+03
161
160
4.9E+01
116
116
6.6E+01
8
8
1.7E+02
450
415
1.0E+05
146
146
1.4E+03
1121
994
5.0E+00
35
35
4.2E+01
89
88
3.8E+01
98
98
6.1E+02
442
246
2.6E+03
1546
1421
1.7E+02
503
465
1.9E+04
500
427
1.8E+04
6074
3886
2.0E+02
702
571
2.9E+01
29
29
1.7E+02
53
53
6.5E+01
103
103
1.4E+03
1622
1180
2.4E+01
28
28
1.5E+03
1653
1272
2.6E+02
452
390
4.6E+02
1553
1333
5.9E+01
82
81
1.2E+01
69
67
4.9E+02
1591
1375
3.4E+08
22413
11257
2.3E+02
224
213
2.7E+04
2911
2349
6.9E+00
43
42
2.5E+02
448
414
4.3E+01
277
251
4.3E+01
277
251
4.3E+01
239
221
4.8E+02
1825
1548
1.0E+02
363
353
3.3E+01
81
81
2.5E+01
52
52
3.9E+01
8
8
5.0E+02
592
484
7.8E+01
220
207
5.0E+04
3276
1796
1.7E+01
100
97
1.1E+03
377
356
1.4E+01
25
25
3.6E+03
3442
3096
5.7E+02
297
170
5.6E+01
8
8
7.4E+01
205
204
4.2E+03
3629
1379
1.8E+01
120
115
2.8E+01
29
29
2.4E+03
72
72
3.3E+01
177
165
2.7E+02
1218
1069
1.9E+02
1112
966
8.8E+02
73
73
1.6E+02
422
398
1.2E+03
77
77
2.3E+04
82
82
Number of columns in A that are not independent.
We are using A = problem.A’; b = problem.c; to construct the least-squares problem min kAx − bk.
Denotes the number of iterations that LSQR or LSMR takes to converge with a tolerance of 10−8 .
For problem lp_maros, lpi_gran, lp_d2q06c, LSQR hits the 10n iteration limit without preconditioning. For
problem lp_share1b, lp_cycle, both LSQR and LSMR hit the 10n iteration limit without preconditioning. Thus
the number of iteration that these five problems take to converge doesn’t represent the relative improvement
provided by LSMR.
Problem lp_pilotnov is compatible (krk k → 0). Therefore LSQR exhibits faster convergence than LSMR. More
examples for compatible systems are given in Section 4.2 and 4.3.
4.1. LEAST-SQUARES PROBLEMS
M ATLAB Code 4.3 Generating preconditioners by perturbation of QR
1
% delta is the chosen standard deviation of Gaussian noise
2
randn( ’state’ ,1);
3
R = qr(A);
4
[I J S] = find(R);
5
Sp = S.*(1+delta*randn(length(S), 1));
6
M = sparse(I,J,Sp);
Diagonal preconditioning almost always reduces the condition number of A. For most of the examples, it also reduces the number of iterations for LSQR and LSMR to converge. With diagonal preconditioning,
the condition number of lp_cre_d reduces from 9.9E+03 to 2.7E+02.
The number of iterations for LSQR to converge is reduced to 1218 and
that for LSMR is reduced to 1069 (12% less than that of LSQR). Since the
preconditioned system needs fewer iterations, there is less advantage
in using LSMR in this case. (This phenomenon can be explained by (4.7)
in the next section.)
To further illustrate the effect of preconditioning, we construct a
sequence of increasingly better preconditioners and investigate their
effect on the convergence of LSQR and LSMR. The preconditioners are
constructed by first performing a sparse QR factorization A = QR, and
then adding Gaussian random noise to the nonzeros of R. For a given
noise level δ, we use M ATLAB Code 4.3 to generate the preconditioner.
Figure 4.10 illustrates the convergence of LSQR and LSMR on problem lp_d2q06c (cond(A) = 1.4E+05) with a number of preconditioners.
We have a total of 5 options:
• No preconditioner
• Diagonal preconditioner from M ATLAB Code 4.2
• Preconditioner from M ATLAB Code 4.3 with δ = 0.1
• Preconditioner from M ATLAB Code 4.3 with δ = 0.01
• Preconditioner from M ATLAB Code 4.3 with δ = 0.001
From the plots in Figure 4.10, we see that when no preconditioner is
applied, both algorithms exhibit very slow convergence and LSQR hits
the 10n iteration limit. The backward error for LSQR lags behind LSMR
by at least 1 order of magnitude at the beginning, and the gaps widen
to 2 orders of magnitude toward 10n iterations. The backward error for
LSQR fluctuates significantly across all iterations.
69
70
CHAPTER 4. LSMR EXPERIMENTS
Name:lp d2q06c, Dim:5831x2171, nnz:33081, id=102
Name:lp d2q06c, Dim:5831x2171, nnz:33081, id=102
0
LSQR
LSMR
LSQR
LSMR
1.5
−1
1
−2
0.5
−3
log(E2)
log(E2)
0
−0.5
−4
−1
−1.5
−5
−2
−6
−2.5
−3
0
0.5
1
1.5
2
2.5
iteration count
0
200
400
600
−2
−3
−4
−5
−6
1200
1400
−8
−5
−7
−7
1000
−4
−6
−4
−5
2000
LSQR
LSMR
log(E2)
log(E2)
log(E2)
−3
600
800
iteration count
1800
0
−3
400
1600
−1
−2
−2
200
1400
LSQR
LSMR
−1
0
1000
1200
iteration count
Name:lp d2q06c, Dim:5831x2171, nnz:33081, id=102
0
LSQR
LSMR
−1
−6
800
Name:lp d2q06c, Dim:5831x2171, nnz:33081, id=102
Name:lp d2q06c, Dim:5831x2171, nnz:33081, id=102
0
−7
4
x 10
−8
0
5
10
15
20
25
iteration count
30
35
40
45
−9
1
2
3
4
5
6
iteration count
7
8
9
10
Figure 4.10: Convergence of LSQR and LSMR with increasingly good preconditioners.
log kE2 k is plotted against iteration number. LSMR shows an advantage until the preconditioners is almost exact. Top: No preconditioners and diagonal preconditioner. Bottom:
Exact preconditioner with noise levels of 10%, 1% and 0.1%.
When diagonal preconditioning is applied, both algorithms take
less than n iterations to converge. The backward errors for LSQR lag
behind LSMR by 1 order of magnitude. There is also much less fluctuation in the LSQR backward error compared to the unpreconditioned
case.
For the increasingly better preconditioners constructed with δ = 0.1,
0.01 and 0.001, we see that the number of iterations to convergence
decreases rapidly. With better preconditioners, we also see that the
gap between the backward errors for LSQR and LSMR becomes smaller.
With an almost perfect preconditioner (δ = 0.001), the backward error
for LSQR becomes almost the same as that for LSMR at each iteration.
This phenomenon can be explained by (4.7) in the next section.
4.1. LEAST-SQUARES PROBLEMS
4.1.4
W HY DOES kE2 k FOR LSQR LAG BEHIND LSMR ?
David Titley-Peloquin, in joint work with Serge Gratton and Pavel Jiranek [76], has performed extensive analysis of the convergence behavior of LSQR and LSMR for least-square problems. These results are
unpublished at the time of writing. We summarize two key insights
from their work to provide a more complete picture on how these two
algorithms perform.
The residuals krkLSQR k, krkLSMR k and residuals for the normal equation kATrkLSQR k, kATrkLSMR k for LSQR and LSMR satisfy the following
relations [76], [35, Lemma 5.4.1]:
krkLSMR k2 = krkLSQR k2 +
k−1
X
j=0
kATrkLSMR k4
LSQR 2
(krjLSQR k2 − krj+1
k ) (4.6)
kATrjLSMR k4
kATrkLSMR k
.
kATrkLSQR k = q
LSMR 2
1 − kATrkLSMR k2 /kATrk−1
k
(4.7)
From (4.6), one can infer that krkLSMR k is much larger than krkLSQR k
only if both
LSQR 2
• krjLSQR k2 ≪ krj+1
k for some j < k
LSMR 4
• kATrjLSMR k4 ≈ kATrj+1
k ≈ · · · ≈ kATrkLSMR k4
happened, which is very unlikely in view of the fourth power [76]. This
explains our observation in Figure 4.1 that krkLSMR k rarely lags behind
krkLSQR k. In cases where krkLSMR k lags behind in the early iterations, it
catches up very quickly.
From (4.7), one can infer that if kATrkLSMR k decreases a lot between
iterations k − 1 and k, then kATrkLSQR k would be roughly the same as
kATrkLSMR k. The converse also holds, in that kATrkLSQR k will be much
larger than kATrkLSMR k if LSMR is almost stalling at iteration k (i.e.,
kATrkLSMR k did not decrease much relative to the previous iteration)
[76]. This explains the peaks and plateaus observed in Figure 4.8.
We have a further insight about the difference between LSQR and
LSMR on least-squares problem that take many iterations. Both solvers
stop when
kAT rk k ≤ ATOLkAkkrk k.
71
72
CHAPTER 4. LSMR EXPERIMENTS
M ATLAB Code 4.4 Criteria for selecting square systems
ids = find(index.nrows > 100000
& ...
& ...
2
index.nrows < 200000
3
index.nrows == index.ncols & ...
& ...
4
index.isReal == 1
5
index.posdef == 0
& ...
6
index.numerical_symmetry < 1);
1
Since kr∗ k is often O(kr0 k) for least-squares, and it is also safe to assume kAT r0 k/(kAkkr0 k) = O(1), we know that they will stop at iteration l, where
l
Y
kAT rk k
kAT rl k
=
≈ O(ATOL).
kAT rk−1 k
kAT r0 k
k=1
LSMR
Thus on average, kATrkLSMR k/kATrk−1
k will be closer to 1 if l is large.
This means that the larger l is (in absolute terms), the more LSQR will
lag behind LSMR (a bigger gap between kATrkLSQR k and kATrkLSMR k).
4.2
S QUARE SYSTEMS
Since LSQR and LSMR are applicable to consistent systems, it is of interest to compare them on an unbiased test set. We used the search
facility of Davis [18] to select a set of square real linear systems Ax = b.
With index = UFget, the criteria in M ATLAB Code 4.4 returned a list of
42 examples. Testing isfield(UFget(id),’b’) left 26 cases for which b
was supplied.
P RECONDITIONING
hFor each
i linear system, diagonal scaling was first applied to the rows of
A b and then to its columns using M ATLAB Code 4.5 to give a scaled
h
i
problem Ax = b in which the columns of A b have unit 2-norm.
In spite of the scaling, most examples required more than n iterations of LSQR or LSMR to reduce krk k satisfactorily (rule S1 in section 3.2.2 with ATOL = BTOL = 10−8 ). To simulate better preconditioning, we chose two cases that required about n/5 and n/10 iterations. Figure 4.11 (left) shows both solvers reducing krk k monotonically but with plateaus that are prolonged for LSMR. With loose stopping tolerances, LSQR could terminate somewhat sooner. Figure 4.11
4.3. UNDERDETERMINED SYSTEMS
M ATLAB Code 4.5 Diagonal preconditioning
1
% scale the row norms to 1
2
rnorms = sqrt(sum(A.*A,2));
3
D = diag(sparse(1./rnorms));
4
A = D*A;
5
b = D*b;
% scale the column norms to 1
6
7
cnorms = sqrt(sum(A.*A,1));
8
D = diag(sparse(1./cnorms));
9
A = A*D;
% scale the 2 norm of b to 1
10
11
bnorm = norm(b);
if bnorm ~= 0
12
13
b = b./bnorm;
14
end
(right) shows kATrk k for each solver. The plateaus for LSMR correspond to LSQR gaining ground with krk k, but falling significantly backward by the kATrk k measure.
C OMPARISON WITH IDR (s) ON SQUARE SYSTEMS
Again we mention that on certain square parameterized systems, the
solvers IDR(s) and LSQR or LSMR complement each other [81; 82] (see
Section 1.3.5).
4.3
U NDERDETERMINED SYSTEMS
In this section, we study the convergence of LSQR and LSMR when applied to an underdetermined system Ax = b. As shown in Section 3.3.2,
LSQR and LSMR converge to the minimum-norm solution for a singular
system (rank(A) < n). The solution solves minAx=b kxk2 .
As a comparison, we also apply MINRES directly to the equation
AATy = b and take x = ATy as the solution. This avoids multiplication
by ATA in the Lanczos process, where ATA is a highly singular operator
because A has more columns than rows. It is also useful to note that this
application of MINRES is mathematically equivalent to applying LSQR
to Ax = b.
Theorem 4.3.1. In exact arithmetic, applying MINRES to AATy = b and
setting xk = ATyk generates the same iterates as applying LSQR to Ax = b.
73
74
CHAPTER 4. LSMR EXPERIMENTS
Name:Hamm.hcircuit, Dim:105676x105676, nnz:513072, id=542
Name:Hamm.hcircuit, Dim:105676x105676, nnz:513072, id=542
0
0
LSQR
LSMR
LSQR
LSMR
−1
−2
−2
−3
−4
log||r||
log||ATr||
−4
−5
−6
−6
−8
−7
−10
−8
−9
0
0.5
1
1.5
2
−12
2.5
iteration count
0
0.5
1
1.5
2
2.5
iteration count
4
x 10
Name:IBM EDA.trans5, Dim:116835x116835, nnz:749800, id=1324
4
x 10
Name:IBM EDA.trans5, Dim:116835x116835, nnz:749800, id=1324
0
0
LSQR
LSMR
LSQR
LSMR
−1
−2
−2
−3
−4
log||r||
log||ATr||
−4
−5
−6
−6
−8
−7
−10
−8
−9
0
2000
4000
6000
iteration count
8000
10000
12000
−12
0
2000
4000
6000
iteration count
8000
10000
12000
Figure 4.11: LSQR and LSMR solving two square nonsingular systems Ax = b: problems
Hamm/hcircuit (top) and IBM_EDA/trans5 (bottom). Left: log10 krk k for both solvers, with
prolonged plateaus for LSMR. Right: log10 kATrk k (preferable for LSMR).
4.3. UNDERDETERMINED SYSTEMS
Table 4.2 Relationship between CG, MINRES, CRAIG, LSQR and LSMR
CRAIG
LSQR
LSMR
≡
≡
≡
≡
CG on AATy = b, x = ATy
CG on ATAx = ATb
MINRES on AATy = b, x = ATy
MINRES on ATAx = ATb
Proof. It suffices to show that the two methods are solving the same
subproblem at every iteration. Let xMINRES
and xLSQR
be the iterates
k
k
generated by MINRES and LSQR respectively. Then
xMINRES
= AT argminy∈Kk (AAT,b) b − AATy k
= argminx∈Kk (ATA,ATb) kb − Axk
= xLSQR
.
k
The first and third equality comes from the subproblems that MINRES
and LSQR solve. The second equality follows from the following mapping from points in Kk (AAT, b) to Kk (ATA, ATb):
f : Kk (AAT, b) → Kk (ATA, ATb),
f (y) = ATy,
and from the fact that for any point x ∈ Kk (ATA, ATb), we can write
x = γ0 ATb +
k
X
γi (ATA)i (ATb)
i=1
for some scalars {γi }ki=0 . Then the point
y = γ0 b +
k
X
γi (AAT)i b
i=1
would be a preimage of x under f .
This relationship, as well as some other well known ones between
CG, MINRES, CRAIG, LSQR and LSMR, are summarized in Table 4.2.
B ACKWARD ERROR FOR UNDERDETERMINED SYSTEMS
A linear system Ax = b is ill-posed if A is m-by-n and m < n, because
the system has an infinite number of solutions (or none). One way to
define a unique solution for such a system is to choose the solution
75
76
CHAPTER 4. LSMR EXPERIMENTS
M ATLAB Code 4.6 Left diagonal preconditioning
1
% scale the row norms to 1
2
rnorms = sqrt(full(sum(A.*A,2)));
3
rnorms = rnorms + (rnorms == 0); % avoid division by 0
4
D = diag(sparse(1./rnorms));
5
A = D*A;
6
b = D*b;
x with minimum 2-norm. That is, we want to solve the optimization
problem
min kxk2 .
Ax=b
For any approximate solution x to above problem, the normwise backward error is defined as the norm of the minimum perturbation to A
such that x is a solution of the perturbed optimization problem:
η(x) = min kEk
E
s.t.
x = argmin(A+E)x=b kxk.
Sun and Sun [39] have shown that this value is given by
η(x) =
s
krk22
2 (B),
+ σmin
kxk22
xxT
.
B=A I−
kxk22
Since it is computationally prohibitive to compute the minimum singular value at every iteration for the backward error, we will use krk/kxk
1 Note that this estimate
as an approximate backward error in the following analysis.1
is a lower bound on the true
backward error. In contrast,
the estimates E1 and E2
for backward error in leastsquares problems are upper
bounds.
P RECONDITIONING
For underdetermined systems, a right preconditioner on A will alter
the minimum-norm solution. Therefore, only left preconditioners are
applicable. In the following experiments, we do a left diagonal preconditioning on A by scaling the rows of A to unit 2-norm; see M ATLAB
Code 4.6.
D ATA
For testing underdetermined systems, we use sparse matrices from the
LPnetlib group (the same set of data as in Section 4.1).
Each example was downloaded in M ATLAB format, and a sparse
matrix A and dense vector b were extracted from the data structure
4.4. REORTHOGONALIZATION
via A = Problem.A and b = Problem.b Then we solve an underdetermined linear system minAx=b kxk2 with both LSQR and LSMR. MINRES
is also used with a change of variable to a form equivalent to LSQR as
described above.
N UMERICAL RESULTS
The experimental results showed that LSMR converges almost as quickly
as LSQR for underdetermined systems. The approximate backward errors for four different problems are shown in Figure 4.12. In only a
few cases, LSMR lags behind LSQR for a number of iterations. Thus we
conclude that LSMR and LSQR are equally good for finding minimum
2-norm solutions for underdetermined systems.
The experimental results also confirmed our earlier derivation that
MINRES on AATy = b and x = ATy is equivalent to LSQR on Ax =
b. MINRES exhibits the same convergence behavior as LSQR, except
in cases where they both take more than m iterations to converge. In
these cases, the effect of increased condition number of AAT kicks in
and slows down MINRES in the later iterations.
4.4
R EORTHOGONALIZATION
It is well known that Krylov-subspace methods can take arbitrarily
many iterations because of loss of orthogonality. For the Golub-Kahan
bidiagonalization, we have two sets of vectors Uk and Vk . As an experiment, we implemented the following options in LSMR:
1. No reorthogonalization.
2. Reorthogonalize Vk (i.e. reorthogonalize vk with respect to Vk−1 ).
3. Reorthogonalize Uk (i.e. reorthogonalize uk with respect to Uk−1 ).
4. Both 2 and 3.
Each option was tested on all of the over-determined test problems
with fewer than 16K nonzeros. Figure 4.13 shows an “easy” case in
which all options converge equally well (convergence before significant loss of orthogonality), and an extreme case in which reorthogonalization makes a large difference.
Unexpectedly, options 2, 3, and 4 proved to be indistinguishable in
all cases. To look closer, we forced LSMR to take n iterations. Option 2
(with Vk orthonormal to machine precision ǫ) was found to be keeping
77
78
CHAPTER 4. LSMR EXPERIMENTS
Name:U lp pds 10, Dim:16558x49932, nnz:107605, id=418
Name:U lp standgub, Dim:361x1383, nnz:3338, id=350
0
2
LSQR
MINRES
LSMR
−1
LSQR
MINRES
LSMR
0
−2
−2
log(||r|| / ||x||)
log(||r|| / ||x||)
−3
−4
−5
−4
−6
−6
−8
−7
−10
−8
−9
0
20
40
60
80
iteration count
100
120
−12
140
0
Name:U lp israel, Dim:174x316, nnz:2443, id=336
20
40
60
80
100
iteration count
120
2
180
LSQR
MINRES
LSMR
0
0
−2
−2
log(||r|| / ||x||)
log(||r|| / ||x||)
160
2
LSQR
MINRES
LSMR
−4
−4
−6
−6
−8
−8
−10
140
Name:U lp fit1p, Dim:627x1677, nnz:9868, id=380
0
200
400
600
800
1000
1200
iteration count
1400
1600
1800
2000
−10
0
200
400
600
800
iteration count
1000
1200
1400
Figure 4.12: The backward errors krk k/kxk k for LSQR, LSMR and MINRES on four different
underdetermined linear systems to find the minimum 2-norm solution. Upper left: The
backward errors for all three methods converge at a similar rate. Most of our test cases exhibit similar convergence behavior. This shows that LSMR and LSQR perform equally well
for underdetermined systems. Upper right: A rare case where LSMR lags behind LSQR significantly for some iterations. This plot also confirms our earlier derivation that this special version of MINRES is theoretically equivalent to LSQR, as shown by the almost identical
convergence behavior. Lower left: An example where all three algorithms take more than
m iterations. Since MINRES works with the operator AAT, the effect of numerical error is
greater and MINRES converges slower than LSQR towards the end of computation. Lower
right: Another example showing that MINRES lags behind LSQR because of greater numerical error.
79
4.4. REORTHOGONALIZATION
Name:lp ship12l, Dim:5533x1151, nnz:16276, id=688
Name:lpi gran, Dim:2525x2658, nnz:20111, id=717
0
0
NoOrtho
NoOrtho
OrthoU
−1
OrthoU
−1
OrthoV
OrthoV
OrthoUV
OrthoUV
−2
−2
−3
log(E2)
log(E2)
−3
−4
−4
−5
−5
−6
−6
−7
−8
0
10
20
30
40
50
iteration count
60
70
80
90
−7
0
500
1000
1500
iteration count
2000
2500
3000
Figure 4.13: LSMR with and without reorthogonalization of Vk and/or Uk . Left: An easy case
where all options perform similarly (problem lp_ship12l). Right: A helpful case (problem
lp_gran).
√
Uk orthonormal to at least O( ǫ). Option 3 (with Uk orthonormal) was
√
not quite as effective but it kept Vk orthonormal to at least O( ǫ) up to
√
the point where LSMR would terminate when ATOL = ǫ.
This effect of one-sided reorthogonalization has also been pointed
out in [65].
Note that for square or rectangular A with exact arithmetic, LSMR
is equivalent to MINRES on the normal equation (and hence to CR [44]
and GMRES [63] on the same equation). Reorthogonalization makes the
equivalence essentially true in practice. We now focus on reorthogonalizing Vk but not Uk .
Other authors have presented numerical results involving reorthogonalization. For example, on some randomly generated LS problems
of increasing condition number, Hayami et al. [37] compare their BAGMRES method with an implementation of CGLS (equivalent to LSQR
[53]) in which Vk is reorthogonalized, and find that the methods require
essentially the same number of iterations. The preconditioner chosen
for BA-GMRES made that method equivalent to GMRES on ATAx = ATb.
Thus, GMRES without reorthogonalization was seen to converge essentially as well as CGLS or LSQR with reorthogonalization of Vk (option
2 above). This coincides with the analysis by Paige et al. [55], who
conclude that MGS-GMRES does not need reorthogonalization of the
Arnoldi vectors Vk .
80
CHAPTER 4. LSMR EXPERIMENTS
Name:lp maros, Dim:1966x846, nnz:10137, id=642
Name:lp cre c, Dim:6411x3068, nnz:15977, id=611
0
0
NoOrtho
Restart5
Restart10
Restart50
FullOrtho
−1
NoOrtho
Restart5
Restart10
Restart50
FullOrtho
−1
−2
−2
log(E2)
log(E2)
−3
−4
−4
−5
−5
−6
−7
−3
−6
0
100
200
300
400
500
iteration count
600
700
800
900
−7
0
500
1000
1500
2000
iteration count
2500
3000
3500
Figure 4.14: LSMR with reorthogonalized Vk and restarting. Restart(l) with l = 5, 10, 50 is
slower than standard LSMR with or without reorthogonalization. NoOrtho represents LSMR
without reorthogonalization. Restart5, Restart10, and Restart50 represents LSMR with Vk
reorthogonalized and with restarting every 5, 10 or 50 iterations. FullOrtho represents LSMR
with Vk reorthogonalized without restarting. Problems lp_maros and lp_cre_c.
R ESTARTING
To conserve storage, a simple approach is to restart the algorithm every
l steps, as with GMRES(l) [63]. To be precise, we set
rl = b − Axl ,
min kA∆x − rl k,
xl ← xl + ∆x
and repeat the same process until convergence. Our numerical test in
Figure 4.14 shows that restarting LSMR even with full reorthogonalization (of Vk ) may lead to stagnation. In general, convergence with
restarting is much slower than LSMR without reorthogonalization. Restarting does not seem useful in general.
L OCAL REORTHOGONALIZATION
Here we reorthogonalize each new vk with respect to the previous l
vectors, where l is a specified parameter. Figure 4.15 shows that l = 5
has little effect, but partial speedup was achieved with l = 10 and 50
in the two chosen cases. There is evidence of a useful storage-time
tradeoff. It should be emphasized that the potential speedup depends
strongly on the computational cost of Av and ATu. If these are cheap,
local reorthogonalization may not be worthwhile.
81
4.4. REORTHOGONALIZATION
Name:lp fit1p, Dim:1677x627, nnz:9868, id=625
Name:lp bnl2, Dim:4486x2324, nnz:14996, id=605
0
1
NoOrtho
Local5
Local10
Local50
FullOrtho
−1
0
−2
−1
log(E2)
log(E2)
−3
−4
−2
−3
−5
−4
−6
−5
−7
NoOrtho
Local5
Local10
Local50
FullOrtho
0
50
100
150
200
250
iteration count
300
350
400
450
−6
0
200
400
600
800
iteration count
1000
1200
1400
Figure 4.15: LSMR with local reorthogonalization of Vk . NoOrtho represents LSMR without
reorthogonalization. Local5, Local10, and Local50 represent LSMR with local reorthogonalization of each vk with respect to the previous 5, 10, or 50 vectors. FullOrtho represents LSMR
with reorthogonalized Vk without restarting. Local(l) with l = 5, 10, 50 illustrates reduced
iterations as l increases. Problems lp_fit1p and lp_bnl2.
5
AMRES
In this chapter we describe an efficient and stable iterative algorithm
for computing the vector x in the augmented system
γI
AT
A
δI
!
s
x
!
=
!
b
,
0
(5.1)
where A is a rectangular matrix, γ and δ are any scalars, and we define
 =
γI
AT
A
δI
!
,
xb =
!
s
,
x
b̂ =
!
b
.
0
(5.2)
Our algorithm is called AMRES, for Augmented-system Minimum RESidual method. It is derived by formally applying MINRES [52] to the
augmented system (5.1), but is more economical because it is based on
the Golub-Kahan bidiagonalization process [29] and it computes estimates of just x (excluding s).
Note that  includes two scaled identity matrices γI and δI in the
(2,2)-block. When γ and δ have opposite sign (e.g., γ = σ, δ = −σ), (5.1)
is equivalent to a damped least-squares problem
!
A
x−
min σI
!
b 0 2
≡
(ATA + σ 2 I)x = ATb
(also known as a Tikhonov regularization problem). CGLS, LSQR or
LSMR may then be applied. They require less work and storage per
iteration than AMRES, but the number of iterations and the numerical reliability of all three algorithms would be similar (especially for
LSMR).
AMRES is intended for the case where γ and δ have the same sign
(e.g., γ = δ = σ). An important application is for the solution of “negatively damped normal equations” of the form
(ATA − σ 2 I)x = ATb,
82
(5.3)
5.1. DERIVATION OF AMRES
83
which arise from both ordinary and regularized total least-squares (TLS)
problems, Rayleigh quotient iteration (RQI) and Curtis-Reid scaling.
These equations do not lend themselves to an equivalent least-squares
formulation. Hence, there is a need for algorithms specially tailored
to the problem. We note in passing that there is a formal least-squares
formulation for negative shifts:
!
A
min x−
iσI
!
b ,
0 i=
2
√
−1,
but this doesn’t lead to a convenient numerical algorithm for largescale problems.
For small enough values of σ 2 , the matrix ATA − σ 2 I is positive
definite (if A has full column rank), and this is indeed the case with ordinary TLS problems. However, for regularized TLS problems (where
σ 2 plays the role of a regularization parameter that is not known a priori and must be found by solving system (5.3) for a range of σ-values),
there is no guarantee that ATA−σ 2 I will be positive definite. With a minor extension, CGLS may be applied with any shift [6] (and this could
often be the most efficient algorithm), but when the shifted matrix is
indefinite, stability cannot be guaranteed.
AMRES is reliable for any shift, even if ATA − σ 2 I is indefinite. Also,
if σ happens to be a singular value of A (extreme or interior), AMRES
may be used to compute a corresponding singular vector in the manner
of inverse iteration,1 as described in detail in Section 6.3.
5.1
D ERIVATION OF AMRES
If the Lanczos process Tridiag(Â, b̂) were applied to the matrix and rhs
in (5.1), we would have

T̂2k
γ

α1



=




V̂2k =
α1
δ
β2
β2
γ
..
.

α2
..
.
βk
u1
u2
v1
v2
..
.
γ
αk
...
...





,



αk 
δ
uk
vk
!
,
(5.4)
1 Actually, by solving a
singular least-squares problem, as in Choi [13].
84
CHAPTER 5. AMRES
and then T̂2k+1 , V̂2k+1 in the obvious way. Because of the structure of
 and b̂, the scalars αk , βk and vectors uk , vk are independent of γ and
δ, and it is more efficient to generate the same quantities by the GolubKahan process Bidiag(A, b). To solve (5.1), MINRES would solve the
subproblems
min Ĥk yk − β1 e1 ,



T̂
k


 (k odd)




 α k+1 eTk
2
Ĥk = 




T̂k


 (k even)



β k +1 eTk
2
5.1.1
L EAST- SQUARES SUBSYSTEM
If γ = δ = σ, a singular value of A, the matrix  is singular. In general
we wish to solve minxb kÂb
x − b̂k, where xb may not be unique. We define
the k-th estimate of xb to be xbk = V̂k ŷk , and then
rbk ≡ b̂ − Âb
xk = b̂ − ÂV̂k ŷk = V̂k+1 β1 e1 − Ĥk ŷk .
(5.5)
To minimize the residual rbk , we perform a QR factorization on Ĥk :
where
min kb
rk k = min kβ1 e1 − Ĥk ŷk k
!
! !
Rk
zk
T
ŷk −
= min Qk+1
0
ζ̄k+1
! !
z
Rk
k
ŷk ,
−
= min ζ̄k+1
0
zkT = (ζ1 , ζ2 , . . . , ζk ),
Qk+1 = Pk · · · P2 P1 ,
(5.6)
(5.7)
(5.8)
with Qk+1 being a product of plane rotations. At iteration k, Pl denotes
a plane rotation on rows l and l + 1 (with k ≥ l):

Il−1


Pl = 

cl
−sl

sl
cl
Ik−l


.

5.1. DERIVATION OF AMRES
The solution ŷk to (5.7) satisfies Rk ŷk = zk . As in MINRES, to allow a
cheap iterative update to xbk we define Ŵk to be the solution of
RkT ŴkT = V̂kT ,
(5.9)
xbk = V̂k ŷk = Ŵk Rk ŷk = Ŵk zk .
(5.10)
which gives us
Since we are interested in the lower half of xbk = ( xskk ), we write Ŵk as
Ŵk =
Wku
Wkv
!
,
Wku = w1u
w2u
···
wku ,
Wkv = w1v
w2v
···
Then (5.10) can be simplified as xk = Wkv zk = xk−1 + ζk wkv .
5.1.2
QR FACTORIZATION
The effects of the first two rotations in (5.8) are shown here:
!
c1
−s1
s1
c1
c2
−s2
s2
c2
!
γ
α1
α1
δ
ρ̄2
β2
θ̄2
γ
β2
α2
!
!
=
=
ρ1
θ1
ρ̄2
η1
θ̄2
ρ2
θ2
ρ̄3
η2
θ̄3
!
!
,
(5.11)
.
(5.12)
Later rotations are shown in Algorithm AMRES.
5.1.3
U PDATING Wkv
Since we are only interested in Wkv , we can extract it from (5.9) to get


 0
T
v T
Rk (Wk ) = 
 0
v1
0
v2
···
v k−1
v1
0
v2
···
0
2
vk
2
T
0
T
(k odd)
(k even)
where the structure of the rhs comes from (5.4). Since RkT is lower tridiagonal, the system can be solved by the following recurrences:
w1v = 0,
w2v = v1 /ρ2

(−η
v
v
k−2 wk−2 − θk−1 wk−1 )/ρk .
wkv =
(v k − ηk−2 wv − θk−1 wv )/ρk .
k−2
k−1
2
(k odd)
(k even)
wkv .
85
86
CHAPTER 5. AMRES
If we define hk = ρk wkv , then we arrive at the update rules in step 7
of Algorithm 5.1, which saves 2n floating point multiplication for each
step of Golub-Kahan bidiagonalization compared with updating wkv .
5.1.4
A LGORITHM AMRES
Algorithm 5.1 summarizes the main steps of AMRES, excluding the
norm estimates and stopping rules developed later. As usual, β1 u1 = b
is shorthand for β1 = kbk, u1 = b/β1 (and similarly for all αk , βk ).
5.2
S TOPPING RULES
Stopping rules analogous to those in LSQR [53] are used for AMRES.
Three dimensionless quantities are needed: ATOL, BTOL, CONLIM.
The first stopping rule applies to compatible systems, the second rule
applies to incompatible systems, and the third rule applies to both.
S1: Stop if krk k ≤ BTOLkbk + ATOLkAkkxk k
S2: Stop if kÂT rbk k ≤ kÂkkb
rk kATOL
S3: Stop if cond(A) ≥ CONLIM
5.3
E STIMATE OF NORMS
5.3.1
C OMPUTING kb
rk k
rk k = |ζ̄k+1 |.
From (5.7), it’s obvious that kb
5.3.2
C OMPUTING kÂb
rk k
Starting from (5.5) and using (5.6), we have
Âb
rk = ÂV̂k+1 β1 e1 − Ĥk ŷk = V̂k+2 Hk+1 β1 e1 − Ĥk ŷk
! !
!
Rk
zk
T
ŷk
−
= V̂k+2 Hk+1 Qk+1
0
ζ̂k+1


!
!
RkT
0
0
0


= V̂k+2 θk eTk ρ̄k+1 
= V̂k+2 Hk+1 QTk+1
ζ̂k+1
ζ̂
k+1


ηk eTk θ̄k+1
0

= ζ̂k+1 V̂k+2 
ρ̄k+1  = ζ̂k+1 (ρ̄k+1 v̂k+1 + θ̄k+1 v̂k+2 ).
θ̄k+1
Therefore, kÂb
rk k = |ζ̂k+1 | ρ̄k+1
θ̄k+1 .
5.3. ESTIMATE OF NORMS
Algorithm 5.1 Algorithm AMRES
1: (Initialize)
β 1 u1 = b
ζ̄1 = β1
η−1 = 0
2:
3:
α1 v1 = AT u1
v
w−1
= ~0
η0 = 0
6:
for k = 2l − 1, 2l do
(Setup temporary variables)
(
λ=δ
α = αl
λ=γ
α = βl+1
21
ck = ρ̄k /ρk
θk = ck θ̄k + sk λ
ηk = s k β
ζk = ck ζ̄k
αl+1 vl+1 = ATul+1 − βl+1 vl
β = βl+1
β = αl+1
(k odd)
(k even)
sk = α/ρk
ρ̄k+1 = −sk θ̄k + ck λ
θ̄k+1 = ck β
ζ̄k+1 = −sk ζ̄k
(Update estimates of x)
(
−(ηk−2 /ρk−2 )hk−2 − (θk−1 /ρk−1 )hk−1
hk =
vl − (ηk−2 /ρk−2 )hk−2 − (θk−1 /ρk−1 )hk−1
xk = xk−1 + (ζk /ρk )hk
8:
9:
θ0 = 0
(Construct and apply rotation Pk )
ρk = ρ̄2k + α2
7:
θ̄1 = α1
x0 = ~0
for l = 1, 2, 3 . . . do
(Continue the bidiagonalization)
βl+1 ul+1 = Avl − αl ul
4:
5:
ρ̄1 = γ
w0v = ~0
end for
end for
(k odd)
(k even)
87
88
CHAPTER 5. AMRES
5.3.3
C OMPUTING kb
xk
From (5.7) and 5.10, we know that xbk = V̂k ŷk and zk = Rk ŷk . If we do
a QR factorization Q̃k RkT = R̃k and let tk = Q̃k ŷk , we can write
(5.13)
zk = Rk ŷk = Rk Q̃Tk Q̃k ŷk = R̃kT tk ,
giving xbk = V̂k Q̃Tk tk and hence kb
xk k = ktk k, where ktk k is derived next.
For the QR factorizations (5.8) and (5.13), we write Rk and R̃k as

ρ1





Rk = 




θ1
ρ2
η1
θ2
..
.
η2
..
.
..
ρk−2
 (4)
ρ̃1





R̃k = 




(2)






,
ηk−2 


θk−1 
ρk
.
θk−2
ρk−1
(1)
θ̃1
(4)
ρ̃1
η̃1
(2)
θ̃1
..
.
(1)
η̃1
..
.
(4)
ρ̃k−2
..
.
(2)
θ̃k−2
(3)
ρ̃k−1






,
(1) 
η̃k−2 
(1) 
θ̃k−1 
(2)
ρ̃k
where the upper index denotes the number of times that element has
changed during the decompositions. Also, Q̃k = (P̃k P̂k ) · · · (P̃3 P̂3 )P̃2 ,
where P̂k and P̃k are constructed by changing the (k −2 : k)×(k −2 : k)
submatrix of Ik to



(1)
(1)
s̃k
c̃k
1
(1)
(1)
−s̃k
c̃k

1


 and 

(2)
c̃k
(2)
−s̃k

(2) 
s̃k 
(2)
c̃k
respectively. The effects of these two rotations can be summarized as
(1)
c̃k
(1)
−s̃k
(1)
s̃k
(1)
c̃k
(2)
c̃k
(2)
−s̃k
!
(3)
ρ̃k−2
ηk−2
!
(2)
s̃k
(2)
c̃k
(1)
θ̃k−2
θk−1
ρk
(2)
ρ̃k−1
θ̇k−1
(1)
ρ̃k
!
!
(4)
=
ρ̃k−2
0
θ̃k−2
θ̇k−1
(3)
=
ρ̃k−1
0
θ̃k−1
(2)
ρ̃k
(2)
(1)
(1)
!
η̃k−2
(1)
ρ̃k
.
!
5.4. COMPLEXITY
Let tk = τ1(3)
(3)
...
τk−2
(2)
(1)
τk−1
τk
T
. We solve for tk by
(3)
(1)
(3)
(2)
(3)
(4)
(2)
(1)
(3)
(2)
(3)
(3)
τk−2 = (ζk−2 − η̃k−4 τk−4 − θ̃k−3 τk−3 )/ρ̃k−2
τk−1 = (ζk−1 − η̃k−3 τk−3 − θ̃k−2 τk−2 )/ρ̃k−1
(1)
τk
(1)
(3)
(1)
(2)
(2)
= (ζk − η̃k−2 τk−2 − θ̃k−1 τk−1 )/ρ̃k
with our estimate of kb
xk obtained from
5.3.4
kb
xk k = ktk k = τ1(3)
...
(3)
τk−2
(2)
(1)
τk−1
τk
E STIMATES OF kÂk AND COND (Â)
.
Using Lemma 2.32 from Choi [13], we have the following estimates of
kÂk at the l-th iteration:
1
(5.14)
1
(5.15)
2
kÂk ≥ max (αi2 + δ 2 + βi+1
)2 ,
1≤i≤l
kÂk ≥ max (βi2 + γ 2 + αi2 ) 2 .
2≤i≤l
Stewart [71] showed that the minimum and maximum singular values of Ĥk bound the diagonal elements of R̃k . Since the extreme singular values of Ĥk estimate those of Â, we have an approximation (which
is also a lower bound) for the condition number of Â:
cond(Â) ≈
5.4
(4)
(4)
(3)
(2)
(4)
(4)
(3)
(2)
max(ρ̃1 , . . . , ρ̃k−2 , ρ̃k−1 , ρ̃k )
min(ρ̃1 , . . . , ρ̃k−2 , ρ̃k−1 , ρ̃k )
.
(5.16)
C OMPLEXITY
The storage and computational of cost at each iteration of AMRES are
shown in Table 5.1.
Table 5.1 Storage and computational cost for AMRES
Storage
Work
m
n
m n
AMRES Av, u x, v, hk−1 , hk
3
9
and the cost to compute products Av and ATu
89
6
AMRES APPLICATIONS
In this chapter we discuss a number of applications for AMRES. Section
6.1 describes its use for Curtis-Reid scaling, a commonly used method
for scaling sparse rectangular matrices such as those arising in linear
programming problems. Section 6.2 applies AMRES to Rayleigh quotient iteration for computing singular vectors, and describes a modified
version of RQI to allow reuse of the Golub-Kahan vectors. Section 6.3
describes a modified version of AMRES that allows singular vectors to
be computed more quickly and reliably when the corresponding singular value is known.
6.1
C URTIS -R EID SCALING
In linear programming problems, the constraint matrix is often preprocessed by a scaling procedure before the application of a solver.
The goal of such scaling is to reduce the range of magnitudes of the
nonzero matrix elements. This may improve the numerical behavior of
the solver and reduce the number of iterations to convergence [21].
Scaling procedures multiply a given matrix A by a diagonal matrix
on each side:
Ā = RAC,
R = diag(β −ri ),
C = diag(β −cj ),
where A is m-by-n, β > 1 is an arbitrary base, and the vectors r =
T
T
represent the scaling
and c = c1 c2 . . . cn
r 1 r2 . . . r m
factors in log-scale. One popular scaling objective is to make each nonzero element of the scaled matrix Ā approximately 1 in absolute value,
which translates to the following system of equations for r and c:
|āij | ≈ 1
⇒
⇒
β −ri |aij |β −cj ≈ 1
−ri + logβ |aij | − cj ≈ 0
Following the above objective, two methods of matrix-scaling have
90
6.1. CURTIS-REID SCALING
91
M ATLAB Code 6.1 Least-squares problem for Curtis-Reid scaling
[m n] = size(A);
2
[I J S] = find(A);
3
z = length(I);
4
M = sparse([1:z 1:z]’, [I; J+m], ones(2*z,1));
5
d = log(abs(S));
1
been proposed in 1962 and 1971:
min max | logβ |aij | − ri − cj |
ri ,cj aij 6=0
X
(logβ |aij | − ri − cj )2
min
ri ,cj
6.1.1
Fulkerson & Wolfe 1962 [28]
(6.1)
Hamming 1971 [36]
(6.2)
aij 6=0
C URTIS -R EID SCALING USING CGA
The closed-form solution proposed by Hamming is restricted to a dense
matrix A. Curtis and Reid extended Hamming’s method to general A
and designed a specialized version of CG (which we will call CGA) that
allows the scaling to work efficiently for sparse matrices [17]. Experiments by Tomlin [77] indicate that Curtis-Reid scaling is superior to
that of Fulkerson and Wolfe. We may describe the key ideas in CurtisReid scaling for matrices.
Let z be the number of nonzero elements in A. Equation (6.2) can
be written in matrix notation as
!
r
(6.3)
− d ,
min M
r,c c
where M is a z-by-(m + n) matrix and d is a z vector. Each row of M
corresponds to a nonzero Aij . The entire row is 0 except the i-th and
(j + m)-th column is 1. The corresponding element in d is log |Aij |. M
and d are defined by M ATLAB Code 6.1.1
The normal equation for the least-squares problem (6.3),
T
M M
can be written as
D1
FT
F
D2
r
c
!
!
= M Td,
r
c
!
=
!
s
,
t
(6.4)
1 In practice, matrix M
is never constructed because
it is more efficient to work
with the normal equation directly, as described below.
92
CHAPTER 6. AMRES APPLICATIONS
M ATLAB Code 6.2 Normal equation for Curtis-Reid scaling
F = (abs(A)>0);
% sparse diagonal matrix
2
D1 = diag(sparse(sum(F,2)));
3
D2 = diag(sparse(sum(F,1)));
% sparse diagonal matrix
1
4
5
6
7
8
abslog
B
s
t
=
=
=
=
@(x) log(abs(x));
spfun(abslog,A);
sum(B,2);
sum(B,1)’;
where F has the same dimension and sparsity pattern as A, with every
nonzero entry changed to 1, while D1 and D2 are diagonal matrices
whose diagonal elements represent the number of nonzeros in each
row and column of A respectively. s and t are defined by

log |A |
ij
Bij =
0
(Aij 6= 0)
(Aij = 0)
,
si =
n
X
Bij ,
j=1
tj =
m
X
Bij .
i=1
The contruction of these submatrices is shown in M ATLAB code 6.2.
Equation (6.4) is a consistent positive semi-definite system. Any
solution to this system would suffice for scaling perpose as they all
that produce the same scaling.2
2 It
is obvious
1
is a null vector,
−1
which follows directly from
the definition of F , D1 and
D2 . If r and c solve (6.4),
then r + α1 and c − α1
are also solutions for any
α.
This corresponds to
the fact that RAC and
(β α R)A(β −α C) give the
same scaled Ā.
3 A matrix A is said to
have property A if there exist permutations P1 , P2 such
that
D1
E
,
P1T AP2 =
F
D2
where D1 and D2 are diagonal matrices [87].
Reid [60] developed a specialized version of CG on matrices with
Property A.3 It takes advantage of the sparsity pattern that appear when
CG is applied to matrices with Property A to reduce both storage and
work. Curtis and Reid [17] applied CGA to (6.4) to find the scaling for
matrix A. To compare the performance of CGA on (6.4) against AMRES,
we implemented the CGA algorithm as described in [17]. The implementation shown in M ATLAB Code 6.3.
6.1.2
C URTIS -R EID SCALING USING AMRES
In this section, we discuss how to transform equation (6.4) so that it can
be solved by AMRES.
!
1
D12
First, we apply a symmetric diagonal preconditioning with
1
D22
as the preconditioner. Then, with the definitions
1
r̄ = D12 r,
1
c̄ = D22 c,
−1
s̄ = D1 2 s,
−1
t̄ = D2 2 t,
−1
−1
C = D 1 2 F D2 2 ,
6.1. CURTIS-REID SCALING
M ATLAB Code 6.3 CGA for Curtis-Reid scaling. This is an implementation of the algorithm described in [17].
1
function [cscale, rscale] = crscaleCGA(A, tol)
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
% CRSCALE implements the Curtis-Reid scaling algorithm.
% [cscale, rscale, normrv, normArv, normAv, normxv, condAv] = crscale(A, tol);
%
% Use of the scales:
%
% If C = diag(sparse(cscale)), R = diag(sparse(rscale)),
% Cinv = diag(sparse(1./cscale)), Rinv = diag(sparse(1./rscale)),
% then Cinv*A*Rinv should have nonzeros that are closer to 1 in absolute value.
%
% To apply the scales to a linear program,
% min c’x st Ax = b, l <= x <= u,
% we need to define "barred" quantities by the following relations:
% A = R Abar C, b = R bbar, C cbar = c,
% C l = lbar, C u = ubar, C x = xbar.
% This gives the scaled problem
% min cbar’xbar st Abar xbar = bbar, lbar <= xbar <= ubar.
%
%
% 03 Jun 2011: First version.
%
David Fong and Michael Saunders, ICME, Stanford University.
23
24
25
26
27
28
29
30
31
32
33
34
E = (abs(A)>0);
% sparse if A is sparse
rowSumE = sum(E,2);
colSumE = sum(E,1);
Minv = diag(sparse(1./(rowSumE + (rowSumE == 0))));
Ninv = diag(sparse(1./(colSumE + (colSumE == 0))));
[m n] = size(A);
abslog = @(x) log(abs(x));
Abar = spfun(abslog,A);
% make sure sigma and tau are of correct size even if some entries of A are 1
sigma = zeros(m,1) + sum(Abar,2);
tau
= zeros(n,1) + sum(Abar,1)’;
35
36
37
38
39
40
41
itnlimit = 100;
r1 = zeros(length(sigma),1);
r2 = tau - E’*(Minv*sigma);
c2 = zeros(length(tau),1);
c = c2; e1 = 0; e = 0; q2 = 1;
s2 = r2’*(Ninv*r2);
42
43
44
45
46
47
48
49
50
51
for t = 1:itnlimit
rm1 = r1; r = r2; em2 = e; em1 = e1;
q = q2; s = s2; cm2 = c; c = c2;
r1 = -(E*(Ninv*r) + em1 * rm1)/q;
s1 = r1’*(Minv*r1);
e = q * s1/s; q1 = 1 - e;
c2 = c + (Ninv*r + em1*em2*(c-cm2))/(q*q1);
cv(:,t+1) = c2;
if s1 < 1e-10; break; end
52
53
54
55
56
r2 = -(E’*(Minv*r1) + e * r)/q1;
s2 = r2’*(Ninv*r2);
e1 = q1*s2/s1; q2 = 1 - e1;
end
57
58
59
60
61
62
63
64
65
c
= c2;
rho = Minv*(sigma - E*c);
gamma
= c;
rscale = exp(rho);
cscale
rmax
= max(rscale);
cmax
s
= sqrt(rmax/cmax);
cscale = cscale*s;
rscale
end % function crscale
= exp(gamma);
= max(cscale);
= rscale/s;
93
94
CHAPTER 6. AMRES APPLICATIONS
(6.4) becomes
I
CT
C
I
!
r̄
c̄
!
=
!
s̄
,
t̄
and with ĉ = c̄ − t̄, we get the form required by AMRES:
I
CT
6.1.3
C
I
!
r̄
ĉ
!
=
!
s̄ − C t̄
.
0
C OMPARISON OF CGA AND AMRES
We now apply CGA and AMRES as described in the previous two sections to find the scaling vectors for Curtis-Reid scaling for some matrices from the University of Florida Sparse Matrix Collection (Davis
[18]). At the k-th iteration, we compute the estimate r(k) , c(k) from the
two algorithms, and use them to evaluate the objective function f (r, c)
from the least-squares problem in (6.2):
fk = f (r(k) , c(k) ) =
X
aij 6=0
(k)
(loge |aij | − ri
(k)
− cj )2 .
(6.5)
We take the same set of problems as in Section 4.1. The LPnetlib
group includes data for 138 linear programming problems. Each example was downloaded in M ATLAB format, and a sparse matrix A was
extracted from the data structure via A = Problem.A.
In Figure 6.1, we plot log10 (fk − f ∗ ) against iteration number k for
each algorithm. f ∗ , the optimal objective value, is taken as the minimum objective value attained by running CGA or AMRES after some K
iterations. From the results, we see that CGA and AMRES exhibit similar
convergence behavior for many matrices. There are also several cases
where AMRES converges rapidly at the early iterations, which is important as scaling vectors don’t need to be computed to high precision.
6.2
R AYLEIGH QUOTIENT ITERATION
In this section, we focus on algorithms that improve an approximate
singular value and singular vector. The algorithms are based on Rayleigh
quotient iteration. In the next section, we focus on finding singular vectors when the corresponding singular value is known.
Rayleigh quotient iteration (RQI) is an iterative procedure developed by John William Strutt, third Baron Rayleigh, for his study on
95
6.2. RAYLEIGH QUOTIENT ITERATION
Name:lpi_pilot4i, Dim:410x1123, nnz:5264, id=61
Name:lp_grow22, Dim:440x946, nnz:8252, id=74
5
5
CurtisReid−CG
CurtisReid−AMRES
CurtisReid−CG
CurtisReid−AMRES
*
*
log(f(x)−f )
0
log(f(x)−f )
0
−5
−10
−5
0
5
10
15
20
iteration count
25
30
35
−10
40
0
10
20
30
iteration count
40
50
60
Name:lp_scfxm3, Dim:990x1800, nnz:8206, id=73
Name:lp_sctap2, Dim:1090x2500, nnz:7334, id=71
4
5
CurtisReid−CG
CurtisReid−AMRES
CurtisReid−CG
CurtisReid−AMRES
3
2
*
log(f(x)−f )
log(f(x)−f*)
0
−5
1
0
−1
−2
−10
0
5
10
15
20
iteration count
25
30
35
40
−3
0
10
20
30
iteration count
40
50
60
Figure 6.1: Convergence of CGA and AMRES when they are applied to the Curtis-Reid scaling
problem. At each iteration, we plot the different between the current value and the optimal
value for the objective function in (6.5). Upper left and lower left: Typical cases where CGA
and AMRES exhibit similar convergence behavior. There is no obvious advantage of choosing
one algorithm over the other in these cases. Upper Right: A less common case where AMRES
converges much faster than CGA at the earlier iterations. This is important because scaling
parameters doesn’t need to be computed to high accuracy for most applications. AMRES
could terminate early in this case. Similar convergence behavior was observed for several
other problems. Lower Right: In a few cases such as this, CGA converges faster than AMRES
during the later iterations.
96
CHAPTER 6. AMRES APPLICATIONS
Algorithm 6.1 Rayleigh quotient iteration (RQI) for square A
1: Given initial eigenvalue and eigenvector estimates ρ0 , x0
2: for q = 1, 2, . . . do
3:
Solve (A − ρq−1 I)t = xq−1
4:
xq = t/ktk
5:
ρq = xTq Axq
6: end for
Algorithm 6.2 RQI for singular vectors for square or rectangular A
1: Given initial singular value and right singular vector estimates
ρ0 , x 0
2: for q = 1, 2, . . . do
3:
Solve for t in
(ATA − ρ2q−1 I)t = xq−1
(6.6)
4:
5:
6:
xq = t/ktk
ρq = kAxq k
end for
the theory of sound [59]. The procedure is summarized in Algorithm
6.1. It improves an approximate eigenvector for a square matrix.
It has been proved by Parlett and Kahan [56] that RQI converges
for almost all starting eigenvalue and eigenvector guesses, although in
general it is unpredictable to which eigenpair RQI will converge. Ostrowski [49] showed that for a symmetric matrix A, RQI exhibits local cubic convergence. It has been shown that MINRES is preferable to
SYMMLQ as the solver within RQI [20; 86]. Thus, it is natural to use
AMRES, a MINRES-based solver, for extending RQI to compute singular
vectors.
In this section, we focus on improving singular vector estimates for
a general rectangular matrix A. RQI can be adapted to this task as in
Algorithm 6.2. We will refer to the iterations in line 2 as outer iterations,
and the iterations inside the solver for line 3 as inner iterations. The
system in line 3 becomes increasingly ill-conditioned as the singular
value estimate ρq converges to a singular value.
We continue our investigation by improving both the stability and
the speed of Algorithm 6.2. Section 6.2.1 focuses on improving stability
by solving an augmented system using AMRES. Section 6.2.2 improves
the speed by a modification to RQI that allows AMRES to reuse precomputed Golub-Kahan vectors in subsequent iterations.
6.2. RAYLEIGH QUOTIENT ITERATION
Algorithm 6.3 Stable RQI for singular vectors
1: Given initial singular value and right singular vector estimates
ρ0 , x 0
2: for q = 1, 2, . . . do
3:
Solve for t in
−ρq−1 I
A
s
Axq−1
=
t̄
0
AT
−ρq−1 I
4:
t = t̄ − xq−1
5:
xq = t/ktk
6:
ρq = kAxq k
7: end for
6.2.1
S TABLE INNER ITERATIONS FOR RQI
Compared with Algorithm 6.2, a more stable alternative would be to
solve a larger augmented system. Note that line 3 in 6.2 is equivalent
to
! !
!
−ρq−1 I
A
s
0
=
.
AT
−ρq−1 I
t
xq−1
If we applied MINRES, we would get a more stable algorithm with
twice the computational cost. A better alternative would be to convert
it to a form that could be readily solved by AMRES. Since the solution
will be normalized as in line 4, we are free to scale the right-hand side:
−ρq−1 I
AT
A
−ρq−1 I
!
s
t
!
=
0
ρq−1 xq−1
!
.
We introduce the shift variable t̄ = t + xq−1 to obtain the system
−ρq−1 I
AT
A
−ρq−1 I
!
s
t̄
!
=
Axq−1
0
!
,
(6.7)
which is suitable for AMRES. This system has a smaller condition number compared with (6.6). Thus, we enjoy both the numerical stability of
the larger augmented system and the lower computational cost of the
smaller system. We summarize this approach in Algorithm 6.3.
T EST DATA
We describe procedures to construct a linear operator A with known
singular value and singular vectors. This method is adapted from [53].
97
98
CHAPTER 6. AMRES APPLICATIONS
1. For any m ≥ n and C ≥ 1, pick vectors satisfying
y (i) ∈ Rm ,
ky (i) k = 1,
n
z∈R ,
(1 ≤ i ≤ C)
kzk = 1,
for constructing Householder reflectors. The parameter C represents the number of Householder reflectors that are used to construct the left singular vectors of A. This allows us to vary the cost
of Av and ATu multiplications for experiments in Section 6.2.2.
2. Pick a vector d ∈ Rn whose elements are the singular values
of A. We have two different options for choosing them. In the
non-clustered option, adjacent singular values are separated by a
constant gap and form an arithmetic sequence. In the clustered
option, the smaller singular values cluster near the smallest one
and form an geometric sequence. The two options are as follows:
dj = σn + (σ1 − σn )
dj = σn
σ1
σn
n−j
n−1
n−j
,
n−1
(Non-clustered)
(Clustered)
,
where σ1 and σn are the largest and smallest singular values of A
and we have d1 = σ1 and dn = σn .
T
3. Define Y (i) = I − 2y (i) y (i) , Z = Z T = I − 2zz T , where
D = diag(d),
A=
l
Y
Y
i=1
(i)
!
!
D
Z.
0
This procedure is shown in M ATLAB code 6.4.
In each of the following experiments, we pick a singular value σ that
we would like to converge to, and extract the corresponding column v
from Z as the singular vector. Then we pick a random unit vector w
and a scalar δ to control the size of the perturbation that we want to
introduce into v. The approximate right singular vector for input to
RQI-AMRES and RQI-MINRES is then (v + δw). Since RQI is known to
converge always to some singular vector, our experiments are meaningful only if δ is quite small.
99
6.2. RAYLEIGH QUOTIENT ITERATION
M ATLAB Code 6.4 Generate linear operator with known singular values
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
function [Afun, sigma, v] ...
= getLinearOperator(m,n,sigmaMax, sigmaMin, p, cost, clustered)
% p: output the p-th largest singular value as sigma
% and the corresponding right singular vector as v
randn( ’state’ ,1);
y = randn(m,cost);
for i = 1:cost
y(:,i) = y(:,i)/norm(y(:,i)); % normalize every column of y
end
z = randn(n,1); z = z/norm(z);
if clustered
d = sigmaMin * (sigmaMax/sigmaMin).^(((n-1):-1:0)’/(n-1));
else
d = sigmaMin + (sigmaMax-sigmaMin).*(((n-1):-1:0)’/(n-1));
end
ep = zeros(n,1); ep(p) = 1; v = ep - 2*(z’*ep)*z;
sigma = s(p); Afun = @A;
18
19
20
21
22
23
24
25
26
27
28
29
30
31
function w = A(x,trans)
w = x;
if trans == 1
w = w - 2*(z’*w)*z;
w = [d.*w; zeros(m-n,1)];
for i = 1:cost; w = w - 2*(y(:,i)’*w)*y(:,i); end
elseif trans == 2
for i = cost:-1:1; w = w - 2*(y(:,i)’*w)*y(:,i); end
w = d.*w(1:n);
w = w - 2*(z’*w)*z;
end
end
end
N AMING OF ALGORITHMS
By applying different linear system solvers in line 3 of Algorithms 6.2
and 6.3, we obtain different versions of RQI for singular vector computation. We refer to these versions by the following names:
• RQI-MINRES: Apply MINRES to line 3 of Algorithm 6.2.
• RQI-AMRES: Apply AMRES to line 3 of Algorithm 6.3.
• MRQI-AMRES4 : Replace line 3 of Algorithm 6.3 by
−σI
AT
A
−σI
!
s
xq+1
!
=
!
u
,
0
4 Details
AMRES
are
Section 6.2.2
(6.8)
of
MRQI-
given
in
100
CHAPTER 6. AMRES APPLICATIONS
where u is a left singular vector estimate, or u = Av if v is a right
singular vector estimate. AMRES is applied to this linear system.
Note that a further difference here from Algorithm 6.3 is that u is
the same for all q.
N UMERICAL RESULTS
We ran both RQI-AMRES and RQI-MINRES to improve given approximate right singular vectors of our constructed linear operators. The
5 For the iterates ρ from
accuracy5 of the new singular vector estimate ρq as generated by both
q
outer iterations that we want
algorithms is plotted against the cumulative number of inner iterations
to converge to σ, we use the
relative error |ρq − σ|/σ as (i.e. the number of Golub-Kahan bidiagonalization steps).
the accuracy measure.
When the singular values of A are not clustered, we found that RQIAMRES and RQI-MINRES exhibit very similar convergence behavior as
shown in Figure 6.2, with RQI-AMRES showing improved accuracy for
small singular values.
When the singular values of A are clustered, we found that RQIAMRES converges faster than RQI-MINRES as shown in Figure 6.3.
We have performed the same experiments on larger matrices and
obtained similar results.
6.2.2
S PEEDING UP THE INNER ITERATIONS OF RQI
In the previous section, we applied AMRES to RQI to obtain a stable
method for refining an approximate singular vector. We now explore
a modification to RQI itself that allows AMRES to achieve significant
speedup. To illustrate the rationale behind the modification, we first
revisit some properties of the AMRES algorithm.
AMRES solves linear systems of the form
γI
AT
A
δI
!
s
x
!
=
!
b
0
using the Golub-Kahan process Bidiag(A, b), which is independent of γ
and δ. Thus, if we solve a sequence of linear systems of the above form
with different γ, δ but A, b kept constant, any computational work in
Bidiag(A, b) can be cached and reused. In Algorithm 6.3, γ = δ = −ρq
and b = Axq−1 . We now propose a modified version of RQI in which
the right-hand side is kept constant at b = Ax0 and only ρq is updated
each outer iteration. We summarize this modification in Algorithm 6.4,
101
6.2. RAYLEIGH QUOTIENT ITERATION
rqi(500,300,1,1,10,1,12,12,0,12)
−2
rqi(500,300,120,1,10,1,12,12,0,12)
−2
RQI−AMRES
RQI−MINRES
RQI−AMRES
RQI−MINRES
−3
−4
−4
−6
log||(ρk − σ)/σ||
log||(ρk − σ)/σ||
−5
−6
−7
−8
−10
−8
−12
−9
−14
−10
−11
0
50
100
150
Number of Inner Iterations
200
−16
250
0
100
rqi(500,300,280,1,10,1,12,12,0,12)
200
300
400
500
Number of Inner Iterations
600
700
800
rqi(500,300,300,1,10,1,12,16,0,12)
0
10
RQI−AMRES
RQI−MINRES
RQI−AMRES
RQI−MINRES
−2
5
−4
log||(ρ − σ)/σ||
−8
0
k
log||(ρk − σ)/σ||
−6
−10
−5
−12
−14
−10
−16
−18
0
200
400
600
800
Number of Inner Iterations
1000
1200
1400
−15
0
1000
2000
3000
4000
Number of Inner Iterations
5000
6000
Figure 6.2: Improving an approximate singular value and singular vector for a matrix with
non-clustered singular values using RQI-AMRES and RQI-MINRES. The matrix A is constructed by the procedure of Section 6.2.1 with parameters m = 500, n = 300, C = 1, σ1 = 1,
σn = 10−10 and non-clustered singular values. The approximate right singular vector v + δw
is formed with δ = 0.1.
Upper Left: Computing the largest singular value σ1 . Upper Right: Computing a singular
value σ120 near the middle of the spectrum. Lower Left: Computing a singular value σ280
near the low end of the spectrum. Lower Right: Computing the smallest singular value
σ300 . Here we see that the iterates ρq generated by RQI-MINRES fail to converge to precisions
higher than 10−8 , while RQI-AMRES achieves higher precision with more iterations.
102
CHAPTER 6. AMRES APPLICATIONS
rqi(500,300,2,1,10,1,12,12,1,12)
−2
rqi(500,300,100,1,10,1,12,12,1,12)
2
RQI−AMRES
RQI−MINRES
RQI−AMRES
RQI−MINRES
0
−4
−2
−6
log||(ρk − σ)/σ||
log||(ρk − σ)/σ||
−4
−8
−10
−6
−8
−10
−12
−12
−14
−16
−14
0
5
10
15
20
25
Number of Inner Iterations
30
35
−16
40
0
2000
4000
6000
8000
Number of Inner Iterations
12000
rqi(500,300,280,1,10,3,16,16,1,8)
rqi(500,300,120,1,10,3,12,12,1,8)
6
0
RQI−AMRES
RQI−MINRES
RQI−AMRES
RQI−MINRES
−1
10000
5
−2
4
−4
log||(ρk − σ)/σ||
log||(ρk − σ)/σ||
−3
−5
3
2
−6
−7
1
−8
0
−9
−10
0
1000
2000
3000
4000
5000
Number of Inner Iterations
6000
7000
8000
−1
0
0.5
1
1.5
Number of Inner Iterations
2
2.5
3
4
x 10
Figure 6.3: Improving an approximate singular value and singular vector for a matrix with
clustered singular values using RQI-AMRES and RQI-MINRES. The matrix A is constructed by
the procedure of Section 6.2.1 with parameters m = 500, n = 300, C = 1, σ1 = 1, σn = 10−10
and clustered singular values. For the two upper plots, the approximate right singular vector
v + δw is formed with δ = 0.1. For the two lower plots, δ = 0.001.
Upper Left: Computing the second largest singular value σ2 . We see that RQI-AMRES starts
to converge faster after the 1st outer iteration. The plot for σ1 is not shown here as σ1 is
well separated from the rest of the spectrum, and both algorithms converge in 1 outer iteration, which is a fairly uninteresting plot. Upper Right: Computing a singular value σ100
near the middle of the spectrum. RQI-AMRES converges faster than RQI-MINRES. Lower Left:
Computing a singular value σ120 near the low end of the spectrum. Lower Right: Computing a singular value σ280 within the highly clustered region. RQI-AMRES and RQI-MINRES
converge but to the wrong σj .
6.2. RAYLEIGH QUOTIENT ITERATION
103
Algorithm 6.4 Modified RQI for singular vectors
1: Given initial singular value and right singular vector estimates
ρ0 , x 0
2: for q = 1, 2, . . . do
3:
Solve for t in
−ρq−1 I
A
s
Ax0
=
t̄
0
AT
−ρq−1 I
4:
t = t̄ − x0
5:
xq = t/ktk
6:
ρq = kAxq k
7: end for
which we refer to as MRQI-AMRES when AMRES is used to solve the
system on line 3.
C ACHED G OLUB -K AHAN PROCESS
We implement a GolubKahan class (see M ATLAB Code 6.5) to represent
the process Bidiag(A, b). The constructor is invoked by
1
gk = GolubKahan(A,b)
with A being a matrix or a M ATLAB function handle that gives Ax =
A(x,1) and ATy = A(y,2). The scalars αk , βk and vector vk can be
retrieved by calling
1
[v_k alpha_k beta_k] = gk.getIteration(k)
These values will be retrieved from the cache if they have been computed already. Otherwise, getIteration() computes them directly and
caches the results for later use.
R EORTHOGONALIZATION
In Section 4.4, reorthogonalization is proposed as a method to speed
up LSMR. The major obstacle is the high cost of storing all the v vectors
generated by the Golub-Kahan process. For MRQI-AMRES, all such vectors must be stored for reuse in the next MRQI iteration. Therefore the
only extra cost to perform reorthogonalization would be the computing cost of modified Gram-Schmidt (MGS). MGS is implemented as in
M ATLAB Code 6.5. It will be turned on when the constructor is called
6 Golub-Kahan(A,b)
by GolubKahan(A,b,1).6 We refer to MRQI-AMRES with reorthogonaland
Golub-Kahan(A,b,0)
ization as MRQI-AMRES-Ortho in the following plots.
are the same. They return
a
Golub-Kahan
object
that does not preform
reorthogonalization.
104
CHAPTER 6. AMRES APPLICATIONS
M ATLAB Code 6.5 Cached Golub-Kahan process
1
2
3
4
5
6
7
classdef GolubKahan < handle
% A cached implementation of Golub-Kahan bidiagonalization
properties
V; m; n; A; Afun; u; alphas; betas; reortho; tol = 1e-15;
kCached = 1;
% number of cached iterations
kAllocated = 1; % initial size to allocate
end
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
methods
function o = GolubKahan(A, b, reortho)
% @param reortho=1 means perform reorthogonalization on o.V
if nargin < 3 || isempty(reortho), reortho = 0; end
o.A = A; o.Afun = A; o.reortho = reortho;
if isa(A, ’numeric’ ); o.Afun = @o.AfunPrivate; end
o.betas = zeros(o.kAllocated,1); o.betas(1) = norm(b);
o.alphas = zeros(o.kAllocated,1); o.m = length(b);
if o.betas(1) > o.tol
o.u = b/o.betas(1);
v = o.Afun(o.u,2);
o.n = length(v);
o.V = zeros(o.n,o.kAllocated);
o.alphas(1) = norm(v);
if o.alphas(1) > o.tol; o.V(:,1) = v/norm(o.alphas(1)); end
end
end
24
25
26
27
28
29
30
31
32
33
34
function [v alpha beta] = getIteration(o,k)
% @return v_k alpha_k beta_k for the k-th iteration of Golub-Kahan
while o.kCached < k
% doubling space for cache
if o.kCached == o.kAllocated
o.V = [o.V zeros(o.n, o.kAllocated)];
o.alphas = [o.alphas ;zeros(o.kAllocated,1)];
o.betas = [o.betas ;zeros(o.kAllocated,1)];
o.kAllocated = o.kAllocated * 2;
end
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
% bidiagonalization
o.u = o.Afun(o.V(:,o.kCached),1) - o.alphas(o.kCached)*o.u;
beta = norm(o.u); alpha = 0;
if beta > o.tol;
o.u = o.u/beta;
v = o.Afun(o.u,2) - beta*o.V(:,o.kCached);
if o.reortho == 1
for i = 1:o.kCached; vi = o.V(:,i); v = v - (v’*vi)*vi; end
end
alpha = norm(v);
if alpha > o.tol; v = v/alpha; o.V(:,o.kCached+1) = v; end
end
o.kCached = o.kCached + 1;
o.alphas(o.kCached) = alpha; o.betas(o.kCached) = beta;
end
v = o.V(:,k); alpha = o.alphas(k); beta = o.betas(k);
end
53
54
55
56
57
58
function out = AfunPrivate(o,x,trans)
if trans == 1; out = o.A*x; else out = o.A’*x; end
end
end
end
6.2. RAYLEIGH QUOTIENT ITERATION
105
N UMERICAL RESULTS
We ran RQI-AMRES, MRQI-AMRES and MRQI-AMRES-Ortho to improve
given approximate right singular vectors of our constructed linear operators. The accuracy7 of each new singular value estimate ρq generated by the three algorithms at each outer iteration is plotted against
the cumulative time (in seconds) used.
In Figure 6.4, we see that MRQI-AMRES takes more time than RQIAMRES for the early outer iterations because of the cost of allocating
memory and caching the Golub-Kahan vectors. For subsequent iterations, more cached vectors are used and less time is needed for GolubKahan bidiagonalization. MRQI-AMRES overtakes RQI-AMRES starting
from the third outer iteration. In the same figure, we see that reorthogonalization significantly improves the convergence rate of MRQI-AMRES.
With orthogonality preserved, MRQI-AMRES-Ortho converges faster than
RQI-AMRES even at the first outer iteration.
In Figure 6.5, we compare the performance of the three algorithms
when the linear operator has different multiplication cost. For operators with very low cost, RQI-AMRES converges faster than the cached
methods. The extra time used for caching outweighs the benefits of
reusing cached results. As the linear operator gets more expensive,
MRQI-AMRES and MRQI-AMRES-Ortho outperforms RQI-AMRES as fewer
expensive matrix-vector products are needed in the subsequent RQIs.
In Figure 6.6, we compare the performance of the three algorithms
for refining singular vectors corresponding to various singular values.
For the large singular values (which are well separated from the rest
of the spectrum), RQI converges in very few iterations and there is not
much benefit from caching. For the smaller (and more clustered) singular values, caching saves the multiplication time in subsequent RQIs
and therefore MRQI-AMRES and MRQI-AMRES-Ortho converge faster than
RQI-AMRES. As the singular values become more clustered, the inner
solver for the linear system suffers increasingly from loss of orthogonality. In this situation, reorthogonalization greatly reduces the number of Golub-Kahan steps and therefore MRQI-AMRES-Ortho converges
much faster than MRQI-AMRES.
Next, we compare RQI-AMRES with the M ATLAB svds function. svds
performs singular vector computation by calling eigs on the augmented
matrix A0T A0 , and eigs in turn calls the Fortran library ARPACK to
perform symmetric Lanczos iterations. svds does not accept a linear
7 For the outer iteration
values ρq that we want to
converge to σ, we use the relative error |ρq − σ|/σ as the
accuracy measure.
106
CHAPTER 6. AMRES APPLICATIONS
rqi(5000,3000,500,10,10,1,12,12,1,12,1)
0
RQI−AMRES
MRQI−AMRES
MRQI−AMRES−Ortho
−2
−4
k
log||(ρ − σ)/σ||
−6
−8
−10
−12
−14
−16
0
5
10
15
20
Time (s)
25
30
35
40
Figure 6.4: Improving an approximate singular value and singular vector for a matrix using RQI-AMRES, MRQI-AMRES and MRQI-AMRESOrtho. The matrix A is constructed by the procedure of Section 6.2.1
with parameters m = 5000, n = 3000, C = 10, σ1 = 1, σn = 10−10
and clustered singular values. We perturbed the right singular vector
v corresponding to the singular value σ500 to be the starting approximate singular vector. MRQI-AMRES lags behind RQI-AMRES at the early
iterations because of the extra cost in allocating memory to store the
Golub-Kahan vectors, but it quickly catches up and converges faster
when the cached vectors are reused in the subsequent iterations. MRQIAMRES with reorthogonalization converges much faster than without
reorthogonalization. In this case, reorthogonalization greatly reduces
the number of inner iterations for AMRES, and the subsequent outer
iterations are basically free.
rqi(5000,3000,500,5,10,1,12,12,1,12,1)
rqi(5000,3000,500,20,10,1,12,12,1,12,1)
0
−4
−6
−6
−6
−6
−8
k
k
log||(ρ − σ)/σ||
−4
log||(ρk − σ)/σ||
−4
−8
0
RQI−AMRES
MRQI−AMRES
MRQI−AMRES−Ortho
−2
−4
−8
rqi(5000,3000,500,50,10,1,12,12,1,12,1)
0
RQI−AMRES
MRQI−AMRES
MRQI−AMRES−Ortho
−2
−4
−6
−8
−8
−10
−10
−10
−10
−10
−12
−12
−12
−12
−12
−14
−14
−14
−14
−16
−16
−16
−16
0
2
4
6
8
10
Time (s)
12
C=1
14
16
18
20
0
5
10
15
Time (s)
C=5
20
25
30
0
5
10
15
20
Time (s)
25
C = 10
30
35
40
RQI−AMRES
MRQI−AMRES
MRQI−AMRES−Ortho
−2
k
RQI−AMRES
MRQI−AMRES
MRQI−AMRES−Ortho
−2
log||(ρ − σ)/σ||
log||(ρk − σ)/σ||
rqi(5000,3000,500,10,10,1,12,12,1,12,1)
0
RQI−AMRES
MRQI−AMRES
MRQI−AMRES−Ortho
log||(ρ − σ)/σ||
rqi(5000,3000,500,1,10,1,12,12,1,12,1)
0
−2
−14
0
10
20
30
Time (s)
40
C = 20
50
60
−16
0
20
40
60
Time (s)
80
100
120
C = 50
Figure 6.5: Convergence of RQI-AMRES, MRQI-AMRES and MRQI-AMRES-Ortho for linear operators of varying cost. The matrix A is constructed as in Figure 6.4 except for the cost C,
which is shown below each plot. C increases from left to right. When C = 1, the time required for caching the Golub-Kahan vectors outweighs the benefits of being able to reuse
them later. When the cost C goes up, the benefit of caching Golub-Kahan vectors outweighs
the time for saving them. Thus MRQI-AMRES converges faster than RQI-AMRES.
107
6.3. SINGULAR VECTOR COMPUTATION
rqi(5000,3000,1,10,10,1,12,12,1,12,1)
−2
rqi(5000,3000,50,10,10,1,12,12,1,12,1)
−2
RQI−AMRES
MRQI−AMRES
MRQI−AMRES−Ortho
−4
−6
rqi(5000,3000,300,10,10,1,12,12,1,12,1)
−2
RQI−AMRES
MRQI−AMRES
MRQI−AMRES−Ortho
−4
rqi(5000,3000,500,20,10,1,12,12,1,12,1)
2
RQI−AMRES
MRQI−AMRES
MRQI−AMRES−Ortho
−2
−2
−4
−10
−8
−10
−12
−14
−16
−12
−12
0.02
0.04
0.06
0.08
0.1
Time (s)
0.12
σ1
0.14
0.16
0.18
0.2
−16
0
0.1
0.2
0.3
0.4
Time (s)
0.5
0.6
0.7
0.8
−16
−12
−14
0
0.5
1
σ50
1.5
2
2.5
Time (s)
3
σ300
3.5
4
4.5
5
−16
−8
−10
−12
−14
−14
0
−6
k
k
−10
log||(ρ − σ)/σ||
−8
k
−8
log||(ρk − σ)/σ||
log||(ρ − σ)/σ||
log||(ρ − σ)/σ||
k
log||(ρ − σ)/σ||
−6
−8
−10
RQI−AMRES
MRQI−AMRES
MRQI−AMRES−Ortho
0
−4
−6
−6
rqi(5000,3000,700,10,10,1,12,12,1,12,1)
0
RQI−AMRES
MRQI−AMRES
MRQI−AMRES−Ortho
−4
−14
0
10
20
30
Time (s)
40
50
σ500
60
−16
0
50
100
150
Time (s)
200
250
300
σ700
Figure 6.6: Convergence of RQI-AMRES, MRQI-AMRES and MRQI-AMRES-Ortho for different
singular values. The matrix A is constructed as in Figure 6.4.
σ1 , σ50 : For the larger singular values, both MRQI-AMRES and MRQI-AMRES-Ortho converge
slower than RQI-AMRES as convergence for RQI is very fast and the extra time for caching
the Golub-Kahan vectors outweighs the benefits of saving them.
σ300 , σ500 : For the smaller and more clustered singular values, RQI takes more iterations to
converge and the caching effect of MRQI-AMRES and MRQI-AMRES-Ortho makes them converge faster than RQI-AMRES.
σ700 : As the singular value gets even smaller and less separated, RQI is less likely to converge
to the singular value we want. In this example, RQI-AMRES converged to a different singular
value. Also, clustered singular values lead to greater loss of orthogonality, and therefore
RQI-AMRES-Ortho converges much faster than RQI-AMRES.
operator (function handle) as input. We slightly modified svds to allow an operator to be passed to eigs, which finds the eigenvectors corresponding to the largest eigenvalues for a given linear operator.8
Figure 6.7 shows the convergence of RQI-AMRES and svds computing three different singular values near the large end of the spectrum.
When only the largest singular value is needed, the algorithms converge at about the same rate. When any other singular value is needed,
svds has to compute all singular values (and singular vectors) from the
largest to the one required. Thus svds takes significantly more time
compared to RQI-AMRES.
6.3
S INGULAR VECTOR COMPUTATION
In the previous section, we explored algorithms to refine an approximate singular vector. We now focus on finding a singular vector corre!
sponding to a known singular value.
u
Suppose σ is a singular value of A. If a nonzero
satisfies
v
!
!
!
−σI
A
u
0
=
,
(6.9)
AT −σI
v
0
then u and v would be the corresponding singular vectors. The inverse
8 eigs can find eigenvalues of a matrix B close to
any given λ, but if B is
an operator, the shift-andinvert operator (B − λI)−1
must be provided.
108
CHAPTER 6. AMRES APPLICATIONS
rqi(10000,6000,5,1,10,0.5,12,12,1,12,2)
rqi(10000,6000,20,1,10,0.5,12,12,1,12,2)
0
−2
−4
−4
−4
−6
−6
−6
k
RQI−AMRES
SVDS
0
RQI−AMRES
SVDS
−8
−8
−10
−10
−10
−12
−12
−12
−14
−14
−16
0
0.05
0.1
0.15
0.2
Time (s)
σ1
0.25
0.3
0.35
0.4
−16
RQI−AMRES
SVDS
−2
log||(ρk − σ)/σ||
−8
log||(ρ − σ)/σ||
log||(ρk − σ)/σ||
rqi(10000,6000,1,1,10,0.5,12,12,1,12,2)
0
−2
−14
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
−16
0
0.2
0.4
Time (s)
σ5
0.6
0.8
1
Time (s)
1.2
1.4
1.6
1.8
σ20
Figure 6.7: Convergence of RQI-AMRES and svds for the largest singular values. The matrix
A is constructed by the procedure of Section 6.2.1 with parameters m = 10000, n = 6000,
C = 1, σ1 = 1, σn = 10−10 and clustered singular values. The approximate right singular
vector v + δw is formed with δ = 10−1/2 ≈ 0.316.
σ1 : For the largest singular value and corresponding singular vector, RQI-AMRES and svds
take similar times.
σ5 : For the fifth largest singular value and corresponding singular vector, svds (and hence
ARPACK) has to compute all five singular values (from the largest to fifth largest), while RQIAMRES computes the fifth largest singular value directly. Therefore, RQI-AMRES takes less
time than svds.
σ20 : As svds needs to compute the largest 20 singular values, it takes significantly more time
than RQI-AMRES.
iteration approach to finding the null vector is to take a random vector
b and solve the system
−σI
AT
A
−σI
!
s
x
!
=
!
b
.
0
(6.10)
Whether a direct or iterative solver is used, we expect the solution xs
to be very large, but the normalized vectors u = s/ksk, v = x/kxk will
satisfy (6.9) accurately and hence be good approximations to the left
and right singular vectors of A.
As noted in Choi [13], it is not necessary to run an iterative solver
until the solution norm becomes very large. If one could stop the solver
appropriately at a least-squares solution, then the residual vector of
that solution would be a null vector of the matrix. Here we apply this
idea to singular vector computation.
For the problem min kAx − bk, if x is a solution, we know that the
residual vector r = b − Ax satisfies ATr = 0. Therefore r is a null vector
for AT .
6.3. SINGULAR VECTOR COMPUTATION
Algorithm 6.5 Singular vector computation via residual vector
1: Given matrix A, and singular value σ
2: Find least-squares
solution for
the
singular system:
b
−σI
A
s
3:
min 0 − AT −σI
x −σI
A
u
b
s
4: Compute
=
−
v
0
x
AT −σI
5: Output u, v as the left and right singular vectors
Thus, if s and x solve the singular least-squares problem
!
b
min −
0
−σI
AT
A
−σI
!
s
x
then the corresponding residual vector
u
v
!
=
!
b
−
0
−σI
AT
A
−σI
!
!
,
s
x
!
(6.11)
(6.12)
would satisfy (6.9). This gives us Av = σu and ATu = σv as required.
Therefore u, v are left and right singular vectors corresponding to σ.
We summarize this procedure as Algorithm 6.5.
The key to this algorithm lies in finding the residual to the leastsquares system accurately and efficiently. An obvious choice would be
applying MINRES to (6.11) and computing the residual from the solution returned by MINRES.
AMRES solves the same problem (6.11) with half the computational
cost by computing x only. However, this doesn’t allow us to construct
the residual from x. We therefore developed a specialized version of
AMRES, called AMRESR, that directly computes the v part of the residual vector (6.12) without computing x or s. The u part can be recovered
from v using σu = Av.
If σ is a repeated singular value, the same procedure can be followed using a new random b that is first orthogonalized with respect
to v.
109
110
CHAPTER 6. AMRES APPLICATIONS
6.3.1
AMRESR
To derive an iterative update for rbk , we begin with (5.5):
rbk = V̂k+1 β1 e1 − Ĥk ŷk
!
zk
T
−
= V̂k+1 Qk+1
ζ̄k+1
Rk
0
!
ŷk
!
= V̂k+1 QTk+1 (ζ̄k+1 ek+1 ).
By defining pk = V̂k+1 QTk+1 ek+1 , we continue with
pk = V̂k+1 QTk+1 ek+1 .
QTk
= V̂k+1
!
1
= V̂k QTk
v̂k+1

Ik−1




0


−sk 
ck
ck
sk


−sk  ek+1
ck
= −sk V̂k QTk ek + ck v̂k+1
= −sk pk−1 + ck v̂k+1 .
With rbk ≡
rku
rkv
!
and pk ≡
pv0 = 0,
!
puk
, we can write an update rule for pvk :
pvk

−s pv + c v k+1
k k−1
k
2
pvk =
−sk pv
k−1
(k odd)
(k even)
where the separation of two cases follows from (5.4) and
rkv = ζ̄k+1 pvk
(6.13)
can be used to output rkv when AMRESR terminates. With the above
recurrence relations for pvk , we summarize AMRESR in Algorithm 6.6.
6.3.2
AMRESR EXPERIMENTS
We compare the convergence of AMRESR versus MINRES to find singular vectors corresponding to a known singular value.
6.3. SINGULAR VECTOR COMPUTATION
Algorithm 6.6 Algorithm AMRESR
1: (Initialize)
β 1 u1 = b
α1 v1 = AT u1
ζ̄1 = β1
2:
3:
pv0
6:
for l = 1, 2, 3 . . . do
(Continue the bidiagonalization)
for k = 2l − 1, 2l do
(Setup temporary variables)
(
λ = δ,
α = αl ,
λ = γ,
α = βl+1 ,
αl+1 vl+1 = ATul+1 − βl+1 vl
β = βl+1
β = αl+1
(k odd)
(k even)
(Construct and apply rotation Qk+1,k )
ρk = ρ̄2k + α2
ck = ρ̄k /ρk
7:
θ̄1 = α1
=0
βl+1 ul+1 = Avl − αl ul
4:
5:
ρ̄1 = γ
12
sk = α/ρk
θk = ck θ̄k + sk λ
ηk = s k β
ρ̄k+1 = −sk θ̄k + ck λ
θ̄k+1 = ck β
ζk = ck ζ̄k
ζ̄k+1 = −sk ζ̄k
(Update estimate of p)
(
−sk pk−1 + ck v k+1
v
2
pk =
−sk pk−1
8:
9:
end for
end for
10: (Compute r for output)
rkv = ζ̄k+1 pvk
(k odd)
(k even)
111
112
CHAPTER 6. AMRES APPLICATIONS
M ATLAB Code 6.6 Error measure for singular vector
function err = singularVectorError(v)
2
v
= v/norm(v);
% Afun(v,1) = A*v;
3
Av = Afun(v,1);
4
u
= Av/norm(Av);
5
err = norm(Afun(u,2) - sigma*v); % Afun(v,2) = A’*u;
6
end
1
T EST DATA
We use the method of Section 6.2.1 to construct a linear operator.
M ETHODS
We compare three methods for computing a singular vector corresponding to known singular value σ, where σ is taken as one of the di above.
We first generate a random m vector b using randn(’state’,1) and b
= randn(m,1). Then we apply one of the following methods:
1. Run AMRESR on (6.10). For each k in Algorithm 6.6, Compute rkv
by (6.13) at each iteration and take vk = rkv as the estimate of the
right singular vector corresponding to σ.
2. Inverse iteration: run MINRES on (ATA − σ 2 I)x = ATb. For the
lth MINRES iteration (l = 1, 2, 3, . . . ), set k = 2l take vk = xl as the
estimate of the right singular vector. (This makes k the number
of matrix-vector products for each method.)
M EASURE OF CONVERGENCE
For the vk from each method, we measure convergence by ATuk − σvk (with uk = Avk /kAvk k) as shown in M ATLAB Code 6.6. This measure
requires two matrix-vector products, which is the same complexity as
each Golub-Kahan step. Thus it is too expensive to be used as a practical stopping rule. It is used solely for the plots to analyze the convergence behavior. In practice, the stopping rule would be based on the
condition estimate (5.16).
N UMERICAL RESULTS
We ran the experiments described above on rectangular matrices of size
24000 × 18000. Our observations are as follows:
6.4. ALMOST SINGULAR SYSTEMS
1. AMRESR always converge faster than MINRES. AMRESR reaches
a minimum error of about 10−8 for all test examples, and then
starts to diverge. In contrast, MINRES is able to converge to singular vectors of higher precision than AMRESR. Thus, AMRESR is
more suitable for applications where high precision of the singular vectors is not required.
2. The gain in convergence speed depends on the position of the
singular value within the spectrum. From Figure 6.8, there is
more improvement for singular vectors near either end of the
spectrum, and less improvement for those in the middle of the
spectrum.
6.4
A LMOST SINGULAR SYSTEMS
Björck [6] has derived an algorithm based on Rayleigh-quotient iteration for computing (ordinary) TLS solutions. The algorithm involves
repeated solution of positive definite systems (ATA − σ 2 I)x = ATb by
means of a modified/extended version of CGLS (called CGTLS) that is
able to incorporate the shift. A key step in CGTLS is computation of
δk = kpk k22 − σ 2 kqk k22 .
Clearly we must have σ < σmin (A) for the system to be positive definite. However, this condition cannot be guaranteed and a heuristic is
adopted to repeat the computation with a smaller value of σ [6, (4.5)].
Another drawback of CGTLS is that it depends on a complete Cholesky
factor of ATA. This is computed by a sparse QR of A, which is sometimes practical when A is large. Since AMRES handles indefinite systems and does not depend on a sparse factorization, it is applicable in
more situations.
113
114
CHAPTER 6. AMRES APPLICATIONS
amresrPlots(24000,18000,1,9,13,1)
amresrPlots(24000,18000,100,9,14,1)
0
0
−1
−2
−2
−3
−4
||ATu − σ v||
−5
T
||A u − σ v||
−4
AMRESR
AMRESR (Cont.)
MINRES
−6
−6
−8
AMRESR
AMRESR (Cont.)
MINRES
−7
−8
−10
−9
−10
0
200
400
600
800
T
Number of A u and Av
1000
1200
−12
1400
0
1000
2000
amresrPlots(24000,18000,9000,8,14,1)
0
0
−1
−1
−2
−2
−3
−3
5000
6000
7000
−4
||ATu − σ v||
||ATu − σ v||
−4
AMRESR
AMRESR (Cont.)
MINRES
−5
−5
−6
−6
−7
−7
−8
−8
−9
−9
−10
3000
4000
T
Number of A u and Av
amresrPlots(24000,18000,18000,8,21,1)
0
0.5
1
1.5
2
2.5
3
T
Number of A u and Av
3.5
4
4.5
5
4
x 10
−10
AMRESR
AMRESR (Cont.)
MINRES
0
1
2
3
4
T
Number of A u and Av
5
6
7
4
x 10
Figure 6.8: Convergence of AMRESR and MINRES for computing singular vectors corresponding to a known singular value as described in Section 6.3.2 for a 24000 × 18000 matrix.
The blue (solid) curve represents where AMRESR would have stopped if there were a suitable limit on the condition number. The green (dash) curve shows how AMRESR will diverge
if it continues beyond this stopping criterion. The red (dash-dot) represents the convergence
behavior of MINRES. The linear operator is constructed using the method of Section 6.2.1.
Upper Left: Computing the singular vector corresponding to the largest singular value σ1 .
Upper Right: Computing the singular vector corresponding to the 100th largest singular
value σ100 . Lower Left: Computing the singular vector corresponding to a singular value in
the middle of the spectrum σ9000 . Lower Right: Computing the singular vector corresponding to the smallest singular value σ18000 .
7
C ONCLUSIONS AND FUTURE DIRECTIONS
7.1
C ONTRIBUTIONS
The main contributions of this thesis involve three areas: providing a
better understanding of the popular MINRES algorithm; development
and analysis of LSMR for least-squares problems; and development and
applications of AMRES for the negatively-damped least-squares problem. We summarize our findings for each of these areas in the next
three sections.
Chapters 1 to 4 discussed a number of iterative solvers for linear
equations. The flowchart in Figure 7.1 helps decide which methods to
use for a particular problem.
7.1.1
MINRES
In Chapter 2, we proved a number of properties for MINRES applied
to a positive definite system. Table 7.1 compares these properties with
known ones for CG. MINRES has a number of monotonic properties that
can make it more favorable, especially when the iterative algorithm
needs to be terminated early.
In addition, our experimental results show that MINRES converges
faster than CG in terms of backward error, often by as much as 2 orders
of magnitude (Figure 2.2). On the other hand, CG converges somewhat
faster than MINRES in terms of both kxk − x∗ kA and kxk − x∗ k (same
figure).
Table 7.1 Comparison of CG and MINRES properties on an spd system
CG
kxk k
kx∗ − xk k
kx∗ − xk kA
krk k
krk k/kxk k
MINRES
ր [68, Thm 2.1] ր (Thm 2.1.6)
ց [38, Thm 4:3] ց (Thm 2.1.7) [38, Thm 7:5]
ց [38, Thm 6:3] ց (Thm 2.1.8) [38, Thm 7:4]
not-monotonic
ց [52] [38, Thm 7:2]
not-monotonic
ց (Thm 2.2.1)
ր monotonically increasing
ց monotonically decreasing
115
116
CHAPTER 7. CONCLUSIONS AND FUTURE DIRECTIONS
CG
Ax = b
No
kxk − x∗ k or kxk − x∗ kA
A square
Yes
A = AT
Yes
A≻0
Yes
want to
minimize
No
krk k or
No
A
singular
No
unsure
No
noisy
data
krk k
kxk k
Yes
MINRES
No
Yes
shape of
A
A = tall
A = fat
LSQR
min kxk k
needed
Yes
AT
available
Yes
No
BiCGStab
IDR(s)
GMRES
LSMR
Figure 7.1: Flowchart on choosing iterative solvers
MINRESQLP
BiCG
QMR
LSQR
7.1. CONTRIBUTIONS
Table 7.2 Comparison of LSQR and LSMR properties
LSQR
LSMR
kxk k
ր (Thm 3.3.1) ր (Thm 3.3.6)
kxk − x∗ k ց (Thm 3.3.2) ց (Thm 3.3.7)
kATrk k
not-monotonic ց (Thm 3.3.5)
krk k
ց (Thm 3.3.4) ց (Thm 3.3.11)
xk converges to minimum norm x∗ for singular systems
kE2LSQR k ≥ kE2LSMR k
ր monotonically increasing
ց monotonically decreasing
7.1.2
LSMR
We presented LSMR, an iterative algorithm for square or rectangular
systems, along with details of its implementation, theoretical properties, and experimental results to suggest that it has advantages over
the widely adopted LSQR algorithm.
As in LSQR, theoretical and practical stopping criteria are provided
for solving Ax = b and min kAx − bk with optional Tikhonov regularization. Formulae for computing krk k, kATrk k, kxk k and estimating
kAk and cond(A) in O(1) work per iteration are available.
For least-squares problems, the Stewart backward error estimate
kE2 k (4.4) is computable in O(1) work and is proved to be at least as
small as that of LSQR. In practice, kE2 k for LSMR is smaller than that of
LSQR by 1 to 2 orders of magnitude, depending on the condition of the
given system. This often allows LSMR to terminate significantly sooner
than LSQR.
In addition, kE2 k seems experimentally to be very close to the optimal backward error µ(xk ) at each LSMR iterate xk (Section 4.1.1). This
allows LSMR to be stopped reliably using rules based on kE2 k.
M ATLAB, Python, and Fortran 90 implementations of LSMR are available from [42]. They all allow local reorthogonalization of Vk .
7.1.3
AMRES
We developed AMRES, an iterative algorithm for solving the augmented
system
!
!
!
γI A
s
b
=
,
AT δI
x
0
where γδ > 0. It is equivalent to MINRES on the same system and
A
is reliable even when AδIT δI
is indefinite or singular. It is based on
117
118
CHAPTER 7. CONCLUSIONS AND FUTURE DIRECTIONS
the Golub-Kahan bidiagonalization of A and requires half the computational cost compared to MINRES. AMRES is applicable to Curtis-Reid
scaling; improving approximate singular vectors; and computing singular vectors from known singular values.
For Curtis-Reid scaling, AMRES sometimes exhibits faster convergence compared to the specialized version of CG proposed by Curtis
and Reid. However, both algorithms exhibit similar performance on
many test matrices.
Using AMRES as the solver for inner iterations, we have developed
Rayleigh quotient-based algorithms (RQI-AMRES, MRQI-AMRES, MRQIAMRES-Ortho) that refine a given approximate singular vector to an accurate one. They converge faster than the M ATLAB svds function for
singular vectors corresponding to interior singular values. When the
singular values are clustered, or the linear operator A is expensive,
MRQI-AMRES and MRQI-AMRES-Ortho are more stable and converge
much faster than RQI-AMRES or direct use of MINRES within RQI.
For a given singular value, we developed AMRESR, a modified version of AMRES, to compute the corresponding singular vectors to a pre√
cision of O( ǫ). Our algorithm converges faster than inverse iteration.
If singular vectors of higher precision are needed, inverse iteration is
preferred.
7.2
F UTURE DIRECTIONS
Several ideas are highly related to the research in this thesis, but their
potential has not been fully explored yet. We summarize these directions below as a pointer for future research.
7.2.1
C ONJECTURE
From our experiments, we conjecture that the optimal backward error
µ(xk ) (4.1) and its approximate µ̃(xk ) (4.5) decrease monotonically for
LSMR.
7.2.2
PARTIAL REORTHOGONALIZATION
Larsen [58] uses partial reorthogonalization of both Vk and Uk within
his PROPACK software for computing a set of singular values and vectors for a sparse rectangular matrix A. This involves a smaller computational cost compared to full reorthogonalization, as each vk or uk
7.2. FUTURE DIRECTIONS
does not need to perform an inner product with each of the previous
vi ’s or ui ’s. Similar techniques might prove helpful within LSMR and
AMRES.
7.2.3
E FFICIENT OPTIMAL BACKWARD ERROR ESTIMATES
Both LSQR and LSMR stop when kE2 k, an upper bound for the optimal
backward error that’s computable in O(1) work per iteration, is sufficiently small. Ideally, an iterative algorithm for least-squares should
use the optimal backward error µ(xk ) itself in the stopping rule. However, direct computation using (4.2) is more expensive than solving the
least-squares problem itself. An accurate approximation is given in
(4.5). This cheaper estimate involves solving a least-squares problem
at each iteration of any iterative solver, and thus is still not practical for
use as a stopping rule. It would be of great interest if a provably accurate estimate of the optimal backward error could be found in O(n)
work at each iteration, as it would allow iterative solvers to stop precisely at the iteration where the desired accuracy has been attained.
More precise stopping rules have been derived recently by Arioli
and Gratton [2] and Titley-Peloquin et al. [11; 50; 75]. The rules allow
for uncertainty in both A and b, and may prove to be useful for LSQR,
LSMR, and least-squares methods in general.
7.2.4
SYMMLQ - BASED LEAST- SQUARES SOLVER
LSMR is derived to be a method mathematically equivalent to MINRES
on the normal equation, and thus LSMR minimizes kATrk k for xk in the
Krylov subspace Kk (ATA, ATb).
For a symmetric system Ax = b, SYMMLQ [52] solves
min
xk ∈AKk (A,b)
kxk − x∗ k
at the k-th iteration [47], [27, p65]. Thus if we derive an algorithm for
the least-squares problem that is mathematically equivalent to SYMMLQ
on the normal equation but uses Golub-Kahan bidiagonalization, this
new algorithm will minimize kxk − x∗ k for xk in the Krylov subspace
ATAKk (ATA, ATb). This may produce smaller errors compared to LSQR
or LSMR, and may be desirable in applications where the algorithm has
to be terminated early with a smaller error (rather than a smaller backward error).
119
B IBLIOGRAPHY
[1] M. Arioli. A stopping criterion for the conjugate gradient algorithm in a finite element method framework. Numerische Mathematik, 97(1):1–24, 2004. doi:10.1007/s00211-003-0500-y.
[2] M. Arioli and S. Gratton. Least-squares problems, normal equations, and stopping criteria for the conjugate gradient method. Technical Report RAL-TR-2008-008, Rutherford Appleton Laboratory, Oxfordshire, UK, 2008.
[3] M. Arioli, I. Duff, and D. Ruiz. Stopping criteria for iterative solvers. SIAM J. Matrix Anal. Appl., 13:138, 1992.
doi:10.1137/0613012.
[4] M. Arioli, E. Noulard, and A. Russo. Stopping criteria for iterative methods: applications to PDE’s. Calcolo, 38(2):
97–112, 2001. doi:10.1007/s100920170006.
[5] S. J. Benbow. Solving generalized least-squares problems with LSQR. SIAM J. Matrix Anal. Appl., 21(1):166–177,
1999. doi:10.1137/S0895479897321830.
[6] Å. Björck, P. Heggernes, and P. Matstoms. Methods for large scale total least squares problems. SIAM J. Matrix
Anal. Appl., 22(2), 2000. doi:10.1137/S0895479899355414.
[7] F. Cajori. A History of Mathematics. The MacMillan Company, New York, second edition, 1919.
[8] D. Calvetti, B. Lewis, and L. Reichel. An L-curve for the MINRES method. In Franklin T. Luk, editor, Advanced Signal Processing Algorithms, Architectures, and Implementations X, volume 4116, pages 385–395. SPIE, 2000.
doi:10.1117/12.406517.
[9] D. Calvetti, S. Morigi, L. Reichel, and F. Sgallari. Computable error bounds and estimates for the conjugate gradient
method. Numerical Algorithms, 25:75–88, 2000. doi:10.1023/A:1016661024093.
[10] F. Chaitin-Chatelin and V. Frayssé.
doi:10.1137/1.9780898719673.
Lectures on Finite Precision Computations.
SIAM, Philadelphia, 1996.
[11] X.-W. Chang, C. C. Paige, and D. Titley-Péloquin. Stopping criteria for the iterative solution of linear least squares
problems. SIAM J. Matrix Anal. Appl., 31(2):831–852, 2009. doi:10.1137/080724071.
[12] Y. Chen, T. A. Davis, W. W. Hager, and S. Rajamanickam.
Algorithm 887: CHOLMOD, Supernodal
Sparse Cholesky Factorization and Update/Downdate. ACM Trans. Math. Softw., 35:22:1–22:14, October 2008.
doi:10.1145/1391989.1391995.
[13] S.-C. Choi. Iterative Methods for Singular Linear Equations and Least-Squares Problems. PhD thesis, Stanford University,
Stanford, CA, 2006.
[14] S.-C. Choi, C. C. Paige, and M. A. Saunders. MINRES-QLP: a Krylov subspace method for indefinite or singular
symmetric systems. SIAM J. Sci. Comput., 33(4):00–00, 2011. doi:10.1137/100787921.
[15] A. R. Conn, N. I. M. Gould, and Ph. L. Toint.
doi:10.1137/1.9780898719857.
Trust-region Methods, volume 1.
[16] G. Cramer. Introduction à l’Analyse des Lignes Courbes Algébriques. 1750.
120
SIAM, Philadelphia, 2000.
7.2. FUTURE DIRECTIONS
121
[17] A. R. Curtis and J. K. Reid. On the automatic scaling of matrices for Gaussian elimination. J. Inst. Maths. Applics.,
10:118–124, 1972. doi:10.1093/imamat/10.1.118.
[18] T. A. Davis. University of Florida Sparse Matrix Collection. http://www.cise.ufl.edu/research/sparse/matrices.
[19] J. J. Dongarra, J. R. Bunch, C. B. Moler, and G. W. Stewart. LINPACK Users’ Guide. SIAM, Philadelphia, 1979.
doi:10.1137/1.9781611971811.
[20] F. A. Dul. MINRES and MINERR are better than SYMMLQ in eigenpair computations. SIAM J. Sci. Comput., 19(6):
1767–1782, 1998. doi:10.1137/S106482759528226X.
[21] J. M. Elble and N. V. Sahinidis. Scaling linear optimization problems prior to application of the simplex method.
Computational Optimization and Applications, pages 1–27, 2011. doi:10.1007/s10589-011-9420-4.
[22] D. K. Faddeev and V. N. Faddeeva.
doi:10.1007/BF01086544.
Computational Methods of Linear Algebra.
Freeman, London, 1963.
[23] R. Fletcher. Conjugate gradient methods for indefinite systems. In G. Watson, editor, Numerical Analysis, volume
506 of Lecture Notes in Mathematics, pages 73–89. Springer, Berlin / Heidelberg, 1976. doi:10.1007/BFb0080116.
[24] D. C.-L. Fong and M. A. Saunders. LSMR: An iterative algorithm for sparse least-squares problems. SIAM J. Sci.
Comput., 33:2950–2971, 2011. doi:10.1137/10079687X.
[25] V. Frayssé. The power of backward error analysis. Technical Report TH/PA/00/65, CERFACS, 2000.
[26] R. W. Freund and N. M. Nachtigal. QMR: a quasi-minimal residual method for non-Hermitian linear systems.
Numerische Mathematik, 60:315–339, 1991. doi:10.1007/BF01385726.
[27] R. W. Freund, G. H. Golub, and N. M. Nachtigal. Iterative solution of linear systems. Acta Numerica, 1:57–100, 1992.
doi:10.1017/S0962492900002245.
[28] D. R. Fulkerson and P. Wolfe.
doi:10.1137/1004032.
An algorithm for scaling matrices.
SIAM Review, 4(2):142–146, 1962.
[29] G. H. Golub and W. Kahan. Calculating the singular values and pseudo-inverse of a matrix. J. of the Society for
Industrial and Applied Mathematics, Series B: Numerical Analysis, 2(2):205–224, 1965. doi:10.1137/0702016.
[30] G. H. Golub and G. A. Meurant. Matrices, moments and quadrature II; How to compute the norm of the error in
iterative methods. BIT Numerical Mathematics, 37:687–705, 1997. doi:10.1007/BF02510247.
[31] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, Baltimore, third edition,
1996.
[32] S. Gratton, P. Jiránek, and D. Titley-Péloquin. On the accuracy of the Karlson-Waldén estimate of the backward
error for linear least squares problems. Submitted for publication (SIMAX), 2011.
[33] J. F. Grcar. Mathematicians of Gaussian elimination. Notices of the AMS, 58(6):782–792, 2011.
[34] J. F. Grcar, M. A. Saunders, and Z. Su. Estimates of optimal backward perturbations for linear least squares problems. Report SOL 2007-1, Department of Management Science and Engineering, Stanford University, Stanford, CA,
2007. 21 pp.
[35] A. Greenbaum. Iterative Methods for Solving Linear Systems. SIAM, Philadelphia, 1997. doi:10.1137/1.9781611970937.
[36] R. W. Hamming. Introduction to Applied Numerical Analysis. McGraw-Hill computer science series. McGraw-Hill,
1971.
[37] K. Hayami, J.-F. Yin, and T. Ito. GMRES methods for least squares problems. SIAM J. Matrix Anal. Appl., 31(5):
2400–2430, 2010. doi:10.1137/070696313.
122
CHAPTER 7. CONCLUSIONS AND FUTURE DIRECTIONS
[38] M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. Journal of Research of the
National Bureau of Standards, 49(6):409–436, 1952.
[39] N. J. Higham. Accuracy and Stability of Numerical Algorithms.
doi:10.1137/1.9780898718027.
SIAM, Philadelphia, second edition, 2002.
[40] C. Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral
operators. J. Res. Nat. Bur. Standards, 45:255–282, 1950.
[41] L. Ljung. System Identification: Theory for the User. Prentice Hall, second edition, January 1999.
[42] LSMR. Software for linear systems and least squares. http://www.stanford.edu/group/SOL/software.html.
[43] D. G. Luenberger. Hyperbolic pairs in the method of conjugate gradients. SIAM J. Appl. Math., 17:1263–1267, 1969.
doi:10.1137/0117118.
[44] D. G. Luenberger. The conjugate residual method for constrained minimization problems. SIAM J. Numer. Anal., 7
(3):390–398, 1970. doi:10.1137/0707032.
[45] W. Menke. Geophysical Data Analysis: Discrete Inverse Theory, volume 45. Academic Press, San Diego, 1989.
[46] G. A. Meurant. The Lanczos and Conjugate Gradient Algorithms: From Theory to Finite Precision Computations, volume 19
of Software, Environments, and Tools. SIAM, Philadelphia, 2006.
[47] J. Modersitzki, H. A. Van der Vorst, and G. L. G. Sleijpen. Differences in the effects of rounding errors
in Krylov solvers for symmetric indefinite linear systems. SIAM J. Matrix Anal. Appl., 22(3):726–751, 2001.
doi:10.1137/S0895479897323087.
[48] W. Oettli and W. Prager. Compatibility of approximate solution of linear equations with given error bounds for
coefficients and right-hand sides. Numerische Mathematik, 6:405–409, 1964. doi:10.1007/BF01386090.
[49] A. M. Ostrowski. On the convergence of the Rayleigh quotient iteration for the computation of the characteristic
roots and vectors. I. Archive for Rational Mechanics and Analysis, 1(1):233–241, 1957. doi:10.1007/BF00298007.
[50] P. Jiránek and D. Titley-Péloquin. Estimating the backward error in LSQR. SIAM J. Matrix Anal. Appl., 31(4):2055–
2074, 2010. doi:10.1137/090770655.
[51] C. C. Paige. Bidiagonalization of matrices and solution of linear equations. SIAM J. Numer. Anal., 11(1):197–209,
1974. doi:10.1137/0711019.
[52] C. C. Paige and M. A. Saunders. Solution of sparse indefinite systems of linear equations. SIAM J. Numer. Anal., 12
(4):617–629, 1975. doi:10.1137/0712047.
[53] C. C. Paige and M. A. Saunders. LSQR: An algorithm for sparse linear equations and sparse least squares. ACM
Trans. Math. Softw., 8(1):43–71, 1982. doi:10.1145/355984.355989.
[54] C. C. Paige and M. A. Saunders. Algorithm 583; LSQR: Sparse linear equations and least-squares problems. ACM
Trans. Math. Softw., 8(2):195–209, 1982. doi:10.1145/355993.356000.
[55] C. C. Paige, M. Rozloznik, and Z. Strakos. Modified Gram-Schmidt (MGS), least squares, and backward stability of
MGS-GMRES. SIAM J. Matrix Anal. Appl., 28(1):264–284, 2006. doi:10.1137/050630416.
[56] B. N. Parlett and W. Kahan. On the convergence of a practical QR algorithm. Information Processing, 68:114–118,
1969.
[57] V. Pereyra, 2010. Private communication.
[58] PROPACK. Software for SVD of sparse matrices. http://soi.stanford.edu/~rmunk/PROPACK/.
7.2. FUTURE DIRECTIONS
123
[59] B. J. W. S. Rayleigh. The Theory of Sound, volume 2. Macmillan, 1896.
[60] J. K. Reid. The use of conjugate gradients for systems of linear equations possessing “Property A”. SIAM J. Numer.
Anal., 9(2):325–332, 1972. doi:10.1137/0709032.
[61] J. L. Rigal and J. Gaches. On the compatibility of a given solution with the data of a linear system. J. ACM, 14(3):
543–548, 1967. doi:10.1145/321406.321416.
[62] Y. Saad. Krylov subspace methods for solving large unsymmetric linear systems. Mathematics of Computation, 37
(155):pp. 105–126, 1981. doi:10.2307/2007504.
[63] Y. Saad and M. H. Schultz. GMRES: a generalized minimum residual algorithm for solving nonsymmetric linear
systems. SIAM J. Sci. and Statist. Comput., 7(3):856–869, 1986. doi:10.1137/0907058.
[64] K. Shen, J. N. Crossley, A. W. C. Lun, and H. Liu. The Nine Chapters on the Mathematical Art: Companion and Commentary. Oxford University Press, New York, 1999.
[65] H. D. Simon and H. Zha. Low-rank matrix approximation using the Lanczos bidiagonalization process with applications. SIAM J. Sci. Comput., 21(6):2257–2274, 2000. doi:10.1137/S1064827597327309.
[66] P. Sonneveld. CGS, a fast Lanczos-type solver for nonsymmetric linear systems. SIAM J. Sci. Stat. Comput., 10:36–52,
1989. doi:10.1137/0910004.
[67] P. Sonneveld and M. B. van Gijzen. IDR(s): A family of simple and fast algorithms for solving large nonsymmetric
systems of linear equations. SIAM J. Sci. Comput., 31:1035–1062, 2008. doi:10.1137/070685804.
[68] T. Steihaug. The conjugate gradient method and trust regions in large scale optimization. SIAM J. Numer. Anal., 20
(3):626–637, 1983. doi:10.1137/0720042.
[69] G. W. Stewart. An inverse perturbation theorem for the linear least squares problem. SIGNUM Newsletter, 10:39–40,
1975. doi:10.1145/1053197.1053199.
[70] G. W. Stewart. Research, development and LINPACK. In J. R. Rice, editor, Mathematical Software III, pages 1–14.
Academic Press, New York, 1977.
[71] G. W. Stewart. The QLP approximation to the singular value decomposition. SIAM J. Sci. Comput., 20(4):1336–1348,
1999. ISSN 1064-8275. doi:10.1137/S1064827597319519.
[72] E. Stiefel. Relaxationsmethoden bester strategie zur lösung linearer gleichungssysteme. Comm. Math. Helv., 29:
157–179, 1955. doi:10.1007/BF02564277.
[73] Z. Su. Computational Methods for Least Squares Problems and Clinical Trials. PhD thesis, SCCM, Stanford University,
Stanford, CA, 2005.
[74] Y. Sun. The Filter Algorithm for Solving Large-Scale Eigenproblems from Accelerator Simulations. PhD thesis, Stanford
University, Stanford, CA, 2003.
[75] D. Titley-Péloquin. Backward Perturbation Analysis of Least Squares Problems. PhD thesis, School of Computer Science,
McGill University, Montreal, PQ, 2010.
[76] D. Titley-Peloquin. Convergence of backward error bounds in LSQR and LSMR. Private communication, 2011.
[77] J. A. Tomlin. On scaling linear programming problems. In Computational Practice in Mathematical Programming, volume 4 of Mathematical Programming Studies, pages 146–166. Springer, Berlin / Heidelberg, 1975.
doi:10.1007/BFb0120718.
[78] H. A. van der Vorst. Bi-CGSTAB: A fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric
linear systems. SIAM J. Sci. Comput., 13(2):631–644, 1992. doi:10.1137/0913035.
124
CHAPTER 7. CONCLUSIONS AND FUTURE DIRECTIONS
[79] H. A. van der Vorst. Iterative Krylov Methods for Large Linear Systems. Cambridge University Press, Cambridge, first
edition, 2003. doi:10.1017/CBO9780511615115.
[80] S. J. van der Walt. Super-resolution Imaging. PhD thesis, Stellenbosch University, Cape Town, South Africa, 2010.
[81] M. B. van Gijzen. IDR(s) website. http://ta.twi.tudelft.nl/nw/users/gijzen/IDR.html.
[82] M. B. van Gijzen. Private communication, Dec 2010.
[83] B. Waldén, R. Karlson, and J.-G. Sun. Optimal backward perturbation bounds for the linear least squares problem.
Numerical Linear Algebra with Applications, 2(3):271–286, 1995. doi:10.1002/nla.1680020308.
[84] D. S. Watkins. Fundamentals of Matrix Computations. Pure and Applied Mathematics. Wiley, Hoboken, NJ, third
edition, 2010.
[85] P. Wesseling and P. Sonneveld. Numerical experiments with a multiple grid and a preconditioned Lanczos type
method. In Reimund Rautmann, editor, Approximation Methods for Navier-Stokes Problems, volume 771 of Lecture
Notes in Mathematics, pages 543–562. Springer, Berlin / Heidelberg, 1980. doi:10.1007/BFb0086930.
[86] F. Xue and H. C. Elman. Convergence analysis of iterative solvers in inexact Rayleigh quotient iteration. SIAM J.
Matrix Anal. Appl., 31(3):877–899, 2009. doi:10.1137/080712908.
[87] D. Young. Iterative methods for solving partial difference equations of elliptic type. Trans. Amer. Math. Soc, 76(92):
111, 1954. doi:10.2307/1990745.
[88] D. M. Young. Iterative Methods for Solving Partial Difference Equations of Elliptic Type. PhD thesis, Harvard University,
Cambridge, MA, 1950.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement