MINIMUM-RESIDUAL METHODS FOR SPARSE LEAST-SQUARES USING GOLUB-KAHAN BIDIAGONALIZATION A DISSERTATION SUBMITTED TO THE INSTITUTE FOR COMPUTATIONAL AND MATHEMATICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY David Chin-lung Fong December 2011 © 2011 by Chin Lung Fong. All Rights Reserved. Re-distributed by Stanford University under license with the author. This dissertation is online at: http://purl.stanford.edu/sd504kj0427 ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Michael Saunders, Primary Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Margot Gerritsen I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Walter Murray Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives. iii iv A BSTRACT For 30 years, LSQR and its theoretical equivalent CGLS have been the standard iterative solvers for large rectangular systems Ax = b and least-squares problems min kAx − bk. They are analytically equivalent to symmetric CG on the normal equation ATAx = ATb, and they reduce krk k monotonically, where rk = b − Axk is the k-th residual vector. The techniques pioneered in the development of LSQR allow better algorithms to be developed for a wider range of problems. We derive LSMR, an algorithm that is similar to LSQR but exhibits better convergence properties. LSMR is equivalent to applying MINRES to the normal equation, so that the error kx∗ − xk k, the residual krk k, and the residual of the normal equation kATrk k all decrease monotonically. In practice we observe that the Stewart backward error kATrk k/krk k is usually monotonic and very close to optimal. LSMR has essentially the same computational cost per iteration as LSQR, but the Stewart backward error is always smaller. Thus if iterations need to be terminated early, it is safer to use LSMR. LSQR and LSMR are based on Golub-Kahan bidiagonalization. Following the analysis of LSMR, we leverage the techniques used there to construct algorithm AMRES for negatively-damped least-squares systems (ATA − δ 2 I)x = ATb, again using Golub-Kahan bidiagonalization. Such problems arise in total least-squares, Rayleigh quotient iteration (RQI), and Curtis-Reid scaling for rectangular sparse matrices. Our solver AMRES provides a stable method for these problems. AMRES allows caching and reuse of the Golub-Kahan vectors across RQIs, and can be used to compute any of the singular vectors of A, given a reasonable estimate of the singular vector or an accurate estimate of the singular value. v A CKNOWLEDGEMENTS First and foremost, I am extremely fortunate to have met Michael Saunders during my second research rotation in ICME. He designed the optimal research project for me from the very beginning, and patiently guided me through every necessary step. He devotes an enormous amount of time and consideration to his students. Michael is the best advisor that any graduate student could ever have. Our work would not be possible without standing on the shoulder of a giant like Chris Paige, who designed MINRES and LSQR together with Michael 30 years ago. Chris gave lots of helpful comments on reorthogonalization and various aspects of this work. I am thankful for his support. David Titley-Peloquin is one of the world experts in analyzing iterative methods like LSQR and LSMR. He provided us with great insights on various convergence properties of both algorithms. My understanding of LSMR would be left incomplete without his ideas. Jon Claerbout has been the most enthusiatic supporter of the LSQR algorithm, and is constantly looking for applications of LSMR in geophysics. His idea of “computational success” aligned with our goal in designing LSMR and gave us strong motivation to develop this algorithm. James Nagy’s enthusiasm in experimenting with LSMR in his latest research also gave us much confidence in further developing and polishing our work. I feel truly excited to see his work on applying LSMR to image processing. AMRES would not be possible without the work that Per Christian Hansen and Michael started. The algorithm and some of the applications are built on top of their insights. I would like to thank Margot Gerritsen and Walter Murray for serving on both my reading and oral defense committees. They provided many valuable suggestions that enhanced the completeness of the experimental results. I would also like to thank Eric Darve for chairing my oral defense and Peter Kitanidis for serving on my oral defense vi committee and giving me objective feedback on the algorithm from a user’s perspective. I am also grateful to the team of ICME staff, especially Indira Choudhury and Brian Tempero. Indira helped me get through all the paperwork, and Brian assembled and maintained all the ICME computers. His support allowed us to focus fully on number-crunching. I have met lots of wonderful friends here at Stanford. It is hard to acknowledge all of them here. Some of them are Santiago Akle, Michael Chan, Felix Yu, and Ka Wai Tsang. Over the past three years, I have been deeply grateful for the love that Amy Lu has brought to me. Her smile and encouragement enabled me to go through all the tough times in graduate school. I thank my parents for creating the best possible learning environment for me over the years. With their support, I feel extremely priviledged to be able to fully devote myself to this intellectual venture. I would like to acknowledge my funding, the Stanford Graduate Fellowship (SGF). For the first half of my SGF, I was supported by the Hong Kong Alumni Scholarship with generous contributions from Dr Franklin and Mr Lee. For the second half of my SGF, I was supported by the Office of Technology Licensing Fellowship with generous contributions from Professor Charles H. Kruger and Ms Katherine Ku. vii C ONTENTS Abstract v Acknowledgements vi 1 Introduction 1.1 Problem statement . . . . . . . . . . . . . . . . . 1.2 Basic iterative methods . . . . . . . . . . . . . . . 1.3 Krylov subspace methods for square Ax = b . . 1.3.1 The Lanczos process . . . . . . . . . . . . 1.3.2 Unsymmetric Lanczos . . . . . . . . . . . 1.3.3 Transpose-free methods . . . . . . . . . . 1.3.4 The Arnoldi process . . . . . . . . . . . . 1.3.5 Induced dimension reduction . . . . . . . 1.4 Krylov subspace methods for rectangular Ax = b 1.4.1 The Golub-Kahan process . . . . . . . . . 1.5 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 2 3 7 8 9 10 11 11 13 2 A tale of two algorithms 2.1 Monotonicity of norms . . . . . . . . 2.1.1 Properties of CG . . . . . . . . 2.1.2 Properties of CR and MINRES 2.2 Backward error analysis . . . . . . . 2.2.1 Stopping rule . . . . . . . . . 2.2.2 Monotonic backward errors . 2.2.3 Other convergence measures 2.3 Numerical results . . . . . . . . . . . 2.3.1 Positive-definite systems . . . 2.3.2 Indefinite systems . . . . . . 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 16 16 16 20 22 22 22 23 23 30 32 3 LSMR 3.1 Derivation of LSMR . . . . . . . . . . . . . . . . . . . . . . 3.1.1 The Golub-Kahan process . . . . . . . . . . . . . . 36 37 37 viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 3.3 3.4 3.5 3.6 3.1.2 Using Golub-Kahan to solve the normal equation 3.1.3 Two QR factorizations . . . . . . . . . . . . . . . . 3.1.4 Recurrence for xk . . . . . . . . . . . . . . . . . . . 3.1.5 Recurrence for Wk and W̄k . . . . . . . . . . . . . 3.1.6 The two rotations . . . . . . . . . . . . . . . . . . . 3.1.7 Speeding up forward substitution . . . . . . . . . 3.1.8 Algorithm LSMR . . . . . . . . . . . . . . . . . . . Norms and stopping rules . . . . . . . . . . . . . . . . . . 3.2.1 Stopping criteria . . . . . . . . . . . . . . . . . . . 3.2.2 Practical stopping criteria . . . . . . . . . . . . . . 3.2.3 Computing krk k . . . . . . . . . . . . . . . . . . . 3.2.4 Computing kATrk k . . . . . . . . . . . . . . . . . . 3.2.5 Computing kxk k . . . . . . . . . . . . . . . . . . . 3.2.6 Estimates of kAk and cond(A) . . . . . . . . . . . LSMR Properties . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Monotonicity of norms . . . . . . . . . . . . . . . . 3.3.2 Characteristics of the solution on singular systems 3.3.3 Backward error . . . . . . . . . . . . . . . . . . . . Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . Regularized least squares . . . . . . . . . . . . . . . . . . 3.5.1 Effects on kr̄k k . . . . . . . . . . . . . . . . . . . . 3.5.2 Pseudo-code for regularized LSMR . . . . . . . . . Proof of Lemma 3.2.1 . . . . . . . . . . . . . . . . . . . . . 38 38 39 39 40 41 41 42 42 43 43 45 45 45 46 46 48 49 50 50 51 52 52 4 LSMR experiments 4.1 Least-squares problems . . . . . . . . . . . . . . . . 4.1.1 Backward error for least-squares . . . . . . 4.1.2 Numerical results . . . . . . . . . . . . . . . 4.1.3 Effects of preconditioning . . . . . . . . . . 4.1.4 Why does kE2 k for LSQR lag behind LSMR? 4.2 Square systems . . . . . . . . . . . . . . . . . . . . 4.3 Underdetermined systems . . . . . . . . . . . . . . 4.4 Reorthogonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 56 56 59 61 71 72 73 77 5 AMRES 5.1 Derivation of AMRES . . . . . . . 5.1.1 Least-squares subsystem . 5.1.2 QR factorization . . . . . 5.1.3 Updating Wkv . . . . . . . 5.1.4 Algorithm AMRES . . . . . . . . . . . . . . . . . . . . . . . . 82 83 84 85 85 86 ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 86 86 86 88 89 89 6 AMRES applications 6.1 Curtis-Reid scaling . . . . . . . . . . . . . . . . 6.1.1 Curtis-Reid scaling using CGA . . . . . 6.1.2 Curtis-Reid scaling using AMRES . . . . 6.1.3 Comparison of CGA and AMRES . . . . 6.2 Rayleigh quotient iteration . . . . . . . . . . . . 6.2.1 Stable inner iterations for RQI . . . . . . 6.2.2 Speeding up the inner iterations of RQI 6.3 Singular vector computation . . . . . . . . . . . 6.3.1 AMRESR . . . . . . . . . . . . . . . . . . 6.3.2 AMRESR experiments . . . . . . . . . . . 6.4 Almost singular systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 90 91 92 94 94 97 100 107 110 110 113 7 Conclusions and future directions 7.1 Contributions . . . . . . . . . . . . . . . . . . . . . 7.1.1 MINRES . . . . . . . . . . . . . . . . . . . . 7.1.2 LSMR . . . . . . . . . . . . . . . . . . . . . . 7.1.3 AMRES . . . . . . . . . . . . . . . . . . . . . 7.2 Future directions . . . . . . . . . . . . . . . . . . . 7.2.1 Conjecture . . . . . . . . . . . . . . . . . . . 7.2.2 Partial reorthogonalization . . . . . . . . . 7.2.3 Efficient optimal backward error estimates 7.2.4 SYMMLQ-based least-squares solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 115 115 117 117 118 118 118 119 119 5.4 Stopping rules . . . . . . . . . . . . . Estimate of norms . . . . . . . . . . . 5.3.1 Computing kb rk k . . . . . . . 5.3.2 Computing kÂb rk k . . . . . . 5.3.3 Computing kb xk . . . . . . . . 5.3.4 Estimates of kÂk and cond(Â) Complexity . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 x L IST OF TABLES 1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1 Storage and computational cost for various least-squares methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.1 4.2 Effects of diagonal preconditioning on LPnetlib matrices and convergence of LSQR and LSMR on min kAx − bk 67 Relationship between CG, MINRES, CRAIG, LSQR and LSMR 75 5.1 Storage and computational cost for AMRES . . . . . . . . 7.1 Comparison of CG and MINRES properties on an spd system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Comparison of LSQR and LSMR properties . . . . . . . . . 117 7.2 xi 89 L IST OF F IGURES 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 6.1 6.2 Distribution of condition number for matrices used for CG vs MINRES comparison . . . . . . . . . . . . . . . . . . Backward and forward errors for CG and MINRES (1) . . Backward and forward errors for CG and MINRES (2) . . Solution norms for CG and MINRES (1) . . . . . . . . . . . Solution norms for CG and MINRES (2) . . . . . . . . . . . Non-monotonic backward error for MINRES on indefinite system . . . . . . . . . . . . . . . . . . . . . . . . . . . Solution norms for MINRES on indefinite systems (1) . . . Solution norms for MINRES on indefinite systems (2) . . . krk k for LSMR and LSQR . . . . . . . . . . . . . . . . . . . kE2 k for LSMR and LSQR . . . . . . . . . . . . . . . . . . . kE1 k, kE2 k, and µ e(xk ) for LSQR and LSMR . . . . . . . . . kE1 k, kE2 k, and µ e(xk ) for LSMR . . . . . . . . . . . . . . . µ e(xk ) for LSQR and LSMR . . . . . . . . . . . . . . . . . . µ e(xk ) for LSQR and LSMR . . . . . . . . . . . . . . . . . . kx∗ − xk for LSMR and LSQR . . . . . . . . . . . . . . . . . Convergence of kE2 k for two problems in NYPA group . . Distribution of condition number for LPnetlib matrices . Convergence of LSQR and LSMR with increasingly good preconditioners . . . . . . . . . . . . . . . . . . . . . . . . LSMR and LSQR solving two square nonsingular systems Backward errors of LSQR, LSMR and MINRES on underdetermined systems . . . . . . . . . . . . . . . . . . . . . . LSMR with and without reorthogonalization of Vk and/or Uk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LSMR with reorthogonalized Vk and restarting . . . . . . LSMR with local reorthogonalization of Vk . . . . . . . . . 24 25 27 28 29 31 33 34 60 61 62 63 63 64 64 65 66 70 74 78 79 80 81 Convergence of Curtis-Reid scaling using CGA and AMRES 95 Improving an approximate singular value and singular vector using RQI-AMRES and RQI-MINRES (1) . . . . . . . 101 xii 6.3 6.7 6.8 Improving an approximate singular value and singular vector using RQI-AMRES and RQI-MINRES (2) . . . . . . . 102 Convergence of RQI-AMRES, MRQI-AMRES and MRQI-AMRESOrtho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Convergence of RQI-AMRES, MRQI-AMRES and MRQI-AMRESOrtho for linear operators of varying cost. . . . . . . . . . 106 Convergence of RQI-AMRES, MRQI-AMRES and MRQI-AMRESOrtho for different singular values . . . . . . . . . . . . . . 107 Convergence of RQI-AMRES and svds . . . . . . . . . . . 108 Singular vector computation using AMRESR and MINRES 114 7.1 Flowchart on choosing iterative solvers . . . . . . . . . . 116 6.4 6.5 6.6 xiii L IST OF A LGORITHMS 1.1 1.2 1.3 1.4 1.5 1.6 1.7 3.1 3.2 3.3 3.4 3.5 5.1 6.1 6.2 6.3 6.4 6.5 6.6 Lanczos process Tridiag(A, b) . . . . . . . . . . . . . Algorithm CG . . . . . . . . . . . . . . . . . . . . . . Algorithm CR . . . . . . . . . . . . . . . . . . . . . . Unsymmetric Lanczos process . . . . . . . . . . . . . Arnoldi process . . . . . . . . . . . . . . . . . . . . . Golub-Kahan process Bidiag(A, b) . . . . . . . . . . Algorithm CGLS . . . . . . . . . . . . . . . . . . . . . Algorithm LSMR . . . . . . . . . . . . . . . . . . . . . Computing krk k in LSMR . . . . . . . . . . . . . . . . Algorithm CRLS . . . . . . . . . . . . . . . . . . . . . Regularized LSMR (1) . . . . . . . . . . . . . . . . . . Regularized LSMR (2) . . . . . . . . . . . . . . . . . Algorithm AMRES . . . . . . . . . . . . . . . . . . . . Rayleigh quotient iteration (RQI) for square A . . . . RQI for singular vectors for square or rectangular A Stable RQI for singular vectors . . . . . . . . . . . . . Modified RQI for singular vectors . . . . . . . . . . . Singular vector computation via residual vector . . Algorithm AMRESR . . . . . . . . . . . . . . . . . . . xiv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 . 4 . 6 . 7 . 9 . 11 . 13 . 41 . 44 . 47 . 53 . 54 . 87 . 96 . 96 . 97 . 103 . 109 . 111 L IST OF M ATLAB C ODE 4.1 4.2 4.3 4.4 4.5 4.6 6.1 6.2 6.3 6.4 6.5 6.6 Approximate optimal backward error . . . . . . . . . Right diagonal preconditioning . . . . . . . . . . . . . Generating preconditioners by perturbation of QR . . Criteria for selecting square systems . . . . . . . . . . Diagonal preconditioning . . . . . . . . . . . . . . . . Left diagonal preconditioning . . . . . . . . . . . . . . Least-squares problem for Curtis-Reid scaling . . . . Normal equation for Curtis-Reid scaling . . . . . . . . CGA for Curtis-Reid scaling . . . . . . . . . . . . . . . Generate linear operator with known singular values Cached Golub-Kahan process . . . . . . . . . . . . . . Error measure for singular vector . . . . . . . . . . . . xv . . . . . . . . . . . . . 58 . 59 . 69 . 72 . 73 . 76 . 91 . 92 . 93 . 99 . 104 . 112 xvi 1 I NTRODUCTION The quest for the solution of linear equations is a long journey. The earliest known work is in 263 AD [64]. The book Jiuzhang Suanshu (Nine Chapters of the Mathematical Art) was published in ancient China with 1 Procedures for solving a chapter dedicated to the solution of linear equations.1 The modern systems of three linear equastudy of linear equations was picked up again by Newton, who wrote tions in three variables were unpublished notes in 1670 on solving system of equations by the sys- discussed. tematic elimination of variables [33]. Cramer’s Rule was published in 1750 [16] after Leibniz laid the work for determinants in 1693 [7]. In 1809, Gauss invented the method of least squares by solving the normal equation for an over-determined system for his study of celestial orbits. Subsequently, in 1826, he extended his method to find the minimum-norm solution for underdetermined systems, which proved to be very popular among cartographers [33]. For linear systems with dense matrices, Cholesky factorization, LU factorization and QR factorization are the popular methods for finding solutions. These methods require access to the elements of matrix. We are interested in the solution of linear systems when the matrix is large and sparse. In such circumstances, direct methods like the ones mentioned above are not practical because of memory constraints. We also allow the matrix to be a linear operator defined by a procedure for computing matrix-vector products. We focus our study on the class of iterative methods, which usually require only a small amount of auxiliary storage beyond the storage for the problem itself. 1.1 P ROBLEM STATEMENT We consider the problem of solving a system of linear equations. In matrix notation we write Ax = b, (1.1) where A is an m × n real matrix that is typically large and sparse, or is available only as a linear operator, b is a real vector of length m, and x is a real vector of length n. 1 2 CHAPTER 1. INTRODUCTION We call an x that satisfies (1.1) a solution of the problem. If such an x does not exist, we have an inconsistent system. If the system is inconsistent, we look for an optimal x for the following least-squares problem instead: min kAx − bk2 . (1.2) x We denote the exact solution to the above problems by x∗ , and ℓ denotes the number of iterations that any iterative method takes to converge to this solution. That is, we have a sequence of approximate solutions x0 , x1 , x2 , . . . , xℓ , with x0 = 0 and xℓ = x∗ . In Section 1.2 and 1.3 we review a number of methods for solving (1.1) when A is square (m = n). In Section 1.4 we review methods that handle the general case when A is rectangular, which is also the main focus of this thesis. 1.2 B ASIC ITERATIVE METHODS Jacobi iteration, Gauss-Seidel iteration [31, p510] and successive overrelaxation (SOR) [88] are three early iterative methods for linear equations. These methods have the common advantage of minimal memory requirement compared with the Krylov subspaces methods that we focus on hereafter. However, unlike Krylov subspace methods, these methods will not converge to the exact solution in a finite number of iterations even with exact arithmetic, and they are applicable to only narrow classes of matrices (e.g., diagonally dominant matrices). They also require explicit access to the nonzeros of A. 1.3 K RYLOV SUBSPACE METHODS FOR SQUARE Ax = b Sections 1.3 and 1.4 describe a number of methods that can regard A as an operator; i.e. only matrix-vector multiplication with A (and some2 These methods are also times AT ) is needed, but not direct access to the elements of A.2 Section know as matrix-free iterative 1.3 focuses on algorithms for the case when A is square. Section 1.4 fomethods. cuses on algorithms that handle both rectangular and square A. Krylov subspaces of increasing dimensions are generated by the matrix-vector products, and an optimal solution within each subspace is found at each iteration of the methods (where the measure of optimality differs with each method). 1.3. KRYLOV SUBSPACE METHODS FOR SQUARE AX = B Algorithm 1.1 Lanczos process Tridiag(A, b) 1: β1 v1 = b (i.e. β1 = kbk2 , v1 = b/β1 ) 2: for k = 1, 2, . . . do 3: w = Avk 4: αk = vkT wk 5: βk+1 vk+1 = w − αk vk − βk vk−1 6: end for 1.3.1 T HE L ANCZOS PROCESS In this section, we focus on symmetric linear systems. The Lanczos process [40] takes a symmetric matrix A and a vector b, and generates a sequence of Lanczos vectors vk and scalars αk , βk for k = 1, 2, . . . as shown in Algorithm 1.1. The process can be summarized in matrix form as AVk = Vk Tk + βk+1 vk+1 eTk = Vk+1 Hk (1.3) with Vk = v1 v2 · · · vk and α1 β 2 Tk = β2 α2 .. . .. . .. . βk , βk αk Hk = Tk βk+1 eTk ! . An important property of the Lanczos vectors in Vk is that they lie in the Krylov subspace Kk (A, b) = span{b, Ab, A2 b, . . . , Ak−1 b}. At iteration k, we look for an approximate solution xk = Vk yk (which lies in the Krylov subspace). The associated residual vector is rk = b − Axk = β1 v1 − AVk yk = Vk+1 (β1 e1 − Hk yk ). By choosing yk in various ways to make rk small, we arrive at different iterative methods for solving the linear system. Since Vk is theoretically orthonormal, we can achieve this by solving various subproblems to make Hk yk ≈ β 1 e 1 . (1.4) Three particular choices of subproblem lead to three established methods (CG, MINRES and SYMMLQ) [52]. Each method has a different minimization property that suggests a particular factorization of Hk . Certain auxiliary quantities can be updated efficiently without the 3 4 CHAPTER 1. INTRODUCTION Algorithm 1.2 Algorithm CG r0 = b, p1 = r0 , ρ0 = r0T r0 for k = 1, 2, . . . do qk = Apk αk = ρk−1 /pTk qk xk = xk−1 + αk pk rk = rk−1 − αk qk ρk = rkT rk βk = ρk /ρk−1 pk+1 = rk + βk pk end for 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: need for yk itself (which in general is completely different from yk+1 ). With exact arithmetic, the Lanczos process terminates with k = ℓ for some ℓ ≤ n. To ensure that the approximations xk = Vk yk improve by some measure as k increases toward ℓ, the Krylov solvers minimize some convex function within the expanding Krylov subspaces [27]. CG CG was introduced in 1952 by Hestenes and Stiefel [38] for solving Ax = b when A is symmetric positive definite (spd). The quadratic form φ(x) ≡ 12 xTAx − bTx is bounded below, and its unique minimizer solves Ax = b. CG iterations are characterized by minimizing the quadratic form within each Krylov subspace [27], [46, §2.4], [84, §§8.8–8.9]: x k = V k yk , where yk = arg min φ(Vk y). y (1.5) With b = Ax and 2φ(xk ) = xTkAxk − 2xTAxk , this is equivalent to minimizing the function kx∗ − xk kA ≡ (x∗ − xk )TA(x∗ − xk ), known as the energy norm of the error, within each Krylov subspace. A version of CG adapted from van der Vorst [79, p42] is shown in Algorithm 1.2. CG has an equivalent Lanczos formulation [52]. It works by deleting the last row of (1.4) and defining yk by the subproblem Tk yk = β1 e1 . If A is positive definite, so is each Tk , and the natural approach is to employ the Cholesky factorization Tk = Lk Dk LTk . We define Wk and zk from the lower triangular systems Lk WkT = VkT , L k D k zk = β 1 e 1 . 1.3. KRYLOV SUBSPACE METHODS FOR SQUARE AX = B It then follows that zk = LTkyk and xk = Vk yk = Wk LTk yk = Wk zk , where the elements of Wk and zk do not change when k increases. Simple recursions follow. xk = xk−1 + ζk wk , where zk = In particular, zk−1 and Wk = Wk−1 wk . This formulation requires one more ζk n-vector than the non-Lanczos formulation. When A is not spd, the minimization in (1.5) is unbounded below, and the Cholesky factorization of Tk might fail or be numerically unstable. Thus, CG cannot be recommended in this case. MINRES MINRES [52] is characterized by the following minimization: where x k = Vk yk , yk = arg min kb − AVk yk. y (1.6) Thus, MINRES minimizes krk k within the kth Krylov subspace. Since this minimization is well-defined regardless of the definiteness of A, MINRES is applicable to both positive definite and indefinite systems. From (1.4), the minimization is equivalent to min kHk yk − β1 e1 k2 . Now it is natural to use the QR factorization Q k Hk β1 e1 = Rk 0 zk ζ̄k+1 ! , from which we have Rk yk = zk . We define Wk from the lower triangular system RkT WkT = VkT and then xk = Vk yk = Wk Rk yk = Wk zk = xk−1 + ζk wk as before. (The Cholesky factor Lk is lower bidiagonal but RkT is lower tridiagonal, so MINRES needs slightly more work and storage than CG.) Stiefel’s Conjugate Residual method (CR) [72] for spd systems also minimizes krk k in the same Krylov subspace. Thus, CR and MINRES must generate the same iterates on spd systems. We will use the two algorithms interchangeably in the spd case to prove a number of properties in Chapter 2. CR is shown in Algorithm 1.3. Note that MINRES is reliable for any symmetric matrix A, whereas CR was designed for positive definite systems. For example it will fail 5 6 CHAPTER 1. INTRODUCTION Algorithm 1.3 Algorithm CR 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: x0 = 0, r0 = b, s0 = Ar0 , ρ0 = r0Ts0 , p0 = r0 , q0 = s0 for k = 1, 2, . . . do (qk−1 = Apk−1 holds, but not explicitly computed) αk = ρk−1 /kqk−1 k2 xk = xk−1 + αk pk−1 rk = rk−1 − αk qk−1 sk = Ark ρk = rkTsk βk = ρk /ρk−1 pk = rk + βk pk−1 qk = sk + βk qk−1 end for on the following nonsingular but indefinite system: 0 0 1 1 x = 1 2 1 0 . 1 1 1 1 In this case, r1 and s1 are nonzero, but ρ1 = 0 and CR fails after 1 iteration. Luenberger extended CR to indefinite systems [43; 44]. The extension relies on testing whether αk = 0 and switching to a different update rule. In practice it is difficult to judge whether αk should be treated as zero. MINRES is free of such a decision (except when A is singular and Ax = b is inconsistent, in which case MINRES-QLP [14; 13] is recommended). Since the pros and cons of CG and MINRES are central to the design of the two new algorithms in this thesis (LSMR and AMRES), a more in-depth discussion of their properties is given in Chapter 2. SYMMLQ SYMMLQ [52] solves the minimum 2-norm solution of an underdetermined subproblem obtained by deleting the last 2 rows of (1.4): min kyk k s.t. T Hk−1 yk = β 1 e 1 . T This is solved using the LQ factorization Hk−1 QTk−1 = Lk−1 0 . A benefit is that xk is computed as steps along a set of theoretically orthogonal directions (the columns of Vk QTk−1 ). 1.3. KRYLOV SUBSPACE METHODS FOR SQUARE AX = B 7 Algorithm 1.4 Unsymmetric Lanczos process 1: β1 = kbk2 , v1 = b/β1 , δ1 = 0, v0 = w0 = 0, w1 = v1 . 2: for k = 1, 2, . . . do 3: αk = wkT Avk 4: δk+1 vk+1 = Avk − αk vk − βk vk−1 5: w̄k+1 = ATwk − αk wk − δk wk−1 T 6: βk+1 = w̄k+1 vk+1 7: wk+1 = w̄k+1 /βk+1 8: end for 1.3.2 U NSYMMETRIC L ANCZOS The symmetric Lanczos process from Section 1.3.1 transforms a symmetric matrix A into a symmetric tridiagonal matrix Tk , and generates 3 Orthogonality a set of orthonormal3 vectors vk using a 3-term recurrence. If A is not holds only under exact arithsymmetric, there are two other popular strategies, each of which sacri- metic. In finite precision, orthogonality is quickly lost. fices some properties of the symmetric Lanczos process. If we don’t enforce a short-term recurrence, we arrive at the Arnoldi process presented in Section 1.3.4. If we relax the orthogonality require4 The version of unsymment, we arrive at the unsymmetric Lanczos process4 , the basis of BiCG metric Lanczos presented and QMR. The unsymmetric Lanczos process is shown in Algorithm here is adapted from [35]. 1.4. T The scalars δk+1 and βk+1 are chosen so that kvk+1 k = 1 and vk+1 wk+1 = 1. In matrix terms, if we define α1 δ2 Tk = β2 α2 .. . β3 .. . δk−1 .. . αk−1 δk then we have the relations AVk = Vk+1 Hk , βk αk and Hk = Tk δk+1 eTk ! , H̄k = Tk ATWk = Wk+1 H̄k . (1.7) As with symmetric Lanczos, by defining xk = Vk yk to search for some optimal solution within the Krylov subspace, we can write rk = b − Axk = β1 v1 − AVk yk = Vk+1 (β1 e1 − Hk yk ). For unsymmetric Lanczos, the columns of Vk are not orthogonal even βk+1 ek , 8 CHAPTER 1. INTRODUCTION in exact arithmetic. However, we pretend that they are orthogonal, and using similar ideas from Lanczos CG and MINRES, we arrive at the following two algorithms by solving Hk yk ≈ β 1 e 1 . (1.8) B I CG BiCG [23] is an extension of CG to the unsymmetric Lanczos process. As with CG, BiCG can be derived by deleting the last row of (1.8), and solving the resultant square system with LU decomposition [79]. QMR QMR [26], the quasi-minimum residual method, is the MINRES analog for the unsymmetric Lanczos process. It is derived by solving the leastsquares subproblem min kHk yk − β1 e1 k at every iteration with QR decomposition. Since the columns of Vk are not orthogonal, QMR doesn’t give a minimum residual solution for the original problem in the corresponding Krylov subspace, but the residual norm does tend to decrease. 1.3.3 T RANSPOSE - FREE METHODS One disadvantage of BiCG and QMR is that matrix-vector multiplication by AT is needed. A number of algorithm have been proposed to remove this multiplication. These algorithms are based on the fact that in BiCG, the residual vector lies in the Krylov subspace and can be written as rk = Pk (A)b, where Pk (A) is a polynomial of A of degree k. With the choice rk = Qk (A)Pk (A)b, where Qk (A) is some other polynomial of degree k, all the coefficients needed for the update at every iteration can be computed without using the multiplication by AT [79]. CGS [66] is the extension of BiCG with Qk (A) ≡ Pk (A). CGS has been shown to exhibit irregular convergence behavior. To achieve smoother convergence, BiCGStab [78] was designed with some optimal polynomial Qk (A) that minimizes the residual at each iteration. 1.3. KRYLOV SUBSPACE METHODS FOR SQUARE AX = B Algorithm 1.5 Arnoldi process 1: βv1 = b (i.e. β = kbk2 , v1 = b/β) 2: for k = 1, 2, . . . , n do 3: w = Av 4: for i = 1, 2, . . . , k do 5: hik = wT vi 6: w = w − hik vi 7: end for 8: βk+1 vk+1 = w 9: end for 1.3.4 T HE A RNOLDI PROCESS Another variant of the Lanczos process for an unsymmetric matrix A is the Arnoldi process. Compared with unsymmetric Lanczos, which preserves the tridiagonal property of Hk and loses the orthogonality among columns of Vk , the Arnoldi process transforms A into an upper Hessenberg matrix Hk with an orthogonal transformation Vk . A shortterm recurrence is no longer available for the Arnoldi process. All the Arnoldi vectors must be kept to generate the next vector, as shown in Algorithm 1.5. The process can be summarized by (1.9) AVk = Vk+1 Hk , with h11 β2 Hk = h12 h22 β3 ... ... ... .. . ... ... ... .. . βk h1k h2k h3k . .. . hkk βk+1 As with symmetric Lanczos, this allows us to write rk = b − Axk = β1 v1 − AVk yk = Vk+1 (β1 e1 − Hk yk ), and our goal is again to find approximate solutions to Hk yk ≈ β 1 e 1 . (1.10) Note that at the k-th iteration, the amount of memory needed to 9 10 CHAPTER 1. INTRODUCTION store Hk and Vk is O(k 2 + kn). Since iterative methods primarily focus on matrices that are large and sparse, the storage cost will soon overwhelm other costs and render the computation infeasible. Most Arnoldi-based methods adopt a strategy of restarting to handle this issue, trading storage cost for slower convergence. FOM FOM [62] is the CG analogue for the Arnoldi process, with yk defined by deleting the last row of (1.10) and solving the truncated system. GMRES GMRES [63] is the MINRES counterpart for the Arnoldi process, with yk defined by the least-squares subproblem min kHk yk − β1 e1 k2 . Like the methods we study (CG, MINRES, LSQR, LSMR, AMRES), GMRES does not break down, but it might require significantly more storage. 1.3.5 I NDUCED DIMENSION REDUCTION Induced dimension reduction (IDR) is a class of transpose-free methods that generate residuals in a sequence of nested subspaces of decreasing dimension. The original IDR [85] was proposed by Wesseling and Sonneveld in 1980. It converges after at most 2n matrix-vector multiplications under exact arithmetic. Theoretically, this is the same complexity as the unsymmetric Lanczos methods and the transpose-free methods. In 2008, Sonneveld and van Gijzen published IDR(s) [67], an improvement over IDR that takes advantage of extra memory available. The memory required increases linearly with s, while the maximum number of matrix-vector multiplications needed becomes n + n/s. We note that in some informal experiments on square unsymmetric systems Ax = b arising from a convection-diffusion-reaction problem involving several parameters [81], IDR(s) performed significantly better than LSQR or LSMR for some values of the parameters, but for certain other parameter values the reverse was true [82]. In this sense the solvers complement each other. 1.4. KRYLOV SUBSPACE METHODS FOR RECTANGULAR AX = B Algorithm 1.6 Golub-Kahan process Bidiag(A, b) 1: 2: 3: 4: 5: β1 u1 = b, α1 v1 = ATu1 . for k = 1, 2, . . . do βk+1 uk+1 = Avk − αk uk αk+1 vk+1 = ATuk+1 − βk+1 vk . end for 1.4 K RYLOV SUBSPACE METHODS FOR RECTANGULAR Ax = b In this section, we introduce a number of Krylov subspace methods for the matrix equation Ax = b, where A is an m-by-n square or rectangular matrix. When m > n, we solve the least-squares problem min kAx − bk2 . When m < n, we find the minimum 2-norm solution minAx=b kxk2 . For any m and n, if Ax = b is inconsistent, we solve the problem min kxk s.t. x = arg min kAx − bk. 1.4.1 T HE G OLUB -K AHAN PROCESS In the dense case, we can construct orthogonal matrices U and V to transform b A to upper bidiagonal form as follows: U T 1 A b ⇒ b × ! = V AV × × = U β1 e1 .. . .. . B , × × where B is a lower bidiagonal matrix. For sparse matrices or linear operators, Golub and Kahan [29] gave an iterative version of the bidiagonalization as shown in Algorithm 1.6. After k steps, we where α1 β2 α2 .. Bk = . have AVk = Uk+1 Bk and ATUk+1 = Vk+1 LTk+1 , .. . βk , αk βk+1 Lk+1 = Bk αk+1 ek+1 . 11 12 CHAPTER 1. INTRODUCTION This is equivalent to what would be generated by the symmetric Lanczos process with matrix ATA and starting vector ATb. The Lanczos vectors Vk are the same, and the Lanczos tridiagonal matrix satisfies Tk = BkT Bk . With xk = Vk yk , the residual vector rk can be written as rk = b − AVk yk = β1 u1 − Uk+1 Bk yk = Uk+1 (β1 e1 − Bk yk ), and our goal is to find an approximate solution to B k yk ≈ β 1 e 1 . (1.11) CRAIG CRAIG [22; 53] is defined by deleting the last row from (1.11), so that yk satisfies Lk yk = β1 e1 at each iteration. It is an efficient and reliable method for consistent square or rectangular systems Ax = b, and it is known to minimize the error norm kx∗ − xk k within each Krylov subspace [51]. LSQR LSQR [53] is derived by solving min krk k ≡ min kβ1 e1 − Bk yk k at each iteration. Since we are minimizing over a larger Krylov subspace at each iteration, this immmediately implies that krk k is monotonic for LSQR. LSMR LSMR is derived by minimizing ATrk within each Krylov subspace. LSMR is a major focus in this thesis. It solves linear systems Ax = b and least-squares problems min kAx − bk2 , with A being sparse or a linear operator. It is analytically equivalent to applying MINRES to the normal equation ATAx = ATb, so that the quantities kATrk k are monotonically decreasing. We have proved that krk k also decreases monotonically. As we will see in Theorem 4.1.1, this means that a certain backward error measure (the Stewart backward error kATrk k/krk k) is always smaller for LSMR than for LSQR. Hence it is safer to terminate LSMR early. CGLS LSQR has an equivalent formulation named CGLS [38; 53], which doesn’t version of CGLS is adapted from [53]. 5 This involve the computation of Golub-Kahan vectors. See Algorithm 1.7.5 1.5. OVERVIEW Algorithm 1.7 Algorithm CGLS 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 1.5 r0 = b, s0 = ATb, p1 = s0 ρ0 = ks0 k2 , x0 = 0 for k = 1,2,. . . do qk = Apk αk = ρk−1 /kqk k2 xk = xk−1 + αk pk rk = rk−1 − αk qk sk = ATrk ρk = ksk k2 βk = ρk /ρk−1 pk+1 = sk + βk pk end for O VERVIEW Chapter 2 compares the performance of CG and MINRES on various symmetric problems Ax = b. The results suggested that MINRES is a superior algorithm even for solving positive definite linear systems. This provides motivation for LSMR, the first algorithm developed in this thesis, to be based on MINRES. Chapter 3 focuses on a mathematical background of constructing LSMR, the derivation of a computationally efficient algorithm, as well as stopping criteria. Chapter 4 describes a number of numerical experiments designed to compare the performance of LSQR and LSMR in solving overdetermined, consistent, or underdetermined linear systems. Chapter 5 derives an iterative algorithm AMRES for solving the negatively damped least-squares problem. Chapter 6 focuses on applications of AMRES to Curtis-Reid scaling, improving approximate singular vectors, and computing singular vectors when the singular value is known. Chapter 7 summarizes the contributions of this thesis and gives a summary of interesting problems available for future research. The notation used in this thesis is summarized in Table 1.1. 13 14 CHAPTER 1. INTRODUCTION Table 1.1 Notation A Aij matrix, sparse matrix or linear operator the element of matrix A in i-th row and j-th column b, p, r, t, u, v, x, y, . . . vectors k subscript index for iteration number. E.g. xk is the approximate solution generated at the k-th iteration of an iterative solver such as MINRES. In Chapters 3 to 6, k represents the number of Golub-Kahan bidiagonalization iterations. q subscript index for RQI iteration number, used in Section 6.2. c k sk c k , sk non-identity elements ( −s ) in a Givens rotak ck tion matrix Bk bidiagonal matrix generated at the k-th step of Golub-Kahan bidiagonalization. Greek letters scalars k·k vector 2-norm or the induced matrix 2-norm. k · kF Frobenius norm kxkA energy norm of vector x with repect to positive √ definite matrix A: xTAx cond(A) condition number of A ek k-th column of an identity matrix 1 a vector with all entries being 1. R(A) range of matrix A. N(A) null space of matrix A. A≻0 A is symmetric positive definite. x∗ the unique solution to a nonsingular square system Ax = b, or more generally the pseudoinverse solution of a rectangular system Ax ≈ b. 2 A TALE OF TWO ALGORITHMS The conjugate gradient method (CG) [38] and the minimum residual method (MINRES) [52] are both Krylov subspace methods for the iterative solution of symmetric linear equations Ax = b. CG is commonly used when the matrix A is positive definite, while MINRES is generally reserved for indefinite systems [79, p85]. We reexamine this wisdom from the point of view of early termination on positive-definite systems. This also serves as the rationale for why MINRES is chosen as the basis for the development of LSMR. In this Chapter, we study the application of CG and MINRES to real symmetric positive-definite (spd) systems Ax = b, where A is of dimension n × n. The unique solution is denoted by x∗ . The initial approximate solution is x0 ≡ 0, and rk ≡ b − Axk is the residual vector for an approximation xk within the kth Krylov subspace. From Section 1.3.1, we know that CG and MINRES use the same information Vk+1 and Hk to compute solution estimates xk = Vk yk within the Krylov subspace Kk (A, b) ≡ span{b, Ab, A2 b, . . . , Ak−1 b} (for each k). It is commonly thought that the number of iterations required will be similar for each method, and hence CG should be preferable on spd systems because it requires less storage and fewer floating-point operations per iteration. This view is justified if an accurate solution is required (stopping tolerance τ close to machine precision ǫ). We show that with looser stopping tolerances, MINRES is sure to terminate sooner than CG when the stopping rule is based on the backward error for xk , and by numerical examples we illustrate that the difference in iteration numbers can be substantial. Section 2.1 describes a number of monotonic convergence properties that make CG and MINRES favorable iterative solvers for linear systems. Section 2.2 introduces the concept of backward error and how it is used in designing stopping rules for iterative solvers, and the reason why MINRES is more favorable for applications where backward error is important. Section 2.3 compares experimentally the behavior of CG and MINRES in terms of energy norm error, backward error, residual norm, and solution norm. 15 16 CHAPTER 2. A TALE OF TWO ALGORITHMS 2.1 M ONOTONICITY OF NORMS In designing iterative solvers for linear equations, we often gauge convergence by computing some norms from the current iterate xk . These 1 For any vector x, the ennorms1 include kxk − x∗ k, kxk − x∗ kA , krk k. An iterative method might ergy norm with repect to spd sometimes be stopped by an iteration limit or a time limit. It is then matrix A is defined as √ highly desirable that some or all of the above norms converge monokxkA = xTAx. tonically. It is also desirable to have monotonic convergence for kxk k. First, some applications such as trust-region methods [68] depend on that property. Second, when convergence is measure by backward error (Section 2.2), monotonicity in kxk k (together with monotonicity in krk k) gives monotonic convergence in backward error. More generally, if kxk k is monotonic, there cannot be catastrophic cancellation error in stepping from xk to xk+1 . 2.1.1 P ROPERTIES OF CG A number of monotonicity properties have been found by various authors. We summarize them here for easy reference. Theorem 2.1.1. [68, Thm 2.1] For CG on an spd system Ax = b, kxk k is strictly increasing. Theorem 2.1.2. [38, Thm 4:3] For CG on an spd system Ax = b, kx∗ −xk kA is strictly decreasing. Theorem 2.1.3. [38, Thm 6:3] For CG on an spd system Ax = b, kx∗ − xk k is strictly decreasing. krk k is not monotonic for CG. Examples are shown in Figure 2.4. 2.1.2 P ROPERTIES OF CR AND MINRES Here we prove a number of monotonicity properties for CR and MINRES on an spd system Ax = b. Some known properties are also included for completeness. Relations from Algorithm 1.3 (CR) are used extensively in the proofs. Termination of CR occurs when rk = 0 for some index k = ℓ ≤ n (⇒ ρℓ = βℓ = 0, rℓ = sℓ = pℓ = qℓ = 0, xℓ = x∗ , where Ax∗ = b). Note: This ℓ is the same ℓ at which the Lanczos process theoretically terminates for the given A and b. Theorem 2.1.4. The following properties hold for Algorithm CR: 2.1. MONOTONICITY OF NORMS (a) qiT qj = 0 (0 ≤ i, j ≤ ℓ − 1, i 6= j) (b) riT qj = 0 (0 ≤ i, j ≤ ℓ − 1, i ≥ j + 1) (c) ri 6= 0 ⇒ pi 6= 0 (0 ≤ i ≤ ℓ − 1) Proof. Given in [44, Theorem 1]. Theorem 2.1.5. The following properties hold for Algorithm CR on an spd system Ax = b: (a) αi > 0 (i = 1, . . . , ℓ) (b) βi > 0 βℓ = 0 (i = 1, . . . , ℓ − 1) (c) pTi qj > 0 (0 ≤ i, j ≤ ℓ − 1) (d) pTi pj > 0 (0 ≤ i, j ≤ ℓ − 1) (e) xTi pj > 0 (1 ≤ i ≤ ℓ, 0 ≤ j ≤ ℓ − 1) (f) riT pj > 0 (0 ≤ i, j ≤ ℓ − 1) Proof. (a) Here we use the fact that A is spd. Since ri 6= 0 for 0 ≤ i ≤ ℓ − 1, we have for 1 ≤ i ≤ ℓ, T T ρi−1 = ri−1 si−1 = ri−1 Ari−1 > 0 2 (A ≻ 0) (2.1) αi = ρi−1 /kqi−1 k > 0, where qi−1 6= 0 follows from qi−1 = Api−1 and Theorem 2.1.4 (c). (b) For 1 ≤ i ≤ ℓ − 1, we have βi = ρi /ρi−1 > 0, (by (2.1)) and rℓ = 0 implies βℓ = 0. (c) For any 0 ≤ i, j ≤ ℓ − 1, we have Case I: i = j pTi qi = pTi Api > 0 (A ≻ 0) where pi 6= 0 from Theorem 2.1.4 (c). Next, we prove the cases where i 6= j by induction. 17 18 CHAPTER 2. A TALE OF TWO ALGORITHMS Case II: i − j = k > 0 pTi qj = pTi qi−k = riT qi−k + βi pTi−1 qi−k (by Thm 2.1.4 (b)) = βi pTi−1 qi−k > 0, 2 Note that i − j = k > 0 implies i ≥ 1. where βi > 0 by (b)2 and pTi−1 qi−k > 0 by induction as (i − 1) − (i − k) = k − 1 < k. Case III: j − i = k > 0 pTi qj = pTi qi+k = pTi Api+k = pTi A(ri+k + βi+k pi+k−1 ) = qiT ri+k + βi+k pTi qi+k−1 (by Thm 2.1.4 (b)) = βi+k pTi qi+k−1 > 0, where βi+k = βj > 0 by (b) and pTi qi+k−1 > 0 by induction as (i + k − 1) − i = k − 1 < k. (d) Define P ≡ span{p0 , p1 , . . . , pℓ−1 } and Q ≡ span{q0 , . . . , qℓ−1 } at termination. By construction, P = span{b, Ab, . . . , Aℓ−1 b} and Q = span{Ab, . . . , Aℓ b} (since qi = Api ). Again by construction, xℓ ∈ P, and since rℓ = 0 we have b = Axℓ ⇒ b ∈ Q. We see that P ⊆ Q. By Theorem 2.1.4(a), {qi /kqi k}ℓ−1 i=0 forms an orthonormal basis for Q. If we project pi ∈ P ⊆ Q onto this basis, we have pi = ℓ−1 T X p qk i qT q k=0 k k qk , where all coordinates are positive from (c). Similarly for any other pj . Therefore pTi pj > 0 for any 0 ≤ i, j < ℓ. (e) By construction, xi = xi−1 + αi pi−1 = · · · = Therefore xTi pi > 0 by (d) and (a). i X k=1 αk pk−1 (x0 = 0) 2.1. MONOTONICITY OF NORMS (f) Note that any ri can be expressed as a sum of qi : ri = ri+1 + αi+1 qi = ··· = rl + αl ql−1 + · · · + αi+1 qi = αl ql−1 + · · · + αi+1 qi . Thus we have riT pj = (αl ql−1 + · · · + αi+1 qi )T pj > 0, where the inequality follows from (a) and (c). We are now able to prove our main theorem about the monotonic increase of kxk k for CR and MINRES. A similar result was proved for CG by Steihaug [68]. Theorem 2.1.6. For CR (and hence MINRES) on an spd system Ax = b, kxk k is strictly increasing. Proof. kxi k2 − kxi−1 k2 = 2αi xTi−1 pi−1 + pTi−1 pi−1 > 0, where the last inequality follows from Theorem 2.1.5 (a), (d) and (e). Therefore kxi k > kxi−1 k. The following theorem is a direct consequence of Hestenes and Stiefel [38, Thm 7:5]. However, the second half of that theorem, kx∗ − xCG k−1 k > ∗ MINRES kx − xk k, rarely holds in machine arithmetic. We give here an alternative proof that does not depend on CG. Theorem 2.1.7. For CR (and hence MINRES) on an spd system Ax = b, the error kx∗ − xk k is strictly decreasing. Proof. From the update rule for xk , we can express x∗ as x∗ = xl = xl−1 + αl pl−1 = ··· = xk + αk+1 pk + · · · + αl pl−1 = xk−1 + αk pk−1 + αk+1 pk + · · · + αl pl−1 . (2.2) (2.3) 19 20 CHAPTER 2. A TALE OF TWO ALGORITHMS Using the last two equalities above, we can write kx∗ − xk−1 k2 − kx∗ − xk k2 = (xl − xk−1 )T (xl − xk−1 ) − (xl − xk )T (xl − xk ) = 2αk pTk−1 (αk+1 pk + · · · + αl pl−1 ) + αk2 pTk−1 pk−1 > 0, where the last inequality follows from Theorem 2.1.5 (a), (d). The following theorem is given in [38, Thm 7:4]. We give an alternative proof here. Theorem 2.1.8. For CR (and hence MINRES) on an spd system Ax = b, the energy norm error kx∗ − xk kA is strictly decreasing. Proof. From (2.2) and (2.3) we can write kxl − xk−1 k2A − kxl − xk k2A = (xl − xk−1 )T A(xl − xk−1 ) − (xl − xk )T A(xl − xk ) = 2αk pTk−1 A(αk+1 pk + · · · + αl−1 pl−1 ) + αk2 pTk−1 Apk−1 T T = 2αk qk−1 (αk+1 pk + · · · + αl−1 pl−1 ) + αk2 qk−1 pk−1 > 0, where the last inequality follows from Theorem 2.1.5 (a), (c). The following theorem is available from [52] and [38, Thm 7:2], and is the characterizing property of MINRES. We include it here for completeness. Theorem 2.1.9. For MINRES on any system Ax = b, krk k is decreasing. Proof. This follows immediately from (1.6). 2.2 B ACKWARD ERROR ANALYSIS “The data frequently contains uncertainties due to measurements, previous computations, or errors committed in storing numbers on the computer. If the backward error is no larger than these uncertainties then the computed solution can hardly be criticized — it may be the solution we are seeking, for all we know.” Nicholas J. Higham, Accuracy and Stability of Numerical Algorithms (2002) 2.2. BACKWARD ERROR ANALYSIS For many physical problems requiring numerical solution, we are given inexact or uncertain input data. Examples include model estimation in geophysics [45], system identification in control theory [41], and super-resolution imaging [80]. For these problems, it is not justifiable to seek a solution beyond the accuracy of the data [19]. Computation time may be wasted in the extra iterations without yielding a more desirable answer [3]. Also, rounding errors are introduced during computation. Both errors in the original data and rounding errors can be analyzed in a common framework by applying the Wilkinson principle, which considers any computed solution to be the exact solution of a nearby problem [10; 25]. The measure of “nearby” should match the error in the input data. The design of stopping rules from this viewpoint is an important part of backward error analysis [4; 39; 48; 61]. For a consistent linear system Ax = b, there may be uncertainty in A and/or b. From now on we think of xk coming from the kth iteration of one of the iterative solvers. Following Titley-Peloquin [75] we say that xk is an acceptable solution if and only if there exist perturbations E and f satisfying (A + E)xk = b + f, kEk ≤ α, kAk kf k ≤β kbk (2.4) for some tolerances α ≥ 0, β ≥ 0 that reflect the (preferably known) accuracy of the data. We are naturally interested in minimizing the size of E and f . If we define the optimization problem min ξ s.t. (A + E)xk = b + f, ξ,E,f kEk ≤ αξ, kAk kf k ≤ βξ kbk to have optimal solution ξk , Ek , fk (all functions of xk , α, and β), we see that xk is an acceptable solution if and only if ξk ≤ 1. We call ξk the normwise relative backward error (NRBE) for xk . With rk = b − Axk , the optimal solution ξk , Ek , fk is shown in [75] to be given by βkbk , αkAkkxk k + βkbk krk k ξk = , αkAkkxk k + βkbk φk = Ek = (1 − φk ) rk xTk , kxk k2 fk = −φk rk . (2.5) (2.6) (See [39, p12] for the case β = 0 and [39, §7.1 and p336] for the case α = β.) 21 22 CHAPTER 2. A TALE OF TWO ALGORITHMS 2.2.1 S TOPPING RULE For general tolerances α and β, the condition ξk ≤ 1 for xk to be an acceptable solution becomes krk k ≤ αkAkkxk k + βkbk, (2.7) the stopping rule used in LSQR for consistent systems [53, p54, rule S1]. 2.2.2 M ONOTONIC BACKWARD ERRORS Of interest is the size of the perturbations to A and b for which xk is an exact solution of Ax = b. From (2.5)–(2.6), the perturbations have the following norms: αkAkkrk k krk k = , kxk k αkAkkxk k + βkbk βkbkkrk k kfk k = φk krk k = . αkAkkxk k + βkbk kEk k = (1 − φk ) (2.8) (2.9) Since kxk k is monotonically increasing for CG and MINRES (when A is spd), we see from (2.5) that φk is monotonically decreasing for both solvers. Since krk k is monotonically decreasing for MINRES (but not for CG), we have the following result. Theorem 2.2.1. Suppose α > 0 and β > 0 in (2.4). For CR and MINRES (but not CG), the relative backward errors kEk k/kAk and kfk k/kbk decrease monotonically. Proof. This follows from (2.8)–(2.9) with kxk k increasing for both solvers and krk k decreasing for CR and MINRES but not for CG. 2.2.3 O THER CONVERGENCE MEASURES Error kx∗ −xk k and energy norm error kx∗ −xk kA are two possible measures of convergence. In trust-region methods [68], some eigenvalue problems [74], finite element approximations [1], and some other applications [46; 84], it is desirable to minimize kx∗ − xk kA , which makes CG a sensible algorithm to use. We should note that since x∗ is not known, neither kx∗ − xk k nor kx∗ − xk kA can be computed directly from the CG algorithm. However, bounds and estimates have been derived for kx∗ −xk kA [9; 30] and they can be used for stopping rules based on the energy norm error. 2.3. NUMERICAL RESULTS An alternative stopping criterion is derived for MINRES by Calvetti et al. [8] based on an L-curve defined by krk k and cond(Hk ). 2.3 N UMERICAL RESULTS Here we compare the convergence of CG and MINRES on various spd systems Ax = b and some associated indefinite systems (A − δI)x = b. The test examples are drawn from the University of Florida Sparse Matrix Collection (Davis [18]). We experimented with all 26 cases for which A is real spd and b is supplied. We compute the condition number for each test matrix by finding the largest and smallest eigenvalue using eigs(A,1,’LM’) and eigs(A,1,’SM’) respectively. For this test set, the condition numbers range from 1.7E+03 to 3.1E+13. Since A is spd, we applied diagonal preconditioning by redefining A and b as follows: d = diag(A), D = diag(1./sqrt(d)), A ← DAD, b ← Db, b ← b/kbk. Thus in the figures below we have diag(A) = I and kbk = 1. With this preconditioning, the condition numbers range from 1.2E+01 to 2.2E+11. The distribution of condition number of the test set matrices before and after preconditioning is shown in Figure 2.1. The stopping rule used for CG and MINRES was (2.7) with α = 0 and β = 10−8 (that is, krk k ≤ 10−8 kbk = 10−8 ). 2.3.1 P OSITIVE - DEFINITE SYSTEMS In defining backward errors, we assume for simplicity that α > 0 and β = 0 in (2.4)–(2.6), even though it doesn’t match the choice α = 0 and β = 10−8 in the stopping rule. This gives φk = 0 and kEk k = krk k/kxk k in (2.8). Thus, as in Theorem 2.2.1, we expect kEk k to decrease monotonically for CR and MINRES but not for CG. We also compute kx∗ − xk k and kx∗ − xk kA at each iteration for both algorithms, where x∗ is obtained by M ATLAB’s backslash function A\b, which uses a sparse Cholesky factorization of A [12]. In Figure 2.2 and 2.3, we plot krk k/kxk k, kx∗ − xk k, and kx∗ − xk kA for CG and MINRES for four different problems. For CG, the plots confirm the properties in Theorem 2.1.2 and 2.1.3 that kx∗ − xk k and kx∗ − xk kA are monotonic. For MINRES, the plots confirm the properties in Theorem 2.2.1, 2.1.8, and 2.1.7 that krk k/kxk k, kx∗ −xk k, and kx∗ −xk kA are monotonic. 23 CHAPTER 2. A TALE OF TWO ALGORITHMS CG vs MINRES test set 14 10 12 10 10 10 8 10 Cond(A) 24 6 10 4 10 2 10 original preconditioned 0 10 0 5 10 15 Matrix ID 20 25 30 Figure 2.1: Distribution of condition number for matrices used for CG vs MINRES comparison, before and after diagonal preconditioning Figure 2.2 (left) shows problem Schenk_AFE_af_shell8 with A of size 504855×504855 and cond(A) = 2.7E+05. From the plot of backward errors krk k/kxk k, we see that both CG and MINRES converge quickly at the early iterations. Then the backward error of MINRES plateaus at about iteration 80, and the backward error of CG stays about 1 order of magnitude behind MINRES. A similar phenomenon of fast convergence at early iterations followed by slow convergence is also observed in the energy norm error and 2-norm error plots. Figure 2.2 (right) shows problem Cannizzo_sts4098 with A of size 19779×19779 and cond(A) = 6.7E+03. MINRES converges slightly faster in terms of backward error, while CG converges slightly faster in terms of energy norm error and 2-norm error. Figure 2.3 (left) shows problem Simon_raefsky4 with A of size 19779× 19779 and cond(A) = 2.2E+11. Because of the high condition number, both algorithms hit the 5n iteration limit that we set. We see that the backward error for MINRES converges faster than for CG as expected. For the energy norm error, CG is able to decrease over 5 orders of magnitude while MINRES plateaus after a 2 orders of magnitude decrease. 25 2.3. NUMERICAL RESULTS Name:Cannizzo_sts4098, Dim:4098x4098, nnz:72356, id=13 Name:Schenk_AFE_af_shell8, Dim:504855x504855, nnz:17579155, id=11 0 −1 CG MINRES CG MINRES −1 −2 −2 −3 −3 −4 log(||r||/||x||) log(||r||/||x||) −4 −5 −6 −5 −6 −7 −7 −8 −8 −9 −10 −9 0 200 400 600 800 iteration count 1000 1200 1400 −10 0 50 100 150 200 250 iteration count 300 350 400 450 Cannizzo_sts4098, Dim:4098x4098, nnz:72356, id=13 Schenk_AFE_af_shell8, Dim:504855x504855, nnz:17579155, id=11 1 1 0 0 −1 −1 −2 log||xk − x*||A log||xk − x*||A −2 −3 −4 −4 −5 −5 −6 −6 −7 −3 −7 CG MINRES 0 200 400 600 800 iteration count 1000 1200 −8 1400 CG MINRES 0 50 100 150 200 250 iteration count 300 350 400 450 Cannizzo_sts4098, Dim:4098x4098, nnz:72356, id=13 Schenk_AFE_af_shell8, Dim:504855x504855, nnz:17579155, id=11 2 2 1 1 0 0 log||xk − x*|| log||xk − x*|| −1 −1 −2 −2 −3 −4 −3 −5 −4 −5 −6 CG MINRES 0 200 400 600 800 iteration count 1000 1200 1400 −7 CG MINRES 0 50 100 150 200 250 iteration count 300 350 400 450 Figure 2.2: Comparison of backward and forward errors for CG and MINRES solving two spd systems Ax = b. Top: The values of log10 (krk k/kxk k) are plotted against iteration number k. These values define log10 (kEk k) when the stopping tolerances in (2.7) are α > 0 and β = 0. Middle: The values of log10 kxk − x∗ kA are plotted against iteration number k. This is the quantity that CG minimizes at each iteration. Bottom: The values of log10 kxk − x∗ k. Left: Problem Schenk_AFE_af_shell8, with n = 504855 and cond(A) = 2.7E+05. Right: Cannizzo_sts4098, with n = 19779 and cond(A) = 6.7E+03. 26 CHAPTER 2. A TALE OF TWO ALGORITHMS For both the energy norm error and 2-norm error, MINRES reaches a lower point than CG for some iterations. This must be due to numerical error in CG and MINRES (a result of loss of orthogonality in Vk ). Figure 2.3 (right) shows problem BenElechi_BenElechi1 with A of size 245874 × 245874 and cond(A) = 1.8E+09. The backward error of MINRES stays ahead of CG by 2 orders of magnitude for most iterations. Around iteration 32000, the backward error of both algorithms goes down rapidly and CG catches up with MINRES. Both algorithms exhibit a plateau on energy norm error for the first 20000 iterations. The error norms for CG start decreasing around iteration 20000 and decreases even faster after iteration 30000. Figure 2.4 shows krk k and kxk k for CG and MINRES on two typical spd examples. We see that kxk k is monotonically increasing for both solvers, and the kxk k values rise fairly rapidly to their limiting value kx∗ k, with a moderate delay for MINRES. Figure 2.5 shows krk k and kxk k for CG and MINRES on two spd examples in which the residual decrease and the solution norm increase are somewhat slower than typical. The rise of kxk k for MINRES is rather more delayed. In the second case, if the stopping tolerance were β = 10−6 rather than β = 10−8 , the final MINRES kxk k (k ≈ 10000) would be less than half the exact value kx∗ k. It will be of future interest to evaluate this effect within the context of trust-region methods for optimization. W HY DOES krk k FOR CG LAG BEHIND MINRES ? It is commonly thought that even though MINRES is known to minimize krk k at each iteration, the cumulative minimum of krk k for CG should approximately match that of MINRES. That is, min kriCG k ≈ krkMINRES k. 0≤i≤k However, in Figure 2.2 and 2.3 we see that krk k for MINRES is often smaller than for CG by 1 or 2 orders of magnitude. This phenomenon can be explained by the following relations between krkCG k and krkMINRES k [76; 35, Lemma 5.4.1]: krkMINRES k . krkCG k = q MINRES 2 1 − krkMINRES k2 /krk−1 k (2.10) 27 2.3. NUMERICAL RESULTS Name:Simon_raefsky4, Dim:19779x19779, nnz:1316789, id=7 Name:BenElechi_BenElechi1, Dim:245874x245874, nnz:13150496, id=22 2 0 CG MINRES −2 −2 −4 −3 −6 −4 −8 −5 −10 −6 −12 −7 −14 −8 −16 −9 −18 CG MINRES −1 log(||r||/||x||) log(||r||/||x||) 0 0 1 2 3 4 5 iteration count 6 7 8 9 −10 10 0 0.5 1 4 x 10 Simon_raefsky4, Dim:19779x19779, nnz:1316789, id=7 1.5 2 iteration count 2.5 3 3.5 4 x 10 BenElechi_BenElechi1, Dim:245874x245874, nnz:13150496, id=22 5 1 0 4 −1 3 * log||xk − x ||A log||xk − x*||A −2 2 −3 −4 1 −5 0 −6 −1 CG MINRES 0 1 2 3 4 5 iteration count 6 7 8 9 −7 10 CG MINRES 0 0.5 1 4 x 10 Simon_raefsky4, Dim:19779x19779, nnz:1316789, id=7 1.5 2 iteration count 2.5 3 3.5 4 x 10 BenElechi_BenElechi1, Dim:245874x245874, nnz:13150496, id=22 10 2 1 9 0 * log||xk − x || k log||x − x*|| 8 7 −1 −2 6 −3 5 −4 CG MINRES 4 0 1 2 3 4 5 iteration count 6 7 8 9 10 4 x 10 −5 CG MINRES 0 0.5 1 1.5 2 iteration count 2.5 3 3.5 4 x 10 Figure 2.3: Comparison of backward and forward errors for CG and MINRES solving two spd systems Ax = b. Top: The values of log10 (krk k/kxk k) are plotted against iteration number k. These values define log10 (kEk k) when the stopping tolerances in (2.7) are α > 0 and β = 0. Middle: The values of log10 kxk − x∗ kA are plotted against iteration number k. This is the quantity that CG minimizes at each iteration. Bottom: The values of log10 kxk − x∗ k. Left: Problem Simon_raefsky4, with n = 19779 and cond(A) = 2.2E+11. Right: BenElechi_BenElechi1, with n = 245874 and cond(A) = 1.8E+09. 28 CHAPTER 2. A TALE OF TWO ALGORITHMS Name:Simon_olafu, Dim:16146x16146, nnz:1015156, id=6 Name:Cannizzo_sts4098, Dim:4098x4098, nnz:72356, id=13 2 1 CG MINRES CG MINRES 0 0 −1 −2 −2 log||r|| log||r|| −3 −4 −4 −5 −6 −6 −7 −8 −8 −10 0 0.5 5 18 x 10 1 1.5 iteration count 2 2.5 −9 3 0 50 100 150 4 x 10 Name:Simon_olafu, Dim:16146x16146, nnz:1015156, id=6 200 250 iteration count 300 350 400 450 Name:Cannizzo_sts4098, Dim:4098x4098, nnz:72356, id=13 25 16 20 14 12 15 ||x|| ||x|| 10 8 10 6 4 5 2 CG MINRES 0 0 0.5 1 1.5 iteration count 2 CG MINRES 2.5 3 4 x 10 0 0 50 100 150 200 250 iteration count 300 350 400 450 Figure 2.4: Comparison of residual and solution norms for CG and MINRES solving two spd systems Ax = b with n = 16146 and 4098. Top: The values of log10 krk k are plotted against iteration number k. Bottom: The values of kxk k are plotted against k. The solution norms grow somewhat faster for CG than for MINRES. Both reach the limiting value kx∗ k significantly before xk is close to x. 29 2.3. NUMERICAL RESULTS Name:BenElechi_BenElechi1, Dim:245874x245874, nnz:13150496, id=22 Name:Schmid_thermal1, Dim:82654x82654, nnz:574458, id=14 0 0 CG MINRES CG MINRES −2 −2 −3 −3 −4 −4 log||r|| log||r|| −1 −1 −5 −5 −6 −6 −7 −7 −8 −8 −9 −9 0 200 400 600 800 iteration count 1000 1200 1400 0 0.5 1 1.5 2 iteration count 2.5 3 3.5 4 x 10 Name:BenElechi_BenElechi1, Dim:245874x245874, nnz:13150496, id=22 Name:Schmid_thermal1, Dim:82654x82654, nnz:574458, id=14 80 25 70 20 60 50 ||x|| ||x|| 15 10 40 30 20 5 10 CG MINRES CG MINRES 0 0 0 200 400 600 800 iteration count 1000 1200 1400 0 0.5 1 1.5 2 iteration count 2.5 3 3.5 4 x 10 Figure 2.5: Comparison of residual and solution norms for CG and MINRES solving two spd systems Ax = b with n = 82654 and 245874. Sometimes the solution norms take longer to reach the limiting value kxk. Top: The values of log10 krk k are plotted against iteration number k. Bottom: The values of kxk k are plotted against k. Again the solution norms grow faster for CG. 30 CHAPTER 2. A TALE OF TWO ALGORITHMS From (2.10), one can infer that if krkMINRES k decreases a lot between iterations k−1 and k, then krkCG k would be roughly the same as krkMINRES k. The converse also holds, in that krkCG k will be much larger than krkMINRES k if MINRES is almost stalling at iteration k (i.e., krkMINRES k did not decrease much relative to the previous iteration). The above analysis was pointed out by Titley-Peloquin [76] in comparing LSQR and LSMR. We repeat the analysis here for CG vs MINRES and extend it to demonstrate why there is a lag in general for large problems. With a residual-based stopping rule, CG and MINRES stop when krk k ≤ βkr0 k. When CG and MINRES stop at iteration ℓ, we have ℓ Y krℓ k krk k = ≈ β. krk−1 k kr0 k k=1 MINRES Thus on average, krkMINRES k/krk−1 k will be closer to 1 if ℓ is large. This means that for systems that take more iterations to converge, CG will lag behind MINRES more (a bigger gap between krkCG k and krkMINRES k). 2.3.2 I NDEFINITE SYSTEMS A key part of Steihaug’s trust-region method for large-scale unconstrained optimization [68] (see also [15]) is his proof that when CG is applied to a symmetric (possibly indefinite) system Ax = b, the solution norms kx1 k, . . . , kxk k are strictly increasing as long as pTjApj > 0 for all iterations 1 ≤ j ≤ k. (We are using the notation in Algorithm 1.3.) From our proof of Theorem 2.1.5, we see that the same property holds for CR and MINRES as long as both pTjApj > 0 and rjTArj > 0 for all iterations 1 ≤ j ≤ k. Since MINRES might be a useful solver in the trust-region context, it is of interest now to offer some empirical results about the behavior of kxk k when MINRES is applied to indefinite systems. First, on the nonsingular indefinite system 2 1 1 0 x = 1 0 1 1 , 1 1 2 1 (2.11) 2.3. NUMERICAL RESULTS 1.6 3 1.4 2.5 1.2 ||r || / ||x || 2 1.5 k k ||x || k 1 0.8 1 0.6 0.5 0.4 0.2 1 1.5 2 Iteration number 2.5 3 0 1 1.5 2 Iteration number 2.5 3 Figure 2.6: For MINRES on indefinite problem (2.11), kxk k and the backward error krk k/kxk k are both slightly non-monotonic. MINRES gives non-monotonic solution norms, as shown in the left plot of Figure 2.6. The decrease in kxk k implies that the backward errors krk k/kxk k may not be monotonic, as illustrated in the right plot. More generally, we can gain an impression of the behavior of kxk k by recalling from Choi et al. [14] the connection between MINRES and M MINRES-QLP. Both methods compute the iterates xM k = Vk yk in (1.6) from the subproblems ykM = arg min kHk y − β1 e1 k y∈Rk and possibly Tℓ yℓM = β1 e1 , where k = ℓ is the last iteration. When A is nonsingular or Ax = b is consistent (which we now assume), ykM is uniquely defined for each k ≤ ℓ and the methods compute the same iterates xM k (but by different numerical methods). In fact they both compute the expanding QR factorizations # " i h R k tk , Q k Hk β 1 e 1 = 0 φk (with Rk upper tridiagonal) and MINRES-QLP also computes the orthogonal factorizations Rk Pk = Lk (with Lk lower tridiagonal), from which the kth solution estimate is defined by Wk = Vk Pk , Lk uk = tk , and xM k = Wk uk . As shown in [14, §5.3], the construction of these quantities is such that the first k − 3 columns of Wk are the same as in Wk−1 , and the first k − 3 elements of uk are the same as in uk−1 . Since Wk has orthonormal columns, kxM k k = kuk k, where the first k − 2 elements of uk are unaltered by later iterations. As shown in [14, §6.5], it means that certain quantities can be cheaply updated to give norm estimates 31 32 CHAPTER 2. A TALE OF TWO ALGORITHMS in the form χ2 ← χ2 + µ̂2k−2 , 2 2 2 2 kxM k k = χ + µ̃k−1 + µ̄k , where it is clear that χ2 increases monotonically. Although the last two 2 terms are of unpredictable size, kxM k k tends to be dominated by the monotonic term χ2 and we can expect that kxM k k will be approximately monotonic as k increases from 1 to ℓ. Experimentally we find that for most MINRES iterations on an indefinite problem, kxk k does increase. To obtain indefinite examples that were sensibly scaled, we used the four spd (A, b) cases in Figures 2.4– 2.5, applied diagonal scaling as before, and solved (A − δI)x = b with δ = 0.5 and where A and b are now scaled (so that diag(A) = I). The number of iterations increased significantly but was limited to n. Figure 2.7 shows log10 krk k and kxk k for the first two cases (where A is the spd matrices in Figure 2.4). The values of kxk k are essentially monotonic. The backward errors krk k/kxk k (not shown) were even closer to being monotonic. Figure 2.7 shows the values of kxk k for the first two cases (MINRES applied to (A − δI)x = b, where A is the spd matrices used in Figure 2.4 and δ = 0.5 is large enough to make the systems indefinite). The number of iterations increased significantly but again was limited to n. These are typical examples in which kxk k is monotonic as in the spd case. Figure 2.8 shows kxk k and log10 krk k for the second two cases (where A is the spd matrices in Figure 2.5). The left example reveals a definite period of decrease in kxk k. Nevertheless, during the n = 82654 iterations, kxk k increased 83% of the time and the backward errors krk k/kxk k decreased 91% of the time. The right example is more like those in Figure 2.8. During n = 245874 iterations, kxk k increased 83% of the time, the backward errors krk k/kxk k decreased 91% of the time, and any nonmonotonicity was very slight. 2.4 S UMMARY Our experimental results here provide empirical evidence that MINRES can often stop much sooner than CG on spd systems when the stopping rule is based on backward error norms krk k/kxk k (or the more general norms in (2.8)–(2.9)). On the other hand, CG generates iterates xk with smaller kx∗ − xk kA and kx∗ − xk k, and is recommended in applications 33 2.4. SUMMARY Cannizzo_sts4098, Dim:4098x4098, nnz:72356, id=39 Simon_olafu, Dim:16146x16146, nnz:1015156, id=32 0 0 −0.5 −0.5 −1 −1 log||r|| log||r|| −1.5 −2 −1.5 −2.5 −2 −3 −3.5 0 2000 4000 6000 8000 10000 iteration count 12000 14000 16000 −2.5 18000 Simon_olafu, Dim:16146x16146, nnz:1015156, id=32 0 500 1000 1500 2000 2500 iteration count 3000 3500 4000 4500 Cannizzo_sts4098, Dim:4098x4098, nnz:72356, id=39 9 15 8 7 6 10 ||x|| ||x|| 5 4 3 5 2 1 0 0 2000 4000 6000 8000 10000 iteration count 12000 14000 16000 18000 0 0 500 1000 1500 2000 2500 iteration count 3000 3500 4000 4500 Figure 2.7: Residual norms and solution norms when MINRES is applied to two indefinite systems (A − δI)x = b, where A is the spd matrices used in Figure 2.4 (n = 16146 and 4098) and δ = 0.5 is large enough to make the systems indefinite. Top: The values of log10 krk k are plotted against iteration number k for the first n iterations. Bottom left: The values of kxk k are plotted against k. During the n = 16146 iterations, kxk k increased 83% of the time and the backward errors krk k/kxk k (not shown here) decreased 96% of the time. Bottom right: During the n = 4098 iterations, kxk k increased 90% of the time and the backward errors krk k/kxk k (not shown here) decreased 98% of the time. 34 CHAPTER 2. A TALE OF TWO ALGORITHMS Schmid_thermal1, Dim:82654x82654, nnz:574458, id=40 BenElechi_BenElechi1, Dim:245874x245874, nnz:13150496, id=48 0 0 −0.5 −0.5 −1 −1 log||r|| log||r|| −1.5 −1.5 −2 −2 −2.5 −2.5 −3 −3.5 0 1 2 3 4 5 iteration count 6 7 8 −3 9 0 350 350 300 300 250 250 ||x|| ||x|| 400 200 150 100 100 50 50 2 3 4 5 iteration count 6 7 2 2.5 5 x 10 200 150 1 1.5 BenElechi_BenElechi1, Dim:245874x245874, nnz:13150496, id=48 400 0 1 iteration count Schmid_thermal1, Dim:82654x82654, nnz:574458, id=40 0 0.5 4 x 10 8 9 4 x 10 0 0 0.5 1 1.5 iteration count 2 2.5 5 x 10 Figure 2.8: Residual norms and solution norms when MINRES is applied to two indefinite systems (A−δI)x = b, where A is the spd matrices used in Figure 2.5 (n = 82654 and 245874) and δ = 0.5 is large enough to make the systems indefinite. Top: The values of log10 krk k are plotted against iteration number k for the first n iterations. Bottom left: The values of kxk k are plotted against k. There is a mild but clear decrease in kxk k over an interval of about 1000 iterations. During the n = 82654 iterations, kxk k increased 83% of the time and the backward errors krk k/kxk k (not shown here) decreased 91% of the time. Bottom right: The solution norms and backward errors are essentially monotonic. During the n = 245874 iterations, kxk k increased 88% of the time and the backward errors krk k/kxk k (not shown here) decreased 95% of the time. 2.4. SUMMARY where these quantities should be minimized. For full-rank least-squares problems min kAx − bk, the solver LSQR [53; 54] is equivalent to CG on the (spd) normal equation ATAx = ATb. This suggests that a MINRES-based algorithm for least-squares may share the same advantage as MINRES for symmetric systems, especially in the case of early termination. LSMR is designed on such a basis. The derivation of LSMR is presented in Chapter 3. Numerical experiments in Chapter 4 show that LSMR has more good properties than our original expectation. Theorem 2.1.5 shows that MINRES shares a known property of CG: that kxk k increases monotonically when A is spd. This implies that kxk k is monotonic for LSMR (as conjectured in [24]), and suggests that MINRES might be a useful alternative to CG in the context of trustregion methods for optimization. 35 3 LSMR We present a numerical method called LSMR for computing a solution x to the following problems: Unsymmetric equations: Linear least squares (LS): Regularized least squares: minimize kxk2 subject to Ax = b minimize kAx −!bk2 ! A b x− minimize λI 0 2 where A ∈ Rm×n , b ∈ Rm , and λ ≥ 0, with m ≤ n or m ≥ n. The matrix A is used as an operator for which products of the form Av and ATu can be computed for various v and u. (If A is symmetric or Hermitian and λ = 0, MINRES [52] or MINRES-QLP [14] are applicable.) LSMR is similar in style to the well known method LSQR [53; 54] in being based on the Golub-Kahan bidiagonalization of A [29]. LSQR is equivalent to the conjugate-gradient (CG) method applied to the normal equation (ATA + λ2 I)x = ATb. It has the property of reducing krk k monotonically, where rk = b − Axk is the residual for the approximate solution xk . (For simplicity, we are letting λ = 0.) In contrast, LSMR is equivalent to MINRES applied to the normal equation, so that the quantities kATrk k are monotonically decreasing. We have also proved that krk k is monotonically decreasing, and in practice it is never very far behind the corresponding value for LSQR. Hence, although LSQR and LSMR ultimately converge to similar points, it is safer to use LSMR in situations where the solver must be terminated early. Stopping conditions are typically based on backward error: the norm of some perturbation to A for which the current iterate xk solves the perturbed problem exactly. Experiments on many sparse LS test problems show that for LSMR, a certain cheaply computable backward error for each xk is close to the optimal (smallest possible) backward error. This is an unexpected but highly desirable advantage. O VERVIEW Section 3.1 derives the basic LSMR algorithm with λ = 0. Section 3.2 derives various norms and stopping criteria. Section 3.3.2 discusses 36 3.1. DERIVATION OF LSMR singular systems. Section 3.4 compares the complexity of LSQR and LSMR. Section 3.5 derives the LSMR algorithm with λ ≥ 0. Section 3.6 proves one of the main lemmas. 3.1 D ERIVATION OF LSMR We begin with the Golub-Kahan process Bidiag(A, b) [29], an iterative procedure for transforming b A to upper-bidiagonal form β1 e1 Bk . The process was introduced in 1.4.1. Since it is central to the development of LSMR, we will restate it here in more detail. 3.1.1 T HE G OLUB -K AHAN PROCESS 1. Set β1 u1 = b (that is, β1 = kbk, u1 = b/β1 ) and α1 v1 = ATu1 . 2. For k = 1, 2, . . . , set βk+1 uk+1 = Avk − αk uk (3.1) αk+1 vk+1 = ATuk+1 − βk+1 vk . After k steps, we have AVk = Uk+1 Bk and ATUk+1 = Vk+1 LTk+1 , where we define Vk = v1 v2 . . . and α1 β2 α2 .. .. Bk = , . . βk αk βk+1 v k , U k = u1 Lk+1 = Bk u2 ... uk , αk+1 ek+1 . Now consider T T A AVk = A Uk+1 Bk = Vk+1 LTk+1 Bk ! = Vk+1 BkT αk+1 eTk+1 = Vk+1 BkTBk αk+1 βk+1 eTk Bk ! . (3.2) 37 38 CHAPTER 3. LSMR This is equivalent to what would be generated by the symmetric Lancthis reason we de- zos process with matrix ATA and starting vector ATb,1 and the columns fine β̄k ≡ αk βk below. of Vk lie in the Krylov subspace Kk (ATA, ATb). 1 For 3.1.2 U SING G OLUB -K AHAN TO SOLVE THE NORMAL EQUATION Krylov subspace methods for solving linear equations form solution estimates xk = Vk yk for some yk , where the columns of Vk are an exand panding set of theoretically independent vectors.2 2 In this case, V k also Uk are theoretically orthonormal. For the equation ATAx = ATb, any solution x has the property of minimizing krk, where r = b − Ax is the corresponding residual vector. Thus, in the development of LSQR it was natural to choose yk to minimize krk k at each stage. Since rk = b − AVk yk = β1 u1 − Uk+1 Bk yk = Uk+1 (β1 e1 − Bk yk ), (3.3) where Uk+1 is theoretically orthonormal, the subproblem minyk kβ1 e1 − Bk yk k easily arose. In contrast, for LSMR we wish to minimize kATrk k. Let β̄k ≡ αk βk for all k. Since ATrk = ATb−ATAxk = β1 α1 v1 −ATAVk yk , from (3.2) we have T A rk = β̄1 v1 − Vk+1 BkTBk αk+1 βk+1 eTk ! yk = Vk+1 BkTBk β̄k+1 eTk β̄1 e1 − ! yk ! and we are led to the subproblem min kA rk k = min β̄1 e1 − yk yk T BkTBk β̄k+1 eTk ! yk . (3.4) Efficient solution of this LS subproblem is the heart of algorithm LSMR. 3.1.3 T WO QR FACTORIZATIONS As in LSQR, we form the QR factorization Qk+1 Bk = Rk 0 ! , ρ1 Rk = θ2 ρ2 .. . .. . . θk ρk (3.5) , 3.1. DERIVATION OF LSMR If we define tk = Rk yk and solve RkTqk = β̄k+1 ek , we have qk = (β̄k+1 /ρk )ek = ϕk ek with ρk = (Rk )kk and ϕk ≡ β̄k+1 /ρk . Then we perform a second QR factorization ρ̄1 θ̄2 ! ! . ρ̄2 . . R̄k zk RkT β̄1 e1 , R̄ = = Q̄k+1 . (3.6) k T . 0 ζ̄k+1 ϕk e k 0 . . θ̄ k ρ̄k Combining what we have with (3.4) gives ! ! RkTRk RkT T y min kA rk k = min β̄1 e1 − = min β̄ e − t k 1 1 k yk yk tk qkT Rk ϕk eTk ! ! z R̄k k = min tk . (3.7) − tk ζ̄k+1 0 The subproblem is solved by choosing tk from R̄k tk = zk .3 3.1.4 R ECURRENCE FOR xk Let Wk and W̄k be computed by forward substitution from RkT WkT = VkT and R̄kT W̄kT = WkT . Then from xk = Vk yk , Rk yk = tk , and R̄k tk = zk , we have x0 ≡ 0 and xk = Wk Rk yk = Wk tk = W̄k R̄k tk = W̄k zk = xk−1 + ζk w̄k . 3.1.5 R ECURRENCE FOR Wk AND W̄k If we write Vk = v 1 v 2 · · · v k , W̄k = w̄1 w̄2 · · · w̄k , Wk = w 1 w2 · · · wk , T zk = ζ 1 ζ 2 · · · ζ k , an important fact is that when k increases to k + 1, all quantities remain the same except for one additional term. The first QR factorization proceeds as follows. At iteration k we construct a plane rotation operating on rows l and l + 1: Il−1 Pl = cl −sl sl cl Ik−l . 39 3 Since every element of tk changes in each iteration, it is never constructed explicitly. Instead, the recurrences derived in the following sections are used. 40 CHAPTER 3. LSMR Now if Qk+1 = Pk . . . P2 P1 , we have Bk Qk+1 Bk+1 = Qk+1 Qk+2 Bk+1 Rk = Pk+1 0 αk+1 ek+1 βk+2 ! θk+1 ek ᾱk+1 , βk+2 θk+1 ek ρk+1 , 0 Rk = 0 Rk θk+1 ek ᾱk+1 = 0 0 βk+2 and we see that θk+1 = sk αk+1 = (βk+1 /ρk )αk+1 = β̄k+1 /ρk = ϕk . Therefore we can write θk+1 instead of ϕk . For the second QR factorization, if Q̄k+1 = P̄k . . . P̄2 P̄1 we know that ! ! RkT R̄k , = Q̄k+1 0 θk+1 eTk and so Q̄k+2 T Rk+1 θk+2 eTk+1 ! R̄k = P̄k+1 θ̄k+1 ek R̄k c̄k ρk+1 = θk+2 θ̄k+1 ek ρ̄k+1 . (3.8) 0 T T T By considering the last row of the matrix equation Rk+1 Wk+1 = Vk+1 T T T and the last row of R̄k+1 W̄k+1 = Wk+1 we obtain equations that define wk+1 and w̄k+1 : T T = vk+1 , θk+1 wkT + ρk+1 wk+1 T T θ̄k+1 w̄kT + ρ̄k+1 w̄k+1 = wk+1 . 3.1.6 T HE TWO ROTATIONS To summarize, the rotations Pk and P̄k have the following effects on our computation: c̄k −s̄k ! ! α¯k ck sk = βk+1 αk+1 −sk ck ! ! ζ̄k c̄k−1 ρk s̄k = θk+1 ρk+1 c̄k ! ρk 0 θk+1 ᾱk+1 ρ̄k 0 θ̄k+1 c̄k ρk+1 ζk ζ̄k+1 ! . 3.1. DERIVATION OF LSMR Algorithm 3.1 Algorithm LSMR 1: (Initialize) β1 u1 = b α1 v1 = ATu1 c̄0 = 1 s̄0 = 0 2: 3: ᾱ1 = α1 h1 = v 1 ζ̄1 = α1 β1 h̄0 = 0 ρ0 = 1 ρ̄0 = 1 x0 = 0 for k = 1, 2, 3 . . . do (Continue the bidiagonalization) βk+1 uk+1 = Avk − αk uk , αk+1 vk+1 = ATuk+1 − βk+1 vk (Construct and apply rotation Pk ) 4: 2 ρk = ᾱk2 + βk+1 θk+1 = sk αk+1 12 ck = ᾱk /ρk sk = βk+1 /ρk (3.9) (3.10) ᾱk+1 = ck αk+1 (Construct and apply rotation P̄k ) 5: θ̄k = s̄k−1 ρk 2 ρ̄k = (c̄k−1 ρk )2 + θk+1 c̄k = c̄k−1 ρk /ρ̄k s̄k = θk+1 /ρ̄k ζk = c̄k ζ̄k ζ̄k+1 = −s̄k ζ̄k 12 (3.11) (3.12) (Update h, h̄ x) 6: h̄k = hk − (θ̄k ρk /(ρk−1 ρ̄k−1 ))h̄k−1 xk = xk−1 + (ζk /(ρk ρ̄k ))h̄k hk+1 = vk+1 − (θk+1 /ρk )hk 7: end for 3.1.7 S PEEDING UP FORWARD SUBSTITUTION The forward substitutions for computing w and w̄ can be made more efficient if we define hk = ρk wk and h̄k = ρk ρ̄k w̄k . We then obtain the updates described in part 6 of the pseudo-code below. 3.1.8 A LGORITHM LSMR Algorithm 3.1 summarizes the main steps of LSMR for solving Ax ≈ b, excluding the norms and stopping rules developed later. 41 42 CHAPTER 3. LSMR 3.2 N ORMS AND STOPPING RULES For any numerical computation, it is impossible to obtain a result with less relative uncertainty than the input data. Thus, in solving linear systems with iterative methods, it is important to know when we have arrived at a solution with the best level of accuracy permitted, in order to terminate before we waste computational effort. Sections 3.2.1 and 3.2.2 discusses various criteria used in stopping LSMR at the appropriate number of iterations. Sections 3.2.3, 3.2.4, 3.2.5 and 3.2.6 derive efficient ways to compute krk k, kAT rk k, kxk k and estimate kAk and cond(A) for the stopping criteria to be implemented efficiently. All quantities require O(1) computation each iteration. 3.2.1 S TOPPING CRITERIA With exact arithmetic, the Golub-Kahan process terminates when either αk+1 = 0 or βk+1 = 0. For certain data b, this could happen in practice when k is small (but is unlikely later because of rounding error). We show that LSMR will have solved the problem at that point and should therefore terminate. When αk+1 = 0, with the expression of kAT rk k from section 3.2.4, we have −1 kAT rk k = |ζ̄k+1 | = |s̄k ζ̄k | = |θk+1 ρ̄−1 k ζ̄k | = |sk αk+1 ρ̄k ζ̄k | = 0, where (3.12), (3.11), (3.10) are used. Thus, a least-squares solution has been obtained. When βk+1 = 0, we have sk = βk+1 ρ−1 k = 0. (from (3.9)) (3.13) β̈k+1 = −sk β̈k = 0. (from (3.19), (3.13)) β̃k − s̃k (−1)k s(k) ck+1 β1 β̇k = c̃−1 (from (3.30)) k (3.14) = c̃−1 k β̃k = = ρ̇−1 k ρ̃k β̃k ρ̇−1 k ρ̃k τ̃k = τ̇k . (from (3.13)) (from (3.20)) (from Lemma 3.2.1) (from (3.26), (3.27)) (3.15) By using (3.21) (derived in Section 3.2.3), we conclude that krk k = 0 from (3.15) and (3.14), so the system is consistent and Axk = b. 3.2. NORMS AND STOPPING RULES 3.2.2 P RACTICAL STOPPING CRITERIA For LSMR we use the same stopping rules as LSQR [53], involving dimensionless quantities ATOL, BTOL, CONLIM: S1: Stop if krk k ≤ BTOLkbk + ATOLkAkkxk k S2: Stop if kAT rk k ≤ ATOLkAkkrk k S3: Stop if cond(A) ≥ CONLIM S1 applies to consistent systems, allowing for uncertainty in A and b [39, Theorem 7.1]. S2 applies to inconsistent systems and comes from Stewart’s backward error estimate kE2 k assuming uncertainty in A; see Section 4.1.1. S3 applies to any system. 3.2.3 C OMPUTING krk k We transform R̄kT to upper-bidiagonal form using a third QR factorizaek = Q e k R̄T with Q e k = Pek−1 . . . Pe1 . This amounts to one addition: R k tional rotation per iteration. Now let e k tk , t̃k = Q ! ek Q b̃k = 1 (3.16) Qk+1 e1 β1 . Then from (3.3), rk = Uk+1 (β1 e1 − Bk yk ) gives rk = Uk+1 β1 e1 − QTk+1 = Uk+1 β1 e1 − QTk+1 = Uk+1 QTk+1 = Uk+1 QTk+1 eT Q k eT Q k 1 ! ! ! Rk yk 0 !! tk 0 ! 1 b̃k − QTk+1 b̃k − t̃k 0 !! e T t̃k Q k 0 !! . Therefore, assuming orthogonality of Uk+1 , we have krk k = b̃k − t̃k 0 ! . (3.17) 43 44 CHAPTER 3. LSMR Algorithm 3.2 Computing krk k in LSMR 1: (Initialize) β̈1 = β1 2: 3: β̇0 = 0 ρ̇0 = 1 τ̃−1 = 0 β̈k+1 = −sk β̈k c̃k−1 = ρ̇k−1 /ρ̃k−1 12 θ̃k = s̃k−1 ρ̄k β̃k−1 = c̃k−1 β̇k−1 + s̃k−1 β̂k ρ̇k = c̃k−1 ρ̄k β̇k = −s̃k−1 β̇k−1 + c̃k−1 β̂k (Update t̃k by forward substitution) τ̇k = (ζk − θ̃k τ̃k−1 )/ρ̇k (Form krk k) 2 γ = (β̇k − τ̇k )2 + β̈k+1 , 7: (3.20) s̃k−1 = θ̄k /ρ̃k−1 τ̃k−1 = (ζk−1 − θ̃k−1 τ̃k−2 )/ρ̃k−1 6: (3.19) (If k ≥ 2, construct and apply rotation Pek−1 ) ρ̃k−1 = ρ̇2k−1 + θ̄k2 5: ζ0 = 0 for the kth iteration do (Apply rotation Pk ) β̂k = ck β̈k 4: θ̃0 = 0 krk k = √ γ (3.21) end for The vectors b̃k and t̃k can be written in the form T b̃k = β̃1 · · · β̃k−1 β̇k β̈k+1 T t̃k = τ̃1 · · · τ̃k−1 τ̇k . (3.18) eT t̃k = zk . The vector t̃k is computed by forward substitution from R k Lemma 3.2.1. In (3.17)–(3.18), β̃i = τ̃i for i = 1, . . . , k − 1. Proof. Section 3.6 proves the lemma by induction. Using this lemma we can estimate krk k from just the last two elements of b̃k and the last element of t̃k , as shown in (3.21). Algorithm 3.2 summarizes how krk k may be obtained from quantities arising from the first and third QR factorizations. 3.2. NORMS AND STOPPING RULES 3.2.4 C OMPUTING kATrk k From (3.7), we have kATrk k = |ζ̄k+1 |. 3.2.5 C OMPUTING kxk k From Section 3.1.4 we have xk = Vk Rk−1 R̄k−1 zk . From the third QR e k R̄T = R ek in Section 3.2.3 and a fourth QR factorization factorization Q k T e Q̂k (Qk Rk ) = R̂k we can write e Tk z̃k xk = Vk Rk−1 R̄k−1 zk = Vk Rk−1 R̄k−1 R̄k Q T T eT Q e = Vk Rk−1 Q k k Rk Q̂k ẑk = Vk Q̂k ẑk , eT z̃k = zk and where z̃k and ẑk are defined by forward substitutions R k T R̂k ẑk = z̃k . Assuming orthogonality of Vk we arrive at the estimate ek and the bottom 2 × 2 kxk k = kẑk k. Since only the last diagonal of R part of R̂k change each iteration, this estimate of kxk k can again be updated cheaply. The pseudo-code, omitted here, can be derived as in Section 3.2.3. 3.2.6 E STIMATES OF kAk AND COND(A) It is known that the singular values of Bk are interlaced by those of A and are bounded above and below by the largest and smallest nonzero singular values of A [53]. Therefore we can estimate kAk and cond(A) by kBk k and cond(Bk ) respectively. Considering the Frobenius norm of Bk , we have the recurrence relation 2 kBk+1 k2F = kBk k2F + αk2 + βk+1 . From (3.5)–(3.6) and (3.8), we can show that the following QLP factorization [71] holds: Qk+1 Bk Q̄Tk = T R̄k−1 θ̄k eTk−1 c̄k−1 ρk ! (the same as R̄kT except for the last diagonal). Since the singular values of Bk are approximated by the diagonal elements of that lowerbidiagonal matrix [71], and since the diagonals are all positive, we can estimate cond(A) by the ratio of the largest and smallest values in {ρ̄1 , . . . , ρ̄k−1 , c̄k−1 ρk }. Those values can be updated cheaply. 45 46 CHAPTER 3. LSMR 3.3 LSMR P ROPERTIES With x∗ denoting the pseudoinverse solution of min kAx − bk, we have the following theorems on the norms of various quantities for LSMR. 3.3.1 M ONOTONICITY OF NORMS A number of monotonic properties for LSQR follow directly from the corresponding properties for CG in Section 2.1.1. We list them here from completeness. Theorem 3.3.1. kxk k is strictly increasing for LSQR. Proof. LSQR on min kAx − bk is equivalent to CG on ATAx = ATb. By Theorem 2.1.1, kxk k is strictly increasing. Theorem 3.3.2. kx∗ − xk k is strictly decreasing for LSQR. Proof. This follows from Theorem 2.1.2 for CG. Theorem 3.3.3. kx∗ − xk kATA = kA(x∗ − xk )k = kr∗ − rk k is strictly decreasing for LSQR. Proof. This follows from Theorem 2.1.3 for CG. We also have the characterizing property for LSQR [53]. Theorem 3.3.4. krk k is strictly decreasing for LSQR. Next, we prove a number of monotonic properties for LSMR. We would like to emphasize that LSMR has all the above monotonic properties that LSQR enjoys. In addition, kATrk k is monotonic for LSMR. This gives LSMR a much smoother convergence behavior in terms of the Stewart backward error, as shown in Figure 4.2. Theorem 3.3.5. kATrk k is monotonically decreasing for LSMR. Proof. From Section 3.2.4 and (3.12), kATrk k = |ζ̄k+1 | = |s̄k ||ζ̄k | ≤ |ζ̄k | = kATrk−1 k. Theorem 3.3.6. kxk k is strictly increasing for LSMR on min kAx − bk when A has full column rank. Proof. LSMR on min kAx − bk is equivalent to MINRES on ATAx = ATb. When A has full column rank, ATA is symmetric positive definite. By Theorem 2.1.6, kxk k is strictly increasing. 3.3. LSMR PROPERTIES Algorithm 3.3 Algorithm CRLS 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: x0 = 0, r̄0 = ATb, s0 = ATAr̄0 , ρ0 = r̄0Ts0 , p0 = r̄0 , q0 = s0 for k = 1, 2, . . . do αk = ρk−1 /kqk−1 k2 xk = xk−1 + αk pk−1 r̄k = r̄k−1 − αk qk−1 sk = ATAr̄k ρk = r̄kTsk βk = ρk /ρk−1 pk = r̄k + βk pk−1 qk = sk + βk qk−1 end for Theorem 3.3.7. The error kx∗ − xk k is strictly decreasing for LSMR on min kAx − bk when A has full column rank. Proof. This follows directly from Theorem 2.1.7 for MINRES. Theorem 3.3.8. kx∗ − xk kATA = kr∗ − rk k is strictly decreasing for LSMR on min kAx − bk when A has full column rank. Proof. This follows directly from Theorem 2.1.8 for MINRES. Since LSMR is equivalent to MINRES on the normal equation, and CR is a non-Lanczos equivalent of MINRES, we can apply CR to the normal equation to derive a non-Lanczos equivalent of LSMR, which we will call CRLS. We start from CR (Algorithm 1.3) and apply the substitutions A → ATA, b → ATb. Since rk in CR would correspond to ATrk in CRLS, we rename rk to r̄k in the algorithm to avoid confusion. With these substitutions, we arrive at Algorithm 3.3. In CRLS, the residual rk = b − Axk is not computed. However, we know that r0 = b0 and that rk satisfies the recurrence relation: rk = b − Axk = b − Axk−1 − αk Apk−1 = rk−1 − αk Apk−1 . (3.22) Also, as mentioned, we define r̄k = ATrk . (3.23) Let ℓ denote the iteration at which LSMR terminates; i.e. ATrℓ = r̄ℓ = 0 and xℓ = x∗ . Then Theorem 2.1.4 for CR translates to the following. 47 48 CHAPTER 3. LSMR Theorem 3.3.9. These properties hold for Algorithm CRLS: (a) qiT qj = 0 (b) riT qj =0 (0 ≤ i, j ≤ ℓ − 1, i 6= j) (0 ≤ i, j ≤ ℓ − 1, i ≥ j + 1) (c) r̄i 6= 0 ⇒ pi 6= 0 (0 ≤ i ≤ ℓ − 1) Also, Theorem 2.1.5 for CR translates to the following. Theorem 3.3.10. These properties hold for Algorithm CRLS on a least-squares system min kAx − bk when A has full column rank: (a) αi > 0 (i = 1, . . . , ℓ) (b) βi > 0 βℓ = 0 (i = 1, . . . , ℓ − 1) (c) pTi qj > 0 (0 ≤ i, j ≤ ℓ − 1) (d) pTi pj > 0 (0 ≤ i, j ≤ ℓ − 1) (e) xTi pj >0 (f) r̄iT pj > 0 (1 ≤ i ≤ ℓ, 0 ≤ j ≤ ℓ − 1) (0 ≤ i ≤ ℓ − 1, 0 ≤ j ≤ ℓ − 1) Theorem 3.3.11. For CRLS (and hence LSMR) on min kAx − bk when A has full column rank, krk k is strictly decreasing. Proof. T krk−1 k2 − krk k2 = rk−1 rk−1 − rkT rk = 2αk rkT Apk−1 + αk2 pTk−1 AT Apk−1 T T = 2αk (A rk ) pk−1 + > 2αk (ATrk )T pk−1 αk2 kApk−1 k2 by (3.22) by Thm 3.3.9 (c) and Thm 3.3.10 (a) = 2αk r̄kT pk−1 by (3.23) ≥ 0, where the last inequality follows from Theorem 3.3.10 (a) and (f), and is strict except at k = ℓ. The strict inequality on line 4 requires the fact that A has full column rank. Therefore we have krk−1 k > krk k. 3.3.2 C HARACTERISTICS OF THE SOLUTION ON SINGULAR SYSTEMS The least-squares problem min kAx − bk has a unique solution when A has full column rank. If A does not have full column rank, infinitely 3.3. LSMR PROPERTIES many points x give the minimum value of kAx − bk. In particular, the normal equation ATAx = ATb is singular but consistent. We show that LSQR and LSMR both give the minimum-norm LS solution. That is, they both solve the optimization problem min kxk2 such that ATAx = ATb. Let N(A) and R(A) denote the nullspace and range of a matrix A. Lemma 3.3.12. If A ∈ Rm×n and p ∈ Rn satisfy ATAp = 0, then p ∈ N(A). Proof. ATAp = 0 ⇒ pTATAp = 0 ⇒ (Ap)TAp = 0 ⇒ Ap = 0. Theorem 3.3.13. LSQR returns the minimum-norm solution. = ATb, and any other Proof. The final LSQR solution satisfies ATAxLSQR ℓ , the difference solution xb satisfies ATAb x = ATb. With p = xb − xLSQR ℓ between the two normal equations gives ATAp = 0, so that Ap = 0 by Lemma 3.3.12. From α1 v1 = ATu1 and αk+1 vk+1 = ATuk+1 − βk+1 vk (3.1), we have v1 , . . . , vℓ ∈ R(AT ). With Ap = 0, this implies pT Vℓ = 0, so that kb xk22 − kxLSQR k22 = kxLSQR + pk22 − kxLSQR k22 = pTp + 2pT xLSQR ℓ ℓ ℓ ℓ = pTp + 2pT Vℓ yℓLSQR = pTp ≥ 0. Corollary 3.3.14. LSMR returns the minimum-norm solution. Proof. At convergence, αℓ+1 = 0 or βℓ+1 = 0. Thus β̄ℓ+1 = αℓ+1 βℓ+1 = 0, which means equation (3.4) becomes min kβ̄1 e1 − BℓTBℓ yℓ k and hence BℓTBℓ yℓ = β̄1 e1 , since Bℓ has full rank. This is the normal equation for min kBℓ yℓ − β1 e1 k, the same LS subproblem solved by LSQR. We conclude that at convergence, yℓ = yℓLSQR and thus xℓ = Vℓ yℓ = Vℓ yℓLSQR = , and Theorem 3.3.13 applies. xLSQR ℓ 3.3.3 B ACKWARD ERROR For completeness, we state a final desirable result about LSMR. The Stewart backward error kATrk k/krk k for LSMR is always less than or equal to that for LSQR. See Chapter 4, Theorem 4.1.1 for details. 49 50 CHAPTER 3. LSMR Table 3.1 Storage and computational cost for various least-squares methods Storage Work m n m n LSMR LSQR MINRES on ATAx = ATb 3.4 p = Av, u p = Av, u p = Av x, v = ATu, h, h̄ 3 x, v = ATu, w 3 x, v1 , v2 = ATp, w1 , w2 , w3 6 5 8 C OMPLEXITY We compare the storage requirement and computational complexity for LSMR and LSQR on Ax ≈ b and MINRES on the normal equation ATAx = ATb. In Table 3.1, we list the vector storage needed (excluding storage for A and b). Recall that A is m × n and for LS systems m may be considerably larger than n. Av denotes the working storage for matrix-vector products. Work represents the number of floating-point multiplications required at each iteration. From Table 3.1, we see that LSMR requires storage of one extra vector, and also n more scalar floating point multiplication when compared to LSQR. This difference is negligible compared to the cost of performing Av and ATu multiplication for most problems. Thus, the computational and storage complexity of LSMR is comparable to LSQR. 3.5 R EGULARIZED LEAST SQUARES In this section we extend LSMR to the regularized LS problem ! A min x− λI where λ is a nonnegative scalar. If Ā = T T 2 Ā r̄k = A rk − λ xk = Vk+1 = Vk+1 β̄1 e1 − β̄1 e1 − ! b , 0 (3.24) 2 A λI and r̄k = ( 0b ) − Āxk , then BkTBk β̄k+1 eTk RkTRk β̄k+1 eTk ! ! yk − λ yk ! 2 yk 0 !! 3.5. REGULARIZED LEAST SQUARES and the rest of the main algorithm follows the same as in the unregularized case. In the last equality, Rk is defined by the QR factorization Bk λI Q2k+1 ! Rk 0 = ! , Q2k+1 ≡ Pk P̂k . . . P2 P̂2 P1 P̂1 , where P̂l is a rotation operating on rows l and l + k + 1. The effects of P̂1 and P1 are illustrated here: α1 β2 P̂1 λ 3.5.1 α̂1 α2 β2 β3 = 0 λ ρ1 α2 β3 = λ α̂1 β2 P1 α2 β3 , λ θ2 ᾱ2 β3 . λ E FFECTS ON kr̄k k Introduction of regularization changes the residual norm as follows: r̄k = b 0 ! A λI − ! xk = u1 0 = u1 0 = = = = with b̃k = ek Q 1 ! ! β1 − AVk λVk ! yk Uk+1 Bk β1 − λVk ! Uk+1 Vk Uk+1 Vk Uk+1 Vk Uk+1 Vk ! ! ! ! yk ! e1 β1 − Bk λI e1 β1 − QT2k+1 e1 β1 − QT2k+1 eT Q k QT2k+1 1 yk ! ! ! ! Rk yk 0 !! tk 0 b̃k − Q2k+1 e1 β1 , where we adopt the notation b̃k = β̃1 ··· β̃k−1 β̇k β̈k+1 β̌1 ··· β̌k We conclude that 2 kr̄k k2 = β̌12 + · · · + β̌k2 + (β̇k − τk )2 + β̈k+1 . T . t̃k 0 !! 51 52 CHAPTER 3. LSMR The effect of regularization on the rotations is summarized as ck −sk 3.5.2 sk ck ĉk −ŝk ! ŝk ĉk α̂k βk+1 ! β¨k ᾱk λ β́k αk+1 ! ! = = ! α̂k β́k β̌k ρk θk+1 ᾱk+1 β̂k β̈k+1 ! . P SEUDO - CODE FOR REGULARIZED LSMR Algorithm 3.4 summarizes LSMR for solving the regularized problem (3.24) with given λ. Our M ATLAB implementation is based on these steps. 3.6 P ROOF OF L EMMA 3.2.1 The effects of the rotations Pk and Pek−1 can be summarized as c̃k s̃k −s̃k c̃k ! ck sk −sk ck ! e Rk = β̈k 0 ρ̇k−1 β̇k−1 θ̄k ρ̄k β̂k ! ! ρ̃1 θ̃2 .. . .. . ρ̃k−1 ! = β̂k β̈k+1 = ρ̃k−1 θ̃k 0 ρ̇k , θ̃k ρ̇k , β̃k−1 β̇k ! , where β̈1 = β1 , ρ̇1 = ρ̄1 , β̇1 = β̂1 and where ck , sk are defined in section 3.1.6. We define s(k) = s1 . . . sk and s̄(k) = s̄1 . . . s̄k . Then from (3.18) and eT t̃k = zk = Ik 0 Q̄k+1 ek+1 β̄1 . Expanding this and (3.6) we have R k (3.16) gives ekT t̃k = R c̄1 −s̄1 c̄2 β̄1 , .. . (−1)k+1 s̄(k−1) c̄k b̃k = ek Q c1 ! −s1 c2 .. β1 , . 1 (−1)k+1 s(k−1) c k (−1)k+2 s(k) 3.6. PROOF OF LEMMA 3.2.1 Algorithm 3.4 Regularized LSMR (1) 1: (Initialize) α1 v1 = ATu1 ᾱ1 = α1 ζ̄1 = α1 β1 ρ0 = 1 c̄0 = 1 s̄0 = 0 β̈1 = β1 β̇0 = 0 ρ̇0 = 1 τ̃−1 = 0 θ̃0 = 0 ζ0 = 0 d0 = 0 h1 = v 1 h̄0 = 0 β 1 u1 = b 2: 3: αk+1 vk+1 = ATuk+1 − βk+1 vk (Construct rotation P̂k ) α̂k = ᾱk2 + λ2 5: 12 ĉk = ᾱk /α̂k θk+1 = sk αk+1 12 ck = α̂k /ρk sk = βk+1 /ρk ᾱk+1 = ck αk+1 (Construct and apply rotation P̄k ) θ̄k = s̄k−1 ρk c̄k = c̄k−1 ρk /ρ̄k ζk = c̄k ζ̄k 7: ŝk = λ/α̂k (Construct and apply rotation Pk ) 2 ρk = α̂k2 + βk+1 6: x0 = 0 for k = 1, 2, 3, . . . do (Continue the bidiagonalization) βk+1 uk+1 = Avk − αk uk 4: ρ̄0 = 1 2 ρ̄k = (c̄k−1 ρk )2 + θk+1 s̄k = θk+1 /ρ̄k 12 ζ̄k+1 = −s̄k ζ̄k (Update h̄, x, h) h̄k = hk − (θ̄k ρk /(ρk−1 ρ̄k−1 ))h̄k−1 xk = xk−1 + (ζk /(ρk ρ̄k ))h̄k hk+1 = vk+1 − (θk+1 /ρk )hk 8: (Apply rotation P̂k , Pk ) β́k = ĉk β̈k 9: 10: β̌k = −ŝk β̈k β̈k+1 = −sk β́k if k ≥ 2 then (Construct and apply rotation Pek−1 ) ρ̃k−1 = ρ̇2k−1 + θ̄k2 c̃k−1 = ρ̇k−1 /ρ̃k−1 21 θ̃k = s̃k−1 ρ̄k β̃k−1 = c̃k−1 β̇k−1 + s̃k−1 β̂k 11: β̂k = ck β́k end if s̃k−1 = θ̄k /ρ̃k−1 ρ̇k = c̃k−1 ρ̄k β̇k = −s̃k−1 β̇k−1 + c̃k−1 β̂k 53 54 CHAPTER 3. LSMR Algorithm 3.5 Regularized LSMR (2) 12: (Update t̃k by forward substitution) τ̃k−1 = (ζk−1 − θ̃k−1 τ̃k−2 )/ρ̃k−1 13: (Compute kr̄k k) dk = dk−1 + β̌k2 14: τ̇k = (ζk − θ̃k τ̃k−1 )/ρ̇k 2 γ = dk + (β̇k − τ̇k )2 + β̈k+1 kr̄k k = √ γ (Compute kĀTr̄k k, kxk k, estimate kĀk, cond(Ā)) kĀTr̄k k = |ζ̄k+1 | (section 3.2.4) Compute kxk k (section 3.2.5) Estimate σmax (Bk ), σmin (Bk ) and hence kĀk, cond(Ā) (section 3.2.6) 15: 16: Terminate if any of the stopping criteria in Section 3.2.2 are satisfied. end for and we see that τ̃1 = ρ̃−1 1 c̄1 β̄1 (3.25) k (k−2) τ̃k−1 = ρ̃−1 c̄k−1 β̄1 − θ̃k−1 τ̃k−2 ) k−1 ((−1) s̄ τ̇k = k+1 (k−1) ρ̇−1 s̄ c̄k β̄1 k ((−1) − θ̃k τ̃k−1 ). β̇1 = β̂1 = c1 β1 β̇k = −s̃k−1 β̇k−1 + c̃k−1 (−1)k−1 s(k−1) ck β1 k (k) β̃k = c̃k β̇k + s̃k (−1) s ck+1 β1 . (3.26) (3.27) (3.28) (3.29) (3.30) We want to show by induction that τ̃i = β̃i for all i. When i = 1, β̃1 = c̃1 c1 β1 −s̃1 s1 c2 β1 = β1 β1 α1 ρ21 β̄1 ρ1 β̄1 (c1 ρ̄1 −θ̄2 s1 c2 ) = = = c̄1 = τ̃1 ρ̃1 ρ̃1 ρ1 ρ̄1 ρ̃1 ρ̄1 ρ̃1 where the third equality follows from the two lines below: c1 α2 α2 α1 1 = ρ̄1 − θ̄2 s1 = (ρ̄1 − θ̄2 s1 α2 ) ρ2 ρ2 ρ1 ρ2 θ2 ρ̄21 − θ22 ρ21 + θ22 − θ22 1 1 = . ρ̄1 − θ̄2 s1 α2 = ρ̄1 − (s̄1 ρ2 )θ2 = ρ̄1 − θ2 = ρ2 ρ2 ρ̄1 ρ̄1 ρ̄1 c1 ρ̄1 − θ̄2 s1 c2 = c1 ρ̄1 − θ̄2 s1 3.6. PROOF OF LEMMA 3.2.1 Suppose τ̃k−1 = β̃k−1 . We consider the expression c̄k−1 ρk (k−1) (s ck )c̄k−1 ρk β1 ρ̄k θ2 θk θ2 · · · θk α1 ρ1 · · · ρk−1 ρk β1 = c̄k · · · β̄1 = c̄k ρ1 · · · ρk ρ̄1 · · · ρ̄k−1 ρ̄1 ρ̄k−1 2 2 s(k−1) ck ρ̄−1 k c̄k−1 ρk β1 = = c̄k s̄1 · · · s̄k−1 β̄1 = c̄k s̄(k−1) β̄1 . (3.31) (−1)k+1 s̄(k−1) c̄k β̄1 − θ̃k τ̃k−1 Applying the induction hypothesis on τ̃k = ρ̃−1 k gives k+1 (k−1) k (k−1) (−1) s̄ c̄ β̄ − θ̃ τ̃k = ρ̃−1 c̃ β̇ + s̃ (−1) s c β k 1 k k−1 k−1 k−1 k 1 k k+1 −1 = ρ̃−1 ρ̃k s̄(k−1) c̄k β̄1 − θ̃k s̃k−1 s(k−1) ck β1 k θ̃k c̃k−1 β̇k−1 + (−1) k+1 −1 (k−1) = ρ̃−1 ρ̃k s β1 ρ̇k c̃k−1 ck − θ̄k+1 sk ck+1 k (ρ̄k s̃k−1 )c̃k−1 β̇k−1 + (−1) = c̃k s̃k−1 β̇k−1 + (−1)k+1 s(k−1) β1 (c̃k c̃k−1 ck − s̃k sk ck+1 ) = c̃k −s̃k−1 β̇k−1 + c̃k−1 (−1)k+1 s(k−1) ck β1 + s̃k (−1)k+1 s(k) ck+1 β1 = c̃k β̇k + s̃k (−1)k+1 s(k) ck+1 β1 = β̃k with the second equality obtained by the induction hypothesis, and the fourth from s̄(k−1) c̄k β̄1 − θ̃k s̃k−1 s(k−1) ck β1 2 2 (k−1) = s(k−1) ck ρ̄−1 c k β1 k c̄k−1 ρk β1 − (s̃k−1 ρ̄k )s̃k−1 s ck 2 c̄ ρ2 − s̃2k−1 ρ̄2k = s(k−1) β1 ρ̄k k−1 k = s(k−1) β1 ρ̇k c̃k−1 ck − θ̄k+1 sk ck+1 , where the first equality follows from (3.31) and the last from 2 c̄2k−1 ρ2k − s̃2k−1 ρ̄2k = ρ̄2k − θk+1 − s̃2k−1 ρ̄2k 2 2 = ρ̄2k (1 − s̃2k−1 ) − θk+1 = ρ̄2k c̃2k−1 − θk+1 , ck 2 2 ρ̄ c̃ = ρ̄k c̃2k−1 ck = ρ̇k c̃k−1 ck , ρ̄k k k−1 θk+1 θk+1 ρk+1 ck ck 2 θk+1 = θk+1 ck = sk αk+1 ρ̄k ρ̄k ρ̄k ρk+1 = θ̄k+1 sk ck+1 . By induction, we know that τ̃i = β̃i for i = 1, 2, . . . . From (3.18), we see that at iteration k, the first k − 1 elements of b̃k and t̃k are equal. 55 4 LSMR EXPERIMENTS In this chapter, we perform numerous experiments comparing the convergence of LSQR and LSMR. We discuss overdetermined systems first, then some square examples, followed by underdetermined systems. We also explore ways to speed up convergence using extra memory by reorthogonalization. 4.1 4.1.1 L EAST- SQUARES PROBLEMS B ACKWARD ERROR FOR LEAST- SQUARES For inconsistent problems with uncertainty in A (but not b), let x be any approximate solution. The normwise backward error for x measures the perturbation to A that would make x an exact LS solution: µ(x) ≡ min kEk s.t. (A + E)T (A + E)x = (A + E)T b. E (4.1) It is known to be the smallest singular value of a certain m × (n + m) matrix C; see Waldén et al. [83] and Higham [39, pp392–393]: µ(x) = σmin (C), h C≡ A krk kxk I− rr T krk2 i . (4.2) Since it is generally too expensive to evaluate µ(x), we need to find approximations. A PPROXIMATE BACKWARD ERRORS E1 AND E2 In 1975, Stewart [69] discussed a particular backward error estimate that we will call E1 . Let x∗ and r∗ = b−Ax∗ be the exact LS solution and residual. Stewart showed that an approximate solution x with residual r = b − Ax is the exact LS solution of the perturbed problem min kb − (A + E1 )xk, where E1 is the rank-one matrix E1 = exT , kxk2 kE1 k = 56 kek , kxk e ≡ r − r∗ , (4.3) 4.1. LEAST-SQUARES PROBLEMS with krk2 = kr∗ k2 + kek2 . Soon after, Stewart [70] gave a further important result that can be used within any LS solver. The approximate x and a certain vector r̃ = b − (A + E2 )x are the exact solution and residual of the perturbed LS problem min kb − (A + E2 )xk, where E2 = − rrTA , krk2 kE2 k = kATrk , krk r = b − Ax. (4.4) LSQR and LSMR both compute kE2 k for each iterate xk because the current krk k and kATrk k can be accurately estimated at almost no cost. An added feature is that for both solvers, r̃ = b − (A + E2 )xk = rk because E2 xk = 0 (assuming orthogonality of Vk ). That is, xk and rk are theoretically exact for the perturbed LS problem (A + E2 )x ≈ b. Stopping rule S2 (section 3.2.2) requires kE2 k ≤ ATOLkAk. Hence the following property gives LSMR an advantage over LSQR for stopping early. Theorem 4.1.1. kE2LSMR k ≤ kE2LSQR k. Proof. This follows from kATrkLSMR k ≤ kATrkLSQR k and krkLSMR k ≥ krkLSQR k. A PPROXIMATE OPTIMAL BACKWARD ERROR µ e(x) Various authors have derived expressions for a quantity µ e(x) that has proved to be a very accurate approximation to µ(x) in (4.1) when x is at least moderately close to the exact solution xb. Grcar, Saunders, and Su [73; 34] show that µ e(x) can be obtained from a full-rank LS problem as follows: K= A krk kxk I , r v = , 0 min kKy−vk, y µ e(x) = kKyk/kxk, (4.5) and give M ATLAB Code 4.1 for computing the “economy size” sparse QR factorization K = QR and c ≡ QTv (for which kck = kKyk) and thence µ e(x). In our experiments we use this script to compute µ e(xk ) for each LSQR and LSMR iterate xk . We refer to this as the optimal backward error for xk because it is provably very close to the true µ(xk ) [32]. 57 58 CHAPTER 4. LSMR EXPERIMENTS M ATLAB Code 4.1 Approximate optimal backward error [m,n] = size(A); 2 r = b - A*x; 3 normx = norm(x); 4 eta = norm(r)/normx; 5 p = colamd(A); 6 K = [A(:,p); eta*speye(n)]; 7 v = [r; zeros(n,1)]; 8 [c,R] = qr(K,v,0); 9 mutilde = norm(c)/normx; 1 D ATA For test examples, we have drawn from the University of Florida Sparse Matrix Collection (Davis [18]). Matrices from the LPnetlib group and the NYPA group are used for our numerical experiments. The LPnetlib group provides data for 138 linear programming problems of widely varying origin, structure, and size. The constraint matrix and objective function may be used to define a sparse LS problem min kAx−bk. Each example was downloaded in M ATLAB format, and a sparse matrix A and dense vector b were extracted from the data structure via A = (Problem.A)’ and b = Problem.c (where ’ denotes transpose). Five examples had b = 0, and a further six gave ATb = 0. The remaining 127 problems had up to 243000 rows, 10000 columns, and 1.4M nonzeros in A. Diagonal scaling was applied to the columns of h i A b to give a scaled problem min kAx − bk in which the columns of A (and also b) have unit 2-norm. LSQR and LSMR were run on each of the 127 scaled problems with stopping tolerance ATOL = 10−8 , gener} and {xLSMR ating sequences of approximate solutions {xLSQR }. The itk k eration indices k are omitted below. The associated residual vectors are denoted by r without ambiguity, and x∗ is the solution to the LS problem, or the minimum-norm solution to the LS problem if the system is singular. This set of artificially created least-squares test problems provides a wide variety of size and structure for evaluation of the two algorithms. They should be indicative of what we could expect when using iterative methods to estimate the dual variables if the linear programs were modified to have a nonlinear objective function (such as P the negative entropy function xj log xj ). The NYPA group provides data for 8 rank-deficient least-squares problems from the New York Power Authority. Each problem provides a 4.1. LEAST-SQUARES PROBLEMS M ATLAB Code 4.2 Right diagonal preconditioning 1 % scale the column norms to 1 2 cnorms = sqrt(sum(A.*A,1)); 3 D = diag(sparse(1./cnorms)); 4 A = A*D; matrix Problem.A and a right-hand side vector Problem.b. Two of the problems are underdetermined. For the remaining 6 problems we compared the convergence of LSQR and LSMR on min kAx − bk with stopping tolerance ATOL = 10−8 . This set of problems contains matrices with condition number ranging from 3.1E+02 to 5.8E+11. P RECONDITIONING For this set of test problems, we apply a right diagonal preconditioning that scales the columns of A to unit 2-norm. (For least-squares systems, a left preconditioner will alter the least-squares solution.) The preconditioning is implemented with M ATLAB Code 4.2. 4.1.2 N UMERICAL RESULTS Observations for the LPnetlib group: 1. krLSQR k is monotonic by design. krLSMR k is also monotonic (as predicted by Theorem 3.3.11) and nearly as small as krLSQR k for all iterations on almost all problems. Figure 4.1 shows a typical example and a rare case. 2. kxk is monotonic for LSQR (Theorem 3.3.1) and for LSMR (Theorem 3.3.6). With krk monotonic for LSQR and for LSMR, kE1 k in (4.3) is likely to appear monotonic for both solvers. Although kE1 k is not normally available for each iteration, it provides a benchmark for kE2 k. 3. kE2LSQR k is not monotonic, but kE2LSMR k appears monotonic almost always. Figure 4.2 shows a typical case. The sole exception for this observation is also shown. 4. Note that Benbow [5] has given numerical results comparing a generalized form of LSQR with application of MINRES to the corresponding normal equation. The curves in [5, Figure 3] show the irregular and smooth behavior of LSQR and MINRES respectively 59 60 CHAPTER 4. LSMR EXPERIMENTS Name:lp greenbeb, Dim:5598x2392, nnz:31070, id=631 Name:lp woodw, Dim:8418x1098, nnz:37487, id=702 1 1 LSQR LSMR LSQR LSMR 0.9 0.998 0.996 0.7 ||r|| ||r|| 0.8 0.994 0.6 0.992 0.5 0.99 0.4 0 50 100 150 iteration count 200 250 300 0.988 0 10 20 30 40 50 iteration count 60 70 80 90 Figure 4.1: For most iterations, krLSMR k is monotonic and nearly as small as krLSQR k. Left: A typical case (problem lp_greenbeb). Right: A rare case (problem lp_woodw). LSMR’s residual norm is significantly larger than LSQR’s during early iterations. in terms of kATrk k. Those curves are effectively a preview of the left-hand plots in Figure 4.2 (where LSMR serves as our more reliable implementation of MINRES). 5. kE1LSQR k ≤ kE2LSQR k often, but not so for LSMR. Some examples e(xk ), the accurate estimate are shown on Figure 4.3 along with µ (4.5) of the optimal backward error for each point xk . 6. kE2LSMR k ≈ µ e(xLSMR ) almost always. Figure 4.4 shows a typical example and a rare case. In all such “rare” cases, kE1LSMR k ≈ µ e(xLSMR ) instead! 7. µ e(xLSQR ) is not always monotonic. µ e(xLSMR ) does seem to be monotonic. Figure 4.5 gives examples. 8. µ e(xLSMR ) ≤ µ e(xLSQR ) almost always. Figure 4.6 gives examples. 9. The errors kx∗ − xLSQR k and kx∗ − xLSMR k decrease monotonically (Theorem 3.3.2 and 3.3.7), with the LSQR error typically smaller than for LSMR. Figure 4.7 gives examples. This is one property for which LSQR seems more desirable (and it has been suggested [57] that for LS problems, LSQR could be terminated when rule S2 would terminate LSMR). 61 4.1. LEAST-SQUARES PROBLEMS Name:lp sc205, Dim:317x205, nnz:665, id=665 Name:lp pilot ja, Dim:2267x940, nnz:14977, id=657 0 0 LSQR LSMR LSQR LSMR −1 −1 −2 −2 log(E2) log(E2) −3 −3 −4 −5 −4 −6 −5 −7 −6 0 100 200 300 400 500 600 iteration count 700 800 900 1000 −8 0 20 40 60 iteration count 80 100 120 Figure 4.2: For most iterations, kE2LSMR k appears to be monotonic (but kE2LSQR k is not). Left: A typical case (problem lp_pilot_ja). LSMR is likely to terminate much sooner than LSMR (see Theorem 4.1.1). Right: Sole exception (problem lp_sc205) at iterations 54–67. The exception remains even if Uk and/or Vk are reorthogonalized. For every problem in the NYPA group, both solvers satisfied the stopping condition in fewer than 2n iterations. Much greater fluctuations are observed in kE2LSQR k than kE2LSMR k. Figure 4.8 shows the convergence of kE2 k for two problems. Maragal_5 has the largest condition number in the group, while Maragal_7 has the largest dimensions. kE2LSMR k converges with small fluctuations, while kE2LSQR k fluctuates by as much as 5 orders of magnitude. We should note that when cond(A) ≥ 108 , we cannnot expect any solver to compute a solution with more than about 1 digit of accuracy. The results for problem Maragal_5 are therefore a little difficult to interpret, but they illustrate the fortunate fact that LSQR and LSMR’s estimates of kATrk k/krk k do converge toward zero (really kAkǫ), even if the computed vectors ATrk are unlikely to become so small. 4.1.3 E FFECTS OF PRECONDITIONING The numerical results in the LPnetlib test set are generated with every matrix A diagonally preconditioned (i.e., the column norms are scaled to be 1). Before preconditioning, the condition numbers range from 2.9E+00 to 7.2E+12. With preconditioning, they range from 2.7E+00 to 3.4E+08. The condition numbers before and after preconditioning are shown in Figure 4.9. 62 CHAPTER 4. LSMR EXPERIMENTS Name:lp pilot, Dim:4860x1441, nnz:44375, id=654 Name:lp cre a, Dim:7248x3516, nnz:18168, id=609 0 E2 E1 Optimal 0 −1 −1 log(Backward Error for LSQR) −2 log(Backward Error for LSQR) E2 E1 Optimal −3 −4 −5 −2 −3 −4 −5 −6 −6 −7 −8 0 200 400 600 800 iteration count 1000 1200 1400 −7 1600 0 300 iteration count 400 500 0 E2 E1 Optimal 600 E2 E1 Optimal −1 −1 log(Backward Error for LSMR) −2 log(Backward Error for LSMR) 200 Name:lp pilot, Dim:4860x1441, nnz:44375, id=654 Name:lp cre a, Dim:7248x3516, nnz:18168, id=609 0 100 −3 −4 −5 −2 −3 −4 −5 −6 −6 −7 −8 0 200 400 600 800 iteration count 1000 1200 1400 1600 −7 0 100 200 300 iteration count 400 500 600 Figure 4.3: kE1 k, kE2 k, and µ e(xk ) for LSQR (top figures) and LSMR (bottom figures). Top left: A typical case. kE1LSQR k is close to the optimal backward error, but the computable kE2LSQR k is not. Top right: A rare case in which kE2LSQR k is close to optimal. Bottom left: kE1LSMR k and kE2LSMR k are often both close to the optimal backward error. Bottom right: kE1LSMR k is far from optimal, but the computable kE2LSMR k is almost always close (too close to distinguish in the plot!). Problems lp_cre_a (left) and lp_pilot (right). 63 4.1. LEAST-SQUARES PROBLEMS Name:lp ship12l, Dim:5533x1151, nnz:16276, id=688 Name:lp ken 11, Dim:21349x14694, nnz:49058, id=638 0 0 E2 E1 Optimal −1 −2 log(Backward Error for LSMR) −2 log(Backward Error for LSMR) E2 E1 Optimal −1 −3 −4 −5 −3 −4 −5 −6 −6 −7 −7 −8 −8 0 50 100 150 200 −9 250 0 10 20 30 40 50 iteration count iteration count 60 70 80 90 Figure 4.4: Again, kE2LSMR k ≈ µ e(xLSMR ) almost always (the computable backward error estimate is essentially optimal). Left: A typical case (problem lp_ken_11). Right: A rare case (problem lp_ship12l). Here, kE1LSMR k ≈ µ e(xLSMR )! Name:lp cre c, Dim:6411x3068, nnz:15977, id=611 Name:lp maros, Dim:1966x846, nnz:10137, id=642 0 0 LSQR LSMR LSQR LSMR −0.5 −1 −1 −2 log(Optimal Backward Error) log(Optimal Backward Error) −1.5 −2 −2.5 −3 −3 −4 −5 −3.5 −6 −4 −7 −4.5 −5 0 100 200 300 400 500 iteration count 600 700 800 900 −8 0 200 400 600 800 iteration count 1000 1200 1400 1600 Figure 4.5: µ e(xLSMR ) seems to be always monotonic, but µ e(xLSQR ) is usually not. Left: A typical case for both LSQR and LSMR (problem lp_maros). Right: A rare case for LSQR, typical for LSMR (problem lp_cre_c). 64 CHAPTER 4. LSMR EXPERIMENTS Name:lp pilot, Dim:4860x1441, nnz:44375, id=654 Name:lp standgub, Dim:1383x361, nnz:3338, id=693 0 0 LSQR LSMR LSQR LSMR −1 −1 −2 log(Optimal Backward Error) log(Optimal Backward Error) −2 −3 −4 −3 −4 −5 −6 −5 −7 −6 −8 −7 0 100 200 300 iteration count 400 500 600 −9 0 50 100 150 iteration count Figure 4.6: µ e(xLSMR ) ≤ µ e(xLSQR ) almost always. Left: A typical case (problem lp_pilot). Right: A rare case (problem lp_standgub). Name:lp ship12l, Dim:5533x1151, nnz:16276, id=688 Name:lp pds 02, Dim:7716x2953, nnz:16571, id=649 2 1 LSQR LSMR LSQR LSMR 1 0 0 −1 −1 log||xk − x*| log||xk − x*| −2 −2 −3 −3 −4 −4 −5 −5 −6 −6 −7 0 10 20 30 40 50 iteration count 60 70 80 90 −7 0 10 20 30 40 iteration count 50 60 70 Figure 4.7: The errors kx∗ − xLSQR k and kx∗ − xLSMR k seem to decrease monotonically, with LSQR’s errors smaller than for LSMR. Left: A nonsingular LS system (problem lp_ship12l). Right: A singular system (problem lp_pds_02). LSQR and LSMR both converge to the minimum-norm LS solution. 65 4.1. LEAST-SQUARES PROBLEMS Name:NYPA.Maragal 5, Dim:4654x3320, nnz:93091, id=182 Name:NYPA.Maragal 5, Dim:4654x3320, nnz:93091, id=182 1 0 LSQR LSMR LSQR LSMR 0 −1 −1 −2 log(E2) log(E2) −2 −3 −4 −4 −5 −5 −6 −6 −7 −3 0 500 1000 1500 2000 iteration count 2500 3000 3500 −7 4000 0 500 1000 1500 2000 2500 iteration count Name:NYPA.Maragal 7, Dim:46845x26564, nnz:1200537, id=184 Name:NYPA.Maragal 7, Dim:46845x26564, nnz:1200537, id=184 0 LSQR LSMR 0 −1 −1 −2 −2 −3 log(E2) log(E2) LSQR LSMR −3 −4 −4 −5 −5 −6 −6 0 2000 4000 6000 8000 iteration count 10000 12000 14000 −7 0 1000 2000 3000 4000 iteration count 5000 6000 7000 Figure 4.8: Convergence of kE2 k for two problems in NYPA group using LSQR and LSMR. Upper: Problem Maragal_5. Left: No preconditioning applied. cond(A)=5.8E+11. If the iteration limit had been n iterations, the final LSQR point would be very poor. Right: Right diagonal preconditioning applied. cond(A)=2.6E+12. Lower: Problem Maragal_7. Left: No preconditioning applied. cond(A)=1.4E+03. Right: Right diagonal preconditioning applied. cond(A)=4.2E+02. The peaks for LSQR (where it would be undesirable for LSQR to terminate) correspond to plateaus for LSMR where kE2 k remains the smallest value so far, except for slight increases near the end of the LSQR peaks. CHAPTER 4. LSMR EXPERIMENTS LPnetlib matrices 14 10 original preconditioned 12 10 10 10 8 10 Cond(A) 66 6 10 4 10 2 10 0 10 0 20 40 60 80 100 120 140 Matrix ID Figure 4.9: Distribution of condition number for LPnetlib matrices. Diagonal preconditioning reduces the condition number in 117 out of 127 cases. To illustrate the effect of this preconditioning on the convergence speed of LSQR and LSMR, we solve each problem min kAx − bk from the LPnetlib set using the two algorithms and summarize the results in Table 4.1. Both algorithms use stopping rule S2 in Section 3.2.2 with ATOL=1E-8, or a limit of 10n iterations. In Table 4.1, we see that for systems that converge quickly, the advantage gained by using LSMR compared with LSQR is relatively small. For example, lp_osa_60 (Row 127) is a 243246 × 10280 matrix. LSQR converges in 124 iterations while LSMR converges in 122 iterations. In contrast, for systems that take many iterations to converge, LSMR usually converges much faster than LSQR. For example, lp_cre_d (Row 122) is a 73948 × 8926 matrix. LSQR takes 19496 iterations, while LSMR takes 9259 iterations (53% fewer). 4.1. LEAST-SQUARES PROBLEMS 67 Table 4.1: Effects of diagonal preconditioning on LPnetlib matrices†† and convergence of LSQR and LSMR on min kAx − bk ID name 1 720 lpi_itest6 2 706 lpi_bgprtr 3 597 lp_afiro 4 731 lpi_woodinfe 5 667 lp_sc50b 6 666 lp_sc50a 7 714 lpi_forest6 8 636 lp_kb2 9 664 lp_sc105 10 596 lp_adlittle 11 713 lpi_ex73a 12 669 lp_scagr7 13 712 lpi_ex72a 14 695 lp_stocfor1 15 603 lp_blend 16 707 lpi_box1 17 665 lp_sc205 18 663 lp_recipe 19 682 lp_share2b 20 700 lp_vtp_base 21 641 lp_lotfi 22 681 lp_share1b ‡ 23 724 lpi_mondou2 24 711 lpi_cplex2 25 606 lp_bore3d 26 673 lp_scorpion 27 709 lpi_chemcom 28 729 lpi_refinery 29 727 lpi_qual 30 730 lpi_vol1 31 703 lpi_bgdbg1 32 668 lp_scagr25 33 678 lp_sctap1 34 608 lp_capri 35 607 lp_brandy 36 635 lp_israel 37 629 lp_gfrd_pnc 38 601 lp_bandm 39 621 lp_etamacro 40 704 lpi_bgetam 41 728 lpi_reactor 42 634 lp_grow7 43 670 lp_scfxm1 44 623 lp_finnis 45 620 lp_e226 46 598 lp_agg 47 725 lpi_pang 48 692 lp_standata 49 674 lp_scrs8 50 693 lp_standgub 51 602 lp_beaconfd 52 683 lp_shell 53 694 lp_standmps 54 691 lp_stair 55 617 lp_degen2 56 685 lp_ship04s 57 699 lp_tuff 58 599 lp_agg2 59 600 lp_agg3 60 655 lp_pilot4 61 726 lpi_pilot4i 62 671 lp_scfxm2 63 604 lp_bnl1 64 632 lp_grow15 65 653 lp_perold 66 684 lp_ship04l 67 622 lp_fffff800 Continued on next page. . . m 17 40 51 89 78 78 131 68 163 138 211 185 215 165 114 261 317 204 162 346 366 253 604 378 334 466 744 464 464 464 629 671 660 482 303 316 1160 472 816 816 808 301 600 1064 472 615 741 1274 1275 1383 295 1777 1274 614 757 1506 628 758 758 1123 1123 1200 1586 645 1506 2166 1028 n 11 20 27 35 50 50 66 43 105 56 193 129 197 117 74 231 205 91 96 198 153 117 312 224 233 388 288 323 323 323 348 471 300 271 220 174 616 305 400 400 318 140 330 497 223 488 361 359 490 361 173 536 467 356 444 402 333 516 516 410 410 660 643 300 625 402 524 nnz σi = 0∗ 29 0 70 0 102 0 140 0 148 0 160 0 246 0 313 0 340 0 424 0 457 5 465 0 467 5 501 0 522 0 651 17 665 0 687 0 777 0 1051 0 1136 0 1179 0 1208 1 1215 1 1448 2 1534 30 1590 0 1626 0 1646 0 1646 0 1662 0 1725 0 1872 0 1896 0 2202 27 2443 0 2445 0 2494 0 2537 0 2537 0 2591 0 2612 0 2732 0 2760 0 2768 0 2862 0 2933 0 3230 0 3288 0 3338 1 3408 0 3558 1 3878 0 4003 0 4201 2 4400 42 4561 31 4740 0 4756 0 5264 0 5264 0 5469 0 5532 1 5620 0 6148 0 6380 42 6401 0 Original A Cond(A) kLSQR† kLSMR† 1.5E+02 5 5 1.2E+03 21 21 1.1E+01 22 22 2.9E+00 17 17 1.7E+01 41 41 1.2E+01 38 38 3.1E+00 21 21 5.1E+04 150 147 3.7E+01 68 68 4.6E+02 61 61 4.5E+01 99 96 6.0E+01 80 80 5.6E+01 101 98 7.3E+03 107 105 9.2E+02 186 186 4.6E+01 28 28 1.3E+02 122 122 2.1E+04 4 4 1.2E+04 516 510 3.6E+07 557 556 6.6E+05 149 146 1.0E+05 1170 1170 4.2E+01 151 147 4.6E+02 105 102 4.5E+04 782 681 7.4E+01 159 152 6.9E+00 40 39 5.1E+04 2684 1811 4.8E+04 2689 1828 4.8E+04 2689 1828 9.5E+00 44 43 5.9E+01 155 147 1.7E+02 364 338 8.1E+03 1194 1051 6.4E+03 665 539 4.8E+03 351 325 1.6E+05 84 73 3.8E+03 2155 1854 6.0E+04 568 186 5.3E+04 536 186 1.4E+05 85 85 5.2E+00 31 30 2.4E+04 1368 1231 1.1E+03 328 327 9.1E+03 591 555 6.2E+02 159 154 2.9E+05 350 247 2.2E+03 144 141 9.4E+04 1911 1803 2.2E+03 144 141 4.4E+03 254 254 4.2E+01 88 85 4.2E+03 286 255 5.7E+01 122 115 9.9E+01 264 250 1.1E+02 103 100 1.1E+06 1021 1013 5.9E+02 184 175 5.9E+02 219 208 4.2E+05 1945 1379 4.2E+05 2081 1357 2.4E+04 2154 1575 3.4E+03 1394 1253 5.6E+00 35 35 5.1E+05 5922 3173 1.1E+02 77 76 1.5E+10 2064 1161 Diagonally preconditioned A Cond(A) kLSQR† kLSMR† 8.4E+01 7 7 1.2E+01 19 19 4.9E+00 21 21 2.8E+00 17 17 9.1E+00 36 36 7.6E+00 34 34 2.8E+00 19 19 7.8E+02 128 128 2.2E+01 58 58 2.5E+01 39 39 3.5E+01 85 84 1.7E+01 60 59 4.5E+01 89 88 1.8E+03 263 238 1.1E+02 118 118 2.1E+01 24 24 7.6E+01 102 101 5.4E+02 4 4 7.5E+02 331 328 1.9E+04 1370 1312 1.4E+03 386 386 6.2E+02 482 427 2.5E+01 119 116 9.9E+01 81 80 1.2E+02 265 263 3.4E+01 116 115 5.0E+00 35 35 6.2E+01 113 113 6.3E+01 121 120 6.3E+01 121 120 1.1E+01 53 52 1.7E+01 97 95 1.8E+02 334 327 3.8E+02 453 451 1.0E+02 208 208 9.2E+03 782 720 3.7E+04 270 210 4.6E+01 196 187 9.2E+01 171 162 9.2E+01 171 162 5.0E+03 151 149 4.4E+00 28 28 1.4E+03 547 470 7.7E+01 279 275 3.0E+03 504 437 1.1E+01 35 35 3.7E+01 125 111 6.6E+02 140 139 1.4E+02 356 338 6.6E+02 140 139 2.1E+01 64 63 1.1E+01 43 42 6.6E+02 201 201 3.4E+01 95 94 3.5E+01 151 146 5.1E+01 75 75 8.2E+02 648 642 5.2E+00 31 31 5.1E+00 32 31 1.2E+02 195 190 1.2E+02 195 191 3.1E+03 975 834 1.4E+02 285 278 4.9E+00 33 32 4.6E+02 706 619 6.1E+01 67 67 1.2E+06 5240 5240 68 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 CHAPTER 4. LSMR EXPERIMENTS ID 628 687 662 679 690 672 633 689 637 658 696 680 625 642 594 614 710 686 659 624 657 605 611 688 649 609 717 708 613 595 618 630 631 718 615 619 702 616 660 654 638 627 650 705 701 697 656 661 639 716 651 626 645 652 612 610 646 640 647 648 ∗ †† † ‡ ¶ name lp_ganges lp_ship08s lp_qap8 lp_sctap2 lp_sierra lp_scfxm3 lp_grow22 lp_ship12s lp_ken_07 lp_pilot_we lp_stocfor2 lp_sctap3 lp_fit1p lp_maros ‡ lp_25fv47 lp_czprob lpi_cplex1 lp_ship08l lp_pilotnov ¶ lp_fit1d lp_pilot_ja lp_bnl2 lp_cre_c lp_ship12l lp_pds_02 lp_cre_a lpi_gran ‡ lpi_ceria3d lp_cycle ‡ lp_80bau3b lp_degen3 lp_greenbea lp_greenbeb lpi_greenbea lp_d2q06c ‡ lp_dfl001 lp_woodw lp_d6cube lp_qap12 lp_pilot lp_ken_11 lp_fit2p lp_pds_06 lpi_bgindy lp_wood1p lp_stocfor3 lp_pilot87 lp_qap15 lp_ken_13 lpi_gosh lp_pds_10 lp_fit2d lp_osa_07 lp_pds_20 lp_cre_d lp_cre_b lp_osa_14 lp_ken_18 lp_osa_30 lp_osa_60 m 1706 2467 1632 2500 2735 1800 946 2869 3602 2928 3045 3340 1677 1966 1876 3562 5224 4363 2446 1049 2267 4486 6411 5533 7716 7248 2525 4400 3371 12061 2604 5598 5598 5596 5831 12230 8418 6184 8856 4860 21349 13525 29351 10880 2595 23541 6680 22275 42659 13455 49932 10524 25067 108175 73948 77137 54797 154699 104374 243246 n 1309 778 912 1090 1227 990 440 1151 2426 722 2157 1480 627 846 821 929 3005 778 975 24 940 2324 3068 1151 2953 3516 2658 3576 1903 2262 1503 2392 2392 2393 2171 6071 1098 415 3192 1441 14694 3000 9881 2671 244 16675 2030 6330 28632 3792 16558 25 1118 33874 8926 9648 2337 105127 4350 10280 nnz σi = 0∗ 6937 0 7194 66 7296 170 7334 0 8001 10 8206 0 8252 0 8284 109 8404 49 9265 0 9357 0 9734 0 9868 0 10137 0 10705 1 10708 0 10947 0 12882 66 13331 0 13427 0 14977 0 14996 0 15977 87 16276 109 16571 11 18168 93 20111 586 21178 0 21234 28 23264 0 25432 2 31070 3 31070 3 31074 3 33081 0 35632 13 37487 0 37704 11 38304 398 44375 0 49058 121 50284 0 63220 11 66266 0 70216 1 72721 0 74949 0 94950 632 97246 169 99953 2 107605 11 129042 0 144812 0 232647 87 246614 2458 260785 2416 317097 0 358171 324 604488 0 1408073 0 Original A Cond(A) kLSQR† kLSMR† 2.1E+04 219 216 1.6E+02 169 169 1.9E+01 8 8 1.8E+02 639 585 5.0E+09 87 74 2.4E+04 2085 1644 5.7E+00 39 38 8.7E+01 135 134 1.3E+02 168 168 5.3E+05 5900 3503 2.8E+04 430 407 1.8E+02 683 618 6.8E+03 81 81 1.9E+06 8460 7934 3.3E+03 5443 4403 8.8E+03 114 110 1.7E+04 89 79 1.6E+02 123 123 3.6E+09 164 343 4.7E+03 61 61 2.5E+08 7424 950 7.8E+03 1906 1333 1.6E+04 20109 12067 1.1E+02 106 104 4.0E+02 129 124 2.1E+04 20196 11219 7.2E+12 26580 20159 7.3E+02 57 56 1.5E+07 19030 19030 5.7E+02 119 111 8.3E+02 1019 969 4.4E+03 2342 2062 4.4E+03 2342 2062 4.4E+03 2148 1860 1.4E+05 21710 15553 3.5E+02 937 848 4.7E+04 557 553 1.1E+03 174 169 3.9E+01 8 8 2.7E+03 1392 1094 4.6E+02 498 491 4.7E+03 73 73 5.4E+01 197 183 6.7E+02 366 358 1.6E+04 53 53 4.5E+05 832 801 8.1E+03 896 751 5.5E+01 8 8 4.5E+02 471 462 5.6E+04 3236 1138 5.6E+02 223 208 1.7E+03 55 55 1.9E+03 105 105 7.3E+01 323 283 9.9E+03 19496 9259 6.8E+03 14761 7720 9.9E+02 120 120 1.2E+03 999 957 6.0E+03 116 115 2.1E+03 124 122 Diagonally preconditioned A Cond(A) kLSQR† kLSMR† 3.3E+03 161 160 4.9E+01 116 116 6.6E+01 8 8 1.7E+02 450 415 1.0E+05 146 146 1.4E+03 1121 994 5.0E+00 35 35 4.2E+01 89 88 3.8E+01 98 98 6.1E+02 442 246 2.6E+03 1546 1421 1.7E+02 503 465 1.9E+04 500 427 1.8E+04 6074 3886 2.0E+02 702 571 2.9E+01 29 29 1.7E+02 53 53 6.5E+01 103 103 1.4E+03 1622 1180 2.4E+01 28 28 1.5E+03 1653 1272 2.6E+02 452 390 4.6E+02 1553 1333 5.9E+01 82 81 1.2E+01 69 67 4.9E+02 1591 1375 3.4E+08 22413 11257 2.3E+02 224 213 2.7E+04 2911 2349 6.9E+00 43 42 2.5E+02 448 414 4.3E+01 277 251 4.3E+01 277 251 4.3E+01 239 221 4.8E+02 1825 1548 1.0E+02 363 353 3.3E+01 81 81 2.5E+01 52 52 3.9E+01 8 8 5.0E+02 592 484 7.8E+01 220 207 5.0E+04 3276 1796 1.7E+01 100 97 1.1E+03 377 356 1.4E+01 25 25 3.6E+03 3442 3096 5.7E+02 297 170 5.6E+01 8 8 7.4E+01 205 204 4.2E+03 3629 1379 1.8E+01 120 115 2.8E+01 29 29 2.4E+03 72 72 3.3E+01 177 165 2.7E+02 1218 1069 1.9E+02 1112 966 8.8E+02 73 73 1.6E+02 422 398 1.2E+03 77 77 2.3E+04 82 82 Number of columns in A that are not independent. We are using A = problem.A’; b = problem.c; to construct the least-squares problem min kAx − bk. Denotes the number of iterations that LSQR or LSMR takes to converge with a tolerance of 10−8 . For problem lp_maros, lpi_gran, lp_d2q06c, LSQR hits the 10n iteration limit without preconditioning. For problem lp_share1b, lp_cycle, both LSQR and LSMR hit the 10n iteration limit without preconditioning. Thus the number of iteration that these five problems take to converge doesn’t represent the relative improvement provided by LSMR. Problem lp_pilotnov is compatible (krk k → 0). Therefore LSQR exhibits faster convergence than LSMR. More examples for compatible systems are given in Section 4.2 and 4.3. 4.1. LEAST-SQUARES PROBLEMS M ATLAB Code 4.3 Generating preconditioners by perturbation of QR 1 % delta is the chosen standard deviation of Gaussian noise 2 randn( ’state’ ,1); 3 R = qr(A); 4 [I J S] = find(R); 5 Sp = S.*(1+delta*randn(length(S), 1)); 6 M = sparse(I,J,Sp); Diagonal preconditioning almost always reduces the condition number of A. For most of the examples, it also reduces the number of iterations for LSQR and LSMR to converge. With diagonal preconditioning, the condition number of lp_cre_d reduces from 9.9E+03 to 2.7E+02. The number of iterations for LSQR to converge is reduced to 1218 and that for LSMR is reduced to 1069 (12% less than that of LSQR). Since the preconditioned system needs fewer iterations, there is less advantage in using LSMR in this case. (This phenomenon can be explained by (4.7) in the next section.) To further illustrate the effect of preconditioning, we construct a sequence of increasingly better preconditioners and investigate their effect on the convergence of LSQR and LSMR. The preconditioners are constructed by first performing a sparse QR factorization A = QR, and then adding Gaussian random noise to the nonzeros of R. For a given noise level δ, we use M ATLAB Code 4.3 to generate the preconditioner. Figure 4.10 illustrates the convergence of LSQR and LSMR on problem lp_d2q06c (cond(A) = 1.4E+05) with a number of preconditioners. We have a total of 5 options: • No preconditioner • Diagonal preconditioner from M ATLAB Code 4.2 • Preconditioner from M ATLAB Code 4.3 with δ = 0.1 • Preconditioner from M ATLAB Code 4.3 with δ = 0.01 • Preconditioner from M ATLAB Code 4.3 with δ = 0.001 From the plots in Figure 4.10, we see that when no preconditioner is applied, both algorithms exhibit very slow convergence and LSQR hits the 10n iteration limit. The backward error for LSQR lags behind LSMR by at least 1 order of magnitude at the beginning, and the gaps widen to 2 orders of magnitude toward 10n iterations. The backward error for LSQR fluctuates significantly across all iterations. 69 70 CHAPTER 4. LSMR EXPERIMENTS Name:lp d2q06c, Dim:5831x2171, nnz:33081, id=102 Name:lp d2q06c, Dim:5831x2171, nnz:33081, id=102 0 LSQR LSMR LSQR LSMR 1.5 −1 1 −2 0.5 −3 log(E2) log(E2) 0 −0.5 −4 −1 −1.5 −5 −2 −6 −2.5 −3 0 0.5 1 1.5 2 2.5 iteration count 0 200 400 600 −2 −3 −4 −5 −6 1200 1400 −8 −5 −7 −7 1000 −4 −6 −4 −5 2000 LSQR LSMR log(E2) log(E2) log(E2) −3 600 800 iteration count 1800 0 −3 400 1600 −1 −2 −2 200 1400 LSQR LSMR −1 0 1000 1200 iteration count Name:lp d2q06c, Dim:5831x2171, nnz:33081, id=102 0 LSQR LSMR −1 −6 800 Name:lp d2q06c, Dim:5831x2171, nnz:33081, id=102 Name:lp d2q06c, Dim:5831x2171, nnz:33081, id=102 0 −7 4 x 10 −8 0 5 10 15 20 25 iteration count 30 35 40 45 −9 1 2 3 4 5 6 iteration count 7 8 9 10 Figure 4.10: Convergence of LSQR and LSMR with increasingly good preconditioners. log kE2 k is plotted against iteration number. LSMR shows an advantage until the preconditioners is almost exact. Top: No preconditioners and diagonal preconditioner. Bottom: Exact preconditioner with noise levels of 10%, 1% and 0.1%. When diagonal preconditioning is applied, both algorithms take less than n iterations to converge. The backward errors for LSQR lag behind LSMR by 1 order of magnitude. There is also much less fluctuation in the LSQR backward error compared to the unpreconditioned case. For the increasingly better preconditioners constructed with δ = 0.1, 0.01 and 0.001, we see that the number of iterations to convergence decreases rapidly. With better preconditioners, we also see that the gap between the backward errors for LSQR and LSMR becomes smaller. With an almost perfect preconditioner (δ = 0.001), the backward error for LSQR becomes almost the same as that for LSMR at each iteration. This phenomenon can be explained by (4.7) in the next section. 4.1. LEAST-SQUARES PROBLEMS 4.1.4 W HY DOES kE2 k FOR LSQR LAG BEHIND LSMR ? David Titley-Peloquin, in joint work with Serge Gratton and Pavel Jiranek [76], has performed extensive analysis of the convergence behavior of LSQR and LSMR for least-square problems. These results are unpublished at the time of writing. We summarize two key insights from their work to provide a more complete picture on how these two algorithms perform. The residuals krkLSQR k, krkLSMR k and residuals for the normal equation kATrkLSQR k, kATrkLSMR k for LSQR and LSMR satisfy the following relations [76], [35, Lemma 5.4.1]: krkLSMR k2 = krkLSQR k2 + k−1 X j=0 kATrkLSMR k4 LSQR 2 (krjLSQR k2 − krj+1 k ) (4.6) kATrjLSMR k4 kATrkLSMR k . kATrkLSQR k = q LSMR 2 1 − kATrkLSMR k2 /kATrk−1 k (4.7) From (4.6), one can infer that krkLSMR k is much larger than krkLSQR k only if both LSQR 2 • krjLSQR k2 ≪ krj+1 k for some j < k LSMR 4 • kATrjLSMR k4 ≈ kATrj+1 k ≈ · · · ≈ kATrkLSMR k4 happened, which is very unlikely in view of the fourth power [76]. This explains our observation in Figure 4.1 that krkLSMR k rarely lags behind krkLSQR k. In cases where krkLSMR k lags behind in the early iterations, it catches up very quickly. From (4.7), one can infer that if kATrkLSMR k decreases a lot between iterations k − 1 and k, then kATrkLSQR k would be roughly the same as kATrkLSMR k. The converse also holds, in that kATrkLSQR k will be much larger than kATrkLSMR k if LSMR is almost stalling at iteration k (i.e., kATrkLSMR k did not decrease much relative to the previous iteration) [76]. This explains the peaks and plateaus observed in Figure 4.8. We have a further insight about the difference between LSQR and LSMR on least-squares problem that take many iterations. Both solvers stop when kAT rk k ≤ ATOLkAkkrk k. 71 72 CHAPTER 4. LSMR EXPERIMENTS M ATLAB Code 4.4 Criteria for selecting square systems ids = find(index.nrows > 100000 & ... & ... 2 index.nrows < 200000 3 index.nrows == index.ncols & ... & ... 4 index.isReal == 1 5 index.posdef == 0 & ... 6 index.numerical_symmetry < 1); 1 Since kr∗ k is often O(kr0 k) for least-squares, and it is also safe to assume kAT r0 k/(kAkkr0 k) = O(1), we know that they will stop at iteration l, where l Y kAT rk k kAT rl k = ≈ O(ATOL). kAT rk−1 k kAT r0 k k=1 LSMR Thus on average, kATrkLSMR k/kATrk−1 k will be closer to 1 if l is large. This means that the larger l is (in absolute terms), the more LSQR will lag behind LSMR (a bigger gap between kATrkLSQR k and kATrkLSMR k). 4.2 S QUARE SYSTEMS Since LSQR and LSMR are applicable to consistent systems, it is of interest to compare them on an unbiased test set. We used the search facility of Davis [18] to select a set of square real linear systems Ax = b. With index = UFget, the criteria in M ATLAB Code 4.4 returned a list of 42 examples. Testing isfield(UFget(id),’b’) left 26 cases for which b was supplied. P RECONDITIONING hFor each i linear system, diagonal scaling was first applied to the rows of A b and then to its columns using M ATLAB Code 4.5 to give a scaled h i problem Ax = b in which the columns of A b have unit 2-norm. In spite of the scaling, most examples required more than n iterations of LSQR or LSMR to reduce krk k satisfactorily (rule S1 in section 3.2.2 with ATOL = BTOL = 10−8 ). To simulate better preconditioning, we chose two cases that required about n/5 and n/10 iterations. Figure 4.11 (left) shows both solvers reducing krk k monotonically but with plateaus that are prolonged for LSMR. With loose stopping tolerances, LSQR could terminate somewhat sooner. Figure 4.11 4.3. UNDERDETERMINED SYSTEMS M ATLAB Code 4.5 Diagonal preconditioning 1 % scale the row norms to 1 2 rnorms = sqrt(sum(A.*A,2)); 3 D = diag(sparse(1./rnorms)); 4 A = D*A; 5 b = D*b; % scale the column norms to 1 6 7 cnorms = sqrt(sum(A.*A,1)); 8 D = diag(sparse(1./cnorms)); 9 A = A*D; % scale the 2 norm of b to 1 10 11 bnorm = norm(b); if bnorm ~= 0 12 13 b = b./bnorm; 14 end (right) shows kATrk k for each solver. The plateaus for LSMR correspond to LSQR gaining ground with krk k, but falling significantly backward by the kATrk k measure. C OMPARISON WITH IDR (s) ON SQUARE SYSTEMS Again we mention that on certain square parameterized systems, the solvers IDR(s) and LSQR or LSMR complement each other [81; 82] (see Section 1.3.5). 4.3 U NDERDETERMINED SYSTEMS In this section, we study the convergence of LSQR and LSMR when applied to an underdetermined system Ax = b. As shown in Section 3.3.2, LSQR and LSMR converge to the minimum-norm solution for a singular system (rank(A) < n). The solution solves minAx=b kxk2 . As a comparison, we also apply MINRES directly to the equation AATy = b and take x = ATy as the solution. This avoids multiplication by ATA in the Lanczos process, where ATA is a highly singular operator because A has more columns than rows. It is also useful to note that this application of MINRES is mathematically equivalent to applying LSQR to Ax = b. Theorem 4.3.1. In exact arithmetic, applying MINRES to AATy = b and setting xk = ATyk generates the same iterates as applying LSQR to Ax = b. 73 74 CHAPTER 4. LSMR EXPERIMENTS Name:Hamm.hcircuit, Dim:105676x105676, nnz:513072, id=542 Name:Hamm.hcircuit, Dim:105676x105676, nnz:513072, id=542 0 0 LSQR LSMR LSQR LSMR −1 −2 −2 −3 −4 log||r|| log||ATr|| −4 −5 −6 −6 −8 −7 −10 −8 −9 0 0.5 1 1.5 2 −12 2.5 iteration count 0 0.5 1 1.5 2 2.5 iteration count 4 x 10 Name:IBM EDA.trans5, Dim:116835x116835, nnz:749800, id=1324 4 x 10 Name:IBM EDA.trans5, Dim:116835x116835, nnz:749800, id=1324 0 0 LSQR LSMR LSQR LSMR −1 −2 −2 −3 −4 log||r|| log||ATr|| −4 −5 −6 −6 −8 −7 −10 −8 −9 0 2000 4000 6000 iteration count 8000 10000 12000 −12 0 2000 4000 6000 iteration count 8000 10000 12000 Figure 4.11: LSQR and LSMR solving two square nonsingular systems Ax = b: problems Hamm/hcircuit (top) and IBM_EDA/trans5 (bottom). Left: log10 krk k for both solvers, with prolonged plateaus for LSMR. Right: log10 kATrk k (preferable for LSMR). 4.3. UNDERDETERMINED SYSTEMS Table 4.2 Relationship between CG, MINRES, CRAIG, LSQR and LSMR CRAIG LSQR LSMR ≡ ≡ ≡ ≡ CG on AATy = b, x = ATy CG on ATAx = ATb MINRES on AATy = b, x = ATy MINRES on ATAx = ATb Proof. It suffices to show that the two methods are solving the same subproblem at every iteration. Let xMINRES and xLSQR be the iterates k k generated by MINRES and LSQR respectively. Then xMINRES = AT argminy∈Kk (AAT,b) b − AATy k = argminx∈Kk (ATA,ATb) kb − Axk = xLSQR . k The first and third equality comes from the subproblems that MINRES and LSQR solve. The second equality follows from the following mapping from points in Kk (AAT, b) to Kk (ATA, ATb): f : Kk (AAT, b) → Kk (ATA, ATb), f (y) = ATy, and from the fact that for any point x ∈ Kk (ATA, ATb), we can write x = γ0 ATb + k X γi (ATA)i (ATb) i=1 for some scalars {γi }ki=0 . Then the point y = γ0 b + k X γi (AAT)i b i=1 would be a preimage of x under f . This relationship, as well as some other well known ones between CG, MINRES, CRAIG, LSQR and LSMR, are summarized in Table 4.2. B ACKWARD ERROR FOR UNDERDETERMINED SYSTEMS A linear system Ax = b is ill-posed if A is m-by-n and m < n, because the system has an infinite number of solutions (or none). One way to define a unique solution for such a system is to choose the solution 75 76 CHAPTER 4. LSMR EXPERIMENTS M ATLAB Code 4.6 Left diagonal preconditioning 1 % scale the row norms to 1 2 rnorms = sqrt(full(sum(A.*A,2))); 3 rnorms = rnorms + (rnorms == 0); % avoid division by 0 4 D = diag(sparse(1./rnorms)); 5 A = D*A; 6 b = D*b; x with minimum 2-norm. That is, we want to solve the optimization problem min kxk2 . Ax=b For any approximate solution x to above problem, the normwise backward error is defined as the norm of the minimum perturbation to A such that x is a solution of the perturbed optimization problem: η(x) = min kEk E s.t. x = argmin(A+E)x=b kxk. Sun and Sun [39] have shown that this value is given by η(x) = s krk22 2 (B), + σmin kxk22 xxT . B=A I− kxk22 Since it is computationally prohibitive to compute the minimum singular value at every iteration for the backward error, we will use krk/kxk 1 Note that this estimate as an approximate backward error in the following analysis.1 is a lower bound on the true backward error. In contrast, the estimates E1 and E2 for backward error in leastsquares problems are upper bounds. P RECONDITIONING For underdetermined systems, a right preconditioner on A will alter the minimum-norm solution. Therefore, only left preconditioners are applicable. In the following experiments, we do a left diagonal preconditioning on A by scaling the rows of A to unit 2-norm; see M ATLAB Code 4.6. D ATA For testing underdetermined systems, we use sparse matrices from the LPnetlib group (the same set of data as in Section 4.1). Each example was downloaded in M ATLAB format, and a sparse matrix A and dense vector b were extracted from the data structure 4.4. REORTHOGONALIZATION via A = Problem.A and b = Problem.b Then we solve an underdetermined linear system minAx=b kxk2 with both LSQR and LSMR. MINRES is also used with a change of variable to a form equivalent to LSQR as described above. N UMERICAL RESULTS The experimental results showed that LSMR converges almost as quickly as LSQR for underdetermined systems. The approximate backward errors for four different problems are shown in Figure 4.12. In only a few cases, LSMR lags behind LSQR for a number of iterations. Thus we conclude that LSMR and LSQR are equally good for finding minimum 2-norm solutions for underdetermined systems. The experimental results also confirmed our earlier derivation that MINRES on AATy = b and x = ATy is equivalent to LSQR on Ax = b. MINRES exhibits the same convergence behavior as LSQR, except in cases where they both take more than m iterations to converge. In these cases, the effect of increased condition number of AAT kicks in and slows down MINRES in the later iterations. 4.4 R EORTHOGONALIZATION It is well known that Krylov-subspace methods can take arbitrarily many iterations because of loss of orthogonality. For the Golub-Kahan bidiagonalization, we have two sets of vectors Uk and Vk . As an experiment, we implemented the following options in LSMR: 1. No reorthogonalization. 2. Reorthogonalize Vk (i.e. reorthogonalize vk with respect to Vk−1 ). 3. Reorthogonalize Uk (i.e. reorthogonalize uk with respect to Uk−1 ). 4. Both 2 and 3. Each option was tested on all of the over-determined test problems with fewer than 16K nonzeros. Figure 4.13 shows an “easy” case in which all options converge equally well (convergence before significant loss of orthogonality), and an extreme case in which reorthogonalization makes a large difference. Unexpectedly, options 2, 3, and 4 proved to be indistinguishable in all cases. To look closer, we forced LSMR to take n iterations. Option 2 (with Vk orthonormal to machine precision ǫ) was found to be keeping 77 78 CHAPTER 4. LSMR EXPERIMENTS Name:U lp pds 10, Dim:16558x49932, nnz:107605, id=418 Name:U lp standgub, Dim:361x1383, nnz:3338, id=350 0 2 LSQR MINRES LSMR −1 LSQR MINRES LSMR 0 −2 −2 log(||r|| / ||x||) log(||r|| / ||x||) −3 −4 −5 −4 −6 −6 −8 −7 −10 −8 −9 0 20 40 60 80 iteration count 100 120 −12 140 0 Name:U lp israel, Dim:174x316, nnz:2443, id=336 20 40 60 80 100 iteration count 120 2 180 LSQR MINRES LSMR 0 0 −2 −2 log(||r|| / ||x||) log(||r|| / ||x||) 160 2 LSQR MINRES LSMR −4 −4 −6 −6 −8 −8 −10 140 Name:U lp fit1p, Dim:627x1677, nnz:9868, id=380 0 200 400 600 800 1000 1200 iteration count 1400 1600 1800 2000 −10 0 200 400 600 800 iteration count 1000 1200 1400 Figure 4.12: The backward errors krk k/kxk k for LSQR, LSMR and MINRES on four different underdetermined linear systems to find the minimum 2-norm solution. Upper left: The backward errors for all three methods converge at a similar rate. Most of our test cases exhibit similar convergence behavior. This shows that LSMR and LSQR perform equally well for underdetermined systems. Upper right: A rare case where LSMR lags behind LSQR significantly for some iterations. This plot also confirms our earlier derivation that this special version of MINRES is theoretically equivalent to LSQR, as shown by the almost identical convergence behavior. Lower left: An example where all three algorithms take more than m iterations. Since MINRES works with the operator AAT, the effect of numerical error is greater and MINRES converges slower than LSQR towards the end of computation. Lower right: Another example showing that MINRES lags behind LSQR because of greater numerical error. 79 4.4. REORTHOGONALIZATION Name:lp ship12l, Dim:5533x1151, nnz:16276, id=688 Name:lpi gran, Dim:2525x2658, nnz:20111, id=717 0 0 NoOrtho NoOrtho OrthoU −1 OrthoU −1 OrthoV OrthoV OrthoUV OrthoUV −2 −2 −3 log(E2) log(E2) −3 −4 −4 −5 −5 −6 −6 −7 −8 0 10 20 30 40 50 iteration count 60 70 80 90 −7 0 500 1000 1500 iteration count 2000 2500 3000 Figure 4.13: LSMR with and without reorthogonalization of Vk and/or Uk . Left: An easy case where all options perform similarly (problem lp_ship12l). Right: A helpful case (problem lp_gran). √ Uk orthonormal to at least O( ǫ). Option 3 (with Uk orthonormal) was √ not quite as effective but it kept Vk orthonormal to at least O( ǫ) up to √ the point where LSMR would terminate when ATOL = ǫ. This effect of one-sided reorthogonalization has also been pointed out in [65]. Note that for square or rectangular A with exact arithmetic, LSMR is equivalent to MINRES on the normal equation (and hence to CR [44] and GMRES [63] on the same equation). Reorthogonalization makes the equivalence essentially true in practice. We now focus on reorthogonalizing Vk but not Uk . Other authors have presented numerical results involving reorthogonalization. For example, on some randomly generated LS problems of increasing condition number, Hayami et al. [37] compare their BAGMRES method with an implementation of CGLS (equivalent to LSQR [53]) in which Vk is reorthogonalized, and find that the methods require essentially the same number of iterations. The preconditioner chosen for BA-GMRES made that method equivalent to GMRES on ATAx = ATb. Thus, GMRES without reorthogonalization was seen to converge essentially as well as CGLS or LSQR with reorthogonalization of Vk (option 2 above). This coincides with the analysis by Paige et al. [55], who conclude that MGS-GMRES does not need reorthogonalization of the Arnoldi vectors Vk . 80 CHAPTER 4. LSMR EXPERIMENTS Name:lp maros, Dim:1966x846, nnz:10137, id=642 Name:lp cre c, Dim:6411x3068, nnz:15977, id=611 0 0 NoOrtho Restart5 Restart10 Restart50 FullOrtho −1 NoOrtho Restart5 Restart10 Restart50 FullOrtho −1 −2 −2 log(E2) log(E2) −3 −4 −4 −5 −5 −6 −7 −3 −6 0 100 200 300 400 500 iteration count 600 700 800 900 −7 0 500 1000 1500 2000 iteration count 2500 3000 3500 Figure 4.14: LSMR with reorthogonalized Vk and restarting. Restart(l) with l = 5, 10, 50 is slower than standard LSMR with or without reorthogonalization. NoOrtho represents LSMR without reorthogonalization. Restart5, Restart10, and Restart50 represents LSMR with Vk reorthogonalized and with restarting every 5, 10 or 50 iterations. FullOrtho represents LSMR with Vk reorthogonalized without restarting. Problems lp_maros and lp_cre_c. R ESTARTING To conserve storage, a simple approach is to restart the algorithm every l steps, as with GMRES(l) [63]. To be precise, we set rl = b − Axl , min kA∆x − rl k, xl ← xl + ∆x and repeat the same process until convergence. Our numerical test in Figure 4.14 shows that restarting LSMR even with full reorthogonalization (of Vk ) may lead to stagnation. In general, convergence with restarting is much slower than LSMR without reorthogonalization. Restarting does not seem useful in general. L OCAL REORTHOGONALIZATION Here we reorthogonalize each new vk with respect to the previous l vectors, where l is a specified parameter. Figure 4.15 shows that l = 5 has little effect, but partial speedup was achieved with l = 10 and 50 in the two chosen cases. There is evidence of a useful storage-time tradeoff. It should be emphasized that the potential speedup depends strongly on the computational cost of Av and ATu. If these are cheap, local reorthogonalization may not be worthwhile. 81 4.4. REORTHOGONALIZATION Name:lp fit1p, Dim:1677x627, nnz:9868, id=625 Name:lp bnl2, Dim:4486x2324, nnz:14996, id=605 0 1 NoOrtho Local5 Local10 Local50 FullOrtho −1 0 −2 −1 log(E2) log(E2) −3 −4 −2 −3 −5 −4 −6 −5 −7 NoOrtho Local5 Local10 Local50 FullOrtho 0 50 100 150 200 250 iteration count 300 350 400 450 −6 0 200 400 600 800 iteration count 1000 1200 1400 Figure 4.15: LSMR with local reorthogonalization of Vk . NoOrtho represents LSMR without reorthogonalization. Local5, Local10, and Local50 represent LSMR with local reorthogonalization of each vk with respect to the previous 5, 10, or 50 vectors. FullOrtho represents LSMR with reorthogonalized Vk without restarting. Local(l) with l = 5, 10, 50 illustrates reduced iterations as l increases. Problems lp_fit1p and lp_bnl2. 5 AMRES In this chapter we describe an efficient and stable iterative algorithm for computing the vector x in the augmented system γI AT A δI ! s x ! = ! b , 0 (5.1) where A is a rectangular matrix, γ and δ are any scalars, and we define Â = γI AT A δI ! , xb = ! s , x b̂ = ! b . 0 (5.2) Our algorithm is called AMRES, for Augmented-system Minimum RESidual method. It is derived by formally applying MINRES [52] to the augmented system (5.1), but is more economical because it is based on the Golub-Kahan bidiagonalization process [29] and it computes estimates of just x (excluding s). Note that Â includes two scaled identity matrices γI and δI in the (2,2)-block. When γ and δ have opposite sign (e.g., γ = σ, δ = −σ), (5.1) is equivalent to a damped least-squares problem ! A x− min σI ! b 0 2 ≡ (ATA + σ 2 I)x = ATb (also known as a Tikhonov regularization problem). CGLS, LSQR or LSMR may then be applied. They require less work and storage per iteration than AMRES, but the number of iterations and the numerical reliability of all three algorithms would be similar (especially for LSMR). AMRES is intended for the case where γ and δ have the same sign (e.g., γ = δ = σ). An important application is for the solution of “negatively damped normal equations” of the form (ATA − σ 2 I)x = ATb, 82 (5.3) 5.1. DERIVATION OF AMRES 83 which arise from both ordinary and regularized total least-squares (TLS) problems, Rayleigh quotient iteration (RQI) and Curtis-Reid scaling. These equations do not lend themselves to an equivalent least-squares formulation. Hence, there is a need for algorithms specially tailored to the problem. We note in passing that there is a formal least-squares formulation for negative shifts: ! A min x− iσI ! b , 0 i= 2 √ −1, but this doesn’t lead to a convenient numerical algorithm for largescale problems. For small enough values of σ 2 , the matrix ATA − σ 2 I is positive definite (if A has full column rank), and this is indeed the case with ordinary TLS problems. However, for regularized TLS problems (where σ 2 plays the role of a regularization parameter that is not known a priori and must be found by solving system (5.3) for a range of σ-values), there is no guarantee that ATA−σ 2 I will be positive definite. With a minor extension, CGLS may be applied with any shift [6] (and this could often be the most efficient algorithm), but when the shifted matrix is indefinite, stability cannot be guaranteed. AMRES is reliable for any shift, even if ATA − σ 2 I is indefinite. Also, if σ happens to be a singular value of A (extreme or interior), AMRES may be used to compute a corresponding singular vector in the manner of inverse iteration,1 as described in detail in Section 6.3. 5.1 D ERIVATION OF AMRES If the Lanczos process Tridiag(Â, b̂) were applied to the matrix and rhs in (5.1), we would have T̂2k γ α1 = V̂2k = α1 δ β2 β2 γ .. . α2 .. . βk u1 u2 v1 v2 .. . γ αk ... ... , αk δ uk vk ! , (5.4) 1 Actually, by solving a singular least-squares problem, as in Choi [13]. 84 CHAPTER 5. AMRES and then T̂2k+1 , V̂2k+1 in the obvious way. Because of the structure of Â and b̂, the scalars αk , βk and vectors uk , vk are independent of γ and δ, and it is more efficient to generate the same quantities by the GolubKahan process Bidiag(A, b). To solve (5.1), MINRES would solve the subproblems min Ĥk yk − β1 e1 , T̂ k (k odd) α k+1 eTk 2 Ĥk = T̂k (k even) β k +1 eTk 2 5.1.1 L EAST- SQUARES SUBSYSTEM If γ = δ = σ, a singular value of A, the matrix Â is singular. In general we wish to solve minxb kÂb x − b̂k, where xb may not be unique. We define the k-th estimate of xb to be xbk = V̂k ŷk , and then rbk ≡ b̂ − Âb xk = b̂ − ÂV̂k ŷk = V̂k+1 β1 e1 − Ĥk ŷk . (5.5) To minimize the residual rbk , we perform a QR factorization on Ĥk : where min kb rk k = min kβ1 e1 − Ĥk ŷk k ! ! ! Rk zk T ŷk − = min Qk+1 0 ζ̄k+1 ! ! z Rk k ŷk , − = min ζ̄k+1 0 zkT = (ζ1 , ζ2 , . . . , ζk ), Qk+1 = Pk · · · P2 P1 , (5.6) (5.7) (5.8) with Qk+1 being a product of plane rotations. At iteration k, Pl denotes a plane rotation on rows l and l + 1 (with k ≥ l): Il−1 Pl = cl −sl sl cl Ik−l . 5.1. DERIVATION OF AMRES The solution ŷk to (5.7) satisfies Rk ŷk = zk . As in MINRES, to allow a cheap iterative update to xbk we define Ŵk to be the solution of RkT ŴkT = V̂kT , (5.9) xbk = V̂k ŷk = Ŵk Rk ŷk = Ŵk zk . (5.10) which gives us Since we are interested in the lower half of xbk = ( xskk ), we write Ŵk as Ŵk = Wku Wkv ! , Wku = w1u w2u ··· wku , Wkv = w1v w2v ··· Then (5.10) can be simplified as xk = Wkv zk = xk−1 + ζk wkv . 5.1.2 QR FACTORIZATION The effects of the first two rotations in (5.8) are shown here: ! c1 −s1 s1 c1 c2 −s2 s2 c2 ! γ α1 α1 δ ρ̄2 β2 θ̄2 γ β2 α2 ! ! = = ρ1 θ1 ρ̄2 η1 θ̄2 ρ2 θ2 ρ̄3 η2 θ̄3 ! ! , (5.11) . (5.12) Later rotations are shown in Algorithm AMRES. 5.1.3 U PDATING Wkv Since we are only interested in Wkv , we can extract it from (5.9) to get 0 T v T Rk (Wk ) = 0 v1 0 v2 ··· v k−1 v1 0 v2 ··· 0 2 vk 2 T 0 T (k odd) (k even) where the structure of the rhs comes from (5.4). Since RkT is lower tridiagonal, the system can be solved by the following recurrences: w1v = 0, w2v = v1 /ρ2 (−η v v k−2 wk−2 − θk−1 wk−1 )/ρk . wkv = (v k − ηk−2 wv − θk−1 wv )/ρk . k−2 k−1 2 (k odd) (k even) wkv . 85 86 CHAPTER 5. AMRES If we define hk = ρk wkv , then we arrive at the update rules in step 7 of Algorithm 5.1, which saves 2n floating point multiplication for each step of Golub-Kahan bidiagonalization compared with updating wkv . 5.1.4 A LGORITHM AMRES Algorithm 5.1 summarizes the main steps of AMRES, excluding the norm estimates and stopping rules developed later. As usual, β1 u1 = b is shorthand for β1 = kbk, u1 = b/β1 (and similarly for all αk , βk ). 5.2 S TOPPING RULES Stopping rules analogous to those in LSQR [53] are used for AMRES. Three dimensionless quantities are needed: ATOL, BTOL, CONLIM. The first stopping rule applies to compatible systems, the second rule applies to incompatible systems, and the third rule applies to both. S1: Stop if krk k ≤ BTOLkbk + ATOLkAkkxk k S2: Stop if kÂT rbk k ≤ kÂkkb rk kATOL S3: Stop if cond(A) ≥ CONLIM 5.3 E STIMATE OF NORMS 5.3.1 C OMPUTING kb rk k rk k = |ζ̄k+1 |. From (5.7), it’s obvious that kb 5.3.2 C OMPUTING kÂb rk k Starting from (5.5) and using (5.6), we have Âb rk = ÂV̂k+1 β1 e1 − Ĥk ŷk = V̂k+2 Hk+1 β1 e1 − Ĥk ŷk ! ! ! Rk zk T ŷk − = V̂k+2 Hk+1 Qk+1 0 ζ̂k+1 ! ! RkT 0 0 0 = V̂k+2 θk eTk ρ̄k+1 = V̂k+2 Hk+1 QTk+1 ζ̂k+1 ζ̂ k+1 ηk eTk θ̄k+1 0 = ζ̂k+1 V̂k+2 ρ̄k+1 = ζ̂k+1 (ρ̄k+1 v̂k+1 + θ̄k+1 v̂k+2 ). θ̄k+1 Therefore, kÂb rk k = |ζ̂k+1 | ρ̄k+1 θ̄k+1 . 5.3. ESTIMATE OF NORMS Algorithm 5.1 Algorithm AMRES 1: (Initialize) β 1 u1 = b ζ̄1 = β1 η−1 = 0 2: 3: α1 v1 = AT u1 v w−1 = ~0 η0 = 0 6: for k = 2l − 1, 2l do (Setup temporary variables) ( λ=δ α = αl λ=γ α = βl+1 21 ck = ρ̄k /ρk θk = ck θ̄k + sk λ ηk = s k β ζk = ck ζ̄k αl+1 vl+1 = ATul+1 − βl+1 vl β = βl+1 β = αl+1 (k odd) (k even) sk = α/ρk ρ̄k+1 = −sk θ̄k + ck λ θ̄k+1 = ck β ζ̄k+1 = −sk ζ̄k (Update estimates of x) ( −(ηk−2 /ρk−2 )hk−2 − (θk−1 /ρk−1 )hk−1 hk = vl − (ηk−2 /ρk−2 )hk−2 − (θk−1 /ρk−1 )hk−1 xk = xk−1 + (ζk /ρk )hk 8: 9: θ0 = 0 (Construct and apply rotation Pk ) ρk = ρ̄2k + α2 7: θ̄1 = α1 x0 = ~0 for l = 1, 2, 3 . . . do (Continue the bidiagonalization) βl+1 ul+1 = Avl − αl ul 4: 5: ρ̄1 = γ w0v = ~0 end for end for (k odd) (k even) 87 88 CHAPTER 5. AMRES 5.3.3 C OMPUTING kb xk From (5.7) and 5.10, we know that xbk = V̂k ŷk and zk = Rk ŷk . If we do a QR factorization Q̃k RkT = R̃k and let tk = Q̃k ŷk , we can write (5.13) zk = Rk ŷk = Rk Q̃Tk Q̃k ŷk = R̃kT tk , giving xbk = V̂k Q̃Tk tk and hence kb xk k = ktk k, where ktk k is derived next. For the QR factorizations (5.8) and (5.13), we write Rk and R̃k as ρ1 Rk = θ1 ρ2 η1 θ2 .. . η2 .. . .. ρk−2 (4) ρ̃1 R̃k = (2) , ηk−2 θk−1 ρk . θk−2 ρk−1 (1) θ̃1 (4) ρ̃1 η̃1 (2) θ̃1 .. . (1) η̃1 .. . (4) ρ̃k−2 .. . (2) θ̃k−2 (3) ρ̃k−1 , (1) η̃k−2 (1) θ̃k−1 (2) ρ̃k where the upper index denotes the number of times that element has changed during the decompositions. Also, Q̃k = (P̃k P̂k ) · · · (P̃3 P̂3 )P̃2 , where P̂k and P̃k are constructed by changing the (k −2 : k)×(k −2 : k) submatrix of Ik to (1) (1) s̃k c̃k 1 (1) (1) −s̃k c̃k 1 and (2) c̃k (2) −s̃k (2) s̃k (2) c̃k respectively. The effects of these two rotations can be summarized as (1) c̃k (1) −s̃k (1) s̃k (1) c̃k (2) c̃k (2) −s̃k ! (3) ρ̃k−2 ηk−2 ! (2) s̃k (2) c̃k (1) θ̃k−2 θk−1 ρk (2) ρ̃k−1 θ̇k−1 (1) ρ̃k ! ! (4) = ρ̃k−2 0 θ̃k−2 θ̇k−1 (3) = ρ̃k−1 0 θ̃k−1 (2) ρ̃k (2) (1) (1) ! η̃k−2 (1) ρ̃k . ! 5.4. COMPLEXITY Let tk = τ1(3) (3) ... τk−2 (2) (1) τk−1 τk T . We solve for tk by (3) (1) (3) (2) (3) (4) (2) (1) (3) (2) (3) (3) τk−2 = (ζk−2 − η̃k−4 τk−4 − θ̃k−3 τk−3 )/ρ̃k−2 τk−1 = (ζk−1 − η̃k−3 τk−3 − θ̃k−2 τk−2 )/ρ̃k−1 (1) τk (1) (3) (1) (2) (2) = (ζk − η̃k−2 τk−2 − θ̃k−1 τk−1 )/ρ̃k with our estimate of kb xk obtained from 5.3.4 kb xk k = ktk k = τ1(3) ... (3) τk−2 (2) (1) τk−1 τk E STIMATES OF kÂk AND COND (Â) . Using Lemma 2.32 from Choi [13], we have the following estimates of kÂk at the l-th iteration: 1 (5.14) 1 (5.15) 2 kÂk ≥ max (αi2 + δ 2 + βi+1 )2 , 1≤i≤l kÂk ≥ max (βi2 + γ 2 + αi2 ) 2 . 2≤i≤l Stewart [71] showed that the minimum and maximum singular values of Ĥk bound the diagonal elements of R̃k . Since the extreme singular values of Ĥk estimate those of Â, we have an approximation (which is also a lower bound) for the condition number of Â: cond(Â) ≈ 5.4 (4) (4) (3) (2) (4) (4) (3) (2) max(ρ̃1 , . . . , ρ̃k−2 , ρ̃k−1 , ρ̃k ) min(ρ̃1 , . . . , ρ̃k−2 , ρ̃k−1 , ρ̃k ) . (5.16) C OMPLEXITY The storage and computational of cost at each iteration of AMRES are shown in Table 5.1. Table 5.1 Storage and computational cost for AMRES Storage Work m n m n AMRES Av, u x, v, hk−1 , hk 3 9 and the cost to compute products Av and ATu 89 6 AMRES APPLICATIONS In this chapter we discuss a number of applications for AMRES. Section 6.1 describes its use for Curtis-Reid scaling, a commonly used method for scaling sparse rectangular matrices such as those arising in linear programming problems. Section 6.2 applies AMRES to Rayleigh quotient iteration for computing singular vectors, and describes a modified version of RQI to allow reuse of the Golub-Kahan vectors. Section 6.3 describes a modified version of AMRES that allows singular vectors to be computed more quickly and reliably when the corresponding singular value is known. 6.1 C URTIS -R EID SCALING In linear programming problems, the constraint matrix is often preprocessed by a scaling procedure before the application of a solver. The goal of such scaling is to reduce the range of magnitudes of the nonzero matrix elements. This may improve the numerical behavior of the solver and reduce the number of iterations to convergence [21]. Scaling procedures multiply a given matrix A by a diagonal matrix on each side: Ā = RAC, R = diag(β −ri ), C = diag(β −cj ), where A is m-by-n, β > 1 is an arbitrary base, and the vectors r = T T represent the scaling and c = c1 c2 . . . cn r 1 r2 . . . r m factors in log-scale. One popular scaling objective is to make each nonzero element of the scaled matrix Ā approximately 1 in absolute value, which translates to the following system of equations for r and c: |āij | ≈ 1 ⇒ ⇒ β −ri |aij |β −cj ≈ 1 −ri + logβ |aij | − cj ≈ 0 Following the above objective, two methods of matrix-scaling have 90 6.1. CURTIS-REID SCALING 91 M ATLAB Code 6.1 Least-squares problem for Curtis-Reid scaling [m n] = size(A); 2 [I J S] = find(A); 3 z = length(I); 4 M = sparse([1:z 1:z]’, [I; J+m], ones(2*z,1)); 5 d = log(abs(S)); 1 been proposed in 1962 and 1971: min max | logβ |aij | − ri − cj | ri ,cj aij 6=0 X (logβ |aij | − ri − cj )2 min ri ,cj 6.1.1 Fulkerson & Wolfe 1962 [28] (6.1) Hamming 1971 [36] (6.2) aij 6=0 C URTIS -R EID SCALING USING CGA The closed-form solution proposed by Hamming is restricted to a dense matrix A. Curtis and Reid extended Hamming’s method to general A and designed a specialized version of CG (which we will call CGA) that allows the scaling to work efficiently for sparse matrices [17]. Experiments by Tomlin [77] indicate that Curtis-Reid scaling is superior to that of Fulkerson and Wolfe. We may describe the key ideas in CurtisReid scaling for matrices. Let z be the number of nonzero elements in A. Equation (6.2) can be written in matrix notation as ! r (6.3) − d , min M r,c c where M is a z-by-(m + n) matrix and d is a z vector. Each row of M corresponds to a nonzero Aij . The entire row is 0 except the i-th and (j + m)-th column is 1. The corresponding element in d is log |Aij |. M and d are defined by M ATLAB Code 6.1.1 The normal equation for the least-squares problem (6.3), T M M can be written as D1 FT F D2 r c ! ! = M Td, r c ! = ! s , t (6.4) 1 In practice, matrix M is never constructed because it is more efficient to work with the normal equation directly, as described below. 92 CHAPTER 6. AMRES APPLICATIONS M ATLAB Code 6.2 Normal equation for Curtis-Reid scaling F = (abs(A)>0); % sparse diagonal matrix 2 D1 = diag(sparse(sum(F,2))); 3 D2 = diag(sparse(sum(F,1))); % sparse diagonal matrix 1 4 5 6 7 8 abslog B s t = = = = @(x) log(abs(x)); spfun(abslog,A); sum(B,2); sum(B,1)’; where F has the same dimension and sparsity pattern as A, with every nonzero entry changed to 1, while D1 and D2 are diagonal matrices whose diagonal elements represent the number of nonzeros in each row and column of A respectively. s and t are defined by log |A | ij Bij = 0 (Aij 6= 0) (Aij = 0) , si = n X Bij , j=1 tj = m X Bij . i=1 The contruction of these submatrices is shown in M ATLAB code 6.2. Equation (6.4) is a consistent positive semi-definite system. Any solution to this system would suffice for scaling perpose as they all that produce the same scaling.2 2 It is obvious 1 is a null vector, −1 which follows directly from the definition of F , D1 and D2 . If r and c solve (6.4), then r + α1 and c − α1 are also solutions for any α. This corresponds to the fact that RAC and (β α R)A(β −α C) give the same scaled Ā. 3 A matrix A is said to have property A if there exist permutations P1 , P2 such that D1 E , P1T AP2 = F D2 where D1 and D2 are diagonal matrices [87]. Reid [60] developed a specialized version of CG on matrices with Property A.3 It takes advantage of the sparsity pattern that appear when CG is applied to matrices with Property A to reduce both storage and work. Curtis and Reid [17] applied CGA to (6.4) to find the scaling for matrix A. To compare the performance of CGA on (6.4) against AMRES, we implemented the CGA algorithm as described in [17]. The implementation shown in M ATLAB Code 6.3. 6.1.2 C URTIS -R EID SCALING USING AMRES In this section, we discuss how to transform equation (6.4) so that it can be solved by AMRES. ! 1 D12 First, we apply a symmetric diagonal preconditioning with 1 D22 as the preconditioner. Then, with the definitions 1 r̄ = D12 r, 1 c̄ = D22 c, −1 s̄ = D1 2 s, −1 t̄ = D2 2 t, −1 −1 C = D 1 2 F D2 2 , 6.1. CURTIS-REID SCALING M ATLAB Code 6.3 CGA for Curtis-Reid scaling. This is an implementation of the algorithm described in [17]. 1 function [cscale, rscale] = crscaleCGA(A, tol) 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 % CRSCALE implements the Curtis-Reid scaling algorithm. % [cscale, rscale, normrv, normArv, normAv, normxv, condAv] = crscale(A, tol); % % Use of the scales: % % If C = diag(sparse(cscale)), R = diag(sparse(rscale)), % Cinv = diag(sparse(1./cscale)), Rinv = diag(sparse(1./rscale)), % then Cinv*A*Rinv should have nonzeros that are closer to 1 in absolute value. % % To apply the scales to a linear program, % min c’x st Ax = b, l <= x <= u, % we need to define "barred" quantities by the following relations: % A = R Abar C, b = R bbar, C cbar = c, % C l = lbar, C u = ubar, C x = xbar. % This gives the scaled problem % min cbar’xbar st Abar xbar = bbar, lbar <= xbar <= ubar. % % % 03 Jun 2011: First version. % David Fong and Michael Saunders, ICME, Stanford University. 23 24 25 26 27 28 29 30 31 32 33 34 E = (abs(A)>0); % sparse if A is sparse rowSumE = sum(E,2); colSumE = sum(E,1); Minv = diag(sparse(1./(rowSumE + (rowSumE == 0)))); Ninv = diag(sparse(1./(colSumE + (colSumE == 0)))); [m n] = size(A); abslog = @(x) log(abs(x)); Abar = spfun(abslog,A); % make sure sigma and tau are of correct size even if some entries of A are 1 sigma = zeros(m,1) + sum(Abar,2); tau = zeros(n,1) + sum(Abar,1)’; 35 36 37 38 39 40 41 itnlimit = 100; r1 = zeros(length(sigma),1); r2 = tau - E’*(Minv*sigma); c2 = zeros(length(tau),1); c = c2; e1 = 0; e = 0; q2 = 1; s2 = r2’*(Ninv*r2); 42 43 44 45 46 47 48 49 50 51 for t = 1:itnlimit rm1 = r1; r = r2; em2 = e; em1 = e1; q = q2; s = s2; cm2 = c; c = c2; r1 = -(E*(Ninv*r) + em1 * rm1)/q; s1 = r1’*(Minv*r1); e = q * s1/s; q1 = 1 - e; c2 = c + (Ninv*r + em1*em2*(c-cm2))/(q*q1); cv(:,t+1) = c2; if s1 < 1e-10; break; end 52 53 54 55 56 r2 = -(E’*(Minv*r1) + e * r)/q1; s2 = r2’*(Ninv*r2); e1 = q1*s2/s1; q2 = 1 - e1; end 57 58 59 60 61 62 63 64 65 c = c2; rho = Minv*(sigma - E*c); gamma = c; rscale = exp(rho); cscale rmax = max(rscale); cmax s = sqrt(rmax/cmax); cscale = cscale*s; rscale end % function crscale = exp(gamma); = max(cscale); = rscale/s; 93 94 CHAPTER 6. AMRES APPLICATIONS (6.4) becomes I CT C I ! r̄ c̄ ! = ! s̄ , t̄ and with ĉ = c̄ − t̄, we get the form required by AMRES: I CT 6.1.3 C I ! r̄ ĉ ! = ! s̄ − C t̄ . 0 C OMPARISON OF CGA AND AMRES We now apply CGA and AMRES as described in the previous two sections to find the scaling vectors for Curtis-Reid scaling for some matrices from the University of Florida Sparse Matrix Collection (Davis [18]). At the k-th iteration, we compute the estimate r(k) , c(k) from the two algorithms, and use them to evaluate the objective function f (r, c) from the least-squares problem in (6.2): fk = f (r(k) , c(k) ) = X aij 6=0 (k) (loge |aij | − ri (k) − cj )2 . (6.5) We take the same set of problems as in Section 4.1. The LPnetlib group includes data for 138 linear programming problems. Each example was downloaded in M ATLAB format, and a sparse matrix A was extracted from the data structure via A = Problem.A. In Figure 6.1, we plot log10 (fk − f ∗ ) against iteration number k for each algorithm. f ∗ , the optimal objective value, is taken as the minimum objective value attained by running CGA or AMRES after some K iterations. From the results, we see that CGA and AMRES exhibit similar convergence behavior for many matrices. There are also several cases where AMRES converges rapidly at the early iterations, which is important as scaling vectors don’t need to be computed to high precision. 6.2 R AYLEIGH QUOTIENT ITERATION In this section, we focus on algorithms that improve an approximate singular value and singular vector. The algorithms are based on Rayleigh quotient iteration. In the next section, we focus on finding singular vectors when the corresponding singular value is known. Rayleigh quotient iteration (RQI) is an iterative procedure developed by John William Strutt, third Baron Rayleigh, for his study on 95 6.2. RAYLEIGH QUOTIENT ITERATION Name:lpi_pilot4i, Dim:410x1123, nnz:5264, id=61 Name:lp_grow22, Dim:440x946, nnz:8252, id=74 5 5 CurtisReid−CG CurtisReid−AMRES CurtisReid−CG CurtisReid−AMRES * * log(f(x)−f ) 0 log(f(x)−f ) 0 −5 −10 −5 0 5 10 15 20 iteration count 25 30 35 −10 40 0 10 20 30 iteration count 40 50 60 Name:lp_scfxm3, Dim:990x1800, nnz:8206, id=73 Name:lp_sctap2, Dim:1090x2500, nnz:7334, id=71 4 5 CurtisReid−CG CurtisReid−AMRES CurtisReid−CG CurtisReid−AMRES 3 2 * log(f(x)−f ) log(f(x)−f*) 0 −5 1 0 −1 −2 −10 0 5 10 15 20 iteration count 25 30 35 40 −3 0 10 20 30 iteration count 40 50 60 Figure 6.1: Convergence of CGA and AMRES when they are applied to the Curtis-Reid scaling problem. At each iteration, we plot the different between the current value and the optimal value for the objective function in (6.5). Upper left and lower left: Typical cases where CGA and AMRES exhibit similar convergence behavior. There is no obvious advantage of choosing one algorithm over the other in these cases. Upper Right: A less common case where AMRES converges much faster than CGA at the earlier iterations. This is important because scaling parameters doesn’t need to be computed to high accuracy for most applications. AMRES could terminate early in this case. Similar convergence behavior was observed for several other problems. Lower Right: In a few cases such as this, CGA converges faster than AMRES during the later iterations. 96 CHAPTER 6. AMRES APPLICATIONS Algorithm 6.1 Rayleigh quotient iteration (RQI) for square A 1: Given initial eigenvalue and eigenvector estimates ρ0 , x0 2: for q = 1, 2, . . . do 3: Solve (A − ρq−1 I)t = xq−1 4: xq = t/ktk 5: ρq = xTq Axq 6: end for Algorithm 6.2 RQI for singular vectors for square or rectangular A 1: Given initial singular value and right singular vector estimates ρ0 , x 0 2: for q = 1, 2, . . . do 3: Solve for t in (ATA − ρ2q−1 I)t = xq−1 (6.6) 4: 5: 6: xq = t/ktk ρq = kAxq k end for the theory of sound [59]. The procedure is summarized in Algorithm 6.1. It improves an approximate eigenvector for a square matrix. It has been proved by Parlett and Kahan [56] that RQI converges for almost all starting eigenvalue and eigenvector guesses, although in general it is unpredictable to which eigenpair RQI will converge. Ostrowski [49] showed that for a symmetric matrix A, RQI exhibits local cubic convergence. It has been shown that MINRES is preferable to SYMMLQ as the solver within RQI [20; 86]. Thus, it is natural to use AMRES, a MINRES-based solver, for extending RQI to compute singular vectors. In this section, we focus on improving singular vector estimates for a general rectangular matrix A. RQI can be adapted to this task as in Algorithm 6.2. We will refer to the iterations in line 2 as outer iterations, and the iterations inside the solver for line 3 as inner iterations. The system in line 3 becomes increasingly ill-conditioned as the singular value estimate ρq converges to a singular value. We continue our investigation by improving both the stability and the speed of Algorithm 6.2. Section 6.2.1 focuses on improving stability by solving an augmented system using AMRES. Section 6.2.2 improves the speed by a modification to RQI that allows AMRES to reuse precomputed Golub-Kahan vectors in subsequent iterations. 6.2. RAYLEIGH QUOTIENT ITERATION Algorithm 6.3 Stable RQI for singular vectors 1: Given initial singular value and right singular vector estimates ρ0 , x 0 2: for q = 1, 2, . . . do 3: Solve for t in −ρq−1 I A s Axq−1 = t̄ 0 AT −ρq−1 I 4: t = t̄ − xq−1 5: xq = t/ktk 6: ρq = kAxq k 7: end for 6.2.1 S TABLE INNER ITERATIONS FOR RQI Compared with Algorithm 6.2, a more stable alternative would be to solve a larger augmented system. Note that line 3 in 6.2 is equivalent to ! ! ! −ρq−1 I A s 0 = . AT −ρq−1 I t xq−1 If we applied MINRES, we would get a more stable algorithm with twice the computational cost. A better alternative would be to convert it to a form that could be readily solved by AMRES. Since the solution will be normalized as in line 4, we are free to scale the right-hand side: −ρq−1 I AT A −ρq−1 I ! s t ! = 0 ρq−1 xq−1 ! . We introduce the shift variable t̄ = t + xq−1 to obtain the system −ρq−1 I AT A −ρq−1 I ! s t̄ ! = Axq−1 0 ! , (6.7) which is suitable for AMRES. This system has a smaller condition number compared with (6.6). Thus, we enjoy both the numerical stability of the larger augmented system and the lower computational cost of the smaller system. We summarize this approach in Algorithm 6.3. T EST DATA We describe procedures to construct a linear operator A with known singular value and singular vectors. This method is adapted from [53]. 97 98 CHAPTER 6. AMRES APPLICATIONS 1. For any m ≥ n and C ≥ 1, pick vectors satisfying y (i) ∈ Rm , ky (i) k = 1, n z∈R , (1 ≤ i ≤ C) kzk = 1, for constructing Householder reflectors. The parameter C represents the number of Householder reflectors that are used to construct the left singular vectors of A. This allows us to vary the cost of Av and ATu multiplications for experiments in Section 6.2.2. 2. Pick a vector d ∈ Rn whose elements are the singular values of A. We have two different options for choosing them. In the non-clustered option, adjacent singular values are separated by a constant gap and form an arithmetic sequence. In the clustered option, the smaller singular values cluster near the smallest one and form an geometric sequence. The two options are as follows: dj = σn + (σ1 − σn ) dj = σn σ1 σn n−j n−1 n−j , n−1 (Non-clustered) (Clustered) , where σ1 and σn are the largest and smallest singular values of A and we have d1 = σ1 and dn = σn . T 3. Define Y (i) = I − 2y (i) y (i) , Z = Z T = I − 2zz T , where D = diag(d), A= l Y Y i=1 (i) ! ! D Z. 0 This procedure is shown in M ATLAB code 6.4. In each of the following experiments, we pick a singular value σ that we would like to converge to, and extract the corresponding column v from Z as the singular vector. Then we pick a random unit vector w and a scalar δ to control the size of the perturbation that we want to introduce into v. The approximate right singular vector for input to RQI-AMRES and RQI-MINRES is then (v + δw). Since RQI is known to converge always to some singular vector, our experiments are meaningful only if δ is quite small. 99 6.2. RAYLEIGH QUOTIENT ITERATION M ATLAB Code 6.4 Generate linear operator with known singular values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 function [Afun, sigma, v] ... = getLinearOperator(m,n,sigmaMax, sigmaMin, p, cost, clustered) % p: output the p-th largest singular value as sigma % and the corresponding right singular vector as v randn( ’state’ ,1); y = randn(m,cost); for i = 1:cost y(:,i) = y(:,i)/norm(y(:,i)); % normalize every column of y end z = randn(n,1); z = z/norm(z); if clustered d = sigmaMin * (sigmaMax/sigmaMin).^(((n-1):-1:0)’/(n-1)); else d = sigmaMin + (sigmaMax-sigmaMin).*(((n-1):-1:0)’/(n-1)); end ep = zeros(n,1); ep(p) = 1; v = ep - 2*(z’*ep)*z; sigma = s(p); Afun = @A; 18 19 20 21 22 23 24 25 26 27 28 29 30 31 function w = A(x,trans) w = x; if trans == 1 w = w - 2*(z’*w)*z; w = [d.*w; zeros(m-n,1)]; for i = 1:cost; w = w - 2*(y(:,i)’*w)*y(:,i); end elseif trans == 2 for i = cost:-1:1; w = w - 2*(y(:,i)’*w)*y(:,i); end w = d.*w(1:n); w = w - 2*(z’*w)*z; end end end N AMING OF ALGORITHMS By applying different linear system solvers in line 3 of Algorithms 6.2 and 6.3, we obtain different versions of RQI for singular vector computation. We refer to these versions by the following names: • RQI-MINRES: Apply MINRES to line 3 of Algorithm 6.2. • RQI-AMRES: Apply AMRES to line 3 of Algorithm 6.3. • MRQI-AMRES4 : Replace line 3 of Algorithm 6.3 by −σI AT A −σI ! s xq+1 ! = ! u , 0 4 Details AMRES are Section 6.2.2 (6.8) of MRQI- given in 100 CHAPTER 6. AMRES APPLICATIONS where u is a left singular vector estimate, or u = Av if v is a right singular vector estimate. AMRES is applied to this linear system. Note that a further difference here from Algorithm 6.3 is that u is the same for all q. N UMERICAL RESULTS We ran both RQI-AMRES and RQI-MINRES to improve given approximate right singular vectors of our constructed linear operators. The 5 For the iterates ρ from accuracy5 of the new singular vector estimate ρq as generated by both q outer iterations that we want algorithms is plotted against the cumulative number of inner iterations to converge to σ, we use the relative error |ρq − σ|/σ as (i.e. the number of Golub-Kahan bidiagonalization steps). the accuracy measure. When the singular values of A are not clustered, we found that RQIAMRES and RQI-MINRES exhibit very similar convergence behavior as shown in Figure 6.2, with RQI-AMRES showing improved accuracy for small singular values. When the singular values of A are clustered, we found that RQIAMRES converges faster than RQI-MINRES as shown in Figure 6.3. We have performed the same experiments on larger matrices and obtained similar results. 6.2.2 S PEEDING UP THE INNER ITERATIONS OF RQI In the previous section, we applied AMRES to RQI to obtain a stable method for refining an approximate singular vector. We now explore a modification to RQI itself that allows AMRES to achieve significant speedup. To illustrate the rationale behind the modification, we first revisit some properties of the AMRES algorithm. AMRES solves linear systems of the form γI AT A δI ! s x ! = ! b 0 using the Golub-Kahan process Bidiag(A, b), which is independent of γ and δ. Thus, if we solve a sequence of linear systems of the above form with different γ, δ but A, b kept constant, any computational work in Bidiag(A, b) can be cached and reused. In Algorithm 6.3, γ = δ = −ρq and b = Axq−1 . We now propose a modified version of RQI in which the right-hand side is kept constant at b = Ax0 and only ρq is updated each outer iteration. We summarize this modification in Algorithm 6.4, 101 6.2. RAYLEIGH QUOTIENT ITERATION rqi(500,300,1,1,10,1,12,12,0,12) −2 rqi(500,300,120,1,10,1,12,12,0,12) −2 RQI−AMRES RQI−MINRES RQI−AMRES RQI−MINRES −3 −4 −4 −6 log||(ρk − σ)/σ|| log||(ρk − σ)/σ|| −5 −6 −7 −8 −10 −8 −12 −9 −14 −10 −11 0 50 100 150 Number of Inner Iterations 200 −16 250 0 100 rqi(500,300,280,1,10,1,12,12,0,12) 200 300 400 500 Number of Inner Iterations 600 700 800 rqi(500,300,300,1,10,1,12,16,0,12) 0 10 RQI−AMRES RQI−MINRES RQI−AMRES RQI−MINRES −2 5 −4 log||(ρ − σ)/σ|| −8 0 k log||(ρk − σ)/σ|| −6 −10 −5 −12 −14 −10 −16 −18 0 200 400 600 800 Number of Inner Iterations 1000 1200 1400 −15 0 1000 2000 3000 4000 Number of Inner Iterations 5000 6000 Figure 6.2: Improving an approximate singular value and singular vector for a matrix with non-clustered singular values using RQI-AMRES and RQI-MINRES. The matrix A is constructed by the procedure of Section 6.2.1 with parameters m = 500, n = 300, C = 1, σ1 = 1, σn = 10−10 and non-clustered singular values. The approximate right singular vector v + δw is formed with δ = 0.1. Upper Left: Computing the largest singular value σ1 . Upper Right: Computing a singular value σ120 near the middle of the spectrum. Lower Left: Computing a singular value σ280 near the low end of the spectrum. Lower Right: Computing the smallest singular value σ300 . Here we see that the iterates ρq generated by RQI-MINRES fail to converge to precisions higher than 10−8 , while RQI-AMRES achieves higher precision with more iterations. 102 CHAPTER 6. AMRES APPLICATIONS rqi(500,300,2,1,10,1,12,12,1,12) −2 rqi(500,300,100,1,10,1,12,12,1,12) 2 RQI−AMRES RQI−MINRES RQI−AMRES RQI−MINRES 0 −4 −2 −6 log||(ρk − σ)/σ|| log||(ρk − σ)/σ|| −4 −8 −10 −6 −8 −10 −12 −12 −14 −16 −14 0 5 10 15 20 25 Number of Inner Iterations 30 35 −16 40 0 2000 4000 6000 8000 Number of Inner Iterations 12000 rqi(500,300,280,1,10,3,16,16,1,8) rqi(500,300,120,1,10,3,12,12,1,8) 6 0 RQI−AMRES RQI−MINRES RQI−AMRES RQI−MINRES −1 10000 5 −2 4 −4 log||(ρk − σ)/σ|| log||(ρk − σ)/σ|| −3 −5 3 2 −6 −7 1 −8 0 −9 −10 0 1000 2000 3000 4000 5000 Number of Inner Iterations 6000 7000 8000 −1 0 0.5 1 1.5 Number of Inner Iterations 2 2.5 3 4 x 10 Figure 6.3: Improving an approximate singular value and singular vector for a matrix with clustered singular values using RQI-AMRES and RQI-MINRES. The matrix A is constructed by the procedure of Section 6.2.1 with parameters m = 500, n = 300, C = 1, σ1 = 1, σn = 10−10 and clustered singular values. For the two upper plots, the approximate right singular vector v + δw is formed with δ = 0.1. For the two lower plots, δ = 0.001. Upper Left: Computing the second largest singular value σ2 . We see that RQI-AMRES starts to converge faster after the 1st outer iteration. The plot for σ1 is not shown here as σ1 is well separated from the rest of the spectrum, and both algorithms converge in 1 outer iteration, which is a fairly uninteresting plot. Upper Right: Computing a singular value σ100 near the middle of the spectrum. RQI-AMRES converges faster than RQI-MINRES. Lower Left: Computing a singular value σ120 near the low end of the spectrum. Lower Right: Computing a singular value σ280 within the highly clustered region. RQI-AMRES and RQI-MINRES converge but to the wrong σj . 6.2. RAYLEIGH QUOTIENT ITERATION 103 Algorithm 6.4 Modified RQI for singular vectors 1: Given initial singular value and right singular vector estimates ρ0 , x 0 2: for q = 1, 2, . . . do 3: Solve for t in −ρq−1 I A s Ax0 = t̄ 0 AT −ρq−1 I 4: t = t̄ − x0 5: xq = t/ktk 6: ρq = kAxq k 7: end for which we refer to as MRQI-AMRES when AMRES is used to solve the system on line 3. C ACHED G OLUB -K AHAN PROCESS We implement a GolubKahan class (see M ATLAB Code 6.5) to represent the process Bidiag(A, b). The constructor is invoked by 1 gk = GolubKahan(A,b) with A being a matrix or a M ATLAB function handle that gives Ax = A(x,1) and ATy = A(y,2). The scalars αk , βk and vector vk can be retrieved by calling 1 [v_k alpha_k beta_k] = gk.getIteration(k) These values will be retrieved from the cache if they have been computed already. Otherwise, getIteration() computes them directly and caches the results for later use. R EORTHOGONALIZATION In Section 4.4, reorthogonalization is proposed as a method to speed up LSMR. The major obstacle is the high cost of storing all the v vectors generated by the Golub-Kahan process. For MRQI-AMRES, all such vectors must be stored for reuse in the next MRQI iteration. Therefore the only extra cost to perform reorthogonalization would be the computing cost of modified Gram-Schmidt (MGS). MGS is implemented as in M ATLAB Code 6.5. It will be turned on when the constructor is called 6 Golub-Kahan(A,b) by GolubKahan(A,b,1).6 We refer to MRQI-AMRES with reorthogonaland Golub-Kahan(A,b,0) ization as MRQI-AMRES-Ortho in the following plots. are the same. They return a Golub-Kahan object that does not preform reorthogonalization. 104 CHAPTER 6. AMRES APPLICATIONS M ATLAB Code 6.5 Cached Golub-Kahan process 1 2 3 4 5 6 7 classdef GolubKahan < handle % A cached implementation of Golub-Kahan bidiagonalization properties V; m; n; A; Afun; u; alphas; betas; reortho; tol = 1e-15; kCached = 1; % number of cached iterations kAllocated = 1; % initial size to allocate end 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 methods function o = GolubKahan(A, b, reortho) % @param reortho=1 means perform reorthogonalization on o.V if nargin < 3 || isempty(reortho), reortho = 0; end o.A = A; o.Afun = A; o.reortho = reortho; if isa(A, ’numeric’ ); o.Afun = @o.AfunPrivate; end o.betas = zeros(o.kAllocated,1); o.betas(1) = norm(b); o.alphas = zeros(o.kAllocated,1); o.m = length(b); if o.betas(1) > o.tol o.u = b/o.betas(1); v = o.Afun(o.u,2); o.n = length(v); o.V = zeros(o.n,o.kAllocated); o.alphas(1) = norm(v); if o.alphas(1) > o.tol; o.V(:,1) = v/norm(o.alphas(1)); end end end 24 25 26 27 28 29 30 31 32 33 34 function [v alpha beta] = getIteration(o,k) % @return v_k alpha_k beta_k for the k-th iteration of Golub-Kahan while o.kCached < k % doubling space for cache if o.kCached == o.kAllocated o.V = [o.V zeros(o.n, o.kAllocated)]; o.alphas = [o.alphas ;zeros(o.kAllocated,1)]; o.betas = [o.betas ;zeros(o.kAllocated,1)]; o.kAllocated = o.kAllocated * 2; end 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 % bidiagonalization o.u = o.Afun(o.V(:,o.kCached),1) - o.alphas(o.kCached)*o.u; beta = norm(o.u); alpha = 0; if beta > o.tol; o.u = o.u/beta; v = o.Afun(o.u,2) - beta*o.V(:,o.kCached); if o.reortho == 1 for i = 1:o.kCached; vi = o.V(:,i); v = v - (v’*vi)*vi; end end alpha = norm(v); if alpha > o.tol; v = v/alpha; o.V(:,o.kCached+1) = v; end end o.kCached = o.kCached + 1; o.alphas(o.kCached) = alpha; o.betas(o.kCached) = beta; end v = o.V(:,k); alpha = o.alphas(k); beta = o.betas(k); end 53 54 55 56 57 58 function out = AfunPrivate(o,x,trans) if trans == 1; out = o.A*x; else out = o.A’*x; end end end end 6.2. RAYLEIGH QUOTIENT ITERATION 105 N UMERICAL RESULTS We ran RQI-AMRES, MRQI-AMRES and MRQI-AMRES-Ortho to improve given approximate right singular vectors of our constructed linear operators. The accuracy7 of each new singular value estimate ρq generated by the three algorithms at each outer iteration is plotted against the cumulative time (in seconds) used. In Figure 6.4, we see that MRQI-AMRES takes more time than RQIAMRES for the early outer iterations because of the cost of allocating memory and caching the Golub-Kahan vectors. For subsequent iterations, more cached vectors are used and less time is needed for GolubKahan bidiagonalization. MRQI-AMRES overtakes RQI-AMRES starting from the third outer iteration. In the same figure, we see that reorthogonalization significantly improves the convergence rate of MRQI-AMRES. With orthogonality preserved, MRQI-AMRES-Ortho converges faster than RQI-AMRES even at the first outer iteration. In Figure 6.5, we compare the performance of the three algorithms when the linear operator has different multiplication cost. For operators with very low cost, RQI-AMRES converges faster than the cached methods. The extra time used for caching outweighs the benefits of reusing cached results. As the linear operator gets more expensive, MRQI-AMRES and MRQI-AMRES-Ortho outperforms RQI-AMRES as fewer expensive matrix-vector products are needed in the subsequent RQIs. In Figure 6.6, we compare the performance of the three algorithms for refining singular vectors corresponding to various singular values. For the large singular values (which are well separated from the rest of the spectrum), RQI converges in very few iterations and there is not much benefit from caching. For the smaller (and more clustered) singular values, caching saves the multiplication time in subsequent RQIs and therefore MRQI-AMRES and MRQI-AMRES-Ortho converge faster than RQI-AMRES. As the singular values become more clustered, the inner solver for the linear system suffers increasingly from loss of orthogonality. In this situation, reorthogonalization greatly reduces the number of Golub-Kahan steps and therefore MRQI-AMRES-Ortho converges much faster than MRQI-AMRES. Next, we compare RQI-AMRES with the M ATLAB svds function. svds performs singular vector computation by calling eigs on the augmented matrix A0T A0 , and eigs in turn calls the Fortran library ARPACK to perform symmetric Lanczos iterations. svds does not accept a linear 7 For the outer iteration values ρq that we want to converge to σ, we use the relative error |ρq − σ|/σ as the accuracy measure. 106 CHAPTER 6. AMRES APPLICATIONS rqi(5000,3000,500,10,10,1,12,12,1,12,1) 0 RQI−AMRES MRQI−AMRES MRQI−AMRES−Ortho −2 −4 k log||(ρ − σ)/σ|| −6 −8 −10 −12 −14 −16 0 5 10 15 20 Time (s) 25 30 35 40 Figure 6.4: Improving an approximate singular value and singular vector for a matrix using RQI-AMRES, MRQI-AMRES and MRQI-AMRESOrtho. The matrix A is constructed by the procedure of Section 6.2.1 with parameters m = 5000, n = 3000, C = 10, σ1 = 1, σn = 10−10 and clustered singular values. We perturbed the right singular vector v corresponding to the singular value σ500 to be the starting approximate singular vector. MRQI-AMRES lags behind RQI-AMRES at the early iterations because of the extra cost in allocating memory to store the Golub-Kahan vectors, but it quickly catches up and converges faster when the cached vectors are reused in the subsequent iterations. MRQIAMRES with reorthogonalization converges much faster than without reorthogonalization. In this case, reorthogonalization greatly reduces the number of inner iterations for AMRES, and the subsequent outer iterations are basically free. rqi(5000,3000,500,5,10,1,12,12,1,12,1) rqi(5000,3000,500,20,10,1,12,12,1,12,1) 0 −4 −6 −6 −6 −6 −8 k k log||(ρ − σ)/σ|| −4 log||(ρk − σ)/σ|| −4 −8 0 RQI−AMRES MRQI−AMRES MRQI−AMRES−Ortho −2 −4 −8 rqi(5000,3000,500,50,10,1,12,12,1,12,1) 0 RQI−AMRES MRQI−AMRES MRQI−AMRES−Ortho −2 −4 −6 −8 −8 −10 −10 −10 −10 −10 −12 −12 −12 −12 −12 −14 −14 −14 −14 −16 −16 −16 −16 0 2 4 6 8 10 Time (s) 12 C=1 14 16 18 20 0 5 10 15 Time (s) C=5 20 25 30 0 5 10 15 20 Time (s) 25 C = 10 30 35 40 RQI−AMRES MRQI−AMRES MRQI−AMRES−Ortho −2 k RQI−AMRES MRQI−AMRES MRQI−AMRES−Ortho −2 log||(ρ − σ)/σ|| log||(ρk − σ)/σ|| rqi(5000,3000,500,10,10,1,12,12,1,12,1) 0 RQI−AMRES MRQI−AMRES MRQI−AMRES−Ortho log||(ρ − σ)/σ|| rqi(5000,3000,500,1,10,1,12,12,1,12,1) 0 −2 −14 0 10 20 30 Time (s) 40 C = 20 50 60 −16 0 20 40 60 Time (s) 80 100 120 C = 50 Figure 6.5: Convergence of RQI-AMRES, MRQI-AMRES and MRQI-AMRES-Ortho for linear operators of varying cost. The matrix A is constructed as in Figure 6.4 except for the cost C, which is shown below each plot. C increases from left to right. When C = 1, the time required for caching the Golub-Kahan vectors outweighs the benefits of being able to reuse them later. When the cost C goes up, the benefit of caching Golub-Kahan vectors outweighs the time for saving them. Thus MRQI-AMRES converges faster than RQI-AMRES. 107 6.3. SINGULAR VECTOR COMPUTATION rqi(5000,3000,1,10,10,1,12,12,1,12,1) −2 rqi(5000,3000,50,10,10,1,12,12,1,12,1) −2 RQI−AMRES MRQI−AMRES MRQI−AMRES−Ortho −4 −6 rqi(5000,3000,300,10,10,1,12,12,1,12,1) −2 RQI−AMRES MRQI−AMRES MRQI−AMRES−Ortho −4 rqi(5000,3000,500,20,10,1,12,12,1,12,1) 2 RQI−AMRES MRQI−AMRES MRQI−AMRES−Ortho −2 −2 −4 −10 −8 −10 −12 −14 −16 −12 −12 0.02 0.04 0.06 0.08 0.1 Time (s) 0.12 σ1 0.14 0.16 0.18 0.2 −16 0 0.1 0.2 0.3 0.4 Time (s) 0.5 0.6 0.7 0.8 −16 −12 −14 0 0.5 1 σ50 1.5 2 2.5 Time (s) 3 σ300 3.5 4 4.5 5 −16 −8 −10 −12 −14 −14 0 −6 k k −10 log||(ρ − σ)/σ|| −8 k −8 log||(ρk − σ)/σ|| log||(ρ − σ)/σ|| log||(ρ − σ)/σ|| k log||(ρ − σ)/σ|| −6 −8 −10 RQI−AMRES MRQI−AMRES MRQI−AMRES−Ortho 0 −4 −6 −6 rqi(5000,3000,700,10,10,1,12,12,1,12,1) 0 RQI−AMRES MRQI−AMRES MRQI−AMRES−Ortho −4 −14 0 10 20 30 Time (s) 40 50 σ500 60 −16 0 50 100 150 Time (s) 200 250 300 σ700 Figure 6.6: Convergence of RQI-AMRES, MRQI-AMRES and MRQI-AMRES-Ortho for different singular values. The matrix A is constructed as in Figure 6.4. σ1 , σ50 : For the larger singular values, both MRQI-AMRES and MRQI-AMRES-Ortho converge slower than RQI-AMRES as convergence for RQI is very fast and the extra time for caching the Golub-Kahan vectors outweighs the benefits of saving them. σ300 , σ500 : For the smaller and more clustered singular values, RQI takes more iterations to converge and the caching effect of MRQI-AMRES and MRQI-AMRES-Ortho makes them converge faster than RQI-AMRES. σ700 : As the singular value gets even smaller and less separated, RQI is less likely to converge to the singular value we want. In this example, RQI-AMRES converged to a different singular value. Also, clustered singular values lead to greater loss of orthogonality, and therefore RQI-AMRES-Ortho converges much faster than RQI-AMRES. operator (function handle) as input. We slightly modified svds to allow an operator to be passed to eigs, which finds the eigenvectors corresponding to the largest eigenvalues for a given linear operator.8 Figure 6.7 shows the convergence of RQI-AMRES and svds computing three different singular values near the large end of the spectrum. When only the largest singular value is needed, the algorithms converge at about the same rate. When any other singular value is needed, svds has to compute all singular values (and singular vectors) from the largest to the one required. Thus svds takes significantly more time compared to RQI-AMRES. 6.3 S INGULAR VECTOR COMPUTATION In the previous section, we explored algorithms to refine an approximate singular vector. We now focus on finding a singular vector corre! sponding to a known singular value. u Suppose σ is a singular value of A. If a nonzero satisfies v ! ! ! −σI A u 0 = , (6.9) AT −σI v 0 then u and v would be the corresponding singular vectors. The inverse 8 eigs can find eigenvalues of a matrix B close to any given λ, but if B is an operator, the shift-andinvert operator (B − λI)−1 must be provided. 108 CHAPTER 6. AMRES APPLICATIONS rqi(10000,6000,5,1,10,0.5,12,12,1,12,2) rqi(10000,6000,20,1,10,0.5,12,12,1,12,2) 0 −2 −4 −4 −4 −6 −6 −6 k RQI−AMRES SVDS 0 RQI−AMRES SVDS −8 −8 −10 −10 −10 −12 −12 −12 −14 −14 −16 0 0.05 0.1 0.15 0.2 Time (s) σ1 0.25 0.3 0.35 0.4 −16 RQI−AMRES SVDS −2 log||(ρk − σ)/σ|| −8 log||(ρ − σ)/σ|| log||(ρk − σ)/σ|| rqi(10000,6000,1,1,10,0.5,12,12,1,12,2) 0 −2 −14 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 −16 0 0.2 0.4 Time (s) σ5 0.6 0.8 1 Time (s) 1.2 1.4 1.6 1.8 σ20 Figure 6.7: Convergence of RQI-AMRES and svds for the largest singular values. The matrix A is constructed by the procedure of Section 6.2.1 with parameters m = 10000, n = 6000, C = 1, σ1 = 1, σn = 10−10 and clustered singular values. The approximate right singular vector v + δw is formed with δ = 10−1/2 ≈ 0.316. σ1 : For the largest singular value and corresponding singular vector, RQI-AMRES and svds take similar times. σ5 : For the fifth largest singular value and corresponding singular vector, svds (and hence ARPACK) has to compute all five singular values (from the largest to fifth largest), while RQIAMRES computes the fifth largest singular value directly. Therefore, RQI-AMRES takes less time than svds. σ20 : As svds needs to compute the largest 20 singular values, it takes significantly more time than RQI-AMRES. iteration approach to finding the null vector is to take a random vector b and solve the system −σI AT A −σI ! s x ! = ! b . 0 (6.10) Whether a direct or iterative solver is used, we expect the solution xs to be very large, but the normalized vectors u = s/ksk, v = x/kxk will satisfy (6.9) accurately and hence be good approximations to the left and right singular vectors of A. As noted in Choi [13], it is not necessary to run an iterative solver until the solution norm becomes very large. If one could stop the solver appropriately at a least-squares solution, then the residual vector of that solution would be a null vector of the matrix. Here we apply this idea to singular vector computation. For the problem min kAx − bk, if x is a solution, we know that the residual vector r = b − Ax satisfies ATr = 0. Therefore r is a null vector for AT . 6.3. SINGULAR VECTOR COMPUTATION Algorithm 6.5 Singular vector computation via residual vector 1: Given matrix A, and singular value σ 2: Find least-squares solution for the singular system: b −σI A s 3: min 0 − AT −σI x −σI A u b s 4: Compute = − v 0 x AT −σI 5: Output u, v as the left and right singular vectors Thus, if s and x solve the singular least-squares problem ! b min − 0 −σI AT A −σI ! s x then the corresponding residual vector u v ! = ! b − 0 −σI AT A −σI ! ! , s x ! (6.11) (6.12) would satisfy (6.9). This gives us Av = σu and ATu = σv as required. Therefore u, v are left and right singular vectors corresponding to σ. We summarize this procedure as Algorithm 6.5. The key to this algorithm lies in finding the residual to the leastsquares system accurately and efficiently. An obvious choice would be applying MINRES to (6.11) and computing the residual from the solution returned by MINRES. AMRES solves the same problem (6.11) with half the computational cost by computing x only. However, this doesn’t allow us to construct the residual from x. We therefore developed a specialized version of AMRES, called AMRESR, that directly computes the v part of the residual vector (6.12) without computing x or s. The u part can be recovered from v using σu = Av. If σ is a repeated singular value, the same procedure can be followed using a new random b that is first orthogonalized with respect to v. 109 110 CHAPTER 6. AMRES APPLICATIONS 6.3.1 AMRESR To derive an iterative update for rbk , we begin with (5.5): rbk = V̂k+1 β1 e1 − Ĥk ŷk ! zk T − = V̂k+1 Qk+1 ζ̄k+1 Rk 0 ! ŷk ! = V̂k+1 QTk+1 (ζ̄k+1 ek+1 ). By defining pk = V̂k+1 QTk+1 ek+1 , we continue with pk = V̂k+1 QTk+1 ek+1 . QTk = V̂k+1 ! 1 = V̂k QTk v̂k+1 Ik−1 0 −sk ck ck sk −sk ek+1 ck = −sk V̂k QTk ek + ck v̂k+1 = −sk pk−1 + ck v̂k+1 . With rbk ≡ rku rkv ! and pk ≡ pv0 = 0, ! puk , we can write an update rule for pvk : pvk −s pv + c v k+1 k k−1 k 2 pvk = −sk pv k−1 (k odd) (k even) where the separation of two cases follows from (5.4) and rkv = ζ̄k+1 pvk (6.13) can be used to output rkv when AMRESR terminates. With the above recurrence relations for pvk , we summarize AMRESR in Algorithm 6.6. 6.3.2 AMRESR EXPERIMENTS We compare the convergence of AMRESR versus MINRES to find singular vectors corresponding to a known singular value. 6.3. SINGULAR VECTOR COMPUTATION Algorithm 6.6 Algorithm AMRESR 1: (Initialize) β 1 u1 = b α1 v1 = AT u1 ζ̄1 = β1 2: 3: pv0 6: for l = 1, 2, 3 . . . do (Continue the bidiagonalization) for k = 2l − 1, 2l do (Setup temporary variables) ( λ = δ, α = αl , λ = γ, α = βl+1 , αl+1 vl+1 = ATul+1 − βl+1 vl β = βl+1 β = αl+1 (k odd) (k even) (Construct and apply rotation Qk+1,k ) ρk = ρ̄2k + α2 ck = ρ̄k /ρk 7: θ̄1 = α1 =0 βl+1 ul+1 = Avl − αl ul 4: 5: ρ̄1 = γ 12 sk = α/ρk θk = ck θ̄k + sk λ ηk = s k β ρ̄k+1 = −sk θ̄k + ck λ θ̄k+1 = ck β ζk = ck ζ̄k ζ̄k+1 = −sk ζ̄k (Update estimate of p) ( −sk pk−1 + ck v k+1 v 2 pk = −sk pk−1 8: 9: end for end for 10: (Compute r for output) rkv = ζ̄k+1 pvk (k odd) (k even) 111 112 CHAPTER 6. AMRES APPLICATIONS M ATLAB Code 6.6 Error measure for singular vector function err = singularVectorError(v) 2 v = v/norm(v); % Afun(v,1) = A*v; 3 Av = Afun(v,1); 4 u = Av/norm(Av); 5 err = norm(Afun(u,2) - sigma*v); % Afun(v,2) = A’*u; 6 end 1 T EST DATA We use the method of Section 6.2.1 to construct a linear operator. M ETHODS We compare three methods for computing a singular vector corresponding to known singular value σ, where σ is taken as one of the di above. We first generate a random m vector b using randn(’state’,1) and b = randn(m,1). Then we apply one of the following methods: 1. Run AMRESR on (6.10). For each k in Algorithm 6.6, Compute rkv by (6.13) at each iteration and take vk = rkv as the estimate of the right singular vector corresponding to σ. 2. Inverse iteration: run MINRES on (ATA − σ 2 I)x = ATb. For the lth MINRES iteration (l = 1, 2, 3, . . . ), set k = 2l take vk = xl as the estimate of the right singular vector. (This makes k the number of matrix-vector products for each method.) M EASURE OF CONVERGENCE For the vk from each method, we measure convergence by ATuk − σvk (with uk = Avk /kAvk k) as shown in M ATLAB Code 6.6. This measure requires two matrix-vector products, which is the same complexity as each Golub-Kahan step. Thus it is too expensive to be used as a practical stopping rule. It is used solely for the plots to analyze the convergence behavior. In practice, the stopping rule would be based on the condition estimate (5.16). N UMERICAL RESULTS We ran the experiments described above on rectangular matrices of size 24000 × 18000. Our observations are as follows: 6.4. ALMOST SINGULAR SYSTEMS 1. AMRESR always converge faster than MINRES. AMRESR reaches a minimum error of about 10−8 for all test examples, and then starts to diverge. In contrast, MINRES is able to converge to singular vectors of higher precision than AMRESR. Thus, AMRESR is more suitable for applications where high precision of the singular vectors is not required. 2. The gain in convergence speed depends on the position of the singular value within the spectrum. From Figure 6.8, there is more improvement for singular vectors near either end of the spectrum, and less improvement for those in the middle of the spectrum. 6.4 A LMOST SINGULAR SYSTEMS Björck [6] has derived an algorithm based on Rayleigh-quotient iteration for computing (ordinary) TLS solutions. The algorithm involves repeated solution of positive definite systems (ATA − σ 2 I)x = ATb by means of a modified/extended version of CGLS (called CGTLS) that is able to incorporate the shift. A key step in CGTLS is computation of δk = kpk k22 − σ 2 kqk k22 . Clearly we must have σ < σmin (A) for the system to be positive definite. However, this condition cannot be guaranteed and a heuristic is adopted to repeat the computation with a smaller value of σ [6, (4.5)]. Another drawback of CGTLS is that it depends on a complete Cholesky factor of ATA. This is computed by a sparse QR of A, which is sometimes practical when A is large. Since AMRES handles indefinite systems and does not depend on a sparse factorization, it is applicable in more situations. 113 114 CHAPTER 6. AMRES APPLICATIONS amresrPlots(24000,18000,1,9,13,1) amresrPlots(24000,18000,100,9,14,1) 0 0 −1 −2 −2 −3 −4 ||ATu − σ v|| −5 T ||A u − σ v|| −4 AMRESR AMRESR (Cont.) MINRES −6 −6 −8 AMRESR AMRESR (Cont.) MINRES −7 −8 −10 −9 −10 0 200 400 600 800 T Number of A u and Av 1000 1200 −12 1400 0 1000 2000 amresrPlots(24000,18000,9000,8,14,1) 0 0 −1 −1 −2 −2 −3 −3 5000 6000 7000 −4 ||ATu − σ v|| ||ATu − σ v|| −4 AMRESR AMRESR (Cont.) MINRES −5 −5 −6 −6 −7 −7 −8 −8 −9 −9 −10 3000 4000 T Number of A u and Av amresrPlots(24000,18000,18000,8,21,1) 0 0.5 1 1.5 2 2.5 3 T Number of A u and Av 3.5 4 4.5 5 4 x 10 −10 AMRESR AMRESR (Cont.) MINRES 0 1 2 3 4 T Number of A u and Av 5 6 7 4 x 10 Figure 6.8: Convergence of AMRESR and MINRES for computing singular vectors corresponding to a known singular value as described in Section 6.3.2 for a 24000 × 18000 matrix. The blue (solid) curve represents where AMRESR would have stopped if there were a suitable limit on the condition number. The green (dash) curve shows how AMRESR will diverge if it continues beyond this stopping criterion. The red (dash-dot) represents the convergence behavior of MINRES. The linear operator is constructed using the method of Section 6.2.1. Upper Left: Computing the singular vector corresponding to the largest singular value σ1 . Upper Right: Computing the singular vector corresponding to the 100th largest singular value σ100 . Lower Left: Computing the singular vector corresponding to a singular value in the middle of the spectrum σ9000 . Lower Right: Computing the singular vector corresponding to the smallest singular value σ18000 . 7 C ONCLUSIONS AND FUTURE DIRECTIONS 7.1 C ONTRIBUTIONS The main contributions of this thesis involve three areas: providing a better understanding of the popular MINRES algorithm; development and analysis of LSMR for least-squares problems; and development and applications of AMRES for the negatively-damped least-squares problem. We summarize our findings for each of these areas in the next three sections. Chapters 1 to 4 discussed a number of iterative solvers for linear equations. The flowchart in Figure 7.1 helps decide which methods to use for a particular problem. 7.1.1 MINRES In Chapter 2, we proved a number of properties for MINRES applied to a positive definite system. Table 7.1 compares these properties with known ones for CG. MINRES has a number of monotonic properties that can make it more favorable, especially when the iterative algorithm needs to be terminated early. In addition, our experimental results show that MINRES converges faster than CG in terms of backward error, often by as much as 2 orders of magnitude (Figure 2.2). On the other hand, CG converges somewhat faster than MINRES in terms of both kxk − x∗ kA and kxk − x∗ k (same figure). Table 7.1 Comparison of CG and MINRES properties on an spd system CG kxk k kx∗ − xk k kx∗ − xk kA krk k krk k/kxk k MINRES ր [68, Thm 2.1] ր (Thm 2.1.6) ց [38, Thm 4:3] ց (Thm 2.1.7) [38, Thm 7:5] ց [38, Thm 6:3] ց (Thm 2.1.8) [38, Thm 7:4] not-monotonic ց [52] [38, Thm 7:2] not-monotonic ց (Thm 2.2.1) ր monotonically increasing ց monotonically decreasing 115 116 CHAPTER 7. CONCLUSIONS AND FUTURE DIRECTIONS CG Ax = b No kxk − x∗ k or kxk − x∗ kA A square Yes A = AT Yes A≻0 Yes want to minimize No krk k or No A singular No unsure No noisy data krk k kxk k Yes MINRES No Yes shape of A A = tall A = fat LSQR min kxk k needed Yes AT available Yes No BiCGStab IDR(s) GMRES LSMR Figure 7.1: Flowchart on choosing iterative solvers MINRESQLP BiCG QMR LSQR 7.1. CONTRIBUTIONS Table 7.2 Comparison of LSQR and LSMR properties LSQR LSMR kxk k ր (Thm 3.3.1) ր (Thm 3.3.6) kxk − x∗ k ց (Thm 3.3.2) ց (Thm 3.3.7) kATrk k not-monotonic ց (Thm 3.3.5) krk k ց (Thm 3.3.4) ց (Thm 3.3.11) xk converges to minimum norm x∗ for singular systems kE2LSQR k ≥ kE2LSMR k ր monotonically increasing ց monotonically decreasing 7.1.2 LSMR We presented LSMR, an iterative algorithm for square or rectangular systems, along with details of its implementation, theoretical properties, and experimental results to suggest that it has advantages over the widely adopted LSQR algorithm. As in LSQR, theoretical and practical stopping criteria are provided for solving Ax = b and min kAx − bk with optional Tikhonov regularization. Formulae for computing krk k, kATrk k, kxk k and estimating kAk and cond(A) in O(1) work per iteration are available. For least-squares problems, the Stewart backward error estimate kE2 k (4.4) is computable in O(1) work and is proved to be at least as small as that of LSQR. In practice, kE2 k for LSMR is smaller than that of LSQR by 1 to 2 orders of magnitude, depending on the condition of the given system. This often allows LSMR to terminate significantly sooner than LSQR. In addition, kE2 k seems experimentally to be very close to the optimal backward error µ(xk ) at each LSMR iterate xk (Section 4.1.1). This allows LSMR to be stopped reliably using rules based on kE2 k. M ATLAB, Python, and Fortran 90 implementations of LSMR are available from [42]. They all allow local reorthogonalization of Vk . 7.1.3 AMRES We developed AMRES, an iterative algorithm for solving the augmented system ! ! ! γI A s b = , AT δI x 0 where γδ > 0. It is equivalent to MINRES on the same system and A is reliable even when AδIT δI is indefinite or singular. It is based on 117 118 CHAPTER 7. CONCLUSIONS AND FUTURE DIRECTIONS the Golub-Kahan bidiagonalization of A and requires half the computational cost compared to MINRES. AMRES is applicable to Curtis-Reid scaling; improving approximate singular vectors; and computing singular vectors from known singular values. For Curtis-Reid scaling, AMRES sometimes exhibits faster convergence compared to the specialized version of CG proposed by Curtis and Reid. However, both algorithms exhibit similar performance on many test matrices. Using AMRES as the solver for inner iterations, we have developed Rayleigh quotient-based algorithms (RQI-AMRES, MRQI-AMRES, MRQIAMRES-Ortho) that refine a given approximate singular vector to an accurate one. They converge faster than the M ATLAB svds function for singular vectors corresponding to interior singular values. When the singular values are clustered, or the linear operator A is expensive, MRQI-AMRES and MRQI-AMRES-Ortho are more stable and converge much faster than RQI-AMRES or direct use of MINRES within RQI. For a given singular value, we developed AMRESR, a modified version of AMRES, to compute the corresponding singular vectors to a pre√ cision of O( ǫ). Our algorithm converges faster than inverse iteration. If singular vectors of higher precision are needed, inverse iteration is preferred. 7.2 F UTURE DIRECTIONS Several ideas are highly related to the research in this thesis, but their potential has not been fully explored yet. We summarize these directions below as a pointer for future research. 7.2.1 C ONJECTURE From our experiments, we conjecture that the optimal backward error µ(xk ) (4.1) and its approximate µ̃(xk ) (4.5) decrease monotonically for LSMR. 7.2.2 PARTIAL REORTHOGONALIZATION Larsen [58] uses partial reorthogonalization of both Vk and Uk within his PROPACK software for computing a set of singular values and vectors for a sparse rectangular matrix A. This involves a smaller computational cost compared to full reorthogonalization, as each vk or uk 7.2. FUTURE DIRECTIONS does not need to perform an inner product with each of the previous vi ’s or ui ’s. Similar techniques might prove helpful within LSMR and AMRES. 7.2.3 E FFICIENT OPTIMAL BACKWARD ERROR ESTIMATES Both LSQR and LSMR stop when kE2 k, an upper bound for the optimal backward error that’s computable in O(1) work per iteration, is sufficiently small. Ideally, an iterative algorithm for least-squares should use the optimal backward error µ(xk ) itself in the stopping rule. However, direct computation using (4.2) is more expensive than solving the least-squares problem itself. An accurate approximation is given in (4.5). This cheaper estimate involves solving a least-squares problem at each iteration of any iterative solver, and thus is still not practical for use as a stopping rule. It would be of great interest if a provably accurate estimate of the optimal backward error could be found in O(n) work at each iteration, as it would allow iterative solvers to stop precisely at the iteration where the desired accuracy has been attained. More precise stopping rules have been derived recently by Arioli and Gratton [2] and Titley-Peloquin et al. [11; 50; 75]. The rules allow for uncertainty in both A and b, and may prove to be useful for LSQR, LSMR, and least-squares methods in general. 7.2.4 SYMMLQ - BASED LEAST- SQUARES SOLVER LSMR is derived to be a method mathematically equivalent to MINRES on the normal equation, and thus LSMR minimizes kATrk k for xk in the Krylov subspace Kk (ATA, ATb). For a symmetric system Ax = b, SYMMLQ [52] solves min xk ∈AKk (A,b) kxk − x∗ k at the k-th iteration [47], [27, p65]. Thus if we derive an algorithm for the least-squares problem that is mathematically equivalent to SYMMLQ on the normal equation but uses Golub-Kahan bidiagonalization, this new algorithm will minimize kxk − x∗ k for xk in the Krylov subspace ATAKk (ATA, ATb). This may produce smaller errors compared to LSQR or LSMR, and may be desirable in applications where the algorithm has to be terminated early with a smaller error (rather than a smaller backward error). 119 B IBLIOGRAPHY [1] M. Arioli. A stopping criterion for the conjugate gradient algorithm in a finite element method framework. Numerische Mathematik, 97(1):1–24, 2004. doi:10.1007/s00211-003-0500-y. [2] M. Arioli and S. Gratton. Least-squares problems, normal equations, and stopping criteria for the conjugate gradient method. Technical Report RAL-TR-2008-008, Rutherford Appleton Laboratory, Oxfordshire, UK, 2008. [3] M. Arioli, I. Duff, and D. Ruiz. Stopping criteria for iterative solvers. SIAM J. Matrix Anal. Appl., 13:138, 1992. doi:10.1137/0613012. [4] M. Arioli, E. Noulard, and A. Russo. Stopping criteria for iterative methods: applications to PDE’s. Calcolo, 38(2): 97–112, 2001. doi:10.1007/s100920170006. [5] S. J. Benbow. Solving generalized least-squares problems with LSQR. SIAM J. Matrix Anal. Appl., 21(1):166–177, 1999. doi:10.1137/S0895479897321830. [6] Å. Björck, P. Heggernes, and P. Matstoms. Methods for large scale total least squares problems. SIAM J. Matrix Anal. Appl., 22(2), 2000. doi:10.1137/S0895479899355414. [7] F. Cajori. A History of Mathematics. The MacMillan Company, New York, second edition, 1919. [8] D. Calvetti, B. Lewis, and L. Reichel. An L-curve for the MINRES method. In Franklin T. Luk, editor, Advanced Signal Processing Algorithms, Architectures, and Implementations X, volume 4116, pages 385–395. SPIE, 2000. doi:10.1117/12.406517. [9] D. Calvetti, S. Morigi, L. Reichel, and F. Sgallari. Computable error bounds and estimates for the conjugate gradient method. Numerical Algorithms, 25:75–88, 2000. doi:10.1023/A:1016661024093. [10] F. Chaitin-Chatelin and V. Frayssé. doi:10.1137/1.9780898719673. Lectures on Finite Precision Computations. SIAM, Philadelphia, 1996. [11] X.-W. Chang, C. C. Paige, and D. Titley-Péloquin. Stopping criteria for the iterative solution of linear least squares problems. SIAM J. Matrix Anal. Appl., 31(2):831–852, 2009. doi:10.1137/080724071. [12] Y. Chen, T. A. Davis, W. W. Hager, and S. Rajamanickam. Algorithm 887: CHOLMOD, Supernodal Sparse Cholesky Factorization and Update/Downdate. ACM Trans. Math. Softw., 35:22:1–22:14, October 2008. doi:10.1145/1391989.1391995. [13] S.-C. Choi. Iterative Methods for Singular Linear Equations and Least-Squares Problems. PhD thesis, Stanford University, Stanford, CA, 2006. [14] S.-C. Choi, C. C. Paige, and M. A. Saunders. MINRES-QLP: a Krylov subspace method for indefinite or singular symmetric systems. SIAM J. Sci. Comput., 33(4):00–00, 2011. doi:10.1137/100787921. [15] A. R. Conn, N. I. M. Gould, and Ph. L. Toint. doi:10.1137/1.9780898719857. Trust-region Methods, volume 1. [16] G. Cramer. Introduction à l’Analyse des Lignes Courbes Algébriques. 1750. 120 SIAM, Philadelphia, 2000. 7.2. FUTURE DIRECTIONS 121 [17] A. R. Curtis and J. K. Reid. On the automatic scaling of matrices for Gaussian elimination. J. Inst. Maths. Applics., 10:118–124, 1972. doi:10.1093/imamat/10.1.118. [18] T. A. Davis. University of Florida Sparse Matrix Collection. http://www.cise.ufl.edu/research/sparse/matrices. [19] J. J. Dongarra, J. R. Bunch, C. B. Moler, and G. W. Stewart. LINPACK Users’ Guide. SIAM, Philadelphia, 1979. doi:10.1137/1.9781611971811. [20] F. A. Dul. MINRES and MINERR are better than SYMMLQ in eigenpair computations. SIAM J. Sci. Comput., 19(6): 1767–1782, 1998. doi:10.1137/S106482759528226X. [21] J. M. Elble and N. V. Sahinidis. Scaling linear optimization problems prior to application of the simplex method. Computational Optimization and Applications, pages 1–27, 2011. doi:10.1007/s10589-011-9420-4. [22] D. K. Faddeev and V. N. Faddeeva. doi:10.1007/BF01086544. Computational Methods of Linear Algebra. Freeman, London, 1963. [23] R. Fletcher. Conjugate gradient methods for indefinite systems. In G. Watson, editor, Numerical Analysis, volume 506 of Lecture Notes in Mathematics, pages 73–89. Springer, Berlin / Heidelberg, 1976. doi:10.1007/BFb0080116. [24] D. C.-L. Fong and M. A. Saunders. LSMR: An iterative algorithm for sparse least-squares problems. SIAM J. Sci. Comput., 33:2950–2971, 2011. doi:10.1137/10079687X. [25] V. Frayssé. The power of backward error analysis. Technical Report TH/PA/00/65, CERFACS, 2000. [26] R. W. Freund and N. M. Nachtigal. QMR: a quasi-minimal residual method for non-Hermitian linear systems. Numerische Mathematik, 60:315–339, 1991. doi:10.1007/BF01385726. [27] R. W. Freund, G. H. Golub, and N. M. Nachtigal. Iterative solution of linear systems. Acta Numerica, 1:57–100, 1992. doi:10.1017/S0962492900002245. [28] D. R. Fulkerson and P. Wolfe. doi:10.1137/1004032. An algorithm for scaling matrices. SIAM Review, 4(2):142–146, 1962. [29] G. H. Golub and W. Kahan. Calculating the singular values and pseudo-inverse of a matrix. J. of the Society for Industrial and Applied Mathematics, Series B: Numerical Analysis, 2(2):205–224, 1965. doi:10.1137/0702016. [30] G. H. Golub and G. A. Meurant. Matrices, moments and quadrature II; How to compute the norm of the error in iterative methods. BIT Numerical Mathematics, 37:687–705, 1997. doi:10.1007/BF02510247. [31] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, Baltimore, third edition, 1996. [32] S. Gratton, P. Jiránek, and D. Titley-Péloquin. On the accuracy of the Karlson-Waldén estimate of the backward error for linear least squares problems. Submitted for publication (SIMAX), 2011. [33] J. F. Grcar. Mathematicians of Gaussian elimination. Notices of the AMS, 58(6):782–792, 2011. [34] J. F. Grcar, M. A. Saunders, and Z. Su. Estimates of optimal backward perturbations for linear least squares problems. Report SOL 2007-1, Department of Management Science and Engineering, Stanford University, Stanford, CA, 2007. 21 pp. [35] A. Greenbaum. Iterative Methods for Solving Linear Systems. SIAM, Philadelphia, 1997. doi:10.1137/1.9781611970937. [36] R. W. Hamming. Introduction to Applied Numerical Analysis. McGraw-Hill computer science series. McGraw-Hill, 1971. [37] K. Hayami, J.-F. Yin, and T. Ito. GMRES methods for least squares problems. SIAM J. Matrix Anal. Appl., 31(5): 2400–2430, 2010. doi:10.1137/070696313. 122 CHAPTER 7. CONCLUSIONS AND FUTURE DIRECTIONS [38] M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 49(6):409–436, 1952. [39] N. J. Higham. Accuracy and Stability of Numerical Algorithms. doi:10.1137/1.9780898718027. SIAM, Philadelphia, second edition, 2002. [40] C. Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J. Res. Nat. Bur. Standards, 45:255–282, 1950. [41] L. Ljung. System Identification: Theory for the User. Prentice Hall, second edition, January 1999. [42] LSMR. Software for linear systems and least squares. http://www.stanford.edu/group/SOL/software.html. [43] D. G. Luenberger. Hyperbolic pairs in the method of conjugate gradients. SIAM J. Appl. Math., 17:1263–1267, 1969. doi:10.1137/0117118. [44] D. G. Luenberger. The conjugate residual method for constrained minimization problems. SIAM J. Numer. Anal., 7 (3):390–398, 1970. doi:10.1137/0707032. [45] W. Menke. Geophysical Data Analysis: Discrete Inverse Theory, volume 45. Academic Press, San Diego, 1989. [46] G. A. Meurant. The Lanczos and Conjugate Gradient Algorithms: From Theory to Finite Precision Computations, volume 19 of Software, Environments, and Tools. SIAM, Philadelphia, 2006. [47] J. Modersitzki, H. A. Van der Vorst, and G. L. G. Sleijpen. Differences in the effects of rounding errors in Krylov solvers for symmetric indefinite linear systems. SIAM J. Matrix Anal. Appl., 22(3):726–751, 2001. doi:10.1137/S0895479897323087. [48] W. Oettli and W. Prager. Compatibility of approximate solution of linear equations with given error bounds for coefficients and right-hand sides. Numerische Mathematik, 6:405–409, 1964. doi:10.1007/BF01386090. [49] A. M. Ostrowski. On the convergence of the Rayleigh quotient iteration for the computation of the characteristic roots and vectors. I. Archive for Rational Mechanics and Analysis, 1(1):233–241, 1957. doi:10.1007/BF00298007. [50] P. Jiránek and D. Titley-Péloquin. Estimating the backward error in LSQR. SIAM J. Matrix Anal. Appl., 31(4):2055– 2074, 2010. doi:10.1137/090770655. [51] C. C. Paige. Bidiagonalization of matrices and solution of linear equations. SIAM J. Numer. Anal., 11(1):197–209, 1974. doi:10.1137/0711019. [52] C. C. Paige and M. A. Saunders. Solution of sparse indefinite systems of linear equations. SIAM J. Numer. Anal., 12 (4):617–629, 1975. doi:10.1137/0712047. [53] C. C. Paige and M. A. Saunders. LSQR: An algorithm for sparse linear equations and sparse least squares. ACM Trans. Math. Softw., 8(1):43–71, 1982. doi:10.1145/355984.355989. [54] C. C. Paige and M. A. Saunders. Algorithm 583; LSQR: Sparse linear equations and least-squares problems. ACM Trans. Math. Softw., 8(2):195–209, 1982. doi:10.1145/355993.356000. [55] C. C. Paige, M. Rozloznik, and Z. Strakos. Modified Gram-Schmidt (MGS), least squares, and backward stability of MGS-GMRES. SIAM J. Matrix Anal. Appl., 28(1):264–284, 2006. doi:10.1137/050630416. [56] B. N. Parlett and W. Kahan. On the convergence of a practical QR algorithm. Information Processing, 68:114–118, 1969. [57] V. Pereyra, 2010. Private communication. [58] PROPACK. Software for SVD of sparse matrices. http://soi.stanford.edu/~rmunk/PROPACK/. 7.2. FUTURE DIRECTIONS 123 [59] B. J. W. S. Rayleigh. The Theory of Sound, volume 2. Macmillan, 1896. [60] J. K. Reid. The use of conjugate gradients for systems of linear equations possessing “Property A”. SIAM J. Numer. Anal., 9(2):325–332, 1972. doi:10.1137/0709032. [61] J. L. Rigal and J. Gaches. On the compatibility of a given solution with the data of a linear system. J. ACM, 14(3): 543–548, 1967. doi:10.1145/321406.321416. [62] Y. Saad. Krylov subspace methods for solving large unsymmetric linear systems. Mathematics of Computation, 37 (155):pp. 105–126, 1981. doi:10.2307/2007504. [63] Y. Saad and M. H. Schultz. GMRES: a generalized minimum residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. and Statist. Comput., 7(3):856–869, 1986. doi:10.1137/0907058. [64] K. Shen, J. N. Crossley, A. W. C. Lun, and H. Liu. The Nine Chapters on the Mathematical Art: Companion and Commentary. Oxford University Press, New York, 1999. [65] H. D. Simon and H. Zha. Low-rank matrix approximation using the Lanczos bidiagonalization process with applications. SIAM J. Sci. Comput., 21(6):2257–2274, 2000. doi:10.1137/S1064827597327309. [66] P. Sonneveld. CGS, a fast Lanczos-type solver for nonsymmetric linear systems. SIAM J. Sci. Stat. Comput., 10:36–52, 1989. doi:10.1137/0910004. [67] P. Sonneveld and M. B. van Gijzen. IDR(s): A family of simple and fast algorithms for solving large nonsymmetric systems of linear equations. SIAM J. Sci. Comput., 31:1035–1062, 2008. doi:10.1137/070685804. [68] T. Steihaug. The conjugate gradient method and trust regions in large scale optimization. SIAM J. Numer. Anal., 20 (3):626–637, 1983. doi:10.1137/0720042. [69] G. W. Stewart. An inverse perturbation theorem for the linear least squares problem. SIGNUM Newsletter, 10:39–40, 1975. doi:10.1145/1053197.1053199. [70] G. W. Stewart. Research, development and LINPACK. In J. R. Rice, editor, Mathematical Software III, pages 1–14. Academic Press, New York, 1977. [71] G. W. Stewart. The QLP approximation to the singular value decomposition. SIAM J. Sci. Comput., 20(4):1336–1348, 1999. ISSN 1064-8275. doi:10.1137/S1064827597319519. [72] E. Stiefel. Relaxationsmethoden bester strategie zur lösung linearer gleichungssysteme. Comm. Math. Helv., 29: 157–179, 1955. doi:10.1007/BF02564277. [73] Z. Su. Computational Methods for Least Squares Problems and Clinical Trials. PhD thesis, SCCM, Stanford University, Stanford, CA, 2005. [74] Y. Sun. The Filter Algorithm for Solving Large-Scale Eigenproblems from Accelerator Simulations. PhD thesis, Stanford University, Stanford, CA, 2003. [75] D. Titley-Péloquin. Backward Perturbation Analysis of Least Squares Problems. PhD thesis, School of Computer Science, McGill University, Montreal, PQ, 2010. [76] D. Titley-Peloquin. Convergence of backward error bounds in LSQR and LSMR. Private communication, 2011. [77] J. A. Tomlin. On scaling linear programming problems. In Computational Practice in Mathematical Programming, volume 4 of Mathematical Programming Studies, pages 146–166. Springer, Berlin / Heidelberg, 1975. doi:10.1007/BFb0120718. [78] H. A. van der Vorst. Bi-CGSTAB: A fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM J. Sci. Comput., 13(2):631–644, 1992. doi:10.1137/0913035. 124 CHAPTER 7. CONCLUSIONS AND FUTURE DIRECTIONS [79] H. A. van der Vorst. Iterative Krylov Methods for Large Linear Systems. Cambridge University Press, Cambridge, first edition, 2003. doi:10.1017/CBO9780511615115. [80] S. J. van der Walt. Super-resolution Imaging. PhD thesis, Stellenbosch University, Cape Town, South Africa, 2010. [81] M. B. van Gijzen. IDR(s) website. http://ta.twi.tudelft.nl/nw/users/gijzen/IDR.html. [82] M. B. van Gijzen. Private communication, Dec 2010. [83] B. Waldén, R. Karlson, and J.-G. Sun. Optimal backward perturbation bounds for the linear least squares problem. Numerical Linear Algebra with Applications, 2(3):271–286, 1995. doi:10.1002/nla.1680020308. [84] D. S. Watkins. Fundamentals of Matrix Computations. Pure and Applied Mathematics. Wiley, Hoboken, NJ, third edition, 2010. [85] P. Wesseling and P. Sonneveld. Numerical experiments with a multiple grid and a preconditioned Lanczos type method. In Reimund Rautmann, editor, Approximation Methods for Navier-Stokes Problems, volume 771 of Lecture Notes in Mathematics, pages 543–562. Springer, Berlin / Heidelberg, 1980. doi:10.1007/BFb0086930. [86] F. Xue and H. C. Elman. Convergence analysis of iterative solvers in inexact Rayleigh quotient iteration. SIAM J. Matrix Anal. Appl., 31(3):877–899, 2009. doi:10.1137/080712908. [87] D. Young. Iterative methods for solving partial difference equations of elliptic type. Trans. Amer. Math. Soc, 76(92): 111, 1954. doi:10.2307/1990745. [88] D. M. Young. Iterative Methods for Solving Partial Difference Equations of Elliptic Type. PhD thesis, Harvard University, Cambridge, MA, 1950.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement