Introduction to Numerical Analysis Numerical analysis is an increasingly important link between pure mathematics and its application in science and technology. This textbook provides an introduction to the justification and development of constructive methods that provide sufficiently accurate approximations to the solution of numerical problems, and the analysis of the influence that errors in data, finite-precision calculations, and approximation formulas have on results, problem formulation, and the choice of method. It also serves as an introduction to scientific programming in MATLAB, including many simple and difficult, theoretical and computational exercises. A unique feature of this book is the consequent development of interval analysis as a tool for rigorous computation and computer-assisted proofs, along with the traditional material. Arnold Neumaier is a Professor of Computational Mathematics at the University of Wien. He has written two books and published numerous articles on the subjects of combinatorics, interval analysis, optimization, and statistics. He has held teaching positions at the University of Freiburg, Germany, and the University of Wisconsin, Madison, and he has worked on the technical staff at AT&T Bell Laboratories. In addition, Professor Neumaier maintains extensive Web pages on public domain software for numerical analysis, optimization, and statistics. Introduction to Numerical Analysis ARNOLD NEUMAIER University of Wien ""'~"'" CAMBRIDGE . ;:: UNIVERSITY PRESS PUBLISHED BY THE PRESS SYNDICATE OF THE UNIVERSITY OF CAMBRIDGE Contents The Pitt Building, Trumpington Street, Cambridge, United Kingdom CAMBRIDGE UNIVERSITY PRESS The Edinburgh Building, Cambridge CB2 2RU, UK 40 West 20th Street, New York, NY 10011-4211, USA 10 Stamford Road, Oakleigh, VIC 3166, Australia Ruiz de Alarcon 13,28014 Madrid, Spain Dock House, The Waterfront, Cape Town 8001, South Africa http://www.cambridge.org © Cambridge University Press 2001 This book is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 200 I Printed in the United Kingdom at the University Press, Cambridge System IbTI;X2s Typeface Times Roman 10/13 pI. [TB] A catalog record for this book is available from the British Library. Library of Congress Cataloging in Publication Data Neumaier, A. Introduction to numerical analysis 1Arnold Neumaier. p. em. Includes bibliographical references and index. ISBN 0-521-33323-7 - ISBN 0-521-33610-4 (pb.) 1. Numerical analysis. QA297 .N48 I. Title. 2001 519.4 - dc21 ISBN 0 521 333237 hardback ISBN 0 521 33610 4 paperback Preface page vii 1. The Numerical Evaluation of Expressions 1.1 Arithmetic Expressions and Automatic Differentiation 1.2 Numbers, Operations, and Elementary Functions 1.3 Numerical Stability 1.4 Error Propagation and Condition 1.5 Interval Arithmetic 1.6 Exercises 2. Linear Systems of Equations 2.1 Gaussian Elimination 2.2 2.3 2.4 2.5 2.6 2.7 2.8 Variations on a Theme Rounding Errors, Equilibration, and Pivot Search Vector and Matrix Norms Condition Numbers and Data Perturbations Iterative Refinement Error Bounds for Solutions of Linear Systems Exercises 00-066713 1 2 14 23 33 38 53 61 63 73 82 94 99 106 112 119 3. Interpolation and Numerical Differentiation 3.1 Interpolation by Polynomials 3.2 Extrapolation and Numerical Differentiation 3.3 Cubic Splines 3.4 Approximation by Splines 3.5 Radial Basis Functions 3.6 Exercises 130 131 145 153 165 170 182 4. Numerical Integration 4.1 The Accuracy of Quadrature Formulas 179 179 Contents vi 4.2 4.3 4.4 4.5 4.6 4.7 Gaussian Quadrature Formulas The Trapezoidal Rule Adaptive Integration Solving Ordinary Differential Equations Step Size and Order Control Exercises 5. Univariate Nonlinear Equations 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 The Secant Method Bisection Methods Spectral Bisection Methods for Eigenvalues Convergence Order Error Analysis Complex Zeros Methods Using Derivative Information Exercises 6. Systems of Nonlinear Equations 6.1 6.2 6.3 6.4 6.5 Preliminaries Newton's Method and Its Variants Error Analysis Further Techniques for Nonlinear Systems Exercises References Index 187 196 203 210 219 225 233 234 241 250 256 265 273 281 293 301 302 311 322 329 340 345 351 Preface They ...explained it so that the people could understand it. Good News Bible, Nehemiah 8:8 Since the introduction of the computer, numerical analysis has developed into an increasingly important connecting link between pure mathematics and its application in science and technology. Its independence as a mathematical discipline depends, above all, on two things: the justification and development of constructive methods that provide sufficiently accurate approximations to the solution of problems, and the analysis of the influence that errors in data, finite-precision calculations, and approximation formulas have on results, problem formulation, and the choice of method. This book provides an introduction to these themes. A novel feature of this book is the consequent development of interval analysis as a tool for rigorous computation and computer-assisted proofs. Apart from this, most of the material treated can be found in typical textbooks on numerical analysis; but even then, proofs may be shorter than and the perspective may be different from those elsewhere. Some of the material on nonlinear equations presented here previously appeared only in specialized books or in journal articles. Readers are expected to have a background knowledge of matrix algebra and calculus of several real variables, and to know just enough about topological concepts to understand that sequences in a compact subset in JRn have a convergent subsequence. In a few places, elements of complex analysis are used. The book is based on course lectures in numerical analysis that the author gave repeatedly at the University of Freiburg (Germany) and the University of Vienna (Austria). Many simple and difficult theoretical and computational exercises help the reader gain experience and deepen the understanding of the techniques presented. The material is a little more than can be covered in a European winter term, but it should be easy to make suitable selections. viii Preface The presentation is in a rigorous mathematical style. However, the theoretical results are usually motivated and discussed in a more leisurely manner so that many proofs can be omitted without impairing the understanding of the algorithms. Notation is almost standard, with a bias towards MATLAB (see also the index). The abbreviation "iff" frequently stands for "if and only if." The first chapter introduces elementary features of numerical computation: floating point numbers, rounding errors, stability and condition, elements of programming (in MATLAB), automatic differentiation, and interval arithmetic. Chapter 2 is a thorough treatment of Gaussian elimination, including its variants such as the Cholesky factorization. Chapters 3 through 5 provide the tools for studying univariate functions - interpolation (with polynomials, cubic splines, and radial basis functions), integration (Gaussian formulas, Romberg and adaptive integration, and an introduction to multistep formulas for ordinary differential equations), and zero finding (traditional and less traditional methods ensuring global and fast local convergence, complex zeros, and spectral bisection for definite eigenvalue problems). Finally, Chapter 6 discusses Newton's method and its many variants for systems of nonlinear equations, concentrating on methods for which global convergence can be proved. In a second course, I usually cover numerical data analysis (least squares and orthogonal factorization, the singular value decomposition and regularization, and the fast Fourier transform), unconstrained optimization, the eigenvalue problem, and differential equations. Therefore, this book contains no (or only a rudimentary) treatment of these topics; it is planned that they will be covered in a companion volume. I want to thank Doris Norbert, Andreas Schafer, and Wolfgang Enger for their help in preparing the German lecture notes; Michael Wolfe and Baker Kearfott for their English translation, out of which the present text grew; Carl de Boor, Baker Kearfott, Weldon Lodwick, Gunter Mayer, JiffRohn, and Siegfried Rump for their comments on various versions of the manuscript; and Stefan Dallwig and Waltraud Huyer for computing the tables and figures and proofreading. I also want to thank God for giving me love and power to understand and use the mathematical concepts on which His creation is based, and for the discipline and perseverance to write this book. I hope that working through the material presented here gives my readers insight and understanding, encourages them to become more deeply interested in the joys and frustrations of numerical computations, and helps them apply mathematics more efficiently and more reliably to practical problems. Vienna, January 2001 Arnold Neumaier 1 The Numerical Evaluation of Expressions In this chapter, we introduce the reader on a very elementary level to some basic considerations in numerical analysis. We look at the evaluation of arithmetic expressions and their derivatives, and show some simple ways to save on the number of operations and storage locations needed to do computations. We demonstrate how finite precision arithmetic differs from computation with ideal real numbers, and give some ideas about how to recognize pitfalls in numerical computations and how to avoid the associated numerical instability. We look at the influence of data errors on the results of a computation, and how to quantify this influence using the concept of a condition number. Finally we show how, using interval arithmetic, it is possible to obtain mathematically correct results and error estimates although computations are done with limited precision only. Wepresent some simple algorithms in a pseudo-MATLAB® formulation very close to the numerical MATrix LABoratory language MATLAB (and somethose printed in typewriter font - in true MATLAB) to facilitate getting used to this excellent platform for experimenting with numerical algorithms. MATLAB is very easy to learn once you get the idea of it, and the online help is usually sufficient to expand your knowledge. We explain many MATLAB conventions on their first use (see also the index); for unexplained MATLAB features you may try the online help facility. Type help at the MATLAB prompt to find a list of available directories with MATLAB functions; add the directory name after help to get information about the available functions, or type the function name (without the ending .m) after help to get information about the Use of a particular function. This should give enough information, in most cases. Apart from that, you may refer to the MATLAB manuals (for example, [58,59]). If you need more help, see, for example, Hanselman and Littlefield [38] for a comprehensive introduction. 3 The Numerical Evaluation of Expressions 1.1 Arithmetic Expressions and Automatic Differentiation For those who cannot afford MATLAB, there are public domain variants that can be downloaded from the internet, SCILAB [87] and OCTAVE [74]. (The syntax and the capabilities are slightly different, so the MATLAB explanations given here may not be fully compatible.) Here we use the MATLAB notation 1 : n for a list of consecutive integers from 1 to n. (vi) The area e under the normal distribution curve e- t 2 /2 j v'2J[ between t = -x and t = x is given by 2 e = -1- jX e:' /2dt. 2 v'2J[ 1.1 Arithmetic Expressions and Automatic Differentiation Mathematical formulas occur in many fields of interest in mathematics and its applications (e.g., statistics, banking, astronomy). 1.1.1 Examples. (i) The absolute value Ix + iyl of the complex number x + iv is given by -x With the exception of the last formula, which contains an integral, only elementary arithmetic operations and elementary functions occur in the examples. Such formulas are called arithmetic expressions. Arithmetic expressions may be defined recursively with the aid of variables XI, ••• , X n , the binary operations + (addition), - (subtraction), * (multiplication), j (division),' (exponentiation), forming the set *. j, A}, 0= {+. -, (ii) A capital sum Ko invested at p% per annum for n years accumulates to a sum K given by K = Ko(l J + pjlOO)n. (iii) The altitude h of a star with spherical coordinates h (1/!. 8, t) is given by = arcsin(sin 1/! sin 8 + cos 1/! cos 8 cos t). (iv) The solutions Xl, x2 of the quadratic equation = -b ± Jb 2 - 4ac 2a (v) The standard deviation (T of a sequence of real numbers XI, by log, sqrt, abs, ...}. The set of elementary functions are not specified precisely here, and should be regarded as consisting of the set of unary continuous functions that are defined on the computer used. 1.1.2 Definition. The set A = A(x[, ... , x n ) of arithmetic expressions in is defined by the rules are given by XI,2 = {+, -. sin, cos, exp, XI, ... , X n + bx + c = 0 ax: and with certain unary elementary functions in the set ... , X n is given (EI) IR < A, (E2) Xi E A (i = 1, ... , n), (E3) g, h E A, 0 EO::::} (g 0 h) E A, (E4) g E A, cp E J ::::} cp (g) E A, (E5) A(x], ... , x n ) is the smallest set that satisfies (El)-(E4). (This rule excludes objects not created by the rules (E 1)-(E4).) Unnecessary parentheses may be deleted in accordance with standard rules. (T = ~L (Xi - ~ L Xi)2 ~ ;; i~l:n 1.1.3 Example. The solution 11 i=l:n -b + Jb 2 - 4ac 2a of a quadratic equation is an arithmetic expression in a, b, and c because we 4 The Numerical Evaluation of Expressions 1.1 Arithmetic Expressions and Automatic Differentiation can write it as follows: (-b 5 to the running time of the program. Thus, in the example + sqrt(b"2 - 4 *a * c))/(2 * a) E A(a, b, c). f(x) f'(x) Evaluation ofan arithmetic expression means replacing the variables with numbers. This is possible on any machine for which the operations in 0 and the elementary functions in J are realized. Differentiation of Expressions For many numerical problems it is useful and sometimes absolutely essential to know how to calculate the value of the derivative of a function f at given points. If no routine for the calculation of t' (x) is available then one can determine approximations for F (x) by means of numerical differentiation (see Section 3.2). However, because of the higher accuracy attainable, it is usually better to derive a routine for calculating F (x) from the routine for calculating f (x). If, in particular, f is given by an arithmetic expression, then an arithmetic expression for I' (x) can be obtained by analytic (symbolic) differentiation using the rules of calculus. We describe several useful variants of this process, confining ourselves here to the case of a single variable x; the case of several variables is treated in Section 6.1. We shall see that, for all applications in which a closed formula for the derivative of f is not needed but one must be able to evaluate l' (x) at arbitrary points, a recursive form of generating this value simultaneously with the value of f (x) is the most useful way to proceed. (a) The Construction of a Closed Expression for l' This is the traditional way in which the formula for the derivative is calculated by hand and the expression that is obtained is then programmed as a function subroutine. However, several disadvantages outweigh the advantage of having a closed expression. (i) Algebraic errors can occur in the calculation and simplification of formulas for derivatives by hand; this is particularly likely when long and complicated formulas for the derivative result. Therefore, a correctness test is necessary; this can be implemented, for example, as a comparison with a sequence of values obtained by numerical differentiation (cf. Exercise 8). (ii) Often, especially when both f(x) and 1'(x) must be calculated, certain subexpressions appear several times, and their recalculation adds needlessly = eX /(1 + sinx), = eX(l + sinx) - eX cosx (l + sin x)? = eX(l + sinx - cosx)/(l + sinx)2 the expression 1 + sin x occurs three times. The susceptibility to error is considerably reduced if one automates the calculation and simplification of closed formulas for derivatives using symbolic algebra (i.e., the processing of syntactically organized strings of symbols). Examples of high-quality packages that accomplish this are MAPLE and MATHEMATICA. (b) The Construction ofa Recursive Program for f and l' In order to avoid the repeated calculation of subexpressions, it is advantageous to give up the closed form and calculate repeatedly occurring subexpressions using auxiliary variables. In the preceding example, one obtains the program segment fl = expx; 12 = 1 + sinx; f = fl/12; t' = f1 * (12 - cosx)/h"2; in which one can represent the last expression for l' more briefly in the form l' = f * (12 - cosx)/h or l' = f * (1 - cos x/h). One can even arrange to have this transformation done automatically by programs for symbolic manipulation and dispense with closed intermediate expressions. Thus, in decomposing the expression for f recursively into its constituent parts, a sequence of assignments of the form f = -g, f =g0 h, f = cp(g) is obtained. One can differentiate this sequence of equations, taking into account the fact that the subexpressions f, g, and h are themselves functions of x and bearing in mind that a constant has the derivative 0 and that the variable x has the derivative I. In addition, the well-known rules for differentiation (Table 1.1) 6 The Numerical Evaluation of Expressions 1.1 Arithmetic Expressions and Automatic Differentiation Table 1.1. Differentiation rules for expressions f (c) Computing with Differential Numbers I' I g±h g*h glh gA2 gAh ±g g' ±h' * h + g * h' I * h')1 h (see main text) 2 * g * g' I * (h' * log(g) + h * g' 1g) g' (g' - ±g' g'l (2 * f) if g > 0 sqrt(g) exp(g) log(g) 1* g' abs(g) cp(g) sign(g) s'ie ip' (g) * s' * g' are used. The unusual form of the quotient rule results from f (g'h - gh')lh 2 = is' - (gl h)h')1 h = (g' - fh')1 h = glh::::} t' = and is advantageous for machine calculation, saving two multiplications. In the preceding example, one obtains fJ = h = sin x: dg*dh:= (g*h,g'*h+g*h'); dg l dn := (f, (g' - f * h')1 h) dg'dh. := (f, f * f~)/h: and from this, by means of a little simplification and substitution, f = exp(h sqrt(dg) := (f, g'/(2 exp(dg) := (f, f * f)) * g') with f = exp(g); h abs(dg) := (abs(g), sign(g) with the same computational cost as the old, intuitively derived recursion. 0); (if g > 0); with f = sqrt(g) (if g > 0); log(dg):= (log(g), g'lg) (if g > 0); f = fl/h: f' = (fl - f * cosx)/h; * k) ±dg := (±g, ±g'); fl = exp x; = 1+ sinx; i= * (h' * k + h * g'lg)) with k = log(g), I;; with f = gl h (if h withk=gA(n-l) (ifl:::::nElR); withf=gAn (ifl>nERg>O); de" {(k*g,n*k*g') gn:= (f,n*f*g'lg) f~ = cosx; t' = (f( - f ;= (g ± h, s' ± h'); dg ± dh f = fdh I( = II: f~ = The compilers of all programming languages can do a syntax analysis of arbitrary expressions, but usually the result of the analysis is not accessible to the programs. However, in programming languages that provide user-definable data types, operators, and functions for objects of these types, the compiler's syntax analysis can be utilized. This possibility exists, for example, in the programming languages MATLAB 5, FORTRAN 90, ADA, C++, and in the PASCAL extension PASCAL-XSC, and leads to automatic differentiation methods. In order to understand how this can be done, we observe that Table 1.1 constructs, for each operation 0 EO, from two pairs of numerical values (g, g') and (h, h'), a third value, namely (f, f'). We may regard this pair simply as the result of the operation (g, g')o(h, h'). Similarly, one finds in Table 1.1 for each elementary function cp E ] a new pair (f, f') from the pair (g, g'), and we regard this as a definition of the value cp«g, g')). The analogy with complex numbers is obvious and motivates our definition of differential numbers as pairs of real numbers. Formally, a differential number is a pair (f, f') of real numbers. We use the generic form df to denote such a differential number, and regard similarly the variables hand h' as the components of the differential number dh, so that dh = (h, h'), and so on. We now define, in correspondence with Table 1.1, operations and elementary functions for differential numbers. exp x: h=l+h: 7 * g') (if g i= 0); cp(dg) := (cp(g), cp'(g) * g') for other cp E ] for which cp(g), sp' (g) exists. The next result follows directly from the definitions. 8 The Numerical Evaluation ofExpressions 1.1.4 Proposition. Let q, r be real functions of one variable, differentiable at Xo E JR., and let dq (i) If dq 0 = 0 EO), then the function p given by p(x) := q(x) 1.1.6 Example. Suppose we want to find the value and the derivative of = dr 0 (r(xo), r'(xo». rex) is defined and differentiable at xo, and (p(xo), p'(xo» (ii) If rp (dq) f (x) = dq 0 dr. is defined (for rp E J), then the function p given by = rp(dq). The reader interested in algebra easily verifies that the set of differential numbers with the operations +, -, * forms a commutative and associative ring with null element 0 = (0,0) and identity element 1 = (1,0). The ring has zero divisors; for example, (0, 1) * (0, 1) = (0, 0) = O. The differential numbers of the form (a, 0) form a subring that is isomorphic to the ring of real numbers. We may call differential numbers of the form (a, 0) constants and identify them with the real numbers a. For operations with constants, the formulas simplify = (g ± a, g'), a ± dh The arithmetic for differential numbers can be programmed without any difficulty in any of the programming languages mentioned previously. With their help, we can compute the values f (xo) and I' (xo) for any arithmetic expression f and at any point xo. Indeed, we need only initialize the independent variable as the differential number dx = (xo, 1) and substitute this into f. 1.1.5 Theorem. Let f be an arithmetic expression in the variable x, and let the differential number f(dx) that results from inserting dx = (xo, 1) into f be defined. Then f is defined and is differentiable at xo, and (f(xo), f'(xo» = f(dx). + 3) x2 + 4x + 7 + 2)2 = (x 204, f (x) and finds by substitution that f(3) 12 5 =- = (f(3) f'(3» = f(dx) 28 (3) = 25 = 1.12. * = ((3,1) - 1) ((3,1) (3, 1) + 2 , (2, 1) , = (3, 1) into the expression for f gives Substituting the differential number dx * (6, 1) + 3) (12, 8) (5, 1) (5, 1) = (204, (8 - 204 = (a ± h, h'), dg * a = (g * a, s' * a), a * dh = (a * h, a * h'), dg] a = (g/a,g'/a), a f dh c» (f,-f*h'/h), withf=a/h. (x - 1)(x --x-+-2-- , f is defined and differentiable at xo, and (p(xo), p'(xo» = at the point Xo = 3. The classical method (a) starts from an explicit formula for the derivative; for example, p(x) := rp(q(x» dg ± a D Proof Recursive application of Proposition 1.104. The conclusion is that, using differential numbers, one can calculate the derivative of any arithmetic expression, at any admissible point, without knowing an expression for f'! (q(xo), q'(xo», dr is defined (for 9 1.1 Arithmetic Expressions and Automatic Differentiation * 1)/5) = (204, 1.12), without knowing an expression for I'. In the same way, substituting dx into the equivalent expression = (3, 1) 3 f(x) =x - - - , x+2 we obtain, in spite of completely different intermediate results, the same final result (f(3), f'(3» = f(dx) = = (3,0) (3, 1) - (5,1) = (3, 1) - (0.6, -0.12) (3, 1) _ = 3 (3,1) +2 (3, 1) - (0.6, (0 - 0.6 = (204, 1.12). * 1)/5) The Numerical Evaluation of Expressions 1.1 Arithmetic Expressions and Automatic Differentiation Finally, we note that differential numbers can be generalized without difficulty to compute derivatives of higher order. For the computation of f"(X), ... , f(n) (x) from an expression f containing N operations or functions, the number of operations and function calls needed grows like a small multiple of n 2 N. The formalism handles differentiation with respect to any parameter. Therefore, it is also possible to compute the partial derivatives of (x) /OXk of a function that is given by an expression f (x) in several variables. Of course, one need not redo the function value part of the calculation when calculating the partial derivatives. In addition to this "forward" mode of automatic differentiation, there is also a "reverse" or "backward" mode that may further increase the efficiency of calculating partial derivatives (see Section 6.1). A MATLAB implementation of automatic differentiation is available in the INTLAB toolbox by Rump [85]. An in-depth discussion of all aspects of automatic differentiation and its applications is given in Griewank and Corliss [33] and Griewank [32]. This is the Homer scheme for calculating the first derivative of a polynomial. Analogous results hold for the higher derivatives (see Exercise 4). In MATLAB notation, we get f = f (x) and g = l' (x) as follows: 10 The Horner Scheme A polynomial of degree at most n is a function of the form (1.1) with given coefficients ao, ... , an' In order to evaluate (1.1) for a given value of x , one does not proceed naively by forming each power x 2 , x 3 , ••. , x", but one uses the following scheme. We define ii := aox i + alx i - 1+ ... + a, Obviously, I; =f (x). The advantage of Io = (i = 0,1, ... , n). Ii lies in its recursive definition ao, ii = ii-1 X+ a; (i = 1, ... , n), f(x) = In. This is the Homer scheme for calculating the value of a polynomial. A simple count shows that only 2n operations are needed, whereas evaluating (1.1) directly requires at least 3n - 1 operations. The derivative of a polynomial can be formed recursively by differentiation of the recursion: f~ = 0, f/ = 1/-I X+ ii-I f'ex) = f,;, (i = 1, ... , n), 11 1.1.7 Algorithm: Horner Scheme % computes f=f(x) and g=f'(x) for % f(x)=a(1)*x-n+a(2)*x-(n-l)+ ... +a(n)*x+a(n+l) f=a(1); g=O; for i=l:n, g=g*x+f; f=f*x+a(i+1) ; end; Note the shift in the index numbering, needed because in MATLAB the vector indices start with index 1, in contrast to C, in which vector indices start with 0, and to FORTRAN. which allows the user to specify the starting index. The reader unfamiliar with programming should also note that a statement such as g = g * x + f is not an equation in the mathematical sense, but an assignment. In the right side, g refers to the contents of the associated storage location before evaluation of the right side; after its evaluation, the contents of g are replaced by the result of the calculation. Thus, the same variable takes different values at different times, and this allows to make most efficient use of the capacity of the computer, saving storage and index calculations. Note also that the update of I must come after the update of g because the formula for updating g requires the old value of f, which would be no longer available after the update of I. 1.1.8 Example. We evaluate the polynomial j (x) = (I - x)6 at 101 equidistant points in the range 0.995 :s x :s 1.005 by calculating (I - x)6 directly and by using the Homer scheme for the equivalent expanded polynomial expression 1 - 6x + 15x 2 - 20x 3 + 15x 4 - 6x 5 + x 6 . The results are shown in Figure 1.1. The effect of (simulated) machine precision on the evaluation of the polynomial p(x) = 1 - 6x + 15x 2 - 20x 3 + l5x 4 - 6x 5 + x 6 without using the Homer scheme is demonstrated in Figure 1.2. In MATLAB, the figure can be drawn easily with the following commands. (Text is handled as an array of characters, and concatenated by placing the pieces, separated by blanks or commas, within square brackets. num2str transforms a number into a string denoting that number, in some standard format. The number of rows or columns of a matrix can be found with size. The period 12 The Numerical Evaluation ofExpressions 15 X 1016,---------,----,----,------,-----,----,--------,---,------,---------, u J .~ a- a- o -4 '----_----'-_ _----'--_ _--'---_ _-'----_ _L - _ - - '_ _--'-_ _---'---_ _--L-_----" 0.995 0.996 0.997 0.998 0.999 1.004 1.001 1.002 1.003 1.005 --~+=- o ." 00 _+_-----lO o --~'F=----- ." 00 _ _-+--_ _~O Figure 1.1. Twoways of evaluating f (x) = (1 - x)6. + before an operation denotes componentwise operations. The %sign indicates that the remainder of the line is a comment and does not affect the computations. The actual plot is produced by plot, and saved as a postscript file using print.) 1.1.9 Algorithm: Draw Figure 1.1 x=(9950:10050)/10000; disp(['number of evaluation points: ' ,num2str(size(x,2»]); y=(1-x).-6; % a compact way of writing the Horner scheme: z=((((((x-6) .*x+15).*x-20).*x+15) .*x-6).*x+l); plot(x, [y;z]); % display graph on screen print -deps horner.ps % save figure in file horner.ps a- o --~'F=------_-+-- o The figures illustrate a typical problem in numerical analysis. The simple expression (1 - X)6 produces the expected curve. However, for the expanded expression, monotonicity of f is destroyed through effects of finite precision arithmetic, and instead of a single minimum of zero at x = 1, we obtain more 13 V> 00 _ _~O 14 The Numerical Evaluation ofExpressions 1.2 Numbers, Operations, and Elementary Functions than 30 changes of sign in its neighborhood. This shows that we must look more closely at the evaluations of arithmetic expressions on a computer and with the rounding errors that create the observed inaccuracies. large number is used, for example, in the multi-precision arithmetic package of BRENT available through the electronic repository of mathematical software NETLIB [67]. (This valuable repository contains much other useful numerical software, too, and should be explored by the serious student.) 1.2 Numbers, Operations, and Elementary Functions Machine numbers of the following types (where each x denotes a digit) can be read and printed by any computer: integer real real ±xxxx ±XX.XXXIO ± xxx ±xxxx.xx (base 10 integer) (base 10 floating point number) (base 10 fixed point number) Integer numbers consist of a sign and a finite sequence of digits, and base 10 real floating point numbers consist of a sign, digits before the decimal point, a decimal point, digits after the decimal point, and an exponent with a sign. In order to avoid subscripts, computers usually print the letter d or e in place of the basis 10. The actual number of digits depends on the particular computer. Fixed point numbers have no exponent part. Many programming languages also provide complex numbers, consisting of a floating point real part and an imaginary part. In the MATLAB environment, there is no distinction among integers, reals, and complex numbers. The appropriate internal representation is determined automatically; however, double precision calculation (see later this section) is used throughout. In the following, we look in more detail at real floating point numbers. Floating Point Numbers For base 10 floating point numbers, the decimal point can be in any position whatever, and as input, this is always allowed. For the output, two standard forms are customary, in which, for nonzero numbers, the decimal point is placed either (i) before the first nonzero digit: ±.XXXIO ± xxx, or (ii) after the first nonzero digit: ±x.XXIO ± xxx. In (i), the x, immediately to the right of the decimal point represents a nonzero decimal digit. In (ii), the x immediately to the left of the decimal point represents a nonzero decimal digit. In the following, we use the normalization (i). The sequence of digits before the exponent part is referred to as the mantissa of the number. Internally, floating point numbers often have a different base B in which, typically, BE {2, 8,16, ... ,2 31 , •.. }. A huge base such as B = 231 or any other 15 1.2.1 Example. The general (normalized) internal floating point number format has the form ±.xxxxxxxxxxx B Mantissa of length L Base '-v-" ±xxxxxxxxxxxxxxx Exponent E [ - E, F] where L is the mantissa length, B the base, and [- E, F] the exponent range of the number format used. The arithmetic of personal computers (and of many workstations) conforms to the so-called IEEE floating point standard [45] for binary arithmetic (base B = 2). Here the single precisionformat (also referred to as real or real*4) uses 32 bits per number: 24 for the mantissa and 8 for the exponent. Because the sign of the mantissa takes I bit, the mantissa length is L = 23, and the exponents lie between E = -126 and F = 127. (The exponents -127 and 128 are reserved for floating point exceptions such as improper numbers like ±oo and NaN; the latter ("not a number") signifies results of nonsensical calculations such as 0/0.) The single precision format allows one to represent numbers of absolute value between approximately 10- 38 and 1038 , with an accuracy of about seven decimals. (There are also some even tinier numbers with less accuracy that cannot be normalized with the given exponent range. Such numbers are called denormalized, In particular, zero is a denormalized number.) The double precision format (also referred to as double or real*8) uses 64 bits per number: 53 for the mantissa and II for the exponent. Because the sign of the mantissa takes I bit, the mantissa length is L = 52, and the exponents lie between E = -1022 and F = 1023. The double precision format allows one to represent numbers of absolute value between approximately 10- 308 and 10308 , with an accuracy of about 16 decimals. (Again, there are also denormalized numbers.) MATLAB works generally with double precision numbers. In particular, the selective use of double precision in predominantly single precision calculations cannot be illustrated with MATLAB, and we comment on such problems using FORTRAN language. 16 The Numerical Evaluation of Expressions 1.2 Numbers, Operations, and Elementary Functions Ignoring the sign for a moment, we consider machine numbers of the form Be with mantissa length L, base B, digits Xl, ... , XL and exponent e E Z. By definition, such a number is assigned the value 1.2.2 Proposition. Under the hypothesis ofan unrestricted exponent set, a cor- .XI ... XL e- i " ~B Xi = Xl B e - l e Z +XZ B - Rounding It is obvious that between any two distinct machine numbers there are many real numbers that cannot be represented. In these cases, rounding must be used; that is, a "nearby" machine number must be found. Two rounding modes are most prevailing: optimal rounding and rounding by chopping. (i) Optimal Rounding. In the case of optimal rounding, the closest machine number is chosen. Ties can be broken arbitrarily, but in case of ties, the closest number with XL even is most suitable on statistical grounds. (Indeed, in a long summation, one expects the last digit to have random parity; hence the errors tend to balance out.) (ii) Rounding by Chopping. In the case of rounding by chopping, the digits after the Lth place are omitted. This kind of rounding is easier to realize than optimal rounding, but it is slightly less accurate. We call a rounding correct if no machine number lies between a number x and its rounded value X. Both optimal rounding and rounding by chopping are correct roundings. For example, rounding with B = 10, L = 3 gives Optimal Xz X4 lx-xl Ixi - - - - <£:=B and an optimally rounded value that is, the number is essentially the value of a polynomial at the point B. Therefore, the Homer scheme is a suitable method for converting from decimal numbers to base B numbers and conversely. X3 x of X in lR\ {O} satisfies +... + XL B e - L i=I;L Xl rectly rounded value = .123456 105 = .567890 105 = .123500 105 = .234500 105 = X2 = X3 = XI .123105 .568 105 .124 105 X4 = .234 105 Chopping XI = .123 105 X2 = .567105 X3 = .123105 X4 = .234 105 As measures of accuracy we use the absolute error Ix- xI and the relative error Ix - x1/ [x I of an approximation x for x. For general floating point numbers of highly varying magnitude, the relative error is the more important measure. 17 x of X l-L , ill lR\ {O} satisfies Ix - xl I I-L - - - <£:= -B . Ixi 2 Thus, the relative error of the rounded value is bounded by a small machine dependent number e. The number e is referred to as the machine precision of the computer. In MATLAB, the machine precision is available in the variable eps (unless eps is overwritten by something else). Proof We give the proof for the case of correct rounding; the case of optimal rounding is similar and left as Exercise 6. Suppose, without loss of generality, that x > 0 (the case x < 0 is analogous), where x is a real number that has, in general, an infinite base B representation. Suppose, without loss of generality, that and with XI I- O. Set Then X E [x', x' Be-L. So + B e- L ] and the assertion follows. because the rounding is correct, and x::: X < x+ o This proposition is a basis for all rigorous error analysis of numerical methods; see in particular Higham [44] and Stummel and Hainer [92]. In this book, 18 The Numerical Evaluation of Expressions 1.2 Numbers. Operations, and Elementary Functions we usually keep such analyses on a more heuristic level (which is enough to expose the pitfalls in most numerical calculations); a fully quantitative treatment is only given for Gaussian elimination (see Section 2.1). bounded by C 8 for a constant not much larger than 1; this holds in practice except for extreme values of the operands. For the calculation of the elementary functions <p E J, one must use approximative methods. For a thorough treatment of the approximations of standard functions, see Hart [41]. We treat only two typical examples here, namely the square root and the exponential function. The following three steps are usually employed for the realization of the elementary function cp(x) on a computer: (i) Argument reduction to a number Xo in a standard range, followed by (ii) approximation of cp(xo), and (iii) a subsequent result adaptation. The argument reduction (i) serves mainly to reduce the amount of work needed in the approximation step (ii), and the result adaptation (iii) corrects the result of (ii) to account for the argument reduction. To demonstrate the ideas, we look in more detail at the computation of the square root and the exponential function. Suppose we want to compute Vi = sqrt(x) with x = m Be, where m E [1 I B, 1[ is the mantissa, B is the base, and e is the exponent of x. Overflow and Underflow The preceding result is valid only for an unrestricted set of exponents. In practice, however, the exponent range is finite, and there are problems when IxI is too large or too small to be representable by a normalized number within the exponent range. In this case, we speak of overflow or underflow. respectively. The behavior resulting from overflow depends on the programming language and its implementation. Often, it results in an appropriate error message; alternatively, it may result in an improper number ±oo. (Some old compilers even set the result to zero, without warning!) In case of underflow, numbers that are too small are either rounded to denormali zed numbers (gradual underflow) or replaced by zero; this is, in general, reasonable, but leads to much larger relative errors, and occasionally can therefore be dangerous. Overflow and underflow can usually be avoided by suitable scaling of the problem; hence we assume in later discussions that the exponent range is unrestricted and Proposition 1.2.2 is generally valid. Argument Reduction and Result Adaptation for .jX We may express Vi in the form JX = Operations and Elementary Functions After the consideration of the representation of numbers on a computer, we look at the machine results of binary operations and elementary functions. For correctly rounded results of the binary operations 0 EO, we have Ix oy -xoyl Ix 0 yl 19 (.fiO) B S with Xo = m, s = el2 if e is even, and Xo = m] B, s = (e + 1)/2 if e is odd. This is the argument reduction. Because Xo E [B- 2 , I], it is sufficient to determine the value of the square root in the interval [B- 2 , 1] to obtain a result which may then be adapted to give the required value Vi by multiplying the result .fiO E [B- 1 , 1] (the mantissa) by B'. <8 -, where xoy represents the computed value of x 0 y and 8 is the machine precision. In the following, the validity of this property are assumed for o E {+, -, *, I} (but not for the power); it is satisfied by most modem computers as long as neither overflow nor underflow occur. For example, the IEEE floating point standard requires the results of these operations (and the square root) to be even optimally rounded. The power x~y = x Y (except for the square, y = 2) is usually slightly less accurate because it is computed as a composite expression. For small integers y, x Y is computed by repeated multiplication, and otherwise from exp(y logx). For theoretical analysis, we suppose that the relative error of the power is Approximation of.jX (a) The simplest but inefficient possibility is the expansion in a power series about 1. The radius of convergence of the series rrrr; vI - z = 1 1- 1,1 15 2" z - 8" z~ - 16 z - 128 z4 -'" is 1. The series converges sufficiently rapidly only for x ;::::; I (z ;::::; 0), but converges very slowly for Xo = O.01(z = 0.99), say (an essential value when B = 10), and is therefore useless for practical purposes. The cause of the slow convergence is the vertical tangent to the graph of x 1/2 at x = O. 20 1.2 Numbers, Operations, and Elementary Functions The Numerical Evaluation of Expressions (b) Another typical way to approximate function values is offered by an iterative method. Here, iteration ofthe same process over and over again successively improves an approximation until a desired accuracy is reached. In order to find W = -J"X for a given value of x > 0, let Wo be an initial estimate of w; for example, we may take Wo = I when x E [B- 2 , 1]). Note that w2 = x and so W = x /w. To find a suitable iteration formula, we define Wo := x /wo. If Wo = wo, then we are finished (w = wo); otherwise, we have Wo = w 2/wo = w(w/wo) < w (or> w) if Wo > w or Wo < w, respectively. This suggests that the arithmetic mean WI = (wo + wo)/2 might be a suitable new estimate for w. Repeating the process leads to the following iterative method, already known to Babylonian mathematicians. hence 8i + I = {(8i WHI If Wo = -J"X, = iili =..;x (2.2) (i:::: 0). (i:::: 0); The absolute error 8i := ui, - ..;x < and the relative error e, := (Wi - = < ... < W2 < WI· «> 0), Ci) si >: 0). (2.3) Here we use the convention that a product has higher priority than division if it is written by juxtaposition, and lower priority if it is written with explicit multiplication sign. Thus, Proof With W := -J"X, we have (for all i Wi = 8i + W = (l > 0) + sdw, + W)2 + w 2 - 2w(8 i + W)}/ 2Wi #- -J"X, then 80 = Wo - w #- 0 and 8i + I 8i + I = 8; /2Wi ::::: 8i12 < 8i > O. So, (i > 0) whence -J"X < proposition. Wi+1 < Wi and Wi < Wi+I < -J"X (i > 0). This proves the D For B = 10, Xo E [0.01, 1] is a consequence of argument reduction. Clearly, Xo = 0.01 is the most unfavorable case; hence seven iterations (i.e., 21 operations) always suffice to calculate -J"X accurate to twelve decimal -J"X) /-J"X satisfies s; /2(1 + 2WWi) / 2Wi The results displayed in Table 1.2 were obtained. One observes a rapid increase in the number of correct digits as predicted by (2.3). This is so-called quadratic convergence, and the iteration is in fact a special case of Newton 's method (cf. Chapter 5). -J"X satisfies 8i+l = 8;/2wi SHI ui, w 1.2.4 Example. The iteration of Proposition 1.2.3 was executed on a pocket calculator with B = 10 and L = 12 for x = 0.01, starting with Wo := 1. otherwise, WI < W2 < ... < ib, < + x/wi)/2 - and o< then Wi W = (Wi (2.1) 0), + wi)/2 := (Wi - = 8f!2wi:::: 0 If now Wo G >: = Wi+I = (wf +x - 1.2.3 Proposition. Let x > O. For arbitrary Wo > 0, define Wi := X/Wi 21 Table 1.2. Calculation of -vl0.01 Wi o 1 1 0.505 . 0.2624 . 0.1502 . 0.1084 . 0.1003 . 0.1000005 ... 0.1 2 3 4 5 6 7 22 digits. The argument reduction is important because for very large or very small Xo, the convergence is initially rather slow (for large e, we have Ei+1 ~ e, /2). (c) Another frequent approximation method uses rational functions. Assuming a rational approximation of JX of the form Approximation of exp(x) As for the square root, there are various possibilities for the approximation. Because the power series for the exponential function converges very rapidl y, only the power series approximation is considered. We may write the partial sum as p(x) w * (x) = t2x = 1ft - L xk/k! k=O:n + tl + - ~- , x +so optimal coefficients for the interval [0.01, 1] can be determined so that sup and have the approximation w*(x)1 eX=p(x)+ XE[O.OI,I] is minimal. The values of the coefficients (from the approximation SQRT0231 of Hart [41]) are t: = 0.588 = 0.467 So = 23 1.3 Numerical Stability The Numerical Evaluation of Expressions 1229, t) 975 327 625, to = -0.040916239 1674, where lexo _ I~ x n+1 (n+l)! e~ ' I ::s IxI. Thus the error satisfies Ix 0 In+l e 1xol < (log(B)/2)n+1 .Jjj (x) I < p 0 - (n+1)! - (n+1)! for Ixo I ::s 10g(B) /2, and an optimal value of n corresponding to a given accuracy can be determined; for example, for B = 10, n = 16, the absolute error is <10- 13 • 0.099 999 8. 1.3 Numerical Stability For every Xo E [0.01, 1], the relative error of w* is less than 0.02 (see Hart [41], p. 94). (For other functions or higher accuracy, one would use more complicated rational functions in the form of so-called continued fractions.) With three additional iterations as under (b) with initial value w*, we obtain a relative error of 10- 15 , requiring a total of fourteen operations only. In binary arithmetic, even fewer operations suffice because the argument reduction reduces the interval in which a good approximation is needed to [0.25, 1]. As our second example, we consider the computation of e' = exp(x) with = m Be, where m E [1/ B, 1] is the mantissa, B is the base, and e is the exponent of x. x Argument Reduction and Result Adaptation for exp(x) We can write exp(x) = exp(xo) . B' with s = [x / log B + 1/2] and Xo = x slog B. This gives an integral exponent s and a reduced argument Xo with Ixo I ::s log B. The constant log B must be stored in higher precision so that the result adaptation does not cause undue rounding errors. ! As a result of the finite number of digits used by the computer, numbers are stored inexactly and operations are carried out inexactly. The relative error is negligibly small for a single operation, but not necessarily for a sequence of two or more operations. For the example of the evaluation of the polynomial p (x) = (1 - x)6 in the neighborhood of x = 1, Section 1.1 shows the different results obtained if the polynomial is evaluated (a) by using the formula (1 - x)6 or (b) by using the equivalent formula 1 - 6x + 15x 2 - 20x 3 + 15x4 - 6x 5 + x 6 • In what follows, behavior such as (a) is called numerically more stable and behavior such as (b) is called numerically more unstable. These are qualitative notions that are not defined exactly. Numerical instability means that the error in the result is considerably greater than one would expect from small errors in the input. This expected error is quantified exactly through the notion of condition in Section 1.4. The most frequent causes of instability are illustrated by means of examples. In the following, we always distinguish between accuracy, which is related to the number of correct digits in some approximate quantity, and precision, which is defined as the accuracy with which single operations with (or storage of) exact numbers are performed. Thus we may speak of single or double precision The Numerical Evaluation of Expressions 1.3 Numerical Stability numbers or calculations; however, this says nothing at all about the quality of the numbers involved in relation to their intended meaning. Estimating the accuracy of the latter is the subject of an error analysis that should accompany any extended calculation, at least in a qualitative way. Because in what follows inexact numbers occur frequently, we introduce y (x « y) the notation ~ for approximately equal to. We shall also write x when x, yare positive and x t y is much greater than 1 (and, respectively, much smaller than 1). These are qualitative notions only; it must be decided from the context or appLication what "approximately" or "much" means quantitatively. There are two reasons for the increasing loss of accuracy in (i) and (ii): 1. There is a large difference in order ofmagnitude between the numbers to be added or subtracted, respectively (e.g., 10 1O+l/3).lfx = mlOe, X = mlO e. with Ixl > lxi, and k := e - e - 1, then x is too small to have any influence on the first k places of the sum or difference of x and x, respectively. Thus, information on .'i' is in the less significant digits of the intermediate results only, and the last k places of x are completely lost in the rounding process. 2. The subtraction of numbers of almost equal size leads to the cancellation of leading digits and promotes the wrong low-order digits in the intermediate results to a much more significant position. This leads to a drastic magnification of the relative error. In case (i), the result of the subtraction already has order unity. so that relative errors and absolute errors have the same magnitude. In case (ii), the result of the subtraction has small absolute (but large relative) error, and the division by the small number x 2 leads to a result with large absolute error. (iii) f(x) := sirr' x/(l - cos? x). 24 » 1.3.1 Example. All of the following examples were computed on a pocket calculator with B = 10 and L = 12. (Calculators with different rounding characteristics probably give similar but not identical results.) (i) f(x) := (x + 1/3) - (x - 1/3). x 1 103 106 109 10 10 lOll lex) 0.666666666 0.666666663 0.666663 . . . 0.663 .. 0.33 . 0 . . x Because f (x) = 2/3 for all x, there is an increasing loss of accuracy for increasing x. (ii) f(x) := «3 + x 2/3) - (3 - x 2/3))/x 2 . x 10- 1 10- 2 10- 3 10-4 10- 5 10- 6 f(x) 0.666666667 0.6666667 f(x) 0.D1° 0.001° 0.0005° 0.0004° 0.0003° 1.0007 1.0879 1.2692 2.3166 . . . . ERROR Because f (x) = 1 for all x not a multiple of n , there is a drastic decrease in accuracy for decreasing x due to division by the small number 1 - cos? .r , and ultimately an error occurs due to division by a computed zero. (iv) f(x) := sinx/Jl - sin 2 x. . . 0.66667 . . . 0.667 . 0.675 0 25 . Because f (x) = 2/3 for all x ::j: O. there is an increasing loss of accuracy for decreasing x. x f(x) tanx 89.9° 89.95) 89.99° 572.9588 ... 1145.87 ... 5735.39 572.9572 ... 1145.91 ... 5729.57 ... The correct digits are underlined. We have f (x) = tan x for all .r, and 26 The Numerical Evaluation of Expressions 1.3 Numerical Stability here the loss of accuracy is due to the fact that f (x) has a pole at x = 90°, together with the cancellation in the denominator. (v) f (x) : = eX'/3 - 1. For comparison, we give together with f (x) also the result of the first two terms of the power series expansion of f (x). • cancellation; that is, subtraction of numbers of almost equal size that leads to a loss of leading digits and thereby to magnification of the relative error; • division by a very small number leading to magnification of the absolute error; f(x) x 0.1 0.01 0001 0.0001 3.33889510 3.33338810 3.33330010 3.33000010 - x 3 5 7 9 2/3+x 4/18 3.33888888910 3.33338888910 3.33333388910 3.33333333910 - f(x), correctly rounded 3 5 7 9 3.3388950668810 3.3333888895110 3.3333338888910 3.3333333388910 - 3 5 7 9 Although the absolute errors of f (x) appear acceptable, the relative errors grow as x approaches zero (cancellation!). However, the truncated power series gives results with increasing relative accuracy. (vi) The formula 27 • multiplication by a number of very large magnitude, leading to magnification of the absolute error. In the formulation and analysis of algorithms, one must look out for these problems to see whether one can avoid them by a suitable reformulation. In general, one can say that caution is appropriate in any computation in which some intermediate result has a much larger or smaller magnitude than a final result. Then a more detailed investigation must reveal whether this really leads to numerical instability, or whether the further course of the computation reduces the effect to a harmless level. Note that, in the case of cancellation, the difference is usually calculated without error; the instability comes from the magnification of old errors. This is shown in the following example. 1.3.2 Example. Subtraction of two numbers with B = 10 and with L = 7 and L = 6, respectively: gives the standard deviation of a sequence of data Xl , •.. , X n in a way as implemented in many pocket calculators. (The alternative formula mentioned in Example 1.1.1 is much more stable but does not allow the input of many data with little storage!) For Xi := x where x is a constant, an = 0, but on the pocket calculator, the corresponding keys yield the results shown in Table 1.3. The zero in the table arises because the calculator checks whether the computed argument of the square root is negative (which cannot happen in exact arithmetic) and, if so, sets an equal to zero, but if the argument is positive, the positive result is purely due to round off. (i) L = 7 1st number: 2nd number: Difference: Normalized: 0'10 0'20 100/3 1000/29 1.204"'10- 3 8.062"'10 - 4 1.131"'10 - 3 0 We conclude from the examples that the quality of a computed function value depends strongly on the expression used. To prevent, if possible, the magnification of old errors, one should avoid the following: No rounding error (ii) L = 6 Table 1.3. Standard deviation of Xi := x where x is a constant x 0.2789014103 0.2788876103 0.0000138103 0.138000010 - 1 Optimally rounded 1st number: 2nd number: Difference: Normalized: 0.278901103 0.278888103 0.000013103 0.13000010 - J Relative error <210 - 6 <210 - 6 No rounding error >510 - 2 Although there is no rounding error in the subtraction, the relative error in the final result is about 25,000 times bigger than both the initial errors. 28 The Numerical Evaluation of Expressions 1.3 Numerical Stability 1.3.3 Example. In spite of division by a small number, the following formulas are stable: . (1) f(x):= {Ismx . / x if x ifx =a # ~ is stable for x 0 Ux _1)/lnx if x if x = # I I is stable for x ~ r: yx+l-yx= 1.001 1.0001 1.00001 1.000 000 00 I x . 2 - 2 sm 1+ cosx - X 2' ~ + 1) - x =-----===-_ 1 _ v'x"+l + .jX v'x"+l + VX (x (iii) The expression eX - 1 is unstable for x :::::: O. We substitute y := e', whence e' - 1 = y - I, which is still unstable. For x # 0, y-l eX - 1 = (y - l)x/x = - - x ; ln y see Example 1.3.3(ii). Therefore we can obtain a stable expression as follows: 1: e" - I x ~ sur' the left side is unstable and both of the right sides are stable for x ~ O. In both cases, the difference of two functions is rearranged so that there is a difference in the numerator that can be simplified analytically. This difference is thus evaluated without rounding error. For example, The reason for the stable behavior is that in both numerator and denominator only one operation (in our usage of the term) is performed; stability follows from the form of the relative error that is more or less independent of the way in which the elementary functions are implemented. The same behavior is visible in the next case. (ii) f(x) := . 0: 0.9983 0.999983 ... 0.99999983 . 0.999999998 . 1 0.1 0.01 0.001 0.0001 0.00001 (ii) In the equalities 1 - cos x = f(x) x 29 where y f(x) = [:-1 x lny if y = 1, if y # I = e". (iv) The expression eX - 1 - x is unstable for x ~ O. Using the power series expansion 1.00049. " 1.000049 . 1.000005 . 1.000 000 001 00 eX = Lx /k! k k=O we obtain Stabilizing Unstable Expressions iflxl>c, There is no general theory for stabilizing unstable expressions, but some useful recipes can be gleaned from the following examples. in which a suitable threshold c is determined by means of an error estimate: We expect the first formula to have an error of 8 = B I-L, and the second, truncated after the terms shown, to have an error of ::::::x 4/24 :::: c4/24; hence, the natural threshold in this case is c = ~, which makes the worst case absolute error minimal (::::::8), and gives a worst case relative (i) In the equality 1 v'x"+l- v'x = v'X+T x+l+ VX' x the left side is unstable and the right side is stable for x » otherwise I. errorof~8/(c2/2) ~.JE76. The Numerical Evaluation of Expressions 30 1.3 Numerical Stability Finally, we consider the stabilization of two widely used unstable formulas. with tn = The Solution of a Quadratic Equation ax 2 + bx +c = (a 0 (Xi - Sn)2, cf. Example 1.1.1. This formula for the standard deviation is impractical because all the Xi must be stored. This can be avoided by a transformation to a recursive form in which each Xi is needed only until Xi+1 is read. -I 0) has the two solutions -b ± Jb 2 2a - 4ac t; If b2 » 4ac, then this expression is unstable when the sign before the square root is the same as the sign of b (the expression obtained by choosing the other sign before the square root is stable). On multiplying both the numerator and the denominator by the algebraic conjugate -b =f Jb 2 - 4ac, the numerator simplifies to b 2 - (b 2 -4ac) = 4ac, leading to the alternative formula XIO ,. = 2c -b =f Jb 2 - 4ac . Now the hitherto unstable solution has become stable, but the hitherto stable solution has become unstable. Therefore, it is sensible to use a combination of both. For real b, the rule q := -(b + sign(b)Jb 2 - XI := q f a, = L xf - 2s L n i=l~ Xi Sn-! = (n - l)sn_1 an of XI, ... , X n + Xn n Xz:= c jq and the standard deviation L1 i=l~ xl Sn - Mean Value and Standard Deviation + S,~ i=l~ cf. Example 1.3.1(vi). These are the formulas for calculating the standard deviation stated in most statistics books. Their advantage is that t; can be computed recursively without storing the Xi, and their greater disadvantage lies in their and instability due to cancellation, because (for data with small variance) L ns~ have several common leading digits. The transformation into a stable expression is inspired by the observation that the numbers s; and In are expected to change very little due to the addition of new data values X,; therefore, we consider the differences Sn - Sn-l and tn - tn-I. We have 4ac )/2, calculates both solutions by evaluating a stable expression. The computational cost remains the same. (For complex b this formula does not work, because sign(b) = b/lbl -I ±l; cf. Exercise 12.) Sn L i=l:n The quadratic equation The mean value defined by 31 n - Sn-I n where 8/l := X/l - S/l-!, and In - In_I = (L xf - ns,~) L xf - (, ,=1:n are usually ,=I:n-I (n - l)SI~_l) 1 or, sometimes, Sn =- an = LXi, n i=l:n Jtn/n = 8/l) = 8n 8/l ( 8n - -;; = 81l (x n - sn). «xn - SIl-I) - (SII - Sn-I» 32 The Numerical Evaluation of Expressions 1.4 Error Propagation and Condition From these results, we obtain a new recursion for the calculation of Sn and tn : 1.4 Error Propagation and Condition SI = Xl, t[ = 0, and for i ::: 2 Oi = s, = t, = Xi Si-l ti-l Si-I, + o;/i, + Oi(Xi - Si). The differences Xi - s, -1 and Xi - s., which have still to be calculated, are now harmless: the cancellation can bring about no great magnification of relative errors because the difference (Xi - Si) is multiplied by a small number Oi and is then added to the (as a sum of many small positive terms generally larger) number t.:«. The new recursion can be programmed with very little auxiliary storage, and in an interactive mode there is no need for storage of old values of Xi, as the following MATLAB program shows. 33 In virtually all numerical calculations many small errors are made. Hence, one is interested in how these errors influence further calculations. As already seen, the final error depends strongly on the formulas that are used. Among other things, especially for unstable methods, this is because of the finite number of digits used by the computer. However, in many calculations, already the input data are known only approximately; for example, when they are determined by measurement. Then even an exact calculation gives an inaccurate answer, and in some cases, as in the following example, the relative error grows strongly even when exact arithmetic is used, and hence independently of the method that is used to calculate the result. 1.4.1 Example. Let 1 f(x):= - - . I-x If X := 0.999, then f(x) = 1000. An analytical error analysis shows that for x = 0.999 + e, with e small, 1.3.4 Algorithm: Stable Mean and Standard Deviation 1000 f(x) = I - 1000e i=l; s=input('first number?>'); t=O; x=input('next number? (press return for end of list»'); while -isempty(x), i=i+l ; delta=x-s; s=s+delta/i; t=t+delta*(x-s); x=input('next number? (press return for end of list»'); end; disp( ['mean: ' ,num2str(s)]); sigmal=sqrt(t/i);sigma2=sqrt(t/(i-l)); disp('standard deviation'); disp ( [, of sample: ',num2str (sigma!)]) ; disp([' estimate of distribution: ',num2str(sigma2)]); C, & and I code the logical "not," "and," and "or." The code also gives an example of how MATLAB allows one to read and print numbers and text.) = 1000(1 + 103e + 106e 2 + ...). Hence, for the relative errors of x and Ix -xl Ixl -- f (.l:) ~ we have Laale and respectively; that is, independently of the method of calculating I, the relative error is magnified by a large factor (here about 1000). One calls such problems ill-conditioned. (Again, this is only a qualitative notion without very precise meaning.) For an ill-conditioned problem with inexact initial data, all numerical methods must give rise to large relative errors, A quantitative measure for the condition of differentiable expressions is the condition number K, the (asymptotic) magnification factor of the relative error. For a concise discussion of the asymptotic behavior of quantities, it is useful 1.4 Error Propagation and Condition The Numerical Evaluation of Expressions 34 to recall the Landau symbols 0 and O. In order to characterize the asymptotic behavior of the functions f and g in the neighborhood of r" E lR U {oo} (a value usually apparent from the context), one writes fer) = o(g(r» fer) when - - ---+ 0 g(r) r. as r ---+ 35 we have If(x) - f(x)1 Ix -xl ~K---. [x] If(x)1 1.4.2 Examples. (i) In the previous example f(x) = I/O-x), we have atx = .999, the values f(x) ~ 103 and f'(x) ~ 106 ; hence, K ~ .999 106 /1000 ~ 103 predicts and * fer) = f(r). when - - remams bounded as r g(r) O(g(r» ---+ * the inflation factor correctly. (ii) The function defined by f(x) := x 5 - 2x 3 has the derivative f'(x) = 5x 4 -6x 2 • Suppose thalli -21 .::: r := 10- 3 , so thatthe relative errorofi is r /2. We estimate the error If (x) - f (2) I by means of the condition number: For x = 2, r . In particular, fer) = o(g(r», ve- rlim -e-r" g(r) bounded fer) = 0, f'(Xl!=7 K=lxf(x) , and fer) = O(g(r», lim g(r) r-e-r" = O:::::} lim fer) r~r* = o. whence If(x) - f(2)1 (One reads the symbols as "small oh of" and "big oh 0[," respectively.) Now let x be an approximation to x with relative error E = (x - x) / x , so that x = (l + E)X. We expand f(x) in a Taylor series about x, to obtain f(x) = = + EX) f(x) + Exf'(x) + 0(E 2 ) f(x = f(x) ( xf'(x) 1+ -E + f(x) 0(E 2 ) ) = Ixf'(X) f(x) . 0.0035If(2)[ f(x) Ixf'(X) I. f(x) 0.0035, = 0.056. f (x*) = 0 and f' (x*) oF 0 (these conditions characterize a simple zero .r"), then K approaches infinity as x ---+ x*; that is, f is ill conditioned in the neighborhood of a simple zero x* oF O. CASE 2: If there is a positive integer m such that with the (relative) condition number = ~ 7 2r = CASE 1: If IIEI + 0(E 2 ) = KIEI + 0(E 2 ) , K and IfCt) - f(2)1 ~ In a qualitative condition analysis, one speaks of error damping or of error magnification, depending on whether the condition number satisfies K < 1 or K > 1.1ll-conditioned problems are characterized by having very large condition numbers, K » 1. From (4.1), it is easy to see when the evaluation of a function of a single variable is ill conditioned. We distinguish several cases. We therefore have for the relative error of f, If(x) - f(x)1 If(x)1 If(2)1 (4.1) Assuming exact calculation, we can therefore predict approximately the relative error of f at the point x by calculating K. Neglecting terms of higher order, = (x - x*)mg(x) with g(x*) oF 0, (4.2) we say that x* is a zero of f of order m because the factor in front of g(x) can be viewed as the limiting, confluent case Xi ---+ x* of a product of m distinct linear factors (x - Xi); if (4.2) holds with a negative integer m = -s < 0, then x* is referred to as a pole of f of 36 The Numerical Evaluation of Expressions 1.4 Error Propagation and Condition 37 (ii) Condition: We have orde r s. In both cases, f'ex) = m(x - x*)m-Ig(x) + (x - x*)mg,(x), f'ex) = so that _x- 2 2v"x- 1 _ 1 v"r l = K g'(x) I =Ixl' -m- + x -x* 1 x-x * -x- I K-+ I +0(1). = +1 +1 Ix . j'(x) f(x) I 1 -2~' if x* =I- 0, Iml ifx*=O. For m = 1, this agrees with what we obtained in Case 1. In general, we see that in the neighborhood of zeros or poles x* =I- 0, the condition number is large and is nearly inversely proportional to the relative distance of x from x*. (This explains the bad behavior near the multiple zero in Example 1.1.8.) However, if the zero or pole is at x* = 0, the condition number is approximately equal to the order of the pole or zero, so that f is well conditioned near such zeros and poles. CASE 3: If t' (x) becomes infinite for x* -+ 0 then f is ill-conditioned at x". For example, the function defined for x > 1 by f (x) = 1 + ,J:X=l has the condition number K = K Therefore, as x -+ x*, 00 Iv"r 1 and therefore, for the condition number K, g(x) I-I 2v"x- 1 + 1 1 - v"r l 2x 2 v"r 1 - IXf'(X) I f(x) = Iml' - _x- 2 Thus, for x ;:::; 0, K ;:::; 4 so that f is well conditioned. However, for x -+ 1, we have K -+ 00 so that f is ill-conditioned. So, the formula for f (x) is unstable but well-conditioned for x ;:::; 0, whereas it is stable but ill conditioned for x ;:::; 1. We see from this that the two ideas have nothing to do with each other: the condition of a problem makes a statement about the magnification of initial errors through exact calculation, whereas the stability of a formula refers to the influence of rounding errors due to inexact calculation because of finite precision arithmetic. It is not difficult to derive a formula similar to (4.1) for the condition number of the evaluation of an expression f in several variables Xl, ... , x n . Let Xi = Xi (1 + Ci) and ICi I ::: C (i = 1, ... , n). Then one obtains the multidimensional Taylor series X --------:===2(x - I + ,J:X=l) and K -+ 00 as x -+ 1. We now demonstrate with an example the important fact that condition and stability are completely different notions. We make use of the derivative f I (x) = ( a aXI 1.4.3 Example. Let f(x) be defined by f(x):=Jrl-I-Jrl+l to obtain the relative error (O<x<l). (i) Numerical stability: For x;:::; 0, we have X-I - 1 ;:::; X-I + 1; hence, there is cancellation and the expression is unstable; however, the expression is stable for x ;:::; 1. a) -f(x), ... , -f(x) aX n , 39 The Numerical Evaluation of Expressions 1.5 Interval Arithmetic where the multiplication dot denotes the scalar product of two vectors. The resulting error propagation formula for functions of several variables is that contains this number, and critical computations are performed with intervals instead of approximate numbers. The operations of interval arithmetic are defined in such a way that the resulting interval always contains the true result that would be obtained by using exact inputs and exact calculations; by careful rounding, this property is assured, even when all calculations are done with finite precision only. In the context of interval calculations, it is more natural to express errors as absolute errors, and we replace a number x; which has an absolute error sr. with the interval [x - r, x + r]. In the following, M is the space of real numbers or a space of real vectors or matrices, with component-wise inequalities. 38 If(x) - f(x)1 -----If(x)1 <~ K max 1<i :» Ix; - x;1 , Ix;1 with the (relative) condition number K = Ixl· Ir (x) I If(x)I' looking just as in the one-dimensional case. Again, one expects bad condition only near zeros of f or near singularities of In the preceding discussion, the higher-order terms in e were neglected in the calculation of the condition number K. This means that the error magnification ratio is given closely by the condition number only when the input error e is sufficiently small. Often, however, it is difficult to tell whether a specified accuracy in the inputs is small enough. Using an additional tool discussed in the next section, it is possible to determine rigorous bounds for the propagation error for a given bound on the input error. r. 1.5.1 Definition. The symbol x := [!., x] := {x E M I!. ~ x ~ x} denotes a (closed and bounded) interval in M with lower bound x upper bound x E M, !. ~ x, and E M and denotes the set of intervals over M (of interval vectors if M is a space of vectors, of interval matrices if M is a space of matrices). The midpoint of x, 1.5 Interval Arithmetic Interval arithmetic is a kind of automatic, rigorous error analysis. It serves as an essential tool for mathematically rigorous computer-assisted proofs that involve finite precision calculations. The most conspicuous application is the recent solution by Hales [37] of Kepler's more than 300 year old conjecture that the face-centered cubic lattice is the densest packing of equal spheres in Euclidean three-dimensional space. For other highlights (which require, of course, additional machinery from the applications in question), see Eckmann, Koch, and Wittwer [23], Hass, Hutchings, and Schlafli [42], Mischaikow and Mrozek [62], and Neumaier and Rage [73]. Many algorithms of computational geometry depend on the correct evaluation of the sign of certain numbers for correct performance, and if these are tiny, the correct sign depends on rigorous error estimates (see, e.g., Mehlhorn and Naher [61]). Independent of rounding issues, interval arithmetic is an important tool in global optimization because of its ability to provide global information over wide intervals, such as bounds on ranges or derivatives, or Lipschitz constants (see Hansen [39] and Kearfott [49]). The propagation of errors is accounted for in interval arithmetic by taking as given not a particular inexact number, but an interval of machine numbers midx:= I x := -(x + x) 2- and the radius of x, 1 radx := -(x - x) 2: 0, 2 - allow one to convert the interval representation of an inaccurate number to the absolute error representation, x EX¢::=} Ix - xl ~ radx. Intervals with zero radius (called point intervals) contain a single point only, and we always use the identification [x,x]=x of point intervals with the element of M they contain. [x] := sup{lxllx EX} = sup{x, -!.} 40 The Numerical Evaluation of Expressions 1.51/lterval Arithmetic defines the absolute value of x. For a bounded subset S of M, In particular, OS := [inf S, sup S] is called the interval hull of S; for example, IJ{I, 2} also define the relation x + Y = Of!. + I, x + y}, = IJ{2, I} = [1,2]. We x- y (iii) For monotone cp x E Of!. - y, x - I}' J, cp(x) = O{cpc!.) , cpU)}· Proof The proof follows immediately from the monotonicity of 1.5.2 Remarks. (i) For x E liJR., [x] = max {x , -!.}. However, the maximum may be undefined in the componentwise order of JR.n, and, for interval vectors, the supremum takes the componentwise maximum. (MATLAB, however, uses max for the componentwise maximum.) (ii) The relation :s is not an order relation on liM because reflexivity fails. x = We now extend the definition of operations and elementary functions to intervals over M = R For 0 EO, we define X 0 y := Ofx 0 Y I oX E X. and tp, D [!.2, x2 ] if x ~ 0, l-x 2 , x 2 ) if x [0,max{!.2,x 2}] if D e x. I :s 0, Note that the standard functions have well-known extrema, only finitely many in each interval, so that all operations and elementary functions can be calculated for intervals in finitely many steps. Thus we can get an enclosure valid simultaneously for all selections of inaccurate values from the intervals. Y E Y}, J, we define cp(x) := O{cp(x) \ x 0 For nonmonotone elementary functions and for powers with even integral exponents, one must also look at the values of the local extrema within the interval; thus, for example, 2 E = :s y :{:::::::} x :s y on liM, and other relations are defined in a similar way. and for cp 41 EX} when the right side is defined; that is, excluding cases such as (1,2]/(0, I] or ~. (There are extensions of interval arithmetic that even give values to such expressions; we do not discuss these here.) Thus, in both cases, the result of the operation is the tightest interval that contains all possible results with inaccurate operands selected from the corresponding interval operands. 1.5.4 Remarks. (i) Intervals are just a new type of numbers similar to complex numbers (but with different properties). However, it is often convenient to think of an interval as a single inaccurately known real number known to lie within that interval. In this context, it is useful to write narrow intervals in an abbreviated form, in which lower and upper bounds are written over each other and identical figures are given only once. For example, 1.5.3 Theorem. 17.4~~m = [17.458548. 17.463751]. (i) For all intervals, -x (ii) For 0 E {+, -, *, [, "]. X 0 = [-x, -!.]. if x 0 Y is defined for all x EX. Y E y = Of!. 0 2" !. 0 y. X 0 2'. x 0 y}. y, we have (ii) Let f(x) := x-x. Then for x = [1,2], f(x) = [-1.1], and for x := ~l - r, 1 + r], f(x) = [-2r,2r]. Interval arithmetic has no memory; It does not "notice" that in x - x the same inaccurately known number occurs twice, and thus calculates the result as if coming from two different inaccurately known numbers both lying in x. 42 The Numerical Evaluation of Expressions 1.5/nterval Arithmetic (iii) Many rules for calculating with real numbers are no longer valid for intervals. As we have seen, we usually have modules FORTRAN-XSC [95] and (public domain) INTERVAL..ARITHMETIC [50]. The following examples illustrate interval operations and outward rounding; we use B = 10, L = 2, and optimal outward rounding. x- x i= 0, [1.1, 1.2] + [- 2.1, 0.2] and another example is a(b + e) i= ab + ae, except in special cases. So one must be careful in theoretical arguments involving interval arithmetic. [1.1,1.2] - [-2.1,0.2] [1.1, 1.2] * [-2.1,0.2] 43 = [-1.0, 1.4], = [0.9,3.3], = [-2.52,0.24], rounded: [-2.6,0.24], [1.1, 1.2]/[-2.1,0.2] not defined, [-2.1,0.2]/[1.1,1.2] = [-21/11,2/11], rounded: [-2.0,0.19], = [-21/11,2/11], rounded: [-2.0,0.19], = [1.21, 1.44], rounded: [1.2, 1.5], [-2.1,0.2]/[1.1. 1000] Basic books on interval arithmetic and associated algorithms are those by Moore [63] (introductory), Alefeld and Herzberger [4] (intermediate), and Neumaier [70] (advanced). Outward Rounding Let x = [:!:., x] E [JR. If :!:. and x are not machine numbers, then they must be rounded. In what follows, let x = L~., x] be the rounded interval corresponding to x. In order that all elements in x should also lie in x,:!:. must be rounded downward and x must be rounded upward; this is called outward rounding. Then, x S; X. The outward rounding is called optimal when x := n{y = [~, y] I y ;2 x, ~"v are machine numbers}. Note that for optimally rounded intervals, the bounds are usually not optimally rounded in the sense used for real numbers, but only correctly rounded. Indeed, the lower bound must be rounded to the next smaller machine number even when the closest machine number is larger, and similarly, the upper bound must be rounded to the next larger machine number. This directed rounding, which is necessary for optimal outward rounding, is available for the results of +, -, *, / and the square root on all processors conforming to the IEEE standard for floating point arithmetic. defined in [45]; for the power and other elementary functions, the IEEE standard prescribes nothing and one may have to be content with a suboptimal outward rounding, which encloses the exact result, but not necessarily in an optimal way. For the common programming languages, there are extensions in which rigorously outward rounded interval arithmetic is easily available: the (public domain) INTLAB toolbox [85J for MATLAB, the PASCAL extension PASCAL-XSC [52], the C extension C-XSC [51], and the FORTAN 90 [-1.2. _1.1]2 [-2.1.0.2]2 = [0,4.41], rounded: [0, 4.5J. J[1.0, 1.5J = [1.0, 1.22 ...], rounded: [1.0, 1.3]. 1.5.5 Proposition. For optimal outward rounding and unbounded exponent range, x s; x· [l where e = B 1- - e. I + c] L. x:: : x when correctly rounded. Proof We have x = Lf, .r], in which j; :s :!:. and It follows from the correctness of the rounding that There are three cases to consider: x > 0, CASE I: x » ing, ° ° E (i.e., :!:. > 0). By hypothesis, :!:. - i x, and x < 0. x :::: :!:. > 0, and by outward round- :s c:!:. and x - x :s eX, - c) and x:s.t (1 + c). whence i :::: :!:.(1 So x = [i, x] S; [,!,(l - c), i(l as asserted. + c)J = [:!:., i][l - e, I + c], The Numerical Evaluation of Expressions 44 CASE 2: ° ° E x (i.e.,!. :s :s x). By hypothesis, K :s !. we can evaluate the absolute values to obtain -K +!. :s -s!. and 1.5 Interval Arithmetic ° :s :s x :s X, whence 45 (iii) Suppose that in the evaluation of f (Zl, ... , zn), the elementary functions cp E J and the power are evaluated only on intervals in which they are differentiable. Then, for small A x- x :s sx, r := max radrx.), giving i we have So rad j'(x}, ... , x,) x= = O(r). [K, x] S; [!.(l + e), x(l + c)] S; [!., xHl - s, 1 + s]. CASE 3: (x < 0, i.e., x< 0) follows from Case 1 by multiplying with -1. D We refer to this asymptotic result by saying that naive interval evaluation has a linear approximation order. Interval Evaluation of Expressions Before we prove the theorem, we illustrate the statements with some examples. In arithmetic expressions, one can replace the variables with intervals and evaluate the resulting expressions using interval arithmetic. Different expressions for the same function may produce different interval results; although this already holds for real arguments in finite precision arithmetic, it is much more pronounced here and persists even in exact interval arithmetic. The simplest example is the fact that x - x is generally not zero, but only an enclosure of it, with a radius of the order of the radius of the operands. We now generalize this observation to the interval evaluation of arbitrary expressions. 1.5.7 Example. We mentioned previously that expressions equivalent for real arguments may behave differently for interval arguments. We look closer at the example f (x) = x 2 , which always gives the range. Indeed, the square is an elementary operation, defined such that this holds. What happens with the equivalent expression x * x? 1.5.6 Theorem. Suppose that the arithmetic expression f(z], ... , Zn) E A(ZI, ... , Zn) can be evaluated at Z], ... , z" E ITR and let Xl S; Zl, ... .x, S; Zn· Then: (i) If f(x) := x * x, and x := [-r, r], with 0< r < 1, then f(x) = [-r, r] . [-r, r] = [_r 2 , r 2 ], but the range of values of f is only {f (x) I x E x} = 2 [0, r ]. The reason for the overestimation of the range of values of f is as follows. In the multiplication x * x = O{X * .~ I x E x, E x}, each x is multiplied with every element of x, but in the calculation of the range of values, x is multiplied only with itself, giving a tighter result. Again, this shows the memory-less nature of interval analysis. r, + r], with 0 :s r :s ~,then (ii) If f(x) := x * x and x := x [! - 1 (i) f can be evaluated at x., ... , x, and f(x], ... , x n) S; f(z], ... , zn) (inclusion isotonicity), (ii) {f(z[, ... , Zn) I Zi E z.} S; f(z], ... , zn) (range inclusion). In (ii), equality holds ifeach variable occurs only once in f (as, e.g., the single variable z in the expression Iogtsint I - Z3»). that is, the range of values of f is not overestimated, in spite of the fact that the condition from Theorem 1.5.6(ii) that each variable should occur only once is not satisfied (the condition is therefore sufficient, but not necessary). For the radius of f(x), we have rad f(x) = O(r) as asserted in (iii) of Theorem 1.5.6. (iii) If f(x) := -IX, Z := [s, 1] and x = [s, e + 2r] (r :s (l - s)/2), then 46 The Numerical Evaluation ofExpressions 1.5 Interval Arithmetic 47 Furthermore, radx = rand 1 rad jX = -his 2 + 2r -.jE) = r r ~ ~ r;; e + 2r + .jE 2y e = for r -+ O. However. the constant hidden in the Landau symbol depends on the choice of z and becomes unbounded as s ~ O. (iv) If f(x) := v1x. z:= [0. I], and x := [~r. ~r], with 0 < r < ~, then v'x = f(x) O(r) = {rp(g} I g < (rp(g) Ig E g(x)} E g(z)} = f(z). (ii) For x = [z. ZJ. (ii) follows immediately from (i). If each variable in f occurs only once, then the same holds for g. and by the inductive hypothesis. (g(z) I E z} = g(z). From this, it follows that z [~vr. ~vr] = (f(i) I.r EX}. (f(z) I zE z) = (rp(g(z» I z E z} = (rp(g) I g E g(z») = f(z). which is to be expected from (ii), and radx = r, rad 1 v'x = '2vr i= O(r), that is, the radius of v'x is considerably greater than is expected from the radius of x, leading to a loss of significant digits. This is due to the fact that vIx is not differentiable at x = 0; see statement (iii). Proof of Theorem 1.5.6. Because arithmetic expressions are recursively defined, we proceed by induction on the number of subexpressions. Clearly, it suffices to show (a) that the theorem is valid for constants and variables, and (b) that if the theorem is valid for the expressions g and h, then it is also valid for -g, g 0 h, and rp(g). Now (a) is clear; hence we assume that the theorem is valid for g and h in place of f. We show that Theorem 1.5.6 is valid for rp(g) with rp E J. We combine the interval arguments to interval vectors (iii) Set g := g(x). Because over the compact set g, the continuous function rp attains its minimum and maximum. we have for suitable gI, g2 E g. Thus by the mean value theorem, M = sup Irp'(OI. SEg(Z) Then radf(x) The proof for the case reader. 1 :s '2M1g2 -gIl :s Mradg = f = g 0 h with 0 E O(r). 0 is similar and is left to the 0 (i) By definition, and because rp is continuous and g(z) is compact and connected, The Mean Value Form f (z) = rp(g(z» = O{rp(g) I g E g(z») = {rp(g) I g E g(z»). In particular. rpU?) is defined for all g E g(z). Therefore, because by the inductivehypothesisg(x) <; g(z), f(x) isalsodefined,and f(x) = rp(g(x». The evaluation off(XI, ... , x n ) gives, in general, only an enclosure for the range of values of f. The naive method, in which real numbers are replaced with intervals, often gives pessimistic bounds, especially for intervals of large radius. However, for many numerical problems, an appropriate reformulation of standard solution methods produces interval methods that provide bounds that are provably realistic for narrow intervals. For a thorough treatment, see Neumaier [70]; in this book, we look at only a few basic techniques for achieving that. For function evaluation, realistic bounds can be obtained for narrow input intervals using the mean value form, a rigorous version of the linearization method discussed in the previous section. 1.5.8 Theorem. Let f(x) = [tx«, ... , x n ) E .A(Xi, ... , x n ) , and suppose that for the evaluation of f' (X) in Z E ][IRn, the elementary functions and the power are evaluated only on intervals in which they are differentiable. 1fx S; z, then w* := D{f(x) I x EX} S; w := f(x) + f'(x)(x - x), 50 40 30 20 10 rad w* ::: 2 rad f' (x) rad x. L-----:::~~:::::::..------ o (5.1) 1.8 and the overestimation is bounded by o ::: rad w - 49 1.5 1nterval Arithmetic The Numerical Evaluation of Expressions 48 2 1.9 2.1 2.2 2.3 Figure 1.3. Enclosure by the mean value form. (5.2) 1.5.9 Remarks. (i) Let x be inexactly known with absolute error j; r and let x = [x - r, x + r]. Statement (5.1) says that the range of values w* as well as the required value f(x) of f at x lie in the interval w. Instead of only an approximate bound on the error in the calculation of f (x) as r ~ 0 (as in Section 1.4), we now have a strict bound on the range of values of f for any fixed value of the radius. Statement (5.2) says how large the overestimation is. For r := max, rad Xi tending to zero, the overestimation is 0 (r 2 ) ; that is, the same order of magnitude as the neglected terms in the calculation of the condition number. The order of approximation is therefore quadratic if we evaluate f at the midpoint x = mid x and correct it with f'(x)(x - x). This is in contrast with the linear order of approximation for the interval evaluation f(x). (ii) The expression f(x) + f'(x)(x - x) is called the mean value form of f; for a geometrical interpretation, see Figure 1.3. The number computation of the range. Note that q = 1 - O(r) approaches 1 as the radius of the arguments gets small (cf. Figure 1.4), unless the range w* is only O(r 2 ) (which may happen close to a local minimum). (iii) For the evaluation of the mean value form in finite precision arithmetic, it is important that the rounding errors in the evaluation of f (x) are taken into account by calculating with a point interval [x, x] instead of with x. However, x itself need not be the exact midpoint of the interval because, as the proof shows, the theorem is valid for any x E x instead of x. 30 20 10 o 2rad f'(x) radX) q -_ max ( 0, 1 - ------'-----'--radw -10 is a computable quality factor for the enclosure given by the mean value form. Because one can rewrite (5.2) as -20 q . rad w ::: rad w* ::: rad w, a q close to 1 shows that very little overestimation occurred in the 0.25 Figure 1.4. Enclosure of form. 0.5 f 0.75 (x) = x 5 - 2x 1 3 1.25 + 10 sin 5x 1.5 1.75 2 over [0, 2] with the mean value 50 The Numerical Evaluation of Expressions Proof of Theorem 1.5.8. We choose x g(t) := fei get) is well defined because the line x mean value theorem f(x) = g(1) = g(O) S; z and we define EX + t(x - Because f + t(x - iiJ 5 ib" x) lies in the interval x. By the = f(x) + f'(n(i 51 (x) E w*, we concl ude for t E [0, 1]; x» + g'(r) 1.5 Interval Arithmetic + «(' - f) rad x. Similarly, ~:::: ~* - (c - f)radx, - x) and therefore with ~ = x + r(x - .\') for some r E [0, 1]. From this, we obtain f(x) E f(x) + f'(x)(x - x) =w for all x E x, whence w* S; w. This proves (5.1). For the overestimation bound (5.2), we first give a heuristic argument. We have + f'(~)(x - f(x) = f(x) i) E f(x) + f'(x)(x - i). If one assumes that ~ and x range through the whole interval x, then because of the overestimate of x by x and of .f'(~) by f'(x) of OCr), one obtains an overestimate of O(r 2 ) for the product. A rigorous proof is a little more technical, and we treat only the onedimensional case; the general case is similar, but requires more detailed arguments. We write c:=f'(x), p:=c(x-i) 2radw = iiJ - w 5 ib" - ~* + 2((' - f)radx = 2 rad w* + 4 rad c rad x. This proves the inequality in (5.2), and the asymptotic bound 0 (r 2 ) follows 0 from Theorem 1.5.6(iii). 1.5.10 Examples. We show that the mean value form gives rigorous bounds on the propagation error for inaccurate input data that are realistic when the inaccuracies are small. (i) We continue the discussion of Example I.4.2(ii), where we obtained for f(x) = x 5 - 2x 3 and Ix - 21 5 10- 3 the approximate estimate If(x) - 16/ ~ 0.056. Because the absolute error of x is r = 10-3, an inaccurate x lies in the interval [2 - r, 2 + r]. For the calculations, we use finite precision interval arithmetic with B = 10 and L = 5 and optimal outward rounding. (a) We bound the error using the mean value form, with .f'(x) = 5x 4 From so that w=f(x)+p. w = f(2) By definition of the product of intervals, jj = Now cd for some c E C, dE X- = [15.943, 16.057], i. i) for some ~ + jj = E f(i) + (c - f(x) - .f'(n(x - x) .f'(n)(x - i) c)(x - i). If(·\') - f(2)/ 50.057. EX. Hence = f(x) + (c - + r])[-r, r] we obtain for the absolute error = f(i) + f'(~)(x - iiJ = f(x) f'([2 - r, 2 6x 2 • = 16 + [55.815, 56.189][-0.001,0.001] x := .\' + d E x, and f(x) + - + c(x - .\') These are safe bounds, realistic because of Example I.4.2(ii). (b) To show the superiority of the mean value form, we also bound the error by means of naive interval evaluation. Here, Wo = f([2 - r, 2 + r]) = [15.896, 16.105] 52 The Numerical Evaluation ofExpressions whence 53 1.6 Exercises In this case, the interval evaluation provides better results. namely If(x) - f(2)1 ::::: 0.105. These are safe bounds, but there is overestimation by a factor of about 2. (c) For the same example, we use naive interval arithmetic on the equivalent expression hex) = x 3(x 2 - 2): WI = h([2 - r, 2 + r]) = [15.944, 16.057] whence If(x) - f(2)1 ::::: 0.057. The result here is the same as for the mean value form (and with higher precision calculation it would be even sharper than the latter); but this is due to special circumstances. In all operations, lower bounds are combined with lower bounds only, and upper bounds with upper bounds only. It is not difficult to see that in such a situation, always the exact range is obtained. (For intervals around 1, we no longer have this nice property.) (ii) Let f(x) := x 2 . Then f'(x) = 2x. For x = [-r, r], we have w* = [0, r 2 ] and w = f(O) + f'([-r, = 2[ -r, r][-r, r] = 2[-r 2 , r 2 ] . r])([-r, r] - 0) From this we obtain, for the radii of w* and w, rad w* = r 2 /2 and rad w = 2r 2 ; that is, the mean value fonn gives rise to an overestimation of the radius of w* by a factor of 4. The cause of this is that the range of values w* is itself of size O(r 2 ) . (iii) Let f(x) := l/x. Then f'(x) = -1/x 2 . For the wide interval x = [1,2] (where the asymptotic result of Theorem 1.5.8 is no longer relevant), one has w* = 1], and [!, w = f(1.5) + f'([l, 2])([1, 2] - 1 1 1.5) [1 7] = -1.5 + - [-0.5 , 0.5] = -6'6- . [1,4] wo = f([1, 2]) = [1\] = [~, 1], which, by Theorem 1.5.6, is optimal. When r is not small, it is still true that the mean value form and the interval evaluation provide strict error bounds, but it is no longer possible to make general statements about the size of the overestimation. There may be significant overestimation in the mean value form, and it is not uncommon that for wide intervals, the naive interval evaluation gives better enclosures than the mean value form; a striking example of this is f (x) = sin x. When the intervals are too wide, both the naive interval evaluation and the mean value form may even break down completely. Indeed, overestimation in intermediate results may lead to a forbidden operation, as in the following example. 1.5.11 Example. The harmless function f(x) = 1/(1 - x + x 2 ) cannot be evaluated at x = [0, 1]. The evaluation of the denominator (whose range is [0.75,1]) gives the enclosure 1 - [0,1] + [0,1] = [0,2], and the subsequent division is undefined. The derivative has the same problem, so that the mean value form is undefined. too. Using as denominator (x - 1) * x + 1 reduces the enclosure to [0, I]. but the problem persists. In this particular example, one can remedy the problem by rewriting f as f(x) = 1/(0.75 + (x - 0.5)2) and even gets the exact range (by Theorem 1.5.6(ii», but in more sophisticated examples of the same kind, there is no easy way out. 1.6 Exercises 1. (a) Use MATLAB to plot the functions defined by the following three arithmetical expressions. (i) f(x) := ~ for -1 < x::::: 1 and for Ixl ::::: 10- 15 , d2x- - (ii) f(x):= Jx x ::::: 2· 108 , + l/x-Jx -l/x for 1::::: x::::: lOandfor2·10 7 ::::: (iii) f(x):= (tanx-sinx)/xforO < x::::: 1 and for 10- 8 ::::: x::::: 10-7 • Observe the strong inaccuracy for the second choices. (b) Use MATLAB to print, for the above three functions, the sentences: 54 The Numerical Evaluation of Expressions The answer for f(x) accurate to 6 leading digits is y The short format answer for f(x) is Show that the polynomials gj(x) := y L f/j)x n-i (j = 0, ... , n) i=j:n The long format answer for f(x) is satisfy (a) gj(x) y but with x and y replaced by the numerical values x = 0.1111 and y f(O.llll). (You must switch the printing mode for the results; use the MATLAB functions disp, num2str, format. See also fprintf and sprintf for more flexible printing commands.) 2. Using differential numbers, evaluate the following functions and their derivatives at the points XI = 2 and Xl = 5. Give the intermediate results for (a) and (b) to (at least) three decimal places. For (c), evaluate as a check an expression for f'(x) at x, and Xl. (You may work by hand, using a pocket calculator, or use the automatic differentiation facilities of INTLAB.) (a) f(x) := 1~~2' (b) f(x) := eXjtl). (b) go(x) = gj(Z) + gj+l(x)(x - z)(j = 0, ... , n - 1), = Lk<' gk(Z)(X - Z)k + gj(x)(x - zF(j = 1, ... , n), J _ (j+l) _ fl1l(z) ._ (c) f n+ 1 - gj(z) - -ji- (j - 1, ... , n). Hint: For (a), use the preceding exercise; for (c), use the Taylor expansion of f(x). The relevance ofthe complete Homer scheme lies in (c), which shows that (z) / j! of the scheme may serve to compute scaled higher derivatives polynomials. 5. (a) A binary fraction of the form r» with a v E {O, I} (v = 1, ... , k) may be represented as (x + x-I (C) f( x ) .' - X2(x2+4) x+I' 3. Suppose that the polynomial 2+3)(x-l) f(x) L a O.5 r = = 55 1.6 Exercises v v v=O:k L ai x n-i with ao = O. Transform the binary fraction i=O:n is evaluated at the point z using the Homer scheme (0.101 100011h fo := ao, fi:=fi-lz+ai (i=I, ... ,n), fez) := fn. into a decimal number using the Homer scheme. (b) Using the complete Homer scheme (see the previous exercise), calculate, for the polynomial p(x) Show that .L n-i-I fix I=O:n-1 = !f'(Z) f(x)- fez) x-z if x = Z, otherwise. 4. Let f(x) = aox n + ajx n- 1 + ... + an be a polynomial of degree n. The complete Horner scheme is given by f?) :=ai (i=O, ,n), a0 (j=O, ,n), f j(j ) '.- f?) := f/~iz + J;~~I) (i = j 8x l + 5x - 1, the values p(x), p'(x), and p"(x) for x = 2.5 and for x = 4/3. (Use MATLAB or hand calculation. For hand calculation, devise a scheme to arrange the intermediate results so that performing and checking the calculations is easy.) 6. Assuming an unbounded exponent range, prove that, on a computer with an L-digit mantissa and base B, the relative error for optimal rounding is 1 L • bounded by e = 7. Let Wo, x E JR with Wo, x > and for i = 0, 1,2, ... &B ° - ui, + 1, ... , n + 1). = 4x 4 - Wi+l :=x / := n-I Wi «n - ' 1)Wi + w;)/n. 56 The Numerical Evaluation ofExpressions 1.6 Exercises (a) Show that, for Wo =I=- ;:,IX, WI .::IX < < W2 < ... < Wi < ... < ... < ui, < ... < W2 < WI and that the sequences [u»} and {Wi} converge to W := ;:,IX. (b) For x := 4, Wo := 2, n := 3,4 calculate, in each case, Wi, Wi (i ::s 7) and print the values of i , Wi, and Wi in long format. (c) For x := 4, Wo := 2, n := 25,50, 75, 100, calculate, in each case, Wi, Wi (i ::s n) and print the values of i , un . and Wi in long format. Use the results to argue that the iteration is unsuitable for the practical calculation of ;:,IX for large n. 8. (a) Analyze the behavior of the quotient fd(a)/fs of the derivative I, = f'ex) and the forward difference quotient fd(a) = (f(x +a)f(x»/a in exact arithmetic, and in finite precision arithmetic. What do you expect for a ---* O?What happens actually? (b) Write a MATLAB program that tries to discover mistakes in the programming of derivatives. This can be done by computing a table of quotients fd(a)/fs for a number of values of a = 10- i ; say, i = -10: 10, and checking whether the behavior predicted in (a) actually occurs. Test with some correct and some incorrect pairs of expressions for function and derivative. (c) How can one reduce the multidimensional case (checking for errors in the gradient) to the previous situation? 9. (How not to calculate eX.) The value eX for given x E 1R is to be estimated from the series expansion XV L~' v=O:oo One calculates, for this purpose, the sums XV v=O:n V. for n = 0,1,2, .... Obtain a value of n such that ISn(x) - eXI < 10- 16 with x = -20 using a remainder estimate. Calculate u = Sn ( - 20) with MATLAB and compare the values so obtained with the optimally rounded value obtained with the MATLAB command v=exp(-20); disp(v-u); How is the bad result to be explained? 10. (a) Rearrange the arithmetic expressions from Exercise 1 so that the smallest possible loss of precision results if x « 1 for (i), x » 1 for (ii), and o < IxI « 1 for (iii). (b) Calculate f(x) from (i) with x := 10- 3 , B = 10, and L = 1,2,4,8 by hand, and compare the resulting values with those obtained from the rearranged expression. 11. The solutions XI and X2 of a quadratic equation of the form ax' with a =I=- + bx + c = 0 0 are given by Xu = (-b ± Jb 2 - 4ac)/2a. (6.1) Determine, from (6.1), the smaller solution of the equation x = (l-ax)2, (a> 0). Why is the formula obtained unstable for small values of a? Derive a better formula and find five distinct values of a that make the difference in stability between the two formulas evident. 12. Find a stable formula for both solutions of a quadratic equation with complex coefficients. 13. Cardano'sformulas. (a) Prove that all cubic equations x 3 + ax' + bx + c = 0 with real coefficients a, b, c may be reduced by the substitution x s to the form = Z - Z3 L , sn(x):= 57 + 3q z - 2r = O. (6.2) In the following assume q3 + r 2 > O. (b) Using the formula z := p - q / p, derive a quadratic equation in p3, and deduce from this an expression for a real solution of (6.2). (c) Show that there is exactly one real solution. What are the complex solutions? (d) Show that both solutions of the quadratic equation in (b) give the same real solution. (e) Determine the real solution in part (b) for r := ±1000 and q := 0.01, choosing for p in tum both solutions of the quadratic equation. For which one is the expression more numerically stable? 58 The Numerical Evaluation of Expressions 1.6 Exercises « Iq I, the real solution is small compared with q; its calculation from the difference p - q I p is therefore numerically unstable. Find a stable version. (g) What happens for q3 + r 2 < O? 14. The standard deviation an = Jtnln can be calculated for given data Xi (i = 1, ... , n) either with (f) For Ir I or with the recursion in Section 1.3. Calculate Xi := (499999899 + i)/3000 a20l for all x E IIIR whose end points are machine numbers. Prove this for at least one of the formulas (6.3) or (6.4). (b) Suppose that B = 10. For which formula is (6.5) false? Give a counterexample, and prove that (6.5) holds for the other formula. (c) Explain purpose and details of the MATLAB code x=input('enter positive x>'); w=max(1,x);wold=2*w; while w<wold wold=~; w=w+(x!w-w)*O.5; end; for the values (i = I, ... ,201) using both methods and compare the two results with the exact value obtained by analytic summation. Hint: Use Li=l:n i = n(n + 1)/2 and Li=l:n;2 = n(n + 1)(2n + 1)/6. 15. A spring with a suspended mass m oscillates about its position of rest. Hooke's law K (r) = - D . x (t) is assumed. When the force K a (t) := A . cos(w"t) acts on the mass m, the equation of motion is What is the limit of w in exact arithmetic? Why does the loop terminate in finite precision arithmetic after finitely many steps? 17. (a) Find intervals x, y, Z E IIIR for which the distributive law x . (y w6 := Dim. It has the solution x(t) = - Aim 2 Wa - 2 W cos(w"t). o For which W a is this formula ill-conditioned (t > 0 and Wo > 0 fixed)? 16. The midpoint x of an interval x E ITIR can be calculated from x:= (x +!.)/2 + z) = x .y +x .z is violated. (b) Show that, for all x, y, Z E ITIR, the subdistributive law x . (y with 59 + z) S; x . y +x .Z holds. 18. A camera has a lens with focal length f = 20 mm and a film of thickness 2!'1b. The distance between the lens and the object is g, and the distance between the lens and the film is boo All objects at a distance g are focused sharply if the lens equation I I I f b g -=-+- (6.3) is satisfied for some b E b = [b o - Sb, bo + !'1b]. From the lens equation, one obtains, for a given f and b := bo ± /sb, or from (6.4) I These formulas give an in general inexact value with base B and optimal rounding. (a) For B = 2, XEX x on an L-digit machine g=---- Ilf - lib b·f b-F A perturbation calculation gives (6.5) g ~ g(bo) ± g' (b o) . Ab. 60 The Numerical Evaluation of Expressions Similarly, one obtains, for a given f and g with 's - 1 b = -1/-f---1-/ g = gol ::s tsg, a g·f g - j" 2 Linear Systems of Equations . he film be in order that an object at a given distance is (a) How thick must t _ [3650 3652], [3650,4400], [6000, 10000], ,_ b d l::..b = rad b from sharply focused? For g [10000, 20000](in millimeters), calculate bo - an b using the formulas 1 b 1= l/f - l/g g·f and b2 = - - . g- f Explain the bad results for b 2 , and the good results.for b l: b f . h 1 focused with a gIven 0, or a (b) At which distances are objects s arp Y , . h f 1s film thickness of 0.02 mm? Calculate the interval g usmg t e ormu a 1 gl := l/f -lib b·f and g2:= b-~ with bo = 20.1,20.05,20.03,20.02,20.01 mm, and using the approximation g3:= g(bo) + (-1, 1]· g'(bo) ·l::..b. d Which of the three results is Compare the results from gl, g2, an g3· . b? increasingly in errorfor decreasmg o: ._ (1 _ x)" in . for bounding the range of values of p (x) .19. Wnte a program the interval x:= 1.5 + [-1, 1]/lO i (i = 1, ... ,8) (a) by evaluating p(x) on the interval x, . . n (b) by using the Homer scheme at z = x on the eqUIvalentexpresslO 6 Po(x) = x - 6 x 5 + 15x 4 _ 20x 3 + 15x 2 - 6x + 1, (c) by using Homer's scheme for the derivative and the mean value form po(x) + p~(x)(x - Interpret the results. (You are allowed to x). ~se unrounded metic if you cannot access directed roundmg.) interval arith- In this chapter, we discuss the solution of systems of n linear equations in n variables. At some stage, most of the more advanced problems in scientific calculations require the solution of linear systems; often these systems are very large and their solution is the most time-consuming part of the computation. Therefore, the efficient solution of linear systems and the analysis of the quality of the solution and its dependence on data and rounding errors is one of the central topics in numerical analysis. A fairly complete reference compendium of numerical linear algebra is Golub and van Loan [31]; see also Demmel (17] and Stewart [89]. An exhaustive source for overdetermined linear systems (that we treat in passing only) is Bjorck [8]. We begin in Section 2.1 by introducing the triangular factorization as a compact form of the well-known Gaussian elimination method for solving linear systems, and show in Section 2.2 how to exploit symmetry with the L D L T and the Cholesky factorization to reduce work and storage costs. We only touch the savings possible for sparse matrices with many zeros by looking at the simple banded case. In Section 2.3, we show how to avoid numerical instability by means of pivoting. In order to assess the closeness of a matrix to singularity and the Worst case sensitivity to input errors, we look in Section 2.4 at norms and in Section 2.5 at condition numbers. Section 2.6 discusses iterative refinement as a method to increase the accuracy of the computed solution and shows that except for very ill-conditioned systems, one can indeed expect from iterative refinement solutions of nearly optimal accuracy. Finally, Section 2.7 discusses various techniques for a realistic error analysis of the solution of linear systems, inclUding interval techniques for the rigorous enclosure of the solution set of linear systems with uncertain coefficients. The solution of linear systems of equations lies at the heart of modem computational practice. Therefore, a lot of effort has gone into the development 61 62 Linear Systems of Equations 2.1 Gaussian Elimination of efficient and reliable implementations of appropriate solution methods. For matrices of moderate size, the state of the art is embodied in the LAPACK [6] subroutine collection. LAPACK routines are freely available from NETLIB [67], an electronic distribution service for public domain numerical software. If nothing else is said, everything in this chapter remains valid if the set C of complex numbers is replaced with the set lR of real numbers. We use the notations lRmxn and cmxn to denote the space of real and complex matrices, respectively, with m rows and n columns; A ik denotes the (i, k)-entry of A. Absolute values and inequalities between vectors or between matrices are interpreted componentwise. The symbol A T denotes the transposed matrix with and A ( : , k}, respectively. The dimensions of an m x n matrix A are found from [m, n] =size (A). The matrices 0 and J of size m x n are created by zeros (m , n ) and ones (m , n): the unit matrix of dimension n by eye (n) . Operations on matrices are written in MATLAB as A+B, A-B, A*B (=A B), inv (A) J (=A- ) , A\B (= A-I B), and A/B (= AB- I). The componentwise product and quotient of (vectors and) matrices are written as A. *B and A. /B, respectively. However, in our pseudo-MATLAB algorithms, we continue to use the notation introduced before for the sake of readability. (AT)ik = A ki, Suppose that n equations in n unknowns with (real or) complex coefficents are given, 2.1 Gaussian Elimination and A H denotes the conjugate transposed matrix with AIIXI A 21XI Here, eX = a - ib denotes the complex conjugate of a complex number = a + ib(a, b E lR), and not the upper bound of an interval. It is also convenient to write 63 + A 12X2 + + A n X2 + + Alnxn = + A 211Xll = bj , b2 , C( in which Aik, b, E C for i, k = 1, ... , n. We can write this linear system in compact matrix notation as Ax = b with A = (A ik) E cnxll and x = (Xi), b = (bi ) E C". The following theorem is known from linear algebra. and A matrix A is called symmetric if AT = A, and Hermitian if A H = A. For A E lRmxn we have A H = AT, and then symmetry and Hermiticity are synonymous. The ith row of A is denoted by Ai:, and the kth column by A:k. The matrix 2.1.1 Theorem. Suppose that A E cnxn and that b E C". The system of equations Ax = b has a unique solution x E cn if and only if A is nonsingular (i.e., if det A i= 0, equivalently, if rank A = n]. In this case, the solution x is given by x = A -I b. Moreover, the i th component ofthe solution is given by Cramer's rule »,». Lk=l:n(adj det A(i) Xi = - - - = ==='----'---det A detA i=l, ... ,n, Where the matrix A (i) is obtained from A by replacing the ith column with b, and the adjoint matrix adj A is formed from suitably signed subdeterminants of A. with entries Iii = 1 and I ik = 0 (i i= k), is called the unit matrix; its size is usually inferred from the context. The columns of I are the unit vectors e(k) = h (i = 1, ... , n) of the space C". If A is a nonsingular matrix, then A -I A = AA -I = I. The symbols J and e denote matrices and column vectors, all of whose entries are I. In MATLAB, the conjugate transpose of a matrix A is denoted by A) (and the transpose by A. ) ); the ith row and kth column of A are obtained as A (i , : ) From a theoretical point of view, this theorem tells everything about the SOlution. From a practical point of view, however, solving linear systems of equations is a complex and diverse subject; the actual solution process is influenced by many considerations involving speed, storage, accuracy, and structure. In particular, we see that the numerical calculation of the solution using A-I or Cramer's rule cannot be recommended for n > 3. The computational cost Linear Systems ofEquations 2.1 Gaussian Elimination in both cases is considerably greater than is necessary; moreover, Cramer's rule is numerically less stable than the other methods. Solution methods for linear systems fall into two categories: direct methods, which provide the answer with finitely many operations; and iterative methods. which provide better and better approximations with increasing work. For systems with low or moderate dimensions and for large systems with a band structure, the most efficient algorithms for solving linear systems are direct, generally variants of Gaussian elimination, and we restrict our discussion to these. For sparse systems, see, for example, Duff et al. [22] (direct methods) and Weiss [97] (iterative methods). Gaussian elimination is a systematic reduction process of general linear systems to those with a triangular coefficient matrix. Although Gaussian elimination currently is generally organized in the form of a triangular factorization. more or less as described as follows, the basic idea is already known from school. distinct from zero, and to solve, instead of Ax = b, the triangular system Rx = y, where y is the transformed right side. Because the latter is achieved very easily, this is a very useful strategy. For the compact form of Gaussian elimination, we need two types of triangular matrices. An upper triangular matrix has the form 64 x x L~ (~ 2x +4y + z = 3, -x + y - 3z = 5, Subtracting the first of these from the second gives -z = 5. It remains to solve a system of equations of triangular form: x+y+z=l, 2y - z = 1 -z = 5 and we obtain in reverse order z = -5, y = -2, x = 8. The purpose of this elimination method is to transform the coefficient matrix A to a matrix R in which only the coefficients in the upper right triangle are x x where possible nonzeros, indicated by crosses, appear only in the upper right triangle, that is, where R i k = 0 (i > k). (The letter R stands for right; there is also a tradition of using the letter U for upper triangular matrices, coding upper.) A lower triangular matrix has the form x+y+z=l, 2y-z=l, 2y - 2z = 6. x x x 2.1.2 Example. In order to solve a system of equations such as the following: one begins by rearranging the system such that the coefficients of x in all of the equations save one (e.g., the first) are zero. We achieve this, for example, by subtracting twice the first row from the second row, and adding the first row to the third. We obtain the new equations x x 65 x x x x x x x x x J where possible nonzeros appear only in the lower left triangle, that is, where L i k = 0 (i < k). (The letter L stands for left or lower.) A matrix that is both lower and upper triangular, that is, that has the form D = (Dll 0 ) =: Diag(D ll •... , Dnn), o o.; is called a diagonal matrix. For a diagonal matrix D, we have D i k = 0 (i # k). Triangular Systems of Equations For an upper triangular matrix R Yi = L k=l~ E RikXk IRnxn, we have Rx = y if and only if = s;», + L RikXk. k=i+l~ The following pseudo-MATLAB algorithm solves Rx = y by substituting the already calculated components Xk (k > i) into the ith equation and solving for Xi, for i = n, n - 1, ... ,LIt also checks whether one of the divisions is forbidden, and then leaves the loop. (In MATLAB, == denotes the logical 66 Linear Systems of Equations comparison for equality.) This happens iff one of the Ri, vanishes, and because (by induction and development of the determinant by the first column) 2.1.5 Algorithm: Solve a Lower Triangular System ier = 0; for i = 1 : n, if L(i, i) == 0, ier = 1; break; end; y(i) = (b(i) - Lii, I : i - I ) * y(l : i - 1))/ Lti, i); det R = R ll R n · .. R nn , this is the case iff R is singular. 2.1.3 Algorithm: Solve an Upper Triangular System (slow) ier = 0; for i = n : -1 : 1, if Ri, == 0, ier = 1; break; end; Xi = (Yi L RikXk)/Rii; k=i+l:n end; % ier = 0: R is nonsingular and x = R- 1Y % ier = I: R is singular end; % ier = 0: L is nonsingular and y % ier = 1 : L is singular 2.1.4 Algorithm: Solve an Upper Triangular System ier=O; x=zeros(n,l); % ensures that x will be a column for i=n:-1:1, if R(i,i)==O, ier=l; break; end; x(i)=(y(i)-R(i,i+l:n)*x(i+l:n))/R(i,i); end; % ier=O: R is nonsingular and x=R\y % ier=l: R is singular In the future, we shall frequently use the summation symbol for the sake of visual clarity; it is understood that in programming, such sums are to be converted to vectorized statements whenever possible. For a lower triangular matrix L E ~n xn, we have Ly = b if and only if = L LikYk = LiiYi + L k=l:n = L -I b In case the diagonal entries satisfy Li, = I for all i , L is called a unit lower triangular matrix, and the control structure simplifies because L is automatically nonsingular and no divisions are needed. Solving a triangular system of linear equations by either forward or back substitution requires Li=l:n (2i - I) = n 2 operations (and a few less for a unit diagonal). Note that MATLAB recognizes whether a matrix A is triangular; if it is, it calculates A\ b in the previously mentioned cheap way. It is possible to speed up this program by vectorizing the sum using MATLAB's subarray facilities. To display this, we switch from pseudo-MATLAB to true MATLAB. bi 67 2.1 Gaussian Elimination LikYk. k=l:i-l The following algorithm solves Ly = b by substituting the already calculated components Yk (k < i) into the i th equation and solving for Yi for i = 1, ... , n. The Triangular Factorization A factorization A = LR into the product of a nonsingular lower triangular matrix L and an upper triangular matrix R is called a triangular factorization (or l-R-factorization or LV -factorization) of A; it is called normalized if L ii = I for i = I, ... , n, The normalization is no substantial restriction because if Lis nonsingular, then LR = (LD- 1 )(DR) is a normalized factorization for the choice D = Diag(L), where Diag(A) := Diag(A 11 , .•• , Ann). denotes the diagonal part of a matrix A E en xn. If a triangular factorization exists (we show later how to achieve this by pivoting), one can solve Ax = b (and also get a simple formula for the determinant det A = det L . det R) as follows. 2.1.6 Algorithm: Solve Ax = b without Pivoting STEP I: Calculate a normalized triangular factorization LR of A. STEP 2: Solve Ly = b by forward elimination. STEP 3: Solve Rx = y by back substitution. Then Ax = LRx = Ly = b and det A = R 11R22 •• · R nn . 68 Linear Systems of Equations 69 2.1 Gaussian Elimination This algorithm cannot always work because not every matrix has a triangular factorization. Lt« = 0 for k A ik = > i and R ik L j=l:n 2.1.7 Example. Let LijRjk = = 0 for k L < i, it follows that LijRjk (i,k=I, ... ,n). (1.2) j=l:min(i,k) Therefore, fork ~ i, fork < i. If a triangular factorization A = LR existed, then L 11R II and L2I R II =J: O. This is, however, impossible. = 0, L II R 12 =J: 0, The following theorem gives a complete and constructive answer to the question when a triangular factorization exists. However, this does not yet guarantee numerical stability, and in Section 2.3, we incorporate pivoting steps to improve the stability properties. e nxn has at most one normalized triangular factorization. The triangular factorization exists iff all leading square submatrices A (m) of A, defined by 2.1.8 Theorem. A nonsingular matrix A E A ll A(m):= ( A~I A:.lm) (m = 1, ... , n), Amm are nonsingular. 1n this case, the triangular factorization can be calculated recursively from Rik := A ik - L j=l:i-1 L ki := (Aki - . L LijR jk LkjRji)/Rii (k~i)'} i=I, ... ,n. (1.1) (k > i), 1=1:1-1 Solving the first equation for Rik and the second for Li; gives equations (1.1). Note that we swapped the indices of L so that there is one outer loop only. (ii) By (i), A has a nonsingular triangular factorization iff (1.2) holds with L ii , Ri, =J: 0 (i = 1, ... , n). This implies that for all m, A(m) = is» R(m) with Li~) = Li, =J: 0 (i = 1, ... ,m) and = ii =J: 0 (i = 1, ... ,m). Therefore, A (m) is nonsingular for m = 1, ... , n. Conversely, if this holds, then a simple induction argument shows that (1.2) holds, and Li, = L i:) =J: 0, Rii = R;;) =J: 0 (i = 1, ... , n). (iii) All of the unknown coefficients Rik(i S k) and LkiCi < k) can be calculated recursively from (1.1) and are thereby uniquely determined. The factorization is unique because by (i) each factorization satisfies (1.1). Rir) R o For a few classes of matrices arising frequently in practice, the triangular factorization always exists. A square matrix A is called an H-matrix if there are diagonal matrices D and D' such that III - DAD'lloo < 1 (see Section 2.4 for properties of matrix norms). We recall that a square matrix A is called positive definite if x H Ax > 0 for all x =J: O. In particular, Ax =J: 0 if x =J: 0; i.e., positive definite matrices are nonsingular. The matrix A is called positive semidefinite if x H Ax ~ 0 for all x. Here, the statements ex > 0 or ex ~ 0 include the statement that ex is real. Note that a nonsymmetric matrix A is positive (semi)definite iff its Hermitian part A sym := !(A + A H ) has this property. 2.1.9 Corollary. 1f A is positive definite or is an H-matrix, then the triangular factorization of A exists. Proof. Proof. We check that the submatrices A (i) (1 SiS n) are nonsingular. (i) Suppose that a triangular factorization exists. We show that the equations (1.1) are satisfied. Because A = L R is nonsingular, 0 =J: det A = det L . det R = R ll ... R nn, so R ii =J: 0 (i = 1, ... , n). Because L ii = l , (i) Suppose that A E en xn is positive definite. For x E C (1 sis n) with x =J: 0, we define a vector y E en according to y j = x j (1 s j S i) and 70 Linear Systems ofEquations Yj =0 (i < j :::: n). Then, x'! A (i) x = yH Ay > O. The submatrices A (i) (l :::: i :::: n) are consequently positive definite, and are therefore non- singular. (ii) Suppose that A E e l i Xli is an H-matrix; without loss of generality, let A be scaled so that III - A 1100 < 1. Because III - A (i) 1100 :::: III - A 1100 < I, each A (i) (l :::: i :::: n) is an H-matrix and is therefore nonsingular. 0 The recursion (1.1) for the calculation of the triangular factorization goes back to Crout. Several alternatives for it exist that compute exactly the same intermediate results but in different orders that may have (dis)advantages on different machines. They are discussed, for example, in Golub and van Loan [31]; there, one can also find a proof of the equivalence of the factorization approach and the reduction process to upper triangular form, outlined in the introductory example. In anticipation of the need for incorporation of row interchanges when no triangular factorization exists, the following arrangement for the ith recursion step of (1.1) is sensible. One divides the ith step into two stages. In stage (a), Lkj R ji (k = i + 1, ... , n) Ri i is found, and the analogous expressions A ki in the numerators of Li, are calculated as well. In stage (b), the calculation of the Rik (k = i + 1, , n) is done according to (I.1), and the calculation of the Li, (k = i + 1, , II) is completed by dividing the expressions calculated in stage (a) by R i i . This method of calculating the coefficients Ri, has the advantage that, when Ri, = 0, this is immediately known, and unnecessary additional calculations are avoided. The coefficients A ik (k = i, ... , n) and A ki (k = i + I, ... , n) in (1.1) can be replaced with the corresponding elements of Rand L, respectively, because they are no longer needed. Thus, we need no additional storage locations for the matrices Rand L. The matrix elements stored in the array A at the conclusion of the ith step correspond to the entries of L, R, and the original A, according to the diagram in Figure 2.1. After n steps, A is completely overwritten with the 2.1 Gaussian Elimination 71 elements of Rand L. The calculation of the triangular factorization therefore proceeds as follows. 2.1.10 Algorithm: Triangular Factorization without Pivoting for i = 1 : n for k = i ; n % find Ri i and numerators of A ki = A ki L AkjA j i; Lki, k > i j=Li-l end if A ii == 0, errort 'no factorization exists'); end; for k = i + 1 ; n % find Rib k > i A ik = Aik L AijA j k; j=l:i-l % complete A ki L = Lki, k > i AkilA i i; end end The triangular factorization LR of A is obtainable as follows from the elements stored in A after execution of the algorithm: I A ik Li, = I o fork<i, for k = i, for k > i, for k ::: i, for k < i. 2.1.11 Example. Let the matrix A be given by -4 r1l R L A 1 -1 i i Figure 2.1. The matrices L, R, and A after the ith step. 3 STEP 1: In the first stage (a), nothing happens; in the second stage (b), nothing is subtracted from the first row above the diagonal, and the first column is divided by All = 2 below the diagonal. In particular, the first row of R is equal to the first row of A. In the following diagrams, coefficients that result from the subtraction of inner products are marked Sand 72 Linear Systems ofEquations coefficients that result from division are marked with D. (a) 4 Step I: ( 42'5 _4 5 (b) -4 ~) ~D 2 1 1 -1 3 1 3 I 25 ( -2 D I D 45 _4 5 2 I -I I I 3 and n STEPS 2 and 3: In stage (a), an inner product is subtracted from the second and third columns, both on and below the diagonal; in stage (b), an inner product is subtracted from the second and third rows above the diagonal, and the second and third columns are divided below the diagonal by the diagonal element. R ~ (~ 4 -6 -4 0 9 2: ~~) 0 0 -9 9 4 -6 5 -4 1 -1 3 95 -3 5 f)( -~ 4 -6 3D -2 -4 lD 3 Step 3: (-~ I 2 95 -1 2 (a) 4 -6 3 -2 -n (b) -4 9 95 2 55 2 -~) ( ~ 3 -2 1 4 -6 3 -2 -4 I 1 2 9 9 2 5D 9 -~ ) 55 2 I STEP 4: In the last step, only Ann remains to be calculated in accordance with (a). (a) Step 4: . R Note that the symmetry of A is reflected in the course of the calculation. Each column below the diagonal, calculated in stage (a), is identical with the row above the diagonal, calculated in stage (b). This implies that for symmetric matrices the computational cost can be nearly halved (cf. Algorithm 2.2.4). Finding the triangular factorization of a matrix A (-~ 5 2: E JRn xn takes (b) (a) Step 2: 73 2.2 Variations on a Theme (-~ The triangular factorization of A have 4 -6 -4 9 3 9 -2 2 I 5 9 2 = -i ) 85 -9 LR has now been carried out; we o o I 5 9 L (i i=l:n 2 1)4(n - i) = _n 3 + 0(n 2 ) 3 operations. Because of the possible division by zero or a small number, the solution of linear systems with a triangular factorization may not be possible or may be numerically unstable. Later, we show that by 0(n 2 ) further operations for implicit scaling and column pivoting, the numerical stability can be improved significantly. The asymptotic cost is unaffected. Once a factorization of A is available, solving a linear system with any right side takes only 2n 2 operations because we must solve two triangular systems of equations; thus, the bulk of the work is done in the factorization. It is a very gratifying fact that for the solution of many linear systems with the same coefficient matrix, only the cheap part of the computation must be done repeatedly. A small number of operations is not the only measure of efficiency of algorithms. Especially for larger problems, further gains can be achieved by making use of vectorization or parallelization capabilities of modem computers. In many cases, this requires a rearrangement of the order of computations that allow bigger pieces of the data to be handled in a uniform fashion or that reduce the amount of communication between fast and slow storage media. A thorough discussion of these points is given in Golub and van Loan [3 n 2.2 Variations on a Theme For special classes of linear systems, Gaussian elimination can be speeded up by exploiting the structure of the coefficient matrices. 74 2.2 Variations on a Theme Band Matrices The factorization of a (2m + I)-band matrices takes ~m2n + O(mn) operations. For small m, the total cost is 0 (n), the same order of magnitude as that for solving a factored band system. Hence, the advantage of reusing a factorization to solve a second system with the same matrix is not as big as for dense systems. For the discretization of boundary value problems with ordinary differential equations, we always obtain band matrices with small bandwidth m. For discretized partial differential equations the bandwidth is generally large, but most of the entries within the band vanish. To exploit this, one must resort to more general methods for (large and) sparse matrices, that is, matrices containing a high proportion of zeros among its entries (see, e.g., Duff, Erisman and Reid [22]. To explore MATLAB's sparse matrix facilities, start with help sparfun. A (2m + I)-band matrix is a square matrix A with Aik = 0 for Ii - kl > m; in the special case m = 1, such a matrix is called tridiagonal. When Gaussian elimination is applied to band matrices, the band form is preserved. It is therefore sufficient for calculation on a computer to store and calculate only the elements of the bands (of which there are less than 2mn + n) instead of all the n 2 elements. Gaussian elimination (without pivot search) for a system of linear equations with a (2m + 1)-band matrix A consists of the following two algorithms. 2.2.1 Algorithm: Factorization of (2m + l j-Band Matrices ier = 0; for i = 1 : n, p = max(i - m, 1); q = min(i + m, n); A ii = A ii - L Symmetry AijA j i; j=p:i-l if A i i = 0, ier = 1; break; end; for k = i + 1 : q, A ki = L A ki - AkjA j i; j=p:i-l end; for k = i + 1 : q, = A ik - A ik L = As seen in Example 2.1.11, the symmetry of a matrix is reflected in its triangular factorization L R. For this reason, it is sufficient to calculate only the lower triangular matrix L. 2.2.3 Proposition. If the Hermitian matrix A has a nonsingular normalized triangular factorization A = LR, then A = LDL H where D = Diag(R) is a real diagonal matrix. AijA j k; j=p:i-l A ki 75 Linear Systems ofEquations Proof Let D := Diag(R). Then A ki/ A i i; end; end; 2.2.2 Algorithm: Banded Forward and Back Solve ier = 0; for i = 1 : n, p = max(i - m, 1); ~ i: L- Li.v u r i» YI· = b I j=p:i-l end; for i = n : - 1 : 1, if R i i = 0, ier = 1; break; end; q = min(i + m, n); Xi = (Yi - L j=i+l:q end; RijXj)/ R ii· Because of the uniqueness of the normalized triangular factorization, it follows from this that R = D HL Hand Diag(R) = D H. Hence, D = D H, so that D is real and A = LR = LDL H, as asserted. 0 The following variant of Algorithm 2.1.10 for real symmetric matrices (so that LDL H = LDL T ) exploits symmetry, using only *n 3 + 0(n 2 ) operations, . half as much as the non symmetric version. 2.2.4 Algorithm: Modified L D L T Factorization of a real Symmetric Matrix A ier = 0; <5 = sqrt(eps) for i = I : n, * norm(A, inf); 76 Linear Systems ofEquations piv = L A ii - 2.2 Variations on a Theme can take square roots and obtain, with D I / l := Diag(,J15i"i), L := L oD I / 2 , the 0 relation A = LoDL{f = LoDI/2(Dl/2)H L{f = LL H. AijA ji; j=J:i-l if abs(piv) > 8, A ii = piv: else ier = 1; Au = 8; end for k = i + 1 : n, Aik=A ik- L AijA jk; j=l:i-l A k; = Aid Au; end end %Now A contains the nontrivial elements of L, D and DL T % if ier > 0, iterative refinement is advisable Note that, as for the standard triangular factorization, the L D L H factorization can be numerically unstable when a divisor becomes tiny, and it fails to exist when a division by zero occurs. We simply replaced tiny pivots by a small threshold value (using the machine accuracy eps in MATLAB) - small enough to cause little perturbation, and large enough to eliminate stability problems. For full accuracy in this case, it is advisable to improve the solutions of linear systems by iterative refinement (see Algorithm 2.6.1). More careful remedies (diagonal pivoting and the use of 2 x 2 diagonal blocks in D) are possible (see Golub and van Loan [31]). So far, the uniqueness of the triangular factorization has been attained through the normalization L;; = l(i = I, ... , n). However, for Hermitian positive definite matrices, the factors can also be normalized so that A = LL H. This is called a Cholesky factorization of A (the Ch is pronounced as a K, cf. [96]), and L is referred to as the Cholesky factor of A. The conjugate transpose R = L H is also referred to as the Cholesky factor; then the Cholesky factorization takes the form A = R H R. (MATLAB's chol uses this version.) If A is not positive definite, then it turns out that the square root of some pivot element p S 0 must be determined, resulting in a failure. However, it is sometimes meaningful in this case to construct instead a so-called modified Cholesky factorization. This is obtained by replacing such pivot elements and, to achieve numerical stability, all p < 8, by a small positive number 8. A suitable value is 8 = vic II A II, where e is the machine accuracy. (In various implementations of modified Cholesky factorizations, the choice of this threshold differs and may vary from row to row. For details of a method especially adapted to optimization, see Schnabel and Eskow [86].) The following algorithm results directly from the equation A = LL T for real, symmetric A; by careful indexing, it is ensured that the modified Cholesky factorization is overwritten over the upper triangular part of A, and the strict lower triangle remains unchanged. 2.2.6 Algorithm: Modified Cholesky Factorization of a Real Symmetric Matrix A 8 = eiIAII; if 8 == 0, def = -1; return; end; def= 1; for i = 1 : n, ~ = A'i - L A;j; j=l:i-l if ~ < 8, def = 0; ~ Ai;=~; for k = i + 1 : n , A;k = (A;k - . 2.2.5 Theorem. If A is Hermitian positive definite, thenAhas a unique Cholesky factorization A = L L H in which the diagonal elements Li, (i = I, ... , n) are real and positive. e 77 en Proof. For x E i (l :s i :s n) with x =1= 0, we define a vector y E by y j = Xj (1:s j :s i) and v, = 0 (i < i z: n). Then, x" A(i)x = v" Ay > O. The submatrices A (i) (1 :s i :s n) are consequently positive definite and are therefore nonsingular. Corollary 2.1.9 now implies that there exists a triangular factorization A = LoR o = LoDL{f. Because A is positive definite, we have for each diagonal entry D;i = (e(i)H De U) = x HAx > 0 for x = L OH e(i). Thus we L = 8; end; Aj;Ajk)/A u; }=l:t-l end; end; % def = 1: A is positive definite; no modification applied % def = 0: A is numerically not positive definite % def = -1: A is the zero matrix It can be shown that the Cholesky factorization is always numerically stable (see, e.g., Higham [44]). The computational cost of the Cholesky factorization is essentially (i.e., apart from some square roots) the same as that for the L D L T factorization. Because 78 79 Linear Systems of Equations 2.2 Variations on a Theme in Algorithm 2.2.6 only the upper triangle of A is used and changed, one could halve the storage location requirement by storing the strict upper triangle of A in a vector oflength n (n - 1)/2 and the diagonal of A (or its inverse) in a vector of length n, and adapting the corresponding index calculations. Alternatively, one can keep a copy of the diagonal of the original A in a vector a; then, one has both A and its Cholesky factor available for later use. This is useful for iterative refinement (see Section 2.6). Therefore, Ilrll~ > Ilr*II~, and equality holds precisely when r" = r. In this case, however, Normal Equations An overdetermined linear system, with more equations than variables, is solvable only in special cases. However, approximation problems frequently lead to the problem of finding a vector x such that Ax ~ b for some vector b, where the dimension of b is larger than that of x. A typical case is that the data vector is supposed to be generated by a stochastic model y = Ax + E, where E is a noise vector with well-defined statistical properties. If the components ofcare uncorrelared, random variables with mean zero and constant variance, the Gauss-Markov theorem (see, e.g., Bjorck [8]) asserts that the best linear unbiased estimator (BLUE) for the parameter vector x can be found by minimizing the 2-norm II y - Ax 112 of the residual. Because then Ily - Axll~ = IIEII~ = l::>? is also minimal, this way of finding a vector x such that Ax ~ y is called the method of least squares, and any vector x minimizing Ily - Axl12 is called a least squares solution of Ax ~ y. The least squares solution can be found by solving the so-called normal equations (2.1 ). 2.2.7 Theorem. Suppose that A E em xn , b E em, and A HA is nonsingular. Then lib - Ax 112 assumes its minimum value exactly at the solution x* of the normal equations A HA(x -x*) = AH(r - r") = 0; and because of the regularity of AHA, the minimum is attained just for x* = x. 0 Thus, the normal equations are obtained from the approximate equations by multiplying with the transposed coefficient matrix. The coefficient matrix A H A of the normal equations is Hermitian because (A HA)H = A H (AH)H = A HA, and positive semidefinite because x HA HAx = II Ax II ~ 2: 0 for all x. If A has rank n, then AHA is positive definite because x H A HAx = 0 implies Ax = 0, and hence x = O. In cases where the coefficient matrix AHA is well-conditioned (such as for the approximation of functions by splines discussed in Section 3.4), the normal equations can be used directly to obtain least squares solutions, by fonning a Cholesky factorization AHA = LL H, and solving the corresponding triangular systems of equations. However, frequently the normal equations are much more ill-conditioned than the underlying minimization problem; the condition number essentially squares. Thus for a problem in which half of the significant figures would be lost because of the condition of the least squares problem, one would obtain no significant figures by solving the normal equations. 2.2.8 Example. In curve fitting, the approximation function is often represented as a linear combination of basis functions. When these basis functions have a similar behavior, the resulting equations are usually ill conditioned. For example, let us try to compute with B = 10, L = 5, and optimal rounding the best straight line through the three points (Si' td = (3.32, 4.32), (3.33,4.33), (3.34,4.34). Of course, the true solution is the line s = t + 1. The formula s = x\t + X2 leads to the overdetermined system Ax = b, where (2.1) A= Proof We putr := b - Ax andr*:= b- Ax*. Then (2.1) implies that A Hr" = 0, whence also (r" -r)H r* = (A(x -X*))H r* = (x _X*)H AHr* = O. It b= 4 .32 ) 4.33 . ( 4.34 For the normal equations (2.1), one obtains follows from this that IIrll~ = rHr 3.32 3.33 ( 3.34 + r*H(r* = r*Hr" + (r* - - r) = r*Hr* - (r" - r)H r r)H (r* - r) = IIr* II~ + IIr* - rll~. H A A ~ (33.267 9.99 9.99) 3' AHb ~ (43.257). 12.99 80 Linear Systems of Equations 2.2 Variations on a Theme The Cholesky factorization gives A H A ~ L L T with L ~ (5.7678 1.7320 81 squares solution of Ax ~ b by x=A \ b, so that the casual user need not to know the details. For strongly ill-conditioned least squares problems, even the solution of (2.2) produces meaningless results. Remarkably, it is still possible to get sensible solutions of Ax ~ b, provided one has some qualitative information about the desired solution. The key to a good solution is called regularization and involves the so-called singular-value decomposition (SVD) of A, a factorization into a product A = U E VH of two unitary matrices U and V and a diagonal matrix ~. For details about the SVD, we refer again to the books mentioned in the introduction to this chapter; for details about regularization, refer to Hansen [40] and Neumaier [71]. 0 ) 0.014142 and solving the system of equations gives and from this one obtains the (completely false) "solution" Indeed, a computation of the condition number condg, (A H A) ~ 3.1 . 106 (see Section 2.5) reveals that with the accuracy used for the computations, one could have expected no useful results. Note that we should have fared much better by representing the line as a linear combination s = XI (t - 3.33) + Xz: one moral of the example is that one should select good basis functions before fitting data. (A good choice is usually one in which different basis functions differ in the number and distribution of the zeros within the interval of interest.) The right way to solve moderately ill-conditioned least squares problems is by means of a so-called orthogonal factorization (or Q R -factorization) of A into a product A = Q R of a unitary matrix Q and an upper triangular matrix R. Here, a matrix Q is called unitary if QH Q = t . A unitary matrix with real entries is called orthogonal, and in this case, QT Q = L, From an orthogonal factorization, the least squares solution of Ax ~ y can be computed by solving the triangular system (2.2) Indeed, because QH Q = t, A H A = R H QH QR = R H R, so that R is the (transposed) Cholesky factor of A H, and the solution x* of (2.2) satisfies the normal equations because Matrix Inversion Occasionally, one must compute a matrix inverse explicitly. (However, as we shall see, it is inadvisable to solve Ax = b by computing first A -I and then x=A- 1bl) To avoid later repetition, we discuss immediately the numerically stable case using a permuted factorization, obtained by multiplying the matrix by a nonsingular (permutation) matrix P (see Algorithm 2.3.7). From a factorization P A = LR, it is not difficult to compute an explicit inverse of A. If we introduce the matrices R := R- I and A := R- I L-I = RL -I, we can split the computation of the inverse matrix A-I = R- I L -I P = AP into three steps. 2.2.9 Algorithm: Matrix Inversion STEP STEP STEP STEP 1: Compute a permuted triangular factorization P A = L R (see Algorithm 2.3.7). 2: Determine R from the equation RR = l . 3: Determine A from the equation AL = k 4; Form the product A -I = AP. The details for how to proceed in steps 2 and 3 can be obtained as in Theorem 2.1.8 by expressing the matrix equations in components, RiiR i i For the stable computation of orthogonal factorizations, we refer to the books mentioned in the introduction to this chapter. MATLAB provides the orthogonal factorization via the routine qr, but it automatically produces a stable least L k j Rj k j=i:k = 1, = 0 ifi<k, 82 Linear Systems oj Equations 2.3 Rounding Errors, Equilibration, and Pivot Search and solving for the unknown variables in an appropriate order (see Exercise 14). For the calculation of an explicit inverse A-I, we need 2n 3 + 0 (n 2 ) operations (see Exercise 14); that is, three times as many as for the triangular factorization. For banded matrices, where the inverse is full, this ratio is even more pessimistic. In practice, the computation ofthe inverse should be avoided whenever possible. The reason is that, given a factorization, the solution of the resulting two triangular systems of equations takes only 2n 2 + O(n) operations. This is essentially the same amount as needed for the multiplication A-I b when an explicit inverse is available. However, because the inverse is much more expensive than the factorization, there is no incentive to compute it at all, except in circumstances where the entries of A -1 are of independent interest. In other applications where matrix inverses are used, it is usually possible (and profitable) to avoid the explicit inverse, too. If rounding errors arise from other arithmetic operations in addition to multiplication, one finds similar expressions. Instability always arises if the magnitude of a term L ij R jk is very large compared with the magnitudes of the elements of A. In particular, if a divisor element R;; has a very small magnitude, then the magnitudes of the corresponding elements Li, (k > i) are large, causing instability. For a few classes of matrices (positive definite matrices and H-matrices; cf. Corollary 2.1.9), it can be shown that no instability can arise. For general matrices, we must ensure stability by avoiding not only vanishing, but also small divisors R;;. This is achieved by row (or sometimes column) permutations. One calls the row permuted into the ith row the ith pivoting row and the resulting divisors Ri, the pivot elements; the search for a suitable divisor is called the pivot search. The most popular pivot search is column pivoting, also called partial pivoting, which usually achieves stability with only 0(1/2) additional comparisons. (The stability of Gaussian elimination for balanced matrices may be further improved by so-called complete pivoting, which allows row and column permutations. Here, the element of greatest magnitude in the remainder of the matrix becomes the pivot element. However, the cost is considerably higher; 0(n 3 ) comparisons are necessary.) In column pivoting, one searches the remainder of the column under consideration for the element of greatest magnitude as the pivot element. Because in the calculation of the L ki (k > i) the pivot element is a divisor, it follows that, with column pivoting, ILki I::: 1. This implies that 2.3 Rounding Errors, Equilibration, and Pivot Search We now consider techniques that ensure that the solution of linear systems is computed in a numerically stable way. Rounding Errors and Equilibration In finite precision arithmetic, the triangular factorization is subject to rounding errors. We first present a short and very simplified argument that serves to identify the main source of errors; a detailed error analysis is given at the end of this section (Theorem 2.3.9). For simplicity, we suppose first that a rounding error appears only in a single multiplication LijR j k; say, for i = p, j = q, k = r. In this case, we have Rpr = Apr - L LpjR j r - LpqRqrO- E:pr) j=l:p-1 Hq = Apr L + LpqRqrE:pr LpjRjr> j=l:p-1 where IE: pr I ::: B 1- L. If we define the matrix A obtained by changing in A the (p, r)-entry to r» the triangular factorization LR (in which k., := R i k for (i, k) # (p, that we obtain instead of L R may be interpreted as the exact triangular factorization of the matrix A. If IL pq R qr I » Apr; then this corresponds to a large relative perturbation IApr - A I" 1/1 A rr I of Apr' and the algorithm is unstable. IRikl::: IAikl + L 83 [Rjkl, j=l:i-1 so that by an easy induction argument. all Ri, are bounded by the expression K II maxi,k IAik I, where the factor K II can, in unusual cases, be exponential in n, but typically varies only slowly with n. Therefore, if A is balanced (i.e., if all nonzero entries of A have a similar magnitude), then we typically have numerical stability. To balance a matrix with entries of widely different magnitudes, one usually Uses some form of scaling. In compact notation, scaling of a matrix A corresponds to the multiplication of A by some diagonal matrix D; left multiplication by D corresponds to row scaling (i.e., multiplication of each row by the corresponding diagonal element), and right multiplication by D corresponds to column scaling (i.e., multiplication of each column by the corresponding diagonal element). For example, if _(ac b)d , _(e0 J0) A- D- 84 2.3 Rounding Errors, Equilibration, and Pivot Search Linear Systems of Equations 85 scaling matrices then ea DA= ( fc eb) fd ' AD = (ea ec D b ffd ) . = diag(1, 10- 15, 10- 20 , 1O-2S ) , D' = diag(1 O-S, 10- 20 , 1O-2S , 10- 30 ) If one uses row scaling only, the most natural strategy for balancing A is row equilibration (i.e., scaling each row such that the absolute values of its entries sum to 1). This is achieved for the choice and the permutation matrix (cf. below) o o o o o o 1 1 (3.1) An optimality property for this choice among all row scalings (but not among two-sided scalings) is proved in Theorem 2.5.4. (Note that (3.1) is well defined unless the ith row of A vanishes identically; in this case, we may take w.l.o.g. Di, = 1.) In the following MATLAB program, the vector d contains the diagonal entries of D. we get the strongly diagonally dominant matrix 2.3.1 Algorithm: Row Equilibration which is perfectly scaled for Gaussian elimination. (For a general method to scale and permute a matrix to so-called I -matrix form, where all diagonal entries are 1 and the other entries have absolute value of at most I, see Olschowka and Neumaier [76]). d=ones(n,l); for i=l:n, dd=sum(abs(A(i,:)); i f dd>O, d(i)=l/dd; A(i,:)=d(i)*A(i,:); end; end; 2.3.2 Example. Rice [82] described the difficulties in scaling the matrix A = ~0+1O ( = 1O- IS 10~s 1 IO- S 10- 40 10- 30 ( 1O- I S 10- 40 1O- IS I IO- S S IO- 30 ) 10~O-SS ' Row Interchanges Although it usually works well, scaling by row equilibration only is sometimes not the most appropriate way to balance a matrix. 1 10+ 20 P DAD' 1 10+20 10+20 1 10+40 10+ 10 1 10+40 lO+ s0 ~0+40) lO+s0 We now consider what to do when no triangular factorization exists because some divisor Ri, is zero. One usually proceeds by what is called column pivoting (i.e., performing suitable row permutations). In linear algebra, permutations are written in compact notation by means of permutation matrices. A matrix P is called a permutation matrix if, in each row and column of P, the number I occurs exactly once and all other elements are equal to O. The multiplication of a matrix A with P on the left corresponds to a permutation of rows; multiplication with P on the right corresponds to a permutation of columns. For example, the permutation matrix . I leaves A to a well-behaved matrix; row equilibration and column equilibration emphasize very different components of the matrix, and it is not at all clear which elements should be regarded as large or small. However, using the E C 2 x 2 unchanged; the only other 2 x 2 permutation matrix is P2 = (~ ~), 86 2.3 Rounding Errors, Equilibration, and Pivot Search Linear Systems of Equations and we have (4a) ) u 4 -4 -6 9 1 2.3.3 Example. We replace, in the matrix A of Example 2.1.11, the element = -1 with A 33 = -11/2. The calculation is then unchanged up to and including Step (2b), and we obtain the following diagrams. (2b) (-~ 4 -6 3 -4 9 11 -2 -Z- 1 3 2 -~) ( ~ 3 -2 1 1 2 3 o 1 5S 2 (3b) Permutation 4 -4 4 -4 -6 9 -6 9 5 I 1 2 3 -2 -~) ( ~ o 2 3 -2 2 3 -2 5 2 aD -~ ) 15 2 3 2 is overwritten over A. What happens if all elements in the remaining column are equal to zero? To construct an example for this case, we replace the elements A 33 = -1 and A 43 = 3 with A 33 = -11/2 and A 43 = 1/2 and obtain the following diagrams (-~ In the next step, division by the element R 33 = A33 should occur, among other things. In this example, however, A 33 = 0 (i.e., the division cannot be performed). This makes an additional step necessary in the previously mentioned algorithm: we must exchange the third row with the fourth row. This corresponds to a row permutation of the initial matrix A. If one forms the triangular factorization of the matrix P A, then one obtains just the corresponding permuted diagrams for A, including Step (3a). Because the third diagonal element A 33 is different from zero, one can continue the process. 5S After Step (4a), the triangular factorization of P A with (3a) 4 -4 -6 9 _~ OS 2 -~ 5 2 -2 A symmetric permutation of rows and columns that moves diagonal entries to diagonal entries is given by replacing A with P ApT; they may be useful, too, because they preserve symmetry. Permutation matrices form a group (i.e., products and inverses of permutation matrices are again permutation matrices). This allows one to store permutation matrices in terms of a product representation by transpositions, which are simple permutations that only interchange two rows (or columns), by storing information about the indices involved. A 33 87 (2b) -4 -6 9 4 3 -2 11 -ZI 1 2 2 ( (3a) -4 -6 9 3 05 -2 -~) ~ 3 I 4 -2 1 1 2 05 -~) Again, we obtain a zero on the diagonal; but this cannot be removed from the diagonal by an exchange of the third and fourth rows. If we simply continue, then the division in Step (3b) leads to L43 = 0/0 (i.e., any value can be taken for L 43). For simplicity, we set L43 = 0 and continue the recursion. ~N ( ~~ ~1 =i -~ -i,) (-~ =i -~ -i) 1 2 OD 1 I 1 2 0 15 2 Indeed, we can carry out the factorization process formally, obtaining, however, R33 = 0 (i.e., R, and therefore A, is singular). It is easy to see that the permuted triangular factorization can be constructed for arbitrary matrices A in exactly the same fashion as illustrated in the examples. Therefore, the following theorem holds. 88 Linear Systems of Equations 2.3 Rounding Errors, Equilibration, and Pivot Search 2.3.4 Theorem. For each n x n matrix A there is at least one permutation nonnalization) and DRD'. Thus, R scales as A. Hence, in the ith step, the pivot row index j would be selected by the condition matrix P such that the permuted matrix P A has a normalized triangular factorization P A = L R. Noting that det P det A = det(P A) = det(LR) = det L det R = det Rand using the permuted system PAx = Pb in place of Ax = b, we obtain the following final version of Algorithm 2.1.6. One sees that in this condition, the D;i cancel and only the row scaling affects the choice of the pivot row. Writing d k := D kk , we find the selection rule Choose j :::: i such that djA ji 2.3.5 Algorithm: Solve Ax = b with Pivoting STEP I: Calculate a permuted normalized triangular factorization P A = L R. STEP 2: Solve Ly = Pb. STEP 3: If R is nonsingular, solve Rx = y. The resulting vector x is the solution of Ax = b. Moreover, det A = det Rj detP. 2.3.6 Remarks. (i) If several systems of equations with the same coefficient matrix A are to be solved, then only steps 2 and 3 must be repeated. Thus we save a considerable amount of computing time. (ii) P is never formed explicitly, but is represented as a product of single row exchanges (transpositions); the number of the row exchanged with the ith row in the ith step is stored as the ith component of a pivot vector. (iii) As a product of many factors, det R is prone to overflow or underflow, unless special precautions are taken to represent it in the form a . M k for some big number M, and updating k occasionally. The determinant det P is either + 1 or -1, depending on whether an even or an odd number of exchanges were carried out. (iv) log [det A I = log [det RI is needed in many statistical applications involving maximum likelihood estimation. Pivot Search with Implicit Scaling As we have seen, scaling is important to obtain an equilibrated matrix for which column pivoting may be applied sensibly. In practice, it is customary to avoid explicit scaling. Instead, one uses for the actual factorization the original matrix and performs the scaling only implicitly to determine the pivot element of largest absolute value. If we would start the factorization with the scaled matrix DAD' in place of A, the factors Land R would scale to DLD- (because of J 89 = max{dkA ki I k :::: i}. This selection rule can now be used together with the unsealed A, and one obtains the following algorithm. 2.3.7 Algorithm: Permuted Triangular Factorization with Implicit Row Equilibration and Column Pivoting % find scale factors d=ones(n,l); for i=l:n, dd=sum(abs(A(i,:»; if dd>O, d(i)=l/dd; end; end; %main loop for i=l:n, % find possible pivot elements for k=i:n, A(k,i)=A(k.i)- A(k,l:i-l)*A(l:i-l.i); end; % find and save index of pivoting row [val,j]=max(d(i:n) .*abs(A(i:n,i»);j=i-l+j;p(i)=j; % interchange rows i and j if j >i, A([i,j], :)=A([j,i],:); d(j)=d(i); % d(i) no longer needed end; for k=i+l:n, %find R(i,k) A(i,k)=A(i,k)-A(i,l:i-l)*A(l:i-l,k); % complete L(k,i) if A(i,i) -= 0, A(k,i)=A(k,i)/A(i,i); end; end; end; 90 Linear Systems of Equations 2.3 Rounding Errors, Equilibration, and Pivot Search 91 Note that is necessary to store the pivot indices in a vector p because this information is needed for the solution of a triangular system Ly = Pb; the original right side b must be permuted as well. Land Rand Sne .::: 1, then 2.3.8 Algorithm: Solving a Factored Linear System Usually (after partial pivoting), the entries of ILIIRI have the same order of magnitude as A; then, the errors in Gaussian elimination with partial pivoting are comparable to those made in forming products Ax. However (cf. Exercise 11), occasionally unreasonable pivot growth may occur such that ILII RI has much bigger entries than A. In such cases, either complete pivoting (involving row and column interchanges to select the absolutely largest pivot element) or an alternative factorization such as the QR-factorization, which are provably stable without exception, should be used. To prove the theorem. we must know how the expressions of the form % forward elimination for i = I : n, j = Pi; if j > i, b([i, j]) = b([j, i]); end; Yi = b, L AikYk; k=1:i-1 end; % back substitution ier = 0; for i = II : -I : 1, if A ii == 0, ier = 1; break; end; Xi = (Yi L A'kXd/A i,; Ib - Axl.::: 5n£IL/lRllxl. (3.2) k='+l:n end; % ier = 0: A is nonsingular and x % ier = I: A is singular = A -I b that Occur in the recurrence for the triangular factorization and the solution of the triangular systems are computed. We assume that the rounding errors in these expressions are equivalent to those obtained if these expressions are calculated with the following algorithm. Rounding Error Analysis for Gaussian Elimination Rounding errors imply that any computed approximate solution x is in general inaccurate. Already, rounding the components of the exact x* gives x with Xk = x;(1 + Ed, I£kl .::: 8 (8 the machine precision), and the residual b - Ai, instead of being zero, can be as large as the upper bound in 2.3.10 Algorithm: Computation of (3.2) So = c; for j Sj Ib - Axl = IA(x* -x)l.::: IAllx* -xl'::: £IAllx*l· We now show that in the execution of Gaussian elimination, the residual usually is not much bigger than the unavoidable bound just derived. Theorem 2.5.6 then implies that rounding errors have the same effect as a small perturbation of the right side b. We may assume w.l.o.g. that the matrix is already permuted in such a way that no further permutations are needed in the pivot search. 2.3.9 Theorem. Let i be an approximation to the solution of the system of linear equations Ax* = b with A E e nxn and bEen, calculated in finiteprecision arithmetic by means of Gaussian elimination without pivoting. Let e be the machine precision. If the computed triangular matrices are denoted by == 1 : m - 1, == Sj_1 - ajb j; end; bm = Sm-I/a m; Proof of Theorem 2.3.9. We proceed in 5 steps. STEP I: Let e, := i£/(1 - i£)(i < 5n). Then because for the positive sign, I.B1=111+a( 1 +1])I':::£+£i(I+£)= (i+1)£ . -<£+1 [ , 1 - 1£ 92 Linear Systems ofEquations 2.3 Rounding Errors. Equilibration, and Pivot Search and for the negative sign, a - 17 1 Si+S - - <--= - 1131 - l 1+ 17 - 1- S C = STEP 4: Applying (3.5) to the recursive formulae of Crout (without pivoting) gives, because e, ~ e; for i S n, 2 (i+1)s-is <Si+l· 1 - (i + l)s + ie? - STEP 2: We show that with the initial data iu ; hi (i calculated values Sj satisfy the relation L aih,(l + ai) + Sj(l + .Bj) (j = 1, ... , m - Sn = 0, I, ... , m - 1) J-1./ J-I./ IAH - j'f:, I'jR;;/ <0'. ,'f:, II"IIR"I (3.4) if i ~ k, so for the computed triangular factors Land for suitable a, and .Bi with magnitude ~Si' This clearly holds for j = 0 with.Bo := O. Suppose that (3.4) holds for some j < m - 1. Because the rounding is correct, Sj(l IAik- ~. LiJ?jk! ~ ~.ILijIIRjkl, 1) and C, the i=l:j Solving for Sj and multiplying by (l + .Bj) = Sj+1 (l = S;+I (l with laj+II,I.B;+11 obtain Similarly, for the computed solutions Ly = b, Rx = y, From the inductive hypothesis (3.4), we y, i R of A, of the triangular systems Ib' - ,I;, I" Y, I<0 e; ,'f:, II" 11M + .Bj) gives, because of (3.3), + .Bj)(l + v)-I + aj+Jhj+ 1(l + .Bj)(l + J.L) + .BJ+z) + aj+1b j+ 1 (l + aj+l) ~S;+I' 93 IY, - ,'f:. hi'l ,'f:. IR"lIi,l, <0£. and we obtain Ib - LYls snlLIIY!. 151 - Ril ~ snlRllil. L a,bi(l + ai) + Sj(l + .Bj) i=l:j = L ai b,(l+ai)+s;+I(l+.Bj+l). C= STEP 5: Using steps 3 and 4, we can estimate the residual. We have i=l:j+1 Ib - Ail STEP3: From b m = (sm-Jlam)(l + 17), 1171 ~ s, and (3.4), it follows that = Ib - L y - L(Ri - y) + (LR - A)il Sib - LYI + ILllRi - yl + ILR - Allil ~ sn(ILllyl + ILllRllil + ILIIRllil) ~ sn(ILlly - Ril + 3ILIIRllxl) L aih,(l +ai)+ambm(l+.Bm-z)(l +17)-1 = L a,b,(l + ai), c= i=l:m--l S sn(sn '=I:m + 3)ILIIRII.tl. By hypothesis, Sne < I, so e; = ns/(l - ns) ~ ~ns < 1, whence the theorem is proved. whence, by (3.3), lam I ~ Sm; we therefore obtain (3.5) Rounding error analyses like these are available for almost all numerical algorithms that are widely used in practice. A comprehensive treatment of error analyses for numerical linear algebra is in Higham [44J; see also Stummel 94 Linear Systems of Equations and Hainer [92]. Generally, these analyses confirm rigorously what can be argued much simpler without rigor, using the rules of thumb given in Chapter 1. Therefore, later chapters discuss stability in this more elementary way only. 2.4 Vector and Matrix Norms because, for example, [x ] - lIyll = Ilx ± Y =f yll -llyll ::: IIx ± YII + IlylI - Ilyll = IIx±yll· 2.4 Vector and Matrix Norms Estimating the effect of a small residual on the accuracy of the solution is a problem independent of rounding errors and can be treated in terms of what is called perturbation theory. For this, we need new tools; in particular, the concept of a norm. A mapping II . II : en ---+ lR is called a (vector) norm if, for all x, y E en: A metric in en is defined through d (x, y) = Ilx- Y II. (Restricted to real 3-space ~3, the "ball" consisting of all x with Ilx-xoll::: e is, for II· liz, 11·1100, and 11./1, a sphere-shaped, a cube-shaped, and an octahedral-shaped s-neighborhood of xo, respectively.) The matrix norm belonging to a vector norm is defined by ° (i) [x II ~ with equality iff x = 0, (ii) Ilaxll = lalllxll fora E e, (iii) Ilx 95 IIAII := sup{IIAxll I [x] = I}. + yll ::: IIxll + lIyll· The relations The most important vector norms are the Euclidean norm IIxliz := )1; Ix;[Z = ../xHx, IIAxll:::IIAllllxll for all A Ee x n , IIABII::: IIAIIIIBIl IlaA11 = lalliAl1 the maximum norm for all A, for all A E XEe n , BE e e"x", xn a E , e, 11/11 = 1 Ilx1100:= max Ix;[, ;=I"",n can be proved easily from the definition. The following matrix norms belong to the vector norms II . liz, II . 1100, and 11·111. and the sum norm IIxlll:= L IXil· i=l:n IIAllz = sup{vXHAHAx IxHx = I} = jmaximal eigenvalue of A H A; In a finite-dimensional vector space, all norms are equivalent, in the sense that any s-neighborhood contains a suitable £' -neighborhood and is contained in a suitable £/1 -neighborhood (cf. Exercise 16). However, for practical purposes, different norms may yield quite different measures of "size." The previously mentioned norms are appropriate if the objects they measure have components of roughly the same order of magnitude. The norm of a sum or difference of vectors can be bounded from below as follows: [x ± yll ~ is called the spectral norm; by definition, for a unitary matrix A, IIAliz IIA-111 z = 1. II A 1100 is called the mw sum norm, and [x] - lIyll IIAII] and Ilx ± yll ~ Ilyll - IIxll ~ max [ ~ IA;, II; ~ I, ... , n ) =max{~IAikllk= I, ... ,n} the column sum norm. We shall derive only the formula for the row sum norm. 96 Linear Systems of Equations 2.4 Vector and Matrix Norms We have 97 and therefore II A 1100 = sup{IIAxll oo Illxlloo = l} IIAII~ = (AX)H (Ax) = L IAix1 2 :::: IIAi:II~ = L L IAikl2. i.k: = sup max ILAikXkl Ixkl::o I :::: max I k L IAikl, x k I and equality holds because the maximum is attained for Xk = IAikl/ Aik. Obviously, the calculation of the spectral norm is more difficult than the calculation of the row sum norm or the column sum norm, but there are some useful, easily computable upper bounds. 2.4.1 Proposition. If A E Equality holds when equality holds in the Cauchy-Schwarz inequality (i.e., when for all i the vectors (AiJ T and are linearly dependent). This is equivalent to saying that the rank of the matrix A is :::: 1. Thus A may be written as an outer product e n x n , then (iii) If Ai are the eigenvalues of the Hermitian matrix A, then A HA = A 2 has the eigenvalues A~. Let Amax be the eigenvalue of A of maximum absolute value. Then, by (4.1), (i) IIA 112 :::: JII A IIIIIA 1100, with equality, e.g., when A is diagonal. (ii) IIAII2:::: IIAIIF := JLi.k IAik12; the equality sign holds if and only if A = xyH with X, Y E en (i.e., the rank of A is at most 1). IIAIIF is called the Frobenius norm (or Schur norm) of A. (iii) If A is Hermitian then IIAII2:::: IIAllforeverymatrixnorm 11·11. Proof We use the fact that for any matrix norm II . II and for each eigenvalue A of a matrix A we have IAI :::: II A II, because for a corresponding eigenvector x we have IAI = IIAxIl/llxll = II Ax II/IIx II :s IIAII· IIAI12 = 2.4.2 Proposition. If II I - A II :s fJ < 1, then A is nonsingular and IIA-III:::: 1/(1- fJ)· Proof Suppose that Ax = O. Then IIxll = jjx - Axil = IIU - A)xll:::: III - AII·llxll + 11/11 :::: IIA-III·III - All + 11/11 :::: fJ11A-111 + I that IIA-III:s 1/(1- fJ). IIAI12 = s f3l1xll. Because fJ < 1 we conclude that [x II = 0, hence x = O. Thus Ax = 0 has only the trivial solution, so A is nonsingular. Furthermore, it follows from IIA-III :::: IIA-' -III (ii) By definition, o IAmaxl:::: IIAII. Norms may serve to prove the nonsingularity of matrices close to the identity. (4.1) (i) Let Amax be the eigenvalue of A H A of maximum absolute value. Then jA~ax = o sup J(Ax)H(Ax) IIxliFI M-Matrices and H-Matrices x and the supremum is attained for some x = because of the compactness of the unit ball. By the Cauchy-Schwarz inequality, T~ere are important classes of matrices which satisfy a generalization of the Cnterion in Proposition 2.4.2. We call A an Hrmatrix if diagonal matrices D I d D 2 exist such that III - D,AD 2 11 oo < 1. By the proposition, an H-matrix IS nonsingular. Diagonally dominant matrices are matrices in which the weak :m 98 Linear Systems ofEquations 2.5 Condition Numbers and Data Perturbations row sum criterion 99 2.4.3 Proposition. A vector norm is monotone iff, for all diagonal matrices IAiil~LIAikl fori=l, ... .n D, the corresponding matrix norm satisfies (4.2) kefi IIDII = max{IDiilli = I, ... , n}. holds. Diagonally dominant matrices are H-matrices if strict inequality holds in (4.2) for all i, (To see this, use D, = Diag(A)-1 and D2 = I.) In fact, strict inequality in (4.2) for one i suffices if, in addition, A happens to be "irreducible"; see, e.g., Varga [94]. M-matrices are H-matrices with the sign distribution A ii ~ 0, A ik :::: 0 for i 1= k; there are many other equivalent definitions of M-matrices. Each M-matrix A is nonsingular and satisfies A -I ~ 0 (see Exercise 19). This inequality mirrors the fact that M-matrices arise naturally as discretization of inverse positive elliptic differential operators; but they arise also in other context (e.g., input-output models in economics). Monotone Norms The absolute value of a vector or a matrix is defined by replacing all components with their absolute value; for example, (4.3) Proof Let II . II be a monotone vector norm. Because IDii IlIe(i) II = II Diie(i) II = II De(i) II :::: IIDlllle(i)II, we have IDiil:::: IIDII, hence IIDII ~ max IDiil =:a. Because IDxl ::::alxl for all x, monotony implies IIDxl1 = IIIDxlll:::: Ilalxlll = allxll, hence IIDII = sup IIDxll/llxll ::::a. Thus IIDII = a. and (4.3) holds. Conversely, suppose (4.3) holds for all diagonal D. We consider the special diagonal matrix D = Diagtsigmx.), .... sign(xn », where signet) = I x / Ix / ] if x 1= O. if x = o. Then IIDII=IID-'II=I and x=Dlxl, hence IIxll=IID/xlll :::: IIDlllllxlll= IIlxlli = IID-'xll:::: IID-'lllIxll = IIxll. Thus we must have equality throughout; hence IIxll = IIlxlll. Now suppose Ixl::: y. We take D = Diag(lx,I/Yl,"" Ixnl/y,,) and find IIDII :::: I, Ixl = Dy, hence IlIxlll = IIDyl1 :::: IIDIiIIYIl :::: Ilyli. Hence, the norm is monotone. 0 2.5 Condition Numbers and Data Perturbations For x, Y E en and A, B E enxlI, one easily sees that Ix ± yl :::: Ixl + [v]. IAxl:::: /Allxl, IA ± BI:::: IAI + IBI, IABI:::: IAIIBI. A norm is called monotone if Ixl::::y => Ilxll = Illxlll:::: lIyll· All vector norms considered above are monotone. The analogous result IAI:::: B => IIAII = IliA/II:::: IIBII holds for the matrix norms II . 111 and II . 1100, but for II . 112 we only have IAI:::: B => IIAI12:::: IIIAII12::: IIBI12. As is well known, a matrix A is nonsingular if and only if det A 1= O. Although one can compute the determinant efficiently by a triangular factorization, checking whether it is zero is difficult to do numerically. Indeed, rounding errors in the numerical calculation ofdet A imply that the calculated value of the determinant is almost always different from zero, even when the exact value is det A = O. Thus the unique solvability of Ax = b would be predicted from an inacCuratedeterminant even when in reality no solution or infinitely many solutions exist. The size of the determinant is not even an appropriate measure for closeness to singularity! For example, det(a A) = a" det A. so that multiplying a 50 x 50 matrix by a harmless factor of 3 increases the determinant by a factor of 3n > 1023. For the analysis of the numerical solution of a system of linear algebraic equations, however, it is important to know "how far from singular" a given matrix is. We can then estimate how much error is allowed in order to still Obtain a useful result (i.e., in order to ensure that no singular matrix is found). A Useful measure is the condition number of A; it depends on the norm used 100 Linear Systems of Equations 2.5 Condition Numbers and Data Perturbations and is defined by cond(A):= IIA- 101 Note that (5.3) specifies a componentwise relative error of at most 8; in particular, zeros are unperturbed. 111·IIAII for nonsingular matrices A; we add the norm index (e.g., cond., (A)) if we refer to a definite norm, For singular matrices, the condition number is set to 00. Unlike for the determinant. Proof Because (5.3) is scaling invariant, it suffices to consider the case where D, = D 2 = I. Because B is singular, then so is A -I B, so, by Proposition 2.4.2, III - A-I BII :::: 1, and we obtain 1.::: III-A- I BIi cond(aA) = cond(A) for a E <C. = IIA- 1(A The condition number indeed deserves its name because, as we show now, it is a measure of the sensitivity of the solution x* of a linear system Ax* = b to changes in the right side b. 2.5.1 Proposition. IfAx* = b then, for arbitrary x, (5.1) and - B)II.::: IIA-IIIIIB - All .::: IIA- 11l1181AIII =8cond(A), o whence 8:::: l/cond(A). In particular, a matrix that has a small condition number cannot be close to a singular matrix. The converse is not true because the condition number depends strongly on the scaling of the matrix, in a way illustrated by the following example. 2.5.3 Example. If IIx* - x II < cond(A) lib - Ax II . 11- II b II -----::-11 x---::-* (5.2) then Proof (5.1) follows from I_(-1 1) A-1 _ _ - 0.995 and (5.2) then from ..... I -0.005 . IIAlloo =2, IIA-111 00 =2/0.995, and condoo(A);::;; 4. Now IIbll = IIAx*II'::: IIAII·lIx*lI· o 200) The condition number also gives a lower bound on the distance from singularity (For a related upper bound. see Rump [83]): (5.3) ' whence (DA)-I 2.5.2 Proposition. If A is nonsingular, and B is a singular matrix with IB - AI.::: 81AI, I = _1_ (-1 199 I 200) -I ' then II DA 1100 = 201, II (DA)-liloo = 201/199, and condoo(DA) ;::;; 200. So the condition number is increased by a factor of about 50 through scaling. for all nonsingular diagonal matrices DI. D 2 • The following theorem (due to van der Sluis [93]) shows that equilibration generally improves the condition number; in particular, the condition number of the equilibrated matrix gives a better bound in Proposition 2.5.2. 103 Linear Systems of Equations 2.5 Condition Numbers and Data Perturbations 2.5.4 Theorem. If D runs through the set of nonsingular diagonal matrices, index i, we obtain a lower bound s, for II A -I 1100' If d is an arbitrary vector with Idl = e, where e is the all-one vector, then 102 then condoo(DA) is minimalfor Di, = 1/;; IAikl (i = I, ... , n), that is, (5.4) if LI(DA)ikl=1 fori=l, ... ,n. k Proof. Without loss of generality, let A be scaled so that Lk IA ik 1= I for all i. If D is an arbitrary nonsingular diagonal matrix, then IIDAlloo = max {IDiil L I IAikl} = max IDiil = IIDlloo; k I The calculation of I(A-Id)il for all i is less expensive than the calculation of all s., because only a single linear system must be solved, and not all the I'". The index of the absolutely largest component of A -Id would therefore seem to be a cheap and good substitute for the optimal index i o. Among the possibilities for d, the choice d := sign(A -T e) (interpreted componentwise) proves to be especially favorable. For this choice, we even obtain the exact value Sio = IIA-11l00 whenever A-I has columnwise constant sign! Indeed, the i th component a, of a := A- T e then has the sign of the i th column of A-I, and because d, = signrc.), the ith component of A-Id satisfies the relation so condoo(A) = IIA- IIINIIAlloo = II(DA)-I Dlloo :s II(DA)-llIooIIDlloo = II(DA)-llIooIIDAlloo = condoo(DA). D 2.5.5 Algorithm: Condition Estimation I: Solve AT a = e and set d = sign(a). Solve Ac = d and find i" such that ICi·1 = max, ICi I. STEP 3: Solve ATf = e U' ) and determine s = If ITe. STEP To compute the condition number, one must usually have the inverse of A explicitly, which is unduly expensive in many cases. If A is an H-matrix, we can use Proposition 2.4.2 to get an upper bound that is often (but not always) quite reasonable. If one is content with an approximate value, one may proceed as follows. using the row-sum norm IIA -11100' Let s, be the sum of the magnitudes of the elements in the i th row of A -I . Then IIA- llloo=max{sili=l, ... ,n}= sio for at least one index io. The ith row of A -I is If a triangular factorization P A = LR is given, then fU) is calculated as the solution of the transposed system We solve R T y = eU), L T ::: = y, and set fU) = p T Z. If we were able to guess the correct index i = io, then we would have simply s, = II A -III 00; for an arbitrary STEP 2: Then II A -I II 00 2: s and often equality holds. Thus s serves as an estimate of IIA-11l00. Indeed, for random matrices A with uniformly distributed A ik E [-1,1], equality holds in 60-80% of the cases and II A -I 1100 :s 3s in more than 99% of the cases. (However, for specially constructed matrices, the estimate may be arbitrarily bad.) Given an LR factorization, only O(n 2 ) additional operations are necessary, so the estimation effort is small compared to the factorization work. Data Perturbations Often the coefficients of a system of equations are only inexactly known and We are interested in the influence of this inexactness on the solution. In view of Proposition 2.5.1, we discuss only the influence on the residuals, which is Somewhat easier to determine. The following basic result is due to Oettli and Prager [75]. 2.5 Condition Numbers and Data Perturbations Linear Systems of Equations 104 2.5.6 Theorem. Let A E ([Il XIl and b, x E ([Il. Then for arbitrary nonnegative matrices .6.A E JR.1l XIl and nonnegative vectors Sb E JR.1l, the following statements are equivalent: en (i) There exists A E ([nxll and h E Ih - bl :S s». (ii) The residual satisfies the inequality with Ax "'=h, IA - AI:S .6.A, and Ib - Axl:s.6.b + .6.Alxl. Proof. It follows from (i) that Ib - Axl = Ib - h + (A - A)xl :S Ib - hi + IA - AI·lxl:S.6.b + .6.Alx/: that is, (ii) holds. Conversely, if (ii) holds, then we define qi as the ith component of the residual divided by the ith component of the right side of (ii); that is, By assumption. Iqi I :S 1. We define A E ([n XII and h E ([II by A ik := A ik + qi(.6.A)iklsign(xk), (i, k:::: 1, ... , n), hi := b, - qi(.6.b)i, (i = 1, ... , n), with the complex sign sign(x) = I Xl lx l ifx # 0, 1 if x =0. (However, note that MATLAB evaluates sign(O) as 0.) Then IA - AI:s .6.A, Ih - bl :S Sb, and for i = 1, ... ,11, (h - AX)i = b, = (b i - qi(.6.b)i - L(A ik + qi(AA)ik! sign(xd)xk L AikXk) - qi (.6.b)i i- L(.6.A)ik IXk (i) The implication (ii) =} (i) in Theorem 2.5.6 says that an approximation x to the solution x of Ax' = b with small residual can be represented as the solution of a slightly perturbed system Xt' = h. (This argument is a typical backward error analysis: Instead of the more difficult task of showing how data perturbations affect the solutions, one analyzes whether the approximations obtained can be interpreted as an exact solution of a nearby system.) In particular, if the coefficients of the system Ax = h have a relative error of at most E per component, then IA - AI:S EIAI and Ih - bl:s Elbl. An approximation x to x* = A -lb can then be regarded as acceptable if the relation Ib - Ax I :S E(lbl + IA I. IxI) holds (i.e., if E :::: EO), where (5.5) (Here the nsturz) interpretation for % is zero.) co is a scaJing invariant quality factor for computed approximations to systems of linear equations. It is always between 0 and 1, and it is small iff the error in x can be explained by small relative errors in the original data. (ii) If we exchange the quantities distinguished by - and *, we can interpret the implication (i) =} (ii) as follows. The solution x* of a system of linear equations Ax* = b approximates the solution x of a slightly perturbed system Ax = h with A ~ A and h ~ b in the sense that the residual h - Ax* remains small. As seen in Proposition 2.5.1, the absolute error x * - x can be a factor of II A -I II greater than the residual, and the relative error can be a factor of cond(A) greater than the relative residual. As the condition number for function evaluation in Section lA, the number cond(A) is therefore interpretable as the maximal magnification factor of the relative error. As an application of Theorem 2.5.6, we prove an easily calculable upper bound on the distance to a singular matrix. # 0, and let ILk AikXkl 8:= max . i Lk IAikikl =0, so Ai = h. Thus (i) is proved. 2.5.7 Remarks. 2.5.8 Proposition. Let x I) o 105 Then there is a singular matrix A with IA - AI :S 81AI. (5.6) 106 Linear Systems of Equations 2.6 Iterative Refinement Proof We have IAxl ::c8IAI·I·'i'I. By Theorem 2.5.6 with b=b.b=O, there exists A with Ax = 0 and IA - A I ::c 81 A I. Because ~'i' =I- 0, it follows that A is singular. 0 To make 8 small, a sensible choice for Rx=e', x* = Xl = I t L - t I '",,:::}'J correct wrong 8h x is a solution of (le;I= .. · =1<1=1) 107 = r-------- (5.7) kl where P A = LR is a normalized triangular factorization that has been determined by using a column pivot search, and the signs of the e; are chosen during back substitution such that the IXi I are as large as possible. If A is close to a singular matrix, then IA.'i' I remains bounded, whereas, typically the solution of (5.7) determined in this way becomes very large and 8 becomes tiny. L-------,I,I 2.6 Iterative Refinement The accuracy of the solution that has been calculated with the aid of Gaussian elimination can be improved by using the method of iterative refinement. The equality P A = LR is only approximate because of rounding error. It follows from this that the calculated "solution" x differs, in general, from the "exact" solution x* = A-I b. The error vector 8* := x* - x satisfies A8* = Ax* - Ax =b - Ax; therefore 8* is the solution of the system of equations Figure2.2. Iterative refinement (shaded digits are inaccurate). STEP 2: Solve LR8' = Pr' for 8' and put x' = x'-t + 8'. STEP 3: If I > 1 and 118'1100 ~ ~ 118/-11100: stop with x* = x' as best approximation found. Otherwise. update I = I + I. STEP 4: Calculate = b - Ax'-I (if possible in double precision) and continue with step 2. r' A8* =b - Ax. (6.1) Thus we can calculate the error vector by solving a system of equations with the same matrix, so that no new triangular factorization is necessary. The right side i' := b - Ax is called the residual corresponding to the approximation x; the residual corresponding to the exact solution x* is zero. In an attempt to obtain a better approximation than x, we solve the system of equations (6.1) for 8* and obtain x* =x + 8*. Because instead of the absolute error 8* only an approximation 8 can be calculated, we repeat this method with :=x +8 until 11811 does not become appreciably smaller - by a factor of 2, say. This serves as a convenient termination criterion for the iteration. x Figure 2.2 illustrates how the iterative refinement algorithm improves the accuracy of a calculated solution. The solution x = Xl, calculated by means of Gaussian elimination, typically agrees with some leading digits of x" (cf. the first two rows of the figure). Similarly, the calculated error 8 = 8 1 agrees with some leading digits of the true error 8'*; because 8 is several orders of magnitudes smaller than x*, the new error 82* is much smaller than 8 1", and so on. In the particular example drawn, the calculated error oJ satisfies II oJ II 00 ~ ~ 118 21\ 00 and hence is not appreciably smaller than 82 ; we have reached the limiting accuracy. In this way, we can obtain nearly L valid digits for the calculated solution ~'i'*, although the initial approximation was rather inaccurate. 2.6.2 Remarks. 2.6.1 Algorithm: Iterative Refinement .'i'*1100 in Algorithm 2.6.1 almost always has an order of magnitude 118'1100, where I is the last index used; but this cannot be proved (i) The error [x" - STEP 1: Compute a permuted triangular factorization P A l xO = 0, r = b, 1=1. = L R, and initialize because one has no control over the rounding error. 108 2.6 Iterative Refinement Linear Systems of Equations (ii) For the implementation of iterative refinement, we need both A and the factorization LR. Because A is overwritten by LR, a copy of A must be made before computing the factorization. (iii) Because the size of the error 8* is determined by the residual through (6.1), it is essential to calculate the residual as accurately as possible. In MATLAB, double precision accuracy is used for all calculations, and nothing special must be done. However, when working in single precision, one can exploit the fact that most computers construct, for the multiplication of two single precision (L-digit) numbers the exact (2L-digit) product, and then round this to L digits. For example, in FORTRAN77, the instructions double precision rr 100 rr =b(i) do 100 k= 1,n rr = rr - dprod(A(i, k), x(k)) r(i) =rr produce the double precision value of the ith component of the residual r = b - Ax, with A and b stored in single precision only. The final rounding of rr to a single precision number stored in rei) is harmless because it has a small relative error. 2.6.3 Examples. (i) The system Ax = b with A= 372 -573 ( 377 241 63 -484 -613) 511 , 107 b= ( 210) -281 170 is to be solved. Single precision Gaussian elimination with iterative refinement using double precision residuals gives xO Xl xZ x3 17.46114540 17.46091557 17.46099734 17.46101689 16.99352741 16.99329805 16.99337959 16.99339914 16.93472481 16.93449497 16.93457675 16.93459606. Comparing X Z with x 3 , one would expect that x 3 is accurate to six places. The quality index (5.5) has the value 80 = .61 10 - 8 and indicates that 109 x3 can be regarded as the exact solution of a system Ax3 = b in which A and b are obtained from A and b by rounding to the next machine number. Now let us consider the scaled system (10- 3 A)x = 1O-3b. The given data are now no longer exactly representable in the (base 2) computer, and the resulting data errors have the following effect on the computed iterates. Xl Xz 17.46083331 17.46096158 17.46093845 17.46093869 16.99321566 16.99334383 16.99332047 16.99332070 16.93441272 16.93454123 16.93451762 16.93451786 One sees that, compared with the unsealed system, the seventh places of the last iterations are affected; we see also that the agreement of x3 and x4 up to the eighth place indicates a higher accuracy than is actually availabk. 1f one supposes that the coefficients of A and b are accurate only to three of the eight given figures, one must reckon with a loss of another 8 - 3 = 5 figures in the result; then, only 6 - 5 = 1 figures are correct. This indicates a nearly singular problem; we therefore estimate the distance to singularity from Proposition 2.5.8 and see that our prognosis of the accuracy was too optimistic: With the given choice of x, we obtain a singular matrix within a relative error 8 = 0.002. From this, it follows that an absolute error of 0.5 in all of the coefficients < 250 (of the original problem) leads to a possibly singular matrix, and one must doubt whether the absolute accuracy of 0.5 in the coefficients guarantees a nonsingular problem. A more precise inspection of the data, which is still possible in this 3 x 3 problem, shows that if one replaces A z! with -572.5 and A 23 with 510.5 then A becomes singular, since the sum of the columns becomes zero. Therefore the given data are actually too imprecise to furnish sensible results; the relative error of this particular singular matrix is < 0.5/511 < 10- 4 , a factor of about 20 smaller than the upper bound from Proposition 2.5.8. (ii) The system Ax = b with A= 372 -573 ( 377 241 63 -484 -125) 182 , 437 b= (155) -946 38 2.6 Iterative Refinement Linear Systems ofEquations 110 is to be solved. Gaussian elimination with iterative refinement gives xO -0.255 228 35 2.81151304 3.42103759 proof From x* = A -I b it follows that IIx* - il-l - 8'1100 Xl = IIA- ' (P' - A8' +b - Ail-l - P')lloo ::: IIA-'lloo(llr' - A8'II00 + lib - Ai'-I - 1"11(0) -0.255 228 37 2.81151304 3.421 03753, so that we expect that x 1 is accurate to eight places. If we assume that the coefficients (measured in millimetres) are accurate only to within an absolute tolerance of 0.5, we expect that 8 - 5 = 3 places of x I are still reliable. The calculated bound 80 = 0.76 from Proposition 2.5.8 indicates no particular problem. 111 ::: IIA- 11100(EoIl8/1100 ::: ~11811100 + s.) by (i) and (iii), + IIA-'llooE r by (iv). Now let I be the index such that i'-1 =i*. Because the termination criterion must have been passed, we have 118'-11100::: 2118 11100' Therefore + (i l- 1 + 81- 1 - ii-I) 1100 ::: IIx* - i l - I - 8'-11100 + Elli l- 11100. by (ii) ::: ~1I81-11100 + IIA- 11100E + Elli*lloo, [x" - i* 1100 = II (x* - il-l - 8'-1) r Limiting Accuracy of Iterative Refinement For singular or almost singular matrices, iterative refinement does not converge. To prove convergence and obtain a formula for the limiting accuracy of the iterative refinement algorithm 2.6.1 in the presence of rounding error, we suppose that the approximations (denoted by -) satisfy for I = 0, I, ... the following inequalities (i) - (iv): (i) (ii) (iii) (iv) 111'1 - A8'II00 :::Eo118'1I00 (this is a test for the calculation of 81; by Theorem 2.3.9, this is certainly satisfied with EO = 5nE II L 1100 II R11(0); Ili'-1 + 8' - ilil oo ::: E Iii' 1100 (rounding error due to the addition of the correction); lib - Ail-I - 1'11100 ::: e, (rounding error due to the calculation of the residual); IIA-11100Eo::: ~ (a quantitative nonsingularity requirement). Assuming that and using (i), requirement (iv) is approximately equivalent to the requirement 20nE condoo(A) < 1. 2.6.4 Theorem. Suppose that (i) - (iv) holdfor I =0, 1, ... and let i* be the approximation for x* computed by iterative refinement. Then the error satisfies the relation IIx* - i* 1100:::: 5E* where E* = IIA -lllooEr + Elli* 1100' whence 1 -I Ilx* - i*lIoo::: 2:118 1100 + E*. (6.2) From this, it follows that 118 11100 = lI(x* - i*) - (x* - ii-I - 81)1100 + [x" - i l - I - 8'1100 ::: !1l811100 + E* + ~11811100 + E* by (i). ::: [x" - i*lloo So 118 11100 ::: 8E*, and substituting this into (6.2), Ilx* - i* II ::: 5E*. 0 2.6.5 Remarks. (i) The factor E* that occurs in the limiting accuracy estimate results from putting together the unavoidable rounding error E Ili* 1100 and the error II A -1 II ooE r caused by the inexact residual. The double precision calculation of the residual that has already been mentioned is therefore important for the reduction of e; because II A-III 00 can be comparatively large. (ii) In the proof, the final 81 is smaller than 8E* in norm, so that (as already mentioned) 118 11100 and [x" - i*lloo have the same order of magnitude generally. However, no lower bound for 118 11100 can be obtained from the previous analysis, so that it is conceivable that 118 11100 is very much smaller than [x" - i*lloo' (iii) Under the same hypotheses, 112 2.7 Error Bounds for Solutions ofLinear Systems Linear Systems ofEquations for Gaussian elimination without iterative refinement if x 1 = 80. For wellconditioned matrices (cond(A) ~ IIA-11100 for equilibrated matrices), Gaussian elimination without iterative refinement also gives good results; for (not too) ill-conditioned matrices, only the order of magnitude is correct. However, even then, if we obtain only one valid decimal without iterative refinement, then iterative refinement with sufficiently accurate calculation of the residuals provides good results. Of course, we then need about as many iterative steps as valid decimals. (Why") From Theorem 2.3.9, we may deduce the heuristic but useful rule of thumb that for the computed solution x one must expect a loss of accuracy of approximately loglo cond(A) decimal places (relative to the machine precision). This follows from the theorem if we make the additional assumption that because then Ilx - x*11 If we omit the factor 5n in the approximate equality in order to compensate for the overestimation of the "worst-ease-analysis," then there remains a loss oflog lO cond(A) decimal places. 2.7.1 Example. In order to illustrate this kind of effect. we consider an almost singular equilibrated matrix A. We obtain such a matrix if, for example, we by approximately 0.1 % to A:= (6:~~~ ~:~~i)· change the singular matrix (Changing the matrix by a smaller relative error gives similar but more pronounced results.) The solution of the system of equations Ax* = b with the right side b := @ is x* = For the approximate solution x = C'~~:), we calculate the residual - _ (-0.002) dth 0* _ (-0.001) B '-1 (25025 -249.75) h r - -0.002 an e error o - -0.001' ecause A = -249.75 250.25' we ave IIA -11100 = 500. Hence the bound (5.1) gives 118* 1100 :S 1, and the error in (5.1) (and in (5.2)) is overestimated by a factor of 1000. The totally wrong approximation x = @ has a residual i = (-~:~~;) of the same order, but the error is 8* = (-i), and this time both bounds (5.1) and (5.2) are tight, without any overestimation. C:) G). We see from this example that for nearly singular matrices we cannot conclude from a small residual that the error is small. However, for equilibrated matrices, one can conclude from a large residual that the absolute error has at least the same order of magnitude, 111'11 = IIA8*11 ::: IIAII·118*11 ~ 118*11, and it may be much larger when the matrix is ill-conditioned (which, for equilibrated matrices, is the same as saying that II A-I II is large). = IIA-1(b - Ax)11 ::: IIA- '115neIIILIIRlllllili ~ Sne cond(A) IlxII. 2.7 Error Bounds for Solutions of Linear Systems Proposition 2.5.1 gives (fairly rough) error bounds for an approximation x to the solution of a linear system Ax = b. The bounds cannot be improved in general (e.g., inequality (5.l) is best possible because IIA- III = sup{IIA- 11'11 I 111'11 = I}). In particular, harmless looking approximations with a tiny residual can give rise to large errors if cond(A) is large (or rather, if A is nearly singular). 113 Realistic Error Bounds As seen previously, the use of II A -I II is often too crude for obtaining realistic error bounds. In particular the bound (5.1) is usually much too bad if x is determined by iterative refinement. To get error bounds that can be assessed as to how realistic they are, one must invest more work. An approximate but generally quite reliable estimate of the error (and at the same time an improved approximation) can be obtained with one step of iterative refinement, from the order of magnitude 118 0 II of the first correction 80 (cf. Figure 2.2). Rigorous and realistic error bounds can be obtained from a good approximation C for A -I (called a preconditioner and computed. e.g., using Algorithm 2.2.9), as follows. Because C A ~ /, the norm 11/ - C A II is small and the following theorem, due to Aird and Lynch [3], is applicable. 2.7.2 Theorem. /fll/- CAli ::: fJ < I then A is nonsingular; and for an arbitrary approximation x of x * = A -I b, IIC(b - Ax) II II * _ IIC(b - Ax)11 - - - - - < x -xll<-"--------'--"- l+fJ - - l-fJ . (7.1) Proof By Proposition 2.4.2, the matrix C A is nonsingular. Because a =1= det(C A) = det C det A, the matrix A is nonsingular, too. From A8* = 1', one 2.7 Error Boundsfor Solutions of Linear Systems Linear Systems of Equations 114 115 5 ,------.~---,--------,-------,---, finds that 8* = 0 + (l I-- - CA)8*, I I 118*11 S 11011 + III - C A 11118* I s 11 011 + ,8118* II, 118*11 ::: 11011 - III - CAli 118*11 ::: 11 011- ,8118*11, 3 I I I I I I I whence the assertion follows. I o I I I I '"x 2.7.3 Remarks. I I I (i) The error bound (7.1) is realistic for small ,8 because the actual error is overestimated by a factor of at most q:= (l + {3)/(l - ,8); e.g., q < 1.23 for,8 < 0.1. (ii) In order to reduce the effects of rounding error one should calculate the residual b - Ax in double precision. Indeed, inaccuracies in C or ,8 affect the bound (7.1) much less than errors in the residual. (iii) To evaluate the bound (7.1), we need 2n 3 + 0 (n 2 ) operations for the calculation of LR and C, 4n 3 operations for the multiplication C A (an operation real 0 interval involves two operations with reals l), and 0 (n 2 ) for the remainder of the calculation. This means that altogether, 6n 3 + 0 (n 2 ) operations are necessary for the calculation of a realistic bound for the error of x. Comparing this with the ~n3 + 0(n 2 ) operations needed for the calculation of x by Gaussian elimination, we see that the calculation of realistic and guaranteed error bounds requires about nine times the cost for solving the system approximatively. Systems ofInterval Equations For many calculations with inexact data, the data are known to lie in specified intervals. In this case, we can use interval arithmetic to compute error bounds simultaneously with the solution instead of estimating the error from an analysis of the error propogation. In the following, we work with a fixed interval matrix A E lllR" X" and a fixed interval vector b E lllR". We want to find a good enclosure for the solution of Ax* = b, where it is only known that A E A and h E b. Clearly, x* lies in the solution set 1; (A, b):= {x*IAx* =b for some A E A, bE b). The solution set is typically star-shaped. The calculation of the solution set is I -1 I I I I I -3 I I I IL _ -5 L -_ _~_ _~ ~_ ___'_ _ ___:'--_____= -1 3 5 -3 -5 x1 Figure 2.3. Star-shaped solution set. quite expensive; for systems of dimension n, it can be shown that the star has up to 2" spikes. We content ourselves with the computation of the interval hull D1;(A, b) of the solution set. 2.7.4 Example. The linear system of interval equations with [2,2] A:= ( [-1,2] [-2,.1]) [2.4] , b'= ([-2,2]) . [-2,2] has the solution set drawn in Figure 2.3. The hull ofthe solution set is 01; (A, b)= ([ -4, 4], [-4, 4]) T, drawn with dashed lines. (To construct the exact solution set is a nontrivial task, even for 2 x 2 systems!) Methods for determining the hull of the solution set are available (see, e.g., Neumaier [70, Chapter 6]) but must still in the worst case have an exponential effort. Indeed, the problem belongs to the class of NP-hard problems (Garey and Johnson [27]) for which it is a long-standing conjecture that no polynomial algorithms exists. However, there are several methods that give at least a reasonable inclusion of the solution set with a cost of 0(n 3 ) . 116 Linear Systems of Equations 2.7 Error Bounds for Solutions of Linear Systems Krawczyk's Method (i) In the first example, when all coefficients are treated as point intervals, To obtain a realistic enclosure, we imitate the error estimation of Aird and Lynch (Theorem 2.7.2). Let the preconditioner C E JRnxn be an approximation of the inverse ofmid(A). The solution.c" ofAx* = h satisfies x* = ci + (I - C A).\'*. If we know an interval vector Xl with x* E Xl, then also x* E Cb + (I - CA)x' and we can improve the inclusion of x*: f3 = .000050545 < I. ex = 51.392014, and Xl 2.7.5 Algorithm: Krawczyk's Method for Equations If we denote by xl the box Xafter I iterations, then by construction, x* E x' for alII ~ 0, and XO ;:> x' ;:> x 2 .... 2.7.6 Example. In order to make rigorous the error estimates computed in Examples 2.6.3(i)-(ii), we apply Krawczyk's method to the systems of linear equations discussed there. (To enable a comparison with the results in Examples 2.6.3, the calculations were performed in single precision.) 2 17.4~~m 17.46b~~6 16.993g~~ = .006260 0"2 =x 4 17.46b~~6 16.993g~~ 16. 934 = .000891 X 16.99~m 3 m 16.934~~~ = .000891 O".J One can see that x 3 (and already x 2 ) have five correct figures. If one again supposes that there is an uncertainly of 0.5 in each coefficient, then all of the coefficients are genuine intervals; All = [371.5,372.5], and so on. Now, in Krawczyk's method, f3 = 3.8025804 > I suggesting numerical singularity of the matrix. No enclosure can be calculated. (ii) In the second example, with point intervals we get the values f3 = .000000309, ex = 6.4877797. and Xl Solving Linear Interval ier=O; a=C*b; E=!-C*A; % To save space, b and A could be overwritten with a and E. f3 = max, Lk IE l k I; % must be rounded upwards if f3 ~ I, ier = I; return; end; ex = lIali oo / (I - (3); x=([-ex,exJ, ... , [-ex,ex])T; O"old =inf; 0" = Lk rad (Xk); fac= (I + (3)/2; while 0" «fac * O"old, x=(a+Ex)nx; O"old =0"; 0" = Lk radtx.); end; X 16.93~6~6 0", How can we find an initial interval vector XO with x* E xo? By Theorem 2.7.2 (for xO = 0), IIX* 1100 ::; II Chll 00/(1-(3) if II! -c A1100 ::; f3 <1. Because IICh 1100 ::; II Cb]00' we define, as the initial interval, xO := ([ -ex, ex], ... , [-ex, ex]) T with ex:= IICblloo/(I - (3). As seen from the following examples, it is worthwhile to terminate the iteration when the sum 0"1 of the radii of the components of xl is no longer rapidly decreasing. We thus obtain the following method, originally due to Krawczyk [53]. 117 X 2=X 3 - 0.255228~ - 0.255228~ 2.8115qj 2.81151~~ 3.42103~~ 0"1 = .00000125 0"2 3.421037j = .00000031 with six correct figures for x 2 . With interval coefficients corresponding to an uncertainty of 0.5 in each coefficient, we find f3 = .01166754, ex = 6.5725773, and (outwardly rounded at the fourth figure after the decimal point) Xl - 0.2~g~ 8741 2 '7489 x2 - 0.2~~g 2.~~~j 3·~~b~ 0"1=.1681 3·j~n 0"2 = .0622 x3 - 0.2~i 8344 2 '7887 4505 3 '3916 0"3 = .0613 x4 644 - 02 . 461 2.~~~ 3.j~?~ 0"4=.0613 with just under two correct figures for x 4 (and already for x 2 ) . The safe inclusions of Krawczyk's method are therefore about one decimal place more pessimistic here than the precision that was estimated for Gaussian elimination with iterative refinement. . We mention without proof the following statement about the quality of the limit x oo , which obviously exists. A proof can be found in Neumaier [70]. 118 Linear Systems ofEquations 2.7.7 Theorem. The limit proximation property: X OO 2.8 Exercises of Krawczyk's iteration has the quadratic ap- radA, rad b = 0(£) =? 0 S rad x'? - radDI;(A, b) 119 5,---------.---,..------,-----,-------, = 0(£2). 3 Thus if the radii of input data A and b are of order of magnitude 0 (s), then the difference between the radius of the solution of Krawczyk's iteration X OO and the radius of the hull of the solution set is of order of magnitude 0(£2). The method therefore provides realistic bounds if A and b have small radii (compare with the discussion of the mean value form in Remark 1.5.9). Even sharper results can be obtained with 0(n 3 ) operations by a more involved method discovered by Hansen and Bliek; see Neumaier [72]. N x -1 -3 Interval Gaussian Elimination For special classes of matrices, especially for M-matrices, for diagonally dominant matrices, and for tridiagonal matrices, realistic error bounds can also be calculated without using Krawczyk's method (and so without knowing an approximate inverse), by performing Gaussian elimination in interval arithmetic. The recursions for the triangular factorization and for the solution of the triangular systems of equations are simply calculated using interval arithmetic. Because of the inclusion property of interval arithmetic (Theorem 1.5.6), this gives an inclusion of the solution set. For matrices of the special form mentioned, the quality of the inclusion is very good - in the case of M-matrices A, one even obtains for many right sides b the precise hull DI;(A, b): 2.7.8 Theorem. If A is an M-matrix and b :::: 0 or b S 0 or 0 E b then interval Gaussian elimination gives the interval hull I] I; (A, b) of the solution set. The proof, which can be found in Neumaier [70], is based on the fact that in these cases the smallest and the largest elements of the hull belong to the solution set. 2.7.9 Example. The interval system of linear equations with [2,4] A:= ( [-1,0] [-2,0]) [2,4]' b.= ([-2,2]) . [-2,2] satisfies the hypotheses of Theorem 2.20. The solution set (filled) and its interval hull (dashed) are drawn in Figure 2.4. -~5:-------:----~---~-----::-------;5 -3 -1 3 x1 Figure 2.4. Solution set (filled) and interval hull of solution. Note that, although interval Gaussian elimination can, in principle, be tried for arbitrary interval systems of linear equations, it can be recommended only for the classes of matrices mentioned previously (and certain other matrices, e.g., 2 x 2 matrices). In many other cases, the radii of the intervals can become so large in the course of the calculation (they may grow exponentially fast with increasing dimension) that at some stage, all candidates for pivot elements contain zero. 2.8 Exercises 1. The two systems of linear equations Ax = b(/), I = 1, 2 (8.1) with k.- (~ 0 I are given. 2 4 2 2 -1 o ~) 5 4' 3 4 l~) v» .= . ( 35' 30 b(2) .= . (=~) 0 -3 Linear Systems of Equations 120 2.8 Exercises (a) Calculate, by hand, the triangular factorization A = LR; then calculate the exact solutions x(l) (I = I, 2) of (8.1) by solving the corresponding triangular systems. (b) Calculate the errors x(l) - x(l) ofthe approximations x(l) = A \b(l) computed with the standard solution x =A\ b of Ax = b in MATLAB. 2. (a) Use MATLAB to solve the systems of equations Ax = b with the coefficient matrices from Exercises 1,9,14,26,31, and 32 and right side b:= Ae. How accurate is the solution in each case? (b) If you work in a FORTRAN or C environment, use a program library such as LAPACK to do the same. 3. Given a triangular factorization A = LR of the matrix A E en xn. Let B = A + auv T with a E e and u, v E en. Show that the system of linear equations Bx = b (b E en) can be solved with only 0 (n 2 ) additional operations. Hint: Rewrite in terms of solutions of two linear systems with matrix A. 4. Let A be an N 2 x N 2-matrix and b an N 2 -vector of the form D -I -1 D 5. 6. 7. o -1 k-[ -[ ° 8. D where the N x N -matrix D is given by 4 -I -1 0 4-1 9. How many different techniques for reordering of sparse matrices are available in MATLAB? Use the MATLAB command spy to visualize the nonzero pattern of the matrix A before and after reordering of the rows and columns. Compare computational effort and run time for solving the system in various reorderings to the results obtained from a calculation with matrix A stored as a full matrix. Hints: Have a look at the Sparse item in the demo menu Matrices. You may also try help sparfun, help sparse, or help full. Show that a square matrix A has a Cholesky factorization (possibly with singular L) if and only if it is positive semidefinite. For the data (Xi, Yi) = (-2,0.5), (-1,0.5), (0, 2), (I, 3.5), (2, 3.5), determine a straight line f (x) = a + f3x such that Li (Yi - f (Xi»2 is minimal. (a) Let P be a permutation matrix. Show that if A is Hermitian positive definite or is an H-matrix, then the same holds for P A P'': (b) A monomial matrix is a matrix A such that exactly one element in each row and column is not equal to zero. Show that A is square, and one can express each monomial matrix A as the product of a nonsingular diagonal matrix and a permutation matrix in both forms A = P D or A=D'P. (a) Show that one can realize steps 2 and 3 in Algorithm 2.2.9 for matrix inversion using ~n3 + 0(n 2 ) operations only. (Assume P = I, the unit matrix.) (b) Use the algorithm to calculate (by hand) the inverse of the matrix of Exercise 1, starting with the factorization computed there. (a) Show that the matrix D'-1 ° -1 4 [ is the N x N unit matrix, and the N -vectors b(l) and b(2) are given by b(l):= (2, I, I, ... , 2)T and b(2):= (I, 0, 0, ... ,0, l)T, with N E (4. 8, 12, 16, 20, ... }. Matrices such as A result when discretizing boundary value problems for elliptical partial differential equations. Solve the system of equations Ax = b using the sparse matrix features of MATLAB. Make a list of run times for different N. Suitable values depend on your hardware. 121 A- G!:) is invertible but has no triangular factorization. (b) Give a permutation matrix P such that P A has a nonsingular triangular factorization L R. 10. Show that column pivoting guarantees that the entries of L have absolute value y 1, but those of R need not be bounded. Hint: Consider matrices with A ii = A in = 1, A i k = - f3 if i > k, and A i k = 0 otherwise. 2.8 Exercises Linear Systems of Equations 22 (a) Find an explicit formula for the triangular factorization of the block matrix I 0 0 I -B I 0 0 (b) Write a MATLAB program that realizes the algorithm (make use of Exercise 8), and apply it to compute the inverse of the matrix -B A= Compare with MATLAB's built in routine inv. 15. For i = 1, ... , n let .i:U) be the computed solution (in finite precision arithmetic) obtained by Gaussian elimination from the system 0 0 -8 I with unit diagonal blocks and subdiagonal blocks - BE dxd lR . (b) Show that column pivoting produces no row interchanges if all entries of B have absolute value < I. I (c) Show that choosing for B a matrix with constant entries f3 (d- < f3 < I), some entries of the upper triangular factor R grow exponentially with the dimension of A. (d) Check the validity of (a) and (b) with a MATLAB program, using MATLAB's lu. Note that matrices of a similar form arise naturally in multiple shooting methods for 2-point boundary value problems (see Wright [99]). 12. Write a MATLAB program for Gaussian elimination with column pivoting for tridiagonal matrices. Show that storage and work count can be kept of the order O(n). 13. How does pivoting affect the band structure? (a) For the permuted triangular factorization of a (2m + I)-band matrix, show that L has m + 1 nonzero bands and R has 2m + I nonzero bands. (b) Asymptotically, how many operations are required to compute the permuted triangular factorization of a (2m + I)-band matrix? (c) How many more operations (in %) are required for the solution of an additional system of linear equations with the same coefficient matrix and a different right side, with or without pivoting? (Two cases; keep only the leading term in the operation counts I) Huw does this compare with the relative cost of solving a dense system with an additional right side? (d) Write an algorithm that solves a linear system with a banded matrix in a numerically stable way. Assume that the square matrix A is stored in a rectangular array AB such that Aik = AB(i, m + I + i - k) if Ik - i I :s m, and A ik = 0 otherwise. 14. (a) In step 4 of Algorithm 2.2.9, the interchanges making up the permutation matrix must be applied to the columns, and in reverse order. Why? 123 in which eUJ is the i th unit vector. (a) Show that the computed approximation C to the inverse A -I calculated in finite precision arithmetic as in Exercise 14coincides with the matrix (i(l), ... , i(n»T. (b) Derive from Theorem 2.3.9 an upper bound for the residual II - tAl. 16. (a) Show that all norms on (:" are equivalent; that is, if II . II and II . II' are two arbitrary norms on C n , then there are always constants 0 < c, d E lR such that for all x E C" cllxll:s Ilxll' :sdllxll· Hint: It suffices to prove this for II . II = II . II 00' Now use that continuous functions on the cube surface Ilx II 00 = 1 attain their extrema (b) ':hat are the optimal such constants (largest possible c, sm~llest posSIble d) for pairs of p-norms (p = 1, 2, oo)? 17. Show that the matrix norm belonging to an arbitrary vector norm has the following properties for all a E C: (a) (b) (c) (d) (e) (f) IIAxll:s IIAII ·llxll IjAIl ::: 0; IIAII =0 -¢:=::::} A =0 IlaAII=lal'IIAil IIA + BII :s IIAII + IIBII IIABII:s IIAII·IIBII IAI:s B ::::} IIAII :s II IAI II :s IIBII if the norm is monotone (g) The implication IAI:S B =? IIAII = IIIA/II:s IIBI/ holds for the matrix norms II· 111 and II . 1100, but not in general for II . 112. 18. Prove the following bounds for the inverse of a matrix A: (a) If II Ax 1\ ::': y IIx II for all x E C" with y > 0 and II . II an arbitrary vector norm then A-I exists and II A -1 II :5 y -I for the matrix norm belonging to the vector norm II . II· 1112:5 (b) If x H Ax ::': Yllxll~ for all x E C" with y > 0, then IIAy-I. (c) If the denominator on the right side of the following three expressions is positive, then IIA- I \\oo :51/min (IAiil- L IAikl). bpi ! IIA- IIII :51/min (IAkklk IIA -1112:5 1/ min (Aii ! Li# IAikl), World Wide Web (WWW). The distributors of MATLAB maintain a page http://www.mathworks.com/support/ftp/ with links to user-contributed software in MATLAB. (a) Find out what is available about linear algebra, and find a routine for the L D L T factorization. Generate a random normalized lower triangular matrix L and a diagonal matrix D with diagonal entries of both signs, and check whether the routine reconstructs Land D from the product LDL T • (b) If you are interested what one can do about linear systems with very ill-conditioned matrices, get the regularization tools, described in Hansen [40], from http://www.imm.dtu.dk/ pch/Regutools/regutools.html ~2 Lki-i IAik + A ki I). It takes a while to explore this interesting package! 22. Let x* be the solution of the system of linear equations 1 3X I H Hints: For arbitrary x, y E C", Ix Y I:5 [x 11211 y 112· For arbitrary u, 2 2 luvi :5 ~ lul + ~ Iv1 . 19. A matrix A is called an M-matrix if Aii ::': 0, A i k :5 0 125 2.8 Exercises Linear Systems of Equations 124 VEe. I 4X 1 + 0.333xl III - D 1AD211 I 10 7 X2 = 21' 29 280 X 2 = 99 (8.2) 280' (a) Calculate the condition number of the coefficient matrix for the rowsum norm. (b) An approximation to the system (8.2) may be expressed in the form for i =1= k; and one of the following equivalent conditions holds: (i) There are diagonal matrices D 1 and D2 such that + < 1 (i.e., A is an H-matrix). (ii) There is a vector u > 0 with Au > O. (iii) Ax ::': 0 =} x ::': O. (iv) A is nonsingular and A-I::,: O. Show that for matrices with the above sign distribution, these conditions are indeed equivalent. 20. (a) Let A E (["Xli be nonpositive outside the diagonal, Aik :5 0 for i -=I k, Then A is an M-matrix iff there are vectors u , v > 0 such that Au ::': v. (b) If A' and A" are M-matrices and A' :5 A :5 A" (componentwise) then A is an M-matrix. Hint: Use Exercise 19. Show first that C = (A,,)-l B is an M-matrix; then use B- 1 = (A"C)-I. 21. A lot of high-quality public domain software is freely available on the + O.143x2 = 0.477, O.250xI + 0.104x2 = 0.353 (8.3) by representing the coefficients Aik with three places after the decimal point. Let the solution of (8.3) be x. Calculate the relative error Ilx* x II / IIx* II of the exact solutions x* and x of (8.2) and (8.3), respectively (calculating by hand with rational numbers). 23. Use MATLAB's cond to compute the condition numbers of the matrices A with entries Aik = fk(Xi) (k = 1, ... , nand i = 1, ... , m; Xi = (i - 1)/ (m - 1) with m = 10 and n = 5, 10.20) in the following polynomial bases: (a) fk(X)=x k- l, (b) fk(X) = (x - 1/2)k-l, (c) fk (x) = Tk(2x - 1), where Tk(x) is the kth Chebyshev polynomial (see the proof of Proposition 3.1.14), (d) fk(X) = nj':~+~k.j mn(x - Xj/2). 126 Plot in each case some of the basis functions. (Only the bases with small condition numbers are suitable for use in least squares data fitting.) 24. A matrix Q is called unitary if QH Q = I. Show that cond2(Q) = I, cond 2(A) = cond 2(QA) = cond 2(AQ) =cond2(QH AQ) 127 2.8 Exercises Linear Systems of Equations 28. The two-point boundary value problem 1/ k(k - 1) -y(x)+(I_x)2 Y(x)=0, y(O)=I, y(1)=O can be approximated by the linear tridiagonal system of linear equations Ty=bwith 2+ql for all unitary matrices Q E e nxnand any invertible matrix A E en XII. 25. (a) Show that for the matrix norm corresponding to any monotone norm, - 1 0 -I 2 +q2 , T= b = eO) IIAII ::: max IAiil. -1 0 and, if A is triangular, \A\ cond(A) ::: max --''- . Akk (b) Show that, for the spectral norm, cond2(AT) =cond 2(A). (c) Show that the condition number of a positive definite matrix A with Cholesky factorization A = L L T satisfies 26. (a) Calculate (by hand) an estimate c of cond(A) from Algorithm 2.5.5 for A:= (-~ -~ ~) -3 7-1 Is this estimated value exact. that is, is c = II A II 00 1\ A -III 00 ? (b) Give a 2 x 2 matrix A for which the estimated value c is not exact, that is, for which s < IIA-liloo' 27. (a) Does the MATLAB operation x = A\b for solving Ax = b include iterative refinement? Find out by implementing an iterative refinement method based on this operation, and test it with Hilbert matrices A with entries Aik = 1/(i + k - 1) and all-one right side b = e. (b) For Hilbert matrices of dimension n = 5,10,15,20, after how many steps of iterative refinement is the termination criterion satisfied? (c) Confirm that Hilbert matrices of dimension n are increasingly ill conditioned for increasing n by calculating their condition number for the dimensions n = 5, 10, 15,20. How does this explain the results of (b)? -1 2+qll with qi =k(k - I)/(n + I - i)2. Then y(i/(n + 1» ~ Yi. Solve this system for k = 0.5 with n = 25 -I for several s, using Exercise ] 2 and (at least) one step of iterative refinement to obtain an approximation ys) for the solution vector. Print the approximate values at the points x = 0.25,0.5, and 0.75 with and without iterative refinement, and compare with the solution y (x) = (I - x)k of the boundary value problem. Then do the same for k = 4. 29. (a) Write a MATLAB program for the iterative refinement of approximate solutions of least squares problems. Update the approximate solution x' in the Ith step of the method by xl+ 1 = xl + sf, where sf be the solution of R T Rs l = AT (b - Ax') and L is a Cholesky factor of the normal equations (found with MATLAB's chol). (b) Give a qualitative argument for the limit accuracy that can be obtained by this way of iterative refinement. 30. The.following technique may be used to assess heuristically the accuracy of linear or nonlinear systems solvers or eigenvalues calculation routines ~itho~t doing any formal error analysis. (It does not work for problems involving approximations other than rounding errors, as in numerical differentiation, integration, or solving differential equations.) S~I~e the systems (k A)x(k) = kb for k = I, 3, 5 numerically, and derive heuristic error estimates [x" - xI ~ ilx for the solution x* ofAx* = b where ' 128 Linear Systems of Equations 2.8 Exercises Compare with the true error for a system with known x*, constructed by choosing random integral A and x* and b = Ax*. Explain! How reliable is the heuristics? 31. Given the system of linear equations Ax = b with 0.051 A:= ( (a) Give a matrix -0.15~ -0.59~)' -0.153 -0.737 -0.598 b:= (33 . -0.299 4280) 1420, ( -2850 IA - AI :s: colAI, and Ib - bl :s: colbl, with smallest possible relative error co. (b) Use x to determine a singular matrix A close to A, and compute the "estimated singularity distance" (5.6). 32. (a) Using the built in solver verifylss of INTLAB to solve linear interval equations with the four symmetric interval coefficient matrices A defined by either 36.1 -63.4 33.1 A".- -75.2 75.8 -27.4 14.7 -88.5 21.6 -22.4 83.3 symm. 56.9 -14.0 15.2 -58.3 36.3 -39.0 15.4 44.1 -17.0 69.4 or A~ c d c c [: c c d (a) 1.25 (b) (c) 125 -3.34 334 18.9 -1.26 1) A and a vector b such that Ax = b where x:= 129 ~l with dimension n = 16 and c, d chosen from the following: Interpret the matrices either as thin interval matrices or as interval matrices obtained by treating each number as only accurate to the number of digits shown. Use as right side b = [15.65, l5.75]e when the matrix is thin, and b = e = (1, ... , 1)T otherwise. Is the system solvable in each of the eight cases? (b) Implement Krawczyk's algorithm in INTLAB and compare the accuracy resulting in each iteration with that of the built-in solver. 3.1 1nterpolation by Polynomials 3 Interpolation and Numerical Differentiation 131 3.1 Interpolation by Polynomials The simplest class of interpolating functions are the polynomials. 3.1.1 Theorem. (Lagrange Interpolation Formulas). For any pairwise distinct points xo . . . . • Xn, there is a unique polynomial Pn of degree :::n that interpolates f (x) on Xo, ... , X n . it can be represented explicitly as L PII(X) = f(Xi)Li(x), (1.1) i=O:n where In this chapter, we discuss the problem of finding a "nice" function of a single variable that has given values at specified points. This is the so-called interpolation problem. It is important both as a theoretical tool for the derivation and analysis of other numerical algorithms (e.g., finding zeros of functions, numerical integration, solving differential equations) and as a means to approximate functions known only at a finite set of points. Because a continuous function is not uniquely defined by a finite number of function values, one must specify in advance a class of interpolation functions with good approximation properties. The simplest class, polynomials, has considerable theoretical importance; they can approximate any continuous functions in a bounded interval with arbitrarily small error. However, they often perform poorly when used to match many function values at specific (e.g., equally spaced) points over a wide interval because they tend to produce large spurious oscillations near the ends of the interval in which the data are given. Hence polynomials are used only over narrow intervals, and large intervals are split into smaller ones on which polynomials perform well. This results in interpolation by piecewise polynomials, and indeed, the most widely used interpolation schemes today are based on so-called splines - that is, piecewise polynomials with strong smoothness properties. In Section 3.1, we discuss the basic properties of polynomial interpolation, including a discussion of its limitations. Section 3.2 then treats the important special case of extrapolation to the limit, and applies it to numerical differentiation, the problem of finding values of the derivative of a function for which one has only a routine computing function values. Section 3.3 treats piecewise polynomial interpolation in the simplest important case, that of cubic splines, and discusses their excellent approximation properties. Finally, Section 3.5 relates splines to another class of important interpolation functions, so-called radial basis functions. 130 (1.2) Pn is called the interpolation polynomial to f at Xo, ... , XII' and the Li(x) are referred to as Lagrange polynomials. Proof By definition, the L, are polynomials of degree n with L (x ) I } = {oI ' fifi. =I- i.. 1 1= J. From this it follows that (1.1) is a polynomial of degree s» satisfying PII (x j) = f(Xj) for j =0, ... , n. For the proof of uniqueness, we suppose that p is an arbitrary polynomial of ~egree sn with p(Xj) = f(xj) for j =0, ... , n. Then the polynomial p« _ p IS of degree <n and has (at least) the n + I pairwise distinct zeros Xo, ... , x". Therefore, P« (x) - p(x) is divisible by the product (x - xo) ..... (x - x,,). However, the degree of the divisor is »n; hence, this is possible only if Pn(x)p(x) is identically zero (i.e., if p = PII)' 0 . Although this result solves the problem completely, there are many situations III which a different representation of the interpolation polynomial is more Useful. Note that the interpolation polynomial is uniquely determined by the data, no matter in which form the polynomial is expressed. Hence one can pick the form of the interpolation polynomial according to other considerations, such as ease of use, computational cost, or numerical stability. 132 3.1 Interpolation by Polynomials Interpolation and Numerical Differentiation 133 gives very inaccurate values at the interpolating points: Linear and Quadratic Interpolation To motivate the alternative approach we first look at linear and quadratic interpolation, the case of polynomials of degrees 1 and 2. The linear interpolation problem is the case n = I and consists in finding for given f(xo) and f(xI) a linear polynomial PI with PI (xo) = f(xo) and PI (XI) = f(xI). The solution is well known from school: the straight line through the x 6000 6001 PI(x) 0.33333 0.30000 -0.66667 -0.70000 PI(X) points (xo, Yo), (XI, YI) is given by the equation Y - Yo X - Xo YI - Yo XI - Xo and, in the limiting case XI ---+ Xo, the line through Xo with given slope Y~ by Y - Yo I --=YoX - Xo The linear interpolation polynomial has therefore the representation PI (x) = f(xo) + f[xo, xJJ(x - xo), (1.3) From linear algebra, we are used to considering any two bases of a vector space as equivalent. However, for numerical purposes, the example shows significant differences. In particular, we conclude that expressing polynomials in the power basis 1, x, x 2 , ... must be avoided to ensure numerical stability. To find t~e qU~dratic interpolation polynomial P2 (a parabola, or, in a limiting c.ase, a.straight l.me) to f at pairwise distinct points Xo, XI, X2, we modify the linear interpolation formula by a correction term that vanishes for X = Xo and X=XI: P2(X) := f(xo) where if XI "lxo, (1.4) ifxI=xo. We call f[xo, xJJ a first-order divided difference of [, This notation for the slope, and the associated form (1.3) proves to be very useful. In particular, (1.3) is often a much more appropriate form to express linear polynomials numerically than using the representation p(x) = mx + b in the power basis. xo) + f[xo, XI, X2](X - xo)(x - XI). Because X2 "I Xo, XI, the coefficient at our disposal, suggestively written as f[xo, XI, X2], can be determined from the interpolation requirement P2(X2) = f(X2)' We find the second-order divided difference '- f[xo, X2] - f[xo, xJJ f[ Xo, XI, X2 ] .X2 - XI if X2 "I XI, Xo. (1.6) 3.1.3 Exam~le. Let f(O) = 1, f(1) = 3, f(3) = 2. The parabola interpolating at 0, 1 and 3 IS calculated from the divided differences 3.1.2 Example. (With optimal rounding, B = 10, L = 5). The straight line through the points (6000, ~) and (6001, -~) may be represented, according to (1.3), by + f[xo, xJJ(x - f(xo)=l, 3- 1 f[xo, xJJ = - - = 2 1- 0 ' f[xo, X2] 2- 1 1 3-0 3 = -- = - and ~ PI (x) ~ 0.33333 + 0.33333 + 0.66667 (x _ 6000) 6000 _ 6001 ~ 0.33333 - 1.0000(x - 6000). f[xo, XI, X2] (1.5) In this form, the evaluation of PI (x) is stable and reproduces the function values at Xo and XI. However, if we convert (1.5) to standard form PI (x) = mx + b, the resulting formula PI(X) ~ -1.0000x + 6000.3 ! - 2 5 3-1 6 = _3_ _ = - - to 5 5 17 P2(X) = 1 + 2x - -x(x - 1) = - _x 2 + - x 6 6 6 + 1 . For st a bili ihty reasons, the first of the two expressions for P (x) is in fi it pre" 2 ,I me ClSlon calculations, preferable to the second expression in the power basis. 134 3.1 Interpolation by Polynomials Interpolation and Numerical Differentiation Note that the divided difference notation (1.6), used with the variable x in place of X2, allows us to write the linear interpolation formula (1.3) as an identity 135 is n-i times continuously differentiable in x and satisfies gi(X) =gi(Xi) + (x - Xi)gi+l(X). (1.10) We proceed by induction, starting with ga(x) = f[x] = f (x) for i = 0, where this follows from our discussion of the linear case. Hence suppose that (1.10) is true with i-I in place of i. Then gi is n - i > 0 times continuously differentiable and by adding the error term 1 1 gi(X) =g;(Xi) + (x - Xi) Newton Interpolation Formulas To obtain the cubic interpolating polynomial, we may correct the quadratic interpolation formula with a correction term that vanishes for x = xa, XI. X2: by repetition of this process, we can find interpolating polynomials of any degree in a similar way. This yields the interpolation formulas of Newton. A generalization due to Hermite extends these formulas to the case where repetitions of some of the x) are allowed; thus we do not assume that the x J are pairwise distinct. 3.1.4 Theorem. (Newton Interpolation Formulas). Let D ~ IR. be a closed interval, let f : D ---+ IR. be an (n + 1)-times continuously differentiable function, and let Xa. XI, ... , Xn E D. Then: (i) There are uniquely determined continuousfunctions f[xa, ... , Xi, ·]:D ---+ IR. (i = 0, 1. .... n) such that f[xa, ... , Xi-], x] = f[xa, ... , Xi-I, Xi]+ f[Xa, ... , Xi, X](X-Xi) (1.7) for all XED. (ii) For i = 0, 1, ... .n the polynomial Pi(X) := f[xa] + f[xa, xd(x - xa) (x - Xi-I) interpolates the function f(x) at the points xa, ... , Xi (iii) We have f(x) = Pi(X) + f[xa, E (1.8) D. ... , Xi, x](x - Xa)'" (x - Xi). gi(X):= f[xa, .. ·, Xi-I, X] xi))dt. (1.11) Thus. the function gi+1 defined by 1 1 gi+I(X)=f[xa. ,,,,Xi,X]:= g;(Xi +t(x -xi))dt is still n -i -1 times continuously differentiable. and (1.10) holds for i. Equation (1.9) is also proved by induction; the case i = 0 was already treated. If we suppose that (1.9) holds with i - I in place of i, we can use (1.7) and the definition (3.1.8) to obtain f(x) Pi-I (x) + (f[xa, ... , Xi-I, x;] + f[xa .... , Xi, X](X - x;)) x (x - Xa) ... (x - xi-d = Pi(X) + f[xa, ... , Xi, x](x - Xa)'" (x - Xi). = Thus (1.9) holds generally. Finally, substitution of x = X) into (1.9) gives the interpolation property !(X) = p;(x)) for j =0, ... , i. o Vn := f[xa, , x,,] Vi := f[xa, , Xi] + (x - Xi )Vi+l (i = n - 1, n - 2, ... , 0) gives rise to the expression (1.9) Vi = f[xa, + f[xa, Proof To prove (1.7), it suffices to show that + t(x - Equation (1.8) is called the Newton form of the interpolation polynomial at Xa, ... , Xi. To evaluate an interpolation polynomial in Newton form, it is advisable to use a Homer-like scheme. The recurrence + . + f[xa, ... ,Xi-I, X;](X- Xa) g;(Xi from which Va follows. , x;] + f[xa, ... , Xi+I](X - Xi) + ... x"l(x - Xi)'" (X - X,,-1) = Pn(x), A corresponding pseudo-MATLAB program is as 136 Interpolation and Numerical Differentiation 3.1.5 Algorithm: Evaluation of the Newton Form % Assume di = f[xo . . . . , Xi] v=dn ; for i = n - 1 : -1 : 0, v = d; + (x - Xi) * v; end; % Now v is the interpolated approximation to 3.1 Interpolation by Polynomials 3.1.7 Example. For the function f defined by 1 f(x) := - - , S-X SEre, one can give closed formulas for all divided differences, namely f (x) (Because MATLAB has no indices < I, a true MATLAB code should interpolate instead XI, ... , Xn by a polynomial of degree Sn - 1 and replace the zeros in this pseudo code with l s.) For the determination of the coefficients required in the Newton form, it is useful to have a general interpretation of the f[xo, ... , x;] as higher order divided differences, a symmetry property for divided differences, and a formula for a limiting case. These generalize the fact that the slope of a line through two points of a curve is independent of the ordering of these points and becomes the tangent slope in the limit when these points merge. f[ Xo, ... , x;] 1 = -;-----;-:-----;-:---;-:-__ (s - xo)(s - XI)'" (1.12) (s - Xi)' Indeed. this holds for i = O. If (1.12) is valid for some i :::: 0 then, for Xi + I #- Xi, l[xo, ... , xil - /[xo, ... , Xi-I, Xi+l] ff xo, ...• Xi, Xi_q ] = ------=-----...:...:...::..:: Xi -Xi+1 1 1 ($-xo)"'(S-Xi) ($ xo)···(s Xi = 3.1.6 Proposition. Xi-l)($ Xi+l) -Xi+1 (s - Xi+]) - (s - Xi) (s - Xo)" '(s 1 0) We have the divided difference representation f[xo, ... , Xi, x] = 137 Xi_I)(S - Xi)(S - Xi+I)(Xi - Xi+l) (s - xo) ... (s - Xi+]) , f[xo.· ..• Xi-I, x] - f[xo • . . . , Xi-I, X;] .::.....::-=------'----.::....-~----=------ X -Xi and by continuity, (1.12) holds in general. forx#-xi' (ii) The derivative with respect to the last argument is d - f[xo, ... , Xi-I, x] = f[xo, ... , Xi-I, X, xl dx (iii) The divided differences are symmetric functions of their arguments, that is, for an arbitrary permutation JT of the indices 0, 1, ... , i, we have f[xo, Xl, •.• , Using the proposition, the coefficients of the Newton form can be calculated recursively from the function values f[xd = f(Xk) if Xo, ... , Xn are pairwise distinct: Once the values l[xo, ... , x;] for i < k are already known, then one obtains l[xo, ... , Xk-I, xd from /[X . ] _ f[xo, ... , Xi-I, xd - l[xo, ... , x;] Xk Xk - Xi ' 0, ... , X, , i = 0, ... , k - 1. x;] = j[XJTO, XJTl, ... ,XJTi]. Proof (i) follows directly from (1.7) and (ii) as limiting case from (i)..To prove (iii), we note that the polynomial Pi (x) that interpolates f at the pOInts Xo, ... , Xi has the highest coefficient l[xo, ... ,x;]. A permutation JT of the indices gives the same interpolation polynomial, but the representation (1.9) yields as highest coefficient f[xlfo, ... ,x JT ;] in place of f[xo, ... , x;]. Hence o these must be the same. To c~nstruct an interpolation polynomial of degree n in this way, one must use o (n-) operations; each evaluation in the Homer-like form then only takes 0 (n) further operations. In practice, n is rarely larger than 5 or 6 and often only 2 or3Th is that t at imterpo Iati anon polynomials are not flexible enough to . e reason IS approximate typical functions with high accuracy, except over short intervals Where a small degree is usually sufficient. Moreover, the more accurate spline Interpolation (see Section 2.3) only needs O(n) operations. 3./ Interpolation by Polynomials Interpolation and Numerical Differentiation 138 The Interpolation Error To investigate the accuracy of polynomial interpolation we look at the interpolation errors (1.13) f(X) - Pi(X) = f[xo.···, Xi, x](x - xo)'" (X - Xi)' from (1.11) i times with respect to integration it follows that g(i)(~)= for some ~, 139 and using the mean value theorem of 1 tif(i+I)(xo+t(~ -Xo»dt=j<i+I)(XO+T(~ 1 T E 1 1 -xo» tidt [0, 1]. Therefore We first consider estimates for the divided difference term in terms of higher derivatives at some point in the interval hull 0{ xu, ... , Xi, x}. 3.1.8 Theorem. Suppose that the function f : D S; iR - lR. is n + 1 times continuously differentiable. (i) If xi; ... ,Xn ED, f[x 0, 0::: i::: n, then .r<i)(~) ... , with x I·]-- -~ ., with I . C " E O{.\O, ... , xd· (ii) The error ofthe interpolation polynomial PI! to f at xo, ... ,Xn E (1.14) D can be whence (i) is true in general. Assertion (ii) follows from (i) and (1.9). 3.1.9 Corollary. Ifx, XO•••. , X n E D [a, b], then written as (1.15) where (1.16) qn(X) := (x - xo)'" (x - x n) II(x) - Pn(x)l::: Ilf(n+l) II (n + l)~ Iq(x)l· (1.17) This bounds the interpolation error as a product of a smoothness factor Ilpn+ lJ ll cx) (n + I)! depending on f only and a factor Iq(x)1 depending on the interpolation points only. and ~ E Dlxo, ... , X n , x} depends on the interpolation points and on x. Hermite Interpolation Proof We prove (i) by induction on n. For n = 0, (i) is obvious. We therefore assume that the assertion is true for some n :::: O. If f: D -+ lR. IS n + .z .' h . d . h pothes l S times continuously differentiable, then we can use t e III uctrve y with g(x) := f[xu, x] in place of f and find, for all i ::: n, g(i)(O f[xo, Xl, ... , xi+il = g[Xl, ... , Xi-LJJ = -i-!- For Xu = XI = and we find ... = Xi =X, O(xo, ... , x,} = {x}, hence'; = x in relation (1.14), . f(i)(x) f[x, x , ... , x] (I + 1 arguments) = - - . it (1.18) One can use this property for the solution of the so-called Hermite interpolation Problem to find a polynomial matching both the function values and one or for some'; E O(Xl, ... , x;+I1. Differentiating the integral representation Inore derivative values at the interpolation points. In the most important case We Want . f,or pairwise " disti . points zo, z 1, .•. , Zm, a polynomial isnnct interpolation P of degree -::::'2m + 1 such that ( 1.19) Interpolation and Numerical Differentiation 140 3.1 Interpolation by Polynomials To solve this problem one doubles each interpolation point, that is, one sets X2j := X2j+l := Zj from which it easily follows that the interpolation requirement (1.19) is satisfied. Similarly, in the general Hermite interpolation problem, an interpolation point Zj at which the function value and the values of the first k j derivatives shall be matched has to be replaced by k j + 1 identical interpolation points Xij = ... = Xi]+kj = Zj; then, again, the Newton interpolation polynomial gives the solution of minimal degree. A very special case of this is the situation where the function value and the first n derivative values shall be matched at a single point Xo; then (1.15), (1.18) and Theorem 3.1.4 give the representation f(xo) +f +f I (xo)(x - Xo) + ... + f(n) (xo) n! n (x - Xo) (n+1) (c ) (n + " I)! (x - )n+l X is = P2m+ I. f(x) - p(x) = f[zo, ZO, ... , Zm, Zm, x](x - ZO)2 ... (x - Zm)2, = ,Xk~l, xd (j =0, ... , m), and calculates the corresponding Newton interpolation polynomial p The error formula (1.13) now takes the form f(x) The corresponding recursion for the dik := f[Xi, Xi+l, ... 141 0 for some ~ E D{x, xo}; and we see that the well-known Taylor formula with remainder term is a special case of Hermite interpolation. In actual practice, the confluent interpolation points cause problems in the evaluation of the divided difference formulas because some of the denominators Xi - Xk in the divided differences may become zero. However, by using the permutation symmetry, the calculation can be arranged so that everything works well. We simply compute the divided differences column by column in the following scheme. otherwise, because by construction the case x, = Xk occurs only when Xi = Xi+1 = ... = Xk, and then the (k - i)th derivative of f at Xi is known. Convergence Under suitable (but for applications often too restricted) conditions, it can be shown that when the number of interpolation points increases indefinitely, the corresponding sequence of interpolation polynomials converges to the function being interpolated. In view of the close relation between Newton interpolation formulas and the Taylor expansion, we suppose that the function f has an absolutely convergent power series in the interval of interest. Then f can be viewed as a complex analytic function in an open region D of the complex plane containing that interval. (A comprehensive exposition of complex analysis from a computational point of view is in Henrici [43].) We restrict the discussion to convex D because then the results obtained so far hold nearly without change (and with nearly identical proofs). We only need to replace interval hulls (I]S) by closed convex hulls (Conv S), and the statements (1.14), (1.15), and (1.17) by f(i)(~) I ~ f[xo, ... ,x;l E Conv { -I-'!If(x) - Pn(x)1 :::: suP. ~EConv{xo,.... x,} E } Conv{xo, ... ,xd , f(n+1) (0 I ( + 1)' Iqn(x)l, I n . and If(x) - Pn(x)l:::: II r:» I + l)~ Iq(x)l, (n where the oo-norm is taken over a convex domain containing all arguments. The complex point of view allows us to use tools from complex analysis that lead to a simple closed expression for divided differences. In the following, D[c; r] := {~ E C II ~ - c] :::: r} denotes the closed complex disk with center c E C and radius r > 0, Positively oriented boundary, and int D its interior. aD its 142 3.1 Interpolation by Polynomials Interpolation and Numerical Differentiation 3.1.10 Proposition. Let f be a function analytic in the open set Do ~ C. If the disk D[c; r] is contained in Do, then f[xo, ... , x n] = -1. 2m 1 3D f(s)ds for Xo, ... , Xn (s - Xo) ... (s - Xn) E int D. (1.20) Lst i 1 po1ation error is bounded according to f(s) - ds for all x E int K. = /f(x) - Pn(x)1 = If[xo, ... , x n, x]llx - xol .. ·Ix - Xn / 3K S - X Hence suppose that (1.20) is valid for n - 1 instead of n then, for (cf. Example 3.1.7), f[xo, ... , x n] ~ C containing the disk D[c; r]. For an infinite sequence of complex numbers x) E D[c; p] (j == 0, 1,2, ...), let Pn be the polynomial of degree Sn that interpolates the junction at xo, Xl, ... , Xn. Ifr > 3p, then for X E D[c; p], the sequence Pn(x) converges uniformly to f(x). 3.1.12 Theorem. Let f be afunction analytic in an open set D proof IfxED[c;p]thenlx-xjl S 2pforallj=0,1,2, ... ,sotheinter- Proof For n == 0, the formula is just Cauchy's well-known integral formula 1 f(x)=-. 143 X n-l For r > 3 p , I r~p I < 1, so that If (x) - Pn (x) I is majorized by a sequence that does not depend on x and converges to zero. This proves uniform convergence on D[c; p]. 0 f[xo, ... , xn-JJ - f[xo, ... , Xn-2, x n] "---=----"-------"--------=-------"----"-- Xn-I - Xn J(s) J(s) 1 [ (s~xO)···(s-xn_ll (S-XO)"'(S-X n_2)(S-Xn) ds = 2rri 13K Xn-I - Xn 1 [ f(s)ds 2rri 13K (s - xo) ... (s - x n) . This formula remains valid in the limit Xn therefore in general. -+ Xn-I. 3.1.11 Corollary. lflf(~)1 S M for all ~ = 0, ... , n then E So (1.20) holds for nand 0 D[c; r] and Ix) - c] S p < r for all j If[xo, ... , xn]1 S Mr (r - p)n Mr n+1 Mr ( 2 P )n+1 < (2p) =-- -- (r - p )n+2 (r - p) r - p i= X n The hypothesis r > 3p can be weakened a little by using more detailed considerations and more complex shapes in place of disks. However, a well-known example due to Runge shows that one cannot dispense with some condition of this sort. 3.1.13 Example. Interpolation of the function f(x) .- I/O + x 2 ) at more and more equidistant interpolating points in the interval [-5,5] leads to divergence of the interpolating polynomials Pn(x) for real x with Ix I ::: 3.64, and convergence for Ixi S 3.63. We illustrate this in Figure 3.1; for a proof, see, for example, Isaacson and Keller [46, Section 6.3.4]. +1 . 5,------_-------, Proof By the previous proposition, If[xo,··., xn]1 = -112rril 11 3K f(s)ds I (s - xo)··· (s - x n ) < _1 [ If(s)lldsl < M [Idsl - 2rr 13K Is - xol" ·Is - xnl - 2rr(r - p)n+! 13K = (Here, 0 M Mr ·2rrr= . 2rr(r - p)n+1 (r _ p)n+1 J Idsl denotes the unoriented line integral.) 0 The fact that the number M is independent of n can be used to derive the desired statement about the convergence of interpolating polynomials as n-+ 00. -5 "---5 ~ o _______J 5 o Figure 3.1. Equidistant polynomial interpolation in many points may be poor. 5 144 Interpolation and Numerical Differentiation 3.2 Extrapolation and Numerical Differentiation The hypothesis of the theorem is not satisfied because j has poles at ±i, although j is analytic in a large open and convex region containing the interpolation interval [-5,5]. For such functions, the derivatives Ij(n+1)(~) I do not vary too much, and the error behavior (1.15) of the interpolation polynomial is mainly governed by the term q.; (x) = (x - xo) ... (x - x n). The divergent behavior finds its explanation by observing that, when the interpolation points are equidistant and their number increases, this term develops huge spikes near the interval boundaries. This can be seen from the table n 16 32 64 128 256 1.9· 10' 3.1· 103 4.4 . 107 4.6· 1016 1.9· 1034 145 Proof We introduce the Chebyshev polynomials Tn, defined recursively by To(x):=I, T,(x):=x, T2(x)=2x 2-1, Tn+l(x):= 2xTn(x) - Tn-I(X) (n = 1, 2, ... ). (1.22) Clearly, the Tn are of degree n and, for n > 0, they have highest coefficient 2n - 1• Using the addition theorem cos(a + fl) + cos(a - fl) = 2 cos a cos fl, a simple induction proof yields the relation Tn(x) = cos(n arccos x) for all n. 0.23) This shows that the polynomial Tn + I (x) has all Chebyshev points as zeros, and hence Tn+! (x) = 2nqn (x). Moreover, the extrema of Tn+1 (x) are ±1, attained between consecutive zeros. This implies the assertion. 0 For the interpolation error in interpolation at Chebyshev points, one can prove the following result. of quotients 3.1.15 Theorem. The interpolation polynomial Pn(x) interpolating an arbitrary s times continuously differentiable junctions f: [-1, 1] ~ lR in the Chebyshev points (1.21) satisfies Interpolation in Chebyshev Points Although high-degree polynomial interpolation in equidistant points often leads to strong oscillations near the boundaries of the interpolation intervals destroying convergence, the situation is better when the interpolation points are more closely spaced there. Because interpolation on arbitrary intervals can be reduced to that on [-1, 1] through a linear transformation of variables, we restrict the discussion to interpolation in the interval [-1, 1], where the formulas are simplest. The example just discussed suggests that a choice of the x j that keeps the term qn(x) = (x - xo) ... (x - x n) of a uniform magnitude between interpolation points is more likely to produce useful interpolation polynomials. One can achieve this by interpolating at the Chebyshev points, defined by Xj :=cos 2) + 1 2n+ 2 lr (j=O, ... ,n). 3.1.14 Proposition. For the Chebyshev points (1.21), 1 Iqn(x)l:::: 2n for x E [-1, 1], and this value is attained between any two interpolation points. (1.21) Ij(x) - Pn(x)1 = 0 (~s ). Proof See. for example, Conte and de Boor [12, Section 6.1]. o This reference also shows that interpolation at the so-called expanded Chebyshev points, which, adapted to a general interval [a, b], are given by a +b +a -b ( cos u-+- l1r ) - 2 2 2n+2 / lr) ( cos - - 2n+2 is even slightly better. 3.2 Extrapolation and Numerical Differentiation Extrapolation to the Limit Extrapolation refers to the use of interpolation at points xo, ... , x; for the approximation of a function j at a point x ¢ O{xo, ... , x n }. Because interpolation polynomials of low degree are generally rather inaccurate, and those of high degree are often bad already near the boundary of the interpolation interval and this behavior becomes worse outside the interval, extrapolation cannot be recommended in practice. 147 Interpolation and Numerical Differentiation 3.2 Extrapolation and Numerical Differentiation A very important exception is the case in which the interpolating points Xj (j =0,1,2, ...) form a sequence converging to zero, and the value of fat x = 0 is sought. The reason is that in this particular case, the usually offending term qn (X) behaves exceptionally well, The assertion (2.3) is clearly true for k = O. Hence suppose that (2.3) holds for k - 1 in place of k. In (2.2), the factor (x - Xi) vanishes for x = Xi, and the fraction vanishes for x =Xi-I, Xi-2, ... , Xi-HI; therefore, 146 (2.1) Moreover, for x and converges very rapidly to zero even when the x j converge quite slowly. 3.2.1 Example. For the sequence x j := lowing results: 2 n 4 8 16 Because the value at the single point x = 0 is the only one of interest in extrapolation, the Newton interpolation polynomial is not the most common way to compute this value. Instead, one generally uses the extrapolation formulas of Neville, which give simultaneously the extrapolated values of many interpolation polynomials at a single argument. 3.2.2 Theorem. The polynomials defined by + (x,. _ x )Pi.k-I(X) - Pi-I,k-I(X) Xi-k - Xi fork= 1, ... , i, (2.2) are the interpolation polynomials for f at Xi<k.» .•• , Xi. Proof Obviously Pik(X) is of degree ~k. We show by induction on k the interpolation property - k, i - k = Pi,k-I (Xi-k) - (Pi,k-I (Xi-k) - Pi-I,k-I (Xi-k)) = Pi-l.k-l (Xi-k) = f(Xi-k), The values Pik (0) are, for increasing k, successively better extrapolation approximations to f (0), until a stage is reached where the limitations of polynomial interpolation (or rounding errors) increase the error again. In order to have a natural stopping criterion, one monitors the values 8i := IPu (0) - Pi,; -I (0) I, and stops the extrapolation if 8i is no longer decreasing. Then one accepts Pu (0) as the best approximation for f (0), and has the value of 8i as a natural estimate for the error f (0) - Pu (0). (Of course, this is not a rigorous bound; the true error is unknown and might be larger.) In the algorithmic formulation, we use only the Xi for i > 0 in accord with the lackof zero indices in MATLAB. We store Pik (0) - f (x;) in Pi» I-b overwriting old numbers no longer needed; the subtraction gives a slight improvement in final accuracy because the intermediate quantities are then smaller. 3.2.3 Algorithm: Neville Extrapolation to Zero i = 1; PI =0; fold=f(xl); while 1, p;o(X) = f(Xi), =i one obtains so that (2.3) is valid for k and hence holds in general. Thus Pik(X) is the uniquely determined polynomial of degree <k which D interpolates the function f (x) at Xi<k» •.• , Xi. The Extrapolation Formulas of Neville Pik(XI) = f(XI) for I Pik(Xi-k) fJ (j = 0, 1,2, ...), one finds the fol- Thus the extrapolation to the limit from values at a given sequence (usually x j = xO / N, for some slowly growing divergent sequence N j ) can be expected to give excellent results. It is a very valuable technique for getting function values at a point (usually x = 0) when the function becomes more and more difficult to evaluate as the argument approaches this point. () . (x)·-· . - Pl.k-I X Plk = Xi-«, + 1, ... , i. (2.3) i = i + 1; Pi = 0; fest = f(Xi); df = fest - fold; for j = i - I : - 1:1, P] = PHI end, + (Pj+l - Pj + df) * xi/(Xj - Xi); fold = fest; 80ld = 8; 8 = abs(p2 - PI); if i > 2 & 8 2: 8old, break; end, end, fest = fest + PI; % best estimate If(O) - fest I ~ 8 Interpolation and Numerical Differentiation 3.2 Extrapolation and Numerical Differentiation For the frequent case where Xi = x /qi (i = 1, 2, ...), the factor Xi/(Xi-k -Xi) in the formula for Pik becomes l/(qk -1), and the algorithms takes the following form. if f is (s + 2) times differentiable atx.1f f is expensive to evaluate, a fixed p(h) is used as approximation to rex), with an error of O(h). However, because for small h divided differences suffer from severe numerical instability due to cancellation, the accuracy achievable in finite precision arithmetic is rather low. However, if one can afford to spend several function values for the computation of the derivative, higher accuracy can be achieved as follows. Because p(h) is continuous at h = 0, and (2.4) shows that it behaves locally like a polynomial, it is natural to calculate the derivative rex) = p(O) by means of extrapolation: For a suitable sequence of numbers hi =/= converging to 0, one calculates the interpolation polynomial Pi using the values p(h j) = f[x, x + hj] (j =0, ... , i). Then the Pi (0) can be expected to be increasingly accurate approximations to rex), with an error of O(h o ' " h n ) . Another divided difference, the central difference quotient 148 3.2.4 Algorithm: Neville Extrapolation to Zero for Xi = X/ qi i=l;PI=O; fold = f(xI); while 1, i = i + 1; Pi fest = 0; ° Q = 1; = f(Xi); df = fest - fold; for j = i - I : - 1 : 1, Q= Q »«. Pj = PHI + (PHI - P] + df)/(Q - 1); end, fold = fest; Oold = 0; 0= abs(p2 - PI); if i > 2 & 0 2: OOId' break; end, end, f[x - h, x =: p(h In practice, one often has functions f not given by arithmetical expressions and therefore not easily differentiable by automatic methods. However, derivatives are needed (or at least very useful) in many applications (e.g., to solve nonlinear algebraic or differential equations, or to find the extreme values of a function). In such cases, the common remedy is to resort to numerical differentiation. In the simplest case one approximates, for fixed x, the required value t" (x) through a forward difference quotient f(x + h) - f(x), h =/=0 h at a suitable value of h. By Taylor expansion, we find the error expansion = f'ex) + L.f (i+l)(X) , hi + O(h'+I) + 1). i = 1:5 (I + h) - f(x - h) + L i=l:s Numerical Differentiation p(h) = f(x f'ex) We now demonstrate the power of the method with numerical differentiation; other important applications include numerical integration (see Section 4.4) and the solution of differential equations. + h] = + h] 2h fest = fest + PI; % best estimate If(O) - fest I .::: 0 p(h) := f[x, x 149 (2.4) 2 i>» (x) h 2i + O(h 2s+2) (2i + I)! (2.5) ) approximates the derivative f' (x) with a smaller error of 0 (h 2 ) . Because in the asymptotic expansion only even powers of h occur, it is sensible to consider this expression as a function of h 2 instead of h. Because now h 2 plays the role of the previous h in the extrapolation method, the extrapolation error term is now O(h6 ... h~). Thus h need not be chosen as small as for forward differences to reduce the truncation error to the same level. Hence central differences suffer from less cancellation and therefore lead to better results. The price to pay is that the calculation of the values p(h]) = f[x - h i» x + h j] for j = 0, ... , n requires 2n + 2 function evaluations f (x ± h j), whereas for the calculation of f[x, x + h j] for j = 0, ... , n, only n + 2 function evaluations (f (x + h j) and f (x)) are necessary. 3.2.5 Example. We want to find the value of the derivative of the function given by f(x) = sinx at x = 1. Of course, the exact value is r(l) = cos 1 = 0.540302305 868 .... The following table lists the central difference quotients p(h~) := f[x - hi, X + hi] for a number of values of hi > 0. Moreover, the values Pi (0) calculated by extrapolation are given for i = 1, 2. Correct digits are underlined. We used a pocket calculator with B = 10, L = 12, and optimal rounding, so that the working precision is E; = 10- 11 ; but only 10 significant digits can be displayed. ! Interpolation and Numerical Differentiation 3.2 Extrapolation and Numerical Differentiation 151 error o 1 2 3 4 5 6 7 8 0.04 0.02 0.01 0.5.10- 2 10- 4 0.5.10- 4 10- 5 10- 6 10- 7 0.540 1582369 0.5402662865 0.540293301 1 0.540 300054 8 0.5403023150 0.5403022800 0.5403021500 0.540 304 500 0 0.540 295 0000 - - - discretization error rounding error total error - 0.5403023030 0.540302305 9 The accuracy of the central difference quotient first increases and attains its optimum for h ~ 10-4 as h decreases. Smaller values than h = 10- 4 give less accurate results, because the calculation of flx - h, x + h] is hampered by cancellation. The increasing cancellation can be seen from the trailing zeros at small h. The extrapolated value is correct to 10 places already for fairly large hi; it is remarkable (but typical) that it is more accurate than the best attainable value for any of the central differences f[x - hi, X + hd. To explain this behavior, we now tum to a stability analysis. Figure 3.2. T~e nAum~ri.ca! ~ifferentiation error in dependence on the step size h. The optimal step srze h mmurnzmg the total error is close to the point wherediscretization errorand rounding error are equaL consists of a contribution O( T,) due to errors in function evaluations and a 2 contribution 0 (h ) , the discretization error due to the finite difference approximation, The qualitative shape of the error curve is given in Figure 3.2. By a similar argument one finds for a method that, in exact arithmetic, approximates f'(x) with a discretization error of O(h') a total error 8(h) (including errors in function evaluations) of The Optimal Step Size We want to estimate the order of magnitude of h that is optimal for the calculation of difference quotients and the resulting accuracy achievable for thc approximation of I' (x). In realistic circumstances, function values at arguments near x can be evaluated only with a relative accuracy of E, say, so that lex ± h) = I(x ± h)(1 + O(E)) (2.6) is calculated instead of I (x ± h). In the most favorable circumstances, E is the machine precision; usually E is larger, Inserting (2.6) into the definition (2.5) of the divided difference gives (2.8) Limiting Accuracy Under the natural assumption that the hidden constants in the Landau symbols do not change much for h in the relevant range, we can analyze the behavior of (2.8) by replacing it with 8(h) i[x - h, x + h] = I[x - h, x + h] + 0 Because f'(x) = f[x - h, x total error f'ex) - + h] + 0(1f'''(x)jh j[x - h, x + h] = 0 2 ) (If(X)I~) . by (2.5), we find that the (If(X)I~) + 0(1f"'(X)lh 2 ) (2.7) E =0- h + bh' where 0 and b are positive real constants. By setting the derivative equal to zero, we find h = (as / sb) 1/(S+ I) as the value for which 8 (h) is minimal and t~e minimal value is s;in = (s + l)bh' = O(E,/(s+I)). The optimal h is only a lIttle smaller than the value of h where the discretization error term matches the function evaluation error term, which is h = (aE/h)l/(s+l) and gives a total error of the same magnitude. 152 Interpolation and Numerical Differentiation 3.3 Cubic Splines If we take into account the dependence of a and b on f, we find that the optimal magnitude of h is =0 h opt (I f(.<+I) f(x) 1 (x) 1 / (.< + 1) (2.9) £1/(H1)) . Because in practice, f(H 1) (x) is not available, one cannot use the optimal value h op t of h; a simpler heuristic substitute is h=I f[x + f(x) ho, x - + h) := 6f[x - Zh, x - 11, x. x - 2f(x) hZ 2 " + f(x - h) 2f IZi+ ZJ( X ) h2i + 0(h 2.<+2) ,"",' 1=1.s (2i +2)! - f(x - h» if h h3 # 0; z then f!ll(x) = p(O) and p(h ) = f!ll(x) + O(h z). The optimal It is now of order l 5 0(e / ) , giving an error oforder 0(8 2/ 5). Again, extrapolation improves on this. 3.3 Cubic Splines As shown, polynomial interpolation has good approximation properties on narrow intervals but may be poor on wide intervals. This suggests the use of piecewise polynomial functions to keep the advantages and overcome the problems associated with polynomial interpolation. Piecewise Linear Interpolation To set the stage, we first look at the simple case of piecewise linear interpolation. if h (i) A grid on [a, b] is a set fl = {XI, ... ,xn } satisfying a=xI <XZ < '" <xn- ! <xn=b. (3.1) XI, ... 'X n are called the nodes of fl, and h := max{IXj+l - Xiii j = I, " . , n - I} is called the mesh size of <:1. A grid <:1 = {XI, ...• x } is n called equispaced if #- 0; xi=a+(i-l)h then f"(x) = p(O), and the asymptotic expansion p(h )=f (x) +,~ + h, x + 2h] 2h» - (l(x + h) 3.3.1 Definition. For the approximation of higher derivatives one has to use higher order divided differences. The experience with the first derivative suggests to use also ~entral differences with arguments symmetric around the point where the derivanvc is to be approximated; then again odd powers of h cancel. and we can use extrapolation for h 2 ---+ O. To approximate second derivatives f" (x) one can use f(x ) ~(l(x + 2h) - f(x - Higher Derivatives + h) = 2 hol This value behaves correctly under scaling of f and x and still gives total errors of the optimal order O(es/(s+l», but now with a suboptimal ~idden factor.. In particular, for the approximation of I' (x) by the central ~Ifference ~uot1ent f[ - h + h] cf. (2.5). we have s = 2 in (2.8), and the optimal magmtude of x ,x , . 2/3' . his 0(e l / 3 ) achieving the minimal total error of magmtude O(e ). This IS .. corroborated, in the above example, where e = 2:1 10- II'IS the mac hime precision. 2 Extrapolation of p(h ) using three values f[x - h o: x +.hol: f[x - h I. X + hIl, and f[x - h i. X + h 2 ] with hi = h/2' gives a discretization error of 0(h 6 ) , hence 0(e l / 7 ) as optimal magnitude for h and nearly fu.n .accura~y 0(e 6 / 7 ) as optimal magnitude for the minimal total error. Again, this IS consistent with the example. . Note that if we use for the calculation the forward difference quotient f[x, x + h), cf. (2.4) we only have s = 1 in (2.8), and the optimal choice h = 0(e I/2) only gives the much inferior accuracy 0(eI/ 2 ) . p(h 2 ) := 2f[x - h , x, x holds for sufficiently often differentiable f. Because of the division by h2, the total error now behaves like O(f;,) + 0(h 2 ) , giving an optimal error of l 2 I /4 0(e / ) for h of order 0(e ) . With quadratic extrapolation, the error looks 6 like 0 (f;,) + 0 (h ) , giving an improved optimal error of 0(e 3 / 4 ) for h of order l 8 0(e / ) . Note that the step size h must be chosen now much larger, reflecting the fact that otherwise cancellation is much more severe. To approximate third derivatives 1'''(x) one uses similarly p(h lel/(.<+I). 153 .. the mesh size is then h (II) Af . = (b. - fori=I ....• n; a)/(n - 1). unction p : [a, b) ---+ JR IS called piecewise linear over the grid fl if it is COntinuousand agrees on each interval [x,, xi+Il with a linear polynomial. 3.3 Cubic Splines Interpolation and Numerical Differentiation 154 f (iii) On the space of continuous, real-valued functions JlOOO ~ 32 subintervals between any two already existing grid points. This means 32 times as much work; cubic polynomial interpolation would already be of order 0 (h 4 ) , reducing the work factor to a more reasonable factor of 11000 ~ 5.6. Thus, in many problems, piecewise linear interpolation is either too inaccurate or too slow to give a satisfying accuracy. However, this can be remedied by using piecewise polynomials of higher degree with sufficient smoothness. defined on the interval [a, b], we define the norms Ilflloo := sup{[f(x)11 x E [a, bl} and IIfllz = Jib 155 f(x)Z dx. Splines For piecewise linear interpolation, it is most natural to choose the grid 1:1 as the set of interpolation points; then the interpolation condition (3.2) automatically ensures continuity in [a, b], and we find the unique interpolant Sex) = f(x)) + !lx), x)+d(x - xi) for all x E [x), xHI]. We see immediately that an arbitrarily accurate approximation is possible when the data points are sufficiently closely spaced ( i.e., if the mesh size h is sufficiently small). If, in addition, we assume sufficient smoothness of f, then we can bound the achieved accuracy as follows. 3.3.2 Theorem. Let Sex) be a piecewise linear interpolating function on the grid 1:1 = {Xl, ... , x n } with mesh size h over [a, b]. If the function f (x) to be interpolated is twice continuously differentiable, then If(x) - S(x)\ S Proof For x E [x), x) + tl, hZ gllf /I 1100 for all x E [a, b]. the relation We now add smoothness requirements that define a class of piecewise polynomials with excellent approximation properties. A spline of order k over a grid 1:1 is a k - 2 times continuously differentiable function S: [a, b] ---+ lR. such that the (k - 2)nd derivative S(k-2) is piecewise linear (over 1:1). A spline S of order k is piecewise a polynomial of degree at most k - 1; indeed, S must agree between two adjacent nodes with a polynomial of degree at most k because the (k - 2)nd derivative is linear. In particular, splines of order 4 are piecewise cubic polynomials; they are called cubic splines. Splines have excellent approximation properties. Because cubic splines are satisfactory for many interpolation problems, we restrict to these and refer for higher order splines to de Boor [IS]. (For splines in several variables, see de Boor [16].) Important examples of splines are the so-called B-splines. A cubic basis , Xn }, is a cubic spline Sl(x) spline, short a B-spline over the grid 1:1 = {Xl, defined over the extended grid x-z < X-I < < Xn+3 with the property that (for some 1=0,1, ... , n + I) S/(J;/) > 0 and S/(x) =0 for x rf. (x/-z, x/+z). Cubic basis splines exist on every grid and are determined up to a constant factor by the (extended) grid 1:1 and the index I; see Section 3.5. 3.3.3 Example. For the equispaced case, one easily checks that the functions defined by implies the bound 1 1 If(x) - S(x)1 S 2111" 1I 00 1(x - x))(x o S xHl)1 /I Where (x ,J+ I _xJ)z . S 211f 1100~ 0 h; 111"1100' . I I it i necessary In order to increase the accuracy by three d ecrma p aces, I IS to decrease h Z by a factor of 1000, which requires the introduction of at least B(x):= ~(2 - Ixl)3 - (l - Ixl)3 for Ixl S 1, ~ (2 - Ix1)3 for I < Ixl < 2, o for Ixl ::: 2 I are B-splines on the equispaced grid with xi = Xo + lh (cf. Figure 3.3). (3.3) 156 Interpolation and Numerical Differentiation 3.3 Cubic Splines 3.3.4 Theorem. Let ~ 1.35 157 = {XI •... , x n } be a grid over [a, b] with spacings h, = X,'+I - X,' (I' . -- I , ... ,n - I) . Thefunction S given by (3.5)for x E [x). Xj+tJ is a cubic spline, interpolating Ion the grid, precisely when there are numbers m) (j = I, ... , n) so that (3.6) (j = 2, ... , n - I), (3.7) where h· q) = h -0.5 -1 ---'-0.5 o 0.5 2 1.5 Figure 3.3. Cubic Bvsplines for an equally spaced grid, h = ~h ) [0, I] E !. 1 " m)=6 S (x) For interpolation by cubic splines, it is again most natural to choose the grid ~ as the set of interpolation points. The interpolation condition (3.2) again ensures continuity in [a, h], but differentiability conditions must be enforced by suitable constraints. We represent S in [x). x)+d as a cubic Hermite interpolation polynomial. Using the abbreviations (3) = f[Xj, Xj+I], Y) = f[Xj. Xj+I, Xj], OJ = f[Xj, Xj+I, »i- Xj+l], J' S'(X) = (3) + y)(2x - S"(x) = + 15)(x - x)2(x - + y)(x X)+I) x)(x - for all x Xj+tJ. - X))2), 6m j = 2 YJ - 28) h ) + 4o)h} for x -+ x) +0 in S"(x), for x -+ x)+! - 0 in S"(x); but this is equivalent to (3.6). X)+l) E [x), . Xj - X)+l) + OJ (2(x - x))(x - X)+I) + (x 2y) + 28)(3x - Zx , - Xj+I). 6m)+1 = 2Yj x) J+ Thus we obtain the following continuity condition on S": (3.4) + (3j(x - forj=l, ... .n. Formula (3.5) gives the following derivatives for x E [x' x· I]' we find SeX) = I(xj) (j = 2, ... , n - 1). Proof Because S (x) is piecewise cubic, S" (x) is piecewise linear and if we kn " , ew S (x), then we would know Sex). Because, however, S"(x) is still unknown, we define ..L---._ _~'____ _~ - L _ ._ _--"-- )-1 (3.5) To get (3.7), we derive continuity conditions on S' as x -+ xJ X)+l -0: x -.. Because the derivatives of S at x j and x)+ I are not specified. the constants Yj and OJ are still undetermined and at our disposal, and must be determined by imposing the condition that a cubic spline is twice continuously differentiable at the nodes (elsewhere, this is the case automatically). The calculation of the corresponding coefficients Yi and 8) is very cheap but a little involved because. as we show, a tridiagonal system of linear equations must be solved. S'(X))={3) - y)h) S'(xj+»=/3j =/3) - (2m) +m)+l)h), + y)h j +o)h;=/3J + tm , +2mj+dh). We can therefore formulate the continuity condition on S' as +0 and Interpolation and Numerical Differentiation 158 3.3 Cubic Splines this is equivalent to Similarly, the free node condition at equation . . . ] _ f[xj, Xj+tl- f[Xj-l, Xj] = {3j - {3j-1 f[Xj-I,Xj,Xi+ 1 .. h· +h· Xj+1 - Xj-l j-l j (2mj n- z) hll-2) ( 2 + -h- mn-I + mj+dh j + (mj-l + 2mj)hi - 1 + (I h - -hn~l n-] h j_ 1 +h j o for j = 2, ... , n - 1, and this simplifies to (3.7). m2 -ml m3 - mz X2 - XI X3 - Xz f[X n - 2 , XIl_I, x n ] f[X n-2, Xn-I, x ll] I - h2 / hi A ii = { 21 - .. _ A 1.1+1- A .. 1,1-1 (hi + h 2)m2 - otherwise, {2 + h Plugging this into (3.7) for j = 2 gives hi 1, for i = otherwise, = {2+h ll- 2/ h ll_1 I - qj 0 I '2 hl(h l +h 2)f[XI,x2,X3] = himl +2h 1(h j +h2)mz+h lh zm3 + (21z l + h 2)(h l + hz)l1'lz hz)(h l + hz)ml hDml + (2h l 2/ qj fori=n, otherwise, if Ik - i I > 1. The computational cost to obtain the m, and, therefore, the spline coefficients is only of order 0 (n). If we have an equispaced grid, then we get the following matrix: hzml hi = (hi = (hi - = 1, for i = n, for i hll- 2/ hll_ 1 A ik = 0 This leads to the condition (X3 - xI)m2 - (x, - x2)mj m., = f[X n-2, Xn-l, x ll]. (3.9) f[XI, X2, X3] f[XI, X2, X3] f[ x2, X3, X4] CASE 1: A simple requirement is the free node condition for X2 and Xn- ! . In this case, we demand that Sex) be given by the same cubic polyno- mial in each of the two neighboring sets of intervals [XI, X2], [X2, X3] and [Xn- 2, Xn-l], [Xn-I, x ll]; thus X2 and Xn-l are, in a certain sense, "artificial" nodes of the spline. As a consequence, S'" cannot have jumps at X2 or Xn-I. Since Sill shouldn't have a jump at X2, we must have 01 = 02, so gives the determining XIl-I From (3.7) and these two additional equations, we obtain the tridiagonal linear system Am = d with Because f[Xj-l, Xj, xJ+tl can be computed from the known function values, the previous equation gives n - 2 defining equations for the n unknowns m lWe can pose various additional conditions to obtain the two missing equations. We now examine four different possibilities. m3= 159 3 2 I '2 0 I '2 2 I '2 A= I '2 + h 2)(h l + h 2)m 2, 0 so (3.8) 2 I '2 I '2 2 '2 3 0 I In this important special case, the solution of the linear system Am = b simplifies a little; ma and mll_1 may be determined directly, then m-; to fIln-Z may be determined by Gaussian elimination, and fill and m.; 160 may be determined by substitution. Because of diagonal dominance. pivoting is unnecessary. CASE 2: Another possibility to determine two additional equations for the m j is to choose two additional points, Xo = XI - ho and X n+ I = x; + hl/. near a and b and demand that S interpolate the function f at these points. For Xo, we obtain the additional condition Similarly, we obtain as additional condition for x n + I; This leads to a tridiagonal linear system analogous to that in Case 1 but with another right side d and other boundary values for A. This linear system can also be solved without pivoting, provided that h1 Xo X,,+I - Xn A has the following form: x x 0 x x x 0 x x 0 0 0 x 0 0 0 x x 0 0 x x Here, II I - ~ A 1100 .s ~,so that we may again apply Gaussian elimination without pivoting; however, because of the additional comer elements outside the band, nonzero elements are produced in the last rows and columns of the factorization. Thus the expense increases a little, but remains of order O(n). CASE 4: If the derivative of f at the end points XI and X n is known (or can reliably be estimated), we can set Xo = Xl and Xn+1 = X n; to obtain m, and m-: we then compute the divided differences f[XI, XI. X2] and f[Xn-l, X n, xn] from these derivative values. This corresponds to the requirement that S'(X) = f'(x) for x E {XI. xn}' The spline so obtained is called a complete spline interpolant. so XI - 161 3.3 Cubic Splines Interpolation and Numerical Differentiation In order to give a bound for the error of approximation prove the following. 111- Slloo' we first 3.3.5 Lemma. Suppose f is twice continuously differentiable in [a. b). Then, for every complete spline interpolant for I on [a, b], IIS"lIoo .s 311f"1100. (3.10) h,,-l ! Proof For a complete spline interpolant. I 1- A II .s ~; therefore, Proposition 2.4.2 implies because then (3. 11) ~ecause f[x. x', x"] = 41"(~) for some ~ E O{x, x', x"}, we have lid II 00 < 211 I" 1100' Because S" is piecewise linear. we have and A is an H-matrix. If I is periodic with period in [a. b], we can construct S to be a CASE 3: periodic spline with Xo = Xn-l, X,,+ 1 = X2, and m I = m n . In this case, we obtain a linear system for m 1, ... , m,,_1 from (3.7), whose matrix IIS"lIoo = i=1:1l max S"(Xi) =6 max Imil=61lmll00 i=1:11 = 611A- Id11 00 .s 611A- llloolldll 00 .s 6·1 . ~. 111"1100' o 163 3.3 Cubic Splines Interpolation and Numerical D(fferentiation 162 The following error bound for the complete spline interpolant results, as well as similar bounds for the errors in derivatives. 3.3.6 Theorem. Suppose f isfour times continuously differentiable in [a, b]. Then the bounds I(x - Xj)(x - xj+I)ls h 2/4 then imply Ilf - Sll oo = Ilell oo h2 1 s 4' 2:llelllloo s h4 16 Ilf(4) 1100' For the derivative of e, we obtain e'(x) = (e[x, xj+Jl- e[xj, xj+1D - (e[x, xj+IJ - e[x, xD = (x - xj)e[x, Xj, Xj+Jl- (Xj+l - x)e[x, x, xj+IJ, so (3.12) implies holdfor every complete spline interpolant with mesh size h. o Proof. For the proof, we utilize the cubic spline S with S"(Xj) = !"(Xj) for j = 1, ... , n, 3.3.7 Remarks. S'(Xj) = f'(xj) for j =1 (i) Comparison with Hermite interpolation shows that the approximation error has the same order (0(h 4 ) ) . In addition (and this can also be proved for Hermite interpolation), the first and second derivatives are also approximated with orders 0(h 3 ) and 0(h 2 ) . (The present simple proof does not yield optimal coefficients in the error bounds; see de Boor [15] for best estimates.) (ii) For equispaced grids (i.e., Xi+l - Xi = h for each i), it is possible to show that the first derivative is also approximated with order 0(h 4 ) at the grid points. Computation of the complete spline interpolant for a function f is therefore also useful for numerical differentiation of f. (iii) Theorem 3.3.6 holds without change for periodic splines in place of complete splines. It also holds for splines defined by the free node condition and splines defined through additional interpolation points, when the error expressions are multiplied by the (usually small) constant factor ~(1+3I1A-lIl00). and j =n, where the existence of S follows by integrating the piecewise linear function through !"(Xj) twice. Lemma 3.3.5 implies II!" - S"lloo s II!" - S"lloo + II(S - S)"lloo s II!" - S"lIoo + 311(S - f)"1I =411!" - S"lloo. 00 Because S" is the piecewise linear interpolant to implies the bounds IIi" - S"lloo f" at the Xi, Theorem 3.3.2 h2 s gll(fIl)lIlloo, so (3.12) To obtain the corresponding inequalities for Because e(x j) = 0, the relationship e(x) then follows for x E = (x f and f', we set e S. An Optimality Property of Cubic Splines Wenow prove an important optimality theorem for complete spline interpolants. - Xj)(x - xj+de[xj, Xj+l, [Xj, Xj+l], j = 1, ... , n - f - xl 1. Formula (3.12) and 3.3.8 Theorem. Suppose f is twice continuously differentiable in [a, bl, and suppose Sex) is a complete spline interpolant for f on the grid a = XI < ... < Interpolation and Numerical Differentiation 164 IIS"II~ = II!"II~ - II!" - S"II~ ::: II!"II~. K(x) = Proof To prove the statement, we compute the expression e := 1I1"1I~ - III" - S"II~ - IIS"II~ = l 21 b (f"(x)2 - (f"(x) - S"(x))2 - S"Cd) dx b (f"(x) - S"(x))S"(x) dx = 2 "2: lX' i =2:n (I"(x) - S"(x») S"(x)dx. Xl-l Because S" (x) is differentiable when x E (Xi -1, Xi), we obtain e =2"2: ((fI(X) - S/(X))S"(X{:_L - L~1 (f'(X) - 165 (ii) The 2-norm of the second derivative is an (approximate) measure for the total curvature of the function. This is because the curvature of a function f(x) is x" =b. Then = 3.4 Approximation by Splines S/(X))S"/(X) dX) I"(x) /I+ f'(x)2 Therefore, if the derivative of the function is small or almost constant, then the curvature is approximately equal or proportional to the second derivative of the function, respectively. Thus, we can approximately interpret the theorem in the following geometric way: the complete spline interpolant has a total curvature that is at most as large as that of the original function. The theorem furthermore says that the complete spline interpolant is approximately optimal over all interpolating functions in the sense of minimal total curvature and therefore is more graphically esthetic. (iii) The same proof produces analogous optimality for periodic splines and the so-called natural splines, which are characterized by the condition S"(xd = S"(x,,) = O. The latter are of course appropriate only for the approximation of functions f with f" (x[) = f" (x,,) = O. 1=2:" by integration by parts. Because the expression (f' (x) - S' (x) ).S"(x) .is continuous on all of [a, b], and because S'" (x) exists and is constant m the interval [Xi-I, x;], we furthermore obtain E~2 [(['(Xl - S'(X))S"(X)I: - ,~" ([(xl - S(xll S"'(X)[.]. Because f' and S' agree by assumption on the boundary, the first expression vanishes. The second expression also vanishes because f and S agree at the 0 nodes x j (j = 1, ... , n). Therefore, e = 0, and the assertion follows. 3.3.9 Remarks. . 0f (i) The optimality theorem states that the 2-norm of the second deri envatIv~ the complete cubic spline interpolant is minimal over all twice contlllUously differentiable functions that take on specified values at .xI, .... ' x" and whose derivatives take on specified values at Xl and x,.. ThIS optImalr ity property gave rise to the name "spline," borrowed from the name.fo a bendable metal strip used by engineers in drafting. If such a phySIcal 1. rve spline were forced to go through the data points, then the resu tm~ cu would be characterized by a minimum stress energy that is approxImated by f\lI"II~. 3.4 Approximation by Splines In practice, we frequently know function values only approximately because they come from experimental measurements or expensive simulations. Because oftheir minimal curvature property, splines are ideal for fitting noisy data (Xi, Yi) (i = I, ... , m) lying approximately on a smooth curve Y = f (x) of unknown parametric form. To fi nd a suitable functional dependence, one represents f (x) as a cubic spline Sex). If we interpolated noisy data with a node at each value we would obtain a very unsmooth curve reflecting the noise introduced by the inaccuracies of the function values. Therefore, one approximates the function by a spline with a few nodes only. Approximation by Interpolation A simple and useful method is the following: We choose a few of the data points as nodes and consider the spline interpolating compute the error interval f at these points. We then e=D{S(x) - f(x) I x E M} corresponding to the set M of data points that were ignored. If the maximum 166 Interpolation and Numerical Differentiation 3.4 Approximation by Splines error is then larger than a previously specified accuracy bound, then we choose as additional nodes those two data points at which the maximum positive and negative error occurred, and interpolate again. We then compute the maximal error again, and so on, until the approximation is sufficiently accurate. The computational expense for this technique is 0 (/lm), where m is the total number of data points and n is the maximum number of data points that are chosen to be nodes. A problem with all approximation methods is the treatment of outliers, that is, particularly erroneous data points that are better ignored because they should not make any contribution to the interpolating function. We can eliminate isolated outliers for linearly ordered data points XI < X2 < ... < XIII as follows. We write £, = Sex,) - f(x,) and, for i < m, because the expression (4.1) is then approximately equal to the integral (f(x) - S(x»2 dx = Ilf - SII~ (see the trapezoidal rule in Section 4.3). With this choice of weights, we approximately minimize the 2-norm of the error function. The choice of the number of nodes is more delicate because using too few nodes gives a poor approximation, whereas using too many produces wiggly curves fitting not only the information in the data but also their noise. A suitable way is to proceed stepwise, essentially as explained in the approximation by interpolation. Minimization of (4.1) proceeds easiest when we represent the spline Sex) as a linear combination of B-splines. The fact that these functions have small compact support with little overlap ensures that the matrix of the associated least squares problem is sparse and well conditioned. In the following, we allow the nodes XI < X2 < ... < xm ofthe spline to differ from the data points Xi. Usually, much fewer nodes are used than data points are available. To minimize (4.1), we assume that S is of the form if I£i I :s l£i+11 otherwise. We then take e = method. 0 {ei I i = 1, ... , m - I} as error interval in the previous 167 J n Sex) := I'>,S,(X), Approximation by Least Squares A more sophisticated method that better reflects the stochastic nature of the noise uses the method of least squares. Here, one picks as approximating function the spline that minimizes the expression '=1 where the z: are parameters to be determined and where the S, are B-splines over a fixed grid ~. Thus, we seek the minimum of m m L Wi(Yi - S(x;»2, h(z) = (4.1) L Wi(Yi i=1 i=1 m S(Xi»2 = L Fi(z)2 = IIF(z)II~, ;=1 with where the Wi are appropriately chosen weights. Regarding the choice of weights, we note that function values should make a contribution to the approximating function in proportion to how precise they are; thus the weights depend, among other things, on the accuracy of the function values. In the following, we assume that the function values are all equally precise; then a simple choice would be to set each ui, to I. However, the distance between two adjacent data points should be reflected in the weights, too, because individual data points should provide a smaller contribution in regions where they are close together than in regions where they are far apart. A reasonable choice is, for example, (X2 - xd/2 Wi·- { (Xi+1 - xi-d/2 (X m - x m - l )/ 2 fori=l, for i = 2, ... m - I, for i =m, Fi(z)=..jWi(Yi - S(Xi»=..jWi (Yi - tZ,SI(Xi»). '=1 Written in matrix form, the previous equation is F(z) = D(y - Bz), where 168 3.4 Approximation by Splines Interpolation and Numerical Differentiation We therefore need to solve the linear least squares problem IIDy - DBzli z = min! (4.2) Here, selection of the basis splines leads to a matrix B with a simple, sparse form, in that there are at most four nonzero elements in each row of B. For example, when n = 7, this leads to the following form, when there are precisely 1,4,2, 1, 2, I data points in the intervals [Xi, .Ki+l], i = 1, 2, 3,4,5,6, respectively: x x x x x B= o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 0 169 the right side also takes 0 (m) operations. Because the solution of the 7-band system takes O(n) operations and n < m, a total of Oem) operations are required to solve (4.4). A comparison with approximation by interpolation shows that a factor of O(n) is saved. Nonetheless, the least squares method is faster only beyond a certain n because the constant associated with the expression for the cost is now larger than before. The least squares method can also be used for interpolation; then, in contrast to the case already considered, the data points Yi and the nodes Xi do not need to correspond. According to de Boor [15], I for i = 1, m otherwise is a nearly optimal choice of nodes. Parametric Splines x x x To solve problem (4.2), we may use the normal equations (DB)T DBz = (DBl Dy (4.3) (Theorem 2.2.7) because the matrix for this special problem is generally wellconditioned. Formula (4.3) is equivalent to A case that often occurs in practice (e.g., when modeling forms such as airplane wings) is that in which we seek a smooth curve that connects the pairs (Xi, Yi) and (Xi+l, Yi+I). Here, the Xi (i = I, ... , m) are in general not necessarily distinct. We cannot approximate such curves by a function f with f(Xi) ~ Yi. To solve this problem, we introduce an artificial parameter s and view the pairs (x, y) as a function of s. This function can then be approximated by two splines. We obtain good results when we take the following arclength approximation for s: (4.4) where W = Diaguo}, ... , wn ) . Because the element Then, s, is the length of the piecewise linear curve which connects (Xl, Yl) with (Xi, Yi) through the points (xz, Yz), ... , (Xi-I, Yi-d.lfwe now specify splines S, and Sy corresponding to a grid on [0, sm] with SxCSi) ~ Xi and Sy(Si) ~ Yi (i I, ... , m), then the curve (x, y) = (SxCs), Sy(s)) is a smooth curve which approximates the given point set. = vanishes whenever Sj and Sk do not overlap, the matrix BTW B is a 7-band matrix. Moreover, B T W B is positive definite, so that a solution of (4.4) via a Cholesky factorization succeeds without problems. Because B contains a total of at most 4m nonzero elements (and therefore on average 4m/n nonzero elements per column) and (by symmetry) only four bands must be computed, the computation of the left side of the normal equations requires at most 4n . 4m / n . 3 = 0 (m) operations. Similarly, the formation of 3.4.1 Remarks. (i) The constructed curve is invariant under orthogonal transformations (ro- tations and reflections). (ii) The parametric curve can be easily graphed, since we may simply graph X = SxCs) and Y = Sy(s). 170 Interpolation and Numerical Differentiation (iii) If we are looking for a closed curve then X m+I = can approximate with periodic splines. XI 3.5 Radial Basis Functions and Ym+I = YI, and we Radial basis functions are real-valued functions defined on lRd that are radially symmetric with respect to some center Xi, and hence have the form rp(lIx -Xi 112) for a suitable univariate function rp : lR+ ----+ lR. Most analyzed and used are radial basis functions with rp(r) = Jr 2 + c2 J~~-~-~: -2.5 3.5 Radial Basis Functions rp(r)=r 2Iogr 2 (thin plate spline) rp(r) = e- c' ,,2 (Gaussian) 2 1 Strong interpolation and approximation theorems are known for the interpolation and approximation by linear combinations of radial basis functions; they are based on the fact that good radial basis functions grow with IlxII, but their higher order divided differences decay (see, e.g., Light [56]). The multivariate analysis is quite involved; here, we only relate univariate radial basis functions to splines, to motivate that we may indeed expect good approximation properties. -1 -0.5 -1.5 -1 -0.5 0 0.5 1.5 2 2.5 -1.5 -1 -0.5 0 0.5 1.5 2 2.5 ir ~ -~t -2 0 0.5 1 1.5 2 ~ 2.5 J (e) I -2.5 1 -1.5 (b) -2.5 (multiquadric) -2 171 2 J Figure 3.4. Linear B-splines as linear combinations of simple radial basis functions (a) <l>i, (b) <1>;+1.;, (c) Bi . are unbounded and wedge-shaped, the differenced functions 3.5.1 Theorem. The space of all piecewise linear functions with possible breakpoints (nodes) at an infinite grid . . . < Xi < Xi + I < space of all linear combinations of the functions <l>i (i = 1, agrees with the , n) defined by <l>i(X) :=Ix - xii· are constant outside [x;, x;+tJ, and the twice differenced functions Proof All piecewise linear functions over the grid ... < Xi < Xi+1 < ... have a representation as a linear combination S(X) = L S(xi)Bi(x) of the linear B-splines if Xi-l :sx :SXi, ifx;:s x:s otherwise. Now, whereas the functions <l>i (x) := Ix - xii B; (x) = <l>i+l.i - <1>;';-1 2 are the linear B-splines over the grid. In particular, B; (x) and therefore all piecewise linear functions are linear combinations of <1>; (x). Conversely, because each <1>; (x) is piecewise linear, all their linear combinations are piecewise linear. Thus approximation by piecewise linear functions is equivalent to ap0 proximation by linear combinations of [x - Xi I. X;+I, More generally, higher order splines can be written as linear combinations of functions of the form <I>;(x)=rp(x -Xi) 172 Interpolation and Numerical Differentiation 3.6 Exercises centered at Xi where for order s splines, rp(x) = ! IXI S if s is odd, xlx!S-J if s is even. 5. Given n + 1 points xo, ... 'X n and an analytical function / : JR ~ R the matrices Sn(f), A E lR(n~l)x(n+l) are defined by Sn(f):= S with (5.1) . Sik Indeed, it is not difficult to see that, in generalization of the linear case, one obtains B-splines of order s over an arbitrary grid by (s + l)-fold repeated differencing. .= (f[Xi, .... xd fori:::: k. 1. Fit (by hand calculations) a parabola of the form g(x) = ax 2 + bx + c through the points (50, ~), (51, ~), (52. Give all quantities and intermediate results to 5 decimal places. Compare the results obtained by using the Newton interpolation formula with those that obtained from the power form at the points x = 50,50.5,51,51.5,52. 2. (a) Write MATLAB programs that calculate and evaluate the interpolation polynomial for a MATLAB function / at n pairwise distinct interpolation points Xl , ...• X n . (b) Interpolate the functions /1 (x):= eX/tO and h(x):= I/O + x 2 ) at the equidistant interpolation points Xi := - 5 + lOi/n, i =0, ... , n. For /1 and hand n = 1,4, 8, 16, plot / and p; and calculate an estimate of the maximum error (i.k=O ..... n) o for i > k and A_C 3.6 Exercises Xn-l ¥). max IPn(x) - /(x)1 XEI by evaluating the interpolation polynomial Pn at 10 I equidistant points in the intervals (i) 1= [-1,1] and (ii) 1=[-5.5]. (c) For both functions and both intervals, determine the value of n where the estimated error becomes least. 3. Let the function / : JR ~ JR be sufficiently differentiable. (a) Show that divided differences satisfy ~ /[Xo, Xl.·.·, XII] = dx; Prove for a, xo, ... 'X n E R and arbitrary n times differentiable functions i.« (a) Sn(a)=al. Sn(f±g)=Sn(f)±SIl(g), SIl(x'j)=ASIl(f). (b) If P is a polynomial then Sn (p) = = peA). (c) Sn(f . g) SIl(f) . Sn(g). (d) If h:= fg then h[Xi, ... ,xd= L j[Xi, ... ,Xj]g[Xj, ... ,Xk]. j=i:k (e) Specialize (d) to get a formula for the kth derivative (f g )(k). Hint: Use induction to get (b) from (a); deduce (c) from (b) and interpolation. 6. Let d, = f[xo, .... Xi] and Xo 0 /[XO, XI,···, Xv-I, Xv, Xl', Xv+I, ... , X n] for v e O, ... ,n. k . d / [Xo, Xl • . . . , X" ] . (b) Find and prove an analogous expression for dX; 4. Show that the Lagrange polynomials can be written as -do -dl XI A= Xn_1 where q(x) = (x - xo)··· (x - x n ) · 173 0 D= -dll - 2 dnx n - d ll - I [ ] Show that Pn(x) = det(Dx - A) is the Newton interpolation polynomial for f at Xo • . . . , Xn . 7. Let (xo,YO,Yb):=(-I,-3, 19), (Xl,YI,YD:=(O,O,-I), a~d (X2,Y2, y~):= (1, 3,19). Calculate the Hermite interpolation polynomial p(x) of 10. The Chebyshev points 2j degree 5 with p(x v ) = Yv and p'(x v ) = y~ , 1J = 0,1,2. . 8. Let P« (x) be the polynomial of degree ~n that interpolates the sufficiently smooth function f (x) at the pairwise distinct points Xo, ... , x". (a) Show that for k ~ n the kth derivative of the error function f(x) - P«(x) has at least n - k + I zeros in the smallest interval [a, b1 containing x·J := cos 2(n sup Iq,,(;)I, where q,,(x):= (x - xo)··· (x - x,,). Show this for n = I, and express the points Xo, x I in terms of square roots. 11. Write a MATLAB program that calculates the derivative !,(x) of a given differentiable function f at the point x E R For this purpose use the central difference quotient (6.1) (c) Show that the relation (6, I) holds even when certain interpolation points coincide. (d) Show that, for the Hermite interpolation polynomial p" (x) that coincides with the values of the function f and its first derivative at the f [x - h . x+h]:=f(x+hi)-f(x-hi) " 1 Zh, for hi := 2- i h o, i = 0, I, 2.... , where h o > 0 is given. Let P" be the interpolation polynomial of degree <n satisfying points xo, ... ,x"' the remainder formula _ f(2,,+2)(;) 2 f(x)-p,,(x)= (2n+2)!q,,(x) with; E O{xo, ... ,x"' x} and c, given by q,,(x) := (x -xo) ... (x -x,,) 1J = (J' = 0 , ... , n) sE[-I,lj that f(k) (;) f[xO,Xl, ... ,Xkl=~k-!-' + I IT + I) minimize the expression all xv· (b) Use (a) to prove for arbitrary kEN the existence of some; E[a, b1such holds for arbitrary x E R 9. Let D be a closed disk with fixed center c E C and radius r. Let 175 3.6 Exercises Interpolation and Numerical Differentiation 174 Xv E D, Choose the degree n for which Ip,,+I(O) - p,,(O)1 :::: Ip,,(O) - p,,_I(O)1 0, ... , k be given points, and let max p:= holds for the first time. Then ~ (p" (0) + P,,-l (0)) is an estimate of f' (x) and ~ IP" (0) - P,,-I (0) I is a bound for the error. Determine in this wayan approximation and an estimated error for the derivative of f(x) := eX at x = I using in succession, the starting values ho := 1,0.1,0.01,0.001, and compare the results with the true error. 12. Suppose, for the linear function f (x) := a + bx ; with a. b =I 0, the first derivative f' (0) = b is estimated from [x, - c], v =O, ...• k Then for every analytic function f :D S; C ---+ C, r If[xo, Xl,"" Xk II ~ (r _ p)k+! sUPlf(;)I=:S(r). sED' Now, let f(x):= e", x E C (a) Give an explicit expression for S(r). (b) For which value r of r is S(r) minimal (with c, xo, Xl,"" Xk fixed)? (c) How big is the overestimation factor over qk. S,_(.:..,r.:...) _ := If[XO,XI,···,Xk 'II for Xo = XI = ... = Xk = c and k = 1,2,4,8, 16, 32? E C 8 _ f(h) - f(-h) h 2h ~n binary floating point arithmetic with mantissa length L and correct roundmg. Let a and b be given binary floating point numbers and let h be a power of 2, so that multiplication with h and division by 2h can be performed ex~ctly, Give a bound for the relative error I(8 h - f' (0)) / f' (0) I of the value 8h calculated for 8h . How does this bound behave as h ---+ O? Interpolation and Numerical Differentiation 3.6 Exercises 13. A lot of high quality public domain software is freely available on the World Wide Web (WWW). The NETLIB site at Hint: The MATLAB functions subp.l.o t, title, text, xlabel, and ylabel may be useful. Try also set (gca , )x.l im ' , [-6,6] ), and experiment with set and get. (c) The plot suggests a conjecture; can you formulate and prove it? 16. For given grid points XI < Xz < ... < Xn and function values f(xj), let S (x) be the cubic spline with 176 http://www.netlib.org/ contains a rich collection of mathematical software, papers, and databases. (a) Search the NETLIB repository for the key word "interpolation" and explore some of the links provided. (b) Get a suitable program for spline interpolation, and try it out on some simple examples. (To get the program work in a MATLAB environment, you may need to learn something about how to make mexfiles that allow you to access FORTRAN or C programs from within MATLAB.) 14. Let Sf be the spline function corresponding to the grid Xl < Xz < ... < X" that interpolates f at the points xo, XI, ... , X n + 1· Show that the mapping f ~ Sf is linear, that is, that 15. For an extended grid X-Z < X-I < Xo < ... < X" < X,,+I < X,,+Z < X,,+3, define functions Bk (k = I, ... , n) by for Xk-Z < x ::::: Xk-I, Clk.k-Z(X - Xk_z)3 + Clk,k-I (x - Xk_l)3 x)3 + Clk.k+1 (Xk+1 - x)3 Clk.k-Z(x - Xk_z)3 for Xk-I < x ::::: Xko Clk.k+2(Xk+Z - for Xk < x::::: Xk+l, for Xk+l < x < Xk+Z, otherwise, Clk.k+Z (Xk+Z - x)3 o where Xk+Z - Xk-Z TI (Xi-Xj)' li-kl::SZ i# (a) Show that these formulas define cubic splines, the B-splines over the grid. (b) Extend the grids (-5, -2, -1,0, 1,2,5) and (-3, -2, -1,0, 1,2,3) by three arbitrary points on both sides and plot the nine resulting B-splines (as defined previously) and their sum on the range of the extended grids, using the plotting features of MATLAB. The figure should consist of two subplots, one for each grid, and be self-explaining; so do not forget the labeling of the axes and a plot title. seX) = Clj + {3j(x - Xj) +8 j (x 177 + Yj(x -Xj)z(x -Xj+d - Xj)(x - Xj+l) for X E [Xj,Xj+I], satisfying S(Xj) = f(xj), j = I, ... , n and the free node condition at Xz and Xn-l. (a) Write a MATLAB subroutine that computes the coefficients a j, {3j, Yj, and 8j (j = 1. ... , n - 1) from the data points by solving the linear tridiagonal system for the m j (cf. Theorem 3.3.5). (b) Write a MATLAB program that evaluates the spline function Sex) at an arbitrary point x. The program should only do three multiplications per evaluation. Hint: Imitate Homer's method. (c) Does your program work correctly when you interpolate the function f(x):= coshx, x E [-2,2], at the equidistant points -2 = XI < ... < Xll = 2? 17. Let S be a cubic spline with m j := S"(Xj) for j = I, ... , n. S is called a natural spline, provided ml = ni; = O. (a) Show that the natural spline that interpolates a function g minimizes s" for all twice continuously differentiable interpolants f. (b) Natural splines are not very appropriate for approximating functions because near the boundary, the error of the approximation is typically O(h z) instead of the expected O(h 3 ) . To demonstrate this, determine the natural spline interpolant for the parabola f (x) = x Z over the grid L1 = {-h, 0, h} and determine the maximum error in the interval [-h, h]. (However, natural splines are the right choice if it is known that the function is C Z and linear on both sides outside the interpolation interval.) 18. Quadratic splines. Suppose that the function values f (Xi), i = I, ... , n, are known atthe equidistant points Xi = a+(i -I)h, where h = (b-a)/(n -I). Show that there exists a unique quadratic (i.e. order 2) spline function Sex) over the grid xb < < x~, where x; = a + (i - !)h, i = 0, ... , n that interpolates f at XI, ,xn · Interpolation and Numerical Differentiation 178 Hint: Formulate a system of equations for the unknown coefficients of S in each subinterval [X!_l' xJ 19. Dilation equations. Dilation equations relate splines at two different scales. This provides the starting point for the multiresolution analysis of sound or images by means of so-called wavelets. (For details, see, e.g., Daubeches [13], De Vore and Lucier [21].) (a) Show that 4 Numerical Integration if Ixl ::: 1, ifl<lxl<3, if Ixl ::: 3 is a quadratic (B-)spline. (b) Prove that B (x) satisfies for all x B(x) E 1 3 3 = 4B(2x - 3) + 4B(2x - 1) 1 + 4B(2x + I) + 4B(2x + 3). (c) Derive a similar dilation equation for the cubic B-spline in (3.3). 20. (a) Show, for cP as in (5.1), that every linear combination Sex) = L (XkCP(X - Xk) (6.2) i = l:n is k - 1 times differentiable, and S(k-l) (x) This chapter treats numerical methods for the approximation of one-dimensional w(x)f(x) dx. After a discussion definite integrals of the form J,:' f(x) dx or of general accuracy and convergence results, we consider the highly accurate Gaussian quadrature rules, most suitable for smooth functions, possibly with known endpoint singularities. Based on the trapezoidal rule, we then derive adaptive step size methods for the integration of difficult integrands. The final two sections show how the methods for integration can be extended to multistep methods for solving initial value problems for systems of coupled ordinary differential equations. However, the latter is a vast subject, and we barely scratch the surface. A thorough treatment of all aspects of numerical integration of univariate functions is in Davis and Rabinowitz [14]. The integration of functions of several variables is treated in Stroud [91], Engels [25], and Krommer and Ueberhuber [54]. The solution of ordinary differential equations is covered thoroughly by Hairer et a1. [35,36]. f: IR the dilation equation is piecewise linear with nodes at Xl,···, x.: (b) Write the cubic B-spline in (3.3) as a linear combination (6.2). Check by plotting the graph for both expressions in MATLAB. Hint: First match the second derivatives, then determine the free parameters to force compact support. 4.1 The Accuracy of Quadrature Formulas In this section, we look at general approximation properties of formulas that use a finite number of values of a function f to approximate a definite integral of the form I(f) := l b f(x)dx. (1.1) Because the integral (1.1) is linear in I, it is desirable that the approximating formulas have the same property. We therefore only consider formulas of the following form. 179 180 Numerical Integration 4.J The Accuracy of Quadrature Formulas 4.1.1 Definition. A formula of the form QU):= L Normalized Quadrature Formulas ad(xj) (1.2) j=O:N is called an (N + I)-point quadrature rule with the nodes Xo, ... , XN and Corresponding weights ao, ... , aN. In general, it is often useful to use a linear transformation to transform the interval of integration to the simplest possible interval. We demonstrate this for various weight functions, namely: CASE I: The trivial weight function co (x) = 1: One usually normalizes to [a, b] = [-1, 1]. If The nodes usually lie in the interval of integration [a, b] because the integrand is possibly undefined outside of [a, b]; for example, when f(x) == j l -I .J(x - a)(b - x). In order for the formulas to remain numerically stable for large N, we require that all weights be nonnegative; there is then no cancellation when we evaluate Q for functions of constant sign. Indeed, the maximum amplification factor for the roundoff error is given by L laj II L aj for function values of approximately constant magnitude, and this quotient is I precisely whenever the aj are all nonnegative. In order to obtain good approximation properties, we further require that the quadrature rule integrates exactly all f in a particular class, where this class must be meaningfully chosen. In the simplest case, the interval of integration is finite and the integrand f (x) is continuous in [a, b]. In a formally only slightly more general case, the integrand can be divided into the product of a simple but possibly singular factor w(x), the so-called weight function, and a continuous factor f(x). In this case, instead of (1.1), the integral has the form (n JUJU) = l b w(x)f(x) dx. f(x)dx = La;f(xi) holds for polynomials of degree y n, then the substitution t := c + xh results in the formula C+h ! f(t)dt c-h = h La;f(c+xi h) for polynomials f of degree y n. CASE 2: The algebraic weight w(x) = x'" (a > -I): One usually normalizes to [a, b] = [0, I]. If i 1 x'" f(x)dx = Laif(Xi) holds for polynomials of degree y n, then the substitution t := xh results in the formula (1.3) Numerical evaluation of (1.3) is then still done approximately with a formula of the form (1.2), where the weights a j are now dependent on the weight function w (x). 4.1.2 Examples. Apart from the most important trivial weight function co (x) = 1, frequently used weight functions are as follows: (i) (ii) (iii) (iv) 181 w(x) = x'" with a > -I for [a, b] = [0, I], w(x) := I/~ for [a, b] = [-I, 1], w(x) := e- x for [a, b] = [0,00], w(x) := e- x 2 for [a, b] = [-00,00]. The last two examples show how an appropriate choice of the weight function makes numerical integration over an infinite interval possible with only a finite number of function values. for polynomials f of degree x n. CASE 3: The exponential weightw(x) = e- Ax (A > 0): One usually normalizes to [a, b] = [0, 00], A = 1. If holds for polynomials of degree y; 17, then the substitution t := c A-I X results in the formula for polynomials f of degree z». + 182 4.1 The Accuracy of Quadrature Formulas Numerical Integration Approximation Order Because continuous functions can be approximated arbitrarily closely by polynomials, it is reasonable to require that the quadrature formula (1.2) reproduces the integral (1.3) exactly for low-degree polynomials. proof By assumption Q has order n ~n we have 183 + 1 so, for all polynomials P» of degree It follows that 4.1.3 Definition. A quadrature rule I I b Q(f) := I>~d(xj) (1.4) b w(x)f(x)dx = + Q(Pn) w(x)(f(x) - Pn(x» d x + b j has approximation order (or simply order) n function w(x) if w(x)(f(x) - Pn(x»dx + 1 with =I respect to the weight o L CljPn(Xj). j=O:n With this we have b I w(x)f(x) dx In particular, when == Q(f) for all polynomials f of degree <n. (1.5) f is constant we have the following. = lIb w(x) (f(x) - Pn(X» dx 4.1.4 Corollary. A quadrature rule (1.4) has a positive approximation order precisely when it satisfies the consistency condition b LClj = I w(x)dx = lIb w(x)f(x) dx - Q(f)1 (1.6) 1(1). } + j~n Clj(Pn(Xj) - f(Xj»1 2: Ib'W(X)",f-Pn"oodx+ L ICljlllf-Pnlloo o j=O:n b = ( I Iw(x)1 + Ilf - Pnlloo· j~n ICljl) In particular, if w (x) and all a j are nonnegative, then (1.6) implies We first consider the accuracy of quadrature rules, and investigate when we can ensure the convergence of a sequence of quadrature rules to the integral (1.1). The following two theorems give information in an important case. 4.1.5 Theorem. Suppose Q is a quadrature rule with nodes in [a, b] and nonnegative weights. If Q has the order n + 1 with respect to the weightfunction w(x), then the bound I b Iw(x)1 + j~n ICljl = I b w(x) + j~n Clj = 21(1). 0 4.1.6 Theorem. Let Qz (l = 1,2, ...) be a sequence ofquadrature rules with nodes in the bounded interval [a, b], nonnegative weights, having order ti, + I with respect to some nonnegative weight function w(x). If n, ~ 00 as I ~ 00, then b 10r w(x)f(x) dx = lim Qz(f) 1--+00 for every continuous function f: [a, b] ~ C. holds for every continuous function f : [a, b] ~ C and all polynomials Pn of degree 2: n. In particular, for nonnegative weight functions, lIb w(x)f(x)dx - Q(f)!2: 21 (1) lip" - flloo· (1.7) Proof By the Weierstrass approximation theorem, there is a sequence of polynomials Pk(x) (k = 0, I, 2, ...) of degree k with lim Ilpk - k-::HXJ fII00 = O. For sufficiently large I we have n, ::: k, so that Theorem 4.1.5 may be applied to each of these polynomials Pk; we thus obtain lim 1--+00 lib w(x)f(x) dx - QI(f)\ = Because n o (i) Theorems of approximation theory imply that actually for m-times continuously differentiable functions; the approximation error (1.7) therefore diminishes at least as rapidly. (ii) Instead of requiring nonnegative weights it suffices to assume that the sums L la; I remain bounded. In practice we are interested in the converse of the previous proposition because we start with a normalized quadrature rule and we desire a bound for the error of integration over the original interval. We obtain this with the following theorem. 4.1.9 Theorem. Suppose Q(f) = La; f(x;) is a normalized quadrature rule over the interval [-1, 1] with nonnegative weights a; and with order n + 1 with respect to w(x) = 1. Then the error bound f i C h + f(t)dt-hLa;f(c+x;h) I :::: l c-h 16 ,(h - )n+2 Ilf(n+1) 1100 (n+l). 2 holds for every N + 1 times continuously differentiable function f on [c - h, c + h], where the maximum norm is over [c - h, c + h]. 4.1.8 Proposition. If the relationship f(t) dt k > 0, we obtain the relationship in the limit h -* O. The normalized quadrature rule (1.9) is therefore exact for all x k with k :::: n. It is thus also exact for linear combinations of x k (k :::: n), and therefore for all polynomials of degree j; n, that is, it has order n + 1. 0 For the most important case, that of constant weight w (x) = 1, we now prove some statements about the accuracy oftransformed quadrature rules over small intervals. C+h + 1- o. a 4.1.7 Remarks. c-h 185 4.1 The Accuracy of Quadrature Formulas Numerical Integration 184 = h I>;f(c + x;h) + O(h n+2) (h -* 0) (1.8) holds for all (n + I)-times continuously differentiable functions f in [c - h, c + h], then the corresponding quadrature rule Proof Let Pn(x) denote the polynomial of degree j; n that interpolates g(x) := fCc +xh) at the Chebychev points. Theorem 3.1.8 and Proposition 3.1.14 then imply (1.9) for (n + I)-times continuously differentiable functions f in [-1, 1] has order n + 1 with respect to w(x) = 1. over the interval [-1, 1]. This and application of Theorem 4.1.5 to Q(g) give \f:;'h f(t) dt - Proof Suppose k :::: nand f (t) := (t - c/. Then (1.8) holds, and we obtain h k+ J _ a;f(c + x;h) = II - +1 This implies 1 / t k dt I _ (_I)k+1 = k + I = La;x~ + O(h n+ 1 n h + I)! ( 2" ) 4h < 00 - n+ J - I Ilg(n+J) I 2n (n + I)! 00 (n+J) Ilf 1100' r: k - (n g(x) dx - Q(g) Ilg - p II 16 _ - _I J < 4h . (_h)k+1 ___----'-_ = h La;(x;h)k + O(h n+2). k I li 1 hL ). o .If we wish to compute the error bounds explicitly, we can bound II I) II 00 interval arithmetic by computing If(n+J)([c - h, C + hm, provided an With 4.2 Gaussian Quadrature Formulas Numerical Integration 186 arithmetic expression for f(n+l) is available or f(n+l) is computed recursively via automatic differentiation from an arithmetic expression for f· However, the error of integration can already be bounded in terms of the nth derivative; in particular we have the following. As in the proof of Proposition 4.1.9, we finally obtain f I C h + 1(t)dt- h L c-h 16 (h)n+1 4.1.10 Theorem. If the quadrature rule Q(f) =S n! = L a.] (x.), normalized on [ -1, 1], has nonnegative weights a j and order n + 1 with respect to w (x) = I, then Proof Let Pn-l (x) denote the polynomial of degree n - I, which interpolates g(x):= fCc + xh) at the Chebychev points, and let q(x) := 21-nTnl~~ ~e the normalized Chebychev polynomial of degree 11. Then Iq(x)1 =S 2 for x E [-1,1]. With a suitable constant y, we now define + yq(x). Pn(x):= Pn-I(X) g (x ) - Pn-l (x) g(n) (0 . h" = --q(x) = -n! f 11\ (n) (c + ~h)q(x) for some s E [-I, 1], so Ig(x) - Pn(x)\ l = ~~ f(n)(c + ~h) =S \~~ f(n)([c - h, c y\ Iq(x)1 + h]) - y\. 2 n. 1 - 2: a ;[ (c + x;h ) I radf(n)([c-h,c+h]). o 4.1.11 Remarks. (i) The bound for the error is still of order O(h n +2 ) because the radius of the enclosure for the nth derivative is always O(h). (ii) Under the assumptions of the theorem, the transformed quadrature rule for the interval [c - h, c + h] also has order n + 1 because the nth derivative of a polynomial of degree 11 is constant and the radius of a constant function is zero, whence the error is zero. (iii) We can avoid recursive differentiation if we use the Cauchy integral theorem to bound the higher order derivatives. Indeed, if f is analytic in the complex disk D[c; 1'] with radius r > h centered at c, and if 11(01 =S M for ~ E D[c; 1'], then Corollary 3.1.11 implies the bound (n We then have 187 1 II (n+l) I + 1)! 1 00 Mr =S (I' _ h)n+2' and the bound in Proposition 4.1.9 becomes 16M rqn+2, with q = hi (21' 2h). For sharper bounds along similar lines, see Eiermann [24] and Petras [80]. In order to apply these theorems in practice, we must know how to determine the 2n + 2 parameters a j and x j in the quadrature rule (1.4) in order to obtain a predetermined approximation accuracy. In the next section, we integrate interpolating polynomials to determine, for arbitrary weights, so-called interpolatory quadrature rules, among them the Gaussian quadrature rules, which have particularly high order. . . .... I We now choose y in such a way that the expression on the nght side IS mllllma . ([c - h, c + h]). Then we get This is the case when y = ~ mid 4.2 Gaussian Quadrature Formulas 2 (h)n Ilg-Pnlloo=Sn! 2: rad(t(n)([c-h,c+h])). When we require that all polynomials of degree =S n are integrated exactly, we Obtaina class of (n + 1)-point quadrature formulas that are sufficiently accurate for many purposes. In this vein, we have the following. r: Numerical Integration 4.2 Gaussian Quadrature Formulas 4.2.1 Definition. An (n + I)-point quadrature formula Q(f) is an interpolatory quadrature formula for the weight function w (x) provided called the Newton-Cotes quadrature rules of designed order n + I. If n is even, the true order is n + 2. Indeed, a symmetry argument shows that (2x -a -bt+ 1 (and hence any polynomial of degree n + 1) is integrated exactly to zero when 11 is even. For n = 1, we obtain the simple trapezoidal rule 188 Q(f) = l b w(x)f(x) dx for all polynomials f of degree y n, that is, when Q(f) has order n + I with TI(f) respect to w(x). of order The following theorem gives information concerning existence and unique- Il + 1 = 2, for 4.2.2 Theorem. To each weight w (x) and n + 1 arbitrarily preassigned pa~r. diIS tiIn ct nodes Xo , ... , x n, there is exactly one quadrature formula with wIse of order 11 + 2 L aJ(xj) aj := l L(x) = ] n i#j x -Xi. x·] - x·I (x) is a polynomial of degree j- n. Then Lagrange's interpo- = L Lj(x)f(xj) j=O:n implies, with o the desired representation. The interpolatory quadrature formulas corresponding to the constant weight = 1 and equidistant nodes Xj = a + J'h n (.J -- 0 , ... , n) are function w(x) + 32f Ca: b) + 12f C; b) (2.1) of order n + 2 = 6. If b - a = O(h), Theorem 4.1.9 implies that the error is O(h 3 ) for the simple trapezoidal rule, O(h 5 ) for the simple Simpson rule, and o (h 7) for the simple Milne rule. In the section on adaptive integration, we meet these rules again as the first members of infinite sequences TN(f), SN(f), and MN(f) oflow order quadrature rules. Unfortunately, the Newton-Cotes formulas have negative weights a j when n is large, more precisely for = 8 and 10. For example, laj 1/ j ~ lOll for Il = 40, which leads to a tremendous amplification of the roundoff error introduced during computation of the f (x j ). We must therefore choose the nodes carefully if we want to maintain nonnegative weights. As already shown in Section 3.1, many functions can be interpolated better when the nodes are chosen closer together near the boundary, rather than equidistant. We also use this principle here because in general the better we interpolate a function, the better its integral is approximated. As an example, we treat normalized quadrature formulas in with respect to w(x) = 1 On the interval [-1, 1]. Here, an advantageous distribution of the nodes is given by the roots of the Chebychev polynomial Tn(x) = cos(n . arccos(x)) (see Section 3.1). However, the extreme values of the Chebychev polynomial, n lation formula (see Theorem 3.1.1) f(x) f (a) w(x)Lj(x)dx, with the Lagrange polynomials f (7 +32 f(a:3b)+7 f(b)) b (/ Proof Suppose simple Simpson rule (also called Kepler's = 4, and for 11 = 4 the simple Milne rule M I (f) = b ~ a j=O:n with = 2 the b-a ( f(a)+4f (a+b) SI(f)=-6-2- +f(b) ) + 1, namely Q(f) = b-a = -2-(f(a) + feb)~ barrel rule) ness of interpolatory quadrature formulas. order n n 189 n::: L La 190 Numerical Integration 4.2 Gaussian Quadrature Formulas that is, of approximation n Xj = cos(rrj/n) l for j = 0, ... , n n 1( ~ 1- "COS(2rrjk/n») L. 4k2 _ I j for j = 1, ... , n - I, 2 b k=1:LII/2j jb (f, g) := a (2.2) + q(x)r(x) b w(x)j(x)g(x) dx I (2.3) for some polynomial Pll of degree s» (the interpolating polynomial at the no~e~ xo, ... , XII) and some polynomial rex) of degree ~k - I. Thus Q(j) has or e (2.4) b w(x) dx = ILo < 00, w(x) > 0 for x E (a, b) (2.5) hold the ( )' " d . . , n·,' IS a posrtive efinite scalar product III the space of real continuous bounded functions in the open interval (a, b). A sequence of polynomials Pi (x) (i = 0, 1, 2, ...) of degree .1' hi' a :~ys em oJ.ort ogona polynomials over [a, b) with respect to the Ight functIOn w(x) If the condition We' f(x) = PII(X) I ~..z.4 Definition. IS called t f of degree n + k has the form D for t~ese frequently occurring integrals. Whenever the weight function co (x) is COntllluous in (a, b) and the conditions j b w(x)q(x)r(x) dx = 0 We need some additional notation for this. We introduce the abbreviation obeys the relationship Proof Every polynomial W(X)PII(X) dx Equation (2.2) places k conditions on the n + 1 nodes. We may therefore hope to satisfy these conditions with k = n + I with an appropriate choice of nodes. The order of approximation 2n + 2 so obtained would thus then be twice as large as before. As we shall now show, for nonnegative weights this is indeed possible, in a unique way. qt x) := (x - xo)" . (x - xn) for i = 0, ... , k - 1. 1 for all polynomials r(x) of degree ~ k - 1, that is, whenever (2.2) holds. 4.2.3 Proposition. An interpolatory quadrature formula with nodes xo, .... x" has order n + 1 + k precisely whenever the polynomial w(x)q(x)xidx = 0 i~ aiPII(xi) = w(x)j(x) d x = b If we consider the difference between the left and right sides, we see that this is the case precisely when for j =0, n. Because a j :::. all' all a j are positive for this choice of nodes, so the quadrature formula with these weights a j is numerically stable. In practice, in order to avoid too many additional cosine evaluations, one tabulates the x j and a j for various values of n (e.g., for n = 1,2.4. 8, 16•... ). Until now we have specified the quadrature formulas in such a way that the x. were somehow fixed beforehand and the weights a j were then computed. ./ve have thus not made use of the fact that we can also choose the nodes in order to get a higher order with the same number of nodes. In order to find o~ti.mal nodes, we prove the following proposition, which points out the conditions under which an order higher than n + I can be achieved. I =Q(j) =2:: ail(xi), holds for all j of the form (2.3). k=I:Ln/2j 1 L 4k2I) _ 1 = n(2Ln/2j + I) w(x)j(x) d x or equivalently II (_ 1 2 b 1=0:11 have proved in practice to be more appropriate nodes for integration. The Corresponding interpolatory quadrature formulas are called Clenshaw-Curtis formulas. We give without proof explicit formulas for their weights: ~ 191 + 1 + k precisely whenever the relationship . (p"Pj}=O is satisfied. fori=O,l, ... ,j_l 193 4.2 Gaussian Quadrature Formulas Numerical Integration 192 To every weight function w(x) satisfying (2.5) that is continuous in (a, h) there corresponds a system of orthogonal polynomials that is unique to within normalization. Such a system may be determined, for example, from the sequence I, x , ... , x" with Gram-Schmidt orthogonalization. Based on the following theorem, we can obtain quadrature formulas with optimal order from such systems of orthogonal polynomials. Pn+I (x), that is, the roots of Pn+1 (x) correspond to those of q (x) and hence to the nodes. By Theorem 4.2.2, the quadrature formula is thus unique. We now determine the sign of the weights and the location of the nodes. For this, we again use the Lagrange polynomial Lk (x), and choose the two special polynomials 4.2.5 Theorem. Suppose w(x) is a weight function which is continuous in of degree x 2n + I for f in (2.8). Because Lk (Xj) (a, h) and which satisfies (2.5). Then for each n ::: 0 there is exactly one quadrature formula ak L GnU) := aJ!(x)) + 2. Its weights Xk (2.6) lim GII(f) = l w(x)f(x) dx = 1 -Gil (h) ak f(x) - b L k(x)2 w(x) dx ::: 0 = t xL k(x)2 w(x) dx ~b? J" Lk(X)-W(x) dx E la, h[. L L j(x)2 f(xj) = g(X)Pn+l(X) )=O:n function w(x). In particular, the relationship n----+-OO l = (ii) Conversely, if the Xi are the roots of Pn+l (x), the weights are given by (2.6), and f is a polynomial of degree j; 2n + 1, then are nonnegative and its nodes x) all lie in the interval ]a, h[. The nodes are the roots ofthe (n + 1)st orthogonal polynomial Pn+l(x) with respect to the weight b Gn(!I) and )=O:n with approximation order 2n = = 8k)' we obtain (2.7) Q holds for all continuous functions f in [a, b]. The uniquely determined quadrature formula Gil (f) is called the (11 + I)-point Gaussian quadrature rule for the weight function w(x) in [a, h]. Proof (i) Suppose Gil satisfies for every polynomial f of degree ::::2n + 1. By Proposition 4.2.3, (q, Xi) ::::: othen holds for i = 0, ... , n. Thus q (x) is a polynomial of. degree n .+ ) that I to is orthogonal to all polynomials of degree r- n, so q (x) IS proportJOna for some polynomial g of degree x n, since the left side has each x) as a root. Hence l b f(x)dx - GnU) = = l l b (f(X) - j~' L J(X)2 f(X j)) dx b g(X)PII+1 (x) dx =0 by (2.2). Thus G n has indeed the approximation order 2n Statement (2.7) now follows from Theorem 4.1.6. + 2. D Gaussian quadrature rules often give excellent results. However, despite the good order of approximation, they are not universally applicable. To apply them, it is essential (unlike for formulas with equidistant weights) to have a functiIOn given . . (or a program) because the nodes Xi are in by an expression general irrational. ~abulated nodes and weights for Gaussian quadrature rules for many different WeIght functions can be found, for example, in Abramowitz and Stegun [I]. 4.2 Gaussian Quadrature Formulas Numerical Integration 194 4.2.6 Example. Suppose w(x) = I and [a, b] = [-I, 1]. The corresponding polynomials are (up to a constant normalization factor) the Legendre polynomials. Orthogonalization of I, x , ... , x n gives the sequence 195 A 3-term recurrence relation like (2.9) is typical for orthogonal polynomials: 4.2.7 Proposition. Let Pk(X) be polynomials ofdegree k = 0, 1, ... with highest coefficient 1. orthogonal with respect to the weightfunction W(x). Then, with Po(x) = I, PI(X) = x, pz(x) =x and, continuing recursively for arbitrary Z - 1 3' P3(X) =x 3 3 P-l (x) - S-x, = 0, Po(x) = I, we have (2.10) i, with the constants (2.9) aj = For n ::: 2, we obtain the following nodes and weights. For n = 0, b, = Xo = 0, forn ao = 2, f f 7_I(x)dx/ W(X) XP W(X)P7_ I(x)dx/ ti > W(X)P7-1 (x) d x w(x)p7_z(x)dx 1), (j::: 2). Proof We fix j and expand the polynomial Xpj-l (x) of degree j as a linear combination = 1, and for n f f Xo = vfi73, ao = Xl = vfi73, a1 = 1, 1, Xpj-l (x) = L aIPI(x). I=O:j Taking inner products with Pk (x) and using the orthogonality, we find (x Pj -1, Pk) = adpb Pk), hence = 2, Xo = -J3j5, ao = 5/9, Xl = 0, al = 8/9, Xz = J3j5, az = 5/9. By comparing the highest coefficients, we find aj = I, hence (2.11) If we substitute z := c +hx (and add error terms from [I]), we obtain for n = I ° the formula By definition ofthe inner product, (XPj-l, Pk) = (Pj-l, xpd = for k < j-2 Since P J -I is orthogonal to all polynomials of degree < j - 1. Thus ak = for k < j - 2, and we find which requires one function evaluation less than Simpson's rule, but has an error term of the same order 0 (h 5 ) . For n = 2, we obtain the formula XPj-l (x) = Pj(x) ° + aj-IPj-l (x) + aj-ZPj-z(x). This gives (2.10) with f C+ h c-h f(z) dz = h 9(5f(c - hJ3j5) h 7 + 315o f (6) + 8f(c) + 5f(c + hJ3j5» (~), aj bj = aj-l = = aj-Z = (XPj_l, Pj-z)/(Pj-Z, Pj-z) = (Pj-l, Pj-I)/(Pj-2, Pj-z) with the same number of function evaluations as Simpson's rule, but an error term of a higher order 0 (h 7 ) . (XPj_l, PJ-I)/(Pj-l, Pj-I), because (Xp/-l, Pj-2) = (XP/-2, Pi-l) = (Pj_l, Pj-I) by (2.11). D Numerical Integration 4.3 The Trapezoidal Rule If the coefficients of the three-term recurrence relation are known, one can compute the nodes and weights for the corresponding Gaussian quadrature rules in a numerically stable way by solving a tridiagonal eigenvalue problem (cf. Exercise 13). (This is preferable to forming the Pj explicitly and finding their zeros, which is much less stable.) for the integral of f over the subinterval [Xi -1, Xi]. By summing over all subintervals, we obtain the following approximation for the integral over the entire interval [a, b]: 196 197 4.3 The Trapezoidal Rule Gaussian rules have error terms that depend on high derivatives of f, and hence are not suitable for the integration of functions with little smoothness. However, even nondifferentiable continuous functions may always be integrated by an approximation with Riemann sums. In this section, we refine this approach. We assume that we know the values f (x j) (j = 0, ... , N) of a function f(x) on a grid a = Xo < Xl < ... < XN = b. When we review the analytical definition of the integral in terms of Riemann sums, we see that it is possible to approximate the integral by summing the areas of rectangles with sides of lengths Xi - Xi-I and f(Xi-d or f(Xi) (Figure 4.1). We also immediately see that we obtain a better approximation if we replace the rectangles with trapezoids obtained by joining a point (Xi-I, f (Xi-1» with the next point (Xi, f (Xi»; that is, we replace f (x) by a piecewise linear function through the data points (Xi, f (Xi» and compute the area underneath this piecewise linear function in terms of a sum of trapezoidal areas. Because the area of a trapezoid is given by the product of its base and average height, we obtain the approximation l X, ((Xi) . f(x) dx ~ (Xi - Xi-I) . . After combining terms with the same index i, we obtain the trapezoidal rule over the grid a = Xo < XI < '" < XN = b: T(f) : = Xo f(xo) 2 XN - XN-I XI - + 2 + L i=1:N-l Xi+1 - Xi_1 f(Xi) 2 !(XN). (3.1) For equispaced nodes Xi = a + i . h N obtained by subdividing the interval [a, b] into N equal parts oflength h N : = (b - a) / N, the trapezoidal rule simplifies to the equidistant trapezoidal rule (3.2) with hN=(b-a)/N, xi=a+ih N. + !(Xi-l) 2 Xl-I Figure 4.1. The integral as a sum of trapezoidal areas. 4.3.1 Remark. Usually, the nodes in an equidistant formula are not stored but are computed while summing TN (f). However, whenever N is somewhat larger, the nodes are not computed accurately with the formula Xi = Xi-I + h because rounding errors accumulate. For this reason, it is strongly recommended that the slightly more expensive formula Xi = a + ih be used. (In most cases, the dominant cost is that for getting the function values, anyway.) Roundoff errors can be further reduced if we sum the terms defining TN (f) in higher precision; in most cases, the additional expense required for this is negligible compared with the cost for function evaluations. Clearly, the order of approximation of the trapezoidal rule is only n + 1 = 2, so Theorem 4.1.6 does not apply. Nevertheless, convergence to the integral follows because continuous functions may be approximated arbitrarily closely by piecewise linear functions. Indeed, for twice continuously differentiable functions, an error bound can be derived, using the following lemma. 4.3 The Trapezoidal Rule Numerical Integration 198 4.3.2 Lemma. If f is twice continuously differentiable in [Xi -I, Xi]' then there is a ~i E [Xi-I, xil such that Proof From Lemma 4.3.2, the error is I l b f(x) dx - T(f)1 1- L = a Proof We use divided differences. For X E [Xi-I, Xi], we have = f(x) dx 1,x'\ (f(Xi-l) + j[Xi-l, Xi](X - (X - X l)h 2 j[Xi_I,Xi,X](X -Xi_I)(X -xi)dx. Xi-l The mean value theorem for integrals states that 1 b f(x)g(x) dx = f(O I" a g(x) dx for some ~ E [a, b], provided f and g are continuous in [a, b] and g has a constant sign on [a, b]. Applying this to the last integral with g(x) := (x - Xi-l)2 - (Xi - Xi-I)(x Xi-l) = (X - Xi_I)(X - xt> :s 0, we obtain, after slight modification, for suitable ~:' ~i E o [a, b]. The following theorem for the trapezoidal rule results. I f (X) d X - T (f ) I :s h2 11f"ll oo = 12 (b - a)IIf"lI oo' 0 4.3.5 Lemma. There exist numbers Y2j with Yo degree j such that the relationship g(l) b I~- In order to increase the accuracy with which the trapezoidal rule approximates an integral by a factor of 100, h must be decreased to h /10. This requires approximately 10 times as many nodes and therefore approximately 10 times as many function evaluations. That is too expensive in practice. Fortunately, significant speed-ups are possible if the nodes are equidistant. The key is the observation that, for equispaced grids, we can replace the norm bound in Theorem 4.3.3 by an asymptotic expansion. This fact is very important because it allows the use of extrapolation techniques to improve the accuracy (see Section 4.4). By Theorem 4.3.3, TN(f), given by the equidistant trapezoidal formula (3.2), has an error approximately proportional to h~. More precisely, we show here that it behaves like a polynomial in h~, provided f has sufficiently many derivatives. To this end, we express the individual trapezoidal areas ~ (f (Xi) + f (Xi -I» in integral form. Because we may use a linear transformation to get Xi-I = -I, Xi = I, h = 2, we first prove the following. 4.3.3 Theorem. For the trapezoidal rule on a grid with mesh size h, the bound I I f"(~i)1 4.3.4 Remark. The formula in Theorem 4.3.3 indeed gives the correct order 0(11 2 ) , but is otherwise fairly rough because it does not take account of variations in f" in the interval [a, b]. As we see from the first expression in the proof, we may allow large steps Xi - Xi-I everywhere where I f"l is small; however, in order to maintain a small error, we must choose closely spaced nodes at those places where I f"l is large. This already points to adaptive methods, which shall be considered in the next section. Xi-l) Xi-l)(X - Xi» dx (Xi - Xi_I)2 = f(Xi-I>(Xi - Xi-I> + f[Xi-l, xil 2 i 12 l=l:N + j[Xi-l, Xi, x](x - + (Xi - Xi-I)3 i=I:N :SL 1.~1 199 + g(-l) = 2 -I h12(b-a)IIf"lloo holds for all twice continuously differentiable functions f defined on [a, b ]. /1 ( L I and polynomials p j (x) of n.s?" (x) + pm+l(X)g(m+lJ(X») dx (3.3) j=O:Lm/2J holds for m 2: 0 and all (m g:[-I,I]~C. = + I)-times continuously differentiable functions In order to prove (3.5), we assume that (3.5) is valid for odd m < 2k; this is certainly the case when k = O. We then plug the function g(x) := x m for m = 2k + 1 into (3.4) and obtain Proof We define Im(g) LC~~-, :~ yjgU}(x) + pm(x)g,m}(x)) dx for suitable constants Yj and polynomials Pm (x), and determine these such that the conditions L; (g) = g(1) Ym + g(-l) =0 However, the integral can be computed directly: _ (3.4) for m :::: 1, Im(g) - (3.5) for odd m :::: I PI(X):= x, = L I ( =m! ( Yj '-0'.m -I J- j=~_/j m. , . ,x (m - }). m-j + Pm(x)m.,) l-(-l)m+l-j dx ) (m+I-j)! +2Ym . Because m is odd, the even terms in the sum vanish; but the odd terms also vanish because of the induction hypothesis. Therefore, m! ·2Ym = 0, so that (3.5) holds in general. D we find !](g) I -I follow, which imply the assertion. By choosing Yo:= 1, 201 4.3 The Trapezoidal Rule Numerical Integration 200 ill (g(x) + xg'(x» dx 4.3.6 Remark. We may use the same argument as for (3.5) to obtain = I -I I and (3.4) holds for m Yj := -I 2 II I (xg(x»' dx = xg(x)I_1 = g(l) + g( -1), = YZi + I)! = -1(2k)! fork> 0 - 1. Also, with Pj(x) dx, Pj+I(X):= IX (Yj - pj(z»dz for i >: I, for m = 2k; from this, the YZi (related to the so-called Bernoulli numbers) may be computed recursively. However, we do not need their numerical values. -I -I From (3.3), we deduce the following representation of the error term of the equidistant trapezoidal rule. we find so that, for (m L i=O:k (2k - 2i + I )-times continuously differentiable functions 4.3.7 Theorem. (Euler-Macl.aurln Summation Formula). Iff is an (m g, 1> TN(f) = 1 a = ill = [II (P;Il+1 (x)g(m)(x) + Pm+1 (x)g(m)' (x») dx (Pm+l(X)g(m)(x»)' dx =0. Thus, (3.4) is valid for all m :::: 1. = Pm+l(x)g(m)(x)I~1 + 1)- times continuously differentiable function, then f(x) dx + L Cjh~ + rm(f), i> 1: Lm/ZJ where the constants, independent of N, are CJ' = 2- ZjYz.I·(f(Zj-I)(b) - f(2 j-I)(a») (.} = I , ... and where the remainder term rm(f) is bounded by , Lm /2J) , (3.6) for some constant C m depending only on m. In particular; rm(f) polynomials f of degree s: m. o for Proof In order to use Lemma 4.3.5, we substitute + (i + = I, ... , N we obtain By summing (3.7) over i TN(f) x;(t) := a 203 4.4 Adaptive Integration Numerical Integration 202 -l b f(x) dx j L t-I) h , -2- 2:h ) 2 ( Y2j(J(2 j-l)(b) - f(2 j-ll(a)) + rm(f), j=I:Lm/2J where the remainder term and set g;(t) := f(x;(t)). We then have lemma implies X;-l = x;(-I), X; = x;(I), so the is bounded by Irm(f)1 :::: ~ L 2C mh m+ 11If(m+ltXl = C mh m+ 1(b - a)ll/m+ltXl' D j=l:N The Euler-MacLaurin summation formula now allows us to combine the trapezoidal rule with extrapolation procedures and thus introduce very effective quadrature rules. 4.4 Adaptive Integration Ifwe compute a sequence of values TN(f) for N = No, 2No, 4No, ... , then we can obtain better quadrature rules by taking appropriate linear combinations. To see this, we consider TN(f) = h; (f(a) + feb) + 2 ;=f;-l f (a + ih N ) ), where the remainder term 2N(f) = h: (f(a) + feb) + 2 ;=f;-1 f(a + ih N) T +2;~N f(a + (i - ~)hN)) is bounded by = ~ ( TN(f) + h ;~N f (a + ( - ~) h N) ) N Hence, if we know TN(f), then to compute T2N (f) it is sufficient to compute the expression with constant C; = I 2m + 2 /1 -L IPm+l (t)1 dt. (4.1) 204 Numerical Integration 4.4 Adaptive Integration with N additional function evaluations, and we thus obtain we find for the simple Simpson rule S(f) applied to [Xi-I, x;J, (4.2) The expression R N (f) is called the rectangle rule. The linear combination IXi + f(b) + 2 = l f[yo, Yl, YI, Yz, y](y - Yo)(y - y])Z(y - yz) dy, where Yo:= Xi-I, Yl := (Xi-I +xi)/2, and yz := Xi. Because (y - YO)(Yz . y]) (y - yz) ::::: 0 for y E [Xi-I, X;], we may apply the mean value theorem for integrals to obtain (4.3) is called the (composite) Simpson rule because it generalizes the simple Simpson rule, which is obtained as SI (f). Because I (f) := Jrr X,-l i=f;-I f(a + ih N) +4i~N f(a + (- DhN)) f(x) d x - S(f) X,-~l 4TzN(f) - TN(f) TN (f) + 2R N(f) SN(f) := = 3 3 = h6N (f(a) fXi f(x)dx - S(f) Xj-l = f[yO, YI, YI, Yz, n IXi (y - YO)(Y - y])Z(y - yZ) dy X,_I b f(x) dx = lim TN(f), Ns-s co a 205 = ( Xi+12- 4(h; = -is we also have Xi)5 .1 1 f[Yo,Yl,Yl,Yz,n _ltZ(tZ-l)dt )5 f[yo, YI, YI, Yz, n lim RN(f) = lim SN(f) = I(f). N~oo N~oo However, although TN (f) and RN (f) approximate the integral I (f) with an error of order O(h~), the error in Simpson's rule SN(f) has the more advantageous order O(h~). This is due to the fact that the linear combination (4.3) exactly eliminates the term clh~ in the Euler-MacLaurin summation formula (Theorem 4.3.7). We give another proof that also provides the following error bound. 4.4.1 Theorem. Suppose f is afour-times continuously differentiablefunction on [a, b]. Then the error in Simpson's rule SN(f) is bounded by I (hN)5 = -90 2 f (4) (~) for suitable ~!, ~ E [Xi-I, x;J. By summation over all subintervals [Xi-I, X;], i = I, ... , N, we then obtain II,r f(X)dX-SN(f)( :::::N(h2 h~ N)5 1If (4)1100 = (b-a)llf4 ) 11 90 2880 00' The linear combination 16SzN - SN MN := - - .,--::-- 15 Proof We proceed similarly to the derivation of the error bound for the trapezoidal rule, and first compute a bound for the error on a subinterval [Xi-I' xiJ· Because the equation SN(f) = I(f) holds for f(x) = Xi (i = 0,1,2,3), Simpson's rule is exact for all polynomials of degree :::::3. By Theorem 3.1.4, 0 (4.4) is c~lled the (composite) Milne rule because it generalizes the simple Milne rule, WhIch is obtained as M I (f). The Milne rule similarly eliminates the factors Cj and Cz in the Euler-MacLaurin summation formula (Theorem 4.3.7), and thus has an error of order O(h~). We may continue this elimination procedure; with Numerical Integration 206 a new notation we obtain the following Romberg triangle. with k To,o = TN T1,o = T2N TI,l T2,O = T4N T2,1 T3,O = TS N T3,1 = = = T2 .2 = MN 4.4.3 Remarks. + 2)-times continuously differentiable on To,o := TN(f), I Ti,o := T2iN(f) = 2:(Ti- I,O + R 2i-1N(f)) (i = 1,2, ...) and Ti,k:=Ti,k-l+ then T;,k(f) Ti,k-l - Ti - 1.k 4k - 1 1 jork=I, ... .L, = Ti,k has order 2k + 2 with respect to w(x) = (45) . 1, and (4.6) holds for each i :::: 0 and k :::: m. Proof We show by induction on k that constants ejk) (j > k) exist such that t., = I a b j(x)dx (k) 2j 2m+2 h2'N + O(h 2iN ). + j=fu:m e j By Theorem 4.3.7, this is valid for k = O. It follows for k > 0 from Ti,k-I - Ti-I,k-l Ti,k = T;,k-l + 4k _ I 4 kTi,k_l - T;-I,k-l 4k - I b = 1 a jex)dx "" (k-I) + j=fu:m e j (j>k). Formula (4.6) follows. Proposition 4.1.8 then shows that the order of T;,k is 2k + 2. 0 T3,2 = M2N We now formulate this notation precisely and bound the error. 4.4.2 Theorem. Suppose j is (2m [a, b]. Ijwe set 2' (k)._ (k-I)4 - 2 ] e j .-ej 4k-1 SN S2N S4N 207 4.4 Adaptive Integration 4 k 22 ] 2] ( 2 +2) 4k _ 1 h 2' N + 0 h 2':'N (i) To compute Tn,n, an exponential number 2n of function evaluations are required; the resulting error is of order O(h 2n+2 ) . For errors of the same order, Gaussian quadrature rules only require n + I function evaluations. Thus the method is useful only for reasonably small values of n. (ii) The Romberg triangle is equivalent to the Neville extrapolation formulas (Algorithm 3.2.4 for q = 2) applied to the extrapolation of (Xl, ft) = (h~, TN(f)), N = 21 to X = O. The equivalence is explained by the EulerMacLaurin summation formula (Theorem 4.3.7): TN(f) is approximately a polynomial Pn of degree n in h~, and the value of this polynomial Pn at zero then approximates the integral. (iii) One can use Neville extrapolation also to compute an approximation to the integral f j (x) dx from TNI (x) for arbitrary sequences of N, in place of N[ = 21• The Bulirsch sequence N, = 1,2,3,4,6, ... , 2i , 3 . 2i - l , ... is a recommended alternative choice that makes good use of old function values and often needs considerably fewer function values to achieve the same accuracy. A further improvement is possible by using rational extrapolation formulas (see, e.g., Stoer and Bulirsch [90]) instead of Neville's formula because the former often converge more rapidly. (iv) It can be shown that the weights of T;,k are all positive and the bound lIb j (x) dx - t., I :::: Y2k+2h~~:/ II j(2n+2) 1100 (n = minCk, m)) holds. Thus, for k :::: m, the kth column of the Romberg triangle converges to I (f) as i ---+ 00 with order h~~:/. By Theorem 4.1.6, the T;,k even converge to I (f) as i + k ---+ 00 for arbitrary continuous functions, and indeed, the better that the functions can be approximated by polynomials, the better the convergence. In practice, we wish to be able to estimate the error approximately without computation of higher order derivatives. For this purpose, the sizes of the last Numerical Integration 208 correction Ei,k := 1 Ii,k-I - Ii.k I may be chosen as an error estimate for Ii.k in the Romberg triangle. This is reasonable if we can guarantee that, under appropriate assumptions, the error II (f) - Ti,k I is actually bounded by Eik, By Theorem 4.4.2, we have II (f) - Ii,k I = O(h 2k+ 2 ) with h = 2- i (b - a) and 2k 2k Eik = O(h ) because l/(f) - Ti .k - 1 1 = O(h ) . We thus see that the relation II (f) - Ii,k I :::: Eik is usually true for sufficiently large i. However, two arguments show that we cannot unconditionally trust this error estimate. When roundoff errors result in Ii,k = T;,k-I, we have Eik = 0 even though the error is almost certainly larger. Moreover, the previous argument holds only asymptotically as i _ 00. However, we really need an error estimate for small i. The previously mentioned bound then does not necessarily hold, as we see from the following example of a rapidly decreasing function. fd 4.4.4 Example. Ifwe integrate/:= (l + ~ e) dx exactly, we obtain I = 1.02000 .... With numerical integration with the Romberg triangle (with N = I), a desired relative error smaller than 0.5%, and computation with four significant decimal places, we obtain: 25x To,o = 1.250 TI.o T2.0 = 1.125 Tl,I = 1.083 = 1.063 hi = 1.042 T2 ,2 = 1.039, where e-, = Ih2-T2,11 = 0.003, so I ~ 1.039 ±0.OO3. However, comparison with the exact value gives II - T2,21 ~ 0.019; the actual error is therefore more than six times larger than the estimated error. Thus, we need to choose a reasonable security factor, say 10, and estimate I (f) ~ t., ± 10IIi,k-l - Ti,kl· (4.7) Adaptive Integration by Step Size Control The Romberg procedure converges only slowly whenever the function to be integrated has problems in the sense that higher derivatives do not exist or are large at least one point (small m in Theorem 4.4.2). We may ameliorate this with a more appropriate subdivision in which we partition the original interval into subintervals, selecting smaller subintervals where the function has such problems. However, in practice we do not usually know these points beforehand; we thus need a step size control, that is, a procedure that, during the computation subdivides recursively those intervals that contribute too large an error in the integration. We obtain such a procedure if we work with a small part of the Romberg triangle, Ii.k with k SiS n (and N = I), to compute Tn,n (n = 2 or 3). If the resulting approximation is not accurate enough, we then halve the original interval and compute the sub integrals separately. The Tu with k :::: i :::: n - I for the resulting subintervals can be computed easily from the already computed function values, and the only additional function values needed to compute the Tn .k are those involved in R N (f), N = 2"-1. This can be used in the implementation to increase the effectiveness. To avoid infinite loops, we may prescribe a lower bound h min for the length of the subintervals. If we fall below this length, we accept Tn,n as the value of the subintegral and we issue a warning that the desired precision has not been attained. To avoid exceeding a prescribed error bound ~ for the entire integral I [a, bl = f(x) dx, it is appropriate to require that every subintegral It" J: IL~., xl = f(x) dx has an error of at most ~(.\: - :5...)/(b - a); that is, we assign to each sub integral that portion of the error corresponding to the ratio of the length of the subinterval to the total length. However, this uniform error strategy has certain disadvantages. This is because in practice, many functions to be integrated behave poorly primarily on the boundary; for example, when the derivative does not exist there (as with Vi cosx for x = 0). The subintegrals on the boundary are then especially inaccurate, and it is advisable to allow a larger error for these in order to avoid having to subdivide too often. This can be achieved when we assign the maximum error per interval with a boundary dominant error strategy. For example, we may proceed as follows: If the maximum allowed error in the interval [:5..., ."l is ~o, we allow errors ~I and ~2 for the intervals [:5..., xl and [x, xl produced from the subdivision, where x< ~I = ~~o, ~2 ~1 = ~~o, ~2 = ~~o if a < :5..., x = b, ~l = ~~o, ~2 = ~~o otherwise. = ~~o if:5... = a, b, If we use the portion of the Romberg triangle to T"./!, I + 2n s function evaluations are required for s accepted subintervals. If we process the intervals from left to right. then due to the splitting procedure, up to 10g«b - a)/ hm;n) incompletely processed intervals may occur, and for each of them, 2/!-1 already available function values, the upper interval bound, and the maximum error need to be stored (cf. Figure 4.2). Thus, (2 n - 1 + 2) log«b - a)/ h min ) storage locations are necessary to store information that is not immediately needed. More sophisticated integration routines use variable order quadrature formulas Numerical Integration 210 1 1 6 • 6/3 • • • 6/2 • L 1~/.\5 L~!9j 1 • 6/2 . .......•... .....•.. 6/6 J 6/2 ....•.. 6/6 J 6/2 .......... 1 • • • 1 4.5 Solving Ordinary Differential Equations ......•.. 211 by making the system autonomous with the additional function Yo = t and the additional differential equation (and initial condition) yb ..... j = 1, Yo(to) = to· (In practice, this transformation is not explicitly applied, but the method is transformed to correspond to it.) Similarly, higher order differential equations can be reduced to the standard form (5.1); for example, the problem .1 x" ....1 = G(t, x), x(to) = xo, x'(to) = vo is equivalent to (5.1) with Figure4.2. The boundary dominant error strategy, n = 2. (e.g., a variable portion of the Romberg triangle), and determine the optimal order adaptively by an order control similar to that discussed in Section 4.5. The boundary dominant error strategy can also be applied whenever the integrand possesses singularities (or singular derivatives) at known points in the interior. In this case, we subdivide the original interval a priori according to the singularities, then we treat each subinterval separately. 4.5 Solving Ordinary Differential Equations In this section, we treat (for reason of space quite superficially) numerical approximation methods for the solution of initial value problems for systems of coupled ordinary differential equations. To simplify the presentation, we limit ourselves to autonomous problems of the form y;(t) = F;(Yl(t)""'YIl(t)), Yi(tO)=Y? (i=l, ... ,n), (5.1) or, in short vector notation, Y' = F(y), y(to) = l. However, this does not significantly limit applicability of the results. In particular, al1 methods for solving (5.1) can be generalized in a straightforward way to nonautonomous systems y' = Ftt , y), y(to) = yO We pose the problem more precisely as follows. Suppose a continuous function F: D S; ]R1l -+ ]R1l, a point Yo E int(D), and a real interval [a, b) are given. We seek a continuously differentiable function y: [a, b) -+ ]R1l with y'(t) = F(y(t), yea) = yo. Each such function is called a solution to the initial value problem y' = F(y), yea) = yO in the interval [a, b]. Peano's theorem provides rather general conditions for the existence of a solution. It asserts that (5.1) has a solution in [a, a + h) for sufficiently smal1 h and that this solution can be continued until it reaches the boundary of D (i.e., until either t -+ 00 or Ily(t)11 -+ 00, or y(t) approaches the boundary of D as t increases). Because most differential equations cannot be solved explicitly, numerical approximation methods are essential. Peano's continuation property is the basis for the development of such methods: From knowledge of the previously computed solution, we compute a new solution point a short distance h > 0 away, and we repeat such integration steps until the upper limit of the interval [a, b] is reached. One traditionally divides numerical methods for the initial value problem (5.1) into two classes: one-step methods and multistep (or multivalue) methods. One-step methods (also called Runge-Kutta methods) are memoryless and only make use of the most recently computed solution point to compute a new solution point, whereas multistep methods retain some memory of the history by storing, using, and updating a matrix containing old and new information. The successful numerical solution of initial value problems requires a thorough knowledge of the propagation of rounding error and discretization 213 Numerical Integration 4.5 Solving Ordinary Differential Equations error. In this chapter, we give only an elementary introduction to multistep methods (which are closely related to interpolation and integration techniques), and treat the numerical issues by means of examples rather than analysis. For an in-depth treatment of the whole subject, we refer the reader to Hairer et al. [35,36]. The method does converge because ail ---+ e as h ---+ O. The following table shows what happens for larger step sizes. 212 hA a h): Euler's Method a The prototypical one-step method is the Euler's method. In it, we first replace y' by a divided difference to obtain ,---y(-'--.:t1,----)_---=-y_(t..::..:..-o) t) - to ~ I () v to • = F ( 0) Y or, with hi := tl - to, .1 2.59 1 2.00 10 1.27 100 1.05 1000 1.0 I -.01 2.73 -.1 2.87 -.5 4.00 -.9 12.92 -.99 104.76 -1 00 Once h); is no longer tiny, a is no longer close to e ~ 2.718. Moreover, for Ii). < -1, the "approximations" l + I = (l + hA) i are violently oscillating because I + h). is negative, and bear no resemblance to the desired solution. :s h.): :s 1, In conclusion, we have qualitatively acceptable solutions for but good approximations only for IhAI « 1. For A 0 it could be expected that the step size h must be very small, in view of the growth behavior of e" . However, it is surprising that a very small step size is also necessary for A « 0, even though the solution is negligibly small past a certain t-value! For negative values of A (for example, when A = -105 and h = 0.02, see Figure 4.3), an -! » Continuing in the same way with variable step size hi, we obtain the formulas ti+1 .01 2.71 = t, + hi, yi+1 = i + hiF(i) for i = 0, ... , N - 1, and we expect to have y(ti) ~ i for i = 1, ... , N. To define an approximating function y(t) for the solution of the initial value problem (5.1), we then simply interpolate these N + 1 points; for example, with methods from Chapter 3. One can show that, for constant step sizes hi = h, Euler's method has a global error of order 0 (h) on intervals of length 0 (l), and thus converges to the solution as h ---+ O. However, as the following example shows, sometimes very small step sizes are needed to get an acceptable solution. 0.08 0.06 \ \ \ \ 4.5.1 Example. Application of Euler's method with constant step size hi = h to the problem y'=Ay, leads to the approximation \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ y(O)=l \ \ \ \ \ \ \ -, - -, - -, - -, - -, - -, - -, - -, - / / / / / / / / / / / / / / / / / / / / I I I I I I I I I I I I -0.06 -0.08 instead of the exact solution -0.1 ':----'---J:-:-:-----'---=-'-~~-LL____L---L---L---LLL__L--"---__"___~_l...__l...__l._l_LLL_L_J o 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Figure 4.3. Divergence of Euler's method for the stiff equation Vi -105 and step size h = 0.02. . = -A v, with A = . 214 oscillating sequence occurs that increases in amplitude with t , and the sequence has nothing at all to do with the solution. The reason for the poor behavior of Euler's method for A « 0 is the very steep direction field of the corresponding differential equation in the immediate vicinity of the solution. Such differential equations are called stiff; to solve such equations effectively, special care must be taken. To improve the accuracy of Euler's method, we may consider the Taylor series expansion yet + h) = yet) 215 4.5 Solving Ordinary Differential Equations Numerical Integration Table 4.1. Coefficients for Adams-Bashforth of order p s 130 0 1 1 2 12 e, fh 131 1 3 -2: 2: 23 5 16 55 3 = s+1 -12 12 59 37 -24 24 24 h2 + hy'(t) + _y"(t) + 0(h 3 ) 2 h quadrature formula (see Section 4.2). We consider only the equidistant case 2 = y(t) + hF(y(t)) + 2:F'(y(t»F(y(t») + 0(h 3 ) . This suggests the method t; b-a = a + ih, h = t:I. + 1 points ti-s, ... , tt as data points, then If one uses the last s y(ti+l) = y(ti) j 1i + h +. F(y(t)) dt I, which can be shown to have a global error of 0(h 2 ) . We may similarly take more terms in the series to obtain formulas of arbitrarily high order. A disadvantage of this Taylor series method is that, for F (y) E IRn, we must compute the entries of the Jacobian F' (y) (or even higher derivatives); usually, this additional cost is unwarranted. However, if combined with automatic differentiation, an advantage is an easy rigorous error analysis with interval methods (Moore [63], Lohner [57], Neumaier [70a]). Multistep Methods To derive methods with increased order that avoid the use of derivative information, we assume that the values have already been computed for j ::: i. If we integrate the differential equation, y(tH]) = y(li) = y(ti) + h L + 0(h s+ 2 ) {3jF(y(ti-j) j=O:s for certain constants {3j depending on s, The constants {3 j may be computed by integrating the Lagrange polynomials; they are given for s ::: 3 in Table 4.1. They differ from those derived in Section 4.2 because now the interpolation points are not chosen from [ti, ti+IJ, but lie to the left of ti. By dropping the error term, one obtains the following explicit Adams-Bashfortli multistep method of order s + I: ti+] = t, -s- h, yi+l = y' +h L {3jF(/-j). j=O:s The cost per step involves one function evaluation only. The global error can be proved to be 0 (h 5 + ] ); as for numerical integration, one loses one power of h for the global estimates because one has to sum over (b - a)/ h steps. If one includes the point to be computed in the set of interpolation points, one finds implicit methods with an order that is higher by I. Indeed, Ii+ l + F(y(t»dt, / t, y(ti+l) = y(td +h L {3jF(y(ti-j» + 0(h 5+ 3 ) j=-1:5 we know on the right side the function values F (y (t j» ~ F (yj) = f j approximately and can therefore approximate the integral with an interpolatory With s-dependent constants f3j given for s ::: 3 in Table 4.2. One obtains the 216 Numerical Integration Table 4.2. Coefficients for Adams-Moulton formulas of order p S fJ-1 fJo 0 I 2: 2: 1 12 12 2 9 24 J9 24 251 646 5 3 720 4.5 Solving Ordinary Differential Equations = s +2 fJl fJ2 fJ3 older values drastically. s': J, j > 0, the convergence and stability behavior may change 4.5.2 Example. The Simpson rule gives I 1 8 -12 y(ti+l) - y(ti-d -24 24 264 t: ti- h F(y(t» dt 2h = (5 (F(y(ti+d) 106 -720 720 = J 5 217 720 + 4F(y(ti» + F(y(ti_l») + O(h 5 ) , hence suggests the implicit multistep formula following implicit Adams-Moulton multi-step method of order s ti+1 + 2, = ti + h, i+1= v' + h (is_I F(yi+l) + J~/J F(yi- J»), /+1 = /-1 + ~(F(/+I) + 4F(/) + F(y/-I». For the model problem y' in which the desired function value also appears on the right side. Instead of solving this nonlinear system of equations, one replaces yi+1 on the right side by the result of the corresponding explicit formula; the order, smaller by 1, is restored when one multiplies by h. One thus computes a predictor yi+1 with an explicit method and then obtains the value yi+ J by using an implicit method as a corrector, with y+1 on the right side. For example, one obtains with s = 3 the fifth order Adams-Moulion predictorcorrector method: (5.2) = Ay, yeO) = 1, we find /+1 = yi-I + ~h (yi+1 + 4 y i + yi-I). The general solution of this difference equation is Y i = YIZIi + Y2Z2i (5.3) Where ti+1 = ti +h; % predictor step yi+1 = yi + :P4(55f i - 59f i- 1 + 37 r" = t- . 2 - 9f'-3); ZI } _ 2Ah Z2 F(yi+I); = i + 7~0(251t+1 + 646fi 3 - Ah - - {I + Ah + O«Ah)2) -1 + ¥- + O«Ah)2) ~ e'"; ~ _e-Ah/3 are the solutions of the quadratic equation % corrector step yi+1 ± J9 + 3A 2h2 _ - 264fi-1 + 106fi-2 - 19f i-3); fi+1 = F(yi+I); , Depending on the implementation, one or several corrector . iterations may be performed. Compared with the explicit method, the cost per step increased to two function evaluations (if only one corrector step is taken). It can be shown that the methods of Adams-Bashforth and ofAdams-Mou Iton converge with convergence order s+ I (i.e., global error O(hs+ 1 and s+2 (i.e.. global error O(hs+ 2», respectively. However, a formal analysis is important here because, for general multistep methods where yi+I depends linearly on » G' 0 I tven y and y , we can solve the equations YI to find YI - Z2 YI = - - - , ZI - Z2 Y2 + Y2 = Yo = 1, YIZl + Y2Z2 = YI ZI - YI = ---. ZI - Z2 Now, if A < 0 then the solution yet) = e" decays for t ---+ 00; but unless Yl = ZI (an unlikely situation), Y2 f=- 0 and (5.3) explodes exponentially, even for tiny values of sh, 218 4.6 Step Size and Order Control Solving Stiff Systems /+1 is not computed as a sum of yi and a linear combination of the previous s function values F (yi - j), but as a sum of the function value at the new point Yi+ 1 and a linear combination of the previous s values yi -I. For stiff differential equations, both the Adams-Bashforth methods and the Adams-Moulton methods suffer from the same instability as Euler's method so that tiny step sizes are required. However, the BDF-formulas have (for s ::: 5 only) better stability properties, and are suitable (and indeed, the most popular choice) for integrating stiff differential equations. However, when using the BDF-formulas in predictor-corrector mode, one loses this stability property; therefore, one must find yi+l directly from (5.4) by solving a nonlinear system of equations (e.g., using the methods of Chapter 6). The increased work per step is more than made good for by the ability to use step sizes that reflect the behavior of the solution instead of being forced by the approximation method to tiny steps. Properties of and implementation details for the BDF-formulas are discussed in Gear [29]. A survey of other methods for stiff problems can be found in Byrne and Hindmarsh [10]. Of considerable practical importance for stiff problems are additional im.plicit multistep methods obtained by numerical differentiation. Here we approxIm~te g(T) := yeti - Th) by an interpolating polynomial Ps(~) of degree s + 1 with data points (T, g(T» = (l, yi-l) (l = -1,0, .... s). ThIS gives g(T) ~ Ps+I(T):= L Lj(T)yi-J J=-l:s where T L)(T) = -l n~ I#} is a Lagrange polynomial of degree s } + 1; of course, for n > I, we interpret Ps+I (t) as a vector of polynomials. We then have . F(y'+I) ~ F(y(ti+l» = It is natural to fix ~h v" 219 Numerical Integration " LJ 1, . ~ 1 , = y'(ti+l) = kg (-1) ~ ,:/s+1 (-1) ' L'(-l)yi-J. J 4.6 Step Size and Order Control J=-l:s implicitly through the condition i-)y +hf3I F(yi+I).' Y i+1 = "LJ a j (5.4) )=O:s (_I)anda'=-f3_ IL'(l)areeasilycomputed t h e coe ffici clen ts f3 -I -IlL' -1 J ) from the definition. They are gathered for s ::: 5 in Table 4.3. . The formulas obtained in this way (by Gear [29]) are called backwards differentiation formulas, or BDF-formulas; they have order s + 1. Note that here For solving differential equations accurately, step sizes and order must be chosen adaptively. In our discussion of numerical integration, we have already seen the relevance of an adaptive choice of step size; but it is even more important in the present context. To see this, we consider the following. 4.6.1 Example. The initial value problem y'=l, y(0)=3 has the solution jor BDFfiormulas of order P = s Tabl e 4 .3 . Coefficients JJ' s f3 0 1 1 2 3 4 5 \ 2 ao only defined for t < 1/3. Euler's method with constant step size h gives 1 4 1 -3 18 TT TT -TT 12 48 6 3 y(t) = 1 - 3t' as al 3 3 +1 9 36 25 25 -25 60 137 60 147 300 137 360 147 300 -137 450 - \47 l 2 = 3. yi+1 = / + h(/)2 for i ~ O. TT 3 25 16 -25 200 75 -137 225 - 147 m 400 147 12 137 72 147 10 -147 AsTable4.4withresultsforconstantstepsizesh = 0.1 andh = 0.01 shows, the method with step size h = 0.1 continues far beyond the singularity, without any noticeable effect in the approximate solution, whereas with step size h = 0.0 I the pole is recognized, albeit too late. If one were to choose h still smaller, then 220 Numerical Integration Table 4.4. 4.6 Step Size and Order Control Results for constant step sizes using Euler's method yi .vi t, yeti) h=O.l h =0.01 0 0.1 0.2 0.3 0.4 0.47 0.5 0.6 0.7 3 4.29 7.5 30 3 3.9 5.42 8.36 15.34 3 4.22 7.06 19.48 3.35.10 6 overflow I 38.91 190.27 3810.34 221 In each integration step, one can make several choices because both the order p of the method and the step size h can be chosen arbitrarily. In principle on~ coul.d change p and ~ in each step, but it has been found in practice tha~ doing this produces spunous instability that may accumulate t . 0 gross errors ill the computed solution. It is therefore recommended that bef h . . ore c angmg to another step SIze and a possibly new order p, one has done at least p steps with constant order and step size. Local Error Estimation To derive error estimates we must have a precise concept of the order f method. We say that a multistep method of the form 0 a ' " ' CXjy'-j ' . +h L.J ' " ' f3jF(yi- j); y i+1 = L.J one would obtain the position of the pole correspondingly more accurately. In general, one needs small h near singularities, whereas, in order to keep the cost small, one wishes to compute everywhere else with larger step sizes. j=O:S (6.1) j=O:s has order p if, for every polynomial q(x) of degree <», q(l)= L(CXN(-})+f3jq'(-j». (6.2) j=O:s 4.6.2 Example. The next example shows that an error analysis must also be done along with the step size control. The initial value problem y' = -y - 2t t». yeO) = 1 has the solution y(t) = vT=2(, defined only for t :S ~. Euler's method, carried through with step size h 0.001, gives the results shown in Table 4.5. Surprisingly, the method can be continued without problems past the singularity at t = 0.5. Only a weak hint that the solution might no longer exist is given, by a sudden increase in the size of the increments yi+ I - / after t = 0.525. Exactly the same behavior results for smaller step sizes. except that the large increments occur closer to 0.5. = We conclude that a reliable automatic method for solving initial value problems must include an estimate for the error in addition to the step size control, and must be able to recognize the implications of that estimate. Table 4.5. Results for y' = -y - 2t/y, yeO) = 1, using Euler's method =================----o 0.100 0.200 0.300 0.400 0.499 0.500 0.510 0.520 0.525 0.530 n.600 ----------~----------------- y(ti) 1 0.894 0.775 0.632 0.447 0.045 0.000 yi 1 0.894 0.774 0.632 0.447 0.056 0.038 0.042 -0.023 -0.001 0.777 0.614 ============================== This is justified by the following result for constant step size h. ~.6.3 The~rem. Ifthe multistep method (6.1) has order p, then for all (p +2)times continuously differentiable functions y and all h E JR, yet + h) = L CXjy(t - }h) + 11 j=O:s L f3jy'(t - }h) + Eh(t) (6.3) = 0(h P+ 1) (6.4) j=O:s with an error term Eh(t) = C h P + 1 [ h p Y t , t - , ... , t - (p + l)h] + 0(h p+ 2 ) and c, = 1 + L(-j)P(jCXj - (p+ l)f3j). j=O:s ProOf By Theorem 3.1.8, Y rt , t - h , ... , t - (p + 1)hJ = Y (p+1)( (p t + r h) + I)! ( _ y P+I)(t) - (p + 1)1 + O(h). 223 4.6 Step Size and Order Control Numerical Integration 222 Taylor expansion of yet) and y'(z) gives (for bounded j) If the largest of these values for a is not too close to I, the step size is changed hi = ah and the corresponding order p is used for the next step. To guarantee accuracy. one must relocate the needed t, _ j to their new equidistant positions ti- j = t, - j hi and find approximations to ytt, - j hi) by interpolating the already computed values. (Take enough points to keep the interpolation error of the same order as the discretization error!) Initially, one starts any multistep method with the Euler method, which has order 1 and needs no past information. To avoid problems with stability when step sizes change too rapidly (the known convergence results apply only to constant or slowly changing step sizes), it is advisable to restart a multistep method when a « I and truncate large factors a to a suitable threshold value. to vct + jh) . (k) ( ) = " L k=O:p+! 2 ~hk / + 0(h P+ ) k' . P+ = qhU) + h 1 jP+1 y[t, t - h, ... , t - (p .(k+l)Ct) Y '( t 'h) +J = "L } k! k=O:p = h-lq~(j) hkJ.k + O(h P+ + (p + I + I)h] + 0(h P+2 ) ) l)h PjPy[t, t - h, ... , t - (p + I)h] + O(h P + I) with a polynomial qh(X) of degree at most p. Using (6.3) as a definition for E (t), plugging in the expressions for yet + j h) and y' (r + j h) and using (6.2), h we find (6.4). 0 If the last step was of order p ; it is natural to perform a step of order p E {p _ I, p , P + I} next. Assuming that one has already performed p steps of step size h, the theorem gives a local discretization error of the form /+1 _ yCti+l) = CphP+1y[ti,"" ti-p-l] + 0(h P+ 2 ). with yi+1 = yCti+l), where yct) is the Taylor expansion ofy(t) of order p. at t.. The divided differences are easily computable from the y' - J • If the step SIze h remains the same, the order p with the smallest value of is the natural choice for the order in the next step. However, if the new step size is changed to hi = edt (where a is still arbitrary), 1 the discretization error in the next step will be approximately a P+ w p ' For thiS error to be at most as large as the maximum allowable contribution 8 to the global error, we must have aP+lw p :s 8. Including empirically determined safety factors, one arrives at a recommended value of a given by if P = p - 1, if P p, if P = p + I. = The safety factors for a change of order are taken smaller than those for order P to combat potential instability. One may also argue for multiplying instead by suitable safety factors inside the parentheses, thus making them p dependent. The Global Error As in all numerical methods, it is important to assess the effect of the local approximation errors on the accuracy of the computed solution. Instead of a formal analysis, we simply present some heuristic that covers the principal pitfalls that must be addressed by good software. The local error is the sum of the rounding errors r, made in the evaluation of (6.1) and the approximation error 0(h P+ 1 ) (for an order p method) that gives the difference to YCti+I). We assume that rounding errors in computing function values are O(SF) and the rounding error in computing the sum is O(s). Then the total local error is O(hs F) + O(s) + 0(h P+ 1). For simplicity, let us further assume that the local errors accumulate additively to global errors. (This assumption is often right and often wrong.) If we use constant step size h over the whole interval [a, b], the number of steps needed is n = (b - a)/ h. The global error therefore becomes yn _ yCtn) = O(n(hsF + S + h P+ I » = 0 (SF + ~ + h P) . (6.5) Although our derivation was heuristic only and based on assumptions that are often violated, the resulting equation (6.5) can be shown to hold rigorously ~or all methods that are what is called asymptotically stable. In particular, this lllcludes the Adams-Bashforth methods and Adams-Moulton methods of all orders, and the BDF methods of order up to six. In exact arithmetic, (6.5) shows that the global error remains within a desired bound ~ when O(~llp) steps are taken and F is a p times continuously differentiable function. (Most practical problems need much more flexibility, but 224 Numerical Integration 4.7 Exercises it is still an unsolved problem whether there is an efficient control of step size and order such that the same statement holds.) In finite precision arithmetic, (6.5) exhibits the same qualitative dependency ofthe total error on the step size h as in the error analysis for numerical differentiation (Section 3.2, Figure 3.2). In particular, the attainable accuracy deteriorates as the step sizes tend to zero. However, as shown later. there are also marked differences between these situations because the cure here is much simpler as in the case of numerical differentiation. As established in Section 3.2, there is an optimal step size h opt = 0 (E I/(P+l». The total error then has magnitude In particular, one sees that not only is a better accuracy attainable with a larger order, but one may also proceed with a larger step size and hence with a smaller total number of steps, thus reducing the total work. We also see that to get optimal accuracy in the final results, one should calculate the linear combination (6.1) with an accuracy E « Ejf+I)/p. In particular, if F is computed to near machine precision, then (6.1) should be calculated in higher precision to make full use of the accuracy provided by F. In FORTRAN, if F is computed in single precision, one would get the desired accuracy of the Yi by storing double precision versions yy(s - j) of the recent Yi- j and proceed with the computation of (6.1) according to the pseudo code Fsum = LfjF(y(i - j) yy(s + 1) = LCtjYY(S - j) +dprod(h, Fsum) y(i + 1) = real(yy(s + 1» yy(l : s) = yy(2 : S + 1) (The same analysis and advice is also relevant for adaptive numerical integration, where the number of function evaluations can be large and higher precision may be appropriate in the addition of the contributions to the integral.) A further error, neglected until now, occurs in computing the t, . At each step, we actually compute t;+1 = t, + h, with rounding errors that accumulate significantly in a long number of short steps. This can lead to substantial systematic errors when the numher of steps becomes large. In order to keep the roundoff error small, it is important to recompute the actual step size h., = t;+1 - t, used, from the new (approximately computed) argument ti+1 = t, hi; because of cancellation that typically occurs in the computation of t;+l - t., the step size hi computed in this way is usually found exactly (cf. Section 1.3). This is one of the few cases in which cancellation is advantageous in numerical computations. + 225 4.7 Exercises 1. Prove the error estimate l a b f(x) dx = (b - a)f (a-2+ b) + (b -24a)3 r(~), ~ E [a, b] for the r~ctangle rule (also called midpoint rule) if f is twice continuously differentiable functions f(x) on the interval [a, b]. 2. Because the available function values 1(x j) = f (x j) + E j (j = 0, ... , N) are general~y contaminated by roundoff errors, one can ask for weights aj chosen III such a way that the influence of roundoff errors is small. Under appropriate assumptions (distribution of the E i with mean zero and standard deviation (J), the root mean square error of'a quadrature formula Q(f) = Lj=O:N aj/(Xj) is given by rN = (J jL CY;' j=O:N (a) Show that r» for a quadrature formula of order > 1 is minimal when all weights CY j are equal. (b) A quadrature formula in which all weights are equal and that is exact for polynomials p(x) of degree <n is called a Chebychev quadratureformula. Determine (for w(x) = 1) the nodes and weights of the Cheby~hev formula for n = 1, 2, and 3. (Remark: The equations for . the weights and points have real solutions only for n ::e 7 and n = 9.) 3. G~ven a closed interval [a, b] and nodes x) := a + jh, j = 0, 1,2,3 WIth h := (b - a)/3, de~ve a quadrature formula Q(f) = Li=O'3 CYi f(Xi) that matches l(f) = fa f(x) dx exactly for polynomials of degree <3. Prove :he error formula l(f) - Q(f) = toh5 f(4)(0, ~ E [a, b] for f E C [a, b]. Is Q(f) also exact for polynomials of degree 47 4. Let Qk(f) (k = 1,2, ...) be a normalized quadrature formula with nonnegative we.ights and order k + I. Show that if f(x) can be expanded in a power series about with radius of convergence r > 1, then there is a constant C, (f) independent of k with ° I t ~1 f(x)dx - Qk(f)/::e Cr(f) (r - . l)k+1 Find Cr(f) for f(x) = 1/(2 - x 2 ) , using Remark 4.1.11, and determine r SUch that C (f) is minimal. l Ja f (x) dx of a 5. In order to approximately compute the definite integral function f E C 6[0, 1], the following formulas may be applied (h := lin): Jor f(x) dx = h (2.I 1 12 J"(~l) 2 f(O) 227 4.7 Exercises Numerical Integration 226 + V=~-1 f(vll) + 2.1 f(l) - 11 ~~~ctions i, where R(f) dep~nds only on the f(xj), then R(f) :::: . 2N If (b) - f (a) I· (Thus, 111 this sense, the trapezoidal rule is optimal.) 7. Wnte a MATLAB program to compute an approximation to the integral f (x) dx with the composite Simpson rule; use it to compute !!. = J: Jd 1~~2 with an absolute error ::::10-5 . How many equidistant point: are sufficient? (Assume that there is no roundoff error.) Jcos x dx with an abso8. Use MATLAB to compute the integral I = lute error ::::0.00005; use Simpson's rule with step size control. 9. Let f be a continuously differentiable function on [0, b l. (a) Show that J;/2 (composite trapezoidal rule; cf. (3.1», " +4L.."f (2V -21)1l) ' l,=l:n I b-a E (composite Simpson rule; cf. (3.1», (composite Milne rule; cf. (4.4» with~; E [0, n i = 1,2,3. Let f(x) := I a b f(x)dx f([a, bD + b ) + [-1, 1l-4b-a rad (f' ([a , b])) } . n { f ( a. -2- (b) Use (a) to write a MATLAB program for the adaptive integration of a giv~n continuous function f where interval extensions for f and t' are available, (Use the range of slopes for l' when the derivative does not exist over some interval, and split intervals recursively until an accuracy specification is met.) (c) Compute enclosures for Itl ~I- dx JI __1_ dx and (n12 a v'l+x' ' -I vT=X' Ja JXcosx dx with an error of at most 5 x 10-;-1 (i = 2,3,4); give the enclosures and the required number of function evaluations. 10. (a) Show by induction that the Vandermonde determinant x 6 .5 . (a) For each formula, determine the smallest n such that the absolute value of the remainder term is :::: 10- 4 . How many function evaluations would be necessary to complete the numerical evaluation with this n? (b) Using the formula that seems most appropriate to you, compute an approximation to l x 6 .5dx to the accuracy 1O-4(with MATLAB). Com- xa det x5 Ja pare with the exact value of Jd x 6 .5 dx. 6. Let a = xa < XI < ... < XN = b be an equidistant grid. (a) Show that if f (x) is monotone in [a, b], then the composite trapezoidal rule satisfies If" \ does not vanish for pairwise distinct Xj E JR, j = 0, , n. (b) Use part (a) to show that for any grid a :::: Xa < Xl < < xn < b there exist uniquely determined numbers aa, ai, ... , an so that - , b a f(x)dx - TN(f) :::: --If(b) - f(a)l· a 2N (b) Show that if Q(f) is an arbitrary quadrature formula on the above subf(x) dx - Q(f)1 :::: R(f) for all monotone division of [a, b] and I J: for every polynomial p(x) of degree n. 229 Numerical Integration 4.7 Exercises 11. (a) Show thatthe Chebychev polynomials T; (x) (defined in (3.1.22» are orthogonal over [-1, 1] with respect to the weight function co (x) := (e) If all b j are positive (which is the case for positive weight functions), find a nonsingular diagonal matrix D such that A~+I = D- I A n+1 Dis symmetric. Confirm that also Pn+I(X) = det(xl - A~+I)' Why does this give another proof of the fact that all nodes are real? (f) Show that in terms of eigenvectors w A of A~+I of2-norm I, CA = (wt)2. Use this to compute with MATLAB's eig nodes and weights for the 12th order Gaussian quadrature formula for f ~ I j (x) dx. 14. A quadrature formula Q(n := Li=O:N a;f(xd with Xi E [a, b] (i = 0.... , N) is called closed provided Xo = a and XN = band ao, aN i= O. Show that, if there is an error bound of the form 228 (l - x 2 )- L (b) Show that T; (x) = Re(x + i.JI=X2)" for x E [-1, 1]. 12. Determine for N = 0, 1,2 the weights a j and arguments x j (j = 0, ... , N) of the Gaussian quadrature formula over the interval [0, I] with weight function os (x) := x -1/2. (Find the roots x j of the orthogonal polynomial P3(X) with MATLAB's root.) 13. (a) Show that the characteristic polynomials P j (x) = det(x I - A j) of the tridiagonal matrices lib ° with E(f) := Li=ON f3;f(Xi) for all exponential functions j = eAX (A E JR), then Q(f) must be a closed quadrature formula and 1f3i1 ~ laiI for i =0. N. 15. Let Dk(f) := Li=O:N f3 i(k) j(Xi) with Xo < Xl < ... < XN. Suppose that satisfy the three-term recurrence relation k (7.1) started with P_I(X) = 0, Po(x) = 1. (b) Show that for any eigenvalue A of An+l, u) LA(x) = c;l LV) Pj_1 (x). j D (f) = 1 k! j (k) (Sk(f» for some Sk(f) E [Xo, xJl holds whenever j is a polynomial of degree :::N. (a) Show that D N (f) = f[xo • . . . • XN], and give explicit formulas for the f3 i( N ) in terms of the Xi. (b) Show that f3?) = for i ::: N < k, i.e., that derivatives of order higher than N cannot be estimated with N + I function values. 16. (a) Write a program for approximate computation of the definite integral j(x) dx via Romberg integration. Use hi := (b - a)/2 as initial step size. Halve the step size until the estimated absolute error is smaller than 10-6 , but at most 10 times. Compute and print the values in the "Romberg triangle" as well as the estimated error of the best value. (b) Test the program on the three functions (i) j(x) := sinx, a := 0, b := it , (ii) j(x) := vfxl3, a := 0, b := I, and (iii) j(x) := (l + l Scos'' x)-1/2, a := 0, b:= Jr/2. 17. Suppose that the approximation F(h) of the integral I, obtained with step size h, satisfies ° = Pj-l(A), define eigenvectors u A of An+l and (c) Show that the j(X)dX\ ::: IE(f)1 VA of A;'+I' where CA = f: L bi->: b (v))2 j j are the Lagrange polynomials for interpolation in the eigenvalues of A n+l. (Hint: Obtain an identity by computing (vA)T An+l u'' in two ways.) (d) Deduce from (a)-(c) that the n + I nodes of a Gaussian quadrature formula corresponding to orthogonal polynomials with three-term recurrence relation (7.1) are given by the eigenvalues x j = A of An + I. with corresponding weights aj = c;l f w(x)dx. (Hint: Use the orthogonality relations and po(x) = 1.) I = F(h) + ah!" + bh"? + ch!" + ... with PI < P2 < P3 < .... 4.7 Exercises Numerical Integration 230 21. Consider an initial value problem of the form Give real numbers a, f3, and y such that 1= aF(h) + f3F (~) + yF (~) + O(h P3 ) holds, to obtain a formula of order h P3• 18. To approximate the integral JD f tx , y) dxdy, D yea) = a. y' = Ftt , y), Suppose that the function F (t, z) is continuously differentiable in [a, b] x R Suppose that c 1R2 , via a formula of F(t,z) 2:0, the form L 231 (7.2) Av.t(xv, Yv) v=I:N and Ft(t,z):SO, Fy(t,z):sO for every t E [a, b] and Z E R Show that the approximations obtained from Euler's piecewise linear method satisfy with nodes (xv, Yv) and coefficients A v, (v = 1, ... , N) we need to determine 3N parameters. The formula (7.2) is said to have order m + I i2:y(ti), i for y(ti) i=O,I, .... 22. Compute (by hand or with a pocket calculator) an approximation to y(OA) to the solution of the initial value problem provided yl/(t) = -yet), for all q, r 2: 0 with q + r = 0, 1, ... , m. (a) What is the one-point formula that integrates all linear functions exactly (i.e., has order 2)? (b) For integration over the unit square D:= {(x, y) 10:S x :s I,O:s y :s I}, yO := yeO), ZO := v'(O), '+1 . yeO) = 1. := 2 . with Euler's method with step size h. 1 4' [0 L}') (c) How small must h be for the relative error to be :s:;: . 10- ill , . (d) Perform 10 steps with the step size determined in (c). 20. The function y(t) = t 2 /4 is a solution to the initial value problem yeO) = y' = F(t, y), + h) = " L v=O:N-1 t and h > O. Verify this yea) = l (7.3) may be solved via a Taylor expansion. To do this, one expands y(t) about a to obtain yea o. However, Euler's method gives yi = 0 for all behavior numerically, and explain it. (h = 0.1). Interpret the rule explained in (b) geometrically. 23. When .t is complex analytic, an initial value problem of the form (a) Write down a solution to the initial value problem. (b) Write down a closed expression for the approximate values yi obtained =,;y, h . i -- h.:' + 2F(ti' y'), Zi+1 := Zi + hF(ti, y"); (i = 0, ...,3); i corresponding A v , v = 0, ... ,5. 19. Consider the initial value problem y' y'(O) = 0 by (a) rewriting the differential equation as a system of first order differential equations and using Euler's method with step size h = 0.1; (b) using the following recursion rules derived from a truncated Taylor expansion: show that there is an integration formula of the form (7.2) with nodes (0,0), (0, 1), (1/2, 1/2), (1,0), (1,1), and order 3, and determine the y' = 2y, yeO) = I, y(V) (a) -'_ _ h" v! + remainder; . function values and derivatives at a are computable by repeated differentiation of (7.3). NegLecting the remainder term, one thus obtains an Numerical Integration 232 approximation for yea + h). Expansion of y about a +h 5 gives an ap- proximation at the point a + 2h, and so on. (a) Perform this procedure for F(t, y) := t - y2, a := 0 and yO := 1. In these computations, develop y to the fourth derivative (inclusive) and choose h = 1/2, 1/4, and 1/8 as step sizes. What is the approximate Univariate Nonlinear Equations value at the point t = I? (b) Use an explicit form of the remainder term and interval arithmetic to compute an enclosure for y (1). 24. To check the effect of different precisions on the global error discussed at the end of Section 4.6, modify Euler's method for solving initial value problems in MATLAB as follows: Reduce the accuracy of F h to single precision by multiplying the standard MATLAB result with 1+le-8*randn (randn produces at different calls normally distribbuted numbers with mean zero and standard deviation 1). Perform the multiplication of F with h and the addition to yi (a) in double precision (i.e., normal MATLAB operations); (b) in simulated single precision (i.e., multiplying after each operation with 1+le-8*randn). Compute an approximate solution at the point t = 10 for the initial value 4 3 problem y' = -O.ly, yeO) = 1, using constant step sizes h := 10- , 10- and 10- 5 . Comment on the results. 25. (a) Verify that the Adams-Bashforth formulas, the Adams-Moulton formulas, and the BDF formulas given in the text have the claimed order. (b) For the orders for which the coefficients of these formulas are given in the text, find the corresponding error constants C p- (Hint: Use Example 3.1.7.) In th~s chapt~r we treat the problem of finding a zero of a real or complex function I of one variable (a univariate function), that is, a number x* such that I(x*) = O. This problem appears in various contexts; we give here some typical examples: (i) Solutions of the equation p(x) = q (x) in one unknown are zeros of func- .. tions I su~~ as (among many others) 1:= p - q or 1:= plq - 1. (ii) Intenor rruruma or maxima of continuously differentiable functions I are zeros of the derivative 1'. (iii) Singular points x* of a continuous function I for which II (x) I ---+ 00 as x ---+ x* (in particular, poles of f) are zeros of the inverted function ._ { lII(x), 0, g (x ) .- if x if x t= x*, = x*, continuously completed at x = x*. (iv) The eigenvalues of a matrix A are zeros of the characteristic polynomial I(A) := det(AI - A). Because of its importance, we look at this problem more closely (see Section 5.3). (v) Shooting methods for the solution of boundary value problems also lead to the problem of determining zeros. The boundary-value problem y"(t) = g(t, yet), let)), yea) = Ya, y(b) y: [a, b] ---+ lR = Yb can often be solved by solving the corresponding initial-value problem y~/(t) = get, ys(t), y;(t)), y,: [a, y,(a) = Ya, y;(a) = s 233 b] ~ lR 234 Univariate Nonlinear Equations for various values of the parameter s. The solution y = Ys of the initialvalue problem is a solution of the boundary-value problem iff s is a zero of the function defined by f(s) := ys(b) - Yb· We see that, depending on the source of the problem, a single evaluation of f may be cheap or expensive. In particular, in examples (iv) and (v), function values have a considerable cost. Thus one would like to determine a zero with the fewest possible number of function evaluations. We also see that it may be a nontrivial task to provide formulas for the derivative of f. Hence methods that do not use derivative information (such as the secant method) are generally preferred to methods that require derivatives (such as Newton's method). Section 5.1 discusses the secant method and shows its local superlinear convergence and its global difficulties. Section 5.2 introduces globally convergent bisection methods, based on preserving sign changes, and shows how bisection methods can accomodate the locally fast secant method without sacrificing their good global properties. For computing eigenvalues of real symmetric matrices and other definite eigenvalue problems, bisection methods are refined in Section 5.3 to produce all eigenvalues in a given interval. In Section 5.4, we show that the secant method has a convergence order of :::::;1.618, and derive faster methods by Opitz and Muller with a convergence order approaching 2. Section 5.5 considers the accuracy of approximate zeros computable in finite precision arithmetic, the stability problems involved in deflation (a technique for finding several zeros), and shows how to find all zeros in an interval rigorously by employing interval techniques. Complex zeros are treated in Section 5.6: Local methods can be safeguarded by a spiral search, and complex analysis provides tools (such as Rouche's theorem) for ascertaining the exact number of zeros of analytic functions. Finally, Section 5.7 explores methods that use derivative information, and in particular Newton's method. Although it is quadratically convergent, it needs more information per step than the secant method and hence is generally slower. However, as a simple prototype for the multivariate Newton method for systems of equations (see Section 6.2), we discuss its properties in some detail - in particular, the need for damping to prevent divergence or cycling. 5.1 The Secant Method 235 approximations of a zero the zero is likely to be close to the intersection of the interpolating line with the x-axis. We therefore interpolate the function f in the neighborhood of the zero x* at two points Xi and Xi _I through the linear function If the slope f[Xi, Xi-IJ does not vanish, we expect that the zero (1.1) of PI represents a better approximation for x* than Xi and Xi-I (Figure 5.1). The iteration according to (1.1) is known as the secant method, and several variants of it are known as regula falsi methods. For actual calculations, one 3 2 o( - - - - - - - - j - - f - - - - - . - - - - - - - - - - - - - ----I -1 -2 -3 5.1 The Secant Method Smooth functions are locally almost linear; in a neighborhood of a simple zero they are well approximated by a straight line. Therefore, if we have two o 0.5 1.5 2 2.5 Figure 5.1. A secant step for f 3 3.5 (x) = x 3 - 8x 2 + 20x 4 - 13. 4.5 5 2 Table 5.1. Results of the secant method in [1,2] for x - 2 1 2 3 4 5 Xi signet) Xi 6 7 8 9 10 + 2 1·33333333333333 1.40000000000000 1.41463414634146 + 237 5.1 The Secant Method Univariate Nonlinear Equations 236 1.41421143847487 1.41421356205732 1.41421356237310 1.41421356237310 1.41421356237309 signet) + + for some convergence factor q E (0, 1). Obviously, each Q-linearly convergent sequence is also R-linearly convergent (with c = Ix *- Xio Iq -i O ) , but the converse is not necessarily true. For zero-finding methods, one generally expects a convergence speed that is faster than linear. We say the sequence Xi (i = 1,2,3, ...) with limit x* converges Q-superlinearly if IXi+1 - x*1 ::: qi IXi - x"] for i :::: 0, with lim qi 1-+00 = 0, and Q-quadratically if a relation of the form may use one of the equivalent formulas Xi -Xi-I 1 - f(Xi-dlf(Xi) -x------ 1 = Xi-J/(Xi) - xi!(xi-d. f(Xi) - f(Xi-l) , of these, the first two are less prone to rounding error than the last one. (Why?) 5.1.1 Example. We consider f(x) := x 2-2withx* = .j2 = 1.414213562 ... with starting points XI := 1. X2 := 2. Iteration according to (1.1) gives the results recorded in Table 5.1. After eight function evaluations, one has 10 valid digits of the zero. Convergence Speed For the comparison of various methods for determining zeros, it is useful to have more precise notions of the rate of convergence. A sequence Xi (i = 1,2,3, ...) is said to be (at least) linearly convergent to x* if there is a number q E (0, 1) and a number c > 0 such that [x" - Xi I ::: cq' for all i :::: 1; holds. Obviously, each quadratically convergent sequence is superlinearly convergent. To differentiate further between superlinear convergence speeds we introduce later (see Section 5.4) the concept of convergence order. These definitions carryover to vector sequences in jRn if one replaces the absolute values with a norm. Because all norms in jRn are equivalent, the definitions do not depend on the choice of the norm. We shall apply the convergence concepts to arbitrary iterative methods, and call a method locally linearly (superlinearly, quadratically) convergent to x* if, for all initial iterates sufficiently close to x*, the method generates a sequence that converges to x* linearly, superlinearly, or quadratically, respectively. Here the attribute local refers to the fact that these measures of speed only apply to an (often small) neighborhood of x*. In contrast, global convergence refers to convergence from arbitrary starting points. Convergence Speed of the Secant Method The speed that the secant method displayed in Example 5.1.1 is quite typical. Indeed, usually the convergence of the secant method (1.1) is locally Q-superlinear. The condition is that the limit is a simple zero of f. Here, a zero x* of f is called a simple zero if f is continuously differentiable at x* and f'(x') #- O. the best possible q is called the convergence factor of the sequence. Clearly, this implies that x* is the limit. More precisely, one speaks in this situation of R-linear convergence as distinct from Q-linear convergence, characterized by 5.1.2 Theorem. Let the function f be twice continuously differentiable in a neighborhood of the simple zero x" and let c:= ~fl/(x*)If'(x*). Then the sequence defined by (1.1) converges to x* for all starting values XI, x2 (Xl #- X2) the stronger condition sufficiently close to x*, and [x" - Xi+11 ::: qlx* - Xii for all i :::: io :::: 1 Xi+! - x" = Ci(Xi - X*)(Xi_1 - x*) (1.2) 239 5.1 The Secant Method Univariate Nonlinear Equations 238 with lim c, i-s-oo = C. In particular, near simple zeros, the secant method is locally Q-superlinearll' convergent. Proof By the Newton interpolating formula X 0= ((x*) = l(xi) + .f[Xi, xi-d(x* - Xi) + .f[Xi, Xi-I, x*l(x* - Xi)(X* - Xi-d. In a sufficiently small neighborhood of x*, the slope .f[Xi, Xi- d is close to f' (x") i- 0, and we can "solve" the linear part for x*. This gives f(x,) .f[x;,X;-I,X*]( *_ ,)(x*-x. ) X X, ,-I .f[x"x,-d f[Xi,Xi-d x* = X, = 5.1.3 Corollary. If the iterates X; of the secant method converge to a simple zero x* with f" (x") i- 0, then for large i, the signs of x, - x* and f (x;) in the secant method are periodic with period 3. where (1.3) So (1.2) holds. . For ~, in a sufficiently small closed ball B[x*; r] around x", the quotient r Ci(~' n := .f[~, s', x*]/.f[~, n remains bounded; suppose that ICi (~, ~') I ~ c. For starting values XI, X2 in the ball B[x*; ro]' where ro = miner, 1/2c), all iterates remain in this ball. Indeed, assuming that x"], IXi-l - IXi - IXHI - . due t'lOn, B y III IXi _x" - I <_ x*1 ~ ro, we find by (1.2) that X* ')2-i . . .ro , 11 Xi I~ 2 so that x·, converges to x*. Therefore, ;r:~ c, With q; := ICi I IXi-l - x* I it follows from (1.2) that IXHI - Proof Let s, = signrx, - x"). Because lim c, = c i- 0 by assumption, there is an index i o such that for i ::: io, the factors c, in 0.2) have the sign of c. Let s = sign(c). Then (1.2) implies SHI = SSiSi_1 and hence S;+2 = SS;+IS; SS5i5i-ISi = 5i-l, giving period 3. 0 As Table 5.1 shows, this periodic behavior is indeed observed locally, until (after convergence) rounding errors spoil the pattern. In spite of the good local behavior of the secant method, global convergence is not guaranteed, as Figure 5.2 illustrates. The method may even break down completely if f[x;, Xi _ d = 0 for some i because then the next iterate is undefined. I X * ~ rD· _f[x*,x*,x*]=~f"(X*)=C. .f[x*, x*] 2 f'(x*) . Figure 5.2. Nonconvergence of the secant method. Now Q-superlinear convergence follows from lim qi = O. This proves the 0 theorem. c;(x* - Xi)(X* - xi-d, Xi+1 - i-1 x"] = q; Ix; - x "]. Multiple Zeros and Zero Clusters It is easy to show that x* is a simple zero of f if and only if f has the representation f(x) = (x -x*)g(x) in a neighborhood of x", in which g is continuous and g(x*) i- O. More generally, x* is called an m-fold zero of f if in a neighborhood of x* the relation f(x) = (x - x*)mg(x), holds for some continuous function g. g(x*) i- 0 ( 1.4) 5.2 Bisection Methods Univariate Nonlinear Equations 240 Table 5.2. For multiple zeros (m> 1), the secant method is only locally linearly 241 The safeguarded root secant method convergent. 5.1.4 Example. Suppose that we want to determine the double zero x* = a of the parabola f (x) := x 2. Iteration with (l.I) for Xl = I and X2 = ~ gives the sequence I 2 3 4 Xi = I I 1 1 3'5'S'13' 5 .... X34= , l.081O-7. 6 7 o 1.20000000000000 1.74522604659078 1.20000000000000 1.05644633886449 1.00526353445128 1.00014620797161 1.00000038426329 1.20000000000000 1.05644633886449 1.00526353445128 J .00014620797161 1.00000038426329 I .00000000002636 0.09760000000000 0.09760000000000 0.00674222757113 0.00005570200768 0.00000004275979 0.00000000000030 o (The denominators n, form the Fibonacci sequence ni+1 = n, + l1i-I.) The convergence is Q-linear with convergence factor q = (j5 - 1)/2 ~ 0.618. If one perturbs a function f(x) with an m-fold zero x* slightly. then the perturbed function f (x) has up to m distinct zeros, a so-calle~ zero cluste:, in the immediate neighborhood of x". (If all these zeros are SImple, there IS an odd number of them if m is odd, and an even number if m is even.) In particular, for m = 2, perturbation produces two closely neighboring ~eros (or none) from a double zero. From the numerical point of view, the behavior of closely neighboring zeros and zero clusters is similar to that of multiple zeros; in particular. most zero finding methods converge only slowly towards such zeros, until one finds separating points in between that produce sign changes. If an even number of simple zeros x* = xi, ... , x 2k' well separated from the other zeros, lie in the immediate neighborhood of some point, it is difficult to find a sign change for f because then the product of the linear factors x is positive for all x outside of the hull of the x;*. One must therefore hit.bY chance into the cluster in order to find a sign change. In this case, finding a sIgn change is therefore essentially the same as finding an "average solution," and is therefore almost as difficult as finding a zero itself. In practice, zeros of multiplicity m > 2 appear only very rarely. For this reason, we consider only the case of a double zero. In a neighborhood of the double zero x", f(x) has the sign of g(x) in (104); so one does not recognize such a zero through a sign change. However, x* is a simple zero of the xt However, because x* is not known, we simply choose the negative sign. We call the iteration by means of the formula Xi -Xi-I Xi + 1 = Xi - -:-I-_-y'r7 (7x==i-==I"")=;::/f7(7X='7 ) i I ( 1.5) the root secant method. One can prove (see Exercise 4) that this choice guarantees locally superlinear convergence provided one safeguards the original formula (1.5) by implicitly relabeling the Xi after each iteration such that If(Xi+dl :::: II(Xi)l· 5.1.5 Example. The function f (x) = x 4 - 2x 3 + 2x 2 - 2x + 1 has a double root at x* = 1. Starting with Xj = 0, X2 = 1.2, the safeguarded root secant method produces the results shown in Table 5.2. Note that in the second step, X2 and X3 were interchanged because the new function value was bigger than the so far best function value. Convergence is indeed superlinear; it occurs, as Figure 5.3 shows, from the convex side of vlf(x) I· The final accuracy is not extremely high because multiple roots are highly sensitive to rounding errors (see Section 5.5). Of course, for a general f, it may happen that although initially there is no sign change, a sign change is found during the iteration (1.5). We consider this as an asset because it allows us to continue with a globally convergent bracketing method, as discussed in the following section. function hex) := (x - x*)jlg(x)1 = sign(x - x*)/lf(x)l, 5.2 Bisection Methods and the secant method applied to h gives Xi+l = Xi Xi - Xi - Xi-I l-h(xi_I)/h(xi) The sign of the root is positive when x* Xi-l = Xi - E I] [Xi, Xi- tl and negative otherwise. I ±.Jf(Xi-IJ/f(Xi) A different principle for determining a zero assumes that two points a and b are known at which I has different signs (i.e., for which f (a)f (b) :::: 0). We then speak of a sign change in I and refer to the pair a, b as a bracket. If f is continuous in O{a, b} and has there a sign change, then I has at least one zero x* E D{a, b} (i.e., some zero is bracketed by a and b). % counts number of function evaluations i=2; if f==O, x=a; else while 1, x=(a+b)/2; if abs (b-a) <=deltaO, break; end; i=i +1; f=f(x) ; if f>=O, b=x; end; if f<=O, a=x; end; end; end; % f has a zero xstar with Ixstar-xl<=deltaO/2 2.5 ,---_~---,---_----,-~,---------,----.---------,---,-------r 2 3 1.5 0.5 2 0 -, <, <, -0.5 .c ---'---- 0 0.2 0.4 06 0.8 "-----"' 1.2 243 5.2 Bisection Methods Univariate Nonlinear Equations 242 1.4 1.6 18 2 Figure 5.3. The safeguarded root secant method (Jlf(x)1 against x). Ifwe evaluate f at some point x between a and b and look at its sign, one finds a sign change in one of the two intervals resulting from splitting the interval at x. We call this process bisection. Repeated bisection at suitably chosen splitting points produces a sequence of increasingly narrow intervals, and by a good choice of the splitting points we obtain a sequence of intervals that converges to a zero of f. An advantage of such a bisection method is the fact that at each iteration we have (at least in exact arithmetic) an interval in which the zero must lie, and we can use this information to know when to stop. In particular. if we choose as splitting point always the midpoint ofthe current interval, the lengths of the intervals are halved at every step, and we get the following algorithm. (0 and 1 are the MATLAB values forfalse and true; while 1 starts an infinite loop that is broken by the break in the stopping test.) 5.2.1 Algorithm: Midpoint Bisection to Find a Zero of f in f=f(a) ; f2=f (b) ; if f*f2>0, error('no sign change'); end; if f>O, x=a;a=b;b=x; end; % now f(a)<=O<=f(b) [a, b] 5.2.2 Example. For finding the zero x* = .j2 = 1.414213562 ... of f(x) := x 2 - 2 in the interval [1,2], Algorithm 5.2.1 gives the results in Table 5.3. Within 10 iterations we get an increase of accuracy of about 3 decimal places, corresponding to a reduction of the initial interval by a factor 2 10 = 1024. The accuracy attained at X9 is temporarily lost since [x" - XIQ I > [r" - x91. (This may happen with R-linear convergence, but not with Q-linear convergence. Why?) A comparison with Example 5.1.1 shows that the secant method is much more effective than the midpoint bisection method, which would need 33 function evaluations to reach the same accuracy. This is due to the superlinear convergence speed of the secant method. In general, if Xi, a., and b, denote the values of x, a, and b after the ith function evaluation, then the maximal error in Xi satisfies the relation * 1 Ix -xil:s 2Ibi-I-ai-11 1 = 2i_2Ib-al fori> 2. Table 5.3. The bisection method for f(x)=x 2 - 2 in [I , 2] Xi 3 4 5 1-5 6 lA375 lA0625 7 1.25 1.375 Xi 8 9 10 11 12 L421875 1.4140625 1.41796875 1.416015625 .L.±!50390625 (2.1) 5.2 Bisection Methods Univariate Nonlinear Equations 244 One can force convergence and sign change in the secant method, given and X2 such that f(xdf(X2) < 0 if one iterates according to 5.2.3 Proposition. The midpoint bisection method terminates after t all Ib - = 2 + II logz -8- Xi+1 0 = Xi - f(Xi)/f[Xi, if f(Xi-df(Xi+l) < 0, function evaluations. (Here, IX l denotes the smallest integer 2:x.) Proof Termination occurs for the smallest i 2: 2 for which From (2. I), one obtains i = t. Ib i -I - a, -Ii 245 :s 80 . 0 5.2.4 Remarks. (i) Rounding error makes the error estimate (2.1) invalid; the final Xi is a good approximation to x* in the sense that close to Xi, there is a sign change of the computed version of f· (ii) Bisection methods make essential use of the continuity of f. The methods are therefore unsuitable for the determination of zeros of rational functions, which may have poles in the interval of interest. The following proposition is an immediate consequence of (2. I). 5.2.5 Proposition. The midpoint bisection method converges globally and R -linearly, with convergence factor q = !. Secant Bisection Method Midpoint bisection is guaranteed to find a zero (when a sign change is known), but it is slow. The secant method is locally fast but may fail globally. To get the best of both worlds, we must combine both methods. Indeed, we can preserve the global convergence of a bisection method if we choose arbitrary bisection points, restricted only by the requirement that the interval widths shrink by at least a fixed factor within a fixed number of iterations. The convergence behavior of the midpoint bisection method may be drastically accelerated if we manage to place the bisection point close to and on alternating sides of the zero. The secant method is locally able to find points very close to the zero, but as Example 5.1.1 shows, the iterates may fall on the same side of the zero twice in a row. XI Xi-I]; Xi = Xi-I; end; This forces f(Xi)f(Xi+l) < 0, but in many cases, the lengths of the resulting O{Xi. xi-d do not approach zero because the Xi converge for large i from one side only, so that the stopping criterion IXi - Xi-JI :s 80 is no longer useful. Moreover, the convergence is in such a case only linear. At the cost of more elaborate programming one can, however, ensure global and locally superlinear convergence. It is a general feature of nonlinear problems that there are no longer straightforward, simple algorithms for their solution, and that sophisticated safeguards are needed to make prototypical algorithms robust. The complexity of the following program gives an impression of the difference between a numerical prototype method that works on some examples and a piece of quality software that is robust and generates appropriate warnings. 5.2.6 Algorithm: Secant Bisection Method with Sign Change Search f=feval(func,x);nf=l; if f==O, ier=O; return; end; f2=feval(func,x2);nf=2; if f2==0, x=x2; ier=O; return; end; %find sign change while f*f2>0, % place best point in position 2 if abs(f)<abs(f2), xl=x2;fl=f2;x2=x;f2=f; else xl=x;fl=f; end; % safeguarded root secant formula x=x2-(x2-xl)/(1-max(sqrt(fl/f2) ,2»; if x==x2 I nf==nfmax, % double zero or no sign change found x=x2;ier=1; return; end; f=feval(func,x);nf=nf+l; if f==O, ier=O; return; end; end; 5.2 Bisection Methods Univariate Nonlinear Equations 246 % we ier=-l; have a sign change f*f2<0 slow=O; while nf<nfmax, % compute new point xx and accuracy if slow==O, % standard secant step if abs(f)<abs(f2), xx=x-(x-x2)*f/(f-f2); acc=abs(xx-x); else xx=x2-(x2-x)*f2/(f2-f); acc=abs(xx-x2); end; else if slow==l, % safeguarded secant extrapolation step i f £1*f2>0, quot=max(f1/f2,2*(x-x1)/(x-x2»; xx=x2-(x2-x1)/(1-quot); acc=abs(xx-x2); else % f1*f>0 quot=max(f1/f,2*(x2-x1)/(x2-x»; xx=x-(x-x1)/(1-quot); acc=abs(xx-x); end; else % safeguarded geometric mean step if x*x2>0, xx=x*sqrt(x2/x); % preserves the sign! else if x==O, xx=0.1*x2; else if x2==0, xx=O.l*x; else xx=O; end; acc=max(abs(xx-[x,x2]»; end; % stopping tests if acc<=O, x=xx;ier=-l; return; end; ff=feval(func,xx);nf=nf+1; if ff==O, x=xx;ier=O; return; end; % compute reduction factor and update bracket 247 if f2*ff<0, redfac=(x2-xx)/(x2-x);x1=x;f1=f;x=xx;f=ff; else redfac=(x-xx)/(x-x2);x1=x2;f1=f2;x2=xx;f2=ff; end; % force two consecutive mean steps if nonlocal (slow=2) if redfac>0.7 I slow==2, slow=slow+1; else slow=O; end; end; The algorithm is supplied with two starting approximations -x and X2. If these do not yet produce a sign change, the safeguarded root secant method is applied in the hope that it either locates a double root or that it finds a sign change. (An additional safeguard was added to avoid huge steps when two function values are nearly equal.) If neither happens, the iteration stops after a predetermined number nfmax of function evaluations were used. This is needed because without a sign change, there is no global convergence guarantee. The function of which zero is to be found must be defined by a function routine whose name is passed in the string func. For example, if func=' myfunc' , then the expression f=feval (f unc , x) is interpreted by MATLAB as f=myfunc (x). The algorithm returns in x a zero if ier=O, an approximate zero within the achievable accuracy if ier=-l, and the number with the absolutely smallest function value found in case of failure ier=1. In the last case, it might well be, however, that a good approximation to a double root has been found. The variable slow counts the number of consecutive times the length of the bracket IJ{x, X2} has not been reduced by a factor of at least 1/,/2; if this happened twice in a row, two mean bisection steps are performed in order to speed up the width reduction. However, these mean bisections do not use the arithmetic mean as bisection point (as in midpoint bisection) because for initial intervals rangi ng over many orders of magni tude, the use of the geometric mean is more appropriate. (For intervals containing zero, the geometric mean is not well defined and an ad hoc recipe is used.) If only a single iteration was slow, it is assumed that this is due to a secant step that fell on the wrong side of the zero. Therefore, the most recent point x and the last point with a function value of the same sign as f (x) are used to pe:form a linear extrapolation step. However, because this can lead to very poor POints, even outside the current bracket (cf. Figure 5.2), a safeguard is applied that restricts the step in a sensible way (cf. Exercise 7). 248 Univariate Nonlinear Equations The secant bisection method is a robust and efficient method that (for four times continuously differentiable functions) converges locally superlinearly toward simple and double zeros. The local superlinear and monotone convergence in a neighborhood of double zeros is due to the behavior of the root secant method (cf. Exercise 4); the local superlinear convergence to simple zeros follows because the local sign pattern in the secant method (see Corollary 5.1.3) ensures that locally, the secant bisection method usually reduces to the secant method: In a step with slow=O, a sign change occurs, and because the sign pattern has period 3. it must locally alternate as either -. +.+.-. +. +.-. +. +... or +. -, -. +. -. -. +.-, - .... Thus, locally. two consecutive slow steps are forbidden. Moreover, locally, after a slow step. the safeguard in the extrapolation formula can be shown to be inactive (cf. Exercise 7). Of course, global convergence is guaranteed only when a sign change is found (after all, not all functions have zeros). In this case, the method performs globally at most twice as slow as the midpoint bisection method (why?); but if the function is sufficiently smooth, then much fewer function values are needed once the neighborhood of superlinear convergence is reached. 5.2.7 Example. For the determination of a zero of f(x) := I - lOx + O.Ole X (Figure 5.4) with the initial values Xl := 5. X2 := 20, the secant bisection method gives the results in Table 5.4. A sign change is immediate. Sixteen function evaluations were necessary to attain (and check) the full accuracy of 16 decimal digits. Close to the zero, the sign pattern and the superlinear convergence of the pure secant method is noticeable. Practical experiments on the computer show that generally about la-IS function evaluations suffice to finda zero. The number is smaller if x and.e, are not far from a zero, and maybe largerif(as in the example) f(x) or f(X2) is huge. However, because ofthe superlinear convergence. the number offunction evaluations is nearly independent of the accuracy requested, unless the latter is so high that rounding errors produce erratic sign changes. Thus nfmax=20 is a good choice to prevent useless function evaluations in case of nonconvergence. (To find several zeros. one generally applies a technique called deflation; see Section 5.5.) Unfortunately there are also situations in which the sign change is of no use, namely if the continuity of f in l«. b1is not guaranteed. Thus, for example, it can happen that for rational functions, a bisection method finds instead of a zero. a pole with a sign change (and then with slow convergence). In this case, one must 5.2 Bisection Methods 249 Table 5.4. Locating the zero of f(x) := I - lOx + O.Olex with the secant bisection method 1 2 3 4 5 6 7 8 9 IO 11 12 13 14 15 16 Xi f(Xi) 5.0000000000000000 20.0000000000000000 5.000 146910843469 J 5.0002938188092605 10.0002938144929130 7.0713794514850772 8.0179789309415899 8.5868413839981272 8.9020135225054009 9.4351868431760817 9.1647237200757186 9.0986134854694392 9.1053724953793740 9.1056030246107298 9.1056021203890367 9.1056021205058109 -47.51586840R9742350 4851452.9540979033000000 -47.5171194663684280 -47.5183704672215440 121.3264462602397100 -57.9360785853060920 -48.8294182041005270 -31.2618672757999010 -14.5526201672795000 31.8611779028851880 4.8935773887754692 -0.5572882149440375 -0.0183804999496431 0.0000723790792989 -0.0000000093485397 -0.0000000000000568 slow 0 0 1 2 3 0 0 1 2 3 0 0 1 0 0 I 2001--,---___r--,-----_,--~--,------r--,-----_,-~ 150 100 50 °K~-------+---------___J -50 o 2 4 6 8 10 12 Figure 5.4. Secant bisection method for f(x) 14 =I- 16 18 lOx + O.OleX • 20 250 Univariate Nonlinear Equations safeguard the secant method instead by damping the steps taken. The details are similar to that for the damped Newton method discussed in Section 5.7. 5.3 Spectral Bisection Methods/or Eigenvalues A:= ( 5.3 Spectral Bisection Methods for Eigenvalues Many univariate zero-finding problems arise in the context of eigenvalue calculations. Special techniques are available for eigenvalues, and in this section we discuss elementary techniques related to those for general zero-finding problems. Techniques based on similarity transformations are often superior but cannot be treated here. The eigenvalues of a matrix A are defined as the zeros of its characteristic polynomial Ie';..) = det(AI- A). The eigenvalues of a matrix pencil (A, B) (the traditional name for a pair of square matrices) are defined as the zeros of its characteristic polynomial I(A) = det(AB - A). And the zeros of I(A) := det G(A) are the eigenvalues of the nonlinear eigenvalue problem associated with a parameter matrix G(A), that is, a matrix dependent on the scalar parameter A. (This includes the ordinary eigenvalue problem with G(A) = AI - A and the general1inear eigenvalue problem with G(A) = AB - A.) In each case, Ais a possibly complex parameter. To compute eigenvalues, one may apply any zerofinder to the characteristic polynomial. If only the real eigenvalues are wanted, we can use, for example, the secant bisection method; if complex eigenvalues are also sought, one would use instead a spiral method discussed in Section 5.6.lt is important to calculate the value det G(A/) at an approximation Al to the eigenvalue Afrom a triangular factorization of G (AI) and not from the coefficients of an explicit characteristic polynomial. Indeed, eigenvalues are often - and for Hermitian matrices alwayswell-conditioned, whereas computing zeros from explicit polynomials often gives rise to severe numerical instability (see Example 5.5.1). One needs about 10-15 function evaluations per eigenvalue. For full n x n-matrices, the cost for one function evaluation is ~ n 3 + 0 (n 2 ) . Hence the total cost to calculate s eigenvalues is 0(m 3 ) operations. For linear eigenvalue problems, all n eigenvalues can be calculated by transformation methods, in 0 (n 3 ) operations using the QR -algorithm for eigenvalues of matrices and the QZ -algorithm for matrix pencils (see, e.g., Golub and van Loan [31], Parlett [79]). Thus these methods are preferred for dense matrices. If, however, G(A) is banded with a narrow band, then these algorithmS require 0(n 2 ) operations whereas the cost for the required factorization- is only O(n). Thus methods based on finding zeros of the determinant (or poles of suitable rational functions; see Section 5.4) are superior when only a few of the eigenvalues are required. 251 5.3.1 Example. The matrix ! H-~J -2 0 I (3.1) 0 has the eigenvalues ±J5 ~ ±2.23606797749979, b0 th 0 f WhiIC h h ave multi-. . . plicity We.compute one of them as a zero of I (A) := det(AI - A) usin the secant bisection method (Algorithm 5.2.6). g 2: With the starting values 3 and 6 we obtain the double eigenvalue X _ 2.2360~797:49979 with full accuracy after nine evaluations of the determ: nant using triangular factorizations. If we use .instead the explicit characteristic polynomial/ (x) = x 4 _ 10x 2 + 25, we obtain after 10 steps i = 2.2360679866043878 . h i ' . deci . . ,Wit on y eight valid eCll~al digits. As shown in Section 5.5, this corresponds to the limit accuracy o (y' c) expected for double zeros. The surprising order. 0 (c) accuracy of the results when the determinant is c~mputed from a matrtx factorization is typical for (nondefective) multi Ie eigenvalues, and ~an be explained as follows: The determinant is calculatedPas atroduct of the dla~onal factors in R (and a sign from the permutations), and c ose to a nondefec~lVe m-fold eigenvalue m of these factors have order O(e); therefore.the Ul.lavoldable error in f(x) is only of order O(e"') for x ~ x* and the error III x* IS O«em)l/m) = O(e). Definite Eigenvalue Problems The parameter matrix G . D c ([ ~ ([II X II • • • .IS called defini te III a real interval [' 1J' !::.,,, C D If for A E [A A] G(')' H .. uousl -. '. - " " IS ermitian and (componentwise) contin..y differentiable and the (componentwise) derivative G'(A) '- ->LG(A) . pOSitIVe definite. .- dJ.. IS 5..3.2 Proposition. The linear parameter matrix G(A) = AB _ A . d ,f; . if and onl ' if B d .. . IS efinue all A are Hermitian and B IS positive definite In this ) Case, all the ei I J' . mat . tgenva ues are real. In particular, all eigenvalues ofa Hermitian fix are real. (In JR.) ProOf TheparametermatrixG(A) =AB-AI"sH iti if d . ' errmtran I an only If Band A H .. are ermitran. The derivative G'(A) - B i ". d fini . - s positive e nite If and only if 5.3 Spectral Bisection Methods for Eigenvalues Univariate Nonlinear Equations 252 5.3.4 Theorem. The number peA) ofnonnegative eigenvalues ofan Hermitian B is positive definite. This proves the first assertion. For each eigenpair (A, x), H H 0= x H G(A)X = AXH Bx - x H Ax. By hypothesis, x Bx > 0 and x Ax is real; from this it follows that A = x HAxjx H matrix A is invariant under congruence transformations; that is, if A o, Al E C"?" are Hermitian and if S is a nonsingular n x n matrix with AI = SAOS H, then p(Ad = p(Ao). Bx is also real. Therefore all of the eigenvalues are real. In particular, this applies with B = I to the eigenvalues of a Hermitian matrix A. r« Let oo be a complex number of modulus 1 with oo i= sign a for every eigenvalue a of S. Then S, := t S - (l-t)aol is nonsingularfor all t E [0, 1], and the matrices At := StAoStH are Hermitian for all t E [0, 1] and have the same rank r := rank A o. (Because SoAoS/j = (-ao)Ao( -aD) = Ao and SIAoS[i = SAOS H = A I, the notation is consistent.) If now At has 7f t positive and VI negative eigenvalues, it follows that n, + VI = r is constant. Because the eigenvalues of a matrix depend continuously on the coefficients, the integers n, and V t must be constant themselves. In particular, p(AI) = 7f) + n - r = o 7fo + n - r = p(A o). 0 Quadratic eigenvalue problems are often definite after dividing the parameter matrix by A: = AB+C-A-IA is definite in (-00,0) and (0, (0) if B, C, and A are Hermitian and A, Bare positive definite. All eigenvalues of the parameter matrix are then real. 5.3.3 Proposition. TheparametermatrixoftheformG(A) Proof Because B, C, and A are Hermitian, G(A) is also Hermitian. Furthermore, G'(A) = B + A- 2 A and x H G'(A)X = x" Bx + A-2 xH Ax > 0 if x i= 0 and A E JR\{O}. Consequently, G(A) is definite in (-00,0) and in (0, (0). For We note two important consequences of Sylvester's theorem. E en xn is Hermitian and that a E JR. Then pea I - A) is the number of eigenvalues of A with A :::: a. 5.3.5 Proposition. Suppose that A each eigenpair (A, x), 0= x" G(A)X = AXHBx + x" Cx - A-I x" Ax. Proof The eigenvalues of a I - A are just the numbers a - Ain which A runs through the eigenvalues of A. o H Because the discriminant (x H Cx)2 + 4(x Bx)(x Ax) of the corresponding quadratic equation in A is real and positive, all eigenvalues are real. 0 H 5.3.6 Proposition. Suppose that A, B For definite eigenvalue problems, the bisection method can be improved to give a method for the determination of all real eigenvalues in a given inter.v.al. For the derivation of the method, we need some theory about Hermlt1an matrices. For this purpose we arrange the eigenvalues of an Hermitian mat~ix, which by Proposition 5.3.2are all real, in order from smallest to largest, cou~tlllg multiple eigenvalues with their multiplicity in the characteristic polynomial We denote with p (A) the number of nonnegative eigenvalues of A. If the matrix A depends continuously on a parameter i, then the coefficients of the characteristic polynomial det(Al - A) also vary continuously with t, Because the zeros of a polynomial depend continuously on the coefficients, the eigenvalues A.i (A) depend continuously on t. Fundamental for spectral bisection methods is the inertia theorem Sylvester. We use it in the following form. 253 .f OJ E e n x n are Hermitian and that A - B is positive definite. Then Ai(A) > Ai(B) In particular, if A for i = 1, ... , n. (3.2) has rank r, then p(A) :::: pCB) +n - r. (3.3) Proof Because A - B is positive definite there exists a Cholesky factorization A - B = LL H. Suppose now that a := Ai(B) is an m-fold eigenvalue of Band that }..IS the largest index with a = Ai (B). Then i > j - m and p(a 1- B) = i: andforC := L -I(a I -B)L -H we have p(C) = p(LCL H ) = p(a I-B) = j. ~owever, the matrix C has the m-fold eigenvalue zero; therefore p(C - l) :::: J -m. whence p(a I - A) = pea 1- B-LL H ) = p(L(C -l)L H ) = p(C ~~.s j - m. Because ~ > j - m (see previous mention), Ai (A) > a = Ai (B). IS proves the assertion (3.2), and (3.3) follows immediately. 0 Univariate Nonlinear Equations 5.3 Spectral Bisection Methods for Eigenvalues In order to be able to make practical use o~ .Proposit~on 5.3.5, we need: simple method for evaluating p(B) for Hermitian matnces ~. If an LDL factorization of B exists (cf. Proposition 2.2.3 then we obtain p(B) as the H number of nonnegative diagonal elements of D; then p(B) = p(LDL ) = p(D) and the eigenvalues of D are just the diagonal elements. to switch to a superJinear zero finder, once an interval containing only one desired eigenvalue has been found. The spectral bisection method is especially effective if triangular factorizations are cheap (e.g., for Hermitian band matrices with small bandwidth) and only a fraction of the eigenvalues (less than about 1Do/c) are wanted. Problems with the execution of the spectral bisection method arise only if the L D L H factorization without pivoting does not exist or is numerically unstable. Algorithm 2.2.4 simply perturbs the pivot element to compute the factorization; for the computation of p(B), this seems acceptable as it corresponds to a perturbation of A of similar size, and eigenvalues of Hermitian matrices are not sensitive to small perturbations. Alternatively, it is recommended by Parlett [79] to simply repeat the function evaluation with a slightly changed value of the argument. One may also employ in place of the L D L H factorization a so-called Bunch-Parlett factorization with block-diagonal D and symmetric pivoting (see Golub and van Loan [31 D, which is numerically stable. The bisection method demonstrated can also be adopted to solve general definite eigenvalue problems. Almost everything remains the same; however, in place of the inertia theorem of Sylvester, the following theorem is used. 254 5.3.7 Example. We are looking for the smallest eigenvalue A* in the interval . II x II . h di It' [0.5,2.5] of the symmetric tridiagonal matnx T E 1Ft Wit iagona en nes I, 2, ... , II and subdiagonal entries 1. (The tridiagonal case can be ~andled very efficiently without computing the entire factorization; see Ex~rclse II). We initialize the bisection procedure by computing the number of eigenvalues less than 0.5 and 2.5, respectively. We find that two eigenvalues are less than 0.5 and four eigenvalues are less than 2.5. In particular, we see that the third eigenvalue A3 is the smallest one in [0.5,2.5]. Table .5:5 shows the result of further bisection steps and the resulting intervals contammg A3· (The 1.0 + 8. III the fourth trial indicates that for A = I, the LD L H factorization does not exist, and a perturbation must be used.) . A useful property of this spectral bisection method is that here a~ e~genva~ue with a particular number can be found. We also obtain additional information about the positions of the other eigenvalues that can subsequently be used. to determine them. For example, we can read off Table 5.5 that [-00,0.5] contams the two eigenvalues Al and A2, [0.5, 1.5] and [1.5,2.5] contain one each (A3 and A4), and [2.5, 00] contains the remaining seven eigenvalues Ako k = 5, .... I L The spectral bisection method is globally convergent, but the converge~ce is . f n.j d ble only linear. Therefore, to reduce the number of evaluations 0 p, It IS a VIsa Table 5.5. \ 2 3 4 5 6 7 8 9 10 11 cn xn 5.3.8 Theorem. Suppose that G : D S; C --r is definite in [b, A] S; D n 1Ft. Then the numbers p... := p(G(b», P := p(G(A» satisfy E :::: p, and exactly p - p... real eigenvalues lie in the interval [b, A]. (Here an eigenvalue A is counted with a multiplicity given by the dimension ofthe null space ofG (A).) Proof Let gAA) := x'' G(A)X for x a aA gAA) = x Example for spectral bisection U/ Pi = p(Ui I - T) 0.5 2.5 1.5 1.0 + 8 0.75 0.875 0.9375 0.96875 0.953\25 0.9609375 0.96484375 2 4 3 3 2 2 2 3 2 2 3 A3 E [0.5,2.5] [0.5. 1.5] [0.5, \.0] [0.75, 1.0] [0.875, 1.0] [0.9375. 1.0] [0.9375,0.96875] [0.953\25.0.96875] [0.9609375,0.96875] [0.9609375,0.964843;5] 255 H E cn\{o}. Then , G (A)X > 0 that is, gx is strictly monotone increasing in [b, - for A E [b, A], A] and for We deduce that G(A2) - G(Al) is positive definite for Proposition 5.3.6, we obtain b:::: Al < A2 :::: A. From 257 5.4 Convergence Order Univariate Nonlinear Equations 256 one similarly obtains the equivalent relation If A2 is an m-fold eigenvalue of the parameter matrix G(A), it follows that e, 2: ifJ Because of the continuous dependence of the eigenvalues Ai (G (A» on A,equality holds in this inequality if no eigenvalue of the parameter matrix G(A) lies between AI and A2. This proves the theorem. 0 +y where 13 := -logJO q, y := -logJO C. (4.2) 1, For example, the midpoint bisection method has q = hence 13 ~ 0.301 > 0.3. After any 10 iterations, 10fJ > 3 further decimal places are ensured. Because (4.1) implies (4.2), we see again that Q-linear convergence implies R-linear convergence. For Q-quadratic convergence, 5.4 Convergence Order To be able to compare the convergence speed of different zero finders, we introduce the concept of convergence order. For a sequence of numbers Xi that converge to x", we define the quantities ei w.e have the equivalent relation ei+l 2: 'le, - y, where y := 10gJO Co. Recursl~ely, one finds e;. - y 2: 2,-k (ek - y). Thus, if we define 13 := (e j - y) /2 j WIth the smallest} such that e j > y, we obtain := -loglO IXi - x"]. e, 2: i fJ Then IXi - x* I = lO- e , , and with s, := Led with 13 > 0, so that the bound for the number of valid digits grows exponentially. In general, we say that a sequence Xl, X2, X3, •.. with limit x* converges with (R- )convergellce order of at least K > I or K = 1 if the numbers ed := max{s E Z Is::: +Y we find that lO- s , -I < Therefore Xi has at least ei IXi - x" I ::: lO- s ei := -logJOlxi - x*1 , correct digits after the decimal point. The growth satisfy pattern of e; characterizes the convergence speed. For Q-linear convergence, IXi+1 - .r"] ::: q\Xi - x"] + fJ, where fJ := - log\o q > ° by taking logarithms, and therefore also ei+k 2: kfJ (4.1) + e.. Thus in k steps, one gains at least LkfJ J decimal places. For R-linear convergence, Ix; - x"] ::: c«, q E for all i 2: 1, e, 2: fJi + Y for all i 2: 1, or with 0< q < 1, we find the equivalent relation ei+1 2: e, +Y e, 2: f3K i lO, l[ respectively, for suitable fJ > 0, Y E lit Thus Q-. and R-linearly convergent sequences have convergence order 1, and Q-quadratIcally convergent sequences have convergence order 2. To determine the convergence order of methods like the secant method we need the following. ' 5.4.1 Lemma. Let Po, PI, ... , Ps be nonnegative numbers such that P := Po + PI + ... + Ps - 1 > 0, Univariate Nonlinear Equations 258 5.4 Convergence Order 5.4.2 Corollary. For simple zeros. the secant method has convergence order K = (l + ../5)/2 ~ 1.618 .... and let K be a positive solution of the equation (4.3) Proof By Theorem 5.1.2, the secant iterates satisfy the relation If e, is a sequence diverging to +00 such that IXi+l - x*1 ::: clxi for some number a E IR then there are f3 > 0 and y e, :::: f3K i +y E - x*llxi-l - x"], where i: = sup c.. Taking logarithms gives IR such that for all i :::: 1. with a = -log 1Oc. Lemma 5.4.1 now implies that Proof We choose io :::: s large enough such that e, + a / p > 0 for all i :::: io - s, and put e, :::: f3K i f3 := min{K-i(ei 259 + a l p) I i = +Y with f3 > 0 and y E IR io - s, ... , io}· for the positive solution K = (I + ../5) /2 of the equation Then f3 > 0, and we claim that e, :::: f3K i - al p for all i :::: io - s . By construction, (4.4) holds for i = io - s, ... , io· Assuming that it is valid for i - s, i - s + 1, ... , i (i :::: io) we have ei+l :::: POei + Plei-l + ... + psei-s + a :::: PO(f3Ki - a/ p) + Pl (f3K i- 1 - a/ p) = f3K i- S(POK S = f3K i+l - + ... + Ps) - + ... + Ps(f3K i- S - a/ p) +a (Po + ... + Ps - p)a/p a/ p. R y :=min{-a/p,el-f3K,e2-f3K , ... eio-S-] -pK +Y iO-s-l} For this function, the divided differences a h[xi-s, ... ,x;J = - - - - - - - - (x* - Xi-s)'" (x* - Xi) for all i > 1. o This proves the assertion. .. K, and this solution satisfies 1 <K < l+max{po,pl,""Ps}' were computed in Example 3.1.7. By taking quotients of successive terms, we find eal One can show (see Exercise 14) that (4.3) has exactly one POSItIve r solution It is possible to increase the convergence order of the secant method by using not only information from the current and the previous iterate, but also information from earlier iterates. An interesting class of zero-finding methods due to Opitz [77] (and later rediscovered by Larkin [55]) is based on approximating the poles ofthe function h := 1/f by looking at the special case hex) = - - . x* -x then one obtains e, :::: f3K i The Method of Opitz a So by induction, (4.4) is valid for all i :::: i o - s. If one chooses 2 o (4.4) * X =Xi + h[xi-s, ... ,xi-tl . h[Xi-s, ... , Xi] In the general case, we may consider this formula as providing approximations 260 Univariate Nonlinear Equations to x*. For, given XI, ... , 5.4 Convergence Order Xs+J this leads to the iterative method of Opitz: h[xi-s,.·., Xi-JJ Xi+l := Xi + ---~- h[xi-s, ... , x;] (fori> s). 261 obtain h[xi-s, ... , x;] = (Os) a Ci-s" 'Ci + g;-s i, ' and In practice, X2, ... ,Xs+I are generated for s > I by the corresponding lower order Opitz formulas. C'1+1 -- x* - x i+l =x • -x' - h[xi-s, - .. ', Xi-Il ---=-:: I h[xi_s. . , . , xiJ ' whence 5.4.3 Theorem. For functions of the form hex) = Ps-J(x) x -x* _ in which Ps-l is a polynomial of degree S~ - I, the Opitz formula (Os) gives the exact pole x" in a single step. - o gi-s.iCi - gi-s,i-1 a + gi-s,iC'i-s ... e, Ci -s ... Ci - - - - - - - - ' - _ In the neighborhood ofthe pole x' the quantities Proof Indeed, by polynomial division one obtains a hex) = - x* - X + Ps-2(X) (s)._ gi-s,ici - gi--s,i-I c i . - ----'---..:::..:...---=..:.:--...:..a + gi-s,ici-s ... Ci with a = -Ps-l(X*) and the polynomial Ps-2 of degree :SS - 2 does not appear in sufficiently higher order divided differences. So the original argument for hex) = I/(x - x") applies again. 0 remain bounded. As in the proof for the secant method, one deduces the locally superlinear convergence and that lim c(s) = c(s) '= -g(s-I)(x*) 1 • (s - l)!a . i ..... oo 5.4.4 Theorem. Let x' be a simple pole ofh and let h(x)(x' - x) be at least s times continuously differentiable in a neighborhood ofx'. Then the sequence Xi defined by the method ofOpitz converges to x' for all initial values XI, ... ,X,+I sufficiently close to x*, and Taking logarithms of the relation ICi+ll :s: C!Ci-s I ... lcd, C = sup Ic?) I, i::-:1 the convergence order follows from Lemma 5.4.1. where ciS) converges to a constantfor i - f 00. In particular, the method (Os) is superlinearly convergent; the convergence order is the positive solution K s = K of K s + 1 = K S + K s - 1 + ... + 1. o For the determination of a zero of I, one applies the method of Opitz to the function h := Ilf. For s = 1, one obtains again the secant method. For ~ = 2, one can show that the method is equivalent to that obtained by hyperbolic Interpolation, Proof We write a hex) = - x* -x Xi+l + g(x) = Xi - f(Xi) f[Xi, Xi-JJ - f(Xi-l)f[Xi, Xi-I, Xi-2]1 f[Xi-l, Xi-2]' (4.5) by rewriting the divided differences (see Exercise 15). where g is an (s - I)-times continuously differentiable function. With the abbreviations Cj := x' - Xj (j = 1,2, ... ) and gi-s,i := g[Xi-s, ... , x;], we The.convergence order of (q,) is monotone increasing with s (see Table 5.6). In particular, (0 2 ) and (0 3 ) are superior to the secant method (0 ) . The values 1 262 Univariate Nonlinear Equations 5.4 Convergence Order Table 5.6. Convergence order of (Os) for different s Proof Because the matrix pencil (A, B) is nondefective, there is abasis u', ... , u" E C" of C'' consisting of eigenvectors, Au i = AiBui. If we represent B-Ib as a linear combination B- 1b = a.u", where tx, E C, then S Ks 1 2 3 4 5 1.61803 1.83929 1.92756 1.96595 1.98358 S 6 7 8 9 10 263 L «, 1.99196 1.99603 1.99803 1.99902 1.99951 s = 2 and s = 3 are useful in practice; locally, they give a saving of about 20% of the function evaluations. For a robust algorithm, the Opitz methods must be combined with a bisection method, analogously to the secant bisection method. Apart from replacing in Algorithm 5.2.6 the secant steps by steps computed from (4.5) once enough function values are available, and storing and updating the required divided differences, an asymptotic analysis of the local sign change pattern (that now has period s + 2) reveals that to preserve superlinear convergence, one may use mean steps only for slow> s. Eigenvalues as Poles Calculating multiple eigenvalues as in Section 5.3 via the determinant is generally slow because the superiinear convergence of zerofinding methods is lost. This slowdown can be avoided for Hermitian matrices (and more generally for nondefective matrix pencils, i.e., when a basis consisting of eigenvectors exists) if one calculates the eigenvalue as a pole of a suitable function. To find the eigenvalues of the parameter matrix G (A), one may consider the function for suitable a, b E C"\ {OJ. Obviously, each pole of h is an eigenvalue of G(A). Typically, each (simple or multiple) eigenvalue of G(A) is a simple pole of h: however, eigenvalues can be lost by bad choices of a and b, and for defective problems (and only for such problems), where the eigenvectors don't span the full space, multiple poles can occur. 5.4.5 Proposition. Suppose that a, b E cn\{o}, that A, B E cnxn, and that the matrix B is nonsingular. If G(A) := AB - A is nondefective, then the function h(A) := aTG(A)-lb has no multiple poles. satisfies the relation G(A)X(A) = L ~(ABui - Au i) A -Ai = LCliBui = B(B-1b) = b. From this, it follows that that is, h has only simple poles. o We illustrate this in the following examples. 5.4.6 Examples. (i) For G(A) := (A - Ao)l, the determinant f(A) := detG(A) = (A - Ao)n has the n-fold zero Ao. However, h(A) := aT G(A)-Ib = aT bft). - AD) has the simple pole AD iff aT b =1= O. (ii) For the defective parameter matrix G(A)._(A-l .- 0 -I ) A- I ' n = 2, f(A) := det G(A) = (A _1)2 has the double zero A= I, and h(A) := aTG(A)-lb = aTbl(A - 1) + a,b 2/(A - 1)2 has A = I as a double pole if b2 =1= O. Thus the transformation of zeros to poles brings no advantage for defective problems. a, To determine the poles of h by the method of Opitz, one function evaluation is necessary in each iteration. For the calculation of h(AI), we solve the system of equations G(AI)X I = band findh(AI) = aT x', Thus, as for the calculation of the determinant, one triangular factorization per function evaluation is necessary, but instead of taking the product of the diagonal terms, we must solve two triangular systems. Univariate Nonlinear Equations 5.5 Error Analysis Table 5.7. Evaluations of the function h to find the double eigenvalue of the denominator is as large as possible (i.e., the correction is as small as possible). For real calculations, therefore, ± = signto»); for complex calculations, ± = signrke o» Req + 1m Wi Imq), where q denotes the square root in the denominator. One can show that the method of Muller has convergence order K ~ 1.84, the solution of K 3 = K Z + K + I for simple zeros of a three times continuously differentiable function; for double zeros it still converges superIinearly with convergence order K ~ 1.23, the solution of 2K 3 = K Z + K + 1. A disadvantage is that the method may produce complex iterates from real starting points. To avoid this, it is advisable to replace the square root of a negative number by zero. (However, for finding complex roots, this may be an asset; d. Section 5.6.) 264 1.10714285714286 0.25000000000000 -0.03875000000000 0.00100257895440 -0.00002874150660 -0.00000004079291 -0.00000000118853 -0.00000000003299 0.00000000000005 -0.00000000000000 0.00000000000000 6.00000000000000 3.00000000000000 2.12500000000000 2.23897058823529 2.23598478789267 2.23606785942768 2.23606797405966 2.23606797740430 2.23606797749993 2.23606797749979 2.23606797749979 1 2 3 4 5 6 7 8 9 10 11 265 5.5 Error Analysis 5.4.7 Example. We repeat the calculation ofthe two double eigenvalues ofthe matrix (3.1) of Example 5.3.1, this time as the poles of h(A) := aT (AI - A)-Ib, where a = b = (1,1,1,1)7. With starting values XI = 6 and Xz = 3, we do one Opitz step (OJ) and further Opitz steps (Oz), and find results as shown in Table 5.7. After II evaluations of the function h, one of the double eigenvalues is found with full accuracy. Muller's Method The method of Muller [65] is based on quadratic interpolation. From 0= f(x*) ~ f(Xi) = Limiting Accuracy Suppose that the computed approximation 1(x) for Il(x) - f(x)1 x*1 = m Xi) + f[Xi, Xi-I, X,-Z](X* - Xi)(X* - xi-d f(Xi) + Wi(X* - Xi) + f[Xi, Xi-I, Xi-Z](X* (x) satisfies s8 for all X near a zero x*. Because only I (x) is available in actual computation, the methods considered determine a zero x* of lex); and for such a zero, the true function values only satisfy If(i*)1 s 8. Now suppose that x* is an m-fold zero of f. Then f (x) = (x - x*)m g(x) with g(x*) i- 0, and it follows that Ix* - + f[Xi, xi-d(x* - f Ig(X*) f(X*) 1< - m _0Ig(x*) I = O( ~). (5.1 ) For simple zeros, we have more specifically - Xi)Z, _ - f(X*) -_ f[-* X,X *] '" g (X- *) -_ f(x*) X* - X* "0 where r. X,*) that is, the absolute error satisfies to a first approximation we deduce that x* _ X* (4.6) I ~< I - [) If'(x*)] From this, we draw some qualitative conclusions: To ensure that Xi+ 1 is the zero closest to Xi of the parabola interpolating in Xi, Xi-I. Xi-2, the sign ofthe square root must be chosen such thatthe magnitude (i) For very smalllf'(x*)I, that is, for very flat functions, the absolute error in x* is strongly magnified. In this case, x* is ill-conditioned. Univariate Nonlinear Equations 5.5 Error Analysis (ii) In particular, the absolute error in i* is magnified for multiple zeros; because of (5.1), one finds that the number of correct places is only about the mth part of the mantissa length that is being used. (iii) A double zero cannot be numerically distinguished from two simple zeros that lie only about 0 (-05) apart. given in a less explicit form. In particular, this applies to the characteristic polynomial of a matrix whose zeros are the eigenvalues, where, as we have seen in Example 5.3.1, much better accuracy is achievable by using matrix factorizations. 266 These remarks are valid independently of the method used. Deflation To compute several (or all) of the zeros of a function f, the standard technique is called deflation and proceeds in analogy to polynomial division, as follows. If one already knows some zeros x{, ... , x;(s ::': I) with corresponding multiplicities m I, ... , m.: then one may find other zeros by setting 5.5.1 Example. Consider the polynomial f(x) := (x - l)(x - 2)··· (x - 20) of degree 20 in the standard from f(x) = x 20 - 210X 19 + ... + 20! The coefficient ofx 15 is (checkwithMATLAB's function poly!) -1672280820. Therefore, a relative error of 8 in this coefficient produces an absolute perturbation of = \/(x) - f(x)1 1672280820x 15 8 in the function value at x. For the derivative at a zero x* = 1,2, ... , 20, If'(x*)1 = n [x" - kl = (x* - 1)!(20 - x")! k=1:20 ki'x* so that the limiting accuracy for the calculation of the zero x* 1672280820· 16 IS! 4! 15 8 ~ 6.14. 10 13 8 267 = 16 is about ~ 0.014 for double precision (machine precision 8 ~ 2.22· 10- 16 ) . Thus about two digits of relative accuracy can be expected for this zero. MATLAB's roots function for computing all zeros of a polynomial produces for this polynomial the approximation 16.00304396718421 in place of 16, with a relative error of 0.00019; and if we multiply the coefficient of x 15 by 1 + 8, roots finds the approximation 15.96922528457543 with a relative error of 0.0019. The slightly higher accuracy obtained is due to our worst-case analysis. This example is quite typical for the sensitivity of zeros of high degree polynomials in standard form, and explains why, for computing zeros, o~e should avoid the transformation to the standard form when a polynomial IS f(x) g (x) : = -=,-----::.-------(x -xjr1) n (5.2) j=l:s and seeking solutions of g(x) = O. By definition of a multiple zero, g(x) converges to a number i- 0 as x ~ xj. Therefore, the solutions x{, ... , x; are "divided out" in this way, and one cannot converge to the same zero again. Although the numerically calculated value of x* is in general subject to error, the so-called implicit deflation, which evaluates g(x) directly by formula (5.2), is harmless from a stability point of view because even when the xj are inaccurate, the other zeros of f are still exactly zeros of g. The only possible problems arise in a tiny neighborhood about the true and the calculated xj, and because there is no further zero, this is usually irrelevant. Note that when function evaluation is expensive, old function values not very close to a deflated zero can be used again to check for sign changes of the deflated function! Warning. For polynomials in the power form, it seems natural to perform the division explicitly by each linear factor x - x* using the Homer scheme, because this again produces a polynomial g of lower degree instead of the expression f(x)/(x - x"). Unfortunately, this explicit deflation may lead to completely incorrect results even for exactly calculated zeros. The reason is that for polynomials with zeros that are stable under small relative changes in the coefficients but very sensitive to small absolute changes, already the deflation of the absolutely largest zero ruins the quality of the other zeros. We show this only by an example; for a more complete analysis of errors in deflation, see Wilkinson [98]. Univariate Nonlinear Equations 5.5 Error Analysis 5.5.2 Example. Unlike in the previous example, small relative errors in the coefficients of the polynomial Once the existence test applies, one can refine the box containing the solution as follows. By the mean value theorem, 268 r I(x) := (x - l)(x - = x 20 _ (2 _ r 1 19 ) .•. )x 19 (x - r 19 + ... + r ) I(x) cause only small perturbations in the zeros; thus all zeros are well conditioned. Indeed, the MATLAB polynomial root finder roots gives the zeros with a maximal absolute error of about 6.10- 15 . However, explicit deflation of the first zero x* = 1 gives a polynomial g(x) with coefficients differing from those of g(x) := (x - 2- 1 ) ... (x - 2- 19 ) by 0 (s), The tiny constantterm is completely altered, and the computed zeros of g(x) are found to be 0.49999999999390 instead of 0.5, 0.25000106540985 instead of 0.25, and the next largest roots are already complex, 0.13465904883151 ± 0.01783932277158i. 5.5.3 Remark. For finding all (real or complex) zeros of polynomials, the fastest and most reliable method is perhaps that of Jenkins and Traub [47,48J; it finds all zeros of a polynomial of degree n in 0 (n 2 ) operations by reformulating it as an eigenvalue problem. The MATLAB 5 version of roots also proceeds that way, but it takes no advantage of the sparsity in the resulting matrix, and hence requires 0 (n 3 ) operations. The Interval Newton Method For rigorous bounds on the zeros of a function, interval arithmetic may be used. We suppose that the function I is given by an arithmetical expression, that I is twice continuously differentiable in the interval X E IT lR, and that 0 rt f' (x). The latter (holds in reasonably large neighborhoods of simple zeros, and) implies that I is monotone in x and has there at most one zero x*. Rigorous existence and nonexistence tests may be based on sign changes: => > 0 => :s 0 => I(x) - I(x*) = I/(~)(X - x*) 190 for ~ o rt I(x) = 269 there is no zero in x. there is no zero in x. there is a unique zero in x. 01 f'(x), I (!) I (x) 01 f'(x), I (!) I (x) Note that I must be evaluated at the thin intervals! = [:!.,:!.] and x = [x, x] in order that rounding errors in the evaluation of I are correctly accounted for. These tests may be used to find all real zeros in an interval X by splitting the interval recursively until one of the conditions mentioned is satisfied. After sufficiently many subdivisions, the only unverified parts are near multiple zeros. E O{x, x*}. If x* E x then * _ I(x) _ x = x - -- with x E I(x) Ex - j'(~) --. (5.3) j'(x) x. Therefore, the intersection xn (x _ I(X») j'(x) x := Xi also contains the zero. Iteration with the choice so-called interval Newton method XI Xi+l = x, = Xi n (Xi - ;/~~i~)' = mid Xi leads to the = 1,2,3, ... i (5.4) Xi is deliberately written in boldface to emphasize that again, in order to correctly account for rounding errors, Xi must be a thin interval containing an approximation to the midpoint of Xi. A useful stopping criterion is Xi+l = Xi; because Xi+ 1 S; Xi and finite precision arithmetic, this holds after finitely many steps. Because of (5.3), a zero x* cannot be "lost" by the iteration (5.4); that is, x* E X implies that x* E Xi for all i. However, it may happen that the iteration (5.4) stops with an empty intersection = 0; then, of course, there was no zero in x. As shown in Figure 5.5, the radius rad Xi is at least halved in each iteration due to the choice x = xi. Hence, if the iteration does not stop, then rad Xi converges to zero, and the sequence Xi converges (in exact arithmetic) to a thin interval X oo = x oo . Taking the limit in the iteration formula, we find Xi I(x ) x 00 =x00 - -oo- , j'(x oo) so that I(x oo ) = O. Therefore, the limit X oo = x* is the unique zero of I in x. 271 5.5 Error Analysis Univariate Nonlinear Equations 270 5.5.5 Example. We determine the zero x* = .j2 = 1.41421356 ... of f(x) := 1 - 3/(x 2 + 1) with the interval Newton method. We have f' (x) with Xl := [1,3], the first iteration gives f(Id = f([2, 2]) I Figure 5.5. The interval Newton method. f 5.5.4 Theorem. If 0 ¢ f'(x), then: X2 (i) The iteration (5.4) stops after finitely many steps with empty only Xi = 0 if and if f has no zero in x. (ii) The function f has a (unique) zero x* in x if and only if Iimhoo Xi = x*. In this case, rad Xi + I :s 2 rad Xi , (5.5) In particular, the radii converge quadratically to zero. Proof Only the quadratic convergence (5.5) remains to be proved. By the mean value theorem, f(I i ) = f'(~)(Ii - x*) with ~ E O{Ii • x"]. Therefore = v Xi - (V Xi - X = [1,3] + 1)2. Starting 3+ 1 = 1 - [5,5] 3 = [25' 52] ' 1 - [2,2]2 6· [1,3] [2,10]2 = n ([2,2] - [6,18] [4,100] [~' = [3 9] 50' 2 ' n/[:O'~]) 14 86] = [1,3] n [ -3' 45 = [1, 1.91111 ...]. ") f'(~) Because interval arithmetic is generally slower than real arithmetic, the splitting process mentioned previously can be speeded up by first locating as many approximate zeros as one can find by the secant bisection method, say. Then one picks narrow intervals containing the approximate zeros, but wide enough that the sign is detectable without ambiguity by the interval evaluation f at each end point; if the interval evaluation at some end point contains zero, the sign is undetermined, and one must widen the interval adaptively by moving the Table 5.8. Interval Newton methodfor f(x) = 1 - 3/(x 2 + 1) in Xl = [1,3] r(Xi) , and (5.6) I 2 3 4 5 6 7 Now IIi - x* I :s rad Xi, f' is bounded on x, and by Theorem 1.5.6(iii), Insertion into (5.6) gives (5.5). = 6x / (x 2 Further iteration (with eight-digit precision) gives the results in Table 5.8, with termination because X7 = X6. An optimal inclusion of .j2 with respect to a machine precision of eight decimal places is attained. 1 and Xi+l (x.) = = o -. x· Xi 1.0000000 1.0000000 1-3183203 lA08559 I 1.4142104 1.4142135 1.4142135 3.0000000 1-9111112 lA422849 1.4194270 1.4142167 1.4142136 1.4142136 Univariate Nonlinear Equations 5.6 Complex Zeros corresponding end point. The verification procedure can then be restricted to the complements of the already verified intervals. If the components of G(A) are given by arithmetical expressions and s is a small interval containing S, then the interval evaluation Go(s) contains the matrix Go(s) ~ I. One can therefore expect that Go(s) is diagonally dominant, or is at least an H-matrix; this can easily be checked in each case. Then for t E S, GoU) E Go(s) is nonsingular and f is continuous in s. Therefore, any sign change in s encloses a zero of f and hence an eigenvalue of G(A). Note that to account for rounding errors, the evaluation of f (A) for finding its sign must be done with a thin interval lA, A] in place of A. Once a sign change is verified, we may reduce the interval s to that defined by the bracket obtained, and obtain the corresponding eigenvector by solving the system of linear interval equations 272 Error Bounds for Simple Eigenvalues and Associated Eigenvectors Due to the special structure of eigenvalue problems, it is frequently possible to improve on the error bounds for general functions by using more linear algebra. Here, we consider only the case of simple real eigenvalues of a parameter matrix G(A). If A* is a simple eigenvalue, then we can find a matrix C and vectors a, b such that the parameter matrix Go (A) := CG(A) =1= 0 + ba" Bx = b is nonsingular in a neighborhood of A*. The following result permits the treatment of eigenvalue problems as the problem of determining a zero of a continuous function. 5,5.6 Proposition. Let GO(A) := CG(A) + ba'! where C is nonsingular. IfA* is a zero of with BE B = 273 Go(s) by Krawczyk's method with the matrix (DiaglD- 1 as preconditioner. If during the calculation no sign change is found in s, or if Go(s) is not diagonally dominant or is not at least an H-matrix, this is an indication that s was a poor eigenvalue approximation, s was chosen too narrow, or there is a multiple eigenvalue (or several close eigenvalues) near S. 5.6 Complex Zeros and x := GO(A*)-lb, then A* is an eigenvalue OfG(A), and x is an associated eigenvector, that is, G(A *)x = O. Proof If f (A*) = 0 and x = GO(A *)-1 b, then a H x = f (A*)+ I = 1. Therefore, we have CG(A*)x = GO(A *)x - ba'' x = b - b = 0; because C is nonsingular, it follows that G(A *)x = O. 0 To apply this, we first calculate an approximation s to the unknown (simple) eigenvalue A*. Then, to find suitable values for a, band C, we modify a normalized triangular factorization LR of G(s) by replacing the diagonal element Ri, of R of least modulus with IIRlloo. With the upper triangular matrix R ' so obtained, we have R = R' - ye(i) (e(i))H where y = R;i - Rii, so that In this section, we consider the problem of finding complex zeros x* E D of a function f that is analytic in an open and bounded set D S; <C and continuous in its closure D, In contrast to the situation for D C R the number of zeros (counting their multiplicity) is no longer affected by small perturbations in f, which makes the determination of multiple zeros a much better behaved problem. In particular, there are simple, globally convergent algorithms based on a modified form of damping. The facts underlying such an algorithm are given in the following theorem. 5.6.1 Theorem. Let Xo ED and If(x)1 > If(xo)! for all x E oD, Then f has a zero in D. Moreover, if f(xo) =1= 0, then every neighborhood ofxo contains a point XI with. If(xj)1 < If(xo)l. Proof Without loss of generality, we may assume that D is connected. For all a with Xo + a in a suitable neighborhood of Xo, the power series Thus if we put f(xo + a) then Go(s) ~ I. is convergent. = , f(xo) + f (xo)a + ... + (xo) n , a + ... n. f(n) If all derivatives are zero, then f(x) = f(xo) in a neighborhood of xo, so f is a constant in D. This remains valid in b, contradicting the assumption. Therefore, there is a smallest n with f(n) (xo) =/= 0, and we have (6.1) with gn(a) = f(n) (xo) n! +a r-: (n 1) (xo) + I)! + .... In particular, gn (0) =/= O. Taking the square of the absolute value in (6.1) gives If(xo + a)1 2 = If(xo)1 2 + 2Re angn(a)f(xo) + laI 2nlgn(a)12 = If(xo)1 2 + 2Re angn(O)f(xo) + O(laI 2n) < If(xo)1 2 if a is small enough and Spiral Search Using (6.3), we could compute a suitable a if we knew the nth derivative of f. However, the computation of even the first derivative is usually unnecessarily expensive, and we instead look for a method that works without derivatives. The key is the observation that when a =/= 0 makes a full revolution around zero, the left side of (6.2) alternates in sign in 2n adjacent sectors of angle n / n. Therefore, if we use a trial correction a and are in the wrong sector, we may correct for it by decreasing a along a spiral toward zero. A natural choice, originally proposed by Bauhuber [7], is to try the reduced corrections q':« (k = 0, I, ...), where q is a complex number of absolute value Iq I < I. The angle of q in polar coordinates must be chosen such that repeated rotation around this angle guarantees that from arbitrary starting points, we land soon in a sector with the correct sign, at least in the most frequent case that n is small. The condition (6.2) that tells us whether we are in a good sector reduces to Reqkn y < 0 with a constant y that depends on the problem. This is equivalent to kn arg(q) Re an s- (0) f (xo) < O. To satisfy (6.2), we choose a small E: (6.2) > 0 and find that for the choice (6.3) valid unless f(xo) = O. Because a; -+ 0 if E: -+ 0, we have If(xo + aE)1 < If (xo) I for sufficiently small E:. This proves the second part. Now If I attains its minimum on the compact set o, that is, there is some x* E jj such that for all x E b, (6.4) and the boundary condition implies that actually x* E D. If f(x*) =/= 0, we may use the previous argument with x* in place of xo, and find a contradiction to (6.4). Therefore, f(x*) = O. 0 + cp E ]2Irr, (21 + I)rr[ for some I E Z, (6.5) where cp = arg(y) - n /2. If we allow for n = 1,2, 3, 4 at most 2, 2, 3, resp. 5 consecutive choices in a bad sector, independent of the choice of tp; this restricts the angle to a narrow range, arg(q) E ±]¥arr, ~rr[ (cf. Exercise 22). The simplest complex number with an angle in this range is 6i - I; therefore, a value q If(x*)1 ::: If(x)1 275 5.6 Complex Zeros Univariate Nonlinear Equations 274 = A(6i - I) with A E [0.05,0.15], say, may be used. This rotates a in each trial by a good angle and shrinks it by a factor of about 0.3041 - 0.9124, depending on the choice of A. Figure 5.6 displays a particular instance of the rotation and shrinking pattern for A = 0.13. Suitable starting corrections a can be found by a secant step (1.1), a hyperbolic interpolation step (4.5), or a Muller step (4.6). The Muller step is most useful when there are very close zeros, but a complex square root must be computed. Rounding errors may cause trouble when f is extremely flat near some point, which typically happens when the above n gets large. (It is easy to construct artificial examples for this.) If the initial correction is within the flat part, the new function value may agree with the old one within their accuracy, and the spiral search never leaves the flat region. The remedy is to expand the 276 Univariate Nonlinear Equations 5.6 Complex Zeros 3r-~---,--------,-------,.--~-r-,------,----~-~---, o + + x = Xo; f = If(xo)l; fok = e * f; q = 0.13 (6i - I); qll = I -"JE; qgl while f > j;'b 2 * + + + 277 5.6.2 Algorithm: Spiral Search for Complex Zeros = 1+"JE; ~ompute a nonzero correction a to x (e.g. a secant or Muller step); OI-------------~"*"'=-,,_,__----------___1 -1 a + -3 -3 + + (l + IxI + IXoldl)P; a * f, % flat step = q * a; jiat = 0; end; -2 -1 0 2 3 Figure 5.6. Spiral search for a good sign in case n = 3. The spiraling factor is q = O.13(6i - I). The circles mark the initial a and corresponding rotated and shrinked values. The third trial value gives the required sign; by design, this is the worst possible case for n = 3. if k end; end; == kmax, return; end; The algorithm stops when f a sufficient decrease. size of the correction when the new function value is too close initially to the old one. Another safeguard is needed to avoid initial step sizes that are too large. One possiblility is to enforce For p = 2, this still allows "local" quadratic convergence to infinite zeros (of rational functions, say). If this is undesirable, one may take p = 1. We formulate the resulting spiral search, using a particular choice of constants that specify the conditions when to accept a new point, when to spiral, when to expand and when to stop. e is the machine precision; kmax and pare control parameters that may be set to fixed values. Xo and an initial a must be provided as input. (In the absence of other information, one may start, e.g., with Xo laI :s fnew = [f(x - a)l; if fnew < ql1 * f, % good step x = x - a; f = fnew; break; end; ifjiat & fnew < qgl a = lO*a; else % long step -2 + If necessary, rescale to enforce jiat= I; for k = I : kmax , = 0, a = 1.) :s fok or when kmax changes of a did not result in Rigorous Error Bounds For constructing rigorous error bounds for complex zeros, we use the following tool from complex analysis. 5.6.3 Theorem. Let f and g be analytic in the interior ofa disk D and nonzero and continuous on the boundary D. If a fez) Re g(z) > 0 for all ZE aD (6.6) a~d each root is counted according to its multiplicity, then f and g have precisely the same number a/zeros in D. Proof This is a refinement of Rouche'» theorem, which assumes the stronger 5.6 Complex Zeros Univariate Nonlinear Equations 278 For polynomials, one may use in place of the kth derivative the divided difference used in the proof, computable by means of the first k steps of the complete Homer scheme. With this change, 8 can usually be chosen somewhat smaller. The theorem may be used as an a posteriori test for existence, starting with an approximate root Z computed by standard numerical methods. It provides a rigorous existence test, multiplicity count, and enclosure for the root or the root cluster near Z. The successful application requires that we "guess" the right k and a suitable 8. Because 8 should be kept small, a reasonable procedure is the following: Let 15 k be the positive real root of the polynomial condition If(z)1 > If(z) - g(z)1 for all z aD. E Rouche's theorem in Henrici [43, Theorem 4.10b] uses However, th e proo f 0 f . 0 in fact only the weaker assumption (6.6). A simple consequence is the following result (slightly sharper than Neumaier [69]). 5.6.4 Theorem. Let f be analytic in the disk D[z; r] -s: {z 279 Eel [z - zl ::: r}. and let 0 < 8 < r. If R f(k)(Z) e f(k) (.-) > z then "" Z:: I=O:k-l k! IRe f(l)(z) \ 81- k for all z E D[z; 8], l! j<k)(z) f has precisely k roots in (6.7) then (6.7) forces 8 :;: 15 k. If 15 k is small, then f(k)(Z) = f(k)(Z) + 0(8) so that it is sufficient to take 8 only slightly bigger than 15 k • In practice, it is usually sufficient to calculate 15 k to a relative precision of 10% only and choose 8 = 28k ; because k: is unknown, one tries k = 1, 2, ... , until one succeeds or a limit on k is attained. If there is a k-fold zero z* and the remaining roots are far away, then Taylor's formula gives D[z; 8], where each root is counted according to its multiplicity. Proof We apply Theorem 5.6.3 with f(k)(Z) _k g(z) = -k-!-(z - z) to D = D[z; 8]. Hermite interpolation at Z gives fez) = PH (z) Thus + f[z, ... , z, z](z - z-)k and with Pi-: (z) =L f(l) (z) - I -l-'-(z - z), l<k f[z, ... , Z, Z = Re rik ) e-) ~ 15 k ~ = -'-k-!- Iz - z*Ij(~ - so that we get with our choice 8 . for some ~ in D, For Z E aD, we have Iz f (z) Re - () g z ] zl = 8, hence (n "" k! f(k) f(k)(;;) ~ + c: l' l<k' an overestimation of roughly 3k. 5.6.5 Example. Consider f (x) = x 4 - 2x 3 - x 2 + 2x + I that has a double root x* = ~ (l + -)5) ~ 1.618033989. To verify the double root, we must choose f(l) (z) (z _ Z)l-k Re f(k)(Z) k = 2. In this case, a suitable guess for the radius 8 is f(k)(n "" k! \ ~\ 81- k > 0 :;: Re f(k) (z) l! Re f(k) (z) f:t by our hypothesis (6.7), and Theorem 5.6,3 applies. = 28k 1) < 1.5klz - z"], Where o p = 21 Re I"(z) f'(z) I , - q - 21 Re I"(z) fez) I . Univariate Nonlinear Equations 280 Table 5.9. 5.7 Methods Using Derivative Information z Approximations to the root and corresponding enclosures Proof For f(x) ;= f'(x) Ie x ) 8 (upward rounded) 8/lx* - zl 6.77 . 10- 1 8.91 . 10- 2 3.92.10- 2 1.65. 10- 4 5.44.10- 8 948.10- 3 1.49.10- 1 3.69.10- 1 Enclosure not guaranteed 4.94 4.88 4.83 4.83 4.82 4.66 Enclosure not guaranteed 1.5 1.6 1.61 1.618 1.618034 1.62 1.65 1.7 281 ao(x - ~I)'" (x - ~n), 1 d = "dlogl/(x)1 == x If x* is the zero closest to 1 --+"'+--. x - ~I X - ~n (6.9) x, then Ix* -xl:s I~J -xl for j == 1, ... .n, Therefore, f' (X) ! < ! I(x) n - -Ix---x-*I and The condition guaranteeing two roots in the disk inf{Re Ix-x*l<n/I(X)!. - I F(z) z 1"(2) Dlz; oJ is E D[Z;8 J} > Thi s ~roves (6.8). A slight generalization of the argument gives the more general assertion (see Exercise 21). 0 (p+q!o)!o, and this can be verified by means of complex interval arithmetic. For various approximations to the root, we find the enclosures Ix* - 21 shown in Table 5.9. Of course, the test does not guarantee the existence of a double root, but only that of two (possibly coinciding) roots x* with Ix* - 21 :s 0. z f'(x) :s Much more information about complex zeros can be found in Henrici [43]. ° 5.7 Methods Using Derivative Information Newton's Method As Error Bounds for Polynomial Zeros The following result, valid for arbitrary x, can be used to verify the accuracy of approximations i to simple, real, or complex zeros of polynomials. If rigorous results are required, the error term must be evaluated in interval arithmetic, using as argument the thin interval [i, i]. 5.6.6 Theorem. If f is a polynomial of degree nand f' (.r) t= 0, x E C, ther there is at least one zero of f in each disk in the complex plane that contains j and x - n f,i~)). In particular, there is a zero x* with __ * < Ix The bound is best possible as x I- lex) = I f(x) I n f'(x) . (x - x")" shows. t . I th . I - j -+ X,, e s~cant s ope f[Xi, Xi-I] approaches the slope f'(Xi) of the tangent to f at the pomt Xi. In this limiting case, formula (1.1) yields the formula for Newton's method Xi+1 . I(Xi) ,==Xi - - f'(Xi)' i-I 2 3 - " , .... (7.1) 5.7.1 Example. To compare with the bisection method and with the secant :ethod, ~he zero x* = ../i = 1.414215362 ... of the function I(x) == x 2 _ 2 in iproxlmated by Newlon's method. Starting with XI :== I, we get the results able 5.10. After five function evaluations and five derivative evaluations one has 10 valid digits of x*. ., (6.8 ThIn th~ e~ampl.e, th~ Newton sequence converges faster than the secant method. at this IS typical IS a consequence of the local Q_q d ti the N ewton met h a d . . ua ra IC convergence of 282 Table 5.11. A comparison ofwork versus order ofaccuracy Table 5.10. Results of Newton's method for x 2 - 2 1 1.5 1.416666667 1.414215686 1.4.l4213562 1.414213562 1 2 3 4 5 6 Xi+1 - X * = ci, ( Xi. - Xl X (7 . 2) *)2 lim c'I = c. . '--+00 In particular, Newton's method is Q-quadratieally convergent to a simple zero, and its convergence order is 2. Proof Formula (7.2) is proved just as the corresponding asserti~n for.the secant method. Convergence for initial values sufficiently close to x ag~m follows from this. With Co := sup Ie; I < 00, one obtains from (7.2) the relation X* I ::: Co IXi - Newton method 1 2 3 8 8 82 8 x *\2 from which the Q-quadratic convergence is apparent. By the results in Sectio~ 5.4, the convergence order is 2. Comparison with the Secant Method By comparing Theorems 5.7.2 and 5.1.2, we see that locally, Newton's metho,d converges in fewer iterations than the secant method. However, each stephI: more expensive. If the cost for a derivative evaluation is about the same as t a . one Newton step for a function evaluation, it is more appropnate to compare with two secant steps. By Theorem 5.1.2, we have for the latter x" = [Ci+ICi(Xi-l - ~ 8i Secant method 8i+1 8i 8i-l ~ 8 8 84 8 8 8 8 16 2 2 3 85 8 8 8 8 8 13 21 34 55 and because the term in square brackets tends to zero as i -+ 00, two secant steps are locally faster than one Newton step. Therefore, the secant method is more efficient when function values and derivatives are equally costly. We illustrate the behaviour in Table 5.11, where the asymptotic order of accuracy after n = 1,2, ... function evaluations is displayed for both Newton's method and the secant method. In some cases, derivatives are much cheaper than function values when computed together with the latter; in this case, Newton's method may be faster. We use the convergence orders 2 of Newton's method and (l + ./5)/2 ~ 1.618 of the secant method to compare the asymptotic costs in this case. Let e and c' denote the cost of calculating f (Xi) and f' (Xi), respectively. As soon as f is not as simple as in our demonstration examples, the cost for the other operations is negligible, so that the cost of calculating Xi+1 (i :::: 1) is essentially c + c' for Newton's method and e for the secant method. The cost of s Newton steps, namely s(c + c'), is equivalent to that of s(l + c' /c), function evaluations, and the number of correct digits multiplies by a factor 2s • With the same cost, sO + c' /e) secant steps may be performed, giving a gain in the number of accurate digits by a factor 1.618,1(1 +c' (c). Therefore, the secant method is locally more efficient than the Newton method if 1.618 s ( l + c' / c ) > 2 s , that is, if c' Xi+2 - ~ 8i+l sufficiently close to x and with \Xi+l - Function evaluation 2 3 2 3 2 3 5.7.2 Theorem. Let the function f be twice continuously differentiable in a . * d I '- 1 f"( *)/f'(\*) Then the ".* neighborhood of the simple zero x ,an et C.- 2:'~ sequence defined by (7.1) converges to x* for all 283 5.7 Methods Using Derivative Information Univariate Nonlinear Equations X*)](Xi - X*)2, C > 5.7 Methods Using Derivative Information Univariate Nonlinear Equations 284 Therefore, locally, Newton's method is preferable to the secant method only when the cost of calculating the derivative is at most 44% of the cost of a 285 Table 5.13. Results ofNewton's methodfor f(x) = 1 - lOx with Xl = 20 + O.Olex function evaluation. Xi Global Behavior of Newton's Method As for the secant method, the global behavior of Newton's method must be assessed independent of the local convergence speed. It can be shown that Newton's method converges for all starting points in some dense subset oflR if f is a polynomial of degree n with real zeros si, .... ';n only. The argument is essentially that if not, it sooner or la~er ~enerates s~m~ Xi > x* := max(';l •... , ';n) (or Xi < min(.;l •. ··' ';n), which is treated similarly). 1 2 3 4 5 6 7 8 9 Xi 20.0000000000000000 19.0000389558837600 18.0001392428132970 17.0003965996736000 16.0010546324046890 15.0027299459509860 14.0069725232866720 13.0176392772467740 12.0441649793488760 10 11 12 13 14 15 16 17 18 11.1088835456740740 10.2610830282869120 9.5930334887471229 9.2146744950274755 9.1119744961101219 9.1056248937564668 9.1056021207975100 9.1056021205058109 9.1056021205058109 Then. !,(xd 1 with a convergence factor of 1- ~ = sets in.) n 0< - - - < - - < - - - , Xi - x* - f(xi) - Xi - x* so that Xi+l = Xi - f(Xi)/!'(Xi) satisfies the relation X,' - X * > Xi - Later in this section we discuss a modified Newton method that overcomes this slow convergence, at least for polynomials. Slowness of a different kind is observed in the next example. X· -x* Xi+l 2: -I -n - , that is, o :s Xi+l - x" :s (1 - ~ ) 5.7.4 Example. The function (Xi - x"). Thus Newton's method converges monotonically for all starting points out~ide the hull of the set of zeros. with global convergence factor of at least 1 - n' For large n, a sequence of n Newton steps therefore decreases [x - ~* I by at ctor (1 - l)n < 1. ~ 0.37. For f(x) = (x - x")", where this bound t f 1easaa n e . initiall is asymptotically achieved, convergence is very slow; the same holds nnua Y for general polynomials if the starting point is so far away from all zeros that these "look like a single cluster." 2 5.7.3 Example. For Xl = 100, Newton's method applied to f(x) = .x - 2 yields the results in Table 5.12. The sequence initially converges only 1mearly, Table 5.12. Results o.f Newton's method for x 2 Xi 100 !. (After iteration 7, quadratic convergence 50.01 3 25.02 4 12.55 2 5 6.36 - 2 with XI = 1O~ 6 7 3.34 1.97 -------= f(x) := 1 - lOx + O.Olex has already been considered in Example 5.2.7 for the treatment of the secant bisection method. Table 5.13 shows the convergence behavior of the Newton method for Xl := 20. It takes a long time for the locally quadratic convergence to be noticeable. For the attainment of full accuracy, 17 evaluations of f and f' are necessary. The secant bisection method with Xl = 5, X2 = 20 needs about the same number of function evaluations but no derivative evaluations to achieve the same accuracy. On nonconvex problems, small perturbations of the starting point may strongly influence the global behavior of Newton's method, if some intermediate iterate gets close to a stationary point. 5.7.5 Example. We demonstrate this with the function f(x) := 1 - 2/(x 2 + 1) 5.7 Methods Using Derivative Information Univariate Nonlinear Equations 286 Table 5.15. 287 Newton's methodsfor f(x) = x - I - 2/x with two different starting points Xl 0.75 Xj 0.5 1 0.25 oL-----~~~+-~f------~l -0.25 -0.5 -0.75 -2 3 -1 Figure 5.7. Graph of f(x) = 1 - 2/(x 2 + I). displayed in Figure 5.7, which has zeros at + I and .-1. Table 5.l~ shows that three close initial values may result in completely different behavior. The next example shows that the neighborhood where Newton's .method converges to a given zero may be very asymmetric, and very large m some direction. 5.7.6 Example. For the function f(x) := x-I - 2/x Table 5.14. Newton's method for f(x) = 1- 2/(x for different starting points Xl Xi 1 2 3 4 5 6 7 8 9 10 11 1.999500 0.126031 2.109171 -0.118015 -2.235974 0.446951 0.983975 0.999874 1.000000 1.000000 1.000000 2 = 1.999720 0.125577 2.115887 -0.134153 -1.997092 -0.130986 -2.039028 -0.042251 -5.959273 46.906603 - 25754.409557 {- ±oo 1.999970 0.125062 2.123583 -0.152822 -1.787816 -0.499058 -0.968928 -0.999533 -1.000000 -1.000000 -1.000000 1000.0000000000000000 1.0039979920039741 1.6702074291915645 1.9772917776064771 1.9999127426320478 1.9999999987309516 2.0000000000000000 0.0010000000000000 0.0020004989997505 0.0040029909876475 0.0080139297363667 ODI6059455313883 1 0.0322437057559748 0.0649734647479344 0.1317795480072139 0.2698985122099006 0.5559697617074131 1.0969550112472573 1.7454226483923749 1.9871575101344159 1.9999722751335729 1.9999999998718863 2.0000000000000000 with zeros -I and +2 and a pole at X = 0, Newton's method converges for all values Xl -I O. As Table 5.15 shows, the convergence is very slow for starting values Ix! I « I because for tiny Xi we have only Xi+1 ~ 2Xi. Very large starting values, however, give a good approximation to a zero in a few Newton steps. + 1) Xi Xi 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 The Damped Newton Method If the Newton method diverges, then one can often obtain convergence in spite of this through damping. The idea of damping is that, instead of a full Newton step, Pi = - f(Xi)/!'(Xi), only a partial step, is taken, in which the damping factor (Xi is chosen by successive halving of an initial trial value (Xi = 1 until If (Xi+ I) I < I f (Xi) I. This ensures that the values If (Xi) I are monotonically decreasing and converge (because they are bounded below by zero). Often, but of course not always, the limiting value is zero. 288 Univariate Nonlinear Equations 5.7 Methods Using Derivative Information 5.7.7 Example. The function 100 f(x) := 10x 5 - 36x 3 1 - - -.---- - -,-- _ ---,-__---,__---,,--_-----, + 90x 50 has the derivative f' (X) 289 := 50x 4 - 108x 2 + 90 = 50(x 2 - , ,, 1.08)2 + 31.68 > 0, .; o so that f is monotone in R The unique real zero of ! is x* = O. Starting from XI =6 (Table 5.16), Newton's method damped in this sense accepts always the first trial value ex = I for the damping factor, but the sequence of iterates alternates for! ~ 8, and has the limit points + 1 and -1, and I! (± 1) I = 64, cf. Figure 5.8. The (mis- )convergence is unusually slow because the improvement factor If(Xi+l) III! (Xi) I gets closer and closer to I. for some fixed positive q < 1. (One usually takes a small but not tiny value, such as q = 0.1.) Because the full Newton step taken for a, = 1 gives local quadratic convergence, one tries in tum cti = 1, ~, ~, ... until (7.3) is satisfied. For monotone functions, this indeed guarantees global convergence. (We analyze this in detail in the multidimensional case in Section 6.2.) Because the step length is now no longer a sign of closeness to the zero, one stops the iteration when I! (Xi) I no longer decreases significantly; for example, - - - ,; -50 - - ; , - - -~C!J ,," - ; '*- - - _ _ _ _ _ X ; -100 In order to ensure the convergence of the sequence I! (Xi) I to zero, clearly one must require a little more thanjust I!(Xi+l) I < I/(Xi) I. In practice, one demands that (7.3) - ... , -150 ':---~--____=:;;";;----:;------:::':_------L---J -1.5 1 05 - - . a 0.5 1 ~~:~re 5.8. Oscillations of simplydampedNewton iteratesfor f(x) 1.5 = lOx 5 _ 36x 3 + when If(xi)1 s 1!(Xi+l)l(l +../i) with the machine precision e. A Modified Newton Method Table 5.16. 1 2 3 4 5 6 7 8 9 10 Simple damping does not help in all situations Xi f(Xi) 6.000000 4.843907 3.926470 3.198309 2.615756 2.133620 1.685220 1.067196 -1.009015 1.008081 70524.000000 23011.672462 7506.913694 2456.653982 815.686552 284.525312 115.294770 66.134592 -64.287891 64.258095 21 22 23 24 25 26 27 28 29 30 Xi f(Xi) -1.003877 1.003706 -1.003550 1.003407 -1.003276 1.003154 -1.003041 1.002936 -1.002839 1.002747 -64.123953 64.118499 -64.113517 64.108946 -64.104738 64.100850 -64.097746 64.093896 -64.090774 64.087856 - ~e promise~ after Example 5.7.3 a modification of the Newton method that aor polyno~11Ials, also converges rapidly for large starting values. Let f be polynomial of degree n; by the fundamental theorem of algebra there are c ( f whi ' exactly n zeros c '" 51, ... , ."''' some 0 which may coincide), and lex) = ao(x _ ~l) (x - ~,,). For a SImple zero x* of ! we have rex) d I - ( ) = -d 10gl!(x)1 = - - + " .f X X X - x* L _1_ srFX' Where C E C\{x*}, X - c 5j R::: _1_ x* X - n-l + --, X - C whence x" R::: x;+l := Xi _ f(x;) !'(x;) - Xi-C .!'-::.If(x)· I (7.4) Univariate Nonlinear Equations 5.7 Methods Using Derivative Information The formula (7.4) defines the modified Newton method. Because the additional term in the denominator of the correction term vanishes as Xi ---+ x*, (7.4) is still locally Q-quadratically convergent. Because Table 5.17. Results of the modified Newton methodfor f(x) = x 2 - 2 with XI = 100 290 291 Xi 1 2 100.000000000000000 2.867949443501232 1.536378083259546 1.415956475989498 1.414213947729941 1.414213562373114 1.414213562373095 1.414213562373095 3 - x. _ --="X-;i~--:-C_ _ - L....- I = ( 4 5 6 7 "'~+I Xi-~j L /~'-c _~J. Xi J=I:n 8 + C) / ( .L /~'-c _~J. + I ) , J=I:n I I it follows that Xi+l approaches the value L(~j - c) + c as Xi ~ 00; in contrast Xi + O(l) ---+ 00 as to the ordinary Newton method, for which Xi+l = (l Xi ---+ 00, the modified Newton method provides for very large starting values in only one step a reasonable guess of the magnitude of the zero (unless c is chosen badly). Moreover, the modified Newton method has nice monotonicity properties when the polynomial has only real zeros. *) Therefore, 0.::: Xi+l (i) If all zeros of f lie between c and Xl, then the modified Newton sequence converges and is monotonically decreasing. (ii) If no zero lies between c and XI, then the modified Newton sequence con- verges monotonically. x*) (1 _ Xi - = (Xi *) - X x* C + n(x* ) - c) (n - l)(x' - c) Xi - x* + n(x* - c) (n - I)(x* - c), (I - ~) By induction, it follows that x* :5 Xi+1 :5 inequality implies that Xi converges to x*. (ii) Similarly proved. Xi (Xi - x*)) . for all i > I, and the last 0 (I;;;; I, l;e I) Proof (i) Let x* be the largest zero of f, and suppose that holds for i = 1). Then we have f'(Xi) n- I I 0< - - - < - - - - - Xi-X' - (Xi - Xi - .::: min 5.7.8 Theorem. Let f be a real polynomial with real zeros only, and suppose that XI > C. x* .::: - f(Xi) Xi-C and I Xi-X* x;-c Xi > x* (this certainly = ---Xi-Xi+l = 3 for the 5.7.9 Example. Exercise ]4 gives the bound J + max 2 absolute values of the zeros of the polynomial f (x) ;= x - 2. Therefore we may choose c ;= -3. With the starting point Xl := 100, the results of the modified Newton method are given in Table 5.17. Halley's Method In the limiting case Xi _ J ~ Xi (j = 1, ... , k), the Opitz formula (Ok) derived in Section 5.4 takes the form 5.8 Exercises Univariate Nonlinear Equations 292 5.8 Exercises Then Ix* - clx* - xilk+l, Xi+11 ::: that is, the method (Hk) has convergence order k + I. The derivatives of h = 11f are calculated from hi = - f' h" = 2f'2 - ff" p' P etc. For k = 1, one again obtains Newton's method applied to f, with convergence order 293 K = 2; the case k = 2 leads to Halley's method 1. What is .the maximal length of a cable suspended between two poles of ~qual height separated by a distance 2d = 100 m if the height h by which It sags may not exceed 10 m? Can you guarantee that the result is accurate to 1 em? ~int: The fo~ofthecableis described by a catenary y(x) := a cosh(xla). FIrst determme the arc length between the poles in terms of a. Then obtain a by finding a zero of a suitable equation. 2. Ho.w deeply will a tree trunk in the form of a circular cylinder of radius r (10 em) and density PH (in g/cm') be submerged in water (with density Pw = I)? The area of the cross-section that is under water is given by r:1 F = l(a - sin a). f(Xi) Xi+1 := Xi - f -r-(-xi-l!-"-(x-i) , I (Xi) - with cubic convergence (K = 3) to simple zeros. It can be shown that, for polynomials with only real roots, Halley's method converges globally and monotonically from arbitrary starting points. Note that k derivatives of f are needed for the determination of the values h(k-I) (Xi), h(k) (x.) in (Hd. If for j = 0, ... , k these derivatives require a similar computational cost, then a step with (Hk ) is about as expensive as k + 1 secant steps. A method where each step consists of k+ 1 secant steps has convergence order (1.618 .. . )k+1 »k + 1; therefore, the secant method is generally much more efficient than any (Hk ) · Other variants of the method of Opitz that use derivatives and old function values can be based on partially confluent formulas such as In this particular case, the evaluation of h and hi is necessary in each iteration, and we have [x" - Xi+ll ::: c[x* - Xi [2 ... Ix* - Xi-s [2 and the convergence order is bounded by 3. Thus, one step of this variant is already asymptotically less efficient than two steps of (0 2) (Ki ~ 3.382 > 3). A host of other methods for zeros of polynomials (and other univariate functions) is known. The interested reader is referred to the bibliography by McNamee [60]. An extensive numerical comparison of many zerofinders is in Nerinckx and Haegemans [66]. (8.1) 2j'(x,) The mass of the displaced water is equal to the mass of the tree trunk whence ' (8.2) Using (8.1) and (8.2), establish an equation f(a) = 0 for the angle a. For r = 30 and PH = 3/4, determine graphically an approximation ao for the zero o " of f(a). Improve the approximation using the secant method. Use the solution to calculate the depth of immersion. 3. Show that for polynomials of degree n with only real zeros, the secant method. started with two points above the largest zero converges monotonically, with global convergence factor of at least 4. Show that for three times continuously differentiable functi~ns f, the root secant method converges locally monotonically and superlinearly toward double zeros x* of f. Hint: ~ou may assume that !"(x*) > 0 (why?). Distinguish two cases depending on whether or not Xl and X2 are on different sides of x*. 5. Forthedeterminationofthezerox* = ,J2of f(x):= x 2-2in the interval [0, 2], use the following variant of the secant method: R % Set starting values = 0: Xo while 1 '!'O = 2; i = 0; X = Xi - f(Xi)/f[Xi, ,!.i] if !(,!.i)!(X) > 0 10. Let A E C"X" be a symmetric tridiagonal matrix with diagonal elements a., i = I, ... , nand subdiagonal elements fJi' i = 1, ... , n - 1. (a) Show that for X E C" with Xi i- 0 for i = 1, ... , n; (A, x) is an eigenpair of A if and only if the quotients Yi = Xi -1/ Xi, i = 1, ... , n satisfy the recurrence relations else :!:.i+1 = :!:.i; Xi+l = x; end; 1= 295 5.8 Exercises Univariate Nonlinear Equations 294 i + 1; end List i.s. and Xi for i = 1,2, ... ,6 to five decimal places, and perform iteration until [x -,J21 ::; 10- 9 . Interpret the results! 2 6. (a) Show that the cubic polynomial f(x) = ax 3 + bx + ex + d (a i- 0) has a sign change at the pair (Xl, X2) with XI = 0, _ maxt]c], b, b X2 = \ + c, b + c + d)/a if ad s; 0, + c, b + c + dv]« otherwise. fJn-l Yn = A- q(t) := t - al - ---------'---'----------;:---- t - a2 - bisection method. (b) Show that in a sufficiently narrow bracket, this safeguard never becomes active. First show that it suffices without loss of generality to consider the case Xl < X < x* < X2. Then show that q = max(f(xI)/f(x), 2(X2XI)/(X2 - x» satisfies q > 2 and (x - XI)/(l- q) < - min(x - Xl, X2 - X). Interpret the behavior of Xll ew = X - (x - Xl)/(l - q) for the cases If(x)\ « f(xI) and If(x)\» f(x)), and compare with the Hint: original secant formula. S (a) Determine the zero of f(x) := 1nx + e' - 100x in the interval [1, 1.0] • b" . with using the original secant method and the secant IsectlOnversion . Xl := 1 and X2 := 10. For both methods, list i and Xi (i = 1,2, ...) untIl the error is < 00 = 10- 9 . Interpret the results! b f . d it te with the (b) Choose Xl := 6 and Xz := 7 for the a ove unction an 1 era -9 L' h 1 of Xi original secant method until IXi - xi-II::; 10 . 1St t eva ues and the corresponding values of i. . 9. Which zero of sin X is found by the secant bisection method started WIth Xl = 1 and X2 = 8? ---"---L t - Analyze the Horner form, noting that IX21 ::: 1. (b) Based on (a), write a MATLAB program that computes all zeros of a Hint: quadratic equation directly. (c) Can one use Cardano's formulas (see Exercise 1.13) to calculate all zeros of x 3 - 7x + 6 = 0, using real arithmetic only? 7. (a) Explain the safeguard used in the linear extrapolation step of the secant an' (b) Show that each zero of the continued fraction _ mint-e]«], b, b cubic polynomial. Hint: After finding one zero, one finds the other two by solving a for i = n - 1, ... ,1, fJi-1 Yi = A - a, - fJi/Yi+l, a n-l _ _ f3;;-1 t-a n is an eigenvalue of A. (c) When is the converse true in (b)? 11. Let A be a Hermitian tridiagonal matrix. (a) Suppose that a I - A has an LDL H factorization. Derive recurrence formulas for the elements of the diagonal matrix D. (You may proceed directly from the equation a I - A = LDL H , or use Exercise 10.) (b) Show that the elements of L can be eliminated from the recurrence derived in (a) to yield an algorithm that determines det(a I - A) with a minimal amount of storage. (c) Use (b) and the spectral bisection method to calculate bounds for the two largest eigenvalues of the 21 x 21 matrix A with A i k := II - i I, ' I 0, if k = i iflk-il=1. otherwise Stop as soon as two decimal places are guaranteed. 12. Let G(A) : C ---* C 3 x 3 with 2A2 + 2A + 2 -l1A 2+A+9 2A2 + 2A + 3 be given. (a) Why are all eigenvalues of the parameter matrix G (A) real? (b) G(A) has six eigenvalues in the interval [-2,2]. Find them by spectral bisection. 13. Let f : 1R ~ 1R be continuously differentiable, and let x* be a simple zero of f. Show for an arbitrary sequence Xi (i = 0, 1,2, ...) converging to x*: (a) If, for i = 1,2, ... , IXi+1 - x"] S qlxi - x*1 (b) If . Xi+1 -x* lim Xi - x* =q with Iql 16. Let f be p + 1 times continuously differentiable and suppose that the iterates x j of Muller's method converge to a zero x* of f with multiplicity P = 1 or P = 2. Show that there is a constant c > 0 such that for j = 3,4, .... Deduce that that the (R-)convergence order of Muller's method is the solution of K3 = K2 + K + 1 for convergence to simple zeros, and the solution of 2K3 = K2 + K + 1 for convergence to double zeros. 17. The polynomial PI(X) := ~X4 - 12x3 + 36x 2 - 48x + 24 has the 4-fold zero 2, and P2(X) := x 4 - IOx3 + 35x 2 - SOx + 24 has the zeros 1, 2, 3, 4. What is the effect on the zeros if the constant term is changed by adding E: = ±0.0024? Find the change exactly for Pi (x), and numerically to six decimal places for P2(X). 18. (a) Show that for a factored polynomial with q < qo S 1, then there exists an index i o such that i-s-co 297 5.8 Exercises Univariate Nonlinear Equations 296 < ql S 1, f(x) = ao then there exists an index i I such that n (x - Xi)'ni i=l:n we have 14. (a) Show that equation (4.3) has exactly one positive real solution K, and that this solution satisfies 1 < K < K = 1 + max{po, PI,···, Ps}· Hint: Show first that (pOK S + PIKs - 1 + ... + Ps)/K s+1 - 1 is positive for K S 1 and negative for K = K, and its derivative is negative for all K > O. (b) Let p(x) = asx" + alx ll - 1 + .. , + all-Ix + a" be a polynomial of degree n, and let q* be the uniquely determined positive solution of Show that 1~1l I s q* s I + 1l=1:" max -all I ao I for all zeros ~Il of p(z). Hint: Find a positive lower bound for If(OI if I~I > q*. 15. (a) Show that formula (4.5) is indeed equivalent to the Opitz formula with k = 2 for h = 1/f· (b) Show that for linear functions and hyperbolas f(x) (cx + d), formula (4.5) gives the zero in a single step. = (ax + b)/ I' (x) f (x) = 1 i~ X - Xi . (b) If f is a polynomial with real roots only, then the zeros of f are the only finite local extrema of the absolute value of the Newton correction ll(x) = - f(x)/!,(x). Hint: Show first that ll(x)-I is monotone between any two adjacent zeros of f. (c) Based on (b) and deflation, devise a damped Newton method that is guaranteed to find all roots of a polynomial with real roots only. (d) Show that for f(x) = (x - 1)(x 2 + 3), Ill(x)1 has a local minimum at x =-1. 19. (a) Let x~, ... , x;(j < n) be known zeros of f(x). Show that for the calculation of the zero X;+I of f(x) with implicit deflation, Newton's method is equivalent with the iteration Xi+1 := Xi - f(Xi ) / (r(X i) - L f~i).). x k=l:j Xl k (b) To see that explicit deflation is numerically unstable, use Newton's method (with starting value Xl = 10 for each zero) to calculate both 5.8 Exercises Univariate Nonlinear Equations 298 7 6 with implicit and explicit deflation all zeros of 1 (x) := x - 7x + 11x 5 + 15x 4 - 34x 3 - 14x 2 + 20x + 8, and compare the results with the exact values 1, 1 ± y'2, 1 ±~, I ± J"S. 20. Assume that in Exercise 2 you only know that r E [29.5, 30.5], Ix* -x*1 < n ·lpn(.T*) PH E [0.74,0.76]. k 21. (a) Prove Theorem 5.6.6. Hint: Multiply (6.9) by a complex number, take its real part, and conclude that there must be some zero with Re(c / (x - ~i» O. (b) Let x' = What is the best error bound for lx' - x" I? 22. (a) Let k, = k 2 = 2, k 3 = 3, k 4 = 5. Show that for arg(q) E ± lMirr, ~rr[, condition (6.5) cannot be violated for all k = 0, ... , k n . (b) In which sense is this choice of the k; best possible? 23. (a) Write a MATLAB program that calculates a zero of an analytic function 1 (x) using Muller's method. If for two iterates x j, x j+j, ::: ;;W). (where 8 is an input tolerance), perform one more iteration and terminate. If x j = x j + j, terminate the iteration immediately. Test the 2 program with l(x) = x 4 + x 3 + 2x + X + 1. (b) Incorporate into the MATLAB program a spiral search and find the zeros again. Experiment with various choices for the contraction and rotation factor q. Compare with the use of the secant formula instead of Muller's. 24. (a) Determine, for n = of ~onjugate complex zeros (in MATLAB, conj gives the complex conjugate), remove them by implicit deflation and begin again with the above starting points. (b) Calculate, for each of the estimates of the zeros, the error bound xt Starting with an initial interval that contains a*, iterate with the interval Newton method until no further improvement is obtained and calculate an interval for the depth of immersion. (You are allowed to use unrounded interval arithmetic if you have no access to directed rounding.) x- 299 12, 16,20, all zeros x;, ... , x~, of the truncated exponential series x" Pn(x) := 'Lv! " -. v=O:n (The zeros are simple and pairwise conjugate complex.) Use the programs of Exercise 23 with xo:= - 6, Xl := - 5, X2:= _ 4 as starting values and 8 = 10- 4 as tolerance. On finding a pair k - p~(x*) I. Print for each zero the number of iteration required, the last iterate that was calculated, and the error bound. (c) Plot the positions of the zeros you have found. What do you expect to see for n ---+ oo? 25. For the functions 11 (x) := x 4 -7x 2 + 2x + 2 and hex) := x 2 - 1999x + I ~OOOOO, perform six steps of Newton's method with Xo = 3 for 11 (x) and with Xo = 365 for 12 (x). Interpret the results. 26. Let x* be a zero of the twice continuously differentiable function 1 : lR ---+ R and suppose that 1 is strictly monotonically increasing and convex for x ::: x*. Show the following: (a) If XI > X2 > x*, then the sequence defined by the secant method started with x I and X2 is well defined and converges monotonically to x*. (b) If YI > x*, then the sequence Yi defined by Newton's method Yi+1 = Yi - 1 (Yi) /1' (Yi) is well defined and converges monotonically to x*. (c) If Xl = Yl > x* then x* < X2i < Yi for all i > I; that is, two steps of the secant method always give (with about the same computational cost) a better approximation than one Newton step. Hint: The error can be represented in terms of x* and divided differences. 27. Investigate the convergence behavior of Newton's method for the polynomial p(x) := x 5 - IOx3 + 69x. (a) ~how that Newton's method converges with any starting point Ixol < )ffi. (b) Show that g (x) = x - ;,i~ is monotone on the interval x := [~ffi, v'3]. Deduce that g(x) maps x into -x and -x into x. What does this imply for Newton's method started with Ixol EX? 28. Let p(x) = aox n + alx n- I + ... + an-Ix + an, ao > 0, be a polynomial of degree n with a real zero ~* for which 1;* ::: Re s, for all zeros I;v of p(x). Univariate Nonlinear Equations 300 6 (a) Show that for zo > ~* the iterative method z) + 1 : = z) p(z) ) - -----,-------( ) p z) Systems of Nonlinear Equations p(z) W)+l := z) - n-----,-------() p z) generates two sequences that converge to ~* with W) :s ~* :s z) for j = 1,2, 3, ... and {z)} is monotone decreasing. Hint: Use Exercise l8(a). (b) Using Exercise 14, show that the above hypotheses are valid for the polynomial n n-l 1 px=x-x -x n-2 - ... - . () Calculate ~* with the method from part (a) for n = 2,3,4,5, 10 and Zo:= 2. In this chapter, we treat methods for finding a zero x* E D (i.e., a point x* E D with F(x*) = 0) of a continuously differentiable function F : D S; jRn ~ jRn. Such problems arise in many applications, such as the analysis of nonlinear electronic circuits, chemical equilibrium problems, or chemical process design. Nonlinear systems of equations must also frequently be solved as subtasks in solving problems involving differential equations. For example, the solution of initial value problems for stiff ordinary differential equations requires implicit methods (see Section 4.5), which in each time step solve a nonlinear system, and nonlinear boundary value problems for ordinary differential equations are often solved by multiple shooting methods where large banded nonlinear systems must be solved to ensure the correct matching of pieces of the solution. Nonlinear partial differential equations are reduced by finite element methods to solving sequences of huge structured nonlinear systems. Because most physical processes, whether in satellite orbit calculations, weather forecasting, oil recovery, or electronic chip design, can be described by nonlinear differential equations, the efficient solution of systems of nonlinear equations is of considerable practical importance. Stationary points of scalar multivariate function f : D ~ lR n ~ lR lead to a nonlinear system of equations for the gradient, V f (x) = O. The most important case, finding the extrema of such functions, occurs frequently in industrial applications where some practical objective such as profit, quality, weight, cost, or loss must be maximized or minimized, but also in thermodynamical or mechanical applications where some energy functional is to be minimized. (To see how the additional structure of these optimization problems can be exploited, consult Fletcher [26], Gill et al. [30] and Nocedal and Wright [73a].) After the discussion of some auxiliary results in Section 6.1, we discuss the multivariate version of Newton's method in Section 6.2. Emphasis is on obtaining a robust damped version that can be proved to converge under fairly general 301 6.1 Preliminaries Systems of Nonlinear Equations 302 303 The Jacobian conditions. Section 6.3 discusses problems of error analysis and interval techniques for the rigorous enclosure of solutions. Finally, some more specialized methods and results are discussed in Section 6.4. A basic reference for solving nonlinear systems of equations and unconstrained optimization problems is Dennis and Schnabel [18]. See also Allgower and Georg [5] for parametrized or strongly nonlinear systems of equations. The matrix A E jRlIl X" is called the Jacobian matrix (or derivative) of F : D S; jR" ~ jRlIl at the point Xo E D if one writes A = F' (x"). It follows immediately from the definition that 6.1 Preliminaries We recall here some ideas from multidimensional analysis on norm inequalities and local expansions of functions, discuss the automatic differentiation of multivariate expressions, and state two fixed-point theorems that are needed later. In the following, int D denotes the interior of a set D, and aD its boundary. , Integration Let g : [a, b] ~ jR" be a vector-valued function that is continuous on [a, b] The Riemann integral g(t) dt is defined as the limit f: r gU) dt:= fa taken over all partitions a and if F' is Lipschitz continuous, the error term is of order O(llx - xOI1 2 ) . Because all norms in jR" are equivalent, the derivative does not depend on the norm used. The function F is called (continuously) differentiable in D if the Jacobian F' (x) exists for all xED (and is continuous there). If F is continuously differentiable in D, then F (X)ik c a = -F;(x) (i OXk = 1, ... .m, k = 1, ... , n). 1ft For example, for m = n = 2 and F(x) = (~~i:;), we have L lim (t;+1 - ti)g(t;) max(ti+l-ti)~O;=0:" = to < t1 < .. , < t,,+1 = b of the interval [a, b], provided this limit exists. 6.1.1 Lemma. For every norm in jR", When programming applications, it is important that derivatives are programmed correctly. Correctness may be checked as in Exercise 1.8. We generalize the concept of a divided difference from the univariate case (Section 3.1) to multivariate vector-valued functions by defining the multivariate slope if F is differentiable on the line xxo. if both sides are defined. 6.1.2 Lemma. Let D S; jR" he convex and let F : D ~ jR" be continuously differentiable in D. Proof This follows from the inequality (i) If x, xO E D then by going to the limit. o (This is a strong form ofthe mean value theorem.) (ii) IfIIF'(x) II :s Lforall xED, then II F[x, xO]1I 305 6.1 Preliminaries Systems of Nonlinear Equations 304 and :s L for all x , xO E D whence (ii) follows. The assertion (iii) follows similarly from and IIF(x) (This is a weak form of the mean value theorem.) (iii) If F' is Lipschitz continuous in D, that is, the relation IIF'(x) - F'(y) II :s yllx - yll F(xo) - F'(xo)(x - = 111 :s 1 II F' (xO + t (x :s 1 YllxO + t(x for all x, y ED xO)11 1 1 (F' (xO + t(x - xu)) - F' (xo)) (x - xu) dt - xu)) - F' (x") IIllx - I xOII dt 1 holds for some Y E~, then for all x, xO E D; =Yllx-xOI1 2 - xu) - Y 1° 1 xOllllx - x011 dt tdt=-llx-xOIl 2 . o 2 in particular Reverse Automatic Differentiation (This is the truncated Taylor expansion with remainder term.) Proof Because D is convex, the function defined by is continuously differentiable in [0, 1]. Therefore, 1 1 F(x) - F(xo) = g(l) - g(O) = g'(t) dt t oo F'(xo+t(x-xo))(x-xo)dt=F[x,x ](x-x). = 10 For many nonlinear problems, and in particular for those discussed in this chapter, the numerical techniques work better if derivatives of the functions involved are available. Because numerical differentiation is often inaccurate (see Section 3.2), it is advisable to provide programs that compute derivatives analytically. The basic principles were discussed already in Section 1.1 for the univariate case. Here, we look briefly at the multivariate case. For more details, see Griewank and Corliss [33]. Consider the calculation of y = F (x) where x E ~n , and y E ~m is a vector defined in terms of x by arithmetic expressions. We may introduce an auxilary vector z E ~ N containing all intermediate results from the calculation of y, with y retrievable from suitable coordinates of z, say y = Pr with a (0, 1) matrix P that has exactly one 1 in each row. Then we get each Zi with a single operation from one or two Zj (j < i) or x i- Therefore, we may write the computation as This proves (i). Now, by Lemma 6.1.1, IIF[x, z F'(xO + t(x - xu»~ dt IIF'(xO + t(x - x°))ll dt 1 xO]11 = \11 :s 1 1 : : £1 Ldt = L, 1\ = 0.1) H(x, z), where Hi (x, z) depends on one or two of the z j (j < i) or x j only. In particular, the partial derivatives HxCx, z) with respect to x and Hz (x, z) with respect to z are extremely sparse; moreover, Hz (x, z) is strictly lower triangular. Now z depends on x, and its derivative can be found by differentiating (1.1), giving 8z -8 x = HxCx, z) az + Hz(x, z)-. ax 306 6.1 Preliminaries Systems ofNonlinear Equations 6.1.3 Example. To compute Bringing the partial derivatives to the left and solving for them, we find -az = ax Because F'(x) (1- Hz(x, z)) -1 HAx, z). (1.2) we need the intermediate expressions = ay/ax = paz/ax, we find z\ = (1.3) Z3 Note that the matrix to be inverted is unit lower triangular, so that one can compute (1.2) by solving = Z4 Z6 by forward substitution: with proper programming, the sparsity can be fully exploited and the computation efficiently arranged. For one-dimensional x, the resulting formulas are equivalent to the approach via differential numbers discussed in Section 1.1. This way of calculating F' (x) is termed forward automatic differentiation. In many cases, especially when m « n, the formula (1.3) can be evaluated more efficiently. Indeed, we can transpose the formula and find Hz(x, z) Now we find K:= (I - Hz(x, z))~r p T by solving with back substitution the associated linear system = K HAx, z). = 0 0 1 0 0 0 0 0 0 1 0 0 0 0 As for any scalar-valued function, K (1.6) simply becomes (1.7) k2 - k3 - k4 This way of proceeding is called reverse or backward automatic differentiation because we solve for the components of K in reverse order. An apparent disadvantage of this way of proceeding is that H; (x, z) must be stored fully while it can be computed on the fly while solving (1.4). However, if m « n the advantage in speed is dramatic because (1.4) contains on the right side n columns whereas (1.6) has only m columns. So the number of triangular solves is cut down by a factor of m / n « 1. 0 0 0 I/x4 0 0 0 0 0 0 0 0 0 26 0 0 0 0 0 0 Z6 0 Z4 k5 Thus, - 0 0 0 0 0 0 0 0 0 0 0 0 = k is now a vector, and the linear system = 0, k3 = 0, (1/x4)k 4 = 0, Z6k7 = 0, Z6k6 = 0, k, - k3 the so-called adjoint equation with a unit upper triangular matrix, and obtain F'(x) X4, to get f(x) = Z7. We may write the system (1.8) as 2 = H (x, z) and have y = P'; with the 1 x 7-matrix P = (0 0 0 0 0 0 1). The Jacobian of H with respect to z is the strictly lower triangular matrix (1.5) (1.6) (1.8) Z3/ X4, = Xl = e Z5 , Z5 (1.4) xf, = X2 X3, = zr + Z2, Z2 T 307 k6 - Z4k7 = 0, k7 = 1. Systems ofNonlinear Equations 308 6.1 Preliminaries and we get the components of the gradient from (1.7) as af(x) g·=--=k T Banach's Fixed-Point Theorem The mapping G : D S; JRn ~ JRn is called a contraction in K S; D if G maps K into itself and if there exists fJ E JR with 0 :s fJ < 1 such that o Hi», z) ax j J 309 dXj IIG(x) - G(Ylll :S fJllx - a Most components of H / ax j vanish, and we end up with YII for all x, Y E K; f3 gl = k 1 · 2Xl is called a contraction factor of G (in K). In particular, every contraction in K is Lipschitz continuous. +ks , g2 = k2 X 3 , g3 = k 2X 2 , g4 = k4 ( -Z3/ xJ) - k s = -k4 z4 / X 4 - k s· (I.10) 6.1.4 Theorem. Let the mapping G : D S; lien ~ JRn be a contraction in the closed. convex set K ~ D. Then G has exactly one fixed point x* E K and the iteration x'+1 := G(x') converges for all xO E K at least linearly to x*. The relations . The whole gradient is computed from (1.9) and (LlO) with only 11 operations, and because no exponential is involved, computing function value and gradient takes less than twice the work for computing the function value alone. This is not untypical for gradient computations in the reverse mode. Ilxl+ l -x*ll:s ,8llx l -x*lI, and II xl+1 - x'il There are several good automatic differentiation programs (such as ADIFOR [2], ADOL-C [34]) based on the formula (1.3) that exploit the forward approach, the backward approach, or a mixture of both. They usually take a program for calculation of F (x) as input and return a program that calculates F (x) and F ' (x) simultaneously. Fixed-Point Theorems (1.13) ---- < I+P hold, in which - IIx l + 1 - x'il II x' - x* II < -~-- - (1.14) l-fJ f3 denotes a contraction factor of G. Proof Let xO E K be arbitrary. Because G maps the set K into itself, defined for all I 2: 0 and for I 2: 1, xl is so by induction, Related to the equation F(x*) = 0 is the fixed point equation x* = G(x*) (Lll) Therefore for I < m, IIx and the corresponding fixed point iteration l - x m II :S Ilx l :S (pi X I + 1 := G(x l ) (l = 0,1,2, ...). (Ll2) x'+111 + '" + [[xIII-I + ... + ,8m-l)lIx l pm - fJ' = The solutions of (I.Il) are called fixed points of G. For application to the problem of determining zeros of F, one sets G (x) = x - C F (x 1with a suitable C E jR" "", There are several important fixed-point theorems giving sufficient conditions for the existence of at least one fixed point. They are useful in proving the existence of zeros in specified regions of space, and relevant to a rigorous error analysis. - ,8 -1 _ xm I _ xOIl IIx l-xOIl, so IIx - x II ~ 0 as I, m ~ 00. Therefore x' is a Cauchy sequence and has a limit x*. Because G is continuous, l m x* = lim xl = lim G(x'-I) 1--";-00 l"""-HX; = G(x*), that is, x* is a fixed point of G. Moreover, because K is closed, x* E K. 6.2 Newton's Method and Its Variants Systems of Nonlinear Equations 310 (ii) (Fixed point theorem by Brouwer) Suppose that the mapping G : D S; JRn ~ JRn is continuous and that K S; Dis nonempty, convex, and compact. If G maps the set K into itself, then G has at least one fixed point in K. An arbitrary fixed point x' E K satisfies Ilx' - x*11 = IIG(x') - G(x*)11 ~ ,Bllx' - x*ll, and we conclude x' = x* because ,B < 1. Therefore x* is the only fixed point of G in K. The inequality (1.13) now follows from l Ilx I+1 - x*11 = IIG(x l) - G(x*)11 ~ ,Bllx - x*lI, and with this result, (1.14) follows from (1 - ,B)IIx l - x* II l ~ IIx i - x* I - Ilxl+1 - x* II ~ Ilx + - xlii l ~ I xl - x*11 + Ilx I+1 - x*11 ~ (1 + ,B)IIx - x*ll· 311 Proof For n = 1 the theorem easily follows from the intermediate value theorem. For the general case, we refer to Ortega and Rheinboldt [78] or Neumaier [70] for a proof of (i), and deduce here only (ii) from (i). If xO E int K and "A > 1 then, assuming G(x) = xO + "A (x - xO), the convexity of K implies that 1 D Therefore, G(x) #xo + "A(x - xo) for each x E JK,"A > 1, and by part (i), G has a fixed point in K. D 6.1.5 Remarks. (i) We know an error bound of the same kind as (1.14) already from Theorem 2.7.2. As seen there, (1.14) is a realistic bound if,B does not lie too close to 1. (ii) Two difficulties in the application of the Banach fixed-po~nt theorem li.e in the proof of the property that G maps K into itself and III the determination of a contraction factor ,B < 1. By Lemma 6.1.2(ii), one can choose f3 := sup{ I G'(x) III x E K}. If this supremum is difficult to determ.ine, o~e can determine a simpler upper bound with the help of interval arithmetic Note that Brouwer's fixed-point theorem weakens the hypotheses of Banach's fixed-point theorem. but also gives a weaker conclusion. That the fixed point is no longer uniquely determined can be seen, for example, by choosing the identity for G. 6.2 Newton's Method and Its Variants The multivariate extension of Newton's method, started at some xO E JRn, is given by (see Section 6.3). (2.1) where pl is the solution of the system of linear equations The Fixed-Point Theorems by Brouwer and Leray-Schauder (2.2) Two other important fixed point theorems guarantee the existence of at least one fixed point, but uniqueness can no longer be guaranteed. The solution pi of (2.2) is called the Newton correction or the Newton direction. We emphasize that although the notation involving the inverse of 6.1.6 Theorem. the Jacobian is very useful for purposes of analysis, in practice the matrix F' (x') -I is never calculated explicitly because as seen in Section 2.2, pI can be determined more efficiently by solving (2.2), for example, using Gaussian elimination. Newton's method is the basis of most methods for determining zeros in JRn; many other methods can be considered as variants of this method. To show this, we characterize an arbitrary sequence x' (l = 0, 1, 2, ... ) that converges quadratically to a regular zero. Here, a zero x* E int D is called a (i) (Fixed-point theorem by Leray and Schauder) Suppose. that the mapping G: D S; JRn ~ JRn is continuous and that K S; D IS compact. If xO E intK and G(x) #xo + "A(x - x") for all x then G has at least one fixed point in K. E JK, "A> 1, 312 Systems of Nonlinear Equations 6.2 Newton's Method and Its Variants regular zero of F : D ~ JR.n --* JR.n if F (x*) = 0, if F is differentiable in a neigh- borhood of x* and F' (x*) is Lipschitz continuous and nonsingular in such a neighborhood. From now on, we frequently omit the iteration index I and simply write x for xl and i for x l +1 (and similarly p for p', etc.). IIi - = O(IIF(i)11) = O(IIF(x)1I 2) == O(lIx _ x*1I r:= F(x) == F(x) F(x) F(x) = r x) + O(llx - xf), - x*) + O(lIx - x*11 2 ) . Iii - xll::s IIx - == (2.3) = O(IIx - x*II), + O(lIx - xI1 2 ) . (2.5) If (ii) now holds, then by (2.4) Because F(x*) = 0 and F'(x*) is bounded, it follows that IIF(x)1I + F'(x)(.i - and Proof We remark first that by Lemma 6.1.2(iii), + F'(x*)(x + F'(x)p and remark that by Lemma 6.1.2(iii), (i) IIi -x*II == O(IIx _x*II 2), (ii) IIF(x)11 = O(IIF(x)II 2), (iii) IIF(x) + F'(x)pll = O(IIF(x)1I 2). F(x) = F(x*) x*1I2). The.refore, (i) and (ii) are equivalent. To show that (ii) and (iii) are equiv I t we mtroduce the abbreviation a en , 6.2.1 Theorem. Let x" E D bea regular zero ofthefunction F : D ~ JR." --* JR.". Suppose that the sequence Xl (l = 0, 1,2, ...) converges to x* and that pi := x l + ] - x', Then the following statements are equivalent: 313 and if (ii) holds, then by (2.4) and (2.3), x*II + IIx - x*1I + O(IIF(x)ll) O(II F(x)II) = O(IIF(x)II), and therefore by (2.5) and (ii) /lrll ::s and for an appropriate c > 0 (x)11+ O(III - IIF x/l 2) = O(/lF(x)1I2) so that (iii) holds. Conversely, if (iii) holds, then Ilx - x*11 OCllx - x*11 2 ))11 ::s II F' (x*)-III(II F(x) II + cflx - x* /12). = IIF'(x*)-I(F(x) - For large I, 1IF'(x*)-' /I /Ix - x*11 :::: Ilrll == O(IIF(x)11 2 ) ie, so III - I xII = lip II = IIF'(x)-I(FCx) - r)/I :::: IIF'(x) IIx-x*ll:::: IIF'(x*)-IIlIIF(X)/I+:2llx-x*II, 111(IIF(x)11 + Ilrll) == O(l)O(II F(x)11) = and solving for Ilx - x* II, we find Ilx -x*lI:::: OClix 211F'(x*)-1 -x*11) O(II F(x)II). By (2.5) and (2.6) we have 1111 F(x)II = O(IIF(x)II)· = OCllx (2.4) IIF(i)lI:::: Ilrll + O(/li _xIl 2 ) == O(IIFCx)112), so that (ii) holds. Therefore, (ii) and (iii) are also equivalent. If now (i) holds, then by (2.3) and (2.4), IIF(x)II = (2.6) and _x*II 2) = OCIIF(x)1I 2 ) , Because condition (iii) is obviously satisfied if p Newton correction, we find the [ollowing. == _ F' (x) -1 F (x) o is the 6.2 Newton's Method and Its Variants Systems of Nonlinear Equations 314 6.2.2 Corollary. Newton's methodfor a system ofequations is locally quadrat- because p ically convergent if it converges to a regular zero. = _A- 1 F(x) IIF(x) = 3x? + XIX3 + x~ - 3x~ xi + X2 + 1 - 3x~ with Jacobian F -6x) F'(x):= ( 2XI: X3 until rounding errors start to dominate. '( ) F(x + hu) - If a program calculating F' (x) is not available, one can approximate the partial derivatives by numerical differentiation to obtain an approximation At for I I d . F'(xl). If one then determines pi from the equation At p = - F(x ) an agam sets XI + 1 := xl + pi, then one speaks of a discretized Newton method. In order to obtain quadratic convergence, the approximation must satisfy Table 6.1. Simple Newton iteration IIF(xl)111 xl \ F(x+hu)-F(x-hu) 2h (2.7) If the dimension n is large but F' (x) is sparse, that is, each component of F depends only on a few variables, one can make use of this in the calculation of A. If one knows, for example, that F'(x) is tridiagonal, then one obtains from (2.7) with 0.7500000 0.8083333 0.8018015 0.8016347 0.8016345 0.8016345 0.8016345 1.88e 3.40e 1.0ge 1.57e 2.84e 5.55e 5.00e - u = = = (1,0,0, 1,0,0, 1,0,0, ) (0, 1,0,0, 1,0,0, 1,0, ) (0,0, 1,0,0, 1,0,0, I, ) approximations for the columns 1,4,7, 2,5,8, 3,6,9, IIA - F'(x)11 = OClIF(x)II); 2 3 4 5 6 ~ or h u The Discretized Newton Method 0.5000000 0.6000000 0.5856949 0.585290\ 0.5852896 0.5852896 0.5852896 F(x) xu~---~-~---':'" u 0.2500000 0.3583333 0.3386243 0.3379180 0.3379\7\ 0.3379171 0.3379171 O(IIF(x)11 2 ) . in which h I- 0 is chosen suitably and u E]RII runs through the unit vectors e(l) , e(2), ... , e(lI) in succession. With the starting value xU := (i, &, ~) T , the Newton method gives the results displayed in Table 6.1. Only the seven leading digits are displayed; however, full accuracy is obtained, and the local quadratic convergence is clearly visible 0 = The matrix A must therefore approximate F'(x) increasingly well as IIF(x)11 decreases. In the case that different components of F (x) cannot be calculated one by one, one may determine the approximation for F'(x) columnwise from ) x? ( - F'(x))plI ::: IIA - F'(x)lllIpll x~ - F(x):= O(IIF(x)II), it follows that + F'(x)plI = II (A 6.2.3 Example. Consider the function 315 01 02 03 06 12 16 16 Ilpllll 2.67e - 01 4.05e - 02 1.28e - 03 1.53e - 06 2.47e-12 6.52e - 16 4.32e - 16 . , and of F'(x), respectively (why?). In this case, only four evaluations of F are required for the calculation of A by the forward and six evaluations of F for the central difference quotient. Damping Under special monotonicity and convexity conditions (see Section 6.4), Newton's method can be shown to converge in its simple form. However, as in the univariate case, in general some sort of damping is necessary to get a robust algorithm. Systems ofNonlinear Equations 6.2 Newton's Method and Its Variants Table 6.2. Divergence with undamped Newton method improbable that the next step has a large damping factor. Therefore, one uses in the next iteration an adaptive initial step size, computed, for example, by 316 x/ 0 1 2 3 4 5 6 2.00000 -18.15888 -8.37101 -3.55249 -1.20146 -0.00041 0.04510 2.00000 -10.57944 -5.22873 -2.71907 -1.47283 0.09490 -1865.23122 21 22 23 24 0.13650 0.33360 0.19666 0.15654 0.15806 -4.06082 -1.49748 1.03237 IIF(x/)111 lip/III 5.le+00 6.8e + 02 1.7e + 02 4.le + 01 l.le + 01 2.4e + 01 3.5e + 06 3.3e + 01 1.5e + 01 7.3e + 00 3.6e + 00 2.8e + 00 1.ge + 03 9.3e + 02 + 00 + 01 + 00 + 00 4.4e + 00 2.7e +00 2.6e + 00 2.4e + 00 6.2e 2.0e 7.3e 6.6e eX = min(1, 4a o ld ) 317 (2.9) or the more cautious strategy _ {a a = min(1,2a) if a decreased in the previous step, otherwise. (2.10) Because in high dimensions, the computation of the exact Newton direction may be expensive, we allow for approximate search directions p by requiring only a bound IIF(x) + F'(x)pll :::: q'IIF(x)1I (2.11) on the residuals. For F : D <; lH. ~ lH. n the following algorithm results, with fixed numbers q, q' satisfying n 6.2.4 Example. The function Xl + 3ln IXII - F(x):= ( 2 2x l - XIXZ - 5Xl xi ) +1 0< q < 1, 0 < q' < 1 - q. has the Jacobian (2.12) 6.2.5 Algorithm: Damped Approximate Newton Method j 1 F , (x) = (1+3 X 4XI - X2 - 5 STEP 1: Choose a starting value For the starting vector xO := (2, 2l, the Newton method gives the results displayed in Table 6.2, without apparent convergence. As in the one-dimensional case, damping consists in taking only a partial step x :=x -s- ap, with a damping factor a chosen such that IIF(x + ap)11 < (1- qa)\IF(x)11 xO E D and compute II F (x") II. Set 1:= 0; a=1. STEP 2: Determine p such that (2.11) holds, for example by solving F' (x) P = -F(x). STEP 3: Update a by (2.9) or (2.10). STEP 4: Compute x = x + ap. If x = x, stop. STEP 5: If x if- int D or IIF(x)11 ::: (l - qa)IIF(x)lI, then replace a with aj2 and return to Step 4. Otherwise, replace I with I + 1 and return to Step 2. 6.2.6 Example. We reconsider the determination of a zero of (2.8) for some fixed q, 0 < q < 1. It is important to note that requiring II F(x + ap) II < II F(x)11 is not enough (see Example 5.7.7). Because ex = I corresponds to a full Newton step, which gives quadratic convergence. one first tests whether (2.8) is satisfied. If this is not the case, then one halves ex until (2.8) holds; we shall show that this is possible under very weak conditions. If in step I a smaller damping factor was used, then it is from Example 6.2.4 with the same starting point x O = (2, 2) T for which the ordinary Newton method diverged. The damped Newton method (with q = 0.5, q' = 0.4) converges for II· II = II . II, to the solution Iiml---+oo x' = (1.3734784, -1.5249648) T, with both strategies for choosing the initial step size (Table 6.3). The use of (2.9) takes fewer iterations and hence Jacobian 6.2 Newton's Method and Its Variants Systems ofNonlinear Equations 318 Table 6.3. With damping (2.9) IIF(x')111 IIp'III at nf 12 13 14 5.08e + 00 4.14e + 00 3.31e + 00 2.6ge + 00 2.31e + 00 2.21e + 00 2.16e + 00 2.12e + 00 2.0ge + 00 2.04e + 00 1.88e + 00 8.5ge - 01 3.93e - 02 8.lle -05 3.92e - 10 3.27e + 01 1.96e + 00 1.95e +00 2.1ge + 00 3.8ge + 00 5.44e + 00 6.70e + 00 7.76e + 00 6.91e + 00 4.7ge + 00 1.l6e + 00 2.55e - 01 l.04e - 02 2.32e - 05 1.04e - 10 6.25e - 02 2.50e - 01 2.50e - 01 2.50e - 01 6.25e - 02 3.12e - 02 3.12e - 02 3.12e - 02 3.12e - 02 1.25e - 01 5.00e - 01 1.00e + 00 l.OOe + 00 1.00e + 00 1.00e +00 6 8 12 16 22 27 31 35 39 41 43 45 47 49 51 0 1 2 5.08e + 00 4.14e+00 3.8ge + 00 3.27e + 01 1.96e + 00 l.94e + 00 6.25e - 02 6.25e - 02 1.25e - 01 6 8 10 13 14 15 16 1.25e + 00 6.00e - 01 3.30e - 02 l.78e - 04 2.91e - 09 4.05e - 01 l.36e-01 l.34e - 02 5.64e - 05 6.65e - 10 5.00e - 01 l.OOe + 00 1.00e + 00 1.00e + 00 l.OOe + 00 37 39 41 43 45 0 I 2 3 4 5 6 7 8 9 10 II With damping (2.10) 17 Damped Newton method, with damping 2.9 Table 6.4. Damped Newton method with I-norm evaluations, whereas (2.10) takes fewer function evaluations (displayed in the 319 With 2-norm IIF(x')112 11/112 0 1 2 3 5.00e + 00 2.9ge + 00 2.36e + 00 1.91e + 00 2.38e + 01 1.60e + 00 1.72e + 00 2.12e+00 6.25e 2.50e 2.50e 2.50e - 02 01 01 01 6 8 12 16 12 13 14 15 16 1.l6e + 00 2.62e - 01 1.27e - 02 2.3ge - 05 5.43e ~ 11 3.34e 9.75e 4.67e 7.37e 1.2ge 01 02 03 06 11 1.00e 1.00e 1.00e 1.00e l.OOe + 00 + 00 + 00 + 00 + 00 50 52 54 56 58 - nf a' With oo-norm 0 1 2 3 4 5 6 22 23 24 25 (XI IIF(xl)lloo Ilpllloo 5.00e + 00 4.36e + 00 3.36e + 00 3.28e + 00 3.26e + 00 3.25e + 00 3.25e +00 2.02e + 01 4.5ge + 00 6.30e + 00 1.04e + 01 1.40e + 01 l.83e + 01 3.02e + 01 1.25e 2.50e 3.12e 7.81e 3.91e 3.91e 9.77e - + 00 + 00 + 00 + 00 2.8ge + 04 5.40e + 04 9.07e + 04 1.27e + 05 1.86e 4.66e l.16e 5.82e - 3.24e 3.24e 3.24e 3.24e nf 01 01 02 03 03 03 04 09 10 10 11 5 8 15 21 26 30 36 119 125 131 136 column labeled nf). Similarly, the 2-norm gives convergence for both strategies, but not as fast as for the l-norm cf. Table 6.4. However, for the maximum norm (and (2.9», 22) the iterates stall near x 22 = (-0.39215, -0.20505l, although II F(X 1100 ~ 4 3.24. The size of the correction Il p 22 lloo ~ 2· 10 is a sign that the Jacobian 22 matrix is very ill-conditioned near x • There is a nearby singularity of the Jacobian, and our convergence analysis does not apply. The other initial step strategy also leads to nonconvergence with the maximum norm. 6.2.7 Proposition. Let the function F : D S; JRn ---+ JRn be continuously differentiable. Suppose that x E int D, that F'(x) is nonsingular, and that IIF(x) + F'(x)pll :::: q'IIF(x)11 #0. (2.13) If (2.12) holds then Ilpll :::: (q' + 1)IIF'(x)-IIlIIF(x)1I (2.14) Convergence We now investigate the feasibility and convergence of the damped Newton method. First, we show that the loop in Steps 4-5 is finite. and (2.8) holds for all sufficiently small a > O. If .c" is a regular zero and IIx - x' II is sufficiently small, then (2.8) holds even for 0 < a :::: 1. 6.2 Newton's Method and Its Variants Systems ofNonlinear Equations 320 6.2.8 Theorem. Suppose that the function F : ]R" ---+ ]R" is uniquely invertible, and F and its inverse function F- 1 are continuously differentiable. Then the sequence x' (l = 0, 1, 2, ...) computed by the damped Newton method either terminates with xl = x* or converges to x*, where x* is the unique zero of F. Proof From (2.13) we find Ilpll ~ IIF'(x)-IIIIIF'(x)pll ~ IIF'(x)-lll(IIF(x) ~ (q' IIF(x + F'(x)pll + IIF(x)ll) + l)IIF'(x)-IIIIIF(x)ll; so (2.14) holds. For 0 < a ~ 1 and x + ap)11 = + ap ED. it follows 321 Proof Because F- 1 is continuously differentiable, from (2.13) that + F'(x)ap + o(allpll)1I = 11(1 - a)F(x) + a(F(x) + F'(x)p)1I + o(allplD ~ (1- a)IIF(x)1I + a\\F(x) + F'(x)pll + o(allplD ~ (1 - (1 - q')a)IIF(x)11 + o(allpll)· IIF(x) However, for sufficiently small a. we have o(a lipiD < (l - q - q'va II F(x) II, so that (2.8) holds. If x* is now a regular zero, then F'(x)-I is defined and bounded as x ---+ x*; therefore, (2.14) implies exists; therefore, Step 2 can always be performed. By Proposition 6.2.7, I is increased after finitely many steps so that - unless by chance F(x /) = 0 and we are done - the sequence x' (I = 0, 1,2, ...) is well defined. It is also bounded because IIF(x/)11 ~ IIF(xo)11 and {x E]R" IIIF(x)11 ~ IIF(xo)11l is bounded as the image of the bounded closed set {y E ]R" I II y II ~ II F (xo) II} under the continuous mapping F- 1 • Therefore, the sequence x' has at least one limit point, that is. there is a convergent subsequence x': (v = 0, 1,2, ...) whose limit we denote with x*. Because of the monotonicity of IIF(x /)lI. we have Ilpll = O(IIF(x)II)· IIF(x*)1I = infIIF(xl)ll. 10:.0 Because II F (x) II ---+ 0 as x ---+ x*, one obtains for sufficiently small [x - x* II the relations Similarly, (2.14) implies that the pi, are bounded; so there exists a limit point p* of the sequence pi, (v = 0, 1,2, ...). By deleting appropriate terms of the sequence if necessary, we may assume that p* = lim, ~ 00 pi, . We now suppose that F(x*) -=1= 0 and derive a contradiction. By Proposition 6.2.7 there exists an a* E {1, ~, ~, ~, .,. } with x +ap E D and o(allplI) = o(aIIF(x)ll) < (1 - q - q')aIlF(x)11 for 0 < a ~ 1. Therefore, (2.8) now holds in this extended range. IIF(x* + o" p*)II < (1 - qa*)IIF(x*)II; 0 It follows that if F' (x) is nonsingular in D, the damped Newton method never cycles. By (2.8). a is changed only a finite number of times in each iteration. Moreover, in the case of convergence to a regular zero, for sufficiently large I, the first a is accepted immediately; and because of Step 3, one takes locally only undamped steps (a = 1). In particular, locally quadratic convergence is retained if F' is Lipschitz continuous and p is chosen so that IIF(x) + F'(x)pll = O(IIF(x)11 2 ). Therefore, for all sufficiently large I = I; we have By construction of the a. it follows that a :::: a*. In particular. a := lim inf ai, :::: a" > O. If we now take the limit in IIF(x*)1I ~ IIF(i)11 = IIF(x + ap)11 < (1 - qa)IIF(x)ll, one obtains the contradiction In an important special case, the global convergence of the damped Newton method can be shown. IIF(x*)1I ::::: (1- qa)IIF(x*)11 < IIF(x*)II. 322 Systems of Nonlinear Equations 6.3 Error Analysis Therefore, II F(x*) II = 0; that is, x* is the (uniquely determined) zero of F. In particular, there is only one limit point, that is, limhoo xl = x*. 0 323 i the approximate bound 6.2.9 Remarks. (i) In general, some hypothesis like that in Theorem 6.2.8 is necessary because the damped Newton method may otherwise converge slowly toward the set of points where the Jacobian is singular. Whether this happens depends, of course, a lot on the starting point x", and cannot happen if the level set {x E D IIIF(x)11 ::: IIF(xo)111 is compact and F'(x) is invertible and bounded in this level set. (ii) Alternative stopping criteria used in practice are (with some error tolerance 8> 0) II F(i) II ::: 8 Therefore, t~e limiting accuracy achievable is of the order of II F' (x*)-11I8 F, and the maxln:al error in F(i) can be magnified by afactorofup to II F'(x*)-III. In the following, we make this approximate argument more rigorous. Norm-wise Error Bounds We deduce from Banach's fixed-point theorem a constructive existence theorem for zeros and use it for giving explicit and rigorous error bounds. 6.3.1 Theorem. Suppose that the function F .. DelFt" . I . . _ -+ m,,· m. IS continuous y differentiable and that B := {x E 1Ft" Illx - xO II 8} lies in D. s or Ilx - xii::: 811xll. (i) (iii) If the components of F(xo) or F' (x") have very different orders of magnitude. then it is sensible to minimize the scaled function II C F (x) II instead of II F (x) II, where C is an appropriate diagonal matrix. A useful possibility is to choose C such that C F' (x") is equilibrated; if necessary, one can modify C during the computation. (iv) The principal work for a Newton step consists of the determination of the matrix A as a suitable approximation to F' (x), and the solution of the system of linear equations Ap = - F(x). Compared with this, several evaluations of the function F usually playa subordinate role. If there is a matrix C E lFt"x" with III-CF'(x)lI:::fJ<l for all x e B, and then F has exactly one zero x* in B, and 80/(1 + fJ) ::: Ilxo - x*1I ::: 80/(1 - fJ) ::: 8. (ii) Moreover, if fJ < 1/3 and 80 ::: 8/3, then Newton's method startedfrom xO cannot break down and converges to x*. 6.3 Error Analysis Limiting Accuracy The relation F(x) = F[i, x*](i - x") (3.1) derived in Lemma 6.1.2 gives IIi -x*11 = IIF[x,x*r'F(.t)II::: IIF[x,x*rlll·IIF(x)ll. If X' is the computed approximation to x *, then one can expect only that II F(X') II ::: 8F for some bound 8F of the accuracy of function values that depends on F and on the way in which F is evaluated. This gives for the error in Proof Because II I - C F' (x) II < 1, C and F' (x) are nonsingular for all x E B. Ifwe define G: D -+ 1Ft" by G(x):= x -C F(x), then F(x*) = 0 {:} C F(x*) = o{:} ~* = G(x*). Thus we must show that G has exactly one fixed point in B. ~or this ,purpose, we prove that G has the contraction property. First, G' (x) = - C F (x ), so IIG'(x)11 ::: fJ < 1 for x E B, and IIG(x) - G(y)11 ::: fJllx _ yll for all .r , y E B. In particular, for x E B , IIG(x) -xoll S IIG(x) - G(xo)11 + IIG(x o) -xoll s fJllx - xOIl + IIC F(x°)ll s fJ8 + 8(l - fJ) S 8; 6.3 Error Analysis Systems of Nonlinear Equations 324 therefore, G (x) E B for x E and f3 can be determined from f3 := II 1- C F'(x) 1100 by means of interval arithmetic. (ii) The best choice of the matrix C is an approximation to F' (x") -I or (mid F'(X))--l. Indeed, for C = F'(xO)-1 ,xo -+ x* and s -+ 0, thecontraction factor f3 approaches zero. Therefore, one can regard f3 as a measure of the nearness of xO to x* relative to the degree of nonlinearity of F. (iii) In an implementation, one must choose e and co appropriately. Ideally, e should be as a small as possible because f3 generally becomes larger as B becomes wider. However, because co ~ cO - (3) must be guaranteed, one cannot choose e too small. A useful way to proceed is as follows: One determines the bounds B. So G is a contraction with contraction factor f3, and (i) follows from the Banach fixed-point theorem. For the proof of (ii), we consider the function G : D -+ JRn defined by _ --I G(x) :=x - A in which x E B and A:= r(x) whence for x E F(x) are fixed. Then III - CAli ~ f3, I I II(CA)- II ~ 1- f3' B, IIG'(x)1I = III - A-I F'(x)11 = II(CA)-IU - C F'(x) - ~ II(CA)-III(III I < --(f3 - 1-f3 U- CF'(x)11 CAm + III - CAli) and then sets e := 2eol (1 - (30). Then one tests the relation 2f3 + (3) = 325 -_. III-CF'(x)lloo=f3~ 1~f30 1-f3 forx:=[xo-ee,xo+ee], (3.3) It follows from this that - (x) - G - (y) II ~ I 2f3 II G _ f3 Ilx - y II f or x, y E B. (3.2) We apply this inequality to Newton's method, from which the required inequality follows. Generally, if (3.3) fails, the zero is either very ill conditioned or xO was not close to a zero. 6.3.3 Example. To show that XI + 1 :=x l _ F'(XI)-1 F(x l). For l = 0, (i) and the hypothesis give IIxo - x* II ~ col (1 - (3) ~ ~eo ~ !e. If l now Ilx l - x* II ~ 1. e for some l :::: 0, then Ilxl - xO II ~ Ilx - x* II + Ilxo- x* II ~ 2 . e, so xl E B. We can therefore apply (3.2) With x = x = x I , y = x * to 0 btai tam Ilxl + 1 - x*1I ~ 1 ! 2 F(x) = (x? F'(x) = 1, Ilxl-x*1I ~ (2f310-f3))lllxo-x*II-+0 es l Thus Newton's method is well-defined and convergent. 8xI + 5) I has a zero in the box x = G~::l), we apply Theorem 6.3.1 with the maximum norm and xO = (~:~). E = 0.5 (so that B = x). We have f3llxl - x*11 because X I + 1 = G(x l ) . Because f3 < it follows that 2f31(1 - (3) < I and 1 IIx l + - x*1I ~ &e; from this it follows by induction that + 5xI + 8X2 - xi + 5X2 - (2X-8+5 8) + I 2X2 5 F'(x) = ([5,7] -8 ' and for r-» co. C:=(midF'(x))-1 = (0.06 0.08 o -0.08) 0.06 ' we find 6.3.2 Remarks. (i) The theorem is useful to provide rigorous a posteriori error bounds for approximate zeros found by other means. For the maximum norm, B = [xu - ee, xO + ee] F(xo) = ( 1- CF'(x) =: x E JIJRn, CF(x o) = (0.125) 1.75), -0.25 = [-1 0.125 ' , 1] (0.06 0.08 0.08) 0.06 . Systems of Nonlinear Equations 326 Thus EO = 0.125, f3 = 0.14, and Eo :::: E(l - f3). Thus there is a unique zero x* 327 6.3 Error Analysis for x, x* E x, where the Krawczyk operator K : TIJR(n x JR(n ~ TIJR(n is defined by in x, and we have 01"5 < 0 . 109 < -'--1.14 - Ilx* - xOll ao :::: 0.125 - - < 0.146. 0.86 In articular x* E ([0.354.0.646J). Better approximations would have given shar[0.354.0.646] p , per bounds. K(x,x):=x - CF(x) - (/- CFf(x»)(x -x). Therefore, one may use in place of the interval Newton iteration (3.4) the Krawczyk iteration XI+I:=K(x',xl)nx' The Interval Newton Method In order to obtain simultaneous componentwise bounds that take rounding error into account, one uses an interval version of Newton's method. We suppose for this purpose that x is an interval that contains the required zero x*, and we calculate rex). Then F'(x):::: F!(~):::: F'(x) for all ~ E X and F[x, x*] = fo1 F' (x* + t (x - x") dt E r (x) for all x EX. Therefore, (3.1) gives x* = x - A-I F(x) EX - D~(F!(x). F(x», where ~ (r (x), F (x» denotes the hull of the solution set of the linear interval equation Az = F(x) (A E F'(x), x E x). This gives rise to the interval Newton (3.5) withx'EX', again with a guarantee of not losing any zero contained in x", Note that because ofthe lack ofthe distributive law for intervals, the definition of K cannot be "simplified"tox + C(r(x)(x - x) - F(x»; this would produce poor results only, because the radius would be bigger than that of x. It is important to remember that in order to take rounding error into account, one must use in the interval Newton iteration or the Krawczyk iteration in place of x' the thin interval Xl = [xl, x']. The Krawczyk operator can also be used to prove the existence of a zero in x. 6.3.4 Theorem. Suppose that x EX. (i] If F has a zero x* E x then x* E K(x, x) n x. (ii) If K(x, x) n x = 0 then F has no zero in x. (iii) If either iteration By our derivation, each Xl contains any zero in x''. . In each step, a system oflinear equations must be solved. In practice, on~ uses in place of the hull a more easily computable enclosure, using an approxImate midpoint inverse C as preconditioner. . Following Krawczyk, one can proceed in a slightly simpler way, also using C but saving the solution of the linear interval system. We rewrite (3.1) as C F(x) = C F[x, x*](x - x*) «i». x) s;:: (3.6) intx or the matrix C in (3.5) is nonsingular and (3.7) K(x,x)S;::x then F has a unique zero x* EX. and observe that x* = x - C F(x) - (l - C F[x, x*])(x - x"), where I - C F[x, x*] can be expected to be small by the construction of C. From this it follows that Proof (i) holds by construction, and (ii) is an immediate consequence of (i). (iii) Suppose first that (3.6) holds. Because B := 1- C s;:: I - C F'(x) =: B, we have rex) IBI radx = IBI rad(x - x) = rad(B(x - x) x* E K(x, x) :'S rad(B(x - x)) = rad K(x, x) < radx 328 6.4 Further Techniques for Nonlinear Systems Systems ofNonlinear Equations ones. Because high order polynomial systems usually have a number of real or complex zeros that is, exponential in the dimension, continuation methods for finding all zeros are limited to low dimensions. by (3.6). For D = Diag(radx), we therefore have " \\1 - D-ICF'(x)Dlloo 329 IBlik rad x, = IID-I BDlloo = max L. --d---'k ra X, I = max (IB\radx); < 1. ; rad x, 6.4 Further Techniques for Nonlinear Systems Thus C P(x) is an H-matrix, hence nonsingular, and as a factor, Cis nonsingular. Thus it suffices to consider the case where C is nonsingular and (3.7) holds. In this case, K (x, x) is (componentwise) a centered form for G (x) := x - C F (x); therefore, G(x) E K (x, x) s; x for all x EX. By Brouwer's fixed-point theore~, G has a unique fixed point x* E x, that is, x* = x* - C F(x*). Because C IS nonsingular, it follows that x* is the unique zero of F in x. 0 If Xo is a narrow (but not too narrow) box symmetric around a good approximation of a well-conditioned zero, Theorem 6.3.4 usually allows one to verify existence in that box, and by refining it with Krawczyk's iteration, we may get very narrow enclosures. . For large and sparse problems, the approximate inverse C is full, which makes Krawczyk's method prohibitively expensive. A modification that respects spar- An Affine Invariant Newton Method The nonlinear systems arising in solving differential equations, for example, for stiff initial value problems by backward differentiation formulas (Section 4.5), have quite often very ill-conditioned Jacobians. As a consequence, II F (x) II has steep and narrow curved valleys and a damped Newton method based on minimizing II F (x) II has difficulties because it must follow these valleys with very short steps. For such problems, the natural merit function for measuring the deviation from a zero is a norm IIll(x) II ofthe Newton correction ll(x) = - F'(x)-I F(x). Near a zero, this gives the distance to it almost exactly because of the local quadratic convergence of Newton's method. Moreover, it is unaffected by the condition of F' (x) because it is affine invariant, that is, any linear transformation p(x):= C F(x) with nonsingular C yields the same ll(x). Indeed, the matrix C cancels out because sity is given in Rump [84]. Finding All Zeros If one has found a zero ~ of F and conjectures that another zero x* exists, then just as in the univariate case, one can attempt to determine it by deflation. For this one needs a damped Newton method that minimizes II F(x) 1I/IIx - ~ II instead ~f II F (x) II. However, unlike in dimension 1, this function is .no 10ng~r smooth near ~. Therefore, deflation in higher dimension tends to give erratic results, and repeated deflation often finds several but rarely all the zeros. All zeros in a specified box can be found by interval methods. In .anal~g~ .to the univariate case, interval methods for finding all zeros scan a given initial box x E [JR." by exhaustive covering with subboxes that contain no zero, are ti~y and guaranteed to contain a unique regular zero, or are tiny and.likel y to cont~lll a nonregular zero. This is done by recursively splitting x until correspondmg tests apply, such as those based on the Krawczyk operator. For details, see Hansen [39], Kearfott [49], and Neumaier [70]. Continuation methods, discussed in the next section, can also find all zeros of polynomial systems, at least with high probability (in theory with probability one). However, they cannot guarantee to find real zeros before complex However, the affinely equivalent function p(x) = F'(xO)-I F(x) has wellconditioned P'(r") = I. Therefore, minimizing Illl (x) I instead of I F (x) II improves the geometry of the valleys and allows one to proceed in larger steps in ill-conditioned problems. Unfortunately, Illl (x) I may increase along the Newton path, defined by the family of solutions x*(t) of F(x) = (1 - t)F(xo), (t E [0,1]) along which a method based on decreasing II F(x) II would roughly proceed (look at infinitesimal steps to see this). Hence, this variant may get stuck in a local minimum of Illl (x) I in many situations where methods based on decreasing I F (x) I converge to a zero. A common remedy is to test for a sufficient decrease by for some positive q < I, so that the matrix is kept fixed during the determination of a good step size (see, e.g., Deuflhard [19,201, who also discusses other implementation details). Unfortunately, this means that the merit function changes 330 Systems of Nonlinear Equations 6.4 Further Techniques for Nonlinear Systems at each step, making a global convergence proof along the old lines impossible. Indeed, the method can cycle for innocuous problems, where methods that use II F(x) II have no problems (cf, Exercise 9). However, a global convergence proof can be given if we mix the two approaches, using an acceptance test similar to (4.1) in regions where II t. (x) II increases, and switching back to the merit function II t.(x) II when it has moved sufficiently far over the region of increase. The algorithm presented next is geared toward large-scale applications where it is often inconvenient to solve exactly for the Newton step. We therefore assume that in each step we have a (possibly crude) approximation M(x) to F' (x) whose structure is simple enough that a factorization of M (x) is available. Often, M(x) = LR, where Land R are approximate triangular factors of F' (x), obtained by ignoring terms in the factorization that would destroy sparsity. One speaks in this case of incomplete triangular factorizations, and calls M (x) the preconditioner. The case M (x) = F' (x) is, of course, admissible and produces an affine invariant algorithm. We now define the modified Newton correction t..(x) = -M(X)-l F(x). 331 or STEP4: Setx'+I:=x'+a/p',I 1+1 If 4 .:= . (.6) was violated, goto Step 2, else goto Step 1. We first show that Step 3 can always be completed, at least when F is defined everywhere. 6.4.2 Proposition. Suppose that F(x' a> (1- qdlfk (hen either (4.7) holds for Oil = + apl) is dejinedfor 0 ::s a ::s «.I] with fkl ::s fk ::s fkz (4.8) a, or the function dejined by (4.2) (4.9) We use an arbitrary but fixed norm that should be a natural measure for changes in x. The l-norm is suitable if all components of x have about the same magnitude. The algorithm depends on some constants that must satisfy the restrictions o< qo < qt < q: < 1, 0 < fkl < fk2 < 1 - qQ. (4.3) hasa-eroaE[O -] L ,01. <. . n porticuian Step 3 can be completed if a> (1 _ )/ . ql fkz· Proof In general, we have 1 IIM- F(x + ap)II + aF'(x)p)II + o(a) 1 = 110 - a)M- F(x) + OIM- (F(x) + F'(x)p)II + o(a) ::s 11 - alllM- 1 F(x) II + lalllM-1(F(x) + r(x)p) II = IIM-1(F(x) 1 6.4.1 Algorithm: Affine Invariant Modified Newton Method +o(a). STEP 0: Start with 1:= 0, k:= - 1, ILl = 00. STEP 1: Calculate a factorization of M(x') and t..(x l) = -M(x')-l F(x l ) . If By (4.5) and (4.8), we conclude that for sufficiently small a > 0, (4.4) setk:=k+l,Mk:=M(x') and ?h:=IIMx')II. STEP 2: Calculate an approximate Newton direction pi such that STEP 3: Find IIMkIF(x' + ap') II II M - 1 F(x l) II k a, such that either (4.6) ::s 1- a + aqo + o(a) < I - all, (4.10) 6.4 Further Techniques for Nonlinear Systems Systems ofNonlinear Equations 332 Note that with M = (Ml + Mz)/2, approximate zeros of if! also satisfy (4.7), so that one may adapt any root finding method based on sign changes to check each iterate for (4.6) and (4.7), and quit as soon as one of the two conditions holds. In order to preserve fast local convergence, one should first try at = 1. (One may also use more sophisticated initial values for az as long as the choice a, = 1 results locally.) If the first guess does not work and if!(az) > 0, one already has a sign change using the artificial value if!(0) = M- 1 + qo < 0 that (cf, (4.10» is an upper bound to the true range of limiting values of if!(a ) as a -+ O. If, however, !p(at) < 0, increase az by repeated doubling (or so) until (4.6) or (4.7) holds or a sign change is found. Hence, k increases without bound. By Steps 1 and 2, Ok+1 < q20k, hence s, ::: q~oo. Therefore, (4.11) If I M k I remains bounded (and by continuity, this holds in particular when Zk remains bounded), then the right side of (4.11) converges to zero and (i) holds. If not, then (ii) holds. 0 In particular, in an important case where all difficulties are absent, we have the following global convergence theorem. 6.4.5 Theorem. If F is continuously differentiable in lRn with globally nonsingular F' (x) and bounded M (x), then 6.4.3 Remarks. (i) One must be able to store two factorizations, those of M k and M (x"). (ii) In Step 2, p' = ~ (x') works if II I - M (x') -I F' (x') II ::: qo, that is, if M (x) is a sufficiently good approximation of F' (x). In practice, one may want to choose qo adaptively, matching it to the accuracy of x' . lim F(l) = k-s co l Quasi-Newton Methods It is possible to generalize the univariate secant method to higher dimensions; the resulting methods are called quasi-Newton methods. Frequently, they perform ties are classified in the following. 6.4.4 Proposition. Denote by o. Proof In Proposition 6.4.4, (ii) contradicts the boundedness of M, (iii) the nonsingularity of F', and (iv) cannot hold because F is defined everywhere. The only alternative left is (i). 0 In general, all sorts of misbehavior may occur besides regular convergence (and, as in the univariate case, this must be the case for any algorithm wit.h~~t additional conditions that guarantee the existence of a solution). The pOSSIbIlI- the values of xl at each completion ofStep I. Then one of the following holds.' (i) F(l) -+ 0 as k -+ 00. (ii) For some subsequence, 333 llzk" II and IIM(zk")11 -+ 00. (iii) F'(x') is undefined or singular for some I. (iv) At some iteration, F(x z +a/) is not definedfor large a. and Step 3 cannot -+ 00 well, although they are generally regarded as somewhat less robust because due to the approximations involved, it is not guaranteed that IIF(x + ap)11 decreases along a computed quasi-Newton direction p. (Indeed, to guarantee that, one needs derivative information.) To motivate the method, we write the univariate secant method in the form X'+l = xi - czf(x,), where c, = II f[xz, X/-I]. Noting that ci (f(xz) - f(xz-l)) = x, - XZ-l, we see that a natural multivariate generalization is to use be completed (cf Proposition 6.4.2). Proof Suppose neither (iii) nor (iv) holds. Then Step 2 can be satisfied by choosing pZ = _F'(X1)-1 F(x'), and Step 3 can be completed by assumption. If k remains bounded, then for k fixed at its largest value, (4.6) will always be violated, and therefore (4.7) must hold. Equation (4.7) implies that II Mt: I F(x') I is monotone decreasing, and because (4.6) is violated, it must converge to a positive limit. Going to the limit in (4.7) therefore gives -+ O. However, now a, (4.10) with M = M2 gives a contradiction. where C, is a square matrix satisfying the secant condition Ciy; = si, with YI:= F(xz) - F(XZ-l), si :=Xz - xZ-I· Note that no linear system must be solved, so that for problems with a dense Jacobian, the linear algebra work is reduced from 0(n 3 ) operations per step to 0(n 2 ) . Systems of Nonlinear Equations 334 The secant condition can be easily enforced by defining with an arbitrary vector (Sl - CI = CI-] + UI =I O. Among CHy,)uT T u l YI many possibilities, the choice regarded as best for well-scaled x is This defines Broyden's method and has reasonably good local convergence properties (see Dennis and Schnabel [18] and Gay [28]). Of course, one loses quadratic convergence for such methods, and global convergence proofs are available only under stringent conditions. However, one can ensure local superlinear convergence with convergence order at least "'V2. Therefore, the number of significant digits doubles about every n iterations. This implies that, for problems with dense Jacobians, quasi-Newton methods are locally not much faster than a discrete Newton method, which uses n function values to get a Jacobian approximation. For sufficiently sparse problems, much fewer function values suffice and discrete Newton methods are usually faster. However, far from a solution, the convergence speed is much less predictable. One again needs a damping strategy, and hence uses the search direction PI = -CI F(xI) in place of the Newton direction. Continuation Methods For functions that satisfy only weak smoothness conditions there are special algorithms that are based on topological concepts: homotopies (that generate a smooth path to a solution), and simplicial decompositions (that generate piecewise linear paths). An in-depth discussion is given in Allgower and Georg [5); see also Morgan [64] and Zangwill and Garcia [100]. Here we only outline the basic idea of homotopy methods. For t E [0, 1] one considers curves x = xU) (called homotopies) for which x(O) = x Ois a known point andx(l) = x' is a zero of F. Such curves are implicitly defined by the solution set of F (x, t) = 0, with an augmented expression F(x, t) that satisfies F(xo, 0) = 0 and F(x, 1) = F(x) for all x. The basic = F(x) - 335 Both choices are natural in that they are affine invariant, that is, the curves on which the zeros lie are not affected by linear transformations of x or F. The task of solving F (x) = 0 is now replaced by tracing the curves from the known starting point at t = 0 to the point given by t = 1. Because the curve guides the way to go, this may be the easiest way to arrive at a solution for many hard zerofinding problems. However, it is possible that the path defined by Ftx . t) = 0 never arrives at t = I, or turns back, or bifurcates into several paths, and all this must be handled by a robust code. Existence of the path can often be guaranteed by a fixed-point theorem. The path-following problem is the special case m = I, E = [0, I] of the more general problem of solving a parametrized system of equations F(x, y) =0 for x in dependence on a vector Y of parameters from a set E c IR m where F : D x E -+ IRn, xED < IRn. This system of equations determi~s a~ m-dimensional manifold of solutions, and often a functional dependence H: E -+ D such that F(H(y), y) = 0 is sought. (Of course, in general, the manifold need not have a one-to-one projection onto E, and this function may have to be pieced together from several solution branches.) The well-known theorem on implicit functions gives sufficient conditions for the existence of H in a sufficiently small neighborhood of a solution. A global statement about the existence of H is made in the following theorem, whose proof (see Neumaier [68]) depends on a detailed topological analysis of the situation. As in the fixed-point theorem of Leray and Schauder, a boundary condition is the main thing to be established. 6.4.6 Theorem. Let D ~ IR" be closed, let E let F : D x E -+ IR" be such that ~ IRm be simply connected, and (i) the derivative B]F(x, y) of F with respect to x exists in D x E and is continuous and nonsingular there; (ii) there exist xO E int D and yO E E such that F(x,y)=I=tF(xO,l) forallxEBD,yEE,tE[O,l]. Then there is a continuous function H: E -+ D with F(H(y), y) = 0 for YEE. examples are Fix, t) 6.4 Further Techniques for Nonlinear Systems (1 - t)F(xo) To trace the curve determined by and F(x, r) = t F(x) - (1 - t)F'(xo)(x - x''). F(x(t), t) = 0 (4.12) 336 Systems ofNonlinear Equations 6.4 Further Techniques for Nonlinear Systems there are a variety of continuation methods. The smooth continuation methods assume that F is continuously differentiable and are based on the differential equation FxCx(t). t)x(t) + F/(x(t), t) = 0 (4.13) obtained by differentiating (4.12), using the chain rule. This is an implicit differential equation, but as long as F, is nonsingular, we can solve for x(t), X(t) = -Fx(x(t), t)-I F/(x(t). t). 337 vectors u, v > 0 such that F ' (x)u :::: v for all x E Do. Suppose that .xo ED, that F(xo) :::: 0, and that Lxo a vector UJ with UJiVi :::: 1 (i = 1, ... , n). Then (4.15) uio! F(xo), xo] S; Do for (i) F has exactly one zero x* in Do. (ii) Each sequence of the form (4.16) (4.14) (A closer analysis of the singular case leads to turning points and bifurcation points; see, e.g., Chow and Hale [11] and Seydel [88]. Now, (4.14) can be solved by methods for ordinary differential equations. Alternatively, one may solve (4.13) by methods for differential-algebraic equations; cf. Brenan, Campbell, and Petzold [9].) If F; is well conditioned, (4.14) is easy to solve. However, for the application to solving systems of equations by a homotopy, the well-conditioned case is less relevant because one solves such problems easily with damped Newton methods. In the ill-conditioned case, (4.14) is stiff, but advantage can be taken of good predictors computed by extrapolating from past values. Due to rounding errors and discretization errors in the solution of the differential equation, a numerical solution based on (4.14) alone tends to wander away from the manifold. To avoid this, one must monitor the residual size II F(x(t), t) I of the approximate solution curve x(t), and reduce the step size if the residual size is deemed to be too large. For implementation details, see, for example, Rheinboldt [81]. in which the nonsingular matrices A, remain bounded as I ~ the ('subgradient') relations F(x) 2: F(x') + A,(x - x') for all x Newton's method is globally and monotonically convergent in an important special case, namely when the derivative F'(x) is an M-matrix and, in addition, either F is convex or F' is isotone in x. (M-matrices were defined in Section 2.4; see also Exercises 2.19 and 2.20.) The assumptions generalize the monotonicity and convexity assumption in the univariate case; they are, for example, met in systems of equations resulting from the discretization of certain nonlinear ordinary or partial differential equations. Do with x ~ x' (4.17) Also, the error estimate (4. I 8) holds. Proof We proceed in several steps. STEP 1: For all x, y E Do, the matrix A= F[x, y] is an M-matrix with Au:::: v. (4.19) Indeed, F (x) - F (y) = A(x - y) holds by Lemma 6.1.2. Because F'(x) is an M-matrix for x E Do, 1 1 A ik = F:k(y + t(x - y» dt ~0 for i =lk, and by (4.15), 1 1 Au = 6.4.7 Theorem. Let F: D c lRn --+ lRn be continuously differentiable. Let F'(x) be an M-matrix in a convex subset Do S; D, and suppose that there are and satisfy remains in Do and converges monotonically to x*: F(x) - F(y) = A(x - y), Global Monotone Convergence of Newton's Method E 00 F'(y + t(x - in particular (Exercise 2.20), 1 1 y»udt :::: A is an M-matrix. vdt = v; 6.4.8 Remarks. STEP 2: We use induction to show that (4.20) and (4.2 I) for all L> O. Because this holds for I = 0, we assume that (4.20) and (4.21) hold for an index I and for all smaller indices. Then (4.22) because All::: o. If now F (x k ) ::: 0, and by Step 1 F(Zk) = F(x k) :::: F(x k ) l E Do and k :::: I, then Zk :::: x k because x k) = F(x + Ak(Zk - 339 6.4 Further Techniquesfor Nonlinear Systems Systems ofNonlinear Equations 338 vw T F(x k ) :::: k) (i) If Al = F'(x l) is an M-matrix, then All::: 0 (see Exercise 2.19). (ii) If AlI:::O and F'(x):::: Al for all x E Do such that x <x', then (4.17) holds, because then A:::: Al for the matrix in (4.19) with y = x'; and F(x) = F(x l) + A(x - Xl) ::: F(x l) + AI(x - Xl) because x :::: x', (iii) If, in particular, F' (x) (x E Do) is an M-matrix with components that are increasing in each variable, then (4.17) holds with Al = F' (x'). If F' (x) is an M-matrix, but its components are not monotone, one can often find bounds for the range of values of F' (x) for x in a rectangular box [x, xl] ~ Do by means of interval arithmetic, and hence find an appropriate Al in (4.17). (iv) If F is convex in Do, that is, if the relation Ft):» k) T - AkuW F(x O. + (1 - holds for all x , y A)y) :::: AF(x) E Do, + (l - A)F(y) for 0:::: A :::: I then Therefore, (4.21) and (4.17) imply F(x) ::: F(y) 0::: F(Zk) ::: F(x l) = AI(x l - + AI(l- x') x l + 1) + AI(l- xl) = I 1 AI(l- X + ) . Because All ::: 0, it follows that 0 ::: l - xl+!, so l :::: x + particular, X I+ 1 E [zo, xo] < Do, whence by (4.17) and (4.16), ::: F(x l) + AI(X I+ 1 - xl) 1 • In = 0. Therefore, (4.20) and (4.21) hold with I + 1 replacing I, whence they hold in general. Thus (4.22) also holds in general. STEP 3: The limit x* = limr.; 00 xl exists and is a zero of F. Because the sequence xl is monotone decreasing and is bounded below by z", it has a limit x*. From (4.16) it follows that F (x') = AI(x l - x l + I), and because of the boundedness of the Al it follows on taking the limit that F(x*) = O. STEP 4: The point x* is the unique zero of F in D«. Indeed, if y* is an arbitrary zero of F in Do, then 0 = Ftx") F(y*) = A(x* -- y") where A is an M-matrix. Multiplication with A--I gives y* y)) - F(Y)/A and as A ~ 0, the inequality l F(X I+ 1 ) + (F(y + A(X - = x *. STEP 5: The error estimate (4.18) holds: Because of the monotonically decreasing convergence of the xl to x" , we have x* :::: x'; and because of (4.21), we have Zl :::: x*. Therefore, x* E [z', xl]. 0 F(x) ::: F(y) + F'(y)(x - y) for all x, y E Do (4.23) is obtained. If in addition, F' (x') is an M-matrix, then again, (4.17) holds with Al = F'(x l). (v) The hypothesis F(xo) ::: 0 may be fairly easily satisfied as follows: (a) If (4.15) holds, and both x and xO :=x + uwTIF(x)1 lie in Do, then F(xo) = F(x) + A(xo - x) = F(x) + AuwTIF(x)1 ::: F(x) + vwTIF(x)1 ::: F(x) + IF(x)1 ::: o. (b) If F is convex and both x and xo :=x - F'(x)-I F(x) lie in Do, then F(xo) ::: F(x) + F'(x)(xo - x) = 0 because of (4.23). (vi) If x* lies in the interior of Do, then the hypothesis of (4.18) is satisfied for sufficiently large l. From the enclosure (4.18), the estimate Ixl - x* I :::: uui" F(x l ) is obtained, which permits a simple analysis of the error. (vii) In particular, in the special case Do = jRn it follows from the previous remarks that the Newton method is convergent for all starting values if F is convex in the whole of ]R;n and F' (x) is bounded above and below by M-matrices. The last hypothesis may be replaced with the requirement 6.5 Exercises Systems of Nonlinear Equations 340 JR n , as can be seen by looking more closely at the arguments used in the previous proof. that F' (x) -1 ::: 0 for x E 341 (a) Show that these radii are zeros of the function F given by F(x, y) 6.5 Exercises 1. WriteaMATLAB program that determines a zero ofafunction F: JR2 ~ JR2 (a) by Newton's method, and (b) by the damped Newton method (with the l-norm), In both cases, terminate the iteration when Ilx l + 1 - xliii F(x, y):= ( x :s 10- 6 . Use exp(x2 + y2) - 3 ) - sin(3(x + y» +y ° . (0.5) '0' (0) (-0.5 0.5) ' as a test function. For (a), use the startmg vaI ues x = (0) -1 '0 1) (I) ( 10) ( I ) (I) (0.5) (2) (0.08) (O.D7),and(_O ).Whichones (-I ' 1 ' -10 ' -0.5 ' 0' -I ' 1.5 ' -0.08 ' .-O.O? 0 0.5 ° C 2 ofthese are reasonable? For (b), use the startmg values x = I) and C5)' 2. Let (XI (s), Yl (s)) and (X2 (t), Y2 (t) be parametric representations of two twice continuously differentiable curves in the real plane that intersect at (x*, y*) at an angle tp" #- O. (a) Find and justify a locally quadratically convergent iterative method for the calculation of the point of intersection (x*, y*). The iteration formula should depend only on the quantities t , Xi, Yi and the derivatives (l - 4x 2)(x + 2y - 2 y 3 ) + 4x y2(x 2 - 1) + 3x y 4 ( 2x(l - x 2)2 + y(l - 2x 2 + 4x 4 ) + 6xy"(x 2 _ y2) - 3 y 5 ) • (b) Write a MATLAB program that computes the optimal radii using Newton's method. Choose as starting values (x", yO) those radii that give the greatest volume under the additional condition x = y (cylindrical bucket; what is its volume?). Determine x*, y* and the corresponding volume V* and height h*. What is the percentage gain compared with the cylindrical bucket? 4. A pendulum of length L oscillates with a period of T = 2 (seconds). How big is its maximum angular displacement? It satisfies the boundary value problem ep"(t) + ~ sinep(t) = 0, L ep(O) = ep(l) = 0; here, t is the time and g is the gravitational acceleration. In order to solve this equation by discretization, one replaces tp" (t) with ep(t - h) - 2eper) x;, y;. + ep(t + h) h2 (b) Test it on the pair of curves (s, s2) (s E JR) and (f2, f) (t E JR). 3. The surface area M and the volume V of a bucket are given by the formulas M = rr(sx + sy + x 2), V rr 2 2 = -(x +xy+y )h, and obtains approximate values Xi ~ ep(ih) for i = 0, ... , nand h:= lin from the following system of nonlinear equations: 3 where x, y, h , and s are shown in Figure 6.1. Find the bucket that has the greatest volume V* for the given surface area M = tt . The radii of this bucket are denoted by x* and y*. fi(x) :=Xi-l - 2Xi + Xi+1 + h 2 f sin x, = O. where i = 1, ... , n - 1 and Xo = x; = O. (a) Solve (5.1) by Newton's method for f = 39.775354 and initial vector 2 with components xi(O) := l2h i (n - i), taking into account the tridiagonal form of f"(x). (Former exercises can be used.) (b) Solve (5.1) with the discretized Newton method. You should only use three directional derivatives per iteration. Compare the results of (a) and (b) for different step sizes (n = 20, 40, 100). Plot the results and list for each iteration the angular displacement x~ ~ ep(&) (which is the maximum) in degrees. (The exact maximum angular displacement is 160 0 Figure 6.1. Model of a bucket. (5.1) .) 342 6.5 Exercises Systems of Nonlinear Equations 5. To improve an approximation boundary-value problem v'(x) to a solution y(x) of the nonlinear Show that for the error in 343 x := xlo , a relation of the form Ill' - x'il :'S Co lex) = F(x, y(x)), where F : ]R2 ~ ]R and r : ]R2 ~ ]R r(y(a), y(b)) = 0, (5.2) are given, one can calculate holds, and find an explicit value for the constant C. 8. Using Theorem 6.3.1, give an a posteriori error estimate for one of the approximate zeros x found in Exercise 1. Take rounding errors into account by using interval arithmetic. Why must one evaluate F(x) with the point interval x = [x, x]? 9. Let with (al denotes the derivative with respect to the ith argument) _ (xi - XlX2 + Xl F(x) - - I) 2' x2 -Xl (a) Show that F ' (x) is nonsingular in the region D = {x and that F has there a unique zero at x" = C). where the constant c is determined such that 01r(yl(a), yl (b))&l (a) + 02r ( / ( a) , /(b))&l(b) = -r(/(a), /(b)). (5.4) (a) Find an explicit expression for c. (b) Show that the solution of (5.3) is equivalent to a Newton step for an appropriate zero-finding problem in a space of continuously differentiable functions. 6. (a) Prove the fixed-point theorem ofLeray-Schauder for the special Banach space JB := ]R. (b) Prove the fixed-point theorem of Brouwer for JB:=]R without using (a). 7. Let the mapping G : D ~]Rn ~ ]Rn be a contraction in K ~ D with fixed point .r" E K and contraction factor f3 < I. Instead of the iterates x' defined by the fixed-point iteration X l + l := G(x l ) , the inaccurate values are actually calculated because of rounding error. Suppose that the error is bounded according to The iteration begins with the index l = lo such that xOE K and is terminated with the first value of E]R2 I X2 < x? + l} (b) Show that the Newton paths with starting points in D remain in D and end at x". (c) Modify a damped Newton method that uses II F (x) II as merit function by basing it on (4.1) instead, and compare for various random starting points its behavior on D with that of the standard damped Newton method, and with an implementation of Algorithm 6.4.1. (d) Has the norm of the Newton correction a nonzero local minimum in D? 10. Let F: D ~]Rn ~]Rn be continuously differentiable. Let F'(x) be an Mmatrix and let the vectors u, v > 0 be such that F'(x)u :::: v for all XED. Also let W E]Rn be the vector with ui, = vi- l (i = I, ... , n). Show that for arbitrary XED: (a) If D contains all x E]Rn such that Ix - x I :'S u . w T IF(x) I then there is exactly one zero x' of F such that (b) If, in addition, D is convex, then x" is the unique zero of Fin D. 11. (a) Let F(x) := xi +2Xl -X2) ( xi + 3X2 XI Show that Newton's method for the calculation of a solution of F (x) c = 0 converges for all c E ]R2 and all starting vectors x O E ]R2. Systems of Nonlinear Equations 44 (b) Let sin XI +3XI -X2) . + 3X2 - Xl F(x):= ( . smx2 References Show that Newton's method for F(x) = 0 with the starting value = 7T converges, but not to a zero of F. (c) For F given in (a) and (b), use Exercise 10 to determine a priori estimates for the solutions of F (x) = c with c arbitrary. xO C) The number(s) at the end of each reference give the page number(s) where the reference is cited. [1] H. Abramowitz and I. A. Stegun, Handbook ofMathematical Functions, Dover, New York, 1985. [193, 194] [21 ADIFOR, Automatic Differentiation of Fortran, WWW-Docvment, http://www-unix.mcs.anl.gov/autodiff/ADIFOR/ [308] [3] T. J. Aird and R. E. Lynch, Computable accurate upper and lower error bounds for approximate solutions of linear algebraic systems, ACM Trans. Math. Software 1 (1975), 217-231. [113] [4] G. Alefeld and J. Herzberger, Introduction to Interval Computations, Academic Press, New York, 1983. [42] [5] E. L. Allgower and K. Georg, Numerical Continuation Methods: An Introduction, Springer, Berlin, 1990. [302, 334] [6] E. Anderson et al., LAPACK Users' Guide, 3rd ed., SIAM, Philadelphia 1999. Available online: http://www . netlib. org/lapack/lug/ [62] [7] F. Bauhuber, Direkte Verfahren zur Berechnung der Nullstellen von Polynomen, Computing 5 (1970), 97-118. [275] [8] A. Bjorck, Numerical Methods for Least Squares Problems, SIAM, Philadelphia, 1996. [61,78] [9] K. E. Brenan, S. L. Campbell, and L. R. Petzold, Numerical Solution of Initial- Value Problems in Differential Algebraic Equations. Classics in Applied Mathematics 14, SIAM, Philadelphia, 1996. [336] [10] G. D. Byrne and A. C. Hindmarsh, Stiff ODE Solvers: A Review of Current and Coming Attractions, J. Comput. Phys. 70 (1987), 1-62. [219] [11] S.-N. Chow and 1. K. Hale, Methods of Bifurcation Theory, Springer, New York, 1982. [336] [12] S. D. Conte and C. de Boor, Elementary Numerical Analysis, 3rd ed., McGraw-Hill, Auckland, 1981. [145] [13] I. Daubechies, Ten Lectures on Wavelets, SIAM, Philadelphia, 1992. [178] [14] P. J. Davis and P. Rabinowitz, Methods ofNumerical Integration, 2nd ed., Academic Press, Orlando, 1984. [179] [15] C. de Boor, A Practical Guide to Splines, 2nd ed., Springer, New York, 2000. [155,163,169] [16] C. de Boor, Multivariate piecewise polynomials, pp. 65-109, in: Acta Numerica 1993 (A. Iserles, ed.), Cambridge University Press, Cambridge, 1993. [1551 345 346 References References [17) J. Demmel, Applied Numerical Linear Algebra, SIAM, Philadelphia, 1997. [61) [18) J. E. Dennis and R. B. Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations, Prentice-Hall, Englewood Cliffs, NJ, 1983. [302,334) [19) P. Deuflhard, A modified Newton method for the solution of ill-conditioned systems of non-linear equations with application to multiple shooting, Numer. Math. 22 (1974), 289-315. [329) [20) P. Deuflhard, Global inexact Newton methods for very large scale nonlinear problems, IMPACT Compo Sci. Eng. 3 (1991), 366-393. [329) [21) R. A. De Vore and B. L. Lucier, Wavelets, pp. 1-56, in: Acta Numerica 1992 (A. Iserles, ed.), Cambridge University Press, Cambridge, 1992. [178) [22) I. S. Duff, A. M. Erisman, and J. K. Reid, Direct Methods for Sparse Matrices, Oxford University Press, Oxford, 1986. [64, 75) [23) J.-P. Eckmann, H. Koch, and P. Wittwer, A computer-assisted proof of universality for area-preserving maps, Amer. Math. Soc. Memoir 289, AMS, Providence (1984). [38) (24) M. C. Eiermann, Automatic, guaranteed integration of analytic functions, BIT 29 (1989), 270-282. [187) [25) H. Engels, Numerical Quadrature and Cubature, Academic Press, London, [980. (179) (26) R. Fletcher, Practical Methods of Optimization, 2nd ed., Wiley, New York, 1987. [301) [27) M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory ofNt'<Completeness, Freeman, San Francisco. 1979. [115) (28) D. M. Gay. Some convergence properties of Broyden's method, SIAM J. Numer: Anal. 16 (1979), 623-630. [334) [29) C. W. Gear, Numerical Initial Value Problems in Ordinary Differential Equations, Prentice-Hall, Englewood Cliffs, NJ, 1971. [218,219] (30) P. E. Gill, W. Murray, and M. H. Wright, Practical Optimization, Academic Press, London, 1981. [301) [31) G. H. Golub and C. F. van Loan, Matrix Computations, 2nd ed., Johns Hopkins University Press, Baltimore, 1989. [61,70,73,76,250,255) [32) A. Griewank, Evaluating Derivatives: Principles and Techniques ofAutomatic Differentiation, SIAM, Philadelphia, 2000. [10] [33) A. Griewank and G. F. Corliss, Automatic Differentiation ofAlgorithms, SIAM, Philadelphia, 1991. [10,305) (34) A. Griewank, D. Juedes. H. Mitev, J. Utke, O. Vogel, and A. Walther, ADOL-C: A package for the automatic differentiation of algorithms written in CjC++, ACM Trans Math. Software 22 (1996),131-167. http://www.math.tu-dresden.de/wir/project/adolc/[308] [35] E. Hairer, S. P. Nersert, and G. Wanner, Solving Ordinary Differential Equations, Vol. 1, Springer, Berlin. 1987. [179,212] [36) E. Hairer and G. Wanner, Solving Ordinary Differential Equations, Vol. 2, Springer, Berlin, 1991. [179,212] [37) T. C. Hales, The Kepler Conjecture, Manuscript (1998 l. math.MG/9811 071 http://www.math.lsa.umich.edu/-hales/countdown/ [38) . [38] D. C. Hanselman and B. Littlefield, Mastering MATLAB 5: A Comprehensive Tutorial and Reference, Prentice-Hall, Englewood Cliffs, NJ, 1998. [1] [39] E. Hansen, Global Optimization Using Interval Analysis, Dekker, New York, 1992. [38, 328) (40) P. C. Hansen, Rank-Deficient and Discrete Ill-Posed Problems, SIAM. Philadelphia, 1997. [81, 125) [41] J. F. Hart, Computer Approximations, Krieger, Huntington, NY, 1978. [19, 22) [42) J. Hass, M. Hutchings, and R. Schlafli, The double bubble conjecture, Electron. Res. Announc. AmeJ: Math. Soc. 1 (1995), 98-[02. http://math.ucdavis.edu/-hass/bubbles.html [38) (43) P. Henrici, Applied and Computational Complex Analysis, Vol. I. Power SeriesIntegration - Conformal Mapping - Location of Zeros, Wiley, New York, 1974. [141,278,281] [44] N. J. Higham, Accuracy and Stability ofNumerical Algorithms, SIAM, Philadelphia, 1996. [17, 77, 93] [45) IEEE Computer Society, IEEE standard for binary floating-point arithmetic, IEEE Std754-1985, (1985). [15,42] [46] E. Isaacson and H. B. Keller, Analysis of Numerical Methods, Dover. New York 1994. [143] , 347 [47] M. A. Jenkins, Algorithm 493: Zeros of a real polynomial, ACM Trans. Math. Software 1 (1975),178-189. [268] [48] M. A. Jenkins and J. F. Traub, A three-stage variable-shift iteration for polynomial zeros and its relation to generalized Rayleigh iteration, Numer. Math. 14 (1970), 252-263. [268] [49] R. B. Kearfott, Rigorous Global Search: Continuous Problems, Kluwer, Dordrecht, 1996. [38, 328] [50) R. B. Kearfott, Algorithm 763: INTERVAL-ARlTHMETIC: A Fortran 90 module for an interval data type, ACM Trans. Math. Software 22 (1996), 385-392. (43) [51] R. Klatte, U. Kulisch, C. Lawo, M. Rauch, and A. Wiethoff, C-XSC, a C++ Class Library for Extended Scientific Computing, Springer, Berlin, 1993. [42] [52) R. Klatte, U. Kulisch, M. Neaga, D. Ratz, and C. Ullrich, PASCAL-XSC _ Language Reference with Examples, Springer, Berlin, 1992. [42] [53] R. Krawczyk, Newton-Algorithmen zur Bestimmung von Nullstellen mit Fehlerschranken, Computing 4 (1969), 187-20 I. [I 16] [54] A. R. Krommer and C. W. Ueberhuber, Computational Integration, SIAM, Philadelphia, 1998. [179) [55] F. M. Larkin, Root finding by divided differences, Numer. Math. 37 (1981) 93-104. [259] [56] W. Light (ed.), Advances in Numerical Analysis: Wavelets, Subdivision Algorithms, and Radial Basis Functions, Clarendon Press, Oxford, 1992. (170) [57) R. Lohner, Enclosing the solutions of ordinary initial- and boundary-value problems, pp. 255-286, in: E. Kaucher et al., eds., Computerarithmetic, Teubner, Stuttgart, 1987. [214] [58) MathWorks, Student Edition of MATLAB Version 5for Windows, Prentice-Hall, Englewood Cliffs, NJ, 1997. [1] [59] MathWorks, Matlab online manuals (in PDF). WWW-document, http://www.mathworks.com/access/helpdesk/help/fulldocset.shtml [J] [60] J. M. McNamee, A bibliography on roots of polynomials, J. Comput. Appl. Math. 47 (1993), 391-394. Available online at http://www.els evier.com/homepage/sac/cam/mcnamee[292] [61) K. Mehlhorn and S. Naber, The LEDA Platform of Combinatorial and Geometric Computing, Cambridge University Press, Cambridge, 1999. [38] [62] ~. Mischaikow and M. Mrozek, Chaos in the Lorenz equations: A computer assisted proof. Part II: Details, Math. Comput. 67 (1998), 1023-1046. [38] [63] R. E. Moore, Methods and Applications of Interval Analysis, SIAM, Philadelphia, 1981. [42,214] 348 References References [64] A. Morgan, Solving Polynomial Systems Using Continuation for Engineering and Scientific Problems, Prentice-Hall, Englewood Cliffs, NJ, 1987. [334] [65] D. E. Muller, A method for solving algebraic equations using an automatic computer, Math. Tables Aids Compo 10 (1956), 208-215. [264] [66] D. Nerinckx and A. Haegemans, A comparison of non-linear equation solvers, J. Comput. Appl. Math. 2 (1976),145-148. [292] [67] NETLlB, A repository of mathematical software, papers, and databases. http://www.netlib.org/[15,62] [68] A. Neumaier, Existence regions and error bounds for implicit and inverse functions, Z. Angew. Math. Mech. 65 (1985), 49-55. [335] [69] A. Neumaier, An existence test for root clusters and multiple roots, Z. Angew. Math. Mech. 68 (1988), 256-257. [278] [70] A. Neumaier, Interval Methods for Systems of Equations, Cambridge University Press, Cambridge, 1990. [42,48, ns, 117, 118,311,328] [70a] A. Neumaier, Global, vigorous and realistic bounds for the solution of dissipative differential equations, Computing 52 (1994), 315-336. [214] [71] A. Neumaier, Solving ill-conditioned and singular linear systems: A tutorial on regularization, SIAM Rev. 40 (1998), 636-666. [81] [72] A. Neumaier, A simple derivation of the Hansen-Bliek-Rohn-Ning-Kearfott enclosure for linear interval equations, Rei. Comput. 5 (1999), 131-136. Erratum, Rei. Comput. 6 (2000), 227. [118] [73] A. Neumaier and T. Rage, Rigorous chaos verification in discrete dynamical systems, Physica D 67 (1993), 327-346. [38] [73a] J. Nocedal and S. J. Wright, Numerical Optimization, Springer, Berlin, 1999. [301] [74] GNU OCTAVE. A high-level interactive language for numerical computations. http://www.che.wisc.edu/octave [2] [75] W. Oettli and W. Prager, Compatibility of approximate solution of linear equations with given error bounds for coefficients and right-hand sides, Numer. Math. 6 (1964), 405--409. [103] [76] M. Olschowka and A. Neumaier, A new pivoting strategy for Gaussian elimination, Linear Algebra Appl. 240 (1996), 131-151. [85] [77] G. Opitz, Gleichungsauflosung mittels einer speziellen Interpolation, Z. Angew. Math. Mech. 38 (1958), 276-277. [259] [78] J. M. Ortega and W. C. RheinboJdt, Iterative Solution of Nonlinear Equations in Several Variables, Academic Press, New York, 1970. [311] [79] B. N. Parlett, The Symmetric Eigenvalue Problem, Prentice Hall, Englewood Cliffs. NJ, 1990. (Reprinted by SIAM, Philadelphia, 1998.) [250, 255] [80] K. Petras, Gaussian versus optimal integration of analytic functions, Const. Approx. 14 (1998), 231-245. [187] [81] W. C. Rheinboldt, Numerical Analysis of Parameterized Nonlinear Equations, Wiley, New York, 1986. [336] [82] J. R. Rice, Matrix Computation and Mathematical Software, McGraw-Hill. New York, 1981. [84] [83] S. M. Rump, Verification methods for dense and sparse systems of equations, pp. 63-136, in: J. Herzberger (ed.), Topics in Validated Computations - Studies in Computational Mathematics, Elsevier, Amsterdam, 1994. [100] [84] S. M. Rump, Improved iteration schemes for validation algorithms for dense and . sparse non-linear systems, Computing 57 (1996), 77-84. [32.8] [85] S. M. Rump, lNTLAB - INTerval LABoratory, pp. 77-104, in: Developments III Reliable Computing (T. Csendes, ed.), Kluwer, Dordrecht, 1999. http://www.ti3.tu-harburg.de/rump/intlab/index.html[IO.42] 349 [86] R. B. Schnabel and E. Eskow, A new modified Cholesky factorization, SIAM J. Sci. Stat. Comput. 11 (1990), 1136-1158. [77] [87] SCILAB home page. http://www-rocq.inria.fr/scilab/scilab .html [2] [88] R. Seydel, Practical Bifurcation and Stability Analysis, Springer, New York, 1994. [336] [89] G. W. Stewart, Matrix Algorithms, SIAM, Philadelphia, 1998. [61] [90] J. Stoer and R. Bulirsch, Introduction to Numerical Analysis, Springer. Berlin, 1987. [207] [91] A. H. Stroud, Approximate Calculation of Multiple Integrals, Prentice-Hall, Englewood Cliffs, NJ, 1971. [179] [92] F. Stummel and K. Hainer, Introduction to Numerical Analysis, Longwood, Stuttgart, 1982. [17, 94] [93] A. van der Sluis, Condition numbers and equilibration of matrices, Numer. Math. 14 (1969),14--23. [101] [94] R. S. Varga, Matrix Iterative Analysis, Prentice-Hall, Englewood Cliffs, NJ. 1962. [98] [95] W. V. Walter, FORTRAN-XSC: A portable Fortran 90 module library for accurate and reliable scientific computing, pp. 265-285, in: R. Albrecht et al. (eds.), Validation Numerics - Theory and Applications. Computing Supplementum 9, Springer, Wien, 1993. [43] [96] J. Waldvogel, Pronunciation of Cholesky, Lanczos and Euler, NA-Digest v90nJO (1990). http://www.netlib.org/cgi-bin/mfs/02/90/v90nl0.html#2[64] [97] R. Weiss, Parameter-Free Iterative Linear Solvers, Akademie-Verlag, Berlin, 1996. [64] [98] J. H. Wilkinson, Rounding Errors in Algebraic Processes, Dover Reprints, New York, 1994. [267] [99] S. J. Wright, A collection of problems for which Gaussian elimination with partial pivoting is unstable, SIAM J. Sci. Statist. Comput. 14 (1993), 231-238. [122] [100] W. I. Zangwill and C. B. Garcia, Pathways to Solutions, Fixed Points, and Equilibria, Prentice-Hall, Englewood Cliffs, NJ, 1981. [334] Index (0,), Opitz method, 260 A:k = kth column of A, 62 Ai: = ith row of A, 62 D[c; r], closed disk in complex plane, 141 d] , differential number, 7 e, all-one vector, 62 e(k), unit vectors, 62 f[xQ, Xl], divided difference, 132 f[xo, ... , Xi], divided difference, 134 I, unit matrix, 62 l(f), (weighted) integral of I, 180 J, all-one matrix, 62 Li(X), Lagrange polynomial, 131 0(·), Landau symbol, 34 0(-). Landau symbol, 34 QU), quadrature formula, 180 QR-algorithm, 250 QZ-algorithm, 250 R N (fl, rectangle rule, 204 TN (fl. equidistant trapezoidal rule, 197 Ti.k, entry of Romberg triangle, 206 x y. much greater than, 24 .v « v. much smaller than, 24 "", approximately equal to, 24 » C m x fJ.62 x, midpoint of x, 39 Conv S, closed convex hull, 141 int D, interior of D, 141 mid x, midpoint of x, 39 rad x, radius of x, 39 w(x), weight function, 180 st». boundary of D, 141 IR m x n , 62 OS, interval hull, 40 L:(A, b), solution set oflinear interval system, 114 A(Xl, ... , x n), set of arithmetic expressions, 3 absolute value, 40, 98 accuracy, 23 algorithm QR,250 QZ,250 Algorithm Affine Invariant Modified Newton Method, 330 Damped Approximate Newton Method, 317 analytic, 4 approximation, 19 of closed curves, 170 by Cubic Splines, 165 expansion in a power series, 19 by interpolation, 165 iterative methods, 20 by least squares, 166 order linear, 45 rational, 22 argument reduction, 19 arithmetic interval, 38 arithmetic expression, 2, 3 definition, 3 evaluation, 4 interval, 44 stabilizing, 28 assignment, II automatic differentiation, 2 backward, 306 reverse, 305, 306 base, 15 best linear unbiased estimator (BLUE),78 bisection midpoint bisection, 242 secant bisection, 244 spectral, 250, 254 351 352 oracket, 241 Bulirsch sequence, 207 cancellation, 25 Cardano's formulas, 57 characteristic polynomial, 2S0 Cholesky factor, 76 factorization, 76 modified,77 Clenshaw-Curtis formulas, 190 column pivoting, 83, 85 condition, 23, 33 condition number, 34, 38 ill-conditioned, 33, 35 matrix condition number, 99 number, 99 continuation method, 334, 336 continued fraction, 22 convergence factor, 236 global,237 local,237 order, 256, 257 Q-linear, 236 Q-quadratic,237 Qvsuperlinear, 237 R-linear,236 correct rounding, 16 Cramer's rule, 63 data perturbations, 99 defective, 262 deflation, 267 explicit, 267 implicit, 267 derivative checking correctness, 56 multivariate, 303 determinant, 67, 88 difference quotient central, 149 forward, 148 differential equation Adams-Bashforth method, 215 Adams-Moulton method, 216 backwards differentiation formula (BDF),218 boundary-value problem, 233 Euler's method, 212 global error, 223 local error estimation, 221 multi-step method, 211, 214 multi-value method, 211 numerical solution of, 210 one-step method, 211 predictor-corrector method, 216 Index Index rigorous solution of, 214 Runge-Kutta method, 211 shooting method, 122, 233 step size control, 219 stiff, 214, 218 Taylor series me\hod, 214 differential number, 7 constant, 8 differentiation analytical, 4 automatic, 7, 10 forward, 10 numerical, 148 discretization error, lSI higher derivatives, 152 optimal step size, ISO rounding error analysis, ISO discretization error in numerical differentiation, 151 divided difference confluent, 136 first-order, 132 general, 134 second-order, 133 double precision format, 15 eigenvalue problem definite, 251 general linear, 250 nonlinear, 250 Opitz methods, 262 Q R-algorithm, 250 QZ-algorithm, 250 rigorous error bounds, 272 spectral bisection, 250 equilibration, 84 error damping, 35 magnification, 35 error analysis for arithmetic expressions, 26 for iterative refinement, 110 for linear systems, 99, 112 for multivariate zeros, 322 reducing integration errors, 224 for triangular factorizations, 82, 90 for univariate zeros, 265 error bounds a posteriori, 324 for complex zeros, 277 for nonlinear systems, 323 for polynomial zeros, 280 for simple eigenvalues, 272 Euler-MacLaurin Summation Formula, 20 I exponent range, 15 extrapolation, 145 Neville formulas, 146 factorization Cholesky, 76 modified, 77 LDL H,75 LR, LU, 67 QR,80 modified LDL T , 75 orthogonal, 80 SVD,81 triangular, 67 incomplete, 330 fixed point, 308 floating point IEEE standard, IS number, 14 function elementary, 3 piecewise cubic, ISS piecewise linear, 153 in terms ofradial basis functions, 170 piecewise polynomial, see spline, 155 radial basis function, 170 spline, see spline, 155 Gaussian elimination, 63 rounding error analysis, 90 global optimization, 38 gradual underflow, 18 grid, 153 If-matrix. 69, 97 Halley's Method, 291 Homer scheme, 10 complete, 54 for Newton form, 135 by polynomials, 131 convergence, 141 divergence, 143 interpolation error, 138 Lagrange formulas, 131 Newton formulas, 134 Hermite, 139 in Chebyshev Points, 144 in expanded Chebyshev points, 145 linear, 132 piecewise linear, 153 quadratic, 133 interval, 39 arithmetic, 38 arithmetic expression, 44 eigenvalue enclosure, 272 evaluation mean value form, 47 hull, OS, 40 Krawczyk's method, 116 Krawczyk iteration, 327 Krawczyk operator, 327 linear system, 114 Gaussian elimination, 118 solution set, I: CA, b), 114 midpoint, mid x, 39 Newton's method multivariate, 326 Newton method univariate, 268 point interval, 39 radius, rad x, 39 set of intervals, 39 INTLAB,IO INTLAB verifylss, 128 iterative method, 20 iterative refinement, 106 Iimiting accuracy, 110 Jacobian matrix, 303 IEEE floating point standard, 15 iff, if and only if, vii ill-conditioned, see condition, 33 inclusion isotonicity, 44 integral, 302 integration adaptive, 203 error estimation, 207 multivariate, 179 numerical, 179 quadrature formula, see quadrature formula, 179 Romberg's method, 206 stepsize control, 208 interpolation by cubic splines, 153, t 56 Kepler's barrel rule, 189 Krawczyk iteration, 327 Krawczyk operator, 327 Landau symbols, 34 Lapack,62 Larkin method, 259 least squares method, 78 solution, 78 limiting accuracy for iterative refinement, 110 for multivariate zeros, 322 for numerical differentiation, 151 for univariate zeros, 265 353 354 linear system error bounds, 112 interval equations, 114 Gaussian elimination, 118 Krawczyk's method, 116 iterative refinement, 106 overdetermined, 61, 78 quality factor, 105 regularization, 81 rounding errors, 82 sparse. 64 triangular, 65 Lipschitz continuous, 304 Index Index sparse, 121 sprintf,54 spy, 121 subplot, 177 text, 177 title, 177 true, 242 xlabel. 177 ylabel,177 zeros,63 %, 12 &:,32 -,32 1,32 M-matrix, 98, 124 machine precision, 17 mantissa length, 15 MATLAB, I 1 :n, 3 ==,65 A\b,63,67,81 A',62 A(: ,k), 63 A(i, :),62 A.',62 A.*B,63 A ./B. 63 A/B,63 chol, 76, 127 cond, 125 conj,299 disp,32 eig,229 eps, 17 eye. 63 false, 242 feval, 247 fprintf,54 full, 121 get, 177 help, 1 input, 32 inv,63 lu, 122 max, 40 mex, 176 NaN, 15 num2str,11 ones, 63 plot, 12 poly, 266 print, 12 qr,80 randn.232 roots, 266.268 set, 177 sign, 104 size, II. 63 conjugate transpose, A' , 62 figures, 176 online help, 1 passing function names, 247 precision, 15, 17 pseudo-, 1 public domain variants OCTAVE, 2 SCILAB,2 sparse matrices, 75 subarray, 66 transpose, A. ' , 62 vector indexing, 11 matrix A:k = kth column of A, 62 Ai: = ith row of A, 62 absolute value, 62, 98 all-one matrix, J, 62 banded, 74 condition number, 99 conjugate transposed, A H , 62 diagonal, 65 diagonally dominant, 97 factorization, see factorization, 67 H-matrix, 69, 97, 124 Hermitian, 62 Hilbert, 126 inequalities, 62 inverse, 81 conjugate transposed, A -H, 62 transposed, A - T • 62 Jacobian, 303 M-matrix, 98 monomial, 121 norm, see norm, 94 notation, 62 orthogonal, 80 parameter matrix, 250 pencil, 250 permutation, 85 symmetric, 86 positive definite, 69 sparse, 75 symmetric, 62 transposed, AT, 62 triangular, 65 tridiagonal, 74 unitary, 80, 126 unit matrix, I, 62 mean value form, 30, 48 mesh size, 153 Milne rule, 189 Muller's Method, 264 NETLIB,62 Newton's method, 281 discretized, 314 global monotone convergence, 336 multivariate, 31 I affine invariant, 329 convergence, 318 damped, 315 error analysis, 322 multivariate interval, 326 local convergence, 314 univariate damped, 287 global behavior, 284 modi fed, 289 vs. secant method, 282 Newton path, 329 Newton step. 287 nodes, 153, 180 nonlinear system, 30 I continuation method. 334 error bounds. 323 finding all zeros. 328 quasi-Newton method, 333 norm, 94 co-norm, 94, 95 I-norm, 94, 95 2-norm. 94, 95 column sum, 95 Euclidean. 94 matrix. 95 maximum, 94 monotone, 98 row sum, 95 spectral, 95 sum, 94 normal equations, 78 number integer, 14 real, 14 fixed point, 14 floating point, 14 numerical differentiation limiting accuracy, 151 numerical integration rigorous, 185 numerical stability, 23 OCTAVE, 2 operation, 3 Opitz method, 259 optimization problems, 301 orthogonal polynomial 3-term recurrence relation, 195 orthogonal polynomials, 191 outward rounding, 42 overflow, 18 parallelization, 73 parameter matrix, 250 partial pivoting, 83 PASCAL-XSC,42 permutation matrix, 85 symmetric, 86 perturbation theory, 94 pivot elements, 83 pivoting column, 85 complete, 91 pivot search, 83, 88 point interval. 39 pole Opitz methods, 259 polynomial Cardano's formulas. 57 characteristic. 250 Chebyshev, 145 interpolation, 131 Lagrange, 131 Legendre, 194 Newton form, 135 orthogonal, 191 3-term recurrence relation, 195 positive definite, 69 positive semidefinite, 69 precision, 23 machine precision, 17 preconditioner, 113, 330 quadratic convergence, 21 quadrature formula Clenshaw-Curtis, 190 closed, 229 Gaussian, 187, 192 interpolatory, 188 Milne rule, 189,205,226 Newton-Cotes, 189 normalized, 181 order, 182 rectangle rule, 204 Simpson rule, 189,204,226 trapezoidal rule. 189, 196, 226 equidistant, 197 356 quality factor linear system, 105 mean value form, 48 quasi-Newton method, 333 range inclusion, 44 rectangle rule, 204 regula falsi, 235 regularization, 81 result adaptation, 19 Romberg triangle, 206 root secant method, 241 rounding, 16 by chopping, 16 correct, 16 optimal, 16,42 outward,42 row equilibration, 84 scaling, 83 implicit, 88 SCILAB,2 search directions, 317 secant method, 234 convergence order, 259 convergence speed, 237 root secant, 241 secant bisection, 244 vs. Newton's method, 282 sign change, 241 Simpson rule, 189 single precision, 15 singular-value decomposition, 8 I slope multivariate, 303 spectral bisection, 254 spline, 155 Spline Approximation, 165 spline approximation error, 162 B-spline, 155, 176 in terms ofradial basis functions, 172 basis spline, 155 complete, 161 cubic, 155 free node condition, 158 interpolation, 153, 156 optimality of complete splines, 163 parametric, 169 periodic, 160 stahility,37 numerical, 23 stabilizing expressions, 28 standard deviation, 2, 26, 30 subdistributive law, 59 Index Theorem of Aird and Lynch, 113 Euler-MacLaurin summation formula, 201 fixed point theorem by Banach, 309 by Brouwer, 311 by Leray and Schauder, 310 Gauss-Markov, 78 inertia theorem of Sylvester, 252 of Oettli and Prager, 104 of Rouche, 277 of van der Sluis, 102 trapezoidal rule, 189, 197 triangular factorization, 67 incomplete, 330 matrix, 65 underflow, 18 vector absolute value, 62, 98 all-one vector, e, 62 inequalities, 62 norm, see norm, 94 unit vectors, e(k) , 62 vectorization, 73 weight function, 180 weights, 180 zero, 233 bracket, 241 cluster, 239 complex, 273 rigorous error bounds, 277 spiral search. 275 deflation, 267 finding all zeros, 268, 328 hyperbolic interpolation, 261 interval Newton method, 268 limiting accuracy, 265 method of Larkin, 259 method of Opitz, 259 Muller's Method, 264 multiple, 239 multivariate, 30 I limiting accuracy, 322 polynomial, 280 Cardano's formulas, 57 sign change, 241 simple, 239 univariate HaUey's method, 291 Newton's method. 281 N U ME RICAL ANALYS IS i. ~n increuingly importan t link between puce math ematics and ils appJic~tion in science and to:chnology. This textbook provido:s an int roduction 10 the justificalion and IkvdopmeOl of constructive methods du. t proeide sufficiendy 2C'C1Ilaie approximations 10 the solution of numerical problems. and the an:alysis of the influence INt erron in data, fin it~precision cakulations. and approDmarion fOfll'lulas haw on results, problem form ulation, and the dIoice of method. It a1so servo as an introduction 10 scientifi<: programming in MATU.S, including many simple and difficult . thtorttical and computaOoD;lI ex~ . A unique feature of mis boo k is the consequent devdopmen t o(in terval aD;llysis as a wol fur rigorous computiltion and ccmpu eer-assiseed proofs. along with th e rradinonal material. Arnold Neumaier i. a Professor of Computational Mathematics al the Uni_eroiry of Wien. He has wrin cn two boc ks . nd published numerous artid es an the subjects a fro mbinalOries. Interval analysis, "ptimi.. tl"n. ~nd . tatistics. He has held teaching po,iti"n s at the Univt'l'> iry of F.eiburg. Gennany. a nd the Uni'e .. iry of Wisconsin. Madison , and he has wor ked a n the technica l 5IafT at AT&T IIdl Labor-atories. In addition. Professo. Neumaie. maim ains e~tensive Web pages on ptIblicdom ain softwa'" for numerial a""lysis. oplimi... ,ioo, and ILLltistiCS.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement