Introduction to
Introduction to Numerical Analysis
Numerical analysis is an increasingly important link between pure mathematics and its application in science and technology. This textbook provides an
introduction to the justification and development of constructive methods that
provide sufficiently accurate approximations to the solution of numerical problems, and the analysis of the influence that errors in data, finite-precision calculations, and approximation formulas have on results, problem formulation,
and the choice of method. It also serves as an introduction to scientific programming in MATLAB, including many simple and difficult, theoretical and
computational exercises.
A unique feature of this book is the consequent development of interval
analysis as a tool for rigorous computation and computer-assisted proofs, along
with the traditional material.
Arnold Neumaier is a Professor of Computational Mathematics at the University
of Wien. He has written two books and published numerous articles on the
subjects of combinatorics, interval analysis, optimization, and statistics. He
has held teaching positions at the University of Freiburg, Germany, and the
University of Wisconsin, Madison, and he has worked on the technical staff at
AT&T Bell Laboratories. In addition, Professor Neumaier maintains extensive
Web pages on public domain software for numerical analysis, optimization, and
statistics.
Introduction to
Numerical Analysis
ARNOLD NEUMAIER
University of Wien
""'~"'" CAMBRIDGE
. ;::
UNIVERSITY PRESS
PUBLISHED BY THE PRESS SYNDICATE OF THE UNIVERSITY OF CAMBRIDGE
Contents
The Pitt Building, Trumpington Street, Cambridge, United Kingdom
CAMBRIDGE UNIVERSITY PRESS
The Edinburgh Building, Cambridge CB2 2RU, UK
40 West 20th Street, New York, NY 10011-4211, USA
10 Stamford Road, Oakleigh, VIC 3166, Australia
Ruiz de Alarcon 13,28014 Madrid, Spain
Dock House, The Waterfront, Cape Town 8001, South Africa
http://www.cambridge.org
© Cambridge University Press 2001
This book is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without
the written permission of Cambridge University Press.
First published 200 I
Printed in the United Kingdom at the University Press, Cambridge
System IbTI;X2s
Typeface Times Roman 10/13 pI.
[TB]
A catalog record for this book is available from the British Library.
Library of Congress Cataloging in Publication Data
Neumaier, A.
Introduction to numerical analysis 1Arnold Neumaier.
p.
em.
Includes bibliographical references and index.
ISBN 0-521-33323-7 - ISBN 0-521-33610-4 (pb.)
1. Numerical analysis.
QA297 .N48
I. Title.
2001
519.4 - dc21
ISBN 0 521 333237 hardback
ISBN 0 521 33610 4 paperback
Preface
page vii
1. The Numerical Evaluation of Expressions
1.1 Arithmetic Expressions and Automatic Differentiation
1.2 Numbers, Operations, and Elementary Functions
1.3 Numerical Stability
1.4 Error Propagation and Condition
1.5 Interval Arithmetic
1.6 Exercises
2. Linear Systems of Equations
2.1 Gaussian Elimination
2.2
2.3
2.4
2.5
2.6
2.7
2.8
Variations on a Theme
Rounding Errors, Equilibration, and Pivot Search
Vector and Matrix Norms
Condition Numbers and Data Perturbations
Iterative Refinement
Error Bounds for Solutions of Linear Systems
Exercises
00-066713
1
2
14
23
33
38
53
61
63
73
82
94
99
106
112
119
3. Interpolation and Numerical Differentiation
3.1 Interpolation by Polynomials
3.2 Extrapolation and Numerical Differentiation
3.3 Cubic Splines
3.4 Approximation by Splines
3.5 Radial Basis Functions
3.6 Exercises
130
131
145
153
165
170
182
4. Numerical Integration
4.1 The Accuracy of Quadrature Formulas
179
179
Contents
vi
4.2
4.3
4.4
4.5
4.6
4.7
Gaussian Quadrature Formulas
The Trapezoidal Rule
Adaptive Integration
Solving Ordinary Differential Equations
Step Size and Order Control
Exercises
5. Univariate Nonlinear Equations
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
The Secant Method
Bisection Methods
Spectral Bisection Methods for Eigenvalues
Convergence Order
Error Analysis
Complex Zeros
Methods Using Derivative Information
Exercises
6. Systems of Nonlinear Equations
6.1
6.2
6.3
6.4
6.5
Preliminaries
Newton's Method and Its Variants
Error Analysis
Further Techniques for Nonlinear Systems
Exercises
References
Index
187
196
203
210
219
225
233
234
241
250
256
265
273
281
293
301
302
311
322
329
340
345
351
Preface
They ...explained it so that the people could understand it.
Good News Bible, Nehemiah 8:8
Since the introduction of the computer, numerical analysis has developed into an
increasingly important connecting link between pure mathematics and its application in science and technology. Its independence as a mathematical discipline
depends, above all, on two things: the justification and development of constructive methods that provide sufficiently accurate approximations to the solution
of problems, and the analysis of the influence that errors in data, finite-precision
calculations, and approximation formulas have on results, problem formulation,
and the choice of method. This book provides an introduction to these themes.
A novel feature of this book is the consequent development of interval analysis as a tool for rigorous computation and computer-assisted proofs. Apart
from this, most of the material treated can be found in typical textbooks on
numerical analysis; but even then, proofs may be shorter than and the perspective may be different from those elsewhere. Some of the material on nonlinear
equations presented here previously appeared only in specialized books or in
journal articles.
Readers are expected to have a background knowledge of matrix algebra and
calculus of several real variables, and to know just enough about topological
concepts to understand that sequences in a compact subset in JRn have a convergent subsequence. In a few places, elements of complex analysis are used.
The book is based on course lectures in numerical analysis that the author
gave repeatedly at the University of Freiburg (Germany) and the University
of Vienna (Austria). Many simple and difficult theoretical and computational
exercises help the reader gain experience and deepen the understanding of the
techniques presented. The material is a little more than can be covered in a
European winter term, but it should be easy to make suitable selections.
viii
Preface
The presentation is in a rigorous mathematical style. However, the theoretical results are usually motivated and discussed in a more leisurely manner so
that many proofs can be omitted without impairing the understanding of the
algorithms. Notation is almost standard, with a bias towards MATLAB (see
also the index). The abbreviation "iff" frequently stands for "if and only if."
The first chapter introduces elementary features of numerical computation:
floating point numbers, rounding errors, stability and condition, elements of
programming (in MATLAB), automatic differentiation, and interval arithmetic.
Chapter 2 is a thorough treatment of Gaussian elimination, including its variants
such as the Cholesky factorization. Chapters 3 through 5 provide the tools for
studying univariate functions - interpolation (with polynomials, cubic splines,
and radial basis functions), integration (Gaussian formulas, Romberg and adaptive integration, and an introduction to multistep formulas for ordinary differential equations), and zero finding (traditional and less traditional methods
ensuring global and fast local convergence, complex zeros, and spectral bisection for definite eigenvalue problems). Finally, Chapter 6 discusses Newton's
method and its many variants for systems of nonlinear equations, concentrating
on methods for which global convergence can be proved.
In a second course, I usually cover numerical data analysis (least squares and
orthogonal factorization, the singular value decomposition and regularization,
and the fast Fourier transform), unconstrained optimization, the eigenvalue
problem, and differential equations. Therefore, this book contains no (or only
a rudimentary) treatment of these topics; it is planned that they will be covered
in a companion volume.
I want to thank Doris Norbert, Andreas Schafer, and Wolfgang Enger for their
help in preparing the German lecture notes; Michael Wolfe and Baker Kearfott
for their English translation, out of which the present text grew; Carl de Boor,
Baker Kearfott, Weldon Lodwick, Gunter Mayer, JiffRohn, and Siegfried Rump
for their comments on various versions of the manuscript; and Stefan Dallwig
and Waltraud Huyer for computing the tables and figures and proofreading. I
also want to thank God for giving me love and power to understand and use the
mathematical concepts on which His creation is based, and for the discipline
and perseverance to write this book.
I hope that working through the material presented here gives my readers
insight and understanding, encourages them to become more deeply interested
in the joys and frustrations of numerical computations, and helps them apply
mathematics more efficiently and more reliably to practical problems.
Vienna, January 2001
Arnold Neumaier
1
The Numerical Evaluation of Expressions
In this chapter, we introduce the reader on a very elementary level to some basic
considerations in numerical analysis. We look at the evaluation of arithmetic
expressions and their derivatives, and show some simple ways to save on the
number of operations and storage locations needed to do computations.
We demonstrate how finite precision arithmetic differs from computation
with ideal real numbers, and give some ideas about how to recognize pitfalls in
numerical computations and how to avoid the associated numerical instability.
We look at the influence of data errors on the results of a computation, and
how to quantify this influence using the concept of a condition number. Finally
we show how, using interval arithmetic, it is possible to obtain mathematically
correct results and error estimates although computations are done with limited
precision only.
Wepresent some simple algorithms in a pseudo-MATLAB® formulation very
close to the numerical MATrix LABoratory language MATLAB (and somethose printed in typewriter font - in true MATLAB) to facilitate getting
used to this excellent platform for experimenting with numerical algorithms.
MATLAB is very easy to learn once you get the idea of it, and the online help
is usually sufficient to expand your knowledge. We explain many MATLAB
conventions on their first use (see also the index); for unexplained MATLAB
features you may try the online help facility. Type help at the MATLAB prompt
to find a list of available directories with MATLAB functions; add the directory
name after help to get information about the available functions, or type the
function name (without the ending .m) after help to get information about the
Use of a particular function. This should give enough information, in most cases.
Apart from that, you may refer to the MATLAB manuals (for example, [58,59]).
If you need more help, see, for example, Hanselman and Littlefield [38] for a
comprehensive introduction.
3
The Numerical Evaluation of Expressions
1.1 Arithmetic Expressions and Automatic Differentiation
For those who cannot afford MATLAB, there are public domain variants that
can be downloaded from the internet, SCILAB [87] and OCTAVE [74]. (The
syntax and the capabilities are slightly different, so the MATLAB explanations
given here may not be fully compatible.)
Here we use the MATLAB notation 1 : n for a list of consecutive integers
from 1 to n.
(vi) The area e under the normal distribution curve e- t 2 /2 j v'2J[ between
t = -x and t = x is given by
2
e = -1- jX e:' /2dt.
2
v'2J[
1.1 Arithmetic Expressions and Automatic Differentiation
Mathematical formulas occur in many fields of interest in mathematics and its
applications (e.g., statistics, banking, astronomy).
1.1.1 Examples.
(i) The absolute value
Ix + iyl of the complex number x + iv is given by
-x
With the exception of the last formula, which contains an integral, only elementary arithmetic operations and elementary functions occur in the examples.
Such formulas are called arithmetic expressions. Arithmetic expressions may
be defined recursively with the aid of variables XI, ••• , X n , the binary operations
+ (addition), - (subtraction), * (multiplication), j (division),' (exponentiation),
forming the set
*. j, A},
0= {+. -,
(ii) A capital sum Ko invested at p% per annum for n years accumulates to a
sum K given by
K
=
Ko(l
J
+ pjlOO)n.
(iii) The altitude h of a star with spherical coordinates
h
(1/!. 8, t) is given by
= arcsin(sin 1/! sin 8 + cos 1/! cos 8 cos t).
(iv) The solutions Xl, x2 of the quadratic equation
=
-b ± Jb 2
-
4ac
2a
(v) The standard deviation (T of a sequence of real numbers XI,
by
log, sqrt, abs, ...}.
The set of elementary functions are not specified precisely here, and should be
regarded as consisting of the set of unary continuous functions that are defined
on the computer used.
1.1.2 Definition. The set A = A(x[, ... , x n ) of arithmetic expressions in
is defined by the rules
are given by
XI,2
= {+, -. sin, cos, exp,
XI, ... , X n
+ bx + c = 0
ax:
and with certain unary elementary functions in the set
... , X n
is given
(EI) IR < A,
(E2) Xi E A (i = 1, ... , n),
(E3) g, h E A, 0 EO::::} (g 0 h) E A,
(E4) g E A, cp E J ::::} cp (g) E A,
(E5) A(x], ... , x n ) is the smallest set that satisfies (El)-(E4). (This rule excludes objects not created by the rules (E 1)-(E4).)
Unnecessary parentheses may be deleted in accordance with standard rules.
(T
=
~L (Xi - ~ L Xi)2
~ ;; i~l:n
1.1.3 Example. The solution
11 i=l:n
-b + Jb 2
-
4ac
2a
of a quadratic equation is an arithmetic expression in a, b, and c because we
4
The Numerical Evaluation of Expressions
1.1 Arithmetic Expressions and Automatic Differentiation
can write it as follows:
(-b
5
to the running time of the program. Thus, in the example
+ sqrt(b"2 -
4 *a
* c))/(2 * a)
E
A(a, b, c).
f(x)
f'(x)
Evaluation ofan arithmetic expression means replacing the variables with numbers. This is possible on any machine for which the operations in 0 and the
elementary functions in J are realized.
Differentiation of Expressions
For many numerical problems it is useful and sometimes absolutely essential
to know how to calculate the value of the derivative of a function f at given
points. If no routine for the calculation of t' (x) is available then one can
determine approximations for F (x) by means of numerical differentiation (see
Section 3.2). However, because of the higher accuracy attainable, it is usually
better to derive a routine for calculating F (x) from the routine for calculating
f (x). If, in particular, f is given by an arithmetic expression, then an arithmetic
expression for I' (x) can be obtained by analytic (symbolic) differentiation
using the rules of calculus. We describe several useful variants of this process,
confining ourselves here to the case of a single variable x; the case of several
variables is treated in Section 6.1. We shall see that, for all applications in
which a closed formula for the derivative of f is not needed but one must
be able to evaluate l' (x) at arbitrary points, a recursive form of generating
this value simultaneously with the value of f (x) is the most useful way to
proceed.
(a) The Construction of a Closed Expression for
l'
This is the traditional way in which the formula for the derivative is calculated
by hand and the expression that is obtained is then programmed as a function
subroutine. However, several disadvantages outweigh the advantage of having
a closed expression.
(i) Algebraic errors can occur in the calculation and simplification of formulas
for derivatives by hand; this is particularly likely when long and complicated
formulas for the derivative result. Therefore, a correctness test is necessary;
this can be implemented, for example, as a comparison with a sequence of
values obtained by numerical differentiation (cf. Exercise 8).
(ii) Often, especially when both f(x) and 1'(x) must be calculated, certain
subexpressions appear several times, and their recalculation adds needlessly
= eX /(1 + sinx),
= eX(l + sinx) -
eX cosx
(l + sin x)?
= eX(l + sinx - cosx)/(l
+ sinx)2
the expression 1 + sin x occurs three times.
The susceptibility to error is considerably reduced if one automates the calculation and simplification of closed formulas for derivatives using symbolic
algebra (i.e., the processing of syntactically organized strings of symbols).
Examples of high-quality packages that accomplish this are MAPLE and
MATHEMATICA.
(b) The Construction ofa Recursive Program for f and l'
In order to avoid the repeated calculation of subexpressions, it is advantageous
to give up the closed form and calculate repeatedly occurring subexpressions
using auxiliary variables. In the preceding example, one obtains the program
segment
fl
= expx;
12
= 1 + sinx;
f = fl/12;
t' = f1 * (12 - cosx)/h"2;
in which one can represent the last expression for l' more briefly in the form
l' = f * (12 - cosx)/h or l' = f * (1 - cos x/h). One can even arrange to
have this transformation done automatically by programs for symbolic manipulation and dispense with closed intermediate expressions. Thus, in decomposing
the expression for f recursively into its constituent parts, a sequence of assignments of the form
f
= -g,
f
=g0
h,
f
= cp(g)
is obtained. One can differentiate this sequence of equations, taking into account
the fact that the subexpressions f, g, and h are themselves functions of x and
bearing in mind that a constant has the derivative 0 and that the variable x has
the derivative I. In addition, the well-known rules for differentiation (Table 1.1)
6
The Numerical Evaluation of Expressions
1.1 Arithmetic Expressions and Automatic Differentiation
Table 1.1. Differentiation rules for expressions f
(c) Computing with Differential Numbers
I'
I
g±h
g*h
glh
gA2
gAh
±g
g' ±h'
* h + g * h'
I * h')1 h (see main text)
2 * g * g'
I * (h' * log(g) + h * g' 1g)
g'
(g' -
±g'
g'l (2 * f) if g > 0
sqrt(g)
exp(g)
log(g)
1* g'
abs(g)
cp(g)
sign(g)
s'ie
ip' (g)
* s'
* g'
are used. The unusual form of the quotient rule results from
f
(g'h - gh')lh 2
= is' - (gl h)h')1 h
= (g' - fh')1 h
= glh::::} t' =
and is advantageous for machine calculation, saving two multiplications. In the
preceding example, one obtains
fJ =
h =
sin x:
dg*dh:= (g*h,g'*h+g*h');
dg l dn := (f, (g' - f
* h')1 h)
dg'dh. := (f, f
* f~)/h:
and from this, by means of a little simplification and substitution,
f = exp(h
sqrt(dg) := (f, g'/(2
exp(dg) := (f, f
* f))
* g')
with f = exp(g);
h
abs(dg) := (abs(g), sign(g)
with the same computational cost as the old, intuitively derived recursion.
0);
(if g > 0);
with f = sqrt(g) (if g > 0);
log(dg):= (log(g), g'lg) (if g > 0);
f = fl/h:
f' = (fl - f * cosx)/h;
* k)
±dg := (±g, ±g');
fl = exp x;
= 1+ sinx;
i=
* (h' * k + h * g'lg))
with k = log(g),
I;;
with f = gl h (if h
withk=gA(n-l) (ifl:::::nElR);
withf=gAn (ifl>nERg>O);
de"
{(k*g,n*k*g')
gn:= (f,n*f*g'lg)
f~ = cosx;
t' = (f( - f
;=
(g ± h,
s' ± h');
dg ± dh
f = fdh
I( = II:
f~ =
The compilers of all programming languages can do a syntax analysis of arbitrary expressions, but usually the result of the analysis is not accessible to the
programs. However, in programming languages that provide user-definable data
types, operators, and functions for objects of these types, the compiler's syntax
analysis can be utilized. This possibility exists, for example, in the programming languages MATLAB 5, FORTRAN 90, ADA, C++, and in the PASCAL
extension PASCAL-XSC, and leads to automatic differentiation methods.
In order to understand how this can be done, we observe that Table 1.1 constructs, for each operation 0 EO, from two pairs of numerical values (g, g') and
(h, h'), a third value, namely (f, f'). We may regard this pair simply as the result
of the operation (g, g')o(h, h'). Similarly, one finds in Table 1.1 for each elementary function cp E ] a new pair (f, f') from the pair (g, g'), and we regard this as
a definition of the value cp«g, g')). The analogy with complex numbers is obvious and motivates our definition of differential numbers as pairs of real numbers.
Formally, a differential number is a pair (f, f') of real numbers. We use the
generic form df to denote such a differential number, and regard similarly the
variables hand h' as the components of the differential number dh, so that
dh = (h, h'), and so on. We now define, in correspondence with Table 1.1,
operations and elementary functions for differential numbers.
exp x:
h=l+h:
7
* g')
(if g
i= 0);
cp(dg) := (cp(g), cp'(g) * g')
for other cp E ] for which cp(g), sp' (g) exists.
The next result follows directly from the definitions.
8
The Numerical Evaluation ofExpressions
1.1.4 Proposition. Let q, r be real functions of one variable, differentiable at
Xo E JR., and let
dq
(i) If dq
0
=
0
EO), then the function p given by
p(x) := q(x)
1.1.6 Example. Suppose we want to find the value and the derivative of
=
dr
0
(r(xo), r'(xo».
rex)
is defined and differentiable at xo, and
(p(xo), p'(xo»
(ii)
If rp (dq)
f (x)
= dq
0
dr.
is defined (for rp E J), then the function p given by
= rp(dq).
The reader interested in algebra easily verifies that the set of differential numbers
with the operations +, -, * forms a commutative and associative ring with null
element 0 = (0,0) and identity element 1 = (1,0). The ring has zero divisors;
for example, (0, 1) * (0, 1) = (0, 0) = O. The differential numbers of the form
(a, 0) form a subring that is isomorphic to the ring of real numbers.
We may call differential numbers of the form (a, 0) constants and identify
them with the real numbers a. For operations with constants, the formulas
simplify
= (g ± a, g'),
a ± dh
The arithmetic for differential numbers can be programmed without any difficulty in any of the programming languages mentioned previously. With their
help, we can compute the values f (xo) and I' (xo) for any arithmetic expression
f and at any point xo. Indeed, we need only initialize the independent variable
as the differential number dx = (xo, 1) and substitute this into f.
1.1.5 Theorem. Let f be an arithmetic expression in the variable x, and let the
differential number f(dx) that results from inserting dx = (xo, 1) into f be
defined. Then f is defined and is differentiable at xo, and
(f(xo), f'(xo»
=
f(dx).
+ 3)
x2
+ 4x + 7
+ 2)2
=
(x
204,
f
(x)
and finds by substitution that
f(3)
12
5
=- =
(f(3) f'(3»
=
f(dx)
28
(3)
= 25 = 1.12.
*
=
((3,1) - 1) ((3,1)
(3, 1) + 2
,
(2, 1)
,
= (3, 1) into the expression for f gives
Substituting the differential number dx
* (6, 1)
+ 3)
(12, 8)
(5, 1)
(5, 1)
= (204, (8 - 204
= (a ± h, h'),
dg * a = (g * a, s' * a), a * dh = (a * h, a * h'),
dg] a = (g/a,g'/a), a f dh c» (f,-f*h'/h), withf=a/h.
(x - 1)(x
--x-+-2--
,
f
is defined and differentiable at xo, and
(p(xo), p'(xo»
=
at the point Xo = 3. The classical method (a) starts from an explicit formula for
the derivative; for example,
p(x) := rp(q(x»
dg ± a
D
Proof Recursive application of Proposition 1.104.
The conclusion is that, using differential numbers, one can calculate the
derivative of any arithmetic expression, at any admissible point, without knowing an expression for f'!
(q(xo), q'(xo»,
dr is defined (for
9
1.1 Arithmetic Expressions and Automatic Differentiation
* 1)/5) = (204, 1.12),
without knowing an expression for I'. In the same way, substituting dx
into the equivalent expression
=
(3, 1)
3
f(x) =x - - - ,
x+2
we obtain, in spite of completely different intermediate results, the same final
result
(f(3), f'(3»
=
f(dx)
=
=
(3,0)
(3, 1) - (5,1)
=
(3, 1) - (0.6, -0.12)
(3, 1) _
=
3
(3,1)
+2
(3, 1) - (0.6, (0 - 0.6
=
(204, 1.12).
* 1)/5)
The Numerical Evaluation of Expressions
1.1 Arithmetic Expressions and Automatic Differentiation
Finally, we note that differential numbers can be generalized without difficulty
to compute derivatives of higher order. For the computation of f"(X), ... ,
f(n) (x) from an expression f containing N operations or functions, the number
of operations and function calls needed grows like a small multiple of n 2 N.
The formalism handles differentiation with respect to any parameter. Therefore, it is also possible to compute the partial derivatives of (x) /OXk of a function
that is given by an expression f (x) in several variables. Of course, one need
not redo the function value part of the calculation when calculating the partial derivatives. In addition to this "forward" mode of automatic differentiation,
there is also a "reverse" or "backward" mode that may further increase the
efficiency of calculating partial derivatives (see Section 6.1).
A MATLAB implementation of automatic differentiation is available in the
INTLAB toolbox by Rump [85]. An in-depth discussion of all aspects of automatic differentiation and its applications is given in Griewank and Corliss [33]
and Griewank [32].
This is the Homer scheme for calculating the first derivative of a polynomial.
Analogous results hold for the higher derivatives (see Exercise 4).
In MATLAB notation, we get f = f (x) and g = l' (x) as follows:
10
The Horner Scheme
A polynomial of degree at most n is a function of the form
(1.1)
with given coefficients ao, ... , an' In order to evaluate (1.1) for a given value
of x , one does not proceed naively by forming each power x 2 , x 3 , ••. , x", but
one uses the following scheme. We define
ii := aox i + alx i - 1+ ... + a,
Obviously, I;
=f
(x).
The advantage of
Io
=
(i = 0,1, ... , n).
Ii lies in its recursive definition
ao,
ii = ii-1 X+ a;
(i
= 1, ... , n),
f(x) = In.
This is the Homer scheme for calculating the value of a polynomial. A simple count shows that only 2n operations are needed, whereas evaluating (1.1)
directly requires at least 3n - 1 operations.
The derivative of a polynomial can be formed recursively by differentiation
of the recursion:
f~
= 0,
f/ = 1/-I X+ ii-I
f'ex) = f,;,
(i
=
1, ... , n),
11
1.1.7 Algorithm: Horner Scheme
% computes f=f(x) and g=f'(x) for
% f(x)=a(1)*x-n+a(2)*x-(n-l)+ ... +a(n)*x+a(n+l)
f=a(1); g=O;
for i=l:n,
g=g*x+f;
f=f*x+a(i+1) ;
end;
Note the shift in the index numbering, needed because in MATLAB the vector
indices start with index 1, in contrast to C, in which vector indices start with 0,
and to FORTRAN. which allows the user to specify the starting index.
The reader unfamiliar with programming should also note that a statement
such as g = g * x + f is not an equation in the mathematical sense, but an
assignment. In the right side, g refers to the contents of the associated storage
location before evaluation of the right side; after its evaluation, the contents of
g are replaced by the result of the calculation. Thus, the same variable takes
different values at different times, and this allows to make most efficient use
of the capacity of the computer, saving storage and index calculations. Note
also that the update of I must come after the update of g because the formula
for updating g requires the old value of f, which would be no longer available
after the update of I.
1.1.8 Example. We evaluate the polynomial j (x) = (I - x)6 at 101 equidistant points in the range 0.995 :s x :s 1.005 by calculating (I - x)6 directly and
by using the Homer scheme for the equivalent expanded polynomial expression
1 - 6x + 15x 2 - 20x 3 + 15x 4 - 6x 5 + x 6 . The results are shown in Figure 1.1.
The effect of (simulated) machine precision on the evaluation of the polynomial
p(x) = 1 - 6x + 15x 2 - 20x 3 + l5x 4 - 6x 5 + x 6 without using the Homer
scheme is demonstrated in Figure 1.2.
In MATLAB, the figure can be drawn easily with the following commands.
(Text is handled as an array of characters, and concatenated by placing the
pieces, separated by blanks or commas, within square brackets. num2str transforms a number into a string denoting that number, in some standard format.
The number of rows or columns of a matrix can be found with size. The period
12
The Numerical Evaluation ofExpressions
15
X 1016,---------,----,----,------,-----,----,--------,---,------,---------,
u
J
.~
a-
a-
o
-4 '----_----'-_ _----'--_ _--'---_ _-'----_ _L - _ - - '_ _--'-_ _---'---_ _--L-_----"
0.995
0.996
0.997
0.998
0.999
1.004
1.001
1.002
1.003
1.005
--~+=-
o
."
00
_+_-----lO
o
--~'F=-----
."
00
_ _-+--_ _~O
Figure 1.1. Twoways of evaluating f (x) = (1 - x)6.
+
before an operation denotes componentwise operations. The %sign indicates
that the remainder of the line is a comment and does not affect the computations.
The actual plot is produced by plot, and saved as a postscript file using print.)
1.1.9 Algorithm: Draw Figure 1.1
x=(9950:10050)/10000;
disp(['number of evaluation points: ' ,num2str(size(x,2»]);
y=(1-x).-6;
% a compact way of writing the Horner scheme:
z=((((((x-6) .*x+15).*x-20).*x+15) .*x-6).*x+l);
plot(x, [y;z]); % display graph on screen
print -deps horner.ps % save figure in file horner.ps
a-
o
--~'F=------_-+--
o
The figures illustrate a typical problem in numerical analysis. The simple
expression (1 - X)6 produces the expected curve. However, for the expanded
expression, monotonicity of f is destroyed through effects of finite precision
arithmetic, and instead of a single minimum of zero at x = 1, we obtain more
13
V>
00
_ _~O
14
The Numerical Evaluation ofExpressions
1.2 Numbers, Operations, and Elementary Functions
than 30 changes of sign in its neighborhood. This shows that we must look more
closely at the evaluations of arithmetic expressions on a computer and with the
rounding errors that create the observed inaccuracies.
large number is used, for example, in the multi-precision arithmetic package of
BRENT available through the electronic repository of mathematical software
NETLIB [67]. (This valuable repository contains much other useful numerical
software, too, and should be explored by the serious student.)
1.2 Numbers, Operations, and Elementary Functions
Machine numbers of the following types (where each x denotes a digit) can be
read and printed by any computer:
integer
real
real
±xxxx
±XX.XXXIO ± xxx
±xxxx.xx
(base 10 integer)
(base 10 floating point number)
(base 10 fixed point number)
Integer numbers consist of a sign and a finite sequence of digits, and base 10
real floating point numbers consist of a sign, digits before the decimal point, a
decimal point, digits after the decimal point, and an exponent with a sign. In
order to avoid subscripts, computers usually print the letter d or e in place of
the basis 10. The actual number of digits depends on the particular computer.
Fixed point numbers have no exponent part.
Many programming languages also provide complex numbers, consisting of
a floating point real part and an imaginary part. In the MATLAB environment,
there is no distinction among integers, reals, and complex numbers. The appropriate internal representation is determined automatically; however, double
precision calculation (see later this section) is used throughout.
In the following, we look in more detail at real floating point numbers.
Floating Point Numbers
For base 10 floating point numbers, the decimal point can be in any position
whatever, and as input, this is always allowed. For the output, two standard forms
are customary, in which, for nonzero numbers, the decimal point is placed either
(i) before the first nonzero digit: ±.XXXIO ± xxx, or
(ii) after the first nonzero digit: ±x.XXIO ± xxx.
In (i), the x, immediately to the right of the decimal point represents a nonzero
decimal digit. In (ii), the x immediately to the left of the decimal point represents
a nonzero decimal digit. In the following, we use the normalization (i). The
sequence of digits before the exponent part is referred to as the mantissa of the
number.
Internally, floating point numbers often have a different base B in which,
typically, BE {2, 8,16, ... ,2 31 , •.. }. A huge base such as B = 231 or any other
15
1.2.1 Example. The general (normalized) internal floating point number format has the form
±.xxxxxxxxxxx
B
Mantissa of
length L
Base
'-v-"
±xxxxxxxxxxxxxxx
Exponent
E [ - E,
F]
where L is the mantissa length, B the base, and [- E, F] the exponent range of
the number format used. The arithmetic of personal computers (and of many
workstations) conforms to the so-called IEEE floating point standard [45] for
binary arithmetic (base B = 2).
Here the single precisionformat (also referred to as real or real*4) uses
32 bits per number: 24 for the mantissa and 8 for the exponent. Because the
sign of the mantissa takes I bit, the mantissa length is L = 23, and the exponents lie between E = -126 and F = 127. (The exponents -127 and 128
are reserved for floating point exceptions such as improper numbers like ±oo
and NaN; the latter ("not a number") signifies results of nonsensical calculations such as 0/0.) The single precision format allows one to represent numbers of absolute value between approximately 10- 38 and 1038 , with an accuracy of about seven decimals. (There are also some even tinier numbers
with less accuracy that cannot be normalized with the given exponent range.
Such numbers are called denormalized, In particular, zero is a denormalized
number.)
The double precision format (also referred to as double or real*8) uses
64 bits per number: 53 for the mantissa and II for the exponent. Because
the sign of the mantissa takes I bit, the mantissa length is L = 52, and the
exponents lie between E = -1022 and F = 1023. The double precision format allows one to represent numbers of absolute value between approximately
10- 308 and 10308 , with an accuracy of about 16 decimals. (Again, there are also
denormalized numbers.)
MATLAB works generally with double precision numbers. In particular, the
selective use of double precision in predominantly single precision calculations
cannot be illustrated with MATLAB, and we comment on such problems using
FORTRAN language.
16
The Numerical Evaluation of Expressions
1.2 Numbers, Operations, and Elementary Functions
Ignoring the sign for a moment, we consider machine numbers of the form
Be with mantissa length L, base B, digits Xl, ... , XL and exponent
e E Z. By definition, such a number is assigned the value
1.2.2 Proposition. Under the hypothesis ofan unrestricted exponent set, a cor-
.XI ... XL
e- i
"
~B
Xi
= Xl B e -
l
e Z
+XZ B -
Rounding
It is obvious that between any two distinct machine numbers there are many
real numbers that cannot be represented. In these cases, rounding must be used;
that is, a "nearby" machine number must be found. Two rounding modes are
most prevailing: optimal rounding and rounding by chopping.
(i) Optimal Rounding. In the case of optimal rounding, the closest machine
number is chosen. Ties can be broken arbitrarily, but in case of ties, the
closest number with XL even is most suitable on statistical grounds. (Indeed,
in a long summation, one expects the last digit to have random parity; hence
the errors tend to balance out.)
(ii) Rounding by Chopping. In the case of rounding by chopping, the digits
after the Lth place are omitted. This kind of rounding is easier to realize
than optimal rounding, but it is slightly less accurate.
We call a rounding correct if no machine number lies between a number x
and its rounded value X. Both optimal rounding and rounding by chopping are
correct roundings. For example, rounding with B = 10, L = 3 gives
Optimal
Xz
X4
lx-xl
Ixi -
- - - <£:=B
and an optimally rounded value
that is, the number is essentially the value of a polynomial at the point B.
Therefore, the Homer scheme is a suitable method for converting from decimal
numbers to base B numbers and conversely.
X3
x of X in lR\ {O} satisfies
+... + XL B e - L
i=I;L
Xl
rectly rounded value
= .123456 105
= .567890 105
= .123500 105
= .234500 105
=
X2 =
X3 =
XI
.123105
.568 105
.124 105
X4 = .234 105
Chopping
XI = .123 105
X2 = .567105
X3 = .123105
X4 = .234 105
As measures of accuracy we use the absolute error Ix- xI and the relative error
Ix - x1/ [x I of an approximation x for x. For general floating point numbers of
highly varying magnitude, the relative error is the more important measure.
17
x of X
l-L
,
ill lR\ {O} satisfies
Ix - xl
I I-L
- - - <£:= -B
.
Ixi 2
Thus, the relative error of the rounded value is bounded by a small machine
dependent number e.
The number e is referred to as the machine precision of the computer. In
MATLAB, the machine precision is available in the variable eps (unless eps
is overwritten by something else).
Proof We give the proof for the case of correct rounding; the case of optimal
rounding is similar and left as Exercise 6. Suppose, without loss of generality,
that x > 0 (the case x < 0 is analogous), where x is a real number that has, in
general, an infinite base B representation. Suppose, without loss of generality,
that
and
with XI I- O. Set
Then X E [x', x'
Be-L. So
+ B e- L ]
and the assertion follows.
because the rounding is correct, and
x::: X
<
x+
o
This proposition is a basis for all rigorous error analysis of numerical methods; see in particular Higham [44] and Stummel and Hainer [92]. In this book,
18
The Numerical Evaluation of Expressions
1.2 Numbers. Operations, and Elementary Functions
we usually keep such analyses on a more heuristic level (which is enough to expose the pitfalls in most numerical calculations); a fully quantitative treatment
is only given for Gaussian elimination (see Section 2.1).
bounded by C 8 for a constant not much larger than 1; this holds in practice
except for extreme values of the operands.
For the calculation of the elementary functions <p E J, one must use approximative methods. For a thorough treatment of the approximations of standard
functions, see Hart [41]. We treat only two typical examples here, namely the
square root and the exponential function.
The following three steps are usually employed for the realization of the
elementary function cp(x) on a computer: (i) Argument reduction to a number
Xo in a standard range, followed by (ii) approximation of cp(xo), and (iii) a
subsequent result adaptation.
The argument reduction (i) serves mainly to reduce the amount of work
needed in the approximation step (ii), and the result adaptation (iii) corrects the
result of (ii) to account for the argument reduction. To demonstrate the ideas,
we look in more detail at the computation of the square root and the exponential
function.
Suppose we want to compute Vi = sqrt(x) with x = m Be, where m E
[1 I B, 1[ is the mantissa, B is the base, and e is the exponent of x.
Overflow and Underflow
The preceding result is valid only for an unrestricted set of exponents. In practice, however, the exponent range is finite, and there are problems when IxI is
too large or too small to be representable by a normalized number within the
exponent range. In this case, we speak of overflow or underflow. respectively.
The behavior resulting from overflow depends on the programming language
and its implementation. Often, it results in an appropriate error message; alternatively, it may result in an improper number ±oo. (Some old compilers even
set the result to zero, without warning!)
In case of underflow, numbers that are too small are either rounded to denormali zed numbers (gradual underflow) or replaced by zero; this is, in general,
reasonable, but leads to much larger relative errors, and occasionally can therefore be dangerous.
Overflow and underflow can usually be avoided by suitable scaling of the
problem; hence we assume in later discussions that the exponent range is unrestricted and Proposition 1.2.2 is generally valid.
Argument Reduction and Result Adaptation for .jX
We may express
Vi in the form
JX =
Operations and Elementary Functions
After the consideration of the representation of numbers on a computer, we
look at the machine results of binary operations and elementary functions.
For correctly rounded results of the binary operations 0 EO, we have
Ix oy -xoyl
Ix 0 yl
19
(.fiO) B S
with Xo = m, s = el2 if e is even, and Xo = m] B, s = (e + 1)/2 if e is odd.
This is the argument reduction. Because Xo E [B- 2 , I], it is sufficient to determine the value of the square root in the interval [B- 2 , 1] to obtain a result
which may then be adapted to give the required value Vi by multiplying the
result .fiO E [B- 1 , 1] (the mantissa) by B'.
<8
-,
where xoy represents the computed value of x 0 y and 8 is the machine
precision. In the following, the validity of this property are assumed for
o E {+, -, *, I} (but not for the power); it is satisfied by most modem computers as long as neither overflow nor underflow occur. For example, the IEEE
floating point standard requires the results of these operations (and the square
root) to be even optimally rounded.
The power x~y = x Y (except for the square, y = 2) is usually slightly less
accurate because it is computed as a composite expression. For small integers
y, x Y is computed by repeated multiplication, and otherwise from exp(y logx).
For theoretical analysis, we suppose that the relative error of the power is
Approximation of.jX
(a) The simplest but inefficient possibility is the expansion in a power series
about 1. The radius of convergence of the series
rrrr;
vI - z
=
1
1-
1,1
15
2" z - 8" z~ - 16 z - 128 z4 -'"
is 1. The series converges sufficiently rapidly only for x ;::::; I (z ;::::; 0), but
converges very slowly for Xo = O.01(z = 0.99), say (an essential value
when B = 10), and is therefore useless for practical purposes. The cause of
the slow convergence is the vertical tangent to the graph of x 1/2 at x = O.
20
1.2 Numbers, Operations, and Elementary Functions
The Numerical Evaluation of Expressions
(b) Another typical way to approximate function values is offered by an iterative
method. Here, iteration ofthe same process over and over again successively
improves an approximation until a desired accuracy is reached.
In order to find W = -J"X for a given value of x > 0, let Wo be an initial
estimate of w; for example, we may take Wo = I when x E [B- 2 , 1]). Note
that w2 = x and so W = x /w. To find a suitable iteration formula, we
define Wo := x /wo. If Wo = wo, then we are finished (w = wo); otherwise,
we have Wo = w 2/wo = w(w/wo) < w (or> w) if Wo > w or Wo < w,
respectively. This suggests that the arithmetic mean WI = (wo + wo)/2
might be a suitable new estimate for w. Repeating the process leads to the
following iterative method, already known to Babylonian mathematicians.
hence
8i + I
= {(8i
WHI
If Wo
= -J"X,
=
iili
=..;x
(2.2)
(i:::: 0).
(i:::: 0);
The absolute error 8i :=
ui, -
..;x <
and the relative error e, := (Wi -
=
< ... < W2 <
WI·
«> 0),
Ci)
si >: 0).
(2.3)
Here we use the convention that a product has higher priority than division
if it is written by juxtaposition, and lower priority if it is written with explicit
multiplication sign. Thus,
Proof With
W
:=
-J"X, we have (for all i
Wi
= 8i + W =
(l
> 0)
+ sdw,
+ W)2 + w 2 -
2w(8 i + W)}/ 2Wi
#- -J"X, then 80 =
Wo - w
#- 0 and 8i + I
8i + I = 8; /2Wi ::::: 8i12 < 8i
> O. So,
(i > 0)
whence -J"X <
proposition.
Wi+1
< Wi and Wi < Wi+I <
-J"X
(i > 0). This proves the
D
For B = 10, Xo E [0.01, 1] is a consequence of argument reduction.
Clearly, Xo = 0.01 is the most unfavorable case; hence seven iterations (i.e.,
21 operations) always suffice to calculate -J"X accurate to twelve decimal
-J"X) /-J"X satisfies
s; /2(1 +
2WWi) / 2Wi
The results displayed in Table 1.2 were obtained. One observes a rapid increase in the number of correct digits as predicted by (2.3). This is so-called
quadratic convergence, and the iteration is in fact a special case of Newton 's
method (cf. Chapter 5).
-J"X satisfies
8i+l = 8;/2wi
SHI
ui,
w
1.2.4 Example. The iteration of Proposition 1.2.3 was executed on a pocket
calculator with B = 10 and L = 12 for x = 0.01, starting with Wo := 1.
otherwise,
WI < W2 < ... < ib, <
+ x/wi)/2 -
and
o<
then
Wi
W = (Wi
(2.1)
0),
+ wi)/2
:= (Wi
-
= 8f!2wi:::: 0
If now Wo
G >:
= Wi+I
= (wf +x -
1.2.3 Proposition. Let x > O. For arbitrary Wo > 0, define
Wi := X/Wi
21
Table 1.2. Calculation of -vl0.01
Wi
o
1
1
0.505 .
0.2624 .
0.1502 .
0.1084 .
0.1003 .
0.1000005 ...
0.1
2
3
4
5
6
7
22
digits. The argument reduction is important because for very large or very
small Xo, the convergence is initially rather slow (for large e, we have
Ei+1 ~ e, /2).
(c) Another frequent approximation method uses rational functions. Assuming
a rational approximation of JX of the form
Approximation of exp(x)
As for the square root, there are various possibilities for the approximation. Because the power series for the exponential function converges very rapidl y, only
the power series approximation is considered. We may write the partial sum as
p(x)
w * (x)
=
t2x
=
1ft -
L
xk/k!
k=O:n
+ tl + - ~- ,
x +so
optimal coefficients for the interval [0.01, 1] can be determined so that
sup
and have the approximation
w*(x)1
eX=p(x)+
XE[O.OI,I]
is minimal. The values of the coefficients (from the approximation
SQRT0231 of Hart [41]) are
t:
= 0.588
= 0.467
So
=
23
1.3 Numerical Stability
The Numerical Evaluation of Expressions
1229,
t)
975 327 625,
to = -0.040916239 1674,
where
lexo
_
I~
x n+1
(n+l)!
e~
'
I ::s IxI. Thus the error satisfies
Ix 0 In+l e 1xol < (log(B)/2)n+1 .Jjj
(x) I <
p
0
-
(n+1)!
-
(n+1)!
for Ixo I ::s 10g(B) /2,
and an optimal value of n corresponding to a given accuracy can be determined;
for example, for B = 10, n = 16, the absolute error is <10- 13 •
0.099 999 8.
1.3 Numerical Stability
For every Xo E [0.01, 1], the relative error of w* is less than 0.02 (see Hart
[41], p. 94). (For other functions or higher accuracy, one would use more
complicated rational functions in the form of so-called continued fractions.)
With three additional iterations as under (b) with initial value w*, we obtain a
relative error of 10- 15 , requiring a total of fourteen operations only. In binary
arithmetic, even fewer operations suffice because the argument reduction
reduces the interval in which a good approximation is needed to [0.25, 1].
As our second example, we consider the computation of e' = exp(x) with
= m Be, where m E [1/ B, 1] is the mantissa, B is the base, and e is the
exponent of x.
x
Argument Reduction and Result Adaptation for exp(x)
We can write exp(x) = exp(xo) . B' with s = [x / log B + 1/2] and Xo = x slog B. This gives an integral exponent s and a reduced argument Xo with Ixo I ::s
log B. The constant log B must be stored in higher precision so that the result
adaptation does not cause undue rounding errors.
!
As a result of the finite number of digits used by the computer, numbers are
stored inexactly and operations are carried out inexactly. The relative error is
negligibly small for a single operation, but not necessarily for a sequence of two
or more operations. For the example of the evaluation of the polynomial p (x) =
(1 - x)6 in the neighborhood of x = 1, Section 1.1 shows the different results
obtained if the polynomial is evaluated (a) by using the formula (1 - x)6 or (b)
by using the equivalent formula 1 - 6x + 15x 2 - 20x 3 + 15x4 - 6x 5 + x 6 • In
what follows, behavior such as (a) is called numerically more stable and behavior such as (b) is called numerically more unstable. These are qualitative notions
that are not defined exactly. Numerical instability means that the error in the
result is considerably greater than one would expect from small errors in the
input. This expected error is quantified exactly through the notion of condition
in Section 1.4. The most frequent causes of instability are illustrated by means
of examples.
In the following, we always distinguish between accuracy, which is related to
the number of correct digits in some approximate quantity, and precision, which
is defined as the accuracy with which single operations with (or storage of)
exact numbers are performed. Thus we may speak of single or double precision
The Numerical Evaluation of Expressions
1.3 Numerical Stability
numbers or calculations; however, this says nothing at all about the quality
of the numbers involved in relation to their intended meaning. Estimating the
accuracy of the latter is the subject of an error analysis that should accompany
any extended calculation, at least in a qualitative way.
Because in what follows inexact numbers occur frequently, we introduce
y (x « y)
the notation ~ for approximately equal to. We shall also write x
when x, yare positive and x t y is much greater than 1 (and, respectively, much
smaller than 1). These are qualitative notions only; it must be decided from the
context or appLication what "approximately" or "much" means quantitatively.
There are two reasons for the increasing loss of accuracy in (i) and (ii):
1. There is a large difference in order ofmagnitude between the numbers to
be added or subtracted, respectively (e.g., 10 1O+l/3).lfx = mlOe, X =
mlO e. with Ixl > lxi, and k := e - e - 1, then x is too small to have
any influence on the first k places of the sum or difference of x and x,
respectively. Thus, information on .'i' is in the less significant digits of
the intermediate results only, and the last k places of x are completely
lost in the rounding process.
2. The subtraction of numbers of almost equal size leads to the cancellation of leading digits and promotes the wrong low-order digits in the
intermediate results to a much more significant position. This leads to
a drastic magnification of the relative error.
In case (i), the result of the subtraction already has order unity. so
that relative errors and absolute errors have the same magnitude. In
case (ii), the result of the subtraction has small absolute (but large
relative) error, and the division by the small number x 2 leads to a result
with large absolute error.
(iii) f(x) := sirr' x/(l - cos? x).
24
»
1.3.1 Example. All of the following examples were computed on a pocket calculator with B = 10 and L = 12. (Calculators with different rounding characteristics probably give similar but not identical results.)
(i) f(x) := (x
+ 1/3) -
(x - 1/3).
x
1
103
106
109
10 10
lOll
lex)
0.666666666
0.666666663
0.666663 . . .
0.663 ..
0.33 .
0
.
.
x
Because f (x) = 2/3 for all x, there is an increasing loss of accuracy for
increasing x.
(ii) f(x) := «3 + x 2/3) - (3 - x 2/3))/x 2 .
x
10- 1
10- 2
10- 3
10-4
10- 5
10- 6
f(x)
0.666666667
0.6666667
f(x)
0.D1°
0.001°
0.0005°
0.0004°
0.0003°
1.0007
1.0879
1.2692
2.3166
.
.
.
.
ERROR
Because f (x) = 1 for all x not a multiple of n , there is a drastic decrease in
accuracy for decreasing x due to division by the small number 1 - cos? .r ,
and ultimately an error occurs due to division by a computed zero.
(iv) f(x) := sinx/Jl - sin 2 x.
.
.
0.66667 . . .
0.667 .
0.675
0
25
.
Because f (x) = 2/3 for all x ::j: O. there is an increasing loss of accuracy
for decreasing x.
x
f(x)
tanx
89.9°
89.95)
89.99°
572.9588 ...
1145.87 ...
5735.39
572.9572 ...
1145.91 ...
5729.57 ...
The correct digits are underlined. We have
f
(x)
= tan x
for all .r, and
26
The Numerical Evaluation of Expressions
1.3 Numerical Stability
here the loss of accuracy is due to the fact that f (x) has a pole at x = 90°,
together with the cancellation in the denominator.
(v) f (x) : = eX'/3 - 1. For comparison, we give together with f (x) also the
result of the first two terms of the power series expansion of f (x).
• cancellation; that is, subtraction of numbers of almost equal size that leads
to a loss of leading digits and thereby to magnification of the relative error;
• division by a very small number leading to magnification of the absolute
error;
f(x)
x
0.1
0.01
0001
0.0001
3.33889510 3.33338810 3.33330010 3.33000010 -
x
3
5
7
9
2/3+x 4/18
3.33888888910 3.33338888910 3.33333388910 3.33333333910 -
f(x), correctly rounded
3
5
7
9
3.3388950668810 3.3333888895110 3.3333338888910 3.3333333388910 -
3
5
7
9
Although the absolute errors of f (x) appear acceptable, the relative errors
grow as x approaches zero (cancellation!). However, the truncated power
series gives results with increasing relative accuracy.
(vi) The formula
27
• multiplication by a number of very large magnitude, leading to magnification
of the absolute error.
In the formulation and analysis of algorithms, one must look out for these
problems to see whether one can avoid them by a suitable reformulation. In
general, one can say that caution is appropriate in any computation in which
some intermediate result has a much larger or smaller magnitude than a final
result. Then a more detailed investigation must reveal whether this really leads to
numerical instability, or whether the further course of the computation reduces
the effect to a harmless level.
Note that, in the case of cancellation, the difference is usually calculated
without error; the instability comes from the magnification of old errors. This
is shown in the following example.
1.3.2 Example. Subtraction of two numbers with B = 10 and with L = 7 and
L = 6, respectively:
gives the standard deviation of a sequence of data Xl , •.. , X n in a way as
implemented in many pocket calculators. (The alternative formula mentioned in Example 1.1.1 is much more stable but does not allow the input of
many data with little storage!) For Xi := x where x is a constant, an = 0, but
on the pocket calculator, the corresponding keys yield the results shown
in Table 1.3. The zero in the table arises because the calculator checks
whether the computed argument of the square root is negative (which cannot happen in exact arithmetic) and, if so, sets an equal to zero, but if the
argument is positive, the positive result is purely due to round off.
(i) L = 7
1st number:
2nd number:
Difference:
Normalized:
0'10
0'20
100/3
1000/29
1.204"'10- 3
8.062"'10 - 4
1.131"'10 - 3
0
We conclude from the examples that the quality of a computed function
value depends strongly on the expression used. To prevent, if possible, the
magnification of old errors, one should avoid the following:
No rounding error
(ii) L = 6
Table 1.3. Standard deviation of Xi := x where x is a constant
x
0.2789014103
0.2788876103
0.0000138103
0.138000010 - 1
Optimally rounded
1st number:
2nd number:
Difference:
Normalized:
0.278901103
0.278888103
0.000013103
0.13000010 -
J
Relative error
<210 - 6
<210 - 6
No rounding error
>510 - 2
Although there is no rounding error in the subtraction, the relative error in the
final result is about 25,000 times bigger than both the initial errors.
28
The Numerical Evaluation of Expressions
1.3 Numerical Stability
1.3.3 Example. In spite of division by a small number, the following formulas
are stable:
.
(1) f(x):=
{Ismx
. /
x
if x
ifx
=a
#
~
is stable for x
0
Ux _1)/lnx
if x
if x
=
#
I
I
is stable for x
~
r:
yx+l-yx=
1.001
1.0001
1.00001
1.000 000 00 I
x
. 2
- 2 sm
1+ cosx -
X
2'
~
+ 1) - x =-----===-_
1
_
v'x"+l + .jX v'x"+l + VX
(x
(iii) The expression eX - 1 is unstable for x :::::: O. We substitute y := e', whence
e' - 1 = y - I, which is still unstable.
For x # 0,
y-l
eX - 1 = (y - l)x/x = - - x ;
ln y
see Example 1.3.3(ii). Therefore we can obtain a stable expression as
follows:
1:
e" - I
x
~
sur'
the left side is unstable and both of the right sides are stable for x ~ O.
In both cases, the difference of two functions is rearranged so that there
is a difference in the numerator that can be simplified analytically. This
difference is thus evaluated without rounding error. For example,
The reason for the stable behavior is that in both numerator and denominator
only one operation (in our usage of the term) is performed; stability follows
from the form of the relative error that is more or less independent of the way
in which the elementary functions are implemented. The same behavior is
visible in the next case.
(ii) f(x) :=
.
0:
0.9983
0.999983 ...
0.99999983 .
0.999999998 .
1
0.1
0.01
0.001
0.0001
0.00001
(ii) In the equalities
1 - cos x =
f(x)
x
29
where y
f(x)
=
[:-1
x
lny
if y
= 1,
if y
#
I
= e".
(iv) The expression eX - 1 - x is unstable for x ~ O. Using the power series
expansion
1.00049. "
1.000049 .
1.000005 .
1.000 000 001
00
eX =
Lx /k!
k
k=O
we obtain
Stabilizing Unstable Expressions
iflxl>c,
There is no general theory for stabilizing unstable expressions, but some useful
recipes can be gleaned from the following examples.
in which a suitable threshold c is determined by means of an error estimate:
We expect the first formula to have an error of 8 = B I-L, and the second,
truncated after the terms shown, to have an error of ::::::x 4/24 :::: c4/24;
hence, the natural threshold in this case is c = ~, which makes the
worst case absolute error minimal (::::::8), and gives a worst case relative
(i) In the equality
1
v'x"+l- v'x = v'X+T
x+l+ VX'
x
the left side is unstable and the right side is stable for x
»
otherwise
I.
errorof~8/(c2/2) ~.JE76.
The Numerical Evaluation of Expressions
30
1.3 Numerical Stability
Finally, we consider the stabilization of two widely used unstable formulas.
with
tn =
The Solution of a Quadratic Equation
ax 2 + bx
+c =
(a
0
(Xi -
Sn)2,
cf. Example 1.1.1. This formula for the standard deviation is impractical because
all the Xi must be stored. This can be avoided by a transformation to a recursive
form in which each Xi is needed only until Xi+1 is read.
-I 0)
has the two solutions
-b ± Jb 2
2a
-
4ac
t;
If b2 » 4ac, then this expression is unstable when the sign before the square
root is the same as the sign of b (the expression obtained by choosing the other
sign before the square root is stable).
On multiplying both the numerator and the denominator by the algebraic
conjugate -b =f Jb 2 - 4ac, the numerator simplifies to b 2 - (b 2 -4ac) = 4ac,
leading to the alternative formula
XIO
,.
=
2c
-b =f Jb 2
-
4ac
.
Now the hitherto unstable solution has become stable, but the hitherto stable
solution has become unstable. Therefore, it is sensible to use a combination of
both. For real b, the rule
q := -(b
+ sign(b)Jb 2 -
XI := q f a,
=
L xf - 2s L
n
i=l~
Xi
Sn-!
=
(n -
l)sn_1
an
of
XI, ... , X n
+ Xn
n
Xz:= c jq
and the standard deviation
L1
i=l~
xl
Sn -
Mean Value and Standard Deviation
+ S,~
i=l~
cf. Example 1.3.1(vi). These are the formulas for calculating the standard deviation stated in most statistics books. Their advantage is that t; can be computed
recursively without storing the Xi, and their greater disadvantage lies in their
and
instability due to cancellation, because (for data with small variance) L
ns~ have several common leading digits.
The transformation into a stable expression is inspired by the observation
that the numbers s; and In are expected to change very little due to the addition
of new data values X,; therefore, we consider the differences Sn - Sn-l and
tn - tn-I. We have
4ac )/2,
calculates both solutions by evaluating a stable expression. The computational
cost remains the same. (For complex b this formula does not work, because
sign(b) = b/lbl -I ±l; cf. Exercise 12.)
Sn
L
i=l:n
The quadratic equation
The mean value
defined by
31
n
-
Sn-I
n
where 8/l := X/l - S/l-!, and
In - In_I
=
(L xf - ns,~) L xf - (,
,=1:n
are usually
,=I:n-I
(n -
l)SI~_l)
1
or, sometimes,
Sn
=-
an
=
LXi,
n i=l:n
Jtn/n
=
8/l) = 8n
8/l ( 8n - -;;
= 81l (x n - sn).
«xn -
SIl-I) - (SII - Sn-I»
32
The Numerical Evaluation of Expressions
1.4 Error Propagation and Condition
From these results, we obtain a new recursion for the calculation of Sn and tn :
1.4 Error Propagation and Condition
SI
= Xl,
t[
= 0,
and for i ::: 2
Oi =
s, =
t, =
Xi Si-l
ti-l
Si-I,
+ o;/i,
+ Oi(Xi
-
Si).
The differences Xi - s, -1 and Xi - s., which have still to be calculated, are now
harmless: the cancellation can bring about no great magnification of relative
errors because the difference (Xi - Si) is multiplied by a small number Oi and
is then added to the (as a sum of many small positive terms generally larger)
number t.:«.
The new recursion can be programmed with very little auxiliary storage, and
in an interactive mode there is no need for storage of old values of Xi, as the
following MATLAB program shows.
33
In virtually all numerical calculations many small errors are made. Hence, one
is interested in how these errors influence further calculations. As already seen,
the final error depends strongly on the formulas that are used. Among other
things, especially for unstable methods, this is because of the finite number of
digits used by the computer.
However, in many calculations, already the input data are known only approximately; for example, when they are determined by measurement. Then
even an exact calculation gives an inaccurate answer, and in some cases, as in
the following example, the relative error grows strongly even when exact arithmetic is used, and hence independently of the method that is used to calculate
the result.
1.4.1 Example. Let
1
f(x):= - - .
I-x
If X := 0.999, then f(x) = 1000. An analytical error analysis shows that for
x = 0.999 + e, with e small,
1.3.4 Algorithm: Stable Mean and Standard Deviation
1000
f(x) = I - 1000e
i=l; s=input('first number?>');
t=O;
x=input('next number? (press return for end of list»');
while -isempty(x),
i=i+l ;
delta=x-s;
s=s+delta/i;
t=t+delta*(x-s);
x=input('next number? (press return for end of list»');
end;
disp( ['mean: ' ,num2str(s)]);
sigmal=sqrt(t/i);sigma2=sqrt(t/(i-l));
disp('standard deviation');
disp ( [,
of sample: ',num2str (sigma!)]) ;
disp(['
estimate of distribution: ',num2str(sigma2)]);
C,
& and I code the logical "not," "and," and "or." The code also gives an
example of how MATLAB allows one to read and print numbers and text.)
= 1000(1
+ 103e + 106e 2 + ...).
Hence, for the relative errors of x and
Ix -xl
Ixl
--
f
(.l:)
~
we have
Laale
and
respectively; that is, independently of the method of calculating I, the relative
error is magnified by a large factor (here about 1000).
One calls such problems ill-conditioned. (Again, this is only a qualitative
notion without very precise meaning.) For an ill-conditioned problem with inexact initial data, all numerical methods must give rise to large relative errors,
A quantitative measure for the condition of differentiable expressions is the
condition number K, the (asymptotic) magnification factor of the relative error.
For a concise discussion of the asymptotic behavior of quantities, it is useful
1.4 Error Propagation and Condition
The Numerical Evaluation of Expressions
34
to recall the Landau symbols 0 and O. In order to characterize the asymptotic behavior of the functions f and g in the neighborhood of r" E lR U {oo}
(a value usually apparent from the context), one writes
fer)
= o(g(r»
fer)
when - - ---+ 0
g(r)
r.
as r ---+
35
we have
If(x) - f(x)1
Ix -xl
~K---.
[x]
If(x)1
1.4.2 Examples.
(i) In the previous example f(x) = I/O-x), we have atx = .999, the values
f(x) ~ 103 and f'(x) ~ 106 ; hence, K ~ .999 106 /1000 ~ 103 predicts
and
*
fer)
=
f(r).
when - - remams bounded as r
g(r)
O(g(r»
---+
*
the inflation factor correctly.
(ii) The function defined by f(x) := x 5 - 2x 3 has the derivative f'(x) =
5x 4 -6x 2 • Suppose thalli -21 .::: r := 10- 3 , so thatthe relative errorofi is
r /2. We estimate the error If (x) - f (2) I by means of the condition number:
For x = 2,
r .
In particular,
fer)
= o(g(r»,
ve- rlim
-e-r"
g(r) bounded
fer)
= 0,
f'(Xl!=7
K=lxf(x)
,
and
fer)
=
O(g(r»,
lim g(r)
r-e-r"
= O:::::}
lim fer)
r~r*
= o.
whence
If(x) - f(2)1
(One reads the symbols as "small oh of" and "big oh 0[," respectively.)
Now let x be an approximation to x with relative error E = (x - x) / x , so
that x = (l + E)X. We expand f(x) in a Taylor series about x, to obtain
f(x)
=
=
+ EX)
f(x) + Exf'(x) + 0(E 2 )
f(x
= f(x) (
xf'(x)
1+ -E +
f(x)
0(E 2 )
)
=
Ixf'(X)
f(x)
.
0.0035If(2)[
f(x)
Ixf'(X) I.
f(x)
0.0035,
= 0.056.
f (x*) = 0 and f' (x*) oF 0 (these conditions characterize a simple
zero .r"), then K approaches infinity as x ---+ x*; that is, f is ill conditioned in the neighborhood of a simple zero x* oF O.
CASE 2: If there is a positive integer m such that
with the (relative) condition number
=
~
7
2r =
CASE 1: If
IIEI + 0(E 2 )
= KIEI + 0(E 2 ) ,
K
and IfCt) - f(2)1
~
In a qualitative condition analysis, one speaks of error damping or of error
magnification, depending on whether the condition number satisfies K < 1 or
K > 1.1ll-conditioned problems are characterized by having very large condition
numbers, K » 1.
From (4.1), it is easy to see when the evaluation of a function of a single
variable is ill conditioned. We distinguish several cases.
We therefore have for the relative error of f,
If(x) - f(x)1
If(x)1
If(2)1
(4.1)
Assuming exact calculation, we can therefore predict approximately the relative
error of f at the point x by calculating K. Neglecting terms of higher order,
=
(x - x*)mg(x)
with g(x*) oF 0,
(4.2)
we say that x* is a zero of f of order m because the factor in front
of g(x) can be viewed as the limiting, confluent case Xi ---+ x* of a
product of m distinct linear factors (x - Xi); if (4.2) holds with a
negative integer m = -s < 0, then x* is referred to as a pole of f of
36
The Numerical Evaluation of Expressions
1.4 Error Propagation and Condition
37
(ii) Condition: We have
orde r s. In both cases,
f'ex) = m(x - x*)m-Ig(x)
+ (x
- x*)mg,(x),
f'ex)
=
so that
_x- 2
2v"x- 1 _ 1
v"r l
=
K
g'(x) I
=Ixl' -m- + x -x*
1
x-x *
-x-
I
K-+
I
+0(1).
=
+1
+1
Ix
. j'(x)
f(x)
I
1
-2~'
if x* =I- 0,
Iml ifx*=O.
For m = 1, this agrees with what we obtained in Case 1. In general, we
see that in the neighborhood of zeros or poles x* =I- 0, the condition
number is large and is nearly inversely proportional to the relative
distance of x from x*. (This explains the bad behavior near the multiple
zero in Example 1.1.8.) However, if the zero or pole is at x* = 0, the
condition number is approximately equal to the order of the pole or
zero, so that f is well conditioned near such zeros and poles.
CASE 3: If t' (x) becomes infinite for x* -+ 0 then f is ill-conditioned at x".
For example, the function defined for x > 1 by f (x) = 1 + ,J:X=l
has the condition number
K
=
K
Therefore, as x -+ x*,
00
Iv"r 1
and therefore, for the condition number K,
g(x)
I-I
2v"x- 1 + 1
1 - v"r l
2x 2 v"r 1 -
IXf'(X) I
f(x)
= Iml'
-
_x- 2
Thus, for x ;:::; 0, K ;:::; 4 so that f is well conditioned. However, for
x -+ 1, we have K -+ 00 so that f is ill-conditioned. So, the formula for
f (x) is unstable but well-conditioned for x ;:::; 0, whereas it is stable but ill
conditioned for x ;:::; 1.
We see from this that the two ideas have nothing to do with each other: the
condition of a problem makes a statement about the magnification of initial
errors through exact calculation, whereas the stability of a formula refers to
the influence of rounding errors due to inexact calculation because of finite
precision arithmetic.
It is not difficult to derive a formula similar to (4.1) for the condition number
of the evaluation of an expression f in several variables Xl, ... , x n . Let Xi =
Xi (1 + Ci) and ICi I ::: C (i = 1, ... , n). Then one obtains the multidimensional
Taylor series
X
--------:===2(x - I
+ ,J:X=l)
and K -+ 00 as x -+ 1.
We now demonstrate with an example the important fact that condition and stability are completely different notions.
We make use of the derivative
f I (x)
=
(
a
aXI
1.4.3 Example. Let f(x) be defined by
f(x):=Jrl-I-Jrl+l
to obtain the relative error
(O<x<l).
(i) Numerical stability: For x;:::; 0, we have X-I - 1 ;:::; X-I + 1; hence, there
is cancellation and the expression is unstable; however, the expression is
stable for x ;:::; 1.
a)
-f(x), ... , -f(x)
aX n
,
39
The Numerical Evaluation of Expressions
1.5 Interval Arithmetic
where the multiplication dot denotes the scalar product of two vectors. The
resulting error propagation formula for functions of several variables is
that contains this number, and critical computations are performed with intervals instead of approximate numbers. The operations of interval arithmetic are
defined in such a way that the resulting interval always contains the true result
that would be obtained by using exact inputs and exact calculations; by careful
rounding, this property is assured, even when all calculations are done with
finite precision only.
In the context of interval calculations, it is more natural to express errors as
absolute errors, and we replace a number x; which has an absolute error sr.
with the interval [x - r, x + r]. In the following, M is the space of real numbers
or a space of real vectors or matrices, with component-wise inequalities.
38
If(x) - f(x)1
-----If(x)1
<~
K
max
1<i :»
Ix; - x;1
,
Ix;1
with the (relative) condition number
K
= Ixl·
Ir
(x) I
If(x)I'
looking just as in the one-dimensional case. Again, one expects bad condition
only near zeros of f or near singularities of
In the preceding discussion, the higher-order terms in e were neglected in the
calculation of the condition number K. This means that the error magnification
ratio is given closely by the condition number only when the input error e
is sufficiently small. Often, however, it is difficult to tell whether a specified
accuracy in the inputs is small enough. Using an additional tool discussed in
the next section, it is possible to determine rigorous bounds for the propagation
error for a given bound on the input error.
r.
1.5.1 Definition. The symbol
x := [!.,
x] := {x E M I!.
~
x
~
x}
denotes a (closed and bounded) interval in M with lower bound x
upper bound x E M, !. ~ x, and
E
M and
denotes the set of intervals over M (of interval vectors if M is a space of vectors,
of interval matrices if M is a space of matrices). The midpoint of x,
1.5 Interval Arithmetic
Interval arithmetic is a kind of automatic, rigorous error analysis. It serves
as an essential tool for mathematically rigorous computer-assisted proofs that
involve finite precision calculations. The most conspicuous application is the
recent solution by Hales [37] of Kepler's more than 300 year old conjecture
that the face-centered cubic lattice is the densest packing of equal spheres
in Euclidean three-dimensional space. For other highlights (which require, of
course, additional machinery from the applications in question), see Eckmann,
Koch, and Wittwer [23], Hass, Hutchings, and Schlafli [42], Mischaikow and
Mrozek [62], and Neumaier and Rage [73]. Many algorithms of computational
geometry depend on the correct evaluation of the sign of certain numbers for
correct performance, and if these are tiny, the correct sign depends on rigorous
error estimates (see, e.g., Mehlhorn and Naher [61]).
Independent of rounding issues, interval arithmetic is an important tool in
global optimization because of its ability to provide global information over
wide intervals, such as bounds on ranges or derivatives, or Lipschitz constants
(see Hansen [39] and Kearfott [49]).
The propagation of errors is accounted for in interval arithmetic by taking
as given not a particular inexact number, but an interval of machine numbers
midx:=
I
x := -(x
+ x)
2-
and the radius of x,
1
radx := -(x - x) 2: 0,
2
-
allow one to convert the interval representation of an inaccurate number to the
absolute error representation,
x EX¢::=} Ix - xl ~ radx.
Intervals with zero radius (called point intervals) contain a single point only,
and we always use the identification
[x,x]=x
of point intervals with the element of M they contain.
[x] := sup{lxllx
EX}
= sup{x, -!.}
40
The Numerical Evaluation of Expressions
1.51/lterval Arithmetic
defines the absolute value of x. For a bounded subset S of M,
In particular,
OS := [inf S, sup S]
is called the interval hull of S; for example, IJ{I, 2}
also define the relation
x + Y = Of!. + I, x + y},
= IJ{2, I} =
[1,2]. We
x- y
(iii) For monotone cp
x
E
Of!. -
y, x -
I}'
J,
cp(x) = O{cpc!.) , cpU)}·
Proof The proof follows immediately from the monotonicity of
1.5.2 Remarks.
(i) For x E liJR., [x] = max {x , -!.}. However, the maximum may be undefined
in the componentwise order of JR.n, and, for interval vectors, the supremum
takes the componentwise maximum. (MATLAB, however, uses max for the
componentwise maximum.)
(ii) The relation :s is not an order relation on liM because reflexivity fails.
x =
We now extend the definition of operations and elementary functions to
intervals over M = R For 0 EO, we define
X 0
y := Ofx
0
Y I oX
E X.
and tp,
D
[!.2, x2 ]
if x ~ 0,
l-x 2 , x 2 )
if x
[0,max{!.2,x 2}]
if D e x.
I
:s 0,
Note that the standard functions have well-known extrema, only finitely many
in each interval, so that all operations and elementary functions can be calculated for intervals in finitely many steps. Thus we can get an enclosure valid
simultaneously for all selections of inaccurate values from the intervals.
Y E Y},
J, we define
cp(x) := O{cp(x) \ x
0
For nonmonotone elementary functions and for powers with even integral exponents, one must also look at the values of the local extrema within the interval;
thus, for example,
2
E
=
:s y :{:::::::} x :s y
on liM, and other relations are defined in a similar way.
and for cp
41
EX}
when the right side is defined; that is, excluding cases such as (1,2]/(0, I]
or ~. (There are extensions of interval arithmetic that even give values to such expressions; we do not discuss these here.) Thus, in both cases,
the result of the operation is the tightest interval that contains all possible
results with inaccurate operands selected from the corresponding interval
operands.
1.5.4 Remarks.
(i) Intervals are just a new type of numbers similar to complex numbers (but
with different properties). However, it is often convenient to think of an
interval as a single inaccurately known real number known to lie within
that interval. In this context, it is useful to write narrow intervals in an
abbreviated form, in which lower and upper bounds are written over each
other and identical figures are given only once. For example,
1.5.3 Theorem.
17.4~~m = [17.458548. 17.463751].
(i) For all intervals,
-x
(ii) For
0
E
{+, -, *, [, "].
X 0
=
[-x, -!.].
if x 0 Y is defined for all x EX. Y E
y = Of!. 0
2" !.
0
y. X 0 2'. x 0 y}.
y, we have
(ii) Let f(x) := x-x. Then for x = [1,2], f(x) = [-1.1], and for x :=
~l - r, 1 + r], f(x) = [-2r,2r]. Interval arithmetic has no memory;
It does not "notice" that in x - x the same inaccurately known number
occurs twice, and thus calculates the result as if coming from two different
inaccurately known numbers both lying in x.
42
The Numerical Evaluation of Expressions
1.5/nterval Arithmetic
(iii) Many rules for calculating with real numbers are no longer valid for intervals. As we have seen, we usually have
modules FORTRAN-XSC [95] and (public domain) INTERVAL..ARITHMETIC [50].
The following examples illustrate interval operations and outward rounding;
we use B = 10, L = 2, and optimal outward rounding.
x- x
i=
0,
[1.1, 1.2] + [- 2.1, 0.2]
and another example is
a(b
+ e) i= ab + ae,
except in special cases. So one must be careful in theoretical arguments
involving interval arithmetic.
[1.1,1.2] - [-2.1,0.2]
[1.1, 1.2] * [-2.1,0.2]
43
= [-1.0, 1.4],
= [0.9,3.3],
= [-2.52,0.24], rounded: [-2.6,0.24],
[1.1, 1.2]/[-2.1,0.2] not defined,
[-2.1,0.2]/[1.1,1.2]
= [-21/11,2/11], rounded: [-2.0,0.19],
= [-21/11,2/11], rounded: [-2.0,0.19],
= [1.21, 1.44], rounded: [1.2, 1.5],
[-2.1,0.2]/[1.1. 1000]
Basic books on interval arithmetic and associated algorithms are those by
Moore [63] (introductory), Alefeld and Herzberger [4] (intermediate), and
Neumaier [70] (advanced).
Outward Rounding
Let x = [:!:., x] E [JR. If :!:. and x are not machine numbers, then they must be
rounded. In what follows, let x = L~., x] be the rounded interval corresponding
to x. In order that all elements in x should also lie in x,:!:. must be rounded
downward and x must be rounded upward; this is called outward rounding.
Then, x S; X. The outward rounding is called optimal when
x := n{y
=
[~,
y] I y ;2 x,
~"v
are machine numbers}.
Note that for optimally rounded intervals, the bounds are usually not optimally
rounded in the sense used for real numbers, but only correctly rounded. Indeed,
the lower bound must be rounded to the next smaller machine number even
when the closest machine number is larger, and similarly, the upper bound
must be rounded to the next larger machine number. This directed rounding,
which is necessary for optimal outward rounding, is available for the results
of +, -, *, / and the square root on all processors conforming to the IEEE
standard for floating point arithmetic. defined in [45]; for the power and other
elementary functions, the IEEE standard prescribes nothing and one may have
to be content with a suboptimal outward rounding, which encloses the exact
result, but not necessarily in an optimal way.
For the common programming languages, there are extensions in which
rigorously outward rounded interval arithmetic is easily available: the (public domain) INTLAB toolbox [85J for MATLAB, the PASCAL extension
PASCAL-XSC [52], the C extension C-XSC [51], and the FORTAN 90
[-1.2. _1.1]2
[-2.1.0.2]2 = [0,4.41], rounded: [0, 4.5J.
J[1.0, 1.5J = [1.0, 1.22 ...], rounded: [1.0, 1.3].
1.5.5 Proposition. For optimal outward rounding and unbounded exponent
range,
x s; x· [l
where e = B 1-
- e. I
+ c]
L.
x:: : x when correctly rounded.
Proof We have x = Lf, .r], in which j; :s :!:. and
It follows from the correctness of the rounding that
There are three cases to consider: x > 0,
CASE I: x »
ing,
°
°
E
(i.e., :!:. > 0). By hypothesis,
:!:. -
i
x, and x < 0.
x ::::
:!:. > 0, and by outward round-
:s c:!:.
and
x - x :s eX,
- c)
and
x:s.t (1 + c).
whence
i :::: :!:.(1
So
x = [i, x] S; [,!,(l - c), i(l
as asserted.
+ c)J =
[:!:., i][l - e, I
+ c],
The Numerical Evaluation of Expressions
44
CASE 2:
°
°
E x (i.e.,!. :s
:s x). By hypothesis, K :s !.
we can evaluate the absolute values to obtain
-K +!. :s
-s!.
and
1.5 Interval Arithmetic
°
:s :s x :s X, whence
45
(iii) Suppose that in the evaluation of f (Zl, ... , zn), the elementary functions
cp E J and the power are evaluated only on intervals in which they are
differentiable. Then, for small
A
x- x :s sx,
r := max radrx.),
giving
i
we have
So
rad j'(x}, ... , x,)
x=
=
O(r).
[K, x] S; [!.(l + e), x(l + c)] S; [!., xHl - s, 1 + s].
CASE 3: (x < 0, i.e.,
x<
0) follows from Case 1 by multiplying with -1.
D
We refer to this asymptotic result by saying that naive interval evaluation
has a linear approximation order.
Interval Evaluation of Expressions
Before we prove the theorem, we illustrate the statements with some
examples.
In arithmetic expressions, one can replace the variables with intervals and evaluate the resulting expressions using interval arithmetic. Different expressions
for the same function may produce different interval results; although this already holds for real arguments in finite precision arithmetic, it is much more
pronounced here and persists even in exact interval arithmetic. The simplest
example is the fact that x - x is generally not zero, but only an enclosure of it,
with a radius of the order of the radius of the operands. We now generalize this
observation to the interval evaluation of arbitrary expressions.
1.5.7 Example. We mentioned previously that expressions equivalent for real
arguments may behave differently for interval arguments. We look closer at
the example f (x) = x 2 , which always gives the range. Indeed, the square is
an elementary operation, defined such that this holds. What happens with the
equivalent expression x * x?
1.5.6 Theorem. Suppose that the arithmetic expression f(z], ... , Zn) E A(ZI,
... , Zn) can be evaluated at Z], ... , z" E ITR and let
Xl
S; Zl, ...
.x,
S; Zn·
Then:
(i) If f(x) := x * x, and x := [-r, r], with 0< r < 1, then f(x) = [-r, r] .
[-r, r] = [_r 2 , r 2 ], but the range of values of f is only {f (x) I x E x} =
2
[0, r ]. The reason for the overestimation of the range of values of f is as
follows. In the multiplication x * x = O{X * .~ I x E x, E x}, each x is
multiplied with every element of x, but in the calculation of the range of
values, x is multiplied only with itself, giving a tighter result. Again, this
shows the memory-less nature of interval analysis.
r, + r], with 0 :s r :s ~,then
(ii) If f(x) := x * x and x :=
x
[! - 1
(i) f can be evaluated at x., ... , x, and
f(x], ... , x n) S; f(z], ... , zn)
(inclusion isotonicity),
(ii)
{f(z[, ... , Zn) I Zi
E z.} S;
f(z], ... , zn) (range inclusion).
In (ii), equality holds ifeach variable occurs only once in f (as, e.g., the
single variable z in the expression Iogtsint I - Z3»).
that is, the range of values of f is not overestimated, in spite of the fact
that the condition from Theorem 1.5.6(ii) that each variable should occur
only once is not satisfied (the condition is therefore sufficient, but not
necessary). For the radius of f(x), we have rad f(x) = O(r) as asserted
in (iii) of Theorem 1.5.6.
(iii) If f(x) := -IX, Z := [s, 1] and x = [s, e + 2r] (r :s (l - s)/2), then
46
The Numerical Evaluation ofExpressions
1.5 Interval Arithmetic
47
Furthermore,
radx = rand
1
rad jX = -his
2
+ 2r -.jE) =
r
r
~
~
r;;
e + 2r + .jE
2y e
=
for r -+ O. However. the constant hidden in the Landau symbol depends
on the choice of z and becomes unbounded as s ~ O.
(iv) If f(x) := v1x. z:= [0. I], and x := [~r. ~r], with 0 < r < ~, then
v'x =
f(x)
O(r)
=
{rp(g} I g
<
(rp(g)
Ig
E
g(x)}
E g(z)}
=
f(z).
(ii) For x = [z. ZJ. (ii) follows immediately from (i). If each variable in f occurs only once, then the same holds for g. and by the inductive hypothesis.
(g(z) I E z} = g(z). From this, it follows that
z
[~vr. ~vr] = (f(i) I.r EX}.
(f(z)
I zE
z)
= (rp(g(z» I z E z}
= (rp(g) I g E g(z») =
f(z).
which is to be expected from (ii), and
radx
= r,
rad
1
v'x = '2vr i=
O(r),
that is, the radius of v'x is considerably greater than is expected from the
radius of x, leading to a loss of significant digits. This is due to the fact
that vIx is not differentiable at x = 0; see statement (iii).
Proof of Theorem 1.5.6. Because arithmetic expressions are recursively defined, we proceed by induction on the number of subexpressions. Clearly, it
suffices to show (a) that the theorem is valid for constants and variables, and
(b) that if the theorem is valid for the expressions g and h, then it is also valid
for -g, g 0 h, and rp(g). Now (a) is clear; hence we assume that the theorem is
valid for g and h in place of f.
We show that Theorem 1.5.6 is valid for rp(g) with rp E J. We combine the
interval arguments to interval vectors
(iii) Set g := g(x). Because over the compact set g, the continuous function rp
attains its minimum and maximum. we have
for suitable
gI, g2
E
g. Thus by the mean value theorem,
M
=
sup Irp'(OI.
SEg(Z)
Then
radf(x)
The proof for the case
reader.
1
:s '2M1g2 -gIl :s Mradg =
f =
g
0
h with
0
E
O(r).
0 is similar and is left to the
0
(i) By definition, and because rp is continuous and g(z) is compact and
connected,
The Mean Value Form
f (z) = rp(g(z»
= O{rp(g) I g E g(z»)
= {rp(g) I g E g(z»).
In particular. rpU?) is defined for all g E g(z). Therefore, because by the inductivehypothesisg(x) <; g(z), f(x) isalsodefined,and f(x) = rp(g(x».
The evaluation off(XI, ... , x n ) gives, in general, only an enclosure for the
range of values of f. The naive method, in which real numbers are replaced
with intervals, often gives pessimistic bounds, especially for intervals of large
radius. However, for many numerical problems, an appropriate reformulation
of standard solution methods produces interval methods that provide bounds
that are provably realistic for narrow intervals. For a thorough treatment, see
Neumaier [70]; in this book, we look at only a few basic techniques for achieving
that.
For function evaluation, realistic bounds can be obtained for narrow input
intervals using the mean value form, a rigorous version of the linearization
method discussed in the previous section.
1.5.8 Theorem. Let f(x) = [tx«, ... , x n ) E .A(Xi, ... , x n ) , and suppose
that for the evaluation of f' (X) in Z E ][IRn, the elementary functions and the
power are evaluated only on intervals in which they are differentiable. 1fx S; z,
then
w* :=
D{f(x) I x EX} S; w := f(x) + f'(x)(x - x),
50
40
30
20
10
rad w* ::: 2 rad f' (x) rad x.
L-----:::~~:::::::..------
o
(5.1)
1.8
and the overestimation is bounded by
o ::: rad w -
49
1.5 1nterval Arithmetic
The Numerical Evaluation of Expressions
48
2
1.9
2.1
2.2
2.3
Figure 1.3. Enclosure by the mean value form.
(5.2)
1.5.9 Remarks.
(i) Let x be inexactly known with absolute error j; r and let x = [x - r, x + r].
Statement (5.1) says that the range of values w* as well as the required
value f(x) of f at x lie in the interval w. Instead of only an approximate
bound on the error in the calculation of f (x) as r ~ 0 (as in Section 1.4),
we now have a strict bound on the range of values of f for any fixed value
of the radius.
Statement (5.2) says how large the overestimation is. For r := max, rad Xi
tending to zero, the overestimation is 0 (r 2 ) ; that is, the same order of magnitude as the neglected terms in the calculation of the condition number.
The order of approximation is therefore quadratic if we evaluate f at the
midpoint x = mid x and correct it with f'(x)(x - x). This is in contrast
with the linear order of approximation for the interval evaluation f(x).
(ii) The expression f(x) + f'(x)(x - x) is called the mean value form of f;
for a geometrical interpretation, see Figure 1.3. The number
computation of the range. Note that q = 1 - O(r) approaches 1 as the
radius of the arguments gets small (cf. Figure 1.4), unless the range w* is
only O(r 2 ) (which may happen close to a local minimum).
(iii) For the evaluation of the mean value form in finite precision arithmetic, it
is important that the rounding errors in the evaluation of f (x) are taken
into account by calculating with a point interval [x, x] instead of with x.
However, x itself need not be the exact midpoint of the interval because,
as the proof shows, the theorem is valid for any x E x instead of x.
30
20
10
o
2rad f'(x) radX)
q -_ max ( 0, 1 - ------'-----'--radw
-10
is a computable quality factor for the enclosure given by the mean value
form. Because one can rewrite (5.2) as
-20
q . rad w ::: rad w* ::: rad w,
a q close to 1 shows that very little overestimation occurred in the
0.25
Figure 1.4. Enclosure of
form.
0.5
f
0.75
(x) = x
5
-
2x
1
3
1.25
+ 10 sin 5x
1.5
1.75
2
over [0, 2] with the mean value
50
The Numerical Evaluation of Expressions
Proof of Theorem 1.5.8. We choose x
g(t) := fei
get) is well defined because the line x
mean value theorem
f(x) = g(1) = g(O)
S; z and we define
EX
+ t(x -
Because
f
+ t(x -
iiJ 5 ib"
x) lies in the interval x. By the
= f(x)
+ f'(n(i
51
(x) E w*, we concl ude
for t E [0, 1];
x»
+ g'(r)
1.5 Interval Arithmetic
+ «(' - f) rad x.
Similarly,
~:::: ~* - (c -
f)radx,
- x)
and therefore
with
~
=
x + r(x -
.\') for some r E [0, 1]. From this, we obtain
f(x) E f(x)
+ f'(x)(x -
x)
=w
for all x E x, whence w* S; w. This proves (5.1).
For the overestimation bound (5.2), we first give a heuristic argument. We
have
+ f'(~)(x -
f(x) = f(x)
i) E f(x)
+ f'(x)(x -
i).
If one assumes that ~ and x range through the whole interval x, then because
of the overestimate of x by x and of .f'(~) by f'(x) of OCr), one obtains an
overestimate of O(r 2 ) for the product.
A rigorous proof is a little more technical, and we treat only the onedimensional case; the general case is similar, but requires more detailed arguments. We write
c:=f'(x),
p:=c(x-i)
2radw = iiJ - w
5 ib" - ~* + 2((' - f)radx
= 2 rad w* + 4 rad c rad x.
This proves the inequality in (5.2), and the asymptotic bound 0 (r 2 ) follows
0
from Theorem 1.5.6(iii).
1.5.10 Examples. We show that the mean value form gives rigorous bounds on
the propagation error for inaccurate input data that are realistic when the inaccuracies are small.
(i) We continue the discussion of Example I.4.2(ii), where we obtained
for f(x) = x 5 - 2x 3 and Ix - 21 5 10- 3 the approximate estimate
If(x) - 16/ ~ 0.056. Because the absolute error of x is r = 10-3, an
inaccurate x lies in the interval [2 - r, 2 + r]. For the calculations, we use
finite precision interval arithmetic with B = 10 and L = 5 and optimal
outward rounding.
(a) We bound the error using the mean value form, with .f'(x) = 5x 4
From
so that
w=f(x)+p.
w = f(2)
By definition of the product of intervals,
jj =
Now
cd for some c E
C,
dE
X-
= [15.943, 16.057],
i.
i)
for some ~
+ jj =
E f(i)
+ (c -
f(x) - .f'(n(x - x)
.f'(n)(x - i)
c)(x - i).
If(·\') - f(2)/ 50.057.
EX.
Hence
= f(x) + (c -
+ r])[-r, r]
we obtain for the absolute error
= f(i) + f'(~)(x -
iiJ = f(x)
f'([2 - r, 2
6x 2 •
= 16 + [55.815, 56.189][-0.001,0.001]
x := .\' + d E x, and
f(x)
+
-
+ c(x -
.\')
These are safe bounds, realistic because of Example I.4.2(ii).
(b) To show the superiority of the mean value form, we also bound the
error by means of naive interval evaluation. Here,
Wo
= f([2 - r, 2
+ r])
= [15.896, 16.105]
52
The Numerical Evaluation ofExpressions
whence
53
1.6 Exercises
In this case, the interval evaluation provides better results. namely
If(x) - f(2)1 ::::: 0.105.
These are safe bounds, but there is overestimation by a factor of
about 2.
(c) For the same example, we use naive interval arithmetic on the equivalent expression hex) = x 3(x 2 - 2):
WI
= h([2 -
r, 2 + r])
=
[15.944, 16.057]
whence
If(x) - f(2)1 ::::: 0.057.
The result here is the same as for the mean value form (and with
higher precision calculation it would be even sharper than the latter);
but this is due to special circumstances. In all operations, lower bounds
are combined with lower bounds only, and upper bounds with upper
bounds only. It is not difficult to see that in such a situation, always
the exact range is obtained. (For intervals around 1, we no longer have
this nice property.)
(ii) Let f(x) := x 2 . Then f'(x) = 2x. For x = [-r, r], we have w* = [0, r 2 ]
and
w
= f(O) + f'([-r,
= 2[ -r, r][-r, r]
= 2[-r 2 , r 2 ] .
r])([-r, r] - 0)
From this we obtain, for the radii of w* and w, rad w* = r 2 /2 and rad w =
2r 2 ; that is, the mean value fonn gives rise to an overestimation of the radius
of w* by a factor of 4. The cause of this is that the range of values w* is
itself of size O(r 2 ) .
(iii) Let f(x) := l/x. Then f'(x) = -1/x 2 . For the wide interval x = [1,2]
(where the asymptotic result of Theorem 1.5.8 is no longer relevant), one
has w* =
1], and
[!,
w = f(1.5)
+ f'([l, 2])([1, 2] -
1 1
1.5)
[1 7]
= -1.5 + - [-0.5 , 0.5] = -6'6- .
[1,4]
wo = f([1, 2]) =
[1\]
=
[~, 1],
which, by Theorem 1.5.6, is optimal.
When r is not small, it is still true that the mean value form and the interval
evaluation provide strict error bounds, but it is no longer possible to make
general statements about the size of the overestimation. There may be significant
overestimation in the mean value form, and it is not uncommon that for wide
intervals, the naive interval evaluation gives better enclosures than the mean
value form; a striking example of this is f (x) = sin x.
When the intervals are too wide, both the naive interval evaluation and the
mean value form may even break down completely. Indeed, overestimation
in intermediate results may lead to a forbidden operation, as in the following
example.
1.5.11 Example. The harmless function f(x) = 1/(1 - x + x 2 ) cannot be
evaluated at x = [0, 1]. The evaluation of the denominator (whose range is
[0.75,1]) gives the enclosure 1 - [0,1] + [0,1] = [0,2], and the subsequent
division is undefined. The derivative has the same problem, so that the mean
value form is undefined. too. Using as denominator (x - 1) * x + 1 reduces the
enclosure to [0, I]. but the problem persists.
In this particular example, one can remedy the problem by rewriting f as
f(x) = 1/(0.75 + (x - 0.5)2) and even gets the exact range (by Theorem
1.5.6(ii», but in more sophisticated examples of the same kind, there is no easy
way out.
1.6 Exercises
1. (a) Use MATLAB to plot the functions defined by the following three
arithmetical expressions.
(i) f(x) :=
~ for -1 < x::::: 1 and for Ixl ::::: 10- 15 ,
d2x- -
(ii) f(x):= Jx
x ::::: 2· 108 ,
+ l/x-Jx -l/x for 1::::: x:::::
lOandfor2·10 7
:::::
(iii) f(x):= (tanx-sinx)/xforO < x::::: 1 and for 10- 8 ::::: x::::: 10-7 •
Observe the strong inaccuracy for the second choices.
(b) Use MATLAB to print, for the above three functions, the sentences:
54
The Numerical Evaluation of Expressions
The answer for f(x) accurate to 6 leading digits is y
The short format answer for f(x) is
Show that the polynomials
gj(x) :=
y
L
f/j)x n-i (j = 0, ... , n)
i=j:n
The long format answer for f(x) is
satisfy
(a) gj(x)
y
but with x and y replaced by the numerical values x = 0.1111 and y
f(O.llll). (You must switch the printing mode for the results; use the MATLAB functions disp, num2str, format. See also fprintf and sprintf
for more flexible printing commands.)
2. Using differential numbers, evaluate the following functions and their
derivatives at the points XI = 2 and Xl = 5. Give the intermediate results
for (a) and (b) to (at least) three decimal places. For (c), evaluate as a check
an expression for f'(x) at x, and Xl. (You may work by hand, using a pocket
calculator, or use the automatic differentiation facilities of INTLAB.)
(a) f(x) := 1~~2'
(b) f(x) :=
eXjtl).
(b) go(x)
= gj(Z) + gj+l(x)(x - z)(j = 0, ... , n - 1),
= Lk<' gk(Z)(X - Z)k + gj(x)(x - zF(j = 1, ... , n),
J _
(j+l) _
fl1l(z)
._
(c) f n+ 1 - gj(z) - -ji- (j - 1, ... , n).
Hint: For (a), use the preceding exercise; for (c), use the Taylor expansion
of f(x).
The relevance ofthe complete Homer scheme lies in (c), which shows that
(z) / j! of
the scheme may serve to compute scaled higher derivatives
polynomials.
5. (a) A binary fraction of the form
r»
with a v
E
{O, I} (v = 1, ... , k) may be represented as
(x
+ x-I
(C) f( x ) .' - X2(x2+4)
x+I'
3. Suppose that the polynomial
2+3)(x-l)
f(x)
L a O.5
r =
=
55
1.6 Exercises
v
v
v=O:k
L ai x n-i
with ao = O. Transform the binary fraction
i=O:n
is evaluated at the point z using the Homer scheme
(0.101 100011h
fo := ao,
fi:=fi-lz+ai
(i=I, ... ,n),
fez) := fn.
into a decimal number using the Homer scheme.
(b) Using the complete Homer scheme (see the previous exercise), calculate, for the polynomial
p(x)
Show that
.L
n-i-I
fix
I=O:n-1
=
!f'(Z)
f(x)- fez)
x-z
if x
= Z,
otherwise.
4. Let
f(x)
= aox n + ajx n- 1 + ... + an
be a polynomial of degree n. The complete Horner scheme is given by
f?) :=ai (i=O,
,n),
a0 (j=O,
,n),
f j(j )
'.-
f?) := f/~iz + J;~~I) (i = j
8x l
+ 5x
- 1,
the values p(x), p'(x), and p"(x) for x = 2.5 and for x = 4/3. (Use
MATLAB or hand calculation. For hand calculation, devise a scheme
to arrange the intermediate results so that performing and checking the
calculations is easy.)
6. Assuming an unbounded exponent range, prove that, on a computer with
an L-digit mantissa and base B, the relative error for optimal rounding is
1 L
•
bounded by e =
7. Let Wo, x E JR with Wo, x > and for i = 0, 1,2, ...
&B
°
-
ui,
+ 1, ... , n + 1).
= 4x 4 -
Wi+l
:=x /
:=
n-I
Wi
«n -
'
1)Wi
+ w;)/n.
56
The Numerical Evaluation ofExpressions
1.6 Exercises
(a) Show that, for Wo =I=- ;:,IX,
WI
.::IX <
< W2 < ... < Wi < ... <
... <
ui,
< ... < W2 <
WI
and that the sequences [u»} and {Wi} converge to W := ;:,IX.
(b) For x := 4, Wo := 2, n := 3,4 calculate, in each case, Wi, Wi (i ::s 7)
and print the values of i , Wi, and Wi in long format.
(c) For x := 4, Wo := 2, n := 25,50, 75, 100, calculate, in each case,
Wi, Wi (i ::s n) and print the values of i , un . and Wi in long format.
Use the results to argue that the iteration is unsuitable for the practical
calculation of ;:,IX for large n.
8. (a) Analyze the behavior of the quotient fd(a)/fs of the derivative
I, = f'ex) and the forward difference quotient fd(a) = (f(x +a)f(x»/a in exact arithmetic, and in finite precision arithmetic. What
do you expect for a ---* O?What happens actually?
(b) Write a MATLAB program that tries to discover mistakes in the programming of derivatives. This can be done by computing a table of quotients fd(a)/fs for a number of values of a = 10- i ; say, i = -10: 10,
and checking whether the behavior predicted in (a) actually occurs. Test
with some correct and some incorrect pairs of expressions for function
and derivative.
(c) How can one reduce the multidimensional case (checking for errors in
the gradient) to the previous situation?
9. (How not to calculate eX.) The value eX for given x E 1R is to be estimated
from the series expansion
XV
L~'
v=O:oo
One calculates, for this purpose, the sums
XV
v=O:n V.
for n = 0,1,2, .... Obtain a value of n such that ISn(x) - eXI < 10- 16
with x = -20 using a remainder estimate. Calculate u = Sn ( - 20) with
MATLAB and compare the values so obtained with the optimally rounded
value obtained with the MATLAB command
v=exp(-20); disp(v-u);
How is the bad result to be explained?
10. (a) Rearrange the arithmetic expressions from Exercise 1 so that the smallest possible loss of precision results if x « 1 for (i), x » 1 for (ii), and
o < IxI « 1 for (iii).
(b) Calculate f(x) from (i) with x := 10- 3 , B = 10, and L = 1,2,4,8
by hand, and compare the resulting values with those obtained from
the rearranged expression.
11. The solutions XI and X2 of a quadratic equation of the form
ax'
with a
=I=-
+ bx + c = 0
0 are given by
Xu
= (-b ± Jb 2 -
4ac)/2a.
(6.1)
Determine, from (6.1), the smaller solution of the equation
x
=
(l-ax)2, (a> 0).
Why is the formula obtained unstable for small values of a? Derive a better formula and find five distinct values of a that make the difference in
stability between the two formulas evident.
12. Find a stable formula for both solutions of a quadratic equation with complex coefficients.
13. Cardano'sformulas.
(a) Prove that all cubic equations
x 3 + ax'
+ bx + c = 0
with real coefficients a, b, c may be reduced by the substitution x
s to the form
=
Z -
Z3
L ,
sn(x):=
57
+ 3q z -
2r
= O.
(6.2)
In the following assume q3 + r 2 > O.
(b) Using the formula z := p - q / p, derive a quadratic equation in p3,
and deduce from this an expression for a real solution of (6.2).
(c) Show that there is exactly one real solution. What are the complex
solutions?
(d) Show that both solutions of the quadratic equation in (b) give the same
real solution.
(e) Determine the real solution in part (b) for r := ±1000 and q := 0.01,
choosing for p in tum both solutions of the quadratic equation. For
which one is the expression more numerically stable?
58
The Numerical Evaluation of Expressions
1.6 Exercises
« Iq I, the real solution is small compared with q; its calculation
from the difference p - q I p is therefore numerically unstable. Find a
stable version.
(g) What happens for q3 + r 2 < O?
14. The standard deviation an = Jtnln can be calculated for given data
Xi (i = 1, ... , n) either with
(f) For Ir I
or with the recursion in Section 1.3. Calculate
Xi
:= (499999899
+ i)/3000
a20l
for all x E IIIR whose end points are machine numbers. Prove this for
at least one of the formulas (6.3) or (6.4).
(b) Suppose that B = 10. For which formula is (6.5) false? Give a counterexample, and prove that (6.5) holds for the other formula.
(c) Explain purpose and details of the MATLAB code
x=input('enter positive x>');
w=max(1,x);wold=2*w;
while w<wold
wold=~; w=w+(x!w-w)*O.5;
end;
for the values
(i = I, ... ,201)
using both methods and compare the two results with the exact value obtained by analytic summation.
Hint: Use Li=l:n i = n(n + 1)/2 and Li=l:n;2 = n(n + 1)(2n + 1)/6.
15. A spring with a suspended mass m oscillates about its position of rest.
Hooke's law K (r) = - D . x (t) is assumed. When the force K a (t) := A .
cos(w"t) acts on the mass m, the equation of motion is
What is the limit of w in exact arithmetic? Why does the loop terminate
in finite precision arithmetic after finitely many steps?
17. (a) Find intervals x, y, Z E IIIR for which the distributive law
x . (y
w6 := Dim. It has the solution
x(t) = -
Aim
2
Wa -
2
W
cos(w"t).
o
For which W a is this formula ill-conditioned (t > 0 and Wo > 0 fixed)?
16. The midpoint x of an interval x E ITIR can be calculated from
x:= (x +!.)/2
+ z)
= x .y
+x .z
is violated.
(b) Show that, for all x, y,
Z E
ITIR, the subdistributive law
x . (y
with
59
+ z)
S; x . y
+x .Z
holds.
18. A camera has a lens with focal length f = 20 mm and a film of thickness
2!'1b. The distance between the lens and the object is g, and the distance
between the lens and the film is boo All objects at a distance g are focused
sharply if the lens equation
I
I
I
f
b
g
-=-+-
(6.3)
is satisfied for some b E b = [b o - Sb, bo + !'1b]. From the lens equation,
one obtains, for a given f and b := bo ± /sb,
or from
(6.4)
I
These formulas give an in general inexact value
with base B and optimal rounding.
(a) For B = 2,
XEX
x on an L-digit machine
g=----
Ilf -
lib
b·f
b-F
A perturbation calculation gives
(6.5)
g
~
g(bo) ± g' (b o) . Ab.
60
The Numerical Evaluation of Expressions
Similarly, one obtains, for a given
f and g with 's -
1
b
= -1/-f---1-/ g =
gol
::s tsg,
a
g·f
g - j"
2
Linear Systems of Equations
.
he film be in order that an object at a given distance is
(a) How thick must t
_ [3650 3652], [3650,4400], [6000, 10000],
,_ b d l::..b = rad b from
sharply focused? For g [10000, 20000](in millimeters), calculate bo - an
b using the formulas
1
b 1= l/f - l/g
g·f
and b2 = - - .
g- f
Explain the bad results for b 2 , and the good results.for b l:
b f
.
h 1 focused with a gIven 0, or a
(b) At which distances are objects s arp Y
,
.
h f
1s
film thickness of 0.02 mm? Calculate the interval g usmg t e ormu a
1
gl := l/f -lib
b·f
and
g2:=
b-~
with bo = 20.1,20.05,20.03,20.02,20.01 mm, and using the approximation
g3:= g(bo)
+ (-1, 1]· g'(bo)
·l::..b.
d
Which of the three results is
Compare the results from gl, g2, an g3·
. b?
increasingly in errorfor decreasmg o:
._ (1 _ x)" in
.
for bounding the range of values of p (x) .19. Wnte a program
the interval
x:= 1.5 + [-1, 1]/lO
i
(i = 1, ... ,8)
(a) by evaluating p(x) on the interval x,
.
.
n
(b) by using the Homer scheme at z = x on the eqUIvalentexpresslO
6
Po(x) = x -
6
x
5
+ 15x 4
_
20x 3
+ 15x 2 -
6x
+ 1,
(c) by using Homer's scheme for the derivative and the mean value form
po(x)
+ p~(x)(x -
Interpret the results. (You are allowed to
x).
~se unrounded
metic if you cannot access directed roundmg.)
interval arith-
In this chapter, we discuss the solution of systems of n linear equations in n
variables. At some stage, most of the more advanced problems in scientific
calculations require the solution of linear systems; often these systems are very
large and their solution is the most time-consuming part of the computation.
Therefore, the efficient solution of linear systems and the analysis of the quality
of the solution and its dependence on data and rounding errors is one of the
central topics in numerical analysis. A fairly complete reference compendium
of numerical linear algebra is Golub and van Loan [31]; see also Demmel (17]
and Stewart [89]. An exhaustive source for overdetermined linear systems (that
we treat in passing only) is Bjorck [8].
We begin in Section 2.1 by introducing the triangular factorization as a compact form of the well-known Gaussian elimination method for solving linear
systems, and show in Section 2.2 how to exploit symmetry with the L D L T and
the Cholesky factorization to reduce work and storage costs. We only touch the
savings possible for sparse matrices with many zeros by looking at the simple
banded case.
In Section 2.3, we show how to avoid numerical instability by means of
pivoting. In order to assess the closeness of a matrix to singularity and the
Worst case sensitivity to input errors, we look in Section 2.4 at norms and in
Section 2.5 at condition numbers. Section 2.6 discusses iterative refinement as
a method to increase the accuracy of the computed solution and shows that
except for very ill-conditioned systems, one can indeed expect from iterative
refinement solutions of nearly optimal accuracy. Finally, Section 2.7 discusses
various techniques for a realistic error analysis of the solution of linear systems,
inclUding interval techniques for the rigorous enclosure of the solution set of
linear systems with uncertain coefficients.
The solution of linear systems of equations lies at the heart of modem computational practice. Therefore, a lot of effort has gone into the development
61
62
Linear Systems of Equations
2.1 Gaussian Elimination
of efficient and reliable implementations of appropriate solution methods. For
matrices of moderate size, the state of the art is embodied in the LAPACK [6]
subroutine collection. LAPACK routines are freely available from NETLIB
[67], an electronic distribution service for public domain numerical software.
If nothing else is said, everything in this chapter remains valid if the set C of
complex numbers is replaced with the set lR of real numbers.
We use the notations lRmxn and cmxn to denote the space of real and complex
matrices, respectively, with m rows and n columns; A ik denotes the (i, k)-entry
of A. Absolute values and inequalities between vectors or between matrices
are interpreted componentwise.
The symbol A T denotes the transposed matrix with
and A ( : , k}, respectively. The dimensions of an m x n matrix A are found
from [m, n] =size (A). The matrices 0 and J of size m x n are created by
zeros (m , n ) and ones (m , n): the unit matrix of dimension n by eye (n) . Operations on matrices are written in MATLAB as A+B, A-B, A*B (=A B), inv (A)
J
(=A- ) , A\B (= A-I B), and A/B (= AB- I). The componentwise product and
quotient of (vectors and) matrices are written as A. *B and A. /B, respectively.
However, in our pseudo-MATLAB algorithms, we continue to use the notation
introduced before for the sake of readability.
(AT)ik = A ki,
Suppose that n equations in n unknowns with (real or) complex coefficents are
given,
2.1 Gaussian Elimination
and A H denotes the conjugate transposed matrix with
AIIXI
A 21XI
Here, eX = a - ib denotes the complex conjugate of a complex number
= a + ib(a, b E lR), and not the upper bound of an interval. It is also
convenient to write
63
+ A 12X2 +
+ A n X2 +
+ Alnxn =
+ A 211Xll =
bj ,
b2 ,
C(
in which Aik, b, E C for i, k = 1, ... , n. We can write this linear system in
compact matrix notation as Ax = b with A = (A ik) E cnxll and x = (Xi),
b = (bi ) E C". The following theorem is known from linear algebra.
and
A matrix A is called symmetric if AT = A, and Hermitian if A H = A. For
A E lRmxn we have A H = AT, and then symmetry and Hermiticity are synonymous.
The ith row of A is denoted by Ai:, and the kth column by A:k. The matrix
2.1.1 Theorem. Suppose that A E cnxn and that b E C". The system of equations Ax = b has a unique solution x E cn if and only if A is nonsingular (i.e.,
if det A i= 0, equivalently, if rank A = n]. In this case, the solution x is given
by x = A -I b. Moreover, the i th component ofthe solution is given by Cramer's
rule
»,».
Lk=l:n(adj
det A(i)
Xi = - - - = ==='----'---det A
detA
i=l, ... ,n,
Where the matrix A (i) is obtained from A by replacing the ith column with b,
and the adjoint matrix adj A is formed from suitably signed subdeterminants
of A.
with entries Iii = 1 and I ik = 0 (i i= k), is called the unit matrix; its size
is usually inferred from the context. The columns of I are the unit vectors
e(k) = h (i = 1, ... , n) of the space C". If A is a nonsingular matrix, then
A -I A = AA -I = I. The symbols J and e denote matrices and column vectors,
all of whose entries are I.
In MATLAB, the conjugate transpose of a matrix A is denoted by A) (and
the transpose by A. ) ); the ith row and kth column of A are obtained as A (i , : )
From a theoretical point of view, this theorem tells everything about the
SOlution. From a practical point of view, however, solving linear systems of
equations is a complex and diverse subject; the actual solution process is influenced by many considerations involving speed, storage, accuracy, and structure.
In particular, we see that the numerical calculation of the solution using A-I
or Cramer's rule cannot be recommended for n > 3. The computational cost
Linear Systems ofEquations
2.1 Gaussian Elimination
in both cases is considerably greater than is necessary; moreover, Cramer's rule
is numerically less stable than the other methods.
Solution methods for linear systems fall into two categories: direct methods,
which provide the answer with finitely many operations; and iterative methods.
which provide better and better approximations with increasing work. For systems with low or moderate dimensions and for large systems with a band
structure, the most efficient algorithms for solving linear systems are direct,
generally variants of Gaussian elimination, and we restrict our discussion to
these. For sparse systems, see, for example, Duff et al. [22] (direct methods)
and Weiss [97] (iterative methods).
Gaussian elimination is a systematic reduction process of general linear systems to those with a triangular coefficient matrix. Although Gaussian elimination currently is generally organized in the form of a triangular factorization.
more or less as described as follows, the basic idea is already known from school.
distinct from zero, and to solve, instead of Ax = b, the triangular system
Rx = y, where y is the transformed right side. Because the latter is achieved
very easily, this is a very useful strategy.
For the compact form of Gaussian elimination, we need two types of triangular matrices. An upper triangular matrix has the form
64
x
x
L~ (~
2x +4y + z = 3,
-x + y - 3z = 5,
Subtracting the first of these from the second gives
-z = 5.
It remains to solve a system of equations of triangular form:
x+y+z=l,
2y - z = 1
-z = 5
and we obtain in reverse order z
= -5, y = -2, x = 8.
The purpose of this elimination method is to transform the coefficient matrix
A to a matrix R in which only the coefficients in the upper right triangle are
x
x
where possible nonzeros, indicated by crosses, appear only in the upper right
triangle, that is, where R i k = 0 (i > k). (The letter R stands for right; there
is also a tradition of using the letter U for upper triangular matrices, coding
upper.) A lower triangular matrix has the form
x+y+z=l,
2y-z=l,
2y - 2z = 6.
x
x
x
2.1.2 Example. In order to solve a system of equations such as the following:
one begins by rearranging the system such that the coefficients of x in all of the
equations save one (e.g., the first) are zero. We achieve this, for example, by
subtracting twice the first row from the second row, and adding the first row to
the third. We obtain the new equations
x
x
65
x
x
x
x
x
x
x
x
x
J
where possible nonzeros appear only in the lower left triangle, that is, where
L i k = 0 (i < k). (The letter L stands for left or lower.)
A matrix that is both lower and upper triangular, that is, that has the form
D = (Dll
0 ) =: Diag(D ll •... , Dnn),
o
o.;
is called a diagonal matrix. For a diagonal matrix D, we have D i k = 0 (i
# k).
Triangular Systems of Equations
For an upper triangular matrix R
Yi =
L
k=l~
E
RikXk
IRnxn, we have Rx = y if and only if
=
s;», +
L
RikXk.
k=i+l~
The following pseudo-MATLAB algorithm solves Rx = y by substituting the
already calculated components Xk (k > i) into the ith equation and solving
for Xi, for i = n, n - 1, ... ,LIt also checks whether one of the divisions
is forbidden, and then leaves the loop. (In MATLAB, == denotes the logical
66
Linear Systems of Equations
comparison for equality.) This happens iff one of the Ri, vanishes, and because
(by induction and development of the determinant by the first column)
2.1.5 Algorithm: Solve a Lower Triangular System
ier = 0;
for i = 1 : n,
if L(i, i) == 0, ier = 1; break; end;
y(i) = (b(i) - Lii, I : i - I ) * y(l : i - 1))/ Lti, i);
det R = R ll R n · .. R nn ,
this is the case iff R is singular.
2.1.3 Algorithm: Solve an Upper Triangular System (slow)
ier = 0;
for i = n : -1 : 1,
if Ri, == 0, ier = 1; break; end;
Xi = (Yi L
RikXk)/Rii;
k=i+l:n
end;
% ier = 0: R is nonsingular and x = R- 1Y
% ier = I: R is singular
end;
% ier = 0: L is nonsingular and y
% ier = 1 : L is singular
2.1.4 Algorithm: Solve an Upper Triangular System
ier=O;
x=zeros(n,l); % ensures that x will be a column
for i=n:-1:1,
if R(i,i)==O, ier=l; break; end;
x(i)=(y(i)-R(i,i+l:n)*x(i+l:n))/R(i,i);
end;
% ier=O: R is nonsingular and x=R\y
% ier=l: R is singular
In the future, we shall frequently use the summation symbol for the sake
of visual clarity; it is understood that in programming, such sums are to be
converted to vectorized statements whenever possible.
For a lower triangular matrix L E ~n xn, we have Ly = b if and only if
=
L LikYk = LiiYi + L
k=l:n
=
L -I b
In case the diagonal entries satisfy Li, = I for all i , L is called a unit lower triangular matrix, and the control structure simplifies because L is automatically
nonsingular and no divisions are needed.
Solving a triangular system of linear equations by either forward or back
substitution requires Li=l:n (2i - I) = n 2 operations (and a few less for a unit
diagonal). Note that MATLAB recognizes whether a matrix A is triangular; if
it is, it calculates A\ b in the previously mentioned cheap way.
It is possible to speed up this program by vectorizing the sum using MATLAB's
subarray facilities. To display this, we switch from pseudo-MATLAB to true
MATLAB.
bi
67
2.1 Gaussian Elimination
LikYk.
k=l:i-l
The following algorithm solves Ly = b by substituting the already calculated
components Yk (k < i) into the i th equation and solving for Yi for i = 1, ... , n.
The Triangular Factorization
A factorization A = LR into the product of a nonsingular lower triangular matrix L and an upper triangular matrix R is called a triangular factorization
(or l-R-factorization or LV -factorization) of A; it is called normalized if
L ii = I for i = I, ... , n, The normalization is no substantial restriction because
if Lis nonsingular, then LR = (LD- 1 )(DR) is a normalized factorization for
the choice D = Diag(L), where
Diag(A) := Diag(A 11 ,
.•• ,
Ann).
denotes the diagonal part of a matrix A E en xn. If a triangular factorization
exists (we show later how to achieve this by pivoting), one can solve Ax = b
(and also get a simple formula for the determinant det A = det L . det R) as
follows.
2.1.6 Algorithm: Solve Ax
= b without Pivoting
STEP I: Calculate a normalized triangular factorization LR of A.
STEP 2: Solve Ly = b by forward elimination.
STEP 3: Solve Rx = y by back substitution.
Then Ax
=
LRx
=
Ly
= b and det A =
R 11R22
•• ·
R nn .
68
Linear Systems of Equations
69
2.1 Gaussian Elimination
This algorithm cannot always work because not every matrix has a triangular
factorization.
Lt«
= 0 for k
A ik =
> i and R ik
L
j=l:n
2.1.7 Example. Let
LijRjk
=
= 0 for k
L
< i, it follows that
LijRjk
(i,k=I, ... ,n).
(1.2)
j=l:min(i,k)
Therefore,
fork
~
i,
fork < i.
If a triangular factorization A = LR existed, then L 11R II
and L2I R II =J: O. This is, however, impossible.
=
0, L II R 12 =J: 0,
The following theorem gives a complete and constructive answer to the question when a triangular factorization exists. However, this does not yet guarantee
numerical stability, and in Section 2.3, we incorporate pivoting steps to improve
the stability properties.
e nxn has at most one normalized triangular factorization. The triangular factorization exists iff all leading
square submatrices A (m) of A, defined by
2.1.8 Theorem. A nonsingular matrix A
E
A ll
A(m):=
(
A~I
A:.lm)
(m
=
1, ... , n),
Amm
are nonsingular. 1n this case, the triangular factorization can be calculated
recursively from
Rik := A ik -
L
j=l:i-1
L ki := (Aki - .
L
LijR jk
LkjRji)/Rii
(k~i)'}
i=I, ... ,n.
(1.1)
(k > i),
1=1:1-1
Solving the first equation for Rik and the second for Li; gives equations
(1.1). Note that we swapped the indices of L so that there is one outer loop
only.
(ii) By (i), A has a nonsingular triangular factorization iff (1.2) holds with L ii ,
Ri, =J: 0 (i = 1, ... , n). This implies that for all m, A(m) = is» R(m) with
Li~) = Li, =J: 0 (i = 1, ... ,m) and
= ii =J: 0 (i = 1, ... ,m).
Therefore, A (m) is nonsingular for m = 1, ... , n. Conversely, if this holds,
then a simple induction argument shows that (1.2) holds, and Li, = L i:) =J:
0, Rii = R;;) =J: 0 (i = 1, ... , n).
(iii) All of the unknown coefficients Rik(i S k) and LkiCi < k) can be calculated recursively from (1.1) and are thereby uniquely determined. The
factorization is unique because by (i) each factorization satisfies (1.1).
Rir) R
o
For a few classes of matrices arising frequently in practice, the triangular
factorization always exists. A square matrix A is called an H-matrix if there are
diagonal matrices D and D' such that III - DAD'lloo < 1 (see Section 2.4 for
properties of matrix norms).
We recall that a square matrix A is called positive definite if x H Ax > 0 for
all x =J: O. In particular, Ax =J: 0 if x =J: 0; i.e., positive definite matrices are
nonsingular. The matrix A is called positive semidefinite if x H Ax ~ 0 for all x.
Here, the statements ex > 0 or ex ~ 0 include the statement that ex is real. Note
that a nonsymmetric matrix A is positive (semi)definite iff its Hermitian part
A sym := !(A + A H ) has this property.
2.1.9 Corollary. 1f A is positive definite or is an H-matrix, then the triangular
factorization of A exists.
Proof.
Proof. We check that the submatrices A (i) (1 SiS n) are nonsingular.
(i) Suppose that a triangular factorization exists. We show that the equations (1.1) are satisfied. Because A = L R is nonsingular, 0 =J: det A =
det L . det R = R ll ... R nn, so R ii =J: 0 (i = 1, ... , n). Because L ii = l ,
(i) Suppose that A E en xn is positive definite. For x E C (1 sis n) with
x =J: 0, we define a vector y E en according to y j = x j (1 s j S i) and
70
Linear Systems ofEquations
Yj
=0
(i < j :::: n). Then, x'! A (i) x
=
yH Ay > O. The submatrices A (i)
(l :::: i :::: n) are consequently positive definite, and are therefore non-
singular.
(ii) Suppose that A E e l i Xli is an H-matrix; without loss of generality, let A be
scaled so that III - A 1100 < 1. Because III - A (i) 1100 :::: III - A 1100 < I, each
A (i) (l :::: i :::: n) is an H-matrix and is therefore nonsingular.
0
The recursion (1.1) for the calculation of the triangular factorization goes
back to Crout. Several alternatives for it exist that compute exactly the same
intermediate results but in different orders that may have (dis)advantages on
different machines. They are discussed, for example, in Golub and van Loan
[31]; there, one can also find a proof of the equivalence of the factorization
approach and the reduction process to upper triangular form, outlined in the
introductory example.
In anticipation of the need for incorporation of row interchanges when no
triangular factorization exists, the following arrangement for the ith recursion
step of (1.1) is sensible. One divides the ith step into two stages. In stage (a),
Lkj R ji (k = i + 1, ... , n)
Ri i is found, and the analogous expressions A ki in the numerators of Li, are calculated as well. In stage (b), the calculation of
the Rik (k = i + 1,
, n) is done according to (I.1), and the calculation of
the Li, (k = i + 1,
, II) is completed by dividing the expressions calculated
in stage (a) by R i i . This method of calculating the coefficients Ri, has the
advantage that, when Ri, = 0, this is immediately known, and unnecessary
additional calculations are avoided.
The coefficients A ik (k = i, ... , n) and A ki (k = i + I, ... , n) in (1.1) can
be replaced with the corresponding elements of Rand L, respectively, because
they are no longer needed. Thus, we need no additional storage locations for the
matrices Rand L. The matrix elements stored in the array A at the conclusion
of the ith step correspond to the entries of L, R, and the original A, according
to the diagram in Figure 2.1. After n steps, A is completely overwritten with the
2.1 Gaussian Elimination
71
elements of Rand L. The calculation of the triangular factorization therefore
proceeds as follows.
2.1.10 Algorithm: Triangular Factorization without Pivoting
for i = 1 : n
for k = i ; n
% find Ri i and numerators of
A ki = A ki L AkjA j i;
Lki,
k > i
j=Li-l
end
if A ii == 0, errort 'no factorization exists'); end;
for k = i + 1 ; n
% find Rib k > i
A ik = Aik L AijA j k;
j=l:i-l
% complete
A ki
L
=
Lki,
k > i
AkilA i i;
end
end
The triangular factorization LR of A is obtainable as follows from the elements
stored in A after execution of the algorithm:
I
A ik
Li, =
I
o
fork<i,
for k = i,
for k > i,
for k ::: i,
for k < i.
2.1.11 Example. Let the matrix A be given by
-4
r1l
R
L
A
1
-1
i
i
Figure 2.1. The matrices L, R, and A after the ith step.
3
STEP 1: In the first stage (a), nothing happens; in the second stage (b), nothing
is subtracted from the first row above the diagonal, and the first column
is divided by All = 2 below the diagonal. In particular, the first row of
R is equal to the first row of A. In the following diagrams, coefficients
that result from the subtraction of inner products are marked Sand
72
Linear Systems ofEquations
coefficients that result from division are marked with D.
(a)
4
Step I:
(
42'5
_4 5
(b)
-4
~) ~D
2
1
1 -1 3
1
3 I
25
(
-2
D
I
D
45
_4 5
2
I
-I
I
I
3
and
n
STEPS 2 and 3: In stage (a), an inner product is subtracted from the second
and third columns, both on and below the diagonal; in stage (b), an
inner product is subtracted from the second and third rows above the
diagonal, and the second and third columns are divided below the
diagonal by the diagonal element.
R
~ (~
4
-6
-4
0
9
2:
~~)
0
0
-9
9
4
-6 5
-4
1
-1
3
95
-3 5
f)( -~
4
-6
3D
-2
-4
lD
3
Step 3:
(-~
I
2
95
-1
2
(a)
4
-6
3
-2
-n
(b)
-4
9
95
2
55
2
-~) ( ~
3
-2
1
4
-6
3
-2
-4
I
1
2
9
9
2
5D
9
-~ )
55
2
I
STEP 4: In the last step, only Ann remains to be calculated in accordance
with (a).
(a)
Step 4:
.
R
Note that the symmetry of A is reflected in the course of the calculation.
Each column below the diagonal, calculated in stage (a), is identical
with the row above the diagonal, calculated in stage (b). This implies
that for symmetric matrices the computational cost can be nearly halved
(cf. Algorithm 2.2.4).
Finding the triangular factorization of a matrix A
(-~
5
2:
E
JRn xn takes
(b)
(a)
Step 2:
73
2.2 Variations on a Theme
(-~
The triangular factorization of A
have
4
-6
-4
9
3
9
-2
2
I
5
9
2
=
-i )
85
-9
LR has now been carried out; we
o
o
I
5
9
L (i i=l:n
2
1)4(n - i) = _n 3 + 0(n 2 )
3
operations.
Because of the possible division by zero or a small number, the solution
of linear systems with a triangular factorization may not be possible or may
be numerically unstable. Later, we show that by 0(n 2 ) further operations for
implicit scaling and column pivoting, the numerical stability can be improved
significantly. The asymptotic cost is unaffected.
Once a factorization of A is available, solving a linear system with any right
side takes only 2n 2 operations because we must solve two triangular systems
of equations; thus, the bulk of the work is done in the factorization. It is a very
gratifying fact that for the solution of many linear systems with the same coefficient matrix, only the cheap part of the computation must be done repeatedly.
A small number of operations is not the only measure of efficiency of algorithms. Especially for larger problems, further gains can be achieved by
making use of vectorization or parallelization capabilities of modem computers. In many cases, this requires a rearrangement of the order of computations
that allow bigger pieces of the data to be handled in a uniform fashion or that
reduce the amount of communication between fast and slow storage media. A
thorough discussion of these points is given in Golub and van Loan [3
n
2.2 Variations on a Theme
For special classes of linear systems, Gaussian elimination can be speeded up
by exploiting the structure of the coefficient matrices.
74
2.2 Variations on a Theme
Band Matrices
The factorization of a (2m + I)-band matrices takes ~m2n + O(mn) operations. For small m, the total cost is 0 (n), the same order of magnitude as that for
solving a factored band system. Hence, the advantage of reusing a factorization
to solve a second system with the same matrix is not as big as for dense systems.
For the discretization of boundary value problems with ordinary differential
equations, we always obtain band matrices with small bandwidth m. For discretized partial differential equations the bandwidth is generally large, but most
of the entries within the band vanish. To exploit this, one must resort to more general methods for (large and) sparse matrices, that is, matrices containing a high
proportion of zeros among its entries (see, e.g., Duff, Erisman and Reid [22].
To explore MATLAB's sparse matrix facilities, start with help sparfun.
A (2m + I)-band matrix is a square matrix A with Aik = 0 for Ii - kl > m;
in the special case m = 1, such a matrix is called tridiagonal. When Gaussian
elimination is applied to band matrices, the band form is preserved. It is therefore sufficient for calculation on a computer to store and calculate only the
elements of the bands (of which there are less than 2mn + n) instead of all the
n 2 elements.
Gaussian elimination (without pivot search) for a system of linear equations
with a (2m + 1)-band matrix A consists of the following two algorithms.
2.2.1 Algorithm: Factorization of (2m
+ l j-Band Matrices
ier = 0;
for i = 1 : n,
p = max(i - m, 1);
q = min(i + m, n);
A ii
=
A ii -
L
Symmetry
AijA j i;
j=p:i-l
if A i i = 0, ier = 1; break; end;
for k = i + 1 : q,
A ki
=
L
A ki -
AkjA j i;
j=p:i-l
end;
for k
= i + 1 : q,
= A ik -
A ik
L
=
As seen in Example 2.1.11, the symmetry of a matrix is reflected in its triangular
factorization L R. For this reason, it is sufficient to calculate only the lower
triangular matrix L.
2.2.3 Proposition. If the Hermitian matrix A has a nonsingular normalized
triangular factorization A = LR, then A = LDL H where D = Diag(R) is a
real diagonal matrix.
AijA j k;
j=p:i-l
A ki
75
Linear Systems ofEquations
Proof Let D := Diag(R). Then
A ki/ A i i;
end;
end;
2.2.2 Algorithm: Banded Forward and Back Solve
ier = 0;
for i = 1 : n,
p = max(i - m, 1);
~
i:
L- Li.v
u r i»
YI· = b I j=p:i-l
end;
for i = n : - 1 : 1,
if R i i = 0, ier = 1; break; end;
q = min(i + m, n);
Xi
=
(Yi -
L
j=i+l:q
end;
RijXj)/ R ii·
Because of the uniqueness of the normalized triangular factorization, it follows
from this that R = D HL Hand Diag(R) = D H. Hence, D = D H, so that D is
real and A = LR = LDL H, as asserted.
0
The following variant of Algorithm 2.1.10 for real symmetric matrices (so
that LDL H = LDL T ) exploits symmetry, using only *n 3 + 0(n 2 ) operations,
.
half as much as the non symmetric version.
2.2.4 Algorithm: Modified L D L T Factorization of a real Symmetric
Matrix A
ier = 0; <5 = sqrt(eps)
for i = I : n,
* norm(A, inf);
76
Linear Systems ofEquations
piv
=
L
A ii -
2.2 Variations on a Theme
can take square roots and obtain, with D I / l := Diag(,J15i"i), L := L oD I / 2 , the
0
relation A = LoDL{f = LoDI/2(Dl/2)H L{f = LL H.
AijA ji;
j=J:i-l
if abs(piv) > 8, A ii = piv:
else ier = 1; Au = 8;
end
for k = i + 1 : n,
Aik=A ik- L AijA jk;
j=l:i-l
A k;
=
Aid Au;
end
end
%Now A contains the nontrivial elements of L, D and DL T
% if ier > 0, iterative refinement is advisable
Note that, as for the standard triangular factorization, the L D L H factorization
can be numerically unstable when a divisor becomes tiny, and it fails to exist
when a division by zero occurs. We simply replaced tiny pivots by a small
threshold value (using the machine accuracy eps in MATLAB) - small enough
to cause little perturbation, and large enough to eliminate stability problems.
For full accuracy in this case, it is advisable to improve the solutions of linear
systems by iterative refinement (see Algorithm 2.6.1). More careful remedies
(diagonal pivoting and the use of 2 x 2 diagonal blocks in D) are possible (see
Golub and van Loan [31]).
So far, the uniqueness of the triangular factorization has been attained through
the normalization L;; = l(i = I, ... , n). However, for Hermitian positive definite matrices, the factors can also be normalized so that A = LL H. This is
called a Cholesky factorization of A (the Ch is pronounced as a K, cf. [96]), and
L is referred to as the Cholesky factor of A. The conjugate transpose R = L H
is also referred to as the Cholesky factor; then the Cholesky factorization takes
the form A = R H R. (MATLAB's chol uses this version.)
If A is not positive definite, then it turns out that the square root of some
pivot element p S 0 must be determined, resulting in a failure. However, it is
sometimes meaningful in this case to construct instead a so-called modified
Cholesky factorization. This is obtained by replacing such pivot elements and,
to achieve numerical stability, all p < 8, by a small positive number 8. A
suitable value is 8 = vic II A II, where e is the machine accuracy. (In various
implementations of modified Cholesky factorizations, the choice of this threshold differs and may vary from row to row. For details of a method especially
adapted to optimization, see Schnabel and Eskow [86].) The following algorithm results directly from the equation A = LL T for real, symmetric A; by
careful indexing, it is ensured that the modified Cholesky factorization is overwritten over the upper triangular part of A, and the strict lower triangle remains
unchanged.
2.2.6 Algorithm: Modified Cholesky Factorization of a Real Symmetric
Matrix A
8 = eiIAII;
if 8 == 0, def = -1; return; end;
def= 1;
for i = 1 : n,
~
=
A'i -
L
A;j;
j=l:i-l
if ~ < 8, def = 0; ~
Ai;=~;
for k = i + 1 : n ,
A;k = (A;k - .
2.2.5 Theorem. If A is Hermitian positive definite, thenAhas a unique Cholesky
factorization A = L L H in which the diagonal elements Li, (i = I, ... , n) are
real and positive.
e
77
en
Proof. For x E i (l :s i :s n) with x =1= 0, we define a vector y E
by y j =
Xj (1:s j :s i) and v, = 0 (i < i z: n). Then, x" A(i)x = v" Ay > O. The
submatrices A (i) (1 :s i :s n) are consequently positive definite and are therefore
nonsingular. Corollary 2.1.9 now implies that there exists a triangular factorization A = LoR o = LoDL{f. Because A is positive definite, we have for each
diagonal entry D;i = (e(i)H De U) = x HAx > 0 for x = L OH e(i). Thus we
L
= 8; end;
Aj;Ajk)/A u;
}=l:t-l
end;
end;
% def = 1: A is positive definite; no modification applied
% def = 0: A is numerically not positive definite
% def = -1: A is the zero matrix
It can be shown that the Cholesky factorization is always numerically stable
(see, e.g., Higham [44]).
The computational cost of the Cholesky factorization is essentially (i.e., apart
from some square roots) the same as that for the L D L T factorization. Because
78
79
Linear Systems of Equations
2.2 Variations on a Theme
in Algorithm 2.2.6 only the upper triangle of A is used and changed, one could
halve the storage location requirement by storing the strict upper triangle of A
in a vector oflength n (n - 1)/2 and the diagonal of A (or its inverse) in a vector
of length n, and adapting the corresponding index calculations. Alternatively,
one can keep a copy of the diagonal of the original A in a vector a; then, one has
both A and its Cholesky factor available for later use. This is useful for iterative
refinement (see Section 2.6).
Therefore, Ilrll~ > Ilr*II~, and equality holds precisely when r" = r. In this
case, however,
Normal Equations
An overdetermined linear system, with more equations than variables, is solvable only in special cases. However, approximation problems frequently lead
to the problem of finding a vector x such that Ax ~ b for some vector b, where
the dimension of b is larger than that of x.
A typical case is that the data vector is supposed to be generated by a stochastic
model y = Ax + E, where E is a noise vector with well-defined statistical properties. If the components ofcare uncorrelared, random variables with mean zero
and constant variance, the Gauss-Markov theorem (see, e.g., Bjorck [8]) asserts
that the best linear unbiased estimator (BLUE) for the parameter vector x can
be found by minimizing the 2-norm II y - Ax 112 of the residual. Because then
Ily -
Axll~
=
IIEII~
=
l::>?
is also minimal, this way of finding a vector x such that Ax ~ y is called the
method of least squares, and any vector x minimizing Ily - Axl12 is called a
least squares solution of Ax ~ y. The least squares solution can be found by
solving the so-called normal equations (2.1 ).
2.2.7 Theorem. Suppose that A E em xn , b E em, and A HA is nonsingular.
Then lib - Ax 112 assumes its minimum value exactly at the solution x* of the
normal equations
A HA(x -x*) = AH(r - r") = 0;
and because of the regularity of AHA, the minimum is attained just for
x* = x.
0
Thus, the normal equations are obtained from the approximate equations by
multiplying with the transposed coefficient matrix. The coefficient matrix A H A
of the normal equations is Hermitian because (A HA)H = A H (AH)H = A HA,
and positive semidefinite because x HA HAx = II Ax II ~ 2: 0 for all x. If A has
rank n, then AHA is positive definite because x H A HAx = 0 implies Ax = 0,
and hence x = O.
In cases where the coefficient matrix AHA is well-conditioned (such as for
the approximation of functions by splines discussed in Section 3.4), the normal
equations can be used directly to obtain least squares solutions, by fonning a
Cholesky factorization AHA = LL H, and solving the corresponding triangular systems of equations. However, frequently the normal equations are much
more ill-conditioned than the underlying minimization problem; the condition
number essentially squares. Thus for a problem in which half of the significant
figures would be lost because of the condition of the least squares problem, one
would obtain no significant figures by solving the normal equations.
2.2.8 Example. In curve fitting, the approximation function is often represented as a linear combination of basis functions. When these basis functions
have a similar behavior, the resulting equations are usually ill conditioned.
For example, let us try to compute with B = 10, L = 5, and optimal rounding
the best straight line through the three points (Si' td = (3.32, 4.32), (3.33,4.33),
(3.34,4.34). Of course, the true solution is the line s = t + 1. The formula
s = x\t + X2 leads to the overdetermined system Ax = b, where
(2.1)
A=
Proof We putr := b - Ax andr*:= b- Ax*. Then (2.1) implies that A Hr" =
0, whence also (r" -r)H r* = (A(x -X*))H r* = (x _X*)H AHr* = O. It
b=
4 .32 )
4.33 .
(
4.34
For the normal equations (2.1), one obtains
follows from this that
IIrll~ = rHr
3.32
3.33
(
3.34
+ r*H(r*
= r*Hr" + (r* -
- r) = r*Hr* - (r" - r)H r
r)H (r* - r) = IIr* II~ + IIr* - rll~.
H
A A ~
(33.267
9.99
9.99)
3'
AHb
~
(43.257).
12.99
80
Linear Systems of Equations
2.2 Variations on a Theme
The Cholesky factorization gives A H A ~ L L T with
L
~
(5.7678
1.7320
81
squares solution of Ax ~ b by x=A \ b, so that the casual user need not to know
the details.
For strongly ill-conditioned least squares problems, even the solution of (2.2)
produces meaningless results. Remarkably, it is still possible to get sensible
solutions of Ax ~ b, provided one has some qualitative information about the
desired solution. The key to a good solution is called regularization and involves
the so-called singular-value decomposition (SVD) of A, a factorization into a
product A = U E VH of two unitary matrices U and V and a diagonal matrix
~. For details about the SVD, we refer again to the books mentioned in the
introduction to this chapter; for details about regularization, refer to Hansen [40]
and Neumaier [71].
0
)
0.014142
and solving the system of equations gives
and from this one obtains the (completely false) "solution"
Indeed, a computation of the condition number condg, (A H A) ~ 3.1 . 106 (see
Section 2.5) reveals that with the accuracy used for the computations, one could
have expected no useful results.
Note that we should have fared much better by representing the line as a
linear combination s = XI (t - 3.33) + Xz: one moral of the example is that one
should select good basis functions before fitting data. (A good choice is usually
one in which different basis functions differ in the number and distribution of
the zeros within the interval of interest.)
The right way to solve moderately ill-conditioned least squares problems is
by means of a so-called orthogonal factorization (or Q R -factorization) of A
into a product A = Q R of a unitary matrix Q and an upper triangular matrix
R. Here, a matrix Q is called unitary if QH Q = t . A unitary matrix with real
entries is called orthogonal, and in this case, QT Q = L,
From an orthogonal factorization, the least squares solution of Ax ~ y can
be computed by solving the triangular system
(2.2)
Indeed, because QH Q = t, A H A = R H QH QR = R H R, so that R is the (transposed) Cholesky factor of A H, and the solution x* of (2.2) satisfies the normal
equations because
Matrix Inversion
Occasionally, one must compute a matrix inverse explicitly. (However, as we
shall see, it is inadvisable to solve Ax = b by computing first A -I and then
x=A- 1bl)
To avoid later repetition, we discuss immediately the numerically stable
case using a permuted factorization, obtained by multiplying the matrix by a
nonsingular (permutation) matrix P (see Algorithm 2.3.7). From a factorization
P A = LR, it is not difficult to compute an explicit inverse of A.
If we introduce the matrices R := R- I and A := R- I L-I = RL -I, we can
split the computation of the inverse matrix A-I = R- I L -I P = AP into three
steps.
2.2.9 Algorithm: Matrix Inversion
STEP
STEP
STEP
STEP
1: Compute a permuted triangular factorization P A = L R (see Algorithm
2.3.7).
2: Determine R from the equation RR = l .
3: Determine A from the equation AL = k
4; Form the product A -I = AP.
The details for how to proceed in steps 2 and 3 can be obtained as in Theorem
2.1.8 by expressing the matrix equations in components,
RiiR i i
For the stable computation of orthogonal factorizations, we refer to the books
mentioned in the introduction to this chapter. MATLAB provides the orthogonal
factorization via the routine qr, but it automatically produces a stable least
L k j Rj k
j=i:k
=
1,
= 0 ifi<k,
82
Linear Systems oj Equations
2.3 Rounding Errors, Equilibration, and Pivot Search
and solving for the unknown variables in an appropriate order (see Exercise 14).
For the calculation of an explicit inverse A-I, we need 2n 3 + 0 (n 2 ) operations
(see Exercise 14); that is, three times as many as for the triangular factorization.
For banded matrices, where the inverse is full, this ratio is even more pessimistic.
In practice, the computation ofthe inverse should be avoided whenever possible. The reason is that, given a factorization, the solution of the resulting
two triangular systems of equations takes only 2n 2 + O(n) operations. This
is essentially the same amount as needed for the multiplication A-I b when an
explicit inverse is available. However, because the inverse is much more expensive than the factorization, there is no incentive to compute it at all, except in
circumstances where the entries of A -1 are of independent interest. In other applications where matrix inverses are used, it is usually possible (and profitable)
to avoid the explicit inverse, too.
If rounding errors arise from other arithmetic operations in addition to multiplication, one finds similar expressions. Instability always arises if the magnitude of a term L ij R jk is very large compared with the magnitudes of the
elements of A. In particular, if a divisor element R;; has a very small magnitude, then the magnitudes of the corresponding elements Li, (k > i) are large,
causing instability. For a few classes of matrices (positive definite matrices and
H-matrices; cf. Corollary 2.1.9), it can be shown that no instability can arise.
For general matrices, we must ensure stability by avoiding not only vanishing,
but also small divisors R;;. This is achieved by row (or sometimes column)
permutations.
One calls the row permuted into the ith row the ith pivoting row and the
resulting divisors Ri, the pivot elements; the search for a suitable divisor is called
the pivot search. The most popular pivot search is column pivoting, also called
partial pivoting, which usually achieves stability with only 0(1/2) additional
comparisons. (The stability of Gaussian elimination for balanced matrices may
be further improved by so-called complete pivoting, which allows row and
column permutations. Here, the element of greatest magnitude in the remainder
of the matrix becomes the pivot element. However, the cost is considerably
higher; 0(n 3 ) comparisons are necessary.)
In column pivoting, one searches the remainder of the column under consideration for the element of greatest magnitude as the pivot element. Because in
the calculation of the L ki (k > i) the pivot element is a divisor, it follows that,
with column pivoting, ILki I::: 1. This implies that
2.3 Rounding Errors, Equilibration, and Pivot Search
We now consider techniques that ensure that the solution of linear systems is
computed in a numerically stable way.
Rounding Errors and Equilibration
In finite precision arithmetic, the triangular factorization is subject to rounding
errors. We first present a short and very simplified argument that serves to
identify the main source of errors; a detailed error analysis is given at the end
of this section (Theorem 2.3.9).
For simplicity, we suppose first that a rounding error appears only in a single
multiplication LijR j k; say, for i = p, j = q, k = r. In this case, we have
Rpr = Apr
-
L
LpjR j r -
LpqRqrO- E:pr)
j=l:p-1
Hq
=
Apr
L
+ LpqRqrE:pr
LpjRjr>
j=l:p-1
where IE: pr I :::
B 1- L.
If we define the matrix
A obtained by changing in A the
(p, r)-entry to
r»
the triangular factorization LR (in which k., := R i k for (i, k) # (p,
that
we obtain instead of L R may be interpreted as the exact triangular factorization
of the matrix A. If IL pq R qr I » Apr; then this corresponds to a large relative
perturbation IApr - A I" 1/1 A rr I of Apr' and the algorithm is unstable.
IRikl::: IAikl +
L
83
[Rjkl,
j=l:i-1
so that by an easy induction argument. all Ri, are bounded by the expression
K II maxi,k IAik I, where the factor K II can, in unusual cases, be exponential in
n, but typically varies only slowly with n. Therefore, if A is balanced (i.e.,
if all nonzero entries of A have a similar magnitude), then we typically have
numerical stability.
To balance a matrix with entries of widely different magnitudes, one usually
Uses some form of scaling. In compact notation, scaling of a matrix A corresponds to the multiplication of A by some diagonal matrix D; left multiplication
by D corresponds to row scaling (i.e., multiplication of each row by the corresponding diagonal element), and right multiplication by D corresponds to
column scaling (i.e., multiplication of each column by the corresponding diagonal element). For example, if
_(ac b)d , _(e0 J0)
A-
D-
84
2.3 Rounding Errors, Equilibration, and Pivot Search
Linear Systems of Equations
85
scaling matrices
then
ea
DA= (
fc
eb)
fd '
AD
=
(ea
ec
D
b
ffd )
.
=
diag(1, 10- 15, 10- 20 , 1O-2S ) ,
D' = diag(1 O-S, 10- 20 , 1O-2S , 10- 30 )
If one uses row scaling only, the most natural strategy for balancing A is row
equilibration (i.e., scaling each row such that the absolute values of its entries
sum to 1). This is achieved for the choice
and the permutation matrix (cf. below)
o
o o
o o
o 1
1
(3.1)
An optimality property for this choice among all row scalings (but not among
two-sided scalings) is proved in Theorem 2.5.4. (Note that (3.1) is well defined unless the ith row of A vanishes identically; in this case, we may take
w.l.o.g. Di, = 1.) In the following MATLAB program, the vector d contains
the diagonal entries of D.
we get the strongly diagonally dominant matrix
2.3.1 Algorithm: Row Equilibration
which is perfectly scaled for Gaussian elimination. (For a general method to
scale and permute a matrix to so-called I -matrix form, where all diagonal
entries are 1 and the other entries have absolute value of at most I, see Olschowka
and Neumaier [76]).
d=ones(n,l);
for i=l:n,
dd=sum(abs(A(i,:));
i f dd>O,
d(i)=l/dd;
A(i,:)=d(i)*A(i,:);
end;
end;
2.3.2 Example. Rice [82] described the difficulties in scaling the matrix
A
= ~0+1O
(
=
1O- IS
10~s
1
IO- S
10- 40
10- 30
( 1O- I S
10- 40
1O- IS
I
IO- S
S
IO- 30 )
10~O-SS
'
Row Interchanges
Although it usually works well, scaling by row equilibration only is sometimes
not the most appropriate way to balance a matrix.
1
10+ 20
P DAD'
1
10+20
10+20
1
10+40
10+ 10
1
10+40
lO+ s0
~0+40)
lO+s0
We now consider what to do when no triangular factorization exists because
some divisor Ri, is zero. One usually proceeds by what is called column pivoting
(i.e., performing suitable row permutations).
In linear algebra, permutations are written in compact notation by means of
permutation matrices. A matrix P is called a permutation matrix if, in each row
and column of P, the number I occurs exactly once and all other elements are
equal to O. The multiplication of a matrix A with P on the left corresponds
to a permutation of rows; multiplication with P on the right corresponds to a
permutation of columns. For example, the permutation matrix
.
I
leaves A
to a well-behaved matrix; row equilibration and column equilibration
emphasize very different components of the matrix, and it is not at all clear
which elements should be regarded as large or small. However, using the
E
C 2 x 2 unchanged; the only other 2 x 2 permutation matrix is
P2
=
(~ ~),
86
2.3 Rounding Errors, Equilibration, and Pivot Search
Linear Systems of Equations
and we have
(4a)
)
u
4
-4
-6
9
1
2.3.3 Example. We replace, in the matrix A of Example 2.1.11, the element
=
-1 with A 33 = -11/2. The calculation is then unchanged up to and
including Step (2b), and we obtain the following diagrams.
(2b)
(-~
4
-6
3
-4
9
11
-2
-Z-
1
3
2
-~) ( ~
3
-2
1
1
2
3
o
1
5S
2
(3b)
Permutation
4
-4
4
-4
-6
9
-6
9
5
I
1
2
3
-2
-~) ( ~
o
2
3
-2
2
3
-2
5
2
aD
-~ )
15
2
3
2
is overwritten over A.
What happens if all elements in the remaining column are equal to zero?
To construct an example for this case, we replace the elements A 33 = -1 and
A 43 = 3 with A 33 = -11/2 and A 43 = 1/2 and obtain the following diagrams
(-~
In the next step, division by the element R 33 = A33 should occur, among
other things. In this example, however, A 33 = 0 (i.e., the division cannot be
performed). This makes an additional step necessary in the previously mentioned algorithm: we must exchange the third row with the fourth row. This
corresponds to a row permutation of the initial matrix A.
If one forms the triangular factorization of the matrix P A, then one obtains
just the corresponding permuted diagrams for A, including Step (3a). Because
the third diagonal element A 33 is different from zero, one can continue the
process.
5S
After Step (4a), the triangular factorization of P A with
(3a)
4 -4
-6
9
_~
OS
2
-~
5
2
-2
A symmetric permutation of rows and columns that moves diagonal entries to
diagonal entries is given by replacing A with P ApT; they may be useful, too,
because they preserve symmetry. Permutation matrices form a group (i.e., products and inverses of permutation matrices are again permutation matrices). This
allows one to store permutation matrices in terms of a product representation by
transpositions, which are simple permutations that only interchange two rows
(or columns), by storing information about the indices involved.
A 33
87
(2b)
-4
-6
9
4
3
-2
11
-ZI
1
2
2
(
(3a)
-4
-6
9
3
05
-2
-~) ~
3
I
4
-2
1
1
2
05
-~)
Again, we obtain a zero on the diagonal; but this cannot be removed from the
diagonal by an exchange of the third and fourth rows. If we simply continue,
then the division in Step (3b) leads to L43 = 0/0 (i.e., any value can be taken
for L 43). For simplicity, we set L43 = 0 and continue the recursion.
~N
(
~~
~1 =i -~ -i,) (-~ =i -~ -i)
1
2
OD
1
I
1
2
0
15
2
Indeed, we can carry out the factorization process formally, obtaining, however,
R33 = 0 (i.e., R, and therefore A, is singular).
It is easy to see that the permuted triangular factorization can be constructed
for arbitrary matrices A in exactly the same fashion as illustrated in the examples. Therefore, the following theorem holds.
88
Linear Systems of Equations
2.3 Rounding Errors, Equilibration, and Pivot Search
2.3.4 Theorem. For each n x n matrix A there is at least one permutation
nonnalization) and DRD'. Thus, R scales as A. Hence, in the ith step, the pivot
row index j would be selected by the condition
matrix P such that the permuted matrix P A has a normalized triangular factorization P A = L R.
Noting that det P det A = det(P A) = det(LR) = det L det R = det Rand
using the permuted system PAx = Pb in place of Ax = b, we obtain the following final version of Algorithm 2.1.6.
One sees that in this condition, the D;i cancel and only the row scaling affects
the choice of the pivot row. Writing d k := D kk , we find the selection rule
Choose j :::: i such that djA ji
2.3.5 Algorithm: Solve Ax
= b with Pivoting
STEP I: Calculate a permuted normalized triangular factorization P A = L R.
STEP 2: Solve Ly = Pb.
STEP 3: If R is nonsingular, solve Rx = y.
The resulting vector x is the solution of Ax = b. Moreover, det A = det Rj
detP.
2.3.6 Remarks.
(i) If several systems of equations with the same coefficient matrix A are
to be solved, then only steps 2 and 3 must be repeated. Thus we save a
considerable amount of computing time.
(ii) P is never formed explicitly, but is represented as a product of single row
exchanges (transpositions); the number of the row exchanged with the ith
row in the ith step is stored as the ith component of a pivot vector.
(iii) As a product of many factors, det R is prone to overflow or underflow,
unless special precautions are taken to represent it in the form a . M k for
some big number M, and updating k occasionally. The determinant det P
is either + 1 or -1, depending on whether an even or an odd number of
exchanges were carried out.
(iv) log [det A I = log [det RI is needed in many statistical applications involving maximum likelihood estimation.
Pivot Search with Implicit Scaling
As we have seen, scaling is important to obtain an equilibrated matrix for
which column pivoting may be applied sensibly. In practice, it is customary to
avoid explicit scaling. Instead, one uses for the actual factorization the original
matrix and performs the scaling only implicitly to determine the pivot element of
largest absolute value. If we would start the factorization with the scaled matrix
DAD' in place of A, the factors Land R would scale to DLD- (because of
J
89
= max{dkA ki I k
:::: i}.
This selection rule can now be used together with the unsealed A, and one
obtains the following algorithm.
2.3.7 Algorithm: Permuted Triangular Factorization with Implicit Row
Equilibration and Column Pivoting
% find scale factors
d=ones(n,l);
for i=l:n,
dd=sum(abs(A(i,:»;
if dd>O, d(i)=l/dd; end;
end;
%main loop
for i=l:n,
% find possible pivot elements
for k=i:n,
A(k,i)=A(k.i)- A(k,l:i-l)*A(l:i-l.i);
end;
% find and save index of pivoting row
[val,j]=max(d(i:n) .*abs(A(i:n,i»);j=i-l+j;p(i)=j;
% interchange rows i and j
if j >i,
A([i,j], :)=A([j,i],:);
d(j)=d(i); % d(i) no longer needed
end;
for k=i+l:n,
%find R(i,k)
A(i,k)=A(i,k)-A(i,l:i-l)*A(l:i-l,k);
% complete L(k,i)
if A(i,i) -= 0, A(k,i)=A(k,i)/A(i,i); end;
end;
end;
90
Linear Systems of Equations
2.3 Rounding Errors, Equilibration, and Pivot Search
91
Note that is necessary to store the pivot indices in a vector p because this
information is needed for the solution of a triangular system Ly = Pb; the
original right side b must be permuted as well.
Land Rand Sne .::: 1, then
2.3.8 Algorithm: Solving a Factored Linear System
Usually (after partial pivoting), the entries of ILIIRI have the same order of
magnitude as A; then, the errors in Gaussian elimination with partial pivoting are
comparable to those made in forming products Ax. However (cf. Exercise 11),
occasionally unreasonable pivot growth may occur such that ILII RI has much
bigger entries than A. In such cases, either complete pivoting (involving row
and column interchanges to select the absolutely largest pivot element) or an
alternative factorization such as the QR-factorization, which are provably stable
without exception, should be used.
To prove the theorem. we must know how the expressions of the form
% forward elimination
for i = I : n,
j = Pi; if j > i, b([i, j]) = b([j, i]); end;
Yi = b, L AikYk;
k=1:i-1
end;
% back substitution
ier = 0;
for i = II : -I : 1,
if A ii == 0, ier = 1; break; end;
Xi = (Yi L A'kXd/A i,;
Ib - Axl.::: 5n£IL/lRllxl.
(3.2)
k='+l:n
end;
% ier = 0: A is nonsingular and x
% ier = I: A is singular
= A -I b
that Occur in the recurrence for the triangular factorization and the solution
of the triangular systems are computed. We assume that the rounding errors
in these expressions are equivalent to those obtained if these expressions are
calculated with the following algorithm.
Rounding Error Analysis for Gaussian Elimination
Rounding errors imply that any computed approximate solution x is in general
inaccurate. Already, rounding the components of the exact x* gives x with
Xk = x;(1 + Ed, I£kl .::: 8 (8 the machine precision), and the residual b - Ai,
instead of being zero, can be as large as the upper bound in
2.3.10 Algorithm: Computation of (3.2)
So = c;
for j
Sj
Ib - Axl = IA(x* -x)l.::: IAllx* -xl'::: £IAllx*l·
We now show that in the execution of Gaussian elimination, the residual usually
is not much bigger than the unavoidable bound just derived. Theorem 2.5.6 then
implies that rounding errors have the same effect as a small perturbation of the
right side b. We may assume w.l.o.g. that the matrix is already permuted in such
a way that no further permutations are needed in the pivot search.
2.3.9 Theorem. Let i be an approximation to the solution of the system of
linear equations Ax* = b with A E e nxn and bEen, calculated in finiteprecision arithmetic by means of Gaussian elimination without pivoting. Let e
be the machine precision. If the computed triangular matrices are denoted by
== 1 : m - 1,
== Sj_1 - ajb j;
end;
bm =
Sm-I/a m;
Proof of Theorem 2.3.9. We proceed in 5 steps.
STEP I: Let
e,
:= i£/(1 - i£)(i <
5n). Then
because for the positive sign,
I.B1=111+a( 1 +1])I':::£+£i(I+£)=
(i+1)£
. -<£+1
[ ,
1 - 1£
92
Linear Systems ofEquations
2.3 Rounding Errors. Equilibration, and Pivot Search
and for the negative sign,
a - 17 1 Si+S
- - <--=
-
1131 -
l
1+
17 -
1-
S
C
=
STEP 4: Applying (3.5) to the recursive formulae of Crout (without pivoting)
gives, because e, ~ e; for i S n,
2
(i+1)s-is
<Si+l·
1 - (i + l)s + ie? -
STEP 2: We show that with the initial data iu ; hi (i
calculated values Sj satisfy the relation
L aih,(l + ai) + Sj(l + .Bj)
(j
= 1, ... , m -
Sn
= 0, I, ... , m -
1)
J-1./
J-I./
IAH - j'f:, I'jR;;/ <0'. ,'f:, II"IIR"I
(3.4)
if i ~ k, so for the computed triangular factors Land
for suitable a, and .Bi with magnitude ~Si' This clearly holds for j = 0
with.Bo := O. Suppose that (3.4) holds for some j < m - 1. Because
the rounding is correct,
Sj(l
IAik- ~. LiJ?jk! ~ ~.ILijIIRjkl,
1) and C, the
i=l:j
Solving for Sj and multiplying by (l
+ .Bj)
= Sj+1 (l
= S;+I (l
with laj+II,I.B;+11
obtain
Similarly, for the computed solutions
Ly = b, Rx = y,
From the inductive hypothesis (3.4), we
y, i
R of A,
of the triangular systems
Ib' - ,I;, I" Y, I<0 e; ,'f:, II" 11M
+ .Bj) gives, because of (3.3),
+ .Bj)(l + v)-I + aj+Jhj+ 1(l + .Bj)(l + J.L)
+ .BJ+z) + aj+1b j+ 1 (l + aj+l)
~S;+I'
93
IY, -
,'f:. hi'l ,'f:. IR"lIi,l,
<0£.
and we obtain
Ib - LYls snlLIIY!.
151 - Ril ~ snlRllil.
L a,bi(l + ai) + Sj(l + .Bj)
i=l:j
= L ai b,(l+ai)+s;+I(l+.Bj+l).
C=
STEP 5: Using steps 3 and 4, we can estimate the residual. We have
i=l:j+1
Ib - Ail
STEP3: From b m = (sm-Jlam)(l
+ 17), 1171
~ s, and (3.4), it follows that
=
Ib - L y - L(Ri - y) + (LR - A)il
Sib -
LYI + ILllRi - yl + ILR - Allil
~ sn(ILllyl + ILllRllil + ILIIRllil)
~ sn(ILlly - Ril + 3ILIIRllxl)
L aih,(l +ai)+ambm(l+.Bm-z)(l +17)-1
= L a,b,(l + ai),
c=
i=l:m--l
S sn(sn
'=I:m
+ 3)ILIIRII.tl.
By hypothesis, Sne < I, so e; = ns/(l - ns) ~ ~ns < 1, whence the
theorem is proved.
whence, by (3.3), lam I ~ Sm; we therefore obtain
(3.5)
Rounding error analyses like these are available for almost all numerical
algorithms that are widely used in practice. A comprehensive treatment of error
analyses for numerical linear algebra is in Higham [44J; see also Stummel
94
Linear Systems of Equations
and Hainer [92]. Generally, these analyses confirm rigorously what can be
argued much simpler without rigor, using the rules of thumb given in Chapter 1.
Therefore, later chapters discuss stability in this more elementary way only.
2.4 Vector and Matrix Norms
because, for example,
[x ] - lIyll = Ilx ± Y =f yll -llyll
::: IIx ± YII + IlylI - Ilyll
= IIx±yll·
2.4 Vector and Matrix Norms
Estimating the effect of a small residual on the accuracy of the solution is a
problem independent of rounding errors and can be treated in terms of what
is called perturbation theory. For this, we need new tools; in particular, the
concept of a norm.
A mapping II . II : en ---+ lR is called a (vector) norm if, for all x, y E en:
A metric in en is defined through d (x,
y) = Ilx- Y II. (Restricted to real 3-space
~3, the "ball" consisting of all x with Ilx-xoll::: e is, for II· liz, 11·1100, and 11./1,
a sphere-shaped, a cube-shaped, and an octahedral-shaped s-neighborhood of
xo, respectively.)
The matrix norm belonging to a vector norm is defined by
°
(i) [x II ~ with equality iff x = 0,
(ii) Ilaxll = lalllxll fora E e,
(iii) Ilx
95
IIAII := sup{IIAxll I [x] = I}.
+ yll ::: IIxll + lIyll·
The relations
The most important vector norms are the Euclidean norm
IIxliz :=
)1;
Ix;[Z = ../xHx,
IIAxll:::IIAllllxll
for all A Ee x n ,
IIABII::: IIAIIIIBIl
IlaA11 = lalliAl1
the maximum norm
for all A,
for all A E
XEe n ,
BE e
e"x",
xn
a E
,
e,
11/11 = 1
Ilx1100:= max Ix;[,
;=I"",n
can be proved easily from the definition.
The following matrix norms belong to the vector norms II . liz, II . 1100, and
11·111.
and the sum norm
IIxlll:=
L
IXil·
i=l:n
IIAllz = sup{vXHAHAx IxHx = I}
= jmaximal eigenvalue of A H A;
In a finite-dimensional vector space, all norms are equivalent, in the sense that
any s-neighborhood contains a suitable £' -neighborhood and is contained in a
suitable £/1 -neighborhood (cf. Exercise 16). However, for practical purposes,
different norms may yield quite different measures of "size." The previously
mentioned norms are appropriate if the objects they measure have components
of roughly the same order of magnitude.
The norm of a sum or difference of vectors can be bounded from below as
follows:
[x ± yll
~
is called the spectral norm; by definition, for a unitary matrix A, IIAliz
IIA-111 z = 1.
II A 1100
is called the mw sum norm, and
[x] - lIyll
IIAII]
and
Ilx ± yll ~ Ilyll - IIxll
~ max [ ~ IA;, II; ~ I, ... , n )
=max{~IAikllk=
I, ... ,n}
the column sum norm. We shall derive only the formula for the row sum norm.
96
Linear Systems of Equations
2.4 Vector and Matrix Norms
We have
97
and therefore
II A 1100
= sup{IIAxll oo Illxlloo = l}
IIAII~
= (AX)H (Ax) = L
IAix1
2
::::
IIAi:II~ = L
L
IAikl2.
i.k:
=
sup max ILAikXkl
Ixkl::o I
:::: max
I
k
L IAikl,
x
k
I
and equality holds because the maximum is attained for Xk = IAikl/ Aik.
Obviously, the calculation of the spectral norm is more difficult than the
calculation of the row sum norm or the column sum norm, but there are some
useful, easily computable upper bounds.
2.4.1 Proposition. If A
E
Equality holds when equality holds in the Cauchy-Schwarz inequality
(i.e., when for all i the vectors (AiJ T and are linearly dependent). This
is equivalent to saying that the rank of the matrix A is :::: 1. Thus A may
be written as an outer product
e n x n , then
(iii) If Ai are the eigenvalues of the Hermitian matrix A, then A HA = A 2
has the eigenvalues A~. Let Amax be the eigenvalue of A of maximum
absolute value. Then, by (4.1),
(i) IIA 112 :::: JII A IIIIIA 1100, with equality, e.g., when A is diagonal.
(ii) IIAII2:::: IIAIIF :=
JLi.k IAik12; the equality sign holds if and only if
A = xyH with X, Y E en (i.e., the rank of A is at most 1). IIAIIF is
called the Frobenius norm (or Schur norm) of A.
(iii) If A is Hermitian then IIAII2:::: IIAllforeverymatrixnorm 11·11.
Proof We use the fact that for any matrix norm II . II and for each eigenvalue
A of a matrix A we have IAI :::: II A II, because for a corresponding eigenvector x
we have
IAI
= IIAxIl/llxll = II Ax II/IIx II :s IIAII·
IIAI12 =
2.4.2 Proposition. If II I - A II :s fJ < 1, then A is nonsingular and
IIA-III:::: 1/(1- fJ)·
Proof Suppose that Ax = O. Then
IIxll = jjx - Axil = IIU - A)xll:::: III - AII·llxll
+ 11/11 :::: IIA-III·III -
All
+ 11/11 :::: fJ11A-111 + I
that IIA-III:s 1/(1- fJ).
IIAI12 =
s f3l1xll.
Because fJ < 1 we conclude that [x II = 0, hence x = O. Thus Ax = 0 has only
the trivial solution, so A is nonsingular. Furthermore, it follows from
IIA-III :::: IIA-' -III
(ii) By definition,
o
IAmaxl:::: IIAII.
Norms may serve to prove the nonsingularity of matrices close to the identity.
(4.1)
(i) Let Amax be the eigenvalue of A H A of maximum absolute value. Then
jA~ax =
o
sup J(Ax)H(Ax)
IIxliFI
M-Matrices and H-Matrices
x
and the supremum is attained for some x = because of the compactness
of the unit ball. By the Cauchy-Schwarz inequality,
T~ere are important classes of matrices which satisfy a generalization of the
Cnterion in Proposition 2.4.2. We call A an Hrmatrix if diagonal matrices D I
d D 2 exist such that III - D,AD 2 11 oo < 1. By the proposition, an H-matrix
IS nonsingular. Diagonally dominant matrices are matrices in which the weak
:m
98
Linear Systems ofEquations
2.5 Condition Numbers and Data Perturbations
row sum criterion
99
2.4.3 Proposition. A vector norm is monotone iff, for all diagonal matrices
IAiil~LIAikl
fori=l, ... .n
D, the corresponding matrix norm satisfies
(4.2)
kefi
IIDII = max{IDiilli = I, ... , n}.
holds. Diagonally dominant matrices are H-matrices if strict inequality holds
in (4.2) for all i, (To see this, use D, = Diag(A)-1 and D2 = I.) In fact, strict
inequality in (4.2) for one i suffices if, in addition, A happens to be "irreducible";
see, e.g., Varga [94].
M-matrices are H-matrices with the sign distribution
A ii
~
0,
A ik :::: 0
for i
1= k;
there are many other equivalent definitions of M-matrices. Each M-matrix A is
nonsingular and satisfies A -I ~ 0 (see Exercise 19). This inequality mirrors the
fact that M-matrices arise naturally as discretization of inverse positive elliptic
differential operators; but they arise also in other context (e.g., input-output
models in economics).
Monotone Norms
The absolute value of a vector or a matrix is defined by replacing all components
with their absolute value; for example,
(4.3)
Proof Let II . II be a monotone vector norm. Because IDii IlIe(i) II = II Diie(i) II =
II De(i) II :::: IIDlllle(i)II, we have IDiil:::: IIDII, hence IIDII ~ max IDiil =:a.
Because IDxl ::::alxl for all x, monotony implies IIDxl1 = IIIDxlll:::: Ilalxlll =
allxll, hence IIDII = sup IIDxll/llxll ::::a. Thus IIDII = a. and (4.3) holds.
Conversely, suppose (4.3) holds for all diagonal D. We consider the special
diagonal matrix D = Diagtsigmx.), .... sign(xn », where
signet) =
I
x / Ix /
]
if x
1= O.
if x =
o.
Then IIDII=IID-'II=I and x=Dlxl, hence IIxll=IID/xlll :::: IIDlllllxlll=
IIlxlli = IID-'xll:::: IID-'lllIxll = IIxll. Thus we must have equality throughout; hence IIxll = IIlxlll. Now suppose Ixl::: y. We take D = Diag(lx,I/Yl,""
Ixnl/y,,) and find IIDII :::: I, Ixl = Dy, hence IlIxlll = IIDyl1 :::: IIDIiIIYIl :::: Ilyli.
Hence, the norm is monotone.
0
2.5 Condition Numbers and Data Perturbations
For x, Y
E
en
and A, B
E
enxlI, one easily sees that
Ix ± yl :::: Ixl
+
[v].
IAxl:::: /Allxl,
IA ± BI:::: IAI + IBI, IABI:::: IAIIBI.
A norm is called monotone if
Ixl::::y
=> Ilxll = Illxlll:::: lIyll·
All vector norms considered above are monotone. The analogous result
IAI:::: B
=> IIAII = IliA/II:::: IIBII
holds for the matrix norms II . 111 and II . 1100, but for II . 112 we only have
IAI:::: B
=> IIAI12:::: IIIAII12::: IIBI12.
As is well known, a matrix A is nonsingular if and only if det A 1= O. Although
one can compute the determinant efficiently by a triangular factorization, checking whether it is zero is difficult to do numerically. Indeed, rounding errors in the
numerical calculation ofdet A imply that the calculated value of the determinant
is almost always different from zero, even when the exact value is det A = O.
Thus the unique solvability of Ax = b would be predicted from an inacCuratedeterminant even when in reality no solution or infinitely many solutions
exist.
The size of the determinant is not even an appropriate measure for closeness
to singularity! For example, det(a A) = a" det A. so that multiplying a 50 x 50
matrix by a harmless factor of 3 increases the determinant by a factor of
3n > 1023.
For the analysis of the numerical solution of a system of linear algebraic
equations, however, it is important to know "how far from singular" a given
matrix is. We can then estimate how much error is allowed in order to still
Obtain a useful result (i.e., in order to ensure that no singular matrix is found).
A Useful measure is the condition number of A; it depends on the norm used
100
Linear Systems of Equations
2.5 Condition Numbers and Data Perturbations
and is defined by
cond(A):= IIA-
101
Note that (5.3) specifies a componentwise relative error of at most 8; in particular, zeros are unperturbed.
111·IIAII
for nonsingular matrices A; we add the norm index (e.g., cond., (A)) if we refer
to a definite norm, For singular matrices, the condition number is set to 00.
Unlike for the determinant.
Proof Because (5.3) is scaling invariant, it suffices to consider the case where
D, = D 2 = I. Because B is singular, then so is A -I B, so, by Proposition 2.4.2,
III - A-I BII :::: 1, and we obtain
1.::: III-A- I BIi
cond(aA)
= cond(A)
for a
E
<C.
= IIA- 1(A
The condition number indeed deserves its name because, as we show now, it
is a measure of the sensitivity of the solution x* of a linear system Ax* = b to
changes in the right side b.
2.5.1 Proposition. IfAx*
= b then, for arbitrary x,
(5.1)
and
- B)II.::: IIA-IIIIIB - All
.::: IIA- 11l1181AIII =8cond(A),
o
whence 8:::: l/cond(A).
In particular, a matrix that has a small condition number cannot be close to a
singular matrix. The converse is not true because the condition number depends
strongly on the scaling of the matrix, in a way illustrated by the following
example.
2.5.3 Example. If
IIx* - x II < cond(A) lib - Ax II .
11- II b II
-----::-11
x---::-*
(5.2)
then
Proof (5.1) follows from
I_(-1
1)
A-1 _ _
- 0.995
and (5.2) then from
.....
I
-0.005
.
IIAlloo =2, IIA-111 00 =2/0.995, and condoo(A);::;; 4. Now
IIbll
= IIAx*II':::
IIAII·lIx*lI·
o
200)
The condition number also gives a lower bound on the distance from singularity
(For a related upper bound. see Rump [83]):
(5.3)
'
whence
(DA)-I
2.5.2 Proposition. If A is nonsingular, and B is a singular matrix with
IB - AI.::: 81AI,
I
=
_1_ (-1
199
I
200)
-I
'
then
II DA 1100 = 201, II (DA)-liloo = 201/199, and condoo(DA) ;::;; 200. So the condition number is increased by a factor of about 50 through scaling.
for all nonsingular diagonal matrices DI. D 2 •
The following theorem (due to van der Sluis [93]) shows that equilibration
generally improves the condition number; in particular, the condition number
of the equilibrated matrix gives a better bound in Proposition 2.5.2.
103
Linear Systems of Equations
2.5 Condition Numbers and Data Perturbations
2.5.4 Theorem. If D runs through the set of nonsingular diagonal matrices,
index i, we obtain a lower bound s, for II A -I 1100' If d is an arbitrary vector with
Idl = e, where e is the all-one vector, then
102
then condoo(DA) is minimalfor
Di, = 1/;; IAikl (i = I, ... , n),
that is,
(5.4)
if
LI(DA)ikl=1
fori=l, ... ,n.
k
Proof. Without loss of generality, let A be scaled so that Lk IA ik 1= I for all i.
If D is an arbitrary nonsingular diagonal matrix, then
IIDAlloo = max {IDiil L
I
IAikl} = max IDiil = IIDlloo;
k
I
The calculation of I(A-Id)il for all i is less expensive than the calculation of
all s., because only a single linear system must be solved, and not all the I'".
The index of the absolutely largest component of A -Id would therefore seem
to be a cheap and good substitute for the optimal index i o.
Among the possibilities for d, the choice d := sign(A -T e) (interpreted componentwise) proves to be especially favorable. For this choice, we even obtain
the exact value Sio = IIA-11l00 whenever A-I has columnwise constant sign!
Indeed, the i th component a, of a := A- T e then has the sign of the i th column
of A-I, and because d, = signrc.), the ith component of A-Id satisfies the
relation
so
condoo(A)
= IIA- IIINIIAlloo = II(DA)-I Dlloo
:s II(DA)-llIooIIDlloo = II(DA)-llIooIIDAlloo
= condoo(DA).
D
2.5.5 Algorithm: Condition Estimation
I: Solve AT a = e and set d = sign(a).
Solve Ac = d and find i" such that ICi·1 = max, ICi I.
STEP 3: Solve ATf = e U' ) and determine s = If ITe.
STEP
To compute the condition number, one must usually have the inverse of A
explicitly, which is unduly expensive in many cases. If A is an H-matrix, we
can use Proposition 2.4.2 to get an upper bound that is often (but not always)
quite reasonable.
If one is content with an approximate value, one may proceed as follows.
using the row-sum norm IIA -11100' Let s, be the sum of the magnitudes of the
elements in the i th row of A -I . Then
IIA- llloo=max{sili=l, ... ,n}= sio
for at least one index io. The ith row of A -I is
If a triangular factorization P A = LR is given, then fU) is calculated as the
solution of the transposed system
We solve R T y = eU), L T ::: = y, and set fU) = p T Z. If we were able to guess the
correct index i = io, then we would have simply s, = II A -III 00; for an arbitrary
STEP 2:
Then II A -I II 00 2: s and often equality holds. Thus s serves as an estimate of
IIA-11l00.
Indeed, for random matrices A with uniformly distributed A ik E [-1,1],
equality holds in 60-80% of the cases and II A -I 1100 :s 3s in more than 99% of
the cases. (However, for specially constructed matrices, the estimate may be arbitrarily bad.) Given an LR factorization, only O(n 2 ) additional operations are
necessary, so the estimation effort is small compared to the factorization work.
Data Perturbations
Often the coefficients of a system of equations are only inexactly known and
We are interested in the influence of this inexactness on the solution. In view
of Proposition 2.5.1, we discuss only the influence on the residuals, which is
Somewhat easier to determine. The following basic result is due to Oettli and
Prager [75].
2.5 Condition Numbers and Data Perturbations
Linear Systems of Equations
104
2.5.6 Theorem. Let A
E ([Il XIl and b, x E ([Il. Then for arbitrary nonnegative
matrices .6.A E JR.1l XIl and nonnegative vectors Sb E JR.1l, the following statements
are equivalent:
en
(i) There exists A E
([nxll and h E
Ih - bl :S s».
(ii) The residual satisfies the inequality
with Ax "'=h, IA - AI:S .6.A, and
Ib - Axl:s.6.b + .6.Alxl.
Proof. It follows from (i) that
Ib - Axl = Ib - h
+ (A
- A)xl :S Ib - hi
+ IA -
AI·lxl:S.6.b
+ .6.Alx/:
that is, (ii) holds.
Conversely, if (ii) holds, then we define qi as the ith component of the residual
divided by the ith component of the right side of (ii); that is,
By assumption. Iqi I :S 1. We define A
E ([n
XII
and h
E
([II
by
A ik := A ik + qi(.6.A)iklsign(xk), (i, k:::: 1, ... , n),
hi := b, - qi(.6.b)i, (i = 1, ... , n),
with the complex sign
sign(x) =
I
Xl lx l ifx # 0,
1
if x =0.
(However, note that MATLAB evaluates sign(O) as 0.) Then IA - AI:s .6.A,
Ih - bl :S Sb, and for i = 1, ... ,11,
(h - AX)i
= b, = (b i -
qi(.6.b)i - L(A ik + qi(AA)ik! sign(xd)xk
L
AikXk) - qi (.6.b)i i- L(.6.A)ik IXk
(i) The implication (ii) =} (i) in Theorem 2.5.6 says that an approximation
x to the solution x of Ax' = b with small residual can be represented
as the solution of a slightly perturbed system Xt' = h. (This argument is
a typical backward error analysis: Instead of the more difficult task of
showing how data perturbations affect the solutions, one analyzes whether
the approximations obtained can be interpreted as an exact solution of a
nearby system.)
In particular, if the coefficients of the system Ax = h have a relative
error of at most E per component, then IA - AI:S EIAI and Ih - bl:s Elbl.
An approximation x to x* = A -lb can then be regarded as acceptable if the
relation Ib - Ax I :S E(lbl + IA I. IxI) holds (i.e., if E :::: EO), where
(5.5)
(Here the nsturz) interpretation for % is zero.) co is a scaJing invariant
quality factor for computed approximations to systems of linear equations.
It is always between 0 and 1, and it is small iff the error in x can be explained
by small relative errors in the original data.
(ii) If we exchange the quantities distinguished by - and *, we can interpret
the implication (i) =} (ii) as follows. The solution x* of a system of linear
equations Ax* = b approximates the solution x of a slightly perturbed system Ax = h with A ~ A and h ~ b in the sense that the residual h - Ax*
remains small. As seen in Proposition 2.5.1, the absolute error x * - x can
be a factor of II A -I II greater than the residual, and the relative error can
be a factor of cond(A) greater than the relative residual. As the condition number for function evaluation in Section lA, the number cond(A) is
therefore interpretable as the maximal magnification factor of the relative
error.
As an application of Theorem 2.5.6, we prove an easily calculable upper
bound on the distance to a singular matrix.
# 0, and let
ILk AikXkl
8:= max
.
i Lk IAikikl
=0,
so Ai = h. Thus (i) is proved.
2.5.7 Remarks.
2.5.8 Proposition. Let x
I)
o
105
Then there is a singular matrix A with IA - AI :S 81AI.
(5.6)
106
Linear Systems of Equations
2.6 Iterative Refinement
Proof We have IAxl ::c8IAI·I·'i'I. By Theorem 2.5.6 with b=b.b=O, there
exists A with Ax = 0 and IA - A I ::c 81 A I. Because ~'i' =I- 0, it follows that A is
singular.
0
To make 8 small, a sensible choice for
Rx=e',
x* =
Xl =
I
t L - t
I '",,:::}'J
correct wrong
8h
x is a solution of
(le;I= .. · =1<1=1)
107
=
r--------
(5.7)
kl
where P A = LR is a normalized triangular factorization that has been determined by using a column pivot search, and the signs of the e; are chosen during
back substitution such that the IXi I are as large as possible. If A is close to a
singular matrix, then IA.'i' I remains bounded, whereas, typically the solution of
(5.7) determined in this way becomes very large and 8 becomes tiny.
L-------,I,I
2.6 Iterative Refinement
The accuracy of the solution that has been calculated with the aid of Gaussian
elimination can be improved by using the method of iterative refinement.
The equality P A = LR is only approximate because of rounding error. It
follows from this that the calculated "solution" x differs, in general, from the
"exact" solution x* = A-I b. The error vector 8* := x* - x satisfies
A8*
= Ax*
- Ax
=b -
Ax;
therefore 8* is the solution of the system of equations
Figure2.2. Iterative refinement (shaded digits are inaccurate).
STEP 2: Solve LR8' = Pr' for 8' and put x' = x'-t + 8'.
STEP 3: If I > 1 and 118'1100 ~ ~ 118/-11100: stop with x* = x' as best approximation found. Otherwise. update I = I + I.
STEP 4: Calculate = b - Ax'-I (if possible in double precision) and continue
with step 2.
r'
A8* =b - Ax.
(6.1)
Thus we can calculate the error vector by solving a system of equations with the
same matrix, so that no new triangular factorization is necessary. The right side
i' := b - Ax is called the residual corresponding to the approximation x; the
residual corresponding to the exact solution x* is zero. In an attempt to obtain a
better approximation than x, we solve the system of equations (6.1) for 8* and
obtain x* =x + 8*.
Because instead of the absolute error 8* only an approximation 8 can be
calculated, we repeat this method with :=x +8 until 11811 does not become
appreciably smaller - by a factor of 2, say. This serves as a convenient termination criterion for the iteration.
x
Figure 2.2 illustrates how the iterative refinement algorithm improves the
accuracy of a calculated solution. The solution x = Xl, calculated by means of
Gaussian elimination, typically agrees with some leading digits of x" (cf. the first
two rows of the figure). Similarly, the calculated error 8 = 8 1 agrees with some
leading digits of the true error 8'*; because 8 is several orders of magnitudes
smaller than x*, the new error 82* is much smaller than 8 1", and so on. In the
particular example drawn, the calculated error oJ satisfies II oJ II 00 ~ ~ 118 21\ 00
and hence is not appreciably smaller than 82 ; we have reached the limiting
accuracy. In this way, we can obtain nearly L valid digits for the calculated
solution ~'i'*, although the initial approximation was rather inaccurate.
2.6.2 Remarks.
2.6.1 Algorithm: Iterative Refinement
.'i'*1100 in Algorithm 2.6.1 almost always has an order of
magnitude 118'1100, where I is the last index used; but this cannot be proved
(i) The error [x" -
STEP 1: Compute a permuted triangular factorization P A
l
xO = 0, r = b, 1=1.
= L R,
and initialize
because one has no control over the rounding error.
108
2.6 Iterative Refinement
Linear Systems of Equations
(ii) For the implementation of iterative refinement, we need both A and the
factorization LR. Because A is overwritten by LR, a copy of A must be
made before computing the factorization.
(iii) Because the size of the error 8* is determined by the residual through
(6.1), it is essential to calculate the residual as accurately as possible. In
MATLAB, double precision accuracy is used for all calculations, and nothing special must be done. However, when working in single precision, one
can exploit the fact that most computers construct, for the multiplication of
two single precision (L-digit) numbers the exact (2L-digit) product, and
then round this to L digits. For example, in FORTRAN77, the instructions
double precision rr
100
rr =b(i)
do 100 k= 1,n
rr = rr - dprod(A(i, k), x(k))
r(i) =rr
produce the double precision value of the ith component of the residual
r = b - Ax, with A and b stored in single precision only. The final rounding
of rr to a single precision number stored in rei) is harmless because it has
a small relative error.
2.6.3 Examples.
(i) The system Ax = b with
A=
372
-573
(
377
241
63
-484
-613)
511 ,
107
b=
(
210)
-281
170
is to be solved. Single precision Gaussian elimination with iterative refinement using double precision residuals gives
xO
Xl
xZ
x3
17.46114540 17.46091557 17.46099734 17.46101689
16.99352741 16.99329805 16.99337959 16.99339914
16.93472481 16.93449497 16.93457675 16.93459606.
Comparing X Z with x 3 , one would expect that x 3 is accurate to six places.
The quality index (5.5) has the value 80 = .61 10 - 8 and indicates that
109
x3
can be regarded as the exact solution of a system Ax3 = b in which
A and b are obtained from A and b by rounding to the next machine
number.
Now let us consider the scaled system (10- 3 A)x = 1O-3b. The given
data are now no longer exactly representable in the (base 2) computer,
and the resulting data errors have the following effect on the computed
iterates.
Xl
Xz
17.46083331 17.46096158 17.46093845 17.46093869
16.99321566 16.99334383 16.99332047 16.99332070
16.93441272 16.93454123 16.93451762 16.93451786
One sees that, compared with the unsealed system, the seventh places of
the last iterations are affected; we see also that the agreement of x3 and
x4 up to the eighth place indicates a higher accuracy than is actually availabk. 1f one supposes that the coefficients of A and b are accurate only to
three of the eight given figures, one must reckon with a loss of another
8 - 3 = 5 figures in the result; then, only 6 - 5 = 1 figures are correct.
This indicates a nearly singular problem; we therefore estimate the distance to singularity from Proposition 2.5.8 and see that our prognosis of
the accuracy was too optimistic: With the given choice of x, we obtain a
singular matrix within a relative error 8 = 0.002. From this, it follows that
an absolute error of 0.5 in all of the coefficients < 250 (of the original
problem) leads to a possibly singular matrix, and one must doubt whether
the absolute accuracy of 0.5 in the coefficients guarantees a nonsingular
problem.
A more precise inspection of the data, which is still possible in this
3 x 3 problem, shows that if one replaces A z! with -572.5 and A 23
with 510.5 then A becomes singular, since the sum of the columns becomes zero. Therefore the given data are actually too imprecise to furnish sensible results; the relative error of this particular singular matrix is
< 0.5/511 < 10- 4 , a factor of about 20 smaller than the upper bound from
Proposition 2.5.8.
(ii) The system Ax = b with
A=
372
-573
(
377
241
63
-484
-125)
182 ,
437
b=
(155)
-946
38
2.6 Iterative Refinement
Linear Systems ofEquations
110
is to be solved. Gaussian elimination with iterative refinement gives
xO
-0.255 228 35
2.81151304
3.42103759
proof From x*
= A -I b it follows that
IIx* - il-l - 8'1100
Xl
= IIA- ' (P' -
A8'
+b -
Ail-l - P')lloo
::: IIA-'lloo(llr' - A8'II00 + lib - Ai'-I - 1"11(0)
-0.255 228 37
2.81151304
3.421 03753,
so that we expect that x 1 is accurate to eight places. If we assume that
the coefficients (measured in millimetres) are accurate only to within an
absolute tolerance of 0.5, we expect that 8 - 5 = 3 places of x I are still
reliable. The calculated bound 80 = 0.76 from Proposition 2.5.8 indicates
no particular problem.
111
::: IIA- 11100(EoIl8/1100
::: ~11811100
+ s.) by (i) and (iii),
+ IIA-'llooE r
by (iv).
Now let I be the index such that i'-1 =i*. Because the termination criterion
must have been passed, we have 118'-11100::: 2118 11100' Therefore
+ (i l- 1 + 81- 1 - ii-I) 1100
::: IIx* - i l - I - 8'-11100 + Elli l- 11100. by (ii)
::: ~1I81-11100 + IIA- 11100E + Elli*lloo,
[x" - i* 1100 = II (x* - il-l - 8'-1)
r
Limiting Accuracy of Iterative Refinement
For singular or almost singular matrices, iterative refinement does not converge.
To prove convergence and obtain a formula for the limiting accuracy of the iterative refinement algorithm 2.6.1 in the presence of rounding error, we suppose
that the approximations (denoted by -) satisfy for I = 0, I, ... the following
inequalities (i) - (iv):
(i)
(ii)
(iii)
(iv)
111'1 - A8'II00 :::Eo118'1I00 (this is a test for the calculation of 81; by
Theorem 2.3.9, this is certainly satisfied with EO = 5nE II L 1100 II R11(0);
Ili'-1 + 8' - ilil oo ::: E Iii' 1100 (rounding error due to the addition of the
correction);
lib - Ail-I - 1'11100 ::: e, (rounding error due to the calculation of the
residual);
IIA-11100Eo::: ~ (a quantitative nonsingularity requirement).
Assuming that
and using (i), requirement (iv) is approximately equivalent to the requirement
20nE condoo(A) < 1.
2.6.4 Theorem. Suppose that (i) - (iv) holdfor I =0, 1, ... and let i* be the
approximation for x* computed by iterative refinement. Then the error satisfies
the relation
IIx* - i* 1100:::: 5E*
where E* = IIA -lllooEr
+ Elli* 1100'
whence
1
-I
Ilx* - i*lIoo::: 2:118 1100
+ E*.
(6.2)
From this, it follows that
118 11100
= lI(x* -
i*) - (x* - ii-I - 81)1100
+ [x" - i l - I - 8'1100
::: !1l811100 + E* + ~11811100 + E* by (i).
::: [x" - i*lloo
So 118 11100 ::: 8E*, and substituting this into (6.2), Ilx* - i* II ::: 5E*.
0
2.6.5 Remarks.
(i) The factor E* that occurs in the limiting accuracy estimate results from
putting together the unavoidable rounding error E Ili* 1100 and the error
II A -1 II ooE r caused by the inexact residual. The double precision calculation
of the residual that has already been mentioned is therefore important for
the reduction of e; because II A-III 00 can be comparatively large.
(ii) In the proof, the final 81 is smaller than 8E* in norm, so that (as already
mentioned) 118 11100 and [x" - i*lloo have the same order of magnitude
generally. However, no lower bound for 118 11100 can be obtained from the
previous analysis, so that it is conceivable that 118 11100 is very much smaller
than [x" - i*lloo'
(iii) Under the same hypotheses,
112
2.7 Error Bounds for Solutions ofLinear Systems
Linear Systems ofEquations
for Gaussian elimination without iterative refinement if x 1 = 80. For wellconditioned matrices (cond(A) ~ IIA-11100 for equilibrated matrices),
Gaussian elimination without iterative refinement also gives good results;
for (not too) ill-conditioned matrices, only the order of magnitude is correct. However, even then, if we obtain only one valid decimal without
iterative refinement, then iterative refinement with sufficiently accurate
calculation of the residuals provides good results. Of course, we then need
about as many iterative steps as valid decimals. (Why")
From Theorem 2.3.9, we may deduce the heuristic but useful rule of thumb
that for the computed solution x one must expect a loss of accuracy of approximately loglo cond(A) decimal places (relative to the machine precision). This
follows from the theorem if we make the additional assumption that
because then
Ilx - x*11
If we omit the factor 5n in the approximate equality in order to compensate
for the overestimation of the "worst-ease-analysis," then there remains a loss
oflog lO cond(A) decimal places.
2.7.1 Example. In order to illustrate this kind of effect. we consider an almost
singular equilibrated matrix A. We obtain such a matrix if, for example, we
by approximately 0.1 % to A:= (6:~~~ ~:~~i)·
change the singular matrix
(Changing the matrix by a smaller relative error gives similar but more pronounced results.)
The solution of the system of equations Ax* = b with the right side b := @
is x* =
For the approximate solution x = C'~~:), we calculate the residual
- _ (-0.002) dth
0* _ (-0.001) B
'-1 (25025 -249.75)
h
r - -0.002 an
e error o - -0.001' ecause A = -249.75 250.25' we ave
IIA -11100 = 500. Hence the bound (5.1) gives 118* 1100 :S 1, and the error in (5.1)
(and in (5.2)) is overestimated by a factor of 1000. The totally wrong approximation x = @ has a residual i = (-~:~~;) of the same order, but the error is
8* = (-i), and this time both bounds (5.1) and (5.2) are tight, without any overestimation.
C:)
G).
We see from this example that for nearly singular matrices we cannot conclude from a small residual that the error is small. However, for equilibrated
matrices, one can conclude from a large residual that the absolute error has at
least the same order of magnitude,
111'11
= IIA8*11 ::: IIAII·118*11 ~ 118*11,
and it may be much larger when the matrix is ill-conditioned (which, for equilibrated matrices, is the same as saying that II A-I II is large).
= IIA-1(b - Ax)11
::: IIA- '115neIIILIIRlllllili
~ Sne cond(A) IlxII.
2.7 Error Bounds for Solutions of Linear Systems
Proposition 2.5.1 gives (fairly rough) error bounds for an approximation x to the
solution of a linear system Ax = b. The bounds cannot be improved in general
(e.g., inequality (5.l) is best possible because IIA- III = sup{IIA- 11'11 I 111'11 = I}).
In particular, harmless looking approximations with a tiny residual can give rise
to large errors if cond(A) is large (or rather, if A is nearly singular).
113
Realistic Error Bounds
As seen previously, the use of II A -I II is often too crude for obtaining realistic
error bounds. In particular the bound (5.1) is usually much too bad if x is
determined by iterative refinement. To get error bounds that can be assessed as
to how realistic they are, one must invest more work.
An approximate but generally quite reliable estimate of the error (and at
the same time an improved approximation) can be obtained with one step of
iterative refinement, from the order of magnitude 118 0 II of the first correction 80
(cf. Figure 2.2).
Rigorous and realistic error bounds can be obtained from a good approximation C for A -I (called a preconditioner and computed. e.g., using Algorithm
2.2.9), as follows. Because C A ~ /, the norm 11/ - C A II is small and the following theorem, due to Aird and Lynch [3], is applicable.
2.7.2 Theorem. /fll/- CAli ::: fJ < I then A is nonsingular; and for an arbitrary approximation x of x * = A -I b,
IIC(b - Ax) II
II * _
IIC(b - Ax)11
- - - - - < x -xll<-"--------'--"-
l+fJ
-
-
l-fJ
.
(7.1)
Proof By Proposition 2.4.2, the matrix C A is nonsingular. Because a =1=
det(C A) = det C det A, the matrix A is nonsingular, too. From A8* = 1', one
2.7 Error Boundsfor Solutions of Linear Systems
Linear Systems of Equations
114
115
5 ,------.~---,--------,-------,---,
finds that
8* = 0
+ (l
I--
- CA)8*,
I
I
118*11 S 11011 + III - C A 11118* I s 11 011 + ,8118* II,
118*11 ::: 11011 - III - CAli 118*11 ::: 11 011- ,8118*11,
3
I
I
I
I
I
I
I
whence the assertion follows.
I
o
I
I
I
I
'"x
2.7.3 Remarks.
I
I
I
(i) The error bound (7.1) is realistic for small ,8 because the actual error is
overestimated by a factor of at most q:= (l + {3)/(l - ,8); e.g., q < 1.23
for,8 < 0.1.
(ii) In order to reduce the effects of rounding error one should calculate the
residual b - Ax in double precision. Indeed, inaccuracies in C or ,8 affect
the bound (7.1) much less than errors in the residual.
(iii) To evaluate the bound (7.1), we need 2n 3 + 0 (n 2 ) operations for the calculation of LR and C, 4n 3 operations for the multiplication C A (an operation real 0 interval involves two operations with reals l), and 0 (n 2 ) for
the remainder of the calculation. This means that altogether, 6n 3 + 0 (n 2 )
operations are necessary for the calculation of a realistic bound for the
error of x. Comparing this with the ~n3 + 0(n 2 ) operations needed for
the calculation of x by Gaussian elimination, we see that the calculation
of realistic and guaranteed error bounds requires about nine times the cost
for solving the system approximatively.
Systems ofInterval Equations
For many calculations with inexact data, the data are known to lie in specified
intervals. In this case, we can use interval arithmetic to compute error bounds
simultaneously with the solution instead of estimating the error from an analysis
of the error propogation.
In the following, we work with a fixed interval matrix A E lllR" X" and a fixed
interval vector b E lllR". We want to find a good enclosure for the solution of
Ax* = b, where it is only known that A E A and h E b. Clearly, x* lies in the
solution set
1; (A, b):=
{x*IAx* =b for some A E A, bE b).
The solution set is typically star-shaped. The calculation of the solution set is
I
-1
I
I
I
I
I
-3
I
I
I
IL
_
-5 L -_ _~_ _~ ~_ ___'_ _ ___:'--_____=
-1
3
5
-3
-5
x1
Figure 2.3. Star-shaped solution set.
quite expensive; for systems of dimension n, it can be shown that the star has
up to 2" spikes. We content ourselves with the computation of the interval hull
D1;(A, b) of the solution set.
2.7.4 Example. The linear system of interval equations with
[2,2]
A:= ( [-1,2]
[-2,.1])
[2.4]
,
b'= ([-2,2])
.
[-2,2]
has the solution set drawn in Figure 2.3. The hull ofthe solution set is 01; (A, b)=
([ -4, 4], [-4, 4]) T, drawn with dashed lines. (To construct the exact solution
set is a nontrivial task, even for 2 x 2 systems!)
Methods for determining the hull of the solution set are available (see, e.g.,
Neumaier [70, Chapter 6]) but must still in the worst case have an exponential
effort. Indeed, the problem belongs to the class of NP-hard problems (Garey
and Johnson [27]) for which it is a long-standing conjecture that no polynomial algorithms exists. However, there are several methods that give at least a
reasonable inclusion of the solution set with a cost of 0(n 3 ) .
116
Linear Systems of Equations
2.7 Error Bounds for Solutions of Linear Systems
Krawczyk's Method
(i) In the first example, when all coefficients are treated as point intervals,
To obtain a realistic enclosure, we imitate the error estimation of Aird and Lynch
(Theorem 2.7.2). Let the preconditioner C E JRnxn be an approximation of the
inverse ofmid(A). The solution.c" ofAx* = h satisfies x* = ci + (I - C A).\'*.
If we know an interval vector Xl with x* E Xl, then also x* E Cb + (I - CA)x'
and we can improve the inclusion of x*:
f3 = .000050545 < I. ex = 51.392014, and
Xl
2.7.5 Algorithm: Krawczyk's Method for
Equations
If we denote by xl the box Xafter I iterations, then by construction, x* E x' for
alII ~ 0, and XO ;:> x' ;:> x 2 ....
2.7.6 Example. In order to make rigorous the error estimates computed in
Examples 2.6.3(i)-(ii), we apply Krawczyk's method to the systems of linear equations discussed there. (To enable a comparison with the results in
Examples 2.6.3, the calculations were performed in single precision.)
2
17.4~~m
17.46b~~6
16.993g~~
= .006260
0"2
=x 4
17.46b~~6
16.993g~~
16. 934
= .000891
X
16.99~m
3
m
16.934~~~
= .000891
O".J
One can see that x 3 (and already x 2 ) have five correct figures. If one again
supposes that there is an uncertainly of 0.5 in each coefficient, then all
of the coefficients are genuine intervals; All = [371.5,372.5], and so on.
Now, in Krawczyk's method, f3 = 3.8025804 > I suggesting numerical
singularity of the matrix. No enclosure can be calculated.
(ii) In the second example, with point intervals we get the values f3 =
.000000309, ex = 6.4877797. and
Xl
Solving Linear Interval
ier=O;
a=C*b;
E=!-C*A;
% To save space, b and A could be overwritten with a and E.
f3 = max, Lk IE l k I; % must be rounded upwards
if f3 ~ I, ier = I; return; end;
ex = lIali oo / (I - (3);
x=([-ex,exJ, ... , [-ex,ex])T;
O"old =inf; 0" = Lk rad (Xk); fac= (I + (3)/2;
while 0" «fac * O"old,
x=(a+Ex)nx;
O"old =0"; 0" = Lk radtx.);
end;
X
16.93~6~6
0",
How can we find an initial interval vector XO with x* E xo? By Theorem 2.7.2
(for xO = 0), IIX* 1100 ::; II Chll 00/(1-(3) if II! -c A1100 ::; f3 <1. Because IICh 1100 ::;
II Cb]00' we define, as the initial interval, xO := ([ -ex, ex], ... , [-ex, ex]) T with
ex:= IICblloo/(I - (3).
As seen from the following examples, it is worthwhile to terminate the iteration when the sum 0"1 of the radii of the components of xl is no longer rapidly decreasing. We thus obtain the following method, originally due to Krawczyk [53].
117
X
2=X 3
- 0.255228~
- 0.255228~
2.8115qj
2.81151~~
3.42103~~
0"1 = .00000125
0"2
3.421037j
= .00000031
with six correct figures for x 2 . With interval coefficients corresponding
to an uncertainty of 0.5 in each coefficient, we find f3 = .01166754,
ex = 6.5725773, and (outwardly rounded at the fourth figure after the
decimal point)
Xl
- 0.2~g~
8741
2 '7489
x2
- 0.2~~g
2.~~~j
3·~~b~
0"1=.1681
3·j~n
0"2
= .0622
x3
- 0.2~i
8344
2 '7887
4505
3 '3916
0"3 = .0613
x4
644
- 02
. 461
2.~~~
3.j~?~
0"4=.0613
with just under two correct figures for x 4 (and already for x 2 ) . The safe
inclusions of Krawczyk's method are therefore about one decimal place
more pessimistic here than the precision that was estimated for Gaussian
elimination with iterative refinement.
. We mention without proof the following statement about the quality of the
limit x oo , which obviously exists. A proof can be found in Neumaier [70].
118
Linear Systems ofEquations
2.7.7 Theorem. The limit
proximation property:
X
OO
2.8 Exercises
of Krawczyk's iteration has the quadratic ap-
radA, rad b = 0(£) =? 0 S rad x'? - radDI;(A, b)
119
5,---------.---,..------,-----,-------,
= 0(£2).
3
Thus if the radii of input data A and b are of order of magnitude 0 (s), then the
difference between the radius of the solution of Krawczyk's iteration X OO and
the radius of the hull of the solution set is of order of magnitude 0(£2). The
method therefore provides realistic bounds if A and b have small radii (compare
with the discussion of the mean value form in Remark 1.5.9).
Even sharper results can be obtained with 0(n 3 ) operations by a more involved method discovered by Hansen and Bliek; see Neumaier [72].
N
x
-1
-3
Interval Gaussian Elimination
For special classes of matrices, especially for M-matrices, for diagonally dominant matrices, and for tridiagonal matrices, realistic error bounds can also be
calculated without using Krawczyk's method (and so without knowing an approximate inverse), by performing Gaussian elimination in interval arithmetic.
The recursions for the triangular factorization and for the solution of the triangular systems of equations are simply calculated using interval arithmetic.
Because of the inclusion property of interval arithmetic (Theorem 1.5.6), this
gives an inclusion of the solution set. For matrices of the special form mentioned, the quality of the inclusion is very good - in the case of M-matrices A,
one even obtains for many right sides b the precise hull DI;(A, b):
2.7.8 Theorem. If A is an M-matrix and b :::: 0 or b S 0 or 0 E b then interval
Gaussian elimination gives the interval hull I] I; (A, b) of the solution set.
The proof, which can be found in Neumaier [70], is based on the fact that
in these cases the smallest and the largest elements of the hull belong to the
solution set.
2.7.9 Example. The interval system of linear equations with
[2,4]
A:= ( [-1,0]
[-2,0])
[2,4]'
b.= ([-2,2])
.
[-2,2]
satisfies the hypotheses of Theorem 2.20. The solution set (filled) and its interval hull (dashed) are drawn in Figure 2.4.
-~5:-------:----~---~-----::-------;5
-3
-1
3
x1
Figure 2.4. Solution set (filled) and interval hull of solution.
Note that, although interval Gaussian elimination can, in principle, be tried
for arbitrary interval systems of linear equations, it can be recommended only
for the classes of matrices mentioned previously (and certain other matrices,
e.g., 2 x 2 matrices). In many other cases, the radii of the intervals can become
so large in the course of the calculation (they may grow exponentially fast
with increasing dimension) that at some stage, all candidates for pivot elements
contain zero.
2.8 Exercises
1. The two systems of linear equations
Ax
= b(/),
I = 1, 2
(8.1)
with
k.-
(~
0
I
are given.
2
4
2
2
-1
o
~)
5
4'
3
4
l~)
v» .=
. ( 35'
30
b(2)
.=
.
(=~)
0
-3
Linear Systems of Equations
120
2.8 Exercises
(a) Calculate, by hand, the triangular factorization A = LR; then calculate
the exact solutions x(l) (I = I, 2) of (8.1) by solving the corresponding
triangular systems.
(b) Calculate the errors x(l) - x(l) ofthe approximations x(l) = A \b(l) computed with the standard solution x =A\ b of Ax = b in MATLAB.
2. (a) Use MATLAB to solve the systems of equations Ax = b with the coefficient matrices from Exercises 1,9,14,26,31, and 32 and right side
b:= Ae. How accurate is the solution in each case?
(b) If you work in a FORTRAN or C environment, use a program library
such as LAPACK to do the same.
3. Given a triangular factorization A = LR of the matrix A E en xn. Let B =
A + auv T with a E e and u, v E en. Show that the system of linear equations Bx = b (b E en) can be solved with only 0 (n 2 ) additional operations.
Hint: Rewrite in terms of solutions of two linear systems with matrix A.
4. Let A be an N 2 x N 2-matrix and b an N 2 -vector of the form
D
-I
-1
D
5.
6.
7.
o
-1
k-[
-[
°
8.
D
where the N x N -matrix D is given by
4
-I
-1
0
4-1
9.
How many different techniques for reordering of sparse matrices are
available in MATLAB? Use the MATLAB command spy to visualize the
nonzero pattern of the matrix A before and after reordering of the rows and
columns.
Compare computational effort and run time for solving the system in
various reorderings to the results obtained from a calculation with matrix
A stored as a full matrix.
Hints: Have a look at the Sparse item in the demo menu Matrices. You
may also try help sparfun, help sparse, or help full.
Show that a square matrix A has a Cholesky factorization (possibly with
singular L) if and only if it is positive semidefinite.
For the data (Xi, Yi) = (-2,0.5), (-1,0.5), (0, 2), (I, 3.5), (2, 3.5), determine a straight line f (x) = a + f3x such that Li (Yi - f (Xi»2 is
minimal.
(a) Let P be a permutation matrix. Show that if A is Hermitian positive
definite or is an H-matrix, then the same holds for P A P'':
(b) A monomial matrix is a matrix A such that exactly one element in each
row and column is not equal to zero. Show that A is square, and one
can express each monomial matrix A as the product of a nonsingular
diagonal matrix and a permutation matrix in both forms A = P D or
A=D'P.
(a) Show that one can realize steps 2 and 3 in Algorithm 2.2.9 for matrix
inversion using ~n3 + 0(n 2 ) operations only. (Assume P = I, the unit
matrix.)
(b) Use the algorithm to calculate (by hand) the inverse of the matrix of
Exercise 1, starting with the factorization computed there.
(a) Show that the matrix
D'-1
°
-1
4
[ is the N x N unit matrix, and the N -vectors b(l) and b(2) are given
by b(l):= (2, I, I, ... , 2)T and b(2):= (I, 0, 0, ... ,0, l)T, with N E (4. 8,
12, 16, 20, ... }. Matrices such as A result when discretizing boundary value
problems for elliptical partial differential equations.
Solve the system of equations Ax = b using the sparse matrix features of
MATLAB. Make a list of run times for different N. Suitable values depend
on your hardware.
121
A-
G!:)
is invertible but has no triangular factorization.
(b) Give a permutation matrix P such that P A has a nonsingular triangular
factorization L R.
10. Show that column pivoting guarantees that the entries of L have absolute
value y 1, but those of R need not be bounded.
Hint: Consider matrices with A ii = A in = 1, A i k = - f3 if i > k, and
A i k = 0 otherwise.
2.8 Exercises
Linear Systems of Equations
22
(a) Find an explicit formula for the triangular factorization of the block
matrix
I
0
0
I
-B
I
0
0
(b) Write a MATLAB program that realizes the algorithm (make use of
Exercise 8), and apply it to compute the inverse of the matrix
-B
A=
Compare with MATLAB's built in routine inv.
15. For i = 1, ... , n let .i:U) be the computed solution (in finite precision arithmetic) obtained by Gaussian elimination from the system
0
0
-8
I
with unit diagonal blocks and subdiagonal blocks -
BE
dxd
lR
.
(b) Show that column pivoting produces no row interchanges if all entries
of B have absolute value < I.
I
(c) Show that choosing for B a matrix with constant entries f3 (d- < f3 < I),
some entries of the upper triangular factor R grow exponentially with
the dimension of A.
(d) Check the validity of (a) and (b) with a MATLAB program, using
MATLAB's lu.
Note that matrices of a similar form arise naturally in multiple shooting
methods for 2-point boundary value problems (see Wright [99]).
12. Write a MATLAB program for Gaussian elimination with column pivoting
for tridiagonal matrices. Show that storage and work count can be kept of
the order O(n).
13. How does pivoting affect the band structure?
(a) For the permuted triangular factorization of a (2m + I)-band matrix,
show that L has m + 1 nonzero bands and R has 2m + I nonzero bands.
(b) Asymptotically, how many operations are required to compute the permuted triangular factorization of a (2m + I)-band matrix?
(c) How many more operations (in %) are required for the solution of an
additional system of linear equations with the same coefficient matrix
and a different right side, with or without pivoting? (Two cases; keep
only the leading term in the operation counts I) Huw does this compare
with the relative cost of solving a dense system with an additional right
side?
(d) Write an algorithm that solves a linear system with a banded matrix in
a numerically stable way. Assume that the square matrix A is stored
in a rectangular array AB such that Aik = AB(i, m + I + i - k) if
Ik - i I :s m, and A ik = 0 otherwise.
14. (a) In step 4 of Algorithm 2.2.9, the interchanges making up the permutation matrix must be applied to the columns, and in reverse order.
Why?
123
in which eUJ is the i th unit vector.
(a) Show that the computed approximation C to the inverse A -I calculated
in finite precision arithmetic as in Exercise 14coincides with the matrix
(i(l), ... , i(n»T.
(b) Derive from Theorem 2.3.9 an upper bound for the residual II - tAl.
16. (a) Show that all norms on (:" are equivalent; that is, if II . II and II . II' are two
arbitrary norms on C n , then there are always constants 0 < c, d E lR such
that for all x E C"
cllxll:s Ilxll' :sdllxll·
Hint: It suffices to prove this for II . II = II . II 00' Now use that continuous
functions on the cube surface Ilx II 00 = 1 attain their extrema
(b) ':hat are the optimal such constants (largest possible c, sm~llest posSIble d) for pairs of p-norms (p = 1, 2, oo)?
17. Show that the matrix norm belonging to an arbitrary vector norm has the
following properties for all a E C:
(a)
(b)
(c)
(d)
(e)
(f)
IIAxll:s IIAII ·llxll
IjAIl ::: 0; IIAII =0 -¢:=::::} A =0
IlaAII=lal'IIAil
IIA + BII :s IIAII + IIBII
IIABII:s IIAII·IIBII
IAI:s B ::::} IIAII :s II IAI II :s IIBII if the norm is monotone
(g) The implication
IAI:S B =? IIAII
=
IIIA/II:s IIBI/
holds for the matrix norms II· 111 and II . 1100, but not in general for II . 112.
18. Prove the following bounds for the inverse of a matrix A:
(a) If II Ax 1\ ::': y IIx II for all x E C" with y > 0 and II . II an arbitrary vector
norm then A-I exists and II A -1 II :5 y -I for the matrix norm belonging
to the vector norm II . II·
1112:5
(b) If x H Ax ::': Yllxll~ for all x E C" with y > 0, then IIAy-I.
(c) If the denominator on the right side of the following three expressions
is positive, then
IIA- I \\oo :51/min (IAiil-
L
IAikl).
bpi
!
IIA- IIII :51/min (IAkklk
IIA -1112:5 1/ min (Aii !
Li# IAikl),
World Wide Web (WWW). The distributors of MATLAB maintain a page
http://www.mathworks.com/support/ftp/
with links to user-contributed software in MATLAB.
(a) Find out what is available about linear algebra, and find a routine for the
L D L T factorization. Generate a random normalized lower triangular
matrix L and a diagonal matrix D with diagonal entries of both signs,
and check whether the routine reconstructs Land D from the product
LDL T •
(b) If you are interested what one can do about linear systems with very
ill-conditioned matrices, get the regularization tools, described in
Hansen [40], from
http://www.imm.dtu.dk/ pch/Regutools/regutools.html
~2 Lki-i IAik + A ki I).
It takes a while to explore this interesting package!
22. Let x* be the solution of the system of linear equations
1
3X I
H
Hints: For arbitrary x, y E C", Ix Y I:5 [x 11211 y 112· For arbitrary u,
2
2
luvi :5 ~ lul + ~ Iv1 .
19. A matrix A is called an M-matrix if
Aii
::':
0, A i k
:5
0
125
2.8 Exercises
Linear Systems of Equations
124
VEe.
I
4X 1 +
0.333xl
III - D 1AD211
I
10
7 X2 = 21'
29
280 X 2
=
99
(8.2)
280'
(a) Calculate the condition number of the coefficient matrix for the rowsum norm.
(b) An approximation to the system (8.2) may be expressed in the form
for i =1= k;
and one of the following equivalent conditions holds:
(i) There are diagonal matrices D 1 and D2 such that
+
< 1
(i.e., A is an H-matrix).
(ii) There is a vector u > 0 with Au > O.
(iii) Ax ::': 0 =} x ::': O.
(iv) A is nonsingular and A-I::,: O.
Show that for matrices with the above sign distribution, these conditions
are indeed equivalent.
20. (a) Let A E (["Xli be nonpositive outside the diagonal, Aik :5 0 for i -=I k,
Then A is an M-matrix iff there are vectors u , v > 0 such that Au ::': v.
(b) If A' and A" are M-matrices and A' :5 A :5 A" (componentwise) then
A is an M-matrix.
Hint: Use Exercise 19. Show first that C = (A,,)-l B is an M-matrix; then
use B- 1 = (A"C)-I.
21. A lot of high-quality public domain software is freely available on the
+ O.143x2 = 0.477,
O.250xI + 0.104x2 = 0.353
(8.3)
by representing the coefficients Aik with three places after the decimal
point. Let the solution of (8.3) be x. Calculate the relative error Ilx* x II / IIx* II of the exact solutions x* and x of (8.2) and (8.3), respectively
(calculating by hand with rational numbers).
23. Use MATLAB's cond to compute the condition numbers of the matrices
A with entries Aik = fk(Xi) (k = 1, ... , nand i = 1, ... , m; Xi = (i - 1)/
(m - 1) with m = 10 and n = 5, 10.20) in the following polynomial bases:
(a) fk(X)=x k- l,
(b) fk(X) = (x - 1/2)k-l,
(c) fk (x) = Tk(2x - 1), where Tk(x) is the kth Chebyshev polynomial (see
the proof of Proposition 3.1.14),
(d) fk(X) = nj':~+~k.j mn(x - Xj/2).
126
Plot in each case some of the basis functions. (Only the bases with small
condition numbers are suitable for use in least squares data fitting.)
24. A matrix Q is called unitary if QH Q = I. Show that
cond2(Q) = I,
cond 2(A) = cond 2(QA) = cond 2(AQ) =cond2(QH AQ)
127
2.8 Exercises
Linear Systems of Equations
28. The two-point boundary value problem
1/
k(k - 1)
-y(x)+(I_x)2 Y(x)=0,
y(O)=I,
y(1)=O
can be approximated by the linear tridiagonal system of linear equations
Ty=bwith
2+ql
for all unitary matrices Q E e nxnand any invertible matrix A E en XII.
25. (a) Show that for the matrix norm corresponding to any monotone norm,
- 1
0
-I
2 +q2
,
T=
b = eO)
IIAII ::: max IAiil.
-1
0
and, if A is triangular,
\A\
cond(A) ::: max --''- .
Akk
(b) Show that, for the spectral norm, cond2(AT) =cond 2(A).
(c) Show that the condition number of a positive definite matrix A with
Cholesky factorization A = L L T satisfies
26. (a) Calculate (by hand) an estimate c of cond(A) from Algorithm 2.5.5 for
A:=
(-~ -~ ~)
-3
7-1
Is this estimated value exact. that is, is c = II A II 00 1\ A -III 00 ?
(b) Give a 2 x 2 matrix A for which the estimated value c is not exact, that
is, for which s < IIA-liloo'
27. (a) Does the MATLAB operation x = A\b for solving Ax = b include iterative refinement? Find out by implementing an iterative refinement
method based on this operation, and test it with Hilbert matrices A with
entries Aik = 1/(i + k - 1) and all-one right side b = e.
(b) For Hilbert matrices of dimension n = 5,10,15,20, after how many
steps of iterative refinement is the termination criterion satisfied?
(c) Confirm that Hilbert matrices of dimension n are increasingly ill conditioned for increasing n by calculating their condition number for the
dimensions n = 5, 10, 15,20. How does this explain the results of (b)?
-1
2+qll
with qi =k(k - I)/(n + I - i)2. Then y(i/(n + 1» ~ Yi.
Solve this system for k = 0.5 with n = 25 -I for several s, using Exercise
] 2 and (at least) one step of iterative refinement to obtain an approximation ys) for the solution vector. Print the approximate values at the points
x = 0.25,0.5, and 0.75 with and without iterative refinement, and compare
with the solution y (x) = (I - x)k of the boundary value problem. Then do
the same for k = 4.
29. (a) Write a MATLAB program for the iterative refinement of approximate
solutions of least squares problems. Update the approximate solution
x' in the Ith step of the method by xl+ 1 = xl + sf, where sf be the
solution of R T Rs l = AT (b - Ax') and L is a Cholesky factor of the
normal equations (found with MATLAB's chol).
(b) Give a qualitative argument for the limit accuracy that can be obtained
by this way of iterative refinement.
30. The.following technique may be used to assess heuristically the accuracy
of linear or nonlinear systems solvers or eigenvalues calculation routines
~itho~t doing any formal error analysis. (It does not work for problems
involving approximations other than rounding errors, as in numerical differentiation, integration, or solving differential equations.)
S~I~e the systems (k A)x(k) = kb for k = I, 3, 5 numerically, and derive
heuristic error estimates [x" - xI ~ ilx for the solution x* ofAx* = b
where
'
128
Linear Systems of Equations
2.8 Exercises
Compare with the true error for a system with known x*, constructed by
choosing random integral A and x* and b = Ax*. Explain! How reliable is
the heuristics?
31. Given the system of linear equations Ax = b with
0.051
A:=
(
(a) Give a matrix
-0.15~
-0.59~)'
-0.153
-0.737
-0.598
b:= (33
.
-0.299
4280)
1420,
(
-2850
IA - AI :s: colAI, and Ib - bl :s: colbl, with smallest possible relative
error co.
(b) Use x to determine a singular matrix A close to A, and compute the
"estimated singularity distance" (5.6).
32. (a) Using the built in solver verifylss of INTLAB to solve linear interval equations with the four symmetric interval coefficient matrices A
defined by either
36.1
-63.4
33.1
A".- -75.2
75.8
-27.4
14.7
-88.5
21.6
-22.4
83.3
symm.
56.9
-14.0
15.2
-58.3
36.3
-39.0
15.4
44.1
-17.0
69.4
or
A~
c
d
c
c
[:
c
c
d
(a)
1.25
(b)
(c)
125
-3.34
334
18.9
-1.26
1)
A and a vector b such that Ax = b where
x:=
129
~l
with dimension n = 16 and c, d chosen from the following:
Interpret the matrices either as thin interval matrices or as interval
matrices obtained by treating each number as only accurate to the
number of digits shown. Use as right side b = [15.65, l5.75]e when
the matrix is thin, and b = e = (1, ... , 1)T otherwise.
Is the system solvable in each of the eight cases?
(b) Implement Krawczyk's algorithm in INTLAB and compare the accuracy resulting in each iteration with that of the built-in solver.
3.1 1nterpolation by Polynomials
3
Interpolation and Numerical Differentiation
131
3.1 Interpolation by Polynomials
The simplest class of interpolating functions are the polynomials.
3.1.1 Theorem. (Lagrange Interpolation Formulas). For any pairwise distinct points xo . . . . • Xn, there is a unique polynomial Pn of degree :::n that
interpolates f (x) on Xo, ... , X n . it can be represented explicitly as
L
PII(X) =
f(Xi)Li(x),
(1.1)
i=O:n
where
In this chapter, we discuss the problem of finding a "nice" function of a single
variable that has given values at specified points. This is the so-called interpolation problem. It is important both as a theoretical tool for the derivation and
analysis of other numerical algorithms (e.g., finding zeros of functions, numerical integration, solving differential equations) and as a means to approximate
functions known only at a finite set of points.
Because a continuous function is not uniquely defined by a finite number
of function values, one must specify in advance a class of interpolation functions with good approximation properties. The simplest class, polynomials,
has considerable theoretical importance; they can approximate any continuous functions in a bounded interval with arbitrarily small error. However, they
often perform poorly when used to match many function values at specific
(e.g., equally spaced) points over a wide interval because they tend to produce
large spurious oscillations near the ends of the interval in which the data are
given. Hence polynomials are used only over narrow intervals, and large intervals are split into smaller ones on which polynomials perform well. This results
in interpolation by piecewise polynomials, and indeed, the most widely used
interpolation schemes today are based on so-called splines - that is, piecewise
polynomials with strong smoothness properties.
In Section 3.1, we discuss the basic properties of polynomial interpolation,
including a discussion of its limitations. Section 3.2 then treats the important
special case of extrapolation to the limit, and applies it to numerical differentiation, the problem of finding values of the derivative of a function for which one
has only a routine computing function values. Section 3.3 treats piecewise polynomial interpolation in the simplest important case, that of cubic splines, and
discusses their excellent approximation properties. Finally, Section 3.5 relates
splines to another class of important interpolation functions, so-called radial
basis functions.
130
(1.2)
Pn is called the interpolation polynomial to f at Xo, ... , XII' and the Li(x) are
referred to as Lagrange polynomials.
Proof By definition, the L, are polynomials of degree n with
L (x )
I
}
=
{oI ' fifi. =I- i..
1 1=
J.
From this it follows that (1.1) is a polynomial of degree s» satisfying PII (x j) =
f(Xj) for j =0, ... , n.
For the proof of uniqueness, we suppose that p is an arbitrary polynomial of
~egree sn with p(Xj) = f(xj) for j =0, ... , n. Then the polynomial
p« _ p
IS of degree <n and has (at least) the n + I pairwise distinct zeros Xo, ... , x".
Therefore, P« (x) - p(x) is divisible by the product (x - xo) ..... (x - x,,).
However, the degree of the divisor is »n; hence, this is possible only if Pn(x)p(x) is identically zero (i.e., if p = PII)'
0
. Although this result solves the problem completely, there are many situations
III which a different representation of the interpolation polynomial is more
Useful. Note that the interpolation polynomial is uniquely determined by the
data, no matter in which form the polynomial is expressed. Hence one can pick
the form of the interpolation polynomial according to other considerations, such
as ease of use, computational cost, or numerical stability.
132
3.1 Interpolation by Polynomials
Interpolation and Numerical Differentiation
133
gives very inaccurate values at the interpolating points:
Linear and Quadratic Interpolation
To motivate the alternative approach we first look at linear and quadratic interpolation, the case of polynomials of degrees 1 and 2.
The linear interpolation problem is the case n = I and consists in finding for
given f(xo) and f(xI) a linear polynomial PI with PI (xo) = f(xo) and PI (XI) =
f(xI). The solution is well known from school: the straight line through the
x
6000
6001
PI(x)
0.33333
0.30000
-0.66667
-0.70000
PI(X)
points (xo, Yo), (XI, YI) is given by the equation
Y - Yo
X - Xo
YI - Yo
XI - Xo
and, in the limiting case XI ---+ Xo, the line through Xo with given slope Y~ by
Y - Yo
I
--=YoX - Xo
The linear interpolation polynomial has therefore the representation
PI (x) = f(xo)
+ f[xo, xJJ(x
- xo),
(1.3)
From linear algebra, we are used to considering any two bases of a vector space
as equivalent. However, for numerical purposes, the example shows significant
differences. In particular, we conclude that expressing polynomials in the power
basis 1, x, x 2 , ... must be avoided to ensure numerical stability.
To find t~e qU~dratic interpolation polynomial P2 (a parabola, or, in a limiting
c.ase, a.straight l.me) to f at pairwise distinct points Xo, XI, X2, we modify the
linear interpolation formula by a correction term that vanishes for X = Xo and
X=XI:
P2(X) := f(xo)
where
if
XI
"lxo,
(1.4)
ifxI=xo.
We call f[xo, xJJ a first-order divided difference of [, This notation for the
slope, and the associated form (1.3) proves to be very useful. In particular, (1.3)
is often a much more appropriate form to express linear polynomials numerically than using the representation p(x) = mx + b in the power basis.
xo)
+ f[xo, XI, X2](X
- xo)(x - XI).
Because X2 "I Xo, XI, the coefficient at our disposal, suggestively written as
f[xo, XI, X2], can be determined from the interpolation requirement P2(X2) =
f(X2)' We find the second-order divided difference
'- f[xo, X2] - f[xo, xJJ
f[ Xo, XI, X2 ] .X2 - XI
if X2
"I XI, Xo.
(1.6)
3.1.3 Exam~le. Let f(O) = 1, f(1) = 3, f(3) = 2. The parabola interpolating
at 0, 1 and 3 IS calculated from the divided differences
3.1.2 Example. (With optimal rounding, B = 10, L = 5). The straight line
through the points (6000, ~) and (6001, -~) may be represented, according
to (1.3), by
+ f[xo, xJJ(x -
f(xo)=l,
3- 1
f[xo, xJJ = - - = 2
1- 0
'
f[xo, X2]
2- 1
1
3-0
3
= -- = -
and
~
PI (x) ~ 0.33333
+
0.33333 + 0.66667 (x _ 6000)
6000 _ 6001
~ 0.33333 - 1.0000(x - 6000).
f[xo, XI, X2]
(1.5)
In this form, the evaluation of PI (x) is stable and reproduces the function values
at Xo and XI. However, if we convert (1.5) to standard form PI (x) = mx + b,
the resulting formula
PI(X) ~
-1.0000x
+ 6000.3
! - 2
5
3-1
6
= _3_ _ = - -
to
5
5
17
P2(X) = 1 + 2x - -x(x - 1) = - _x 2 + - x
6
6
6
+
1
.
For st a bili
ihty reasons, the first of the two expressions for P (x) is in fi it
pre"
2
,I
me
ClSlon calculations, preferable to the second expression in the power basis.
134
3.1 Interpolation by Polynomials
Interpolation and Numerical Differentiation
Note that the divided difference notation (1.6), used with the variable x in
place of X2, allows us to write the linear interpolation formula (1.3) as an identity
135
is n-i times continuously differentiable in x and satisfies
gi(X) =gi(Xi)
+ (x
- Xi)gi+l(X).
(1.10)
We proceed by induction, starting with ga(x) = f[x] = f (x) for i = 0, where
this follows from our discussion of the linear case. Hence suppose that (1.10)
is true with i-I in place of i. Then gi is n - i > 0 times continuously differentiable and
by adding the error term
1
1
gi(X) =g;(Xi)
+ (x
- Xi)
Newton Interpolation Formulas
To obtain the cubic interpolating polynomial, we may correct the quadratic
interpolation formula with a correction term that vanishes for x = xa, XI. X2:
by repetition of this process, we can find interpolating polynomials of any
degree in a similar way. This yields the interpolation formulas of Newton.
A generalization due to Hermite extends these formulas to the case where
repetitions of some of the x) are allowed; thus we do not assume that the x J are
pairwise distinct.
3.1.4 Theorem. (Newton Interpolation Formulas). Let D ~ IR. be a closed
interval, let f : D ---+ IR. be an (n + 1)-times continuously differentiable function,
and let Xa. XI, ... , Xn
E D.
Then:
(i) There are uniquely determined continuousfunctions f[xa, ... , Xi, ·]:D ---+
IR. (i = 0, 1. .... n) such that
f[xa, ... , Xi-], x]
= f[xa,
... , Xi-I, Xi]+ f[Xa, ... , Xi, X](X-Xi)
(1.7)
for all XED.
(ii) For i = 0, 1, ... .n the polynomial
Pi(X) := f[xa]
+ f[xa, xd(x -
xa)
(x - Xi-I)
interpolates the function f(x) at the points xa, ... , Xi
(iii) We have
f(x)
= Pi(X) + f[xa,
E
(1.8)
D.
... , Xi, x](x - Xa)'" (x - Xi).
gi(X):= f[xa, .. ·, Xi-I, X]
xi))dt.
(1.11)
Thus. the function gi+1 defined by
1
1
gi+I(X)=f[xa. ,,,,Xi,X]:=
g;(Xi +t(x -xi))dt
is still n -i -1 times continuously differentiable. and (1.10) holds for i. Equation
(1.9) is also proved by induction; the case i = 0 was already treated. If we
suppose that (1.9) holds with i - I in place of i, we can use (1.7) and the
definition (3.1.8) to obtain
f(x)
Pi-I (x) + (f[xa, ... , Xi-I, x;] + f[xa .... , Xi, X](X - x;))
x (x - Xa) ... (x - xi-d
= Pi(X) + f[xa, ... , Xi, x](x - Xa)'" (x - Xi).
=
Thus (1.9) holds generally.
Finally, substitution of x = X) into (1.9) gives the interpolation property
!(X) = p;(x)) for j =0, ... , i.
o
Vn := f[xa,
, x,,]
Vi := f[xa,
, Xi] + (x - Xi )Vi+l
(i = n - 1, n - 2, ... , 0)
gives rise to the expression
(1.9)
Vi
=
f[xa,
+ f[xa,
Proof To prove (1.7), it suffices to show that
+ t(x -
Equation (1.8) is called the Newton form of the interpolation polynomial
at Xa, ... , Xi. To evaluate an interpolation polynomial in Newton form, it is
advisable to use a Homer-like scheme. The recurrence
+ .
+ f[xa, ... ,Xi-I, X;](X- Xa)
g;(Xi
from which Va
follows.
, x;]
+ f[xa,
... , Xi+I](X - Xi)
+ ...
x"l(x - Xi)'" (X - X,,-1)
= Pn(x), A corresponding pseudo-MATLAB program is as
136
Interpolation and Numerical Differentiation
3.1.5 Algorithm: Evaluation of the Newton Form
% Assume di = f[xo . . . . , Xi]
v=dn ;
for i = n - 1 : -1 : 0,
v = d; + (x - Xi) * v;
end;
% Now v is the interpolated approximation to
3.1 Interpolation by Polynomials
3.1.7 Example. For the function
f
defined by
1
f(x) := - - ,
S-X
SEre,
one can give closed formulas for all divided differences, namely
f
(x)
(Because MATLAB has no indices < I, a true MATLAB code should interpolate
instead XI, ... , Xn by a polynomial of degree Sn - 1 and replace the zeros in
this pseudo code with l s.)
For the determination of the coefficients required in the Newton form, it
is useful to have a general interpretation of the f[xo, ... , x;] as higher order
divided differences, a symmetry property for divided differences, and a formula
for a limiting case. These generalize the fact that the slope of a line through two
points of a curve is independent of the ordering of these points and becomes
the tangent slope in the limit when these points merge.
f[ Xo, ... , x;]
1
= -;-----;-:-----;-:---;-:-__
(s -
xo)(s -
XI)'"
(1.12)
(s - Xi)'
Indeed. this holds for i = O. If (1.12) is valid for some i :::: 0 then, for Xi + I
#- Xi,
l[xo,
... , xil - /[xo, ... , Xi-I,
Xi+l]
ff xo, ...• Xi, Xi_q ] = ------=-----...:...:...::..::
Xi -Xi+1
1
1
($-xo)"'(S-Xi)
($ xo)···(s
Xi
=
3.1.6 Proposition.
Xi-l)($
Xi+l)
-Xi+1
(s - Xi+]) - (s - Xi)
(s - Xo)" '(s 1
0) We have the divided difference representation
f[xo, ... , Xi, x] =
137
Xi_I)(S -
Xi)(S - Xi+I)(Xi -
Xi+l)
(s - xo) ... (s - Xi+]) ,
f[xo.· ..• Xi-I, x] - f[xo • . . . , Xi-I, X;]
.::.....::-=------'----.::....-~----=------­
X -Xi
and by continuity, (1.12) holds in general.
forx#-xi'
(ii) The derivative with respect to the last argument is
d
- f[xo, ... , Xi-I, x] = f[xo, ... , Xi-I, X, xl
dx
(iii) The divided differences are symmetric functions of their arguments, that
is, for an arbitrary permutation JT of the indices 0, 1, ... , i, we have
f[xo,
Xl, •.• ,
Using the proposition, the coefficients of the Newton form can be calculated
recursively from the function values f[xd = f(Xk) if Xo, ... , Xn are pairwise
distinct: Once the values l[xo, ... , x;] for i < k are already known, then one
obtains l[xo, ... , Xk-I, xd from
/[X
. ] _ f[xo, ... , Xi-I, xd - l[xo, ... , x;]
Xk Xk - Xi
'
0, ... , X, ,
i = 0, ... , k - 1.
x;] = j[XJTO, XJTl, ... ,XJTi].
Proof (i) follows directly from (1.7) and (ii) as limiting case from (i)..To
prove (iii), we note that the polynomial Pi (x) that interpolates f at the pOInts
Xo, ... , Xi has the highest coefficient l[xo, ... ,x;]. A permutation JT of the
indices gives the same interpolation polynomial, but the representation (1.9)
yields as highest coefficient f[xlfo, ... ,x JT ;] in place of f[xo, ... , x;]. Hence
o
these must be the same.
To c~nstruct an interpolation polynomial of degree n in this way, one must use
o (n-) operations; each evaluation in the Homer-like form then only takes 0 (n)
further operations. In practice, n is rarely larger than 5 or 6 and often only 2
or3Th
is that
t at imterpo Iati
anon polynomials are not flexible enough to
. e reason IS
approximate typical functions with high accuracy, except over short intervals
Where a small degree is usually sufficient. Moreover, the more accurate spline
Interpolation (see Section 2.3) only needs O(n) operations.
3./ Interpolation by Polynomials
Interpolation and Numerical Differentiation
138
The Interpolation Error
To investigate the accuracy of polynomial interpolation we look at the interpolation errors
(1.13)
f(X) - Pi(X) = f[xo.···, Xi, x](x - xo)'" (X - Xi)'
from (1.11) i times with respect to
integration it follows that
g(i)(~)=
for some
~,
139
and using the mean value theorem of
1 tif(i+I)(xo+t(~ -Xo»dt=j<i+I)(XO+T(~ 1
T E
1
1
-xo»
tidt
[0, 1]. Therefore
We first consider estimates for the divided difference term in terms of higher
derivatives at some point in the interval hull 0{ xu, ... , Xi, x}.
3.1.8 Theorem. Suppose that the function f : D S; iR - lR. is n + 1 times continuously differentiable.
(i) If xi; ...
,Xn
ED,
f[x 0,
0::: i::: n,
then
.r<i)(~)
... ,
with
x I·]-- -~
.,
with
I .
C
"
E
O{.\O, ... , xd·
(ii) The error ofthe interpolation polynomial PI! to f at xo, ...
,Xn E
(1.14)
D can be
whence (i) is true in general.
Assertion (ii) follows from (i) and (1.9).
3.1.9 Corollary. Ifx,
XO•••. , X n E
D
[a, b], then
written as
(1.15)
where
(1.16)
qn(X) := (x - xo)'" (x - x n)
II(x) - Pn(x)l:::
Ilf(n+l) II
(n + l)~ Iq(x)l·
(1.17)
This bounds the interpolation error as a product of a smoothness factor
Ilpn+ lJ ll cx) (n + I)! depending on f only and a factor Iq(x)1 depending on
the interpolation points only.
and ~ E Dlxo, ... , X n , x} depends on the interpolation points and on x.
Hermite Interpolation
Proof We prove (i) by induction on n. For n = 0, (i) is obvious. We therefore assume that the assertion is true for some n :::: O. If f: D -+ lR. IS n + .z
.'
h . d . h pothes l S
times continuously differentiable, then we can use t e III uctrve y
with g(x) := f[xu, x] in place of f and find, for all i ::: n,
g(i)(O
f[xo, Xl, ... , xi+il = g[Xl, ... , Xi-LJJ = -i-!-
For Xu = XI =
and we find
... =
Xi =X, O(xo, ... , x,}
= {x}, hence'; = x in relation (1.14),
.
f(i)(x)
f[x, x , ... , x] (I + 1 arguments) = - - .
it
(1.18)
One can use this property for the solution of the so-called Hermite interpolation
Problem to find a polynomial matching both the function values and one or
for some'; E O(Xl, ... , x;+I1.
Differentiating the integral representation
Inore derivative values at the interpolation points. In the most important case
We Want . f,or pairwise
"
disti
. points zo, z 1, .•. , Zm, a polynomial
isnnct interpolation
P of degree -::::'2m
+ 1 such that
( 1.19)
Interpolation and Numerical Differentiation
140
3.1 Interpolation by Polynomials
To solve this problem one doubles each interpolation point, that is, one sets
X2j := X2j+l := Zj
from which it easily follows that the interpolation requirement (1.19) is satisfied.
Similarly, in the general Hermite interpolation problem, an interpolation
point Zj at which the function value and the values of the first k j derivatives
shall be matched has to be replaced by k j + 1 identical interpolation points
Xij = ... = Xi]+kj = Zj; then, again, the Newton interpolation polynomial gives
the solution of minimal degree. A very special case of this is the situation where
the function value and the first n derivative values shall be matched at a single
point Xo; then (1.15), (1.18) and Theorem 3.1.4 give the representation
f(xo)
+f
+f
I
(xo)(x - Xo) +
... +
f(n) (xo)
n!
n
(x - Xo)
(n+1) (c )
(n
+
"
I)!
(x -
)n+l
X
is
= P2m+ I.
f(x) - p(x) = f[zo, ZO, ... , Zm, Zm, x](x - ZO)2 ... (x - Zm)2,
=
,Xk~l, xd
(j =0, ... , m),
and calculates the corresponding Newton interpolation polynomial p
The error formula (1.13) now takes the form
f(x)
The corresponding recursion for the dik := f[Xi, Xi+l, ...
141
0
for some ~ E D{x, xo}; and we see that the well-known Taylor formula with
remainder term is a special case of Hermite interpolation.
In actual practice, the confluent interpolation points cause problems in the
evaluation of the divided difference formulas because some of the denominators
Xi - Xk in the divided differences may become zero. However, by using the
permutation symmetry, the calculation can be arranged so that everything works
well. We simply compute the divided differences column by column in the
following scheme.
otherwise,
because by construction the case x, = Xk occurs only when Xi = Xi+1 = ... = Xk,
and then the (k - i)th derivative of f at Xi is known.
Convergence
Under suitable (but for applications often too restricted) conditions, it can be
shown that when the number of interpolation points increases indefinitely, the
corresponding sequence of interpolation polynomials converges to the function
being interpolated.
In view of the close relation between Newton interpolation formulas and the
Taylor expansion, we suppose that the function f has an absolutely convergent
power series in the interval of interest. Then f can be viewed as a complex
analytic function in an open region D of the complex plane containing that interval. (A comprehensive exposition of complex analysis from a computational
point of view is in Henrici [43].)
We restrict the discussion to convex D because then the results obtained
so far hold nearly without change (and with nearly identical proofs). We only
need to replace interval hulls (I]S) by closed convex hulls (Conv S), and the
statements (1.14), (1.15), and (1.17) by
f(i)(~) I ~
f[xo, ... ,x;l E Conv { -I-'!If(x) - Pn(x)1 ::::
suP.
~EConv{xo,.... x,}
E
}
Conv{xo, ... ,xd ,
f(n+1) (0 I
( + 1)' Iqn(x)l,
I n
.
and
If(x) - Pn(x)l::::
II
r:»
I
+ l)~ Iq(x)l,
(n
where the oo-norm is taken over a convex domain containing all arguments.
The complex point of view allows us to use tools from complex analysis that
lead to a simple closed expression for divided differences. In the following,
D[c; r] :=
{~ E
C II ~ - c] :::: r}
denotes the closed complex disk with center c E C and radius r > 0,
Positively oriented boundary, and int D its interior.
aD
its
142
3.1 Interpolation by Polynomials
Interpolation and Numerical Differentiation
3.1.10 Proposition. Let f be a function analytic in the open set Do
~
C. If
the disk D[c; r] is contained in Do, then
f[xo, ... , x n] =
-1.
2m
1
3D
f(s)ds
for Xo, ... , Xn
(s - Xo) ... (s - Xn)
E
int D.
(1.20)
Lst i
1
po1ation error is bounded according to
f(s)
- ds
for all x
E
int K.
=
/f(x) - Pn(x)1 = If[xo, ... , x n, x]llx - xol .. ·Ix - Xn /
3K S - X
Hence suppose that (1.20) is valid for n - 1 instead of n then, for
(cf. Example 3.1.7),
f[xo, ... , x n]
~ C containing
the disk D[c; r]. For an infinite sequence of complex numbers x) E D[c; p]
(j == 0, 1,2, ...), let Pn be the polynomial of degree Sn that interpolates the
junction at xo, Xl, ... , Xn. Ifr > 3p, then for X E D[c; p], the sequence Pn(x)
converges uniformly to f(x).
3.1.12 Theorem. Let f be afunction analytic in an open set D
proof IfxED[c;p]thenlx-xjl S 2pforallj=0,1,2, ... ,sotheinter-
Proof For n == 0, the formula is just Cauchy's well-known integral formula
1
f(x)=-.
143
X n-l
For r > 3 p , I r~p I < 1, so that If
(x) - Pn (x) I is majorized by a sequence that
does not depend on x and converges to zero. This proves uniform convergence
on D[c; p].
0
f[xo, ... , xn-JJ - f[xo, ... , Xn-2, x n]
"---=----"-------"--------=-------"----"--
Xn-I - Xn
J(s)
J(s)
1 [
(s~xO)···(s-xn_ll
(S-XO)"'(S-X n_2)(S-Xn) ds
= 2rri 13K
Xn-I - Xn
1 [
f(s)ds
2rri 13K (s - xo) ... (s - x n) .
This formula remains valid in the limit Xn
therefore in general.
-+ Xn-I.
3.1.11 Corollary. lflf(~)1 S M for all ~
= 0, ... , n then
E
So (1.20) holds for nand
0
D[c; r] and Ix) - c] S p < r for
all j
If[xo, ... , xn]1 S
Mr
(r - p)n
Mr
n+1
Mr ( 2 P )n+1
<
(2p)
=-- -- (r - p )n+2
(r - p) r - p
i= X n
The hypothesis r > 3p can be weakened a little by using more detailed considerations and more complex shapes in place of disks. However, a well-known
example due to Runge shows that one cannot dispense with some condition of
this sort.
3.1.13 Example. Interpolation of the function f(x) .- I/O + x 2 ) at more
and more equidistant interpolating points in the interval [-5,5] leads to divergence of the interpolating polynomials Pn(x) for real x with Ix I ::: 3.64, and
convergence for Ixi S 3.63. We illustrate this in Figure 3.1; for a proof, see,
for example, Isaacson and Keller [46, Section 6.3.4].
+1 .
5,------_-------,
Proof By the previous proposition,
If[xo,··., xn]1
= -112rril
11
3K
f(s)ds
I
(s - xo)··· (s - x n )
< _1 [
If(s)lldsl
<
M
[Idsl
- 2rr 13K Is - xol" ·Is - xnl - 2rr(r - p)n+! 13K
=
(Here,
0
M
Mr
·2rrr=
.
2rr(r - p)n+1
(r _ p)n+1
J Idsl denotes the unoriented line integral.)
0
The fact that the number M is independent of n can be used to derive the
desired statement about the convergence of interpolating polynomials as n-+ 00.
-5 "---5
~
o
_______J
5
o
Figure 3.1. Equidistant polynomial interpolation in many points may be poor.
5
144
Interpolation and Numerical Differentiation
3.2 Extrapolation and Numerical Differentiation
The hypothesis of the theorem is not satisfied because j has poles at ±i,
although j is analytic in a large open and convex region containing the interpolation interval [-5,5]. For such functions, the derivatives Ij(n+1)(~) I do not
vary too much, and the error behavior (1.15) of the interpolation polynomial
is mainly governed by the term q.; (x) = (x - xo) ... (x - x n). The divergent
behavior finds its explanation by observing that, when the interpolation points
are equidistant and their number increases, this term develops huge spikes near
the interval boundaries. This can be seen from the table
n
16
32
64
128
256
1.9· 10'
3.1· 103
4.4 . 107
4.6· 1016
1.9· 1034
145
Proof We introduce the Chebyshev polynomials Tn, defined recursively by
To(x):=I, T,(x):=x,
T2(x)=2x 2-1,
Tn+l(x):= 2xTn(x) - Tn-I(X) (n = 1, 2, ... ).
(1.22)
Clearly, the Tn are of degree n and, for n > 0, they have highest coefficient
2n - 1• Using the addition theorem cos(a + fl) + cos(a - fl) = 2 cos a cos fl, a
simple induction proof yields the relation
Tn(x)
=
cos(n arccos x)
for all n.
0.23)
This shows that the polynomial Tn + I (x) has all Chebyshev points as zeros, and
hence Tn+! (x) = 2nqn (x). Moreover, the extrema of Tn+1 (x) are ±1, attained
between consecutive zeros. This implies the assertion.
0
For the interpolation error in interpolation at Chebyshev points, one can prove
the following result.
of quotients
3.1.15 Theorem. The interpolation polynomial Pn(x) interpolating an arbitrary s times continuously differentiable junctions f: [-1, 1] ~ lR in the
Chebyshev points (1.21) satisfies
Interpolation in Chebyshev Points
Although high-degree polynomial interpolation in equidistant points often leads
to strong oscillations near the boundaries of the interpolation intervals destroying convergence, the situation is better when the interpolation points are more
closely spaced there. Because interpolation on arbitrary intervals can be reduced to that on [-1, 1] through a linear transformation of variables, we restrict
the discussion to interpolation in the interval [-1, 1], where the formulas are
simplest.
The example just discussed suggests that a choice of the x j that keeps the
term qn(x) = (x - xo) ... (x - x n) of a uniform magnitude between interpolation
points is more likely to produce useful interpolation polynomials. One can
achieve this by interpolating at the Chebyshev points, defined by
Xj
:=cos
2) + 1
2n+ 2
lr
(j=O, ... ,n).
3.1.14 Proposition. For the Chebyshev points (1.21),
1
Iqn(x)l:::: 2n
for x
E [-1, 1],
and this value is attained between any two interpolation points.
(1.21)
Ij(x) - Pn(x)1 = 0
(~s
).
Proof See. for example, Conte and de Boor [12, Section 6.1].
o
This reference also shows that interpolation at the so-called expanded
Chebyshev points, which, adapted to a general interval [a, b], are given by
a +b +a -b ( cos u-+- l1r )
-
2
2
2n+2
/
lr)
( cos - - 2n+2
is even slightly better.
3.2 Extrapolation and Numerical Differentiation
Extrapolation to the Limit
Extrapolation refers to the use of interpolation at points xo, ... , x; for the approximation of a function j at a point x ¢ O{xo, ... , x n }. Because interpolation
polynomials of low degree are generally rather inaccurate, and those of high
degree are often bad already near the boundary of the interpolation interval
and this behavior becomes worse outside the interval, extrapolation cannot be
recommended in practice.
147
Interpolation and Numerical Differentiation
3.2 Extrapolation and Numerical Differentiation
A very important exception is the case in which the interpolating points
Xj (j =0,1,2, ...) form a sequence converging to zero, and the value of fat
x = 0 is sought. The reason is that in this particular case, the usually offending
term qn (X) behaves exceptionally well,
The assertion (2.3) is clearly true for k = O. Hence suppose that (2.3) holds
for k - 1 in place of k. In (2.2), the factor (x - Xi) vanishes for x = Xi, and the
fraction vanishes for x =Xi-I, Xi-2, ... , Xi-HI; therefore,
146
(2.1)
Moreover, for x
and converges very rapidly to zero even when the x j converge quite slowly.
3.2.1 Example. For the sequence x j :=
lowing results:
2
n
4
8
16
Because the value at the single point x = 0 is the only one of interest in extrapolation, the Newton interpolation polynomial is not the most common way to
compute this value. Instead, one generally uses the extrapolation formulas of
Neville, which give simultaneously the extrapolated values of many interpolation polynomials at a single argument.
3.2.2 Theorem. The polynomials defined by
+ (x,. _ x )Pi.k-I(X) -
Pi-I,k-I(X)
Xi-k - Xi
fork= 1, ... , i,
(2.2)
are the interpolation polynomials for f at Xi<k.» .•• , Xi.
Proof Obviously Pik(X) is of degree ~k. We show by induction on k the
interpolation property
- k, i - k
= Pi,k-I (Xi-k) - (Pi,k-I (Xi-k) - Pi-I,k-I (Xi-k))
= Pi-l.k-l (Xi-k) = f(Xi-k),
The values Pik (0) are, for increasing k, successively better extrapolation approximations to f (0), until a stage is reached where the limitations of polynomial interpolation (or rounding errors) increase the error again. In order to have
a natural stopping criterion, one monitors the values 8i := IPu (0) - Pi,; -I (0) I,
and stops the extrapolation if 8i is no longer decreasing. Then one accepts Pu (0)
as the best approximation for f (0), and has the value of 8i as a natural estimate
for the error f (0) - Pu (0). (Of course, this is not a rigorous bound; the true
error is unknown and might be larger.)
In the algorithmic formulation, we use only the Xi for i > 0 in accord with the
lackof zero indices in MATLAB. We store Pik (0) - f (x;) in Pi» I-b overwriting
old numbers no longer needed; the subtraction gives a slight improvement in
final accuracy because the intermediate quantities are then smaller.
3.2.3 Algorithm: Neville Extrapolation to Zero
i = 1; PI =0;
fold=f(xl);
while 1,
p;o(X) = f(Xi),
=i
one obtains
so that (2.3) is valid for k and hence holds in general.
Thus Pik(X) is the uniquely determined polynomial of degree <k which
D
interpolates the function f (x) at Xi<k» •.• , Xi.
The Extrapolation Formulas of Neville
Pik(XI) = f(XI) for I
Pik(Xi-k)
fJ (j = 0, 1,2, ...), one finds the fol-
Thus the extrapolation to the limit from values at a given sequence (usually
x j = xO / N, for some slowly growing divergent sequence N j ) can be expected
to give excellent results. It is a very valuable technique for getting function
values at a point (usually x = 0) when the function becomes more and more
difficult to evaluate as the argument approaches this point.
()
. (x)·-·
. - Pl.k-I X
Plk
= Xi-«,
+ 1, ... , i.
(2.3)
i = i + 1; Pi = 0;
fest = f(Xi); df = fest - fold;
for j = i - I : - 1:1,
P] = PHI
end,
+ (Pj+l
- Pj
+ df) * xi/(Xj
- Xi);
fold = fest; 80ld = 8;
8 = abs(p2 - PI); if i > 2 & 8 2: 8old, break; end,
end,
fest = fest + PI;
% best estimate If(O) - fest I ~ 8
Interpolation and Numerical Differentiation
3.2 Extrapolation and Numerical Differentiation
For the frequent case where Xi = x /qi (i = 1, 2, ...), the factor Xi/(Xi-k -Xi) in
the formula for Pik becomes l/(qk -1), and the algorithms takes the following
form.
if f is (s + 2) times differentiable atx.1f f is expensive to evaluate, a fixed p(h)
is used as approximation to rex), with an error of O(h). However, because
for small h divided differences suffer from severe numerical instability due to
cancellation, the accuracy achievable in finite precision arithmetic is rather low.
However, if one can afford to spend several function values for the computation of the derivative, higher accuracy can be achieved as follows. Because
p(h) is continuous at h = 0, and (2.4) shows that it behaves locally like a
polynomial, it is natural to calculate the derivative rex) = p(O) by means of
extrapolation: For a suitable sequence of numbers hi =/= converging to 0, one
calculates the interpolation polynomial Pi using the values p(h j) = f[x, x +
hj] (j =0, ... , i). Then the Pi (0) can be expected to be increasingly accurate
approximations to rex), with an error of O(h o ' " h n ) .
Another divided difference, the central difference quotient
148
3.2.4 Algorithm: Neville Extrapolation to Zero for Xi = X/ qi
i=l;PI=O;
fold = f(xI);
while 1,
i = i + 1; Pi
fest
= 0;
°
Q = 1;
= f(Xi); df = fest -
fold;
for j = i - I : - 1 : 1,
Q= Q »«.
Pj = PHI + (PHI - P] + df)/(Q - 1);
end,
fold = fest; Oold = 0;
0= abs(p2 - PI); if i > 2 & 0 2: OOId' break; end,
end,
f[x - h, x
=: p(h
In practice, one often has functions f not given by arithmetical expressions and
therefore not easily differentiable by automatic methods. However, derivatives
are needed (or at least very useful) in many applications (e.g., to solve nonlinear
algebraic or differential equations, or to find the extreme values of a function).
In such cases, the common remedy is to resort to numerical differentiation.
In the simplest case one approximates, for fixed x, the required value t" (x)
through a forward difference quotient
f(x
+ h) -
f(x),
h
=/=0
h
at a suitable value of h. By Taylor expansion, we find the error expansion
= f'ex) +
L.f (i+l)(X)
, hi + O(h'+I)
+ 1).
i = 1:5
(I
+ h) -
f(x - h)
+
L
i=l:s
Numerical Differentiation
p(h)
= f(x
f'ex)
We now demonstrate the power of the method with numerical differentiation;
other important applications include numerical integration (see Section 4.4) and
the solution of differential equations.
+ h] =
+ h]
2h
fest = fest + PI;
% best estimate If(O) - fest I .::: 0
p(h) := f[x, x
149
(2.4)
2
i>» (x) h 2i + O(h 2s+2)
(2i
+
I)!
(2.5)
)
approximates the derivative f' (x) with a smaller error of 0 (h 2 ) . Because in the
asymptotic expansion only even powers of h occur, it is sensible to consider this
expression as a function of h 2 instead of h. Because now h 2 plays the role of
the previous h in the extrapolation method, the extrapolation error term is now
O(h6 ... h~). Thus h need not be chosen as small as for forward differences to
reduce the truncation error to the same level. Hence central differences suffer
from less cancellation and therefore lead to better results. The price to pay is
that the calculation of the values p(h]) = f[x - h i» x + h j] for j = 0, ... , n
requires 2n + 2 function evaluations f (x ± h j), whereas for the calculation of
f[x, x + h j] for j = 0, ... , n, only n + 2 function evaluations (f (x + h j) and
f (x)) are necessary.
3.2.5 Example. We want to find the value of the derivative of the function
given by f(x) = sinx at x = 1. Of course, the exact value is r(l) = cos 1 =
0.540302305 868 .... The following table lists the central difference quotients
p(h~) := f[x - hi, X + hi] for a number of values of hi > 0. Moreover, the
values Pi (0) calculated by extrapolation are given for i = 1, 2. Correct digits
are underlined. We used a pocket calculator with B = 10, L = 12, and optimal
rounding, so that the working precision is E; = 10- 11 ; but only 10 significant
digits can be displayed.
!
Interpolation and Numerical Differentiation
3.2 Extrapolation and Numerical Differentiation
151
error
o
1
2
3
4
5
6
7
8
0.04
0.02
0.01
0.5.10- 2
10- 4
0.5.10- 4
10- 5
10- 6
10- 7
0.540 1582369
0.5402662865
0.540293301 1
0.540 300054 8
0.5403023150
0.5403022800
0.5403021500
0.540 304 500 0
0.540 295 0000
- - - discretization error
rounding error
total error
-
0.5403023030
0.540302305 9
The accuracy of the central difference quotient first increases and attains its
optimum for h ~ 10-4 as h decreases. Smaller values than h = 10- 4 give less
accurate results, because the calculation of flx - h, x + h] is hampered by
cancellation. The increasing cancellation can be seen from the trailing zeros at
small h.
The extrapolated value is correct to 10 places already for fairly large hi; it is
remarkable (but typical) that it is more accurate than the best attainable value
for any of the central differences f[x - hi, X + hd. To explain this behavior,
we now tum to a stability analysis.
Figure 3.2. T~e nAum~ri.ca! ~ifferentiation error in dependence on the step size h. The
optimal step srze h mmurnzmg the total error is close to the point wherediscretization
errorand rounding error are equaL
consists of a contribution O( T,) due to errors in function evaluations and a
2
contribution 0 (h ) , the discretization error due to the finite difference approximation, The qualitative shape of the error curve is given in Figure 3.2.
By a similar argument one finds for a method that, in exact arithmetic, approximates f'(x) with a discretization error of O(h') a total error 8(h) (including
errors in function evaluations) of
The Optimal Step Size
We want to estimate the order of magnitude of h that is optimal for the calculation of difference quotients and the resulting accuracy achievable for thc
approximation of I' (x). In realistic circumstances, function values at arguments
near x can be evaluated only with a relative accuracy of E, say, so that
lex ± h) = I(x ± h)(1 + O(E))
(2.6)
is calculated instead of I (x ± h). In the most favorable circumstances, E is the
machine precision; usually E is larger, Inserting (2.6) into the definition (2.5)
of the divided difference gives
(2.8)
Limiting Accuracy
Under the natural assumption that the hidden constants in the Landau symbols
do not change much for h in the relevant range, we can analyze the behavior of
(2.8) by replacing it with
8(h)
i[x -
h, x + h] = I[x - h, x + h] + 0
Because f'(x) = f[x - h, x
total error
f'ex) -
+ h] + 0(1f'''(x)jh
j[x - h, x + h] = 0
2
)
(If(X)I~) .
by (2.5), we find that the
(If(X)I~) + 0(1f"'(X)lh 2 )
(2.7)
E
=0-
h
+ bh'
where 0 and b are positive real constants. By setting the derivative equal to
zero, we find h = (as / sb) 1/(S+ I) as the value for which 8 (h) is minimal and
t~e minimal value is s;in = (s + l)bh' = O(E,/(s+I)). The optimal h is only a
lIttle smaller than the value of h where the discretization error term matches
the function evaluation error term, which is h = (aE/h)l/(s+l) and gives a total
error of the same magnitude.
152
Interpolation and Numerical Differentiation
3.3 Cubic Splines
If we take into account the dependence of a and b on f, we find that the
optimal magnitude of h is
=0
h
opt
(I f(.<+I)
f(x) 1
(x)
1
/ (.< + 1)
(2.9)
£1/(H1)) .
Because in practice, f(H 1) (x) is not available, one cannot use the optimal value
h op t of h; a simpler heuristic substitute is
h=I f[x + f(x)
ho, x -
+ h)
:= 6f[x - Zh, x - 11, x. x
- 2f(x)
hZ
2
"
+ f(x
- h)
2f IZi+ ZJ( X ) h2i + 0(h 2.<+2)
,"",'
1=1.s
(2i +2)!
- f(x - h»
if h
h3
#
0;
z
then f!ll(x) = p(O) and p(h ) = f!ll(x) + O(h z). The optimal It is now of order
l 5
0(e / ) , giving an error oforder 0(8 2/ 5). Again, extrapolation improves on this.
3.3 Cubic Splines
As shown, polynomial interpolation has good approximation properties on narrow intervals but may be poor on wide intervals. This suggests the use of piecewise polynomial functions to keep the advantages and overcome the problems
associated with polynomial interpolation.
Piecewise Linear Interpolation
To set the stage, we first look at the simple case of piecewise linear interpolation.
if h
(i) A grid on [a, b] is a set fl
= {XI,
... ,xn } satisfying
a=xI <XZ < '" <xn-
!
<xn=b.
(3.1)
XI, ... 'X n are called the nodes of fl, and h := max{IXj+l - Xiii j =
I, " . , n - I} is called the mesh size of <:1. A grid <:1 = {XI, ...• x } is
n
called equispaced if
#- 0;
xi=a+(i-l)h
then f"(x) = p(O), and the asymptotic expansion
p(h )=f (x) +,~
+ h, x + 2h]
2h» - (l(x + h)
3.3.1 Definition.
For the approximation of higher derivatives one has to use higher order divided
differences. The experience with the first derivative suggests to use also ~entral
differences with arguments symmetric around the point where the derivanvc
is to be approximated; then again odd powers of h cancel. and we can use
extrapolation for h 2 ---+ O.
To approximate second derivatives f" (x) one can use
f(x
)
~(l(x + 2h) - f(x -
Higher Derivatives
+ h) =
2
hol
This value behaves correctly under scaling of f and x and still gives total errors
of the optimal order O(es/(s+l», but now with a suboptimal ~idden factor..
In particular, for the approximation of I' (x) by the central ~Ifference ~uot1ent
f[ - h + h] cf. (2.5). we have s = 2 in (2.8), and the optimal magmtude of
x
,x
,
.
2/3' .
his 0(e l / 3 ) achieving the minimal total error of magmtude O(e ). This IS
..
corroborated, in the above example, where e = 2:1 10- II'IS the mac hime precision.
2
Extrapolation of p(h ) using three values f[x - h o: x +.hol: f[x - h I.
X + hIl, and f[x - h i. X + h 2 ] with hi = h/2' gives a discretization error of
0(h 6 ) , hence 0(e l / 7 ) as optimal magnitude for h and nearly fu.n .accura~y
0(e 6 / 7 ) as optimal magnitude for the minimal total error. Again, this IS consistent with the example.
.
Note that if we use for the calculation the forward difference quotient
f[x, x + h), cf. (2.4) we only have s = 1 in (2.8), and the optimal choice
h = 0(e I/2) only gives the much inferior accuracy 0(eI/ 2 ) .
p(h 2 ) := 2f[x - h , x, x
holds for sufficiently often differentiable f. Because of the division by h2,
the total error now behaves like O(f;,) + 0(h 2 ) , giving an optimal error of
l 2
I /4
0(e / ) for h of order 0(e ) . With quadratic extrapolation, the error looks
6
like 0 (f;,) + 0 (h ) , giving an improved optimal error of 0(e 3 / 4 ) for h of order
l 8
0(e / ) . Note that the step size h must be chosen now much larger, reflecting
the fact that otherwise cancellation is much more severe.
To approximate third derivatives 1'''(x) one uses similarly
p(h
lel/(.<+I).
153
.. the mesh size is then h
(II)
Af
.
= (b. -
fori=I ....• n;
a)/(n - 1).
unction p : [a, b) ---+ JR IS called piecewise linear over the grid fl if it is
COntinuousand agrees on each interval [x,, xi+Il with a linear polynomial.
3.3 Cubic Splines
Interpolation and Numerical Differentiation
154
f
(iii) On the space of continuous, real-valued functions
JlOOO ~ 32 subintervals between any two already existing grid points. This
means 32 times as much work; cubic polynomial interpolation would already
be of order 0 (h 4 ) , reducing the work factor to a more reasonable factor of
11000 ~ 5.6.
Thus, in many problems, piecewise linear interpolation is either too inaccurate or too slow to give a satisfying accuracy. However, this can be remedied
by using piecewise polynomials of higher degree with sufficient smoothness.
defined on the interval
[a, b], we define the norms
Ilflloo
:= sup{[f(x)11 x E [a, bl}
and
IIfllz =
Jib
155
f(x)Z dx.
Splines
For piecewise linear interpolation, it is most natural to choose the grid 1:1 as the
set of interpolation points; then the interpolation condition
(3.2)
automatically ensures continuity in [a, b], and we find the unique interpolant
Sex)
= f(x)) + !lx), x)+d(x -
xi)
for all x E [x), xHI].
We see immediately that an arbitrarily accurate approximation is possible when
the data points are sufficiently closely spaced ( i.e., if the mesh size h is sufficiently small). If, in addition, we assume sufficient smoothness of f, then we
can bound the achieved accuracy as follows.
3.3.2 Theorem. Let Sex) be a piecewise linear interpolating function on the
grid 1:1 = {Xl, ... , x n } with mesh size h over [a, b]. If the function f (x) to be
interpolated is twice continuously differentiable, then
If(x) - S(x)\ S
Proof For x
E [x), x) + tl,
hZ
gllf
/I
1100
for all x
E [a,
b].
the relation
We now add smoothness requirements that define a class of piecewise polynomials with excellent approximation properties.
A spline of order k over a grid 1:1 is a k - 2 times continuously differentiable
function S: [a, b] ---+ lR. such that the (k - 2)nd derivative S(k-2) is piecewise
linear (over 1:1). A spline S of order k is piecewise a polynomial of degree at most
k - 1; indeed, S must agree between two adjacent nodes with a polynomial of
degree at most k because the (k - 2)nd derivative is linear. In particular, splines
of order 4 are piecewise cubic polynomials; they are called cubic splines.
Splines have excellent approximation properties. Because cubic splines are
satisfactory for many interpolation problems, we restrict to these and refer
for higher order splines to de Boor [IS]. (For splines in several variables, see
de Boor [16].)
Important examples of splines are the so-called B-splines. A cubic basis
, Xn }, is a cubic spline Sl(x)
spline, short a B-spline over the grid 1:1 = {Xl,
defined over the extended grid x-z < X-I <
< Xn+3 with the property that
(for some 1=0,1, ... , n + I) S/(J;/) > 0 and S/(x) =0 for x rf. (x/-z, x/+z).
Cubic basis splines exist on every grid and are determined up to a constant
factor by the (extended) grid 1:1 and the index I; see Section 3.5.
3.3.3 Example. For the equispaced case, one easily checks that the functions
defined by
implies the bound
1
1
If(x) - S(x)1
S 2111" 1I 00 1(x - x))(x o
S
xHl)1
/I
Where
(x ,J+ I _xJ)z
.
S 211f 1100~
0
h; 111"1100'
. I I
it i necessary
In order to increase the accuracy by three d ecrma p aces, I IS
to decrease h Z by a factor of 1000, which requires the introduction of at least
B(x):=
~(2 - Ixl)3 - (l - Ixl)3
for Ixl S 1,
~ (2 - Ix1)3
for I < Ixl < 2,
o
for Ixl ::: 2
I
are B-splines on the equispaced grid with xi = Xo + lh (cf. Figure 3.3).
(3.3)
156
Interpolation and Numerical Differentiation
3.3 Cubic Splines
3.3.4 Theorem. Let ~
1.35
157
= {XI •... , x n } be a grid over [a, b] with spacings
h, =
X,'+I -
X,'
(I'
. --
I , ... ,n - I) .
Thefunction S given by (3.5)for x E [x). Xj+tJ is a cubic spline, interpolating
Ion the grid, precisely when there are numbers m) (j = I, ... , n) so that
(3.6)
(j = 2, ... , n - I),
(3.7)
where
h·
q) = h
-0.5
-1
---'-0.5
o
0.5
2
1.5
Figure 3.3. Cubic Bvsplines for an equally spaced grid, h =
~h
)
[0, I]
E
!.
1 "
m)=6 S (x)
For interpolation by cubic splines, it is again most natural to choose the grid ~
as the set of interpolation points. The interpolation condition (3.2) again ensures
continuity in [a, h], but differentiability conditions must be enforced by suitable constraints. We represent S in [x). x)+d as a cubic Hermite interpolation
polynomial. Using the abbreviations
(3)
= f[Xj, Xj+I],
Y) = f[Xj. Xj+I, Xj],
OJ
= f[Xj, Xj+I,
»i- Xj+l],
J'
S'(X)
= (3) + y)(2x -
S"(x) =
+ 15)(x - x)2(x -
+ y)(x X)+I)
x)(x -
for all x
Xj+tJ.
- X))2),
6m j = 2 YJ
- 28) h )
+ 4o)h}
for x -+ x)
+0
in S"(x),
for x -+ x)+! - 0 in S"(x);
but this is equivalent to (3.6).
X)+l)
E [x),
.
Xj - X)+l)
+ OJ (2(x - x))(x - X)+I) + (x
2y) + 28)(3x - Zx , - Xj+I).
6m)+1 = 2Yj
x)
J+
Thus we obtain the following continuity condition on S":
(3.4)
+ (3j(x -
forj=l, ... .n.
Formula (3.5) gives the following derivatives for x E [x' x· I]'
we find
SeX) = I(xj)
(j = 2, ... , n - 1).
Proof Because S (x) is piecewise cubic, S" (x) is piecewise linear and if we
kn
"
,
ew S (x), then we would know Sex). Because, however, S"(x) is still unknown, we define
..L---._ _~'____ _~
- L _ ._ _--"--
)-1
(3.5)
To get (3.7), we derive continuity conditions on S' as x
-+ xJ
X)+l -0:
x -..
Because the derivatives of S at x j and x)+ I are not specified. the constants Yj
and OJ are still undetermined and at our disposal, and must be determined by
imposing the condition that a cubic spline is twice continuously differentiable
at the nodes (elsewhere, this is the case automatically). The calculation of the
corresponding coefficients Yi and 8) is very cheap but a little involved because.
as we show, a tridiagonal system of linear equations must be solved.
S'(X))={3) - y)h)
S'(xj+»=/3j
=/3) -
(2m) +m)+l)h),
+ y)h j +o)h;=/3J + tm , +2mj+dh).
We can therefore formulate the continuity condition on S' as
+0
and
Interpolation and Numerical Differentiation
158
3.3 Cubic Splines
this is equivalent to
Similarly, the free node condition at
equation
.
. . ] _ f[xj, Xj+tl- f[Xj-l, Xj] = {3j - {3j-1
f[Xj-I,Xj,Xi+ 1 ..
h· +h·
Xj+1 - Xj-l
j-l
j
(2mj
n- z)
hll-2)
( 2 + -h- mn-I
+ mj+dh j + (mj-l + 2mj)hi - 1
+ (I h
- -hn~l
n-]
h j_ 1 +h j
o
for j = 2, ... , n - 1, and this simplifies to (3.7).
m2 -ml
m3 - mz
X2 - XI
X3 - Xz
f[X n - 2 , XIl_I, x n ]
f[X n-2, Xn-I, x ll]
I - h2 / hi
A ii
= { 21 -
.. _
A 1.1+1-
A ..
1,1-1
(hi
+ h 2)m2 -
otherwise,
{2 + h
Plugging this into (3.7) for j = 2 gives
hi
1,
for i =
otherwise,
= {2+h ll- 2/ h ll_1
I
- qj
0
I
'2
hl(h l +h 2)f[XI,x2,X3] = himl +2h 1(h j +h2)mz+h lh zm3
+ (21z l + h 2)(h l + hz)l1'lz
hz)(h l + hz)ml
hDml
+ (2h l
2/
qj
fori=n,
otherwise,
if Ik - i I > 1.
The computational cost to obtain the m, and, therefore, the spline
coefficients is only of order 0 (n). If we have an equispaced grid, then
we get the following matrix:
hzml
hi
= (hi = (hi -
= 1,
for i = n,
for i
hll- 2/ hll_ 1
A ik = 0
This leads to the condition
(X3 - xI)m2 - (x, - x2)mj
m., = f[X n-2, Xn-l, x ll]. (3.9)
f[XI, X2, X3]
f[XI, X2, X3]
f[ x2, X3, X4]
CASE 1: A simple requirement is the free node condition for X2 and Xn- ! . In
this case, we demand that Sex) be given by the same cubic polyno-
mial in each of the two neighboring sets of intervals [XI, X2], [X2, X3]
and [Xn- 2, Xn-l], [Xn-I, x ll]; thus X2 and Xn-l are, in a certain sense,
"artificial" nodes of the spline. As a consequence, S'" cannot have
jumps at X2 or Xn-I. Since Sill shouldn't have a jump at X2, we must
have 01 = 02, so
gives the determining
XIl-I
From (3.7) and these two additional equations, we obtain the tridiagonal linear system Am = d with
Because f[Xj-l, Xj, xJ+tl can be computed from the known function values,
the previous equation gives n - 2 defining equations for the n unknowns m lWe can pose various additional conditions to obtain the two missing equations.
We now examine four different possibilities.
m3=
159
3
2
I
'2
0
I
'2
2
I
'2
A=
I
'2
+ h 2)(h l + h 2)m 2,
0
so
(3.8)
2
I
'2
I
'2
2
'2
3
0
I
In this important special case, the solution of the linear system Am = b
simplifies a little; ma and mll_1 may be determined directly, then m-;
to fIln-Z may be determined by Gaussian elimination, and fill and m.;
160
may be determined by substitution. Because of diagonal dominance.
pivoting is unnecessary.
CASE 2: Another possibility to determine two additional equations for the m j
is to choose two additional points, Xo = XI - ho and X n+ I = x; + hl/.
near a and b and demand that S interpolate the function f at these
points. For Xo, we obtain the additional condition
Similarly, we obtain as additional condition for x n + I;
This leads to a tridiagonal linear system analogous to that in Case 1
but with another right side d and other boundary values for A. This
linear system can also be solved without pivoting, provided that
h1
Xo
X,,+I -
Xn
A has the following form:
x
x
0
x
x
x
0
x
x
0
0
0
x
0
0
0
x
x
0
0
x
x
Here, II I - ~ A 1100 .s ~,so that we may again apply Gaussian elimination without pivoting; however, because of the additional comer
elements outside the band, nonzero elements are produced in the last
rows and columns of the factorization. Thus the expense increases a
little, but remains of order O(n).
CASE 4: If the derivative of f at the end points XI and X n is known (or can
reliably be estimated), we can set Xo = Xl and Xn+1 = X n; to obtain
m, and m-: we then compute the divided differences f[XI, XI. X2]
and f[Xn-l, X n, xn] from these derivative values. This corresponds to
the requirement that S'(X) = f'(x) for x E {XI. xn}' The spline so
obtained is called a complete spline interpolant.
so
XI -
161
3.3 Cubic Splines
Interpolation and Numerical Differentiation
In order to give a bound for the error of approximation
prove the following.
111- Slloo'
we first
3.3.5 Lemma. Suppose f is twice continuously differentiable in [a. b). Then,
for every complete spline interpolant for I on [a, b],
IIS"lIoo .s 311f"1100.
(3.10)
h,,-l
!
Proof For a complete spline interpolant. I 1- A II
.s
~; therefore, Proposition
2.4.2 implies
because then
(3. 11)
~ecause f[x. x', x"] = 41"(~) for some ~ E O{x, x', x"}, we have lid II 00 <
211 I" 1100' Because S" is piecewise linear. we have
and A is an H-matrix.
If
I is periodic with period in [a. b], we can construct S to be a
CASE 3:
periodic spline with Xo = Xn-l, X,,+ 1 = X2, and m I = m n . In this case,
we obtain a linear system for m 1, ... , m,,_1 from (3.7), whose matrix
IIS"lIoo = i=1:1l
max S"(Xi) =6 max Imil=61lmll00
i=1:11
= 611A- Id11 00
.s 611A- llloolldll 00 .s 6·1 . ~. 111"1100'
o
163
3.3 Cubic Splines
Interpolation and Numerical D(fferentiation
162
The following error bound for the complete spline interpolant results, as well
as similar bounds for the errors in derivatives.
3.3.6 Theorem. Suppose f isfour times continuously differentiable in [a, b].
Then the bounds
I(x - Xj)(x - xj+I)ls h 2/4 then imply
Ilf - Sll oo = Ilell oo
h2
1
s 4' 2:llelllloo s
h4
16 Ilf(4)
1100'
For the derivative of e, we obtain
e'(x)
= (e[x, xj+Jl- e[xj, xj+1D - (e[x, xj+IJ - e[x, xD
= (x - xj)e[x, Xj, Xj+Jl- (Xj+l - x)e[x, x, xj+IJ,
so (3.12) implies
holdfor every complete spline interpolant with mesh size h.
o
Proof. For the proof, we utilize the cubic spline
S with
S"(Xj)
=
!"(Xj)
for j
= 1, ... , n,
3.3.7 Remarks.
S'(Xj)
= f'(xj)
for j
=1
(i) Comparison with Hermite interpolation shows that the approximation
error has the same order (0(h 4 ) ) . In addition (and this can also be proved
for Hermite interpolation), the first and second derivatives are also approximated with orders 0(h 3 ) and 0(h 2 ) . (The present simple proof does not
yield optimal coefficients in the error bounds; see de Boor [15] for best
estimates.)
(ii) For equispaced grids (i.e., Xi+l - Xi = h for each i), it is possible to show
that the first derivative is also approximated with order 0(h 4 ) at the grid
points. Computation of the complete spline interpolant for a function f is
therefore also useful for numerical differentiation of f.
(iii) Theorem 3.3.6 holds without change for periodic splines in place of complete splines. It also holds for splines defined by the free node condition and splines defined through additional interpolation points, when the
error expressions are multiplied by the (usually small) constant factor
~(1+3I1A-lIl00).
and j =n,
where the existence of S follows by integrating the piecewise linear function
through !"(Xj) twice. Lemma 3.3.5 implies
II!" - S"lloo
s II!" - S"lloo + II(S - S)"lloo
s II!" - S"lIoo + 311(S - f)"1I =411!" - S"lloo.
00
Because S" is the piecewise linear interpolant to
implies the bounds
IIi" - S"lloo
f"
at the
Xi,
Theorem 3.3.2
h2
s gll(fIl)lIlloo,
so
(3.12)
To obtain the corresponding inequalities for
Because e(x j) = 0, the relationship
e(x)
then follows for x
E
= (x
f and f', we set e
S.
An Optimality Property of Cubic Splines
Wenow prove an important optimality theorem for complete spline interpolants.
- Xj)(x - xj+de[xj, Xj+l,
[Xj, Xj+l], j = 1, ... , n -
f -
xl
1. Formula (3.12) and
3.3.8 Theorem. Suppose f is twice continuously differentiable in [a, bl, and
suppose Sex) is a complete spline interpolant for f on the grid a
= XI
< ... <
Interpolation and Numerical Differentiation
164
IIS"II~ = II!"II~ - II!" - S"II~ ::: II!"II~.
K(x) =
Proof To prove the statement, we compute the expression
e := 1I1"1I~ - III" - S"II~ - IIS"II~
=
l
21
b
(f"(x)2 - (f"(x) - S"(x))2 - S"Cd) dx
b
(f"(x) - S"(x))S"(x) dx
= 2 "2: lX'
i =2:n
(I"(x) - S"(x») S"(x)dx.
Xl-l
Because S" (x) is differentiable when x E (Xi -1, Xi), we obtain
e
=2"2:
((fI(X) - S/(X))S"(X{:_L -
L~1 (f'(X) -
165
(ii) The 2-norm of the second derivative is an (approximate) measure for the
total curvature of the function. This is because the curvature of a function
f(x) is
x" =b. Then
=
3.4 Approximation by Splines
S/(X))S"/(X) dX)
I"(x)
/I+ f'(x)2
Therefore, if the derivative of the function is small or almost constant, then
the curvature is approximately equal or proportional to the second derivative of the function, respectively. Thus, we can approximately interpret the
theorem in the following geometric way: the complete spline interpolant
has a total curvature that is at most as large as that of the original function. The theorem furthermore says that the complete spline interpolant
is approximately optimal over all interpolating functions in the sense of
minimal total curvature and therefore is more graphically esthetic.
(iii) The same proof produces analogous optimality for periodic splines and
the so-called natural splines, which are characterized by the condition
S"(xd = S"(x,,) = O. The latter are of course appropriate only for the approximation of functions f with f" (x[) = f" (x,,) = O.
1=2:"
by integration by parts. Because the expression (f' (x) - S' (x) ).S"(x) .is continuous on all of [a, b], and because S'" (x) exists and is constant m the interval
[Xi-I, x;], we furthermore obtain
E~2 [(['(Xl - S'(X))S"(X)I: - ,~" ([(xl - S(xll S"'(X)[.].
Because f' and S' agree by assumption on the boundary, the first expression
vanishes. The second expression also vanishes because f and S agree at the
0
nodes x j (j = 1, ... , n). Therefore, e = 0, and the assertion follows.
3.3.9 Remarks.
. 0f
(i) The optimality theorem states that the 2-norm of the second deri
envatIv~
the complete cubic spline interpolant is minimal over all twice contlllUously differentiable functions that take on specified values at .xI, .... ' x"
and whose derivatives take on specified values at Xl and x,.. ThIS optImalr
ity property gave rise to the name "spline," borrowed from the name.fo
a bendable metal strip used by engineers in drafting. If such a phySIcal
1.
rve
spline were forced to go through the data points, then the resu tm~ cu
would be characterized by a minimum stress energy that is approxImated
by f\lI"II~.
3.4 Approximation by Splines
In practice, we frequently know function values only approximately because
they come from experimental measurements or expensive simulations. Because
oftheir minimal curvature property, splines are ideal for fitting noisy data (Xi, Yi)
(i = I, ... , m) lying approximately on a smooth curve Y = f (x) of unknown
parametric form.
To fi nd a suitable functional dependence, one represents f (x) as a cubic spline
Sex). If we interpolated noisy data with a node at each value we would obtain
a very unsmooth curve reflecting the noise introduced by the inaccuracies of
the function values. Therefore, one approximates the function by a spline with
a few nodes only.
Approximation by Interpolation
A simple and useful method is the following: We choose a few of the data
points as nodes and consider the spline interpolating
compute the error interval
f
at these points. We then
e=D{S(x) - f(x) I x E M}
corresponding to the set M of data points that were ignored. If the maximum
166
Interpolation and Numerical Differentiation
3.4 Approximation by Splines
error is then larger than a previously specified accuracy bound, then we choose as
additional nodes those two data points at which the maximum positive and negative error occurred, and interpolate again. We then compute the maximal error
again, and so on, until the approximation is sufficiently accurate. The computational expense for this technique is 0 (/lm), where m is the total number of data
points and n is the maximum number of data points that are chosen to be nodes.
A problem with all approximation methods is the treatment of outliers, that is,
particularly erroneous data points that are better ignored because they should not
make any contribution to the interpolating function. We can eliminate isolated
outliers for linearly ordered data points XI < X2 < ... < XIII as follows. We write
£, = Sex,) - f(x,) and, for i < m,
because the expression (4.1) is then approximately equal to the integral
(f(x) - S(x»2 dx = Ilf - SII~ (see the trapezoidal rule in Section 4.3). With
this choice of weights, we approximately minimize the 2-norm of the error
function.
The choice of the number of nodes is more delicate because using too few
nodes gives a poor approximation, whereas using too many produces wiggly
curves fitting not only the information in the data but also their noise. A suitable
way is to proceed stepwise, essentially as explained in the approximation by
interpolation.
Minimization of (4.1) proceeds easiest when we represent the spline Sex)
as a linear combination of B-splines. The fact that these functions have small
compact support with little overlap ensures that the matrix of the associated
least squares problem is sparse and well conditioned.
In the following, we allow the nodes XI < X2 < ... < xm ofthe spline to differ
from the data points Xi. Usually, much fewer nodes are used than data points
are available.
To minimize (4.1), we assume that S is of the form
if I£i I :s l£i+11
otherwise.
We then take e =
method.
0 {ei I i =
1, ... , m - I} as error interval in the previous
167
J
n
Sex) := I'>,S,(X),
Approximation by Least Squares
A more sophisticated method that better reflects the stochastic nature of the noise
uses the method of least squares. Here, one picks as approximating function
the spline that minimizes the expression
'=1
where the z: are parameters to be determined and where the S, are B-splines
over a fixed grid ~. Thus, we seek the minimum of
m
m
L Wi(Yi -
S(x;»2,
h(z) =
(4.1)
L Wi(Yi i=1
i=1
m
S(Xi»2 =
L
Fi(z)2 =
IIF(z)II~,
;=1
with
where the Wi are appropriately chosen weights.
Regarding the choice of weights, we note that function values should make a
contribution to the approximating function in proportion to how precise they are;
thus the weights depend, among other things, on the accuracy of the function
values. In the following, we assume that the function values are all equally
precise; then a simple choice would be to set each ui, to I. However, the distance
between two adjacent data points should be reflected in the weights, too, because
individual data points should provide a smaller contribution in regions where
they are close together than in regions where they are far apart. A reasonable
choice is, for example,
(X2 - xd/2
Wi·-
{
(Xi+1 - xi-d/2
(X m
-
x m - l )/ 2
fori=l,
for i = 2, ... m - I,
for i =m,
Fi(z)=..jWi(Yi - S(Xi»=..jWi (Yi - tZ,SI(Xi»).
'=1
Written in matrix form, the previous equation is F(z) = D(y - Bz), where
168
3.4 Approximation by Splines
Interpolation and Numerical Differentiation
We therefore need to solve the linear least squares problem
IIDy - DBzli z = min!
(4.2)
Here, selection of the basis splines leads to a matrix B with a simple, sparse
form, in that there are at most four nonzero elements in each row of B. For
example, when n = 7, this leads to the following form, when there are precisely 1,4,2, 1, 2, I data points in the intervals [Xi, .Ki+l], i = 1, 2, 3,4,5,6,
respectively:
x
x
x
x
x
B=
o
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
0
169
the right side also takes 0 (m) operations. Because the solution of the 7-band
system takes O(n) operations and n < m, a total of Oem) operations are
required to solve (4.4).
A comparison with approximation by interpolation shows that a factor of
O(n) is saved. Nonetheless, the least squares method is faster only beyond a
certain n because the constant associated with the expression for the cost is now
larger than before.
The least squares method can also be used for interpolation; then, in contrast
to the case already considered, the data points Yi and the nodes Xi do not need
to correspond. According to de Boor [15], I
for i
= 1, m
otherwise
is a nearly optimal choice of nodes.
Parametric Splines
x
x
x
To solve problem (4.2), we may use the normal equations
(DB)T DBz = (DBl Dy
(4.3)
(Theorem 2.2.7) because the matrix for this special problem is generally wellconditioned. Formula (4.3) is equivalent to
A case that often occurs in practice (e.g., when modeling forms such as airplane
wings) is that in which we seek a smooth curve that connects the pairs (Xi, Yi)
and (Xi+l, Yi+I). Here, the Xi (i = I, ... , m) are in general not necessarily distinct. We cannot approximate such curves by a function f with f(Xi) ~ Yi. To
solve this problem, we introduce an artificial parameter s and view the pairs
(x, y) as a function of s. This function can then be approximated by two splines.
We obtain good results when we take the following arclength approximation
for s:
(4.4)
where W = Diaguo}, ... , wn ) . Because the element
Then, s, is the length of the piecewise linear curve which connects (Xl, Yl) with
(Xi, Yi) through the points (xz, Yz), ... , (Xi-I, Yi-d.lfwe now specify splines
S, and Sy corresponding to a grid on [0, sm] with SxCSi) ~ Xi and Sy(Si) ~ Yi
(i
I, ... , m), then the curve (x, y) = (SxCs), Sy(s)) is a smooth curve which
approximates the given point set.
=
vanishes whenever Sj and Sk do not overlap, the matrix BTW B is a 7-band
matrix. Moreover, B T W B is positive definite, so that a solution of (4.4) via a
Cholesky factorization succeeds without problems.
Because B contains a total of at most 4m nonzero elements (and therefore
on average 4m/n nonzero elements per column) and (by symmetry) only four
bands must be computed, the computation of the left side of the normal equations
requires at most 4n . 4m / n . 3 = 0 (m) operations. Similarly, the formation of
3.4.1 Remarks.
(i) The constructed curve is invariant under orthogonal transformations (ro-
tations and reflections).
(ii) The parametric curve can be easily graphed, since we may simply graph
X = SxCs) and Y = Sy(s).
170
Interpolation and Numerical Differentiation
(iii) If we are looking for a closed curve then X m+I =
can approximate with periodic splines.
XI
3.5 Radial Basis Functions
and Ym+I =
YI,
and we
Radial basis functions are real-valued functions defined on lRd that are radially
symmetric with respect to some center Xi, and hence have the form rp(lIx -Xi 112)
for a suitable univariate function rp : lR+ ----+ lR. Most analyzed and used are
radial basis functions with
rp(r) = Jr 2
+ c2
J~~-~-~:
-2.5
3.5 Radial Basis Functions
rp(r)=r 2Iogr 2
(thin plate spline)
rp(r) = e- c' ,,2
(Gaussian)
2
1
Strong interpolation and approximation theorems are known for the interpolation and approximation by linear combinations of radial basis functions; they
are based on the fact that good radial basis functions grow with IlxII, but their
higher order divided differences decay (see, e.g., Light [56]). The multivariate
analysis is quite involved; here, we only relate univariate radial basis functions to splines, to motivate that we may indeed expect good approximation
properties.
-1
-0.5
-1.5
-1
-0.5
0
0.5
1.5
2
2.5
-1.5
-1
-0.5
0
0.5
1.5
2
2.5
ir
~
-~t
-2
0
0.5
1
1.5
2
~
2.5
J
(e)
I
-2.5
1
-1.5
(b)
-2.5
(multiquadric)
-2
171
2
J
Figure 3.4. Linear B-splines as linear combinations of simple radial basis functions
(a) <l>i, (b) <1>;+1.;, (c) Bi .
are unbounded and wedge-shaped, the differenced functions
3.5.1 Theorem. The space of all piecewise linear functions with possible
breakpoints (nodes) at an infinite grid . . . < Xi < Xi + I <
space of all linear combinations of the functions <l>i (i = 1,
agrees with the
, n) defined by
<l>i(X) :=Ix - xii·
are constant outside [x;, x;+tJ, and the twice differenced functions
Proof All piecewise linear functions over the grid ... < Xi < Xi+1 < ... have
a representation as a linear combination
S(X) =
L S(xi)Bi(x)
of the linear B-splines
if
Xi-l
:sx :SXi,
ifx;:s x:s
otherwise.
Now, whereas the functions
<l>i (x)
:= Ix - xii
B; (x)
= <l>i+l.i -
<1>;';-1
2
are the linear B-splines over the grid. In particular, B; (x) and therefore all
piecewise linear functions are linear combinations of <1>; (x). Conversely, because each <1>; (x) is piecewise linear, all their linear combinations are piecewise
linear. Thus approximation by piecewise linear functions is equivalent to ap0
proximation by linear combinations of [x - Xi I.
X;+I,
More generally, higher order splines can be written as linear combinations
of functions of the form
<I>;(x)=rp(x -Xi)
172
Interpolation and Numerical Differentiation
3.6 Exercises
centered at Xi where for order s splines,
rp(x)
=
!
IXI
S
if s is odd,
xlx!S-J if s is even.
5. Given n + 1 points xo, ... 'X n and an analytical function / : JR ~ R the
matrices Sn(f), A E lR(n~l)x(n+l) are defined by Sn(f):= S with
(5.1)
.
Sik
Indeed, it is not difficult to see that, in generalization of the linear case, one
obtains B-splines of order s over an arbitrary grid by (s + l)-fold repeated
differencing.
.=
(f[Xi, .... xd fori:::: k.
1. Fit (by hand calculations) a parabola of the form g(x) = ax 2 + bx + c
through the points (50, ~), (51, ~), (52.
Give all quantities and intermediate results to 5 decimal places. Compare the results obtained by
using the Newton interpolation formula with those that obtained from the
power form at the points x = 50,50.5,51,51.5,52.
2. (a) Write MATLAB programs that calculate and evaluate the interpolation
polynomial for a MATLAB function / at n pairwise distinct interpolation points Xl , ...• X n .
(b) Interpolate the functions /1 (x):= eX/tO and h(x):= I/O + x 2 ) at the
equidistant interpolation points Xi := - 5 + lOi/n, i =0, ... , n.
For /1 and hand n = 1,4, 8, 16, plot / and p; and calculate an estimate of the maximum error
(i.k=O ..... n)
o
for i > k
and
A_C
3.6 Exercises
Xn-l
¥).
max IPn(x) - /(x)1
XEI
by evaluating the interpolation polynomial Pn at 10 I equidistant points
in the intervals (i) 1= [-1,1] and (ii) 1=[-5.5].
(c) For both functions and both intervals, determine the value of n where
the estimated error becomes least.
3. Let the function / : JR ~ JR be sufficiently differentiable.
(a) Show that divided differences satisfy
~ /[Xo, Xl.·.·, XII] =
dx;
Prove for a, xo, ... 'X n
E
R and arbitrary n times differentiable functions
i.«
(a) Sn(a)=al. Sn(f±g)=Sn(f)±SIl(g), SIl(x'j)=ASIl(f).
(b) If P is a polynomial then Sn (p)
=
= peA).
(c) Sn(f . g) SIl(f) . Sn(g).
(d) If h:= fg then
h[Xi, ... ,xd=
L
j[Xi, ... ,Xj]g[Xj, ... ,Xk].
j=i:k
(e) Specialize (d) to get a formula for the kth derivative (f g )(k).
Hint: Use induction to get (b) from (a); deduce (c) from (b) and interpolation.
6. Let d, = f[xo, .... Xi] and
Xo
0
/[XO, XI,···, Xv-I, Xv, Xl', Xv+I, ... , X n]
for v e O, ... ,n.
k
.
d / [Xo, Xl • . . . , X" ] .
(b) Find and prove an analogous expression
for dX;
4. Show that the Lagrange polynomials can be written as
-do
-dl
XI
A=
Xn_1
where q(x) = (x - xo)··· (x - x n ) ·
173
0
D=
-dll - 2
dnx n - d ll -
I
[ ]
Show that Pn(x) = det(Dx - A) is the Newton interpolation polynomial
for f at Xo • . . . , Xn .
7. Let (xo,YO,Yb):=(-I,-3, 19), (Xl,YI,YD:=(O,O,-I), a~d (X2,Y2,
y~):= (1, 3,19). Calculate the Hermite interpolation polynomial p(x) of
10. The Chebyshev points
2j
degree 5 with p(x v ) = Yv and p'(x v ) = y~ , 1J = 0,1,2.
.
8. Let P« (x) be the polynomial of degree ~n that interpolates the sufficiently
smooth function f (x) at the pairwise distinct points Xo, ... , x".
(a) Show that for k ~ n the kth derivative of the error function f(x) - P«(x)
has at least n - k + I zeros in the smallest interval [a, b1 containing
x·J := cos 2(n
sup
Iq,,(;)I,
where q,,(x):= (x - xo)··· (x - x,,).
Show this for n = I, and express the points Xo, x I in terms of square roots.
11. Write a MATLAB program that calculates the derivative !,(x) of a given
differentiable function f at the point x E R For this purpose use the central
difference quotient
(6.1)
(c) Show that the relation (6, I) holds even when certain interpolation points
coincide.
(d) Show that, for the Hermite interpolation polynomial p" (x) that coincides with the values of the function f and its first derivative at the
f [x - h . x+h]:=f(x+hi)-f(x-hi)
"
1
Zh,
for hi := 2- i h o, i = 0, I, 2.... , where h o > 0 is given. Let P" be the interpolation polynomial of degree <n satisfying
points xo, ... ,x"' the remainder formula
_
f(2,,+2)(;) 2
f(x)-p,,(x)= (2n+2)!q,,(x)
with; E O{xo, ... ,x"' x} and c, given by q,,(x) := (x -xo) ... (x -x,,)
1J =
(J' = 0 , ... , n)
sE[-I,lj
that
f(k) (;)
f[xO,Xl, ... ,Xkl=~k-!-'
+ I IT
+ I)
minimize the expression
all xv·
(b) Use (a) to prove for arbitrary kEN the existence of some; E[a, b1such
holds for arbitrary x E R
9. Let D be a closed disk with fixed center c E C and radius r. Let
175
3.6 Exercises
Interpolation and Numerical Differentiation
174
Xv
E D,
Choose the degree n for which
Ip,,+I(O) - p,,(O)1 :::: Ip,,(O) - p,,_I(O)1
0, ... , k be given points, and let
max
p:=
holds for the first time. Then ~ (p" (0) + P,,-l (0)) is an estimate of f' (x)
and ~ IP" (0) - P,,-I (0) I is a bound for the error.
Determine in this wayan approximation and an estimated error for the
derivative of f(x) := eX at x = I using in succession, the starting values
ho := 1,0.1,0.01,0.001, and compare the results with the true error.
12. Suppose, for the linear function f (x) := a + bx ; with a. b =I 0, the first
derivative f' (0) = b is estimated from
[x, - c],
v =O, ...• k
Then for every analytic function
f :D
S; C ---+ C,
r
If[xo, Xl,"" Xk II ~ (r _ p)k+!
sUPlf(;)I=:S(r).
sED'
Now, let f(x):= e", x E C
(a) Give an explicit expression for S(r).
(b) For which value r of r is S(r) minimal (with c, xo, Xl,"" Xk
fixed)?
(c) How big is the overestimation factor
over
qk.
S,_(.:..,r.:...)
_
:= If[XO,XI,···,Xk 'II
for Xo = XI = ... = Xk = c and k = 1,2,4,8, 16, 32?
E
C
8 _ f(h) - f(-h)
h 2h
~n binary floating point arithmetic with mantissa length L and correct roundmg. Let a and b be given binary floating point numbers and let h be a power
of 2, so that multiplication with h and division by 2h can be performed ex~ctly, Give a bound for the relative error I(8 h - f' (0)) / f' (0) I of the value
8h calculated for 8h . How does this bound behave as h ---+ O?
Interpolation and Numerical Differentiation
3.6 Exercises
13. A lot of high quality public domain software is freely available on the World
Wide Web (WWW). The NETLIB site at
Hint: The MATLAB functions subp.l.o t, title, text, xlabel, and
ylabel may be useful. Try also set (gca , )x.l im ' , [-6,6] ), and experiment with set and get.
(c) The plot suggests a conjecture; can you formulate and prove it?
16. For given grid points XI < Xz < ... < Xn and function values f(xj), let
S (x) be the cubic spline with
176
http://www.netlib.org/
contains a rich collection of mathematical software, papers, and databases.
(a) Search the NETLIB repository for the key word "interpolation" and
explore some of the links provided.
(b) Get a suitable program for spline interpolation, and try it out on some
simple examples. (To get the program work in a MATLAB environment, you may need to learn something about how to make mexfiles that allow you to access FORTRAN or C programs from within
MATLAB.)
14. Let Sf be the spline function corresponding to the grid Xl < Xz < ... < X"
that interpolates f at the points xo, XI, ... , X n + 1· Show that the mapping
f ~ Sf is linear, that is, that
15. For an extended grid X-Z < X-I < Xo < ... < X" < X,,+I < X,,+Z < X,,+3,
define functions Bk (k = I, ... , n) by
for Xk-Z < x ::::: Xk-I,
Clk.k-Z(X - Xk_z)3
+ Clk,k-I (x - Xk_l)3
x)3 + Clk.k+1 (Xk+1 - x)3
Clk.k-Z(x - Xk_z)3
for Xk-I < x ::::: Xko
Clk.k+2(Xk+Z -
for Xk < x::::: Xk+l,
for Xk+l < x < Xk+Z,
otherwise,
Clk.k+Z (Xk+Z - x)3
o
where
Xk+Z - Xk-Z
TI
(Xi-Xj)'
li-kl::SZ
i#
(a) Show that these formulas define cubic splines, the B-splines over the
grid.
(b) Extend the grids (-5, -2, -1,0, 1,2,5) and (-3, -2, -1,0, 1,2,3)
by three arbitrary points on both sides and plot the nine resulting
B-splines (as defined previously) and their sum on the range of the extended grids, using the plotting features of MATLAB. The figure should
consist of two subplots, one for each grid, and be self-explaining; so
do not forget the labeling of the axes and a plot title.
seX) = Clj + {3j(x - Xj)
+8 j (x
177
+ Yj(x
-Xj)z(x -Xj+d
- Xj)(x - Xj+l)
for X
E [Xj,Xj+I],
satisfying S(Xj) = f(xj), j = I, ... , n and the free node condition at Xz
and Xn-l.
(a) Write a MATLAB subroutine that computes the coefficients a j, {3j, Yj,
and 8j (j = 1. ... , n - 1) from the data points by solving the linear
tridiagonal system for the m j (cf. Theorem 3.3.5).
(b) Write a MATLAB program that evaluates the spline function Sex) at
an arbitrary point x. The program should only do three multiplications
per evaluation.
Hint: Imitate Homer's method.
(c) Does your program work correctly when you interpolate the function
f(x):= coshx,
x E [-2,2],
at the equidistant points -2 = XI < ... < Xll = 2?
17. Let S be a cubic spline with m j := S"(Xj) for j = I, ... , n. S is called a
natural spline, provided ml = ni; = O.
(a) Show that the natural spline that interpolates a function g minimizes
s" for all twice continuously differentiable interpolants f.
(b) Natural splines are not very appropriate for approximating functions
because near the boundary, the error of the approximation is typically O(h z) instead of the expected O(h 3 ) . To demonstrate this, determine the natural spline interpolant for the parabola f (x) = x Z over the
grid L1 = {-h, 0, h} and determine the maximum error in the interval
[-h, h]. (However, natural splines are the right choice if it is known
that the function is C Z and linear on both sides outside the interpolation
interval.)
18. Quadratic splines. Suppose that the function values f (Xi), i = I, ... , n, are
known atthe equidistant points Xi = a+(i -I)h, where h = (b-a)/(n -I).
Show that there exists a unique quadratic (i.e. order 2) spline function Sex)
over the grid xb <
< x~, where x; = a + (i - !)h, i = 0, ... , n that
interpolates f at XI,
,xn ·
Interpolation and Numerical Differentiation
178
Hint: Formulate a system of equations for the unknown coefficients of S
in each subinterval [X!_l' xJ
19. Dilation equations. Dilation equations relate splines at two different scales.
This provides the starting point for the multiresolution analysis of sound or
images by means of so-called wavelets. (For details, see, e.g., Daubeches
[13], De Vore and Lucier [21].)
(a) Show that
4
Numerical Integration
if Ixl ::: 1,
ifl<lxl<3,
if Ixl ::: 3
is a quadratic (B-)spline.
(b) Prove that B (x) satisfies for all x
B(x)
E
1 3 3
= 4B(2x
- 3)
+ 4B(2x -
1)
1
+ 4B(2x + I) + 4B(2x + 3).
(c) Derive a similar dilation equation for the cubic B-spline in (3.3).
20. (a) Show, for cP as in (5.1), that every linear combination
Sex) =
L
(XkCP(X -
Xk)
(6.2)
i = l:n
is k - 1 times differentiable, and
S(k-l) (x)
This chapter treats numerical methods for the approximation of one-dimensional
w(x)f(x) dx. After a discussion
definite integrals of the form J,:' f(x) dx or
of general accuracy and convergence results, we consider the highly accurate
Gaussian quadrature rules, most suitable for smooth functions, possibly with
known endpoint singularities. Based on the trapezoidal rule, we then derive
adaptive step size methods for the integration of difficult integrands.
The final two sections show how the methods for integration can be extended
to multistep methods for solving initial value problems for systems of coupled
ordinary differential equations. However, the latter is a vast subject, and we
barely scratch the surface.
A thorough treatment of all aspects of numerical integration of univariate
functions is in Davis and Rabinowitz [14]. The integration of functions of several
variables is treated in Stroud [91], Engels [25], and Krommer and Ueberhuber
[54]. The solution of ordinary differential equations is covered thoroughly by
Hairer et a1. [35,36].
f:
IR the dilation equation
is piecewise linear with
nodes at Xl,···, x.:
(b) Write the cubic B-spline in (3.3) as a linear combination (6.2). Check
by plotting the graph for both expressions in MATLAB.
Hint: First match the second derivatives, then determine the free parameters to force compact support.
4.1 The Accuracy of Quadrature Formulas
In this section, we look at general approximation properties of formulas that
use a finite number of values of a function f to approximate a definite integral
of the form
I(f) :=
l
b
f(x)dx.
(1.1)
Because the integral (1.1) is linear in I, it is desirable that the approximating
formulas have the same property. We therefore only consider formulas of the
following form.
179
180
Numerical Integration
4.J The Accuracy of Quadrature Formulas
4.1.1 Definition. A formula of the form
QU):=
L
Normalized Quadrature Formulas
ad(xj)
(1.2)
j=O:N
is called an (N + I)-point quadrature rule with the nodes Xo, ... , XN and Corresponding weights ao, ... , aN.
In general, it is often useful to use a linear transformation to transform the
interval of integration to the simplest possible interval. We demonstrate this for
various weight functions, namely:
CASE I: The trivial weight function co (x) = 1: One usually normalizes to
[a, b] = [-1, 1]. If
The nodes usually lie in the interval of integration [a, b] because the integrand is possibly undefined outside of [a, b]; for example, when f(x) ==
j
l
-I
.J(x - a)(b - x).
In order for the formulas to remain numerically stable for large N, we require
that all weights be nonnegative; there is then no cancellation when we evaluate
Q
for functions of constant sign. Indeed, the maximum amplification factor
for the roundoff error is given by L laj II L aj for function values of approximately constant magnitude, and this quotient is I precisely whenever the aj
are all nonnegative.
In order to obtain good approximation properties, we further require that the
quadrature rule integrates exactly all f in a particular class, where this class
must be meaningfully chosen. In the simplest case, the interval of integration is
finite and the integrand f (x) is continuous in [a, b]. In a formally only slightly
more general case, the integrand can be divided into the product of a simple but
possibly singular factor w(x), the so-called weight function, and a continuous
factor f(x). In this case, instead of (1.1), the integral has the form
(n
JUJU) =
l
b
w(x)f(x) dx.
f(x)dx
=
La;f(xi)
holds for polynomials of degree y n, then the substitution t := c + xh
results in the formula
C+h
!
f(t)dt
c-h
= h La;f(c+xi h)
for polynomials f of degree y n.
CASE 2: The algebraic weight w(x) = x'" (a > -I): One usually normalizes
to [a, b] = [0, I]. If
i
1
x'" f(x)dx
=
Laif(Xi)
holds for polynomials of degree y n, then the substitution t := xh
results in the formula
(1.3)
Numerical evaluation of (1.3) is then still done approximately with a formula of
the form (1.2), where the weights a j are now dependent on the weight function
w (x).
4.1.2 Examples. Apart from the most important trivial weight function co (x) =
1, frequently used weight functions are as follows:
(i)
(ii)
(iii)
(iv)
181
w(x) = x'" with a > -I for [a, b] = [0, I],
w(x) := I/~ for [a, b] = [-I, 1],
w(x) := e- x for [a, b] = [0,00],
w(x) := e- x 2 for [a, b] = [-00,00].
The last two examples show how an appropriate choice of the weight function
makes numerical integration over an infinite interval possible with only a finite
number of function values.
for polynomials f of degree x n.
CASE 3: The exponential weightw(x) = e- Ax (A > 0): One usually normalizes
to [a, b] = [0, 00], A = 1. If
holds for polynomials of degree y; 17, then the substitution t := c
A-I X results in the formula
for polynomials f of degree
z».
+
182
4.1 The Accuracy of Quadrature Formulas
Numerical Integration
Approximation Order
Because continuous functions can be approximated arbitrarily closely by polynomials, it is reasonable to require that the quadrature formula (1.2) reproduces
the integral (1.3) exactly for low-degree polynomials.
proof By assumption Q has order n
~n we have
183
+ 1 so, for all polynomials P» of degree
It follows that
4.1.3 Definition. A quadrature rule
I
I
b
Q(f) := I>~d(xj)
(1.4)
b
w(x)f(x)dx
=
+ Q(Pn)
w(x)(f(x) - Pn(x» d x
+
b
j
has approximation order (or simply order) n
function w(x) if
w(x)(f(x) - Pn(x»dx
+ 1 with
=I
respect to the weight
o
L CljPn(Xj).
j=O:n
With this we have
b
I
w(x)f(x) dx
In particular, when
== Q(f) for all polynomials f of degree
<n.
(1.5)
f is constant we have the following.
= lIb w(x) (f(x) - Pn(X» dx
4.1.4 Corollary. A quadrature rule (1.4) has a positive approximation order
precisely when it satisfies the consistency condition
b
LClj
=
I
w(x)dx
=
lIb w(x)f(x) dx - Q(f)1
(1.6)
1(1).
}
+ j~n Clj(Pn(Xj)
- f(Xj»1
2: Ib'W(X)",f-Pn"oodx+ L ICljlllf-Pnlloo
o
j=O:n
b
= ( I Iw(x)1 +
Ilf - Pnlloo·
j~n ICljl)
In particular, if w (x) and all a j are nonnegative, then (1.6) implies
We first consider the accuracy of quadrature rules, and investigate when we can
ensure the convergence of a sequence of quadrature rules to the integral (1.1).
The following two theorems give information in an important case.
4.1.5 Theorem. Suppose Q is a quadrature rule with nodes in [a, b] and
nonnegative weights. If Q has the order n + 1 with respect to the weightfunction
w(x), then the bound
I
b
Iw(x)1 +
j~n ICljl =
I
b
w(x)
+ j~n Clj
= 21(1).
0
4.1.6 Theorem. Let Qz (l = 1,2, ...) be a sequence ofquadrature rules with
nodes in the bounded interval [a, b], nonnegative weights, having order ti, + I
with respect to some nonnegative weight function w(x). If n, ~ 00 as I ~ 00,
then
b
10r
w(x)f(x) dx
=
lim Qz(f)
1--+00
for every continuous function f: [a, b] ~ C.
holds for every continuous function f : [a, b] ~ C and all polynomials Pn of
degree 2: n. In particular, for nonnegative weight functions,
lIb w(x)f(x)dx - Q(f)!2: 21 (1) lip" -
flloo·
(1.7)
Proof By the Weierstrass approximation theorem, there is a sequence of polynomials Pk(x) (k = 0, I, 2, ...) of degree k with
lim Ilpk -
k-::HXJ
fII00 = O.
For sufficiently large I we have n, ::: k, so that Theorem 4.1.5 may be applied
to each of these polynomials Pk; we thus obtain
lim
1--+00
lib
w(x)f(x) dx - QI(f)\ =
Because n
o
(i) Theorems of approximation theory imply that actually
for m-times continuously differentiable functions; the approximation error
(1.7) therefore diminishes at least as rapidly.
(ii) Instead of requiring nonnegative weights it suffices to assume that the sums
L la; I remain bounded.
In practice we are interested in the converse of the previous proposition
because we start with a normalized quadrature rule and we desire a bound
for the error of integration over the original interval. We obtain this with the
following theorem.
4.1.9 Theorem. Suppose Q(f) = La; f(x;) is a normalized quadrature rule
over the interval [-1, 1] with nonnegative weights a; and with order n + 1 with
respect to w(x) = 1. Then the error bound
f
i
C
h
+ f(t)dt-hLa;f(c+x;h) I ::::
l c-h
16 ,(h
- )n+2 Ilf(n+1) 1100
(n+l). 2
holds for every N + 1 times continuously differentiable function f on [c - h,
c + h], where the maximum norm is over [c - h, c + h].
4.1.8 Proposition. If the relationship
f(t) dt
k > 0, we obtain the relationship
in the limit h -* O. The normalized quadrature rule (1.9) is therefore exact for
all x k with k :::: n. It is thus also exact for linear combinations of x k (k :::: n),
and therefore for all polynomials of degree j; n, that is, it has order n + 1. 0
For the most important case, that of constant weight w (x) = 1, we now prove
some statements about the accuracy oftransformed quadrature rules over small
intervals.
C+h
+ 1-
o.
a
4.1.7 Remarks.
c-h
185
4.1 The Accuracy of Quadrature Formulas
Numerical Integration
184
= h I>;f(c + x;h) + O(h n+2)
(h -* 0)
(1.8)
holds for all (n + I)-times continuously differentiable functions f in [c - h,
c + h], then the corresponding quadrature rule
Proof Let Pn(x) denote the polynomial of degree j; n that interpolates g(x) :=
fCc +xh) at the Chebychev points. Theorem 3.1.8 and Proposition 3.1.14 then
imply
(1.9)
for (n + I)-times continuously differentiable functions f in [-1, 1] has order
n + 1 with respect to w(x) = 1.
over the interval [-1, 1]. This and application of Theorem 4.1.5 to Q(g) give
\f:;'h f(t) dt -
Proof Suppose k :::: nand f (t) := (t - c/. Then (1.8) holds, and we obtain
h k+ J
_
a;f(c
+ x;h) = II
-
+1
This implies
1
/
t k dt
I _ (_I)k+1
=
k
+
I
= La;x~ +
O(h
n+ 1
n
h
+ I)! ( 2" )
4h
< 00 -
n+ J
-
I
Ilg(n+J) I
2n (n
+ I)!
00
(n+J)
Ilf
1100'
r:
k
-
(n
g(x) dx - Q(g)
Ilg - p II
16
_
-
_I
J
< 4h .
(_h)k+1
___----'-_ = h La;(x;h)k + O(h n+2).
k
I li
1
hL
).
o
.If we wish to compute the error bounds explicitly, we can bound II
I) II 00
interval arithmetic by computing If(n+J)([c - h, C + hm, provided an
With
4.2 Gaussian Quadrature Formulas
Numerical Integration
186
arithmetic expression for f(n+l) is available or f(n+l) is computed recursively
via automatic differentiation from an arithmetic expression for f· However, the
error of integration can already be bounded in terms of the nth derivative; in
particular we have the following.
As in the proof of Proposition 4.1.9, we finally obtain
f
I
C
h
+ 1(t)dt- h L
c-h
16 (h)n+1
4.1.10 Theorem. If the quadrature rule Q(f)
=S n!
= L a.] (x.),
normalized on
[ -1, 1], has nonnegative weights a j and order n + 1 with respect to w (x) = I,
then
Proof Let Pn-l (x) denote the polynomial of degree n - I, which interpolates
g(x):= fCc + xh) at the Chebychev points, and let q(x) := 21-nTnl~~ ~e
the normalized Chebychev polynomial of degree 11. Then Iq(x)1 =S 2
for
x E [-1,1]. With a suitable constant y, we now define
+ yq(x).
Pn(x):= Pn-I(X)
g (x ) - Pn-l (x)
g(n) (0
.
h"
= --q(x)
= -n! f
11\
(n)
(c
+ ~h)q(x)
for some s E [-I, 1], so
Ig(x) - Pn(x)\
l
= ~~ f(n)(c + ~h) =S
\~~ f(n)([c -
h, c
y\ Iq(x)1
+ h]) -
y\. 2
n.
1
-
2:
a ;[ (c + x;h ) I
radf(n)([c-h,c+h]).
o
4.1.11 Remarks.
(i) The bound for the error is still of order O(h n +2 ) because the radius of the
enclosure for the nth derivative is always O(h).
(ii) Under the assumptions of the theorem, the transformed quadrature rule for
the interval [c - h, c + h] also has order n + 1 because the nth derivative of
a polynomial of degree 11 is constant and the radius of a constant function
is zero, whence the error is zero.
(iii) We can avoid recursive differentiation if we use the Cauchy integral theorem to bound the higher order derivatives. Indeed, if f is analytic in the
complex disk D[c; 1'] with radius r > h centered at c, and if 11(01 =S M
for ~ E D[c; 1'], then Corollary 3.1.11 implies the bound
(n
We then have
187
1
II (n+l) I
+ 1)! 1
00
Mr
=S (I' _ h)n+2'
and the bound in Proposition 4.1.9 becomes 16M rqn+2, with q = hi (21' 2h). For sharper bounds along similar lines, see Eiermann [24] and
Petras [80].
In order to apply these theorems in practice, we must know how to determine the 2n + 2 parameters a j and x j in the quadrature rule (1.4) in order to
obtain a predetermined approximation accuracy. In the next section, we integrate
interpolating polynomials to determine, for arbitrary weights, so-called interpolatory quadrature rules, among them the Gaussian quadrature rules, which
have particularly high order.
.
.
.... I
We now choose y in such a way that the expression on the nght side IS mllllma .
([c - h, c + h]). Then we get
This is the case when y = ~ mid
4.2 Gaussian Quadrature Formulas
2 (h)n
Ilg-Pnlloo=Sn!
2: rad(t(n)([c-h,c+h])).
When we require that all polynomials of degree =S n are integrated exactly, we
Obtaina class of (n + 1)-point quadrature formulas that are sufficiently accurate
for many purposes. In this vein, we have the following.
r:
Numerical Integration
4.2 Gaussian Quadrature Formulas
4.2.1 Definition. An (n + I)-point quadrature formula Q(f) is an interpolatory quadrature formula for the weight function w (x) provided
called the Newton-Cotes quadrature rules of designed order n + I. If n is even,
the true order is n + 2. Indeed, a symmetry argument shows that (2x -a -bt+ 1
(and hence any polynomial of degree n + 1) is integrated exactly to zero when
11 is even.
For n = 1, we obtain the simple trapezoidal rule
188
Q(f)
=
l
b
w(x)f(x) dx
for all polynomials f of degree y n, that is, when Q(f) has order n
+ I with
TI(f)
respect to w(x).
of order
The following theorem gives information concerning existence and unique-
Il
+ 1 = 2, for
4.2.2 Theorem. To each weight w (x) and n + 1 arbitrarily preassigned pa~r. diIS tiIn ct nodes Xo , ... , x n, there is exactly one quadrature formula with
wIse
of order 11 + 2
L
aJ(xj)
aj :=
l
L(x) =
]
n
i#j
x -Xi.
x·] - x·I
(x) is a polynomial of degree j- n. Then Lagrange's interpo-
=
L
Lj(x)f(xj)
j=O:n
implies, with
o
the desired representation.
The interpolatory quadrature formulas corresponding to the constant weight
= 1 and equidistant nodes Xj = a + J'h n (.J -- 0 , ... , n) are
function w(x)
+ 32f Ca: b) + 12f
C;
b)
(2.1)
of order n + 2 = 6. If b - a = O(h), Theorem 4.1.9 implies that the error is
O(h 3 ) for the simple trapezoidal rule, O(h 5 ) for the simple Simpson rule, and
o (h 7) for the simple Milne rule.
In the section on adaptive integration, we meet these rules again as the first
members of infinite sequences TN(f), SN(f), and MN(f) oflow order quadrature rules.
Unfortunately, the Newton-Cotes formulas have negative weights a j when n
is large, more precisely for = 8 and
10. For example,
laj 1/
j ~
lOll for Il = 40, which leads to a tremendous amplification of the roundoff
error introduced during computation of the f (x j ). We must therefore choose
the nodes carefully if we want to maintain nonnegative weights.
As already shown in Section 3.1, many functions can be interpolated
better when the nodes are chosen closer together near the boundary, rather
than equidistant. We also use this principle here because in general the better
we interpolate a function, the better its integral is approximated. As an example, we treat normalized quadrature formulas in with respect to w(x) = 1
On the interval [-1, 1]. Here, an advantageous distribution of the nodes is
given by the roots of the Chebychev polynomial Tn(x) = cos(n . arccos(x))
(see Section 3.1). However, the extreme values of the Chebychev polynomial,
n
lation formula (see Theorem 3.1.1)
f(x)
f (a)
w(x)Lj(x)dx,
with the Lagrange polynomials
f
(7
+32 f(a:3b)+7 f(b))
b
(/
Proof Suppose
simple Simpson rule (also called Kepler's
= 4, and for 11 = 4 the simple Milne rule
M I (f) = b ~ a
j=O:n
with
= 2 the
b-a ( f(a)+4f (a+b)
SI(f)=-6-2- +f(b) )
+ 1, namely
Q(f) =
b-a
= -2-(f(a) + feb)~
barrel rule)
ness of interpolatory quadrature formulas.
order n
n
189
n:::
L
La
190
Numerical Integration
4.2 Gaussian Quadrature Formulas
that is,
of approximation n
Xj
= cos(rrj/n)
l
for j = 0, ... , n
n
1(
~ 1-
"COS(2rrjk/n»)
L.
4k2 _ I
j
for j = 1, ... , n - I,
2
b
k=1:LII/2j
jb
(f, g) :=
a
(2.2)
+ q(x)r(x)
b
w(x)j(x)g(x) dx
I
(2.3)
for some polynomial Pll of degree s» (the interpolating polynomial at the no~e~
xo, ... , XII) and some polynomial rex) of degree ~k - I. Thus Q(j) has or e
(2.4)
b w(x)
dx = ILo < 00,
w(x) > 0
for x
E
(a, b)
(2.5)
hold the ( )'
"
d
.
.
,
n·,' IS a posrtive efinite scalar product III the space of real continuous bounded functions in the open interval (a, b).
A sequence of polynomials Pi (x) (i = 0, 1, 2, ...) of degree
.1'
hi'
a :~ys em oJ.ort ogona polynomials over [a, b) with respect to the
Ight functIOn w(x) If the condition
We'
f(x) = PII(X)
I
~..z.4
Definition.
IS called
t
f of degree n + k has the form
D
for t~ese frequently occurring integrals. Whenever the weight function co (x) is
COntllluous in (a, b) and the conditions
j
b
w(x)q(x)r(x) dx = 0
We need some additional notation for this. We introduce the abbreviation
obeys the relationship
Proof Every polynomial
W(X)PII(X) dx
Equation (2.2) places k conditions on the n + 1 nodes. We may therefore
hope to satisfy these conditions with k = n + I with an appropriate choice of
nodes. The order of approximation 2n + 2 so obtained would thus then be twice
as large as before. As we shall now show, for nonnegative weights this is indeed
possible, in a unique way.
qt x) := (x - xo)" . (x - xn)
for i = 0, ... , k - 1.
1
for all polynomials r(x) of degree ~ k - 1, that is, whenever (2.2) holds.
4.2.3 Proposition. An interpolatory quadrature formula with nodes xo, .... x"
has order n + 1 + k precisely whenever the polynomial
w(x)q(x)xidx = 0
i~ aiPII(xi) =
w(x)j(x) d x =
b
If we consider the difference between the left and right sides, we see that this
is the case precisely when
for j =0, n.
Because a j :::. all' all a j are positive for this choice of nodes, so the quadrature
formula with these weights a j is numerically stable. In practice, in order to avoid
too many additional cosine evaluations, one tabulates the x j and a j for various
values of n (e.g., for n = 1,2.4. 8, 16•... ).
Until now we have specified the quadrature formulas in such a way that the
x. were somehow fixed beforehand and the weights a j were then computed.
./ve have thus not made use of the fact that we can also choose the nodes in order
to get a higher order with the same number of nodes. In order to find o~ti.mal
nodes, we prove the following proposition, which points out the conditions
under which an order higher than n + I can be achieved.
I
=Q(j) =2:: ail(xi),
holds for all j of the form (2.3).
k=I:Ln/2j
1
L 4k2I)
_ 1 = n(2Ln/2j + I)
w(x)j(x) d x
or equivalently
II
(_
1 2
b
1=0:11
have proved in practice to be more appropriate nodes for integration. The Corresponding interpolatory quadrature formulas are called Clenshaw-Curtis formulas. We give without proof explicit formulas for their weights:
~
191
+ 1 + k precisely whenever the relationship
.
(p"Pj}=O
is satisfied.
fori=O,l, ... ,j_l
193
4.2 Gaussian Quadrature Formulas
Numerical Integration
192
To every weight function w(x) satisfying (2.5) that is continuous in (a, h)
there corresponds a system of orthogonal polynomials that is unique to within
normalization. Such a system may be determined, for example, from the sequence I, x , ... , x" with Gram-Schmidt orthogonalization.
Based on the following theorem, we can obtain quadrature formulas with
optimal order from such systems of orthogonal polynomials.
Pn+I (x), that is, the roots of Pn+1 (x) correspond to those of q (x) and hence
to the nodes. By Theorem 4.2.2, the quadrature formula is thus unique.
We now determine the sign of the weights and the location of the nodes.
For this, we again use the Lagrange polynomial Lk (x), and choose the two
special polynomials
4.2.5 Theorem. Suppose w(x) is a weight function which is continuous in
of degree x 2n
+ I for
f in (2.8). Because Lk (Xj)
(a, h) and which satisfies (2.5). Then for each n ::: 0 there is exactly one
quadrature formula
ak
L
GnU) :=
aJ!(x))
+ 2. Its weights
Xk
(2.6)
lim GII(f)
=
l
w(x)f(x) dx
=
1
-Gil (h)
ak
f(x) -
b
L k(x)2 w(x) dx ::: 0
=
t xL k(x)2 w(x) dx
~b?
J" Lk(X)-W(x) dx
E
la, h[.
L
L j(x)2 f(xj)
= g(X)Pn+l(X)
)=O:n
function w(x). In particular, the relationship
n----+-OO
l
=
(ii) Conversely, if the Xi are the roots of Pn+l (x), the weights are given by (2.6),
and f is a polynomial of degree j; 2n + 1, then
are nonnegative and its nodes x) all lie in the interval ]a, h[. The nodes are the
roots ofthe (n + 1)st orthogonal polynomial Pn+l(x) with respect to the weight
b
Gn(!I)
and
)=O:n
with approximation order 2n
=
= 8k)' we obtain
(2.7)
Q
holds for all continuous functions f in [a, b].
The uniquely determined quadrature formula Gil (f) is called the (11 + I)-point
Gaussian quadrature rule for the weight function w(x) in [a, h].
Proof
(i) Suppose Gil satisfies
for every polynomial f of degree ::::2n + 1. By Proposition 4.2.3, (q, Xi) :::::
othen holds for i = 0, ... , n. Thus q (x) is a polynomial of. degree n .+ ) that
I to
is orthogonal to all polynomials of degree r- n, so q (x) IS proportJOna
for some polynomial g of degree x n, since the left side has each x) as a
root. Hence
l
b
f(x)dx - GnU)
=
=
l
l
b
(f(X) -
j~' L
J(X)2
f(X j)) dx
b
g(X)PII+1 (x) dx
=0
by (2.2). Thus G n has indeed the approximation order 2n
Statement (2.7) now follows from Theorem 4.1.6.
+ 2.
D
Gaussian quadrature rules often give excellent results. However, despite the
good order of approximation, they are not universally applicable. To apply
them, it is essential (unlike for formulas with equidistant weights) to have a
functiIOn given
.
. (or a program) because the nodes Xi are in
by an expression
general irrational.
~abulated nodes and weights for Gaussian quadrature rules for many different
WeIght functions can be found, for example, in Abramowitz and Stegun [I].
4.2 Gaussian Quadrature Formulas
Numerical Integration
194
4.2.6 Example. Suppose w(x) = I and [a, b] = [-I, 1]. The corresponding
polynomials are (up to a constant normalization factor) the Legendre polynomials. Orthogonalization of I, x , ... , x n gives the sequence
195
A 3-term recurrence relation like (2.9) is typical for orthogonal polynomials:
4.2.7 Proposition. Let Pk(X) be polynomials ofdegree k = 0, 1, ... with highest coefficient 1. orthogonal with respect to the weightfunction W(x). Then, with
Po(x)
=
I,
PI(X)
= x,
pz(x)
=x
and, continuing recursively for arbitrary
Z
-
1
3'
P3(X)
=x
3
3
P-l (x)
- S-x,
= 0, Po(x) =
I, we have
(2.10)
i,
with the constants
(2.9)
aj =
For n ::: 2, we obtain the following nodes and weights. For n
= 0,
b, =
Xo = 0,
forn
ao
= 2,
f
f
7_I(x)dx/
W(X) XP
W(X)P7_ I(x)dx/
ti >
W(X)P7-1 (x) d x
w(x)p7_z(x)dx
1),
(j::: 2).
Proof We fix j and expand the polynomial Xpj-l (x) of degree j as a linear
combination
= 1,
and for n
f
f
Xo
= vfi73,
ao
=
Xl
= vfi73,
a1
= 1,
1,
Xpj-l (x)
=
L
aIPI(x).
I=O:j
Taking inner products with Pk (x) and using the orthogonality, we find (x Pj -1,
Pk) = adpb Pk), hence
= 2,
Xo
= -J3j5, ao = 5/9,
Xl = 0,
al = 8/9,
Xz =
J3j5,
az = 5/9.
By comparing the highest coefficients, we find aj
=
I, hence
(2.11)
If we substitute z := c +hx (and add error terms from [I]), we obtain for n = I
°
the formula
By definition ofthe inner product, (XPj-l, Pk) = (Pj-l, xpd = for k < j-2
Since P J -I is orthogonal to all polynomials of degree < j - 1. Thus ak = for
k < j - 2, and we find
which requires one function evaluation less than Simpson's rule, but has an
error term of the same order 0 (h 5 ) . For n = 2, we obtain the formula
XPj-l (x)
=
Pj(x)
°
+ aj-IPj-l (x) + aj-ZPj-z(x).
This gives (2.10) with
f
C+ h
c-h
f(z) dz
=
h
9(5f(c - hJ3j5)
h
7
+ 315o f
(6)
+ 8f(c) + 5f(c + hJ3j5»
(~),
aj
bj
= aj-l =
= aj-Z =
(XPj_l, Pj-z)/(Pj-Z, Pj-z)
= (Pj-l, Pj-I)/(Pj-2, Pj-z)
with the same number of function evaluations as Simpson's rule, but an error
term of a higher order 0 (h 7 ) .
(XPj_l, PJ-I)/(Pj-l, Pj-I),
because (Xp/-l, Pj-2)
=
(XP/-2, Pi-l)
=
(Pj_l, Pj-I) by (2.11).
D
Numerical Integration
4.3 The Trapezoidal Rule
If the coefficients of the three-term recurrence relation are known, one can
compute the nodes and weights for the corresponding Gaussian quadrature
rules in a numerically stable way by solving a tridiagonal eigenvalue problem
(cf. Exercise 13). (This is preferable to forming the Pj explicitly and finding
their zeros, which is much less stable.)
for the integral of f over the subinterval [Xi -1, Xi]. By summing over all subintervals, we obtain the following approximation for the integral over the entire
interval [a, b]:
196
197
4.3 The Trapezoidal Rule
Gaussian rules have error terms that depend on high derivatives of f, and hence
are not suitable for the integration of functions with little smoothness. However,
even nondifferentiable continuous functions may always be integrated by an
approximation with Riemann sums. In this section, we refine this approach.
We assume that we know the values f (x j) (j = 0, ... , N) of a function
f(x) on a grid a = Xo < Xl < ... < XN = b. When we review the analytical
definition of the integral in terms of Riemann sums, we see that it is possible
to approximate the integral by summing the areas of rectangles with sides of
lengths Xi - Xi-I and f(Xi-d or f(Xi) (Figure 4.1).
We also immediately see that we obtain a better approximation if we replace
the rectangles with trapezoids obtained by joining a point (Xi-I, f (Xi-1» with
the next point (Xi, f (Xi»; that is, we replace f (x) by a piecewise linear function through the data points (Xi, f (Xi» and compute the area underneath this
piecewise linear function in terms of a sum of trapezoidal areas. Because the
area of a trapezoid is given by the product of its base and average height, we
obtain the approximation
l
X,
((Xi)
. f(x) dx ~ (Xi - Xi-I) . .
After combining terms with the same index i, we obtain the trapezoidal rule
over the grid a = Xo < XI < '" < XN = b:
T(f) : =
Xo f(xo)
2
XN - XN-I
XI -
+
2
+
L
i=1:N-l
Xi+1 - Xi_1 f(Xi)
2
!(XN).
(3.1)
For equispaced nodes Xi = a + i . h N obtained by subdividing the interval [a, b]
into N equal parts oflength h N : = (b - a) / N, the trapezoidal rule simplifies
to the equidistant trapezoidal rule
(3.2)
with
hN=(b-a)/N,
xi=a+ih N.
+ !(Xi-l)
2
Xl-I
Figure 4.1. The integral as a sum of trapezoidal areas.
4.3.1 Remark. Usually, the nodes in an equidistant formula are not stored but
are computed while summing TN (f). However, whenever N is somewhat larger,
the nodes are not computed accurately with the formula Xi = Xi-I + h because
rounding errors accumulate. For this reason, it is strongly recommended that
the slightly more expensive formula Xi = a + ih be used. (In most cases, the
dominant cost is that for getting the function values, anyway.) Roundoff errors
can be further reduced if we sum the terms defining TN (f) in higher precision;
in most cases, the additional expense required for this is negligible compared
with the cost for function evaluations.
Clearly, the order of approximation of the trapezoidal rule is only n + 1 = 2,
so Theorem 4.1.6 does not apply. Nevertheless, convergence to the integral
follows because continuous functions may be approximated arbitrarily closely
by piecewise linear functions. Indeed, for twice continuously differentiable
functions, an error bound can be derived, using the following lemma.
4.3 The Trapezoidal Rule
Numerical Integration
198
4.3.2 Lemma. If f is twice continuously differentiable in [Xi -I, Xi]' then there
is a ~i E [Xi-I, xil such that
Proof From Lemma 4.3.2, the error is
I
l
b
f(x) dx - T(f)1
1- L
=
a
Proof We use divided differences. For X E [Xi-I, Xi], we have
=
f(x) dx
1,x'\ (f(Xi-l) + j[Xi-l, Xi](X -
(X - X l)h 2
j[Xi_I,Xi,X](X -Xi_I)(X -xi)dx.
Xi-l
The mean value theorem for integrals states that
1
b
f(x)g(x) dx = f(O
I"
a
g(x) dx
for some
~ E [a,
b],
provided f and g are continuous in [a, b] and g has a constant sign on [a, b].
Applying this to the last integral with g(x) := (x - Xi-l)2 - (Xi - Xi-I)(x Xi-l) = (X - Xi_I)(X - xt> :s 0, we obtain, after slight modification,
for suitable ~:' ~i
E
o
[a, b].
The following theorem for the trapezoidal rule results.
I
f (X) d X - T (f )
I :s
h2
11f"ll oo = 12 (b
-
a)IIf"lI oo'
0
4.3.5 Lemma. There exist numbers Y2j with Yo
degree j such that the relationship
g(l)
b
I~-
In order to increase the accuracy with which the trapezoidal rule approximates
an integral by a factor of 100, h must be decreased to h /10. This requires
approximately 10 times as many nodes and therefore approximately 10 times
as many function evaluations. That is too expensive in practice. Fortunately,
significant speed-ups are possible if the nodes are equidistant.
The key is the observation that, for equispaced grids, we can replace the
norm bound in Theorem 4.3.3 by an asymptotic expansion. This fact is very
important because it allows the use of extrapolation techniques to improve the
accuracy (see Section 4.4).
By Theorem 4.3.3, TN(f), given by the equidistant trapezoidal formula
(3.2), has an error approximately proportional to h~. More precisely, we show
here that it behaves like a polynomial in h~, provided f has sufficiently
many derivatives. To this end, we express the individual trapezoidal areas
~ (f (Xi) + f (Xi -I» in integral form. Because we may use a linear transformation to get Xi-I = -I, Xi = I, h = 2, we first prove the following.
4.3.3 Theorem. For the trapezoidal rule on a grid with mesh size h, the bound
I
I
f"(~i)1
4.3.4 Remark. The formula in Theorem 4.3.3 indeed gives the correct order
0(11 2 ) , but is otherwise fairly rough because it does not take account of variations in f" in the interval [a, b]. As we see from the first expression in the proof,
we may allow large steps Xi - Xi-I everywhere where I f"l is small; however, in
order to maintain a small error, we must choose closely spaced nodes at those
places where I f"l is large. This already points to adaptive methods, which shall
be considered in the next section.
Xi-l)
Xi-l)(X - Xi» dx
(Xi - Xi_I)2
= f(Xi-I>(Xi - Xi-I> + f[Xi-l, xil
2
i
12
l=l:N
+ j[Xi-l, Xi, x](x -
+
(Xi - Xi-I)3
i=I:N
:SL
1.~1
199
+ g(-l) =
2
-I
h12(b-a)IIf"lloo
holds for all twice continuously differentiable functions f defined on [a, b ].
/1 ( L
I and polynomials p j (x) of
n.s?" (x) + pm+l(X)g(m+lJ(X») dx
(3.3)
j=O:Lm/2J
holds for m 2: 0 and all (m
g:[-I,I]~C.
=
+
I)-times continuously differentiable functions
In order to prove (3.5), we assume that (3.5) is valid for odd m < 2k; this
is certainly the case when k = O. We then plug the function g(x) := x m for
m = 2k + 1 into (3.4) and obtain
Proof We define
Im(g)
LC~~-,
:~
yjgU}(x)
+ pm(x)g,m}(x)) dx
for suitable constants Yj and polynomials Pm (x), and determine these such that
the conditions
L; (g) = g(1)
Ym
+ g(-l)
=0
However, the integral can be computed directly:
_
(3.4)
for m :::: 1,
Im(g) -
(3.5)
for odd m :::: I
PI(X):= x,
=
L
I (
=m! (
Yj
'-0'.m -I
J-
j=~_/j
m.
,
. ,x
(m - }).
m-j
+ Pm(x)m.,)
l-(-l)m+l-j
dx
)
(m+I-j)! +2Ym
.
Because m is odd, the even terms in the sum vanish; but the odd terms also
vanish because of the induction hypothesis. Therefore, m! ·2Ym = 0, so that
(3.5) holds in general.
D
we find
!](g)
I
-I
follow, which imply the assertion. By choosing
Yo:= 1,
201
4.3 The Trapezoidal Rule
Numerical Integration
200
ill (g(x) + xg'(x» dx
4.3.6 Remark. We may use the same argument as for (3.5) to obtain
=
I
-I
I
and (3.4) holds for m
Yj := -I
2
II
I
(xg(x»' dx = xg(x)I_1 = g(l)
+ g( -1),
=
YZi
+ I)!
= -1(2k)!
fork> 0
-
1. Also, with
Pj(x) dx,
Pj+I(X):=
IX (Yj -
pj(z»dz
for i >: I,
for m = 2k; from this, the YZi (related to the so-called Bernoulli numbers) may
be computed recursively. However, we do not need their numerical values.
-I
-I
From (3.3), we deduce the following representation of the error term of the
equidistant trapezoidal rule.
we find
so that, for (m
L
i=O:k (2k - 2i
+ I )-times continuously differentiable functions
4.3.7 Theorem. (Euler-Macl.aurln Summation Formula). Iff is an (m
g,
1>
TN(f)
=
1
a
= ill
=
[II
(P;Il+1 (x)g(m)(x) + Pm+1 (x)g(m)' (x») dx
(Pm+l(X)g(m)(x»)' dx
=0.
Thus, (3.4) is valid for all m :::: 1.
= Pm+l(x)g(m)(x)I~1
+ 1)-
times continuously differentiable function, then
f(x) dx
+
L
Cjh~
+ rm(f),
i> 1: Lm/ZJ
where the constants, independent of N, are
CJ'
=
2- ZjYz.I·(f(Zj-I)(b) - f(2 j-I)(a»)
(.} = I , ...
and where the remainder term rm(f) is bounded by
, Lm /2J) ,
(3.6)
for some constant C m depending only on m. In particular; rm(f)
polynomials f of degree s: m.
o for
Proof In order to use Lemma 4.3.5, we substitute
+ (i +
= I, ... , N we obtain
By summing (3.7) over i
TN(f)
x;(t) := a
203
4.4 Adaptive Integration
Numerical Integration
202
-l
b
f(x) dx
j
L
t-I) h ,
-2-
2:h ) 2
(
Y2j(J(2 j-l)(b) - f(2 j-ll(a))
+ rm(f),
j=I:Lm/2J
where the remainder term
and set g;(t) := f(x;(t)). We then have
lemma implies
X;-l
= x;(-I),
X;
= x;(I), so the
is bounded by
Irm(f)1 ::::
~
L
2C mh m+ 11If(m+ltXl
=
C mh m+
1(b
- a)ll/m+ltXl'
D
j=l:N
The Euler-MacLaurin summation formula now allows us to combine the
trapezoidal rule with extrapolation procedures and thus introduce very effective
quadrature rules.
4.4 Adaptive Integration
Ifwe compute a sequence of values TN(f) for N = No, 2No, 4No, ... , then we
can obtain better quadrature rules by taking appropriate linear combinations.
To see this, we consider
TN(f)
=
h; (f(a)
+ feb) + 2
;=f;-l f (a + ih
N ) ),
where the remainder term
2N(f) = h: (f(a) + feb) + 2 ;=f;-1 f(a + ih N)
T
+2;~N f(a + (i - ~)hN))
is bounded by
=
~ ( TN(f) + h ;~N f (a + ( - ~) h N) )
N
Hence, if we know TN(f), then to compute T2N (f) it is sufficient to compute
the expression
with constant
C;
=
I
2m + 2
/1
-L
IPm+l (t)1 dt.
(4.1)
204
Numerical Integration
4.4 Adaptive Integration
with N additional function evaluations, and we thus obtain
we find for the simple Simpson rule S(f) applied to [Xi-I, x;J,
(4.2)
The expression R N (f) is called the rectangle rule. The linear combination
IXi
+ f(b) + 2
=
l
f[yo, Yl, YI, Yz, y](y - Yo)(y - y])Z(y - yz) dy,
where Yo:= Xi-I, Yl := (Xi-I +xi)/2, and yz := Xi. Because (y - YO)(Yz
.
y]) (y - yz) ::::: 0 for y E [Xi-I, X;], we may apply the mean value theorem for
integrals to obtain
(4.3)
is called the (composite) Simpson rule because it generalizes the simple Simpson
rule, which is obtained as SI (f). Because
I (f) :=
Jrr
X,-l
i=f;-I f(a + ih N)
+4i~N f(a + (- DhN))
f(x) d x - S(f)
X,-~l
4TzN(f) - TN(f)
TN (f) + 2R N(f)
SN(f) :=
=
3
3
= h6N (f(a)
fXi
f(x)dx - S(f)
Xj-l
= f[yO, YI, YI, Yz,
n IXi (y -
YO)(Y - y])Z(y - yZ) dy
X,_I
b
f(x) dx = lim TN(f),
Ns-s co
a
205
= ( Xi+12-
4(h;
= -is
we also have
Xi)5
.1
1
f[Yo,Yl,Yl,Yz,n _ltZ(tZ-l)dt
)5 f[yo, YI, YI, Yz, n
lim RN(f) = lim SN(f) = I(f).
N~oo
N~oo
However, although TN (f) and RN (f) approximate the integral I (f) with an
error of order O(h~), the error in Simpson's rule SN(f) has the more advantageous order O(h~). This is due to the fact that the linear combination (4.3)
exactly eliminates the term clh~ in the Euler-MacLaurin summation formula
(Theorem 4.3.7). We give another proof that also provides the following error
bound.
4.4.1 Theorem. Suppose f is afour-times continuously differentiablefunction
on [a, b]. Then the error in Simpson's rule SN(f) is bounded by
I (hN)5
= -90
2
f
(4)
(~)
for suitable ~!, ~ E [Xi-I, x;J. By summation over all subintervals [Xi-I, X;],
i = I, ... , N, we then obtain
II,r f(X)dX-SN(f)( :::::N(h2
h~
N)5 1If (4)1100 =
(b-a)llf4 ) 11
90
2880
00'
The linear combination
16SzN - SN
MN := - - .,--::-- 15
Proof We proceed similarly to the derivation of the error bound for the trapezoidal rule, and first compute a bound for the error on a subinterval [Xi-I' xiJ·
Because the equation SN(f) = I(f) holds for f(x) = Xi (i = 0,1,2,3),
Simpson's rule is exact for all polynomials of degree :::::3. By Theorem 3.1.4,
0
(4.4)
is c~lled the (composite) Milne rule because it generalizes the simple Milne rule,
WhIch is obtained as M I (f). The Milne rule similarly eliminates the factors Cj
and Cz in the Euler-MacLaurin summation formula (Theorem 4.3.7), and thus
has an error of order O(h~). We may continue this elimination procedure; with
Numerical Integration
206
a new notation we obtain the following Romberg triangle.
with
k
To,o = TN
T1,o = T2N TI,l
T2,O = T4N T2,1
T3,O = TS N T3,1
=
=
=
T2 .2
=
MN
4.4.3 Remarks.
+ 2)-times continuously differentiable on
To,o := TN(f),
I
Ti,o := T2iN(f) = 2:(Ti- I,O
+ R 2i-1N(f))
(i = 1,2, ...)
and
Ti,k:=Ti,k-l+
then T;,k(f)
Ti,k-l - Ti - 1.k 4k - 1
1
jork=I, ... .L,
= Ti,k has order 2k + 2 with respect to w(x) =
(45)
.
1, and
(4.6)
holds for each i :::: 0 and k :::: m.
Proof We show by induction on k that constants ejk) (j > k) exist such that
t., =
I
a
b
j(x)dx
(k) 2j
2m+2
h2'N + O(h 2iN ).
+ j=fu:m e j
By Theorem 4.3.7, this is valid for k
= O. It follows for k
> 0 from
Ti,k-I - Ti-I,k-l
Ti,k = T;,k-l +
4k _ I
4 kTi,k_l - T;-I,k-l
4k - I
b
=
1
a
jex)dx
""
(k-I)
+ j=fu:m e j
(j>k).
Formula (4.6) follows. Proposition 4.1.8 then shows that the order of T;,k is
2k + 2.
0
T3,2 = M2N
We now formulate this notation precisely and bound the error.
4.4.2 Theorem. Suppose j is (2m
[a, b]. Ijwe set
2'
(k)._ (k-I)4 - 2 ]
e j .-ej
4k-1
SN
S2N
S4N
207
4.4 Adaptive Integration
4 k 22 ]
2]
( 2 +2)
4k _ 1 h 2' N + 0 h 2':'N
(i) To compute Tn,n, an exponential number 2n of function evaluations are
required; the resulting error is of order O(h 2n+2 ) . For errors of the same
order, Gaussian quadrature rules only require n + I function evaluations.
Thus the method is useful only for reasonably small values of n.
(ii) The Romberg triangle is equivalent to the Neville extrapolation formulas
(Algorithm 3.2.4 for q = 2) applied to the extrapolation of (Xl, ft) =
(h~, TN(f)), N = 21 to X = O. The equivalence is explained by the EulerMacLaurin summation formula (Theorem 4.3.7): TN(f) is approximately
a polynomial Pn of degree n in h~, and the value of this polynomial Pn at
zero then approximates the integral.
(iii) One can use Neville extrapolation also to compute an approximation to
the integral f j (x) dx from TNI (x) for arbitrary sequences of N, in place
of N[ = 21• The Bulirsch sequence N, = 1,2,3,4,6, ... , 2i , 3 . 2i - l , ...
is a recommended alternative choice that makes good use of old function
values and often needs considerably fewer function values to achieve the
same accuracy.
A further improvement is possible by using rational extrapolation formulas (see, e.g., Stoer and Bulirsch [90]) instead of Neville's formula
because the former often converge more rapidly.
(iv) It can be shown that the weights of T;,k are all positive and the bound
lIb
j (x) dx -
t., I ::::
Y2k+2h~~:/ II j(2n+2) 1100
(n = minCk, m))
holds. Thus, for k :::: m, the kth column of the Romberg triangle converges
to I (f) as i ---+ 00 with order h~~:/. By Theorem 4.1.6, the T;,k even
converge to I (f) as i + k ---+ 00 for arbitrary continuous functions, and
indeed, the better that the functions can be approximated by polynomials,
the better the convergence.
In practice, we wish to be able to estimate the error approximately without
computation of higher order derivatives. For this purpose, the sizes of the last
Numerical Integration
208
correction Ei,k := 1 Ii,k-I - Ii.k I may be chosen as an error estimate for Ii.k in
the Romberg triangle. This is reasonable if we can guarantee that, under appropriate assumptions, the error II (f) - Ti,k I is actually bounded by Eik, By
Theorem 4.4.2, we have II (f) - Ii,k I = O(h 2k+ 2 ) with h = 2- i (b - a) and
2k
2k
Eik = O(h ) because l/(f) - Ti .k - 1 1 = O(h ) . We thus see that the relation
II (f) - Ii,k I :::: Eik is usually true for sufficiently large i.
However, two arguments show that we cannot unconditionally trust this error
estimate. When roundoff errors result in Ii,k = T;,k-I, we have Eik = 0 even
though the error is almost certainly larger. Moreover, the previous argument
holds only asymptotically as i _ 00. However, we really need an error estimate
for small i. The previously mentioned bound then does not necessarily hold, as
we see from the following example of a rapidly decreasing function.
fd
4.4.4 Example. Ifwe integrate/:=
(l + ~ e) dx exactly, we obtain I =
1.02000 .... With numerical integration with the Romberg triangle (with N = I),
a desired relative error smaller than 0.5%, and computation with four significant
decimal places, we obtain:
25x
To,o = 1.250
TI.o
T2.0
= 1.125 Tl,I = 1.083
= 1.063 hi = 1.042
T2 ,2 = 1.039,
where e-, = Ih2-T2,11 = 0.003, so I ~ 1.039 ±0.OO3. However, comparison
with the exact value gives II - T2,21 ~ 0.019; the actual error is therefore more
than six times larger than the estimated error.
Thus, we need to choose a reasonable security factor, say 10, and estimate
I (f) ~
t., ±
10IIi,k-l -
Ti,kl·
(4.7)
Adaptive Integration by Step Size Control
The Romberg procedure converges only slowly whenever the function to be
integrated has problems in the sense that higher derivatives do not exist or
are large at least one point (small m in Theorem 4.4.2). We may ameliorate
this with a more appropriate subdivision in which we partition the original
interval into subintervals, selecting smaller subintervals where the function
has such problems. However, in practice we do not usually know these points
beforehand; we thus need a step size control, that is, a procedure that, during
the computation subdivides recursively those intervals that contribute too large
an error in the integration.
We obtain such a procedure if we work with a small part of the Romberg
triangle, Ii.k with k SiS n (and N = I), to compute Tn,n (n = 2 or 3). If
the resulting approximation is not accurate enough, we then halve the original
interval and compute the sub integrals separately. The Tu with k :::: i :::: n - I for
the resulting subintervals can be computed easily from the already computed
function values, and the only additional function values needed to compute
the Tn .k are those involved in R N (f), N = 2"-1. This can be used in the
implementation to increase the effectiveness.
To avoid infinite loops, we may prescribe a lower bound h min for the length
of the subintervals. If we fall below this length, we accept Tn,n as the value
of the subintegral and we issue a warning that the desired precision has not
been attained. To avoid exceeding a prescribed error bound ~ for the entire
integral I [a, bl =
f(x) dx, it is appropriate to require that every subintegral
It"
J:
IL~., xl =
f(x) dx has an error of at most ~(.\: - :5...)/(b - a); that is, we
assign to each sub integral that portion of the error corresponding to the ratio
of the length of the subinterval to the total length. However, this uniform error
strategy has certain disadvantages.
This is because in practice, many functions to be integrated behave poorly
primarily on the boundary; for example, when the derivative does not exist
there (as with Vi cosx for x = 0). The subintegrals on the boundary are then
especially inaccurate, and it is advisable to allow a larger error for these in
order to avoid having to subdivide too often. This can be achieved when we
assign the maximum error per interval with a boundary dominant error strategy.
For example, we may proceed as follows: If the maximum allowed error in the
interval [:5..., ."l is ~o, we allow errors ~I and ~2 for the intervals [:5..., xl and
[x, xl produced from the subdivision, where
x<
~I = ~~o,
~2
~1 = ~~o,
~2 = ~~o
if a < :5..., x = b,
~l = ~~o,
~2 = ~~o
otherwise.
= ~~o
if:5... = a,
b,
If we use the portion of the Romberg triangle to T"./!, I + 2n s function
evaluations are required for s accepted subintervals. If we process the intervals
from left to right. then due to the splitting procedure, up to 10g«b - a)/ hm;n)
incompletely processed intervals may occur, and for each of them, 2/!-1 already
available function values, the upper interval bound, and the maximum error
need to be stored (cf. Figure 4.2). Thus, (2 n - 1 + 2) log«b - a)/ h min ) storage
locations are necessary to store information that is not immediately needed.
More sophisticated integration routines use variable order quadrature formulas
Numerical Integration
210
1
1
6
•
6/3
• • •
6/2
•
L
1~/.\5 L~!9j
1
•
6/2
. .......•...
.....•..
6/6
J
6/2
....•..
6/6
J
6/2
..........
1
•
•
•
1
4.5 Solving Ordinary Differential Equations
......•..
211
by making the system autonomous with the additional function Yo = t and the
additional differential equation (and initial condition)
yb
..... j
=
1,
Yo(to)
= to·
(In practice, this transformation is not explicitly applied, but the method is
transformed to correspond to it.) Similarly, higher order differential equations
can be reduced to the standard form (5.1); for example, the problem
.1
x"
....1
=
G(t, x),
x(to)
= xo,
x'(to)
=
vo
is equivalent to (5.1) with
Figure4.2. The boundary dominant error strategy, n = 2.
(e.g., a variable portion of the Romberg triangle), and determine the optimal
order adaptively by an order control similar to that discussed in Section 4.5.
The boundary dominant error strategy can also be applied whenever the
integrand possesses singularities (or singular derivatives) at known points in
the interior. In this case, we subdivide the original interval a priori according
to the singularities, then we treat each subinterval separately.
4.5 Solving Ordinary Differential Equations
In this section, we treat (for reason of space quite superficially) numerical
approximation methods for the solution of initial value problems for systems of
coupled ordinary differential equations. To simplify the presentation, we limit
ourselves to autonomous problems of the form
y;(t)
=
F;(Yl(t)""'YIl(t)),
Yi(tO)=Y?
(i=l, ... ,n),
(5.1)
or, in short vector notation,
Y' = F(y),
y(to) =
l.
However, this does not significantly limit applicability of the results. In particular, al1 methods for solving (5.1) can be generalized in a straightforward way
to nonautonomous systems
y' = Ftt , y),
y(to)
=
yO
We pose the problem more precisely as follows. Suppose a continuous function
F: D S; ]R1l -+ ]R1l, a point Yo E int(D), and a real interval [a, b) are given.
We seek a continuously differentiable function y: [a, b) -+ ]R1l with y'(t) =
F(y(t), yea) = yo. Each such function is called a solution to the initial value
problem y' = F(y), yea) = yO in the interval [a, b].
Peano's theorem provides rather general conditions for the existence of a
solution. It asserts that (5.1) has a solution in [a, a + h) for sufficiently smal1 h
and that this solution can be continued until it reaches the boundary of D (i.e.,
until either t -+ 00 or Ily(t)11 -+ 00, or y(t) approaches the boundary of D as
t increases).
Because most differential equations cannot be solved explicitly, numerical
approximation methods are essential. Peano's continuation property is the basis for the development of such methods: From knowledge of the previously
computed solution, we compute a new solution point a short distance h > 0
away, and we repeat such integration steps until the upper limit of the interval
[a, b] is reached.
One traditionally divides numerical methods for the initial value problem
(5.1) into two classes: one-step methods and multistep (or multivalue) methods.
One-step methods (also called Runge-Kutta methods) are memoryless and only
make use of the most recently computed solution point to compute a new
solution point, whereas multistep methods retain some memory of the history
by storing, using, and updating a matrix containing old and new information.
The successful numerical solution of initial value problems requires a
thorough knowledge of the propagation of rounding error and discretization
213
Numerical Integration
4.5 Solving Ordinary Differential Equations
error. In this chapter, we give only an elementary introduction to multistep
methods (which are closely related to interpolation and integration techniques),
and treat the numerical issues by means of examples rather than analysis.
For an in-depth treatment of the whole subject, we refer the reader to Hairer
et al. [35,36].
The method does converge because ail ---+ e as h ---+ O. The following table shows
what happens for larger step sizes.
212
hA
a
h):
Euler's Method
a
The prototypical one-step method is the Euler's method. In it, we first replace
y' by a divided difference to obtain
,---y(-'--.:t1,----)_---=-y_(t..::..:..-o)
t) - to
~
I ()
v to
•
=
F ( 0)
Y
or, with hi := tl - to,
.1
2.59
1
2.00
10
1.27
100
1.05
1000
1.0 I
-.01
2.73
-.1
2.87
-.5
4.00
-.9
12.92
-.99
104.76
-1
00
Once h); is no longer tiny, a is no longer close to e ~ 2.718. Moreover, for
Ii). < -1, the "approximations" l + I = (l + hA) i are violently oscillating
because I + h). is negative, and bear no resemblance to the desired solution.
:s h.): :s 1,
In conclusion, we have qualitatively acceptable solutions for
but good approximations only for IhAI « 1. For A
0 it could be expected
that the step size h must be very small, in view of the growth behavior of e" .
However, it is surprising that a very small step size is also necessary for A « 0,
even though the solution is negligibly small past a certain t-value! For negative
values of A (for example, when A = -105 and h = 0.02, see Figure 4.3), an
-!
»
Continuing in the same way with variable step size hi, we obtain the formulas
ti+1
.01
2.71
= t, + hi,
yi+1
= i + hiF(i)
for i = 0, ... , N - 1, and we expect to have y(ti) ~ i for i = 1, ... , N. To
define an approximating function y(t) for the solution of the initial value problem (5.1), we then simply interpolate these N + 1 points; for example, with
methods from Chapter 3.
One can show that, for constant step sizes hi = h, Euler's method has a
global error of order 0 (h) on intervals of length 0 (l), and thus converges to
the solution as h ---+ O. However, as the following example shows, sometimes
very small step sizes are needed to get an acceptable solution.
0.08
0.06
\ \ \
\
4.5.1 Example. Application of Euler's method with constant step size hi = h
to the problem
y'=Ay,
leads to the approximation
\ \
\ \ \
\ \ \ \ \ \ \ \
\ \ \ \ \ \ \ \
y(O)=l
\
\
\
\
\
\
\
-,
-
-,
-
-,
-
-,
-
-,
-
-,
-
-,
-
-,
-
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/ / / / /
I
I
I I I I I
I I I I I
-0.06
-0.08
instead of the exact solution
-0.1 ':----'---J:-:-:-----'---=-'-~~-LL____L---L---L---LLL__L--"---__"___~_l...__l...__l._l_LLL_L_J
o
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Figure 4.3. Divergence of Euler's method for the stiff equation Vi
-105 and step size h = 0.02.
.
= -A v, with A =
.
214
oscillating sequence occurs that increases in amplitude with t , and the sequence
has nothing at all to do with the solution.
The reason for the poor behavior of Euler's method for A « 0 is the very
steep direction field of the corresponding differential equation in the immediate
vicinity of the solution. Such differential equations are called stiff; to solve such
equations effectively, special care must be taken. To improve the accuracy of
Euler's method, we may consider the Taylor series expansion
yet
+ h)
= yet)
215
4.5 Solving Ordinary Differential Equations
Numerical Integration
Table 4.1. Coefficients for Adams-Bashforth of order p
s
130
0
1
1
2
12
e,
fh
131
1
3
-2:
2:
23
5
16
55
3
= s+1
-12
12
59
37
-24
24
24
h2
+ hy'(t) + _y"(t) + 0(h 3 )
2
h
quadrature formula (see Section 4.2). We consider only the equidistant case
2
= y(t) + hF(y(t)) + 2:F'(y(t»F(y(t») + 0(h 3 ) .
This suggests the method
t;
b-a
= a + ih,
h
= t:I.
+ 1 points ti-s, ... , tt as data points, then
If one uses the last s
y(ti+l)
=
y(ti)
j
1i + h
+.
F(y(t)) dt
I,
which can be shown to have a global error of 0(h 2 ) . We may similarly take more
terms in the series to obtain formulas of arbitrarily high order. A disadvantage
of this Taylor series method is that, for F (y) E IRn, we must compute the
entries of the Jacobian F' (y) (or even higher derivatives); usually, this additional
cost is unwarranted. However, if combined with automatic differentiation, an
advantage is an easy rigorous error analysis with interval methods (Moore [63],
Lohner [57], Neumaier [70a]).
Multistep Methods
To derive methods with increased order that avoid the use of derivative information, we assume that the values
have already been computed for j ::: i. If we integrate the differential equation,
y(tH])
=
y(li)
= y(ti) + h
L
+ 0(h s+ 2 )
{3jF(y(ti-j)
j=O:s
for certain constants {3j depending on s,
The constants {3 j may be computed by integrating the Lagrange polynomials;
they are given for s ::: 3 in Table 4.1. They differ from those derived in
Section 4.2 because now the interpolation points are not chosen from [ti, ti+IJ,
but lie to the left of ti. By dropping the error term, one obtains the following
explicit Adams-Bashfortli multistep method of order s + I:
ti+]
= t, -s- h,
yi+l
=
y' +h
L
{3jF(/-j).
j=O:s
The cost per step involves one function evaluation only. The global error can
be proved to be 0 (h 5 + ] ); as for numerical integration, one loses one power of
h for the global estimates because one has to sum over (b - a)/ h steps.
If one includes the point to be computed in the set of interpolation points,
one finds implicit methods with an order that is higher by I. Indeed,
Ii+ l
+
F(y(t»dt,
/ t,
y(ti+l)
=
y(td
+h
L
{3jF(y(ti-j»
+ 0(h
5+ 3
)
j=-1:5
we know on the right side the function values F (y (t j» ~ F (yj) = f j approximately and can therefore approximate the integral with an interpolatory
With s-dependent constants f3j given for s ::: 3 in Table 4.2. One obtains the
216
Numerical Integration
Table 4.2.
Coefficients for Adams-Moulton formulas of order
p
S
fJ-1
fJo
0
I
2:
2:
1
12
12
2
9
24
J9
24
251
646
5
3
720
4.5 Solving Ordinary Differential Equations
= s +2
fJl
fJ2
fJ3
older values
drastically.
s': J, j
> 0, the convergence and stability behavior may change
4.5.2 Example. The Simpson rule gives
I
1
8
-12
y(ti+l) - y(ti-d
-24
24
264
t:
ti-
h
F(y(t» dt
2h
= (5 (F(y(ti+d)
106
-720
720
=
J
5
217
720
+ 4F(y(ti» + F(y(ti_l») + O(h 5 ) ,
hence suggests the implicit multistep formula
following implicit Adams-Moulton multi-step method of order s
ti+1
+ 2,
= ti + h, i+1= v' + h (is_I F(yi+l) + J~/J F(yi- J»),
/+1 = /-1
+ ~(F(/+I) + 4F(/) + F(y/-I».
For the model problem
y'
in which the desired function value also appears on the right side.
Instead of solving this nonlinear system of equations, one replaces yi+1 on the
right side by the result of the corresponding explicit formula; the order, smaller
by 1, is restored when one multiplies by h. One thus computes a predictor yi+1
with an explicit method and then obtains the value yi+ J by using an implicit
method as a corrector, with y+1 on the right side.
For example, one obtains with s = 3 the fifth order Adams-Moulion predictorcorrector method:
(5.2)
= Ay,
yeO)
=
1,
we find
/+1
= yi-I + ~h (yi+1 + 4 y i + yi-I).
The general solution of this difference equation is
Y i = YIZIi + Y2Z2i
(5.3)
Where
ti+1 = ti +h;
% predictor step
yi+1 = yi + :P4(55f i - 59f i- 1 + 37
r"
=
t-
.
2 - 9f'-3);
ZI } _ 2Ah
Z2
F(yi+I);
= i + 7~0(251t+1 + 646fi
3 - Ah
-
-
{I + Ah + O«Ah)2)
-1
+ ¥- + O«Ah)2)
~ e'";
~ _e-Ah/3
are the solutions of the quadratic equation
% corrector step
yi+1
± J9 + 3A 2h2 _
- 264fi-1
+ 106fi-2 -
19f i-3);
fi+1 = F(yi+I);
,
Depending on the implementation, one or several corrector .
iterations
may be
performed. Compared with the explicit method, the cost per step increased to
two function evaluations (if only one corrector step is taken).
It can be shown that the methods of Adams-Bashforth and ofAdams-Mou Iton
converge with convergence order s+ I (i.e., global error O(hs+ 1 and s+2 (i.e..
global error O(hs+ 2», respectively. However, a formal analysis is important
here because, for general multistep methods where yi+I depends linearly on
»
G'
0
I
tven y and y , we can solve the equations YI
to find
YI - Z2
YI = - - - ,
ZI - Z2
Y2
+ Y2 = Yo = 1, YIZl + Y2Z2 = YI
ZI - YI
= ---.
ZI -
Z2
Now, if A < 0 then the solution yet) = e" decays for t ---+ 00; but unless Yl = ZI
(an unlikely situation), Y2 f=- 0 and (5.3) explodes exponentially, even for tiny
values of sh,
218
4.6 Step Size and Order Control
Solving Stiff Systems
/+1 is not computed as a sum of yi and a linear combination of the previous
s function values F (yi - j), but as a sum of the function value at the new point
Yi+ 1 and a linear combination of the previous s values yi -I.
For stiff differential equations, both the Adams-Bashforth methods and the
Adams-Moulton methods suffer from the same instability as Euler's method so
that tiny step sizes are required. However, the BDF-formulas have (for s ::: 5
only) better stability properties, and are suitable (and indeed, the most popular
choice) for integrating stiff differential equations. However, when using the
BDF-formulas in predictor-corrector mode, one loses this stability property;
therefore, one must find yi+l directly from (5.4) by solving a nonlinear system
of equations (e.g., using the methods of Chapter 6). The increased work per
step is more than made good for by the ability to use step sizes that reflect the
behavior of the solution instead of being forced by the approximation method
to tiny steps.
Properties of and implementation details for the BDF-formulas are discussed
in Gear [29]. A survey of other methods for stiff problems can be found in Byrne
and Hindmarsh [10].
Of considerable practical importance for stiff problems are additional im.plicit
multistep methods obtained by numerical differentiation. Here we approxIm~te
g(T) := yeti - Th) by an interpolating polynomial Ps(~) of degree s + 1 with
data points (T, g(T» = (l, yi-l) (l = -1,0, .... s). ThIS gives
g(T)
~
Ps+I(T):=
L
Lj(T)yi-J
J=-l:s
where
T
L)(T) =
-l
n~
I#}
is a Lagrange polynomial of degree s
}
+ 1; of course, for n
> I, we interpret
Ps+I (t) as a vector of polynomials. We then have
.
F(y'+I) ~ F(y(ti+l»
=
It is natural to fix
~h
v"
219
Numerical Integration
"
LJ
1,
. ~ 1 ,
= y'(ti+l) = kg (-1) ~
,:/s+1
(-1)
'
L'(-l)yi-J.
J
4.6 Step Size and Order Control
J=-l:s
implicitly through the condition
i-)y
+hf3I F(yi+I).'
Y i+1 = "LJ a j
(5.4)
)=O:s
(_I)anda'=-f3_
IL'(l)areeasilycomputed
t h e coe ffici
clen ts f3 -I -IlL'
-1
J
)
from the definition. They are gathered for s ::: 5 in Table 4.3.
.
The formulas obtained in this way (by Gear [29]) are called backwards differentiation formulas, or BDF-formulas; they have order s + 1. Note that here
For solving differential equations accurately, step sizes and order must be chosen
adaptively. In our discussion of numerical integration, we have already seen the
relevance of an adaptive choice of step size; but it is even more important in the
present context. To see this, we consider the following.
4.6.1 Example. The initial value problem
y'=l,
y(0)=3
has the solution
jor BDFfiormulas of order P = s
Tabl e 4 .3 . Coefficients
JJ'
s
f3
0
1
1
2
3
4
5
\
2
ao
only defined for t < 1/3. Euler's method with constant step size h gives
1
4
1
-3
18
TT
TT
-TT
12
48
6
3
y(t) = 1 - 3t'
as
al
3
3
+1
9
36
25
25
-25
60
137
60
147
300
137
360
147
300
-137
450
- \47
l
2
= 3.
yi+1 = /
+ h(/)2
for i ~ O.
TT
3
25
16
-25
200
75
-137
225
- 147
m
400
147
12
137
72
147
10
-147
AsTable4.4withresultsforconstantstepsizesh = 0.1 andh = 0.01 shows, the
method with step size h = 0.1 continues far beyond the singularity, without any
noticeable effect in the approximate solution, whereas with step size h = 0.0 I
the pole is recognized, albeit too late. If one were to choose h still smaller, then
220
Numerical Integration
Table 4.4.
4.6 Step Size and Order Control
Results for constant step sizes
using Euler's method
yi
.vi
t,
yeti)
h=O.l
h =0.01
0
0.1
0.2
0.3
0.4
0.47
0.5
0.6
0.7
3
4.29
7.5
30
3
3.9
5.42
8.36
15.34
3
4.22
7.06
19.48
3.35.10 6
overflow
I
38.91
190.27
3810.34
221
In each integration step, one can make several choices because both the
order p of the method and the step size h can be chosen arbitrarily. In principle
on~ coul.d change p and ~ in each step, but it has been found in practice tha~
doing this produces spunous instability that may accumulate t
.
0 gross errors
ill the computed solution. It is therefore recommended that bef
h
.
.
ore c angmg to
another step SIze and a possibly new order p, one has done at least p steps with
constant order and step size.
Local Error Estimation
To derive error estimates we must have a precise concept of the order f
method. We say that a multistep method of the form
0 a
' " ' CXjy'-j
' . +h L.J
' " ' f3jF(yi- j);
y i+1 = L.J
one would obtain the position of the pole correspondingly more accurately. In
general, one needs small h near singularities, whereas, in order to keep the cost
small, one wishes to compute everywhere else with larger step sizes.
j=O:S
(6.1)
j=O:s
has order p if, for every polynomial q(x) of degree <»,
q(l)= L(CXN(-})+f3jq'(-j».
(6.2)
j=O:s
4.6.2 Example. The next example shows that an error analysis must also be
done along with the step size control. The initial value problem y' = -y - 2t t».
yeO) = 1 has the solution y(t) = vT=2(, defined only for t :S ~. Euler's
method, carried through with step size h
0.001, gives the results shown in
Table 4.5. Surprisingly, the method can be continued without problems past
the singularity at t = 0.5. Only a weak hint that the solution might no longer
exist is given, by a sudden increase in the size of the increments yi+ I - / after
t = 0.525. Exactly the same behavior results for smaller step sizes. except that
the large increments occur closer to 0.5.
=
We conclude that a reliable automatic method for solving initial value problems must include an estimate for the error in addition to the step size control,
and must be able to recognize the implications of that estimate.
Table 4.5.
Results for y' = -y - 2t/y, yeO) = 1, using Euler's method
=================----o 0.100 0.200 0.300 0.400 0.499 0.500 0.510 0.520 0.525 0.530 n.600
----------~-----------------
y(ti) 1 0.894 0.775 0.632 0.447 0.045 0.000 yi
1 0.894 0.774 0.632 0.447 0.056 0.038 0.042 -0.023 -0.001 0.777 0.614
==============================
This is justified by the following result for constant step size h.
~.6.3 The~rem. Ifthe multistep method (6.1) has order p, then for all (p +2)times continuously differentiable functions y and all h E JR,
yet
+ h)
= L
CXjy(t - }h)
+ 11
j=O:s
L
f3jy'(t - }h)
+ Eh(t)
(6.3)
= 0(h P+ 1)
(6.4)
j=O:s
with an error term
Eh(t) = C h P + 1 [
h
p
Y t , t - , ... , t - (p
+ l)h] + 0(h p+ 2 )
and
c,
= 1 + L(-j)P(jCXj - (p+ l)f3j).
j=O:s
ProOf By Theorem 3.1.8,
Y rt , t - h , ... , t - (p
+ 1)hJ
= Y
(p+1)(
(p
t
+ r h)
+ I)!
(
_ y P+I)(t)
- (p + 1)1
+ O(h).
223
4.6 Step Size and Order Control
Numerical Integration
222
Taylor expansion of yet) and y'(z) gives (for bounded j)
If the largest of these values for a is not too close to I, the step size is changed
hi = ah and the corresponding order p is used for the next step. To guarantee
accuracy. one must relocate the needed t, _ j to their new equidistant positions
ti- j = t, - j hi and find approximations to ytt, - j hi) by interpolating the
already computed values. (Take enough points to keep the interpolation error
of the same order as the discretization error!)
Initially, one starts any multistep method with the Euler method, which
has order 1 and needs no past information. To avoid problems with stability
when step sizes change too rapidly (the known convergence results apply only
to constant or slowly changing step sizes), it is advisable to restart a multistep method when a « I and truncate large factors a to a suitable threshold
value.
to
vct + jh)
.
(k) ( )
=
"
L
k=O:p+!
2
~hk
/ + 0(h P+ )
k'
.
P+
= qhU) + h 1 jP+1 y[t, t - h, ... , t - (p
.(k+l)Ct)
Y
'(
t
'h)
+J
= "L }
k!
k=O:p
= h-lq~(j)
hkJ.k + O(h P+
+ (p +
I
+ I)h] + 0(h P+2 )
)
l)h PjPy[t, t - h, ... , t - (p
+ I)h]
+ O(h P + I)
with a polynomial qh(X) of degree at most p. Using (6.3) as a definition for
E (t), plugging in the expressions for yet + j h) and y' (r + j h) and using (6.2),
h
we find (6.4).
0
If the last step was of order p ; it is natural to perform a step of order p E
{p _ I, p , P + I} next. Assuming that one has already performed p steps of
step size h, the theorem gives a local discretization error of the form
/+1 _ yCti+l) =
CphP+1y[ti,"" ti-p-l]
+ 0(h P+
2
).
with yi+1 = yCti+l), where yct) is the Taylor expansion ofy(t) of order p. at
t.. The divided differences are easily computable from the y' - J • If the step SIze
h remains the same, the order p with the smallest value of
is the natural choice for the order in the next step.
However, if the new step size is changed to hi = edt (where a is still arbitrary),
1
the discretization error in the next step will be approximately a P+ w p ' For thiS
error to be at most as large as the maximum allowable contribution 8 to the
global error, we must have aP+lw p :s 8. Including empirically determined
safety factors, one arrives at a recommended value of a given by
if P = p - 1,
if P p,
if P = p + I.
=
The safety factors for a change of order are taken smaller than those for order P
to combat potential instability. One may also argue for multiplying instead by
suitable safety factors inside the parentheses, thus making them p dependent.
The Global Error
As in all numerical methods, it is important to assess the effect of the local
approximation errors on the accuracy of the computed solution. Instead of a
formal analysis, we simply present some heuristic that covers the principal
pitfalls that must be addressed by good software.
The local error is the sum of the rounding errors r, made in the evaluation of
(6.1) and the approximation error 0(h P+ 1 ) (for an order p method) that gives
the difference to YCti+I). We assume that rounding errors in computing function
values are O(SF) and the rounding error in computing the sum is O(s). Then
the total local error is O(hs F) + O(s) + 0(h P+ 1).
For simplicity, let us further assume that the local errors accumulate additively
to global errors. (This assumption is often right and often wrong.) If we use
constant step size h over the whole interval [a, b], the number of steps needed
is n = (b - a)/ h. The global error therefore becomes
yn _ yCtn) = O(n(hsF
+ S + h P+ I »
= 0 (SF
+ ~ + h P)
.
(6.5)
Although our derivation was heuristic only and based on assumptions that are
often violated, the resulting equation (6.5) can be shown to hold rigorously
~or all methods that are what is called asymptotically stable. In particular, this
lllcludes the Adams-Bashforth methods and Adams-Moulton methods of all
orders, and the BDF methods of order up to six.
In exact arithmetic, (6.5) shows that the global error remains within a desired
bound ~ when O(~llp) steps are taken and F is a p times continuously differentiable function. (Most practical problems need much more flexibility, but
224
Numerical Integration
4.7 Exercises
it is still an unsolved problem whether there is an efficient control of step size
and order such that the same statement holds.)
In finite precision arithmetic, (6.5) exhibits the same qualitative dependency
ofthe total error on the step size h as in the error analysis for numerical differentiation (Section 3.2, Figure 3.2). In particular, the attainable accuracy deteriorates
as the step sizes tend to zero. However, as shown later. there are also marked
differences between these situations because the cure here is much simpler as
in the case of numerical differentiation.
As established in Section 3.2, there is an optimal step size h opt = 0 (E I/(P+l».
The total error then has magnitude
In particular, one sees that not only is a better accuracy attainable with a larger
order, but one may also proceed with a larger step size and hence with a smaller
total number of steps, thus reducing the total work. We also see that to get
optimal accuracy in the final results, one should calculate the linear combination
(6.1) with an accuracy E « Ejf+I)/p. In particular, if F is computed to near
machine precision, then (6.1) should be calculated in higher precision to make
full use of the accuracy provided by F.
In FORTRAN, if F is computed in single precision, one would get the desired
accuracy of the Yi by storing double precision versions yy(s - j) of the recent
Yi- j and proceed with the computation of (6.1) according to the pseudo code
Fsum = LfjF(y(i - j)
yy(s + 1) = LCtjYY(S - j) +dprod(h, Fsum)
y(i + 1) = real(yy(s + 1»
yy(l : s) = yy(2 : S + 1)
(The same analysis and advice is also relevant for adaptive numerical integration, where the number of function evaluations can be large and higher precision
may be appropriate in the addition of the contributions to the integral.)
A further error, neglected until now, occurs in computing the t, . At each step,
we actually compute t;+1 = t, + h, with rounding errors that accumulate significantly in a long number of short steps. This can lead to substantial systematic
errors when the numher of steps becomes large. In order to keep the roundoff
error small, it is important to recompute the actual step size h., = t;+1 - t, used,
from the new (approximately computed) argument ti+1 = t, hi; because of
cancellation that typically occurs in the computation of t;+l - t., the step size hi
computed in this way is usually found exactly (cf. Section 1.3). This is one of
the few cases in which cancellation is advantageous in numerical computations.
+
225
4.7 Exercises
1. Prove the error estimate
l
a
b
f(x) dx = (b - a)f
(a-2+ b) + (b -24a)3 r(~),
~
E [a, b]
for the r~ctangle rule (also called midpoint rule) if f is twice continuously
differentiable functions f(x) on the interval [a, b].
2. Because the available function values 1(x j) = f (x j) + E j (j = 0, ... , N)
are general~y contaminated by roundoff errors, one can ask for weights
aj chosen III such a way that the influence of roundoff errors is small.
Under appropriate assumptions (distribution of the E i with mean zero and
standard deviation (J), the root mean square error of'a quadrature formula
Q(f) = Lj=O:N aj/(Xj) is given by
rN =
(J
jL
CY;'
j=O:N
(a) Show that r» for a quadrature formula of order > 1 is minimal when
all weights CY j are equal.
(b) A quadrature formula in which all weights are equal and that is exact
for polynomials p(x) of degree <n is called a Chebychev quadratureformula. Determine (for w(x) = 1) the nodes and weights of the
Cheby~hev formula for n = 1, 2, and 3. (Remark: The equations for
. the weights and points have real solutions only for n ::e 7 and n = 9.)
3. G~ven a closed interval [a, b] and nodes x) := a + jh, j = 0, 1,2,3
WIth h := (b - a)/3, de~ve a quadrature formula Q(f) = Li=O'3 CYi f(Xi)
that matches l(f) = fa f(x) dx exactly for polynomials of degree <3.
Prove :he error formula l(f) - Q(f) = toh5 f(4)(0, ~ E [a, b] for
f E C [a, b]. Is Q(f) also exact for polynomials of degree 47
4. Let Qk(f) (k = 1,2, ...) be a normalized quadrature formula with nonnegative we.ights and order k + I. Show that if f(x) can be expanded in
a power series about with radius of convergence r > 1, then there is a
constant C, (f) independent of k with
°
I
t
~1
f(x)dx - Qk(f)/::e
Cr(f)
(r -
.
l)k+1
Find Cr(f) for f(x) = 1/(2 - x 2 ) , using Remark 4.1.11, and determine r
SUch that C (f) is minimal.
l
Ja
f (x) dx of a
5. In order to approximately compute the definite integral
function f E C 6[0, 1], the following formulas may be applied (h := lin):
Jor
f(x) dx = h
(2.I
1 12 J"(~l)
2
f(O)
227
4.7 Exercises
Numerical Integration
226
+ V=~-1 f(vll) +
2.1 f(l)
- 11
~~~ctions i, where R(f) dep~nds only on the f(xj), then R(f) ::::
. 2N If (b) - f (a) I· (Thus, 111 this sense, the trapezoidal rule is optimal.)
7. Wnte a MATLAB program to compute an approximation to the integral
f (x) dx with the composite Simpson rule; use it to compute !!. =
J:
Jd 1~~2
with an absolute error ::::10-5 . How many equidistant point: are
sufficient? (Assume that there is no roundoff error.)
Jcos x dx with an abso8. Use MATLAB to compute the integral I =
lute error ::::0.00005; use Simpson's rule with step size control.
9. Let f be a continuously differentiable function on [0, b l.
(a) Show that
J;/2
(composite trapezoidal rule; cf. (3.1»,
"
+4L.."f
(2V -21)1l)
'
l,=l:n
I
b-a
E
(composite Simpson rule; cf. (3.1»,
(composite Milne rule; cf. (4.4» with~; E [0,
n i = 1,2,3. Let f(x) :=
I
a
b
f(x)dx
f([a, bD
+ b ) + [-1, 1l-4b-a rad (f' ([a , b])) } .
n { f ( a. -2-
(b) Use (a) to write a MATLAB program for the adaptive integration of a
giv~n continuous function f where interval extensions for f and t' are
available, (Use the range of slopes for l' when the derivative does not
exist over some interval, and split intervals recursively until an accuracy
specification is met.)
(c) Compute enclosures for Itl ~I- dx JI __1_ dx and (n12
a v'l+x'
'
-I vT=X'
Ja
JXcosx dx with an error of at most 5 x 10-;-1 (i = 2,3,4); give
the enclosures and the required number of function evaluations.
10. (a) Show by induction that the Vandermonde determinant
x 6 .5 .
(a) For each formula, determine the smallest n such that the absolute value
of the remainder term is :::: 10- 4 . How many function evaluations would
be necessary to complete the numerical evaluation with this n?
(b) Using the formula that seems most appropriate to you, compute an approximation to l x 6 .5dx to the accuracy 1O-4(with MATLAB). Com-
xa
det
x5
Ja
pare with the exact value of
Jd x 6 .5 dx.
6. Let a = xa < XI < ... < XN = b be an equidistant grid.
(a) Show that if f (x) is monotone in [a, b], then the composite trapezoidal
rule satisfies
If"
\
does not vanish for pairwise distinct Xj E JR, j = 0,
, n.
(b) Use part (a) to show that for any grid a :::: Xa < Xl <
< xn < b
there exist uniquely determined numbers aa, ai, ... , an so that - ,
b a
f(x)dx - TN(f) :::: --If(b)
- f(a)l·
a
2N
(b) Show that if Q(f) is an arbitrary quadrature formula on the above subf(x) dx - Q(f)1 :::: R(f) for all monotone
division of [a, b] and I
J:
for every polynomial p(x) of degree n.
229
Numerical Integration
4.7 Exercises
11. (a) Show thatthe Chebychev polynomials T; (x) (defined in (3.1.22» are orthogonal over [-1, 1] with respect to the weight function co (x) :=
(e) If all b j are positive (which is the case for positive weight functions),
find a nonsingular diagonal matrix D such that A~+I = D- I A n+1 Dis
symmetric. Confirm that also Pn+I(X) = det(xl - A~+I)' Why does
this give another proof of the fact that all nodes are real?
(f) Show that in terms of eigenvectors w A of A~+I of2-norm I, CA = (wt)2.
Use this to compute with MATLAB's eig nodes and weights for the
12th order Gaussian quadrature formula for f ~ I j (x) dx.
14. A quadrature formula Q(n := Li=O:N a;f(xd with Xi E [a, b] (i =
0.... , N) is called closed provided Xo = a and XN = band ao, aN i= O.
Show that, if there is an error bound of the form
228
(l - x 2 )- L
(b) Show that T; (x) = Re(x + i.JI=X2)" for x E [-1, 1].
12. Determine for N = 0, 1,2 the weights a j and arguments x j (j = 0, ... , N)
of the Gaussian quadrature formula over the interval [0, I] with weight
function os (x) := x -1/2. (Find the roots x j of the orthogonal polynomial
P3(X) with MATLAB's root.)
13. (a) Show that the characteristic polynomials P j (x) = det(x I - A j) of the
tridiagonal matrices
lib
°
with E(f) := Li=ON f3;f(Xi) for all exponential functions j = eAX (A E
JR), then Q(f) must be a closed quadrature formula and 1f3i1 ~ laiI for
i =0. N.
15. Let Dk(f) := Li=O:N f3 i(k) j(Xi) with Xo < Xl < ... < XN. Suppose that
satisfy the three-term recurrence relation
k
(7.1)
started with P_I(X) = 0, Po(x) = 1.
(b) Show that for any eigenvalue A of An+l,
u)
LA(x)
= c;l LV) Pj_1 (x).
j
D (f)
=
1
k! j
(k)
(Sk(f»
for some Sk(f)
E [Xo, xJl
holds whenever j is a polynomial of degree :::N.
(a) Show that D N (f) = f[xo • . . . • XN], and give explicit formulas for the
f3 i( N ) in terms of the Xi.
(b) Show that f3?) = for i ::: N < k, i.e., that derivatives of order higher
than N cannot be estimated with N + I function values.
16. (a) Write a program for approximate computation of the definite integral
j(x) dx via Romberg integration. Use hi := (b - a)/2 as initial
step size. Halve the step size until the estimated absolute error is smaller
than 10-6 , but at most 10 times. Compute and print the values in the
"Romberg triangle" as well as the estimated error of the best value.
(b) Test the program on the three functions
(i) j(x) := sinx, a := 0, b := it ,
(ii) j(x) := vfxl3, a := 0, b := I, and
(iii) j(x) := (l + l Scos'' x)-1/2, a := 0, b:= Jr/2.
17. Suppose that the approximation F(h) of the integral I, obtained with step
size h, satisfies
°
= Pj-l(A),
define eigenvectors u A of An+l and
(c) Show that the
j(X)dX\ ::: IE(f)1
VA
of A;'+I'
where
CA
=
f:
L bi->: b (v))2
j
j
are the Lagrange polynomials for interpolation in the eigenvalues of
A n+l. (Hint: Obtain an identity by computing (vA)T An+l u'' in two
ways.)
(d) Deduce from (a)-(c) that the n + I nodes of a Gaussian quadrature
formula corresponding to orthogonal polynomials with three-term recurrence relation (7.1) are given by the eigenvalues x j = A of An + I.
with corresponding weights aj = c;l f w(x)dx. (Hint: Use the orthogonality relations and po(x) = 1.)
I
=
F(h)
+ ah!" + bh"? + ch!" + ...
with PI < P2 < P3 < ....
4.7 Exercises
Numerical Integration
230
21. Consider an initial value problem of the form
Give real numbers a, f3, and y such that
1= aF(h)
+ f3F (~) + yF (~) + O(h P3 )
holds, to obtain a formula of order h P3•
18. To approximate the integral JD f tx , y) dxdy, D
yea) = a.
y' = Ftt , y),
Suppose that the function F (t, z) is continuously differentiable in [a, b] x
R Suppose that
c 1R2 ,
via a formula of
F(t,z) 2:0,
the form
L
231
(7.2)
Av.t(xv, Yv)
v=I:N
and
Ft(t,z):SO,
Fy(t,z):sO
for every t E [a, b] and Z E R Show that the approximations
obtained from Euler's piecewise linear method satisfy
with nodes (xv, Yv) and coefficients A v, (v = 1, ... , N) we need to determine 3N parameters. The formula (7.2) is said to have order m + I
i2:y(ti),
i
for y(ti)
i=O,I, ....
22. Compute (by hand or with a pocket calculator) an approximation to y(OA)
to the solution of the initial value problem
provided
yl/(t) = -yet),
for all q, r 2: 0 with q + r = 0, 1, ... , m.
(a) What is the one-point formula that integrates all linear functions exactly
(i.e., has order 2)?
(b) For integration over the unit square
D:= {(x, y) 10:S x :s I,O:s y :s I},
yO := yeO),
ZO := v'(O),
'+1
.
yeO) = 1.
:=
2
.
with Euler's method with step size h.
1
4' [0 L}')
(c) How small must h be for the relative error to be :s:;: . 10- ill , .
(d) Perform 10 steps with the step size determined in (c).
20. The function y(t) = t 2 /4 is a solution to the initial value problem
yeO) =
y' = F(t, y),
+ h)
=
"
L
v=O:N-1
t
and h > O. Verify this
yea) =
l
(7.3)
may be solved via a Taylor expansion. To do this, one expands y(t) about
a to obtain
yea
o.
However, Euler's method gives yi = 0 for all
behavior numerically, and explain it.
(h = 0.1).
Interpret the rule explained in (b) geometrically.
23. When .t is complex analytic, an initial value problem of the form
(a) Write down a solution to the initial value problem.
(b) Write down a closed expression for the approximate values yi obtained
=,;y,
h
.
i -- h.:' + 2F(ti' y'),
Zi+1 := Zi + hF(ti, y");
(i = 0, ...,3);
i
corresponding A v , v = 0, ... ,5.
19. Consider the initial value problem
y'
y'(O) = 0
by
(a) rewriting the differential equation as a system of first order differential
equations and using Euler's method with step size h = 0.1;
(b) using the following recursion rules derived from a truncated Taylor
expansion:
show that there is an integration formula of the form (7.2) with nodes
(0,0), (0, 1), (1/2, 1/2), (1,0), (1,1), and order 3, and determine the
y' = 2y,
yeO) = I,
y(V) (a)
-'_ _ h"
v!
+ remainder;
.
function values and derivatives at a are computable by repeated differentiation of (7.3). NegLecting the remainder term, one thus obtains an
Numerical Integration
232
approximation for yea
+ h).
Expansion of y about a
+h
5
gives an ap-
proximation at the point a + 2h, and so on.
(a) Perform this procedure for F(t, y) := t - y2, a := 0 and yO := 1. In
these computations, develop y to the fourth derivative (inclusive) and
choose h = 1/2, 1/4, and 1/8 as step sizes. What is the approximate
Univariate Nonlinear Equations
value at the point t = I?
(b) Use an explicit form of the remainder term and interval arithmetic to
compute an enclosure for y (1).
24. To check the effect of different precisions on the global error discussed
at the end of Section 4.6, modify Euler's method for solving initial value
problems in MATLAB as follows:
Reduce the accuracy of F h to single precision by multiplying the standard
MATLAB result with 1+le-8*randn (randn produces at different calls
normally distribbuted numbers with mean zero and standard deviation 1).
Perform the multiplication of F with h and the addition to yi
(a) in double precision (i.e., normal MATLAB operations);
(b) in simulated single precision (i.e., multiplying after each operation with
1+le-8*randn).
Compute an approximate solution at the point t = 10 for the initial value
4
3
problem y' = -O.ly, yeO) = 1, using constant step sizes h := 10- , 10-
and 10- 5 . Comment on the results.
25. (a) Verify that the Adams-Bashforth formulas, the Adams-Moulton formulas, and the BDF formulas given in the text have the claimed order.
(b) For the orders for which the coefficients of these formulas are given
in the text, find the corresponding error constants C p- (Hint: Use
Example 3.1.7.)
In th~s chapt~r we treat the problem of finding a zero of a real or complex
function I of one variable (a univariate function), that is, a number x* such
that I(x*) = O. This problem appears in various contexts; we give here some
typical examples:
(i) Solutions of the equation p(x) = q (x) in one unknown are zeros of func-
.. tions I su~~ as (among many others) 1:= p - q or 1:= plq - 1.
(ii) Intenor rruruma or maxima of continuously differentiable functions I are
zeros of the derivative 1'.
(iii) Singular points x* of a continuous function I for which II (x) I ---+ 00 as
x ---+ x* (in particular, poles of f) are zeros of the inverted function
._ { lII(x),
0,
g (x ) .-
if x
if x
t= x*,
= x*,
continuously completed at x = x*.
(iv) The eigenvalues of a matrix A are zeros of the characteristic polynomial
I(A) := det(AI - A). Because of its importance, we look at this problem
more closely (see Section 5.3).
(v) Shooting methods for the solution of boundary value problems also lead
to the problem of determining zeros. The boundary-value problem
y"(t)
= g(t, yet), let)),
yea)
= Ya,
y(b)
y: [a, b] ---+ lR
= Yb
can often be solved by solving the corresponding initial-value problem
y~/(t)
= get, ys(t), y;(t)), y,: [a,
y,(a) = Ya, y;(a) = s
233
b] ~ lR
234
Univariate Nonlinear Equations
for various values of the parameter s. The solution y = Ys of the initialvalue problem is a solution of the boundary-value problem iff s is a zero
of the function defined by
f(s) := ys(b) - Yb·
We see that, depending on the source of the problem, a single evaluation of
f may be cheap or expensive. In particular, in examples (iv) and (v), function
values have a considerable cost. Thus one would like to determine a zero with
the fewest possible number of function evaluations. We also see that it may be
a nontrivial task to provide formulas for the derivative of f. Hence methods
that do not use derivative information (such as the secant method) are generally
preferred to methods that require derivatives (such as Newton's method).
Section 5.1 discusses the secant method and shows its local superlinear convergence and its global difficulties. Section 5.2 introduces globally convergent
bisection methods, based on preserving sign changes, and shows how bisection methods can accomodate the locally fast secant method without sacrificing
their good global properties. For computing eigenvalues of real symmetric matrices and other definite eigenvalue problems, bisection methods are refined in
Section 5.3 to produce all eigenvalues in a given interval.
In Section 5.4, we show that the secant method has a convergence order of
:::::;1.618, and derive faster methods by Opitz and Muller with a convergence
order approaching 2. Section 5.5 considers the accuracy of approximate zeros
computable in finite precision arithmetic, the stability problems involved in
deflation (a technique for finding several zeros), and shows how to find all
zeros in an interval rigorously by employing interval techniques.
Complex zeros are treated in Section 5.6: Local methods can be safeguarded
by a spiral search, and complex analysis provides tools (such as Rouche's
theorem) for ascertaining the exact number of zeros of analytic functions.
Finally, Section 5.7 explores methods that use derivative information, and in
particular Newton's method. Although it is quadratically convergent, it needs
more information per step than the secant method and hence is generally slower.
However, as a simple prototype for the multivariate Newton method for systems
of equations (see Section 6.2), we discuss its properties in some detail - in
particular, the need for damping to prevent divergence or cycling.
5.1 The Secant Method
235
approximations of a zero the zero is likely to be close to the intersection of the
interpolating line with the x-axis.
We therefore interpolate the function f in the neighborhood of the zero x*
at two points Xi and Xi _I through the linear function
If the slope
f[Xi, Xi-IJ
does not vanish, we expect that the zero
(1.1)
of PI represents a better approximation for x* than Xi and Xi-I (Figure 5.1).
The iteration according to (1.1) is known as the secant method, and several
variants of it are known as regula falsi methods. For actual calculations, one
3
2
o( - - - - - - - - j - - f - - - - - . - - - - - - - - - - - - -
----I
-1
-2
-3
5.1 The Secant Method
Smooth functions are locally almost linear; in a neighborhood of a simple
zero they are well approximated by a straight line. Therefore, if we have two
o
0.5
1.5
2
2.5
Figure 5.1. A secant step for f
3
3.5
(x) = x 3 - 8x 2
+ 20x
4
- 13.
4.5
5
2
Table 5.1. Results of the secant method in [1,2] for x - 2
1
2
3
4
5
Xi
signet)
Xi
6
7
8
9
10
+
2
1·33333333333333
1.40000000000000
1.41463414634146
+
237
5.1 The Secant Method
Univariate Nonlinear Equations
236
1.41421143847487
1.41421356205732
1.41421356237310
1.41421356237310
1.41421356237309
signet)
+
+
for some convergence factor q E (0, 1). Obviously, each Q-linearly convergent
sequence is also R-linearly convergent (with c = Ix *- Xio Iq -i O ) , but the converse
is not necessarily true.
For zero-finding methods, one generally expects a convergence speed that
is faster than linear. We say the sequence Xi (i = 1,2,3, ...) with limit x*
converges Q-superlinearly if
IXi+1 -
x*1 ::: qi IXi - x"]
for i :::: 0, with lim qi
1-+00
= 0,
and Q-quadratically if a relation of the form
may use one of the equivalent formulas
Xi -Xi-I
1 - f(Xi-dlf(Xi)
-x------
1
=
Xi-J/(Xi) - xi!(xi-d.
f(Xi) - f(Xi-l)
,
of these, the first two are less prone to rounding error than the last one. (Why?)
5.1.1 Example. We consider f(x) := x 2-2withx* = .j2 = 1.414213562 ...
with starting points XI := 1. X2 := 2. Iteration according to (1.1) gives the results recorded in Table 5.1. After eight function evaluations, one has 10 valid
digits of the zero.
Convergence Speed
For the comparison of various methods for determining zeros, it is useful to have
more precise notions of the rate of convergence. A sequence Xi (i = 1,2,3, ...)
is said to be (at least) linearly convergent to x* if there is a number q E (0, 1)
and a number c > 0 such that
[x" -
Xi
I :::
cq'
for all i :::: 1;
holds. Obviously, each quadratically convergent sequence is superlinearly convergent. To differentiate further between superlinear convergence speeds we
introduce later (see Section 5.4) the concept of convergence order.
These definitions carryover to vector sequences in jRn if one replaces the
absolute values with a norm. Because all norms in jRn are equivalent, the definitions do not depend on the choice of the norm.
We shall apply the convergence concepts to arbitrary iterative methods, and
call a method locally linearly (superlinearly, quadratically) convergent to x* if,
for all initial iterates sufficiently close to x*, the method generates a sequence
that converges to x* linearly, superlinearly, or quadratically, respectively. Here
the attribute local refers to the fact that these measures of speed only apply to
an (often small) neighborhood of x*. In contrast, global convergence refers to
convergence from arbitrary starting points.
Convergence Speed of the Secant Method
The speed that the secant method displayed in Example 5.1.1 is quite typical. Indeed, usually the convergence of the secant method (1.1) is locally
Q-superlinear. The condition is that the limit is a simple zero of f. Here, a
zero x* of f is called a simple zero if f is continuously differentiable at x* and
f'(x') #- O.
the best possible q is called the convergence factor of the sequence. Clearly,
this implies that x* is the limit. More precisely, one speaks in this situation of
R-linear convergence as distinct from Q-linear convergence, characterized by
5.1.2 Theorem. Let the function f be twice continuously differentiable in a
neighborhood of the simple zero x" and let c:= ~fl/(x*)If'(x*). Then the
sequence defined by (1.1) converges to x* for all starting values XI, x2 (Xl #- X2)
the stronger condition
sufficiently close to x*, and
[x" - Xi+11 ::: qlx* - Xii
for all i :::: io :::: 1
Xi+! - x"
= Ci(Xi -
X*)(Xi_1 - x*)
(1.2)
239
5.1 The Secant Method
Univariate Nonlinear Equations
238
with
lim c,
i-s-oo
= C.
In particular, near simple zeros, the secant method is locally Q-superlinearll'
convergent.
Proof By the Newton interpolating formula
X
0= ((x*)
= l(xi)
+ .f[Xi, xi-d(x* -
Xi)
+ .f[Xi, Xi-I, x*l(x* -
Xi)(X* - Xi-d.
In a sufficiently small neighborhood of x*, the slope .f[Xi, Xi- d is close to
f' (x") i- 0, and we can "solve" the linear part for x*. This gives
f(x,)
.f[x;,X;-I,X*]( *_ ,)(x*-x. )
X
X,
,-I
.f[x"x,-d
f[Xi,Xi-d
x* = X, =
5.1.3 Corollary. If the iterates X; of the secant method converge to a simple
zero x* with f" (x") i- 0, then for large i, the signs of x, - x* and f (x;) in the
secant method are periodic with period 3.
where
(1.3)
So (1.2) holds.
.
For ~, in a sufficiently small closed ball B[x*; r] around x", the quotient
r
Ci(~'
n := .f[~, s', x*]/.f[~, n
remains bounded; suppose that ICi (~, ~') I ~ c. For starting values XI, X2 in the
ball B[x*; ro]' where ro = miner, 1/2c), all iterates remain in this ball. Indeed,
assuming that
x"],
IXi-l -
IXi -
IXHI -
. due t'lOn,
B y III
IXi
_x"
-
I <_
x*1 ~ ro, we find by (1.2) that
X*
')2-i
.
. .ro
,
11 Xi I~ 2
so that x·, converges to x*. Therefore,
;r:~ c, With q; :=
ICi
I IXi-l
-
x* I it follows from (1.2) that
IXHI -
Proof Let s, = signrx, - x"). Because lim c, = c i- 0 by assumption, there is
an index i o such that for i ::: io, the factors c, in 0.2) have the sign of c. Let
s = sign(c). Then (1.2) implies SHI = SSiSi_1 and hence S;+2 = SS;+IS;
SS5i5i-ISi = 5i-l, giving period 3.
0
As Table 5.1 shows, this periodic behavior is indeed observed locally, until
(after convergence) rounding errors spoil the pattern.
In spite of the good local behavior of the secant method, global convergence is
not guaranteed, as Figure 5.2 illustrates. The method may even break down completely if f[x;, Xi _ d = 0 for some i because then the next iterate is undefined.
I
X * ~ rD·
_f[x*,x*,x*]=~f"(X*)=C.
.f[x*, x*]
2 f'(x*)
.
Figure 5.2. Nonconvergence of the secant method.
Now Q-superlinear convergence follows from lim qi = O. This proves the
0
theorem.
c;(x* - Xi)(X* - xi-d,
Xi+1 -
i-1
x"] = q; Ix; - x "].
Multiple Zeros and Zero Clusters
It is easy to show that x* is a simple zero of f if and only if f has the representation f(x) = (x -x*)g(x) in a neighborhood of x", in which g is continuous and
g(x*) i- O. More generally, x* is called an m-fold zero of f if in a neighborhood
of x* the relation
f(x) = (x - x*)mg(x),
holds for some continuous function g.
g(x*)
i- 0
( 1.4)
5.2 Bisection Methods
Univariate Nonlinear Equations
240
Table 5.2.
For multiple zeros (m> 1), the secant method is only locally linearly
241
The safeguarded root secant method
convergent.
5.1.4 Example. Suppose that we want to determine the double zero x* = a
of the parabola f (x) := x 2. Iteration with (l.I) for Xl = I and X2 = ~ gives the
sequence
I
2
3
4
Xi
=
I I 1 1
3'5'S'13'
5
.... X34=
,
l.081O-7.
6
7
o
1.20000000000000
1.74522604659078
1.20000000000000
1.05644633886449
1.00526353445128
1.00014620797161
1.00000038426329
1.20000000000000
1.05644633886449
1.00526353445128
J .00014620797161
1.00000038426329
I .00000000002636
0.09760000000000
0.09760000000000
0.00674222757113
0.00005570200768
0.00000004275979
0.00000000000030
o
(The denominators n, form the Fibonacci sequence ni+1 = n, + l1i-I.) The
convergence is Q-linear with convergence factor q = (j5 - 1)/2 ~ 0.618.
If one perturbs a function f(x) with an m-fold zero x* slightly. then the
perturbed function f (x) has up to m distinct zeros, a so-calle~ zero cluste:,
in the immediate neighborhood of x". (If all these zeros are SImple, there IS
an odd number of them if m is odd, and an even number if m is even.) In
particular, for m = 2, perturbation produces two closely neighboring ~eros
(or none) from a double zero. From the numerical point of view, the behavior of
closely neighboring zeros and zero clusters is similar to that of multiple zeros;
in particular. most zero finding methods converge only slowly towards such
zeros, until one finds separating points in between that produce sign changes.
If an even number of simple zeros x* = xi, ... , x 2k' well separated from the
other zeros, lie in the immediate neighborhood of some point, it is difficult to
find a sign change for f because then the product of the linear factors x is positive for all x outside of the hull of the x;*. One must therefore hit.bY
chance into the cluster in order to find a sign change. In this case, finding a sIgn
change is therefore essentially the same as finding an "average solution," and
is therefore almost as difficult as finding a zero itself.
In practice, zeros of multiplicity m > 2 appear only very rarely. For this
reason, we consider only the case of a double zero. In a neighborhood of
the double zero x", f(x) has the sign of g(x) in (104); so one does not recognize such a zero through a sign change. However, x* is a simple zero of the
xt
However, because x* is not known, we simply choose the negative sign. We call
the iteration by means of the formula
Xi -Xi-I
Xi + 1
= Xi - -:-I-_-y'r7 (7x==i-==I"")=;::/f7(7X='7
)
i
I
( 1.5)
the root secant method. One can prove (see Exercise 4) that this choice guarantees locally superlinear convergence provided one safeguards the original
formula (1.5) by implicitly relabeling the Xi after each iteration such that
If(Xi+dl :::: II(Xi)l·
5.1.5 Example. The function f (x) = x 4 - 2x 3 + 2x 2 - 2x + 1 has a double root
at x* = 1. Starting with Xj = 0, X2 = 1.2, the safeguarded root secant method
produces the results shown in Table 5.2. Note that in the second step, X2 and X3
were interchanged because the new function value was bigger than the so far
best function value. Convergence is indeed superlinear; it occurs, as Figure 5.3
shows, from the convex side of vlf(x) I· The final accuracy is not extremely
high because multiple roots are highly sensitive to rounding errors (see Section
5.5). Of course, for a general f, it may happen that although initially there is
no sign change, a sign change is found during the iteration (1.5). We consider
this as an asset because it allows us to continue with a globally convergent
bracketing method, as discussed in the following section.
function
hex) := (x - x*)jlg(x)1 = sign(x - x*)/lf(x)l,
5.2 Bisection Methods
and the secant method applied to h gives
Xi+l
=
Xi Xi -
Xi -
Xi-I
l-h(xi_I)/h(xi)
The sign of the root is positive when x*
Xi-l
=
Xi -
E
I] [Xi, Xi- tl and negative otherwise.
I ±.Jf(Xi-IJ/f(Xi)
A different principle for determining a zero assumes that two points a and b
are known at which I has different signs (i.e., for which f (a)f (b) :::: 0). We
then speak of a sign change in I and refer to the pair a, b as a bracket. If f is
continuous in O{a, b} and has there a sign change, then I has at least one zero
x* E D{a, b} (i.e., some zero is bracketed by a and b).
% counts number of function evaluations
i=2;
if f==O, x=a;
else
while 1,
x=(a+b)/2;
if abs (b-a) <=deltaO, break; end;
i=i +1; f=f(x) ;
if f>=O, b=x; end;
if f<=O, a=x; end;
end;
end;
% f has a zero xstar with Ixstar-xl<=deltaO/2
2.5 ,---_~---,---_----,-~,---------,----.---------,---,-------r
2
3
1.5
0.5
2
0
-,
<,
<,
-0.5
.c
---'----
0
0.2
0.4
06
0.8
"-----"'
1.2
243
5.2 Bisection Methods
Univariate Nonlinear Equations
242
1.4
1.6
18
2
Figure 5.3. The safeguarded root secant method (Jlf(x)1 against x).
Ifwe evaluate f at some point x between a and b and look at its sign, one finds
a sign change in one of the two intervals resulting from splitting the interval at x.
We call this process bisection. Repeated bisection at suitably chosen splitting
points produces a sequence of increasingly narrow intervals, and by a good
choice of the splitting points we obtain a sequence of intervals that converges
to a zero of f. An advantage of such a bisection method is the fact that at each
iteration we have (at least in exact arithmetic) an interval in which the zero must
lie, and we can use this information to know when to stop.
In particular. if we choose as splitting point always the midpoint ofthe current
interval, the lengths of the intervals are halved at every step, and we get the
following algorithm. (0 and 1 are the MATLAB values forfalse and true; while
1 starts an infinite loop that is broken by the break in the stopping test.)
5.2.1 Algorithm: Midpoint Bisection to Find a Zero of
f in
f=f(a) ; f2=f (b) ;
if f*f2>0, error('no sign change'); end;
if f>O, x=a;a=b;b=x; end;
% now f(a)<=O<=f(b)
[a, b]
5.2.2 Example. For finding the zero x* = .j2 = 1.414213562 ... of f(x) :=
x 2 - 2 in the interval [1,2], Algorithm 5.2.1 gives the results in Table 5.3.
Within 10 iterations we get an increase of accuracy of about 3 decimal places,
corresponding to a reduction of the initial interval by a factor 2 10 = 1024. The
accuracy attained at X9 is temporarily lost since [x" - XIQ I > [r" - x91. (This may
happen with R-linear convergence, but not with Q-linear convergence. Why?)
A comparison with Example 5.1.1 shows that the secant method is much
more effective than the midpoint bisection method, which would need 33 function evaluations to reach the same accuracy. This is due to the superlinear
convergence speed of the secant method.
In general, if Xi, a., and b, denote the values of x, a, and b after the ith
function evaluation, then the maximal error in Xi satisfies the relation
*
1
Ix -xil:s 2Ibi-I-ai-11
1
= 2i_2Ib-al fori> 2.
Table 5.3. The bisection method for
f(x)=x 2 - 2 in [I , 2]
Xi
3
4
5
1-5
6
lA375
lA0625
7
1.25
1.375
Xi
8
9
10
11
12
L421875
1.4140625
1.41796875
1.416015625
.L.±!50390625
(2.1)
5.2 Bisection Methods
Univariate Nonlinear Equations
244
One can force convergence and sign change in the secant method, given
and X2 such that f(xdf(X2) < 0 if one iterates according to
5.2.3 Proposition. The midpoint bisection method terminates after
t
all
Ib - = 2 + II logz -8-
Xi+1
0
= Xi -
f(Xi)/f[Xi,
if f(Xi-df(Xi+l) < 0,
function evaluations.
(Here,
IX l denotes the smallest integer 2:x.)
Proof Termination occurs for the smallest i 2: 2 for which
From (2. I), one obtains i = t.
Ib i -I
-
a, -Ii
245
:s 80 .
0
5.2.4 Remarks.
(i) Rounding error makes the error estimate (2.1) invalid; the final Xi is a good
approximation to x* in the sense that close to Xi, there is a sign change of
the computed version of f·
(ii) Bisection methods make essential use of the continuity of f. The methods
are therefore unsuitable for the determination of zeros of rational functions,
which may have poles in the interval of interest.
The following proposition is an immediate consequence of (2. I).
5.2.5 Proposition. The midpoint bisection method converges globally and
R -linearly, with convergence factor q =
!.
Secant Bisection Method
Midpoint bisection is guaranteed to find a zero (when a sign change is known),
but it is slow. The secant method is locally fast but may fail globally. To get the
best of both worlds, we must combine both methods. Indeed, we can preserve
the global convergence of a bisection method if we choose arbitrary bisection
points, restricted only by the requirement that the interval widths shrink by at
least a fixed factor within a fixed number of iterations.
The convergence behavior of the midpoint bisection method may be drastically accelerated if we manage to place the bisection point close to and on
alternating sides of the zero. The secant method is locally able to find points
very close to the zero, but as Example 5.1.1 shows, the iterates may fall on the
same side of the zero twice in a row.
XI
Xi-I];
Xi
= Xi-I;
end;
This forces f(Xi)f(Xi+l) < 0, but in many cases, the lengths of the resulting
O{Xi. xi-d do not approach zero because the Xi converge for large i from one
side only, so that the stopping criterion IXi - Xi-JI :s 80 is no longer useful.
Moreover, the convergence is in such a case only linear.
At the cost of more elaborate programming one can, however, ensure global
and locally superlinear convergence. It is a general feature of nonlinear problems that there are no longer straightforward, simple algorithms for their solution, and that sophisticated safeguards are needed to make prototypical algorithms robust. The complexity of the following program gives an impression of
the difference between a numerical prototype method that works on some examples and a piece of quality software that is robust and generates appropriate
warnings.
5.2.6 Algorithm: Secant Bisection Method with Sign Change Search
f=feval(func,x);nf=l;
if f==O, ier=O; return; end;
f2=feval(func,x2);nf=2;
if f2==0, x=x2; ier=O; return; end;
%find sign change
while f*f2>0,
% place best point in position 2
if abs(f)<abs(f2), xl=x2;fl=f2;x2=x;f2=f;
else
xl=x;fl=f;
end;
% safeguarded root secant formula
x=x2-(x2-xl)/(1-max(sqrt(fl/f2) ,2»;
if x==x2 I nf==nfmax,
% double zero or no sign change found
x=x2;ier=1; return;
end;
f=feval(func,x);nf=nf+l;
if f==O, ier=O; return; end;
end;
5.2 Bisection Methods
Univariate Nonlinear Equations
246
% we
ier=-l;
have a sign change f*f2<0
slow=O;
while nf<nfmax,
% compute new point xx and accuracy
if slow==O,
% standard secant step
if abs(f)<abs(f2),
xx=x-(x-x2)*f/(f-f2);
acc=abs(xx-x);
else
xx=x2-(x2-x)*f2/(f2-f);
acc=abs(xx-x2);
end;
else if slow==l,
% safeguarded secant extrapolation step
i f £1*f2>0,
quot=max(f1/f2,2*(x-x1)/(x-x2»;
xx=x2-(x2-x1)/(1-quot);
acc=abs(xx-x2);
else % f1*f>0
quot=max(f1/f,2*(x2-x1)/(x2-x»;
xx=x-(x-x1)/(1-quot);
acc=abs(xx-x);
end;
else
% safeguarded geometric mean step
if x*x2>0, xx=x*sqrt(x2/x); % preserves the sign!
else if x==O, xx=0.1*x2;
else if x2==0, xx=O.l*x;
else
xx=O;
end;
acc=max(abs(xx-[x,x2]»;
end;
% stopping
tests
if acc<=O, x=xx;ier=-l; return; end;
ff=feval(func,xx);nf=nf+1;
if ff==O, x=xx;ier=O; return; end;
% compute reduction factor and update bracket
247
if f2*ff<0, redfac=(x2-xx)/(x2-x);x1=x;f1=f;x=xx;f=ff;
else
redfac=(x-xx)/(x-x2);x1=x2;f1=f2;x2=xx;f2=ff;
end;
% force
two consecutive mean steps if nonlocal (slow=2)
if redfac>0.7 I slow==2, slow=slow+1;
else
slow=O;
end;
end;
The algorithm is supplied with two starting approximations -x and X2. If
these do not yet produce a sign change, the safeguarded root secant method
is applied in the hope that it either locates a double root or that it finds a
sign change. (An additional safeguard was added to avoid huge steps when
two function values are nearly equal.) If neither happens, the iteration stops
after a predetermined number nfmax of function evaluations were used. This
is needed because without a sign change, there is no global convergence
guarantee.
The function of which zero is to be found must be defined by a function routine whose name is passed in the string func. For example, if func=' myfunc' ,
then the expression f=feval (f unc , x) is interpreted by MATLAB as
f=myfunc (x). The algorithm returns in x a zero if ier=O, an approximate
zero within the achievable accuracy if ier=-l, and the number with the absolutely smallest function value found in case of failure ier=1. In the last case,
it might well be, however, that a good approximation to a double root has been
found.
The variable slow counts the number of consecutive times the length of
the bracket IJ{x, X2} has not been reduced by a factor of at least 1/,/2; if this
happened twice in a row, two mean bisection steps are performed in order to
speed up the width reduction. However, these mean bisections do not use the
arithmetic mean as bisection point (as in midpoint bisection) because for initial
intervals rangi ng over many orders of magni tude, the use of the geometric mean
is more appropriate. (For intervals containing zero, the geometric mean is not
well defined and an ad hoc recipe is used.)
If only a single iteration was slow, it is assumed that this is due to a secant
step that fell on the wrong side of the zero. Therefore, the most recent point x
and the last point with a function value of the same sign as f (x) are used to
pe:form a linear extrapolation step. However, because this can lead to very poor
POints, even outside the current bracket (cf. Figure 5.2), a safeguard is applied
that restricts the step in a sensible way (cf. Exercise 7).
248
Univariate Nonlinear Equations
The secant bisection method is a robust and efficient method that (for four
times continuously differentiable functions) converges locally superlinearly toward simple and double zeros.
The local superlinear and monotone convergence in a neighborhood of double
zeros is due to the behavior of the root secant method (cf. Exercise 4); the local
superlinear convergence to simple zeros follows because the local sign pattern
in the secant method (see Corollary 5.1.3) ensures that locally, the secant bisection method usually reduces to the secant method: In a step with slow=O, a sign
change occurs, and because the sign pattern has period 3. it must locally alternate
as either -. +.+.-. +. +.-. +. +... or +. -, -. +. -. -. +.-, - ....
Thus, locally. two consecutive slow steps are forbidden. Moreover, locally,
after a slow step. the safeguard in the extrapolation formula can be shown to be
inactive (cf. Exercise 7).
Of course, global convergence is guaranteed only when a sign change is
found (after all, not all functions have zeros). In this case, the method performs
globally at most twice as slow as the midpoint bisection method (why?); but if
the function is sufficiently smooth, then much fewer function values are needed
once the neighborhood of superlinear convergence is reached.
5.2.7 Example. For the determination of a zero of
f(x) :=
I - lOx + O.Ole X
(Figure 5.4) with the initial values Xl := 5. X2 := 20, the secant bisection method
gives the results in Table 5.4. A sign change is immediate. Sixteen function
evaluations were necessary to attain (and check) the full accuracy of 16 decimal
digits. Close to the zero, the sign pattern and the superlinear convergence of the
pure secant method is noticeable.
Practical experiments on the computer show that generally about la-IS function evaluations suffice to finda zero. The number is smaller if x and.e, are not far
from a zero, and maybe largerif(as in the example) f(x) or f(X2) is huge. However, because ofthe superlinear convergence. the number offunction evaluations
is nearly independent of the accuracy requested, unless the latter is so high that
rounding errors produce erratic sign changes. Thus nfmax=20 is a good choice
to prevent useless function evaluations in case of nonconvergence. (To find several zeros. one generally applies a technique called deflation; see Section 5.5.)
Unfortunately there are also situations in which the sign change is of no use,
namely if the continuity of f in l«. b1is not guaranteed. Thus, for example, it can
happen that for rational functions, a bisection method finds instead of a zero. a
pole with a sign change (and then with slow convergence). In this case, one must
5.2 Bisection Methods
249
Table 5.4. Locating the zero of f(x) := I - lOx + O.Olex with the secant
bisection method
1
2
3
4
5
6
7
8
9
IO
11
12
13
14
15
16
Xi
f(Xi)
5.0000000000000000
20.0000000000000000
5.000 146910843469 J
5.0002938188092605
10.0002938144929130
7.0713794514850772
8.0179789309415899
8.5868413839981272
8.9020135225054009
9.4351868431760817
9.1647237200757186
9.0986134854694392
9.1053724953793740
9.1056030246107298
9.1056021203890367
9.1056021205058109
-47.51586840R9742350
4851452.9540979033000000
-47.5171194663684280
-47.5183704672215440
121.3264462602397100
-57.9360785853060920
-48.8294182041005270
-31.2618672757999010
-14.5526201672795000
31.8611779028851880
4.8935773887754692
-0.5572882149440375
-0.0183804999496431
0.0000723790792989
-0.0000000093485397
-0.0000000000000568
slow
0
0
1
2
3
0
0
1
2
3
0
0
1
0
0
I
2001--,---___r--,-----_,--~--,------r--,-----_,-~
150
100
50
°K~-------+---------___J
-50
o
2
4
6
8
10
12
Figure 5.4. Secant bisection method for f(x)
14
=I-
16
18
lOx + O.OleX •
20
250
Univariate Nonlinear Equations
safeguard the secant method instead by damping the steps taken. The details
are similar to that for the damped Newton method discussed in Section 5.7.
5.3 Spectral Bisection Methods/or Eigenvalues
A:= (
5.3 Spectral Bisection Methods for Eigenvalues
Many univariate zero-finding problems arise in the context of eigenvalue calculations. Special techniques are available for eigenvalues, and in this section we
discuss elementary techniques related to those for general zero-finding problems. Techniques based on similarity transformations are often superior but
cannot be treated here.
The eigenvalues of a matrix A are defined as the zeros of its characteristic
polynomial Ie';..) = det(AI- A). The eigenvalues of a matrix pencil (A, B) (the
traditional name for a pair of square matrices) are defined as the zeros of its characteristic polynomial I(A) = det(AB - A). And the zeros of I(A) := det G(A)
are the eigenvalues of the nonlinear eigenvalue problem associated with a parameter matrix G(A), that is, a matrix dependent on the scalar parameter A.
(This includes the ordinary eigenvalue problem with G(A) = AI - A and the
general1inear eigenvalue problem with G(A) = AB - A.) In each case, Ais a
possibly complex parameter.
To compute eigenvalues, one may apply any zerofinder to the characteristic
polynomial. If only the real eigenvalues are wanted, we can use, for example,
the secant bisection method; if complex eigenvalues are also sought, one would
use instead a spiral method discussed in Section 5.6.lt is important to calculate
the value det G(A/) at an approximation Al to the eigenvalue Afrom a triangular
factorization of G (AI) and not from the coefficients of an explicit characteristic
polynomial. Indeed, eigenvalues are often - and for Hermitian matrices alwayswell-conditioned, whereas computing zeros from explicit polynomials often
gives rise to severe numerical instability (see Example 5.5.1).
One needs about 10-15 function evaluations per eigenvalue. For full n x
n-matrices, the cost for one function evaluation is ~ n 3 + 0 (n 2 ) . Hence the total
cost to calculate s eigenvalues is 0(m 3 ) operations.
For linear eigenvalue problems, all n eigenvalues can be calculated by transformation methods, in 0 (n 3 ) operations using the QR -algorithm for eigenvalues of matrices and the QZ -algorithm for matrix pencils (see, e.g., Golub and
van Loan [31], Parlett [79]). Thus these methods are preferred for dense matrices. If, however, G(A) is banded with a narrow band, then these algorithmS
require 0(n 2 ) operations whereas the cost for the required factorization- is
only O(n). Thus methods based on finding zeros of the determinant (or poles
of suitable rational functions; see Section 5.4) are superior when only a few of
the eigenvalues are required.
251
5.3.1 Example. The matrix
! H-~J
-2
0
I
(3.1)
0
has
the eigenvalues ±J5 ~ ±2.23606797749979, b0 th 0 f WhiIC h h ave multi-.
. .
plicity We.compute one of them as a zero of I (A) := det(AI - A) usin the
secant bisection method (Algorithm 5.2.6).
g
2:
With the starting values 3 and 6 we obtain the double eigenvalue X _
2.2360~797:49979 with full accuracy after nine evaluations of the determ:
nant using triangular factorizations.
If we use .instead the explicit characteristic polynomial/ (x) = x 4 _ 10x 2 +
25, we obtain after 10 steps i = 2.2360679866043878 . h i '
.
deci
. .
,Wit on y eight valid
eCll~al digits. As shown in Section 5.5, this corresponds to the limit accuracy
o (y' c) expected for double zeros.
The surprising order. 0 (c) accuracy of the results when the determinant is
c~mputed from a matrtx factorization is typical for (nondefective) multi Ie
eigenvalues, and ~an be explained as follows: The determinant is calculatedPas
atroduct of the dla~onal factors in R (and a sign from the permutations), and
c ose to a nondefec~lVe m-fold eigenvalue m of these factors have order O(e);
therefore.the Ul.lavoldable error in f(x) is only of order O(e"') for x ~ x* and
the error III x* IS O«em)l/m) = O(e).
Definite Eigenvalue Problems
The parameter matrix G . D c ([ ~ ([II X II •
•
•
.IS called defini te III a real interval
[' 1J'
!::.,,, C D If for A E [A A] G(')' H
..
uousl -.
'.
- " " IS ermitian and (componentwise) contin..y differentiable and the (componentwise) derivative G'(A) '- ->LG(A) .
pOSitIVe definite.
.- dJ..
IS
5..3.2 Proposition. The linear parameter matrix G(A) = AB _ A . d ,f; .
if and onl ' if B d
..
.
IS efinue
all A are Hermitian and B IS positive definite In this
)
Case, all the ei
I
J'
.
mat .
tgenva ues are real. In particular, all eigenvalues ofa Hermitian
fix are real.
(In JR.)
ProOf TheparametermatrixG(A) =AB-AI"sH
iti if d
.
' errmtran I an only If Band
A
H
..
are ermitran. The derivative G'(A) - B i
". d fini .
-
s positive e nite If and only if
5.3 Spectral Bisection Methods for Eigenvalues
Univariate Nonlinear Equations
252
5.3.4 Theorem. The number peA) ofnonnegative eigenvalues ofan Hermitian
B is positive definite. This proves the first assertion. For each eigenpair (A, x),
H
H
0= x H G(A)X = AXH Bx - x H Ax. By hypothesis, x Bx > 0 and x Ax is
real; from this it follows that A
= x HAxjx
H
matrix A is invariant under congruence transformations; that is, if A o, Al E
C"?" are Hermitian and if S is a nonsingular n x n matrix with AI = SAOS H,
then p(Ad = p(Ao).
Bx is also real. Therefore all of
the eigenvalues are real.
In particular, this applies with B = I to the eigenvalues of a Hermitian
matrix A.
r«
Let oo be a complex number of modulus 1 with oo i= sign a for every
eigenvalue a of S. Then S, := t S - (l-t)aol is nonsingularfor all t E [0, 1], and
the matrices At := StAoStH are Hermitian for all t E [0, 1] and have the same
rank r := rank A o. (Because SoAoS/j = (-ao)Ao( -aD) = Ao and SIAoS[i =
SAOS H = A I, the notation is consistent.) If now At has 7f t positive and VI
negative eigenvalues, it follows that n, + VI = r is constant. Because the
eigenvalues of a matrix depend continuously on the coefficients, the integers
n, and V t must be constant themselves. In particular, p(AI) = 7f) + n - r =
o
7fo + n - r = p(A o).
0
Quadratic eigenvalue problems are often definite after dividing the parameter
matrix by A:
= AB+C-A-IA
is definite in (-00,0) and (0, (0) if B, C, and A are Hermitian and A, Bare
positive definite. All eigenvalues of the parameter matrix are then real.
5.3.3 Proposition. TheparametermatrixoftheformG(A)
Proof Because B, C, and A are Hermitian, G(A) is also Hermitian. Furthermore, G'(A) = B + A- 2 A and x H G'(A)X = x" Bx + A-2 xH Ax > 0 if x i= 0
and A E JR\{O}. Consequently, G(A) is definite in (-00,0) and in (0, (0). For
We note two important consequences of Sylvester's theorem.
E en xn is Hermitian and that a E JR. Then
pea I - A) is the number of eigenvalues of A with A :::: a.
5.3.5 Proposition. Suppose that A
each eigenpair (A, x),
0= x" G(A)X
= AXHBx + x" Cx -
A-I x" Ax.
Proof The eigenvalues of a I - A are just the numbers a - Ain which A runs
through the eigenvalues of A.
o
H
Because the discriminant (x H Cx)2 + 4(x Bx)(x Ax) of the corresponding
quadratic equation in A is real and positive, all eigenvalues are real.
0
H
5.3.6 Proposition. Suppose that A, B
For definite eigenvalue problems, the bisection method can be improved to
give a method for the determination of all real eigenvalues in a given inter.v.al.
For the derivation of the method, we need some theory about Hermlt1an
matrices. For this purpose we arrange the eigenvalues of an Hermitian mat~ix,
which by Proposition 5.3.2are all real, in order from smallest to largest, cou~tlllg
multiple eigenvalues with their multiplicity in the characteristic polynomial
We denote with p (A) the number of nonnegative eigenvalues of A. If the matrix
A depends continuously on a parameter i, then the coefficients of the characteristic polynomial det(Al - A) also vary continuously with t, Because the zeros
of a polynomial depend continuously on the coefficients, the eigenvalues A.i (A)
depend continuously on t.
Fundamental for spectral bisection methods is the inertia theorem
Sylvester. We use it in the following form.
253
.f
OJ
E e n x n are Hermitian and that A - B
is positive definite. Then
Ai(A) > Ai(B)
In particular,
if A
for i = 1, ... , n.
(3.2)
has rank r, then
p(A) :::: pCB)
+n -
r.
(3.3)
Proof Because A - B is positive definite there exists a Cholesky factorization
A - B = LL H. Suppose now that a := Ai(B) is an m-fold eigenvalue of Band
that }..IS the largest index with a = Ai (B). Then i > j - m and p(a 1- B) = i:
andforC := L -I(a I -B)L -H we have p(C) = p(LCL H ) = p(a I-B) = j.
~owever, the matrix C has the m-fold eigenvalue zero; therefore p(C - l) ::::
J -m. whence p(a I - A) = pea 1- B-LL H ) = p(L(C -l)L H ) = p(C ~~.s j - m. Because ~ > j - m (see previous mention), Ai (A) > a = Ai (B).
IS proves the assertion (3.2), and (3.3) follows immediately.
0
Univariate Nonlinear Equations
5.3 Spectral Bisection Methods for Eigenvalues
In order to be able to make practical use o~ .Proposit~on 5.3.5, we need:
simple method for evaluating p(B) for Hermitian matnces ~. If an LDL
factorization of B exists (cf. Proposition 2.2.3 then we obtain p(B) as the
H
number of nonnegative diagonal elements of D; then p(B) = p(LDL ) =
p(D) and the eigenvalues of D are just the diagonal elements.
to switch to a superJinear zero finder, once an interval containing only one
desired eigenvalue has been found. The spectral bisection method is especially
effective if triangular factorizations are cheap (e.g., for Hermitian band matrices
with small bandwidth) and only a fraction of the eigenvalues (less than about
1Do/c) are wanted.
Problems with the execution of the spectral bisection method arise only if the
L D L H factorization without pivoting does not exist or is numerically unstable.
Algorithm 2.2.4 simply perturbs the pivot element to compute the factorization; for the computation of p(B), this seems acceptable as it corresponds to a
perturbation of A of similar size, and eigenvalues of Hermitian matrices are not
sensitive to small perturbations.
Alternatively, it is recommended by Parlett [79] to simply repeat the function
evaluation with a slightly changed value of the argument. One may also employ
in place of the L D L H factorization a so-called Bunch-Parlett factorization with
block-diagonal D and symmetric pivoting (see Golub and van Loan [31 D, which
is numerically stable.
The bisection method demonstrated can also be adopted to solve general
definite eigenvalue problems. Almost everything remains the same; however,
in place of the inertia theorem of Sylvester, the following theorem is used.
254
5.3.7 Example. We are looking for the smallest eigenvalue A* in the interval
.
II x II
. h di
It'
[0.5,2.5] of the symmetric tridiagonal matnx T E 1Ft
Wit iagona en nes
I, 2, ... , II and subdiagonal entries 1. (The tridiagonal case can be ~andled
very efficiently without computing the entire factorization; see Ex~rclse II).
We initialize the bisection procedure by computing the number of eigenvalues
less than 0.5 and 2.5, respectively. We find that two eigenvalues are less than
0.5 and four eigenvalues are less than 2.5. In particular, we see that the third
eigenvalue A3 is the smallest one in [0.5,2.5]. Table .5:5 shows the result of
further bisection steps and the resulting intervals contammg A3· (The 1.0 + 8. III
the fourth trial indicates that for A = I, the LD L H factorization does not exist,
and a perturbation must be used.)
.
A useful property of this spectral bisection method is that here a~ e~genva~ue
with a particular number can be found. We also obtain additional information
about the positions of the other eigenvalues that can subsequently be used. to
determine them. For example, we can read off Table 5.5 that [-00,0.5] contams
the two eigenvalues Al and A2, [0.5, 1.5] and [1.5,2.5] contain one each (A3 and
A4), and [2.5, 00] contains the remaining seven eigenvalues Ako k = 5, .... I L
The spectral bisection method is globally convergent, but the converge~ce is
.
f n.j
d
ble
only linear. Therefore, to reduce the number of evaluations
0 p, It IS a VIsa
Table 5.5.
\
2
3
4
5
6
7
8
9
10
11
cn
xn
5.3.8 Theorem. Suppose that G : D S; C --r
is definite in [b, A] S;
D n 1Ft. Then the numbers p... := p(G(b», P := p(G(A» satisfy E :::: p, and
exactly p - p... real eigenvalues lie in the interval [b, A]. (Here an eigenvalue A
is counted with a multiplicity given by the dimension ofthe null space ofG (A).)
Proof Let gAA) := x'' G(A)X for x
a
aA gAA) = x
Example for spectral bisection
U/
Pi = p(Ui I - T)
0.5
2.5
1.5
1.0 + 8
0.75
0.875
0.9375
0.96875
0.953\25
0.9609375
0.96484375
2
4
3
3
2
2
2
3
2
2
3
A3 E
[0.5,2.5]
[0.5. 1.5]
[0.5, \.0]
[0.75, 1.0]
[0.875, 1.0]
[0.9375. 1.0]
[0.9375,0.96875]
[0.953\25.0.96875]
[0.9609375,0.96875]
[0.9609375,0.964843;5]
255
H
E
cn\{o}. Then
,
G (A)X > 0
that is, gx is strictly monotone increasing in [b,
-
for A E [b, A],
A]
and
for
We deduce that G(A2) - G(Al) is positive definite for
Proposition 5.3.6, we obtain
b:::: Al
< A2 ::::
A. From
257
5.4 Convergence Order
Univariate Nonlinear Equations
256
one similarly obtains the equivalent relation
If A2 is an m-fold eigenvalue of the parameter matrix G(A), it follows that
e, 2: ifJ
Because of the continuous dependence of the eigenvalues Ai (G (A» on A,equality holds in this inequality if no eigenvalue of the parameter matrix G(A) lies
between AI and A2. This proves the theorem.
0
+y
where 13 := -logJO q, y := -logJO C.
(4.2)
1,
For example, the midpoint bisection method has q =
hence 13 ~ 0.301 >
0.3. After any 10 iterations, 10fJ > 3 further decimal places are ensured. Because
(4.1) implies (4.2), we see again that Q-linear convergence implies R-linear
convergence.
For Q-quadratic convergence,
5.4 Convergence Order
To be able to compare the convergence speed of different zero finders, we
introduce the concept of convergence order. For a sequence of numbers Xi that
converge to x", we define the quantities
ei
w.e have the equivalent relation ei+l 2: 'le, - y, where y := 10gJO Co. Recursl~ely, one finds e;. - y 2: 2,-k (ek - y). Thus, if we define 13 := (e j - y) /2 j
WIth the smallest} such that e j > y, we obtain
:= -loglO IXi - x"].
e, 2: i fJ
Then IXi - x* I = lO- e , , and with
s, := Led
with 13 > 0,
so that the bound for the number of valid digits grows exponentially.
In general, we say that a sequence Xl, X2, X3, •.. with limit x* converges with
(R- )convergellce order of at least K > I or K = 1 if the numbers
ed
:= max{s E Z Is:::
+Y
we find that
lO- s , -I <
Therefore
Xi
has at least
ei
IXi -
x" I ::: lO-
s
ei := -logJOlxi - x*1
,
correct digits after the decimal point. The growth
satisfy
pattern of e; characterizes the convergence speed.
For Q-linear convergence,
IXi+1 - .r"] ::: q\Xi - x"]
+ fJ,
where fJ := - log\o q >
°
by taking logarithms, and therefore also
ei+k 2: kfJ
(4.1)
+ e..
Thus in k steps, one gains at least LkfJ J decimal places. For R-linear convergence,
Ix; - x"] ::: c«,
q
E
for all i 2: 1,
e, 2: fJi + Y
for all i 2: 1,
or
with 0< q < 1,
we find the equivalent relation
ei+1 2: e,
+Y
e, 2: f3K i
lO, l[
respectively, for suitable fJ > 0, Y E lit
Thus Q-. and R-linearly convergent sequences have convergence order 1, and
Q-quadratIcally convergent sequences have convergence order 2.
To determine the convergence order of methods like the secant method we
need the following.
'
5.4.1 Lemma. Let Po, PI, ... , Ps be nonnegative numbers such that
P := Po + PI
+ ... + Ps -
1 > 0,
Univariate Nonlinear Equations
258
5.4 Convergence Order
5.4.2 Corollary. For simple zeros. the secant method has convergence order
K = (l + ../5)/2 ~ 1.618 ....
and let K be a positive solution of the equation
(4.3)
Proof By Theorem 5.1.2, the secant iterates satisfy the relation
If e, is a sequence diverging to +00 such that
IXi+l - x*1 ::: clxi
for some number a
E
IR then there are f3 > 0 and y
e, :::: f3K i
+y
E
- x*llxi-l - x"],
where i: = sup c.. Taking logarithms gives
IR such that
for all i :::: 1.
with a = -log 1Oc. Lemma 5.4.1 now implies that
Proof We choose io :::: s large enough such that e, + a / p > 0 for all i :::: io - s,
and put
e, :::: f3K i
f3 := min{K-i(ei
259
+ a l p) I i =
+Y
with f3 > 0 and y
E
IR
io - s, ... , io}·
for the positive solution K = (I
+ ../5) /2 of the equation
Then f3 > 0, and we claim that
e, :::: f3K i - al p
for all i :::: io - s . By construction, (4.4) holds for i = io - s, ... , io· Assuming
that it is valid for i - s, i - s + 1, ... , i (i :::: io) we have
ei+l :::: POei + Plei-l + ... + psei-s + a
:::: PO(f3Ki - a/ p) + Pl (f3K i- 1 - a/ p)
= f3K i- S(POK S
= f3K i+l
-
+ ... + Ps) -
+ ... + Ps(f3K i-
S
-
a/ p)
+a
(Po + ... + Ps - p)a/p
a/ p.
R
y :=min{-a/p,el-f3K,e2-f3K , ... eio-S-] -pK
+Y
iO-s-l}
For this function, the divided differences
a
h[xi-s, ... ,x;J = - - - - - - - - (x* - Xi-s)'" (x* - Xi)
for all i > 1.
o
This proves the assertion.
..
K,
and this solution satisfies
1 <K < l+max{po,pl,""Ps}'
were computed in Example 3.1.7. By taking quotients of successive terms, we
find
eal
One can show (see Exercise 14) that (4.3) has exactly one POSItIve r
solution
It is possible to increase the convergence order of the secant method by using not
only information from the current and the previous iterate, but also information
from earlier iterates.
An interesting class of zero-finding methods due to Opitz [77] (and later
rediscovered by Larkin [55]) is based on approximating the poles ofthe function
h := 1/f by looking at the special case
hex) = - - .
x* -x
then one obtains
e, :::: f3K i
The Method of Opitz
a
So by induction, (4.4) is valid for all i :::: i o - s. If one chooses
2
o
(4.4)
*
X =Xi
+ h[xi-s, ... ,xi-tl .
h[Xi-s, ... , Xi]
In the general case, we may consider this formula as providing approximations
260
Univariate Nonlinear Equations
to x*. For, given
XI, ... ,
5.4 Convergence Order
Xs+J this leads to the iterative method of Opitz:
h[xi-s,.·., Xi-JJ
Xi+l := Xi + ---~-­
h[xi-s, ... , x;]
(fori> s).
261
obtain
h[xi-s, ... , x;] =
(Os)
a
Ci-s" 'Ci
+ g;-s i,
'
and
In practice, X2, ... ,Xs+I are generated for s > I by the corresponding lower
order Opitz formulas.
C'1+1 --
x*
-
x i+l =x • -x' - h[xi-s,
- .. ', Xi-Il
---=-::
I
h[xi_s. . , . , xiJ '
whence
5.4.3 Theorem. For functions of the form
hex) = Ps-J(x)
x -x*
_
in which Ps-l is a polynomial of degree S~ - I, the Opitz formula (Os) gives
the exact pole x" in a single step.
-
o
gi-s.iCi - gi-s,i-1
a + gi-s,iC'i-s ... e,
Ci -s ... Ci - - - - - - - - ' - _
In the neighborhood ofthe pole x' the quantities
Proof Indeed, by polynomial division one obtains
a
hex) = - x* - X
+ Ps-2(X)
(s)._ gi-s,ici - gi--s,i-I
c i . - ----'---..:::..:...---=..:.:--...:..a + gi-s,ici-s ... Ci
with a = -Ps-l(X*)
and the polynomial Ps-2 of degree :SS - 2 does not appear in sufficiently higher
order divided differences. So the original argument for hex) = I/(x - x")
applies again.
0
remain bounded. As in the proof for the secant method, one deduces the locally
superlinear convergence and that
lim c(s) = c(s) '= -g(s-I)(x*)
1
•
(s - l)!a .
i ..... oo
5.4.4 Theorem. Let x' be a simple pole ofh and let h(x)(x' - x) be at least s
times continuously differentiable in a neighborhood ofx'. Then the sequence Xi
defined by the method ofOpitz converges to x' for all initial values XI, ... ,X,+I
sufficiently close to x*, and
Taking logarithms of the relation
ICi+ll
:s:
C!Ci-s
I ... lcd,
C = sup
Ic?) I,
i::-:1
the convergence order follows from Lemma 5.4.1.
where ciS) converges to a constantfor i - f 00. In particular, the method (Os) is
superlinearly convergent; the convergence order is the positive solution K s = K
of K s + 1 = K S + K s - 1 + ... + 1.
o
For the determination of a zero of I, one applies the method of Opitz to
the function h := Ilf. For s = 1, one obtains again the secant method. For
~ = 2, one can show that the method is equivalent to that obtained by hyperbolic
Interpolation,
Proof We write
a
hex) = - x* -x
Xi+l
+ g(x)
= Xi -
f(Xi)
f[Xi, Xi-JJ - f(Xi-l)f[Xi, Xi-I, Xi-2]1 f[Xi-l, Xi-2]'
(4.5)
by rewriting the divided differences (see Exercise 15).
where g is an (s - I)-times continuously differentiable function. With the
abbreviations Cj := x' - Xj (j = 1,2, ... ) and gi-s,i := g[Xi-s, ... , x;], we
The.convergence order of (q,) is monotone increasing with s (see Table 5.6).
In particular, (0 2 ) and (0 3 ) are superior to the secant method (0 ) . The values
1
262
Univariate Nonlinear Equations
5.4 Convergence Order
Table 5.6. Convergence order of (Os)
for different s
Proof Because the matrix pencil (A, B) is nondefective, there is abasis u', ... ,
u" E C" of C'' consisting of eigenvectors, Au i = AiBui. If we represent B-Ib
as a linear combination B- 1b =
a.u", where tx, E C, then
S
Ks
1
2
3
4
5
1.61803
1.83929
1.92756
1.96595
1.98358
S
6
7
8
9
10
263
L
«,
1.99196
1.99603
1.99803
1.99902
1.99951
s = 2 and s = 3 are useful in practice; locally, they give a saving of about
20% of the function evaluations. For a robust algorithm, the Opitz methods
must be combined with a bisection method, analogously to the secant bisection
method. Apart from replacing in Algorithm 5.2.6 the secant steps by steps
computed from (4.5) once enough function values are available, and storing
and updating the required divided differences, an asymptotic analysis of the
local sign change pattern (that now has period s + 2) reveals that to preserve
superlinear convergence, one may use mean steps only for slow> s.
Eigenvalues as Poles
Calculating multiple eigenvalues as in Section 5.3 via the determinant is generally slow because the superiinear convergence of zerofinding methods is lost.
This slowdown can be avoided for Hermitian matrices (and more generally for
nondefective matrix pencils, i.e., when a basis consisting of eigenvectors exists)
if one calculates the eigenvalue as a pole of a suitable function.
To find the eigenvalues of the parameter matrix G (A), one may consider the
function
for suitable a, b E C"\ {OJ. Obviously, each pole of h is an eigenvalue of G(A).
Typically, each (simple or multiple) eigenvalue of G(A) is a simple pole of h:
however, eigenvalues can be lost by bad choices of a and b, and for defective
problems (and only for such problems), where the eigenvectors don't span the
full space, multiple poles can occur.
5.4.5 Proposition. Suppose that a, b E cn\{o}, that A, B E cnxn, and that
the matrix B is nonsingular. If G(A) := AB - A is nondefective, then the
function h(A) := aTG(A)-lb has no multiple poles.
satisfies the relation
G(A)X(A) = L
~(ABui - Au i)
A -Ai
= LCliBui = B(B-1b) = b.
From this, it follows that
that is, h has only simple poles.
o
We illustrate this in the following examples.
5.4.6 Examples.
(i) For G(A) := (A - Ao)l, the determinant f(A) := detG(A) = (A - Ao)n
has the n-fold zero Ao. However, h(A) := aT G(A)-Ib = aT bft). - AD) has
the simple pole AD iff aT b =1= O.
(ii) For the defective parameter matrix
G(A)._(A-l
.-
0
-I )
A- I
'
n = 2, f(A) := det G(A) = (A _1)2 has the double zero A= I, and h(A) :=
aTG(A)-lb = aTbl(A - 1) + a,b 2/(A - 1)2 has A = I as a double pole
if b2 =1= O. Thus the transformation of zeros to poles brings no advantage
for defective problems.
a,
To determine the poles of h by the method of Opitz, one function evaluation is
necessary in each iteration. For the calculation of h(AI), we solve the system of
equations G(AI)X I = band findh(AI) = aT x', Thus, as for the calculation of the
determinant, one triangular factorization per function evaluation is necessary,
but instead of taking the product of the diagonal terms, we must solve two
triangular systems.
Univariate Nonlinear Equations
5.5 Error Analysis
Table 5.7. Evaluations of the function h to find the
double eigenvalue
of the denominator is as large as possible (i.e., the correction is as small as possible). For real calculations, therefore, ± = signto»); for complex calculations,
± = signrke o» Req + 1m Wi Imq), where q denotes the square root in the
denominator.
One can show that the method of Muller has convergence order K ~ 1.84,
the solution of K 3 = K Z + K + I for simple zeros of a three times continuously differentiable function; for double zeros it still converges superIinearly
with convergence order K ~ 1.23, the solution of 2K 3 = K Z + K + 1. A disadvantage is that the method may produce complex iterates from real starting
points. To avoid this, it is advisable to replace the square root of a negative
number by zero. (However, for finding complex roots, this may be an asset;
d. Section 5.6.)
264
1.10714285714286
0.25000000000000
-0.03875000000000
0.00100257895440
-0.00002874150660
-0.00000004079291
-0.00000000118853
-0.00000000003299
0.00000000000005
-0.00000000000000
0.00000000000000
6.00000000000000
3.00000000000000
2.12500000000000
2.23897058823529
2.23598478789267
2.23606785942768
2.23606797405966
2.23606797740430
2.23606797749993
2.23606797749979
2.23606797749979
1
2
3
4
5
6
7
8
9
10
11
265
5.5 Error Analysis
5.4.7 Example. We repeat the calculation ofthe two double eigenvalues ofthe
matrix (3.1) of Example 5.3.1, this time as the poles of h(A) := aT (AI - A)-Ib,
where a = b = (1,1,1,1)7. With starting values XI = 6 and Xz = 3, we do
one Opitz step (OJ) and further Opitz steps (Oz), and find results as shown in
Table 5.7. After II evaluations of the function h, one of the double eigenvalues
is found with full accuracy.
Muller's Method
The method of Muller [65] is based on quadratic interpolation. From
0= f(x*)
~ f(Xi)
=
Limiting Accuracy
Suppose that the computed approximation 1(x) for
Il(x) - f(x)1
x*1
=
m
Xi)
+ f[Xi, Xi-I, X,-Z](X* - Xi)(X* - xi-d
f(Xi) + Wi(X* - Xi) + f[Xi, Xi-I, Xi-Z](X*
(x) satisfies
s8
for all X near a zero x*. Because only I (x) is available in actual computation,
the methods considered determine a zero x* of lex); and for such a zero, the
true function values only satisfy If(i*)1 s 8.
Now suppose that x* is an m-fold zero of f. Then f (x) = (x - x*)m g(x)
with g(x*) i- 0, and it follows that
Ix* -
+ f[Xi, xi-d(x* -
f
Ig(X*)
f(X*) 1<
-
m
_0Ig(x*) I
= O( ~).
(5.1 )
For simple zeros, we have more specifically
- Xi)Z,
_ - f(X*) -_ f[-*
X,X *] '"
g (X- *) -_ f(x*)
X* - X*
"0
where
r. X,*)
that is, the absolute error satisfies to a first approximation
we deduce that
x* _ X*
(4.6)
I
~<
I -
[)
If'(x*)]
From this, we draw some qualitative conclusions:
To ensure that Xi+ 1 is the zero closest to Xi of the parabola interpolating in
Xi, Xi-I. Xi-2, the sign ofthe square root must be chosen such thatthe magnitude
(i) For very smalllf'(x*)I, that is, for very flat functions, the absolute error
in x* is strongly magnified. In this case, x* is ill-conditioned.
Univariate Nonlinear Equations
5.5 Error Analysis
(ii) In particular, the absolute error in i* is magnified for multiple zeros;
because of (5.1), one finds that the number of correct places is only about
the mth part of the mantissa length that is being used.
(iii) A double zero cannot be numerically distinguished from two simple zeros
that lie only about 0 (-05) apart.
given in a less explicit form. In particular, this applies to the characteristic
polynomial of a matrix whose zeros are the eigenvalues, where, as we have
seen in Example 5.3.1, much better accuracy is achievable by using matrix
factorizations.
266
These remarks are valid independently of the method used.
Deflation
To compute several (or all) of the zeros of a function f, the standard technique is called deflation and proceeds in analogy to polynomial division, as
follows.
If one already knows some zeros x{, ... , x;(s ::': I) with corresponding multiplicities m I, ... , m.: then one may find other zeros by setting
5.5.1 Example. Consider the polynomial
f(x) := (x - l)(x - 2)··· (x - 20)
of degree 20 in the standard from
f(x) = x 20
-
210X 19
+ ... + 20!
The coefficient ofx 15 is (checkwithMATLAB's function poly!) -1672280820.
Therefore, a relative error of 8 in this coefficient produces an absolute perturbation of
=
\/(x) - f(x)1
1672280820x
15
8
in the function value at x. For the derivative at a zero x* = 1,2, ... , 20,
If'(x*)1
=
n
[x" -
kl = (x* - 1)!(20 -
x")!
k=1:20
ki'x*
so that the limiting accuracy for the calculation of the zero x*
1672280820· 16
IS! 4!
15
8
~
6.14. 10 13 8
267
=
16 is about
~ 0.014
for double precision (machine precision 8 ~ 2.22· 10- 16 ) . Thus about two digits
of relative accuracy can be expected for this zero. MATLAB's roots function
for computing all zeros of a polynomial produces for this polynomial the approximation 16.00304396718421 in place of 16, with a relative error of 0.00019;
and if we multiply the coefficient of x 15 by 1 + 8, roots finds the approximation 15.96922528457543 with a relative error of 0.0019. The slightly higher
accuracy obtained is due to our worst-case analysis.
This example is quite typical for the sensitivity of zeros of high degree
polynomials in standard form, and explains why, for computing zeros, o~e
should avoid the transformation to the standard form when a polynomial IS
f(x)
g (x) : = -=,-----::.-------(x -xjr1)
n
(5.2)
j=l:s
and seeking solutions of g(x) = O. By definition of a multiple zero, g(x)
converges to a number i- 0 as x ~ xj. Therefore, the solutions x{, ... , x;
are "divided out" in this way, and one cannot converge to the same zero
again.
Although the numerically calculated value of x* is in general subject to
error, the so-called implicit deflation, which evaluates g(x) directly by formula
(5.2), is harmless from a stability point of view because even when the xj are
inaccurate, the other zeros of f are still exactly zeros of g. The only possible
problems arise in a tiny neighborhood about the true and the calculated xj, and
because there is no further zero, this is usually irrelevant.
Note that when function evaluation is expensive, old function values not
very close to a deflated zero can be used again to check for sign changes of the
deflated function!
Warning. For polynomials in the power form, it seems natural to perform
the division explicitly by each linear factor x - x* using the Homer scheme,
because this again produces a polynomial g of lower degree instead of the
expression f(x)/(x - x"). Unfortunately, this explicit deflation may lead to
completely incorrect results even for exactly calculated zeros. The reason is
that for polynomials with zeros that are stable under small relative changes
in the coefficients but very sensitive to small absolute changes, already the deflation of the absolutely largest zero ruins the quality of the other zeros. We show
this only by an example; for a more complete analysis of errors in deflation, see
Wilkinson [98].
Univariate Nonlinear Equations
5.5 Error Analysis
5.5.2 Example. Unlike in the previous example, small relative errors in the
coefficients of the polynomial
Once the existence test applies, one can refine the box containing the solution
as follows. By the mean value theorem,
268
r
I(x) := (x - l)(x -
=
x
20
_
(2 _
r
1
19
) .•.
)x
19
(x -
r
19
+ ... + r
)
I(x)
cause only small perturbations in the zeros; thus all zeros are well conditioned.
Indeed, the MATLAB polynomial root finder roots gives the zeros with a maximal absolute error of about 6.10- 15 . However, explicit deflation of the first zero
x* = 1 gives a polynomial g(x) with coefficients differing from those of
g(x) := (x - 2- 1 ) ... (x - 2- 19 ) by 0 (s), The tiny constantterm is completely
altered, and the computed zeros of g(x) are found to be 0.49999999999390
instead of 0.5, 0.25000106540985 instead of 0.25, and the next largest roots
are already complex, 0.13465904883151 ± 0.01783932277158i.
5.5.3 Remark. For finding all (real or complex) zeros of polynomials, the
fastest and most reliable method is perhaps that of Jenkins and Traub [47,48J; it
finds all zeros of a polynomial of degree n in 0 (n 2 ) operations by reformulating
it as an eigenvalue problem. The MATLAB 5 version of roots also proceeds
that way, but it takes no advantage of the sparsity in the resulting matrix, and
hence requires 0 (n 3 ) operations.
The Interval Newton Method
For rigorous bounds on the zeros of a function, interval arithmetic may be used.
We suppose that the function I is given by an arithmetical expression, that I is
twice continuously differentiable in the interval X E IT lR, and that 0 rt f' (x). The
latter (holds in reasonably large neighborhoods of simple zeros, and) implies
that I is monotone in x and has there at most one zero x*.
Rigorous existence and nonexistence tests may be based on sign changes:
=>
> 0 =>
:s 0 =>
I(x) - I(x*)
=
I/(~)(X - x*)
190
for ~
o rt I(x)
=
269
there is no zero in x.
there is no zero in x.
there is a unique zero in x.
01 f'(x), I (!) I (x)
01 f'(x), I (!) I (x)
Note that I must be evaluated at the thin intervals! = [:!.,:!.] and x = [x, x] in
order that rounding errors in the evaluation of I are correctly accounted for.
These tests may be used to find all real zeros in an interval X by splitting
the interval recursively until one of the conditions mentioned is satisfied. After sufficiently many subdivisions, the only unverified parts are near multiple
zeros.
E
O{x, x*}. If x* E x then
*
_
I(x)
_
x = x - --
with x
E
I(x)
Ex -
j'(~)
--.
(5.3)
j'(x)
x. Therefore, the intersection
xn
(x _
I(X»)
j'(x)
x := Xi
also contains the zero. Iteration with the choice
so-called interval Newton method
XI
Xi+l
=
x,
=
Xi
n
(Xi -
;/~~i~)'
= mid Xi leads to the
= 1,2,3, ...
i
(5.4)
Xi is deliberately written in boldface to emphasize that again, in order to correctly account for rounding errors, Xi must be a thin interval containing an
approximation to the midpoint of Xi. A useful stopping criterion is Xi+l = Xi;
because Xi+ 1 S; Xi and finite precision arithmetic, this holds after finitely many
steps.
Because of (5.3), a zero x* cannot be "lost" by the iteration (5.4); that is,
x* E X implies that
x* E
Xi
for all i.
However, it may happen that the iteration (5.4) stops with an empty intersection
= 0; then, of course, there was no zero in x.
As shown in Figure 5.5, the radius rad Xi is at least halved in each iteration
due to the choice x = xi. Hence, if the iteration does not stop, then rad Xi
converges to zero, and the sequence Xi converges (in exact arithmetic) to a thin
interval X oo = x oo . Taking the limit in the iteration formula, we find
Xi
I(x
)
x 00 =x00 - -oo- ,
j'(x
oo)
so that I(x oo )
= O. Therefore, the limit X oo = x*
is the unique zero of I in x.
271
5.5 Error Analysis
Univariate Nonlinear Equations
270
5.5.5 Example. We determine the zero x* = .j2 = 1.41421356 ... of
f(x) := 1 - 3/(x 2
+ 1)
with the interval Newton method. We have f' (x)
with Xl := [1,3], the first iteration gives
f(Id
=
f([2, 2])
I
Figure 5.5. The interval Newton method.
f
5.5.4 Theorem. If 0 ¢ f'(x), then:
X2
(i) The iteration (5.4) stops after finitely many steps with empty
only
Xi
= 0 if and
if f has no zero in x.
(ii) The function f has a (unique) zero x* in x
if and only if Iimhoo Xi = x*.
In this case,
rad Xi + I
:s 2 rad Xi ,
(5.5)
In particular, the radii converge quadratically to zero.
Proof Only the quadratic convergence (5.5) remains to be proved. By the mean
value theorem, f(I i ) = f'(~)(Ii - x*) with ~ E O{Ii • x"]. Therefore
=
v
Xi -
(V
Xi -
X
=
[1,3]
+ 1)2. Starting
3+ 1 = 1 - [5,5]
3 = [25' 52] '
1 - [2,2]2
6· [1,3]
[2,10]2
=
n ([2,2] -
[6,18]
[4,100]
[~'
=
[3 9]
50' 2
'
n/[:O'~])
14 86]
= [1,3] n [ -3'
45 =
[1, 1.91111 ...].
") f'(~)
Because interval arithmetic is generally slower than real arithmetic, the splitting process mentioned previously can be speeded up by first locating as many
approximate zeros as one can find by the secant bisection method, say. Then
one picks narrow intervals containing the approximate zeros, but wide enough
that the sign is detectable without ambiguity by the interval evaluation f at each
end point; if the interval evaluation at some end point contains zero, the sign
is undetermined, and one must widen the interval adaptively by moving the
Table 5.8. Interval Newton methodfor
f(x) = 1 - 3/(x 2 + 1) in Xl = [1,3]
r(Xi) ,
and
(5.6)
I
2
3
4
5
6
7
Now IIi - x* I :s rad Xi, f' is bounded on x, and by Theorem 1.5.6(iii),
Insertion into (5.6) gives (5.5).
=
6x / (x 2
Further iteration (with eight-digit precision) gives the results in Table 5.8, with
termination because X7 = X6. An optimal inclusion of .j2 with respect to a
machine precision of eight decimal places is attained.
1
and
Xi+l
(x.)
=
=
o
-.
x·
Xi
1.0000000
1.0000000
1-3183203
lA08559 I
1.4142104
1.4142135
1.4142135
3.0000000
1-9111112
lA422849
1.4194270
1.4142167
1.4142136
1.4142136
Univariate Nonlinear Equations
5.6 Complex Zeros
corresponding end point. The verification procedure can then be restricted to
the complements of the already verified intervals.
If the components of G(A) are given by arithmetical expressions and s is
a small interval containing S, then the interval evaluation Go(s) contains the
matrix Go(s) ~ I. One can therefore expect that Go(s) is diagonally dominant,
or is at least an H-matrix; this can easily be checked in each case. Then for
t E S, GoU) E Go(s) is nonsingular and f is continuous in s. Therefore, any
sign change in s encloses a zero of f and hence an eigenvalue of G(A). Note
that to account for rounding errors, the evaluation of f (A) for finding its sign
must be done with a thin interval lA, A] in place of A.
Once a sign change is verified, we may reduce the interval s to that defined
by the bracket obtained, and obtain the corresponding eigenvector by solving
the system of linear interval equations
272
Error Bounds for Simple Eigenvalues and Associated Eigenvectors
Due to the special structure of eigenvalue problems, it is frequently possible to
improve on the error bounds for general functions by using more linear algebra.
Here, we consider only the case of simple real eigenvalues of a parameter matrix
G(A).
If A* is a simple eigenvalue, then we can find a matrix C and vectors a, b
such that the parameter matrix
Go (A) := CG(A)
=1=
0
+ ba"
Bx = b
is nonsingular in a neighborhood of A*. The following result permits the treatment of eigenvalue problems as the problem of determining a zero of a continuous function.
5,5.6 Proposition. Let GO(A) := CG(A)
+ ba'! where C is nonsingular. IfA*
is a zero of
with
BE
B
=
273
Go(s)
by Krawczyk's method with the matrix (DiaglD- 1 as preconditioner.
If during the calculation no sign change is found in s, or if Go(s) is not
diagonally dominant or is not at least an H-matrix, this is an indication that s
was a poor eigenvalue approximation, s was chosen too narrow, or there is a
multiple eigenvalue (or several close eigenvalues) near S.
5.6 Complex Zeros
and x := GO(A*)-lb, then A* is an eigenvalue OfG(A), and x is an associated
eigenvector, that is, G(A *)x
= O.
Proof If f (A*) = 0 and x = GO(A *)-1 b, then a H x = f (A*)+ I = 1. Therefore,
we have CG(A*)x = GO(A *)x - ba'' x = b - b = 0; because C is nonsingular,
it follows that G(A *)x = O.
0
To apply this, we first calculate an approximation s to the unknown (simple)
eigenvalue A*. Then, to find suitable values for a, band C, we modify a normalized triangular factorization LR of G(s) by replacing the diagonal element
Ri, of R of least modulus with IIRlloo. With the upper triangular matrix R ' so
obtained, we have R = R' - ye(i) (e(i))H where y = R;i - Rii, so that
In this section, we consider the problem of finding complex zeros x* E D of a
function f that is analytic in an open and bounded set D S; <C and continuous
in its closure D, In contrast to the situation for D C R the number of zeros
(counting their multiplicity) is no longer affected by small perturbations in
f, which makes the determination of multiple zeros a much better behaved
problem. In particular, there are simple, globally convergent algorithms based
on a modified form of damping. The facts underlying such an algorithm are
given in the following theorem.
5.6.1 Theorem. Let Xo ED and If(x)1 > If(xo)! for all x E oD, Then f has
a zero in D. Moreover, if f(xo) =1= 0, then every neighborhood ofxo contains a
point XI with. If(xj)1 < If(xo)l.
Proof Without loss of generality, we may assume that D is connected. For all
a with Xo + a in a suitable neighborhood of Xo, the power series
Thus if we put
f(xo + a)
then Go(s)
~
I.
is convergent.
=
,
f(xo) + f (xo)a + ... +
(xo) n
, a + ...
n.
f(n)
If all derivatives are zero, then f(x) = f(xo) in a neighborhood of xo, so f
is a constant in D. This remains valid in b, contradicting the assumption.
Therefore, there is a smallest n with f(n) (xo) =/= 0, and we have
(6.1)
with
gn(a)
=
f(n) (xo)
n!
+a
r-:
(n
1) (xo)
+ I)!
+ ....
In particular, gn (0) =/= O. Taking the square of the absolute value in (6.1) gives
If(xo
+ a)1 2 = If(xo)1 2 + 2Re angn(a)f(xo) + laI 2nlgn(a)12
= If(xo)1 2 + 2Re angn(O)f(xo) + O(laI 2n)
< If(xo)1
2
if a is small enough and
Spiral Search
Using (6.3), we could compute a suitable a if we knew the nth derivative of f.
However, the computation of even the first derivative is usually unnecessarily
expensive, and we instead look for a method that works without derivatives.
The key is the observation that when a =/= 0 makes a full revolution around
zero, the left side of (6.2) alternates in sign in 2n adjacent sectors of angle n / n.
Therefore, if we use a trial correction a and are in the wrong sector, we may
correct for it by decreasing a along a spiral toward zero.
A natural choice, originally proposed by Bauhuber [7], is to try the reduced
corrections q':« (k = 0, I, ...), where q is a complex number of absolute value
Iq I < I. The angle of q in polar coordinates must be chosen such that repeated
rotation around this angle guarantees that from arbitrary starting points, we land
soon in a sector with the correct sign, at least in the most frequent case that
n is small. The condition (6.2) that tells us whether we are in a good sector
reduces to Reqkn y < 0 with a constant y that depends on the problem. This is
equivalent to
kn arg(q)
Re an s- (0) f (xo) < O.
To satisfy (6.2), we choose a small
E:
(6.2)
> 0 and find that
for the choice
(6.3)
valid unless f(xo) = O. Because a; -+ 0 if E: -+ 0, we have If(xo + aE)1 <
If (xo) I for sufficiently small E:. This proves the second part.
Now If I attains its minimum on the compact set o, that is, there is some
x* E jj such that
for all x
E
b,
(6.4)
and the boundary condition implies that actually x* E D. If f(x*) =/= 0, we may
use the previous argument with x* in place of xo, and find a contradiction to
(6.4). Therefore, f(x*) = O.
0
+ cp
E ]2Irr, (21
+ I)rr[
for some I
E
Z,
(6.5)
where cp = arg(y) - n /2. If we allow for n = 1,2, 3, 4 at most 2, 2, 3, resp.
5 consecutive choices in a bad sector, independent of the choice of tp; this
restricts the angle to a narrow range, arg(q) E ±]¥arr, ~rr[ (cf. Exercise 22).
The simplest complex number with an angle in this range is 6i - I; therefore,
a value
q
If(x*)1 ::: If(x)1
275
5.6 Complex Zeros
Univariate Nonlinear Equations
274
= A(6i
- I)
with A E [0.05,0.15],
say, may be used. This rotates a in each trial by a good angle and shrinks it by
a factor of about 0.3041 - 0.9124, depending on the choice of A. Figure 5.6
displays a particular instance of the rotation and shrinking pattern for A =
0.13. Suitable starting corrections a can be found by a secant step (1.1), a
hyperbolic interpolation step (4.5), or a Muller step (4.6). The Muller step is
most useful when there are very close zeros, but a complex square root must be
computed.
Rounding errors may cause trouble when f is extremely flat near some point,
which typically happens when the above n gets large. (It is easy to construct
artificial examples for this.) If the initial correction is within the flat part, the
new function value may agree with the old one within their accuracy, and
the spiral search never leaves the flat region. The remedy is to expand the
276
Univariate Nonlinear Equations
5.6 Complex Zeros
3r-~---,--------,-------,.--~-r-,------,----~-~---,
o
+
+
x = Xo; f = If(xo)l; fok = e * f;
q = 0.13 (6i - I); qll = I -"JE; qgl
while f > j;'b
2
*
+
+
+
277
5.6.2 Algorithm: Spiral Search for Complex Zeros
= 1+"JE;
~ompute a nonzero correction a to x (e.g. a secant or Muller step);
OI-------------~"*"'=-,,_,__----------___1
-1
a
+
-3
-3
+
+
(l
+ IxI + IXoldl)P;
a
* f,
% flat step
= q * a; jiat = 0;
end;
-2
-1
0
2
3
Figure 5.6. Spiral search for a good sign in case n = 3. The spiraling factor is q =
O.13(6i - I). The circles mark the initial a and corresponding rotated and shrinked
values. The third trial value gives the required sign; by design, this is the worst possible
case for n = 3.
if k
end;
end;
== kmax, return; end;
The algorithm stops when f
a sufficient decrease.
size of the correction when the new function value is too close initially to the
old one.
Another safeguard is needed to avoid initial step sizes that are too large. One
possiblility is to enforce
For p = 2, this still allows "local" quadratic convergence to infinite zeros (of
rational functions, say). If this is undesirable, one may take p = 1.
We formulate the resulting spiral search, using a particular choice of constants that specify the conditions when to accept a new point, when to spiral,
when to expand and when to stop. e is the machine precision; kmax and pare
control parameters that may be set to fixed values. Xo and an initial a must be
provided as input. (In the absence of other information, one may start, e.g., with
Xo
laI :s
fnew = [f(x - a)l;
if fnew < ql1 * f, % good step
x = x - a; f = fnew; break;
end;
ifjiat & fnew < qgl
a = lO*a;
else % long step
-2
+
If necessary, rescale to enforce
jiat= I;
for k = I : kmax ,
= 0, a = 1.)
:s fok or when kmax changes of a did not result in
Rigorous Error Bounds
For constructing rigorous error bounds for complex zeros, we use the following
tool from complex analysis.
5.6.3 Theorem. Let f and g be analytic in the interior ofa disk D and nonzero
and continuous on the boundary D. If
a
fez)
Re g(z) > 0
for all
ZE
aD
(6.6)
a~d each root is counted according to its multiplicity, then f and g have precisely the same number a/zeros in D.
Proof This is a refinement of Rouche'» theorem, which assumes the stronger
5.6 Complex Zeros
Univariate Nonlinear Equations
278
For polynomials, one may use in place of the kth derivative the divided
difference used in the proof, computable by means of the first k steps of the
complete Homer scheme. With this change, 8 can usually be chosen somewhat
smaller.
The theorem may be used as an a posteriori test for existence, starting with
an approximate root Z computed by standard numerical methods. It provides a
rigorous existence test, multiplicity count, and enclosure for the root or the root
cluster near Z. The successful application requires that we "guess" the right k
and a suitable 8. Because 8 should be kept small, a reasonable procedure is the
following: Let 15 k be the positive real root of the polynomial
condition
If(z)1 > If(z) - g(z)1
for all z
aD.
E
Rouche's theorem in Henrici [43, Theorem 4.10b] uses
However, th e proo f 0 f
.
0
in fact only the weaker assumption (6.6).
A simple consequence is the following result (slightly sharper than Neumaier
[69]).
5.6.4 Theorem. Let f be analytic in the disk D[z; r]
-s:
{z
279
Eel [z - zl ::: r}.
and let 0 < 8 < r. If
R f(k)(Z)
e f(k) (.-) >
z
then
""
Z::
I=O:k-l
k! IRe f(l)(z) \ 81- k for all z E D[z; 8],
l!
j<k)(z)
f has precisely k roots in
(6.7)
then (6.7) forces 8 :;: 15 k. If 15 k is small, then f(k)(Z) = f(k)(Z) + 0(8) so that
it is sufficient to take 8 only slightly bigger than 15 k • In practice, it is usually
sufficient to calculate 15 k to a relative precision of 10% only and choose 8 = 28k ;
because k: is unknown, one tries k = 1, 2, ... , until one succeeds or a limit on
k is attained.
If there is a k-fold zero z* and the remaining roots are far away, then Taylor's
formula gives
D[z; 8], where each root is counted according
to its multiplicity.
Proof We apply Theorem 5.6.3 with
f(k)(Z)
_k
g(z) = -k-!-(z - z)
to D
=
D[z; 8]. Hermite interpolation at Z gives
fez) = PH (z)
Thus
+ f[z, ... , z, z](z - z-)k
and
with
Pi-: (z)
=L
f(l) (z)
- I
-l-'-(z - z),
l<k
f[z, ... , Z, Z
= Re
rik ) e-)
~
15 k ~
= -'-k-!-
Iz - z*Ij(~ -
so that we get with our choice 8
.
for some ~ in D, For Z E aD, we have Iz f (z)
Re - ()
g z
]
zl = 8, hence
(n "" k!
f(k)
f(k)(;;)
~
+ c:
l'
l<k'
an overestimation of roughly 3k.
5.6.5 Example. Consider f (x) = x 4 - 2x 3 - x 2 + 2x + I that has a double root
x* = ~ (l + -)5) ~ 1.618033989. To verify the double root, we must choose
f(l) (z) (z _ Z)l-k
Re f(k)(Z)
k = 2. In this case, a suitable guess for the radius 8 is
f(k)(n
"" k! \ ~\ 81- k > 0
:;: Re f(k) (z) l! Re f(k) (z)
f:t
by our hypothesis (6.7), and Theorem 5.6,3 applies.
= 28k
1) < 1.5klz - z"],
Where
o
p
=
21 Re I"(z)
f'(z) I
,
-
q -
21 Re I"(z)
fez) I
.
Univariate Nonlinear Equations
280
Table 5.9.
5.7 Methods Using Derivative Information
z
Approximations to the root and corresponding
enclosures
Proof For f(x)
;=
f'(x)
Ie x )
8 (upward rounded)
8/lx* - zl
6.77 . 10- 1
8.91 . 10- 2
3.92.10- 2
1.65. 10- 4
5.44.10- 8
948.10- 3
1.49.10- 1
3.69.10- 1
Enclosure not guaranteed
4.94
4.88
4.83
4.83
4.82
4.66
Enclosure not guaranteed
1.5
1.6
1.61
1.618
1.618034
1.62
1.65
1.7
281
ao(x - ~I)'" (x - ~n),
1
d
= "dlogl/(x)1 ==
x
If x* is the zero closest to
1
--+"'+--.
x - ~I
X -
~n
(6.9)
x, then
Ix* -xl:s I~J -xl for j == 1, ... .n,
Therefore,
f' (X) ! <
! I(x)
n
- -Ix---x-*I
and
The condition guaranteeing two roots in the disk
inf{Re
Ix-x*l<n/I(X)!.
-
I
F(z) z
1"(2)
Dlz; oJ is
E D[Z;8 J} >
Thi s ~roves (6.8). A slight generalization of the argument gives the more general
assertion (see Exercise 21).
0
(p+q!o)!o,
and this can be verified by means of complex interval arithmetic. For various
approximations to the root, we find the enclosures Ix* - 21
shown in
Table 5.9. Of course, the test does not guarantee the existence of a double root,
but only that of two (possibly coinciding) roots x* with Ix* - 21 :s 0.
z
f'(x)
:s
Much more information about complex zeros can be found in Henrici [43].
°
5.7 Methods Using Derivative Information
Newton's Method
As
Error Bounds for Polynomial Zeros
The following result, valid for arbitrary x, can be used to verify the accuracy of
approximations i to simple, real, or complex zeros of polynomials. If rigorous
results are required, the error term must be evaluated in interval arithmetic,
using as argument the thin interval [i, i].
5.6.6 Theorem. If f is a polynomial of degree nand f' (.r) t= 0, x E C, ther
there is at least one zero of f in each disk in the complex plane that contains j
and x - n f,i~)). In particular, there is a zero x* with
__ * <
Ix
The bound is best possible as
x
I-
lex) =
I f(x) I
n f'(x) .
(x - x")" shows.
t
.
I
th
. I - j -+ X,,
e s~cant s ope f[Xi, Xi-I] approaches the slope f'(Xi) of the
tangent to f at the pomt Xi. In this limiting case, formula (1.1) yields the formula
for Newton's method
Xi+1
.
I(Xi)
,==Xi - - f'(Xi)'
i-I 2 3
- " , ....
(7.1)
5.7.1 Example. To compare with the bisection method and with the secant
:ethod, ~he zero x* = ../i = 1.414215362 ... of the function I(x) == x 2 _ 2
in iproxlmated by Newlon's method. Starting with XI :== I, we get the results
able 5.10. After five function evaluations and five derivative evaluations
one has 10 valid digits of x*.
.,
(6.8
ThIn th~ e~ampl.e, th~ Newton sequence converges faster than the secant method.
at this IS typical IS a consequence of the local Q_q d ti
the N ewton met h a d .
. ua ra IC convergence of
282
Table 5.11. A comparison ofwork versus order ofaccuracy
Table 5.10. Results of Newton's
method for x 2 - 2
1
1.5
1.416666667
1.414215686
1.4.l4213562
1.414213562
1
2
3
4
5
6
Xi+1 -
X
*
= ci, ( Xi.
-
Xl
X
(7 . 2)
*)2
lim c'I = c.
.
'--+00
In particular, Newton's method is Q-quadratieally convergent to a simple zero,
and its convergence order is 2.
Proof Formula (7.2) is proved just as the corresponding asserti~n for.the secant
method. Convergence for initial values sufficiently close to x ag~m follows
from this. With Co := sup Ie; I < 00, one obtains from (7.2) the relation
X*
I ::: Co IXi
-
Newton method
1
2
3
8
8
82
8
x *\2
from which the Q-quadratic convergence is apparent. By the results in Sectio~
5.4, the convergence order is 2.
Comparison with the Secant Method
By comparing Theorems 5.7.2 and 5.1.2, we see that locally, Newton's metho,d
converges in fewer iterations than the secant method. However, each stephI:
more expensive. If the cost for a derivative evaluation is about the same as t a
.
one Newton step
for a function evaluation, it is more appropnate to compare
with two secant steps. By Theorem 5.1.2, we have for the latter
x" =
[Ci+ICi(Xi-l -
~
8i
Secant method
8i+1
8i 8i-l
~
8
8
84
8
8
8
8
16
2
2
3
85
8
8
8
8
8
13
21
34
55
and because the term in square brackets tends to zero as i -+ 00, two secant
steps are locally faster than one Newton step. Therefore, the secant method
is more efficient when function values and derivatives are equally costly. We
illustrate the behaviour in Table 5.11, where the asymptotic order of accuracy
after n = 1,2, ... function evaluations is displayed for both Newton's method
and the secant method.
In some cases, derivatives are much cheaper than function values when computed together with the latter; in this case, Newton's method may be faster. We
use the convergence orders 2 of Newton's method and (l + ./5)/2 ~ 1.618 of
the secant method to compare the asymptotic costs in this case.
Let e and c' denote the cost of calculating f (Xi) and f' (Xi), respectively.
As soon as f is not as simple as in our demonstration examples, the cost for
the other operations is negligible, so that the cost of calculating Xi+1 (i :::: 1)
is essentially c + c' for Newton's method and e for the secant method. The
cost of s Newton steps, namely s(c + c'), is equivalent to that of s(l + c' /c),
function evaluations, and the number of correct digits multiplies by a factor 2s •
With the same cost, sO + c' /e) secant steps may be performed, giving a gain
in the number of accurate digits by a factor 1.618,1(1 +c' (c). Therefore, the secant
method is locally more efficient than the Newton method if
1.618 s ( l + c' / c ) > 2 s ,
that is, if
c'
Xi+2 -
~
8i+l
sufficiently close to x and
with
\Xi+l -
Function
evaluation
2
3
2
3
2
3
5.7.2 Theorem. Let the function f be twice continuously differentiable in a
.
*
d I
'- 1 f"( *)/f'(\*)
Then the
".*
neighborhood of the simple zero x ,an et C.- 2:'~
sequence defined by (7.1) converges to x* for all
283
5.7 Methods Using Derivative Information
Univariate Nonlinear Equations
X*)](Xi -
X*)2,
C
>
5.7 Methods Using Derivative Information
Univariate Nonlinear Equations
284
Therefore, locally, Newton's method is preferable to the secant method only
when the cost of calculating the derivative is at most 44% of the cost of a
285
Table 5.13. Results ofNewton's methodfor f(x) = 1 - lOx
with Xl = 20
+ O.Olex
function evaluation.
Xi
Global Behavior of Newton's Method
As for the secant method, the global behavior of Newton's method must be
assessed independent of the local convergence speed.
It can be shown that Newton's method converges for all starting points in some
dense subset oflR if f is a polynomial of degree n with real zeros si, .... ';n only.
The argument is essentially that if not, it sooner or la~er ~enerates s~m~ Xi >
x* := max(';l •... , ';n) (or Xi < min(.;l •. ··' ';n), which is treated similarly).
1
2
3
4
5
6
7
8
9
Xi
20.0000000000000000
19.0000389558837600
18.0001392428132970
17.0003965996736000
16.0010546324046890
15.0027299459509860
14.0069725232866720
13.0176392772467740
12.0441649793488760
10
11
12
13
14
15
16
17
18
11.1088835456740740
10.2610830282869120
9.5930334887471229
9.2146744950274755
9.1119744961101219
9.1056248937564668
9.1056021207975100
9.1056021205058109
9.1056021205058109
Then.
!,(xd
1
with a convergence factor of 1- ~ =
sets in.)
n
0< - - - < - - < - - - ,
Xi - x* - f(xi) - Xi - x*
so that
Xi+l
= Xi -
f(Xi)/!'(Xi) satisfies the relation
X,' -
X
* > Xi -
Later in this section we discuss a modified Newton method that overcomes
this slow convergence, at least for polynomials.
Slowness of a different kind is observed in the next example.
X· -x*
Xi+l
2: -I -n - ,
that is,
o :s Xi+l -
x"
:s
(1 - ~ )
5.7.4 Example. The function
(Xi - x").
Thus Newton's method converges monotonically for all starting points out~ide
the hull of the set of zeros. with global convergence factor of at least 1 - n'
For large n, a sequence of n Newton steps therefore decreases [x - ~* I by at
ctor (1 - l)n < 1. ~ 0.37. For f(x) = (x - x")", where this bound
t f
1easaa
n
e .
initiall
is asymptotically achieved, convergence is very slow; the same holds nnua Y
for general polynomials if the starting point is so far away from all zeros that
these "look like a single cluster."
2
5.7.3 Example. For Xl = 100, Newton's method applied to f(x) = .x - 2
yields the results in Table 5.12. The sequence initially converges only 1mearly,
Table 5.12. Results o.f Newton's method for x
2
Xi
100
!. (After iteration 7, quadratic convergence
50.01
3
25.02
4
12.55
2
5
6.36
-
2 with
XI
= 1O~
6
7
3.34
1.97
-------=
f(x) := 1 - lOx + O.Olex
has already been considered in Example 5.2.7 for the treatment of the secant
bisection method. Table 5.13 shows the convergence behavior of the Newton
method for Xl := 20. It takes a long time for the locally quadratic convergence
to be noticeable. For the attainment of full accuracy, 17 evaluations of f and f'
are necessary. The secant bisection method with Xl = 5, X2 = 20 needs about
the same number of function evaluations but no derivative evaluations to achieve
the same accuracy.
On nonconvex problems, small perturbations of the starting point may
strongly influence the global behavior of Newton's method, if some intermediate iterate gets close to a stationary point.
5.7.5 Example. We demonstrate this with the function
f(x) := 1 - 2/(x 2
+ 1)
5.7 Methods Using Derivative Information
Univariate Nonlinear Equations
286
Table 5.15.
287
Newton's methodsfor f(x) = x - I - 2/x with
two different starting points Xl
0.75
Xj
0.5
1
0.25
oL-----~~~+-~f------~l
-0.25
-0.5
-0.75
-2
3
-1
Figure 5.7. Graph of f(x) = 1 - 2/(x
2
+ I).
displayed in Figure 5.7, which has zeros at + I and .-1. Table 5.l~ shows that
three close initial values may result in completely different behavior.
The next example shows that the neighborhood where Newton's .method
converges to a given zero may be very asymmetric, and very large m some
direction.
5.7.6 Example. For the function
f(x) := x-I - 2/x
Table 5.14.
Newton's method for f(x) = 1- 2/(x
for different starting points Xl
Xi
1
2
3
4
5
6
7
8
9
10
11
1.999500
0.126031
2.109171
-0.118015
-2.235974
0.446951
0.983975
0.999874
1.000000
1.000000
1.000000
2
=
1.999720
0.125577
2.115887
-0.134153
-1.997092
-0.130986
-2.039028
-0.042251
-5.959273
46.906603
- 25754.409557
{-
±oo
1.999970
0.125062
2.123583
-0.152822
-1.787816
-0.499058
-0.968928
-0.999533
-1.000000
-1.000000
-1.000000
1000.0000000000000000
1.0039979920039741
1.6702074291915645
1.9772917776064771
1.9999127426320478
1.9999999987309516
2.0000000000000000
0.0010000000000000
0.0020004989997505
0.0040029909876475
0.0080139297363667
ODI6059455313883 1
0.0322437057559748
0.0649734647479344
0.1317795480072139
0.2698985122099006
0.5559697617074131
1.0969550112472573
1.7454226483923749
1.9871575101344159
1.9999722751335729
1.9999999998718863
2.0000000000000000
with zeros -I and +2 and a pole at X = 0, Newton's method converges for
all values Xl -I O. As Table 5.15 shows, the convergence is very slow for
starting values Ix! I « I because for tiny Xi we have only Xi+1 ~ 2Xi. Very large
starting values, however, give a good approximation to a zero in a few Newton
steps.
+ 1)
Xi
Xi
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
The Damped Newton Method
If the Newton method diverges, then one can often obtain convergence in spite
of this through damping. The idea of damping is that, instead of a full Newton
step,
Pi = - f(Xi)/!'(Xi),
only a partial step,
is taken, in which the damping factor (Xi is chosen by successive halving of an
initial trial value (Xi = 1 until If (Xi+ I) I < I f (Xi) I. This ensures that the values
If (Xi) I are monotonically decreasing and converge (because they are bounded
below by zero). Often, but of course not always, the limiting value is zero.
288
Univariate Nonlinear Equations
5.7 Methods Using Derivative Information
5.7.7 Example. The function
100
f(x) := 10x 5
-
36x 3
1 - - -.---- - -,-- _ ---,-__---,__---,,--_-----,
+ 90x
50
has the derivative
f' (X)
289
:= 50x 4
-
108x 2
+ 90 =
50(x 2
-
,
,,
1.08)2 + 31.68 > 0,
.;
o
so that f is monotone in R The unique real zero of ! is x* = O. Starting from
XI =6 (Table 5.16), Newton's method damped in this sense accepts always
the first trial value ex = I for the damping factor, but the sequence of iterates
alternates for! ~ 8, and has the limit points + 1 and -1, and I! (± 1) I = 64, cf.
Figure 5.8. The (mis- )convergence is unusually slow because the improvement
factor If(Xi+l) III! (Xi) I gets closer and closer to I.
for some fixed positive q < 1. (One usually takes a small but not tiny value, such
as q = 0.1.) Because the full Newton step taken for a, = 1 gives local quadratic
convergence, one tries in tum cti = 1, ~, ~, ... until (7.3) is satisfied. For
monotone functions, this indeed guarantees global convergence. (We analyze
this in detail in the multidimensional case in Section 6.2.)
Because the step length is now no longer a sign of closeness to the zero, one
stops the iteration when I! (Xi) I no longer decreases significantly; for example,
-
-
-
,;
-50
-
-
;
,
-
-
-~C!J
,,"
-
;
'*- - - _ _ _ _ _
X
;
-100
In order to ensure the convergence of the sequence I! (Xi) I to zero, clearly one
must require a little more thanjust I!(Xi+l) I < I/(Xi) I. In practice, one demands
that
(7.3)
-
... ,
-150
':---~--____=:;;";;----:;------:::':_------L---J
-1.5
1
05
-
- .
a
0.5
1
~~:~re 5.8. Oscillations of simplydampedNewton iteratesfor
f(x)
1.5
=
lOx 5
_
36x 3
+
when
If(xi)1
s
1!(Xi+l)l(l
+../i)
with the machine precision e.
A Modified Newton Method
Table 5.16.
1
2
3
4
5
6
7
8
9
10
Simple damping does not help in all situations
Xi
f(Xi)
6.000000
4.843907
3.926470
3.198309
2.615756
2.133620
1.685220
1.067196
-1.009015
1.008081
70524.000000
23011.672462
7506.913694
2456.653982
815.686552
284.525312
115.294770
66.134592
-64.287891
64.258095
21
22
23
24
25
26
27
28
29
30
Xi
f(Xi)
-1.003877
1.003706
-1.003550
1.003407
-1.003276
1.003154
-1.003041
1.002936
-1.002839
1.002747
-64.123953
64.118499
-64.113517
64.108946
-64.104738
64.100850
-64.097746
64.093896
-64.090774
64.087856
-
~e promise~ after Example 5.7.3 a modification of the Newton method that
aor polyno~11Ials, also converges rapidly for large starting values. Let f be
polynomial of degree n; by the fundamental theorem of algebra there are
c (
f whi
'
exactly n zeros c
'"
51, ... , ."''' some 0 which may coincide), and lex) = ao(x _
~l) (x - ~,,). For a SImple zero x* of ! we have
rex)
d
I
- ( ) = -d 10gl!(x)1 = - - + "
.f X
X
X - x*
L
_1_
srFX'
Where
C
E C\{x*},
X -
c
5j
R:::
_1_
x*
X -
n-l
+ --,
X - C
whence
x"
R::: x;+l
:= Xi _
f(x;)
!'(x;) - Xi-C
.!'-::.If(x)·
I
(7.4)
Univariate Nonlinear Equations
5.7 Methods Using Derivative Information
The formula (7.4) defines the modified Newton method. Because the additional
term in the denominator of the correction term vanishes as Xi ---+ x*, (7.4) is
still locally Q-quadratically convergent. Because
Table 5.17. Results of the modified Newton
methodfor f(x) = x 2 - 2 with XI = 100
290
291
Xi
1
2
100.000000000000000
2.867949443501232
1.536378083259546
1.415956475989498
1.414213947729941
1.414213562373114
1.414213562373095
1.414213562373095
3
- x. _
--="X-;i~--:-C_ _
-
L....-
I
= (
4
5
6
7
"'~+I
Xi-~j
L /~'-c
_~J.
Xi
J=I:n
8
+ C) / (
.L /~'-c
_~J. + I ) ,
J=I:n
I
I
it follows that Xi+l approaches the value L(~j - c) + c as Xi ~ 00; in contrast
Xi + O(l) ---+ 00 as
to the ordinary Newton method, for which Xi+l = (l Xi ---+ 00, the modified Newton method provides for very large starting values
in only one step a reasonable guess of the magnitude of the zero (unless c is
chosen badly). Moreover, the modified Newton method has nice monotonicity
properties when the polynomial has only real zeros.
*)
Therefore,
0.::: Xi+l
(i) If all zeros of f lie between c and Xl, then the modified Newton sequence
converges and is monotonically decreasing.
(ii) If no zero lies between c and XI, then the modified Newton sequence con-
verges monotonically.
x*)
(1 _
Xi -
= (Xi
*)
-
X
x*
C
+ n(x*
)
- c)
(n - l)(x' - c)
Xi -
x*
+ n(x* -
c)
(n - I)(x* - c), (I - ~)
By induction, it follows that x* :5 Xi+1 :5
inequality implies that Xi converges to x*.
(ii) Similarly proved.
Xi
(Xi -
x*)) .
for all i > I, and the last
0
(I;;;; I, l;e I)
Proof
(i) Let x* be the largest zero of f, and suppose that
holds for i = 1). Then we have
f'(Xi)
n- I
I
0< - - - < - - - - -
Xi-X' -
(Xi -
Xi -
.::: min
5.7.8 Theorem. Let f be a real polynomial with real zeros only, and suppose
that XI > C.
x* .:::
-
f(Xi)
Xi-C
and
I
Xi-X*
x;-c
Xi
> x* (this certainly
= ---Xi-Xi+l
= 3 for the
5.7.9 Example. Exercise ]4 gives the bound J + max
2
absolute values of the zeros of the polynomial f (x) ;= x - 2. Therefore we may
choose c ;= -3. With the starting point Xl := 100, the results of the modified
Newton method are given in Table 5.17.
Halley's Method
In the limiting case Xi _ J ~ Xi (j = 1, ... , k), the Opitz formula (Ok) derived
in Section 5.4 takes the form
5.8 Exercises
Univariate Nonlinear Equations
292
5.8 Exercises
Then
Ix* -
clx* - xilk+l,
Xi+11 :::
that is, the method (Hk) has convergence order k + I. The derivatives of h = 11f
are calculated from
hi = - f'
h"
= 2f'2 - ff"
p'
P
etc.
For k = 1, one again obtains Newton's method applied to f, with convergence
order
293
K
= 2; the case k = 2 leads to Halley's method
1. What is .the maximal length of a cable suspended between two poles of
~qual height separated by a distance 2d = 100 m if the height h by which
It sags may not exceed 10 m? Can you guarantee that the result is accurate
to 1 em?
~int: The fo~ofthecableis described by a catenary y(x) := a cosh(xla).
FIrst determme the arc length between the poles in terms of a. Then obtain
a by finding a zero of a suitable equation.
2. Ho.w deeply will a tree trunk in the form of a circular cylinder of radius
r (10 em) and density PH (in g/cm') be submerged in water (with density
Pw = I)? The area of the cross-section that is under water is given by
r:1
F = l(a - sin a).
f(Xi)
Xi+1 := Xi -
f
-r-(-xi-l!-"-(x-i) ,
I
(Xi) -
with cubic convergence (K = 3) to simple zeros. It can be shown that, for polynomials with only real roots, Halley's method converges globally and monotonically from arbitrary starting points.
Note that k derivatives of f are needed for the determination of the values h(k-I) (Xi), h(k) (x.) in (Hd. If for j = 0, ... , k these derivatives require a
similar computational cost, then a step with (Hk ) is about as expensive as k + 1
secant steps. A method where each step consists of k+ 1 secant steps has convergence order (1.618 .. . )k+1 »k + 1; therefore, the secant method is generally
much more efficient than any (Hk ) ·
Other variants of the method of Opitz that use derivatives and old function
values can be based on partially confluent formulas such as
In this particular case, the evaluation of h and hi is necessary in each iteration,
and we have
[x" - Xi+ll ::: c[x* - Xi [2 ... Ix* - Xi-s [2
and the convergence order is bounded by 3. Thus, one step of this variant is
already asymptotically less efficient than two steps of (0 2) (Ki ~ 3.382 > 3).
A host of other methods for zeros of polynomials (and other univariate
functions) is known. The interested reader is referred to the bibliography by
McNamee [60]. An extensive numerical comparison of many zerofinders is in
Nerinckx and Haegemans [66].
(8.1)
2j'(x,)
The mass of the displaced water is equal to the mass of the tree trunk
whence
'
(8.2)
Using (8.1) and (8.2), establish an equation f(a) = 0 for the angle a.
For r = 30 and PH = 3/4, determine graphically an approximation
ao for the zero o " of f(a). Improve the approximation using the secant
method. Use the solution to calculate the depth of immersion.
3. Show that for polynomials of degree n with only real zeros, the secant
method. started with two points above the largest zero converges monotonically, with global convergence factor of at least
4. Show that for three times continuously differentiable functi~ns f, the root
secant method converges locally monotonically and superlinearly toward
double zeros x* of f.
Hint: ~ou may assume that !"(x*) > 0 (why?). Distinguish two cases
depending on whether or not Xl and X2 are on different sides of x*.
5. Forthedeterminationofthezerox* = ,J2of f(x):= x 2-2in the interval
[0, 2], use the following variant of the secant method:
R
% Set starting values
= 0: Xo
while 1
'!'O
= 2; i = 0;
X = Xi - f(Xi)/f[Xi, ,!.i]
if !(,!.i)!(X) > 0
10. Let A E C"X" be a symmetric tridiagonal matrix with diagonal elements
a., i = I, ... , nand subdiagonal elements fJi' i = 1, ... , n - 1.
(a) Show that for X E C" with Xi i- 0 for i = 1, ... , n; (A, x) is an eigenpair
of A if and only if the quotients Yi = Xi -1/ Xi, i = 1, ... , n satisfy the
recurrence relations
else
:!:.i+1 = :!:.i; Xi+l = x;
end;
1=
295
5.8 Exercises
Univariate Nonlinear Equations
294
i + 1;
end
List i.s. and Xi for i = 1,2, ... ,6 to five decimal places, and perform
iteration until [x -,J21 ::; 10- 9 . Interpret the results!
2
6. (a) Show that the cubic polynomial f(x) = ax 3 + bx + ex + d (a i- 0)
has a sign change at the pair (Xl, X2) with XI = 0,
_ maxt]c], b, b
X2 =
\
+ c, b + c + d)/a if ad s; 0,
+ c, b + c + dv]« otherwise.
fJn-l Yn
= A-
q(t) := t -
al -
---------'---'----------;:----
t - a2 -
bisection method.
(b) Show that in a sufficiently narrow bracket, this safeguard never becomes
active.
First show that it suffices without loss of generality to consider the
case Xl < X < x* < X2. Then show that q = max(f(xI)/f(x), 2(X2XI)/(X2 - x» satisfies q > 2 and (x - XI)/(l- q) < - min(x - Xl,
X2 - X). Interpret the behavior of Xll ew = X - (x - Xl)/(l - q) for
the cases If(x)\ « f(xI) and If(x)\» f(x)), and compare with the
Hint:
original secant formula.
S (a) Determine the zero of f(x) := 1nx + e' - 100x in the interval [1, 1.0]
•
b"
. with
using the original secant method and the secant IsectlOnversion
.
Xl := 1 and X2 := 10. For both methods, list i and Xi (i = 1,2, ...) untIl
the error is < 00 = 10- 9 . Interpret the results!
b
f
.
d it te with the
(b) Choose Xl := 6 and Xz := 7 for the a ove unction an 1 era
-9 L'
h
1
of Xi
original secant method until IXi - xi-II::; 10 . 1St t eva ues
and the corresponding values of i.
.
9. Which zero of sin X is found by the secant bisection method started WIth
Xl
= 1 and X2 = 8?
---"---L
t -
Analyze the Horner form, noting that IX21 ::: 1.
(b) Based on (a), write a MATLAB program that computes all zeros of a
Hint:
quadratic equation directly.
(c) Can one use Cardano's formulas (see Exercise 1.13) to calculate all
zeros of x 3 - 7x + 6 = 0, using real arithmetic only?
7. (a) Explain the safeguard used in the linear extrapolation step of the secant
an'
(b) Show that each zero of the continued fraction
_ mint-e]«], b, b
cubic polynomial.
Hint: After finding one zero, one finds the other two by solving a
for i = n - 1, ... ,1,
fJi-1 Yi = A - a, - fJi/Yi+l,
a
n-l
_
_ f3;;-1
t-a n
is an eigenvalue of A.
(c) When is the converse true in (b)?
11. Let A be a Hermitian tridiagonal matrix.
(a) Suppose that a I - A has an LDL H factorization. Derive recurrence
formulas for the elements of the diagonal matrix D. (You may proceed
directly from the equation a I - A = LDL H , or use Exercise 10.)
(b) Show that the elements of L can be eliminated from the recurrence
derived in (a) to yield an algorithm that determines det(a I - A) with
a minimal amount of storage.
(c) Use (b) and the spectral bisection method to calculate bounds for the
two largest eigenvalues of the 21 x 21 matrix A with
A i k :=
II - i
I,
'
I
0,
if k = i
iflk-il=1.
otherwise
Stop as soon as two decimal places are guaranteed.
12. Let G(A) : C ---* C 3 x 3 with
2A2
+ 2A + 2
-l1A 2+A+9
2A2 + 2A + 3
be given.
(a) Why are all eigenvalues of the parameter matrix G (A) real?
(b) G(A) has six eigenvalues in the interval [-2,2]. Find them by spectral
bisection.
13. Let f : 1R ~ 1R be continuously differentiable, and let x* be a simple zero
of f. Show for an arbitrary sequence Xi (i = 0, 1,2, ...) converging to x*:
(a) If, for i = 1,2, ... ,
IXi+1 - x"] S qlxi - x*1
(b) If
. Xi+1 -x*
lim
Xi -
x*
=q
with
Iql
16. Let f be p + 1 times continuously differentiable and suppose that the
iterates x j of Muller's method converge to a zero x* of f with multiplicity
P = 1 or P = 2. Show that there is a constant c > 0 such that
for j = 3,4, .... Deduce that that the (R-)convergence order of Muller's
method is the solution of K3 = K2 + K + 1 for convergence to simple zeros,
and the solution of 2K3 = K2 + K + 1 for convergence to double zeros.
17. The polynomial PI(X) := ~X4 - 12x3 + 36x 2 - 48x + 24 has the 4-fold
zero 2, and P2(X) := x 4 - IOx3 + 35x 2 - SOx + 24 has the zeros 1, 2, 3, 4.
What is the effect on the zeros if the constant term is changed by adding
E: = ±0.0024? Find the change exactly for Pi (x), and numerically to six
decimal places for P2(X).
18. (a) Show that for a factored polynomial
with q < qo S 1,
then there exists an index i o such that
i-s-co
297
5.8 Exercises
Univariate Nonlinear Equations
296
< ql S 1,
f(x) = ao
then there exists an index i I such that
n
(x - Xi)'ni
i=l:n
we have
14. (a) Show that equation (4.3) has exactly one positive real solution K, and
that this solution satisfies 1 < K < K = 1 + max{po, PI,···, Ps}·
Hint: Show first that (pOK S + PIKs - 1 + ... + Ps)/K s+1 - 1 is positive
for K S 1 and negative for K = K, and its derivative is negative for all
K > O.
(b) Let p(x) = asx" + alx ll - 1 + .. , + all-Ix + a" be a polynomial of
degree n, and let q* be the uniquely determined positive solution of
Show that
1~1l I
s
q*
s I + 1l=1:"
max
-all
I ao
I
for all zeros ~Il of p(z).
Hint: Find a positive lower bound for If(OI if I~I > q*.
15. (a) Show that formula (4.5) is indeed equivalent to the Opitz formula with
k = 2 for h = 1/f·
(b) Show that for linear functions and hyperbolas f(x)
(cx + d), formula (4.5) gives the zero in a single step.
= (ax +
b)/
I' (x)
f (x) =
1
i~ X -
Xi .
(b) If f is a polynomial with real roots only, then the zeros of f are the
only finite local extrema of the absolute value of the Newton correction
ll(x) = - f(x)/!,(x).
Hint: Show first that ll(x)-I is monotone between any two adjacent
zeros of f.
(c) Based on (b) and deflation, devise a damped Newton method that is
guaranteed to find all roots of a polynomial with real roots only.
(d) Show that for f(x) = (x - 1)(x 2 + 3), Ill(x)1 has a local minimum at
x =-1.
19. (a) Let x~, ... , x;(j < n) be known zeros of f(x). Show that for the calculation of the zero X;+I of f(x) with implicit deflation, Newton's
method is equivalent with the iteration
Xi+1 := Xi - f(Xi ) / (r(X i) -
L f~i).).
x
k=l:j Xl
k
(b) To see that explicit deflation is numerically unstable, use Newton's
method (with starting value Xl = 10 for each zero) to calculate both
5.8 Exercises
Univariate Nonlinear Equations
298
7
6
with implicit and explicit deflation all zeros of 1 (x) := x - 7x +
11x 5 + 15x 4 - 34x 3 - 14x 2 + 20x + 8, and compare the results with
the exact values 1, 1 ± y'2, 1 ±~, I ± J"S.
20. Assume that in Exercise 2 you only know that
r E [29.5, 30.5],
Ix* -x*1 < n ·lpn(.T*)
PH E [0.74,0.76].
k
21. (a) Prove Theorem 5.6.6.
Hint: Multiply (6.9) by a complex number, take its real part, and conclude that there must be some zero with Re(c / (x - ~i» O.
(b) Let x' =
What is the best error bound for lx' - x" I?
22. (a) Let k, = k 2 = 2, k 3 = 3, k 4 = 5. Show that for arg(q) E ± lMirr, ~rr[,
condition (6.5) cannot be violated for all k = 0, ... , k n .
(b) In which sense is this choice of the k; best possible?
23. (a) Write a MATLAB program that calculates a zero of an analytic function
1 (x) using Muller's method. If for two iterates x j, x j+j,
:::
;;W).
(where 8 is an input tolerance), perform one more iteration and terminate. If x j = x j + j, terminate the iteration immediately. Test the
2
program with l(x) = x 4 + x 3 + 2x + X + 1.
(b) Incorporate into the MATLAB program a spiral search and find the
zeros again. Experiment with various choices for the contraction and
rotation factor q. Compare with the use of the secant formula instead
of Muller's.
24. (a) Determine, for n
=
of ~onjugate complex zeros (in MATLAB, conj gives the complex
conjugate), remove them by implicit deflation and begin again with the
above starting points.
(b) Calculate, for each of the estimates
of the zeros, the error bound
xt
Starting with an initial interval that contains a*, iterate with the interval
Newton method until no further improvement is obtained and calculate an
interval for the depth of immersion. (You are allowed to use unrounded
interval arithmetic if you have no access to directed rounding.)
x-
299
12, 16,20, all zeros x;, ... , x~, of the truncated
exponential series
x"
Pn(x) := 'Lv!
" -.
v=O:n
(The zeros are simple and pairwise conjugate complex.)
Use the programs of Exercise 23 with xo:= - 6, Xl := - 5, X2:=
_ 4 as starting values and 8 = 10- 4 as tolerance. On finding a pair
k
-
p~(x*)
I.
Print for each zero the number of iteration required, the last iterate that
was calculated, and the error bound.
(c) Plot the positions of the zeros you have found. What do you expect to
see for n ---+ oo?
25. For the functions 11 (x) := x 4 -7x 2 + 2x + 2 and hex) := x 2 - 1999x +
I ~OOOOO, perform six steps of Newton's method with Xo = 3 for 11 (x) and
with Xo = 365 for 12 (x). Interpret the results.
26. Let x* be a zero of the twice continuously differentiable function 1 : lR ---+
R and suppose that 1 is strictly monotonically increasing and convex for
x ::: x*. Show the following:
(a) If XI > X2 > x*, then the sequence defined by the secant method started
with x I and X2 is well defined and converges monotonically to x*.
(b) If YI > x*, then the sequence Yi defined by Newton's method Yi+1 =
Yi - 1 (Yi) /1' (Yi) is well defined and converges monotonically
to x*.
(c) If Xl = Yl > x* then x* < X2i < Yi for all i > I; that is, two steps of the
secant method always give (with about the same computational cost) a
better approximation than one Newton step.
Hint: The error can be represented in terms of x* and divided differences.
27. Investigate the convergence behavior of Newton's method for the polynomial p(x) := x 5 - IOx3 + 69x.
(a) ~how that Newton's method converges with any starting point Ixol <
)ffi.
(b) Show that g (x) = x - ;,i~ is monotone on the interval x := [~ffi,
v'3]. Deduce that g(x) maps x into -x and -x into x. What does this
imply for Newton's method started with Ixol EX?
28. Let p(x) = aox n + alx n- I + ... + an-Ix + an, ao > 0, be a polynomial
of degree n with a real zero ~* for which
1;* ::: Re s,
for all zeros I;v of p(x).
Univariate Nonlinear Equations
300
6
(a) Show that for zo > ~* the iterative method
z) + 1 :
= z)
p(z) )
-
-----,-------(
)
p z)
Systems of Nonlinear Equations
p(z)
W)+l
:= z) - n-----,-------()
p z)
generates two sequences that converge to ~* with
W)
:s ~* :s z)
for j
=
1,2, 3, ...
and {z)} is monotone decreasing.
Hint: Use Exercise l8(a).
(b) Using Exercise 14, show that the above hypotheses are valid for the
polynomial
n
n-l
1
px=x-x
-x n-2 - ... - .
()
Calculate ~* with the method from part (a) for n = 2,3,4,5, 10 and
Zo:= 2.
In this chapter, we treat methods for finding a zero x* E D (i.e., a point x* E D
with F(x*) = 0) of a continuously differentiable function F : D S; jRn ~ jRn.
Such problems arise in many applications, such as the analysis of nonlinear
electronic circuits, chemical equilibrium problems, or chemical process design.
Nonlinear systems of equations must also frequently be solved as subtasks in
solving problems involving differential equations. For example, the solution of
initial value problems for stiff ordinary differential equations requires implicit
methods (see Section 4.5), which in each time step solve a nonlinear system,
and nonlinear boundary value problems for ordinary differential equations are
often solved by multiple shooting methods where large banded nonlinear systems must be solved to ensure the correct matching of pieces of the solution.
Nonlinear partial differential equations are reduced by finite element methods to solving sequences of huge structured nonlinear systems. Because most
physical processes, whether in satellite orbit calculations, weather forecasting,
oil recovery, or electronic chip design, can be described by nonlinear differential equations, the efficient solution of systems of nonlinear equations is of
considerable practical importance.
Stationary points of scalar multivariate function f : D ~ lR n ~ lR lead to a
nonlinear system of equations for the gradient, V f (x) = O. The most important
case, finding the extrema of such functions, occurs frequently in industrial
applications where some practical objective such as profit, quality, weight,
cost, or loss must be maximized or minimized, but also in thermodynamical
or mechanical applications where some energy functional is to be minimized.
(To see how the additional structure of these optimization problems can be
exploited, consult Fletcher [26], Gill et al. [30] and Nocedal and Wright [73a].)
After the discussion of some auxiliary results in Section 6.1, we discuss the
multivariate version of Newton's method in Section 6.2. Emphasis is on obtaining a robust damped version that can be proved to converge under fairly general
301
6.1 Preliminaries
Systems of Nonlinear Equations
302
303
The Jacobian
conditions. Section 6.3 discusses problems of error analysis and interval techniques for the rigorous enclosure of solutions. Finally, some more specialized
methods and results are discussed in Section 6.4.
A basic reference for solving nonlinear systems of equations and unconstrained optimization problems is Dennis and Schnabel [18]. See also Allgower
and Georg [5] for parametrized or strongly nonlinear systems of equations.
The matrix A E jRlIl X" is called the Jacobian matrix (or derivative) of F : D S;
jR" ~ jRlIl at the point Xo E D if
one writes A = F' (x"). It follows immediately from the definition that
6.1 Preliminaries
We recall here some ideas from multidimensional analysis on norm inequalities
and local expansions of functions, discuss the automatic differentiation of multivariate expressions, and state two fixed-point theorems that are needed later.
In the following, int D denotes the interior of a set D, and aD its boundary.
,
Integration
Let g : [a, b] ~ jR" be a vector-valued function that is continuous on [a, b]
The Riemann integral
g(t) dt is defined as the limit
f:
r
gU) dt:=
fa
taken over all partitions a
and if F' is Lipschitz continuous, the error term is of order O(llx - xOI1 2 ) .
Because all norms in jR" are equivalent, the derivative does not depend on the
norm used. The function F is called (continuously) differentiable in D if the
Jacobian F' (x) exists for all xED (and is continuous there). If F is continuously
differentiable in D, then
F (X)ik
c
a
= -F;(x)
(i
OXk
= 1, ... .m, k = 1, ... , n).
1ft
For example, for m
= n = 2 and F(x) = (~~i:;), we have
L
lim
(t;+1 - ti)g(t;)
max(ti+l-ti)~O;=0:"
= to <
t1 < .. , < t,,+1 = b of the interval [a, b],
provided this limit exists.
6.1.1 Lemma. For every norm in jR",
When programming applications, it is important that derivatives are programmed correctly. Correctness may be checked as in Exercise 1.8.
We generalize the concept of a divided difference from the univariate case
(Section 3.1) to multivariate vector-valued functions by defining the multivariate slope
if F is differentiable on the line xxo.
if both sides are defined.
6.1.2 Lemma. Let D S; jR" he convex and let F : D ~ jR" be continuously
differentiable in D.
Proof This follows from the inequality
(i) If x, xO E D then
by going
to
the limit.
o
(This is a strong form ofthe mean value theorem.)
(ii) IfIIF'(x) II :s Lforall xED, then
II F[x, xO]1I
305
6.1 Preliminaries
Systems of Nonlinear Equations
304
and
:s L
for all x , xO E D
whence (ii) follows. The assertion (iii) follows similarly from
and
IIF(x) (This is a weak form of the mean value theorem.)
(iii) If F' is Lipschitz continuous in D, that is, the relation
IIF'(x) -
F'(y) II
:s yllx -
yll
F(xo) - F'(xo)(x -
=
111
:s
1
II F' (xO + t (x
:s
1
YllxO + t(x
for all x, y ED
xO)11
1
1
(F' (xO + t(x - xu)) - F' (xo)) (x - xu) dt
- xu)) - F' (x") IIllx
-
I
xOII dt
1
holds for some Y E~, then
for all x, xO E D;
=Yllx-xOI1 2
- xu) -
Y
1°
1
xOllllx - x011 dt
tdt=-llx-xOIl 2 .
o
2
in particular
Reverse Automatic Differentiation
(This is the truncated Taylor expansion with remainder term.)
Proof Because D is convex, the function defined by
is continuously differentiable in [0, 1]. Therefore,
1
1
F(x) - F(xo) = g(l) - g(O) =
g'(t) dt
t
oo
F'(xo+t(x-xo))(x-xo)dt=F[x,x ](x-x).
= 10
For many nonlinear problems, and in particular for those discussed in this
chapter, the numerical techniques work better if derivatives of the functions
involved are available. Because numerical differentiation is often inaccurate
(see Section 3.2), it is advisable to provide programs that compute derivatives
analytically. The basic principles were discussed already in Section 1.1 for the
univariate case. Here, we look briefly at the multivariate case. For more details,
see Griewank and Corliss [33].
Consider the calculation of y = F (x) where x E ~n , and y E ~m is a vector
defined in terms of x by arithmetic expressions. We may introduce an auxilary
vector z E ~ N containing all intermediate results from the calculation of y, with
y retrievable from suitable coordinates of z, say y = Pr with a (0, 1) matrix P
that has exactly one 1 in each row. Then we get each Zi with a single operation
from one or two Zj (j < i) or x i- Therefore, we may write the computation as
This proves (i). Now, by Lemma 6.1.1,
IIF[x,
z
F'(xO
+ t(x -
xu»~ dt
IIF'(xO
+ t(x -
x°))ll dt
1
xO]11 =
\11
:s
1
1
: : £1
Ldt
=
L,
1\
=
0.1)
H(x, z),
where Hi (x, z) depends on one or two of the z j (j < i) or x j only. In particular,
the partial derivatives HxCx, z) with respect to x and Hz (x, z) with respect to
z are extremely sparse; moreover, Hz (x, z) is strictly lower triangular. Now z
depends on x, and its derivative can be found by differentiating (1.1), giving
8z
-8
x
=
HxCx, z)
az
+ Hz(x, z)-.
ax
306
6.1 Preliminaries
Systems ofNonlinear Equations
6.1.3 Example. To compute
Bringing the partial derivatives to the left and solving for them, we find
-az =
ax
Because F'(x)
(1- Hz(x, z))
-1
HAx, z).
(1.2)
we need the intermediate expressions
= ay/ax = paz/ax, we find
z\ =
(1.3)
Z3
Note that the matrix to be inverted is unit lower triangular, so that one can
compute (1.2) by solving
=
Z4
Z6
by forward substitution: with proper programming, the sparsity can be fully
exploited and the computation efficiently arranged. For one-dimensional x,
the resulting formulas are equivalent to the approach via differential numbers
discussed in Section 1.1. This way of calculating F' (x) is termed forward automatic differentiation.
In many cases, especially when m « n, the formula (1.3) can be evaluated
more efficiently. Indeed, we can transpose the formula and find
Hz(x, z)
Now we find K:= (I - Hz(x, z))~r p T by solving with back substitution the
associated linear system
=
K HAx, z).
=
0
0
1
0
0
0
0
0
0
1
0
0
0
0
As for any scalar-valued function, K
(1.6) simply becomes
(1.7)
k2
-
k3
-
k4 This way of proceeding is called reverse or backward automatic differentiation because we solve for the components of K in reverse order. An apparent
disadvantage of this way of proceeding is that H; (x, z) must be stored fully
while it can be computed on the fly while solving (1.4). However, if m « n
the advantage in speed is dramatic because (1.4) contains on the right side n
columns whereas (1.6) has only m columns. So the number of triangular solves
is cut down by a factor of m / n « 1.
0
0
0
I/x4
0
0
0
0
0
0
0
0
0
26
0
0
0
0
0
0
Z6
0
Z4
k5
Thus,
-
0
0
0
0
0
0
0
0
0
0
0
0
= k is now a vector, and the linear system
= 0,
k3
= 0,
(1/x4)k 4 = 0,
Z6k7
= 0,
Z6k6
= 0,
k, - k3
the so-called adjoint equation with a unit upper triangular matrix, and obtain
F'(x)
X4,
to get f(x) = Z7. We may write the system (1.8) as 2 = H (x, z) and have y = P';
with the 1 x 7-matrix P = (0 0 0 0 0 0 1). The Jacobian of H with respect
to z is the strictly lower triangular matrix
(1.5)
(1.6)
(1.8)
Z3/ X4,
= Xl = e Z5 ,
Z5
(1.4)
xf,
= X2 X3,
= zr + Z2,
Z2
T
307
k6 - Z4k7
= 0,
k7
= 1.
Systems ofNonlinear Equations
308
6.1 Preliminaries
and we get the components of the gradient from (1.7) as
af(x)
g·=--=k
T
Banach's Fixed-Point Theorem
The mapping G : D S; JRn ~ JRn is called a contraction in K S; D if G maps K
into itself and if there exists fJ E JR with 0 :s fJ < 1 such that
o Hi», z)
ax j
J
309
dXj
IIG(x) - G(Ylll :S fJllx -
a
Most components of H / ax j vanish, and we end up with
YII
for all x, Y
E
K;
f3
gl = k 1 ·
2Xl
is called a contraction factor of G (in K). In particular, every contraction in
K is Lipschitz continuous.
+ks ,
g2
= k2 X 3 ,
g3
= k 2X 2 ,
g4
= k4 ( -Z3/ xJ) - k s = -k4 z4 / X 4 - k s·
(I.10)
6.1.4 Theorem. Let the mapping G : D S; lien ~ JRn be a contraction in the
closed. convex set K ~ D. Then G has exactly one fixed point x* E K and
the iteration x'+1 := G(x') converges for all xO E K at least linearly to x*. The
relations
.
The whole gradient is computed from (1.9) and (LlO) with only 11 operations,
and because no exponential is involved, computing function value and gradient
takes less than twice the work for computing the function value alone. This is
not untypical for gradient computations in the reverse mode.
Ilxl+ l -x*ll:s ,8llx l -x*lI,
and
II xl+1 - x'il
There are several good automatic differentiation programs (such as ADIFOR
[2], ADOL-C [34]) based on the formula (1.3) that exploit the forward approach,
the backward approach, or a mixture of both. They usually take a program for
calculation of F (x) as input and return a program that calculates F (x) and
F ' (x) simultaneously.
Fixed-Point Theorems
(1.13)
---- <
I+P
hold, in which
-
IIx l + 1 - x'il
II x' - x* II < -~--
-
(1.14)
l-fJ
f3 denotes a contraction factor of G.
Proof Let xO E K be arbitrary. Because G maps the set K into itself,
defined for all I 2: 0 and for I 2: 1,
xl
is
so by induction,
Related to the equation F(x*) = 0 is the fixed point equation
x* = G(x*)
(Lll)
Therefore for I < m,
IIx
and the corresponding fixed point iteration
l
-
x m II :S Ilx l
:S (pi
X
I
+ 1 := G(x l )
(l = 0,1,2, ...).
(Ll2)
x'+111
+ '" + [[xIII-I
+ ... + ,8m-l)lIx l
pm - fJ'
=
The solutions of (I.Il) are called fixed points of G. For application to the
problem of determining zeros of F, one sets G (x) = x - C F (x 1with a suitable
C E jR" "", There are several important fixed-point theorems giving sufficient
conditions for the existence of at least one fixed point. They are useful in proving
the existence of zeros in specified regions of space, and relevant to a rigorous
error analysis.
-
,8 -1
_
xm I
_ xOIl
IIx l-xOIl,
so IIx - x II ~ 0 as I, m ~ 00. Therefore x' is a Cauchy sequence and has a
limit x*. Because G is continuous,
l
m
x*
=
lim xl = lim G(x'-I)
1--";-00
l"""-HX;
=
G(x*),
that is, x* is a fixed point of G. Moreover, because K is closed, x*
E
K.
6.2 Newton's Method and Its Variants
Systems of Nonlinear Equations
310
(ii) (Fixed point theorem by Brouwer) Suppose that the mapping G : D S;
JRn ~ JRn is continuous and that K S; Dis nonempty, convex, and compact.
If G maps the set K into itself, then G has at least one fixed point in K.
An arbitrary fixed point x' E K satisfies
Ilx' - x*11 = IIG(x') - G(x*)11 ~ ,Bllx' - x*ll,
and we conclude x'
= x* because ,B <
1. Therefore x* is the only fixed point of
G in K. The inequality (1.13) now follows from
l
Ilx I+1 - x*11 = IIG(x l) - G(x*)11 ~ ,Bllx - x*lI,
and with this result, (1.14) follows from
(1 - ,B)IIx l - x* II
l
~ IIx i - x* I - Ilxl+1 - x* II ~ Ilx + - xlii
l
~ I xl - x*11 + Ilx I+1 - x*11 ~ (1 + ,B)IIx - x*ll·
311
Proof For n = 1 the theorem easily follows from the intermediate value
theorem. For the general case, we refer to Ortega and Rheinboldt [78] or
Neumaier [70] for a proof of (i), and deduce here only (ii) from (i). If xO E int K
and "A > 1 then, assuming G(x) = xO + "A (x - xO), the convexity of K implies
that
1
D
Therefore, G(x) #xo + "A(x - xo) for each x E JK,"A > 1, and by part (i), G
has a fixed point in K.
D
6.1.5 Remarks.
(i) We know an error bound of the same kind as (1.14) already from Theorem
2.7.2. As seen there, (1.14) is a realistic bound if,B does not lie too close
to 1.
(ii) Two difficulties in the application of the Banach fixed-po~nt theorem li.e
in the proof of the property that G maps K into itself and III the determination of a contraction factor ,B < 1. By Lemma 6.1.2(ii), one can choose
f3 := sup{ I G'(x) III x E K}. If this supremum is difficult to determ.ine, o~e
can determine a simpler upper bound with the help of interval arithmetic
Note that Brouwer's fixed-point theorem weakens the hypotheses of Banach's
fixed-point theorem. but also gives a weaker conclusion. That the fixed point
is no longer uniquely determined can be seen, for example, by choosing the
identity for G.
6.2 Newton's Method and Its Variants
The multivariate extension of Newton's method, started at some xO E JRn, is
given by
(see Section 6.3).
(2.1)
where pl is the solution of the system of linear equations
The Fixed-Point Theorems by Brouwer and Leray-Schauder
(2.2)
Two other important fixed point theorems guarantee the existence of at least
one fixed point, but uniqueness can no longer be guaranteed.
The solution pi of (2.2) is called the Newton correction or the Newton
direction. We emphasize that although the notation involving the inverse of
6.1.6 Theorem.
the Jacobian is very useful for purposes of analysis, in practice the matrix
F' (x') -I is never calculated explicitly because as seen in Section 2.2, pI can
be determined more efficiently by solving (2.2), for example, using Gaussian
elimination.
Newton's method is the basis of most methods for determining zeros in
JRn; many other methods can be considered as variants of this method. To
show this, we characterize an arbitrary sequence x' (l = 0, 1, 2, ... ) that converges quadratically to a regular zero. Here, a zero x* E int D is called a
(i) (Fixed-point theorem by Leray and Schauder) Suppose. that the mapping G: D S; JRn ~ JRn is continuous and that K S; D IS compact. If
xO
E
intK and
G(x) #xo
+ "A(x -
x")
for all x
then G has at least one fixed point in K.
E
JK, "A> 1,
312
Systems of Nonlinear Equations
6.2 Newton's Method and Its Variants
regular zero of F : D ~ JR.n --* JR.n if F (x*) = 0, if F is differentiable in a neigh-
borhood of x* and F' (x*) is Lipschitz continuous and nonsingular in such a
neighborhood.
From now on, we frequently omit the iteration index I and simply write x
for xl and i for x l +1 (and similarly p for p', etc.).
IIi -
= O(IIF(i)11) = O(IIF(x)1I 2) == O(lIx _
x*1I
r:= F(x)
==
F(x)
F(x)
F(x) = r
x)
+ O(llx -
xf),
- x*)
+ O(lIx - x*11 2 ) .
Iii - xll::s IIx -
==
(2.3)
= O(IIx - x*II),
+ O(lIx - xI1 2 ) .
(2.5)
If (ii) now holds, then by (2.4)
Because F(x*) = 0 and F'(x*) is bounded, it follows that
IIF(x)1I
+ F'(x)(.i -
and
Proof We remark first that by Lemma 6.1.2(iii),
+ F'(x*)(x
+ F'(x)p
and remark that by Lemma 6.1.2(iii),
(i) IIi -x*II == O(IIx _x*II 2),
(ii) IIF(x)11 = O(IIF(x)II 2),
(iii) IIF(x) + F'(x)pll = O(IIF(x)1I 2).
F(x) = F(x*)
x*1I2).
The.refore, (i) and (ii) are equivalent. To show that (ii) and (iii) are equiv I t
we mtroduce the abbreviation
a en ,
6.2.1 Theorem. Let x" E D bea regular zero ofthefunction F : D ~ JR." --* JR.".
Suppose that the sequence Xl (l = 0, 1,2, ...) converges to x* and that
pi := x l + ] - x', Then the following statements are equivalent:
313
and if (ii) holds, then by (2.4) and (2.3),
x*II
+ IIx - x*1I
+ O(IIF(x)ll)
O(II F(x)II)
= O(IIF(x)II),
and therefore by (2.5) and (ii)
/lrll ::s
and for an appropriate c > 0
(x)11+ O(III -
IIF
x/l 2)
=
O(/lF(x)1I2)
so that (iii) holds. Conversely, if (iii) holds, then
Ilx - x*11
OCllx - x*11 2 ))11
::s II F' (x*)-III(II F(x) II + cflx - x* /12).
= IIF'(x*)-I(F(x) -
For large I, 1IF'(x*)-' /I /Ix - x*11 ::::
Ilrll ==
O(IIF(x)11 2 )
ie, so
III -
I
xII = lip II = IIF'(x)-I(FCx) - r)/I
:::: IIF'(x)
IIx-x*ll:::: IIF'(x*)-IIlIIF(X)/I+:2llx-x*II,
111(IIF(x)11 + Ilrll)
== O(l)O(II F(x)11) =
and solving for Ilx - x* II, we find
Ilx
-x*lI::::
OClix
211F'(x*)-1
-x*11)
O(II F(x)II).
By (2.5) and (2.6) we have
1111 F(x)II = O(IIF(x)II)·
= OCllx
(2.4)
IIF(i)lI::::
Ilrll + O(/li _xIl 2 ) == O(IIFCx)112),
so that (ii) holds. Therefore, (ii) and (iii) are also equivalent.
If now (i) holds, then by (2.3) and (2.4),
IIF(x)II =
(2.6)
and
_x*II 2) =
OCIIF(x)1I 2 ) ,
Because condition (iii) is obviously satisfied if p
Newton correction, we find the [ollowing.
== _ F' (x) -1 F (x)
o
is the
6.2 Newton's Method and Its Variants
Systems of Nonlinear Equations
314
6.2.2 Corollary. Newton's methodfor a system ofequations is locally quadrat-
because p
ically convergent if it converges to a regular zero.
=
_A- 1 F(x)
IIF(x)
=
3x?
+ XIX3 + x~ - 3x~
xi + X2 + 1 - 3x~
with Jacobian
F
-6x)
F'(x):=
(
2XI: X3
until rounding errors start to dominate.
'( )
F(x
+ hu) -
If a program calculating F' (x) is not available, one can approximate the partial derivatives by numerical differentiation to obtain an approximation At for
I
I
d
.
F'(xl). If one then determines pi from the equation At p = - F(x ) an agam
sets XI + 1 := xl + pi, then one speaks of a discretized Newton method. In order
to obtain quadratic convergence, the approximation must satisfy
Table 6.1. Simple Newton iteration
IIF(xl)111
xl
\
F(x+hu)-F(x-hu)
2h
(2.7)
If the dimension n is large but F' (x) is sparse, that is, each component of F
depends only on a few variables, one can make use of this in the calculation of
A. If one knows, for example, that F'(x) is tridiagonal, then one obtains from
(2.7) with
0.7500000
0.8083333
0.8018015
0.8016347
0.8016345
0.8016345
0.8016345
1.88e 3.40e 1.0ge 1.57e 2.84e 5.55e 5.00e -
u
=
=
=
(1,0,0, 1,0,0, 1,0,0,
)
(0, 1,0,0, 1,0,0, 1,0,
)
(0,0, 1,0,0, 1,0,0, I,
)
approximations for the columns
1,4,7,
2,5,8,
3,6,9,
IIA - F'(x)11 = OClIF(x)II);
2
3
4
5
6
~
or
h
u
The Discretized Newton Method
0.5000000
0.6000000
0.5856949
0.585290\
0.5852896
0.5852896
0.5852896
F(x)
xu~---~-~---':'"
u
0.2500000
0.3583333
0.3386243
0.3379180
0.3379\7\
0.3379171
0.3379171
O(IIF(x)11 2 ) .
in which h I- 0 is chosen suitably and u E]RII runs through the unit vectors
e(l) , e(2), ... , e(lI) in succession.
With the starting value xU := (i, &, ~) T , the Newton method gives the results
displayed in Table 6.1. Only the seven leading digits are displayed; however,
full accuracy is obtained, and the local quadratic convergence is clearly visible
0
=
The matrix A must therefore approximate F'(x) increasingly well as IIF(x)11
decreases.
In the case that different components of F (x) cannot be calculated one by
one, one may determine the approximation for F'(x) columnwise from
)
x?
(
- F'(x))plI
::: IIA - F'(x)lllIpll
x~ -
F(x):=
O(IIF(x)II), it follows that
+ F'(x)plI = II (A
6.2.3 Example. Consider the function
315
01
02
03
06
12
16
16
Ilpllll
2.67e - 01
4.05e - 02
1.28e - 03
1.53e - 06
2.47e-12
6.52e - 16
4.32e - 16
.
,
and
of F'(x),
respectively (why?). In this case, only four evaluations of F are required for
the calculation of A by the forward and six evaluations of F for the central
difference quotient.
Damping
Under special monotonicity and convexity conditions (see Section 6.4),
Newton's method can be shown to converge in its simple form. However, as in
the univariate case, in general some sort of damping is necessary to get a robust
algorithm.
Systems ofNonlinear Equations
6.2 Newton's Method and Its Variants
Table 6.2. Divergence with undamped Newton method
improbable that the next step has a large damping factor. Therefore, one uses
in the next iteration an adaptive initial step size, computed, for example, by
316
x/
0
1
2
3
4
5
6
2.00000
-18.15888
-8.37101
-3.55249
-1.20146
-0.00041
0.04510
2.00000
-10.57944
-5.22873
-2.71907
-1.47283
0.09490
-1865.23122
21
22
23
24
0.13650
0.33360
0.19666
0.15654
0.15806
-4.06082
-1.49748
1.03237
IIF(x/)111
lip/III
5.le+00
6.8e + 02
1.7e + 02
4.le + 01
l.le + 01
2.4e + 01
3.5e + 06
3.3e + 01
1.5e + 01
7.3e + 00
3.6e + 00
2.8e + 00
1.ge + 03
9.3e + 02
+ 00
+ 01
+ 00
+ 00
4.4e + 00
2.7e +00
2.6e + 00
2.4e + 00
6.2e
2.0e
7.3e
6.6e
eX = min(1, 4a o ld )
317
(2.9)
or the more cautious strategy
_ {a
a =
min(1,2a)
if a decreased in the previous step,
otherwise.
(2.10)
Because in high dimensions, the computation of the exact Newton direction
may be expensive, we allow for approximate search directions p by requiring
only a bound
IIF(x)
+ F'(x)pll
:::: q'IIF(x)1I
(2.11)
on the residuals. For F : D <; lH. ~ lH. n the following algorithm results, with
fixed numbers q, q' satisfying
n
6.2.4 Example. The function
Xl
+ 3ln IXII -
F(x):= (
2
2x l - XIXZ - 5Xl
xi )
+1
0< q < 1, 0 < q' < 1 - q.
has the Jacobian
(2.12)
6.2.5 Algorithm: Damped Approximate Newton Method
j 1
F , (x) = (1+3 X
4XI - X2 - 5
STEP 1: Choose a starting value
For the starting vector xO := (2, 2l, the Newton method gives the results displayed in Table 6.2, without apparent convergence.
As in the one-dimensional case, damping consists in taking only a partial
step
x :=x -s- ap,
with a damping factor a chosen such that
IIF(x
+ ap)11 <
(1- qa)\IF(x)11
xO E
D and compute II F (x") II. Set 1:= 0;
a=1.
STEP 2: Determine p such that (2.11) holds, for example by solving F' (x) P =
-F(x).
STEP 3: Update a by (2.9) or (2.10).
STEP 4: Compute x = x + ap. If x = x, stop.
STEP 5: If x if- int D or IIF(x)11 ::: (l - qa)IIF(x)lI, then replace a with aj2
and return to Step 4. Otherwise, replace I with I + 1 and return to
Step 2.
6.2.6 Example. We reconsider the determination of a zero of
(2.8)
for some fixed q, 0 < q < 1. It is important to note that requiring II F(x + ap) II <
II F(x)11 is not enough (see Example 5.7.7).
Because ex = I corresponds to a full Newton step, which gives quadratic
convergence. one first tests whether (2.8) is satisfied. If this is not the case,
then one halves ex until (2.8) holds; we shall show that this is possible under
very weak conditions. If in step I a smaller damping factor was used, then it is
from Example 6.2.4 with the same starting point x O = (2, 2) T for which the ordinary Newton method diverged. The damped Newton method (with q = 0.5, q' =
0.4) converges for II· II = II . II, to the solution Iiml---+oo x' = (1.3734784,
-1.5249648) T, with both strategies for choosing the initial step size
(Table 6.3). The use of (2.9) takes fewer iterations and hence Jacobian
6.2 Newton's Method and Its Variants
Systems ofNonlinear Equations
318
Table 6.3.
With damping (2.9)
IIF(x')111
IIp'III
at
nf
12
13
14
5.08e + 00
4.14e + 00
3.31e + 00
2.6ge + 00
2.31e + 00
2.21e + 00
2.16e + 00
2.12e + 00
2.0ge + 00
2.04e + 00
1.88e + 00
8.5ge - 01
3.93e - 02
8.lle -05
3.92e - 10
3.27e + 01
1.96e + 00
1.95e +00
2.1ge + 00
3.8ge + 00
5.44e + 00
6.70e + 00
7.76e + 00
6.91e + 00
4.7ge + 00
1.l6e + 00
2.55e - 01
l.04e - 02
2.32e - 05
1.04e - 10
6.25e - 02
2.50e - 01
2.50e - 01
2.50e - 01
6.25e - 02
3.12e - 02
3.12e - 02
3.12e - 02
3.12e - 02
1.25e - 01
5.00e - 01
1.00e + 00
l.OOe + 00
1.00e + 00
1.00e +00
6
8
12
16
22
27
31
35
39
41
43
45
47
49
51
0
1
2
5.08e + 00
4.14e+00
3.8ge + 00
3.27e + 01
1.96e + 00
l.94e + 00
6.25e - 02
6.25e - 02
1.25e - 01
6
8
10
13
14
15
16
1.25e + 00
6.00e - 01
3.30e - 02
l.78e - 04
2.91e - 09
4.05e - 01
l.36e-01
l.34e - 02
5.64e - 05
6.65e - 10
5.00e - 01
l.OOe + 00
1.00e + 00
1.00e + 00
l.OOe + 00
37
39
41
43
45
0
I
2
3
4
5
6
7
8
9
10
II
With damping (2.10)
17
Damped Newton method, with damping 2.9
Table 6.4.
Damped Newton method with I-norm
evaluations, whereas (2.10) takes fewer function evaluations (displayed in the
319
With 2-norm
IIF(x')112
11/112
0
1
2
3
5.00e + 00
2.9ge + 00
2.36e + 00
1.91e + 00
2.38e + 01
1.60e + 00
1.72e + 00
2.12e+00
6.25e
2.50e
2.50e
2.50e
-
02
01
01
01
6
8
12
16
12
13
14
15
16
1.l6e + 00
2.62e - 01
1.27e - 02
2.3ge - 05
5.43e ~ 11
3.34e
9.75e
4.67e
7.37e
1.2ge
01
02
03
06
11
1.00e
1.00e
1.00e
1.00e
l.OOe
+ 00
+ 00
+ 00
+ 00
+ 00
50
52
54
56
58
-
nf
a'
With oo-norm
0
1
2
3
4
5
6
22
23
24
25
(XI
IIF(xl)lloo
Ilpllloo
5.00e + 00
4.36e + 00
3.36e + 00
3.28e + 00
3.26e + 00
3.25e + 00
3.25e +00
2.02e + 01
4.5ge + 00
6.30e + 00
1.04e + 01
1.40e + 01
l.83e + 01
3.02e + 01
1.25e
2.50e
3.12e
7.81e
3.91e
3.91e
9.77e
-
+ 00
+ 00
+ 00
+ 00
2.8ge + 04
5.40e + 04
9.07e + 04
1.27e + 05
1.86e
4.66e
l.16e
5.82e
-
3.24e
3.24e
3.24e
3.24e
nf
01
01
02
03
03
03
04
09
10
10
11
5
8
15
21
26
30
36
119
125
131
136
column labeled nf).
Similarly, the 2-norm gives convergence for both strategies, but not as fast
as for the l-norm cf. Table 6.4. However, for the maximum norm (and (2.9»,
22)
the iterates stall near x 22 = (-0.39215, -0.20505l, although II F(X 1100 ~
4
3.24. The size of the correction Il p 22 lloo ~ 2· 10 is a sign that the Jacobian
22
matrix is very ill-conditioned near x • There is a nearby singularity of the
Jacobian, and our convergence analysis does not apply. The other initial step
strategy also leads to nonconvergence with the maximum norm.
6.2.7 Proposition. Let the function F : D S; JRn ---+ JRn be continuously differentiable. Suppose that x E int D, that F'(x) is nonsingular, and that
IIF(x)
+ F'(x)pll
:::: q'IIF(x)11 #0.
(2.13)
If (2.12) holds then
Ilpll ::::
(q'
+ 1)IIF'(x)-IIlIIF(x)1I
(2.14)
Convergence
We now investigate the feasibility and convergence of the damped Newton
method. First, we show that the loop in Steps 4-5 is finite.
and (2.8) holds for all sufficiently small a > O. If .c" is a regular zero and
IIx - x' II is sufficiently small, then (2.8) holds even for 0 < a :::: 1.
6.2 Newton's Method and Its Variants
Systems ofNonlinear Equations
320
6.2.8 Theorem. Suppose that the function F : ]R" ---+ ]R" is uniquely invertible,
and F and its inverse function F- 1 are continuously differentiable. Then the
sequence x' (l = 0, 1, 2, ...) computed by the damped Newton method either
terminates with xl = x* or converges to x*, where x* is the unique zero of F.
Proof From (2.13) we find
Ilpll ~ IIF'(x)-IIIIIF'(x)pll ~ IIF'(x)-lll(IIF(x)
~ (q'
IIF(x
+ F'(x)pll + IIF(x)ll)
+ l)IIF'(x)-IIIIIF(x)ll;
so (2.14) holds. For 0 < a ~ 1 and x
+ ap)11 =
+ ap ED. it follows
321
Proof Because F- 1 is continuously differentiable,
from (2.13) that
+ F'(x)ap + o(allpll)1I
= 11(1 - a)F(x) + a(F(x) + F'(x)p)1I + o(allplD
~ (1- a)IIF(x)1I + a\\F(x) + F'(x)pll + o(allplD
~ (1 - (1 - q')a)IIF(x)11 + o(allpll)·
IIF(x)
However, for sufficiently small a. we have o(a lipiD < (l - q - q'va II F(x) II,
so that (2.8) holds.
If x* is now a regular zero, then F'(x)-I is defined and bounded as x ---+ x*;
therefore, (2.14) implies
exists; therefore, Step 2 can always be performed. By Proposition 6.2.7, I is
increased after finitely many steps so that - unless by chance F(x /) = 0 and we
are done - the sequence x' (I = 0, 1,2, ...) is well defined. It is also bounded
because IIF(x/)11 ~ IIF(xo)11 and {x E]R" IIIF(x)11 ~ IIF(xo)11l is bounded
as the image of the bounded closed set {y E ]R" I II y II ~ II F (xo) II} under the
continuous mapping F- 1 • Therefore, the sequence x' has at least one limit
point, that is. there is a convergent subsequence x': (v = 0, 1,2, ...) whose
limit we denote with x*. Because of the monotonicity of IIF(x /)lI. we have
Ilpll = O(IIF(x)II)·
IIF(x*)1I
= infIIF(xl)ll.
10:.0
Because II F (x) II ---+ 0 as x ---+ x*, one obtains for sufficiently small [x - x* II
the relations
Similarly, (2.14) implies that the pi, are bounded; so there exists a limit point
p* of the sequence pi, (v = 0, 1,2, ...). By deleting appropriate terms of the
sequence if necessary, we may assume that p* = lim, ~ 00 pi, .
We now suppose that F(x*) -=1= 0 and derive a contradiction. By Proposition
6.2.7 there exists an a* E {1, ~, ~, ~, .,. } with
x +ap E D
and
o(allplI) = o(aIIF(x)ll) < (1 - q - q')aIlF(x)11
for 0 < a ~ 1. Therefore, (2.8) now holds in this extended range.
IIF(x* + o" p*)II < (1 - qa*)IIF(x*)II;
0
It follows that if F' (x) is nonsingular in D, the damped Newton method never
cycles. By (2.8). a is changed only a finite number of times in each iteration.
Moreover, in the case of convergence to a regular zero, for sufficiently large I,
the first a is accepted immediately; and because of Step 3, one takes locally
only undamped steps (a = 1). In particular, locally quadratic convergence is
retained if F' is Lipschitz continuous and p is chosen so that
IIF(x)
+ F'(x)pll
= O(IIF(x)11
2
).
Therefore, for all sufficiently large I
= I;
we have
By construction of the a. it follows that a :::: a*. In particular. a := lim inf ai, ::::
a" > O. If we now take the limit in
IIF(x*)1I ~ IIF(i)11 = IIF(x
+ ap)11
< (1 - qa)IIF(x)ll,
one obtains the contradiction
In an important special case, the global convergence of the damped Newton
method can be shown.
IIF(x*)1I ::::: (1- qa)IIF(x*)11 < IIF(x*)II.
322
Systems of Nonlinear Equations
6.3 Error Analysis
Therefore, II F(x*) II = 0; that is, x* is the (uniquely determined) zero of F. In
particular, there is only one limit point, that is, limhoo xl = x*.
0
323
i the approximate bound
6.2.9 Remarks.
(i) In general, some hypothesis like that in Theorem 6.2.8 is necessary because
the damped Newton method may otherwise converge slowly toward the set
of points where the Jacobian is singular. Whether this happens depends,
of course, a lot on the starting point x", and cannot happen if the level
set {x E D IIIF(x)11 ::: IIF(xo)111 is compact and F'(x) is invertible and
bounded in this level set.
(ii) Alternative stopping criteria used in practice are (with some error tolerance
8>
0)
II F(i) II :::
8
Therefore, t~e limiting accuracy achievable is of the order of II F' (x*)-11I8 F,
and the maxln:al error in F(i) can be magnified by afactorofup to II F'(x*)-III.
In the following, we make this approximate argument more rigorous.
Norm-wise Error Bounds
We deduce from Banach's fixed-point theorem a constructive existence theorem
for zeros and use it for giving explicit and rigorous error bounds.
6.3.1
Theorem.
Suppose that the function F .. DelFt"
.
I
.
.
_
-+ m,,·
m. IS continuous y
differentiable and that B := {x E 1Ft" Illx - xO II
8} lies in D.
s
or
Ilx - xii::: 811xll.
(i)
(iii) If the components of F(xo) or F' (x") have very different orders of magnitude. then it is sensible to minimize the scaled function II C F (x) II instead
of II F (x) II, where C is an appropriate diagonal matrix. A useful possibility is to choose C such that C F' (x") is equilibrated; if necessary, one can
modify C during the computation.
(iv) The principal work for a Newton step consists of the determination of the
matrix A as a suitable approximation to F' (x), and the solution of the system of linear equations Ap = - F(x). Compared with this, several evaluations of the function F usually playa subordinate role.
If there
is a matrix C E lFt"x" with
III-CF'(x)lI:::fJ<l
for all x e B,
and
then F has exactly one zero x* in B, and
80/(1 + fJ) ::: Ilxo - x*1I ::: 80/(1 - fJ) ::: 8.
(ii) Moreover, if fJ < 1/3 and 80 ::: 8/3, then Newton's method startedfrom xO
cannot break down and converges to x*.
6.3 Error Analysis
Limiting Accuracy
The relation
F(x)
=
F[i, x*](i - x")
(3.1)
derived in Lemma 6.1.2 gives
IIi -x*11
= IIF[x,x*r'F(.t)II::: IIF[x,x*rlll·IIF(x)ll.
If X' is the computed approximation to x *, then one can expect only that
II F(X') II ::: 8F for some bound 8F of the accuracy of function values that depends on F and on the way in which F is evaluated. This gives for the error in
Proof Because II I - C F' (x) II < 1, C and F' (x) are nonsingular for all x E B.
Ifwe define G: D -+ 1Ft" by G(x):= x -C F(x), then F(x*) = 0 {:} C F(x*) =
o{:} ~* = G(x*). Thus we must show that G has exactly one fixed point in B.
~or this ,purpose, we prove that G has the contraction property. First, G' (x) =
- C F (x ), so IIG'(x)11 ::: fJ < 1 for x E B, and IIG(x) - G(y)11 ::: fJllx _ yll
for all .r , y E B. In particular, for x E B ,
IIG(x) -xoll S IIG(x) - G(xo)11 + IIG(x o) -xoll
s fJllx - xOIl + IIC F(x°)ll
s
fJ8 + 8(l - fJ) S 8;
6.3 Error Analysis
Systems of Nonlinear Equations
324
therefore, G (x)
E
B for x
E
and f3 can be determined from f3 := II 1- C F'(x) 1100 by means of interval
arithmetic.
(ii) The best choice of the matrix C is an approximation to F' (x") -I or
(mid F'(X))--l. Indeed, for C = F'(xO)-1 ,xo -+ x* and s -+ 0, thecontraction factor f3 approaches zero. Therefore, one can regard f3 as a measure
of the nearness of xO to x* relative to the degree of nonlinearity of F.
(iii) In an implementation, one must choose e and co appropriately. Ideally, e
should be as a small as possible because f3 generally becomes larger as
B becomes wider. However, because co ~ cO - (3) must be guaranteed,
one cannot choose e too small. A useful way to proceed is as follows: One
determines the bounds
B. So G is a contraction with contraction factor f3,
and (i) follows from the Banach fixed-point theorem.
For the proof of (ii), we consider the function G : D -+ JRn defined by
_
--I
G(x) :=x - A
in which x E B and
A:= r(x)
whence for x
E
F(x)
are fixed. Then
III - CAli ~ f3,
I
I
II(CA)- II ~ 1- f3'
B,
IIG'(x)1I = III - A-I F'(x)11
= II(CA)-IU - C F'(x) -
~ II(CA)-III(III I
< --(f3
- 1-f3
U-
CF'(x)11
CAm
+ III -
CAli)
and then sets e := 2eol (1 - (30). Then one tests the relation
2f3
+ (3) =
325
-_.
III-CF'(x)lloo=f3~ 1~f30
1-f3
forx:=[xo-ee,xo+ee],
(3.3)
It follows from this that
- (x) - G
- (y) II ~ I 2f3
II G
_ f3 Ilx - y II f or x, y
E
B.
(3.2)
We apply this inequality to Newton's method,
from which the required inequality follows. Generally, if (3.3) fails, the
zero is either very ill conditioned or xO was not close to a zero.
6.3.3 Example. To show that
XI + 1 :=x l _ F'(XI)-1 F(x l).
For l = 0, (i) and the hypothesis give IIxo - x* II ~ col (1 - (3) ~ ~eo ~ !e. If
l
now Ilx l - x* II ~ 1. e for some l :::: 0, then Ilxl - xO II ~ Ilx - x* II + Ilxo- x* II ~
2
.
e, so xl E B. We can therefore apply (3.2) With x = x = x I , y = x * to 0 btai
tam
Ilxl + 1 - x*1I
~
1
!
2
F(x) = (x?
F'(x) =
1,
Ilxl-x*1I ~ (2f310-f3))lllxo-x*II-+0
es l
Thus Newton's method is well-defined and convergent.
8xI
+
5)
I
has a zero in the box x = G~::l), we apply Theorem 6.3.1 with the maximum
norm and xO = (~:~). E = 0.5 (so that B = x). We have
f3llxl - x*11
because X I + 1 = G(x l ) . Because f3 <
it follows that 2f31(1 - (3) < I and
1
IIx l + - x*1I ~ &e; from this it follows by induction that
+ 5xI + 8X2 -
xi + 5X2 -
(2X-8+5
8)
+
I
2X2
5
F'(x) = ([5,7]
-8
'
and for
r-» co.
C:=(midF'(x))-1 = (0.06
0.08
o
-0.08)
0.06 '
we find
6.3.2 Remarks.
(i) The theorem is useful to provide rigorous a posteriori error bounds for
approximate zeros found by other means. For the maximum norm,
B = [xu - ee, xO
+ ee]
F(xo) = (
1- CF'(x)
=: x E JIJRn,
CF(x o) = (0.125)
1.75),
-0.25
= [-1
0.125 '
,
1] (0.06
0.08
0.08)
0.06 .
Systems of Nonlinear Equations
326
Thus
EO
= 0.125, f3 = 0.14, and Eo :::: E(l -
f3). Thus there is a unique zero x*
327
6.3 Error Analysis
for x, x* E x, where the Krawczyk operator K : TIJR(n x JR(n
~ TIJR(n
is defined by
in x, and we have
01"5
<
0 . 109 < -'--1.14 -
Ilx* - xOll ao
::::
0.125
- - < 0.146.
0.86
In articular x* E ([0.354.0.646J). Better approximations would have given shar[0.354.0.646]
p
,
per bounds.
K(x,x):=x - CF(x) - (/- CFf(x»)(x -x).
Therefore, one may use in place of the interval Newton iteration (3.4) the
Krawczyk iteration
XI+I:=K(x',xl)nx'
The Interval Newton Method
In order to obtain simultaneous componentwise bounds that take rounding error
into account, one uses an interval version of Newton's method. We suppose
for this purpose that x is an interval that contains the required zero x*, and
we calculate rex). Then F'(x):::: F!(~):::: F'(x) for all ~ E X and F[x, x*] =
fo1 F' (x* + t (x - x") dt E r (x) for all x EX. Therefore, (3.1) gives
x* = x -
A-I F(x) EX
- D~(F!(x). F(x»,
where ~ (r (x), F (x» denotes the hull of the solution set of the linear interval
equation Az = F(x) (A E F'(x), x E x). This gives rise to the interval Newton
(3.5)
withx'EX',
again with a guarantee of not losing any zero contained in x",
Note that because ofthe lack ofthe distributive law for intervals, the definition
of K cannot be "simplified"tox + C(r(x)(x - x) - F(x»; this would produce
poor results only, because the radius would be bigger than that of x.
It is important to remember that in order to take rounding error into account,
one must use in the interval Newton iteration or the Krawczyk iteration in place
of x' the thin interval Xl = [xl, x'].
The Krawczyk operator can also be used to prove the existence of a zero in x.
6.3.4 Theorem. Suppose that x EX.
(i] If F has a zero x* E x then x* E K(x, x) n x.
(ii) If K(x, x) n x = 0 then F has no zero in x.
(iii) If either
iteration
By our derivation, each Xl contains any zero in x''.
.
In each step, a system oflinear equations must be solved. In practice, on~ uses
in place of the hull a more easily computable enclosure, using an approxImate
midpoint inverse C as preconditioner.
.
Following Krawczyk, one can proceed in a slightly simpler way, also using
C but saving the solution of the linear interval system. We rewrite (3.1) as
C F(x) = C F[x, x*](x - x*)
«i». x) s;::
(3.6)
intx
or the matrix C in (3.5) is nonsingular and
(3.7)
K(x,x)S;::x
then F has a unique zero x* EX.
and observe that
x* = x - C F(x) - (l - C F[x, x*])(x - x"),
where I - C F[x, x*] can be expected to be small by the construction of C.
From this it follows that
Proof (i) holds by construction, and (ii) is an immediate consequence of (i).
(iii) Suppose first that (3.6) holds. Because B := 1- C
s;:: I - C F'(x) =:
B, we have
rex)
IBI radx = IBI rad(x - x) = rad(B(x - x)
x* E K(x, x)
:'S rad(B(x - x)) = rad K(x, x) < radx
328
6.4 Further Techniques for Nonlinear Systems
Systems ofNonlinear Equations
ones. Because high order polynomial systems usually have a number of real or
complex zeros that is, exponential in the dimension, continuation methods for
finding all zeros are limited to low dimensions.
by (3.6). For D = Diag(radx), we therefore have
"
\\1 - D-ICF'(x)Dlloo
329
IBlik rad x,
= IID-I BDlloo = max L.
--d---'k
ra X,
I
= max (IB\radx); < 1.
;
rad x,
6.4 Further Techniques for Nonlinear Systems
Thus C P(x) is an H-matrix, hence nonsingular, and as a factor, Cis nonsingular.
Thus it suffices to consider the case where C is nonsingular and (3.7) holds. In
this case, K (x, x) is (componentwise) a centered form for G (x) := x - C F (x);
therefore, G(x) E K (x, x) s; x for all x EX. By Brouwer's fixed-point theore~,
G has a unique fixed point x* E x, that is, x* = x* - C F(x*). Because C IS
nonsingular, it follows that x* is the unique zero of F in x.
0
If Xo is a narrow (but not too narrow) box symmetric around a good approximation of a well-conditioned zero, Theorem 6.3.4 usually allows one to verify
existence in that box, and by refining it with Krawczyk's iteration, we may get
very narrow enclosures.
.
For large and sparse problems, the approximate inverse C is full, which makes
Krawczyk's method prohibitively expensive. A modification that respects spar-
An Affine Invariant Newton Method
The nonlinear systems arising in solving differential equations, for example,
for stiff initial value problems by backward differentiation formulas (Section
4.5), have quite often very ill-conditioned Jacobians. As a consequence, II F (x) II
has steep and narrow curved valleys and a damped Newton method based on
minimizing II F (x) II has difficulties because it must follow these valleys with
very short steps.
For such problems, the natural merit function for measuring the deviation
from a zero is a norm IIll(x) II ofthe Newton correction ll(x) = - F'(x)-I F(x).
Near a zero, this gives the distance to it almost exactly because of the local
quadratic convergence of Newton's method. Moreover, it is unaffected by the
condition of F' (x) because it is affine invariant, that is, any linear transformation
p(x):= C F(x) with nonsingular C yields the same ll(x). Indeed, the matrix
C cancels out because
sity is given in Rump [84].
Finding All Zeros
If one has found a zero ~ of F and conjectures that another zero x* exists,
then just as in the univariate case, one can attempt to determine it by deflation.
For this one needs a damped Newton method that minimizes II F(x) 1I/IIx - ~ II
instead ~f II F (x) II. However, unlike in dimension 1, this function is .no 10ng~r
smooth near ~. Therefore, deflation in higher dimension tends to give erratic
results, and repeated deflation often finds several but rarely all the zeros.
All zeros in a specified box can be found by interval methods. In .anal~g~ .to
the univariate case, interval methods for finding all zeros scan a given initial
box x E [JR." by exhaustive covering with subboxes that contain no zero, are ti~y
and guaranteed to contain a unique regular zero, or are tiny and.likel y to cont~lll
a nonregular zero. This is done by recursively splitting x until correspondmg
tests apply, such as those based on the Krawczyk operator. For details, see
Hansen [39], Kearfott [49], and Neumaier [70].
Continuation methods, discussed in the next section, can also find all zeros
of polynomial systems, at least with high probability (in theory with probability one). However, they cannot guarantee to find real zeros before complex
However, the affinely equivalent function p(x) = F'(xO)-I F(x) has wellconditioned P'(r") = I. Therefore, minimizing Illl (x) I instead of I F (x) II improves the geometry of the valleys and allows one to proceed in larger steps in
ill-conditioned problems.
Unfortunately, Illl (x) I may increase along the Newton path, defined by the
family of solutions x*(t) of F(x) = (1 - t)F(xo), (t E [0,1]) along which a
method based on decreasing II F(x) II would roughly proceed (look at infinitesimal steps to see this). Hence, this variant may get stuck in a local minimum
of Illl (x) I in many situations where methods based on decreasing I F (x) I
converge to a zero.
A common remedy is to test for a sufficient decrease by
for some positive q < I, so that the matrix is kept fixed during the determination
of a good step size (see, e.g., Deuflhard [19,201, who also discusses other implementation details). Unfortunately, this means that the merit function changes
330
Systems of Nonlinear Equations
6.4 Further Techniques for Nonlinear Systems
at each step, making a global convergence proof along the old lines impossible.
Indeed, the method can cycle for innocuous problems, where methods that use
II F(x) II have no problems (cf, Exercise 9).
However, a global convergence proof can be given if we mix the two approaches, using an acceptance test similar to (4.1) in regions where II t. (x) II
increases, and switching back to the merit function II t.(x) II when it has moved
sufficiently far over the region of increase.
The algorithm presented next is geared toward large-scale applications where
it is often inconvenient to solve exactly for the Newton step. We therefore
assume that in each step we have a (possibly crude) approximation M(x) to
F' (x) whose structure is simple enough that a factorization of M (x) is available.
Often, M(x) = LR, where Land R are approximate triangular factors of F' (x),
obtained by ignoring terms in the factorization that would destroy sparsity. One
speaks in this case of incomplete triangular factorizations, and calls M (x) the
preconditioner. The case M (x) = F' (x) is, of course, admissible and produces
an affine invariant algorithm.
We now define the modified Newton correction
t..(x) = -M(X)-l F(x).
331
or
STEP4: Setx'+I:=x'+a/p',I
1+1 If 4
.:=
.
(.6) was violated, goto Step 2,
else goto Step 1.
We first show that Step 3 can always be completed, at least when F is defined
everywhere.
6.4.2 Proposition. Suppose that F(x'
a>
(1- qdlfk
(hen either (4.7) holds for Oil =
+ apl) is dejinedfor 0 ::s a ::s «.I]
with fkl
::s fk ::s fkz
(4.8)
a, or the function dejined by
(4.2)
(4.9)
We use an arbitrary but fixed norm that should be a natural measure for changes
in x. The l-norm is suitable if all components of x have about the same
magnitude. The algorithm depends on some constants that must satisfy the
restrictions
o<
qo < qt < q: < 1,
0 < fkl < fk2 < 1 - qQ.
(4.3)
hasa-eroaE[O -] L
,01.
<.
.
n porticuian Step 3 can be completed if
a>
(1 _ )/
.
ql fkz·
Proof In general, we have
1
IIM-
F(x
+ ap)II
+ aF'(x)p)II + o(a)
1
= 110 - a)M- F(x) + OIM- (F(x) + F'(x)p)II + o(a)
::s 11 - alllM- 1 F(x) II + lalllM-1(F(x) + r(x)p) II
= IIM-1(F(x)
1
6.4.1 Algorithm: Affine Invariant Modified Newton Method
+o(a).
STEP 0: Start with 1:= 0, k:= - 1, ILl = 00.
STEP 1: Calculate a factorization of M(x') and t..(x l) = -M(x')-l F(x l ) . If
By (4.5) and (4.8), we conclude that for sufficiently small a > 0,
(4.4)
setk:=k+l,Mk:=M(x') and ?h:=IIMx')II.
STEP 2: Calculate an approximate Newton direction pi such that
STEP 3: Find
IIMkIF(x' + ap') II
II M - 1 F(x l) II
k
a, such that either
(4.6)
::s
1- a
+ aqo + o(a)
< I - all,
(4.10)
6.4 Further Techniques for Nonlinear Systems
Systems ofNonlinear Equations
332
Note that with M = (Ml + Mz)/2, approximate zeros of if! also satisfy (4.7),
so that one may adapt any root finding method based on sign changes to check
each iterate for (4.6) and (4.7), and quit as soon as one of the two conditions
holds. In order to preserve fast local convergence, one should first try at = 1.
(One may also use more sophisticated initial values for az as long as the choice
a, =
1 results locally.)
If the first guess does not work and if!(az) > 0, one already has a sign
change using the artificial value if!(0) = M- 1 + qo < 0 that (cf, (4.10» is an
upper bound to the true range of limiting values of if!(a ) as a -+ O. If, however,
!p(at) < 0, increase az by repeated doubling (or so) until (4.6) or (4.7) holds or
a sign change is found.
Hence, k increases without bound. By Steps 1 and 2, Ok+1 < q20k, hence
s, ::: q~oo. Therefore,
(4.11)
If I M k I remains bounded (and by continuity, this holds in particular when Zk
remains bounded), then the right side of (4.11) converges to zero and (i) holds.
If not, then (ii) holds.
0
In particular, in an important case where all difficulties are absent, we have
the following global convergence theorem.
6.4.5 Theorem. If F is continuously differentiable in lRn with globally nonsingular F' (x) and bounded M (x), then
6.4.3 Remarks.
(i) One must be able to store two factorizations, those of M k and M (x").
(ii) In Step 2, p' = ~ (x') works if II I - M (x') -I F' (x') II ::: qo, that is, if M (x)
is a sufficiently good approximation of F' (x). In practice, one may want to
choose qo adaptively, matching it to the accuracy of x' .
lim F(l) =
k-s co
l
Quasi-Newton Methods
It is possible to generalize the univariate secant method to higher dimensions; the
resulting methods are called quasi-Newton methods. Frequently, they perform
ties are classified in the following.
6.4.4 Proposition. Denote by
o.
Proof In Proposition 6.4.4, (ii) contradicts the boundedness of M, (iii) the
nonsingularity of F', and (iv) cannot hold because F is defined everywhere.
The only alternative left is (i).
0
In general, all sorts of misbehavior may occur besides regular convergence
(and, as in the univariate case, this must be the case for any algorithm wit.h~~t
additional conditions that guarantee the existence of a solution). The pOSSIbIlI-
the values of xl at each completion ofStep I.
Then one of the following holds.'
(i) F(l) -+ 0 as k -+ 00.
(ii) For some subsequence,
333
llzk" II
and IIM(zk")11 -+ 00.
(iii) F'(x') is undefined or singular for some I.
(iv) At some iteration, F(x z +a/) is not definedfor large a. and Step 3 cannot
-+
00
well, although they are generally regarded as somewhat less robust because
due to the approximations involved, it is not guaranteed that IIF(x + ap)11
decreases along a computed quasi-Newton direction p. (Indeed, to guarantee
that, one needs derivative information.)
To motivate the method, we write the univariate secant method in the form
X'+l = xi - czf(x,), where c, = II f[xz, X/-I]. Noting that ci (f(xz) - f(xz-l)) =
x, - XZ-l, we see that a natural multivariate generalization is to use
be completed (cf Proposition 6.4.2).
Proof Suppose neither (iii) nor (iv) holds. Then Step 2 can be satisfied by
choosing pZ = _F'(X1)-1 F(x'), and Step 3 can be completed by assumption.
If k remains bounded, then for k fixed at its largest value, (4.6) will always be
violated, and therefore (4.7) must hold. Equation (4.7) implies that II Mt: I F(x') I
is monotone decreasing, and because (4.6) is violated, it must converge to a
positive limit. Going to the limit in (4.7) therefore gives -+ O. However, now
a,
(4.10) with M = M2 gives a contradiction.
where C, is a square matrix satisfying the secant condition
Ciy; = si,
with YI:= F(xz) - F(XZ-l),
si
:=Xz - xZ-I·
Note that no linear system must be solved, so that for problems with a dense
Jacobian, the linear algebra work is reduced from 0(n 3 ) operations per step to
0(n 2 ) .
Systems of Nonlinear Equations
334
The secant condition can be easily enforced by defining
with an arbitrary vector
(Sl -
CI
= CI-] +
UI
=I O. Among
CHy,)uT
T
u l YI
many possibilities, the choice regarded
as best for well-scaled x is
This defines Broyden's method and has reasonably good local convergence
properties (see Dennis and Schnabel [18] and Gay [28]). Of course, one loses
quadratic convergence for such methods, and global convergence proofs are
available only under stringent conditions. However, one can ensure local superlinear convergence with convergence order at least "'V2. Therefore, the number
of significant digits doubles about every n iterations. This implies that, for problems with dense Jacobians, quasi-Newton methods are locally not much faster
than a discrete Newton method, which uses n function values to get a Jacobian
approximation. For sufficiently sparse problems, much fewer function values
suffice and discrete Newton methods are usually faster.
However, far from a solution, the convergence speed is much less predictable.
One again needs a damping strategy, and hence uses the search direction
PI
= -CI F(xI) in place of the Newton direction.
Continuation Methods
For functions that satisfy only weak smoothness conditions there are special
algorithms that are based on topological concepts: homotopies (that generate a
smooth path to a solution), and simplicial decompositions (that generate piecewise linear paths). An in-depth discussion is given in Allgower and Georg [5);
see also Morgan [64] and Zangwill and Garcia [100]. Here we only outline the
basic idea of homotopy methods.
For t E [0, 1] one considers curves x = xU) (called homotopies) for which
x(O) = x Ois a known point andx(l) = x' is a zero of F. Such curves are implicitly defined by the solution set of F (x, t) = 0, with an augmented expression
F(x, t) that satisfies F(xo, 0) = 0 and F(x, 1) = F(x) for all x. The basic
= F(x) -
335
Both choices are natural in that they are affine invariant, that is, the curves on
which the zeros lie are not affected by linear transformations of x or F.
The task of solving F (x) = 0 is now replaced by tracing the curves from the
known starting point at t = 0 to the point given by t = 1. Because the curve
guides the way to go, this may be the easiest way to arrive at a solution for
many hard zerofinding problems. However, it is possible that the path defined
by Ftx . t) = 0 never arrives at t = I, or turns back, or bifurcates into several
paths, and all this must be handled by a robust code. Existence of the path can
often be guaranteed by a fixed-point theorem.
The path-following problem is the special case m = I, E = [0, I] of the
more general problem of solving a parametrized system of equations
F(x, y)
=0
for x in dependence on a vector Y of parameters from a set E c IR m
where F : D x E -+ IRn, xED < IRn. This system of equations determi~s a~
m-dimensional manifold of solutions, and often a functional dependence
H: E -+ D such that F(H(y), y) = 0 is sought. (Of course, in general, the
manifold need not have a one-to-one projection onto E, and this function may
have to be pieced together from several solution branches.)
The well-known theorem on implicit functions gives sufficient conditions
for the existence of H in a sufficiently small neighborhood of a solution. A
global statement about the existence of H is made in the following theorem,
whose proof (see Neumaier [68]) depends on a detailed topological analysis of
the situation. As in the fixed-point theorem of Leray and Schauder, a boundary
condition is the main thing to be established.
6.4.6 Theorem. Let D ~ IR" be closed, let E
let F : D x E -+ IR" be such that
~
IRm be simply connected, and
(i) the derivative B]F(x, y) of F with respect to x exists in D x E and is
continuous and nonsingular there;
(ii) there exist xO E int D and yO E E such that
F(x,y)=I=tF(xO,l)
forallxEBD,yEE,tE[O,l].
Then there is a continuous function H: E -+ D with F(H(y), y) = 0 for
YEE.
examples are
Fix, t)
6.4 Further Techniques for Nonlinear Systems
(1 - t)F(xo)
To trace the curve determined by
and
F(x, r) = t F(x) - (1 - t)F'(xo)(x - x'').
F(x(t), t) = 0
(4.12)
336
Systems ofNonlinear Equations
6.4 Further Techniques for Nonlinear Systems
there are a variety of continuation methods. The smooth continuation methods
assume that F is continuously differentiable and are based on the differential
equation
FxCx(t). t)x(t)
+ F/(x(t), t)
= 0
(4.13)
obtained by differentiating (4.12), using the chain rule. This is an implicit
differential equation, but as long as F, is nonsingular, we can solve for x(t),
X(t) = -Fx(x(t), t)-I F/(x(t). t).
337
vectors u, v > 0 such that
F ' (x)u :::: v
for all x
E
Do.
Suppose that .xo ED, that F(xo) :::: 0, and that Lxo a vector UJ with UJiVi :::: 1 (i = 1, ... , n). Then
(4.15)
uio! F(xo),
xo] S; Do for
(i) F has exactly one zero x* in Do.
(ii) Each sequence of the form
(4.16)
(4.14)
(A closer analysis of the singular case leads to turning points and bifurcation
points; see, e.g., Chow and Hale [11] and Seydel [88]. Now, (4.14) can be solved
by methods for ordinary differential equations. Alternatively, one may solve
(4.13) by methods for differential-algebraic equations; cf. Brenan, Campbell,
and Petzold [9].)
If F; is well conditioned, (4.14) is easy to solve. However, for the application
to solving systems of equations by a homotopy, the well-conditioned case is
less relevant because one solves such problems easily with damped Newton
methods. In the ill-conditioned case, (4.14) is stiff, but advantage can be taken
of good predictors computed by extrapolating from past values.
Due to rounding errors and discretization errors in the solution of the differential equation, a numerical solution based on (4.14) alone tends to wander
away from the manifold. To avoid this, one must monitor the residual size
II F(x(t), t) I of the approximate solution curve x(t), and reduce the step size if
the residual size is deemed to be too large. For implementation details, see, for
example, Rheinboldt [81].
in which the nonsingular matrices A, remain bounded as I ~
the ('subgradient') relations
F(x) 2: F(x')
+ A,(x -
x')
for all x
Newton's method is globally and monotonically convergent in an important
special case, namely when the derivative F'(x) is an M-matrix and, in addition,
either F is convex or F' is isotone in x. (M-matrices were defined in Section 2.4;
see also Exercises 2.19 and 2.20.) The assumptions generalize the monotonicity
and convexity assumption in the univariate case; they are, for example, met
in systems of equations resulting from the discretization of certain nonlinear
ordinary or partial differential equations.
Do with x ~ x'
(4.17)
Also, the error estimate
(4. I 8)
holds.
Proof We proceed in several steps.
STEP 1: For all x, y
E
Do, the matrix
A=
F[x, y] is an M-matrix with
Au:::: v.
(4.19)
Indeed, F (x) - F (y) = A(x - y) holds by Lemma 6.1.2. Because
F'(x) is an M-matrix for x E Do,
1
1
A ik =
F:k(y
+ t(x -
y» dt
~0
for i =lk, and by (4.15),
1
1
Au =
6.4.7 Theorem. Let F: D c lRn --+ lRn be continuously differentiable. Let
F'(x) be an M-matrix in a convex subset Do S; D, and suppose that there are
and satisfy
remains in Do and converges monotonically to x*:
F(x) - F(y) = A(x - y),
Global Monotone Convergence of Newton's Method
E
00
F'(y
+ t(x -
in particular (Exercise 2.20),
1
1
y»udt ::::
A is an M-matrix.
vdt = v;
6.4.8 Remarks.
STEP 2: We use induction to show that
(4.20)
and
(4.2 I)
for all L> O. Because this holds for I = 0, we assume that (4.20) and
(4.21) hold for an index I and for all smaller indices. Then
(4.22)
because All::: o. If now
F (x k ) ::: 0, and by Step 1
F(Zk) = F(x k)
:::: F(x k )
l
E
Do and k :::: I, then Zk :::: x k because
x k) = F(x
+ Ak(Zk -
339
6.4 Further Techniquesfor Nonlinear Systems
Systems ofNonlinear Equations
338
vw T F(x k )
::::
k)
(i) If Al = F'(x l) is an M-matrix, then All::: 0 (see Exercise 2.19).
(ii) If AlI:::O and F'(x):::: Al for all x E Do such that x <x', then (4.17)
holds, because then A:::: Al for the matrix in (4.19) with y = x'; and
F(x) = F(x l) + A(x - Xl) ::: F(x l) + AI(x - Xl) because x :::: x',
(iii) If, in particular, F' (x) (x E Do) is an M-matrix with components that
are increasing in each variable, then (4.17) holds with Al = F' (x'). If
F' (x) is an M-matrix, but its components are not monotone, one can often
find bounds for the range of values of F' (x) for x in a rectangular box
[x, xl] ~ Do by means of interval arithmetic, and hence find an appropriate
Al in (4.17).
(iv) If F is convex in Do, that is, if the relation
Ft):»
k)
T
- AkuW F(x
O.
+ (1 -
holds for all x , y
A)y) :::: AF(x)
E Do,
+ (l
- A)F(y)
for 0:::: A :::: I
then
Therefore, (4.21) and (4.17) imply
F(x) ::: F(y)
0::: F(Zk) ::: F(x l)
= AI(x l -
+ AI(l- x')
x l + 1) + AI(l- xl) =
I 1
AI(l- X + ) .
Because All ::: 0, it follows that 0 ::: l - xl+!, so l :::: x +
particular, X I+ 1 E [zo, xo] < Do, whence by (4.17) and (4.16),
:::
F(x l)
+ AI(X I+ 1 -
xl)
1
•
In
= 0.
Therefore, (4.20) and (4.21) hold with I + 1 replacing I, whence they
hold in general. Thus (4.22) also holds in general.
STEP 3: The limit x* = limr.; 00 xl exists and is a zero of F.
Because the sequence xl is monotone decreasing and is bounded
below by z", it has a limit x*. From (4.16) it follows that F (x') =
AI(x l - x l + I), and because of the boundedness of the Al it follows on
taking the limit that F(x*) = O.
STEP 4: The point x* is the unique zero of F in D«.
Indeed, if y* is an arbitrary zero of F in Do, then 0 = Ftx") F(y*) = A(x* -- y") where A is an M-matrix. Multiplication with
A--I gives y*
y)) - F(Y)/A
and as A ~ 0, the inequality
l
F(X I+ 1 )
+ (F(y + A(X -
= x *.
STEP 5: The error estimate (4.18) holds: Because of the monotonically decreasing convergence of the xl to x" , we have x* :::: x'; and because of (4.21),
we have Zl :::: x*. Therefore, x* E [z', xl].
0
F(x) ::: F(y)
+ F'(y)(x -
y)
for all x, y
E
Do
(4.23)
is obtained. If in addition, F' (x') is an M-matrix, then again, (4.17) holds
with Al = F'(x l).
(v) The hypothesis F(xo) ::: 0 may be fairly easily satisfied as follows:
(a) If (4.15) holds, and both x and xO :=x + uwTIF(x)1 lie in Do,
then F(xo) = F(x) + A(xo - x) = F(x) + AuwTIF(x)1 ::: F(x) +
vwTIF(x)1 ::: F(x)
+ IF(x)1
:::
o.
(b) If F is convex and both x and xo :=x - F'(x)-I F(x) lie in Do, then
F(xo) ::: F(x) + F'(x)(xo - x) = 0 because of (4.23).
(vi) If x* lies in the interior of Do, then the hypothesis of (4.18) is satisfied for
sufficiently large l. From the enclosure (4.18), the estimate Ixl - x* I ::::
uui" F(x l ) is obtained, which permits a simple analysis of the error.
(vii) In particular, in the special case Do = jRn it follows from the previous
remarks that the Newton method is convergent for all starting values if
F is convex in the whole of ]R;n and F' (x) is bounded above and below
by M-matrices. The last hypothesis may be replaced with the requirement
6.5 Exercises
Systems of Nonlinear Equations
340
JR n , as can be seen by looking more closely at
the arguments used in the previous proof.
that F' (x) -1
:::
0 for x
E
341
(a) Show that these radii are zeros of the function F given by
F(x, y)
6.5 Exercises
1. WriteaMATLAB program that determines a zero ofafunction F: JR2 ~ JR2
(a) by Newton's method, and
(b) by the damped Newton method (with the l-norm),
In both cases, terminate the iteration when Ilx l + 1 - xliii
F(x, y):= ( x
:s
10-
6
.
Use
exp(x2 + y2) - 3 )
- sin(3(x + y»
+y
°
.
(0.5) '0'
(0) (-0.5
0.5) '
as a test function. For (a), use the startmg
vaI ues x = (0)
-1 '0
1) (I) ( 10) ( I ) (I) (0.5) (2) (0.08) (O.D7),and(_O ).Whichones
(-I ' 1 ' -10 ' -0.5 ' 0' -I ' 1.5 ' -0.08 ' .-O.O?
0 0.5
°
C
2
ofthese are reasonable? For (b), use the startmg values x = I) and C5)'
2. Let (XI (s), Yl (s)) and (X2 (t), Y2 (t) be parametric representations of two
twice continuously differentiable curves in the real plane that intersect at
(x*, y*) at an angle tp" #- O.
(a) Find and justify a locally quadratically convergent iterative method for
the calculation of the point of intersection (x*, y*). The iteration formula should depend only on the quantities t , Xi, Yi and the derivatives
(l - 4x 2)(x + 2y - 2 y 3 ) + 4x y2(x 2 - 1) + 3x y 4
(
2x(l - x 2)2 + y(l - 2x 2 + 4x 4 ) + 6xy"(x 2 _ y2) - 3 y 5
)
•
(b) Write a MATLAB program that computes the optimal radii using
Newton's method. Choose as starting values (x", yO) those radii that
give the greatest volume under the additional condition x = y (cylindrical bucket; what is its volume?). Determine x*, y* and the corresponding volume V* and height h*. What is the percentage gain compared
with the cylindrical bucket?
4. A pendulum of length L oscillates with a period of T = 2 (seconds). How
big is its maximum angular displacement? It satisfies the boundary value
problem
ep"(t) +
~ sinep(t) = 0,
L
ep(O)
= ep(l) = 0;
here, t is the time and g is the gravitational acceleration. In order to solve
this equation by discretization, one replaces tp" (t) with
ep(t - h) - 2eper)
x;, y;.
+ ep(t + h)
h2
(b) Test it on the pair of curves (s, s2) (s E JR) and (f2, f) (t E JR).
3. The surface area M and the volume V of a bucket are given by the formulas
M
= rr(sx + sy + x 2),
V
rr 2
2
= -(x
+xy+y )h,
and obtains approximate values Xi ~ ep(ih) for i = 0, ... , nand h:= lin
from the following system of nonlinear equations:
3
where x, y, h , and s are shown in Figure 6.1. Find the bucket that has the
greatest volume V* for the given surface area M = tt . The radii of this
bucket are denoted by x* and y*.
fi(x)
:=Xi-l -
2Xi
+ Xi+1 + h 2
f
sin x, = O.
where i = 1, ... , n - 1 and Xo = x; = O.
(a) Solve (5.1) by Newton's method for
f = 39.775354 and initial vector
2
with components xi(O) := l2h i (n - i), taking into account the tridiagonal form of f"(x). (Former exercises can be used.)
(b) Solve (5.1) with the discretized Newton method. You should only use
three directional derivatives per iteration.
Compare the results of (a) and (b) for different step sizes (n =
20, 40, 100). Plot the results and list for each iteration the angular
displacement x~ ~ ep(&) (which is the maximum) in degrees. (The
exact maximum angular displacement is 160
0
Figure 6.1. Model of a bucket.
(5.1)
.)
342
6.5 Exercises
Systems of Nonlinear Equations
5. To improve an approximation
boundary-value problem
v'(x)
to a solution y(x) of the nonlinear
Show that for the error in
343
x := xlo , a relation of the form
Ill' - x'il :'S Co
lex) = F(x, y(x)),
where F : ]R2 ~
]R
and r : ]R2 ~
]R
r(y(a), y(b)) = 0,
(5.2)
are given, one can calculate
holds, and find an explicit value for the constant C.
8. Using Theorem 6.3.1, give an a posteriori error estimate for one of the
approximate zeros x found in Exercise 1. Take rounding errors into account
by using interval arithmetic. Why must one evaluate F(x) with the point
interval x = [x, x]?
9. Let
with (al denotes the derivative with respect to the ith argument)
_ (xi - XlX2 + Xl
F(x) -
-
I)
2'
x2 -Xl
(a) Show that F ' (x) is nonsingular in the region D = {x
and that F has there a unique zero at x" =
C).
where the constant c is determined such that
01r(yl(a), yl (b))&l (a)
+ 02r ( /
( a) , /(b))&l(b)
= -r(/(a),
/(b)).
(5.4)
(a) Find an explicit expression for c.
(b) Show that the solution of (5.3) is equivalent to a Newton step for an appropriate zero-finding problem in a space of continuously differentiable
functions.
6. (a) Prove the fixed-point theorem ofLeray-Schauder for the special Banach
space JB := ]R.
(b) Prove the fixed-point theorem of Brouwer for JB:=]R without using (a).
7. Let the mapping G : D ~]Rn ~ ]Rn be a contraction in K ~ D with fixed
point .r" E K and contraction factor f3 < I. Instead of the iterates x' defined
by the fixed-point iteration X l + l := G(x l ) , the inaccurate values
are actually calculated because of rounding error. Suppose that the error is
bounded according to
The iteration begins with
the index l = lo such that
xOE
K and is terminated with the first value of
E]R2
I X2
< x?
+ l}
(b) Show that the Newton paths with starting points in D remain in D and
end at x".
(c) Modify a damped Newton method that uses II F (x) II as merit function
by basing it on (4.1) instead, and compare for various random starting
points its behavior on D with that of the standard damped Newton
method, and with an implementation of Algorithm 6.4.1.
(d) Has the norm of the Newton correction a nonzero local minimum in
D?
10. Let F: D ~]Rn ~]Rn be continuously differentiable. Let F'(x) be an Mmatrix and let the vectors u, v > 0 be such that F'(x)u :::: v for all XED.
Also let W E]Rn be the vector with ui, = vi- l (i = I, ... , n). Show that for
arbitrary XED:
(a) If D contains all x E]Rn such that Ix - x I :'S u . w T IF(x) I then there is
exactly one zero x' of F such that
(b) If, in addition, D is convex, then x" is the unique zero of Fin D.
11. (a) Let
F(x) :=
xi +2Xl -X2)
( xi + 3X2 XI
Show that Newton's method for the calculation of a solution of F (x) c = 0 converges for all c E ]R2 and all starting vectors x O E ]R2.
Systems of Nonlinear Equations
44
(b) Let
sin XI +3XI -X2)
.
+ 3X2 - Xl
F(x):= ( .
smx2
References
Show that Newton's method for F(x) = 0 with the starting value
= 7T converges, but not to a zero of F.
(c) For F given in (a) and (b), use Exercise 10 to determine a priori estimates for the solutions of F (x) = c with c arbitrary.
xO
C)
The number(s) at the end of each reference give the page number(s) where the reference
is cited.
[1] H. Abramowitz and I. A. Stegun, Handbook ofMathematical Functions, Dover,
New York, 1985. [193, 194]
[21 ADIFOR, Automatic Differentiation of Fortran, WWW-Docvment,
http://www-unix.mcs.anl.gov/autodiff/ADIFOR/ [308]
[3] T. J. Aird and R. E. Lynch, Computable accurate upper and lower error bounds
for approximate solutions of linear algebraic systems, ACM Trans. Math.
Software 1 (1975), 217-231. [113]
[4] G. Alefeld and J. Herzberger, Introduction to Interval Computations, Academic
Press, New York, 1983. [42]
[5] E. L. Allgower and K. Georg, Numerical Continuation Methods: An
Introduction, Springer, Berlin, 1990. [302, 334]
[6] E. Anderson et al., LAPACK Users' Guide, 3rd ed., SIAM, Philadelphia 1999.
Available online: http://www . netlib. org/lapack/lug/ [62]
[7] F. Bauhuber, Direkte Verfahren zur Berechnung der Nullstellen von Polynomen,
Computing 5 (1970), 97-118. [275]
[8] A. Bjorck, Numerical Methods for Least Squares Problems, SIAM,
Philadelphia, 1996. [61,78]
[9] K. E. Brenan, S. L. Campbell, and L. R. Petzold, Numerical Solution of
Initial- Value Problems in Differential Algebraic Equations. Classics in Applied
Mathematics 14, SIAM, Philadelphia, 1996. [336]
[10] G. D. Byrne and A. C. Hindmarsh, Stiff ODE Solvers: A Review of Current and
Coming Attractions, J. Comput. Phys. 70 (1987), 1-62. [219]
[11] S.-N. Chow and 1. K. Hale, Methods of Bifurcation Theory, Springer, New York,
1982. [336]
[12] S. D. Conte and C. de Boor, Elementary Numerical Analysis, 3rd ed.,
McGraw-Hill, Auckland, 1981. [145]
[13] I. Daubechies, Ten Lectures on Wavelets, SIAM, Philadelphia, 1992. [178]
[14] P. J. Davis and P. Rabinowitz, Methods ofNumerical Integration, 2nd ed.,
Academic Press, Orlando, 1984. [179]
[15] C. de Boor, A Practical Guide to Splines, 2nd ed., Springer, New York, 2000.
[155,163,169]
[16] C. de Boor, Multivariate piecewise polynomials, pp. 65-109, in: Acta Numerica
1993 (A. Iserles, ed.), Cambridge University Press, Cambridge, 1993. [1551
345
346
References
References
[17) J. Demmel, Applied Numerical Linear Algebra, SIAM, Philadelphia, 1997. [61)
[18) J. E. Dennis and R. B. Schnabel, Numerical Methods for Unconstrained
Optimization and Nonlinear Equations, Prentice-Hall, Englewood Cliffs, NJ,
1983. [302,334)
[19) P. Deuflhard, A modified Newton method for the solution of ill-conditioned
systems of non-linear equations with application to multiple shooting, Numer.
Math. 22 (1974), 289-315. [329)
[20) P. Deuflhard, Global inexact Newton methods for very large scale nonlinear
problems, IMPACT Compo Sci. Eng. 3 (1991), 366-393. [329)
[21) R. A. De Vore and B. L. Lucier, Wavelets, pp. 1-56, in: Acta Numerica 1992
(A. Iserles, ed.), Cambridge University Press, Cambridge, 1992. [178)
[22) I. S. Duff, A. M. Erisman, and J. K. Reid, Direct Methods for Sparse Matrices,
Oxford University Press, Oxford, 1986. [64, 75)
[23) J.-P. Eckmann, H. Koch, and P. Wittwer, A computer-assisted proof of
universality for area-preserving maps, Amer. Math. Soc. Memoir 289, AMS,
Providence (1984). [38)
(24) M. C. Eiermann, Automatic, guaranteed integration of analytic functions, BIT 29
(1989), 270-282. [187)
[25) H. Engels, Numerical Quadrature and Cubature, Academic Press, London,
[980. (179)
(26) R. Fletcher, Practical Methods of Optimization, 2nd ed., Wiley, New York,
1987. [301)
[27) M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the
Theory ofNt'<Completeness, Freeman, San Francisco. 1979. [115)
(28) D. M. Gay. Some convergence properties of Broyden's method, SIAM J. Numer:
Anal. 16 (1979), 623-630. [334)
[29) C. W. Gear, Numerical Initial Value Problems in Ordinary Differential
Equations, Prentice-Hall, Englewood Cliffs, NJ, 1971. [218,219]
(30) P. E. Gill, W. Murray, and M. H. Wright, Practical Optimization, Academic
Press, London, 1981. [301)
[31) G. H. Golub and C. F. van Loan, Matrix Computations, 2nd ed., Johns Hopkins
University Press, Baltimore, 1989. [61,70,73,76,250,255)
[32) A. Griewank, Evaluating Derivatives: Principles and Techniques ofAutomatic
Differentiation, SIAM, Philadelphia, 2000. [10]
[33) A. Griewank and G. F. Corliss, Automatic Differentiation ofAlgorithms, SIAM,
Philadelphia, 1991. [10,305)
(34) A. Griewank, D. Juedes. H. Mitev, J. Utke, O. Vogel, and A. Walther, ADOL-C:
A package for the automatic differentiation of algorithms written in CjC++,
ACM Trans Math. Software 22 (1996),131-167.
http://www.math.tu-dresden.de/wir/project/adolc/[308]
[35] E. Hairer, S. P. Nersert, and G. Wanner, Solving Ordinary Differential
Equations, Vol. 1, Springer, Berlin. 1987. [179,212]
[36) E. Hairer and G. Wanner, Solving Ordinary Differential Equations, Vol. 2,
Springer, Berlin, 1991. [179,212]
[37) T. C. Hales, The Kepler Conjecture, Manuscript (1998 l. math.MG/9811 071
http://www.math.lsa.umich.edu/-hales/countdown/ [38)
.
[38] D. C. Hanselman and B. Littlefield, Mastering MATLAB 5: A Comprehensive
Tutorial and Reference, Prentice-Hall, Englewood Cliffs, NJ, 1998. [1]
[39] E. Hansen, Global Optimization Using Interval Analysis, Dekker, New York,
1992. [38, 328)
(40) P. C. Hansen, Rank-Deficient and Discrete Ill-Posed Problems, SIAM.
Philadelphia, 1997. [81, 125)
[41] J. F. Hart, Computer Approximations, Krieger, Huntington, NY, 1978. [19, 22)
[42) J. Hass, M. Hutchings, and R. Schlafli, The double bubble conjecture, Electron.
Res. Announc. AmeJ: Math. Soc. 1 (1995), 98-[02.
http://math.ucdavis.edu/-hass/bubbles.html [38)
(43) P. Henrici, Applied and Computational Complex Analysis, Vol. I. Power SeriesIntegration - Conformal Mapping - Location of Zeros, Wiley, New York, 1974.
[141,278,281]
[44] N. J. Higham, Accuracy and Stability ofNumerical Algorithms, SIAM,
Philadelphia, 1996. [17, 77, 93]
[45) IEEE Computer Society, IEEE standard for binary floating-point arithmetic,
IEEE Std754-1985, (1985). [15,42]
[46] E. Isaacson and H. B. Keller, Analysis of Numerical Methods, Dover. New York
1994. [143]
,
347
[47] M. A. Jenkins, Algorithm 493: Zeros of a real polynomial, ACM Trans. Math.
Software 1 (1975),178-189. [268]
[48] M. A. Jenkins and J. F. Traub, A three-stage variable-shift iteration for
polynomial zeros and its relation to generalized Rayleigh iteration, Numer.
Math. 14 (1970), 252-263. [268]
[49] R. B. Kearfott, Rigorous Global Search: Continuous Problems, Kluwer,
Dordrecht, 1996. [38, 328]
[50) R. B. Kearfott, Algorithm 763: INTERVAL-ARlTHMETIC: A Fortran 90
module for an interval data type, ACM Trans. Math. Software 22 (1996),
385-392. (43)
[51] R. Klatte, U. Kulisch, C. Lawo, M. Rauch, and A. Wiethoff, C-XSC, a C++
Class Library for Extended Scientific Computing, Springer, Berlin, 1993. [42]
[52) R. Klatte, U. Kulisch, M. Neaga, D. Ratz, and C. Ullrich, PASCAL-XSC _
Language Reference with Examples, Springer, Berlin, 1992. [42]
[53] R. Krawczyk, Newton-Algorithmen zur Bestimmung von Nullstellen mit
Fehlerschranken, Computing 4 (1969), 187-20 I. [I 16]
[54] A. R. Krommer and C. W. Ueberhuber, Computational Integration, SIAM,
Philadelphia, 1998. [179)
[55] F. M. Larkin, Root finding by divided differences, Numer. Math. 37 (1981)
93-104. [259]
[56] W. Light (ed.), Advances in Numerical Analysis: Wavelets, Subdivision
Algorithms, and Radial Basis Functions, Clarendon Press, Oxford, 1992. (170)
[57) R. Lohner, Enclosing the solutions of ordinary initial- and boundary-value
problems, pp. 255-286, in: E. Kaucher et al., eds., Computerarithmetic,
Teubner, Stuttgart, 1987. [214]
[58) MathWorks, Student Edition of MATLAB Version 5for Windows, Prentice-Hall,
Englewood Cliffs, NJ, 1997. [1]
[59] MathWorks, Matlab online manuals (in PDF). WWW-document,
http://www.mathworks.com/access/helpdesk/help/fulldocset.shtml
[J]
[60] J. M. McNamee, A bibliography on roots of polynomials, J. Comput. Appl.
Math. 47 (1993), 391-394. Available online at
http://www.els evier.com/homepage/sac/cam/mcnamee[292]
[61) K. Mehlhorn and S. Naber, The LEDA Platform of Combinatorial and Geometric
Computing, Cambridge University Press, Cambridge, 1999. [38]
[62] ~. Mischaikow and M. Mrozek, Chaos in the Lorenz equations: A computer assisted proof. Part II: Details, Math. Comput. 67 (1998), 1023-1046. [38]
[63] R. E. Moore, Methods and Applications of Interval Analysis, SIAM,
Philadelphia, 1981. [42,214]
348
References
References
[64] A. Morgan, Solving Polynomial Systems Using Continuation for Engineering
and Scientific Problems, Prentice-Hall, Englewood Cliffs, NJ, 1987. [334]
[65] D. E. Muller, A method for solving algebraic equations using an automatic
computer, Math. Tables Aids Compo 10 (1956), 208-215. [264]
[66] D. Nerinckx and A. Haegemans, A comparison of non-linear equation solvers, J.
Comput. Appl. Math. 2 (1976),145-148. [292]
[67] NETLlB, A repository of mathematical software, papers, and databases.
http://www.netlib.org/[15,62]
[68] A. Neumaier, Existence regions and error bounds for implicit and inverse
functions, Z. Angew. Math. Mech. 65 (1985), 49-55. [335]
[69] A. Neumaier, An existence test for root clusters and multiple roots, Z. Angew.
Math. Mech. 68 (1988), 256-257. [278]
[70] A. Neumaier, Interval Methods for Systems of Equations, Cambridge University
Press, Cambridge, 1990. [42,48, ns, 117, 118,311,328]
[70a] A. Neumaier, Global, vigorous and realistic bounds for the solution of dissipative
differential equations, Computing 52 (1994), 315-336. [214]
[71] A. Neumaier, Solving ill-conditioned and singular linear systems: A tutorial on
regularization, SIAM Rev. 40 (1998), 636-666. [81]
[72] A. Neumaier, A simple derivation of the Hansen-Bliek-Rohn-Ning-Kearfott
enclosure for linear interval equations, Rei. Comput. 5 (1999), 131-136.
Erratum, Rei. Comput. 6 (2000), 227. [118]
[73] A. Neumaier and T. Rage, Rigorous chaos verification in discrete dynamical
systems, Physica D 67 (1993), 327-346. [38]
[73a] J. Nocedal and S. J. Wright, Numerical Optimization, Springer, Berlin, 1999.
[301]
[74] GNU OCTAVE. A high-level interactive language for numerical computations.
http://www.che.wisc.edu/octave [2]
[75] W. Oettli and W. Prager, Compatibility of approximate solution of linear
equations with given error bounds for coefficients and right-hand sides, Numer.
Math. 6 (1964), 405--409. [103]
[76] M. Olschowka and A. Neumaier, A new pivoting strategy for Gaussian
elimination, Linear Algebra Appl. 240 (1996), 131-151. [85]
[77] G. Opitz, Gleichungsauflosung mittels einer speziellen Interpolation, Z. Angew.
Math. Mech. 38 (1958), 276-277. [259]
[78] J. M. Ortega and W. C. RheinboJdt, Iterative Solution of Nonlinear Equations in
Several Variables, Academic Press, New York, 1970. [311]
[79] B. N. Parlett, The Symmetric Eigenvalue Problem, Prentice Hall, Englewood
Cliffs. NJ, 1990. (Reprinted by SIAM, Philadelphia, 1998.) [250, 255]
[80] K. Petras, Gaussian versus optimal integration of analytic functions, Const.
Approx. 14 (1998), 231-245. [187]
[81] W. C. Rheinboldt, Numerical Analysis of Parameterized Nonlinear Equations,
Wiley, New York, 1986. [336]
[82] J. R. Rice, Matrix Computation and Mathematical Software, McGraw-Hill.
New York, 1981. [84]
[83] S. M. Rump, Verification methods for dense and sparse systems of equations,
pp. 63-136, in: J. Herzberger (ed.), Topics in Validated Computations - Studies
in Computational Mathematics, Elsevier, Amsterdam, 1994. [100]
[84] S. M. Rump, Improved iteration schemes for validation algorithms for dense and
.
sparse non-linear systems, Computing 57 (1996), 77-84. [32.8]
[85] S. M. Rump, lNTLAB - INTerval LABoratory, pp. 77-104, in: Developments III
Reliable Computing (T. Csendes, ed.), Kluwer, Dordrecht, 1999.
http://www.ti3.tu-harburg.de/rump/intlab/index.html[IO.42]
349
[86] R. B. Schnabel and E. Eskow, A new modified Cholesky factorization, SIAM J.
Sci. Stat. Comput. 11 (1990), 1136-1158. [77]
[87] SCILAB home page. http://www-rocq.inria.fr/scilab/scilab .html
[2]
[88] R. Seydel, Practical Bifurcation and Stability Analysis, Springer, New York,
1994. [336]
[89] G. W. Stewart, Matrix Algorithms, SIAM, Philadelphia, 1998. [61]
[90] J. Stoer and R. Bulirsch, Introduction to Numerical Analysis, Springer. Berlin,
1987. [207]
[91] A. H. Stroud, Approximate Calculation of Multiple Integrals, Prentice-Hall,
Englewood Cliffs, NJ, 1971. [179]
[92] F. Stummel and K. Hainer, Introduction to Numerical Analysis, Longwood,
Stuttgart, 1982. [17, 94]
[93] A. van der Sluis, Condition numbers and equilibration of matrices, Numer. Math.
14 (1969),14--23. [101]
[94] R. S. Varga, Matrix Iterative Analysis, Prentice-Hall, Englewood Cliffs, NJ.
1962. [98]
[95] W. V. Walter, FORTRAN-XSC: A portable Fortran 90 module library for
accurate and reliable scientific computing, pp. 265-285, in: R. Albrecht et al.
(eds.), Validation Numerics - Theory and Applications. Computing
Supplementum 9, Springer, Wien, 1993. [43]
[96] J. Waldvogel, Pronunciation of Cholesky, Lanczos and Euler, NA-Digest
v90nJO (1990).
http://www.netlib.org/cgi-bin/mfs/02/90/v90nl0.html#2[64]
[97] R. Weiss, Parameter-Free Iterative Linear Solvers, Akademie-Verlag, Berlin,
1996. [64]
[98] J. H. Wilkinson, Rounding Errors in Algebraic Processes, Dover Reprints, New
York, 1994. [267]
[99] S. J. Wright, A collection of problems for which Gaussian elimination with
partial pivoting is unstable, SIAM J. Sci. Statist. Comput. 14 (1993), 231-238.
[122]
[100] W. I. Zangwill and C. B. Garcia, Pathways to Solutions, Fixed Points, and
Equilibria, Prentice-Hall, Englewood Cliffs, NJ, 1981. [334]
Index
(0,), Opitz method, 260
A:k = kth column of A, 62
Ai: = ith row of A, 62
D[c; r], closed disk in complex plane,
141
d] , differential number, 7
e, all-one vector, 62
e(k), unit vectors, 62
f[xQ, Xl], divided difference, 132
f[xo, ... , Xi], divided difference, 134
I, unit matrix, 62
l(f), (weighted) integral of I, 180
J, all-one matrix, 62
Li(X), Lagrange polynomial, 131
0(·), Landau symbol, 34
0(-). Landau symbol, 34
QU), quadrature formula, 180
QR-algorithm, 250
QZ-algorithm, 250
R N (fl, rectangle rule, 204
TN (fl. equidistant trapezoidal rule,
197
Ti.k, entry of Romberg triangle, 206
x
y. much greater than, 24
.v « v. much smaller than, 24
"", approximately equal to, 24
»
C m x fJ.62
x, midpoint of x, 39
Conv S, closed convex hull, 141
int D, interior of D, 141
mid x, midpoint of x, 39
rad x, radius of x, 39
w(x), weight function, 180
st». boundary of D, 141
IR m x n , 62
OS, interval hull, 40
L:(A, b), solution set oflinear interval system,
114
A(Xl, ... , x n), set of arithmetic expressions,
3
absolute value, 40, 98
accuracy, 23
algorithm
QR,250
QZ,250
Algorithm
Affine Invariant Modified Newton Method,
330
Damped Approximate Newton Method, 317
analytic, 4
approximation, 19
of closed curves, 170
by Cubic Splines, 165
expansion in a power series, 19
by interpolation, 165
iterative methods, 20
by least squares, 166
order
linear, 45
rational, 22
argument reduction, 19
arithmetic
interval, 38
arithmetic expression, 2, 3
definition, 3
evaluation, 4
interval, 44
stabilizing, 28
assignment, II
automatic differentiation, 2
backward, 306
reverse, 305, 306
base, 15
best linear unbiased estimator
(BLUE),78
bisection
midpoint bisection, 242
secant bisection, 244
spectral, 250, 254
351
352
oracket, 241
Bulirsch sequence, 207
cancellation, 25
Cardano's formulas, 57
characteristic polynomial, 2S0
Cholesky
factor, 76
factorization, 76
modified,77
Clenshaw-Curtis formulas, 190
column pivoting, 83, 85
condition, 23, 33
condition number, 34, 38
ill-conditioned, 33, 35
matrix condition number, 99
number, 99
continuation method, 334, 336
continued fraction, 22
convergence
factor, 236
global,237
local,237
order, 256, 257
Q-linear, 236
Q-quadratic,237
Qvsuperlinear, 237
R-linear,236
correct rounding, 16
Cramer's rule, 63
data perturbations, 99
defective, 262
deflation, 267
explicit, 267
implicit, 267
derivative
checking correctness, 56
multivariate, 303
determinant, 67, 88
difference quotient
central, 149
forward, 148
differential equation
Adams-Bashforth method, 215
Adams-Moulton method, 216
backwards differentiation formula
(BDF),218
boundary-value problem, 233
Euler's method, 212
global error, 223
local error estimation, 221
multi-step method, 211, 214
multi-value method, 211
numerical solution of, 210
one-step method, 211
predictor-corrector method, 216
Index
Index
rigorous solution of, 214
Runge-Kutta method, 211
shooting method, 122, 233
step size control, 219
stiff, 214, 218
Taylor series me\hod, 214
differential number, 7
constant, 8
differentiation
analytical, 4
automatic, 7, 10
forward, 10
numerical, 148
discretization error, lSI
higher derivatives, 152
optimal step size, ISO
rounding error analysis, ISO
discretization error
in numerical differentiation,
151
divided difference
confluent, 136
first-order, 132
general, 134
second-order, 133
double precision format, 15
eigenvalue problem
definite, 251
general linear, 250
nonlinear, 250
Opitz methods, 262
Q R-algorithm, 250
QZ-algorithm, 250
rigorous error bounds, 272
spectral bisection, 250
equilibration, 84
error
damping, 35
magnification, 35
error analysis
for arithmetic expressions, 26
for iterative refinement, 110
for linear systems, 99, 112
for multivariate zeros, 322
reducing integration errors, 224
for triangular factorizations, 82, 90
for univariate zeros, 265
error bounds
a posteriori, 324
for complex zeros, 277
for nonlinear systems, 323
for polynomial zeros, 280
for simple eigenvalues, 272
Euler-MacLaurin Summation
Formula, 20 I
exponent range, 15
extrapolation, 145
Neville formulas, 146
factorization
Cholesky, 76
modified, 77
LDL H,75
LR, LU, 67
QR,80
modified LDL T , 75
orthogonal, 80
SVD,81
triangular, 67
incomplete, 330
fixed point, 308
floating point
IEEE standard, IS
number, 14
function
elementary, 3
piecewise cubic, ISS
piecewise linear, 153
in terms ofradial basis
functions, 170
piecewise polynomial, see spline,
155
radial basis function, 170
spline, see spline, 155
Gaussian elimination, 63
rounding error analysis, 90
global optimization, 38
gradual underflow, 18
grid, 153
If-matrix. 69, 97
Halley's Method, 291
Homer scheme, 10
complete, 54
for Newton form, 135
by polynomials, 131
convergence, 141
divergence, 143
interpolation error, 138
Lagrange formulas, 131
Newton formulas, 134
Hermite, 139
in Chebyshev Points, 144
in expanded Chebyshev points, 145
linear, 132
piecewise linear, 153
quadratic, 133
interval, 39
arithmetic, 38
arithmetic expression, 44
eigenvalue enclosure, 272
evaluation
mean value form, 47
hull, OS, 40
Krawczyk's method, 116
Krawczyk iteration, 327
Krawczyk operator, 327
linear system, 114
Gaussian elimination, 118
solution set, I: CA, b), 114
midpoint, mid x, 39
Newton's method
multivariate, 326
Newton method
univariate, 268
point interval, 39
radius, rad x, 39
set of intervals, 39
INTLAB,IO
INTLAB
verifylss, 128
iterative method, 20
iterative refinement, 106
Iimiting accuracy, 110
Jacobian matrix, 303
IEEE floating point standard, 15
iff, if and only if, vii
ill-conditioned, see condition, 33
inclusion isotonicity, 44
integral, 302
integration
adaptive, 203
error estimation, 207
multivariate, 179
numerical, 179
quadrature formula, see quadrature
formula, 179
Romberg's method, 206
stepsize control, 208
interpolation
by cubic splines, 153, t 56
Kepler's barrel rule, 189
Krawczyk iteration, 327
Krawczyk operator, 327
Landau symbols, 34
Lapack,62
Larkin method, 259
least squares
method, 78
solution, 78
limiting accuracy
for iterative refinement, 110
for multivariate zeros, 322
for numerical differentiation, 151
for univariate zeros, 265
353
354
linear system
error bounds, 112
interval equations, 114
Gaussian elimination, 118
Krawczyk's method, 116
iterative refinement, 106
overdetermined, 61, 78
quality factor, 105
regularization, 81
rounding errors, 82
sparse. 64
triangular, 65
Lipschitz continuous, 304
Index
Index
sparse, 121
sprintf,54
spy, 121
subplot, 177
text, 177
title, 177
true, 242
xlabel. 177
ylabel,177
zeros,63
%, 12
&:,32
-,32
1,32
M-matrix, 98, 124
machine precision, 17
mantissa length, 15
MATLAB, I
1 :n, 3
==,65
A\b,63,67,81
A',62
A(: ,k), 63
A(i, :),62
A.',62
A.*B,63
A ./B. 63
A/B,63
chol, 76, 127
cond, 125
conj,299
disp,32
eig,229
eps, 17
eye. 63
false, 242
feval, 247
fprintf,54
full, 121
get, 177
help, 1
input, 32
inv,63
lu, 122
max, 40
mex, 176
NaN, 15
num2str,11
ones, 63
plot, 12
poly, 266
print, 12
qr,80
randn.232
roots, 266.268
set, 177
sign, 104
size, II. 63
conjugate transpose, A' , 62
figures, 176
online help, 1
passing function names, 247
precision, 15, 17
pseudo-, 1
public domain variants
OCTAVE, 2
SCILAB,2
sparse matrices, 75
subarray, 66
transpose, A. ' , 62
vector indexing, 11
matrix
A:k = kth column of A, 62
Ai: = ith row of A, 62
absolute value, 62, 98
all-one matrix, J, 62
banded, 74
condition number, 99
conjugate transposed, A H , 62
diagonal, 65
diagonally dominant, 97
factorization, see factorization, 67
H-matrix, 69, 97, 124
Hermitian, 62
Hilbert, 126
inequalities, 62
inverse, 81
conjugate transposed, A -H, 62
transposed, A - T • 62
Jacobian, 303
M-matrix, 98
monomial, 121
norm, see norm, 94
notation, 62
orthogonal, 80
parameter matrix, 250
pencil, 250
permutation, 85
symmetric, 86
positive definite, 69
sparse, 75
symmetric, 62
transposed, AT, 62
triangular, 65
tridiagonal, 74
unitary, 80, 126
unit matrix, I, 62
mean value form, 30, 48
mesh size, 153
Milne rule, 189
Muller's Method, 264
NETLIB,62
Newton's method, 281
discretized, 314
global monotone convergence, 336
multivariate, 31 I
affine invariant, 329
convergence, 318
damped, 315
error analysis, 322
multivariate
interval, 326
local convergence, 314
univariate
damped, 287
global behavior, 284
modi fed, 289
vs. secant method, 282
Newton path, 329
Newton step. 287
nodes, 153, 180
nonlinear system, 30 I
continuation method. 334
error bounds. 323
finding all zeros. 328
quasi-Newton method, 333
norm, 94
co-norm, 94, 95
I-norm, 94, 95
2-norm. 94, 95
column sum, 95
Euclidean. 94
matrix. 95
maximum, 94
monotone, 98
row sum, 95
spectral, 95
sum, 94
normal equations, 78
number
integer, 14
real, 14
fixed point, 14
floating point, 14
numerical differentiation
limiting accuracy, 151
numerical integration
rigorous, 185
numerical stability, 23
OCTAVE, 2
operation, 3
Opitz method, 259
optimization problems, 301
orthogonal polynomial
3-term recurrence relation,
195
orthogonal polynomials, 191
outward rounding, 42
overflow, 18
parallelization, 73
parameter matrix, 250
partial pivoting, 83
PASCAL-XSC,42
permutation
matrix, 85
symmetric, 86
perturbation theory, 94
pivot elements, 83
pivoting
column, 85
complete, 91
pivot search, 83, 88
point interval. 39
pole
Opitz methods, 259
polynomial
Cardano's formulas. 57
characteristic. 250
Chebyshev, 145
interpolation, 131
Lagrange, 131
Legendre, 194
Newton form, 135
orthogonal, 191
3-term recurrence relation,
195
positive definite, 69
positive semidefinite, 69
precision, 23
machine precision, 17
preconditioner, 113, 330
quadratic convergence, 21
quadrature formula
Clenshaw-Curtis, 190
closed, 229
Gaussian, 187, 192
interpolatory, 188
Milne rule, 189,205,226
Newton-Cotes, 189
normalized, 181
order, 182
rectangle rule, 204
Simpson rule, 189,204,226
trapezoidal rule. 189, 196, 226
equidistant, 197
356
quality factor
linear system, 105
mean value form, 48
quasi-Newton method, 333
range inclusion, 44
rectangle rule, 204
regula falsi, 235
regularization, 81
result adaptation, 19
Romberg triangle, 206
root secant method, 241
rounding, 16
by chopping, 16
correct, 16
optimal, 16,42
outward,42
row equilibration, 84
scaling, 83
implicit, 88
SCILAB,2
search directions, 317
secant method, 234
convergence order, 259
convergence speed, 237
root secant, 241
secant bisection, 244
vs. Newton's method, 282
sign change, 241
Simpson rule, 189
single precision, 15
singular-value decomposition, 8 I
slope
multivariate, 303
spectral bisection, 254
spline, 155
Spline
Approximation, 165
spline
approximation error, 162
B-spline, 155, 176
in terms ofradial basis functions, 172
basis spline, 155
complete, 161
cubic, 155
free node condition, 158
interpolation, 153, 156
optimality of complete splines, 163
parametric, 169
periodic, 160
stahility,37
numerical, 23
stabilizing expressions, 28
standard deviation, 2, 26, 30
subdistributive law, 59
Index
Theorem
of Aird and Lynch, 113
Euler-MacLaurin summation formula,
201
fixed point theorem
by Banach, 309
by Brouwer, 311
by Leray and Schauder, 310
Gauss-Markov, 78
inertia theorem of Sylvester, 252
of Oettli and Prager, 104
of Rouche, 277
of van der Sluis, 102
trapezoidal rule, 189, 197
triangular
factorization, 67
incomplete, 330
matrix, 65
underflow, 18
vector
absolute value, 62, 98
all-one vector, e, 62
inequalities, 62
norm, see norm, 94
unit vectors, e(k) , 62
vectorization, 73
weight function, 180
weights, 180
zero, 233
bracket, 241
cluster, 239
complex, 273
rigorous error bounds, 277
spiral search. 275
deflation, 267
finding all zeros, 268, 328
hyperbolic interpolation, 261
interval Newton method, 268
limiting accuracy, 265
method of Larkin, 259
method of Opitz, 259
Muller's Method, 264
multiple, 239
multivariate, 30 I
limiting accuracy, 322
polynomial, 280
Cardano's formulas, 57
sign change, 241
simple, 239
univariate
HaUey's method, 291
Newton's method. 281
N U ME RICAL ANALYS IS i. ~n increuingly importan t
link between puce math ematics and ils appJic~tion in science
and to:chnology. This textbook provido:s an int roduction 10 the
justificalion and IkvdopmeOl of constructive methods du. t
proeide sufficiendy 2C'C1Ilaie approximations 10 the solution
of numerical problems. and the an:alysis of the influence INt
erron in data, fin it~precision cakulations. and approDmarion fOfll'lulas haw on results, problem form ulation, and the
dIoice of method. It a1so servo as an introduction 10 scientifi<:
programming in MATU.S, including many simple and difficult . thtorttical and computaOoD;lI ex~ .
A unique feature of mis boo k is the consequent devdopmen t o(in terval aD;llysis as a wol fur rigorous computiltion
and ccmpu eer-assiseed proofs. along with th e rradinonal
material.
Arnold Neumaier i. a Professor of Computational Mathematics al
the Uni_eroiry of Wien. He has wrin cn two boc ks . nd published
numerous artid es an the subjects a fro mbinalOries. Interval analysis,
"ptimi.. tl"n. ~nd . tatistics. He has held teaching po,iti"n s at the
Univt'l'> iry of F.eiburg. Gennany. a nd the Uni'e .. iry of Wisconsin.
Madison , and he has wor ked a n the technica l 5IafT at AT&T IIdl
Labor-atories. In addition. Professo. Neumaie. maim ains e~tensive
Web pages on ptIblicdom ain softwa'" for numerial a""lysis. oplimi... ,ioo, and ILLltistiCS.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement