veldhuizen-05-05.

veldhuizen-05-05.
DELFT UNIVERSITY OF TECHNOLOGY
REPORT 05-05
Efficient Methods for Solving Advection Diffusion
Reaction Equations
S. van Veldhuizen
ISSN 1389-6520
Reports of the Department of Applied Mathematical Analysis
Delft 2005
c
Copyright
2005 by Delft Institute of Applied Mathematics Delft,
The Netherlands.
No part of the Journal may be reproduced, stored in a retrieval system, or
transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission from
Delft Institute of Applied Mathematics, Delft University of Technology, The
Netherlands.
Efficient Methods for Solving
Advection Diffusion Reaction Equations
-Literature StudyS. van Veldhuizen
August 23, 2005
2
Contents
Contents
2
Preface
6
I
9
Mathematical Model of CVD
1 Introduction
11
2 The Mathematical Model
2.1 The Transport Model . . . . . . . . . . . .
2.1.1 Transport Equations for Gas Species
2.1.2 Complete Mathematical Model . . .
2.2 Boundary Conditions . . . . . . . . . . . . .
13
13
15
20
20
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Stif fness
25
3.1 Example of Stiff Equation . . . . . . . . . . . . . . . . . . . . 25
3.2 Stability Analysis for Euler Forward Method . . . . . . . . . 28
4 Test Problem
4.1 The Computational Grid . . . . . . . . . . . . . . . . . . . . .
4.2 Finite Volume Discretization of the General Transport Equation
4.3 Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . .
4.4 Chemistry Properties . . . . . . . . . . . . . . . . . . . . . . .
4.4.1 Gas-Phase Reaction Model . . . . . . . . . . . . . . .
31
32
32
36
38
38
II
43
Time Integration Methods
1 Introduction
45
4
CONTENTS
2 Runge-Kutta Methods
47
2.1 The Stability Function . . . . . . . . . . . . . . . . . . . . . . 49
2.2 Rosenbrock Methods . . . . . . . . . . . . . . . . . . . . . . . 50
2.3 Runge Kutta Chebyshev Methods . . . . . . . . . . . . . . . 53
2.3.1 First Order (Damped) Stability Polynomials and Schemes 55
2.3.2 Second Order Schemes . . . . . . . . . . . . . . . . . . 57
2.4 Some Remarks on Runge-Kutta Methods . . . . . . . . . . . 59
2.4.1 Properties of Implicit Runge-Kutta methods . . . . . 60
2.4.2 Diagonally Implicit Runge-Kutta methods . . . . . . . 62
2.4.3 The Order Reduction Phenomenon for RK-methods . 64
2.4.4 Order Reduction for Rosenbrock Methods . . . . . . . 67
3 Time Splitting
3.1 Operator Splitting . . . . . . . . . . . . . . . . . . . .
3.1.1 First Order Splitting of Linear ODE Problems
3.1.2 Nonlinear ODE problems . . . . . . . . . . . .
3.1.3 Second and Higher Order Splitting. . . . . . . .
3.2 Boundary Values, Stiff Terms and Steady State . . . .
3.2.1 Boundary Values . . . . . . . . . . . . . . . . .
3.2.2 Stiff Terms . . . . . . . . . . . . . . . . . . . .
3.2.3 Steady State . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
71
71
72
73
73
75
75
75
76
4 IMEX Methods
77
4.1 IMEX- θ Method . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 80
5 IMEX Runge-Kutta-Chebyshev Methods
5.1 Construction of the IMEX scheme . . . .
5.2 Stability . . . . . . . . . . . . . . . . . . .
5.3 Consistency . . . . . . . . . . . . . . . . .
5.4 Final Remarks . . . . . . . . . . . . . . .
6 Linear Multi-Step Methods
6.1 The Order Conditions . . .
6.2 Stability Properties . . . . .
6.2.1 Zero-Stability . . . .
6.2.2 The Stability Region
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
83
84
85
86
86
.
.
.
.
89
90
92
93
93
7 Multi Rate Runge Kutta Methods
97
7.1 Multi-Rate Runge-Kutta Methods . . . . . . . . . . . . . . . 98
7.2 Order Conditions . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.3 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
CONTENTS
III
5
Nonlinear Solvers
105
1 Introduction
107
2 Newton’s Method
2.1 Newton’s Method in One Variable . . . . . .
2.2 General Remarks on Newton’s Method . . . .
2.3 Convergence Properties . . . . . . . . . . . .
2.4 Criteria for Termination of the Iteration . . .
2.5 Inexact Newton Methods . . . . . . . . . . .
2.6 Global Convergence . . . . . . . . . . . . . .
2.7 Extension of Secant Method to n Dimensions
2.8 Failures . . . . . . . . . . . . . . . . . . . . .
2.8.1 Non-smooth Functions . . . . . . . . .
2.8.2 Slow Convergence . . . . . . . . . . .
2.8.3 No Convergence . . . . . . . . . . . .
2.8.4 Failure of the Line Search . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
109
109
109
112
113
114
115
117
119
119
119
120
120
3 Picard Iteration
121
3.1 One Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.2 Higher Dimensions . . . . . . . . . . . . . . . . . . . . . . . . 123
3.3 Some Last Remarks . . . . . . . . . . . . . . . . . . . . . . . 124
IV
Linear Solvers
125
1 Introduction
127
2 Krylov Subspace Methods
2.1 Krylov Subspace Methods for Symmetric Matrices .
2.1.1 Arnoldi’s Method and the Lanczos Algorithm
2.1.2 The Conjugate Gradient Algorithm . . . . . .
2.2 Krylov Subspace Methods for General Matrices . . .
2.2.1 BiCG Type Methods . . . . . . . . . . . . . .
2.2.2 GMRES Methods . . . . . . . . . . . . . . .
2.2.3 The GMRES Algorithm . . . . . . . . . . . .
2.3 Stopcriterium . . . . . . . . . . . . . . . . . . . . . .
3 Precondition Techniques
3.1 Preconditioned Iterations . . . . . . . . . .
3.1.1 Preconditioned Conjugate Gradient
3.1.2 Preconditioned GMRES . . . . . . .
3.1.3 Preconditioned Bi-CGSTAB . . . . .
3.2 Preconditioned Techniques . . . . . . . . . .
3.2.1 Diagonal Scaling . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
131
132
132
134
136
137
141
142
143
.
.
.
.
.
.
145
145
145
147
149
150
151
6
CONTENTS
3.3
3.4
V
3.2.2 Incomplete LU Factorization . . . . . . . . . . . . . . 151
Incomplete Choleski Factorization . . . . . . . . . . . . . . . 153
Multigrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Concluding Remarks
155
1 Summary and Conclusions
157
1.1 Time Integration Methods . . . . . . . . . . . . . . . . . . . . 159
1.2 Nonlinear and Linear Solvers . . . . . . . . . . . . . . . . . . 159
2 Future Research
2.1 Test-problem and Time Integration
2.2 Stability . . . . . . . . . . . . . . .
2.3 (Non)-Linear Solvers . . . . . . . .
2.4 Extension to Three Dimensions . .
VI
Methods
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
161
161
161
162
162
Appendices
163
A Collocation Methods
165
B Padé Approximations
167
Bibliography
169
Preface
Aim of the Research Project
Chemical Vapor Deposition creates thin films of material on a substrate via
the use of chemical reactions. Reactive gases are fed into a reaction chamber
and these gases react in the gas phase or on a substrate and form a thin film
or a powder. In order to model this phenomenon the flow of gases is described by the incompressible Navier-Stokes equations, whereby the density
variations are taken into account by the perfect gas law. The temperature is governed by the heat equation, whereas the chemical species satisfy
the advection-diffusion-reaction equations. Since the reaction of chemical
species generates or costs energy, there is a coupling between the heat equation and the advection-diffusion-reaction equations. In classical CVD the
generated heat is rather small, so it is possible to decouple the heat equation. Our aim is to develop numerical methods , which are also applicable to
laminar combustion simulations, where there is a strong coupling between
the heat equation and the advection-diffusion-reaction equations.
Typically, in the advection-diffusion-reaction equations the time scales of
the slow and fast reaction terms differ order of magnitudes from each other
and from the time scales of diffusion and advection, leading to extremely stiff
systems. The purpose of this project is to develop robust and efficient solvers
for the heat/reaction system, where we firstly assume that the velocity is
given. Thereafter this solver should be integrated in a CFD package. For a
typical 3D problem a 50 × 50 × 50 grid with 50 chemical species are used.
Structure of the Report
This report, a summary of recent, relevant literature, is divided into five
parts. The first part consists of a general introduction to the C(hemical)
V(apor) D(eposition) process and the corresponding mathematical model
which describes that process. It appears that the system of PDEs describing
8
CONTENTS
CVD is a so-called stiff system. In this first part the notion of stiffness is
explained. We conclude with the finite volume discretization of the general
transport equation.
The second part of this report presents time integration methods that
can ‘handle’ stiff problems. Some ’old’ methods, which work fine for stiff
problems. as well as recently developed integration methods will be treated.
Since the PDEs describing CVD are of the nonlinear type, time integration methods will in general result in solving nonlinear equations. Therefore,
the topic of the third part is ‘nonlinear solvers’. We give a treatment of all
well known nonlinear solvers.
Newton-type nonlinear solvers need solutions of linear systems. The
latter is (evidently) the subject of Part IV. Since our aim is to solve three
dimensional problems, this will result in huge nonlinear systems. Thus also
huge linear systems have to be solved. The most attractive class of linear
solvers for three dimensional problems are iterative methods. In Part IV
we treat generally known iterative linear solvers for symmetric and general
matrices.
In Part V some conclusions are made and the scheme for future research
is presented.
Part I
Mathematical Model of CVD
Chapter
1
Introduction
Thin solid films are used as insulating and (semi)conducting layers in many
technological areas such as micro-electronics, optical devices and so on.
Therefore, deposition processes are required which produce solid films on
a wide variety of materials. Of course, these processes need to fulfill specific
requirements with regard to safety, economics, etc.
Chemical Vapor Deposition (CVD) is a process that synthesizes a thin
solid film from the gaseous phase by a chemical reaction on a solid material. The chemical reactions involved distinguish CVD from other physical
deposition processes, such as sputtering and evaporation.
A CVD system is a chemical reactor, wherein the material to be deposited is injected as a gas and which contains the substrates on which
deposition takes place. The energy to drive the chemical reaction is (usually) thermal energy. On the substrates surface reactions will take place
resulting in deposition of a thin film. Basically, the following nine steps,
taken from [16], occur in every CVD reaction process:
1. Convective and diffusive transport of reactants from the reactor inlet
to the reaction zone within the reactor chamber,
2. Chemical reactions in the gas phase, leading to a multitude of new
reactive species and byproducts,
3. Convective and diffusive transport of the initial reactants and the reaction products from the homogeneous reactions to the susceptor surface,
4. Adsorption or chemisorption of these species on the susceptor surface,
5. Surface diffusion of adsorbed species over the surface,
6. Heterogeneous surface reactions catalyzed by the surface, leading to
the formation of a solid film,
12
Introduction
7. Desorption of gaseous reaction products,
8. Diffusive transport of reaction products away from the surface,
9. Convective and/or diffusive transport of reaction products away from
the reaction zone to the outlet of the reactor.
In the case that a CVD process is considered to be fully heterogeneous, step
(2) in the above enumeration is does not take place. The above enumeration
is illustrated in Figure 1.1, taken from [14].
)*
main gas flow
"!
gas phase
reactions
#$
'( %&
desorption &
diffusion &
diffusion of
adsorption
of reactive
volatile surface
species
-.
++,
reaction products
surface diffusion &
reactions
−film growth
Figure 1.1: Schematic representation of basic steps in CVD (after Jensen,
see [14])
Chapter
2
The Mathematical Model
The mathematical model describing the CVD process consists of a set of partial differential equations, with appropriate boundary conditions, describing
the gas flow, transport of energy, transport of species and the chemical reactions in the reactor.
The gas mixture is assumed to behave as a continuum. This assumption
is valid when the mean free path length of the molecules is much smaller
than a characteristic dimension of the reactor. The Knudsen Number Kn is
defined as
ξ
(2.1)
Kn = ,
L
where ξ is the mean free path length of the molecules and L a typical characteristic dimension of the reactor. Thus, the gas-mixture behaves as a
continuum when Kn < 0.01. For pressures larger than 100 P a and typical
reactor dimensions larger than 0.01 m the continuum approach can be used
safely. See also [16] and [21, Chapter 4].
2.1
The Transport Model
The gas mixture in the reactor is assumed to behave as a continuum, ideal
and transparent gas1 behaving in accordance with Newton’s law of viscosity.
Furthermore, the gas flow is assumed to be laminar (low Reynolds number
flow). Since no large velocity gradients appear in CVD gas flows, viscous
heating due to dissipation will be neglected. Furthermore, we neglect the
effects of pressure variations in the energy equation.
The composition of the N component gas mixture is described in terms
1
By transparent we mean that the adsorption of heat radiation by the gas(es) will be
small.
14
The Mathematical Model
of the dimensionless mass fractions ω i =
N
X
ρi
ρ,
i = 1, . . . , N , with the property
ωi = 1.
i=1
P
The density ρ = N
i=1 ωi ρi of the gas mixture depends on the spatial variable
x, temperature T , the pressure P , time t, etc. Usually, in chemistry reaction
equilibria are expressed in terms of (dimensionless) molar fractions. The
molar fraction of species i is denoted by f i and is related to the mass fraction
ωi as
fi mi
ωi =
,
m
where mi is the molar mass of specie i and
m=
N
X
fi mi .
i=1
The mass averaged velocity v in an N component gas mixture is defined as
v=
N
X
ωi vi .
i=1
As a consequence, the diffusive mass fluxes j i (see Section 2.1.1) sum up to
zero, e.g.,
N
X
i=1
ji =
N
X
i=1
ρωi (vi − v) = ρ
N
X
i=1
ωi vi − ρv
N
X
i=1
ωi = ρv − ρv = 0.
The transport of mass, momentum and heat are described respectively
by the continuity equation, the Navier-Stokes equations and the transport
equation for thermal energy expressed in terms of temperature T . The
continuity equation is given as
∂ρ
= −∇ · (ρv),
∂t
(2.2)
kg
where ρ is the gas mixture density ( m
3 ) and v the mass averaged velocity
).
The
Navier-Stokes
equations
are
vector ( m
s
2
∂(ρv)
T
= −(∇ρv)·v+∇· µ ∇v + (∇v) − µ(∇ · v)I −∇P+ρg, (2.3)
∂t
3
2.1 The Transport Model
15
kg
), I the unit tensor and g the gravitational
where µ is the viscosity ( m·s
acceleration. The transport equation for thermal energy 2 is
cp
∂(ρT )
∂t
= −cp ∇ · (ρvT ) + ∇ · (λ∇T ) +
!
N
N
X
X
DTi ∇fi
Hi
+∇ · RT
+
∇ · ji
Mi fi
mi
i=1
−
K
N X
X
i=1
Hi νik Rkg ,
(2.4)
i=1 k=1
J
W
with cp specific heat ( mol·K
), λ the thermal conductivity ( m·K
) and R the
gas constant. Gas species i has a mole fraction f i , a molar mass mi , a
thermal diffusion coefficient DTi , a molar enthalpy Hi and a diffusive mass
flux ji . For a definition of diffusive flux we refer to Section 2.1.1. The
stoichiometric coefficient of the ith species in the k th gas-phase reaction
with net molar reaction rate Rkg is νik .
The third term on the right-hand side of (2.4) is due to the Dufour effect
(diffusion-thermo effect). The Dufour effect is the reversed process of the
Soret effect, which is the process of thermal diffusion. The fourth term on
the right represents the transport of heat associated with the inter-diffusion
of the chemical species. Both terms, the third and the fourth, are probably
not too important in CVD. The last term at the right-hand side of (2.4)
represents the heat production or consumption due to chemical reactions in
the gas mixture. In processes where the gas-phase reactions are neglected
this term will not be important. Also in processes where the reactants are
highly diluted in an inert carrier gas this term is negligible.
2.1.1
Transport Equations for Gas Species
We formulate the transport equations for the gas species in terms of mass
fractions and diffusive mass fluxes. The convective mass flux for the i th gas
species is ρωi v. The diffusive mass flux ji of the ith species is defined as
ji = ρωi (vi − v),
with vi the velocity vector of species i and v the mass averaged velocity.
See for instance [2]. Mass diffusion can be decomposed in concentration
T
diffusion jC
i and thermal diffusion ji :
T
ji = j C
i + ji .
2
Professor Kleijn remarked that the well-posedness of this equation is doubtful. Due
to the fact that the zero of the enthalpy can be chosen in a free way, the ‘meaning’ of this
equation comes into play. Choosing a not-suitable zero of the enthalpy causes undesired
large or small values of the enthalpy.
16
The Mathematical Model
See [16]. The first type of diffusion, j C
i , occurs as a result of a concentration
gradient in the system. We will refer to it as ordinary diffusion. Thermal
diffusion is the kind of diffusion resulting from a temperature gradient.
Ordinary Diffusion
Depending on the properties of the gas mixture, different approaches are
possible to describe ordinary diffusion. First, in a binary gas mixture, e.g.,
a gas mixture consisting of two species (N = 2), the ordinary diffusion mass
flux is given by Fick’s Law :
j1 = −ρD12 ∇ω1 ,
j2 = −ρD21 ∇ω2 ,
with Dij the diffusion coefficient for ordinary diffusion in the pair of gases 1
and 2. In for instance [2] is derived that for a binary mixture there is just
one ordinary diffusity,
D12 = D21 .
For a multi-component gas mixture there are two approaches, the StefanMaxwell equations and an alternative approximation derived by Wilke. The
Stefan-Maxwell equations are a general expression for the ordinary diffusion
fluxes jC
i in a multi-component gas mixture. In terms of mass fraction and
fluxes they are given as
N
mX 1
C
ωi jC
∇ωi + ωi ∇(ln m) =
j − ω j ji ,
ρ
mj Dij
j=1
i = 1, . . . , N − 1, (2.5)
with m the average mole mass of the mixture
m=
N
X
fi mi .
i=1
The coefficients Dij are the binary diffusion coefficients for gas species i
and j. In an N -component gas mixture there are N − 1 Stefan-Maxwell
equations (2.5) and the additional equation
N
X
jC
i = 0.
i=1
The following explicit expression for j C
i is taken from [16],
jC
i = ρDi − ρωi Di ∇(ln m) + mωi Di
N
X
j=1,j6=i
jC
i
,
mj Dij
(2.6)
2.1 The Transport Model
17
with Di an effective diffusion coefficient for species i
Di = PN
1
fj
j=1,j6=i Dij
.
An other approximate expression for the ordinary diffusion fluxes has
been derived by Wilke in the 1950’s. In Wilkes’ approach the diffusion of
species i is
0
jC
(2.7)
i = ρDi ∇ωi ,
with effective diffusion coefficient
D0i = (1 − fi ) PN
1
fj
j=1,j6=i Dij
.
In Wilke’s approach the diffusion is written in the form of Fick’s Law of
diffusion with an effective diffusion coefficient instead of a binary diffusion
coefficient.
Finally we remark that in the case of a binary gas mixture, the StefanMaxwell equations and the Wilke approach lead to Fick’s Law of binary
diffusion. In the case of a multi-component gas mixture both approaches
are identical if the gas species are highly diluted (f 1 , ω1 1). The following
remark is taken from [16]. When Wilke’s approach to the Stefan-Maxwell
equations is used to compute the diffusive fluxes, the set of transport equations (2.9) form an independent set, which is not consistent with
N
X
ωi = 1.
i=1
To fulfill this constraint one of the transport equations has to be replaced
by the constraint itself.
Thermal Diffusion
A homogeneous gas mixture will be separated due to the effect of thermal
diffusion (Soret effect) under the influence of a temperature gradient. This
so-called Soret effect is usually small in comparison with ordinary diffusion.
Only in systems, actually we mean CVD reactors, with large temperature
gradients thermal diffusion may be an important effect. For instance in coldwall CVD reactors we have large temperature gradients. Thermal diffusion
causes in general a movement of relatively heavy molecules to ‘colder’ regions
of the reactor chamber and a movement of light molecules to hotter parts.
The thermal diffusive mass flux is given by
jTi = −DTi ∇(ln T ).
(2.8)
18
The Mathematical Model
In (2.8) DTi is the multi-component thermal diffusion coefficient for species
i. In general it will be a function of the temperature T and the composition
of the gas mixture. As remarked in [16] the coefficient D Ti is not a function
of the pressure. An exact computation of D Ti can be found in [16, Appendix
F], where can be observed that indeed no pressure is used.
Finally, we remark that adiabatic systems do not have thermal diffusion.
The Species Concentration Equations
We assume that in the gas-phase there are K reversible
reactions. For the
mole
is
given.
The balance
k th reaction a net molar reaction rate R kg m
3 ·s
th
equation for the i gas species, i = 1, . . . , N , in terms of mass fractions and
diffusive mass fluxes is then given as
K
X
∂(ρωi )
νik Rkg .
= −∇ · (ρvωi ) − ∇ · ji + mi
∂t
(2.9)
k=1
From the above general PDE for species transport and chemical gas phase
reactions, the approximate Wilke and exact Stefan-Maxwell expressions for
ordinary diffusion and expression for thermal diffusion we have the following
PDEs for solving the species concentrations :
Using the Wilke approximation :
∂(ρωi )
∂t
= −∇ · (ρvωi ) + ∇ · (ρD0i ∇ωi ) + ∇ · (DTi ∇(ln T )) +
+mi
K
X
νik Rkg ,
(2.10)
k=1
Using the exact Stefan-Maxwell equations :
∂(ρωi )
∂t
= −∇ · (ρvωi ) + ∇ · (ρDi ∇ωi ) +

+∇ · (ρωi Di ∇(ln m)) − ∇ · mωi Di
+∇ · (DTi ∇(ln T )) + mi
K
X
νik Rkg .
N
X
j=1,j6=i
jC
i
mj Dij

+
(2.11)
k=1
Net Molar Reaction Rates
The last term on the right hand side of (2.9), (2.10) and (2.11) describes the
consumption and production of the i th species due to homogeneous reactions
2.1 The Transport Model
19
in the gas-phase. In this section we give expressions for the net molar
reaction rates Rkg . As before, we assume that there are K reversible gasphase reactions. Since different species can act as reactant and as product,
we use the general notation
N
X
i=1
g
kk,forward
k − νik kAi
N
X
g
kk,backward
i=1
kνik kAi
k = 1, . . . , K.
(2.12)
In (2.12) Ai , i = 1, . . . , N , represent the different gas species in the reactor
g
g
chamber, kk,forward
the forward reaction rate constant and k k,backward
the
backward reaction rate constant. By taking
νik > 0 for the products of the forward reaction
νik < 0 for the reactants of the forward reaction
and
kνik k = νik and k − νik k = 0
for νik ≥ 0
kνik k = 0
and k − νik k = |νik | for νik ≤ 0
equation (2.12) represents a general equilibrium reaction, with reactants
appearing at the left-hand side and products on the right-hand side. The
net reaction rate Rkg for the k th reaction is given as
Rkg = Rkg,forward − Rkg,backward =
g
= kk,forward
N
Y
i=1
k−νik k
ci
g
− kk,backward
N
Y
kνik k
ci
,
(2.13)
i=1
P fi 3
RT .
See for instance [16] and [21, Chapter 4].
with ci =
g
The forward reaction rate constants k k,forward
are fitted as a function of
the temperature T as follows :
g
kk,forward
(T ) = Ak T βk e
−Ek
RT
,
(2.14)
where Ak is the pre-exponential factor homogeneous reaction rate for the
k th reaction, βk the temperature coefficient and Ek the activation energy for
g
the k th reaction. Expressing the rate constant k k,forward
as done in (2.14)
is the (modified) Law of Arrhenius. See also [17]. The backward reaction
rate constants are self-consistent with the forward reaction rate constants.
Using thermo-chemistry and doing some calculations the following relation
appears
P N
g
i=1 νik
k
(T
)
RT
k,forward
g
kk,backward
(T ) =
,
Kkg
P0
3
Recall that the mole fraction of the ith gas species fi is related with the mass fraction
ωi in the following way
mi
ωi ,
fi =
m
with mi the molar mass of species i and m the average molar mass of the gas-mixture.
20
The Mathematical Model
with Kkg the reaction equilibrium constant given by
Kkg (T ) = e−
0 (T )−T ∆S 0 (T )
∆Hk
k
RT
N
X
∆Hk0 (T ) =
i=1
N
X
∆Sk0 (T ) =
, with
νik Hi0 (T ) and
νik Si0 (T ).
i=1
Remark 2.1. Typically, the pre-exponential factor homogeneous reaction
rate for the k th reaction Ak are of order 105 − 1029 . Due to these large
factors Ak the equations (2.10) and (2.11) become very stiff.
2.1.2
Complete Mathematical Model
To complete the set of equations we add the ideal gas law, i.e., we assumed
the gas mixture to behave as an ideal gas. In terms of of the mass density
ρ and molar mass m of the gas it is given as
P m = ρRT,
(2.15)
with P the pressure, R the gas constant and T the temperature.
As final result, in the CVD reactor the following coupled set of an algebraic equation and partial differential equations has to be solved. This set
consists of the following N +3+d, with d the dimension of space, equations :
1. Continuity equation (2.2),
2. Navier-Stokes equations (d equations, d = 1, 2, 3) (2.3),
3. Transport equation for thermal energy (2.4),
4. Transport equations for the ith gas species, i = 1, . . . , N (2.9) and
5. Ideal gas law (2.15).
The variables to be solved from this set of equations are the mass averaged
velocity vector v, the pressure P, the temperature T , the density ρ and the
mass fractions ωi for i = 1, . . . , N .
2.2
Boundary Conditions
In order to solve the system of PDE’s describing the CVD process appropriate boundary conditions have to be chosen. If we start with a simple
reactor chamber, consisting of an inflow and an outflow boundary, a number of solid walls and a reacting surface, then on each of these boundary we
have to prescribe a condition.
2.2 Boundary Conditions
21
To avoid misunderstanding, we will first give a definition of the normal
n. By the normal n we mean the unit normal vector pointing outward the
plane or boundary.
Solid Wall
On the solid, non reacting walls we apply the no slip and impermeability
conditions for the velocity vector,
v = 0.
On this wall we can have either a prescribed temperature T ,
T = Twall ,
or a zero temperature gradient for adiabatic walls,
n · ∇T = 0.
Furthermore, on the solid wall there is no mass diffusion,
T
n · ji = n · (jC
i + ji ) = 0.
For adiabatic walls and adiabatic systems the above boundary condition
reduces to
n · ∇ωi = 0.
Reacting Surface
On the reacting surface we have to prescribe the surface reactions. We
assume that on the surface S different surface reactions can take place.
There will be a net mass production P i at these surfaces. For ωi this net
mass production is given by
Pi = m i
S
X
σis RsS ,
s=1
where mi is the mole mass, σis the stoichiometric coefficient for the gaseous
and solid species for surface reaction s (s = 1, . . . , S) and R sS the reaction
rate of reaction s on the surface.
We assume the no-slip condition on the wafer surface. Since there is
mass production in normal direction the normal component of the velocity
will not be equal to zero. Now, we find for the velocity the conditions
n·v =
S
N
X
1X
σis RsS ,
mi
ρ
s=1
n × v = 0.
i=1
(2.16)
22
The Mathematical Model
on the reacting boundary.4 For the wafer temperature we have
T = Twafer .
On the wafer surface the net mass production of species i must be equal
to the normal direction of the total mass flux of species i. The total mass
flux of species i is given by
ρωi v + ji .
Therefore we have on the wafer surface the following boundary condition
n · (ρωi v + ji ) = mi
S
X
σis RsS .
s=1
Condition (2.16), n × v = 0, gives always two conditions if we consider
the three dimensional case. The normal on the reacting surface is known,
n = [n1 , n2 , n3 ]T and the velocity vector is the vector v = [v 1 , v2 , v3 ]T ,
with v1 , v2 and v3 unknowns. We compute n × v to derive the number of
conditions for v. Hence,


n2 v3 − n 3 v2
n × v =  n3 v1 − n 1 v3  .
n1 v2 − n 2 v1
Since ni is known and vi not, we are able to compute the three velocity components from the equation n × v = 0. The first equation gives v 3 = nn3 v2 2 and
the second v1 = nn1 n2 n3 v3 2 . Hence, the third equation is also satisfied. Therefore, from an outer product follows a condition for only two components of
the velocity vector.
Inflow
We take as boundary conditions in the inflow
n · v = vin
n × v = 0,
where the inflow velocity vin with prescribed inflow of each species, denoted
by Qi for species i, is given as
vin =
N
N
X
Tin P 0 10−3 1 X
−3 Tin
Q
=
5.66
·
10
Qi .5
i
T 0 Pin 60 Ain
Ain Pin
i=1
4
(2.17)
i=1
The outer product of two vectors u and v is defined as
u × v = (u2 v3 − u3 v2 )e(1) + (u3 v1 − u1 v3 )e(2) + (u1 v2 − u2 v1 )e(3) ,
with e(α) the unit vector in the xα direction. The outer product is not symmetric, hence,
u × v = −v × u.
5
We used in (2.17) that the standard pressure P 0 = 1.013 · 105 Pa and the standard
temperature T 0 = 298.15K.
2.2 Boundary Conditions
23
The temperature in the inflow is prescribed as
T = Tin .
In [16] the author also prescribes
n · (λ∇T ) = 0,
to prohibit a conductive heat flow through the inflow opening. After further
analysis it was clear that λ = 0 is taken on the boundary 6 .
Total mass flow of species i flowing into the reactor chamber must correspond to Qi in the following way, see [16] :
−3 Qi mi
T
n · ρωi v + jC
,
i + ji = 5.66 · 10
RAin
with R the universal gas constant. To satisfy (2.18) we put
T
n · jC
i + ji = 0,
and
n · (ρωi v) = 5.66 · 10−3
Qi mi
.
RAin
(2.18)
(2.19)
(2.20)
By taking the concentration- and thermal- diffusion coefficients on the boundary equal to zero, we satisfy (2.19). Elementary analysis of (2.20) gives that
the inlet concentrations should be taken fixed as
Outflow
mi Qi
.
ωi = PN
M
Q
j
j
j=1
(2.21)
In the outflow opening we assume zero gradients in the normal direction of
the outflow for the total mass flux vector, the heat flux and the diffusion
fluxes. Furthermore, we assume that the velocity at the outflow boundary is
in the normal direction. For the velocity we have the boundary conditions
n · (∇ρv) = 0,
n × v = 0.
For the heat and diffusion fluxes we get
n · (λ∇T ) = 0, and
T
n · (jC
i + ji ) = 0.
6
Is it possible to have a discontinuous thermal conductivity coefficient λ ?
24
The Mathematical Model
Chapter
3
Stiffness
The note of stiffness has already been mentioned in Chapter 2. In this
(short) chapter we pay some attention to this notion.
While the intuitive meaning of stiff is clear to all specialists, much controversy is going on about it’s correct ‘mathematical definition’. The most
pragmatical opinion is also historically the first one, Curtiss & Hirschfelder
[4] :
“Stiff equations are equations where certain implicit methods,
in particular BDF, perform better, usually tremendously better,
than explicit ones.”
A more recent opinion of Hundsdorfer & Verwer [13] :
“Stiffness has no mathematical definition. Instead it is an operational concept, indicating the class of problems for which implicit
methods perform (much) better than explicit methods.”
The eigenvalues of the Jacobian δf
δy play certainly a role in this decision,
but quantities such as the dimension of the system, the smoothness of the
solution or the integration interval are also important.
In this chapter we give first an appropriate example. Subsequently we
give a stability analysis for the Euler forward method, where the role of the
eigenvalues of the Jacobian become clear.
3.1
Example of Stiff Equation
As mentioned before, stiff equations belong to the class of ODEs for which
explicit methods do not work. There are many ‘standard’ examples of stiff
equations. Next, we treat an example coming from chemical reacting systems, taken from [26]. We selected this ‘standard’ example because of the
connectivity with the topic of this research.
26
Stiffness
Example 3.1. Consider the chemical species A, B and C. The Robertson
reaction system is then given as :
A
B+B
B+C
0.04
−→
B
(slow )
3·107
−→ C + B (very fast)
104
−→
A+C
(fast)
which leads to the equations
A : y10 = −0.04y1 +104 y2 y3 ,
y1 (0) = 1
B : y20 =
0.04y1 −104 y2 y3 −3 · 107 y22 , y2 (0) = 0
C : y30 =
3 · 107 y22 , y3 (0) = 0.
Introduce

y1
y =  y2  ,
y3

so that (3.1) can be written as
y 0 = f (y),
y(0) = [1, 0, 0]T ,
(3.1)
(3.2)
(3.3)
where the definition of f (y) is self-evident.
The ODE system will be solved using two methods :
1. Euler Forward, which is an explicit method,
yn+1 − yn
= f (yn ),
τ
(3.4)
2. Euler Backward, which is implicit,
yn+1 − yn
= f (yn+1 ).
τ
(3.5)
The numerical solutions obtained for y 2 (t) with Euler Forward and Backward
are displayed in Figure 3.1. We observe that the solution y 2 rapidly reaches
a quasi-stationary position in the vicinity of y 20 = 0, which in the beginning
is at 0.04 ≈ 3 · 107 y22 , hence y2 ≈ 3.65·−5 , and then very slowly goes back to
zero again.
It can be seen in Figure 3.1 that the implicit method, i.e. Euler Backward,
integrates (3.1) without problem. The explicit Euler Forward has violent
oscillations when the step-size τ is too large. We remark that the solution
of Euler Backward is of good quality.
Apparently, in the sense of Curtiss and Hirschfelder (and Hundsdorfer
and Verwer) the system of ODEs (3.1) is stif f.
3.1 Example of Stiff Equation
165
−5
Concentration of species B
x 10
1
27
Concentration of species B
x 10
6
5
0
4
−1
3
−2
2
1
−3
0
−4
−1
−5
−2
−6
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
−3
0
0.05
0.1
0.15
time t
−5
4
4
3.5
0.25
0.3
0.35
0.25
0.3
0.35
Concentration of species B
x 10
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0
0.2
time t
−5
Concentration of species B
x 10
0.5
0
0.05
0.1
0.15
0.2
time t
0.25
0.3
0.35
0
0
0.05
0.1
0.15
0.2
time t
Figure 3.1: Numerical solution for y 2 (t) with Euler Forward with step-size
1
1
1
(upper-left), with τ = 1000
(upper-right), with τ = 1250
(lower-left)
τ = 50
1
and Euler Backward with steps-size τ = 50 (lower-right). Remark that the
axis in the upper-left figure has another scale, i.e., it is scaled by a factor
10165 .
28
3.2
Stiffness
Stability Analysis for Euler Forward Method
Let ϕ(t) be a smooth solution of y 0 = f (t, y). We linearize f in its neighborhood as follows
y 0 (t) = f (t, ϕ(t)) +
δf
(t, ϕ(t))(y(t) − ϕ(t)) + · · ·
δy
(3.6)
and introduce ȳ(t) = y(t) − ϕ(t) to obtain
ȳ 0 (t) = f (t, ϕ(t)) +
δf
(t, ϕ(t))ȳ(t) + · · · = J(t)ȳ(t) + · · ·
δy
(3.7)
As a first approximation, we consider the Jacobian J(t) as constant and
neglect the error terms. Omitting the bars we arrive at
y 0 = Jy.
(3.8)
If we now apply the Euler forward method to (3.8), we obtain
yn+1 = R(hJ)yn ,
(3.9)
R(z) = 1 + z.
(3.10)
with
To study the behavior of (3.9), we assume that J is diagonalizable with
eigenvectors v1 , . . . , vk . As a consequence, y0 can be written in this basis as
y0 =
k
X
αi vi .
(3.11)
i=1
If the λi are the eigenvalues of J, then using (3.11), (3.9) can be written as
ym =
k
X
(R(hλi ))m αi vi .
(3.12)
i=1
It is clear that ym remains bounded for m → ∞, if for all eigenvalues the
complex number z = hλi lies in the set
S = {z ∈ C : |R(z)| ≤ 1}.
(3.13)
This leads to the explanation of the results found in the example of the
previous section.
Example 3.1. (Continued) The Jacobian for the Robertson reaction system (3.1) is given by


−0.04
104 y3
104 y2
 0.04 −104 y3 − 6 · 107 y2 −104 y2  .
(3.14)
0
6 · 107 y2
0
3.2 Stability Analysis for Euler Forward Method
29
In the neighborhood of the equilibrium y 1 = 1, y2 = 0.0000365, y3 = 0 the
Jacobian is equal to


−0.04
0
0.365
 0.04 −2190 0.365  ,
(3.15)
0
2190
0
with eigenvalues
λ1 = 0,
λ2 = −0.405
and
λ3 = −2189.6.
(3.16)
The third eigenvalue, λ3 , produces stiffness. For stability for Euler Forward
5
2
= 5474
≈ 0.0009.
we need λ3 τ ≥ −2. Hence, τ ≤ 2189.6
The implicit Euler Backward method is unconditionally stable, thus there
are no restrictions on the step-size τ . This confirms the numerical observations.
30
Stiffness
Chapter
4
Test Problem
The mathematical model of the CVD process described in Chapter 2 consists of a number of coupled non-linear partial differential equations with
boundary conditions. In general, it is impossible to solve this set of PDEs
analytically. As mentioned in [16], for simple reactor configurations and a
drastically simplified transport model useful analytic solutions can be obtained. See for example [7, 18]. In this study we have a general model, which
can be applied to various reactor configurations and processes. Therefore,
we use numerical methods to find approximate solutions of the full set of
PDEs. The aim is to find the most efficient solution method.
The finite volume approach is the most widely used class of numerical
methods for computing heat and mass transfer in variable property fluid
flows with chemical reactions. Therefore, the finite volume approach will be
used in this study. A detailed description of the finite volume approach can
be found in [24] and numerous other publications.
The main goal of this project is to find an efficient solution method for
solving the advection-diffusion-reaction equations, which are also known in
this report as species equations. In the test problem we restrict ourselves to
solve a coupled system of species equations on a 2D computational grid. As
a first approximation we also neglect the surface reactions. This results in
having inflow, outflow and solid wall boundary conditions. Stiffness in this
system comes from the gas phase reactions.
In this chapter a 2D axisymmetric test problem will be described. The
chapter is organized as follows. We start with defining the computational
grid. Subsequently, we give the finite volume discretization of the species
equations. The last sections of this chapter are reserved for more details on
the chemical properties of the test model.
32
Test Problem
4.1
The Computational Grid
We restrict ourselves to the two-dimensional case. We use a cell-centered
non-uniform grid. There are two ways to arrange the unknowns on the grid :
1. colocated arrangement,
2. staggered arrangement.
Visualization of both arrangements is done on a cell-centered uniform grid
in Figure 4.1. When all discrete unknowns are located in the center of
a cell, the arrangement is called colocated. In the case that the pressure,
temperature and species mass fraction are located in the cell-centers and the
velocity components are located at the cell face centers, the arrangement is
called staggered.
In this test problem we use a colocated grid, because the CFD package
X-stream also uses a colocated grid. Recall that the solvers developed in
this project will be implemented in X-stream.
Figure 4.1: Two-dimensional cell-centered uniform grid with a colocated
arrangement of the unknowns (left) and a staggered arrangement of the
unknowns (right).
4.2
Finite Volume Discretization of the General
Transport Equation
The test problem we have in this chapter consists of N coupled species
concentration equations. Recall that N represents the number of gas species
in the reactor. With respect to the diffusion flux we assume the following :
• we neglect thermo-diffusion, thus we consider diffusion to exist of ordinary diffusion only,
• ordinary diffusion is described by the Wilke approach (Fick’s law).
4.2 Finite Volume Discretization of the General Transport Equation
33
The species concentration equations then take the form
K
X
∂(ρωi )
νik Rkg ,
= −∇ · (ρvωi ) + ∇ · (ρD0i ∇ωi ) + mi
∂t
(4.1)
k=1
where K is the number of reactions in the gas phase and D 0i the effective
diffusion coefficient. Equation (4.1) is valid for all species in the gas mixture.
Since the relation
N
X
ωi = 1,
(4.2)
i=1
holds, only (N −1) coupled species concentration equations have to be solved.
The general form of equation (4.1) is
∂(ρφ)
= −∇ · (ρvφ) + ∇ · (Γφ ∇φ) + Sφ ,
∂t
(4.3)
whereby
φ = ωi ,
Γφ = ρD0i
and
S φ = mi
K
X
νik Rkg .
(4.4)
k=1
The discretization of advection-diffusion-reaction equation (4.3) will be done
for its 2D axisymmetric (r, z) form, that is
∂(rρφ)
= −∇r,z · (rρvφ) + ∇r,z · (rΓφ ∇r,z φ) + Sφ ,
∂t
(4.5)
∂· ∂· T
, ∂z ) . Remark that the 2D Cartewhere ∇r,z is the operator ∇r,z (·) = ( ∂r
sian form (4.3) of (4.5) can be obtained by taking r = 1 in (4.5).
Next, we will compute the volume integral over the control volume surrounding the grid point P , see Figure 4.2.
Grid point P has four neighbors, indicated by N (orth), S(outh), E(ast)
and W (est), and the corresponding walls are indicated by n, s, e and w. For
the sake of clarity, we write in the remainder of this chapter
u
,
(4.6)
v=
v
where u is the velocity component in r-direction and v the velocity component in z-direction.
Integrating Equation (4.5) over the control volume ∆r∆z surrounding
P gives
ZZ
∂(rρφ)
dr dz =
(4.7)
∂t
∆r∆z
ZZ
− ∇r,z · (rρvφ) + ∇r,z · (rΓφ ∇r,z φ) + Sφ dr dz.
(4.8)
∆r∆z
34
Test Problem
N
∆z
vn
n
n
w
ue
W
uw
E
P
∆z
e
s
vs
S
∆r
Figure 4.2: Grid cells
By using the Gauß theorem1 we may write
Z
Z
Z
Z
∂(rρφ)
dr dz = − rρvφ dr + rρvφ dr − rρuφ dz + rρuφ dz +
∂t
n
s
e
w
∆r∆z
Z
Z
Z
Z
∂φ
∂φ
∂φ
∂φ
rΓφ
dr − rΓφ
dr + rΓφ
dz − rΓφ
dz +
∂z
∂z
∂r
∂r
n
s
e
w
ZZ
rSφ dr dz.(4.9)
ZZ
∆r∆z
1
Theorem (Divergence Theorem of Gauß). For any volume V ⊂ Rd with piecewise
smooth closed surface S and any differentiable vector field F we have
Z
∇ · F dV =
V
where n is the outward unit normal on S.
Z
S
F · n dS,
4.2 Finite Volume Discretization of the General Transport Equation
35
The remaining integrals are approximated as
d(rP ρP φP )
∆r∆z = −rn ρn vn φn ∆r + rs ρs vs φs ∆r − re ρe ve φe ∆z + rw ρw vw φw ∆z +
dt
∂φ ∂φ ∂φ ∂φ ∆r − rs Γφ,s
∆r + re Γφ,e
∆z − rw Γφ,w
∆z +
rn Γφ,n
∂z n
∂z s
∂r e
∂r w
rp Sφ,P ∆r∆z.
(4.10)
In the above formulation the time derivative will not be approximated
yet. The next part of this report is dedicated to the subject of time integration methods.
The densities ρ and the diffusion coefficients Γ φ at the cell walls are
approximated by the harmonic mean as
2ρP ρN
,
ρP + ρ N
2ΓP ΓN
,
Γn =
ΓP + Γ N
ρn =
and similar for the other cell walls. Because the density function is smooth
in the computational domain, its (harmonic mean) approximation can be
replaced by
1
(4.11)
ρn = (ρN + ρP ).
2
We remark that the latter is done in [16].
For the approximation of the value of φ and its first derivative on the
cell walls, we consider the following three methods :
φn = 12 (φN + φP )
∂φ φN −φP
∂z n = ∆zn
φP for vn ≥ 0
1st order upwind scheme : φn =
φ
N for vn < 0
φN −φP
∂φ ∂z n = ∆zn

φN
for P e∆n < −2

1
hybrid scheme :
φn =
(φ + φP ) for |P e∆n | ≤ 2
 2 N
φP
for P e∆n > 2
φN −φ
P
for |P e∆n | ≤ 2
∂φ ∆zn
∂z n =
0
for |P e∆n | > 2
central scheme :
where the cell Peclet number P e∆ is on the n-wall is defined as
P e∆n =
and similar for the other walls.
ρn vn ∆zn
,
Γφ,n
(4.12)
36
Test Problem
The central scheme has second order accuracy, but becomes unstable for
large |P e∆n |. It is generally known that for large |P e ∆n | the central scheme
leads to wiggles in the solution.
The upwind scheme damps these wiggles. The disadvantage is that in
turn for damping these wiggles one gets loss of accuracy, i.e. the upwind
scheme has only first order accuracy. Furthermore, the upwind scheme produces ‘large’ numerical diffusion.
The hybrid scheme combines the advantages of the central and the upwind scheme. For |P e∆ | > 2, it locally switches to the upwind scheme and
sets the diffusion contribution to zero.
Finally we remark that it is common in CVD processes that the Reynolds
number is low. Based on previous research, low Reynolds number imply low
Peclet numbers. This means that the using the hybrid scheme results usually
in a central differencing scheme.
4.3
Boundary Conditions
For every ADR-equation of the form (4.5) a set of discretized boundary
conditions is needed. Subsequently we will discretize the inflow, solid wall
and outflow boundary conditions. Recall that in the test-problem we do not
consider reacting surfaces.
Inflow
In Figure 4.3 an inflow opening normal to the z-direction has been illustrated. For the species concentration equations (4.5) we wish to impose
Figure 4.3: Inflow boundary condition
boundary conditions (2.19) and (2.21). The first is simply done by setting
the diffusion constant Γφ,n equal to zero. The second is satisfied by replacing
φn in (4.10) by φin , where φin is the prescribed inlet concentration.
4.3 Boundary Conditions
37
Solid Wall
In Figure 4.4 a solid wall is illustrated. Since there is assumed that there is
no thermal diffusion, we only have
n · ∇φ = 0,
(4.13)
on wall w. Substituting this condition into (4.9) gives a modification of the
integrals
R
• rρuφ dz changes into
w
Z
rρuφ dz = rρw uw φP ∆z
(4.14)
w
where we used that φW = φP (comes from the fact that ∂φ/∂r = 0)
and φw = 21 (φP + φW ),
R
w
rΓφ ∂φ
∂r dz becomes zero, because
∂φ
= 0.
∂r
(4.15)
Solid Wall
•
N
n
w
E
P
W
Virtual Point
e
s
S
Figure 4.4: Solid wall boundary condition
Outflow
The outflow boundary condition with respect to the concentrations, without
thermo diffusion, is n · ∇φ = 0. Remark that this b.c. is equal to the solid
wall boundary condition and therefore we refer to the previous subsection
for the discretization.
38
4.4
Test Problem
Chemistry Properties
In this section we present the details of the chemical model used for this test
problem. The gas mixture in the CVD reactor of this test problem consists
of 7 species, see Section 4.4.1. This model is a simplified model of a CVD
process that deposits silicon Si from silane SiH 4 . The process of depositing
silicon from silane is one of the most applied and studied CVD processes in
the IC industry.
In this problem we use the reactor configuration as given in Figure 4.5.
This reactor has one inflow boundary, a number of solid walls, one reacting surface and two outflow boundaries. The reactor is axisymmetric, and
hence, the computational domain results into the domain given in Figure
4.6. Furthermore, we assume that the gas flow in the reactor is also axisymmetric.
Inflow
10 cm.
30 cm.
substrate
Outflow
35 cm.
Figure 4.5: Reactor geometry
We conclude with some last remarks about this CVD process. From the
top of the reactor, the inflow boundary, a gas mixture enters the reactor.
This gas mixture consists of silane SiH 4 and the carrier gas helium He. When
the mixture enters the reactor it has a uniform temperature T in = 300 K
and a uniform velocity Uin . The inlet silane mole fraction is f in,SiH4 = 0.001,
the rest is helium. At a distance of 10 cm. below the inlet, a susceptor with
temperature Ts = 1000 K is placed. The susceptor has a diameter of 30 cm.
On the susceptor surface reactions take place leading to the deposition of
silicon. In the first numerical experiments, these surface reactions will be
omitted.
4.4.1
Gas-Phase Reaction Model
Due to gas phase reactions in the reactor the gas mixture consists, besides
helium and silane, of the following species :
4.4 Chemistry Properties
39
INFLOW
θ
z
r
OUTFLOW
SUSCEPTOR
Figure 4.6: Computational domain
SiH4
SiH2
Si2 H6
H2
He
H2 SiSiH2
Si3 H8
We use the following reaction mechanism :
G1 :
G2 :
G3 :
G4 :
G5 :
SiH4
Si2 H6
Si2 H6
SiH2 +Si2 H6
2Si2
SiH2 + H4
SiH4 + SiH2
H2 SiSiH2 + H2
Si3 H8
H2 SiSiH2
Recall from Chapter 2 that the forward reaction rate k forward is fitted according to the modified Law of Arrhenius as
g
kforward
(T ) = Ak T βk e
−Ek
RT
.
(4.16)
The backward reaction rates are self consistent with
PN
g
i=1 νik
kforward
(T ) RT
g
kbackward (T ) =
,
Kg
P0
where the reaction equilibrium K g is approximated by
K g (T ) = Aeq T βeq e
−Eeq
RT
.
(4.17)
In Table 4.1 the forward rate constants A k , βk and Ek are given. The fit
parameters for the gas phase equilibrium are given in Table 4.2.
40
Test Problem
Reaction
Ak
βk
Ek
G1
1.09 · 1025
−3.37
256000
−4.24
243000
0.00
236000
0.00
0
0.00
0
G2
G3
G4
G5
3.24 · 1029
7.94 · 1015
1.81 · 108
1.81 · 108
Table 4.1: Fit parameters for the forward reaction rates (4.16)
Reaction
Aeq
βeq
Eeq
G1
6.85 · 105
0.48
235000
−1.68
229000
0.00
187000
1.64
233000
0.00
272000
G2
G3
G4
G5
1.96 · 1012
3.70 · 107
1.36 · 10−12
2.00 · 10−7
Table 4.2: Fit parameters for the equilibrium constants (4.17)
4.4 Chemistry Properties
41
Other Data
Besides the chemical model of the reacting gases, also some other properties
of the gas mixture are needed. As mentioned before, the inlet temperature
of the mixture is 300 K, and the temperature at the wafer surface is 1000
K. The pressure in the reactor is 1 atm., which is equal to 1.013 · 10 5 Pa.
Other properties are given in Table 4.3. The diffusion coefficients, according
to Fick’s Law, are given in Table 4.4.
Density ρ(T )
Specific heat cp
Viscosity µ
Thermal conductivity λ
1.637 · 10−1 ·
300
T
5.163 · 103
1.990 · 10−5
1.547 · 10−1
T
300
T
300
[kg/m3 ]
[J/kg/K]
0.7
0.8
[kg/m/s]
[W/m/K]
Table 4.3: Gas mixture properties
SiH4
SiH2
H2 SiSiH2
Si2 H6
Si3 H8
H2
4.77 · 10−6
5.38 · 10−6
3.94 · 10−6
3.72 · 10−6
3.05 · 10−6
8.02 · 10−6
T 1.7
300
T 1.7
300
T 1.7
300
T 1.7
300
T 1.7
300
T 1.7
300
Table 4.4: Diffusion coefficients D0i , according to Fick’s Law
42
Test Problem
Part II
Time Integration Methods
Chapter
1
Introduction
In this part of the thesis we present time integration methods which are of
practical relevance for solving advection diffusion reaction problems. These
methods have been selected from recent literature concerning this problem,
e.g. [13, 20, 36, 37]
As seen in Part I the advection diffusion reaction equation is a PDE
with appropriate boundary conditions and initial values. To approximate
the solution u(x, t) of an ADR equation, or system of ADR equations, we
use the Method of Lines approach. The PDE is then first discretized in
space on a certain grid Ωh with mesh width h > 0 to yield a semi discrete
system
w0 (t) = F (t, w(t)), 0 < t ≤ T, w(0) given,
(1.1)
m
where w(t) = (wj (t))m
j=1 ∈ R , with m proportional to the number of grid
points in spatial direction. The next step is to integrate the ODE system
(1.1) with an appropriate time integration method. A number of methods
is treated in the following sections. Among others we will discuss
1. The class of Runge Kutta Methods including the Rosenbrock Methods,
2. Time Splitting Methods,
3. Runge Kutta Chebyshev Methods,
4. Multi-rate (Runge-Kutta) Schemes.
We consider the non autonomous initial value problem of the system of
ODEs
w0 (t) = F (t, w(t)), t > 0, w(0) = w0 ,
(1.2)
with given F : R × Rm → Rm and w0 ∈ Rm . The exact solution will be
approximated in the grid points tn = nτ , n = 0, 1, 2, . . . , N , where τ > 0
is the time step. We denote the numerical approximations by w n ≈ w(tn ).
46
Introduction
To keep the presentation of the methods short, we assume the time step τ
to be constant. However, we emphasize that in many applications variable
step sizes should be used to obtain efficient codes and solvers.
Chapter
2
Runge-Kutta Methods
Runge-Kutta Methods belong to the class of one-step methods, e.g. this is
the class of methods that step forward from computed approximations w n
at time tn to new approximations wn+1 at the forward time tn+1 using as
input only wn . To compute the new approximation w n+1 the Runge-Kutta
methods use a number of auxiliary intermediate approximations w ni ≈
w(tn + ci τ ), i = 1, 2, . . . , s, where s is the number of stages. These intermediate approximations serve to obtain a sufficiently level of accuracy
for the approximations wn at the grid points tn . The general form of a
Runge-Kutta method is
wni = wn + τ
wn+1 = wn + τ
s
X
j=1
s
X
αij F (tn + cj τ, wnj ),
bi F (tn + ci τ, wni ).
i = 1, 2, . . . , s,
(2.1)
i=1
P
The coefficients αij and bi define the particular method and ci = sj=1 αij .
Observe that a particular Runge-Kutta method is explicit if α ij = 0 for
all j ≥ i. All other particular methods are implicit. If α ij = 0 for all j > i,
then we will call this method diagonally implicit. Particular Runge-Kutta
methods can be represented in a so-called Butcher-array, e.g.,
c1
..
.
α11
..
.
···
α1s
..
.
cs
αs1
b1
···
···
αss
bs
The order p of a Runge-Kutta method is determined by its coefficients
αij , bi and ci . By making use of Taylor series, one can derive conditions in
48
Runge-Kutta Methods
order to make the method consistent of order p. These conditions are given
in Table 2.1 for p = 1, 2, 3 and 4. Table 2.1 is taken from [13].
order p
order conditions
1
bT e = 1
2
bT ce =
1
2
3
b T c2 =
1
3
4
b T c3 =
1
4
bT Ac2 =
bT Ac =
bT CAc =
1
12
bT A2 c =
1
6
1
8
1
24
Table 2.1: Order conditions of Runge-Kutta methods for p = 1, 2, 3, 4. The
vector b = (b1 , . . . , bs )T , c = (c1 , . . . , cs )T , e = (1, . . . , 1)T , the matrix A =
(αij )ij , C = diag(ci ) and ck = C k e.
Example 2.1. We give some Butcher arrays of simple explicit Runge-Kutta
methods.
0
0 0
1 1
1
1
1
2
2
At the left we have the Butcher array of the familiar Euler Forward method
of order one. At the right we have a method known as the second-order
explicit trapezoidal rule or modified Euler method, e.g.,
1
1
wn+1 = wn + τ F (tn , wn ) + τ F (tn + τ, w − n + τ F (tn , wn )).
2
2
An example of explicit method of order four is given by the Butcher array
0
1
2
1
2
1
1
2
1
2
0
0
1
1
6
1
3
1
3
1
6
This method is better known as the method of Runge-Kutta. Completely
2.1 The Stability Function
49
written out it is
wn1 = wn ,
1
wn2 = wn + τ F (tn , wn1 ),
2
1
1
wn3 = wn + τ F tn + τ, wn2 ,
2
2
1
wn4 = wn + τ F tn + τ, wn3 ,
2
1
1
F (tn , wn1 ) + F (tn + 1/2τ, wn2 )
wn+1 = wn + τ
6
3
1
1
+ F (tn + 1/2τ, wn3 ) + F (tn + τ, wn4 ) .
3
6
Taking ki = τ F (tn +ci τ, wni ), the above method is in the more familiar form
k1 = τ F (tn , wn ),
1
1
k2 = τ F tn + τ, wn + k1 ,
2
2
1
1
k3 = τ F tn + τ, wn + k2 ,
2
2
k4 = τ F (tn + τ, wn + k3 ),
1
wn+1 = wn + (k1 + 2k2 + 2k3 + k4 ) .
6
We will refer to it as the classical fourth order method.
2.1
The Stability Function
Consider the Dahlquist test equation w 0 (t) = λw(t), λ ∈ C. Applying a
Runge-Kutta method (2.1) to the test equation gives the recursion
wn+1 = R(z)wn ,
with R(z) the stability function and z = τ λ. This function can be found to
be
R(z) = 1 + zbT (I − zA)−1 e,
(2.2)
where e = (1, 1, . . . , 1)T . It follows from (2.2) that for explicit R-K methods
the stability function R(z) becomes a polynomial of degree ≤ s. For implicit
R-K methods it becomes a rational function with both denominator and
numerator polynomials of degree ≤ s.
Example 2.2. For every explicit R-K method of order p = s and s ≤ 4 the
stability function is given by the polynomial
1
1
R(z) = 1 + z + z 2 + · · · + z s .
2
s!
50
Runge-Kutta Methods
For s = 3 and 4 the stability regions S = {z ∈ C : |R(z)| ≤ 1} have been
plotted in Figure 2.1.
Third Order Runge−Kutta Method
−4
−3
−2
Fourth Order Runge−Kutta Method
−1
3
3
2
2
1
1
0
1
−4
−3
−2
−1
0
1
−1
−1
−2
−2
−3
−3
Figure 2.1: Stability regions S = {z ∈ C : |R(z)| ≤ 1 for s = 3 (left) and
s = 4 (right).
2.2
Rosenbrock Methods
Rosenbrock methods belong to the family of Runge-Kutta methods. They
are named after Rosenbrock, see [27], who was the first one that proposed
methods of this kind. In literature different forms have been used. Nowadays
Rosenbrock methods are understood to solve an autonomous, stiff ODE system w0 (t) = F (w(t)). We derive the class of Rosenbrock methods from the
diagonally implicit Runge-Kutta methods. Recall that an s-stage diagonally
implicit RK-method, with F = F (w(t)), is given as
wni = wn + τ
s
X
αij F (wnj ),
i = 1, 2, . . . , s,
j=1
wn+1 = wn + τ
s
X
bi F (wni )
(2.3)
i=1
with Butcher array as in Figure 2.2.
Rewriting (2.3) gives


i−1
X
ki = τ F wn +
αij kj + αii ki  ,
j=1
wn+1 = wn +
s
X
i=1
bi ki .
i = 1, . . . , s,
(2.4)
2.2 Rosenbrock Methods
51
c1
..
.
cs
α11
α21
..
.
α22
αs1
b1
···
···
..
.
αss
bs
Figure 2.2: Butcher array of an s-stage diagonally implicit RK-method
P
Now, the idea is not to solve ki from (2.4), but to linearize F wn + i−1
α
k
+
α
k
ij
j
ii
i
j=1
Pi−1
around x = wn + j=1 αij kj :
ki = τ F (x) + τ F 0 (x)αii ki .
(2.5)
This can be interpreted as applying one Newton iteration to (2.4) with
starting value ki = 0. Instead of continuing the Newton iteration we consider
(2.5) as a new class of methods.
Before defining the actual class of Rosenbrock methods, we give some
additional remarks to obtain more efficient methods. A lot of computational
advantage can be obtained by replacing the Jacobians F 0 (x) by the Jacobian
A = F 0 (wn ), in each time step only one Jacobian has to be computed.
Furthermore, we gain more freedom by introducing linear combinations of
the terms Akj into (2.5). The following class of methods can be defined :
Definition 2.3. An s-stage Rosenbrock method is given as
ki = τ F (wn +
i−1
X
αij kj ) + τ A
j=1
wn+1 = wn +
s
X
bi ki
i
X
γij kj ,
j=1
(2.6)
i=1
where A = An is the Jacobian F 0 (w(t)).
Definition 2.3 is taken from [13]. Again, the coefficients b ij , αij and
γij define a particular method and are selected to obtain a desired level of
consistency and stability.
Remark that computing an approximation w n+1 from wn , in each stage
i a linear system of algebraic equations with the matrix I − γ ii τ A has to be
solved. To save computing time for large dimension systems w 0 (t) = F (w(t))
the coefficients γii are taken constant, e.g., γii = γ. Then in every timestep the matrix (I − γii A) is the same. To solve these large systems LU
decomposition or (preconditioned) iterative methods could be used.
Finally we remark that taking A to the zero matrix in (2.6), this leads
to a standard explicit Runge-Kutta method. The final remark, taken from
52
Runge-Kutta Methods
[13], Rosenbrock methods have proven to be successful in many stiff ODE
and PDE problems, especially in the low to moderate accuracy range.
As can be found in [13] the definition of order of consistency is the same
as for Runge-Kutta. Define
βij = αij + γij ,
ci =
i−1
X
αij
and di =
i−1
X
βij .
j=1
j=1
In Table 2.2, taken from [13], the order conditions for s ≤ 4 and p ≤ 3 and
γii = γ = constant can be found. For Rosenbrock methods with constant
order p
order conditions
1
b1 + b2 + b3 + b4 = 1
2
b 1 d2 + b 3 d3 + b 4 d4 =
1
2
−γ
b2 c22 + b3 c23 + b4 d24 =
1
3
b3 β32 d2 + b4 (β42 d2 + β43 d3 ) =
1
6
3
− γ + γ2
Table 2.2: Order conditions of Rosenbrock methods with γ ii = γ for s ≤ 4
and p ≤ 3.
coefficients γii = γ the stability function has the following form,
R(z) =
P (z)
,
(1 − γz)s
where P (z) is a polynomial of degree ≤ s.
Example 2.4. By taking s = 1, b1 = 1, α11 = 1 and γ11 = γ, γ a free
parameter, in (2.6) we defined the one-stage method
wn+1 = wn + k1
k1 = τ F (wn ) + τ Ak1 .
Since β11 = 1 + γ and βij = ci = di = 0 for all i = 1, 2, 3, 4 and j = 2, 3, 4,
we only have a second order method if γ = 2.
The stability function is
R(z) =
Hence, this method is A-stable
1
1 + (1 − γ)z
.
1 − γz
for all γ ≥
1
2
and L-stable
(2.7)
2
for γ = 1.
1
A method is called A-stable if the stability region S = {z ∈ C : |R()z| ≤ 1} contains
the left half plane C− .
2
A method is called L-stable if the method is A-stable and R|∞| = 0.
2.3 Runge Kutta Chebyshev Methods
53
Example 2.5. We consider the 2-stage method
wn+1 = wn + b1 k1 + b2 k2
k1 = τ F (wn ) + γτ Ak1
k2 = τ F (wn + α21 k1 ) + γ21 τ Ak1 + γτ Ak2 ,
(2.8)
with coefficients
b1 = 1 − b 2 ,
α21 =
1
2b2
and
γ21 = −
γ
.
b2
Hence, this method is of order 2 for all γ and as long as b 2 6= 0. The stability
function is
1 + (1 − 2γ)z + (γ 2 − 2γ + 21 )z 2
.
(2.9)
R(z) =
(1 − γz)2
√
The method is A-stable for γ ≥ 41 and L-stable if γ = 1 ± 21 2.
2.3
Runge Kutta Chebyshev Methods
The family of Runge Kutta methods discussed in this section are explicit.
They will avoid solving algebraic systems, but posses extended real stability interval with a length proportional to s 2 , with s the number of stages.
Therefore the Runge Kutta Chebyshev (RKC) methods could be attractive
to solve moderate stiff systems. The principle goal of constructing Runge
Kutta methods is to achieve the highest order consistency with a given number of stages s. The methods discussed in this section use a few stages to
achieve a low order consistency and the additional stages are used to increase
the stability boundary β(s).
Definition 2.6. The stability boundary β(s) is the number β(s) such that
[−β(s), 0] is the largest segment of the negative real axis contained in the
stability region
S = {z ∈ C : |R(z)| ≤ 1} .
To construct the family of RKC methods we start with the explicit
Runge-Kutta methods which have the stability polynomial
R(z) = γ0 + γ1 z + · · · + γs z s .
(2.10)
In order to have first order consistency we take γ 0 = γ1 = 1.3 The following theorem states that every explicit Runge-Kutta method has as optimal
3
This can be verified by considering the test equation y 0 = λy. The local error of the
test equation satisfies
ez − R(z)
= O(τ p ).
(2.11)
τ
th
To achieve p order consistency the coefficients γi has to be chosen in such a way that
(2.11) satisfies for p.
54
Runge-Kutta Methods
stability boundary β(s) = 2s2 , thus the maximum value of β(s) is 2s 2 . The
upper boundary of β(s) is achieved if we take the shifted Chebyshev polynomials of the first kind as stability polynomial. The following theorem is
taken from [13].
Theorem 2.7. For any explicit, consistent Runge-Kutta method we have
β(s) ≤ 2s2 . The optimal stability polynomial is the shifted Chebyshev polynomial of the first kind
z
Ps (z) = Ts 1 + 2 ,
s
(2.12)
where the polynomials Ts (z)4 for z ∈ C are recursively defined as
T0 (z) = 1,
T1 (z) = z,
Tj (z) = 2zTj−1 (z) − Tj−2 (z)
2 ≤ j ≤ s.
(2.13)
If we take (2.12) as stability polynomial, then we achieve the optimum value
for β(s) equal to 2s2 .
Proof. Since Ps (z) = 1 + z + O(z 2 ) these polynomials give first order consistency and belong to the class of stability polynomials (2.10) that can be
generated by explicit Runge Kutta methods. By definition, it follows that
|Ts (x)| ≤ 1 for −1 ≤ x ≤ 1. Therefore, |Ps (x)| ≤ 1 for −2s2 ≤ x ≤ 0. As
known, on the interval [−1, 1], Ts (x) has s − 1 points of tangency with the
lines y = ±1. As a consequence, Ps (x) has also s − 1 tangential points with
these lines.
This property of the shifted Chebyshev polynomials of the first kind
determines it as the unique polynomial with largest stability boundary.
Suppose there exists a second stability polynomial of the class (2.10) with
β(s) ≥ 2ss . Since Ps (x) has s − 1 points of tangency with the lines y = ±1,
this second stability polynomial then intersects P s (x) at least s − 1 times for
x < 0, where intersection points with common tangent are counted double.
The difference polynomial has at least s − 1 negative roots, where roots of
multiplicity 2 are counted double. Since this second polynomial also belongs
to the class (2.10), the difference polynomial is of the form
x2 (γ̃2 + · · · + γ̃s xs−2 ).
This difference polynomial can have at most s − 2 roots on the negative real
axis, thus we have a contradiction.
4
On the real interval [−1, 1] the Chebyshev polynomials can also be defined as Ts (x) =
cos(s arccos x).
2.3 Runge Kutta Chebyshev Methods
2.3.1
55
First Order (Damped) Stability Polynomials and Schemes
In the previous section we proved that a stability polynomial that is generated by a first order explicit R-K method has as optimal stability boundary
2s2 . The class of shifted Chebyshev polynomials of the first kind seemed
to be the class of stability polynomials that achieve the stability boundary
2s2 . If one considers the stability regions of P s (z), then one observes that
there are interior points z ∈ (−β(s), 0) where |P s (z)| = 1. This means that
a small perturbation in imaginary direction near these points might cause
instability. To avoid that kind of situations, the polynomials are slightly
modified, so called damping.
Following [13], adopting the choice made by Guillo and Lago [9] (already
in 1961), the damped form of (2.12) reads
Ps (z) =
where ω0 = 1 +
ε
s2
Ts (ω0 + ω1 z)
,
Ts (ω0 )
(2.14)
and
ω1 =
Ts (ω0 )
.
Ts0 (ω0 )
(2.15)
The polynomials (2.14) satisfy Rs (z) = 1 + z + O(z 2 ) and thus generate a
first order method.5 The stability interval is then determined by the relation
Ts (ω0 + ω1 z) ≤ 1.
|Ps (z)| = Ts (ω0 )
Using Taylor series of Ts (z) near z = 0 we obtain that the stability interval
is determined by the relation
−ω0 ≤ ω0 + ω1 z ≤ ω0 ,
and thus it follows that
β(s) =
2ω0
4
≈ (2 − ε)s2 .
ω1
3
(2.16)
In Figure 2.3 the stability region of the first order shifted Chebyshev polynomial (with damping), for s = 5, is given. The next problem is to find R-K
methods that have stability polynomials like (2.14).
The Approach of Van Der Houwen & Sommeijer
To find explicit R-K methods with stability polynomials (2.14) we use the
idea of van der Houwen and Sommeijer, as we will now explain. Their
elegant idea was to use the scaled and damped Chebyshev polynomials of
5
The conditions for first order are Rs (0) = 1 and Rs0 (0) = 1.
56
Runge-Kutta Methods
First Order Chebyshev Polynomial
10
5
0
−5
−10
−60
−50
−40
−30
−20
−10
0
−10
0
First Order Damped Chebyshev Polynomial
10
5
0
−5
−10
−60
−50
−40
−30
−20
Figure 2.3: Stability region of P5 (z) without damping (upper) and with
damping (lower).
order p as stability polynomials to generate an R-K method of order p.
To derive the internal stages they use the three term recursion (2.13) and
the damped Chebyshev polynomials to define the stability functions of the
stability functions of the internal stages. Furthermore they made the ansatz
that
Rs (z) = as + bs Ts (ω0 + ω1 z).
We construct the first order explicit R-K method with stability polynomial (2.14). Then, ω0 and ω1 are already defined, see (2.15). To have a first
order method Rs (z) has to satisfy Rs (0) = 1 and Rs0 (0) = 1. It follows that
bs =
1
Ts (ω0 )
and as = 1 − bs Ts (ω0 ).
For the internal stages to be of order one, we also put
Rj (z) = aj + bj Tj (ω0 + ω1 z)
with
bj =
aj = 1 − bj Tj (ω0 ),
(2.17)
1
.
Tj (ω0 )
Define R0 (z) = a0 + b0 = 1 and imposing the recursion (2.13) where we use
that Rj (0) = 1 for all 0 ≤ j ≤ s then shows after some calculations
R0 (z) = 1,
R1 (z) = 1 + µ̃1 z,
Rj (z) = (1 − µj − νj ) + µj Rj−1 (z) + νj Rj−2 (z) + µ̃j Rj−1 (z)z + γ̃j z,
2.3 Runge Kutta Chebyshev Methods
57
for j = 2, . . . , s and where
µ̃1 = b1 ω1 ,
bj
2bj ω0
, νj = −
,
bj−1
bj−2
2bj ω1
µ̃j =
, γ̃j = −aj−1 µ̃j .
bj−1
µj =
(2.18)
From the above relation the RKC integration formula for the nonlinear problem w0 (t) = F (t, w(t)) can de derived by associating R j (z) with the intermediate approximation wnj and the occurrence of z with a function evaluation :
wn0 = wn ,
wn1 = wn + µ̃1 τ F (tn + c0 τ, wn0 ),
wnj
= (1 − µj − νj )wn + µj wn,j−1 + νj wn,j−2 +
+µ̃1 τ F (tn + cj−1 τ, wn,j−1 ) + γ̃j τ F (tn + c0 τ, wn0 ),
wn+1 = wns .
(2.19)
The above scheme obviously belongs to the class of explicit R-K methods.
The only coefficient that remains to be defined are c j for 1 ≤ j < s. Observe
that Rj (z) = ecj z + O(z 2 ) with
j2
Ts (ω0 ) Tj0 (ω0 )
≈ 2
cj = 0
Ts (ω0 ) Ts (ω0 )
s
cs = 1.
Remark 2.8. Comparing the s-stage RKC method with a straightforward
explicit method like Euler Forward we can conclude the following. The stability boundary of the s-stage RKC method equals β RKC = 2s2 , while for the
Euler forward method the stability boundary is β EF = 2. See Figures 2.3
and 2.4. To have stability for both methods, the time-step τ RKC for the RKC
method can be s2 as large as the time step of the Euler forward method, i.e.,
τRKC = s2 τEF .
The number of function evaluations for Euler forward is one per time
step. For the s-stage RKC method is the number of function evaluations per
time step equal to s.
We conclude that to integrate one time step with the s-stage RKC method
we need s function evaluations. To integrate the same time step with Euler
forward we need to integrate s2 smaller time steps τEF , because of the relation
τRKC = s2 τEF . Per time step Euler forward needs one function evaluation
per time step τEF . Integrating with RKC pays of with a factor s less function
evaluations.
2.3.2
Second Order Schemes
If we are looking for second order explicit R-K methods with s internal
stages, then we first have to find analytical expressions for second order
58
Runge-Kutta Methods
Euler Forward Stability Region
1
−3
−2
−1
00
1
−1
Figure 2.4: Stability region of Euler Forward
stability polynomials
1
R(z) = 1 + z + z 2 + γ3 z 3 + · · · + γs z s .
2
The free coefficients γi , i = 3, . . . , s have to be chosen such that β(s) is as
large as possible. A suitable approximate polynomial in analytical form was
given by Bakker [1] :
1
1
3z
1
2
Ts 1 + 2
,
(2.20)
−
Bs (z) = + 2 +
3 3s
3 3s2
s −1
with stability boundary β(s) ≈ 32 (s2 − 1). This polynomial generates about
80% of the optimal interval. The damped version of (2.20) reads ([12])
Bs (z) = 1 +
Ts00 (ω0 )
(Ts (ω0 + ω1 z) − Ts (ω0 ))
(Ts0 (ω0 ))2
with
ω0 = 1 +
ε
s2
ω1 =
(2.21)
Ts0 (ω0 )
.
Ts00 (ω0 )
The stability boundary is in the damped case equal to
2
2
β(s) = (s2 − 1)(1 − ε).
3
15
In Figure 2.5 the stability regions S = {|B 5 (z)| ≤ 1 : z ∈ C} are given in the
undamped as well as the damped case.
Using the method of van der Houwen and Sommeijer to find a second
order explicit R-K method which has stability polynomial (2.21) gives the
following method. Again we choose all the internal stages to have second
order consistency, thus
1
Rj (z) = 1 + bj ω1 Tj0 (ω0 )z + bj ω12 Tj0 (ω0 )z 2 + O(z 3 ),
2
2.4 Some Remarks on Runge-Kutta Methods
59
Second Order Chebyshev Polynomial
5
0
−5
−20
−18
−16
−14
−12
−10
−8
−6
−4
−2
0
2
−2
0
2
Second Order Damped Chebyshev Polynomial
5
0
−5
−20
−18
−16
−14
−12
−10
−8
−6
−4
Figure 2.5: Stability region of B5 (z) without damping (upper) and with
damping (lower).
has to match with
1
Rj (z) = 1 + cj z + (cj z)2 + O(z 3 ).
2
Elementary calculations yields the following method :
wn0 = wn ,
wn1 = wn + µ̃1 τ F (tn + c0 τ, wn0 ),
wnj
= (1 − µj − νj )wn + µj wn,j−1 + νj wn,j−2 +
(2.22)
+µ̃1 τ F (tn + cj−1 τ, wn,j−1 ) + γ̃j τ F (tn + c0 τ, wn0 ),
wn+1 = wns ,
with coefficients
ω0 = 1 +
Tj00 (ω0 )
,
bj = 0
(Tj (ω0 ))2
2.4
ε
,
s2
ω1 =
Ts0 (ω0 )
,
Ts00 (ω0 )
(2.23)
Ts0 (ω0 ) Tj00 (ω0 )
j2 − 1
cj = 00
≈ 2
,
0
Ts (ω0 ) Tj (ω0 )
s −1
µ̃1 = b1 ω1 ,
µj =
µ̃j =
2bj ω1
,
bj−1
2bj ω0
,
bj−1
νj = −
bj
bj−2
,
γ̃j = −aj−1 µ̃j .
(2.24)
(2.25)
(2.26)
Some Remarks on Runge-Kutta Methods
This part of Chapter 2 is reserved to mention some extra properties of
Runge-Kutta methods. The properties mainly apply to Implicit RK-methods.
60
Runge-Kutta Methods
2.4.1
Properties of Implicit Runge-Kutta methods
A-stability
As observed, some Runge-Kutta methods have a stability region equal to
the left half plane C− . For stiff problems such stability regions are favorable.
Definition 2.9 (Dahlquist 1963). A method with stability region S such
that,
S ⊃ C− = {z | Re z ≤ 0},
is called A-stable.
Implicit Runge-Kutta methods have stability polynomials of the form
R(z) =
P (z)
,
Q(z)
with deg(P (z)), deg(Q(z)) ≤ s Recall that s is the number of stages. The
following observation is easily observed by using the Minimum Modulus Theorem. An implicit Runge-Kutta method is A-stable if and only if
|R(iy)| ≤ 1
for all real y,
and
R(z) analitic
for all Re z ≤ 0.
Construction of Implicit Runge-Kutta methods
In this part we give basic conditions to derive classes of fully implicit RungeKutta methods having good stability properties. The construction of these
methods relies on the following assumptions, taken from [11] :
B(p) :
C(η) :
D(ζ) :
q−1
= 1q
i=1 bi ci
Ps
cq
q−1
= qi
j=1 αij cj
Ps
b
q−1
αij = qj (1 − cqj )
i=1 bi ci
Ps
q = 1, . . . , p,
i = 1, . . . , s,
q = 1, . . . , η,
j = 1, . . . , s,
q = 1, . . . , ζ.
Condition B(p) means that the quadrature formula (b i , ci ) is of order p.
The following fundamental theorem derived by Butcher gives the relation between the conditions stated above and the order of the Runge-Kutta method.
Theorem 2.10 (Butcher 1964). If the coefficients b i , ci , αij of a RungeKutta method satisfy B(p), C(η) and D(ζ) with p ≤ η + ζ + 1 and p ≤ 2η + 2,
then the method is of order p.
2.4 Some Remarks on Runge-Kutta Methods
61
Gauss Methods
Gauss methods are collocation methods 6 based on the Gauss-Legendre quadrature formulas, i.e., c1 , . . . , cs are the zeros of the shifted Legendre polynomial
of degree s
ds
(xs (x − 1)s ) .
dxs
The weights b1 , . . . , bs are chosen such that B(s) is satisfied. The following
theorem is taken from [11].
Theorem 2.11. The s-stage Gauss method is of order 2s. Its stability
function is the (s, s)- Padé approximation 7 and the method is A-stable.
Examples of Gauss methods are given in Figure 2.6.
1
2
1
2
1
1
2
1
2
−
+
√
3
√6
3
6
1
4√
3
1
+
4
6
1
2
1
4
−
1
4
1
2
√
3
6
Figure 2.6: Butcher tableaus of Gauss methods of order 2 (left) and 4 (right)
Radau Methods
Based on the Radau and Lobatto quadrature formulas other Runge-Kutta
methods can be constructed. Taking c 1 , . . . , cs as zeros of respectively
ds−1
xs (x − 1)s−1 ,
s−1
dx
ds−1
xs−1 (x − 1)s , and
s−1
dx
ds−2
xs−1 (x − 1)s−1 ,
s−2
dx
(2.27)
(2.28)
(2.29)
we call the method Radau left (2.27), Radau right (2.28) and Lobatto (2.29).
For all three methods the weights b1 , . . . , bs are chosen such that B(s) is
satisfied.
The s-stage Radau IA method is defined by the Radau left method. The
coefficients αij , i, j = 1, . . . , s are defined by condition D(s). Since all c j are
distinct and the bi 6= 0 this is uniquely possible. Figure 2.7 represents two
examples of this method.
The s-stage Radau IIA method is defined by the Radau right method and
the coefficients αij , i, j = 1, . . . , s are obtained by condition C(s). Theorem
II.7.7 of [10] implies that this results in the collocation method based on the
zeros of (2.28). Examples are given in Figure 2.8. The following theorem is
6
7
See Appendix A
See Appendix B
62
Runge-Kutta Methods
0
1
1
0
2
3
1
4
1
4
1
4
− 41
5
12
3
4
Figure 2.7: Butcher tableaus of Radau IA methods of order 1 (left) and 3
(right)
1
1
1
1
3
1
5
12
3
4
3
4
1
− 12
1
4
1
4
Figure 2.8: Butcher tableaus of Radau IIA methods of order 1 (left) and 3
(right)
taken from [11] and stated without proof.
Theorem 2.12. The s-stage Radau IA method and the s stage Radau IIA
method are of order 2s − 1. Their stability function is the (s − 1, s) (subdiagonal) Padé approximation. Also, both methods are A-stable.
For the Lobatto IIIA methods the coefficients α ij are defined by C(s)
and therefore it is a collocation method. Lobatto IIIB D(s) is used to define
the αij . Finally, for the Lobatto IIIC methods we put
αi1 = b1
for
i = 1, . . . , s,
and determine the remaining αij by C(s − 1). The following theorem, taken
without proof from [11], gives the order and stability properties of Lobatto
methods.
Theorem 2.13. The s-stage Lobatto IIIA, IIIB and IIIC methods are of
order 2s − 2. The stability function for the Lobatto IIIA and IIIB methods
is the diagonal (s − 1, s − 1)-Padé approximation. The Lobatto IIIC method
has as stability function the (s − 2, s)-Padé approximation. Finally, all the
Lobatto methods are A-stable.
In Table 2.3 we give a summary of these statements. The implicit methods Radau IA and IIA are widely used to solve stiff problems due to their
favorable stability properties and simplicity.
2.4.2
Diagonally Implicit Runge-Kutta methods
Fully implicit Runge-Kutta methods like Radau IA and IIA are often used
in stiff chemistry problems for their superior stability properties. A clearly
disadvantage of these methods is that the Runge-Kutta matrix A is full.
This means that the system of algebraic equations for w ni must be solved
simultaneously. In the situation that the number of stages s increases, this
2.4 Some Remarks on Runge-Kutta Methods
method
Gauss
63
assumptions
order
stability function
B(2s)
C(s)
D(s)
2s
(s, s)-Padé
Radau IA
B(2s − 1)
C(s − 1)
D(s)
2s-1
(s − 1, s)-Padé
Radau IIA
B(2s − 1)
C(s)
D(s − 1)
2s-1
(s − 1, s)-Padé
Lobatto IIIA
B(2s − 2)
C(s)
D(s − 2)
2s-2
(s − 1, s − 1)-Padé
Lobatto IIIB
B(2s − 2)
C(s − 2)
D(s)
2s-2
(s − 1, s − 1)-Padé
Lobatto IIIA
B(2s − 2)
C(s − 1)
D(s − 1)
2s-2
(s − 2, s)-Padé
Table 2.3: Implicit Runge-Kutta methods
can become very costly. To avoid these situations, one can take for A a
lower triangular matrix, see Figure 2.9. Then, the s approximations w ni can
be solved sub-sequently for i = 1, . . . , s. Such methods we call Diagonally
implicit Runge-Kutta methods, or shorter DIRK.
c1
c2
..
.
α11
α21
..
.
α22
cs
αs1
b1
αs2
b2
..
.
···
···
αss
bs
Figure 2.9: Lower Triangular matrix A
To solve each approximation wni , in general a Newton-type iteration with
a coefficient matrix I − τ aii F 0 will be needed. By taking the diagonal entries
of A equal, one may hope to use repeatedly the stored LU factorization of
I − τ aii F 0 . DIRK methods with this additional property are called Singly
Diagonally Implicit RK methods (SDIRK). The general structure of the
Runge-Kutta matrix belonging to the SDIRK is given in Figure 2.10. The
c1
c2
..
.
γ
α21
..
.
γ
cs
αs1
b1
αs2
b2
..
.
···
···
γ
bs
Figure 2.10: Lower Triangular matrix A
special structure of the SDIRK schemes makes it possible to simplify the
64
Runge-Kutta Methods
order conditions, see Table 2.4, taken from [11].
order
1
2
3
4
previous conditions
P
bj = 1
P
bj αjk = 12
P
bj αjk αjl = 13
P
bj αjk αjl = 16
P
bj αjk αjl αjm = 41
P
bj αjk αjl αjm = 81
P
1
bj αjk αjl αjm = 12
P
1
bj αjk αjl αjm = 24
simplified conditions
P
bj = 1
P
bj αjk = 12 − γ
P
bj αjk αjl = 13 − γ + γ 2
P
bj αjk αjl = 16 − γ + γ 2
P
bj αjk αjl αjm = 14 − γ + 23 γ 2 − γ 3
P
bj αjk αjl αjm = 18 − 56 γ + 23 γ 2 − γ 3
P
1
bj αjk αjl αjm = 12
− 32 γ + 23 γ 2 − γ 3
P
1
bj αjk αjl αjm = 24
− 21 γ + 23 γ 2 − γ 3
Table 2.4: Order conditions for SDIRK methods (simplified conditions) and
the ‘original’ conditions of RK methods (previous conditions)
The stability function of DIRK methods have the form
R(z) =
P (z)
,
(1 − α11 z)(1 − α22 z) · · · (1 − αss z)
(2.30)
where the numerator P (z) is of degree ≤ s. In the case of an SDIRK scheme
(2.30) changes into
P (z)
.
R(z) =
(1 − γz)s
For more information with respect to stability properties and order conditions of specific (S)DIRK methods we refer to [11].
The following theorem, also taken from [11], gives order barriers for
(S)DIRK methods with respect to the vector b.
Theorem 2.14.
6,
1. A DIRK method with all b i positive has order at most
2. A SDIRK method with all bi positive has order at most 4.
2.4.3
The Order Reduction Phenomenon for RK-methods
To study the accuracy of RK-methods applied to stiff equations Prothero
and Robinson [25] proposed to consider the following problem
y0
= λ(y − ϕ(x)) + ϕ0 (x)
,
(2.31)
y(x0 ) = ϕ(x0 )
2.4 Some Remarks on Runge-Kutta Methods
65
with λ ∈ C and Re λ ≤ 0. Note that the exact solution of (2.31) is given by
y(t) = ϕ(t). Applying a general RK-method to (2.31) yields
wni = wn + h
s
X
j=1
wn+1 = wn + h
s
X
j=1
αij (λ((wni − ϕ(x0 + cj h))) + ϕ0 (x0 + cj h)),
bj (λ((wni − ϕ(x0 + cj h))) + ϕ0 (x0 + cj h)). (2.32)
We assume that for this RK-method the conditions B(p) and C(q) hold.
Then, by replacing wni , wn and wn+1 by the exact solutions
wni
ϕ(x0 + ci h),
wn
ϕ(x0 ),
wn+1
ϕ(x0 + h),
and using Taylor expansions of ϕ(x0 + cj h), ϕ(x0 + h) and ϕ0 (x0 + cj h) near
x0 gives
ϕ(x0 + ci h) = ϕ(x0 ) + h
ϕ(x0 + h) = ϕ(x0 ) + h
s
X
j=1
s
X
αij ϕ0 (x0 + cj h) + ∆i,h (x0 ),
bj ϕ0 (x0 + cj h) + ∆0,h (x0 ),
(2.33)
j=1
where
∆0,h (x0 ) = O(hp+1 )
and ∆i,h (x0 ) = O(hq+1 ).
(2.34)
Eliminating internal stages and subtracting (2.33) from (2.32) yields for
n=0
w1 − ϕ(x0 + h) = R(z)(w0 − ϕ(x0 )) + δh (x0 ),
(2.35)
with R(z) the stability function (R(z) = 1 + zb T (I − zA)−1 e). Remark that
δh (x0 ) is the local error when w0 = ϕ(x0 ) in (2.35). It is given by
δh (x) = −zbT (I − zA)−1 ∆h (x) − ∆0,h (x),
(2.36)
where ∆h (x) = (∆1,h (x), . . . , ∆s,h (x))T . Replacing x0 by xn and so on, one
obtains instead of (2.35) the general recursion
wn+1 − ϕ(xn+1 ) = R(z)(wn − ϕ(xn )) + δh (xn ).
(2.37)
Applying the above recursion n times gives the following general expression
for the global error
wn+1 − ϕ(xn+1 ) = R(z)n+1 (w0 − ϕ(x0 )) +
n
X
j=0
R(z)n−j δh (xj ).
(2.38)
66
Runge-Kutta Methods
Remark 2.15. In the non-stiff theory we have z = O(h). Observe that in
the non-stiff case the global error (2.38) behaves like O(h p ).
In the stiff case we would like to have a time-step that is larger than
with |λ| 1. Therefore, the global error (2.38) is studied under the
assumption that simultaneously h → 0 and z = hλ → ∞. In Table 2.5
the results for this analysis are given for the s-stage Gauss methods, s-stage
Radau methods and s-stage Labotto methods of Section 2.4.1. Comparing
Table 2.5 with Table 2.3 we observe that for several methods we have order
reduction.
1
|λ| ,
Method

 s odd
Gauss
 s even
Radau IA
Radau IIA
Labotto IIIA
Labotto IIIB
Labotto IIIC

 s odd
 s even

 s odd
 s even
local error
O(hs+1 )
O(hs )
z −1 O(hs+1 )
z −1 O(hs+1 )
zO(hs+1 )
z −1 O(hs )
global error

 O(hs+1 )
 O(hs )
O(hs )
z −1 O(hs+1 )

 z −1 O(hs )
 z −1 O(hs+1 )

 zO(hs )
 zO(hs+1 )
z −1 O(hs )
Table 2.5: Error for (2.31) in the stiff case under the assumption that simultaneously h → 0 and z = hλ → ∞.
For the Gauss methods we verify the results obtained in Table 2.5. Since
the Runge-Kutta matrix A is invertible, the vector-matrix product −zb T (I−
zA)−1 is equal to
T
−zb (I − zA)
−1
T
=b A
−1
1
+O
.
z
Observing that for Gauss methods the condition C(η) holds for η = s yields
that q = s. Also recall that the construction of Gauss methods implies that
B(p) is satisfied for p = s. It follows that the local error δ h (x) (2.36) is of
order O(hs+1 ).
Denote en = yn − ϕ(xn ). We obtain from the recursion (2.37) and the
2.4 Some Remarks on Runge-Kutta Methods
67
fact that |R(z)| ≤ 1 for all z with Re z ≤ 0 that
|en+1 | ≤ |R(z)||en | + |δh (xn )|
= |R(z)||en | + Chs+1
≤ |en | + Chs+1
..
.
≤ |e0 | + nChs+1
= |e0 | + Chs = O(hs ).
In the last step we used that nh = O(1).
Since the stability function of a Gauss method is the (s, s) Padé approximation, we have for odd s R(∞) = −1. For odd s the global error estimate
can be improved using the partial summation
n
X
n
κn−j δ(xj ) =
X 1 − κn+1−j
1 − κn+1
δ(x0 ) +
(δ(xj ) − δ(xj−1 )).
1−κ
1−κ
j=1
j=0
The fact that (δ(xj ) − δ(xj−1 )) = O(hq+2 ) and the above partial summation
gives, when substituted in (2.38), the desired result. For the verification of
the results of the other methods we refer to [11].
2.4.4
Order Reduction for Rosenbrock Methods
Applying Rosenbrock methods (2.6) to the Prothero & Robinson test equation (2.31) gives by straightforward calculation that the global error ε n =
yn − y(xn ) satisfies
εn+1 = R(z)εn + δh (xn ),
(2.39)
where R(z) is the stability function. The local error δ h (xn ) in (2.39) is given
by
δh (x) = ϕ(x) − ϕ(x + h) + bT (I − zB)−1 ∆,
(2.40)
with B an s × s matrix with entries Bi,j = (αij + γij )ij . In (2.40) vector
b has as ith entry bi , i = 1, . . . , s. Finally, the vector ∆ is the s × 1 vector
with entries
∆i = z ϕ(x) − ϕ(x + αi h) − γi hϕ0 (x) + hϕ0 (x + αi h) + γi h2 ϕ00 (x),
for i = 1, . . . , s. The following result on order reduction for Rosenbrock
methods can be proven.
Theorem 2.16. The local error δh (x) of a Rosenbrock method applied to
the test equation of Prothero & Robinson (2.31) satisfies for h → 0 and
z = hλ → ∞


2
2
X
h
h
2
00
3
bi ωij αj − 1 ϕ (x) + O(h ) + O
δh (x) = 
,
(2.41)
2
z
i,j
68
Runge-Kutta Methods
where ωij are the entries of B−1 .
Proof. Using the Neumann series expansion
(I − E)−1 = I + E + E2 + · · ·
we obtain for
(I − zB)
−1
−1
B−1
=
− I (zB)
z
B−1
B−1 −1
−
(I −
) =
z
z
!
−1 2
B
B−1
B−1
+ ··· =
+
I+
−
z
z
z
−1 2
B−1
1
B
−
+O
−
.
z
z
z3
=
=
=
=
The product bT (I − zB)−1 ∆ can then be written as
T
b (I − zB)
−1
1
1
= − bT B−1 − 2 bT B−2 + O
z
z
1
z3
.
(2.42)
The first term on the right-hand side of (2.42) is with (B −1 )ij = ωij and
using Taylor expansions for ϕ(x + h) equal to
1X
(αj h)2 00
0
0
3
−
bi ωij z −αj hϕ (x) −
ϕ (x) − γj hϕ (x) + O(h ) +
z
2
i,j
+hϕ0 (x) + αj h2 ϕ00 (x) + γj h2 ϕ00 (x) + O(h3 ) .
Rearranging gives
X
0
bi ωij (αj + γj )hϕ (x) +
i,j
X
i,j
α2j h2 00
bi ωij
ϕ (x) + O(h3 ) +
2
1X
bi ωij hϕ0 (x) + αj h2 ϕ00 (x) + γj h2 ϕ00 (x) + O(h3 ) .
−
z
i,j
Pj−1
P
Recall that (B)ij = αij + γij , αj = k=1
αjk and γj = jk=1 γjk . Using
these properties one can derive
X X
bi
ωij (αj + γj ) = 1.8
i
j
8
Since Bij = αij + γij it follows with the definition of αj and γj that BI = [α1 +
T
γ
1 , . . . , αn + γn ] , where I is the vector consisting of ones only. Thus, the summation
P P
T
−1
BI = bT I = 1.
j ωij (αj + γj ) is equal to b B
i bi
2.4 Some Remarks on Runge-Kutta Methods
69
It follows that
X
i
−
bi
X
ωij (αj + γj )hϕ0 (x) = hϕ0 (x),
and,
j
h2 ϕ00 (x)
1X X
ωij (αj + γj )h2 ϕ00 (x) = −
bi
.
z
z
i
j
Thus, bT (I − zB)−1 ∆ is equal to
0
hϕ (x) +
X
i,j
α2j h2 00
1X
bi ωij hϕ0 (x) + O
bi ωij
ϕ (x) + O(h3 ) −
2
z
i,j
h2
z
. (2.43)
The second term on the right-hand side of (2.42), equal to
1 T B−2
b
,
z2
z2
can be rewritten, almost in the same way as for the first term on the righthand side of (2.42), as
2
h
1X
0
bi ωij hϕ (x) + O
.
(2.44)
z
z2
i,j
The third term on the right-hand side of (2.42) equals O z12 . Adding
up (2.43),
(2.44) and the last term on the right-hand side of (2.42) equal to
O z12 , and letting z → ∞ gives
−
bT (I − zB)−1 ∆ =
2
X
α2j h2 00
h
0
3
hϕ (x) +
bi ωij
ϕ (x) + O(h ) + O
.
2
z
i,j
Applying Taylor expansion of ϕ(x + h) near x in (2.40) gives the desired
result.
Remark 2.17. In the stiff case a Rosenbrock is only third order when
X
bi ωij α2j = 1,
i,j
is satisfied. Since none of the Rosenbrock methods defined in Section 2.2
satisfies this extra constraint, their order of convergence is only two for the
Prothero & Robinson test equation.
To satisfy this extra constraint, a convenient way is to require
αsi + γsi = bi ,
i = 1, . . . , s,
and
αs = 1.
(2.45)
This extra requirement implies even
δh (x) = O
h2
z
.
In that case the method yields asymptotically exact results for z → ∞.
70
Runge-Kutta Methods
Chapter
3
Time Splitting
In general it is inefficient (or even infeasible) to apply the same integration
formula to different parts of the advection diffusion reaction system in higher
dimensions,
ut + ∇ · (au) = ∇ · (D∇u) + f (u).
If, for example, the discretization of the advection (and diffusion) results
in a stiff system, then it calls for an implicit method to solve that part of
the equation. If on the other hand the reaction terms are not stiff, then
explicit methods are often more suitable for that part of the equation. Or,
it could also be the other way around. If we have a stiff reaction term, then
we would like to use implicit methods. If we used for spatial discretization
limiters, then explicit methods are often more suitable.
Solving the system using a simple implicit integration rule results into a
large system of nonlinear algebraic equations. Due to simultaneous coupling
over species and space the nonlinear system becomes too large to handle.
In such cases we want to have an appropriate form of splitting. The general
idea behind splitting is to split a complicated system into smaller parts,
which can be solved efficiently with suitable integration formulas, for the
sake of time stepping.
This chapter is organized as follows. We start with the explanation of the
concept of (first and second order) time splitting. The chapter is concluded
with remarks on the treatment of boundary conditions and stiffness.
3.1
Operator Splitting
The technique in this section is called Operator Splitting or Time Splitting.
In this section we threat on first order splitting. We focus more on concepts
rather than on actual methods.
72
3.1.1
Time Splitting
First Order Splitting of Linear ODE Problems
Consider the linear homogeneous ODE system
w0 (t) = Aw(t),
t>0
and
w(0) = w0 .
(3.1)
This system can be seen as a semi-discretization of a linear PDE. Assume
that we have for A a two term splitting, e.g.,
A = A 1 + A2 .
The exact solution of (3.1) is given by
w(tn+1 ) = eτ A w(tn ),
where τ = tn+1 − tn . Since A has a splitting one can use it to approximate
a solution of (3.1). The exact solution of (3.1) is then approximated by
wn+1 = eτ A2 eτ A1 wn ,
(3.2)
where wn is an approximation of w(tn ) and τ = tn+1 − tn . We have found
the simplest splitting, in which the two subproblems
d ∗
dt w (t)
d ∗∗
dt w (t)
= A1 w∗ (t) for tn < t ≤ tn+1 with w∗ (tn ) =
wn ,
∗∗
∗∗
∗
= A2 w (t) for tn < t ≤ tn+1 with w (tn ) = w (tn+1 ),
have to be solved one after each other starting from w n . To complete the
integration after one step take wn+1 = w∗∗ (tn+1 ).
The error introduced by splitting, called the splitting error, can be found
by inserting (3.2) into the exact solution. This gives
w(tn+1 ) = eτ A2 eτ A1 w(tn ) + τ ρn ,
with ρn the local truncation error. The fact that τ ρ n is the error introduced
per step starting from the exact solution it is the local splitting error. Since
1
eτ A = I + τ (A1 + A2 ) + τ 2 (A1 + A2 )2 + O(τ 3 ),
2
1
τ A2 τ A1
e
e
= I + τ (A1 + A2 ) + τ 2 (A21 + 2A1 A2 + A22 ) + O(τ 3 ),
2
ρn satisfies
1
1 τA
ρn =
e − eτ A2 eτ A1 w(tn ) = τ [A1 , A2 ]w(tn ) + O(τ 2 ).
τ
2
The commutator of A1 and A2 is defined as
[A1 , A2 ] = A1 A2 − A2 A1 .
It is obvious that the splitting defined in (3.2) has order one, unless A 1 and
A2 commute.
As can be found in [13], splitting appears to be stable provided that the
sub steps themselves are stable.
3.1 Operator Splitting
3.1.2
73
Nonlinear ODE problems
For general nonlinear ODE problems
w0 (t) = F (t, w(t)),
t>0
and w(0) = w0 .
(3.3)
with a two term splitting
F (t, w) = F1 (t, w) + F2 (t, w),
the first order linear splitting (3.2) is
d ∗
dt w (t)
d ∗∗
dt w (t)
= F1 (t, w∗ (t)) for tn < t ≤ tn+1 with w∗ (tn ) =
wn ,
= F2 (t, w∗∗ (t)) for tn < t ≤ tn+1 with w∗∗ (tn ) = w∗ (tn+1 ),
giving wn+1 = w∗∗ (tn+1 ) as the next approximation.
The nonlinear counterpart of the splitting error is again of order one. As
seen in [13], this error can be derived by Taylor expansions of w ∗ (tn+1 ) and
w∗∗ (tn+1 ) around t = tn . One then obtains that the local splitting error ρ n
is equal to
∂F2
1 ∂F1
F2 −
F1 (tn , w(tn )) + O(τ 2 ).
ρn = τ
2
∂w
∂w
If the bracketed term equals zero the splitting error is of order two.
3.1.3
Second and Higher Order Splitting.
In this section we treat second and higher order splitting methods for linear
and nonlinear problems. To get a second order splitting error the idea of
symmetry in splitting is proposed by Strang.
Linear ODE Problems
Consider the linear ODE system (3.1) and assume for A the two-term splitting A = A1 + A2 . The exact solution of (3.1) can be approximated by
(3.2). Interchanging the order of A 1 and A2 after each half time step will
lead to symmetry and to better accuracy. The solution of (3.1) is then
approximated by
1
1
1
1
1
1
wn+1 = e 2 τ A1 e 2 τ A2 e 2 τ A2 e 2 τ A1 wn = e 2 τ A1 eτ A2 e 2 τ A1 wn . (3.4)
The above splitting is proposed by Strang [33] and is therefore called Strang
Splitting.
The splitting error of Strang splitting has a formal consistency of order
two. By series expansion the local truncation error, or splitting error, is
found as
ρn =
1 2
τ ([A1 , [A1 , A2 ]] + 2[A2 , [A1 , A2 ]]) w(tn+ 1 ) + O(τ 4 ).
2
24
(3.5)
74
Time Splitting
Another second order symmetrical splitting of Strang is
1 τ A1 τ A2
e
e
+ e τ A2 eτ A1 wn ,
wn+1 =
2
with splitting error
ρn = −
(3.6)
1 2
τ ([A1 , [A1 , A2 ]] + [A2 , [A2 , A1 ]]) w( tn ) + O(τ 3 ).
12
The splitting (3.6) is more expensive than (3.4) 1 , but has as advantage that
the factors eτ A1 eτ A2 and eτ A2 eτ A1 can be computed parallel.
In the case of a multi-component splitting of A, e.g.,
A = A 1 + A2 + A3 ,
the symmetrical Strang splitting method (3.4) is just a repeated application
of itself. Hence, for the three-term splitting we obtain
1
1
1
1
wn+1 = e 2 τ A1 e 2 τ A2 eτ A3 e 2 τ A2 e 2 τ A1 wn ,
(3.7)
which has also a second order splitting error. In the same way the alternative
Strang splitting (3.6) can be generalized.
Nonlinear ODE Problems
The extension of the Strang splitting (3.4) is straightforward,
d ∗
w (t) = F1 (t, w∗ (t)) for tn < t ≤ tn+ 1
2
dt
with
w∗ (tn ) = wn ,
d ∗∗
w (t) = F2 (t, w∗∗ (t)) for tn < t ≤ tn+1
dt
with
w∗∗ (tn ) = w∗ (tn+ 1 ),
2
d ∗∗∗
w (t) = F1 (t, w∗∗∗ (t)) for tn+ 1 < t ≤ tn+1
2
dt
∗∗
with
w∗∗∗ (tn+ 1 ) = wn+1
,
2
w∗∗∗ (t
giving wn+1 =
n+1 ) as the next approximation. Again we will have
formal consistency of order two. This result can be obtained using Taylor
expansion under the condition that the ODE is autonomous. Referring to
[13] we state that for autonomous problems we also have a second order
splitting error.
1
This follows by writing (3.6) out as
„ «n “
”
“
”
1
e τ A 1 e τ A 2 + e τ A 2 e τ A 1 · · · e τ A 1 e τ A 2 + e τ A 2 e τ A 1 w0 .
wn =
2
3.2 Boundary Values, Stiff Terms and Steady State
75
Higher-Order Splittings
Examples of higher-order splittings are the fourth order splitting
wn+1 =
1
4
1
(S 1 τ )2 − Sτ
3 2
3
wn ,
(3.8)
1
where Sτ = e 2 τ A1 eτ A2 e 2 τ A1 is the second order Strang splitting operator.
Since the above splitting (3.8) has negative a negative weight, it is not known
for what kind of problems this scheme will be stable or not.
Another fourth order splitting scheme derived by Yoshida [39] and
Suzuki [34] reads
wn+1 = Sθτ S(1−2θ)τ Sθτ wn ,
(3.9)
√
with θ = (2− 3 2)−1 ≈ 1.35. Observe that 2−θ < 0, which means that a time
step with negative time has to be taken. As mentioned in [13], reversing
the time for diffusion or (stiff) reaction terms this will lead to ill-posedness.
Higher order splittings can be used for conservative problems where boundary conditions are not relevant, such as the Schrödinger equation, see [13].
3.2
3.2.1
Boundary Values, Stiff Terms and Steady State
Boundary Values
For PDE problems, where the boundary conditions are important, difficulties with splitting may occur. The boundary conditions for the PDE problem have a physical interpretation, while boundary conditions for the sub
steps are missing. These sub steps have sometimes a physical interpretation.
Therefore one may have to reconstruct boundary conditions for the specific
splitting under consideration.
General analysis of boundary conditions for splitting methods is, at
present, still lacking. The rule of thumb we will use is that the treatment of
the boundaries should coincide as much as possible with the scheme in the
interior of the domain.
3.2.2
Stiff Terms
Assume we have the ODE problem w 0 (t) = Aw(t), where A has a two-term
splitting A = A1 + A2 . Assume also that kA1 k is bounded and A2 has
an eigenvalue equal to − 1 , 1 > 0. Verwer et. al. [35, 32] showed
that the first order splitting with A as above, see Section 3.1, will retain its
first order accuracy. In [35, 32] also is shown that the second order Strang
splitting will in general give only an order one accuracy.
76
3.2.3
Time Splitting
Steady State
In the case we have to solve an advection-diffusion-reaction problem with
stiff reaction terms, the following behavior usually occurs. First there will
be a short time interval where concentrations rapidly change, the so called
transient phase. After the transient phase, there will be a steady state, e.g.
the reactions between the species are in balance.
If one solves an advection-diffusion-reaction system with time or operator
splitting, then the steady states are not returned exactly. This can easily
seen in the following linear ODE case. Suppose the advection-diffusionreaction system has been discretized in space. We then have the semi discrete
system w0 (t) = Aw(t), which we solve with time splitting, thus A = A 1 +A2 .
In the steady state the concentrations are in balance, thus w 0 (t) = 0. This
means that in the steady state yields A 1 + A2 = 0. Since the two term
splitting introduces an error of at least order one, this error will also be
returned in the steady state. Hence, this is in general the case for all splitting
methods.
Other time integration methods like Runge-Kutta methods, Runge-KuttaChebyshev methods, etc. return steady states exactly.
Chapter
4
IMEX Methods
Many advection-diffusion-reaction equations have a natural splitting into a
non-stiff or moderate stiff part and a stiff part. To solve the non-stiff or
moderate part explicit solvers are suitable. To solve the stiff part one would
use an implicit method. To split the non-stiff and stiff part time splitting is
an option, but has disadvantages like splitting errors, boundary conditions,
etc. Another disadvantage could be that the subproblem cannot be solved
by a multi-step method, i.e. it needs information from previous time steps.
In this part we consider IMEX methods, which are methods that are
a suitable mix of implicit and explicit methods. For instance, there exist
IMEX multi-step and IMEX Runge-Kutta methods. The main idea of an
IMEX method will be sketched in the next section, where we treat the
IMEX- θ method.
4.1
IMEX- θ Method
Suppose we have the nonlinear system or semi-discretization
w0 (t) = F (t, w(t)),
where F (t, w(t)) has the natural splitting
F (t, w(t)) = F0 (t, w(t)) + F1 (t, w(t)),
with F0 is a non-stiff and F1 stiff. In advection-diffusion-reaction systems
the non-stiff term is for instance advection and the stiff terms the discretized
diffusion and reactions. The non-stiff term is suitable for explicit time integration while the stiff terms are more suitable for implicit integration
methods.
An example of a method that combines explicit as well as implicit treatment of respectively the non-stiff term F 0 (t, w(t)) and stiff term F1 (t, w(t))
78
IMEX Methods
is the following one :
wn+1 = wn + τ (F0 (tn , wn ) + (1 − θ)F1 (tn , wn ) + θF1 (tn+1 , wn+1 )) , (4.1)
where the parameter θ ≥ 21 . This method is a combination of the Euler Forward method, which is explicit, and the implicit θ -method. Methods that
are mixtures of IM plicit and EX plicit methods are called IMEX methods.
The method given in (4.1) is called the IMEX- θ Method.
Truncation error
The truncation error of the IMEX- θ method can be derived by inserting
the exact solution of w 0 (t) = F (t, w(t)) into (4.1). We then obtain
w(tn+1 ) = w(tn ) + τ (F0 (tn , w(tn )) + (1 − θ)F1 (tn , wn )+
θF1 (tn+1 , w(tn+1 ))) + τ pn ,
(4.2)
where pn is the truncation error. Using Taylor series of w(t n+1 ), F1 (tn+1 , w(tn+1 ))
and F0 (tn+1 , w(tn+1 )) near tn , we obtain
pn =
1
− θ τ w00 (tn ) + θτ F00 (tn , w(tn )) + O(τ 2 ).
2
Stability
Consider the test equation
w0 (t) = λ0 w(t) + λ1 w(t),
and let zj = τ λj for j = 0, 1. Applying the IMEX- θ method to this test
equation gives
1 + z0 + (1 − θ)z1
R(z0 , z1 ) =
.
(4.3)
1 − θz1
Stability of the IMEX- θ method thus requires
|R(z0 , z1 )| ≤ 1.
(4.4)
To analyze the stability region (4.4) we have two starting points :
1. Assume the implicit part of the method to be stable, in fact A-stable,
and investigate the stability region of the explicit part,
2. Assume the explicit part of the method to be stable and investigate the
stability region of the implicit part.
4.1 IMEX- θ Method
79
Starting with the first point, we assume the implicit part of the IMEXθ method to be A-stable. Define the set
D0 = {z0 ∈ C : the IMEX scheme is stable ∀z1 ∈ C− },
where C− is the set
C− = {z ∈ C : Re z ≤ 0}.
The question is whether the set D0 is smaller, larger or equally shaped in
comparison with the stability region of Euler Forward. Using the maximum
modulus theorem and the assumption that the implicit part of the IMEX- θ
method gives that z1 can be replaced by z1 = i · r, r ∈ R. Substituting (4.3)
into (4.4) gives after some elementary calculations that z 0 = x0 + iy0 ∈ D0
if and only if for all r ∈ R yields
(2θ − 1)r 2 + 2(θ − 1)y0 r − (2x0 + x20 + y02 ) ≥ 0.
In the above inequality the left-hand side is a cubic function of r. For θ > 21
this function is larger or equal than zero when the discriminant is less or
equal than zero. We then obtain that z 0 = x0 + iy0 ∈ D0 iff
θ 2 y02 + (2θ − 1)(1 + x0 )2 ≤ 2θ − 1.
For θ = 21 the stability region of the IMEX method reduces to the line
segment [−2, 0], while for θ = 1 the stability region of Euler Forward is
obtained. See Figure 4.1.
1.5
1.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−1.5
−3
−1.5
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
−3
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
Figure 4.1: Boundary of regions D0 (left) and D1 for θ = 0.5 (circles), 0.51
(-·-), 0.6 (- -) and 1 (solid)
The alternative is to assume that the explicit part of the IMEX method is
stable, which means that z0 is an element of the set S0 = {z0 : |1 + z0 | ≤ 1}.
The question now is, what is the set
D1 = {z1 ∈ C : the IMEX scheme is stable ∀z0 ∈ S0 }.
Some elementary calculations yield that z 1 ∈ D1 iff
1 + |(1 − θ)z1 | ≤ |1 − θz1 |,
80
IMEX Methods
is smaller, larger or equally shaped in comparison with the stability region
of Euler Backward. The boundary of the stability regions are for several
values of θ plotted in Figure 4.1. Hence, for θ = 12 we obtain as stability
region for the IMEX method the negative real axis. For θ = 1 we obtain as
stability region the stability region of Euler Backward.
We remark that for θ = 1 the IMEX method has favorable stability
properties. It could be seen as a form of time splitting where we first solve the
explicit part with Euler Forward and the implicit part with Euler Backward.
However, using this IMEX method we do not have errors as consequence of
• intermediate results that are inconsistent with the full equation,
• intermediate boundary conditions to solve these intermediate results.
If one uses this IMEX- θ method with θ = 12 , then one has to pay a little
more attention to stability. If, for instance, one has a system with complex
eigenvalues, then the IMEX method will not be stable for θ = 21 .
Steady State
If we are in steady state, which means that F (w) = 0, F is independent of
t, then the splitting of F (t, w(t)) becomes in steady state F (w) = F 0 (w) +
F1 (w), where F , F0 and F1 are independent of t.
If this steady state w is a stationary point of the IMEX-θ scheme, then
this means that the IMEX-θ scheme returns steady states exactly. Inserting
the steady state w into (4.1) gives
w = w + τ (F0 (w) + (1 − θ)F1 (w) + θF1 (w)) = w.
The conclusion is that steady states are returned exactly by the IMEX-θ
scheme.
4.2
Concluding Remarks
The general idea of IMEX methods is explained in Section 4.1 using the
IMEX- θ method. This idea can also be used in Multi-step methods, RungeKutta methods and so on. The IMEX- θ method is also the simplest example
of an IMEX R-K method. In the next example we give a more sophisticated
example of an IMEX R-K method.
Example 4.1. We combine the explicit and implicit trapezoidal rule in the
following way :
1
1
∗
∗
),
= wn + τ F0 (tn , wn ) + τ F1 (tn , wn ) + τ F1 (tn+1 , wn+1
wn+1
2
2
1
1
∗
wn+1 = wn + τ F (tn , wn ) + τ F (tn+1 , wn+1
).
2
2
4.2 Concluding Remarks
81
This method has second order consistency, which can be verified by simple
calculations. Applying this IMEX trapezoidal rule to the test equation y 0 =
(λ0 + λ1 )y we get
R(z0 , z1 ) = 1 +
1 + 12 z0
(z0 + z1 ),
1 − 12 z1
as stability function. If we take the limit for z 1 → −∞1 , then we have as
stability criterion
|R(z)| = |1 + z0 | ≤ 1.
Hence, this IMEX trapezoidal rule is not suitable for advection-diffusionreaction problems if advection is treated explicitly and discretized with higher
order differences. For more detailed information we refer to [13].
Another example of an IMEX R-K method is the subject of the next
chapter, titled IMEX Runge-Kutta Chebyshev methods. Since this class of
IMEX R-K methods recently is developed for solving advection-diffusionreaction equations, see [36], we devote the next chapter to it.
1
The implicit part is unconditionally stable.
82
IMEX Methods
Chapter
5
IMEX Runge-Kutta-Chebyshev
Methods
In this chapter the IMEX extension of Runge-Kutta-Chebyshev methods are
introduced. The IMEX extension proposed in [36, 37] is an RKC method
that also deals with highly stiff (reaction) terms. The highly stiff terms are
treated implicitly and the moderate stiff terms, for instance coming from
diffusion, are treated in an explicit way.
As starting point we have the Runge Kutta Chebyshev methods defined
in Section 2.3. Recall that the general form of a RKC method is
W0 = w n ,
W1 = wn + µ̃1 τ F (tn + c0 τ, W0 ),
Wj = (1 − µj − νj )wn + µj Wj−1 + νj Wj−2 +
+µ̃1 τ F (tn + cj−1 τ, Wj−1 ) + γ̃j τ F (tn + c0 τ, W0 ),
wn+1 = Wns .
(5.1)
Suppose we have a linear system
w0 (t) = F (t, w(t)),
(5.2)
where F (t, w(t)) can be split as
F (t, w(t)) = FE (t, w(t)) + FI (t, w(t)).
The term FI (t, w(t)) is supposed to be too stiff to be treated efficiently
by a Runge-Kutta-Chebyshev method (5.1). The term F E (t, w(t)) is the
moderate stiff term which can still be treated explicitly by (5.1).
84
IMEX Runge-Kutta-Chebyshev Methods
5.1
Construction of the IMEX scheme
The first stage of (5.1) becomes in the IMEX-RKC scheme
W1 = wn + µ̃1 τ FE (tn + c0 τ, W0 ) + µ̃1 τ FI (tn + c1 τ, W1 ),
(5.3)
with µ̃1 = b1 ω1 . Note that the highly stiff term is treated implicitly. The
other s − 1 subsequent stages of (5.1) will be modified in a similar way.
The stability function of the first stage of (5.1) is equal to
R1 (z0 , z1 ) =
1 + b 1 ω1 z0
,
1 − b 1 ω1 z1
(5.4)
when (5.3) is applied to the test equation w 0 (t) = λ0 w(t) + λ1 w(t) with
zj = λj τ . Following [36], we impose
b1 =
so that
R1 (z0 , z1 ) =
1
,
ω0
1 + b1 ωω01 z0
1−
ω1
ω0 z 1
.
(5.5)
Observe that the choice for b1 is different than in Section 2.3. This choice
for b1 enables the following ansatz
Ansatz 5.1. All stability functions R j (z0 , z1 ) from the internal stages j,
j = 1, . . . , s, of the IMEX-RKC scheme are taken of the form
Rj (z0 , z1 ) = aj + bj Tj
The coefficients b1 =
1
ω0 ,
ω0 + ω 1 z0
1 − ωω10 z1
!
,
aj = 1 − bj Tj (ω0 ).
(5.6)
b0 = b2 and bj for j ≥ 2 is taken as
bj =
Tj00 (ω0 )
.
(Tj0 (ω0 ))2
These stability functions of the IMEX-RKC scheme are an extension of
the stability functions of the RKC scheme. Taking z 1 = 0 in (5.6) we obtain
the internal stability functions (2.17).
The construction of the IMEX scheme is based on the following. Rewrite
(5.6) into
−aj
Rj (z0 , z1 )
ω0 + ω 1 z0
Tj (x) =
+
, x=
,
bj
bj
1 − ωω10 z1
5.2 Stability
85
and apply the recursion (2.13). Inserting x gives
(1 −
ω1
)Rj
ω0
bj
ω1
)+2
Rj−1 · (ω0 + ω1 z0 )
ω0
bj−1
bj
bj
ω1
−2
aj−1 (ω0 + ω1 z0 ) +
aj−2 (1 − z1 )
bj−1
bj−1
ω0
ω1
bj
Rj−2 · (1 −
z1 ).
−
bj−2
ω0
= aj (1 −
Identifying Rj with Wj and Rj z1 with τ FI (tn +cj τ, Wj ) and so on we obtain
Wj − µ̃1 τ FI,j
= (aj − µj aj−1 − νj aj−2 )W0 + µj Wj−1 + νj Wj−2 +
µ̃j τ FE,j−1 + γ̃j τ FE,0 − νj µ̃1 τ FI,j−2 − µ̃1 (aj − νj aj−2 )τ FI,0 ,
where FI,j = FI (tn + cj τ, Wj ) and so on. The coefficients µ̃j , γ̃j , . . . are
defined as in Section 2.3. Using aj − µj aj−1 − νj aj−2 = 1 − νj − µj we get
W0 = w n ,
W1 = wn + µ̃1 τ FE,0 + µ̃1 τ FI,1 ,
Wj = (1 − µj − νj )wn + µj Wj−1 + νj Wj−2 +
(5.7)
+µ̃j τ FE,j−1 + γ̃j τ FE,0 ,
[γ̃j − (1 − µj − νj )µ̃1 ]τ FI,0 − νj µ̃1 τ FI,j−2 + µ̃1 FI,j
wn+1 = Ws .
Remark 5.2. As already remarked before, the scheme (5.7) is an IMEX
extension of (5.1) in a natural way.
Using easy calculations there can be obtained that also this IMEX extension of RKC returns the steady state exactly.
Remark 5.3. In each stage the following system of nonlinear algebraic equations has to be solved
Wj − µ̃1 τ FI (tn + cj τ, Wj ) = Vj ,
with Vj known and Wj unknown. In the case that the implicit part of
F (t, w(t)) consists of the ‘stiff’ reaction only, then F I has no underlying
spatial grid connectivity.
5.2
Stability
Consider the test-equation
w0 (t) = λE w(t) + λI w(t),
and assume that both eigenvalues are real and non-positive. For many
practical cases this imposes no restriction.
86
IMEX Runge-Kutta-Chebyshev Methods
Applying the IMEX-RKC method on the above test-equation gives the
stability function
!
ω0 + ω 1 z0
,
Rs (z0 , z1 ) = as + bs Ts
1 − ωω10 z1
with z0 = λE τ and z1 = λI τ . For stability we require
|Rs (z0 , z1 )| ≤ 1.
Since we assumed that λI ∈ [−∞, 0] it follows that z1 ∈ [−∞, 0]. Therefore
ω + ω z 0
1 0
(5.8)
≤ |ω0 + ω1 z0 | .
1 − ωω10 z1 From (5.8) it follows easily that
|Rs (z0 , z1 )| ≤ 1,
as long as z0 ∈ [−β(s), 0]. The IMEX extension of the RKC scheme is
unconditionally stable for the implicit part and the stability condition for
the explicit part remains unchanged.
5.3
Consistency
The local truncation error of the RKC scheme is of order two. The IMEX
extension of the second order RKC scheme has the following local error,
ρn =
τ
Sn + O(τ 2 ),
s2 − 1
with s the number of stages and Sn = FI0 (w(tn ))F (w(tn )) (FI0 denotes the
Jacobian of FI ).
If the number of stages s is large, then the influence of the extra term
above does not harm. In the case the number of stages is relatively small,
there will be order reduction. This means that in that case we have only
order one consistency.
5.4
Final Remarks
In actual computations this IMEX extension of the RKC scheme is used
in the same manner as the RKC scheme. The only difference is that in
each stage a nonlinear algebraic system has to be solved. When solving this
nonlinear system is cheap, as e.g. with stiff chemical reactions giving rise to
small sized systems decoupled over the grid, the IMEX-RKC scheme (5.7)
is more efficient than the fully explicit form (5.1).
5.4 Final Remarks
87
Advection-diffusion-reaction problems that are advection dominated or
problems where advection is discretized with (higher order) upwind schemes
give rise to (more) complex eigenvalues of the Jacobians of the right-handside of (5.2). In that particular case one would like to increase the stability
region in complex direction. By increasing the damping parameter ε the
strip along the real line becomes wider. See Figure 5.1. A disadvantage is
Second Order Damped Chebyshev Polynomial
Second Order Chebyshev Polynomial
5
5
0
0
−5
−20
−18
−16
−14
−12
−10
−8
−6
−4
−2
0
2
−5
−20
−18
−16
−14
−12
−10
−8
−6
−4
−2
0
2
Figure 5.1: Stability regions for the second order polynomial P 5 with damp2
) (left) and large (ε = 10)
ing parameter ε small (ε = 13
that the stability boundary β(s) will become smaller. The result is that for
the moderate stiff terms like discretized diffusion, a smaller time-step must
be taken to maintain stability.
88
IMEX Runge-Kutta-Chebyshev Methods
Chapter
6
Linear Multi-Step Methods
One-step methods like Runge-Kutta methods compute w n+1 using only the
previous approximation wn . Linear multi-step methods on the other hand,
use also additional previous approximations w n−i , i = 0, . . . , k, with a fixed
integer k. This integer k is the total number of previous approximations
wn−i used to compute the next approximation w n+1 . Next, we give the
definition of an k stage linear multi-step method.
Definition 6.1. The linear k-step method is defined by the formula
k
X
j=0
αj wn+j = τ
k
X
βj F (tn+j , wn+j ),
n = 0, 1, . . . ,
(6.1)
j=0
which uses the k past values wn , . . . , wn+k−1 to compute wn+k . Remark that
the most advanced level is tn+k instead of tn+1 .
The method is explicit when βk = 0 and implicit otherwise. Furthermore,
we will assume that αk > 0.
The relation with the class of Runge-Kutta methods is described in the
following remark.
Remark 6.2. Taking k = 1 in Definition 6.1 gives a one-step method. In
this case of linear one-step methods there is some overlap with Runge-Kutta
methods. For instance, the θ method,
wn+1 = wn + τ ((1 − θ)F (tn , wn ) + θF (tn+1 , wn+1 )) ,
belongs to the class of Runge-Kutta methods. Its Butcher array is given
in Table 6.1. Obviously, the θ method also belongs to the class of linear
multi-step methods.
An advantage of linear multi-step methods with respect to Runge-Kutta
methods is the following. In every new approximation w n+k of explicit linear multi-step methods only one function evaluation is needed against s
90
Linear Multi-Step Methods
0
1
0
0
1−θ
0
1
θ
Table 6.1: Butcher array of the θ method
for explicit s stage Runge-Kutta methods. We remark that explicit linear
multi-step methods use more memory to store the previous (k − 2) function
evaluations.
A disadvantage of linear multi-step methods is that the first k−1 approximations can not be computed with the linear k-step scheme. To compute
the first (k − 1) approximations, one could use a
1. Runge-Kutta scheme,
2. use for the first step a linear 1-step method, for the second approximation a linear 2-step method, . . . and for the (k − 1) st approximation
a linear (k − 1)-step scheme.
6.1
The Order Conditions
A linear multi-step method is called consistent of order p if the local error
satisfies
y(tn+k ) − w n+k = O(τ p+1 ),
where y(tn+k ) is the exact solution and w n+k the approximation obtained
by inserting the past k exact values in (6.1).
To find order conditions for order p consistency we will do the following.
First we introduce the linear difference operator L, associated with (6.1), as
L(y, t, τ ) =
k
X
i=0
αi y(t + iτ ) − τ βi y 0 (t + iτ ) .
Here y(t) is some differentiable function defined on an interval that contains
tn , . . . , tn+k .
Lemma 6.3. Consider y 0 (t) = F (t, y) with F (t, y) continuously differentiable. Let y(t) be its exact solution. For the local error one has
−1
∂F
y(tn+k ) − wn+k = αk I − τ βk
(tn+k , η)
L(y, tn , τ ).
(6.2)
∂y
In the case that F (t, y) is a scalar function η is some value between y(t n+1 )
and wn+1 . If F (t, y) is a vector valued function, ∂F
∂y (tn+k , η) is the Jacobian
of F (t, y), whose rows are evaluated at possibly different values η j lying on
the segment between (y(tn+k ))j and (wn+k )j .
6.1 The Order Conditions
91
Proof. By definition, the ‘numerical’ solution w n+k is obtained by the equation
k−1
X
j=0
αj y(tn+j ) − τ βj F (tn+j , y(tn+j )) + αk wn+k − τ βk F (tn+k , wn+k ) = 0.
Inserting the operator L(y, tn , τ ) gives
L(y, tn , τ ) =
(6.3)
αk (y(tn+k ) − wn+k ) − τ βk (F (tn+k , y(tn+k )) − F (tn+k , wn+k )).
Statement (6.2) in Lemma 6.3 follows from the mean value theorem applied
to the right-hand side of (6.3).
Following [10], a multi-step method (6.1) is said to be of order p, if one
of the following conditions is satisfied
1. for all sufficiently regular functions y(t) we have L(y, t, τ ) = O(τ p+1 ),
2. the local error of (6.1) is O(τ p+1 ) for all sufficiently regular differential
equations y 0 (t) = F (t, y(t)).
Next, observe that Lemma 6.3 implies that the above conditions 1. and
2. are equivalent. The following theorem gives the conditions for coefficients
αi and βi , i = 0, . . . , k, to get a multi-step scheme of order p.
Theorem 6.4. The multi-step method (6.1) is of order p if and only if
k
X
αi = 0
and
k
X
αi ij = j
βi ij−1
for
j = 1, 2, . . . , p.
(6.4)
i=0
i=0
i=0
k
X
Proof. The method is of order p when L(y, t, τ ) = O(τ p+1 ). Since
L(y, t, τ ) =
k
X
i=0
αi y(t + iτ ) − τ βi y 0 (t + iτ ) ,
Taylor expansions of y(t + iτ ) and y 0 (t + iτ ) near t gives the conditions (6.4)
by setting L(y, t, τ ) = O(τ p+1 ).
Example 6.5. Backward Differentiation Formulas, or shorter BDF methods, are implicit and defined by
βk = 1
and
βj = 0
for
j = 0, . . . , k − 1,
with αj chosen such that the order is optimal, namely k. The 1-step BDF
method is Backward Euler. The 2-step method is
1
3
wn+2 − 2wn+1 + wn = τ F (tn+2 , wn+2 ),
2
2
(6.5)
92
Linear Multi-Step Methods
and the three step BDF is given by
3
1
11
wn+3 − 3wn+2 + wn+1 − wn = τ F (tn+3 , wn+3 ).
6
2
3
(6.6)
In chemistry applications the BDF methods belong to the most widely used
methods to solve stiff chemical reaction equations, due to their favorable
stability properties. However, for k > 6 the BDF-methods are unstable, see
[10, Theorem 3.4, page 329].
See [13]. For more background on stability properties for linear multistep methods, we refer to the next section.
6.2
Stability Properties
We study in this part the stability properties of multi-step methods. As
usual we will use the familiar test equation y 0 (t) = λy(t), where λ ∈ C.
First, some properties on linear recursions are needed. Consider the
linear recursion formula
αk yn+k + αk−1 yn+k−1 + · · · + α0 yn = 0,
(6.7)
with αi ∈ C. We define its characteristic polynomial as the polynomial π(X)
of degree k
k
X
π(X) =
αi X i .
i=0
The following lemma, formulated without proof, gives the general solution
of (6.7) in terms of roots of it characteristic polynomial.
Lemma 6.6. Let χ1 , . . . , χl be the roots of π(X) of respective multiplicity
m1 , . . . , ml . If each root of π(X) has multiplicity 1, then l = k. The general
solution of (6.7) is given by
yn = p1 (n)χn1 + · · · + pl (n)χnl ,
(6.8)
where pi (n) are polynomials of degree (mi − 1).
Then, the following can be deduced. From (6.8) it follows that boundedness of the sequence (yn )n≥0 , is guaranteed when the roots χ1 , . . . , χl lie
in the unit disc and that roots on the unit disc are simple. The latter two
conditions are called the root condition.
To derive stability properties we will use the above results. The coefficients of the linear multi-step method will be contained in the polynomials
p(X) =
k
X
i=0
αi X i
and
σ(X) =
k
X
i=0
βi X i .
(6.9)
6.2 Stability Properties
6.2.1
93
Zero-Stability
A typical multi-step requirement is the notion of zero-stability. Consider the
trivial equation w 0 (t) = 0. Applying a linear k-step method to this equation
gives, by substituting F (t, w(t)) = 0 in (6.1), the linear recursion
k
X
αi wn+i = 0.
i=0
The above linear recursion has characteristic polynomial p(X).
Definition 6.7. A linear multi-step method (6.1) is said to be zero-stable,
when its characteristic polynomial p(X) satisfies the root condition, i.e., the
roots χi , i = 0, . . . , k, with multiple roots repeated, satisfy
|χi | ≤ 1
and
|χi | < 1
if χi is not simple.
In other words, Definition 6.7 puts forward that if a linear multi-step
method is not zero stable, then this method even fails to solve the trivial
equation w 0 (t) = 0 in a stable manner. We remark that this requirement
holds trivially for consistent one-step methods.
The First Dahlquist Barrier
Next, some remarks on maximal attainable orders of linear multi-step methods will follow. Firstly, we remark that due to the order conditions (6.4)
it follows by counting the number of coefficients α i and βi that the maximal order of an implicit linear k-step method (i.e. β k 6= 0) is 2k. For an
explicit linear k-step method (i.e. β k = 0) this is 2k − 1. As mentioned
above, a linear multi-step method makes only sense if it is also zero-stable.
Zero-stability reduces for explicit methods the maximal attainable order to
p = k. For implicit methods it reduces to p = 2b k+2
2 c. This is better known
as the first Dahlquist barrier. For a derivation of this result, we refer to
[5, 10].
6.2.2
The Stability Region
Applying the linear k-step method (6.1) to the test equation w 0 (t) = λw(t),
with λ ∈ C gives
k
X
(αj − zβj )wn+j = 0,
(6.10)
j=0
for n = 0, 1, . . . and where z = hλ. The linear recursion (6.10) has the
characteristic polynomial
ρ(X) = p(X) − zσ(X) =
k
X
j=0
(αj − zβj )X j .
94
Linear Multi-Step Methods
By definition, the multi-step method is stable if the sequence (w n )n≥0 is
bounded. Boundedness of (wn )n≥0 is guaranteed when ρ(X) satisfies the
root condition1 . Therefore, the stability region is the set S ⊂ C such that
z ∈ S ⇔ ρz (X) satisfies the root condition.
Furthermore, zero-stability is guaranteed if and only if 0 ∈ S.
The characteristic polynomial ρ(X) has k roots. As generally known 2 ,
only one of these roots approximates the exact solution w(t) = e λt = ez , up
to O(z p+1 ). This particular root is called the superior root. The other k − 1
roots are called the spurious roots. Spurious roots do not play a role in the
accuracy of the method. Sometimes they can cause instability.
Determination of the Stability Region
To determine the stability region of a linear multi-step method we observe
that its boundary is described by |ρ z (χ̃)| = 1, where χ̃ is one of the roots of
the characteristic polynomial ρz (X). Recall that
z ∈ S ⇔ ρz (χ) satisfies the root condition.
In the case that ρz (χ) satisfies the root condition, the boundary of this region
is described in the χ-plane is the unit disk, i.e. χ = e iθ with 0 ≤ θ ≤ 2π.
In the z-plane this boundary ∂S can be found as follows. The parameter
z and χ are related to each other by ρz (χ) ≡ p(χ) − zσ(χ) = 0. From this
relation follows that the map χ 7→ z is given by
z=
p(χ)
.
σ(χ)
(6.11)
The boundary ∂S is then described by inserting χ = e iθ , with 0 ≤ θ ≤ 2π.
Thus, the boundary ∂S of the stability region S is described by
z=
p(eiθ )
,
σ(eiθ )
0 ≤ θ ≤ 2π.
(6.12)
The curve (6.12) must be considered as a left oriented curve. The stability
region S is the part left from the curve, as long as it is not empty.
1
2
The root condition is among others given in Definition 6.7
This can be shown by inserting w(t) = eλt and F (t, w) = λw into
k
X
αj w(tn+j ) = τ
j=0
with ρn+k−1 the defect of order p.
k
X
j=0
βj F (tn+j , w(tn+j )) + τ ρn+k−1 ,
6.2 Stability Properties
95
A-Stability
To use linear multi-step methods in for instance chemical applications, Astability of the particular method is desirable. Define C̄ = C ∪ ∞. We say
that z = ∞ belongs to the stability region if the polynomial σ(χ) satisfies
the root condition. This makes indeed sense, because the roots of ρ z (χ) tend
to the roots of σ(χ) for z → ∞. Therefore, a linear multi-step method is
said to be A-stable if
S ⊃ z ∈ C̄ : Re z ≤ 0 or z = ∞ .
Unlike Runge-Kutta methods, there are not many A-stable linear multistep methods. It can be derived that a linear multi-step method can be at
most second order consistent. This fundamental result is called the second
Dahlquist barrier. At first sight, this second Dahlquist barrier seems to be a
disappointing result. However, by introducing the notion of A(α) stability
the problem is (partially) solved. A method is said to be A(α) stable if
S ⊃ z ∈ C̄ : z = 0, ∞ or | arg(−z)| ≤ α .
For many stiff ODE problems the A-stability property is not needed since
for these problems the eigenvalues with large modulus stay away from the
imaginary axis.
Example 6.8. The BDF methods are for k = 1, 2 A-stable. For 3 ≤ k ≤ 6
the BDF methods is A(α)-stable, with α depending on k. See Table 6.2.
k
α
1
90◦
2
90◦
3
86◦
4
73◦
5
51◦
6
17◦
Table 6.2: Values of α for given k.
In Figure 6.1 the stability regions for the BDF-2 method (6.5) and the
BDF-3 method (6.6) are given. Notice that the BDF-3 method is indeed not
A-stable, as can be seen in Figure 6.1. For the BDF-3 method we show how
the boundary ∂S and the stability region S can be found.
Recall that the three step BDF method is
11
3
1
wn+3 − 3wn+2 + wn+1 − wn = τ F (tn+3 , wn+3 ).
6
2
3
Applying this method to the test equation gives the recursion
3
1
11
wn+3 − 3wn+2 + wn+1 − wn − zwn+3 = 0.
6
2
3
(6.13)
The characteristic polynomial corresponding to the above recursion (6.13) is
then
ρ(X) = p(X) − zσ(X),
96
Linear Multi-Step Methods
Stability Region for BDF−2
Stability Region for BDF−3
3
3
2
2
1
1
0
0
−1
−1
−2
−2
−3
−4
−3
−2
−1
0
1
2
−3
−4
−3
−2
−1
0
1
2
Figure 6.1: Stability regions of the BDF-2 method (left) and the BDF-3
method (right).
with
3
1
11 3
X − 3X 2 + X − ,
6
2
3
σ(X) = X 3 .
p(X) =
and
The boundary ∂S of the stability region S of BDF-3 is described by
z =
=
p(eiθ )
σ(eiθ )
11 iθ 3
(e ) −3(eiθ )2 + 32 eiθ − 13
6
(eiθ )3
,
0 ≤ θ ≤ 2π.
One have to consider this as an oriented curve. The stability region is then
the part left of it.
Chapter
7
Multi Rate Runge Kutta Methods
In this chapter we consider ODEs
y 0 = f (t, y),
(7.1)
where y ∈ Rn and f : R × Rn → Rn . We assume that the solution y and
the right-hand side of (7.1) can be split up into rapidly (Active) and slowly
(Latent) varying subsystems. By splitting up
yA (t)
y(t) =
, yA (t) ∈ RnA , yL (t) ∈ RnL and nA + nL = n,
yL (t)
(7.2)
the ODE system (7.1) can then be written as a split system
0 (t) = f (y (t), y (t)), y (t ) = y
yA
A A
L
A 0
A,0
,
yL0 (t) = fL (yA (t), yL (t)), yL (t0 ) = yL,0
(7.3)
which for simplicity, but without loss of generality, is assumed to be autonomous.
Problems like (7.1) with a distinction between active and latent terms
/ subsystems arise frequently form discretization of PDEs of the advectiondiffusion-reaction type, see e.g. [13]. Similar problems arise from applications in electronic circuits, multi-body dynamics and pneumatics.
With respect to splitting up the solution y(t) we have some remarks. The
first remark is that the splitting is time dependent, that is, for time t+∆t the
splitting can be different from the splitting at time t. Secondly, it can also
vary in space. For example, in a chemical reactor with a heated susceptor,
only in the area above the susceptor will be some active components in a
gas mixture. In all other parts of the reactor all components will be slow,
in general.
For more information with respect to the first remark we refer to [6].
With respect to the second remark, no literature could be found. However,
98
Multi Rate Runge Kutta Methods
recently (February 2004), a research project started at the CWI, Amsterdam, which has as topic the contents of the above remarks. For sake of
complicity, this research project is within the cluster MAS3, theme Nonlinear Dynamics and Complex Systems.
In this chapter we introduce some basics on multi-rate strategies and
define a multi-rate Runge-Kutta approach.
7.1
Multi-Rate Runge-Kutta Methods
In the following we assume that the solution y(t) of the initial value problem
(7.1) is split into active components y A (t), hopefully a small subset of y(t),
and latent components yL (t). This results in a split system (7.3), which is
assumed to be autonomous.
The active components yA are integrated with a small step-size h, whereas
the latent components are integrated with the large stpsize H. We remark
that the methods to integrate the subsystems can, but do not have to, be the
same. At the end of each macro-step H a synchronization of the micro-steps
h and macro-steps H is performed.
The following definition of a multi-rate Runge-Kutta method (MRK) for
the numerical solution of (7.3) is taken from [19].
Definition 7.1. Assume we have split system (7.3) and that the active
respectively latent part is integrated with a micro-step h respectively macrostep H. For each macro-step H and micro-step h we have the relation
h=
1
H.
m
(7.4)
The active components yA are integrated by
for each λ = 0, 1, 2, . . . , m − 1 :

λ
kA,i
= fA yA,λ + h
yA,λ+1 = yA,λ + h
s
X
i=1
s
X
j=1

λ
λ 
aij kA,j
, ỸL,i
,
i = 1, 2, . . . , s,
λ
bi kA,i
≈ y(t0 + (λ + 1)h),
λ ≈ y (t + (λ + c )h) and c =
Here, ỸL,i
L 0
i
i
follow in the next section.
Ps
j=1 aij .
(7.5)
(7.6)
λ will
More details on ỸL,i
7.1 Multi-Rate Runge-Kutta Methods
99
The latent components yL are integrated by


s
X
λ
kL,i
= fL ỸA,i , yL,0 + H
āij kL,j  ,
j = 1, 2, . . . , s̄,
(7.7)
j=1
yL,1 = yL,0 + H
s̄
X
i=1
b̄i kL,i ≈ y(t0 + H),
where ỸA,i ≈ yA (t0 + c̄i H) and c̄i =
more detailed information on ỸA,i .
Ps̄
j=1 āij .
(7.8)
In the next section we give
As can be seen in the above definition, the coupling between the active and latent subsystem, and vice versa, is performed by the intermediate
λ . These intermediate values can be computed by
stage values ỸA,i and ỸL,i
interpolation or extrapolation according to the following strategies :
• Fastest first strategy :
– Integrate the active components y A with m steps of step-size h;
yL -values are obtained by extrapolation from the previous step,
– Integrate the latent components y L with one macro-step H; yA values are obtained by interpolation,
• Slowest first strategy :
– Integrate the latent components y L with one macro-step H; yA values are obtained by extrapolation from the previous step,
– Integrate the active components y A with with m steps of step-size
h; yL -values are obtained by interpolation,
The extrapolation / interpolation strategies described above are natural
choices, but these approaches inevitably turn the Runge-Kutta method into
λ is
a two-step method. In [19] an approach for the stage values ỸA,i and ỸL,i
given such that the multi-rate Runge-Kutta method is a one (macro) step
method. This can be achieved, with no extra cost,by using the following
formulas for the stage values :
λ
ỸL,i
= yL,0 + h
s̄
X
(γij + ηj (λ))kL,j
i = 1, 2, . . . s,
j=1
λ
ỸA,i
= yA,0 + H
s
X
λ = 0, 1, . . . , m − 1,
0
(γ̄ij )kA,j
i = 1, 2, . . . s̄.
(7.9)
(7.10)
j=1
This strategy resembles the slowest first strategy in the sense that y A -values
are obtained by extrapolation from the first micro-step.
100
Multi Rate Runge Kutta Methods
Another idea to compute the ỸA,i -values is to use a lower order method.
We refer to [19] for more information behind this idea. We remark that the
method using this idea is called MRKII. The method given by (7.9) is called
MRKI.
7.2
Order Conditions
We remark that we have not mentioned any conditions for γ ij , γ̄ij and ηj (λ)
to maintain the order of the given MRK methods. We will follow reference
[19] to give some conditions to obtain a method of respectively order two
and three.
Given two third order RK-methods for the integration of the active and
latent components of (7.3). These methods are joined in an MRK by the
coupling coefficients γij , γ̄ij and ηj (λ), which must be chosen carefully to
retain the order of the method.
To ensure the MRKI to be of order two the following assumptions have
to be satisfied :
s̄
X
ηj (λ) = λ,
s̄
X
γij = ci
i = 1, 2, . . . , s,
(7.11)
j=1
j=1
and
s
X
γ̄ij = c̄i
i = 1, 2, . . . , s.
(7.12)
j=1
Remark that no extra conditions are needed.
To be of order three the following conditions, in addition to the classical
ones, have to be fulfilled :
s̄
s X
X
i=1 j=1
bi γij c̄j =
1
,
6m
s̄
s X
X
bi ηj (λ)c̄j =
i=1 j=1
λ(λ + 1)
,
2m
(7.13)
and
s
s̄ X
X
i=1 j=1
b̄i γ̄ij cj =
m
.
6
(7.14)
Note that what has been said about MRKI applies to explicit as well as
implicit methods, and even a mix of both, like implicit for the latent part
and explicit for the active.
From this point onwards one can construct particular MRKI methods.
In [19] for example, one constructs a third order explicit MRK-method.
7.3 Stability
7.3
101
Stability
Following [20], we adopt the test-problem,
ẏA
yA
α11 α12
=A
, A=
∈ R2×2 ,
ẏL
yL
α21 α22
(7.15)
because multi-rate formulas require at least two equations. Assuming that
α11 , α22 < 0
and γ =
α12 α21
< 1,
α11 α22
(7.16)
ensures that both eigenvalues of A have negative real parts. The parameter
γ can be viewed as a measure for the coupling between the equations. Also
a measure κ for the stiffness of the system is introduced, i.e.,
κ=
α22
.
α11
(7.17)
A compound step of the multi-rate Runge-Kutta methods suggested in
the previous section applied to (7.15) can be expressed as
yA,m
yA,0
=K
,
(7.18)
yL,1
yL,0
where the matrix K depends on the step-sizes H and h, and the coefficients
of the method. Observe that the method is asymptotically stable if and only
if the eigenvalues of K are both within the unit disk.
The following scenario will be studied. For both parts, the active and
the latent part of y(t), a method has been chosen to integrate the particular
subsystem. Step-sizes H and h are chosen such that the uncoupled system
(α12 = α21 = 0) is stable. This means that the stability functions R A (hα11 )
and RL (Hα22 are both of absolute value less than one.
Remark that for the case α12 = α21 = 0 the eigenvalues of K are equal to
m . Therefore, it seems convenient to express the matrix K in terms
RL and RA
of RA and RL , rather than in terms of H and h. Now it is interesting to study
the stability region, that is the values γ, R A and RL for which eigenvalues
of K are in the unit disk, and how this region varies with increasing m and
stiffness ratio κ.
To demonstrate this idea, we use the following example taken from [20].
Suppose we have for the active part the Euler Forward scheme and for the
latent part the Euler Backward scheme, i.e.,
yA,λ+1 = yA,λ + hα11 yA,λ ,
yL,1 = yL,0 + Hα22 yL,1 .
λ = 0, 1, 2, . . . , m − 1,
(7.19)
(7.20)
Remark that for every step-size H the stability function |R L | < 1 and that
|RA | < 1 if and only if −2 < α11 h < 0. The stability functions RA and RL
102
Multi Rate Runge Kutta Methods
are related through H = mh as
RL =
1
.
1 + m(1 − RA )κ
(7.21)
Using this and a lot of calculations, which are done in [20], one obtains a
set of inequalities that determine a set of values for γ and R A such that
the eigenvalues of K are in the unit disk. The following stability regions are
obtained for κ = 1 and κ = 100, see Figure 7.1. We can conclude that the
stability region becomes smaller with increasing m and increasing κ. In this
case the coupling has not much influence, but there are multi-rate methods
where ‘strong’ coupling gives very small stability regions. A modification
of the synchronization of the active and latent part, gives a stability region
as in 7.2, which is taken from [20]. We observe that in the second case
coupling and stiffnes influence the range of step-sizes h that integrate (7.19)
in a stable way.
7.3 Stability
103
κ=1
1
stable
0.5
γ
0
−0.5
unstable
−1
−1.5
−2
−1
−0.8
−0.6
−0.4
−0.2
0
Ra
0.2
0.4
0.6
0.8
1
0.8
1
κ = 100
1
stable
0.5
γ
0
−0.5
unstable
−1
−1.5
−2
−1
−0.8
−0.6
−0.4
−0.2
0
Ra
0.2
0.4
0.6
Figure 7.1: Boundaries of the stability regions for m = 1 (–)) and m = 10
(solid). The methods are stable to the right of the boundaries and unstable
to the left.
104
Multi Rate Runge Kutta Methods
κ = 100
1
0.5
γ
0
−0.5
unstable
−1
−1.5
−2
−1
−0.8
−0.6
−0.4
−0.2
0
Ra
0.2
0.4
0.6
0.8
1
Figure 7.2: Boundaries of the stability regions for m = 1 (–)) and m = 10
(solid). The methods are stable to the right of the boundaries and unstable
to the left.
Part III
Nonlinear Solvers
Chapter
1
Introduction
Solving a partial differential equation numerically gives after spatial and
time discretization in general a (huge) system of nonlinear equations. There
are several methods to solve the general nonlinear equation
F (x) = 0,
where x ∈ Rn and F : Rn → Rm , m, n ∈ N. Well known are the fixed point
iteration, the Newton iteration, Broyden method and so on. In this part we
will treat the most important ones. First we present the Newton iteration in
one and more variables and also Broyden’s method. In Chapter 3 we discuss
the Picard iteration.
108
Introduction
Chapter
2
Newton’s Method
The contents of this chapter is based on references [15, 23]. We present
general properties and facts of Newton’s method, which we will also call
Newton iteration. This chapter has the following construction. First we
present Newton’s method in one and more variables and present some general properties. Then, we continue with convergence properties and criteria
to terminate the iteration. Furthermore, we pay some attention to modifications of the Newton iteration. Finally we conclude this chapter with
possible failures of these nonlinear solvers.
2.1
Newton’s Method in One Variable
We shortly discuss the Newton iteration that solves the nonlinear equation
f (x) = 0,
(2.1)
where f (x) is a real-valued function in the variable x ∈ R. The iteration
xk+1 = xk −
f (xk )
,
f 0 (xk )
k = 0, 1, 2, . . .
(2.2)
is known as the Newton Iteration. This method can also be represented
graphically as done in Figure 2.1. Properties and a derivation of this method
will be done for the more general case in the next section.
2.2
General Remarks on Newton’s Method
In order to solve n dimensional systems F (x) = 0, with F : R n → Rn and
x ∈ Rn , the Newton iteration (2.2) can be generalized into
xk+1 = xk − F 0 (xk )−1 F (xk ).
(2.3)
110
Newton’s Method
1
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
−1.5
−1
−0.5
0
0.5
1
1.5
Figure 2.1: Illustration of the Newton method
Importance of the above method rests on the fact that, under certain conditions on F , the error kxk+1 − xk (remark that x is a solution of F (x) = 0)
can be estimated by the inequality
kxk+1 − xk ≤ ckxk − xk2 .
(2.4)
Inequality (2.4) yields for a c ∈ R and a certain norm defined on R n . The
above inequality states that the (k + 1) th error is proportional to the k th
error squared. We will not give an exact proof of this estimate, but only a
sketch. For the exact proof we refer to [23, Chapter 10]. The sketch of the
proof of inequality (2.4) starts with a sketch of the derivation of the method
of Newton.
Assume F : Rn → Rm and that F is Fréchet differentiable 1 in xk . Then
0 = F (x) = F (xk ) + F 0 (x)(x − xk ) + R(x − xk ),
where
lim
h→0
(2.6)
R(h)
= 0.
khk
For a proof we refer to [23]. If xk is in a neighborhood of x, then it is
natural to neglect the term R(x − xk ). Furthermore, we assume that F 0 (xk )
is non-singular in a neighborhood of x, which implies that m = n.
The difference x − xk can then be approximated by the solution of the
system
F 0 (xk )h = −F (xk ), with h = x − xk .
(2.7)
As a new approximation of x we take
xk+1 = xk + h
= xk − F 0 (xk )−1 F (xk ).
(2.8)
1
The mapping F : D ⊂ Rn → Rm is Fréchet differentiable at x ∈ intD if there is an
A ∈ L(Rn , Rm ) (space of linear mappings from Rn → Rm ) such that
lim
h→0
kF (x + h) − F (x) − Ahk
= 0.
khk
(2.5)
The linear operator A is denoted by F 0 (x) and is called the Fréchet derivative of F at x.
2.2 General Remarks on Newton’s Method
111
If the second Fréchet derivative is bounded in a neighborhood of x and F 0 (x)
is non-singular, then it can be shown that there exists an α such that
kR(x − xk )k ≤ αkx − xk k2 .
(2.9)
For a proof we refer to [23, Theorem 3.3.6]. Subtract (2.8) of (2.6) gives
F 0 (xk )(x − xk ) = R(x − xk ).
(2.10)
From the above equality follows
kF 0 (xk )kk(x − xk )k = kR(x − xk )k.
(2.11)
Using (2.9) we obtain inequality (2.4).
Newton’s method is theoretically attractive, but in practice there can
be some difficulties. In each step of the iteration a linear system needs to
be solved. Especially in practical applications the dimensions of the systems to solve can be up to one million or even larger. Another difficulty is
that in each step not only the n components of F (x k ) have to be evaluated,
but also the n2 entries of its Jacobian. In the case that the partial derivatives of each component have a simple functional form the Jacobian can be
computed exactly. In all other cases it is more desirable to avoid these computations. Almost all modifications of Newton iterations are constructed in
such a way to avoid the explicit computation of (partial) derivatives. An
example of such modification is to use instead of the exact Jacobian an approximate of the Jacobian. One can approximate the Jacobian for example
by finite differences. The price on has to pay for this modification is that the
convergence to a solution will become more slowly, i.e., more iterations are
needed to solve the problem. However, the overall cost of solving F (x) = 0
is usually significant less, because the computation of the Newton step is
less expensive.
Some last general remarks on Newton’s method and iteration methods
in general. For each iterative method that solves F (x) = 0, it is required
that the method is norm reducing in the sense that
kF (xk+1 )k ≤ kF (xk )k
k = 0, 1, 2, . . . ,
(2.12)
holds for some norm defined on Rn . In general the ‘standard Newton iteration’ (2.3) does not necessarily satisfy this requirement. This problem is
solved by the simple modification
xk+1 = xk − ωk F 0 (xk )−1 F (xk ),
(2.13)
where the relaxation-parameter ωk is chosen such that (2.12) holds. More
on this in Section 2.6.
112
Newton’s Method
Another general modification of Newton’s method is to reevaluate F 0 (x)
occasionally. The iteration scheme then becomes
xk+1 = xk − F 0 (xp(k) )−1 F (xk ),
k = 0, 1, 2, . . .
(2.14)
where p(k) is some integer less or equal to k. For p(k) ≡ k we have the
‘standard Newton iteration’ (2.3) and for p(k) ≡ 0 the simplified Newton
method or chord method. A disadvantage of the chord method is that we
have linear convergence, i.e., kxk+1 − xk ≤ ζkxk − xk, for a ζ ∈ (0, 1).
2.3
Convergence Properties
In this section we give a measure of the rate of convergence of an iterative
process. An example of an iterative process is for instance the method of
Newton.
n
Definition 2.1. Let {xn }∞
n=0 be a sequence in R that converges to a fixed
x ∈ R. Further, let k · k be a norm defined on R n . If there exist positive
constants λ ∈ R and p ∈ [1, ∞) such that
kxn+1 − xk
= λ,
n→∞ kxn − xkp
lim
(2.15)
then we say that xn converges to x with order p and asymptotic constant λ.
Whenever a process converges with order one, p = 1, then we call the
convergence q-linear. If a process converges with p = 1 and λ = 0, then
we call such a process q-super-linearly convergent. Processes that converge
with p = 2 are called q-quadraticly convergent. We remark that q-quadratic
convergence is a special case of q-super-linear convergence. 2 In general,
every process that converges with order p greater than one is super-linearly
convergent.
As seen in the previous section Newton’s method converges quadratically.
The result is summarized in the next theorem.
Theorem 2.2. Assume that F (x) = 0 has a solution, which we denote by
x∗ . Further, assume that the Jacobian F 0 (x∗ ) is nonsingular and Lipschitz
continuous near x∗ . If x0 is sufficiently near x∗ , then the Newton sequence
2
Assume that our process converges quadratically. Recall that this also means that
limn→∞ kxn − xk = 0. Then, the fact that the iterative process converges quadratically
implies that for n sufficiently large there exist a K > 0 such that kxn+1 −xk ≤ Kkxn −xk2 .
It follows that
kxn+1 − xk
Kkxn − xk2
lim
≤ lim
= 0.
n→∞ kxn − xk
n→∞
kxn − xk
2.4 Criteria for Termination of the Iteration
113
exists, i.e. F 0 (xn ) is nonsingular for all n ≥ 0, and the sequence converges
to x∗ . Furthermore, there exists a K > 0 such that
kxn+1 − xk ≤ Kkxn − xk2 ,
(2.16)
for n sufficiently large.
Usually in practice it is more efficient to approximate the Newton step
in some way. One (obvious) way to do this is to approximate the Jacobian
F 0 (xn ) in such a way that computation of the derivative is avoided. The
secant method for scalar equations approximates the derivative using finite
differences. It uses the two most recent iterations to form the difference
quotient, i.e.,
F (xn )(xn − xn−1 )
,
(2.17)
xn+1 = xn −
F (xn ) − F (xn−1 )
where xn is the current iteration and xn−1 the previous iteration. As general
known, if the secant method converges, then it converges with order ϕ, where
ϕ is the golden ratio.3 We remark that the secant method must be initialized
with two points. One way to do that is to let x −1 = 0.99x0 . We took this
from [15]. Finally we remark that (2.17) cannot be extended to systems
of nonlinear equations, because the denominator in the fraction would be a
difference of vectors. There are many generalizations to n dimensions of the
secant method. We will discuss one of them in Section 2.7.
2.4
Criteria for Termination of the Iteration
In practice we do not like to iterate forever. Therefore, an iterative method
should be stopped if the approximate solution is accurate enough. A termination criterion is an essential part of an iterative method. A weak criterion
is useless, while a too severe criterion makes the iterative method expensive
or, even worser, it might never stop.
While one cannot know the error without knowing the solution, in most
cases the norm of F (xk ) can be used as a reliable indicator of the rate of
decay of norm of the error ken k = kx − xn k as the iteration progresses.
Termination of the iteration takes place, based on this heuristic, when
kF (xk )k ≤ τr kF (x0 )k + τa ,
(2.18)
where τr is the relative error tolerance and τ a the absolute error tolerance.
Both tolerances are important. For instance, it is a poor idea to put τ a = 0.
In the case that the initial iterate is near the solution x, then (2.18) is
impossible to satisfy.
3
The√golden ratio ϕ is the positive root of the equation ϕ2 − ϕ − 1 = 0, meaning that
ϕ = 1+2 5 . The golden ratio also satisfies the recurrence relation ϕn = ϕn−1 + ϕn−2 .
Taking n = 1 gives the special case ϕ = 1 + ϕ−1 .
114
2.5
Newton’s Method
Inexact Newton Methods
In the Newton iteration there are two difficulties. The first one is to compute
the Jacobian at the current iterate x n . Secondly, computation of the solution
of the equation
F 0 (xn )s = −F (xn ),
(2.19)
can be very hard, or even impossible. Instead of solving the Newton step
exactly one could also approximate it using some iterative method. More
on iterative methods can be found in Part ??.
Assume that the nth iterate is known and that the Jacobian F 0 (xn ) as
well as F (xn ) are known. Solving (2.19) by using an iterative method gives
with a initial start-vector s0 a sequence of vectors {sn }n≥0 that converges
to the solution s of (2.19). After k iterations the residual r k is equal to
rk = F 0 (xn )sk + F (xn ).
(2.20)
A good criterion to stop the iterative process that approximates the solution
of (2.19) is
kF 0 (xn )sk + F (xn )k
≤ η η ∈ R.
(2.21)
kF (xn )k
For more on stopping criteria for iterative methods we refer to Section 2.4
and Part ??. Criterion (2.21) rewritten gives the so-called inexact Newton
condition
kF 0 (xn )sk + F (xn )k ≤ ηkF (xn )k,
(2.22)
where the parameter η is called the forcing term. Remark that by choosing a
smaller value for η the inexact Newton method makes it more like Newton’s
method. Secondly we remark that during the Newton iteration the forcing
term η can be varied, thus for the nth iterate we have ηn . In the following
theorem, taken without proof from [15], we give some convergence results.
Theorem 2.3. Assume that F (x) = 0 has a solution, F 0 is Lipschitz continuous near this solution and that F 0 (x) is nonsingular. Then there are δ and
η̄ such that, if x0 ∈ B(δ), {ηn } ⊂ [0, η̄], then the inexact Newton iteration
xn+1 = xn + sn ,
where
kF 0 (xn )sn + F (xn )k ≤ ηn kF (xn )k,
(2.23)
converges linearly to x. Moreover, if η n → 0, then the convergence is qsuper-linear.
2.6 Global Convergence
2.6
115
Global Convergence
In this section we use an example taken from [15] to illustrate the theory.
To illustrate that the Newton iteration only converges locally we apply it to
the function
f (x) = arctan(x),
(2.24)
with initial iterate x0 = 10. This initial value is too far from the root, so
that the local convergence theory is not valid. The Newton step can be
computed easily and is equal to
s=
f (x0 )
1.5
≈
≈ −150.
0
0
f (x )
−0.01
(2.25)
Hence, this computed Newton step is in the correct direction, but is far
too large in magnitude. By continuing the iteration we obtain the following
sequence of iterates
x0 =
10
x1 =
−138
x2 = 2.9 · 104
(2.26)
x3 = −1.5 · 109
..
.
Since the computed Newton step s points in the correct direction we can
apply the following simple modification. We reduce the Newton step by half
until kf (x)k has been reduced. We then get a convergence behavior as in
Figure 2.2
5
10
0
Absolute Nonlinear Residual
10
−5
10
−10
10
−15
10
−20
10
−25
10
0
2
4
6
8
Nonlinear iterations
10
12
Figure 2.2: The absolute residual on log scale versus the number of iterations. The problem arctan(x) = 0 is solved by a Newton iteration with
initial iterate x0 = 10. The Newton step is computed in such a way that
(2.28) is satisfied.
We now return to the general case of F (x) = 0. In order to describe this
artifice in a clear way we define the Newton direction as
d = −F 0 (xn )−1 F (xn ),
(2.27)
116
Newton’s Method
and the Newton step s as a positive scalar multiple of the Newton direction
d. In the local convergence theory there is no difference between the Newton direction and the Newton step, i.e. s = d. The Newton step will be
determined as follows. Find the smallest integer m such that
kF (xn + 2−m d)k < (1 − α2−m )kF (xn )k,
(2.28)
and let the Newton step be s = 2−m d. Condition (2.28) is called the sufficient decrease of kF k. The parameter α in (2.28) is a small number, chosen
in such a way to satisfy the sufficient decrease condition as easy as possible.
In [15] α = 10−4 .
Methods like the one described above are called line search methods.
This kind of methods search for a decrease in kF k along the segment [x n , xn +
d]. Some problems can be more efficient solved if the step-length reduction
1
instead of 21 . To make
is more aggressive, for instance by factors of 10
this possible, we can do the following after two reductions by half. Build a
quadratic polynomial model of
ψ(λ) = kF (xn + λd)k2 ,
(2.29)
which is based on interpolation of ψ at the three most recent values of λ.
The minimizer of the quadratic model then becomes the next value for λ.
As safeguard we claim that the reduction in λ is at least a factor of 2 and
at most a factor of 10. This proposed algorithm, taken from [15], generates
a sequence of λm ’s with
λ0 = 1
1
λ1 =
2
(2.30)
(2.31)
and λm such that
1
λm
1
≤
≤
10
λm−1
2
for all
m ∈ {m ∈ N|m ≥ 2}.
(2.32)
The line search will be terminated for the smallest m ≤ 0 such that
kF (xn + λm d)k < (1 − αλm )kF (xn )k,
(2.33)
holds in a certain norm. For more details or other ways to implement a line
search we refer to [15].
For completeness the above is sketched in Algorithm 2.1. Finally we
remark that for smooth (vector)functions F , there are only three possible
scenarios for the iteration process of Algorithm 2.1 to occur :
1. the sequence {xn }n≥0 converges to a solution of F (x) = 0,
2. the sequence {xn }n≥0 will become unbounded,
2.7 Extension of Secant Method to n Dimensions
117
Algorithm 2.1 Line Search
Evaluate F (x); τ ← τr kF (x)k + τa .
while kF (x)k > τ do
Find d = −F 0 (x)−1 F (x).
If no such d can be found, then terminate with failure.
λ = 1.
while kF (x + λd)k > (1 − αλ)kF (x)k do
1 1
, 2 ] is computed by minimizing the polynomial
λ ← σλ, where σ ∈ [ 10
2
model of kF (x + λd)k .
end while
x ← x + λd.
end while
3. the Jacobian F 0 (xn ) will become singular for a certain x n .
The line search algorithm as presented here is the most simplest way to
find a root of F (x), when the initial iterate is far from a root. There are
many alternatives to this line search algorithm that can sometimes overcome stagnation or, in the case of many solutions, find the solution that is
appropriate to the physical problem. Three of these other methods are trust
region globalization, pseudo-transient continuation and homotopy methods.
These alternatives will not be discussed in this report. For more information
we refer to [15].
2.7
Extension of Secant Method to n Dimensions
For solving the scalar equation f (x) = 0 the secant method is given as
xk+1 = xk −
f (xk )(xk − xk−1 )
,
f (xk ) − f (xk−1 )
(2.34)
where xk is the current iteration and xk−1 the iteration before. As seen
before, the secant method converges with order equal to the golden ratio.
One of the most famous extensions of the secant method to n dimensions
is Broyden’s method. We will shortly show how this method can be derived
from the (k + 1)-point sequential secant method, which is derived in [23,
Chapter 7]. We remark that this method is not used in practice. Methods
that are used in practice can be derived from this method.
The (n+1)-point sequential secant method is the n dimensional extension
of the scalar secant method and given as
k
xk+1 = xk − A−1
k F (x ),
k = 0, 1, . . . ,
(2.35)
Γk = [q k−1 , . . . , q k−n ],
(2.36)
where
Ak = Γk Hk−1 ,
Hk = [pk−1 , . . . , pk−n ],
118
Newton’s Method
with
pi = xi+1 − xi
and q i = F (xi+1 ) − F (xi ).
(2.37)
x 0 , . . . , x−n .
Remark that this method needs n initial values, e.g.,
The
matrix Ak is an approximation of the Jacobian in the k th iteration. For the
next iteration the approximation of the Jacobian will be given by
Ak+1 = Ak − F (xk+1 )(v k )T ,
(2.38)
−1
with v k the first row of Hk+1
. For a derivation of this result, see [23],
Theorem 7.3.1. This result suggests that A k+1 is obtained from Ak by
addition of a matrix of rank one. Therefore, the more general form of (2.38)
is
Ak+1 = Ak + uk (v k )T , for some uk , v k ∈ Rn .
(2.39)
Using the Sherman - Morrison formula 4 gives
−1
A−1
k+1 = Ak −
k k T −1
A−1
k u (v ) Ak
,
k
1 + (v k )T A−1
k u
(2.41)
where we assumed the denominator to be unequal to zero and A 0 nonsingular. This is the first constraint on v k and uk , e.g.,
k
1 + (v k )T A−1
k u 6= 0.
(2.42)
From now on it is possible to construct a wide range of methods. Next, we
assume that
(v k )T pk 6= 0
and
uk =
F (xk+1 )
(v K )T pk
k ∈ N.
(2.43)
k+1 ) = A−1 q k ,
The first constraint becomes then, using p k + A−1
k F (x
k
(v k )T A−1
k q 6= 0.
(2.44)
Then, (2.41) becomes
−1
A−1
k+1 = Ak −
k+1 )(v k )T A−1
A−1
k F (x
k
.
k
(v k )T A−1
q
k
(2.45)
Broyden’s method [3] is a special case of the method defined by (2.43),
in which v k is taken equal to pk . We remark that (2.44) may not be equal
to zero. The corresponding algorithm, see [3], is given as Algorithm 2.2. In
Algorithm 2.2 the matrix A−1
k is represented by the matrix B k . Furthermore,
we remark that the update Bk+1 in State 8 of the Broyden algorithm is in
practice not computed in that way. For more information on the Broyden
algorithm we refer to [3, 15].
4
Sherman - Morrison formula : Let the n × n matrix A be invertible and let u, v ∈ Rn .
Then A + uv T is invertible if and only if 1 + v T A−1 u 6= 0, and then
(A + uv T )−1 = A−1 −
A−1 uv T A−1
.
1 + v T A−1 u
(2.40)
2.8 Failures
119
Algorithm 2.2 Broyden
1: Obtain an initial estimate x0 of the solution.
2: Obtain an initial value of the iteration matrix B 0 . For instance, this can
be done by approximating the Jacobian at x 0 and inverting the result.
3: Compute Fi = F (xi ) i = 0, 1, 2, . . .
4: Compute pi = −Bi Fi .
5: Choose λi such that xi+1 = xi + λi pi implies kFi+1 k < kFi k.
6: Test kFi+1 k for convergence.
7: Compute yi = Fi+1 − Fi .
B F (xk+1 )(p )T Bi
8: Compute Bi+1 = Bi − i (p )T B yi
.
i
i i
9: Repeat from Step 4.
2.8
Failures
In this section we discuss possible failures of the nonlinear solvers presented
in this chapter. Among others we treat possible causes for failure to converge.
In the case that the problem has no solution, then any solver will have
trouble finding a solution. For example, if f (x) = e −x , then the Newton
iteration will converge to −∞ for any starting point.
2.8.1
Non-smooth Functions
All variations of the Newton iteration which are discussed in this chapter are
intended to find roots of F (x) = 0 for which F 0 (x) is Lipschitz continuous.
Solving problems with a discontinuous Jacobian causes unpredictable behavior of the codes. An example of a function with a non-smooth Jacobian
is the vector norm.
2.8.2
Slow Convergence
In the case one has a nonlinear solver that converges, but not as fast as it
should, then probably one of the following three operations is inaccurate :
• Computation of the Jacobian,
• Computation of the Jacobian-vector product,
• Linear solver.
Super-linear convergence which follows from the local convergence theory
only holds if the correct linear system is solved accurately. In order to get
the expected super-linear convergence, one could check the following things.
1. Check the computation of the Jacobian,
120
Newton’s Method
2. If you use a iterative linear solver, then make sure that the termination
criteria for the linear solver are set tight enough.
2.8.3
No Convergence
If the Newton iteration does not converge, then either the iteration will become unbounded or the Jacobian will become singular. We give possible
causes of failure of convergence. The first one is to check whether the problem has a solution or not. In the case your problem has a solution one of
the following remarks might cause the failure.
• Inaccurate function evaluation. If the error in the function evaluation
is higher than the machine roundoff, then the computed Newton direction, which is computed by a difference Jacobian, can be poor enough
for the iteration to fail.
• Singular Jacobian. If the Jacobian becomes (almost) singular, then
the step lengths will become zero. In the case that one terminates
the iteration when the step length approaches zero and does not check
whether F approaches zero, then one concludes incorrectly that the
problem has been solved.
• If the line search algorithm as presented in Section 2.6 fails for the
problem, one should use one of the alternatives presented in the same
section.
2.8.4
Failure of the Line Search
In the case that there is no singular Jacobian and the line search algorithm
reduces the step size to an unacceptable small value, then this means that
the computed Newton direction is poor. In order to let the line search algorithm perform better, the computation of the Newton direction must become
more accurate. Furthermore, we repeat that convergence of the line search
algorithm is based on an exact Jacobian. A difference approximation to a
Jacobian or Jacobian-vector product is usually, but not always, sufficient.
For more background on failures of Newton’s method and the line search
algorithm we refer to [15].
Chapter
3
Picard Iteration
Picard iteration is also known as fixed point iteration. We first give the
definition of a fixed point.
Definition 3.1. A fixed point is a point that does not change upon application of a map or a system of differential equations. For example, a point p
such that f (p) = p holds for a given map f : R → R is called a fixed point.
Points of an autonomous system of ordinary differential equations at which
 dx1
= f1 (x1 , . . . , xn ) = 0

dt


 dx2 = f2 (x1 , . . . , xn ) = 0
dt
(3.1)
..

.


 dxn
= fn (x1 , . . . , xn ) = 0
dt
are also known as fixed points.
In this chapter we restrict ourselves to finding a fixed point of a given
map F : Rn → Rn . First, we consider the scalar case, e.g., n = 1.
3.1
One Dimension
To find a solution of the scalar equation f (x) = x, or equivently, to find a
fixed point of the given map f : R → R the following approximation can be
used. Choose a start-value x0 and determine xn by xn = f (xn−1 ) for n ≥ 1.
If the sequence converges to x and f is continuous, then
x = lim xn = lim f (xn−1 ) = f lim xn−1 = f (x).
(3.2)
n→∞
n→∞
n→∞
From Equation (3.2) follows that x is a fixed point of f (x). This approximation of the fixed point is called the fixed point or Picard iteration.
The following theorems give sufficient conditions for existence and uniqueness of a fixed point, and convergence of the sequence {x n }n≥0 to a fixed
point. Both theorems are taken from [38].
122
Picard Iteration
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Figure 3.1: Graphical illustration of the fixed point method
Theorem 3.2. If f ∈ C[a, b] and f (x) ∈ [a, b] for all x ∈ [a, b], then f has
a fixed point in [a, b]. Furthermore, if f 0 (x) exists for x ∈ [a, b] and there
exists a positive constant k < 1 such that
|f 0 (x)| ≤ k
for all
x ∈ [a, b],
(3.3)
then the fixed point in [a, b] is unique.
Proof. See [38], page 86.
Theorem 3.3. Assume f ∈ C[a, b], f (x) ∈ [a, b], x ∈ [a, b] and |f 0 (x)| ≤
k < 1 for x ∈ [a, b]. The fixed point iteration converges to x for every
start-value x0 ∈ [a, b].
Proof. The assumptions imply that f has a unique fixed point in [a, b]. Using
the mean-value theorem we obtain
|xn − x| = |f (xn−1 ) − f (x)| = |f 0 (ξ)||xn−1 − x| ≤ k|xn−1 − x|.
(3.4)
By induction follows
lim |xn − x| ≤ lim k n |x0 − x| = 0.
n→∞
n→∞
(3.5)
We conclude that xn converges to x.
The fixed point iteration is illustrated in Figure 3.1 for the map
f (x) =
with start-value x0 = 0.
x2
4
,
+3
(3.6)
3.2 Higher Dimensions
3.2
123
Higher Dimensions
We start with the definition of a contraction, or contraction mapping.
Definition 3.4. A mapping F : D ⊂ Rn → Rn is contractive on a set
D0 ⊂ D if there is an α < 1 such that kF (x) − F (y)k ≤ αkx − yk for all
x, y ∈ D0 . We call such a mapping F a contraction mapping, or shorter a
contraction on D0 .
Note that the contractive property as given in the above definition is
norm dependent. Clearly, a contraction is Lipschitz-continuous. The following theorem gives a basic result on the existence and uniqueness of a
fixed point of a contraction. This result is known as the Banach Fixed Point
Theorem.
Theorem 3.5. Let F be a contraction mapping from a closed subset D of
a Banach space S onto D. Then there exists a unique z ∈ D such that
F (z) = z.
Then, the following theorem presents a criterion for convergence of the
Picard iteration in more dimensions.
Theorem 3.6. Assume that F is a contraction on a closed subset D of R n .
The fixed point iteration
xk+1 = F (xk ),
(3.7)
converges to a fixed point x for every start-value x 0 , if the spectral radius of
the Jacobian of F (x) is less than one.
Proof. We obtain from the mean-value theorem that
kxk+1 − xk = kF (xk ) − F (x)k = kJF (ξk+1 )(xk − x)k,
(3.8)
where ξk+1 ∈ (xk , x). For any multiplicative norm k · k, the norm of
JF (ξk+1 )(xk − x) is less or equal to
kJF (ξk+1 )kk(xk − x)k.
(3.9)
By induction we obtain
kxk+1 − xk ≤ kJF (ξk+1 )kkJF (ξk )k · · · kJF (ξ1 )kk(x0 − x)k.
(3.10)
In order to have convergence of the fixed point iteration, kJ F (ξi )k should be
less than one for all i ∈ [1, k + 1]. Since for every multiplicative norm holds
ρ(A) ≤ kAk,
(3.11)
a neccesary condition for convergence is ρ(J F (x)) < 1 in a sphere with radius
kx0 − xk around x.
124
3.3
Picard Iteration
Some Last Remarks
In this section we give some last remarks considering the fixed point iteration. To terminate the iteration it is sufficient to fulfill the condition
kxk+1 − xk k ≤ tol.
(3.12)
In the case that the Picard iteration does not converge, i.e., the spectral
radius of the Jacobian becomes larger than one, a relaxation parameter ω can
be used. The so-called Picard iteration scheme with relaxation parameter is
xk+1 = xk + ω(F (xk ) − xk ),
with ω > 0.
(3.13)
Part IV
Linear Solvers
Chapter
1
Introduction
In Part III we saw that solving nonlinear equations iteratively needs solutions of linear systems. For example, the standard Newton iteration
xk+ = xk − F 0 (xk )−1 F (xk ),
(1.1)
needs in each iteration the solution of the system
F 0 (xk )sk = F (xk ).
(1.2)
In this part we discuss methods which solve linear algebraic systems. Generally, we distinguish direct and iterative solution methods. In this chapter
we first give some examples of direct solution methods. Then it also will
become clear that we focus only on iterative methods. The further contents
of this chapter consists of the presentation of the basic ideas behind iterative
methods.
Examples of direct solution methods are the LU factorization and the
Choleski factorization. The LU factorization is based on the following observation. Suppose we have the system
Ax = b,
(1.3)
with A square (n × n) and nonsingular. It can be shown that there exist
Gauß transformations M1 , . . . , Mn−1 such that
Mn−1 · · · M1 A = U,
(1.4)
with U an upper triangular matrix. Then
A = (Mn−1 · · · M1 )−1 U = LU,
(1.5)
with L = (Mn−1 · · · M1 )−1 . It can be shown that L is a lower triangular
matrix.
128
Introduction
If the LU factorization of A is obtained, then the solution of Ax = b is
easy to compute. The solution of LU x = b can be split into two parts, i.e.,
first the solution of the lower triangular system Ly = b and then the solution
of the upper triangular system U x = y. To compute an LU factorization of
A takes 2n3 /3 flops. Solving Ly = b and U x = y requires 2n 2 flops.
In the case that A is symmetric positive definite (SPD), then the factorization
A = GGT ,
(1.6)
with G a lower triangular matrix, exists. This factorization is known as the
Choleski factorization and costs n 3 /3 flops.
Problems arising from discretized PDEs lead in general to large sparse
systems of equations. Direct solutions methods can be impractical if A is
large and sparse, because the L and U can be dense. This is especially
the case for 3D problems. We then need a considerable amount of memory
and the solving the triangular system can many floating point operations.
Therefore, we restrict ourselves to iterative methods to solve linear systems.
The amount of work to solve Ax = b iteratively depends on the number of
iterations and the amount of work per iteration.
The basic idea behind solving Ax = b, with A invertible, iteratively is
the following. Construct a splitting A = B + (A − B), so that B is “easy
invertible”. The original linear system can then be rewritten as
Bx + (A − B)x = b.
(1.7)
By putting
Bxi+1 + (A − B)xi = b,
we introduce a sequence of approximate solutions x
can be rewritten as
(1.8)
i ∞
i=0
. Equation (1.8)
xi+1 = (I − B −1 A)xi + B −1 b,
(1.9)
xi+1 = Qxi + s,
(1.10)
or equivalently as
with Q = I − B −1 A and s = B −1 b. We call an iterative process consistent
if the matrix I − Q is non-singular and if (I − Q) −1 s = A−1 b. If the iteration process is consistent, then the equation (I − Q)x = s has the unique
solution xi→∞ ≡ x. We call an iterative process convergent if the sequence
x0 , x1 , x2 , . . . converges to a limit independent of the start-value x 0 .
For example, the Jacobi method has the splitting A = D − C, with D a
diagonal matrix and C a matrix that contains zeros on the main diagonal.
By taking B equal to D the Jacobi method reads
Dxi+1 = Cxi + b.
(1.11)
129
In the Gauß-Seidel method we also split up C = C 1 + C2 , with a lower
triangular matrix C1 and an upper triangular matrix C2 . Then, by taking
B = D − C1 the Gauß-Seidel method reads
(D − C1 )xi+1 = C2 xi + b.
(1.12)
In our application of CVD we get after discretization and applying a
time integration method a huge nonlinear system to solve. The size of this
nonlinear system depends on
1. the spatial dimension d (d = 1, 2, 3),
2. the number of chemical species,
3. solving the nonlinear equations coupled or decoupled.
As mentioned above, to solve this nonlinear system also linear systems have
to be solved. These are huge and sparse linear systems.
Therefore, iterative solution methods are more efficient than direct methods, especially in the three-dimensional case.
Using these basic iterative methods as described above a class of efficient solution methods can be derived. This class, the class of the so-called
Krylov subspace methods will be discussed in the next chapter. We conclude
with presenting preconditioner techniques in Chapter 3, which are used to
accelerate Krylov subspace methods.
130
Introduction
Chapter
2
Krylov Subspace Methods
In Chapter 1 the basic iterative method is defined to approximate the solution of a linear system Ax = b iteratively by
xi+1 = (I − B −1 A)xi + B −1 b,
(2.1)
where B is an “easy invertible” matrix coming from the splitting A = B +
(A − B). We also remark that the iteration process (2.1) needs a start
approximation x0 . Rewriting (2.1) gives
xi+1 = xi + B −1 (b − Axi ) = xi + B −1 r i ,
(2.2)
where r i = b − Axi is called the ith residual. Using this notation, the first
steps of the iteration process are
x0 = x 0 ,
x1 = x0 + B −1 r 0 ,
x2 = x1 + B −1 r 1 =
= x0 + B −1 r 0 + B −1 (b − Ax1 ) =
= x0 + B −1 r 0 + B −1 (b − A(x0 + B −1 r 0 )) =
= x0 + (2B −1 − B −1 AB −1 )r 0 ,
..
.
which implies that
xi ∈ x0 + span B −1 r 0 , B −1 A(B −1 )r 0 , . . . , (B −1 A)i−1 (B −1 )r 0 .
(2.3)
The subspace K i (A, r 0 ) := span{r 0 , Ar 0 , . . . , Ai−1 r 0 } is called the Krylovspace of dimension i corresponding to the matrix A and initial residual r 0 .
Each approximation xi , i = 1, 2, . . ., that is computed by a basic iterative
method is an element of x0 + K i (M −1 A, M −1 r 0 ).
132
Krylov Subspace Methods
This chapter is divided into two parts. In the first part the Conjugate
Gradients method will be treated, which is a Krylov Subspace method for
symmetric matrices. In the other part of this chapter we discuss Krylov
subspace methods for general matrices.
2.1
Krylov Subspace Methods for Symmetric Matrices
In this section we present Krylov subspace methods for solving
Ax = b,
(2.4)
where A is a symmetric matrix. The CG method is the most prominent
iterative method for solving linear systems (2.4) in the case that A is SPD.
CG will be derived from the Lanczos Algorithm, which will be presented in
the next section.
2.1.1
Arnoldi’s Method and the Lanczos Algorithm
The Arnoldi Algorithm is a procedure for building an orthonormal basis of
the Krylov subspace K i (A, r 0 ). It is given in Algorithm 2.1, which is taken
from [28].
Algorithm 2.1 Arnoldi
Choose a vector v1 of norm 1,
for j = 1, 2, . . . , m do
Compute hij = (Avj , vi ) for i = 1, 2, . . . , j,
P
Compute wj = Avj − ji=1 hij vi ,
hj+1,j = kwj k2 ,
if hj+1,j = 0 then
Stop,
end if
vj+1 = wj /hj+1,j
end for
A few simple properties of the algorithm are given in Proposition 2.1
and 2.2.
Proposition 2.1. Assume that Algorithm 2.1 does not stop before the m th
step. Then the vectors v1 , v2 , . . . , vm from an orthonormal basis of the Krylov
subspace
K m (A, r 0 ) = span{v1 , Av1 , . . . , Am−1 v1 }.
(2.5)
Proposition 2.2. Denote by Vm the n × n matrix with column vectors
v1 , . . . , vm , by H̄m the (m + 1) × n Hessenberg matrix whose nonzero entries
2.1 Krylov Subspace Methods for Symmetric Matrices
133
hij are defined by Algorithm 2.1, and by H m the matrix obtained from H̄m
by deleting its last row. Then the following relations hold :
AVm = Vm Hm + wm eTm
(2.6)
= Vm+1 H̄m ,
(2.7)
= Hm .
(2.8)
VmT AVm
For the proofs of Proposition 2.1 and 2.2 we refer to [28].
In the case that the matrix A is symmetric the Hessenberg H m will become symmetric tridiagonal, see [28, Theorem 6.2]. The standard notation
used to describe the Lanczos Algorithm is obtained by setting
αj ≡ hjj ,
βj ≡ hj−1,j ,
(2.9)
and if Tm denotes the resulting Hm matrix, it is of the form

α1 β2
 β2 α1
β2


.
..



βm−1 αm−1 βm
βm
αm




.


(2.10)
This leads to the following variant of Arnoldi’s Algorithm (see Algorithm
2.1), which is called the Lanczos Algorithm. It is given as Algorithm 2.2.
Algorithm 2.2 Lanczos Algorithm
Choose an initial vector v1 of norm 1. Set β1 ≡ 0 and v0 ≡ 0,
for j = 1, 2, . . . , m do
wj = AVj − βj vj−1 ,
αj = (wj , vj ),
wj = w j − α j vj ,
βj+1 = kwj k2 ,
if βj+1 = 0 then
Stop
end if
vj+1 = wj /βj+1
end for
Algorithm 2.2 garantuees that in exact algorithmic the vectors v i , i =
1, 2, . . . , m, are orthogonal. In practice, when we do not compute in exact
algorithmic, exact orthogonality of these vectors is only observed at the
beginning of the process. At some point the v i ’s start to loose their global
orthogonality rapidly. For more information on this subject we refer to [28].
134
2.1.2
Krylov Subspace Methods
The Conjugate Gradient Algorithm
As mentioned before, the Conjugate Gradient Method, shortly CG, is the
best known iterative techniques for solving sparse symmetric positive definite
linear systems. Shortly described, CG is a realization of an orthogonal
projection technique onto the Krylov subspace K m (A, r 0 ), where r 0 is the
initial residual.
We will derive the CG method form the Lanczos Method for Linear Systems. This Lanczos Method for Linear Systems approximates the solution
of Ax = b, with AT = A, by an orthogonal projection onto K m (A < r 0 ).
Given an initial guess x0 to the linear system Ax = b and the Lanczos
vectors v1 , . . . , vm together with the tridiagonal matrix T m , the orthogonal
projection is given as
xm = x 0 + V m ym ,
−1
ym = Tm
(βe1 ).
(2.11)
The actual algorithm is given as Algorithm 2.3.
Algorithm 2.3 Lanczos Method for Linear Systems
1: Compute r 0 = b − Ax0 , β = kr 0 k and v1 = r 0 /β,
2: for j = 1, 2, . . . , m do
3:
wj = AVj − βj vj−1 , (If j = 1 set β1 v0 ≡ 0)
4:
αj = (wj , vj ),
5:
w j = w j − α j vj ,
6:
βj+1 = kwj k2 ,
7:
if βj+1 = 0 then
8:
Set m = j and go to 10
9:
end if
10:
Set Tm = tridiag(βi , αi , βi+1 ) and Vm = [v1 , . . . , vm ],
−1 (βe ) and xm = x0 + V y .
11:
Compute ym = Tm
1
m m
12: end for
Remark from Algorithm 2.3 that the approximate solution is given as
−1 −1
xm = x 0 + V m ym = x m = x 0 + V m Um
Lm (βe1 ),
(2.12)
where we used the LU factorization of T m = Lm Um . Letting
−1
Pm = V m Um
and
zm = L−1
m (βe1 ),
(2.13)
gives xm = x0 +Pm z m . As remarked in [28], the last column of P m , denoted
by pm , can be updated from the previous pi ’s and vm by
pm =
with
λm =
βm
ηm−1
1
(vm − βm pm−1 ),
ηm
and ηm = αm − λm βm .
(2.14)
(2.15)
2.1 Krylov Subspace Methods for Symmetric Matrices
In addition, as shown in [28], we have
zm−1
with ζm = −λm ζm−1
zm =
ζm
and ζ1 = β.
135
(2.16)
As a result, xm can be updated at each step as xm = xm−1 + ζm pm . In the
next proposition we give some properties of the Lanczos algorithm.
Proposition 2.3. Let r m = b − Axm , m = 0, 1, . . ., be the residual vectors
produced by Algorithm 2.3 and pm be the auxiliary vectors as given above.
Then,
1. Each residual vector r m is such that r m = σm vm+1 , where σm is a
certain scalar. As a result, the residual vectors are orthogonal to each
other.
2. The auxiliary vectors pi form an A-conjugate set, i.e., (Api , pj ) = 0,
for i 6= j. 1
Proof. For a proof we refer to [28].
From the above proposition we can derive a modified version of Algorithm 2.3 by imposing orthogonality and conjugate conditions. This modified version of Algorithm 2.3 is the CG method.
First we remark that
xj+1 = xj + αj xj .
(2.17)
Therefore, the residual vectors must satisfy
r j+1 = r j − αj Apj .
(2.18)
To have the residual vectors orthogonal, it is necessary to impose (r j −
αj Apj , pj ) = 0. It follows that αj must be equal to
αj =
(r j , r j )
.
Apj , r j )
(2.19)
Using Proposition 2.3 and Equation (2.14) we observe that the next search
direction pj+1 is a linear combination of r j+1 and pj . After rescaling the p
vectors appropriately, it follows that
pj+1 = r j+1 + βj pj .
(2.20)
A first consequence of the above relation is
(Apj , r j ) = (Apj , pj − βj−1 pj−1 ) = (Apj , pj ),
1
To ensure that (Ax, x) is an inner product, A must be SPD
(2.21)
136
Krylov Subspace Methods
because Apj is orthogonal to pj−1 . Remark that (2.19) becomes αj =
(r j , r j )/(Apj , pj ). A second consequence of (2.20) is the following. Since
pj+1 is orthogonal to Apj , we can derive that βj in (2.20) must be equal to
βj = −
1 (r j+1 , (r j+1 − r j ))
(r j+1 , r j+1 )
(r j+1 , Apj )
=
=
.
(pj , Apj )
αj
(Apj , pj )
(r j , r j )
(2.22)
Putting these relations together gives Algorithm 2.4, i.e., the CG algorithm.
Algorithm 2.4 Conjugate Gradient
1: Compute r 0 = b − Ax0 and p0 = r 0 ,
2: for j = 1, 2, . . . , until convergence do
3:
αj = (r j , r j )/(Apj , pj ),
4:
xj+1 = xj + αj pj ,
5:
r j+1 = r j − αj Apj ,
6:
βj = (r j+1 , r j+1 )/(r j , r j ),
7:
pj+1 = r j+1 + βj pj .
8: end for
2.2
Krylov Subspace Methods for General Matrices
In the previous section we have discussed the CG method. This Krylov subspace method can only be used to solve (sparse) symmetric positive definite
linear systems. In this section we discuss the best known Krylov subspace
methods for general (real) sparse linear systems, i.e., we consider Krylov
subspace methods to solve Ax = b, with A an n × n nonsingular real matrix.
Considering the case of symmetric positive definite linear systems, we
have seen that CG has the properties :
• Krylov subspace based on A,
• optimality : the residual is minimized in a certain norm,
• short recurrences; only the results of one foregoing step are necessary,
work and memory do not increase for an increasing number of iterations.
In [8] it has been shown that it is impossible to obtain a Krylov subspace
method for general matrices, which satisfies the above properties. From
this we conclude that a Krylov method for general matrices has either an
optimality property but long recurrences, or no optimality and short recurrences. A method that has an optimality property and short recurrences is
not a Krylov subspace method.
2.2 Krylov Subspace Methods for General Matrices
137
It appears that there are essentially three different ways to solve nonsymmetric real linear systems :
1. Solve the normal equations AT Ax = AT b with CG,
2. Construct a basis for the Krylov subspace by a 3-term bi-orthogonality
relation,
3. Make all the residuals explicitly orthogonal in order to have an orthogonal basis for the Krylov subspace.
In this report only the last two classes will be treated. For more information
on the first class of methods we refer to [22, 28].
2.2.1
BiCG Type Methods
In the previous chapter we saw that the CG method is derived from the
Lanczos Algorithm, i.e. Algorithm 2.2. It appears that there are various generalizations of the Lanczos Algorithm for general matrices. One of
these generalizations, the Arnoldi procedure (Algorithm 2.1), has already
been treated. Another extension of Lanczos to general matrices is the
Lanczos Bi-orthogonalization Algorithm. The concept of the Lanczos Biorthogonalization Algorithm is different in concept from Arnoldi in the sense
that it relies on bi-orthogonal sequences instead of orthogonal sequences.
In this section we shortly present the Lanczos Bi-orthogonalization Algorithm and mention how Bi-Conjugate Gradient (BiCG) can be derived
from it. Furthermore, we present the basic ideas behind Conjugate Gradient Squared (CGS) and Bi-Conjugate Gradient STABilized (Bi-CGSTAB)
The latter algorithm, i.e. Bi-CGSTAB, is together with GMRES one of the
best known iterative methods for general sparse real linear systems.
Lanczos Bi-Orthogonalization
The algorithm proposed by Lanczos for non-symmetric matrices builds a
pair of bi-orthogonal bases for the two subspaces
K m (A, v1 ) = span{v1 , Av1 , . . . , Am−1 v1 },
(2.23)
K m (AT , w1 ) = span{w1 , AT w1 , . . . , (AT )m−1 w1 }.
(2.24)
and
The algorithm that achieves this is Algorithm 2.5. There can be proved that
if Algorithm 2.5 does not break down before step m, then the vectors v i and
wj , i, j = 1, 2, . . . , m, form a bi-orthogonal system.
Comparing this algorithm with Arnoldi’s method we remark the following. From a practical point of view, the Lanczos Algorithm has an advantage
over Arnoldi’s method, because it requires only a few vectors of storage, if
138
Krylov Subspace Methods
Algorithm 2.5 Lanczos Bi-Orthogonalization Procedure
1: Choose two vectors v1 , w1 such that (v1 , w1 ) = 0,
2: Set β1 = δ1 = 0 and w0 = v0 = 0,
3: for j = 1, 2, . . . , m do
4:
αj = (Avj , wj ),
5:
v̂j+1 = Avj − αj vj − βj vj−1 ,
6:
ŵj+1 = AT wj − αj wj − βj wj−1 ,
7:
δj+1 = |(v̂j+1 , ŵj+1 )|1/2 ,
8:
if δj+1 = 0 then
9:
Stop.
10:
end if
11:
βj+1 = (v̂j+1 , ŵj+1 )/δj+1 ,
12:
wj+1 = ŵj+1 /βj+1 ,
13:
vj+1 = v̂j+1 /δj+1 ,
14: end for
no re-orthogonalization is performed. In particular, six vectors of length n
and a tridiagonal matrix are needed for storage. This amount of storage is
independent of m.
Bi-Conjugate Gradient
The Bi-Conjugate Gradient (BiCG) procedure can be derived from Algorithm 2.5 in exactly the same way as the CG Algorithm was derived from
Algorithm 2.2. Implicitly, the BiCG algorithm solves not only the original
system Ax = b, but also a dual linear system A T x∗ = b∗ . The dual system
is often ignored in the algorithm.
The BiCG procedure is a projection process onto
K m (A, v1 ) = span{v1 , Av1 , . . . , Am−1 v1 },
(2.25)
K m (AT , w1 ) = span{w1 , AT w1 , . . . , (AT )m−1 w1 }.
(2.26)
orthogonally to
The vector v1 is taken as usual v1 = r0 /kr0 k2 . The vector w1 is arbitrary,
provided that (v1 , w1 ) 6= 0. The algorithm is given as Algorithm 2.6
Conjugate Gradient Squared
Each step of BiCG requires a matrix-vector product with both A and A T .
Secondly, observe that the vectors p ∗i and wj , generated by AT do not contribute directly to the solution. Due to this observations, one might ask, is it
possible to bypass the use of AT and still generate iterates related to BiCG ?
The answer is given as the Conjugate Gradient Squared (CGS) method.
2.2 Krylov Subspace Methods for General Matrices
139
Algorithm 2.6 Bi-Conjugate Gradient
1: Compute r0 = b − Ax0 . Choose r0∗ such that (r0 , r0∗ ) 6= 0.
2: Set p0 = r0 and p∗0 = r0∗ ,
3: for j = 1, 2, . . . , until convergence do
4:
αj = (rj , rj∗ )/(Apj , p∗j ),
5:
xj+1 = xj + αj pj ,
6:
rj+1 = rj − αj Apj ,
∗
7:
rj+1
= rj∗ − αj AT p∗j ,
∗ )/(r , r ∗ ),
8:
βj = (rj+1 , rj+1
j j
9:
pj+1 = rj+1 − βj pj ,
∗
10:
p∗j+1 = rj+1
− βj p∗j ,
11: end for
The CGS method was developed in 1984 by Sonneveld [31]. In the BiCG
method, the residual rj can be regarded as the product of r0 and the j th
degree polynomial R in A, i.e.,
rj = Rj (A)r0 ,
(2.27)
satisfying the constraint Rj (0) = 1. This same polynomial satisfies also
rj∗ = Rj (AT )r0∗ ,
(2.28)
by construction of the BiCG method. Similarly, the conjugate-direction
polynomial Pj (t) satisfies the following expressions
pj = Pj (A)r0
and p∗j = Pj (AT )r0∗ .
(2.29)
Also, note that the scalar αj in BiCG is given as
αj =
(rj , rj∗ )
(Apj , p∗j )
=
(R2j (A)r0 , r0∗ )
(Rj (A)r0 , Rj (AT )r0∗ )
=
,
(APj (A)r0 , Pj (AT )r0∗ )
(APj2 (A)r0 , r0∗ )
(2.30)
which indicates that if it is possible to obtain a recursion for the vectors
R2j (A)r0 and Pj2 (A)r0 , then computing αj and βj cause no problem.
Expression (2.30) suggests that if R j (A) reduces r0 to a smaller vector rj ,
then it might be advantageous to apply this ‘contraction’-operator twice and
compute R2j (A)r0 . The iteration coefficients can still be recovered from these
vectors, and it turns out to be easy to find the corresponding approximations
for x. This approach is called the Conjugate Gradient Squared method. See
Algorithm 2.7. For a complete derivation we refer to [28].
Observe that in Algorithm 2.7 are no matrix-vector products with A T .
Instead, two matrix-vector products with A are performed in each step. In
general, one should expect the resulting algorithm to converge twice as fast
as BiCG. Therefore, what has been done is to replace the matrix-vector
products with AT with more useful work.
140
Krylov Subspace Methods
Algorithm 2.7 Conjugate Gradient Squared
1: Compute r0 = b − Ax0 . Choose r0∗ arbitrary,
2: Set p0 = u0 = r0 ,
3: for j = 1, 2, . . . , until convergence do
4:
αj = (rj , r0∗ )/(Apj , r0∗ ),
5:
qi = uj − αj Apj
6:
xj+1 = xj + αj (uj + qj ),
7:
rj+1 = rj − αj A()uj + qj ),
8:
βj = (rj+1 , r0∗ )/(rj , r0∗ ),
9:
uj+1 = rj+1 − βj qj ,
10:
pj+1 = u∗j+1 − βj (qj + βj pj ),
11: end for
A drawback of the CGS method is the following. Since the polynomials
are squared, rounding errors tend to have a more damaging effect than in
the standard BiCG algorithm.
Bi-CGSTAB
The CGS algorithm is based on squaring the residual polynomial, and,
in cases of irregular convergence, this may lead to substantial build-up of
rounding errors, or possibly even overflow. The Bi-Conjugate Gradient Stabilized (Bi-CGSTAB) Algorithm is a variation of CGS that has a remedy for
this difficulty. During the development of Bi-CGSTAB one used a residual
of the form
rj = Qj (A)Rj (A)r0 ,
(2.31)
instead of rj = R2j (A)r0 . In (2.31) the polynomial Qj (A) is a new polynomial which is defined recursively at each step with the goal of ‘stabilizing’
or ‘smoothing’ the convergence behavior of the original algorithm. In particular, Qj is defined recursively by
Qj+1 (t) = (1 − ωj t)Qj (t).
(2.32)
The above recursion is equivalent with
Qj+1 (t) = (1 − ω0 t)(1 − ω1 t) · · · (1 − ωj t),
(2.33)
where the scalars ωi , i = 0, 1, . . . , j have to be determined.
For the determination of the scalars ω i , i = 0, 1, . . . , j, and the complete
derivation of the Bi-CGSTAB algorithm we refer to [28], pp 216-219. The
Bi-CGSTAB algorithm is given as Algorithm 2.8.
2.2 Krylov Subspace Methods for General Matrices
141
Algorithm 2.8 Bi-CGSTAB
1: Compute r0 = b − Ax0 . Choose r0∗ arbitrary,
2: Set p0 = r0 ,
3: for j = 1, 2, . . . , until convergence do
4:
αj = (rj , r0∗ )/(Apj , r0∗ ),
5:
si = rj − αj Apj
6:
wj = (Asj , sj )/(Asj , Asj ),
7:
xj+1 = xj + αj pj + wj sj ,
8:
rj+1 = sj − wj Asj ,
(r
,r0∗ )
α
9:
βj = (rj+1
× wjj ,
∗
,r
j 0)
10:
pj+1 = rj+1 + βj (pj − wj Apj ),
11: end for
2.2.2
GMRES Methods
The Generalized Minimum Residual Method (GMRES) is a projection method
on the Krylov subspace K m (A, v1 ), with v1 = r0 /kr0 k2 . This technique
minimizes the residual norm over all vectors in x 0 + K m (A, v1 ). In order
to do this, GMRES generates a sequence of orthogonal vector with long
recurrences due to the non-symmetry. GMRES uses a variant of Arnoldi’s
Algorithm (Algorithm 2.1) to find an orthonormal basis. We first give the
idea behind the GMRES algorithm.
GMRES Derivation
Any vector x in x0 + K m (A, v1 ) can be written as
x = x0 + Vm y,
(2.34)
where y is an m-vector. Define
J(y) = kb − Axk2 = kb − A(x0 + Vm y)k2 ,
(2.35)
b − Ax = b − A(x0 + Vm y) =
(2.36)
= βv1 − Vm+1 H̄m y =
(2.38)
and remark that
= r0 − AVm y =
(2.37)
= Vm+1 (βe1 − H̄m y),
(2.39)
where we used Proposition 2.2, β = kr 0 k2 and e1 = (1, 0, . . . , 0)T . Since the
column-vectors of Vm+1 are orthonormal, we obtain
J(y) = kb − A(x0 + Vm y)k2 = kβe1 − H̄m yk2 .
(2.40)
142
Krylov Subspace Methods
The GMRES approximation is the vector of x 0 +K m (A, v1 ) which minimizes
J(y). Using (2.40) this approximation can easily be obtained by
xm = x 0 + V m ym
where
ym = argminy kβe1 − H̄m yk2 .
(2.41)
Observe that ym is the solution of an (m + 1) × m least-squares problem,
which is inexpensive to solve (for m typically small).
2.2.3
The GMRES Algorithm
Summarizing the above remarks and results gives the following algorithm.
Algorithm 2.9 GMRES
1: Compute r 0 = b − Ax0 , β = kr0 k2 and v1 = r0 /β
2: Define the (m + 1) × m matrix H̄m = {hij }1≤i≤m+1,1≤j≤m . Set H̄m = 0.
3: for j = 1, 2, . . . , m do
4:
Compute wj = Avj ,
5:
for i = 1, . . . , j do
6:
hij = (wj , vi ),
7:
wj = wj − hij vi ,
8:
end for
9:
hj+1,i = kwj k2 ,
10:
if hj+1,j = 0 then
11:
Set m = j and go to 15,
12:
end if
13:
vj+1 = wj /hj+1,j
14: end for
15: Compute ym = argminy kβe1 − H̄m yk2 and xm = x0 + VM ym .
We remark that in Algorithm 2.9 the lines 3 to 14 represent Arnoldi’s
Algorithm for finding an orthonormal basis of K m (A, v1 ).
If Algorithm 2.9 is examined carefully, we observe that GMRES has a
breakdown when hj+1,j = 0 at a given step j. In this situation the next
Arnoldi vector cannot be generated. However, in this situation, the residual
vector is equal to zero and GMRES will deliver the exact solution at this
step. For a proof of this we refer to [28, Proposition 6.10].
The last remark considering Algorithm 2.9 is the following. In the case
the iteration number m is large, the GMRES algorithm becomes impractical
by lack of memory and increasing computational requirements. This follows
directly from the fact that during the Arnoldi steps (lines 3-14 of Algorithm
2.9) the number of vectors to be stored increases. To remedy this problem
one can think of restarting or truncating the algorithm. More information
on restarting and truncating can be found in [28].
2.3 Stopcriterium
2.3
143
Stopcriterium
In most of the iterative methods given in this chapter the following statement
is included :
for j = 1, 2, . . . , until convergence do.
To specify the statement ‘until convergence’ various stop criteria can be
chosen. It depends on each practical situation which of them should be
chosen for an accurate solution. We give two standard stop criteria. Other
criteria can be found in for example [22, 29].
• Criterion 1 : krj k2 ≤ ε
The main disadvantage of this stop criterion is that it is not scaling
invariant. By this we mean that if kr j k2 ≤ ε, then this does not hold
for k100rj k2 , although the accuracy for xi remains the same,
• Criterion 2 :
krj k2
kbk2
≤ε
In this criterion the norm of the residual is small in comparison with
the norm of the right-hand side. Replacing ε by ε/K 2 (A) gives that
the relative error in x is less than ε, i.e.
krj k2
kx − xi k2
≤ K2 (A)
≤ ε.
kxk2
kbk2
(2.42)
144
Krylov Subspace Methods
Chapter
3
Precondition Techniques
Krylov subspace methods for solving (sparse) linear systems are well founded
theoretically, but appear to suffer from slow convergence from typical applications as CFD. In this chapter we first describe preconditioning without
being specific about the particular preconditioners used. In Section 3.2 we
will give some widely used preconditioners as, for example, ILU.
Efficiency and robustness of iterative methods can be improved by using preconditioning techniques. In one sentence, preconditioning can be
described as :
Transforming the original linear system into one which has the
same solution, but which is likely to be easier to solve with an
iterative solver.
In practical applications, the reliability of iterative solution methods depends more on the quality of the preconditioner than on the particular
Krylov subspace accelerator used.
3.1
Preconditioned Iterations
In this section we discuss the preconditioned versions of the Krylov subspace
methods of the previous chapter, using a generic preconditioner. We will
mainly follow Reference [28]. We start with the Preconditioned Conjugate
Gradient method.
3.1.1
Preconditioned Conjugate Gradient
Consider a linear system Ax = b, whereby A is symmetric positive definite.
Furthermore, we assume that a preconditioner M is available, which approximates A in some yet-undefined sense. This preconditioner M is also to
be assumed symmetric positive definite. From a practical point of view, the
146
Precondition Techniques
only requirement for M is that linear systems M x = b̃ are inexpensive to
solve. This is a neccesary requirement, because preconditioned algorithms
will require a linear system solution of M x = b̃ at each step. Then, we solve
the left preconditioned system
M −1 Ax = M −1 b.
(3.1)
Note that in general this system is not symmetric. To preserve symmetry
we introduce the M -inner product.
The M -inner product is defined as
(x, y)M = (M x, y) = (x, M y).
(3.2)
By observing that
(M −1 Ax, y)M
= (Ax, y) = (x, Ay) =
= (x, M (M −1 A)y) = (x, M −1 Ay)M ,
(3.3)
we see that M −1 A is self-adjoint for the M -inner product. As an alternative,
we replace the Euclidean inner product in the CG algorithm by the M -inner
product.
We now will rewrite the CG algorithm for the M -inner product. Therefore we distinguish the original residue r j = b − Axj and the preconditioned
residual zj = M −1 rj . For the preconditioned system we then obtain the
following sequence of parameters (compare with Algorithm 2.4) :
1. αj = (zj , zj )M /(M −1 Apj , pj )M ,
2. xj+1 = xj + αj pj ,
3. rj+1 = rj − αj Apj and zj+1 = M −1 rj+1 ,
4. βj = (zj+1 , zj+1 )M /(zj , zj )M ,
5. pj+1 = zj+1 + βj pj .
The M -inner products (zj , zj )M and (M −1 Apj , pj )M do not have to be computed explicitly, i.e.
(zj , zj )M = (rj , zj )
and (M −1 Apj , pj )M = (Apj , pj ).
(3.4)
Using the latter observation, we obtain the Preconditioned Conjugate Gradient Algorithm, given as Algorithm 3.1.
3.1 Preconditioned Iterations
147
Algorithm 3.1 Preconditioned Conjugate Gradient
Compute r0 = b − Ax0 , z0 = M −1 r0 and p0 = z0 ,
for j = 0, 1, . . . , until convergence do
αj = (rj , zj )/(Apj , pj ),
xj+1 = xj + αj pj ,
rj+1 = rj − αj Apj ,
zj+1 = M −1 rj+1 ,
βj = (rj+1 , zj+1 )/(rj , zj ),
pj+1 = zj+1 + βj pj .
end for
Left, Right and Split Preconditionering
To transform the linear system Ax = b with a preconditioner M , such that
the transformed system is asier’to solve, we have three options :
1. Left Preconditioning : Solve M −1 Ax = M −1 b,
2. Right Preconditioning : Solve AM −1 u = b with u = M x,
3. Split Preconditioning : If M is SPD, then there exists a matrix L such
that M = LLT . Solve L−1 AL−T u = b with x = LT u.
The Left Preconditioning is already given in (3.1) and used to determine the Preconditioned CG Algorithm. Right Preconditioning has the
advantage that the original residual r i = b − Axi is computed implicitly by
b − AM −1 ui = b − Axi . An advantage of Split Preconditioning is that the
linear system is symmetric.
Finally, we remark that the three options given above can be used to
construct preconditioned iterative solution methods. In this report we give
preconditioned versions of
• (Left) Preconditioned CG,
• Left Preconditioned GMRES,
• Right Preconditioned GMRES, and
• (Left) Preconditioned Bi-CGSTAB.
For more information on preconditioned iterative methods we refer to [28,
29]
3.1.2
Preconditioned GMRES
As for preconditioning the CG method, we have for the GMRES method
again the three options for applying the preconditioning operation, i.e. left,
148
Precondition Techniques
right and split preconditioning. In this section we treat the left and right
preconditioning of GMRES. There will be a fundamental difference with
respect to the right preconditioning.
Left Preconditioned GMRES
Left Preconditioned GMRES solves the left preconditioned linear system
M −1 Ax = M −1 b,
(3.5)
with A and M general matrices. The straightforward application of GMRES to (3.5) gives Algorithm 3.2, i.e., the Left Preconditioned GMRES
Algorithm.
Algorithm 3.2 Left Preconditioned GMRES
1: Compute r0 = M −1 (b − Ax0 ), β = kr0 k2 and v1 = r0 /β
2: for j = 1, 2, . . . , m do
3:
Compute w = M −1 Avj ,
4:
for i = 1, . . . , j do
5:
hij = (w, vi ),
6:
w = w − hij vi ,
7:
end for
8:
hj+1,j = kwk2 and vj+1 = w/hj+1,j ,
9: end for
10: Define Vm = [v1 , . . . , vm ] and H̄m = (hi,j )1≤i≤j+1,1≤j≤m ,
11: Compute ym = argminy kβe1 − H̄m yk2 and xm = x0 + Vm ym .
12: If satisfied Stop, else set x0 = xm and go to 1.
In Algorithm 3.2 an orthogonal basis is constructed for the left preconditioned Krylov subspace
span = {r0 , M −1 Ar0 , . . . , (M −1 A)m−1 r0 }.
(3.6)
Furthermore, we remark that in Algorithm 3.2 all residuals and their norms
correspond to the preconditioned residual r m = M −1 (b − Axm ). The ‘original’ residuals rm,original = b − Axm are not computed. In the case that the
stop criterion is based on rm , this can cause some difficulties. A possible
solution is to compute the residuals r m explicitly.
Right Preconditioned GMRES
The Right Preconditioned GMRES Algorithm is based on solving
AM −1 u = b
with
u = M x.
(3.7)
In this situation the new variable u never needs to be invoked explicitly. By
simple analysis of (3.7), as done in [28], one obtains that one preconditioning
3.1 Preconditioned Iterations
149
operation is needed at the end of the loop, instead of at the beginning in
the case of the left preconditioned version. We obtain then Algorithm 3.3.
Algorithm 3.3 builds an orthogonal basis of the right preconditioned Krylov
Algorithm 3.3 Right Preconditioned GMRES
1: Compute r0 = b − Ax0 , β = kr0 k2 and v1 = r0 /β
2: for j = 1, 2, . . . , m do
3:
Compute w = AM −1 vj ,
4:
for i = 1, . . . , j do
5:
hi,j = (w, vi ),
6:
w = w − hi,j vi ,
7:
end for
8:
hj+1,j = kwk2 and vj+1 = w/hj+1,j ,
9:
Define Vm = [v1 , . . . , vm ] and H̄m = (hi,j )1≤i≤j+1,1≤j≤m ,
10: end for
11: Compute ym = argminy kβe1 − H̄m yk2 and xm = x0 + M −1 Vm ym .
12: If satisfied Stop, else set x0 = xm and go to 1.
subspace
span = {r0 , AM −1 r0 , . . . , (AM −1 )m−1 r0 }.
(3.8)
Note that the residual norm is now relative to the initial system, since the
algorithm obtains the residual b − Ax m = b − AM −1 um , implicitly. This is
an essential difference with the left preconditioned GMRES algorithm.
By comparing left and right preconditioning GMRES the following proposition, taken from [28], can be proven.
Proposition 3.1. The approximate solution obtained by left or right preconditioned GMRES is of the form
xm = x0 + sm−1 (M −1 A)z0 = x0 + M −1 sm−1 (AM −1 )r0 ,
(3.9)
where z0 = M −1 r0 and sm−1 is a polynomial of degree (m − 1). The polynomial sm−1 minimizes the residual norm kb − Axm k2 in the right preconditioning case, and the preconditioned residual norm kM −1 (b − Axm )k in the
left preconditioning case.
In most practical situations, the difference in the convergence behavior
of the two approaches is not significant. The only exception is when M is
ill-conditioned which could lead to substantial differences.
3.1.3
Preconditioned Bi-CGSTAB
We restrict ourselves by giving the preconditioned Bi-CGSTAB algorithm.
It is given as Algorithm 3.4.
150
Precondition Techniques
Algorithm 3.4 Preconditioned Bi-CGSTAB
1: Compute r0 = b − Ax0 and choose r0∗ arbitrary,
2: Set p0 = r0 ,
3: for j = 0, 1, 2, . . . , m do
4:
p̃j = M −1 pj ,
5:
s̃j = M −1 sj ,
6:
vj = Asj ,
7:
αj = p̃j /(Ap̃j , r̃0 ),
8:
sj = rj − αj Ap̃j ,
9:
wj = (vj , sj )/(vj , vj ),
10:
xj+1 = xj + αj p̃j + wj s̃j ,
11:
rj+1 = sj − wj vj ,
α
(r
,r̃0 )
× wjj ,
12:
βj = (rj+1
j ,r̃0 )
13:
pj+1 = rj + βj pj − wj Ap̃j ,
14: end for
3.2
Preconditioned Techniques
In this section we give some of the well known precondition techniques.
Among others we will treat the Diagonal Preconditioner, ILU, and ......
Finding a good preconditioner to solve a sparse linear system is often
viewed as a combination of art and science. Roughly speaking, a preconditioner is a explicit or implicit modification of an original linear system which
makes it “easier” to solve by a given iterative method. An example of an
explicit form of preconditioning is to scale all rows of the system to make
the diagonal elements equal to one. The resulting system can be solved by
a Krylov subspace method and may require fewer iterates to converge than
with the original system. We remark that this is not guaranteed. As another
example, solving the linear system
M −1 Ax = M −1 b,
(3.10)
where M −1 is some complicated mapping that may involve FFT transforms,
integral calculations, etc., may be another form of preconditionering. Here,
it is unlikely that the matrix M −1 and M −1 A can be computed explicitly.
Instead, the iterative processes operate with A and M −1 whenever needed.
In practice, the (explicit or implicit) preconditioning operation M −1 should
be inexpensive to apply to an arbitrary vector.
In this report we only treat matrix based preconditioners. Analytic
preconditioners like AILU, SoV, . . ., are not suited for the linear systems
arising from our application.
3.2 Preconditioned Techniques
3.2.1
151
Diagonal Scaling
We transform the original system Ax = b, with A is SPD, to the transformed
system
Ãx̃ = b̃,
(3.11)
where à = P −1 AP −T , x = P −T x̃ and b̃ = P −1 b. The matrix M = P −1 P −T
is called the preconditioner.
A simple choice for P is a diagonal matrix with the elements
p
(3.12)
Pi,i = Ai,i , for i = 1, 2, . . . , n.
Then, it is easy to derive that
−1
−T
Ãi,i = Pi,i
Ai,i Pi,i
= 1,
for i = 1, 2, . . . , n.
(3.13)
In [30] it has been shown that for this choice for P the condition number
of à is minimized. For this preconditioner it is advantageous to apply CG
to Ãx̃ = b̃, since à is easy to calculate.
3.2.2
Incomplete LU Factorization
One of the simplest ways to obtain a preconditioner is to perform an incomplete factorization of the original matrix A. This gives a factorization of the
form A = LU − R, where L and U have the same non-zero structure as the
lower and upper parts of the original matrix A respectively, and R is the
residual of the factorization.
In a practical implementation the ILU factorization depends on the implementation of the Gaussian elimination. The general ILU factorization is
given in Algorithm 3.5. Here P is the zero pattern set such that
P ⊂ {(i, j) : i 6= j; 1 ≤ i, j ≤ n},
(3.14)
and aij are the elements of A. If we choose for P the empty set, i.e. P = ∅,
Algorithm 3.5 General ILU Factorization (IKJ version)
1: for i = 2, . . . , n do
2:
for k = 1, . . . , i − 1 and if (i, j) ∈
/ P do
3:
aik = aik /akk ,
4:
for j = k + 1, . . . , n and for (i, j) ∈
/ P do
5:
aij = aij − aik akj .
6:
end for
7:
end for
8: end for
then Algorithm 3.5 gives the complete LU decomposition of A.
152
Precondition Techniques
Zero Fill-In ILU (ILU(0))
Denote by N Z(A) the set of pairs (i, j), 1 ≤ i, j ≤ n such that a ij 6= 0 and
by Z(A) the set of pairs (i, j), 1 ≤ i, j ≤ n such that a ij = 0.
The ILU technique with no fill-in, denoted by ILU(0), consists of taking
the zero pattern P to be precisely the zero pattern of A. This defines the
ILU factorization in general terms :
Choose any pair of L and U such that the elements of A − LU
are zero in the locations N Z(A).
These constraints do not define the ILU(0) factors in a unique way since
there are, in general, infinitely many pairs of L and U which satisfy these
requirements. However, the standard ILU(0) technique is defined constructively using Algorithm 3.5 with P = Z(A) and adding the requirement
Lii = 1. ILU(0) reduces Algorithm 3.5 to Algorithm 3.6. The accuracy
Algorithm 3.6 ILU(0)
1: for i = 2, . . . , n do
2:
for k = 1, . . . , i − 1 and if (i, k) ∈
/ Z(A) do
3:
aik = aik /akk ,
4:
for j = k + 1, k + 2, . . . , n and for (i, j) ∈
/ Z(A) do
5:
aij = aij − aik akj .
6:
end for
7:
end for
8: end for
of the ILU(0) factorization may be insufficient to yield an adequate rate of
convergence, see for example [28, Section 10.3.2].
ILU(p)
In order to improve this convergence rate as well as the efficiency, more
accurate ILU factorizations can be performed by allowing some fill-ins. This
is better known as the ILU(p) method, where p is the level of fill-in. In this
section we explain ILU(p) in particular for p = 1.
The ILU(1) factorization results form taking P to be zero pattern of the
product L0 U0 , where L0 and U0 are obtained via the ILU(0) factorization.
By doing this, we actually consider a matrix with additional off-diagonal
elements which are actually zero in the additional matrix A. The factors L 1
and U1 of ILU(1) are obtained by performing ILU(0) to this matrix, i.e. the
matrix L0 U0 .
By induction, we can define ILU(p) for p = 2, 3, . . . Remark that for
increasing values of p we get a higher amount of fill-in in both L and U ,
which could be inefficient. Therefore, one should look for the optimal mix
between the amount of work and the convergence rate.
3.3 Incomplete Choleski Factorization
153
Other variants of ILU
There are several variants of ILU(p) which differ from the variant described
above. For instance we have the Modified ILU (MILU) which attempts to
reduce the effect of dropping by compensating for the discard entries. A
popular strategy is to add up all the elements that have been dropped at
the completion of the k-loop in Algorithm 3.5. Then this sum is subtracted
from the diagonal entry in U . The compensation strategy described above
is the MILU method.
Another variant of ILU which has a more accurate factorization is ILUT
or ILU(tol). Roughly speaking, the ILUT method drops elements which are
smaller than a specified value τ . For example, in [28], τ = 10 −4 .
More details about MILU of ILUT can be found in [28, Chapter 10].
3.3
Incomplete Choleski Factorization
In the case that the matrix A is SPD we are able to construct an Incomplete
Choleski Factorization, i.e. A = LL T − R. The matrix L has the same
non-zero structure as the lower part of A. Here we described the IC(0)
preconditioner. The IC(p) Algorithm for p ≥ 1, can be constructed in an
analogous way as ILU(p) is constructed in the previous section. For more
information about Incomplete Choleski Factorization we refer to [29].
3.4
Multigrid
The last remark on preconditioners is that also multigrid can be used as
a preconditioning technique. This can easily be done by choosing M −1 =
(I − Q)A−1 , where Q is the multigrid iteration operator. See for example
[22].
With a varying preconditioner, like multigrid with a different cycle in
each iteration, a Krylov subspace method in which the preconditioner can
change from iteration to iteration is needed. The Flexible GMRES (FGMRES), or the GMRESR method allows such allows such a varying preconditioner. More information on Multigrid and Multigrid as preconditioner can
be found in e.g. [22].
154
Precondition Techniques
Part V
Concluding Remarks
Chapter
1
Summary and Conclusions
The process of CVD is mathematically described by :
1. continuity equation,
2. incompressible Navier-Stokes equations,
3. transport equation for thermal energy,
4. transport equation for gas species, or advection-diffusion-reaction equations,
5. ideal gas law.
The main purpose of this research project is to develop robust and efficient
solvers for the heat reaction system, i.e., the coupled system of the transport
equation for thermal energy and the advection-diffusion-reaction equations.
The first approximation is to solve the test problem, which consists of a
small number (say 10) of coupled advection-diffusion-reaction equations.
Using the Method of Lines approach, we first discretize in space using
the finite volume approach resulting in a ODE system
y 0 = f (t, y),
y ∈ RN
and
f (t, y) : RN +1 → RN .
(1.1)
In (1.1) the integer N is the number of grid cells in the finite volume discretrization. This ODE system (1.1) is a so-called stiff system. The next
step is to find a suitable time integration method. In general this method
will contain implicit parts, or even it can be fully implicit, which results in
nonlinear systems. To solve nonlinear systems, solutions of linear systems
are also needed. These remarks are summarized in Figure 1.1.
158
Summary and Conclusions
Mathematical Model
Spatial Discretization
Time Integration Method
Non−Linear Solver
Linear Solver
Figure 1.1: Schematic representation of steps to solve the mathematical
model of CVD
1.1 Time Integration Methods
1.1
159
Time Integration Methods
In literature one can find many time integration methods that are developed
to integrate stiff systems. Nevertheless, the most succesfull ones, in practical
applications, are usually the most famous ones as
• Rosenbrock methods,
• BDF, implicit linear multi-step,
• Implicit Runge Kutta methods, like RADAU methods.
We remark that these methods are fully implicit, except Rosenbrock methods which are derived from DIRK methods. The idea behind Rosenbrock
methods is not to solve the nonlinear equations resulting from the implicit
scheme for the stage approximations, but to compute the intermediate approximations from the linearized nonlinear system. See Part II, Section 2.2.
More recent research in the area of solving advection-diffusion-reaction
equations is focused on “splitting up” the ODE system (1.1) into slow and
fast varying components. Splitting up an ODE system can be done in two
ways, i.e., splitting up into two subsystems and solve them separately or
split the system up into a part for explicit and a part for implicit treatment.
Examples of the first method are Operator Splitting and Multi-rate (Runge
Kutta) methods. IMEX extensions of Runge Kutta (Chebyshev) methods
are examples of the second class.
1.2
Nonlinear and Linear Solvers
The iterative methods studied for solving linear and nonlinear systems are
all standard techniques, see Part III and IV. The most important question
is how to find for a given time integration method the best suitable (with
respect to efficiency and robustness) nonlinear solver and (eventually) linear
solver.
160
Summary and Conclusions
Chapter
2
Future Research
We formulate the research subjects for future work.
1. 2D Test-problem
2. Selection of time integration methods
3. Investigate how coupling and stiffness influences stability
4. Relation time integration method and (non) linear solvers
5. 3D Test-problem :
aim is to solve CVD system with 50 species on 50 × 50 × 50 grid
2.1
Test-problem and Time Integration Methods
The first two points of the above enumeration are connected to each other.
For the given test problem we will make a selection of time integration
methods and compare their efficiency and robustness for the CVD application. Of course, the selection will consist of both well known methods like
Rosenbrock, et. and recently developed method like IMEX RKC, etc.
2.2
Stability
By the third point in the above enumeration we mean that for some methods,
like for instance Multi rate Runge Kutta methods, both coupling between
slow and fast components and the enormous variations in reaction rates(the
latter produces stiffness), influence the stability. As shown in [20], a simple
multi-rate extension of Backward Euler can have a very small stability region for one of the subsystems. For other multi-rate extensions we could not
162
Future Research
find any references on stability. To our knowledge a general theory on stability for multi-rate methods is not known. In the case we apply such time
integration methods, we also have to investigate the influence of coupling
between subsystems and stifness on stability.
2.3
(Non)-Linear Solvers
In order to solve the stiff ODE system resulting from spatial discretization,
some nonlinear and linear systems have to be solved. For this typical CVD
application we have to search for the optimal (non)linear solvers. By optimal
we mean efficient and robust.
2.4
Extension to Three Dimensions
As formulated in the Preface, the aim of this research project is to solve
the coupled heat / reaction system on a three dimensional spatial grid.
To have a realistic model for CVD and/or combustion one has to model
approximately 50 different chemical species. This results in a nonlinear
coupled PDE system of 49 coupled advection-diffusion-reaction equations
and one heat equation. The last step is then to integrate the developed
solver(s) in a CFD package.
Part VI
Appendices
Appendix
A
Collocation Methods
In this part we pay a little attention to collocation methods. Collocation is
an old an universal concept in numerical analysis. Applied to an ordinary
differential equation
w0 (t) = f (t, w(t)),
(A.1)
the idea is as follows. Search for a polynomial u(t) of degree s, with s a
positive integer, whose derivative coincides at s given points with the vector
field of the differential equation (A.1).
Definition A.1. For s a positive integer and c 1 , . . . , cs real constants (typically between 0 and 1), the corresponding collocation polynomial u(x) of
degree s is defined by
u(xn ) = wn
initial value
0
(A.2)
u (xn + ci h) = f (xn + ci h, u(xn + ci h))
i = 1, . . . , s.
(A.3)
(A.4)
The numerical solution is then given by
wn+1 = u(xn + h).
(A.5)
If some ci ’s coincide, the collocation condition (A.3) will contain higher
derivatives and lead to multi-derivative methods, see [10]. We suppose all
ci ’s to be different.
Theorem A.2. The collocation method as defined in the above definition is
equivalent to the s-stage Implicit Runge-Kutta method with
αij =
Z
0
ci
Lj (t)dt,
bj =
Z
0
1
Lj (t)dt
and
Lj (t) =
Y t − ck
.
cj − c k
k6=j
(A.6)
166
Collocation Methods
Proof. Put u0 (xn + ci h) = ki such that
0
u (xn + ci h) =
s
X
kj Lj (t).
(A.7)
j=1
Then integrate
u(xn + ci h) = w0 + h
Z
ci
u0 (xn + ci h)dt,
0
and insert into (A.3) together with (A.6). We then obtain an implicit RungeKutta scheme.
Appendix
B
Padé Approximations
Padé approximations are rational functions which, for a given degree of
the numerator and the denominator, have highest order of approximation.
Their origin lies in the theory of continued fractions and they played a
fundamental role in Hermite’s proof of the transcendency of e. These optimal
approximations can be obtained for the exponential function e z , which is
given in the following theorem. Theorem B.1 is taken from [11].
Theorem B.1. The (k, j)-Padé approximation to e z is given by
Rkj (z) =
Pkj (z)
,
Qkj (z)
(B.1)
where
k
k(k − 1)
z2
z+
·
+ ···
j+k
(j + k)(j + k − 1) 2!
k(k − 1) · · · 1
zk
··· +
· ,
(j + k)(j + k − 1) · · · (j + 1) k!
Pkj (z) = 1 +
and
j
j(j − 1)
z2
z+
·
+ ···
k+j
(k + j)(k + j − 1) 2!
j(j − 1) · · · 1
zj
··· +
·
= Pjk (−z).
(k + j)(k + j − 1) · · · (k + 1) j!
(B.2)
Qkj (z) = 1 −
(B.3)
The error ez − Rkj (z) is then given by
ez − Rkj (z) = (−1)j
j!k!
z j+k+1 + O(z j+k+2 ).
(j + k)!(j + k + 1)!
(B.4)
It is the unique rational approximation to e z of order j + k, such that the
degrees of numerator and denominator are k and j, respectively.
In Table B.1 the first Padé approximations to e z are given.
168
Padé Approximations
k=0
k=1
k=2
j=0
1
1
1+z
1
1+z+ z2!
1
j=1
1
1−z
1+ 12 z
1− 12 z
1+ 32 z+ 31 z2!
1− 13 z
j=2
1
2
1−z+ z2!
1+ 13 z
1+ 12 z+ 61 z2!
2
2
1− 23 z+ 13 z2!
2
2
2
1− 12 z+ 61 z2!
Table B.1: (k, j)-Padé approximations to e z for k, j = 0, 1, 2
Bibliography
[1] M. Bakker, Analytical Aspects of a Minimax Problem(in Dutch), Technical Note TN62, Mathematical Center, Amsterdam
[2] R.B. Bird, W.E. Stewart and E.N. Lightfoot, Transport Phenomena,
John Wiley & Sons, (2002)
[3] C.G. Broyden, A New Method of Solving Nonlinear Simultaneous Equations, Computer Journal, 12, pp. 94-99, (1969)
[4] C.F. Curtiss and J.O. Hirschfelder, Integration of Stiff Equations, Proc.
Nat. Acad. Sci. 38, pp. 235-243, (1952)
[5] G. Dahlquist, Convergence and Stability in the Numerical Integration
of Ordinary Differential Equations, Math. Scand. 4, pp. 33-53, (1956)
[6] Ch. Engstler and Ch. Lubich, Multi-rate Extrapolation Methods for
Differential Equations with different time scales, Computing 58, pp.
173 - 185, (1997)
[7] F.C. Eversteijn, P.J.W. Severin, C.H.J. van den Brekel and H.L. Peek
A Stagnant Layer Model for the Epitaxial Growth of Silicon from Silane
in a Horizontal Reactor, J. Electrochem. Soc. 117, pp.925-931, (1970)
[8] V. Faber and T. Manteuffel, Necessary and sufficient conditions for the
existence of a conjugate gradient method, SIAM J. Num. Anal., 21, pp.
356-362, (1984)
[9] A. Guillo and B. Lago, Domaine de Stabilité Associé aux Formules
d’Integration Numérique d’Equations Différentielles, a pas Sépar]’es
et pas Liés. Recherche de Formules a Grand Rayon de Stabilité., Ier.
Congr. assoc. Fran. Calcul, AFCAL, Grenoble, pp.43-56, (1961)
[10] E. Hairer, S.P. Nørsett and G. Wanner, Solving Ordinary Differential Equations I : Nonstiff Problems, Springer Series in Computational
Mathematics, 8, Springer, Berlin, (1987)
170
BIBLIOGRAPHY
[11] E. Hairer and G. Wanner, Solving Ordinary Differential Equations II :
Stiff and Differential-Algebraic Problems, Second Edition, Springer Series in Computational Mathematics, 14, Springer, Berlin, (1996)
[12] P.J. van der Houwen and B.P. Sommeijer, On the Internal Stability of
Explicit, m-stage Runge-Kutta Methods for Large m Values, Z. Angew.
Math. Mech., vol 60, pp. 479-485
[13] W. Hundsdorfer and J.G. Verwer, Numerical Solution of TimeDependent Advection-Diffusion-Reaction Equations, Springer Series in
Computational Mathematics, 33, Springer, Berlin, (2003)
[14] K.F. Jensen, Modeling of Chemical Vapor Deposition Reactors, in Modeling of Chemical Vapor Deposition Reactors for Semiconductor Fabrication, Course notes, University of California Extension, Berkeley,
USA, (1988)
[15] C.T. Kelley, Solving Nonlinear Equations with Newton’s Method, Fundamentals of Algorithms, SIAM, (2003)
[16] C.R. Kleijn, Transport Phenomena in Chemical Vapor Deposition Reactors, PhD thesis, Delft University of Technology, (1991)
[17] C.R. Kleijn, Computational Modeling of Transport Phenomena and
Detailed Chemistry in Chemical Vapor Deposition- A Benchmark Solution, Thin Solid Films, 365, pp. 294-306, (2000)
[18] A.E.T. Kuiper, C.H.K. van den Brekel, J. de Groot and F.W. Veltkamp,
Modeling of Low Pressure CVD Processes, J. Electrochem. Soc. 129,
pp. 2288-2291, (1982)
[19] A. Kværnø and P. Rentrop, Low Order Multi-rate Runge-Kutta Methods in Electric Circuit Simulation, Preprint No. 99/1, IWRMM, Universitt Karlsruhe (TH), 1999
[20] A. Kværnø, Stability of Multi-rate Runge-Kutta Schemes, Int. J. Differ.
Equ. Appl., pp.97-105, (2000)
[21] M. Meyyapan, Computational Modeling in Semiconductor Processing,
Artech House, Boston, (1995)
[22] C.W. Oosterlee, H. Bijl, H.X. Lin, S.W. de Leeuw, J.B. Perot, C. Vuik,
P. Wesseling, Lecture Notes for Course “Computational Science and
Engineering”, Lecture Notes, Delft University of Technology, (2005)
[23] J.M. Ortega and W.C. Rheinboldt, Iterative Solution of Nonlinear
Equations in Several Variables, Reprint of the 1970 original, Classics in
Applied Mathematics 30, SIAM, Philadelphia, (2000)
BIBLIOGRAPHY
171
[24] S.V. Patankar, Numerical Heat Transfer and Fluid Flow, Hemisphere
Publ. Corp., Washington, (1980)
[25] A. Prothero and A. Robinson, On the Stability and Accuracy of OneStep Methods for Solving Stiff Systems of Ordinary Differential Equations, Math. of Comput., 28, pp.145-162, (1974)
[26] H.H. Robertson, The Solution of a Set of Reaction Rate Equations,
In : J. Walsh ed. : Numer. Anal., an Introduction, Academ. Press, pp.
178-182, (1966)
[27] H.H. Rosenbrock, Some General Implicit Processes for the Numerical
Solution of Differential Equations, Comp. J. 5, pp. 329-330, (1963)
[28] Y. Saad, Iterative Methods for Sparse Linear Systems, PWS Publishing,
Boston, (1996)
[29] A. Segal and C. Vuik, Computational Fluid Dynamics II, J.M. Burgerscentrum Course Notes, (2004)
[30] A. van der Sluis, Conditioning, Equilibration and Pivoting in Linear
Algebraic Systems, Numer. Math., 15, pp. 74-86, (1970)
[31] P. Sonneveld, CGS : A Fast Lanczos Type Solver for Non-symmetric
Linear Systems, SIAM J. Sci. Stat. Comp., 10, pp. 36-52, (1989)
[32] B. Sportisse, An Analysis of Operator Splitting in the Stiff Case, J.
Comput. Phys. 161, pp. 140-168, (2000)
[33] G. Strang, On the Construction and Comparison of Difference Schemes,
SIAM J. Numer. Anal. 5, pp.506-517, (1968)
[34] M. Suzuki, Fractal Decomposition of Exponential Operators with Applications to Many-Body Theories and Monte Carlo Simulations, Phys.
Lett. A 146, pp. 319-323, (1990)
[35] J.G. Verwer and B. Sportisse, A Note on Operator Splitting in a Stiff
Linear Case, Report MAS-R9830, CWI, Amsterdam, (1998)
[36] J.G. Verwer and B.P. Sommeijer, An Implicit-Explicit Runge-KuttaChebyshev Scheme for Diffusion-Reaction Equations, SIAM Journal on
Sci. Comp., 25, pp.1824-1835, (2004)
[37] J.G. Verwer, B.P. Sommeijer and W. Hundsdorfer, RKC Time-Stepping
for Advection-Diffusion-Reaction Problems, Journal of Comp. Physics,
201, pp. 61-79, (2004)
[38] C. Vuik, Numerieke Methoden voor Differentiaalvergelijkingen, Lecture
Notes, Delft University of Technology, (2000)
172
BIBLIOGRAPHY
[39] H. Yoshida, Construction of Higher Order Symplectic Integrators, Phys.
Lett. A 150, pp. 262-268, (1990)
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement