TECHNISCHE UNIVERSIT ¨ AT M ¨ UNCHEN Department of Informatics

TECHNISCHE UNIVERSIT ¨ AT M ¨ UNCHEN Department of Informatics
TECHNISCHE UNIVERSITÄT MÜNCHEN
Department of Informatics
Scientific Computing
A Framework for Parallel PDE Solvers on Multiscale Adaptive
Cartesian Grids
Tobias Weinzierl
Vollständiger Abdruck der von der Fakultät für Informatik der Technischen
Universität München zur Erlangung des akademischen Grades eines
Doktors der Naturwissenschaften (Dr. rer. nat.)
genehmigten Dissertation.
Vorsitzender:
Prüfer der Dissertation:
Univ.-Prof. Bernd Brügge, Ph.D.
1. Univ.-Prof. Dr. Hans-Joachim Bungartz
2. Univ.-Prof. Dr. Dr. h.c. Christoph Zenger
3. Univ.-Prof. David E. Keyes, Ph.D.,
Columbia University, New York/USA
Die Dissertation wurde am 23. März 2009 bei der Technischen Universität München
eingereicht und durch die Fakultät für Informatik am 23. Juni 2009 angenommen.
Abstract
In many fields of application in science and engineering, the grid-based numerical
simulation of partial differential equations leads to new scientific insights due to
increasing computing resources, increasing amounts of data, and increasing efficiency
of the algorithms used. All three of them facilitate more detailed models and more
reliable simulations. Yet, a growing code complexity accompanies this progress.
To tackle this complexity, more and more solvers for partial differential equations rely on frameworks. State-of-the-art frameworks have to support multiscale
algorithms for arbitrary dimensional problems with dynamic, i.e. changing, discretisations. Despite this flexibility of data and data access, the realisation has to have
low memory requirements, as the gap between computing power and memory bandwidth and access speed broadens. It furthermore has to exploit modern computer
architectures and has to scale on parallel computers, even if the data structures and
the data access pattern change permanently.
This thesis presents a framework tackling these challenges with adaptive Cartesian grids resulting from a generalised octree concept. They are traversed with
space-filling curves and a small yet fixed number of stacks acting as data containers.
Within this context, dynamically changing multiscale grids of arbitrary dimension
can be stored with a few bits per datum, whereas classical approaches often require
several kilobytes and more flexible data structures entailing a runtime overhead. The
thesis formalises the approach and reduces the implementation complexity—the algorithmic principle itself has been well-known for several years—from an exponential
to a linear number of containers in the spatial dimension of the problem. A modification of the grid traversal originally following a depth-first order then facilitates a
domain decomposition strategy with dynamic load-balancing.
The framework is named after the Italian mathematician Giuseppe Peano who
discovered the underlying space-filling curve. Its potential is demonstrated via a
geometric multigrid solver for the Poisson equation with low memory requirements,
good cache hit rates, a posteriori adaptive grids, and a dynamic load balancing—
a combination of characteristics rarely found. As the characteristics result from
the framework usage, the thesis paves the way to a multiscale computational fluid
dynamics application on instationary, hierarchical grids that uses Peano.
Zusammenfassung
Mit steigender Rechenleistung, steigendem Datenumfang und Datendetailfülle sowie
steigender Algorithmeneffizienz liefert die Simulation partieller Differentialgleichungen neue wissenschaftliche Erkenntnisse in vielen Anwendungsbereichen aus Naturwissenschaft und Technik. Dabei sei hier die numerische Simulation auf räumlichen
Diskretisierungen thematisiert. Alle vier Einflussfaktoren stoßen die Tür zu immer detaillierteren Modellen und immer verlässlicheren Simulationen auf, jedoch
begleitet eine beständig wachsende Quelltextkomplexität eben diesen Fortschritt.
Um genau diese Komplexität in den Griff zu bekommen, greifen mehr und mehr
Umsetzungen von Lösern zu partiellen Differentialgleichungen auf Frameworks, also
vorgefertigte Quelltextumgebungen respektive -ökosysteme, zurück. Kompetitive,
zeitgemäße Frameworks müssen heute Multiskalenalgorithmik für beliebig dimensionale Probleme mit sich ständig dynamisch ändernden räumlichen Diskretisierungen unterstützen. Trotz der geforderten Flexibilität in Daten und Datenzugriff
sollte die Realisierung jedoch geringe Speicheranforderungen aufweisen, da sich zwischen vorhandener Rechenleistung und Speicherbandbreite eine Kluft auftut, die sich
beständig weitet. Darüber hinaus muss sie moderne Rechnerarchitekturen sinnvoll, also öknomisch und effizient, nutzen und sollte auch dann auf Parallelrechnern
skalieren, wenn sich Datenstrukturen und Datenzugriffsmuster permanent ändern.
Diese Arbeit präsentiert eine solche Umgebung, die eben angesprochene Herausforderungen mit adaptiven kartesischen Gittern angeht. Diese entstammen einem
verallgemeinerten Oktalbaumkonzept und werden mit raumfüllenden Kurven durchlaufen, wobei eine kleine, jedoch festgeschriebene Zahl an Stapeln als Daten-Container fungiert. In solch einer Umgebung lassen sich sich dynamisch ändernde Gitter
mit wenigen Bits pro Datum ablegen. Klassische Ansätze veranschlagen hierzu oftmals mehrere tausend Bytes und verlangen nach alternativen, flexiblen Datenstrukturen, die einen großen Laufzeit-Overhead nach sich ziehen. Die Arbeit formalisiert
zum Einen den neuen Ansatz, zum Anderen reduziert sie die Implementierungskomplexität—das algorithmische Prinzip ist seit einigen Jahren wohlbekannt—von einem
exponentiellen auf ein lineares Wachstum in der Raumdimension des Problems. Eine
Modifikation des Traversierungsprinzips, das originär einer einfachen Tiefensuche
nacheifert, erlaubt schlussendlich, eine Rechengebietszerlegungsstrategie mit einem
dynamischen Lastausgleich umzusetzen.
Das Framework ist nach dem italienischen Mathematiker Giuseppe Peano benannt, der die zugrundeliegende raumfüllende Kurve entdeckt hat. Der Implemen-
tierung Potential wird anhand eines geometrischen Mehrgitterlösers für die PoissonGleichung dargelegt. Dieser weist in Folge dann einen sehr geringen Speicherbedarf
in Begleitung sehr guter Cache-Trefferraten auf, wobei auch dynamische Gitterverfeinerung und Lastbalancierung zur Anwendung kommen. Eine Kombination all
dieser Charakteristiken ist üblicherweise schwer zu finden. Da sie direkt auf die
Benutzung des Frameworks zurückzuführen ist, besteht die berechtigte Hoffnung,
dass das selbige den Weg zu einer multiskalen Strömungsdynamikanwendung auf
instationären, hierarchischen Gittern ebnet.
Introduction
Numerical simulation of phenomena modelled by partial differential equations is
of uttermost importance for new scientific insights and industrial innovation, and
the solution of the equations belongs to the grand challenges of many disciplines.
Simulation-driven quantum leaps can be found in many fields of application—from
Astrophysics illustrating Nature’s title page on June 2, 2005, over racing yachts
winning the America’s Cup in 2003 and 2007, down to Nanophysics predicting black
holes created by next generation particle colliders such as the Large Hadron Collider
at CERN . Consequently, more and more publications outline the significance of the
discipline—the PITAC Report on Computational Science released in 2005 by the
the President’s Information Technology Advisory Committee [5] is perhaps most
prominent—more and more supercomputers relocate the front line of processing
capacity, and more and more national and international supercomputing centers
and initiatives enter the spotlight of the public attention. Computational science
and engineering being the third pillar of research besides modelling and experiment
is an omnipresent term.
The foundations of this third pillar is the trinity of data, hardware, and algorithmics. It is kept together by software acting as integrator and catalyst: Software, on
the one hand, brings the three disciplines together. On the other hand, it has to
exploit synergy effects and make the whole more than the sum of its parts. For the
growth of complexity of the individual disciplines and the frictional loss of a naive
combination of single techniques and paradigms, this task is far from trivial. Data
becomes available in overwhelming quantity—installations such as the Large Hadron
Collider need a complete supercomputer network to record, interpret, and postprocess the measurements—requiring adaptive multi-resolution, high-dimensional,
and distributed data formats. Algorithms exhibit complicated multiscale behaviour
incorporating multiphysics models that couple equations describing different phenomena and referring to different spatial scales. Hardware finally becomes massive
parallel and tailored for specialised code and single-purpose data streams.
An example illustrates the resulting difficulties: A high performance computer
such as SGI’s Altix system benefits from simple, homogeneous code processing data
sequentially without case-distinctions and jumps in the memory. Due to a data decomposition approach, such a code usually is deployed to the multiple cores running
the same execution stream each, it exploits the processors’ cache hierarchies due
to the sequentiality of the data access, and it fits to the vast amount of registers
i
without an out-of-order logic due to the lack of case-distinctions. A multi-resolution
algorithm such as a multigrid solver though lacks such a homogeneity—the operators change from iteration to iteration, and the algorithm’s nature imposes non-local
memory accesses. Real-world data such as permeability coefficients of porous media
finally make standard multigrid algorithms break down, as the latter rely on an
isotropic, homogeneous multiscale behaviour of the underlying partial differential
equations. Simulation codes on supercomputers thus often perform with a disappointing speed, lack sophisticated yet well-understood algorithms and algorithmic
improvements, or handle dramatically simplified data. In a worst case, a combination of all three aspects.
Keeping all the details from all the disciplines in mind throughout the high performance software development cycle is almost impossible for single developers. At
least, it is economically unreasonable because of the time to be spent on sufficiently
elaborate implementations if they have to be built from the scratch. Unfortunately,
even the software that is already available often lacks a sufficient level of quality, and
the PITAC report coins the phrase of a software crisis: [. . . ] today’s computational
science ecosystem is unbalanced, with a software base that is inadequate to keep pace
with and support evolving hardware and application needs.
Establishing such a software base and a reuse culture is an enormous challenge
covering every issue from the documentation of best practices and design patterns
over the programming to standardised interfaces to the black-box reuse of complete
code repositories. Two code reuse paradigms coexist and compete: Software is either
built bottom-up by a combination of individual toolboxes and black-box components
from a library—the user then integrates and composes the separate parts manually or
with domain specific high-level languages—or it integrates into frameworks providing
a sophisticated and elaborate source code and feature environment. Pros and cons
adhere to both approaches with respect to flexibility, extendability, exchangeability
of components, and applicability to different problems. Yet, it finally comes down
to the question whether the resulting code efficiently implements a sophisticated
algorithm on a piece of hardware with the given data sets, while the code features are
encapsulated and the implementation obeys the separation of concepts paradigm,
i.e. while the code preserves a high quality. In this thesis, I present the C++
framework Peano addressing selected facets of these challenges.
The Spacetree Paradigm in a Framework for Partial Differential Equations
The idea of spacetrees pulls through each algorithmic and architectural design decision of the framework. Spacetrees start from a hypercube; a successive, recursive
refinement of this hypercube then yields a spatial discretisation corresponding to an
adaptive Cartesian grid. They look back on a long tradition at our group in Munich. Favoured for their simplicity and their multiscale spatial representation, they
ii
have proven of great value for several proof-of-concept implementations of fluid dynamic and fluid-structure interaction codes, geometric multigrid algorithms, solvers
dynamically refining the grid where appropriate, and so forth.
It is a common wisdom that spacetrees benefit from space-filling curves: With a
spacetree traversal following such a curve, one can encode the spacetree efficiently.
Efficiency hereby covers both the amount of memory required and the memory
access characteristics with respect to indirect addressing and memory/cache hierarchies. All data structures needed for such a traversal are stacks if the particular
space-filling curve is carefully chosen—a fact not obvious to our group before 2003
[35, 63]. Several prototypes confirmed this observation for different spatial dimensions, and several prototypes confirmed that parallelisation, geometric multigrid
algorithms, dynamic adaptivity, and so forth also fit perfectly to the spacetree-stack
combination—each a piece of software on its own.
This thesis picks up the idea of spacetrees traversed by space-filling curves. It formalises the discretisation principle and generalises as well as simplifies the traversal
and grid management algorithm. I end up with an efficient grid management that is
able to handle adaptive Cartesian grids, runs in parallel, supports dynamic adaptivity as well as multiscale algorithms, and so forth—I combine, integrate, and extend
many features of the individual prototypes available before. Since the traversal
acts as an algorithmic blueprint with concrete algorithms plugging into the traversal, this establishes a framework for sophisticated algorithms for challenging partial
differential equations.
In the public eye, scientific computing typically comes second behind the disciplines applying high performance computing, since few people are originally interested in methodological, (software-)aesthetic, or even runtime improvements. People
are interested in new insights or better understanding. Freely adapted from Richard
Hamming, the purpose of computing is neither runtime nor numbers but insight.
Alternatively, [. . . ] the ultimate purpose of computing is insight, not petaflop/s [. . . ]
[44, p. 2]. A framework’s worth consequently becomes apparent as soon as a scientific
or engineering code based upon the very framework yields significant results.
I demonstrate the applicability of my framework with a simple matrix-free, multiplicative, geometric multigrid solver for the Poisson equation. Matrix-free at this
is the keyword, as matrix-free methods reveal a number of unique selling points
that are especially important for many simulations. They are discussed in a moment. Having no global matrices at hand, it is though not trivial anymore to apply
direct or iterative solvers or sophisticated preconditioners to improve the solver’s
convergence. Consequently, the solver of the linear equation system itself has to
make up this constraint. Fortunately, multiplicative multigrid algorithms already
are optimal solvers for elliptic problems and they play in the upper league for many
other problems [78]. While the Poisson solver demonstrates that it is possible to
implement such an optimal solver within the framework and that the solver adopts
iii
all the framework’s properties, it is far from a real-world problem. For really interesting, real-world problems, I want to refer the reader to subsequent theses and
publications—namely [9] and [60].
Selected Challenges in High Performance Computing
This thesis tackles selected challenges high performance computing nowadays faces.
Because of the framework approach, the benefits resulting from each solution tackling a challenge carry over for each solver implemented.
From the data point of view, Peano comes along with very low memory requirements. Furthermore, it is not restricted by the available main memory. For dynamic,
arbitrary adaptive grids, the memory per vertex ratio in many out-of-the-box solvers
exceeds several hundred bytes and restricts the simulation’s maximum size of simulation runs. Due to the stack-based containers and the spacetree discretisation, the
algorithm here gets along with less than one byte per degree of freedom. This byte
holds the complete adjacency, connectivity, and adaptivity information. With such
an approach, one can handle problems that are by magnitudes bigger than many
conventional simulation runs, and the algorithm is not bandwidth-restricted, i.e. the
memory connection does not slow down the code.
Operating systems swap data to the hard disk whenever a code’s memory requirements outrun the main memory. Yet, the application’s performance then usually
breaks down, and the maximum swap data size is usually also restricted by the
operating system’s installation. In general, simulations have to get along with the
available memory. This constraint is particularly dominant for problems suffering
under the curse of dimensions or requiring almost regular and very fine grids. Direct numerical simulation of turbulent flows or high-dimensional problems arising in
mathematical finances are popular examples. Due to the grid’s stack-based persistence management (all data containers are stacks), the algorithm here is able to offer
a tailored file swapping strategy temporarily storing data subsets of arbitrary size
to disk without runtime penalties. As a result, solely the application’s instruction
stream and a fixed record buffer have to fit into the memory.
From the algorithmic point of view, Peano’s Poisson solver implements a stateof-the-art multiplicative, geometric multigrid with a full approximation storage on
the adaptive Cartesian grid, i.e. it holds a solution representation on every grid
level. While the implementation follows standard multigrid text books, it highlights
four implementation advantages resulting from matrix-free methods—the system
matrices are never assembled, instead, the matrix-vector products associated with
the multigrid solver are evaluated on-the-fly throughout the grid traversal. First,
generating systems and spacetrees fit perfectly together. Switching from a nodal
or hierarchical basis to a hierarchical generating system simplifies the multigrid’s
arithmetics and yields the reference system to compute the coarse grid corrections
iv
for free. Second, it circumnavigates the challenge to elaborate a fitting matrix
storage format. The set of linear equations resulting from arbitrary adaptive grids
even for simple partial differential equations with simple operator discretisations
comprises matrices exhibiting a complicated sparsity pattern, and the realisation or
usage of appropriate sparse matrix formats is laborious and error-prone. Third, if
the algorithm assembled system matrices, it would need additional memory to hold
these data structures. This memory is saved, i.e. solely the grid size determines the
algorithm’s need for memory. Finally, assembling a system matrix involves a certain
runtime overhead disappearing for matrix-free methods. The latter three arguments
gain weight if the grid and, hence, the matrices change permanently either due to a
dynamic grid refinement or a multiscale algorithm.
From the hardware point of view, Peano exploits modern cluster architectures
in three different ways. First, it provides uniform data access cost. To access data
stored in the processing unit’s registers is by magnitudes cheaper in terms of runtime
than fetching data from the main memory (or even hard disk). Modern computer
architectures mind this fact and introduce a cascade of caches holding small sets
of copies from the main memory. Accessing these copies is significantly faster than
the access to the main memory, i.e. the runtime penalty for accessing non-register
values is reduced if the algorithm does not read from or write to the main memory
directly. As a result, the runtime per record access is not constant, but it depends
on the record’s location. Due to the space-filling curves and the stack-based data
containers, Peano’s traversal restricts to cache and register accesses. The number
of non-cache accesses, i.e. cache misses, is negligible, and the runtime per record is
constant. Peano is cache-oblivious. This algorithmic ingredient goes hand in hand
with the fact that the hard disk swapping does not thwart the simulation code: the
cost per record remains constant, even if the memory requirements exceed the main
memory.
Second, Peano provides a domain decomposition strategy deploying the spacetree
among multiple computing nodes. The resulting advantage is twofold: On the
one hand, the individual nodes run in parallel, i.e. the performance improves. On
the other hand, the individual nodes require less memory, i.e. if the experiment
would exceed the (hard disk and/or main) memory of one computing node, several
nodes in combination cope with the problem. Throughout the decomposition, the
space-filling curve ensures that the subpartitions are quasi-optimal, i.e. the data
that is to be exchanged to couple the smaller problems on the individual nodes
is minimal relative to the workload, and this data exchange is straightforward to
implement, i.e. it does not induce an additional sorting or mapping step. As network
interconnections are bottlenecks of a parallel computers, modest exchange package
sizes are essential for parallel codes. As the exchange process integrates smoothly
into the grid management without any sophisticated reordering or synchronisation,
the homogeneous data access cost moreover are preserved.
v
Third, Peano provides load balancing. It distributes the workload equally among
the computing nodes, i.e. it tries in a greedy fashion to assign each node the same
amount of computations. The balancing’s realisation is itself parallelised, i.e. the
balancing itself scales with the parallel nodes and does not become a bottleneck.
Such a property is essential for massive parallel architectures. Furthermore, the
balancing works on-the-fly, i.e. it permanently adopts and optimises the partitioning.
This is essential for dynamic adaptive algorithms.
The splitting into three different classes of features coincides with many publications, and the tackled challenges can be found in a vast amount of literature. Peano
queues into this group. Low memory requirements draw through the discussion of
advantages of many structured grid approaches including adaptive mesh refinement
methods with regular patches. Matrix-free methods, cache-oblivious algorithms,
parallelisation and load balancing are popular subject of attention of many multigrid
solver implementations. Two important aspects of realisations in high performance
computing though are not covered at all: On the one hand, the integration of legacy
or third-party code into the Peano framework is beyond the scope of the thesis. Although it is inconsistent to motivate a framework development with code reuse and
separation of concerns without reusing software, I skip this discussion and develop
most of the realisation’s ingredients from skratch. For tests, validation, and quick
prototyping, connecting to software not tailored to the spacetree world is surely reasonable and routine [60]. To exploit Peano’s full properties and to preserve Peano’s
advantages, most codes presumably have to be adopted. I assume that white box
reusage copying code blocks, algorithmic ideas, and best practices outshines blackbox code reuse. Consequently, there is a need for reusage and tailoring patterns
and strategies, and there is also a need for a careful elaboration where third-party
component usage nevertheless does make sense. This reasoning is picked up in the
conclusion. On the other hand, many papers in scientific computing concentrate on
the development of more and more elaborate and sophisticated algorithms. They
derive schemes yielding qualitatively better results per atomic computing operation
or per record beyond the result improvements due to an increased computational
effort, i.e. beyond just scaling up the problem size. The term more science per flop
[44, p. 14] perhaps describes this aspect best. Obviously, such an endeavour is beyond the foundations of the pillar of scientific computing, and, hence, beyond the
scope of this work.
While each of the tackled challenges is an interesting subject of study itself, the
combination of all of them is the outstanding highlight of the framework. Because
of the strict separation of the individual packages and concerns, any solver programmed along the ideas of the Poisson solver carries over the same properties.
This is a promising insight making me believe that the framework can be the basis
of simulations tackling challenges not solved before.
vi
Alternative Frameworks
This vision underlies many frameworks and libraries. While a well-grounded comparison and evaluation of different code packages is beyond the scope and volume of
this thesis, I nevertheless pick out three alternative projects here and classify Peano
with respect to them. Besides a better impression what Peano’s characteristics and
unique selling points look like, such a classification also gives hints in the conclusion
what features Peano is missing and what Peano has to learn from other endeavours. For the time being, I concentrate on highlights, similarities, and common
approaches. The three non-commercial reference systems are chosen subjectively,
i.e. their selection is not exhaustive.
The Distributed and Unified Numerics Environment toolbox DUNE is a first subject of study [3, 4]. DUNE follows a strict separation-of-concept approach due the
definition of C++ interfaces, i.e. it defines a toolbox as set of interfaces and delivers several implementations of these interfaces. Programming versus interfaces
allows the developer to exchange both individual (sub)algorithms and selected data
structures without modifying the complete code basis. Especially interesting is the
idea to replace subgrids by optimised, structured patches and to hide the distribution of the grid parts. Both features are hidden from the user, as the algorithms’
realisations rely solely on an iterator concept.
Peano and its extensions provide a similar concept—regular patches are optimised
and the parallel realisation details are encapsulated from the algorithms—while it
relies on a callback mechanism, i.e. the algorithm does not actively traverse the
grid data structures, but the grid calls back the algorithms. My implementation
lacks the flexibility of unstructured grids. However, it provides arbitrary, dynamic
adaptivity—a feature that usually is difficult to provide with regular patches—it
provides a multiscale representation of the computational domain, and it provides a
fine granular load balancing, where the individual elements and not whole patches
are atomic work units. It would be interesting to quantify DUNE’s overhead due to
the flexibility, to evaluate the benefits arising from unstructered grids, and to elaborate whether using optimised regular patches as black-box can overtake Peano’s
holistic, structured, and inherent hierarchical approach. Furthermore, DUNE’s flexibility facilitates legacy code reuse. While code reuse is laborious for Peano if the old
code does not fit to the underlying spacetree paradigma, it is nevertheless desireable
in many places, and reuse best-practices hereby are of great value. I pick up this
aspect in the conclusion.
Next, I want to mention the Adaptive Large-scale Parallel Simulations (ALPS)
library underlying [18, 19]. These papers employ adaptive mesh refinement methods
to study mantle convection which is a multiscale problem in both time and space:
they solve a real-world problem on a petascale supercomputer. The underlying
computations scale on massive parallel environments due to the usage of octrees
vii
and space-filling curves. The octrees here act as key structure for data access, and
refining, coarsening, and load decomposition is directly interwoven with this data
structure.
All of Peano’s features rely on k-spacetrees—a generalisation of the octree concept—
and, thus, many paradigms and algorithmic approaches exhibit similarities (the initial setup of the load decomposition, e.g., follows a tree-based top-down approach in
both codes). I am sure that these similarities rest upon fundamental principles and
patterns of any “spacetree code”, and, hence, Peano can learn from ALPS. Nevertheless, exploiting the spacetree directly for the matrix-free PDE solver is a fundamental
idea of the Peano framework, while ALPS concentrates on the grid management and
its applications employ algebraic solvers. For applications where geometric multigrid
solvers can be applied, it would be interesting to compare Peano’s holistic approach
with applications built on top of ALPS, to quantify the overhead arising from the
algebraic approach, and to study the runtime implications of both approaches.
Finally, the PDE solver Hierarchical Hybrid Grids [26, 29] shall be mentioned.
It is a matrix-free multigrid solver, i.e. the operations are embedded into the grid
traversal, for elliptic problems. It starts with a coarse, conform, unstructured mesh,
distributes and balances this mesh, and then refines the individual mesh elements.
Outstanding is the enormous number of unknowns the solver is able to handle in
combination with its high performance and good scalability. From my point of
view, this is in particular due to a unique and intelligent combination of multigrid
algorithms, efficient programming techniques, and realisation patterns rarely found
in one place.
Peano’s extensions also support patches of regular refined grids, and its regular
refined subdomains resemble HHG’s structured patches. Consequently, I am sure
that efficiency patterns and best practices from the hybrid code also can be applied
to Peano’s implementation. Such a tuning makes the code exploit current computer
architectures better. In turn, it allows to compare and quantify the performance
drawbacks resulting from Peano’s flexibility due to the arbitrary dynamic refinement
with a highly optimised PDE solver implementation.
Thesis Structure
The thesis’ structure roughly follows the challenge enumeration. After a preamble,
I start with a description and formalisation of the spacetree idea in Chapter 2.
Besides the interplay of spacetrees, adaptive Cartesian grids, and the continuous
computational domain, the first chapter establishes a common language for the
subsequent text, and it establishes the concept of the traversal events. They act
as plug-in points for algorithms built on top of the framework. Three chapters
covering the traversal algorithm (Chapter 3), the multigrid solver (Chapter 4), and
the parallelisation (Chapter 5) follow. Wherever possible, they do not depend on
viii
each other, as the separation-of-concerns idea prohibits a strong coupling of the
different subjects. Besides some properties resulting from the interplay of individual
chapters, these chapters can basically be read and understood in arbitrary order.
After some experiments integrating the individual ideas in Chapter 6, one short
chapter closes the thesis, draws some conclusions, and presents expectations and
preliminary results from further, future, and extending work. The text is framed
by the Chapter 1 “Peano in a Nutshell” acting as preamble as well as an appendix.
The preamble outlines the thesis’ content from a programmer’s point of view—a
programmer that has to implement a multigrid solver within the framework.
The four big Chapters 3, 4, and 5 exhibit a similar internal construction. An
introduction summarises the aim, rationale, and reasoning of the chapter’s content.
It outlines the basic ideas, compares them to alternative and older results, and
points out which subjects are not discussed deliberately. Finally, it describes the
chapter’s structure. Several sections built upon each other follow. Most outline
their content and insights at the beginning. It thus is possible to skip individual
parts of the thesis. Some experiments study the framework’s properties chapter by
chapter. The splitting of the experiments simplifies the identification of cause-andeffect chains. A short outlook finally gives extensions and links.
Acknowledgments
This work has been conducted over a period of four years at the Chair of Scientific
Computing in Computer Science of the Technische Universität München. Throughout this time, several research activities such as conferences, graduate workshops,
ix
and research visits abroad were co-funded by the university’s International Graduate
School of Science and Engineering (IGSSE).
Special thanks are due to Prof. Dr. Hans-Joachim Bungartz and Prof. Dr. Dr. h.c.
Christoph Zenger. The idea to traverse a spacetree with the Peano space-filling curve
originates from Christoph Zenger, and he and his research group first elaborated
ideas, prototypes, and concepts underlying many of my thoughts written down here.
His personality, enthusiasm, and encouragement made me decide to do a doctorate.
My supervisor Hans-Joachim Bungartz then provided the inspiring environment and
ecosystem in his research group to evolve the algorithmic ideas further, to integrate
hitherto separated concepts into one code, and to continue to elaborate additional
ideas and visions. His mentoring and advice significantly shaped the thesis at hand.
Last but not least, his support and personality made me decide to continue research
beyond a doctor’s degree.
Albeit this text comprises my personal ideas, many persons were involved in the
development process, the elaboration of alternatives, the formulation and publications of the results, and the day-to-day business accompanying research at a university. Among them, I particularly want to thank Dr. Alexander Kallischko and
Dr. Miriam Mehl. Both volunteered to proof-read the whole manuscript, and the
text for sure profits to a great extent from their corrections, their critical analysis,
and their constructive annotations. Miriam Mehl also headed the computational
fluid dynamics group both giving my ideas and software developments a home and
an application and in turn inspiring the code’s evolution. It is difficult to appreciate
the significance of such a productive environment accordingly. She played a big part
in this environment. Among the research group, I especially want to thank Tobias
Neckel. Prior to all other codes, his research used my algorithmic ideas and framework for a real-world problem. He thus became the person always first suffering
from all the confusions, permanent modifications, and meanders coming along with
the development of a big piece of software. His insights and feedback particularly
coined the shape of this work, and the fruitful cooperation with him was essential
for the success of the group’s research. I finally want to thank all the students and
colleagues having joined the research group later, as well as all the members of the
Chair of Scientific Computing in Computer Science in Munich.
x
Contents
1 Peano in a Nutshell
1
2 Adaptive Cartesian Grids and Spacetrees
2.1 k-spacetree . . . . . . . . . . . . . . . . . . . . . . .
2.2 Adaptive Cartesian Grids and k-spacetrees . . . . .
2.3 Some Formalism . . . . . . . . . . . . . . . . . . . .
2.4 Element-wise Grid Traversal . . . . . . . . . . . . . .
2.5 k-spacetree Traversals . . . . . . . . . . . . . . . . .
2.6 Vertex-based Refinement and Information Transport
2.7 Geometry Representation . . . . . . . . . . . . . . .
2.8 Traversal Events . . . . . . . . . . . . . . . . . . . .
2.9 Experiments . . . . . . . . . . . . . . . . . . . . . . .
2.10 Outlook . . . . . . . . . . . . . . . . . . . . . . . . .
3 Spacetree Traversal and Storage
3.1 Peano Space-Filling Curve . . . . . . . . . . .
3.2 Deterministic Peano Traversal for k-spacetrees
3.3 Stack-Based Containers . . . . . . . . . . . . .
3.4 Cache Efficiency . . . . . . . . . . . . . . . . .
3.5 Some Realisation Details . . . . . . . . . . . .
3.6 Experiments . . . . . . . . . . . . . . . . . . .
3.7 Outlook . . . . . . . . . . . . . . . . . . . . .
4 Full
4.1
4.2
4.3
4.4
4.5
4.6
4.7
Approximation Scheme
Hierarchical Generating Systems
Stencils and Operators . . . . .
Multigrid Ingredients . . . . . .
Traversal Events . . . . . . . .
Extensions and Realisation . . .
Experiments . . . . . . . . . . .
Outlook . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
. .
. .
. .
. .
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
10
13
15
18
20
23
30
34
39
48
.
.
.
.
.
.
.
.
.
.
.
.
.
.
51
53
60
63
80
83
89
99
.
.
.
.
.
.
.
103
. 104
. 113
. 116
. 125
. 131
. 136
. 149
.
.
.
.
.
.
.
.
.
.
5 Parallelisation
155
5.1 Parallel Spacetree Traversal . . . . . . . . . . . . . . . . . . . . . . . 158
5.2
5.3
5.4
5.5
5.6
5.7
Partitions Induced by Space-Filling Curves
Work Partitioning and Load Balancing . .
Node Pool Realisation . . . . . . . . . . .
Parallel Iterations and HTMG . . . . . . .
Experiments . . . . . . . . . . . . . . . . .
Outlook . . . . . . . . . . . . . . . . . . .
6 Numerical Experiments
6.1 Memory Requirements . . . . . . . .
6.2 Horizontal Tree Cuts . . . . . . . . .
6.3 Simultaneous Coarse Grid Smoothing
6.4 MFlops on Regular Grids . . . . . . .
6.5 MFlops on Adaptive Grids . . . . . .
6.6 Outlook . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
166
177
185
186
187
197
.
.
.
.
.
.
201
. 201
. 202
. 204
. 205
. 207
. 208
7 Conclusion
211
A Helper Algorithms
223
B Hardware
229
C Notation
233
List of Relevant Publications
M. Langlotz, M. Mehl, T. Weinzierl, and C. Zenger. SkvG: Cache-Optimal Parallel
Solution of PDEs on High Performance Computers Using Space-Trees and
Space-Filling Curves. In A. Bode and F. Durst, editors. High Performance
Computing in Science and Engineering, Garching 2004, Springer-Verlag. pp.
71–82, 2005
M. Mehl, T. Weinzierl, and C. Zenger. A cache-oblivious self-adaptive full multigrid
method. In R. D. Falgout, editor. Numerical Linear Algebra with Applications,
volume 13(2-3), pp. 275–291, 2006
H.-J. Bungartz, M. Mehl, and T. Weinzierl. A Parallel Adaptive Cartesian PDE
Solver Using Space–Filling Curves. In W. E. Nagel, W. V. Walter, and W.
Lehner, editors. Euro-Par 2006, Parallel Processing, 12th International EuroPar Conference, Lecture Notes in Science and Engineering, volume 4128, pp.
1064–1074, 2006
M. Brenk, H.-J. Bungartz, M. Mehl, I. L. Muntean, T. Neckel, and T. Weinzierl.
Numerical Simulation of Particle Transport in a Drift Ratchet. In C. Johnson,
D. E. Keyes, and U. Rüde, editors. SIAM Journal of Scientific Computing,
volume 30(6), pp. 2777–2798, 2008
M. Mehl, M. Brenk, I. L. Muntean, T. Neckel, and T. Weinzierl. A Modular and
Efficient Simulation Environment for Fluid-Structure Interactions with Large
Domain Deformation. In M. Papadrakakis and B. H. V. Topping, editors.
Proceedings of the Sixth International Conference on Engineering Computational Technology, Civil-Comp Press, Kippen, Stirlingshire, United Kingdom,
2008
I. L. Muntean, M. Mehl, T. Neckel, and T. Weinzierl. Concepts for Efficient Flow
Solvers Based on Adaptive Cartesian Grids. In S. Wagner, M. Steinmetz, A.
Bode, and M. Brehm, editors. High Performance Computing in Science and
Engineering, Springer-Verlag, pp. 535–550, 2008
M. Mehl, T. Neckel, and T. Weinzierl. Concepts for the Efficient Implementation
of Domain Decomposition Approaches for Fluid-Structure Interactions. In
U. Langer, M. Discacciati, D. E. Keyes, O. B. Widlund, and W. Zulehner,
editors. Domain Decomposition Methods in Science and Engineering XVII,
Lecture Notes in Science and Engineering, volume 60, pp. 591–598, 2008
1 Peano in a Nutshell
In a Nutshell is a preamble. Outlining the content of the individual chapters, stating the important results, and highlighting the underlying vision and rationale, it
runs briefly through the whole thesis. Since the thesis establishes a framework for
grid-based solvers for partial differential equations (PDEs), its descriptions and explanations are often abstract, technical, and formal. I even outsourced concrete
applications and demonstration challenges into the conclusion or refer to other theses and publications. This chapter in turn refrains from all the technical details. I
take up the position of a developer implementing a multigrid finite element solver
with d-linear shape functions for the Poisson equation, and I show how such a solver
can benefit from the framework Peano.
Figure 1.1: Peano solves a PDE for a given compuational domain on an adaptive
Cartesian grid (left, cut through a spherical domain). Yet, it does not
hold the Cartesian grid in the memory, but it stores the spacetree instead
(right).
Before the finite element method breaks down the PDE into a linear equation
system, one has to define a discretised representation of the continuous domain—a
grid. Throughout the coding, the grid’s shape, its representation in the computer,
and the available operations on the grid are important. The shape uniquely defined
1
1 Peano in a Nutshell
by the grid construction process determines the numerical properties of the solution.
The data structure storing the grid (co-)determines the memory requirements of the
implementation, as well as the efficiency of grid and data reads and writes. The
operations on the grid—the signature of the grid data structure—split up into two
groups: On the one hand, the Poisson solver reads, interprets, and writes back gridassigned data such as the solution’s approximation. On the other hand, the solver
modifies the grid itself, i.e. it refines and coarses it. The signature thus influences
the execution speed, too, as it freezes the set of operations an algorithm can invoke.
Furthermore, this set determines the set of tools available for the implementation.
Before I jump into a grid discussion, the concept of k-spacetrees is introduced
(Section 2.1): A single hypercube encloses the computational domain. This hypercube then is recursively refined until the resulting cubes approximate the computational domain with sufficient accuracy. Such a refinement process yields a tree data
structure standing for a cascade of smaller and smaller hypercubes. The cascade
in turn mirrors an adaptive Cartesian grid, i.e. k-spacetrees give special types of
adaptive Cartesian grids. k-spacetrees moreover define several adaptive Cartesian
grids with different spatial resolutions simultaneously (Figure 1.1). The principle
is simple: Each leaf’s hypercube defines a geometric element of the Cartesian grid.
All the leaves together tessellate the computational domain. As refined nodes also
represent a hypercube, different resolutions of the tessellation, i.e. different Cartesian grids, are given by different levels of the spacetree. Whereas the finest adaptive
Cartesian grid is used to apply the finite element method, the coarser grids are useful
to implement a multiscale solver.
As soon as a grid is at hand, the solver has to run through it to exploit its information and operate on the data—it traverses the grid. I reveal a tree traversal being
a special case of an element-wise grid traversal (Section 2.4 and Section 2.5), i.e. as
soon as a tree traversal is found, I can implement an element-wise algorithm on
adaptive Cartesian grids. The depth-first traversal is of special interest, as it allows
to encode the grid data structure with one bit per element: if the traversal runs topdown through the tree, it has to read one bit per hypercube. This bit determines
whether the cube is split up further. I end up with an element-wise algorithm on
an adaptive Cartesian grid that comes along with little memory spent on the grid
structure. Further, it is simple to enlarge or shorten k-spacetrees. Adding additional elements to the tree equals a grid refinement, removing equals a coarsening.
Consequently, I end up with a low-memory algorithm that also supports dynamic
adaptivity.
The PDE is defined on a computational domain. A mapping from the continuous domain to the grid is next presented (Section 2.7). It makes all spacetree
elements lying inside the domain represent the discretised domain—the continuous
domain shrinks to its discrete representation. While such a mapping approximates
an arbitrary domain just with first order, it is straightforward to reduce any discreti-
2
sation error by refining the spacetree further: the approach accepts the low order
boundary approximation, as a refinement and, hence, improvement of the boundary
approximation is that cheap and simple.
Knowledge about the grid structure is important for the Poisson solver, the visualisation, the data postprocessing, as well as other algorithms. Yet, the actual storage
of the tree and the realisation of the traversal is not of interest for the Poisson solver
as long as all information is accessible on demand. While one could introduce an
access signature on the data structure, I prefer a callback mechanism (Section 2.8):
The grid traversal runs through the data structure. For each traversal transition, it
triggers an operation—an event—parametrised by the transition’s state, the current
spacetree element’s data, and its vertices. The algorithm plugs into these events,
reads, evaluates, and modifies the arguments. A plug-in implementing an error estimator, e.g., listens to the event triggered whenever a vertex is read for the very
last time throughout the traversal. At one point, it tells the vertex to refine all its
adjacent spacetree elements. In the subsequent iteration, the plug-in then receives
the events for the newly created vertices. The refinement realisation, the data structure extension, and the mapping from the continuous domain to the modified grid
though are hidden from the plugin. Besides the information hiding, such a mechanism offers additional advantages: a modification or exchange of the solver does not
affect the traversal or data structure realisation. At most, the mapping from events
to callback routines is altered, and, hence, the user can comfortably combine several
algorithms. In the experiments, e.g., I combine an error estimator, a PDE solver,
and a visualisation plug-in throughout one single grid traversal.
From now on, chapters have to tackle two orthogonal questions: On the one hand,
they have to demonstrate that the plugin mechanism is sufficient to implement a
sophisticated solver, i.e. the information hiding and encapsulation do not prohibit
the realisation. On the other hand, they have to discuss the actual storage format of
both the grid and data assigned to it in context with the efficient structure encoding.
I start with the second issue.
A grid consists of vertices, hyperplanes connecting these vertices, and geometric elements, i.e. volumes, in-between. k-spacetrees yield the geometric elements, but the
solver for the Poisson evaluates data assigned to the vertices. While the preceding
chapter specifies what the traversal looks like—it mirrors a depth-first order—and
how the grid’s structure is encoded, it neither gives the containers holding the encoding, nor does it refer to the vertex data. I concentrate on these questions and derive
a data container realisation going hand in hand with a sophisticated depth-first
traversal (Chapter 3). The realisation preserves the low-memory characteristics, it
does not restrict the dynamic adaptivity in any way, and it is efficient. The latter
aspect covers the algorithmic complexity (the data access is in linear time) and architectural principles, i.e. the realisation fits to and benefits from standard computer
architectures. Previous to that an excursus: The Peano space-filling curve makes
3
1 Peano in a Nutshell
the depth-first traversal on a k-spacetree based upon three-partitioning deterministic, as it orders the children of each refined hypercube (Section 3.1). Step by step,
the realisation benefits from the nice mathematical properties of Giuseppe Peano’s
curve1 .
Figure 1.2: Peano’s grid persistence management relies exclusively on stacks as data
containers. Throughout the grid traversal, it invokes events. The solver,
visualisation software, data preprocessing component, and so forth plug
into these events.
This process is tripartite (Section 3.3): First, I study the handling of the geometric
elements. If the depth-first traversal inverts the space-filling curve’s order after each
iteration, the inverted order again fits to Peano’s curve description, and two stacks
are sufficient to store all the geometric elements of the k-spacetree—the realisation
of an element container is straightforward. Second, I study the order of the vertices
before and after a tree traversal. Let a vertex a be used the last time before a vertex
b throughout an iteration. Due to the space-filling curve, b will be used for the
first time before a throughout the subsequent iteration, and two stacks are sufficient
to store all the vertices of the k-spacetree—the realisation of a vertex container is
straightforward, too. With these input and output containers, vertices finally have
to be managed throughout the iteration: They are read from the input stream the
first time they are used. They are stored to the output stream the last time they are
1
4
and, finally, the framework adopts the inventor’s name
used. They have to be hold in a container in-between these two actions. Again, the
space-filling curve allows the realisation to rely on a fixed linear number of stacks for
these temporary containers. The number is independent of the grid hierarchy, and,
in the end, a finite number of stacks is sufficient to hold all the data throughout the
traversals (Section 3.3 and Figure 1.2).
Space-filling curves and stacks are nice subjects of scientific study. With respect
to the solver, they have two especially interesting properties: Their implementation
comes along with almost no memory overhead, and their data access fits to modern
computer architectures: Each stack provides read and write access to only one single
position in memory, and this position shifts at most by one entry per data access.
Since the number of stacks is fixed, this yields an impressive cache hit rate, and the
runtime per vertex becomes independent of the problem size (Section 3.4).
Some realisation details—due to the strict stack-based realisation, one can even
handle problems that do not fit into the main memory anymore, and if algorithms
evaluate only information up to a given tree level, temporarily removing tree levels
below speeds up the traversal substantially—close this discussion. All the data
structures, signatures, and implementation concepts for the solver of the Poisson
equation are now available.
I hence switch from the realisation to the actual solver (Chapter 4). Due to the
inherent multi-resolution property of the k-spacetree, it is obvious to focus on a
multigrid algorithm here. After some basic remarks on the finite element method,
three topics are discussed. The text argues how k-spacetrees represent a finite
function space, what an efficient solver for the PDE’s linear equation system does
look like, and how the traversal events are mapped to solver operations.
My finite element implementation relies on a d-linear approximation of the continuous function space. Each spacetree vertex yields one basis function of this approximation, and its support covers the 2d adjacent geometric elements. This approach
ends up with a generating system instead of a basis, i.e. the representation of a
function is not unique (Section 4.1). To avoid confusion due to ambiguousness I
standardise the representations by convention, and, hence, discuss two variants: A
nodal scheme holding a function in different spatial representations simultaneously,
and a hierarchical scheme splitting up the function into contributions of different
frequency.
With the nodal scheme, one can evaluate the method’s matrix-vector products
element-wisely on the grid: there is no need to set up any global system matrix
explicitly, and the resulting approach thus needs no memory in addition to the pure
grid data. Next, the matrices are discussed (Section 4.2), and, starting from a simple
Jacobi scheme, I then run through the idea of multiplicative geometric multigrid
solvers and full approximation schemes with k-spacetrees fitting to the geometric
approach, since they hold grids of different spatial resolution by construction. The
solution representation with different frequencies simplifies the restriction of the
5
1 Peano in a Nutshell
right-hand side within the multigrid cycle. I thus switch regularly from a nodal
representation to the hierarchical scheme.
The individual steps of the multigrid solver plug into the traversal events: the
grid and traversal management are encapsulated from the multigrid solver and vice
versa. I consequently formulate the solver steps in terms of operations triggered by
the single events (Section 4.4). As the algorithm does not introduce additional data
or data accesses, it ends up with all the nice properties of the pure traversal—low
memory requirements, good cache hit rates, arbitrary adaptivity, and so forth—
while it implements a state-of-the-art multiscale solver. Some realisation details
and experiments document this.
High performance computing nowadays is dominated by massive parallel computers. The thesis eventually introduces a parallelisation concept based upon a domain
decomposition (Chapter 5). It tackles four challenges: An (extended and modified)
traversal has to be able to handle a spacetree split into several parts. The realisation of this traversal has to preserve the original traversal’s nice properties. And, a
load balancing fitting to the traversal has to find a well-suited partitioning, and it
has to adopt it the partitioning a changing refinement structure if the Peano solver
employs an a posteriori refinement criterion.
First, I show that a depth-first traversal is inherently sequential, i.e. it does not
profit from a decomposition of the spacetree. Consequently, I weaken the depth-first
paradigm (Section 5.1) and make it a level-wise depth-first traversal preserving all
the nice properties, as the stack concept remains exactly the same, and the order of
the traversal events is not modified, too.
Second, I discuss the traversal’s realisation and the vertex exchange. Both the domain decomposition and the exchange pattern benefit from space-filling curves (Section 5.2): As for the stack-based data management, the curve makes any reordering
of exchanged data unnecessary. Furthermore, partitions following the iterate exhibit
a small surface compared to their volume, i.e. a nice computation-communication
ratio: The partitions are quasi-optimal due to the Hölder continuity of the Peano
curve.
The parallel tree traversal postulates properties of a good partitioning. It becomes
obvious which partition layout leads to a well balanced decomposition. I third use
the properties to derive an on-the-fly load balancing for k-spacetrees (Section 5.3).
All traversals permanently keep track of the computational load assigned to one
computing node. Furthermore, they observe how many additional work units could
be handled by the computing node without becoming a bottleneck for the overall
application. Whenever a computing node would become a bottleneck due to a
refinement, it takes additional nodes from a global node pool and deploys work to
them.
Some remarks on the node pool continue the discussion (Section 5.4). Finally, I
combine the solver’s actions and the data exchange—due to the exchange of data
6
the mapping from events to solver operations is slightly to be modified—and end
up with a parallel multigrid solver with dynamic load balancing on permanently
changing grids.
7
1 Peano in a Nutshell
8
2 Adaptive Cartesian Grids and
Spacetrees
Peano is a framework for grid-based methods for partial differential equations
(PDEs), i.e. it relies on a spatial discretisation of the computational domain—a
mesh or grid. This chapter defines the grids underlying the whole thesis: it defines
the grid structure and the grid generation process, as well as the interplay of the
grid and the differential equation’s continuous domain.
The spatial discretisation is based on k-spacetrees. They are a d-dimensional generalisation of the octree concept. While octrees rely on bi-partitioning, k-spacetrees
employ a k-partitioning: The computational domain is embedded into a hypercube. Then, the discretisation algorithm recursively cuts the cube into k d equally
sized subcubes. While such a discretisation is equivalent to a subset of the class of
adaptive Cartesian grids, the efficiency of each algorithm in this thesis exploits the
particular spacetree’s structure.
Throughout the spacetree discussion, several alternative representations and interpretations are enlisted, but the list is neither complete nor representative. The
aim of this chapter instead is as follows: It establishes a uniform terminology and
a formalism, to make the text understandable and consistent without additional
literature. And, the chapter formally derives algorithmic restrictions and insights
for the k-spacetree.
My dissertation looks back to a long tradition of publications discussing challenges,
algorithmic ideas, and formalisms coming along with spacetrees. Nevertheless, the
underlying ideas are far from being exhausted. Compared to the direct predecessors
[35], [63], and [39]1 , this chapter picks up facts scattered among the theses, and it
harmonises the description. In return, I omit technical details, if they are not essential to understand the algorithmic ideas. Starting from a well-defined terminology,
the chapter furthermore states several properties of k-spacetrees or algorithms on
k-spacetrees, not written down explicitly before.
The chapter is organised as follows: I first introduce the concept of the k-spacetree
in the context of adaptive Cartesian grids with regularity constraints. They are two
alternative interpretations of one idea deducing a spatial discretisation for a given
domain. In Section 2.3, a formalism describing the grid constituents and their rela1
This is the chronological order.
9
2 Adaptive Cartesian Grids and Spacetrees
tionships is given. Such a formalism is essential for algorithms. All algorithms on
Cartesian grids require a well-suited traversal of the data structures. Section 2.4 introduces the fundamental ingredients of Peano’s element-wise traversals. They then
are transferred to the spacetree world in Section 2.5. The subsequent section analyses the traversal’s implications and constraints for for algorithms, and it establishes
a refinement concept for the adaptive Cartesian grids, i.e. it sets up the data flow
underlying the grid generation. Section 2.7 picks up the PDE challenge again, and
it discusses the mapping of the continuous domain to the spatial data structures.
Next, I formalise the interplay of the k-spacetree , the k-spacetree’s traversal and
any algorithm making use of these traversals. Such a formalism is important for
work extending this thesis ([23, 51, 60], e.g.). A small number of experiments finally
highlights some of the k-spacetrees’s properties, and an outlook closes the chapter.
2.1 k-spacetree
Algorithms are by definition executable by a machine. Ergo, they need a finite
representation of the computational domain Ω. The plenitude of existing geometry representation schemes makes it impossible to give a comprehensive overview:
Analytical descriptions compete with surface descriptions such as B-splines or triangulated meshes. Volume-based methods (voxel discretisations or space partitioning
data structures, e.g.) face constructive solid geometry models. Surface reconstruction algorithms on data obtained from measurements also might act as geometry
representation. A starting point for an overview is for example [68, 72]. Since numerical methods approximate the solution of a partial differential equation—they do not
derive an exact, point-wise solution—it is sufficient to approximate the continuous
domain. Yet, the geometric error must not pollute the approximation.
Grid-based methods are based on a spatial discretisation of the computational domain, i.e. Ω’s representation in the computer is cut into a finite set of primitives—a
grid or mesh. The term tessellation is a synonym not highlighting that the domain’s
boundaries typically are approximated, too. The term grid in turn usually circumscribes the tessellation of the computational domain together with the geometric
primitives arising from this discretisation, i.e. the vertices, the primitives, and so
forth. I use the terms grid, tessellation, and spatial discretisation as synonyms.
The result is Ωh . Formally, the transition Ω 7→ Ωh is a two step cascade: It first
approximates Ω and then cuts it into pieces. Here, both steps are merged in a
spatial discretisation stemming from a regular recursive k-section 2 . Four principles
circumscribe k-section:
2
Recursive decomposition is another synonym [67].
10
2.1 k-spacetree
—Terminology—
For the k-spacetree, the following terms are well-defined:
The computational domain is embedded into a hypercube. This hypercube
equals the root of the k-spacetree.
If a cube e1 is a child of a cube e2 , e2 is the parent of e1 . I write e1 ⊑child e2 .
If two cubes e1 and e2 share a common parent, e1 and e2 are siblings.
If two cubes e1 and e2 share at least one common point x ∈ Rd , they are
neighbours.
If the recursive k-section generates a cube e1 out of e2 with an arbitrary
positive number of recursion steps, e1 is a descendant of e2 . All children
of e2 are descendants. Not all descendants are children.
The height of a k-spacetree equals the recursion depth.
1. The algorithm embeds the computational domain into a hypercube3 . Since a
domain is a bounded, open subset in Rd , this is always possible.
2. For each coordinate axis xi , i ∈ {1, . . . , d}, there is a hyperplane with a normal
along xi . The algorithm cuts the hypercube into k equally sized parts along
each suitably translated hyperplane.
3. The algorithm ends up with k d small hypercubes. They are disjoint, the sum
of their volumes equals the original cube’s volume, and all k d cubes have the
same size.
4. The algorithm continues recursively for each new small hypercube if appropriate.
The resulting spatial discretisation is a k-spacetree: Each hypercube of the discretisation process corresponds to a vertex in the tree graph. The hypercube into
which the computational domain is embedded into is the tree’s root. When a cube
is cut into k d pieces, each new subcube is a child of the original cube, i.e. there is a
directed edge from the bigger to the smaller hypercube.
3
Hypercubes in this thesis are closed.
11
2 Adaptive Cartesian Grids and Spacetrees
Figure 2.1: k-spacetree construction for d = 2 and k = 3 (left), and k-spacetree
construction for d = 3 and k = 2 (right) with accompanying tree graph.
Both trees at the bottom have height two.
Different recursive k-section variants are well-known under different names in
different disciplines. The term quadtree for d = 2 and k = 2 and the term octree
for d = 3 and k = 2 (Figure 2.1) are popular names in computer graphics and
grid generation (see for example [25, 58, 67, 72]). Refinement tree [56], Q-tree and
quadtrie [67] as well as region quadtree [68] are among the alternative names. If
the cuts are not equidistant or the number of cuts per dimension per cube does
not equal an constant k, data structures such as binary space partitioning trees,
k − d trees or point quadtrees [20, 67, 72] arise. Some authors generalise the concept
and replace the initial cube by another primitive such as a triangle ([2, 56], e.g.).
Others introduce different names such as edge quadtree, MX quadtree or PM quadtree
for different recursion termination criterion [68]. This list is neither complete nor
are the examples representative.
I draw my attention to a fixed number k of equidistant cuts per dimension. The
dimension d ≥ 2 is an arbitrary constant in this thesis. Whenever an algorithm
demands for a special k, I write (k = 3)-spacetree, e.g.
With a fixed Ω, each hypercube of the k-spacetree is either inside the computational domain or not. The k-section hence yields a spatial approximation of the
computational domain as the surface of the inner hypercubes’ union approximates
the boundary of Ω. The k-section also yields a discretisation of this approximation,
i.e. the k-spacetree gives an Ωh (Figure 2.2).
12
2.2 Adaptive Cartesian Grids and k-spacetrees
Figure 2.2: k-spacetree construction scheme for four different computational domains (d = 2, k = 3). Each illustration consists of four layers top-down:
The layers one, two and three correspond to the three initial recursion
steps, i.e. they show the spacetrees of height one, two and three. The
bottom level shows a significant finer resolution of the domain.
2.2 Adaptive Cartesian Grids and k-spacetrees
Among the simplest grids in computational sciences is the Cartesian grid ([52], e.g.)
consisting of hypercuboids whose faces’ normals are parallel to a coordinate axis of
the Cartesian coordinate system. The Cartesian grid’s simplicity and structuredness make many geometric parameterisations and transformations in mathematical
expressions and algorithms trivial or obsolete. The same two properties also enable
for implementations exhibiting high performance and low memory requirements on
today’s hardware (e.g. [6, 22, 77])4 . This section emphasises the obvious similarities of k-spacetrees and particular Cartesian grids, albeit k-spacetrees afford more
flexible and more sophisticated grids and exhibit inherent grid hierarchy relation4
Their poor O(h) approximation of complex geometries is picked up, discussed, and (at least)
softened in Section 2.7.
13
2 Adaptive Cartesian Grids and Spacetrees
Figure 2.3: Adaptive Cartesian grid. This grid does not correspond to a k-spacetree,
as k is not invariant throughout the construction, and it is not equal
along the spatial directions.
ships. While the section recapitulates well-known facts familiar to most readers,
subsequent chapters exploit both the similarities and the flexibility.
A regular Cartesian grid is a Cartesian grid with equally sized non-overlapping
hypercuboids. Its construction resembles the recursive k-section in Section 2.1, but
it refrains from the recursion and replaces the original cuboid. A straightforward
extension of regular Cartesian grids is a non-overlapping adaptive Cartesian grid 5 .
The corresponding grid generation process starts with a regular Cartesian grid.
If appropriate, it then replaces each hypercube with another Cartesian grid and
continues recursively.
The leaves of the k-spacetree yield a spatial discretisation of the unit hypercube.
This tessellation is the fine grid. The term does not incorporate the actual shape of
Ω. It does not hold any geometric information. It finally does thus not give a spatial
discretisation Ωh of Ω according to Section 2.1. The elements of Ωh instead are a
subset of the fine grid. I examine the interplay of a spacetree’s fine grid and the
geometry in Section 2.7. This interplay results in a fine grid for the computational
domain instead of a fine grid for the spacetree’s root.
Albeit the leaves and the fine grid interfere, a k-spacetree holds information beyond the pure adaptive mesh:
Example 2.1. Let an equidistant Cartesian grid on a square with 8 × 8 geometric
elements be the subject of inspection. Such a grid also corresponds to the fine grid
of a (k = 8)-spacetree with height one. It furthermore corresponds to the fine grid
of a (k = 2)-spacetree with height three. Although all three approaches yield exactly
5
Since all grids in this thesis are non-overlapping, I skip this qualifier from now on.
14
2.3 Some Formalism
Figure 2.4: Fine grid of a two-dimensional (k = 2)-spacetree (left) and two Cartesian
grids extracted from the tree (right).
the same spatial discretisation, each of them exhibits a topology of its own and a
different number of geometric elements in total—8 · 8 for the Cartesian grid, and for
the spacetrees 82 + 1 or 82 + 42 + 22 + 1.
While each k-spacetree’s fine grid equals an adaptive Cartesian grid, this does
not hold the other way round, as most adaptive Cartesian grids choose the number
of cuts along every hyperplane individually for each recursion step. These adaptive
Cartesian grids thus are less restrictive (Figure 2.3). On the other hand, such an
approach lacks the uniform hierarchy relations. A k-spacetree explicitly preserves
the hierarchy, and the father-child relations’ cardinalities are fixed and invariant.
This facilitates many efficient and elegant algorithms—geometric multigrid solvers
for example profit from the hierarchy.
In a k–spacetree, each hypercube has a level. If the construction needs at least ℓ
recursion steps to create the hypercube, ℓ is the level. The root element thus has
level zero. All the hypercubes of a given level in a k-spacetree have the same size.
Yet, they do not define a Cartesian grid without any additional assumptions, as
they might not cover the whole domain, as their layout might not be a hypercube
itself, and as they might not be connected (Figure 2.4). In turn, each level of a
full k–spacetree—all geometric elements not belonging to the maximum level are
refined—really yields a regular Cartesian grid.
2.3 Some Formalism
The preceding sections introduce the spatial discretisation underlying this thesis’
algorithms. Yet, they lack a rigorous formalism. Algorithms however demand for
a finite, formal description, and they—by definition—have to be executable by a
machine [11, 12]. The following pages establish such a formalism, and I also use
them as opportunity to switch from a geometry’s tessellation point of view to a gridcentered language including vertices and vertex-element adjacency relationships.
New aspects beyond the discretisation are shifted to the subsequent Section 2.4,
whereas Appendix C holds, beyond other details, a table of the definitions here.
15
2 Adaptive Cartesian Grids and Spacetrees
The spatial discretisation Ωh consists of a set ET of geometric elements. All
geometric elements e ∈ ET are hypercubes. Thus, every geometric element has 2d
vertices v1 , v2 , . . . , vd ∈ VT . All normals of the the 2d hypercubes’ hyperfaces are
parallel to a coordinate axis of the Cartesian coordinate system.
A k-spacetree T equals a four-tuple
T = (ET , ⊑child ∈ ET × ET , e0 ∈ ET , VT )
with a dedicated root element e0 and a partial order ⊑child giving the child-father
relationship. An element is a leaf, i.e. it belongs to the fine grid level, if it has no
children. Elements with children, i.e. elements e with {ei ∈ ET : ei ⊑child e} 6= ∅,
are refined, and |{ei ∈ ET : ei ⊑child e}| = k d .
Prefined : ET
7→ {⊤, ⊥},
⊤ if {ei ∈ ET : ei ⊑child e} 6= ∅
Prefined (e) =
.
⊥ else
Let
∀ei , ej ∈ ET , ei ⊑child
level : ET →
7
N0
with
level(e0 ) = 0
and
ej : level(ei ) = level(ej ) + 1
(2.1)
(2.2)
return the level of a geometric element. The level of the root element equals zero
(2.1), and the level of a child equals the parent’s level incremented by one (2.2).
The following properties result from the k-spacetree’s construction algorithm:
∀ei , ej ∈ ET , ei 6= ej , level(ei ) = level(ej ) :
∀ei , ej ∈ ET , ei ⊑child ej :
∀ei ∈ ET , Prefined (ei ) :
|ei ∩ ej | = 0,
ei ⊂ ej ,
and
[
ej = ei ,
(2.3)
(2.4)
(2.5)
ej ⊑child ei
Different elements belonging to the same level share at most a submanifold. For
d = 2 a vertex or an edge (2.3). Each element besides the root is contained within
its father (2.4), and if all the children of one element are merged, the merged volume
equals the father (2.5).
d
vertex(e) with vertex : ET 7→ V2T ⊂ P(VT ) identifies the 2d adjacent vertices of
an element e6 . A vertex is a (d + 1)-tuple (x ∈ Rd , level ∈ N0 ) holding the vertex’s
6
Because of the exponent 2d , each image element has cardinality 2d .
16
2.3 Some Formalism
Figure 2.5: Part of a (k = 2)-spacetree with d = 2. Prefined (e1 ) but ¬Prefined (e2 ). e1
has k d children. v1 and v2 are at the same position in space but belong
to different levels. Thus, they are not equal. v1 ∈ VT \ HT . v2 ∈ HT is
a hanging node.
level and position in space. Two vertices are equal if and only if they coincide in
both space and level. The level level(v) of a vertex v of an element equals the
element’s level:
∀v ∈ VT , ∀e ∈ adjacent(v) : level(e) = level(v).
The set
VT =
[
vertex(e)
e∈ET
holds all vertices of the spacetree. This definition derives vertices from elements.
The counterpart
adjacent : VT 7→ P(ET )
delivers all the elements that are adjacent to a vertex on the same level. The
spacetree’s construction yields
∀v ∈ VT : 0 < |adjacent(v)| ≤ 2d ,
and this adjacency information allows for a classification of the vertices: A vertex
v with less than 2d adjacent elements, i.e. |adjacent(v)| < 2d , is a hanging node or
hanging vertex. To distinguish hanging vertices from the other vertices is essential
throughout the thesis, and the set
HT = {v ∈ VT : |adjacent(v)| < 2d }
(2.6)
provides this separation.
The following convention is common for vertices in many applications such as
geometric ansatzspaces in Chapter 4, where it ensures the continuity of the numerical
solution: Hanging vertices never hold information. As a result, there is no need to
store hanging nodes persistently.
17
2 Adaptive Cartesian Grids and Spacetrees
2.4 Element-wise Grid Traversal
On the next pages, I derive a grid traversal idea fitting to both an element-wise
traversal of the adaptive Cartesian grid and the k-spacetree. This traversal is the
basis of all algorithms implemented later on. In this context, the conclusions and
restrictions coming along with it (Definition 2.1) particularly coin the thesis.
A function on a k-spacetree maps the tree and data assigned to it to an image.
Iterative solvers, e.g., map the data structure to another k-spacetree; plotters, e.g.,
write it to an output stream whose content is interpreted by a visualisation tool.
Without any assumptions on the preimage’s impact, such algorithms have to read
each element of the spacetree at least once, i.e. they have to process each element
of ET and each element of VT . This is a grid traversal. The interplay of geometric
elements and vertices is essential in most algorithms. A traversal algorithm having
minimal computational complexity reads each vertex and each geometric element
once. If the connectivity information is of interest, the traversal algorithm has to
read each vertex-element relationship at least once.
Two different traversal types are vertex-wise and element-wise. Vertex-wise traversals define a total order on the vertices and run through this data stream. For each
vertex v, they also process the adjacent elements e ∈ adjacent(v). Thus, each vertex is read once, but elements are read multiple times. Element-wise algorithms in
contrast define a total order on the geometric elements and run through this data
stream. For each element e, they also process the adjacent vertices v ∈ vertex(e).
Thus, each element is read once, but vertices are read multiple times: for v ∈ VT \HT
exactly 2d . For hanging nodes v ∈ HT the number of reads is smaller. In this thesis,
all algorithms are based upon an element-wise traversal7 .
Definition 2.1. An element-wise k-spacetree traversal processes the k-spacetree T
in n ≥ |ET | steps. In each step, exactly one element e ∈ ET and the 2d adjacent
vertices v ∈ adjacent(e) are available to an algorithm built atop the traversal. Inbetween two steps, both the source and the destination data are available.
The element-wise k-spacetree traversal poses an important restriction on the speed
information is transported from one geometric element to another: Without further
assumptions, it is not possible for an element-wise algorithm to directly evaluate or
7
Both vertex-wise and element-wise traversal are found in scientific computing. Some examples:
Introductions to the finite element method usually start from a vertex point of view ([8],
e.g. ). Finite element codes switch to an element-wise traversal often to eliminate duplicate
computations such as numerical integration on the elements. Grid-based visualisation software
typically prefers a vertex-wise point of view, whereas an element-wise traversal fits better into
the paradigm of volume rendering. For spacetrees, algorithms stemming from graph theory
prefer an element-wise approach as it fits to standard tree search algorithms. Other space tree
formalisms such as Morton ordering [57] correspond to a vertex-wise interpretation of the grid.
18
2.4 Element-wise Grid Traversal
even manipulate neighbour cells, as at most two neighbours’ data is available during
an element transition.
Example 2.2. The element-wise traversal grants an algorithm access to the current
element and its vertices. The two sketches below illustrate the resulting challenges
for an element A traversed before an element B:
If traversal(A) < traversal(B) and if an algorithm requires information from B
within A, this information is not available (left). Thus, information needed by neighbours has to be written to the element’s vertices (B right). This information then is
available to the neighbours in the subsequent iteration (A right).
If a cell needs information from neighbouring cells, there has to be a function that
reconstructs the neighbour’s information. Such a function analyses the properties of
the vertices that are adjacent to both geometric elements and it uses the following
information:
1. For the reconstruction of the neighbours sharing a common hyperface, the 2d−1
shared vertices and the elements’ data itself have to be sufficient.
2. For the reconstruction of the neighbours sharing a common hyperedge, the
element’s state, the 2d−2 shared vertices and the information obtained in the
preliminary step have to be sufficient.
...
d. For the reconstruction of the neighbours sharing only one common vertex, the
element’s state, the 20 common vertices, i.e. the only common vertex, and the
information obtained in the preliminary steps have to be sufficient.
19
2 Adaptive Cartesian Grids and Spacetrees
Vertices act as an information transport medium: Element-based information
propagates due to the vertices (write from origin element to vertex; image element
then takes data from this vertex). Vertex-based information propagates from one
vertex to the neighbouring vertex. The maximum information propagation speed
the element-wise traversal guarantees without further assumptions and knowledge on
the elements’ and vertices’ order and topology thus equals on element per traversal.
For any algorithm mapping a k-spacetree’s data to another k-spacetree, a computer scientist is interested if it is possible to use the same data structure as origin
and as destination record. If this is possible, the application’s demand for memory halves. The answer depends on the algorithm’s vertex manipulation and the
reconstruction function built atop.
If an algorithm needs, at any time, a neighbour element’s state in the preimage,
and if it manipulates the current traversal element’s state, it requires the state of
a vertex to remain invariant throughout the traversal. Yet, each element’s state
transition implies an update of the vertices, as the vertices transport information.
Thus, it is not possible to make the input data structure, i.e. the preimage’s vertices,
act as output data structure.
Instead of using two complete data structures, any algorithm harming this criterion can be implemented using protocol variables: A vertex’s property transition
is not performed immediately, but the algorithm writes the transition into a transaction variable. At the end of the traversal, all transactions are processed en bloc.
Alternatively, the very last write operation on a vertex precedes the transaction.
Such approaches are well-known from databases and persistence layers ([24], e.g. ).
If a vertex’s state changes exclusively in-between two iterations, there is a maximum guaranteed speed for the information to spread. For a fixed grid level, information can spread only one element per iteration: In one iteration, an element
e1 writes its information to the vertices. The updated information is available to
the neighbouring elements e2 with traversal(e2 ) < traversal(e1 ) in the subsequent
traversal. The other way round, this is not a restriction similar to a CFL condition
[8] for the whole spacetree, i.e. it does not imply that information can only be passed
from one element to a neighbouring element per iteration. If an algorithm traverses
from an geometric element e1 to its father element e2 , then to e2 ’s neighbour e3 ,
and then to a child e4 of cell e3 , there is no need for e1 and e4 to be neighbours,
i.e. information stemming from element e1 can influence a non-neighbouring element
e4 .
2.5 k-spacetree Traversals
The preceding section discusses element-wise traversals from a grid point of view. It
is an obvious idea to transfer the ideas to the tree world. A unified traversal concept
20
2.5 k-spacetree Traversals
holding for both point of views allows each algorithm to pick out the formalism
leading to the simplest description, while Peano’s algorithmic realisations always
benefit from the tree paradigm.
A tree traversal in graph theory is a process that reads each node of a tree once [47].
Each tree node in a k-spacetree equals a geometric element in the hierarchical grid.
Thus, a tree traversal yields an element-wise traversal according to Definition 2.1,
if one makes the tree traversal algorithm also pass the corresponding grid vertices.
Several different criterion induce a classification of tree traversals. The distinction
of deterministic and nondeterministic traversals is of importance in this thesis: Parallel algorithms for example exhibit nondeterministic behaviour, i.e. sophisticated
synchronisation mechanisms are required if the code’s correctness demands for deterministic algorithmic steps. The hierarchy’s influence on the traversal order also
is essential for most algorithms:
Definition 2.2. A traversal preserves the child-father relationship ⊑child , if
e1 ⊑child e2 ⇒ traversal(e2 ) < traversal(e1 ).
A traversal preserves the inverse child-father relationship ⊑child , if
e1 ⊑child e2 ⇒ traversal(e1 ) < traversal(e2 ).
With binary trees, the preorder traversal [11, 12] preserves the child-father relationship’s implication. With trees such as k-spacetrees, where a refined element
has more than two children, the standard depth-first and the breadth-first orders’
traversal functions also fit to the first definition above [11, 12]. Hereby, Definition
2.2 neither determines whether the traversal is deterministic, nor do the implications
prefer a depth-first or a breadth-first traversal.
Let an algorithm’s traversal preserve both relationships. Such an algorithm can
transfer information from fathers to their children within one traversal transition
according to the child-father relationship. Information hereby is transported topdown. For the inverse child-father relationship, the algorithm can transfer all the
children’s information to their father within one iteration. Information is transported bottom-up. The first principle corresponds to information inheritance within
tree algorithms. The second principle describes information analysis [46].
A bigger part of this thesis turns its attention to the efficient realisation of the
grid and the grid traversal. Depth-first traversals (later merged with the paradigm
of breadth-first) build the basis of these realisations, as their traversal management
requires only one data structure—a stack or a queue (Algorithm 2.1 or Algorithm
2.2)8 . The names and semantics of the operations and data structures in the pseudocode follow standard text books such as [11, 12]. A key ingredient for an efficient
8
The actual realisation is a recursive code, i.e. the stack in Algorithm 2.1 is hidden by the call
stack.
21
2 Adaptive Cartesian Grids and Spacetrees
realisation is the transformation of the abstract, nondeterministic traversal definition into a deterministic sequence. The nondeterminism results from the forall
statements in the pseudocode. Introducing an order for these statements makes the
traversal deterministic.
Algorithm 2.1 Depth-first traversal.
T = (ET , ⊑child ∈ ET × ET , e0 ∈ ET , VT )
stack := (e0 )
when algorithm starts up
1: procedure df s(stack)
2:
ecurrent ← popstack ()
3:
for all e ∈ ET : e ⊑child ecurrent do
4:
pushstack (e)
5:
end for
6: end procedure
Algorithm 2.2 Breadth-first traversal.
T = (ET , ⊑child ∈ ET × ET , e0 ∈ ET , VT )
queue := (e0 )
when algorithm starts up
1: procedure bf s(queue)
2:
ecurrent ← dequeuequeue ()
3:
for all e ∈ ET : e ⊑child ecurrent do
4:
enqueuequeue (e)
5:
end for
6: end procedure
From a hardware-near and implementation-aware point of view, a straightforward
implementation of the two traversals exhibits a subtle disaccord with a naive reception of the term element-wise. Each refined geometric element in the spacetree is
read twice: Once by the pop operation, once by the push (queue and dequeue
respectively). And each operation has to move records within the memory and, thus,
accesses the data. I weaken the phrase “element is read once” in a minute, and regard both traversals fit to the definition of element-wise. Herefrom, the depth-first
traversal makes the following statements possible.
• Each element on the stack corresponds to an element in the adaptive Cartesian
grid. Let the stack’s top element determine the current element of the elementwise traversal.
22
2.6 Vertex-based Refinement and Information Transport
• For each refined geometric element, the traversal triggers k d push-pop combinations. They correspond to a grid traversal transition from an element on a
given level ℓ into a subelement on a level ℓ + 1—a step-down transition. Both
the parent and the child data are available throughout this transition.
• Each pop operation corresponds to a transition from a geometric element into
a sibling or back to the father element (bottom-up transition). Both source
and destination element’s data are available throughout the transition.
The latter transition agrees with the first issue, as the stack’s top element determines
the geometric element. Each refined element occurs twice within the sequence of
traversed geometric elements, and this is a conflict with the radical interpretation of
the term traversal. I resolve this conflict as I replace the traversal function by two
access functions. Each of them alone defines a traversal in the traditional sense.
Definition 2.3. Let f irst : ET 7→ N0 denote the first time an element is read. Let
second : ET 7→ N0 denote the second time an element is read. Let e1 be a descendant
of e2 or the other way round. f irst(e) = second(e) ⇔ ¬Prefined (e). All elements are
read at most twice. A process with
e1 ⊑child e2 ⇒ f irst(e2 ) < f irst(e1 ),
second(e1 ) < second(e2 )
and
is a k-spacetree traversal.
f irst hereby preserves the father-child relationships, second preserves the inverse
child-father relationships. A k-spacetree traversal defines an element-wise traversal
for all Cartesian grids of the k-spacetree, and the information access discussion from
Section 2.4 is valid for it. All traversals in this thesis are k-spacetree traversals.
2.6 Vertex-based Refinement and Information
Transport
The subsequent section discusses the k-spacetree construction process and how the
spacetrees are encoded. Due to the information propagation property, the refinement
information is assigned to the vertices, and an element in turn is refined if and
only if at least one adjacent vertex holds a corresponding refinement flags. As a
result, the algorithm deduces hanging nodes on-the-fly from the refinement predicate
Prefined of coarser levels, i.e. no additional storage effort is spent on hanging vertices.
Furthermore, it turns out that the time to set the refinement predicate has to be
chosen carefully, if the number of read operations for vertices v 6∈ HT has to be
invariant. This property is essential for many algorithms relying on the fact that
23
2 Adaptive Cartesian Grids and Spacetrees
all elements adjacent to a vertex are traversed. In the following, I introduce this
refinement flag for the vertices and derive both the elements’ refinement and the
term hanging node from this predicate.
For many element-wise algorithms in this thesis, it is essential to be able to decide
whether a vertex v ∈ HT . According to (2.6), a vertex is a hanging vertex if the
number of adjacent elements is smaller than 2d . As the number of its adjacent
elements depends on the refinement predicates on the level above, the vertex’s state,
i.e. whether v ∈ HT or not, depends on the level above.
Example 2.3. Let vl = (x, l) and vl+1 = (x, l + 1) be two vertices at the same
position x ∈ Rd in space. The number of adjacent elements of vl+1 depends on the
state of the adjacent elements of vl —if one adjacent element of vl is not refined, the
number of adjacent elements of vl+1 is smaller than 2d and vl+1 is hanging.
Let el and el+1 be two elements with el+1 ⊑child el and vl ∈ vertex(el ), vl+1 ∈
vertex(el+1 ). Throughout the top-down transition from el to el+1 , the refinement
state of el is available. The refinement states of the other elements that are adjacent to vl are not available. To bridge that information gap, vl carries a variable
reconstructing the state of all the adjacent elements. It makes no sense to hold
refinement information redundantly. Thus, exclusively the vertices hold refinement
information.
Definition 2.4. The k-spacetrees in this thesis are based upon an or-wise vertex
refinement criterion. There’s a refinement predicate Prefined for each vertex with
∀e ∈ ET : Prefined (e) ⇔ ∃v ∈ vertex(e) :
∀v ∈ HT :
Prefined (v),
¬Prefined (v),
and
(2.7)
i.e. every time the refinement predicate holds for a vertex, all the 2d adjacent geometric elements are refined.
For many algorithms it is extremely useful to ensure that the algorithm processes
non-hanging vertices exactly 2d times. As one can distinguish f irst and second
transitions, the term “processes” refers to one of these orders. The next theorem
formalises this constraint and relates it to the information propagation speed.
Theorem 2.1. Consider any dfs-type k-spacetree traversal. Any vertex v ∈ VT \HT
is read exactly 2d times, i.e. the traversal accesses every adjacent geometric element
of a non-hanging vertex, if and only if the refinement predicate ref ine(v) does not
change throughout the traversal.
Proof. If the grid does not change throughout the traversal, each vertex is processed
2d times, as the k-spacetree traversal processes all elements of the k-spacetree once.
To show that this does not hold anymore if the grid changes, it is sufficient to give
one counterexample (see Figure 2.7).
24
2.6 Vertex-based Refinement and Information Transport
Figure 2.6: If Prefined holds for a vertex, all the adjacent geometric elements are
refined. Within one element, all the vertices’ refinement flags are combined via an logical or to determine whether the element itself is refined
(Figure 2.6).
Figure 2.7: Motivation for Theorem 2.1: (a) The k-spacetree traversal processes an
unrefined element ea . (b) It continues to sibling element eb . (c) Within
eb , the algorithm sets the refinement flag of an adjacent vertex. The
k-spacetree traversal will continue with the new elements e ⊑child eb , but
(d) it will never descend into the new elements e ⊑child ea .
25
2 Adaptive Cartesian Grids and Spacetrees
Figure 2.8: The vertex’s refinement state is synchronised with the traversal: Any
algorithm may trigger a refinement at any time. The refinement triggered state is changed into refined in-between two iterations. The grid
structure thus is invariant throughout the traversal.
With the or-wise vertex refinement criterion and a fixed number of vertex accesses, a vertex’s refinement state has to remain invariant throughout the traversal.
As many algorithms want to change the grid’s layout whenever they “feel” like that,
the vertices’ carry multiple refinement states with a transaction semantics. If an
algorithm wants to refine a vertex, it switches the vertex’s state from unrefined
to refinement-triggered. The refine predicate does not hold for the additional
state refinement-triggered. At the end of the traversal, the algorithm takes
all the vertices with state refinement-triggered and switches the vertex’s state
from refinement-triggered to refined. The refinement predicate does hold for
refined. Instead of a postprocessing of all vertices, the transition is implemented
after a vertex is read the last time. An analogous argumentation is valid for the
coarsening of refined vertices, i.e. for algorithms that change a vertex’s state from
refined to unrefined, and the underlying state chart is given in Figure 2.8. The
outlined solution with a transaction variable is only one possible solution to this
problem. If setting the refinement predicate is allowed only at the first read, no
transaction variable would be required. I though preferred to make the refinement
criterion and decision algorithms completely encapsulated from the grid management.
Within an efficient implementation, the exist quantifier in (2.7) has to be resolved.
A brute force search for all neighbour elements is ill-suited. There is an elegant
formulation based upon the definition of the father vertices (Algorithm 2.3). This
formulation accepts the parent’s vertices in cell-wise lexicographic order and the
position of the vertex within the refined parent element. It then derives the state of
an element’s vertex from these inputs.
26
2.6 Vertex-based Refinement and Information Transport
Theorem 2.2. With Algorithm 2.3, a non-hanging node is given by
vi ∈ ET \ HT ⇔
_
v∈f ather(Vparent ,i)
Prefined (v).
W
Vparent denotes the 2d vertices of the father cell.
denotes a logical or over all
the functions’ results and motivates the term or-wise refinement criterion.
Proof. An induction over the dimension d proves the theorem for any k.
• With d = 1, the father element is a line split up into k subelements.
– For number = 0, f ather(V, 0) returns the very left vertex of the refined
element. If the father’s left vertex is not refined, the father’s left neighbour element is not refined, either. v0 is a hanging vertex.
– For number = k, f ather(V, k) returns the very right vertex of the refined
element. If the father’s right vertex is not refined, the father’s right
neighbour element is not refined, either. vk is a hanging vertex.
– For number ∈ {1, . . . , k − 1}, both vertices adjacent to the coarse refined
element are contained within f ather(V, number)’s result set. One of them
holds the refinement predicate as the element is refined. vnumber thus has
two adjacent geometric elements and is not a hanging vertex.
• With d 7→ d + 1, the proof has to distinguish vd+1 ∈ {1, . . . , k − 1} from
vd+1 ∈ {0, k}:
– If vd+1 ∈ {0, k}, the vertex v belongs to a coarse element face with a
normal parallel to the xd -axis. parentno’s corresponding entry is set to
this face’s coordinate, i.e. 0 for the face nearer to the coordinate system’s
origin or 1 respectively. The decision and analysis of this face’s vertices
then reduces to a d-dimensional challenge.
– If vd+1 ∈ {1, . . . , k − 1}, the vertex v does not belong to a face with a
normal parallel to the xd+1 -axis. If the direct neighbour vertices along the
xd+1 -axis are not hanging, the vertex itself is not hanging, too. As this
arguing is to be repeated, the next two neighbours along the xd+1 -axis
that are element of the coarse element’s face with normal d + 1 determine
whether the vertex is hanging. If one of them is not hanging, the vertex
also is not hanging: it then has 2d+1 adjacent geometric elements. The
problem reduces to two d-dimensional challenges.
27
2 Adaptive Cartesian Grids and Spacetrees
Algorithm 2.3 Derive the father vertices, i.e. the vertices of the refined father
element that influence a vertex. Algorithm accepts 2d vertices adjacent to the father
element and the vertex’s position within k d refinement patch.
7
P(VT )
f ather : P(VT ) × {0, . . . , k − 1}d →
′
f ather(Vparent , number) = Vparent
,
with
|Vparent | = 2d
and
′
Vparent ⊂ Vparent .
1: procedure f ather(Vparent , number)
′
2:
Vparent
←∅
⊲ Result set.
3:
parentno ← (0, . . . , 0)
4:
for i ∈ {0, d − 1} do
5:
if numberi = 0 then
6:
parentnoi ← 0
7:
⊲ number and parentno are d-tuple.
8:
⊲ Index i denotes the ith entry, C-style.
9:
else if numberi = k then
10:
parentnoi ← 1
11:
else
12:
newnumber1, newnumber2 ← number
13:
newnumber1i ← 0
14:
newnumber2i ← k
′
′
15:
Vparent
← Vparent
∪ f ather(Vparent , newnumber1)
⊲ Recursive call.
′
′
16:
Vparent ← Vparent
∪ f ather(Vparent , newnumber2)
⊲ Recursive call.
′
17:
return Vparent
18:
end if
19:
end for
′
20:
Vparent
← {v ∈ Vparent : number(v) = parentno}
21:
⊲ Result contains vertex at same position in the level above
′
22:
return Vparent
⊲ Recursion terminates
23: end procedure
28
2.6 Vertex-based Refinement and Information Transport
—Lexicographic Enumeration—
The lexicographic enumeration defines an enumeration for all vertices of one
cell or all vertices of one refinement step. It assigns each vertex a d-tuple
number : v ∈ VT 7→ {0, . . . , k − 1}d
numberlinearised : v ∈ VT 7→ N0
with
d−1
X
numberlinearised =
numberi (v) · k i .
and a
i=0
Within a single cell, k = 1 gives a vertex enumeration. Otherwise k equals
the k in k-spacetree and number is the vertex’s position within a k d patch,
i.e. within one refined geometric element.
Within a k d motif, the enumeration starts with the vertex nearest to the origin of the Cartesian coordinate system. It then enumerates all the vertices
along the x1 -axis. Afterwards, it continues with the x2 -axis, etc. In the twodimensional case, the enumeration starts with the left bottom vertex, and it
enumerates from left to right and bottom-up.
Instead of working with the function number(v), this thesis often denotes the
index as subscript, i.e. vi ⇔ number(vi ) = i. If vertices of different levels are
involved, two subscripts are used. The first denotes the level, the second the
index.
29
2 Adaptive Cartesian Grids and Spacetrees
2.7 Geometry Representation
Cartesian grids are not aligned with the computational domain, i.e. their cells cover
parts outside and inside the domain. Since a PDE is defined on a computational
domain, the domain’s shape information has to be transferred to the spatial discretisation. Afterwards, one can solve the PDE numerically. The mapping from the
continuous domain to the k-spacetree is discussed by this section emphasising the
identification of vertices that are inside the computational domain or on the domain’s
boundary. For the boundary vertices, the PDE’s boundary conditions determine the
vertices’ state: there are inner vertices where the numerics determine the solution,
there are boundary vertices with prescribed properties, and there are ignored outer
vertices. At the end of the section, I present a simple geometric refinement criterion.
The spacetree construction already involves geometric information since the root
element covers the computational domain. It is not inside but covers the domain’s
boundary. I continue this distinction into inner, outer and hybrid elements for
each k-spacetree element and end up with a marker-and-cell approach—[38] synonymously refer to it as “marker and cell technique”. The union of the inner cells
then gives the computational domain. As this approach does not approximate the
computational domain exactly, it demands for a mapping of the computational domain’s boundary to the boundary points of the Cartesian grids, and it demands for
a discussion of the accuracy.
Definition 2.5. A geometric element e ∈ ET is inside the computational domain if
e ⊆ Ω. Ω is open, e is closed, Ω is the closure of Ω. Pinside (e) and ¬Poutside (e) hold.
Otherwise, it is outside, i.e. ¬Pinside (e) and Poutside (e).
Definition 2.6. A vertex v ∈ VT is
• outside the computational domain, if all 2d adjacent elements e ∈ adjacent(v)
are outside. Poutside (v) holds.
• a boundary vertex, if 1 ≤ k < 2d − 1 adjacent elements e ∈ adjacent(v) are
outside. Pboundary (v) holds.
• inside the computational domain otherwise, i.e. if all 2d adjacent elements
e ∈ adjacent(v) are inside. Pinside (v) holds.
These two classifications lead to a number of interesting statements on the approximation of the continuous computational domain. All the statements rest upon
the property that the computational domain is shrinked towards the smaller computational domain. Within the multiscale context, this proves of great value, although
alternative formulations of the inside/outside/boundary predicates are possible and
might make sense in another context.
30
2.7 Geometry Representation
PDEs are defined on the computational domain. Algorithms to solve a PDE on
a grid, algorithms to visualise a solution, and so forth thus perform a number of
fixed operations on elements and vertices inside or at the boundary of the discretised
computational domain. There’s nothing to compute on elements outside the computational domain. Thus, it is reasonable to make the traversal check each element’s
state before it triggers any user-defined operation. As k-spacetree traversals preserve
the child-father relationship, a check exhibiting that a geometric element is inside
the computational domain makes all the subsequent checks for the descendants’
state obsolete. For such an code optimisation the next property is useful.
Corollary 2.1.
∀ei ⊑child
Pinside (e) ⇔ ∀ei ⊑child e : Pinside (ei )
e : Poutside (ei ) ⇒ Poutside (e).
and
Proof. Both statements result directly from the fact that the union of all children
of a parent equals the parent’s geometric element.
[
ei = e,
ei ⊑child e
[
ei ⊑child e
ei ⊆ Ω ⇔ e ⊆ Ω ⇔ Pinside (e)
and
∃ei , ei ⊑child e : Poutside (ei ) ⇒ ∃ei , ei ⊑child e : ei 6⊆ Ω ⇒
⇒ ¬Pinside (e) ⇔ Poutside (e).
[
ei ⊑child e
ei 6⊆ Ω
A k-spacetree yields adaptive Cartesian grids for a given level and arbitrary computational domain Ω. Let
Ωadaptive
h,ℓ
Ωh,ℓ = {e ∈ ET : level(e) = ℓ ∧ Pinside (e)},
and
= {e ∈ ET : Pinside (e) ∧ (level(e) = ℓ ∨ level(e) < ℓ ∧ ¬Prefined (e))}.
Depending on the context, I also refer to Ωh,l as grid of the level ℓ. Ωadaptive
defines
h,l
the adaptive Cartesian grid with maximum level ℓ. The computational fine grid or
fine grid of a k-spacetree is
Ωh = {e ∈ ET : Pinside (e) ∧ ¬Prefined (e)}.
I use the same symbol for fine grid, spatial discretisation and computational fine grid,
i.e. I do not distinguish between the spacetree’s leaves and the fine grid restricted
to the inner geometric elements, as the context makes the semantics unambiguous.
31
2 Adaptive Cartesian Grids and Spacetrees
Figure 2.9: The bigger the maximum level of the refinement criterion in (2.9), the
better the approximation of the computational domain becomes. Here,
it is a circle. The domain’s volume grows with increasing level, i.e. computational domains with a coarser resolution are contained within the
discrete computational domain belonging to finer resolutions.
Theorem 2.3. Let ⊑∈ P(ET ) × P(ET ) denote that elements of one discretisation
tessellate another discretisation, i.e. each primitive of the set on the left-hand side
fits completely into a primitive of the set on the right-hand side.
Ωadaptive
⊑ Ωadaptive
⊑ Ωadaptive
⊑ . . . ⊑ Ωh .
h,1
h,2
h,3
(2.8)
⊑ Ωadaptive
directly results from Corollary 2.1. The
Proof. The statement Ωadaptive
h,l+1
h,ℓ
final ⊑ relation results from
lim {e ∈ ET : level(e) = ℓ ∨ level(e) < ℓ ∧ ¬Prefined (e)} = {e ∈ ET : ¬Prefined (e)}.
l→∞
The spatial approximations of the different levels of a k-spacetree converge monotonously to the exact computational domain (Theorem 2.3), and the finer the grid
level, the better the approximation (Figure 2.9).
All this formalism affords the definition of a geometric adaptivity criterion. For
all vertices v ∈ VT \ HT fulfilling
level(v) ≤ ℓ ∧ Pboundary (v)
(2.9)
the algorithm triggers a refinement. This simple rule refines each geometric element
up to a given level ℓ, i.e. the formula yields an adaptive Cartesian grid with approximation order O(h) (Figure 2.10). The approximation of the adaptive Cartesian grid
32
2.7 Geometry Representation
Figure 2.10: The computational domain’s boundary intersects a square (d = 2, left)
or cube (d = 3, right). The maximum
distance from a hypercube’s
√
2
vertex to the boundary is dist ≤ d · h . The spatial discretisation
error is in O(h).
resulting from the refinement criterion above thus is in O k −ℓ , if the computational
domain is embedded into the unit hypercube.
Although the k-spacetree approximates the computational domain up to any precision, the computational domain’s boundary does not match the continuous boundary exactly, and a solver has to project the continuous boundary’s conditions to the
computational grid’s boundary vertices. In this thesis, I apply a simple projection:
For each boundary point, I search for the nearest point on ∂Ω and copy its state
and value, i.e. I map information along the distance vector to the boundary (Figure
2.10). A more sophisticated projection preserving the overall boundary’s L2 -norm,
e.g., might be of great value for some applications.
The two-step refinement cascade in Section 2.6 makes the k-spacetree’s structure invariant throughout the traversal (Figure 2.8). refinement-triggered, an
intermediate state, encapsulates the refinement mechanism and constraints, and it
enables any algorithm to trigger the refinement anytime. As a result, at most one
grid level is added per traversal. Another approach preserving a constant number of
vertex accesses for v 6∈ HT would be to allow a refinement only immediately before
the first time a vertex is read in a traversal. Criterion (2.9) depends exclusively
on the vertex’s position, i.e. the spacetree construction can evaluate it whenever a
vertex is created, and it can switch the vertex’s state immediately to refined. It
would fit to a modified spacetree construction without the intermediate state.
Both approaches come along with pros and cons: If an algorithm requires a spacetree with height ℓ and does not modify the grid later on throughout the traversal, it
does not make sense to spend ℓ traversals to build up the grid. In turn, a multigrid
33
2 Adaptive Cartesian Grids and Spacetrees
F-cycle [73] for a sufficient smooth problem9 never requires more than one additional
level per traversal. Peano follows solely a one-level-per-traversal policy.
A PDE defines the type of the boundary’s vertices and fitting boundary values.
Thus, the numerical scheme is able to reconstruct the vertices’ state in each iteration
on-the-fly, and there is no need to hold boundary vertices persistently. The codes
coming along with forerunners of this thesis, ([35, 39] and [63]), hence never store
boundary vertices. Experience with longer simulation runs and increasingly complex
applications reveals that the vertex construction process consumes more runtime
than the overhead resulting from boundary vertices stored explicitly. Furthermore,
the multiscale solvers in this thesis modify the boundary’s vertices on coarser levels
frequently. An on-the-fly reconstruction of this modification is cumbersome, as
it demands for the complete boundary’s multilevel data. I thus decided to store
boundary vertices, too.
Examining the computational domain Ω = (0, 1)d delivers another insight: Here,
the spacetree’s root element e0 equals the computational domain (e0 = Ω), and
all the spacetree’s boundary vertices are hanging vertices. According to Definition
2.3, hanging vertices hold no semantics, i.e. the whole boundary condition is to be
represented by e0 ’s vertices. This is neither desirable nor always possible. As soon as
the boundary vertices’ state or value is not invariant in space, the hanging vertices
should hold information on their own. I thus refine e0 once or twice until I am able
to embed Ω into an element surrounded by other (empty) elements. For k ≥ 3, one
refinement step is sufficient. This ensures that no boundary vertex is hanging.
2.8 Traversal Events
k-spacetrees and the element-wise k-spacetree traversals build the basis of the subsequent three chapters. The algorithms plug into the transitions of the traversal as
well as into the steps of the grid generation. Herefrom, they evaluate the formulas to compute the PDE’s solution, visualise data, postprocess results, update the
spacetree itself, setup initial solution guesses as well as boundary conditions, and so
forth. The subsequent section formalises the plug-in mechanism.
In the implementation, the plug-in mechanism equals a template method pattern
[27], i.e. the software defines a number of operations with a predefined signature.
The traversal then calls these operations at the right time passing the required
arguments. I call these operations events.
The events are split up into two groups. Events of the first group correspond to the
geometry management and the grid generation process. They are called throughout
the grid construction and enlisted in Table 2.1. Yet, as any algorithm is allowed to
update the grid structure anytime, the grid construction overlaps with other phases.
9
For singularities, this statement does not hold anymore.
34
2.8 Traversal Events
Events of the second group correspond to the spacetree traversal. They map the
transitions within the hierarchical grid to operations typically belonging to the solver
or data postprocessing.
Table 2.1: Grid generation events.
Event
createDegreeOf F reedom
Rd × R × VT 7→ VT
(x, h, v) 7→ v ′
Description
Factory method. The method is given a new vertex
(hull) v. The user’s implementation initialises this
vertex and returns v ′ . Besides the vertex record,
the operation accepts the position x and mesh
width h corresponding to the vertex’s level.
createDegreeOf F reedom Factory method. The method is given a new
Rd × R × VT 7→ ET
geometric element (hull) e. The user’s imple′
(x, h, e) 7→ e
mentation initialises this element and returns
e′ .
x’s and h’s semantics equals the other
createDegreeOf F reedom operation.
isElementOutsideDomain Spatial query. The method is given a point x in
space and a (hyper-)rectangularly h-environment.
Rd × Rd 7→ {⊤, ⊥}
It returns whether or not the point identified by
(x, h) 7→ b
the position and the surrounding environment is
outside the computational domain. The grid uses
this information for optimisation purposes according to Corollary 2.1.
isElementInsideDomain Spatial query. Counterpart to the event before.
Only geometric elements completely inside the
Rd × Rd 7→ {⊤, ⊥}
computational domain belong to the computa(x, h) 7→ b
tional grid.
ref ine
Spatial query. The method is given a point x in
Rd × Rd 7→ {⊤, ⊥}
space and a (hyper-)rectangularly h-environment.
(x, h) 7→ b
It returns whether the grid should refine here. The
geometric refinement criterion e.g. plugs into this
event.
The transition events (Table 2.2) are split up into events corresponding to the
overall traversal, events corresponding to one single level, events representing an
inter-level transition, and operations managing the lifecycle of geometric elements
and vertices. All the events’ implementations are allowed to modify the elements’
and vertices’ state. They are not allowed to modify spatial data such as positions,
levels and mesh widths.
35
2 Adaptive Cartesian Grids and Spacetrees
Before and after a traversal, the grid triggers the events beginT raversal and
endT raversal. For one single level, enterElement accepts an element, 2d vertices,
the position of vertex v0 10 , the element’s size and level. It is called before the traversal enters the element. leaveElement is the counterpart. touchV ertexF irstT ime
is invoked the first time a vertex v is read throughout the traversal. Besides the
position, the level and the corresponding mesh width h, the traversal also passes the
parent vertices and an integer vector p identifying the vertex v’s discrete position
within a (k + 1)d Cartesian grid corresponding to the geometric parent element. As
all the traversals in this thesis are based upon a depth-first traversal preserving the
child-father relationship, the value p is always available to the traversal implementation. touchV ertexLastT ime is the counterpart of touchV ertexF irstT ime. If it
is triggered, all of v’s adjacent geometric elements have been traversed before.
The set of inter-level transition events comprises two operations: loadSubElement
precedes a top-down transition. storeSubElement corresponds to a bottom-up transition. Both operations accept a refined geometric element, the element’s spatial
attributes (position, level, etc.) and its vertices. Furthermore, they get all the data
of a child element that is loaded or stored, respectively. The operations facilitate an
inter-level information transfer.
The operations createP ersistentV ertex mirrors the createDegreeOf F reedom
factory method for vertices. It is a redundant lifecycle management operation that
enables the programmer to implement a whole lifecycle algorithm within the grid
traversal event set.
destroyP ersistentV ertex is the counterpart of
createP ersistentV ertex. Both operations are defined on VT \ HT . The operations
createT emporaryV ertex
and
destroyT emporaryV ertex in turn provide the same plug-in possibility for hanging
vertices v ∈ HT .
The traversal events permit the interpretation of a traversal as event sequence.
The event sequence corresponding to a depth-first traversal is given by Algorithm
2.4. For each algorithm implemented as plain set of operations, its behaviour is
well-defined as mapping from events to this operations. To give such a mapping
for each algorithm thus proves that the algorithm fits into the k-spacetree traversal
concept. The element-wise traversal poses restrictions on the information available
throughout the traversal. This information restriction is also formalised by the event
sequence, and a mapping proves that an algorithm’s implementation can make ends
meet with the information available. Besides the data passed to the events, each
mapping also has to consider the order of the events within the k-spacetree traversal.
The order’s properties can be formalised as invariants. For their obviousness, I
decided not to write them down explicitly but to give some simple examples.
10
See the excursus on page 29 for a description of the enumeration and the position indexing.
36
2.8 Traversal Events
Table 2.2: Grid traversal events.
Operation Name
beginT raversal
endT raversal
enterElement
leaveElement
touchV ertexF irstT ime
touchV ertexLastT ime
loadSubElement
storeSubElement
createP ersistentV ertex
destroyP ersistentV ertex
createT emporaryV ertex
destroyT emporaryV ertex
Arguments
{⊤, ⊥} 7→ ∅
∅ 7→ ∅
d
d
V2T × ET × Rd × Rd × N0 7→ V2T × ET
(v, e, h, x, l) 7→ (v ′ , e′ )
d
d
V2T × ET × Rd × Rd × N0 7→ V2T × ET
(v, e, h, x, l) 7→ (v ′ , e′ )
d
VT × Rd × Rd × N0 × {0, k}d × V2T
′
(v, x, h, l, p, vparent ) 7→ v ′ , vparent
d
VT × Rd × Rd × N0 × {0, k}d × V2T
′
(v, x, h, l, p, vparent ) 7→ v ′ , vparent
d
d
V2T × ET × Rd × Rd × N0 × V2T × ET × Rd 7→
d
d
V2T × ET × V2T × ET
(vparent , eparent , hparent , xparent , lparent
, vchild , echild , xchild ) 7→
′
′
′
′
vparent , eparent , vchild , echild
d
2d
VT × ET × Rd × Rd × N0 × V2T × ET × Rd 7→
d
d
V2T × ET × V2T × ET
(vparent , eparent , hparent , xparent , lparent
, vchild , echild , xchild ) 7→
′
′
vparent
, e′parent , vchild
, e′child
d
VT × Rd × Rd × N0 × {0, k}d × V2T 7→ VT
(v, x, h, l, p, vparent ) 7→ v ′
d
VT × Rd × Rd × N0 × {0, k}d × V2T 7→ VT
(v, x, h, l, p, vparent ) 7→ v ′
d
VT × Rd × Rd × N0 × {0, k}d × V2T 7→ VT
(v, x, h, l, p, vparent ) 7→ v ′
d
VT × Rd × Rd × N0 × {0, k}d × V2T 7→ VT
(v, x, h, l, p, vparent ) 7→ v ′
37
2 Adaptive Cartesian Grids and Spacetrees
Algorithm 2.4 The depth-first traversal triggers a well-defined set of events. Algorithm 2.1 defines the depth-first traversal using a stack. Here, I prefer a recursive
formulation. Both are equivalent as the stack in Algorithm 2.1 mirrors a call stack.
As I omit the event’s arguments and technical details, touchV ertexF irstT ime and
touchV ertexLastT ime as well as the lifecycle and geometry management events do
not occur in the description. e0 is the root node of the k-spacetree .
1: procedure traverseDf s
2:
trigger beginT raversal
3:
df s(e0 )
4:
trigger endT raversal
5: end procedure
6: procedure df s(e)
7:
trigger enterElement
8:
for ei ⊑child e do
9:
trigger loadSubElement
10:
df s(ei )
11:
trigger storeSubElement
12:
end for
13:
trigger leaveElement
14: end procedure
• The event touchV ertexLastT ime ensures that leaveElement has been called
for each adjacent geometric element.
• enterElement is called before leaveElement.
• Both enterElement and leaveElement are triggered exactly once per traversal.
• The event enterElement ensures that for each vertex either
touchV ertexF irstT ime, createP ersistentV ertex or createT emporaryV ertex
has been called before.
A light-weight component composition following the publish-subscribe idea of the
observer pattern [27] puts the event concept into the code: The grid management
and traversal component defines interfaces comprising the traversal and grid generation events. These interfaces are the only interaction points—a separated interface
[24]—visible from outside the component. Plug-ins implement them in their own
components and delegate the events to operations of their own.
The composition pattern on the one hand enables the algorithm to exchange
the mapping of events to an algorithm’s operations at runtime. A programmer is
thus able to exchange the algorithm throughout the computation (perform γ1 solver
38
2.9 Experiments
traversals, plug in a plotter afterwards and stream the results to a visualisation
component, and perform γ2 data postprocessing traversals afterwards, e.g. ). Furthermore, the mapping from events to operations has arbitrary cardinality, since
there is the opportunity to make one event trigger several algorithms’ operations:
An additional event interface implementation just has to follow the multiple dispatch
paradigm [27] and delegate one event to several implementations. In this thesis, e.g.,
the parallelisation and the Poisson solver algorithms are completely independent of
each other. Yet, it is straightforward to combine them: I just plug-in both of them
into the event sequence, i.e. each event triggers the solver’s operations as well as the
parallel algorithm’s operations.
Albeit Peano’s mapping from events to operations is flexible and follows the
separation-of-concerns paradigm, it does not lead to a performance breakdown if it
is realised by static polymorphism—a pattern often found in object-oriented highperformance code ([4], e.g.). Here, lookup table techniques typically employed for
virtual function tables are replaced by static binding due to generic programming.
To have all possible combinations of plugins available at compile time is the only
consequence arising from the static polymorphism. In my experiments, this was
always the case. Nevertheless, Peano also offers a plug-in mechanism for dynamic
polymorphism.
2.9 Experiments
The following experiments analyse different spacetree properties for three out of
four different geometries from Figure 2.2. These experiments either fit exactly to
the spacetree’s structure (the cube equals the spacetree’s root element), exhibit a
complex surface not fitting to the spacetree elements (sphere), or hold faces not
aligned with the spacetree’s faces (L-shape).
All measurements are based upon (k = 3)-partitioning. The L-shape’s continuous
d
domain is (0, 1)d − 0, 21 , and the spacetree’s hypercubes near xi = 21 consequently
never fit exactly to the computational domain. As a result, the spacetree has to approximate the computational domain although all faces are aligned with the Cartesian coordinate system’s axes. The sphere is embedded into the unit hypercube,
π d/2
too, and its analytical volume thus equals 21d · Γ(d/2+1)
with the gamma function Γ.
First, the volume of the continuous computational domain is contrasted with the
volume of the fine grid of the k-spacetree . The comparison is trivial for the hypercube as computational domain, since a hypercube equals the spacetree’s root
element. Both domains’ volumes equal for spacetrees of arbitrary height, and,
thus, a plot comparing both volumes would be without information (see Tables
2.3 and 2.5 instead). If the L-shape with Ω = (0, 1)d \ (0, 12 )d is embedded into the
(k = 3)-spacetree ’s root element,
39
2 Adaptive Cartesian Grids and Spacetrees
1
0.95
0.9
Volume
0.85
0.8
0.75
d=2, continuous domain
d=2, regular
d=2, adaptive
d=3, continuous domain
d=3, regular
d=3, adaptive
d=4, continuous domain
d=4, regular
d=4, adaptive
d=5, continuous domain
d=5, regular
d=5, adaptive
0.7
0.65
0.6
0.55
1
10
100
1000
10000 100000 1e+06
Number of inner elements
1e+07
1e+08
1e+09
1e+10
1e+09
1e+10
1
d=2, regular
d=2, adaptive
d=3, regular
d=3, adaptive
d=4, regular
d=4, adaptive
d=5, regular
d=5, adaptive
0.1
Error
0.01
0.001
1e-04
1e-05
1e-06
1
10
100
1000
10000 100000 1e+06
Number of inner elements
1e+07
1e+08
Figure 2.11: Volume of spatial discretisation of the L-shape from Figure 2.2. Each
subsequent tick corresponds to one refinement step.
40
2.9 Experiments
0.8
0.7
0.6
Volume
0.5
0.4
0.3
d=2, continuous domain
d=2, regular
d=2, adaptive
d=3, continuous domain
d=3, regular
d=3, adaptive
d=4, continuous domain
d=4, regular
d=4, adaptive
0.2
0.1
0
1
10
100
1000
10000
100000
1e+06
Number of inner elements
1e+07
1e+08
1e+09
1
d=2, regular
d=2, adaptive
d=3, regular
d=3, adaptive
d=4, regular
d=4, adaptive
0.1
Error
0.01
0.001
1e-04
1e-05
1
10
100
1000
10000
100000
1e+06
Number of inner elements
1e+07
1e+08
1e+09
Figure 2.12: Experiment from Figure 2.11 with a hypersphere as computational domain. Each subsequent tick corresponds to one refinement step.
41
2 Adaptive Cartesian Grids and Spacetrees
Table 2.3: Hypercube domain with cardinalities of elements and vertices, as well as
resulting memory overhead. For each dimension, the upper block gives
figures for a regular grid, the lower block for an adaptive grid based upon
the geometric refinement criterion.
d
2
3
4
5
|Pinside (ET )|
fine grid Ωh
|Pinside (VT )|
43, 046, 721
387, 420, 489
47, 812, 196
430, 414, 724
boundary
vertices
34, 992
104, 976
944, 457
2, 833, 993
14, 348, 907
387, 420, 489
708, 340
2, 125, 492
14, 684, 488
400, 000, 840
92, 928, 291
838, 389, 319
531, 441
43, 046, 721
|Pinside (ET )|
whole k-spacetree
|Pinside (VT )|
48, 427, 561
435, 848, 050
48, 407, 888
435, 789, 012
boundary
vertices
39, 364
118, 096
314, 928
944, 784
393, 664
3, 542, 944
1, 062, 514
3, 188, 242
14, 900, 788
402, 321, 277
708, 340
2, 125, 492
14, 702, 584
400, 530, 936
354, 292
1, 062, 880
398, 592
3, 587, 240
64, 334, 840
580, 423, 224
461, 072
41, 416, 976
31, 886, 464
286, 978, 144
163, 584
4, 409, 856
96, 502, 456
870, 635, 062
538, 084
43, 584, 805
64, 334, 840
580, 423, 224
461, 088
41, 421, 088
32, 285, 056
290, 565, 384
163, 840
4, 416, 016
11, 604, 561
339, 335, 761
9, 049
14, 348, 907
7, 824, 400
229, 023, 760
32, 800
11, 914, 144
4, 409, 856
119, 045, 376
68, 224
5, 396, 224
11, 749, 618
343, 577, 458
59, 293
14, 408, 200
7, 824, 400
229, 023, 760
32, 800
11, 914, 176
4, 416, 016
119, 209, 216
68, 256
5, 397, 248
10, 281, 371
1, 119, 435, 615
6, 759, 520
748, 034, 144
5, 396, 224
435, 927, 424
10, 323, 856
1, 124, 061, 382
6, 759, 520
748, 034, 144
5, 397, 248
435, 995, 680
• the volume of the fine grid converges to the analytical volume linearly in
h: Each tick in the figures represents one experiment. The heights of two
subsequent experiments’ spacetrees differ by one, i.e. each refinement reduces
the error by a factor of k = 3 (Figure 2.11, top). Two subsequent refinements
thus reduce the volume’s error almost by one decade (Figure 2.11, bottom),
and the convergence rate is independent of both h and d.
• As the fine grid is contained within the computational domain, the fine grid’s
volume is smaller than the analytical volume. Hence, the fine grid’s volume
monotonically increases to the actual solution.
√
• The convergence rate depends on d, as there’s a constant d hidden in the
error estimate (Figure 2.10): With increasing dimension d, the convergence
speed deteriorates.
42
2.9 Experiments
• The grids for adaptive spacetrees yield the same spatial discretisation error as
their regular counterpart. Two ticks for the same d with same ordinate value
represent such a pair of experiments. The adaptive grid comes along with a
substantially smaller number of elements and vertices.
If a hypersphere is embedded into the k-spacetree’s root element, the same observations hold (Figure 2.12).
Second, the cardinalities of the spacetree’s inner fine grid vertices and elements
are compared with the overall number of inner elements and vertices in the whole
k-spacetree. As there is no spatial discretisation error for the cube, the cube experiment reveals the pristine overhead resulting from the additional levels (Table
2.3), which is, in fact, not an overhead but the additional memory needed to store
the coarser grid levels in the spacetree.
With the k-spacetree holding all the levels simultaneously, the overall number of
d
vertices and elements outnumbers the fine grid numbers by a factor of at most kdk−1 :
For a given fine grid level ℓ, the number of elements and vertices on level ℓ − 1
is smaller by a factor of k d . The overall number of elements and vertices thus is
bounded by a factor of
d
1+k +k
2d
+k
3d
+ ... =
ℓ
X
k i·d
i=0
= k
≤ k
ℓ·d
ℓ·d
·
·
ℓ
X
k
(i−ℓ)·d
i=0
∞ X
i=0
1
kd
i
=k
ℓ·d
i
ℓ X
1
·
kd
i=0
= k ℓ·d ·
1
.
1 − 1/k d
The boundary of the computational domain is a d − 1-manifold. Thus, the overhead
of additional boundary vertices on the coarser grids is smaller than the overhead for
the inner vertices and elements. The tables for the L-shape and the sphere reveal
the same conclusions for more complicated domains (Table 2.4).
Third, the total number of the spacetree’s inner elements is contrasted with the
total number of spacetree elements. For persistent boundary vertices, the algorithm
refines the (k = 3)-spacetree once and embeds the computational domain into the
central element. This leads to an k 3 − 1 additional geometric elements on the
first refinement level—some kind of shadow boundary layer. All elements of the
shadow layer are outside the computational domain. They simplify and speed up
the handling of the boundary vertices but bring along a memory overhead. As
only vertices inside the computational domain and lying on the discretised domain’s
boundary carry a refinement flag, the shadow layer elements are refined if and only if
they are adjacent to a boundary vertex. The shadow layer is adaptive. Furthermore,
43
2 Adaptive Cartesian Grids and Spacetrees
it corresponds to the domain’s surface—a submanifold. The overhead is therefore
k
bounded by a constant smaller than k−1
(see Table 2.5 for the hypercube, Table 2.6
for the L-shape domain).
The cube fits exactly to the spacetree’s root element. All the vertices along
the root element’s faces are boundary vertices and, hence, refined. The maximum
refinement level of the grid follows the cube’s faces exactly, and the refinement
structure left and right, above or below, and so forth of the boundary is exactly the
same. It gives the worst-case overhead resulting from the embedding. The sphere’s
overhead (Table 2.6) is smaller than the cube’s overhead, as the sphere’s surface
compared to its volume is smaller than the cube’s surface compared to its volume.
44
2.9 Experiments
Table 2.4: Experiment from Table 2.3 with an L-shape as computational domain
(upper part) or a hypersphere as computational domain (lower part).
d
2
3
4
5
2
3
4
|Pinside (ET )|
fine grid Ωh
|Pinside (VT )|
32, 281, 760
290, 555, 525
whole k-spacetree
|Pinside (VT )|
boundary
vertices
36, 315, 748
36, 296, 076
39, 363
326, 871, 273 326, 812, 236
118, 095
|Pinside (ET )|
35, 850, 399
322, 784, 799
boundary
vertices
34, 992
104, 976
1, 033, 016
3, 099, 701
12, 533, 059
338, 793, 364
796, 903
2, 391, 204
12, 799, 719
349, 557, 867
314, 928
944, 784
393, 664
3, 542, 944
1, 151, 069
3, 453, 946
13, 013, 141
351, 806, 505
796, 903
2, 391, 204
12, 814, 938
350, 016, 165
354, 291
1, 062, 879
398, 591
3, 587, 239
96, 973, 571
874, 739, 326
493, 025
40, 220, 960
68, 380, 150
616, 773, 261
422, 031
38, 552, 799
31, 886, 464
286, 978, 144
163, 584
4, 409, 856
100, 547, 728
906, 985, 061
499, 026
40, 719, 986
68, 380, 150
616, 773, 261
422, 031
38, 556, 270
32, 285, 055
290, 565, 383
163, 839
4, 416, 015
11, 902, 160
346, 986, 545
55, 924
13, 811, 083
8, 122, 159
236, 674, 704
29, 643
11, 373, 195
4, 409, 856
119, 045, 376
68, 224
5, 396, 224
12, 047, 201
351, 228, 226
56, 135
13, 867, 218
8, 122, 159
236, 674, 704
29, 643
11, 373, 195
4, 416, 015
119, 209, 215
68, 255
5, 397, 247
55, 924
10, 500, 039
33, 795, 465
304, 239, 481
29, 643
6, 978, 938
37, 530, 044
337, 982, 468
68, 224
5, 396, 224
34, 976
104, 960
56, 135
10, 542, 492
38, 014, 840
342, 254, 321
29, 643
6, 978, 938
37, 995, 200
342, 195, 320
68, 255
5, 397, 247
39, 296
118, 020
1, 304, 557
3, 907, 249
7, 373, 533
201, 599, 327
1, 068, 512
3, 198, 828
7, 484, 720
207, 590, 440
314, 912
944, 768
305, 104
2, 770, 648
1, 422, 542
4, 261, 418
7, 645, 057
209, 244, 384
1, 068, 512
3, 198, 828
7, 491, 896
207, 845, 688
354, 208
1, 062, 788
308, 440
2, 803, 872
10, 734, 471
97, 961, 183
125, 481
12, 187, 065
8, 287, 640
75, 594, 064
94, 512
11, 259, 792
2, 770, 648
25, 007, 824
70, 208
2, 171, 392
11, 038, 572
100, 751, 974
126, 291
12, 313, 356
8, 287, 640
75, 594, 064
94, 512
11, 260, 032
2, 803, 872
25, 316, 264
70, 224
2, 173, 088
6, 675, 865
223, 727, 577
5, 036, 464
169, 392, 176
2, 171, 392
61, 093, 056
6, 733, 266
225, 768, 354
5, 036, 464
169, 392, 176
2, 173, 088
61, 163, 280
45
2 Adaptive Cartesian Grids and Spacetrees
Table 2.5: Memory overhead of hypercube due to embedding: Total number of inner
elements/vertices compared to total number of elements/vertices in the
spacetree. For each dimension, the upper block gives figures for a regular
grid, the lower block for an adaptive grid due to the geometric refinement
criterion.
|Pinside (ET )|
4.84 · 107
4.36 · 108
|Pinside (VT )|
4.84 · 107
4.36 · 108
|ET |
4.85 · 107
4.36 · 108
|VT |
4.85 · 107
4.36 · 108
3
1.06 · 106
3.19 · 106
1.49 · 107
4.02 · 108
7.08 · 105
2.13 · 106
1.47 · 107
4.01 · 108
1.77 · 106
5.31 · 106
1.59 · 107
4.11 · 108
2.13 · 106
6.38 · 106
1.61 · 107
4.13 · 108
4
9.65 · 107
8.71 · 108
5.38 · 105
4.36 · 107
6.44 · 107
5.80 · 108
4.61 · 105
4.14 · 107
1.61 · 108
1.45 · 109
1.09 · 106
5.58 · 107
1.94 · 108
1.74 · 109
1.24 · 106
5.85 · 107
5
1.18 · 107
3.44 · 108
5.92 · 104
1.44 · 107
7.82 · 107
2.29 · 108
3.28 · 104
1.19 · 107
2.22 · 107
5.96 · 108
5.70 · 105
3.41 · 107
2.67 · 107
7.16 · 108
8.19 · 105
4.00 · 107
5.91 · 104
1.03 · 107
3.18 · 104
6.76 · 106
5.70 · 105
2.90 · 107
8.18 · 105
3.59 · 107
d
2
46
2.9 Experiments
Table 2.6: Experiment from Table 2.5 with an L-shape as computational domain
(upper part) and a hypersphere as computational domain (lower part).
|Pinside (ET )|
3.63 · 107
3.27 · 108
|Pinside (VT )|
3.63 · 107
3.27 · 108
|ET |
3.64 · 107
3.27 · 108
|VT |
3.64 · 107
3.27 · 108
3
1.15 · 106
3.45 · 106
1.30 · 107
3.52 · 108
7.97 · 105
2.39 · 106
1.28 · 107
3.50 · 108
1.77 · 106
5.31 · 106
1.40 · 107
3.60 · 108
2.13 · 106
6.38 · 106
1.42 · 107
3.62 · 108
4
1.01 · 108
9.07 · 109
5.00 · 105
4.07 · 107
6.84 · 107
6.17 · 108
4.22 · 105
3.86 · 107
1.61 · 108
1.45 · 109
1.04 · 106
5.26 · 107
1.94 · 108
1.74 · 109
1.19 · 106
5.54 · 107
5
1.21 · 107
3.51 · 108
5.61 · 104
1.39 · 107
8.12 · 106
2.37 · 108
2.96 · 105
1.14 · 107
2.22 · 107
5.96 · 108
5.63 · 105
3.34 · 107
2.67 · 107
7.16 · 108
8.11 · 105
3.92 · 107
2
5.61 · 104
1.05 · 107
3.80 · 107
3.42 · 108
2.96 · 104
6.98 · 106
3.80 · 107
3.42 · 108
5.63 · 105
2.90 · 107
3.81 · 107
3.42 · 108
8.11 · 105
3.59 · 107
3.81 · 107
3.43 · 108
3
1.42 · 106
4.26 · 106
7.65 · 106
2.09 · 108
1.07 · 106
3.20 · 106
7.49 · 106
2.08 · 108
1.77 · 106
5.31 · 106
8.11 · 106
2.13 · 108
2.13 · 106
6.38 · 106
8.27 · 106
2.15 · 108
4
1.10 · 107
1.01 · 108
1.26 · 105
1.23 · 107
8.29 · 106
7.56 · 107
9.45 · 104
1.13 · 107
1.39 · 107
1.26 · 108
2.55 · 105
1.58 · 107
1.67 · 107
1.51 · 108
3.17 · 105
1.71 · 107
5
6.73 · 106
2.26 · 108
2.94 · 103
1.64 · 106
5.04 · 106
1.69 · 108
⊥
⊥
9.53 · 106
2.94 · 108
6.56 · 104
4.36 · 106
1.15 · 107
3.52 · 108
1.18 · 105
5.79 · 106
2.94 · 103
1.63 · 106
⊥
⊥
6.56 · 104
4.33 · 106
1.18 · 105
5.78 · 106
d
2
47
2 Adaptive Cartesian Grids and Spacetrees
2.10 Outlook
This chapter defines the grid underlying the whole thesis. Its definition is complete
and closed with respect to the algorithms here. Nevertheless, many straightforward
extensions and generalisations exist. This closing section enlists some of them. The
list is neither complete nor representative.
One extension applying the spacetree philosophy further is the boundary extended
spacetree [25] providing an improved boundary approximation. While [25] constructs
them with (k = 2)-spacetrees in the three-dimensional case, the extension to arbitrary k is straightforward. The spacetree cube’s faces are discretised recursively
in d by a d − 1-dimensional spacetree. Applying the idea for boundary faces resolves the boundary with a higher order and enables the application to reduce the
discretisation error. An improved boundary resolution is of great value in many
PDE problems, if the computational domain exhibits a complicated shape or if the
boundary values’ precision is of great importance. Naively refining the spacetree’s
elements at the boundary in fact does not work properly for d ≥ 3: Let a numerical scheme converge in O(h2 ). As the boundary’s approximation order is in O(h),
halving the mesh within the domain entails the boundary’s elements to half twice,
i.e. h 7→ h2 7→ h4 . Otherwise, the boundary’s approximation error pollutes the overall
solution. Such a refinement increases the number of geometric elements within the
computational domain by a factor of k d . As the boundary’s “dimension” equals
d − 1, it increases the number of boundary elements by a factor of k 2(d−1) due to
the two refinement steps. The boundary elements thus soon dominate the overall
number of geometric elements, the memory consumption, and the computational
load if the domain shape is sufficiently complicated.
Besides an improvement of the boundary accuracy, boundary extended spacetrees
also fit to hypercube faces lying inside the computational domain. Applying them
there mirrors construction ideas of sparse grids [15] and yields promising results for
example for multigrid solvers on convection dominated problems [1].
Accuracy at the boundary is important for the k-spacetree’s coarser levels, too. A
loss of precision on coarser levels does not affect the solution’s accuracy, but influences the solver of the linear equation system: The coarser the grid, the smaller the
computational domain becomes. Thus, geometric multigrid algorithms can not correct the solution at the boundary as efficiently as they do within the computational
domain.
Another boundary improvement adds information where the edges intersect the
computational domain to each boundary vertex. Such a scheme also reduces the
discretisation error and improves the multigrid convergence rate. [75], e.g., discusses
this approach for a (k = 3)-spacetree environment and three-dimensional problems.
The extension to arbitrary k and dimension d is straightforward.
The k-spacetree definition in this thesis does not restrict the level difference for
48
2.10 Outlook
hanging vertices: For a two-dimensional challenge, up to k ∆l − 1 hanging nodes can
by placed on one edge. ∆l is the level difference of two adjacent geometric elements
on the fine grid. To extend this formula to arbitrary dimensions is trivial. Some
numerical problems benefit from balanced grids, i.e. two adjacent fine grid elements’
levels differ at most by one. Such a balanced adaptive Cartesian grid corresponds
to a sufficiently balanced tree, and this balancing can be ensured by posing the
invariant
∀e ∈ ET :
∃v1 ∈ vertex(e) : Prefined (v1 ) ⇔ ∀v2 ∈ vertex(e) : v2 6∈ HT
on the tree or the refinement criterion, respectively.
Many PDE’s solutions exhibit an anisotropic behaviour [73]. The representation
of anisotropic solutions benefits from anisotropic grids as such grids come along
with less vertices and elements compared to the adaptive Cartesian grids here,
where each element exhibits the same spatial resolution along each coordinate axis.
Furthermore, multigrid algorithms for anisotropic problems demand for specialised
inter-level transfer operators and tailored smoothers. The inter-level transfer operators also benefit from anisotropic grids. To extend the k-spacetree definition for
anisotropic grids equals a modified refinement predicate
Prefined : VT \ HT × {1, . . . , d} 7→ {⊤, ⊥}
controlling the refinement along each coordinate axis. The formalisms then become
more complicated but all the principles remain the same.
The children of a refined k-spacetree element are both embedded into the father’s
hypercube and disjoint from each other. To remove the disjointness invariant is
a more subtle change in the k-spacetree definition, but it permits data structures
resolving complicated domains more accurately. Furthermore, it makes the grid
fit to the grids used by many groups in the adaptive mesh refinement community
[7, 62].
Throughout this dissertation, I refrain from all these generalisations. Instead,
the pure k-spacetree introduced in this chapter is the basis of the subsequent three
chapters. They present three different algorithms exploiting the k-spacetree ’s definition: Chapter 3 introduces an efficient grid management, i.e. a management coming
along with very low memory demands and good memory access behaviour. Chapter 5 establishes a parallelisation and load balancing approach that is able to cope
with dynamic adaptive refinement. Chapter 4 implements a multiplicative multigrid
with a full approximation storage scheme within the k-spacetree data structure. All
three chapters are orthogonal, i.e. most of their presentation relies exclusively on
this chapter and its definition. Besides Chapter 3, all algorithms holds for arbitrary
k, and all three algorithmic ideas are well-suited for any dimension d.
49
2 Adaptive Cartesian Grids and Spacetrees
50
3 Spacetree Traversal and Storage
Every PDE solver requires a traversal of the grid, and Chapter 2 shows that a
traversal preserving the child-father relationship is of great value to process the
k-spacetree. On the one hand, such a traversal facilitates a memory-modest encoding of the adaptive grid, as one refinement bit per vertex is sufficient. In each
recursion step, the traversal determines due to this bit whether the recursion stops
or continues, i.e. whether the geometric element is refined. On the other hand, the
traversal’s interplay of different recursion levels facilitates the implementation of
inter-level transfer operations. They are essential for any multiscale algorithm.
Two traversals preserving the child-father relationship result from the depth-first
and breadth-first order. Both come along with advantages and disadvantages. In
this chapter, I concentrate on depth-first algorithms, as their simple backtracking
mechanism affords plain and elegant recursive implementations, and their realisation
itself comes along without any additional data container—the backtracking, i.e. the
bottom-up traversal steps, is realised by the system’s call stack. Despite this simplicity, the question nonetheless remains open what a well-suited data container for
the grid’s constituents, i.e. the vertices and geometric elements, as well as their connectivity information looks like. This chapter presents one solutions and discusses
the storage of vertex and hypercube associated data. Besides the vertex refinement
flag and the geometric encoding, this data comprises PDE-specific properties, too.
It turns out that the single bit for the spacetree code is also sufficient to store
the complete structural information, i.e. the bidirectional vertex-element adjacency
relations, if k = 3—a restriction finally weakened to odd k in the outlook. Although
all the thesis’ experiments are conducted for k = 3—I implemented only one code
combining all the presented features—most insights hence hold for arbitrary odd
k, and I thus a variable k whenever possible. The low memory requirements for
the adjacency information and k = 3 is due to a sophisticated combination of the
depth-first order with a space-filling curve and results in an exclusive usage of stacks
as grid data container.
Without the careful selection of space-filling curves and their properties, a straightforward implementation of the data containers that does not restrict the adaptivity
is a pointer-based data structure. Hereby, pointers hold the grid’s adjacency and
connectivity information. The most prominent example for such a data structure is
the vef -graph in computer graphics [16]. Such pointer networks come along with at
least three disadvantages:
51
3 Spacetree Traversal and Storage
1. The pointers need memory, i.e. the application’s memory demands result from
both the application’s data plus the connectivity information. The pointers
hence induce a memory overhead.
2. The size of pointer data structures is bounded by the overall memory address
space available. To make an application work with pointer structures not fitting into this space anymore entails the development of a serialisation strategy.
Besides the trickiness of such strategies, serialisation always comes along with
some performance overhead. Due to that, problems not fitting into the main
memory usually are considered as not solvable.
3. Pointer-based implementations rely on indirect memory access, i.e. the application does not read from the main memory directly. It first reads the record’s
address from the memory, and it second reads the actual record from this address. The two-step memory access causes a performance penalty. A more
severe performance drawback results from the fact that the records typically
are scattered among the address space. The mechanism thus exhibits non-local
memory access behaviour and is not tuned to caches.
Numerous concepts realise the k-spacetree without pointers. Early approaches
make both the refinement level and the spatial position act as access key (address)
for records. The Morton ordering [57] for yields such a key. In this case, all the
vertices are held in a global container and the address determines a position within
this container. Hash tables are a natural choice for the container’s realisation. A
subtle choice of the key computation leads to a sophisticated and efficient memory
access ([32], e.g. ). Nevertheless, the approach still relies on indirect memory access
and it is bounded by the available address space.
To overcome these constraints, the forerunners of this thesis [35, 39, 63] derive
an alternative realisation concept. They fit the construction principle of a spacefilling curve [66] into the depth-first traversal, and, thereof, they make the traversal
use exclusively the two stack operations push and pop for both the vertex and
the geometric element management. The number of stacks is fixed. Besides their
simplicity, stacks exhibit three important memory access characteristics. The data
access comes without indirect addressing, and the access itself offers both spatial and
temporal locality: If a record is taken from or written to the memory, the subsequent
memory access’ position differs from the current position by at most one record’s
size. If a record is taken from or written to the memory, the time in-between two
accesses is mall due to the small number of stacks and the curve’s Hölder continuity.
The latter two characteristics lead to a good cache hit rate [48].
Despite all these nice properties, the forerunners’ algorithms suffer from a high implementation complexity: [35] derives an algorithm for a two-dimensional
(k = 3)-spacetree using ten stacks. [63] extends this concept to d = 3. His implementation comes along with 28 stacks. [39] finally generalises these two concepts
52
3.1 Peano Space-Filling Curve
to arbitrary dimensions, and she proves that these algorithms utilise 3d + 1 stacks1 .
In this dissertation, an alternative scheme coming along with 2d + 2 stacks is proposed, although it does not pose any additional restrictions on the trees. I reduce
the implementation complexity from exponential to linear.
The term space-filling is credited to the mathematicians Cantor, Netto, Peano
and Hilbert, and its underlying ideas open the door to a rich, interesting field of
research and theory. As many aspects of this theory are presented in [35, 39, 63],
this chapter restricts to the construction and appropriate properties. A good survey
on space-filling curves is [66]. Besides [35, 39, 63], there are additional applications
of (alternative) space-filling curves not discussed here ([2, 32], e.g.).
The chapter is organised as follows: In Section 3.1, Peano space-filling curves,
their construction principle, and some of their properties are introduced. The Peano
curve’s properties lead to the idea of stacks acting as data containers for the grid
management. This management is described in Section 3.3, and, thus, this section holds the fundamental new contribution of the chapter. The following text
reveals how today’s hardware architectures and cache hierarchies benefit from the
cache-based management, before some realisation details in Section 3.5 complete the
chapter. These details on the one hand comprise the handling of different traversal
depths as they occur for multiplicative multigrid algorithms. On the other hand,
they transfer and extend a file-based stack realisation concept introduced by [63]
to this work. With these file-based realisations, one can handle problems that need
more main memory than actually available on the machine. Some experiments study
the memory characteristics of the algorithm, and a short outlook closes the chapter.
3.1 Peano Space-Filling Curve
Peano’s cache and memory efficiency rely on the construction principles and properties of the Peano space-filling curve. Although the principle of its construction
can be understood analysing the illustrations in Figure 3.1, and although its three
important properties formalised in the theorems on page 59 and the following are intuitively clear, a rigorous, closed formalism is important for stating subsequent algorithms. This section provides this formalism. It so leaves the concept of k-spacetrees
aside, and concentrates on the concept of the space-filling curves. Skipping it, the
ideas of Section 3.3 holding the grid storage and traversal algorithm remain plausible, but are neither provable nor re-programmable.
Space-filling curves are mathematical eccentrics whose existence and properties
have been fiercely debated for a long time, as their character implies that the unit
1
In all three theses, one can reduce the number of required stacks by 2d with some straight
modifications.
53
3 Spacetree Traversal and Storage
—Alternative Peano Curves—
In [66], the term Peano space-filling curve refers to a construction principle
based upon three-partitioning of the unit square along each coordinate axis.
Such a curve is neither unique with respect to rotation, nor is it unique with
respect to the curve’s layout. The following illustration gives four three times
four iterates for four different Peano space-filling curves:
This thesis uses a standardised Peano space-filling curve (first row), where
the curve always runs along the x1 axis first, then x2 , then x3 , and so on.
Another version of the curve changes the dominant traversal direction per
iterate (second). The third variant exchanges the dominant traversal order for
each second square. All three variants are of switch-back type (Serpentinentyp)
exhibiting a 2×1×2×1×2×1 step pattern—two steps along one direction, then
one step along an orthogonal direction. Finally, the fourth variant traverses
each 32 motif with a 2 × 2 × 1 × 1 × 1 × 2 step pattern. This variant’s original
identifier is Peano curve of the meander type. The extension of all four variants
to arbitrary dimension is straightforward, and the Austrian mathematician
Walter Wunderlich coined the names.
54
3.1 Peano Space-Filling Curve
square’s volume equals the unit interval’s volume. The term curve identifies a mapping from the unit interval to a higher-dimensional domain such as a square. The
term space-filling denotes that the curve’s image has a positive Jordan content: the
image completely “floods” a higher-dimensional domain. This section presents the
construction principle of one specific type of space-filling curves—the Peano curve.
The Peano space-filling curve is a surjective, continuous mapping from the unit
interval to the unit hypercube. It is constructed recursively:
• The hypercube (image) is split up into 3d equal subcubes.
• The unit interval (preimage) is split up into 3d equal subintervals.
• Each subinterval maps to one subcube according to Figure 3.1, i.e. neighbour
subintervals map to adjacent subcubes. The central illustration in Figure
3.1—the curve running through 3d hypercubes—is the leitmotiv.
• Neighbouring subintervals’ images are connected by the curve, and the curve
defines an order on the subintervals.
• The curve’s construction continues recursively for each subcube. For each
subcube, the leitmotiv is mirrored accordingly: The connected leitmotivs fitted into the subcubes form a continuous curve, and this curve preserves the
ordering of the coarser hypercubes resulting from the preceding construction
step.
Figure 3.1: The first three iterates of the Peano curve for d = 2. For d → ∞, the
curve fills the unit square surjectively. The left hypercube corresponds
to the construction scheme’s recursion start, and the central illustration
shows the leitmotiv. This leitmotiv defines the overall curve. The interconnections of the different suitable translated and mirrored leitmotivs
on the right-hand side do not result from the leitmotiv directly, but
result from the leitmotiv of the preceding recursion step.
With the recursion depth going to infinity, the curve completely fills out the whole
unit hypercube. There are many variants of this space-filling curve based upon tripartitioning (see the excursus on page 54), and even more different mappings based
55
3 Spacetree Traversal and Storage
upon other partitioning techniques. The Hilbert curve based on bi-partitioning
perhaps is the most popular one. Another example embedding the curve’s image
into triangles is the Sierpinski curve. However, the Peano curve has a number of
unique properties which are very useful for a PDE solver realisation2 . Before the
subsequent subsections discuss these properties, the multitude of different Peano
space-filling curves is worth a further attention.
The leitmotiv in Figure 3.1 equals a 2 × 1 pattern: From a subcube, the curve
meanders two subcubes along one direction. Afterwards, it changes the direction
orthogonally and proceeds one subcube. Although the curve’s shape is thereby
defined unambiguously, the overall curve’s rotation is not fixed yet.
Definition 3.1. A standardised lexicographic leitmotiv traverses a 3d pattern always
first along the x1 axis. It then steps one subcube along the x2 axis and, subsequently,
continues to meander along the x1 axis backwards. For d ≥ 3, the next meander
direction is x3 , then x4 and so on.
Figure 3.2: The first two iterates of the Peano curve for d = 3. On the right-hand
side, the unit cube is cut into three plates and the curve runs through
each plate according to the two-dimensional scheme. Afterwards, the
three individual curve fragments are connected.
The term “mirrored accordingly” within the construction definition also demands
for a more rigorous definition.
Definition 3.2. In this thesis, the term mirror along or in the direction of the
xi -axis equals a reversion of the xi -axis. For the unit hypercube, mirroring along the
xi -axis thus mirrors the curve at a hyperface with normal xi .
2
This dissertation elaborates properties of the Peano curve based on tri-partitioning. Tripartitioning then in turn fits to the (k = 3)-spacetree. However, the properties hold for any
standardised meander-type curve corresponding to an odd k, i.e. the are of application is much
wider (see outlook).
56
3.1 Peano Space-Filling Curve
The “mirroring accordingly” is an affine mapping P = TP̂T, where the generic
operator T translates any hypercube to the unit hypercube or the other way round,
and where P̂ performs the mirroring. The mirroring in turn depends on the hypercube’s position within the preliminary construction step. One can express this
position in terms of odd and even cubes along the coordinate axes. Let
even : subcube 7→ {⊤, ⊥}d
define a d-tuple on each subcube. It is the even-flag, and a subscript picks out
a particular component of the image. even for the spacetree’s root results in
{⊥, ⊥, . . . , ⊥}. If a cube is split-up, the first new subcube’s even-flag along the
iterate equals the original subcube’s flag. For two subcubes a and b connected by a
hyperface with normal xi , i ∈ {0, . . . , d − 1}3
evenj (a) = evenj (b)
eveni (a) = ¬eveni (b).
∀j 6= i
and
(3.1)
holds (Figure 3.3). The function separates even from odd subcubes along each coordinate axis, and, thus, defines whether an element’s iterate runs along the coordinate
axis or not:
isT raverseP ositiveAlongAxis : {⊤, ⊥}d × {1, . . . , d} →
7
{⊤, ⊥},
isT raverseP ositiveAlongAxis(even, axis) = ⊤ ⇔
∃k ∈ N0 : |{i : i 6= axis ∧ eveni = ⊤}| = 2 · k.
with
(3.2)
Figure 3.3: even-flag for a two-dimensional adaptive Cartesian grid. The flag
uniquely determines how the leitmotiv is mirrored. Therefore, the flag
also determines which neighbour square is crossed next by the iterate.
The function analyses the root node’s leitmotiv: Here, the traversal runs from
the cube point nearest to the origin to the point furthest away: It runs along each
coordinate axis. even gives a d-dimensional module two enumeration, and it tracks
how often the leitmotiv has been mirrored and translated. The translations do not
affect the curve’s shape. With this mirroring, there are four important observations:
3
I enumerate all these tuples in C-style, i.e. starting with 0, to keep algorithms and the formalism
consistent. Consequently, the coordinate system’s axes are also enumerated in C-style.
57
3 Spacetree Traversal and Storage
• For the unit hypercube, the algorithm applies the standard lexicographic leitmotiv.
• For two hypercubes with the same even-flag, the algorithm applies the same
leitmotiv.
• If two adjacent hypercubes’ even-flags differ in one entry i, one leitmotiv
results from the other cube’s leitmotiv by mirroring it along each coordinate
axis besides xi .
• If two leitmotivs differ in more than one entry, the mirroring operations have
to be applied consecutively. The image is deterministic and unique as the
mirror operations are commutative.
A more formal description of the leitmotiv usage can be obtained by using a
grammar to describe the recursion steps. Such a grammar can be found in [39] and
[63], e.g. For this thesis, working with the even-flag is sufficient, as the individual
algorithms exploit solely the traversal’s orientation and neglect the transition within
the affine operator.
Example 3.1. For d = 2, the following mirroring operators arise.
1 0
1 0
P̂⊥⊥ = id =
P̂⊤⊥ =
0 1
0 −1
−1 0
P̂⊥⊤ =
P̂⊥⊥ = P̂⊥⊤ P̂⊤⊥ = P̂⊤⊥ P̂⊥⊤ = −id.
0 1
For arbitrary d, the matrices evolve from
P̂⊥⊥⊥... = id,

1 0 ...
 0 1 ...

...

 ...
P̂⊥⊥⊥ . . . ⊤⊥ . . . = 
−1

|
{z
}

ith entry
1

and the multiplication rule

...




,
ith row 


P̂⊥ . . . ⊤⊥ . . . ⊤⊥ . . . = P̂⊥⊥⊥ . . . ⊤⊥ . . . P̂⊥⊥⊥ . . . ⊤⊥ . . . .
|
{z
}
{z
} |
{z
}
|
lth and kth entry
lth entry
kth entry
Since the involved matrices exhibit a diagonal pattern, they are commutative.
58
3.1 Peano Space-Filling Curve
Each construction step applying the leitmotiv on all hypercubes of the smallest
size yields a curve of its own. They are iterates of the Peano space-filling curve and
belong to a non-linear recursion with order 3d . The recursion’s depth determines the
level of the iterate. With these iterates and their construction blueprint at hand,
the subsequent pages study the iterate’s properties. Peano’s spacetree traversal later
imitates the curve’s iterates and, thus, inherits these properties. They are in turn
essential to construct the grid storage and traversal algorithm.
3.1.1 Projection Property
Theorem 3.1. For the standardised Peano iterates, the projection property holds:
Examine an iterate of level ℓ. A hyperplane is cut out from the unit hypercube along
two hyperfaces with normal xi . They are translated by 3kℓ and k+1
along the xi 3ℓ
axis with a fixed k ∈ {0, . . . , 3ℓ − 1}. The hyperplane contains a subcurve. If one
projects the subcurve orthogonally to one of the cuts’ hyperfaces, the image is in turn
a (d − 1)-dimensional Peano iterate.
Figure 3.4: Projection property of the Peano space-filling curve for d = 3: If the
iterate is mapped to a face of the unit cube, the image is in turn a
(d − 1 = 2)-dimensional Peano iterate. This holds for all 2 · d faces.
The proof results from the mirroring of the leitmotiv on page 57: The iterate
within the cut plane corresponds to a sequence of leitmotiv transformations, and
the following projection to the cut planes removes the i-th entry from the image.
One can interchange construction as well as projection and ends up with a new
construction scheme for a d − 1-dimensional curve where the i-th component is
eliminated. This construction scheme equals the d − 1-dimensional Peano iterate.
The proof is elaborated for example in [49], and the projection property itself is
illustrated for d = 3 in Figure 3.4.
59
3 Spacetree Traversal and Storage
3.1.2 Inversion Property
Theorem 3.2. For the standardised Peano iterates, the inversion property holds:
Let ia be the iterate belonging to level ℓ. If the leitmotiv of the unit hypercube is
mirrored along each coordinate axis4 , the construction scheme of the standardised
Peano space-filling curve’s definition also yields a Peano iterate ib . ia and ib are
congruent, but they have reverse orientations.
Proof. The proof is a simple induction over the construction steps. For the induction
start, the theorem holds, as the iterate’s direction is inverted. For a refined event,
the construction sequence of the 3d subcubes is inverted. For each subcube the
theorem holds by the induction. All the curves are then connected. The connection
order equals the original curve’s inverted arrangement.
Corollary 3.1. The set of Peano space-filling curves is closed under the invert
traversal operation.
3.1.3 Palindrome Property
Theorem 3.3. For the standardised Peano iterates, the palindrome property holds:
Let there be an iterate of level ℓ. Two neighbouring hyperplanes are cut out from the
unit hypercube along three hyperfaces with normal xi . The hyperfaces are translated
, and k+2
along the xi -axis with fixed k ∈ {0, . . . , 3ℓ − 2}. Each hyperplane
by 3kℓ , k+1
3ℓ
3ℓ
contains a subiterate ia or ib , respectively. ia and ib are congruent but have reversed
orientations.
Again, a formal proof is given by [49]. The property stems from the construction
of the even-flag, as two hypercubes contained in different planes sharing one hyperface have an equal even-flag besides one entry corresponding to the hyperface’s
normal. If the two hyperplanes’ iterates are mapped to the connecting hyperface,
their mirroring operators thus differ in the sign, i.e. one leitmotiv results from the
other by inverting the curve along each of the d − 1 coordinate axes (Figure 3.5).
3.2 Deterministic Peano Traversal for k-spacetrees
The construction of the Peano space-filling curve resembles the construction of a
(k = 3)-spacetree. Both differ in three aspects: First, the grid generation stops after
a finite number of recursion steps. The Peano curve instead results from the limit of
the iterates. Second, the grid generation determines for each subcube individually
4
Such a inversion equals the transition P̂ 7→ −id · P̂ in the mirroring operator. For an even
d, this transition in turn corresponds to an inversion of the even flags, i.e. (⊥, ⊥, ⊥, . . .) 7→
(⊤, ⊤, ⊤, . . .), for the root element. For an odd d, this equivalence does not hold.
60
3.2 Deterministic Peano Traversal for k-spacetrees
Figure 3.5: Palindrome property of the Peano space-filling curve: If the unit cube is
cut into three plates along a coordinate axis, and the first Peano iterate
is projected onto these plates (projection property in Figure 3.4), each
iterate on the plate mirrors the neighbours’ curves, i.e. it exhibits the
same layout with a different traversal direction.
whether to refine further, and, thus, facilitates adaptive grids. The Peano iterates’
image intervals equal a regular Cartesian grid. Third, the k-spacetree definition lacks
a definition of an order of the children resulting from the recursion steps, whereas
the Peano iterates define such an order on the image’s interval.
Both depth-first and breath-first traversal on k-spacetrees are partially non-deterministic: They obey to the order ⊑child , as they exploit the tree structures. Yet, the
loops in the traversal algorithms do not arrange the children. It is an obvious idea
to use the Peano space-filling iterates to derive such an arrangement. The resulting
spacetree traversal then inherits all the properties of the Peano space-filling curve or
its iterates, respectively. Due to the projection property, algorithms exhibit a tensor
product style. Due to the inversion property, the traversal can toggle its orientation
after each run through the spacetree, and, nevertheless, it preserves its behaviour.
And the palindrome property finally is the missing link to realise the data containers
with stacks.
Definition 3.3. A Peano spacetree is a (k = 3)-spacetree with a deterministic traversal preserving the child-father relationship. Furthermore, the siblings’ order f irst
preserves ⊑pre . ⊑pre in turn is the order induced by the leitmotiv corresponding
to the parent’s even-flag. The (k = 3)-spacetree’s root corresponds to an arbitrary
event flag.
The depth-first traversal for a two-dimensional Peano spacetree is illustrated in
Figure 3.6. It is realised recursively by a stack automaton (usually represented by
the call stack), and it benefits from the definition of the even-flag: The automaton
holds—besides spatial location, level, and so forth—the even-flag of the current geometric element. If one adjacent vertex holds Prefined , the automaton has to descend.
61
3 Spacetree Traversal and Storage
Figure 3.6: (k = 3)-spacetree with a depth-first ordering derived from the iterates of
the Peano space-filling curve. The first row illustrates the Peano iterate
on the different levels. The second row gives the enumeration corresponding to the depth-first search. At the bottom, the corresponding
space-tree is illustrated.
The even-flag identifies the first child into which to descend, and the even-flag also
determines the order of the subsequent descends due to the leitmotiv.
The counterpart of the inversion property for the stack automaton is the following
definition:
Definition 3.4. The image of the invert traversal operation
invert : T 7→ T
on a Peano spacetree is a tree with the same geometric elements, with the same
vertices, and with the same root element,
• where the leitmotiv’s direction is inverted,
• where the same child-father relationship ⊑child holds
∀ei ⊑child,T ej : ei ⊑child,invert(T ) ej ,
• but where the siblings’ order ⊑pre is inverted:
∀ei , ej ⊑child,T ek : ei ⊑pre,T ej ⇔ ej ⊑pre,invert(T ) ei .
62
3.3 Stack-Based Containers
Corollary 3.2. The set of Peano spacetrees is closed under the invert traversal
operation, i.e. the image is in turn a Peano spacetree.
Proof. The proof results from the inversion property.
Figure 3.7: The Peano space-filling curve’s iterate runs through an adaptive
(k = 3)-spacetree discretising a circle. Adaptive grid with three snapshots of a running traversal. All fine grid elements already visited are
inked.
The deterministic stack automaton sketched above, a definition of the geometry
representation, and the refinement transition concept deliver a blueprint for a deterministic traversal algorithm on dynamic adaptive Cartesian grids resulting from
Peano spacetrees. Yet, this blueprint lacks the description of containers for data
associated to vertices or elements.
3.3 Stack-Based Containers
Space-filling curve and depth-first traversal in combination shape the deterministic
Peano spacetree traversal. To implement this traversal, a programmer has to combine this simple traversal blueprint with fitting containers for vertices and elements.
The data flow and data access pattern underlying a k-spacetree traversal define the
access operations the data containers have to provide. Due to the properties of
63
3 Spacetree Traversal and Storage
Figure 3.8: The Peano space-filling curve’s iterate runs through an adaptive
(k = 3)-spacetree discretising a sphere. Six snapshots of a running
traversal. All fine grid elements already visited are inked.
space-filling curves and their iterations, the operations comprise solely operations
from a stack signature, while the data access patterns for vertices and geometric elements exhibit a different behaviour. The section at hand elaborates this behaviour.
It starts with the element data, continues with the storage of vertex data in-between
two iterations, and ends up with the stacks required throughout the traversal itself.
As a result, it presents a spacetree traversal realisation and a storage scheme coming
along with a small, fixed number of stacks as data containers.
64
3.3 Stack-Based Containers
—Face Enumeration—
Each geometric element has 2 · d faces. These faces’ enumeration is deduced
from the faces’ normal as follows:
1. The numbers 0 and 0 + d identify the faces with normal x1 .
2. The numbers 1 and 1 + d identify the faces with normal x2 .
3. The numbers 2 and 2 + d identify the faces with normal x3 .
4. . . .
With two faces having the same normal, the face closer to the coordinate
system’s origin is enumerated first.
65
3 Spacetree Traversal and Storage
3.3.1 Container for the Geometric Elements
Every time a geometric element is read for the first time, its data is taken from an
input container, the corresponding vertices are read, and the traversal triggers the
corresponding event enterElement (Table 2.2). Afterwards, the traversal descends
recursively if the current geometric element is refined. In this case, the call stack
holds both the traversal’s state (position, element width, and so forth) represented
by a traversal automaton as well as the associated grid data (vertices and data
assigned to the geometric elements) until the recursion terminates. Finally, the
algorithm triggers leaveElement, deposits the vertices, and writes the element to
an output container. This write has a fire-and-forget semantics, i.e. each element
written is not read again throughout this traversal. For the geometric elements,
there is thus an input and an output stream.
Figure 3.9: The traversal algorithm reads geometric elements from an input stream,
and stores the record to the output stream. The output stream then
acts as input stream for the second traversal.
For two subsequent traversals, the records stored on the output stream of the first
iteration act as input for the second iteration (Figure 3.9). The order on the corresponding output stream has to equal the order of the input stream. Yet, the order
of the elements on the input and output stream is not equal for a straightforward
depth-first type traversal:
Example 3.2. Examine a spacetree of height one. The refined root element is read
before the 3d children are read. When the traversal descends, the parent element
is held on the call stack. As the children are leaves, they are hence written to the
output stream in the same order in which they are read. Finally, the parent geometric
element is written to the output stream:
66
3.3 Stack-Based Containers
In a depth-first traversal, parent nodes are always read before their children. In turn,
they are written to the output stream after their children have been written.
Although input and output order do not concur, it is oblivious that the resorting
equals an inversion of the storage order. An algorithm switching the read access
order for each traversal thus comes along without any resorting of the containers
holding the geometric elements.
Theorem 3.4. Each Peano traversal defines an order in which the geometric elements are read (input order) and an order in which the geometric elements are
written (output order). The elements’ output order for a Peano spacetree T equals
the inverse input order for the Peano spacetree invert(T ).
Proof. The proof is a simple induction over the height of the Peano spacetree. The
theorem holds for a spacetree of height zero, i.e. one single hypercube. For a refined
geometric element e ∈ inverse(T ), the leitmotiv is mirrored along every axis and
∀ei , ej ⊑child e, ei ⊑pre,T ej : ej ⊑pre,inverse(T ) ei .
(3.3)
T ’s traversal writes e as last element of the output stream. inverse(T )’s traversal
reads e first. According to (3.3), the last geometric element e1 written to the output
by the traversal is the the first geometric element read by the inverse traversal. For
the siblings of e1 , the theorem holds due to the induction. The subsequent child
e2 of e read by the inverse traversal is the element written by the traversal before
it reads and descends into e1 . For the siblings of e2 , the theorem holds due to the
induction. The arguing is continued for e3 up to e3d .
The theorem does not have to discuss dynamic refinement, as it does not correlate
the input and output order of the same traversal: Adding or removing spacetree
elements enrich or downsize the output stream, but do not alter the stream access
pattern or the stream’s semantics.
Running through a stream once in one direction then in the opposite direction
equals a sequence of push operations followed by a sequence of pop operations:
Corollary 3.3. The input and output containers for geometric elements are stacks.
While I implement the containers as stacks, this realisation is far too general as
it does not take into account the stream characteristics: The two stacks are always
emptied completely before a sequence of push operations refills them. More sophisticated implementations could exploit this insight and, due to a reduced functionality,
provide an optimised container realisation. The stream access pattern permits implementations for example to deploy data pre- and postprocessing to a thread of
their own [23].
67
3 Spacetree Traversal and Storage
Figure 3.10: Throughout the traversal, the algorithm reads vertex and element data
from two input streams. If a record is not needed anymore, the record
is written to an output stream. In between, records for geometric elements are stored on the call stack. Vertices, however, are manipulated
by several geometric elements. Thus, they have to be stored within
temporary containers after the first read, until they have been used by
all 2d adjacent geometric elements.
3.3.2 Input and Output Container for Vertices
The Peano spacetree’s traversal equals a resorting of the geometric elements: From
a data flow point of view, it reads elements from an input stream and writes them
to an output stream in a permuted order. The data flow for the vertices is more
complicated, as vertices are processed 2d times by the element-wise traversal—once
per adjacent element. A vertex is read from an input stream, too, but it is not
written to an output stream immediately. The algorithm instead reads a vertex from
an input stream when the first adjacent geometric element is entered. Afterwards,
it has to store the vertex within a temporary container. After it has been read
or written, respectively, from or to the temporary container 2d − 1 times, i.e. the
total number of read operations equals 2d , the algorithm finally stores the vertex in
an output container (Figure 3.10). This subsection discusses the input and output
container.
The vertex transition scheme in Section 2.6 ensures that each vertex v ∈ VT \ HT
is processed 2d times. Hanging vertices v ∈ HT are created on-the-fly, and they are
not stored on the input and output streams. Because of the vertex state transition
scheme, the refinement and coarsening is completely transparent: The algorithm
does not distinguish between existing vertices and new vertices. Existing vertices
are read from the input stream, while new vertices are created on-the-fly. Besides
the input stream there is thus a vertex source. If elements are to be destroyed due
to coarsening the algorithm does not store these vertices on the output stream but
throws them away. The same holds for hanging vertices. Besides the output stream,
there is thus a vertex sink.
68
3.3 Stack-Based Containers
Figure 3.11: The Peano iterate runs along the vertices a and b first from left to right
(top) and then from right to left (bottom). This behaviour results from
the palindrome property. Since a non-hanging, i.e. persistent, vertex
is written to the output stream as soon as it has been read by all 2d
adjacent vertices, vertex a is read from the input stream before b and is
written to the output stream after vertex b is written. If the traversal
direction is inverted after the iteration, vertex a again is read before
vertex b.
Theorem 3.5. Each Peano traversal defines an order in which the vertices are read
(input order) from the input stream, and it defines an order in which the vertices
are finally written (output order). The vertices’ output order for a Peano spacetree
T equals the inverse input order for the Peano spacetree invert(T ).
Proof. The proof for one single grid level derives from the inversion property. For the
traversal of the whole spacetree, the statement holds by induction, as the traversal
preserves the child-father relationship as well as the inverse child-father relationship
(Definition 2.2).
Corollary 3.4. The input and output containers for vertices are stacks.
Both the vertex and the element input and output containers are stacks and fit
to an algorithm switching the traversal direction after each iteration. Again, stacks
even are a too generalised data structure, as they are accessed in a stream manner.
The number of grid levels does not influence the realisation, and the implementation
of the depth-first traversal in Algorithm 2.4 thus is straightforward. Nevertheless, it
relies on a local criterion deciding whether to read a vertex from the input stream
or a temporary container. The same argumentation holds for the decision where to
write the vertex to (Algorithm 3.1).
If a stack automaton realises the element-wise traversal of the spacetree, i.e. it
runs through the geometric elements, it represents the individual states stored on
the call stack. Each geometric element has 2 · d faces. As each traversal state
69
3 Spacetree Traversal and Storage
Algorithm 3.1 The depth-first traversal traversal from Algorithm 2.1 without the
events. Input and output streams are stacks. The temporary vertex container’s
realisation is not specified yet, and the for loop has to be made deterministic with
the Peano curve.
1: procedure df s(e)
2:
...
3:
for ei ⊑child e do
4:
pop geometric element ei from element input stream
5:
for vj ∈ vertex(ei ) do
6:
if first read of vj then
⊲ see Algorithm A.1
7:
pop vj from vertex input stream
8:
else
9:
read vj from temporary vertex container
10:
end if
11:
end for
12:
DFS(ei )
13:
push geometric element ei on element output stream
14:
for vj ∈ vertex(ei ) do
15:
if vj read 2d times then
⊲ see Algorithm A.1
16:
push vj on vertex output stream
17:
else
18:
write vj to temporary vertex container
19:
end if
20:
end for
21:
end for
22:
...
23: end procedure
corresponds to one geometric element, each automaton state “has” 2 · d faces. The
enumeration of its faces is defined on page 65. Let Ptouched : {0, . . . , 2d−1} 7→ {⊤, ⊥}
declare whether the traversal automaton has already traversed the geometric element
connected by a face. Ptouched is the touched predicate. It depends exclusively on the
stack automaton’s state and is formalised later. The predicate is ⊥ for all the
faces of the root element. If an element is refined, the iterate’s orientation (i.e. the
mirroring of the leitmotiv) derives for all the 3d recursion steps and all the 2 · d faces
per recursion step whether Ptouched holds5 .
With Ptouched assigned to each geometric element in combination with the traversal
5
Although the mirroring of the leitmotiv, i.e. the even flags, determines when Ptouched in the
subsequent recursion steps holds, Ptouched and even are not equivalent, as Ptouched also depends
on the father element’s touched predicate.
70
3.3 Stack-Based Containers
automaton (it depends on the traversal direction, too), the traversal automaton can
determine where to read a vertex from and where to store a vertex. If no face
adjacent to a vertex holds the Ptouched flag, the algorithm pops the record from the
input stack. Otherwise, it reads from the temporary container. If all faces hold
Ptouched , the algorithm pushes the record to the output stack. Otherwise, it passes
the record to the temporary container. I present the algorithm deriving Ptouched
later, as it turns out that Ptouched evaluates a more general automaton property
needed anyway for the temporary stack containers. Algorithm A.1 comprising two
helpers formalises the logic described above.
3.3.3 Temporary Stack Container
Figure 3.12: The continuous iterate splits up the vertices into vertices left of the
iterate (circles) and right of the iterate (crosses).
The realisation of the input and output containers is more or less trivial, i.e. having
stacks storing streams is not surprising. The subsequent text reveals that stacks are
in fact the only data structure needed for the complete grid storage.
Since the Peano curve’s iterate meanders through the discretised computational
domain and passes through each geometric element once, it transits from one element into a neighbouring element through a common face. And as the construction
principle yields continuous iterates, the discrete domain’s vertices in d = 2 either
are on the right-hand side of the curve or on the left-hand side (Figure 3.12). This
holds for all levels.
Let vertex a and vertex b on level ℓ belong to the class of left-hand side vertices
(d = 2). For the continuity of the iterate and the palindrome property, the following
facts hold:
• If a is read from the vertex input stream before b, a is stored on the temporary
vertex container for the first time before b.
• Afterwards, b will be read before a, i.e. the second vertex read order is inverted.
71
3 Spacetree Traversal and Storage
Algorithm 3.2 In each geometric element, the vertex load order has to fit to the
traversal’s direction. The for loop in Algorithm 3.1 thus has represents the following
loop—the loop’s body instructions are to be embedded at line 13. The code snippet
evaluates the automatons current even flag fixing the iterate’s direction. The source
fragment is to be optimised, but the non-optimised version enlights the traversal
mirroring. mask and f irstV ertex are modified in their binary representation, and
f irstV ertex represents a vertex position within one element, i.e. as a (0, 1)d tuple.
1: f irstV ertex ← (0, 0, . . .); mask ← (0, 0, . . .)
2: for i ∈ {0, d − 1} do
3:
if eveni then
4:
maski ← 1
5:
mask ← ¬mask
6:
f irstV ertex ← f irstV ertex ⊻ mask
7:
mask ← ¬mask
8:
maski ← 0
9:
end if
10: end for
11: for i = 0 : 2d do
12:
currentV ertex ← f irstV ertex ⊻ i
13:
...
14: end for
As the access order inverts, it is an obvious idea to use again stacks as temporary containers, and to extend this concept to arbitrary dimensions. To make this
approach work, the sequence in which the vertices are loaded for an element has to
fit to the space-filling curve, too. Thus, the two loops “for vj ∈ vertex(ei ) do ” in
Algorithm 3.1 fit to Algorithm 3.2.
A Left/right Classifier
The fundamental idea of [35, 39, 63] is to store all the vertices left of the iterate
on one stack and all the vertices right of the iterate on another stack. There is
a left and a right stack. If the traversal automaton leaves a geometric element,
the algorithm swaps all vertices that are to be read again to one of the two stacks
according to the order induced by the iterate running through the element. The
iterate within the neighbouring element is inverted due to the palindrome property,
i.e. in a neighbouring element the automaton reads the temporary vertices in reversed
order—it takes the vertices from a temporary stack:
• The operation “read vj from temporary vertex container” in Algorithm 3.1
becomes “if vj is left of iterate, pop it from the left stack, otherwise pop it
from the right stack”.
72
3.3 Stack-Based Containers
• The operation “write vj to temporary vertex container” in Algorithm 3.1
becomes “if vj is left of iterate, push it to the left stack, otherwise push it to
the right stack”.
As the projection property holds for the Peano curve, this idea generalises to any d
via induction.
Figure 3.13: Only a left and a right temporary stack does not work for the Peano
spacetree traversal: In the example, the left stack is studied. The
geometric elements 2, 3, 4, 5, 9 swap the vertices c,d,e,f,g and h to the
left stack. Afterwards, the traversal algorithm ascends into the parent
element, and the vertices a and b are written to the left stack. The
algorithm then transits from element 1 into element 10. Within 10, b is
taken from the left stack and the traversal descends into the first child
(11). This child has to read h from the left stack. Yet, vertex a is on
top of the stack and shadows the vertices f, g and h. The two-stack
algorithm does not work.
The Peano spacetree traversal runs through all levels of the spacetree. Unfortunately, a simple left/right temporary stack concept then results in a container access
conflict (Figure 3.13). Two approaches preserving the stacks-only policy resolve this
conflict: Either each grid level holds a pair of left/right stacks of its own. Or a small,
and, in particular, resolution independent, fixed number of additional “hierarchy”
stacks locally resolves the access conflicts resulting from the multilevel interplay.
The first solution is straightforward but renders the stack idea null and void,
as it introduces a dynamic data structure mapping levels to pairs of stacks. This
73
3 Spacetree Traversal and Storage
mapping then establishes an additional indirect memory access for a lookup table,
and the table itself introduces an overhead as it grows with the maximum spacetree
depth. Finally, the traversal permanently switches from one level to another, i.e. it
would access different stacks permanently. Such an access behaviour leads to bad
cache hit rates—an argument picked up later. Hence, [35, 39, 63] follow the other
idea: The shadowing problem from Figure 3.13 affects vertices on the faces, i.e. the
problem corresponds to a submanifold. They hence introduce additional stacks
corresponding to the submanifold and write the vertices causing problems (vertex
a in the example) to these additional submanifold stacks. The exact rules where to
place which vertices become more complicated, but an induction over d and the grid
hierarchy shows that the resulting algorithm succeeds with a fixed number of 3d + 1
stacks, i.e. the number of stacks does not depend on the spacetree’s depth.
Alternative Concept
I derive an alternative access pattern for the temporary stack containers. Here, the
traversal automaton keeps track of the order in which the neighbours are connected
by a hyperface: Due to the standardisation of the Peano iterate in Definition 3.1,
the stack automaton knows for each element connected by a face whether it will be
visited or has been visited. For each vertex, the Peano traversal will next access
the vertex out of a neighbour connected by a hyperface—not an edge, not a vertex,
but a face. This is due to the continuity of the curve. Each automaton state thus
knows, which neighbouring element triggers the next read access. It also knows
which neighbouring element has processed a vertex before.
Let there be two stacks per dimension. They are enumerated oscillatingly, similar
to the even-flag. Then, each element’s face corresponds to a different stack due to
the fact that there is an even number (two) of stacks but an odd number of partitions
(k = 3) per coordinate axis.
Preceding a write to the temporary containers, the algorithm identifies the face
connecting the element triggering the next read access. This face has a stack number.
The algorithm pushes the element onto the stack with this number. The read
operation then processes the counterpart of this analysis, and the overall algorithm
comes along with 2d temporary stacks.
Example 3.3. Given is a two-dimensional grid with the stack numbers 0, 1, 2 and
3. In the following illustration, each tuple denotes the read stack (first entry) and
the write stack (second entry). x denotes an unknown stack (not important in the
example), and in and out are the input or output stacks respectively.
74
3.3 Stack-Based Containers
In the first element (top, left), the traversal runs along the x1 axis. Thus, two
vertices are stored on stack 2, the third entry is stored on stack 3, as it will be
used by the element along the x2 axis next. In the second element (top, right), the
traversal automaton reloads the vertices from stack 2. Although the traversal also
continues along the x1 axis (bottom, left), the vertices adjacent to the next element
are stored on stack 0. This is due to the stack identifiers alternating along each
coordinate axis.
It is obvious that this approach works for one single level, although it requires twice
the number of stacks compared to a left/right classification. Yet, the shadowing
problem sketched in Figure 3.13 can not occur, as the algorithm a priori swaps the
vertices to the stacks from which they are needed next. The approach hereby relies
on a fact mentioned above: Each pair of opposite faces belongs to different stack
numbers.
Corollary 3.5. The temporary container consists of 2d stacks.
A formal proof is beyond the scope of this work, Yet, I formalise the underlying
algorithm in more detail. Let
access : {0, . . . , 2d − 1} 7→ {−2d + 1, −2d + 2, . . . , 0, 1, . . . , 2d − 1}
(3.4)
define a value for the faces of a geometric element. The semantics of access is as
follows:
75
3 Spacetree Traversal and Storage
access(f ) < 0
access(f ) > 0
access(f ) = 0
access(f ) = −1
access(f ) = 1
access(i) = m
Neighbour connected via face f has been processed before element was entered.
Neighbour connected via face f has not been processed yet,
and will be entered after element has been left.
Face f is covered by the root element’s faces.
Element has been entered through face f .
Element is left through face f .
Neighbour connected via face i will be processed after all neighbours connected via face j for which access(j) < m holds.
The constraints
access(i) > 0 ⇒ ∀ 0 < m ≤ access(i), ∃j : access(j) = m,
access(i) < 0 ⇒ ∀ access(i) ≤ m < 0, ∃j : access(j) = m, and
∀i, access(i) 6= 0, 6 ∃j : access(j) = access(i)
(3.5)
(3.6)
ensure that no numbers within one access list are left out, i.e. no natural number
is omitted. Furthermore, the numbers resulting from access are unique (3.6). The
predicate Ptouched then is
access(f ) < 0 ⇒ Ptouched ,
access(f ) >= 0 ⇒ ¬Ptouched ,
and the range of the access’s image is bounded and fulfills (3.4). The spacetree’s
root element corresponds to
access(f ) = 0 ∀ f ∈ {0, . . . , 2d − 1}.
In accordance with the even flag, a access flag also is given by the Peano iterate’s
construction. Whenever the automaton descends from a refined event into a child,
it performs the following steps:
• Clone the parent automaton state.
• Derive the new even attributes.
• Identify the new entry face fentry , set access value of fentry to −1, and adopt
other access entries such that (3.5) and (3.6) hold again.
• Identify the new exit face fexit , set access value of fexit to 1, and adopt other
access entries such that (3.5) and (3.6) hold again.
A formal description of these steps is given in Algorithm A.2, Algorithm A.3, Algorithm A.4, and Algorithm A.5 in the appendix.
76
3.3 Stack-Based Containers
Example 3.4. The access flags for the geometric elements in Example 3.3.
The access flags denoted by x depend on the parent element’s access flags.
To validate that each new state fulfills the stated constraints is trivial and not
done here. The identification of the temporary stack now is as follows (Algorithms
3.3 and 3.4): If a vertex is to be read and all adjacent face’s access entries are
positive, it has to be taken from the input stream (Ptouched does not hold for any
adjacent face). Otherwise, one has to select f with the biggest negative access(f )
value among the adjacent faces. f identifies a stack, and f also corresponds to an
element’s face normal. If the even flag of the state along this normal does not hold,
f is the temporary stack number searched for. Otherwise, f does identify the stack
that belongs to the face that is opposite to the face belonging to the stack searched
for.
If a vertex is to be written and all adjacent face’s access entries are negative, it
has to be written to the output stream. Otherwise, one has to select f with the
smallest positive access(f ) value among the adjacent faces. f again identifies a stack
and corresponds to a normal. If the even flag for this normal does not hold, f is
the temporary stack number searched for. Otherwise, f does identify the stack that
belongs to the face that is opposite the face to which the stack searched for does
belong to.
The alternative stack management presented here is—from my point of view—a
significant improvement compared to [35, 39, 63] for two reasons: It reduces the
number of stacks for d ≥ 3 and the grid hierarchy does not affect the vertex access,
i.e. the information where to store a vertex or where to take a vertex from does not
depend on the fact whether the vertex’s position coincides with a vertex of a smaller
level. The first aspect gains weight with increasing dimension d.
Peano offers a data persistence layer comprising exclusively of stacks, and the
element-wise grid traversal is well-defined, too, according to this chapter. Before I
switch to a concrete application built on top of the traversal due to a mapping from
77
3 Spacetree Traversal and Storage
Algorithm 3.3 Determine temporary stack number to read vertex at position from.
If operation returns UseInputStream, vertex is to be read from input stream instead
of a temporary stack.
getReadStack : {0, 1}d 7→ {0, . . . , 2d − 1} ∪ {UseInputStream}
1: procedure getReadStack(position)
2:
smallestV alue ← −2 · d − 1
3:
result ← UseInputStream
4:
direction ← −1
5:
for i ∈ {0, d − 1} do
6:
if positioni = 0 then
7:
f ace ← i
8:
else
9:
f ace ← i + d
10:
end if
11:
if accessf ace < 0 ∧ accessf ace > smallestV alue then
12:
result ← f ace
13:
smallestV alue ← accessf ace
14:
direction ← i
15:
end if
16:
if result 6= UseInputStream ∧ evendirection = ⊤ then
17:
if result < d then
18:
result ← result + d
19:
else
20:
result ← result − d
21:
end if
22:
end if
23:
end for
24:
return result
25: end procedure
78
3.3 Stack-Based Containers
Algorithm 3.4 Determine temporary stack number to write vertex at position to.
If operation returns UseOutputStream, vertex is to be written to output stream
instead of a temporary stack.
getW riteStack : {0, 1}d 7→ {0, . . . , 2d − 1} ∪ {UseOutputStream}
1: procedure getW riteStack(position)
2:
biggestV alue ← 2 · d + 1
3:
result ← UseOutputStream
4:
direction ← −1
5:
for i ∈ {0, d − 1} do
6:
if positioni = 0 then
7:
f ace ← i
8:
else
9:
f ace ← i + d
10:
end if
11:
if accessf ace > 0 ∧ accessf ace < biggestV alue then
12:
result ← f ace
13:
biggestV alue ← accessf ace
14:
direction ← i
15:
end if
16:
if result 6= UseOutputStream ∧ evendirection = ⊤ then
17:
if result < d then
18:
result ← result + d
19:
else
20:
result ← result − d
21:
end if
22:
end if
23:
end for
24:
return result
25: end procedure
79
3 Spacetree Traversal and Storage
—von-Neumann Bottleneck—
A fundamental principle in computer construction is to distinguish between
memory and processing units among other components. This idea goes back
to János Lajos Neumann (John von Neumann) [74]. The processing unit’s
speed then determines the theoretical computing power of a system—the peak
performance. The memory’s size determines the maximum amount of data
an application can handle, as long as no additional swapping techniques are
applied. Both components are connected: The processing unit has to fetch
data from the memory and write data back, as its local memory (registers) is
extremely small compared to the main memory. Nowadays, this connection
typically is a bus.
For such systems, the actual performance of an application does not depend exclusively on the peak performance anymore: Whenever data is to be read from
the memory, the minimum of the memory’s, the bus’ and the processing unit’s
performance determines the actual execution speed. The first two aspects go
back to the latency and bandwidth of the memory connection. Applications
restricted by these effects are called memory bounded, and, in accordance with
[74], people often speak of the von-Neumann bottleneck, if the bus causes the
slowdown.
events to concrete operations, I discuss some properties arising from the interplay
of space-filling curves and stacks.
3.4 Cache Efficiency
A stack-based grid traversal and storage scheme for adaptive Cartesian multiresolution grids is an interesting subject of study from an algorithmic point of view.
The unique selling point of the approach is the low memory requirements mentioned
several times. Another nice property stems from the exclusive use of stacks: The
cache hit rate of the algorithm is extraordinary good as the following section points
out. As a result, the framework was never memory-bounded in my experiments—a
surprising and promising property for a PDE solver framework, in particular for
applications typically running into the von-Neumann bottleneck.
How an algorithm exploits the memory hierarchy of an architecture determines
whether the algorithm is cache efficient. There are many approaches to make an
algorithm cache efficient, and there are many approaches to classify cache efficient
algorithms. A fundamental classification distinguishes between cache-aware and
80
3.4 Cache Efficiency
—Caches in the Memory Hierarchy—
Many applications are memory bounded, i.e. they do not exploit the processing
unit’s power, as the processing unit permanently has to wait for the memory
system to deliver data. This problem becomes more and more burdensome:
On the one hand, the processing units get faster due to an increased frequency
or an increased number of cores. On the other hand, the memory becomes
bigger, and this grow slows down the memory access speed.
To overcome the memory bottleneck, computer architects introduce a hierarchical memory system. In-between the processing unit and the memory—it is
a main memory now—smaller but faster intermediate memories are plugged
in. These intermediate memories are caches, and they hold copies of the main
memory’s data. Today’s architectures typically exhibit two or three caches in
a row called cache levels: As the memory access speed depends on its size, the
caches are the smaller the nearer they are to the processing unit.
If the processing unit wants to load a record, it does not access the memory
directly, but triggers a lookup in the nearest cache. The cache is small and fast
compared to the main memory. If the record is contained in the cache, it is
taken from the cache. If it is not contained, the cache triggers the next lookup
in the next bigger cache or in the main memory, respectively. The first case is
a cache hit, the latter case is a cache miss.
Each data transfer comes along with a certain overhead due to latency and
data consistency management. Thus, caches do not exchange single bits and
bytes, but they transfer whole blocks of memory. These blocks are cache lines.
If a record is not held in a cache, the cache requests the whole cache line
comprising the record. As the cache size is small, caches frequently run out of
space. In this case, the cache has to swap records back to the bigger caches or
the main memory. An algorithm cast in the hardware decides which cache lines
are swapped. The sooner a swapped cache line is to be reloaded, the worse the
algorithm’s decision (capacity miss). The decision which line to replace could
be optimal, if the algorithm knew an application’s upcoming memory access
pattern. As the application’s access pattern is—without loss of generality—not
known a priori to the hardware, a cache swapping strategy is never optimal.
Nevertheless, an application should diminish cache line replacements.
81
3 Spacetree Traversal and Storage
cache-oblivious algorithms. Cache aware algorithms exploit knowledge about underlying cache hierarchies and cache properties. Typical cache-aware algorithms
block memory accesses, reorder data accesses, and fuse loops of different program
parts ([22, 48], e.g. ). Hereby, they are parametrised with the cache’s properties.
Cache oblivious algorithms are algorithms relying on the existence of caches but not
exploiting the caches’ properties quantitatively.
To measure the cache efficiency, the number of cache accesses and the cache hit
rate are suitable metrics: The former is a counter, the latter divides the number
of cache hits by the number of cache accesses. Because of the restricted size, the
cache line policy and the replacement challenges, two data access characteristics6
determine the cache efficiency. The higher the spatial locality, the better the cache
hit rate. The higher the temporal locality, the better the cache hit rate.
A data access pattern exposes spatial locality, if
• the distance between two records a and b in the main memory is small
• whenever the algorithm accesses a and b in a row.
The probability then is high that both records are contained in the same cache line.
As a result, there is also no need to swap cache lines back to a bigger cache or the
memory because of the restricted capacity.
A data access pattern exposes temporal locality, if the time in-between the two
accesses to a record a is small. The probability then is high that the cache line
holding a still resides in the cache and there is no capacity cache miss.
The two terms are introduced in [48]. Yet, they define both terms on Cartesian
grids in the context of iterative equation systems solvers. The definition above
generalises the idea to abstract memory access patterns.
Corollary 3.6. The stack-based grid management is cache-oblivious.
On the one hand, the stack signature permits only memory accesses exhibiting
high spatial locality, as the distance in-between two memory accesses is at most one.
On the other hand, the fixed and small number of stacks yields the temporal locality.
If there were a dynamic number of stacks, this would not hold. Furthermore, the
curve’s Hölder continuity ensures that the size of the temporary stacks does not
change by an order of magnitude throughout the traversal: Vertices swapped to the
temporary stacks are usually reused before long, as the curve meanders back soon.
Many PDE solvers suffer from a lack of spatial and temporal locality. Two examples: First, algorithms for parabolic PDEs often exhibit low temporal locality.
These equations typically lead to a sequence of linear equation systems, i.e. one
linear equation system per time step is to be solved. A simulation code has to
6
In this incomplete presentation, actual hardware properties such as cache line associativity are
neglected.
82
3.5 Some Realisation Details
choose time steps that are sufficiently small to resolve all the solution’s temporal
characteristics. The solution then exhibits a smooth behaviour in time—it does not
change rapidly from one time step to another. For a given time step, the solver of
the linear equation system uses the solution of the preceding time step as initial
solution guess. As both the preceding solution and the current solution do not differ
significantly, the solver needs only a small number of iteration steps to end up with
a sufficiently accurate solution. For each time step, the whole grid is traversed. The
algorithm exhibits a low number of computations but big data movements.
Second, multiscale algorithms often exhibit low spatial locality. Typically, developers arrange records belonging to one grid level continuous in the memory. As the
number of smoother operations outnumbers the number of inter-level operations, it
is important to optimise the data layout for the smoother’s memory access. Interlevel operations access different records from different grid levels, and, thus, they do
not access the memory in a continuous manner.
The introductions of [35, 39, 63] motivate the development of the stack-based vertex management with the von-Neumann bottleneck and the problems of algorithms
not exploiting the memory hierarchy. I by contrast make the cache discussion follow
the algorithm’s presentation, as the cache arguing carries the inherent danger that
the reader mixes up cache efficiency and application’s speed. The cache hit rate is
only one ingredient of fast numerical algorithms7 , although it might become a more
and more important property with all the multicore architectures coming up. Here,
multiple processing units have to share one bus and one cache hierarchy, and this
sharing makes the von-Neumann bottleneck more severe. In addition, low memory requirements, the algorithm’s ability to handle arbitrary adaptivity without any
overhead, and an algorithm that is able to handle problems that exceed the memory
available (see subsequent section) are of value of their own. A good cache hit rate
then in turn is a nice non-functional algorithmic property. It is thus important to
prove that the cache properties of the algorithms of [35, 39, 63] carry over for the
alternative stack management, although this does not exempt the developer from
studying the implementation’s performance in great detail.
3.5 Some Realisation Details
On the subsequent pages, implementation consequences arising from the usage of
an object-oriented language are discussed within the context of the grid storage
and traversal concept. Furthermore, I combine a file swapping strategy with the
stack approach and add tree cuts splitting up the k-spacetree horizontally into an
7
In fact, some cache “optimisations” might even slow down the implementation: Replacing a
formula alike −a1 + 2 · a2 − a3 with −a1 + a2 + a2 − a3 might improve the cache hit rate but
slow down the overall computation due to the rise of operations.
83
3 Spacetree Traversal and Storage
active upper part and a lower part not traversed anymore. The latter feature is
especially important for multilevel solvers not traversing the whole spacetree all the
time. While the discussion exploits the stack and spacetree idea, it does not add new
features or algorithms, but bridges the gap from the pure formalism to a framework.
Peano’s implementation is a pure C++ code. Straightforward object-oriented
implementations typically exhibit two shortcomings: Their runtime efficiency suffers
from an instantiation overhead per object creation, and their objects demand for
lots of memory compared to highly optimised code written in a lower level language.
Both drawbacks gain weight, if the code models the grid’s entities, i.e. vertices and
geometric elements, as classes. Yet, I follow the academic object-oriented paradigm
rigorously.
The Peano code tackles the construction overhead challenge as it exploits the flyweight pattern [27] intensely. Hereby, a small, fixed number of instances of vertices,
e.g., is created. The object’s state then is replaced on-the-fly with data from the
stacks, i.e. instead of throwing away old objects and creating new ones, the instance
hull is used to handle multiple records.
The Peano code tackles the memory overhead challenge as it uses the precompiler
DaStGen [13, 14]. This precompiler transforms annotated C++ classes into memory
optimised C++ classes. Sets of booleans, enumerations, and bounded integers are
packed into a small number of primitives. A boolean attribute, for example, then
requires only one bit instead of a whole byte. Furthermore, Peano distinguishes
between persistent attributes and attributes which can be discarded when they are
written to the output streams: Before data is written to the output streams, the
objects are transferred into an alternative representation holding only attributes
required in the next traversal. The precompiler automatically generates the transformation code.
Example 3.5. The storage order on the input stacks determines the adjacency information of elements and vertices, as it determines which element-vertex combinations
are passed to the events. Consequently, there is no need to have a global enumeration
of the grid entities. My visualisation though needs a global vertex enumeration, and
I thus add a global number to each vertex. As the grids can change from visualisation snapshot to snapshot, these numbers are not persistent, i.e. they are not stored
on the input and output stacks. Instead, I throw them away whenever I store the
records, and I regenerate them on-the-fly whenever I need them.
The algorithm deriving the access and even flag (Algorithm A.2, A.3, A.4, and
A.5) is complicated and comes along with lots of integer operations. The Peano
code hence derives both properties once when the grid is constructed. Afterwards,
it stores the data in the geometric elements.
There are three different stack implementations for the grid management. Two
simple implementations are based upon C++’s STL (standard template library)
84
3.5 Some Realisation Details
vector type [53, 70] and a plain array. The first is a dynamic data structure growing
with the number of entries. The latter one is a fixed data structure, i.e. the user has
to specify the maximum stack size a priori. In exchange, no additional operations
are required to manage the stack’s size. The last implementation is a combination of
fixed-sized arrays and a sophisticated hard disk swapping. It enables the application
to handle problem exceeding the main memory, and it is studied on the forthcoming
pages. Afterwards, this section discusses some technical details of grid traversals
not descending into all the leaves of any depth. Many multigrid algorithms work on
one fixed maximum level per traversal, and for these algorithms it would be a waste
of computing time to traverse the spacetree’s elements belonging to a finer level.
3.5.1 File Swapping
The value of a numerical solution of a PDE is interweaved with the accuracy of the
numerical discretisation: The smaller the mesh width becomes, the more accurate
and reliable the numerical result. Despite all the sophisticated approaches developed
to increase the precision (p-adaptivity, improved boundary approximation, extrapolation, etc.), in the end it hence comes down to make the computational grid as
fine as possible8 . The increase in spatial resolution is bounded by the computing
time available and the main memory offered by the simulation system—the available main memory per computing node is a limiting factor. Peano circumnavigates
this restriction by applying a hard disk swapping strategy for problems exceeding
the main memory. The following section outlines this tailored strategy and explains
why it does not lead to a performance breakdown.
Whenever hard disk swapping comes into play, it has to be chosen carefully,
as the hard disk is slower than the main memory by magnitudes. Otherwise, it
unacceptably throttles the overall simulation. Peano’s stack concept fits perfectly
to swapping. If the hard disk is the place the records are stored, the stacks in the
main memory act as cache. The hard disk holds the persistent data, and the main
memory makes the access from the processing unit transparent to the application. It
is an obvious idea to transfer the insights on cache efficiency to the manual swapping.
Let there be one array, i.e. one buffer, of fixed size in the main memory per stack.
Furthermore, let there be one swap file per stack. As the memory is accessed by push
and pop operations—they belong to a last recently used paradigm—it is sufficient to
hold the upper part of the stack in the memory. The remaining part of the stack will
only be needed after the whole upper part is read by pop operations. The algorithm
thus stores it on the hard disk. As soon as a maximal fill threshold for the memory
8
If an application supports p-adaptivity, it might be possible to increase the polynomial degree
instead of choosing a finer grid. In this case, this arguing does not hold. Yet, increasing
the polynomial degree in practice is restricted by an insufficient smoothness of the underlying
solution. In this case, the user has to switch to hp-adaptivity, and the arguing is justified again.
85
3 Spacetree Traversal and Storage
Figure 3.14: If the buffer fill rate exceeds a given maximum (3.8), the part of the
stack that was last recently used is swapped to the hard disk (top). If
the buffer fill rate falls below a fill threshold (3.8), the most recently
used part of the hard disk is reloaded into the main memory’s buffer
(bottom). Inked entries of the main ring buffer cache data from the
disk.
Algorithm 3.5 push operation for a file-based stack implememtation. The predicates and functions are defined in (3.7) and (3.8).
1: procedure push(x)
2:
write to buffer position pos(icurrent )
3:
if PMmax ∧ icurrent − ibottom > Cfile stack then
4:
write Nblocksize buffer entries at pos(ibottom ) to swap file
5:
ibottom ← ibottom + Nblocksize
6:
end if
7:
icurrent ← icurrent + 1
8: end procedure
86
3.5 Some Realisation Details
Algorithm 3.6 pop operation for a file-based stack implememtation. The predicates and functions are defined in (3.7) and (3.8).
1: procedure pop
2:
if PMmin ∧ icurrent − ibottom < Cfile stack then
3:
load Nblocksize entries from the swap file to location pos(ibottom ) − Nblocksize
4:
ibottom ← ibottom − Nblocksize
5:
end if
6:
icurrent ← icurrent − 1
7:
return record at position pos(icurrent )
8: end procedure
buffer is passed, the oldest entries within the array are swapped to the associated
file. As soon as a minimal fill threshold is under–run, the top swap file partition is
loaded into the buffer (Figure 3.14).
The buffer’s implementation equals a ring buffer concept. Thus, the main memory
buffer for one stack is an array defined by a number Nblocks of blocks with Nblocksize
entries per block. Blocks mirror the concept of cache lines. Two indices icurrent and
ibottom identify the top element of the stack and the smallest stack element that
is still stored in the main memory buffer and not swapped to the file. Since the
buffer is a ring buffer, the actual position of an entry i of the stack within the main
memory array is given by
pos(i) = i mod (Nblocks · Nblocksize ).
(3.7)
There are two constants Cmin and Cmax representing a maximal and minimal fill
rate threshold within a block. One ends up with the stack access Algorithms 3.5
and 3.6 based upon the predicates
PMmax = (icurrent mod
((icurrent + 1)
PMmin = (icurrent mod
((icurrent − 1)
Nblocksize ) ≤ Cmax ∧
mod Nblocksize ) > Cmax
Nblocksize ) ≥ Cmin ∧
mod Nblocksize ) < Cmin .
and
(3.8)
Obviously, writing and reading into or from a swap file is independent from the
result and the effect of the push and pop operation. Thus, they might be deployed
into an IO–thread of their own.
87
3 Spacetree Traversal and Storage
(traversal+1)
Figure 3.15: A tree cut has to know the maximum traversal level ℓmax
of the
subsequent traversal. Records belonging to a higher level are written
to the bottom output stream instead of the output stream (top). In
the successive iterations, the algorithm does not descend from refined
(traversal+1)
spacetree nodes of level ℓmax
, but it treats them as leaves. The
counterpart is tree merge.
3.5.2 Horizontal Tree Cuts
All algorithms in this thesis are built on top of a Peano spacetree traversal. Nevertheless, not all of them have to descend into the whole tree all the time: For many
algorithms it is sufficient to descend into a given level ℓmax . All events for elements
and vertices belonging to a level ℓ > ℓmax deteriorate to “no operation” (no-op).
Example 3.6. Consider the multiplicative two-grid algorithm from the upcoming
chapter with a V (µ) cycle. This scheme first traverses the whole grid µ times, and it
performs a number of calculations—the smoothing—on each geometric element and
each vertex. Afterwards, it ascends, i.e. the consecutive µ traversals perform operations on all elements and vertices belonging to a level smaller than the maximum
level. The events for the elements and vertices having maximum level deteriorate.
It is an obvious idea to make the traversal pass exclusively the part of the tree that
is actually needed by the algorithm. Let traversal be the actual traversal number.
The algorithm knows that the next traversal traversal + 1 performs operations for
(traversal+1)
each vertex and element up to a level ℓmax
. Whenever a record is written
88
3.6 Experiments
to the output stream, the algorithm analyses its level. If the level is smaller or
(traversal+1)
equal ℓmax
, it is written to the output stream. Otherwise, it is written to
the bottom output stream. Throughout traversal traversal + 1, the algorithm reads
(traversal+1)
from traversal traversal’s output stream, and refined elements on level ℓmax
are treated as leaf nodes. The bottom output stream meanwhile is not modified.
This process is a tree cut (Figure 3.15). The counterpart of the tree cut is a tree
merge. Throughout the merge, the traversal reintegrates data from the bottom
stream.
Obviously, such an approach can work with an arbitrary sequence of cuts and
merges, i.e. more than one cut in a row is allowed, as long as the order of the
bottom stream fits to the overall stack access scheme.
(traversal)
(traversal+1)
The tuple op = (traversal ∈ N0 , ℓmax
7→ ℓmax
) identifies one cut or
merge, respectively, throughout traversal traversal. (16, ∞ 7→ 23) e.g. denotes that
throughout the 16th traversal all records belonging to a level ℓ > 23 are written
to the bottom stack. Before, the whole tree had been traversed. Afterwards, the
traversal algorithm descends the spacetree up to level 23, i.e. the maximum level
ℓmax changes from ∞ to 23. If the tuple is followed by (19, 23 7→ 15), e.g., this
denotes that traversal 17 and 18 descend into level 23. Traversal 19 then descends
up into level 15 and swaps additional parts of the k-spacetree to the bottom stream.
Let op1 op2 op3 . . . denote a cut and merge sequence. The sequence has to coincide
with the following grammar with start symbol O to make the streams’ orders fit to
each other:
O 7→ ǫ |
OO |
(t, ℓi 7→ ℓj )O(t + 1 + 2 · m, ℓj 7→ ℓi ),
t, m ∈ N0 , ℓi , ℓj ∈ {∞} ∪ N0 , ℓi > ℓj .
(3.9)
There is a merge for each cut reintegrating all the records from to the bottom output
stream into the input stream, as the levels ℓi and ℓj in the two tuples coincide
(3.9). The overall cut and merge sequence is ordered in time. As the traversal
direction switches after each traversal, and as the bottom output stream acts as
bottom input stream throughout the merge, the 2 · m in (3.9) ensures that the
streams’ orders fit to each other. Finally, I postulate for two consecutive transitions
(t, ℓ1 7→ ℓ2 )(t + 1, ℓ3 7→ ℓ4 ) that ℓ2 = ℓ3 and ℓ4 < ℓ2 ∨ ℓ4 = ℓ1 hold.
3.6 Experiments
The following experiments analyse the runtime per vertex for a pure grid traversal:
the traversal creates the grid and moves data from one stack to the other, while the
89
3 Spacetree Traversal and Storage
Table 3.1: Memory requirements for different dimensions and code constituents. All
numbers are given in bytes per grid/traversal entity. The vertices and
elements hold only structural data, i.e. they lack PDE-specific properties.
Traversal automaton
Geometric element
Vertex
d=2
40
10
2
d=3
56
14
2
d=4
72
18
2
traversal events—besides the geometry analysis and the evaluation of refinement
criterion—degenerate to no-op. This runtime behaviour is broken down into effects
stemming from different ideas presented in this chapter. The experiments were
conducted on the Pentium, Opteron and Itanium platform (Appendix B).
A circle, sphere or hypersphere, respectively, is embedded into the unit hypercube
for all the experiments, i.e. the geometric setup is the same for all test runs. Besides
the grid-related data (Section 3.5), the geometric elements hold an inside/outside
flag and the vertices hold an inside/outside/boundary identifier. The program ends
up with a very modest memory consumption (Table 3.1), and the memory needs
do not depend on the type of architecture, i.e. whether it is a 32 bit or 64 bit
system. The latter statement holds if and only if the compiler’s memory alignment
is switched off.
The traversal automaton encodes—besides level, traversal state variables, and
so forth—the spatial position and the current spatial resolution. Both are vectors
with floating point entries requiring eight bytes. For each additional dimension, the
automaton’s size hence increases by two times eight bytes. If an automaton instance
is stored in the memory, it is stored on the call stack.
The geometric element encodes both the geometric flag and traversal information
such as the access and even flags. Computing these flags on-the-fly would lead to
a significant performance breakdown. The even flag is just a d-tuple of boolean
flags, the access flag consists of 2 · d integer values. Hence, the flag’s size grows by
one integer’s size—it is actually realised by a byte—per dimension for the access
flag, whereas DaStGen [13, 14] encodes all the remaining information within two
additional bytes. The realisation does not exploit DaStGen’s opportunity to pack
integers of bounded size. This could be of value for the access flag, and it might
reduce the element’s size further.
Finally, the vertex holds the geometric information, i.e. whether it is inside, outside or on the computational domain’s boundary, and the refinement status. The
latter requires six bits, and, in turn, the vertex’s state fits into two bytes.
90
3.6 Experiments
3.6.1 Vertex Access Cost
The first set of experiments compares the runtime per persistent vertex for regular
grids to the runtime per persistent vertex for adaptive grids. A sequence of experiments on regular grids with decreasing mesh size was followed by a sequence of
experiments exclusively applying the geometric refinement criterion (2.9). Hereby,
I chose the minimal mesh size such that the total number of persistent vertices
VT \ HT is of the same order as the regular grid’s vertices. All the experiments
were conducted with a stack implementation based upon C++’s STL vector. As
the regular grid’s number of vertices grows by a factor of 3d for each additional level,
the main memory rigorously restricts the number of possible experiment setups.
The measurements (Figure 3.16) reveal three insights:
1. As hanging vertices are created on-the-fly, traversals on adaptive grids last
longer than traversals on regular grids of the same size. This holds for all experiments besides very small, two-dimensional setups with solely a few thousands of grid points. The time spent on the generation of hanging vertices
worsens the computing time per vertex ratio. This effect is not observable for
d = 3 and VT \ HT ≈ 1.0 · 104 , as the adaptive grid here almost equals the
regular one. Therefore, the drawback resulting from hanging vertices occurs
only if the adaptivity is sufficiently developed, i.e. if it is a very strong, local
adaptivity.
2. The runtime gap between the adaptive and the regular grid for d = 2 is
invariant of the grid size, i.e. the hanging vertices’ overhead does not dominate
the overall computing time with a smaller and smaller grid size. This holds
although the adaptive grid is extremely adaptive, i.e. it exclusively refines on
the domain’s boundary. For each refinement flag set, the grid is added at least
5d additional non-hanging vertices. The raise in computing time spent on
these vertices is not outnumbered by the overhead required by the additional
hanging vertices.
3. The time per vertex converges to a fixed runtime, i.e. the implementation’s
performance per vertex is—for sufficiently big problems— invariant of the grid
size.
As all experiments were conducted with a dynamic data container, the question
is justified whether the application suffers from the dynamic memory management
of the STL implementation. To avoid “pollution” effects resulting from the hanging
nodes, I reran the experiments for the regular grids with the three different stack
implementations described in Section 3.5. The result in Figure 3.17 exhibits two
interesting insights.
91
3 Spacetree Traversal and Storage
Circle (d=2)
3.5e-06
regular, Pentium
adaptive, Pentium
regular, Opteron
adaptive, Opteron
regular, Itanium
adaptive, Itanium
Time/Vertex [s]
3e-06
2.5e-06
2e-06
1.5e-06
1e-06
10
4
10
5
10
6
10
7
Vertices
Sphere (d=3)
5.5e-06
regular, Pentium
adaptive, Pentium
regular, Opteron
adaptive, Opteron
regular, Itanium
adaptive, Itanium
5e-06
Time/Vertex [s]
4.5e-06
4e-06
3.5e-06
3e-06
2.5e-06
2e-06
1.5e-06
10
4
5
10
Vertices
10
6
Figure 3.16: A circle (top) or a sphere (down), respectively, are discretised by a
k-spacetree with a regular and an adaptive grid.
92
3.6 Experiments
Circle (d=2)
3e-06
Array, Pentium
STL Vector, Pentium
File, Pentium
Array, Opteron
STL Vector, Opteron
File, Opteron
Array, Itanium
STL Vector, Itanium
File, Itanium
2.5e-06
Time/Vertex [s]
2e-06
1.5e-06
1e-06
5e-07
0
10
4
10
5
10
6
10
7
Vertices
Sphere (d=3)
6e-06
Array, Pentium
STL Vector, Pentium
File, Pentium
Array, Opteron
STL Vector, Opteron
File, Opteron
Array, Itanium
STL Vector, Itanium
File, Itanium
5e-06
Time/Vertex [s]
4e-06
3e-06
2e-06
1e-06
0
10
4
5
10
Vertices
10
6
Figure 3.17: A circle (top) or a sphere (down), respectively, are discretised by a
regular k-spacetree’s grid. Three different stack implementations are
compared.
93
3 Spacetree Traversal and Storage
First, the performance penalty induced by the dynamic memory management is
negligible. Thus, the two advantages of the STL implementation outnumber the
loss in performance: With a dynamic stack memory management, one can compute
bigger problems than with a fixed size stack memory. With a fixed size stack memory
in turn, one has to know or guess how big the temporary stacks might become. If the
guess turns out to be false, the simulation has to be stopped, the stack size guess has
to be adopted, and the simulation is to be reran. The STL’s vector implementation
works on arrays, i.e. if the vector runs out of memory, the implementation allocates
a bigger one and transfers the stack’s data to it. In return, the implementation also
can allocate a smaller amount of memory if the stack’s content underruns a given
threshold. The maximum experiment size is determined by the available memory in
combination with the maximum size of the stacks for an array-based implementation.
If the application shrinks stacks on-demand, bigger problems become solvable. For
the input and output stacks, the shrinking is straightforward. The maximum size
of the temporary stacks is studied in Section 3.6.2.
Second, the usage of the file stacks exhibits a small performance penalty, and this
penalty is independent of the problem size. It is an obvious idea to study the effect of
the file stacks for even bigger problem. In Section 3.6.4, additional experiments reveal that the algorithm can indeed handle problems significantly exceeding the main
memory available with constant cost per degree of freedom. In this experiments, no
multithreading was used, i.e. the file stack overhead merges into the constant total
cost per vertex. Peano’s standard configuration deploys the file management into
a thread of its own. The file swapping then is for free, if a core is spent on the
overhead studied above.
3.6.2 Maximum Stack Size
The discussion of an advantageous stack implementation lacks the knowledge how
big the temporary stacks actually become. The experiments in Table 3.2, Table
3.3, and Table 3.4 track the temporary stack sizes and compare them to the total
number of persistent vertices. All experiments study two geometries, a hypersphere
and a hypercube, discretised by a regular and an adaptive grid. The results reveal
several insights:
1. As both geometries are symmetric, pairs of stacks corresponding to faces with
a normal that is parallel to the same coordinate axis have the same maximum
cardinality. Thus, there are only d stacks with the same maximum stack size.
2. The temporary stacks’ sizes are smaller than the input and output stacks
by magnitudes, as the Peano space-filling curve is a local curve. Here is a
colloquial description of the term local: Before traversing from one element
to an element far away, the curve prefers to traverse the elements around
94
3.6 Experiments
Table 3.2: Maximum elements per stack for different resolutions (d = 2). The upper
experiments study a regular grid, the lower block shows results for an
adaptive grid based upon the geometric refinement criterion (2.9).
Square
Square
in/out
4
68
132
772
6,128
53,352
473,428
4
68
248
788
2,408
7,268
21,848
65,588
196,808
590,468
temp 1
0
3
3
13
41
123
367
0
3
13
41
13
367
1,097
3,285
9,847
29,531
temp 2
0
7
9
35
115
348
1,059
0
7
21
53
142
393
1,130
3,325
9,894
29,585
Circle
Circle
in/out
4
68
132
772
6,128
53,352
473,428
4
68
132
552
2,052
6,792
21,252
64,872
195,972
589,512
temp 1
0
3
3
13
41
123
367
0
3
3
8
22
36
57
81
102
130
temp 2
0
7
9
35
115
348
1,059
0
7
9
19
35
53
78
115
156
210
Table 3.3: Experiment from Table 3.2 with d = 3.
Cube
Cube
in/out
8
520
3,264
36,032
672,088
8
520
3,200
25,320
222,400
1,994,120
temp 1
0
15
115
899
7,623
0
15
115
899
7,623
67,159
temp 2
0
31
171
1,249
10,351
0
31
171
1,091
8,293
69,279
temp 3
0
63
262
1,361
8,994
0
63
262
1,361
8,994
71,221
Sphere
Sphere
in/out
8
520
1,032
14,288
319,360
8
520
1,032
13,000
150,464
1,494,720
temp 1
0
15
15
115
899
0
15
15
91
392
1,336
temp 2
0
31
39
299
2,464
0
31
39
216
922
3,081
temp 3
0
63
85
671
5,874
0
63
85
562
2,365
7,858
95
3 Spacetree Traversal and Storage
Table 3.4: Experiment from Table 3.2 with d = 4. The experiments in the first part
of the table work on regular grids, the experiments in the second part of
the table work on adaptive grids.
Hypercube
Hypersphere
Hypercube
Hypersphere
in/out
16
4,112
42,528
16
4,112
8,208
16
4,112
42,272
856,592
16
4,112
8,208
248,448
temp 1
0
63
1,063
0
63
63
0
63
1,063
23,015
0
63
63
931
temp 2
0
127
1,527
0
127
159
0
127
1,527
26,615
0
127
159
2,407
temp 3
0
255
2,215
0
255
343
0
255
2,215
31,433
0
255
343
5,703
temp 4
0
511
3,264
0
511
705
0
511
3,264
38,885
0
511
705
12,021
the local elements. This locality property is a direct result from the Peano
curve’s Hölder continuity [66], and it is exploited and discussed in greater
detail throughout Chapter 5 applying the curve for parallelisation. For the
grid management, the locality results in small temporary stacks compared to
the input and output stack, and because of the small stacks the data access is
very local, too: Long sequences of exclusive writes (and corresponding reads)
do not occur. Instead, the temporary stack’s size is oscillating around a small
value. This abets a good cache hit rate further.
3. For regular grids, the pairs of stacks differ in their maximum size significantly.
The definition of a standardised traversal explains this difference: Records
stored on stacks corresponding to the faces with a normal along the x1 axis
are more likely to be read in the next geometric elements than records stored
on other stacks. The standardisation prioritises the temporary stacks.
4. The latter observation and arguing do not hold for strongly adaptive grids.
3.6.3 Cache Hit Rate
Section 3.4 predicts a good cache access behaviour for the Peano algorithm. The
figures in Table 3.5 give evidence for this statement. All the figures result from the
hardware counters of the Itanium. These counters track the system’s behaviour and
do not break the measurements down into individual processes.
96
3.6 Experiments
Table 3.5: Memory access characteristics for different geometries. The data results
from the Itanium’s hardware counters.
Square
Circle
Cube
Sphere
Hypercube
Hypersphere
regular
adaptive
regular
adaptive
regular
adaptive
regular
adaptive
regular
adaptive
regular
adaptive
d
L2 Misses
L2 References
L3 References
L2 Misses
L3 Misses
L3 References
Bus Load
2
0.03722%
0.05370%
0.03243%
0.04029%
0.06213%
0.06778%
0.06908%
0.05258%
0.07454%
0.09130%
0.05671%
0.12047%
1.610
1.776
1.567
1.378
1.770
1.364
1.486
1.346
1.493
1.238
1.391
0.880
43.08%
39.47%
40.35%
42.75%
30.93%
43.77%
22.63%
38.36%
14.13%
30.31%
13.83%
31.03%
≈ 0%
≈ 0%
≈ 0%
≈ 0%
≈ 0%
≈ 0%
≈ 0%
≈ 0%
≈ 0%
14%
≈ 0%
13%
3
4
Peano’s cache miss rate is negligible (first column). The second column contrasts
the number of these misses with the L3 cache accesses, and shows that both figures
are in the same order. Nevertheless, the number of L3 cache references exceeds
the number of L2 misses. Since the total number of L2 misses is that small, and
since a cache simulation predicts a L3 cache miss rate close to zero, the additional
accesses and the high L3 cache miss rate are to be caused by alternative (daemons)
and operating system processes causing context switches. If the total number of L2
misses were not that small, these switches would not affect the measurements.
All the experiments run on grids with a comparable number of persistent vertices.
For extremely adaptive grids, the ratio of the persistent vertices to the maximum
size of temporary stacks becomes the bigger the higher the dimension. Thus, the
tests with d = 4 on adaptive grids lead to long input and output streams, but
comparably small temporary stacks9 . As the input and output stacks do not fit
completely into the main memory, each record has to be transferred into the caches
by the bus. And as the half-value period in this case is rather small, the bus load
becomes measurable. For smaller dimensions, the bus load is negligible, i.e. it was
not measured by the hardware counters.
Memory bandwidth is a crucial factor for almost all PDE solvers, and a bandwidth
aware algorithm design gains weight with all the multicores sharing one memory
9
This insight is represented by the figures in Table 3.4 tracking the maximum stack size over
the whole traversal time. The underlying principles are discussed in the spacetree chapter
and a huge part of the Outlook 2.10 highlights drawbacks, consequences, and improvements
concerning the adaptive boundary approximation.
97
3 Spacetree Traversal and Storage
connections coming up. More and bigger caches in combination with cache access
optimisations are considered as solution approach for this problem ([44, 48, 77],
e.g.). Peano’s cache-obliviousness makes the code swim against the stream: In no
experiment, the performance was restricted by the memory. The framework thus
it particularly promising for applications suffering extraordinary from bandwidth
restrictions. Computational fluid dynamics and their flux computations for example
are such a field of applications, since their algorithms have to traverse huge grids
per iteration while the number of operations per record is typically small [34].
3.6.4 File Swapping Overhead
8e-06
Pentium,2d
Opteron,2d
Pentium,3d
Opteron,3d
Pentium,4d
Opteron,4d
7e-06
Time/Vertex [s]
6e-06
5e-06
4e-06
3e-06
2e-06
1e-06
10000
100000
1e+06
1e+07
Vertices
1e+08
1e+09
1e+10
Figure 3.18: Runtime per vertex on two different architectures with the file stacks.
The underlying simulation discretises a hypersphere with a regular grid.
Due to the file swapping, Peano is able to solve problems whose datasets do not
fit into the main memory. In Section 3.6.1, the runtime overhead resulting from
the file swapping is quantified. This overhead is independent of the problem size,
i.e. the runtime per vertex is invariant of the grid’s mesh width—even for problems
not fitting into the main memory anymore (Figure 3.18)10 .
10
For the Pentium architecture, bigger problems than presented in the figure exceed the address
98
3.7 Outlook
The file swapping algorithm comprises several constants (PMmax , Nblocksize , and
so forth). These constants so far are magic numbers, and they have to be adopted
to the concrete architecture (cache hierarchy and properties, available hard disks,
connection to the disks, etc.) and application domain. As this thesis establishes
algorithms and ideas, a parameter study and optimisation are beyond scope.
Peano’s grid management exhibits impressing low memory demands and outstanding high cache hit rates for adaptive grids. The numerical experiments at the
end of the thesis nevertheless reveal that the framework suffers from a poor MFlop
rate. This is due to the lack of optimisation spent on the current implementation.
While the cache-obliviousness does not automatically induce a high performance,
the studies nevertheless yield two advantages: On the one hand, an optimisation
does not have to care about the cache behaviour—it is already good. On the other
hand, the low memory bandwidth requirements enable any optimisation to blow up
the individual records stored on within the grid constituents. The latter fact opens
the door to a vast field of possible source code optimisations such as precomputing/dictionary techniques and approaches yielding better results per vertex due to
an increased number of bytes spent per vertex.
3.7 Outlook
This chapter’s sections establish a grid traversal algorithm for (k = 3)-spacetrees
based on stacks. Both the numerical algorithms and the parallelisation in this thesis
are built on top of this traversal algorithm. The whole grid management, traversal,
and storage come along with very modest memory requirements, and, since the
algorithm requires for a fixed number of 2d + 2 stacks (two additional stacks are
required if tree cuts are switched on), it exhibits high cache hit rates. Furthermore,
the runtime per vertex ratio does not grow with an increasing mesh resolution. This
holds even if the grid’s memory demands exceed the available main memory.
The good memory behaviour of Peano-type algorithms is already studied in [35,
39, 63]. The fundamental new contribution of this chapter thus is the reduction
from an exponential number of stacks in d to a linear number of stacks. In the
preceding Chapter 2, I establish a flexible definition of k-spacetrees. The chapter
here specialises on (k = 3) and derives a grid management for this special case of
k-spacetrees. The restriction to (k = 3) results from the genealogy of the algorithm.
Nevertheless, it is too restrictive: To make the stack idea work, the algorithm relies
on the palindrome property, the projection property, and the invert traversal property. As these properties hold for any k-spacetree with k ∈ {2i + 3 : i ∈ N0 }, the
stack approach works for them, too (Figure 3.19)—an observation already available
space: With 232 ≈ 42.95 · 108 and the maximum problem size 4.78 · 108 for d = 2, it is obviously
that the next bigger problem exceeds the 32 bit threshold.
99
3 Spacetree Traversal and Storage
Figure 3.19: The correctness of the stack access scheme relies on the palindrome, the
projection and the invert traversal property. All these properties hold
for any meander-type traversal based upon an odd number of cuts along
each coordinate axis, i.e. for k-spacetrees with k ∈ {2i + 3 : i ∈ N0 }.
throughout the metric discussion of space-filling curves [28]. They call these iterates
rasterised space-filling curves. The resulting k d -patches facilitate solver optimisations, as the PDE solvers can process whole k d -patches. Processing k d -patches in
turn leads straightforward to a couple of further improvements ([23], e.g.):
1. They are traversed lexicographically, which reduces the integer arithmetics
resulting from all the mirroring and traversal meanders.
2. The underlying loops are unrolled.
3. As PDE solvers typically need several loops over a grid, these loops are fused.
4. The workload can be deployed to several threads.
To perform the same number of cuts along each coordinate axis for all geometric
elements does not fit to every application (anisotropic partial differential equations
benefit from a more flexible partitioning, flows dominated by convection benefit from
grid aligned with the flow direction, etc.). Hence, it makes sense to decide per axis
whether to refine or not. This idea is picked up in Section 2.10, and it fits to the
spacetree principle. The increase in flexibility does not affect the underlying idea
of a stack-based grid management, if the resulting traversal meanders through the
computational domain (Figure 3.20) and, thus, preserves the left/right classification
of the vertices—as long as the number of cuts equals an odd number.
If an algorithm has to traverse the spacetree up to a given level, the tree cut technique significantly speeds up the implementation (the effect is studied en detail in
the upcoming chapter). Tree cuts in Section 3.5 are parametrised by a global maximum level, i.e. all nodes beyond this level are removed from the traversal tree. This
100
3.7 Outlook
Figure 3.20: Anisotropic (k = 3)-spacetrees fit to the stack-based grid management,
as long as the traversal remains a continuous meander curve separating
vertices into left and right.
Figure 3.21: A tree cut removes all leaves beyond a given level from the traversal
tree, i.e. it “cuts” through a given level horizontally (left). A local
maximum traversal level allows individually to select subtrees to be cut
from the tree (right).
101
3 Spacetree Traversal and Storage
behaviour motivates the term “horizontal”. Nevertheless, some approaches (multigrid algorithms resolving singularities, e.g.) do not yield a global maximum level,
but define a maximum level individually for each subtree (Figure 3.21). To remove
all the leaves from a tree is an example. Such an extension of the cut mechanism is
straightforward: The global variable ℓmax is removed, and each geometric element
is added a ℓmax variable of its own. The traversal then decides individually for each
subtree, if the records are to be stored on the output or the bottom output stream.
As a result, a global stack holding the preceding cut history is inapplicable anymore.
It has to be replaced by a more sophisticated control mechanism preserving the tree
consistency.
This chapter establishes all the ingredients for a (k = 3)-spacetree traversal implementation. In the subsequent chapter, I use this implementation to create a
geometric multigrid solver on top of it.
102
4 Full Approximation Scheme
This chapter realises a finite element solver with a matrix-free, geometric multigrid
[73, 10] and d-linear shape functions on adaptive Cartesian grids for the Poisson
problem
−∆u = f,
u|∂Ω = g,
u, f : Ω 7→ R, and
g : ∂Ω 7→ R,
(4.1)
with a sufficiently smooth right-hand side f and a sufficiently smooth Dirichlet
boundary condition g. The solver exploits k-spacetrees, and it is embedded into
a Peano traversal. It thus demonstrates that the Peano framework allows the implementation of state-of-the-art PDE solvers. In return, the state-of-the-art solver
inherits all the nice properties of the framework such as support of dynamic adaptivity, low memory requirements, good cache behaviour, and so forth.
The intention is not to solve a complicated PDE. Instead, I concentrate on algorithmic and numerical principles and choose a simple PDE. Nevertheless, more complicated problems—modified Poisson problems [51, 76], the Navier-Stokes equations
and the continuity equation [59, 60], fluid-structure interaction [9], as examples—can
be tackled starting from these principles.
The solver is a geometric, multiplicative multigrid implementing a full approximation storage scheme with d-linear shape functions. Geometric multigrid solvers
need a sequence of grids giving a refinement cascade. They start with a fine grid,
and the succeeding grids then become coarser and coarser. k-spacetrees yield such
a hierarchy by construction.
A d-linear shape function on a hypercube is a hat function, i.e. on a Cartesian grid
one hat is located at each vertex, and its support covers the 2d adjacent hypercubes.
Such a hat’s value equals 1 at the associated vertex and disappears at all other
vertices. A linear combination of the different hats approximates the solution with
a piecewise d-linear function. As each level of a k-spacetree yields (disconnected)
Cartesian grids, there is such an approximation on each k-spacetree level. The
overall system is not a hierarchical basis, but a hierarchical generating system [33].
This eases the implementation of a full approximation storage scheme.
Discussing a solver of the linear equation system given on the k-spacetree, [21,
35, 36, 39, 49, 63] study a Jacobi solver preconditioned with an additive multigrid.
A conjugate gradient solver with a BPX-type preconditioner for a data structure
103
4 Full Approximation Scheme
similar to a k-spacetree data structure is subject of [2, 69]. In contrast, this chapter
establishes a multiplicative approach. The algorithm follows [30] with the realisation
embedding the algorithm into the element-wise traversal. For this, I combine [30]
with three additional aspects: First, hierarchical generating systems [33] replace the
hierarchical basis, as hierarchical generating systems fit better to the k-spacetrees.
Second, the bi-partitioning is generalised to k-partitioning. Third, the approach is
extended from regular to adaptive grids.
The chapter is organised as follows: An introduction in Section 4.1 establishes the
finite element method’s weak formulation for d-dimensional hierarchical generating
systems or a nodal basis on k-spacetrees. In Section 4.2, the nodal basis leads to
a discretisation and stencils for the PDE. The description of standard multigrid
ingredients follows. It leads over to the multigrid algorithm tailored to k-spacetrees,
and Section 4.4 breaks down the algorithm to the different traversal events. The
vertex attributes’ lifecycle is analysed afterwards, and I implement a linear error
estimator giving a simple, dynamic refinement criterion. A discussion of the global
convergence measurement leads over to some experiments in Section 4.6. An outlook
closes the chapter.
4.1 Hierarchical Generating Systems
The finite element method computes the solution of (4.1) in the weak formulation
Z
Z
Z
(∇u, ∇ϕ) dx =
(f, ϕ) dx +
(∇u, n) ϕdS(x)
∀ϕ : Ω 7→ R.
(4.2)
Ω
Ω
∂Ω
ϕ is from a set of suitably chosen test functions, whereas n equals the outer normal
vector of the computational domain. (., .) denotes the inner scalar product. The
weak formulation (4.2) is accompanied by suitable, weak boundary conditions, and
its solution is from a Sobolev space H 1 (Ω) [8].
H 1 (Ω) consists of an infinite number of functions. A finite element method therefore approximates H 1 (Ω) with a finite dimensional function subspace Hh1 (Ωh ) ⊂
H 1 (Ω). Given a set of shape functions φi defining a basis of Hh1 (Ωh ) abbreviated as
Hh1 , the solution of the weak problem within the subspace is a linear combination
1
u 7→ uh =
Z
Ω
1
(∇u, ∇ϕ) dx 7→
|Hh |
X
i=1
|Hh |
X
ui
ui φi , i.e.
i=1
Z
Ω
(∇φi , ∇ϕ) dx ∀ϕ ∈ H 1 d, ui ∈ R,
of the shape functions. The number of unknowns within this linear combination
equals the basis’ cardinality. To determine the unknowns, it is sufficient to select
104
4.1 Hierarchical Generating Systems
a finite number of test functions. This number has to equal the basis’ cardinality.
Ritz-Galerkin methods underlying the forthcoming text equalise the space of the
shape functions—the ansatz space—and the test space.
Z
Ω
1
(∇u, ∇ϕ) dx →
|Hh |
X
i=1
ui
Z
Ω
(∇φi , ∇φj ) dx
(4.3)
∀j ∈ 1, . . . , |Hh1 | φi , φj ∈ Hh1 , ui ∈ R.
The weak formulation in (4.3) yields |Hh1 | linear equations to be solved, i.e. a
linear equation system
A (u1 , u2 , . . .)T = b
(4.4)
describes the solution of the PDE.R This system of linear equations is determined by
the stiffness matrix A with Aji = Ω (∇φi , ∇φj ) dx. If the shape functions have local
support, A exhibits a sparse pattern, since a test function’s support then intersects
only a few other shape function’s supports, and most integrals under the sum in
(4.3) vanish. This idea underlies the forthcoming text.
4.1.1 The Finite Element
The finite element method here works with d-dimensional shape functions, i.e. the
function ϕ with
φ : Rd 7→ R,
φ1D : R 7→ R,

t if t ∈ [0, 1]

1 − (t − 1) if t ∈ [1, 2] ,
φ1D (t) =

0 else
φ(x) =
d
Y
φ1D (xi )
i=1
gives one reference shape function. For each non-hanging vertex vi , the hat function
ϕ is translated such that its maximum value 1 coincides with the vertex position.
Afterwards, the translated hat function is scaled such that its support covers the 2d
adjacent geometric elements. The resulting function ϕi hence equals 1 in one vertex
vi and 0 for each other vertex belonging to the same grid level (Figure 4.1). All the
vertices of one grid level introduce a piecewise d-linear function space, and the hat
functions define a nodal basis.
Definition 4.1. Each non-hanging vertex v ∈ VT \ HT yields one shape function.
105
4 Full Approximation Scheme
Figure 4.1: A hat function being one in one single vertex. The d-dimensional function vanishes on all other vertices. Its support covers 2d geometric elements.
The number of shape functions equals the cardinality of the set VT \ HT . The
text sometimes ascribes properties of a vertex directly to the corresponding shape
function or vice versa. Consequently, the grid symbols Ωh , Ωh,ℓ and Ωadaptive
then
h,ℓ
represent the piecewise linear ansatz spaces
Hh1 (Ωh ), Hh1 (Ωh,ℓ ), Hh1 (Ωadaptive
) ⊂ H 1 (Ω),
h,ℓ
and the shape functions are denoted by ϕv , v ∈ VT \ HT , i.e. I replace the index
by the vertex symbol, as v uniquely defines both level and position. uv then is the
weight of vertex v in (4.4).
Definition 4.2. A finite element is a geometric element together with the parts of
the 2d hat functions defined on its vertices and covering the element.
The construction of the d-linear ansatz space corresponds to a nodal point of
view. The definition of the term finite element accentuates an element-wise point of
view (Figure 4.2). As hanging vertices do not entail a shape function construction,
geometric elements “contain” at most 2d hats.
4.1.2 Generating Systems on k-spacetrees
A set of functions establishing a basis contains solely linearly independent shape
functions. As Definition 4.1 does not distinguish between refined and unrefined
vertices, the resulting set of shape functions are linearly dependent. They define a
hierarchical generating system. Let
[
ΩT =
Ωh,ℓ
ℓ
106
4.1 Hierarchical Generating Systems
Figure 4.2: A finite element consists of one geometric element and the parts of the
grid’s hat functions covering the geometric element.
Figure 4.3: Generating system for two levels (d = 1). Coarse shape functions on
refined vertices can be reconstructed by 3d fine grid functions (solid fine
grid hats). The hats thus give a hierarchical generating system.
denote the union of all the different grid levels. Hh1 (ΩT ) spanned by the corresponding hat functions—I use Hh1 (ΩT ) synonym for the generating system, too, i.e. the
set also holds the linear dependent hat functions—gives the hierarchical generating
system for a k-spacetree.
Example 4.1. Let d = 1, k = 2, and pick out any vertex VT \ HT , i.e. a vertex with
two adjacent elements. If the vertex is refined, another vertex at the same position
on level ℓ + 1 exists. Add this vertex’s hat and its two neighbouring shape functions
on ℓ + 1 with the weights ( 12 , 22 , 21 ). The result equals the shape function on level ℓ
(Figure 4.3).
With the well-defined finite function space, one wants to
• represent functions uh in Hh1 (ΩT ) and
• apply operators such as the weak Laplacian on these representations. To
evaluate an operator, one has to evaluate uh (x) for any x ∈ Ω.
107
4 Full Approximation Scheme
These two steps are uncomplicated for a function basis of Hh1 (Ωh ), i.e. a basis for
the fine grid system. Let uh represent a function in Hh1 (Ωh ).
∀v = (x, ℓ) ∈ Ωh : uv = uh (x)
or, the other way round,
!
X
∀x : uh (x) =
uv φv (x).
(4.5)
(4.6)
v=∈Ωh
Here, the grid points sample the function’s course (4.5), and the function in turn
is a linear combination of the function space’s elements (4.6). The same approach
obviously is not suited for a hierarchical generating system (Ωh in (4.5) is replaced
by VT \ HT ) where the functions are not linearly independent, as
• a concatenation of (4.5) and (4.6) does not yield the identity. Both the generating system and the nodal basis have the same range, i.e. each element in
Hh1 (Ωh ) can be represented in Hh1 (ΩT ) and vice versa (Figure 4.3 and Figure
4.4). Nevertheless, a naive reference system transformation followed by an inverse reference system transformation does not result in the original function.
Furthermore,
• the scheme does not exploit the additional degree of freedom resulting from
the linearly dependent entries of the generating system, i.e. the generating
system does not yield an additional benefit.
I apply the following strategy instead: Let h : H 1 (Ω) 7→ Hh1 (ΩT ) map a function to a vector of coefficients in the hierarchical generating system. Such an h
decomposes any function into a set of hats belonging to different levels. It consequently decomposes a function into its frequencies. h is free to exploit any freedom
resulting from the non-uniqueness. In turn, there is a ĥ : Hh1 (ΩT ) 7→ Hh1 (Ωh ) mapping this representation back to the fine grid. ĥ−1 does not exist. Nevertheless
ĥ ◦ h = id, ∀u ∈ Hh1 (Ωh ). In this thesis, I use two pairs of (h, ĥ) discussed on the
following pages. The first representation choice is well-suited for a (Laplacian) operator evaluation, the second for inter-level information transfer. Switching from one
representation to another then simplifies the implementation of a multigrid scheme.
Nodal Representation
In a nodal representation, each vertex samples the function to be represented. As
shape functions of refined vertices can be removed from the generating system without reducing the system’s rank, the projection ĥ from Hh1 (ΩT ) to the nodal repre-
108
4.1 Hierarchical Generating Systems
Figure 4.4: Generating system for two levels, k = 2, and d = 2. Nine shape functions
on the fine grid can be combined into one coarse grid shape function.
Four of them are illustrated.
sentation does not take them into account (4.8).
h : H 1 (Ω)
∀v = (x, ℓ) ∈ VT \ HT : uv
∀x 6∈ Ω : u(x)
ĥ : Hh1 (ΩT )
∀uh ∈ Hh1 (ΩT ) :
Hh1 (ΩT ),
with
u(x)
and
0.
Hh1 (Ωh ),
with


X
uv φv  (x)
ĥ(uh ) (x) = 
7
→
←
:=
7→
(4.7)
and
v∈Vtemp
Vtemp
= {v : v ∈ VT \ HT ∧ ¬Prefined (v)}. (4.8)
In this scheme, the vertices of the hierarchical generating system hold the nodal
value of the function (4.7), and an algorithm can evaluate a function directly. Furthermore, the scheme represents a function’s shape on each level simultaneously,
i.e. the hierarchical generating system yields a multiscale representation. The latter
ingredient mirrors the idea of full approximation schemes discussed later.
Hierarchical Basis
The construction of a hierarchical basis starts with a nodal basis on a coarse grid. It
then adds linearly independent shape functions corresponding to subsequent refine-
109
4 Full Approximation Scheme
ment levels. For Cartesian grids, the refinement scheme is similar to the k-spacetree
construction, but new shape functions are only added for vertices at “new” locations: If there is a new vertex v = (x, ℓ) and ∃v̂ = (x, ℓ̂), ℓ̂ < ℓ, the new vertex does
not yield an additional shape function. If the transformation into the hierarchical
system mirrors this idea, the inverse transformation just sums up the different levels’
contributions, i.e.
h : H 1 (Ω) 7→ Hh1 (ΩT ),
∀x 6∈ Ω :
u(x) := 0
and
∀v = (x, ℓ) ∈ VT \ HT :
level(v) = 0 ⇒ uv ← u(x),
level(v) > 0 ⇒ uv ← u(x) −
with
(4.9)
X
uv φv (x). (4.10)
v∈Vtemp (level(v))
Here, Vtemp (k) = {v = (x, ℓ) : v ∈ VT \ HT ∧ ℓ < k}, and ĥ is given by
ĥ : Hh1 (ΩT ) 7→ Hh1 (Ωh )

X
ĥ(uh ) (x) = 
v∈VT \HT
with

uv φv  (x).
(4.11)
In this scheme, the vertices of the hierarchical generating system hold the hierarchical
surplus (4.10). To point out that a function identifies the hierarchical surplus,
i.e. that a d-linear function stems from a linear combination of hats scaled with the
linear surplus, I write ûv instead of uv .
4.1.3 Dirichlet Boundary Conditions
Many finite element codes integrate Dirichlet boundary conditions into the ansatz
space: they add tailored shape functions approximating the boundary condition.
The test space remains unchanged, as the number of shape functions whose weight
is to be determined remains unchanged, too. The solution then is the sum of the
weighted shape functions of the ansatz space and the hats approximating the boundary values (Figure 4.5).
This thesis follows such an idea, and the boundary vertices realise the Dirichlet
boundary condition of the nearest surface point on the continuous computational
domain’s boundary according to Section 2.7. This convention lacks a handling of
the hierarchical system. Here, the actual “boundary” values depend on both the
Dirichlet value and the representation scheme chosen.
First, the nodal representation scheme. Unrefined boundary vertices have the
value of the nearest surface point of the continuous domain’s boundary. According
110
4.1 Hierarchical Generating Systems
Figure 4.5: Hats at the discretised boundary approximate the Dirichlet boundary,
i.e. their weights are defined by the boundary conditions (top). A standard ansatz space is used to solve the problem (bottom). Both function
spaces added give a solution of the PDE.
Figure 4.6: In the nodal representation, refined coarse boundary vertices adopts the
value of the finer grid, i.e. the coarse boundary values depend on the fine
grid approximation.
111
4 Full Approximation Scheme
Figure 4.7: In the hierarchical basis, the coarse boundary vertices’ values have to
vanish, as the sum of coarse and fine grid shape functions may not be
discontinuous.
to (4.7), the values of refined boundary vertices equal the finer level’s solution at
the same place (Figure 4.6).
Let ℘ : ∂Ω×VT 7→ ∂Ωh map the continuous boundary to a discrete vertex position
along the shortest distance. In accordance with the nodal representation scheme,
the boundary vertices are given by
∀v = (x, ℓ) ∈ VT \ HT : Pboundary (v) ∧ ¬Prefined (v) : uv = u(℘−1 (x)),
Pboundary (v) ∧ Prefined (v) : uv = u(x,ℓ+1) , (4.12)
and the value of a Dirichlet boundary point in a stationary problem can change
due to additional refinements at the boundary. When the discrete computational
domain grows monotonously towards the exact domain (2.8) due to a refinement, the
precision of the boundary approximation on coarser levels in the hierarchical system
thus improves at the same time, too (4.12). As a result, it is important to make
(coarse) boundary vertices persistent. Otherwise, their value had to be reconstructed
analysing all the fine grids’ values. Such a reconstruction needs data from finer
levels—the values are synthesised [46]—and hence introduces an additional depthfirst traversal.
Second, the hierarchical basis. In accordance with h and ĥ, unrefined boundary
vertices have the value of the nearest surface point on the continuous domain’s
boundary. As there is no other vertex at the same position belonging to a finer
112
4.2 Stencils and Operators
Figure 4.8: Coarse grid and fine grid are summed up. If hats at the boundary exist
(dotted hats), the sum is continuous if and only if the coarse and the
fine grid boundary coincide.
level, their values determine the discrete domain’s boundary (4.10). The values of
refined boundary vertices result from the mapping rule (4.11) and from the fact
that the nodal representation uh ∈ C: As the hat functions at the boundary are
tailored to the discrete computational domain (the hat’s support vanishes on outer
elements), coarser boundary weights have to vanish (Figure 4.7). Otherwise, the
d-linear approximation would become discontinuous wherever the fine grid extends
the coarse grid’s computational domain (Figure 4.8)
For the hierarchical basis, boundary vertices thus have
∀v = (x, ℓ) ∈ VT \ HT : Pboundary ∧ ¬Prefined : uv = u(℘−1 (x)), and
Pboundary ∧ Prefined : uv = 0.
The coarsest grid holds a nodal representation uv of the solution supplied with homogeneous boundary values. All finer grids’ weights denote the hierarchical surplus ûv .
Weights of fine grid shape functions belonging to a coarse grid vertex’s position are
zero. With all these shape functions removed—they are scaled with zero anyway—
the scheme gives a hierarchical basis representation. Again, the value of a Dirichlet
boundary vertex can change throughout the computation due to a refinement. Yet,
it does not depend on the actual solution on the fine grid.
4.2 Stencils and Operators
The finite element method approximates the solution of a PDE with a finite dimensional function basis due to the equation system (4.4). To set up the involved
113
4 Full Approximation Scheme
Figure 4.9: The grey numbers give the grid’s enumeration. Let for example tmp23 =
4 · u23 − u22 − u24 − u13 − u33 . The stencil (left) illustrates this operator.
An element-wise operator evaluation splits up the operator among the
cells and sums up the result (right).
stiffness matrix explicitly according to (4.3) is far from trivial, if the k-spacetree’s
fine grid is not regular. Nevertheless, the definition of the shape functions delivers the complete set of tools required for this task, and the construction process is
describable by a set of small matrices. This section defines them. The subsequent
text then derives a scheme that solves (4.4) without setting up any global matrix.
It is based exclusively on a splitting of the underlying small construction matrices.
The thesis’ introduction already highlights the advantages of such a matrix-free approach. As a result, the global matrices are used for the description of the equation
system solver, but they do never occur in the implementation.
Each row of the stiffness matrix corresponds to one test function φj in (4.3). The
test function is an element of Hh1 (Ωh ), i.e. each test function corresponds to one
non-hanging vertex of the fine grid, and its support covers the 2d adjacent elements.
The sum of integrals thus degenerates to a set of 2d integrals. For all the other
integrals in (4.3), the test function vanishes.
Select v ∈ VT \ HT with v surrounded by non-hanging vertices. The integral that
yields the stiffness matrix’s line corresponding to test function ϕv then becomes
1
|Hh |
X
i=1
ui
Z
Ω
(∇ϕi , ∇ϕv ) dx 7→
X
uvi
vi ∈vertex(element(v))
=
X
{z
Ω
uv i
vi ∈vertex(element(v))
|
Z
}
(∇ϕvi , ∇ϕv ) dx.
X
e∈element(v)
Z
e
(∇ϕvi , ∇ϕv ) dx,
|vertex(element(v))|=3d
involving the weights of 3d vertices in total. The shape functions ϕ are d-linear on
each geometric element. Hence, there is an analytical expression for the integral,
and, given a global enumeration of the vertices, the sum defines the matrix’s row
114
4.2 Stencils and Operators
Figure 4.10: The value of a hanging node is interpolated from the coarser grid linearly.
corresponding to test function v. For d = 2, this matrix row looks alike
1
1
1
1
8
1
1
1
1
... − ... − ... − ... − ...
... − ... − ... − ... −
3
3
3
3
3
3
3
3
3
,
with . . . comprising lots of zero entries. The entries −1
result from adjacent vertices.
3
scales the weight of the test function’s vertex.
While a global vertex enumeration underlies the stiffness matrix, the stencil representation denotes the interplay of the vertices’ weights, i.e. the entries of one row of
the stiffness matrix, without a vertex enumeration. It exploits the spatial alignment
of the unknowns (Figure 4.9). For d = 2,
 −1 −1 −1 


−1
−1
−1
3
3
3
1
 −1 8 −1 
−1
8 −1 
=
3
3
3
3
−1
−1
−1
−1 −1 −1
3
3
3
8
3
defines the stencil for the Laplacian with a d-linear ansatz space on Cartesian grids.
Its entries are in O(hd−2 ), where h is the mesh width of the 2d adjacent elements.
To derive the stencils for further operators is straightforward.
If a vertex neighbouring the test vertex is a hanging vertex, the Laplacian stencil
is inapplicable directly, as a hanging vertex does not have a shape function. Instead,
shape functions belonging to the coarser grid determine the function’s value at the
hanging nodes with (4.8) as well as (4.11) yielding the same values. Both formulas
define the value at a hanging vertex as a sum of coarser shape functions, i.e. they
give an interpolation (Figure 4.10). I use the term prolongation as synonym. The
interpolation also can be written down as interpolation stencil.
To evaluate the Laplacian for a vertex that neighbours hanging vertices, the interpolation scheme first of all defines the function’s values at the hanging vertices.
Then, it applies the stencil. Such an evaluation equals a cascade of stencil applications, as the coarse vertices themselves might be hanging vertices again.
Stencils describe the interplay of different vertices. To apply a stencil within a
Peano traversal, the algorithm has to have access to all vertices affected by a stencil
and to all vertices determining a stencil’s image. For the coarsening in (4.7), e.g.,
115
4 Full Approximation Scheme
Figure 4.11: Element-wise prolongation for d = 2, k = 3 (left) and k = 2 (right).
these data are available throughout a bottom-up step within the k-spacetree. This
is not always the case: The stencil of the Laplacian, e.g., needs all surrounding
vertices, but the traversal has access to at most two different geometric elements
and their vertices at one time.
In such a case, the algorithm splits up the stencils additively into their elementwise contributions (Figure 4.9 for the Laplacian in d = 2). In the finite element
world, such a splitting reflects the decomposition of the integral or the hats’ supports, respectively, into integrals over individual elements. The result then is to
be accumulated within the vertex, i.e. the vertex’s result variable is set to zero by
the event touchV ertexF irstT ime. Throughout the enterElement or leaveElement
events, the result variable is incremented corresponding to the split-up stencil. If
the event touchV ertexLastT ime is triggered, the result variable holds the stencil’s
image. This algorithmic principle shapes Section 4.4.
4.3 Multigrid Ingredients
The algorithm solves the equation system
Auh = b
iteratively without a setup of the stiffness matrix A: At the program startup, all the
weights of the inner vertices are zero as initial guess. The algorithm then traverses
the grid several times and applies the solver. For this, matrix-vector products are
evaluated on an element basis and their results are immediately written back. The
matrix-vector product itself is not stored. After several traversals equaling solver
iterations, the nodal representation on the generating system is the solution to the
linear equation system.
The following pages run through the ingredients of the multigrid solver and identify the operators/matrices involved. They thus provide the third aspect, the miss-
116
4.3 Multigrid Ingredients
ing link, to realise the solver: With a function representation—in fact two different
variants—and the stencils at hand, the algorithm has to know what to do with them,
i.e. what variables act as preimage for the stencils and what happens with the result.
In this context, the section also reveals which function representation scheme fits
best for an algorithmic step.
4.3.1 Jacobi Solver
At the core of the multigrid solver is a Jacobi iteration applying the transition
uh →
7
uh + ω diag −1 (A) (b − Auh )
=: uh + ω diag −1 (A) r,
(4.13)
where ω ∈ ]0, 1[ is a relaxation factor parameterising the updates. The auxiliary
variable r is the residual, and diag(A) extracts the diagonal from A. Section 4.4.1
explains why Jacobi is a natural choice within the Peano world.
For a fixed grid level ℓ, an algorithm can apply the update formula (4.13) yielding
a Jacobi solver for all non-hanging vertices belonging to level ℓ, as long as all the
vertices hold the nodal value of the function (not the hierarchical surplus ûh,ℓ ). Let
thus hold all vertices on level ℓ hold the nodal value. A matrix-free Jacobi for a fixed
level ℓ traverses all vertices v ∈ VT \ HT with level(v) = ℓ. They hold uh,ℓ . For each
vertex, it computes the value rv . It depends solely on the neighbours of the nonhanging vertex. These neighbours belong to ℓ, too, and their value is interpolated
from the coarser grids if they are hanging. As soon as r is available for all vertices,
the algorithm updates all vertices according to (4.13). Hereby, the diagonal element
is taken from the stiffness matrix’ stencil. The weight of each vertex then represents
the new value of uh,ℓ .
If the k-spacetree yields a regular grid and ℓ equals the height of the spacetree,
(4.13) gives a Jacobi solver for the PDE. If the k-spacetree represents an adaptive
grid, it does not solve the problem, as regions with a coarser fine grid are not updated
at all. The interpolated values resulting from these coarser regions are not updated,
too, and the solution on ℓ is not a solution to the PDE but a solution within
subregions for artificial boundary conditions resulting from the actual boundary
conditions wherever the finest grid covers the computational domain’s boundary,
and resulting from artificial Dirichlet values given by the hanging vertices.
4.3.2 Standard Multigrid: Smoother and Correction Scheme
Multigrid methods rely on the fact that Jacobi- and Gauß-Seidel-type solvers eliminate errors with high frequency faster than errors with low frequency. They perform
a small number γ1 of solver iterations—they are now called smoothing iterations, as
they smooth the error—on the finest grid with level ℓ. If this section refers to regular
117
4 Full Approximation Scheme
grids only, the resulting approximation’s quality is now exclusively dominated by
low-frequency errors. Then, they derive a correction equation
Aℓ−1 eh,ℓ−1 = Rrh,ℓ
= R(b − Aℓ uh,ℓ )
(4.14)
on a level ℓ to approximate the error eh,ℓ−1 on the next coarser grid. R(b − Aℓ uh,ℓ ) is
the coarse grid right-hand side, and Aℓ equals the stiffness matrix. The coarser grid is
determined by the k-spacetree’s coarser level, and, thus, this approach is a geometric
multigrid. The correction equation has to eliminate the remaining error starting with
the initial guess eh,ℓ−1 = 0. As the correction problem is defined on a coarser grid,
it is smaller and, thus, easier to solve than the original problem. Furthermore, the
low frequency errors in the original systems have a higher frequency relative to this
grid due to the bigger mesh width. Multigrid methods apply the coarsening idea
recursively, i.e. they perform again only a small number of smoothing operations on
the correction equation and then apply the coarsening transformation again. Finally,
the error correction from (4.14) is transported back to the fine grid, added to the
fine grid approximation
uh,ℓ 7→ uh,ℓ + P eh,ℓ−1
(4.15)
using an operator P , and the algorithm performs another small number of γ2 smoothing iterations. The overall procedure is a V-cycle (Algorithm 4.1). The letter V
results from the grid transitions: The algorithm starts with the finest grid, ascends
to the coarsest grid and returns to the fine grid. V (γ1 , γ2 ) gives the exact number of
smoothing iterations per level (Figure 4.12). There is a couple of alternative cycles
based upon the V-cycle, but these modifications are not discussed as they do not
alter the underlying idea and algorithmics.
For the time being, the operators P, Aℓ−1 and R are to be defined. P transports
the correction to a finer level, i.e. it is an interpolation from a coarse grid function
to the fine grid. Since the coarse grid in a k-spacetree holds shape functions, it is
an obvious idea to use the hat’s shape for this interpolation, i.e. the hierarchical
geometric structure determines the prolongation (Section 4.2). Additional sub- and
superscripts determine the preimage or image, respectively, of the operator.
R is a restriction transporting the fine grid residual to the coarser grid. Two
natural restriction operators arise for k-spacetrees: A coarsening operator picks out
all the fine grid vertices whose position coincides with a coarse vertex. Then, it
copies their values to the coarse grid vertices. Such a trivial coarsening operator is
implicitly introduced in (4.7). In the following, let Cℓℓ−1 be a coarsening applied on
level ℓ and overwriting vertices on level ℓ − 1.
The alternative restriction results from a weighted summation of the fine grid vertices: Hereby, a coarse vertex’s weight is determined by all fine grid values covered
118
4.3 Multigrid Ingredients
Algorithm 4.1 Blueprint of a correction scheme for regular grids. It is started on
the finest mesh level, i.e. the initial ℓactive equals the spacetree’s height.
1: procedure cs(ℓactive )
2:
Compute an approximation of the solution on the fine grid level ℓactive :
Aℓ uh,ℓ = bℓ .
3:
A fixed number of Jacobi iterations for this problem is sufficient (presmoothing).
Set up the right-hand side of the coarse grid correction (4.14)
bℓ−1 = R (bℓ − Aℓ uh,ℓ ) = R r.
4:
5:
6:
Erase variable eh,ℓ−1 = uh,ℓ−1 ← 0 approximating the error/correction on the
coarser grid.
Recursive call: cs(ℓactive − 1)
Prolongate coarse grid correction back to solution:
uh,ℓ 7→ uh,ℓ + P eh,ℓ−1
7:
Improve solution on the fine grid level (postsmoothing): ℓactive :
Aℓ uh,ℓ = bℓ .
8: end procedure
119
4 Full Approximation Scheme
by the coarse hat’s support. The weight of the individual contributions equals the
interpolation stencil P , i.e. the restriction is the transposed prolongation. Prolongation P and restriction R in combination implement a full weightening. Again, Rℓℓ−1
T
ℓ
denotes the information transport from level ℓ to ℓ − 1, and, thus, Rℓℓ−1 = Pℓ−1
.
Whenever unambiguous, I omit the source and destination index for the operators.
Finally, the matrix A is to be determined for the individual levels. I follow the
Galerkin multigrid idea [73, 10] with
ℓ
Aℓ−1 = Rℓℓ−1 Aℓ Pℓ−1
.
(4.16)
Another approach is to derive Aℓ per level from the weak form and the shape functions. As Hh1 (ΩT ) contains shape functions on each level, this is also plausible. I
return to this fact later in the context of adaptive grids.
The Galerkin multigrid idea has at least three valuable properties. First, the
correction scheme matrices are well-defined via the restriction and prolongation,
i.e. there is no need for a space construction on coarse levels. Second, the prolongation and restriction operators are invariant with respect to scaling. With
R 7→ c · R, c ∈ R \ {0}, both the right-hand side of the correction equation (4.14)
and the correction operator are scaled by c. For P 7→ c · P, c ∈ R \ {0}, the Jacobi
update is scaled by c, i.e. the error summation step (4.15) becomes
Rr
uh,ℓ 7→ uh,ℓ + c · P eh,ℓ−1 = uh,ℓ + c · P A−1
h,ℓ
ℓ−1
−1
= uh,ℓ + c · P c−1 · P −1 A−1
R
Rr
h,ℓ .
ℓ
Third, most theorems on convergence analysis rely on (4.16).
4.3.3 Full Approximation Storage
Correction schemes pave the way to asymptotically optimal solvers for elliptic problems. As soon as the problem becomes non-linear, out-of-the-box correction schemes
run into problems. This insight historically motivated full approximation storage
schemes—an augmented correction scheme additionally holding a multiscale representation of the solution approximation on each level. The following section elaborates this extension as it falls into place for k-spacetrees. In the end, it also proves
of great value for adaptive grids. In the context of Peano, the latter fact is much
more promising and motivating than the support for non-linear equations.
Full approximation storage schemes hold a representation of the solution on each
level. Before the algorithm determines the correction equation for the coarse grid,
it thus creates a representation of the current solution on the coarser levels. The
injection C is a straightforward realisation of such a coarsening. More sophisticated implementations are discussed for example in [30]. With a full approximation
storage, the geometric multigrid algorithm follows the blueprint in Algorithm 4.2.
120
4.3 Multigrid Ingredients
Algorithm 4.2 Blueprint of full approximation storage scheme for regular grids. It
is started on the finest mesh level, i.e. the initial ℓactive equals the spacetree’s height.
1: procedure f as(ℓactive )
2:
Compute an approximation of the solution on the fine grid level ℓ:
Aℓ uh,ℓ = b = bℓ .
3:
A fixed number of Jacobi iterations for this problem is sufficient.
Transport the fine grid approximation to the next coarser level:
uh,ℓ−1 = Cℓℓ−1 uh,ℓ .
4:
(4.17)
The coarse grid correction (4.14) then becomes
Aℓ−1 (Cuh,ℓ + eh,ℓ ) = Rrh,ℓ + Aℓ−1 Cuh,ℓ ,
or
Aℓ−1 uh,ℓ−1 = Rrh,ℓ + Aℓ−1 Cuh,ℓ =: bℓ−1 .
The first equation adds (4.17) on both sides of the coarse grid equation
system.
5:
Apply solution idea recursively with Cuh,ℓ as intial guess for the solution of
the coarse grid system: f as(ℓactive − 1)
6:
Prolongate the difference between the smoothed coarse grid and the original
fine grid value uh,ℓ−1 to the fine grid, sum up fine representation and this
value, and continue.
7: end procedure
121
4 Full Approximation Scheme
With the coarse grid shape functions inducing P and an unscaled full weightening
R = P T , the coarse grid operator obeys both the Galerkin multigrid idea and a direct
discretisation. As a result, the traversal holds one hard-coded stiffness operator and
applies it directly on the vertices after it is scaled with hd−2 . The d results from
the integral over Ω, the reduction by two results from the two derivatives in the
integrand. There is no need to hold a coarsened Aℓ on any level.
4.3.4 Hierarchical Transformation Multigrid Method
A multigrid iteration runs through the individual grid levels. Bottom-up first and
then reverse. Throughout the steps up, the presmoothing is performed and the
correction equations are set up. Throughout the steps down, the postsmoothing
is performed and the difference between the finer and coarser levels is transported
back to the fine grid levels. To transport the solution difference back implies that
the algorithm has to keep book of the original coarse grid values. In [30], an elegant
realisation of this bookkeeping is introduced. It uses the hierarchical representation
scheme.
Let ûh,ℓ define the hierarchical surplus with
ℓ
uh,ℓ = Pℓ−1
Cℓℓ−1 uh,ℓ + ûh,ℓ .
The hierarchical residual then is
r̂h,ℓ = bℓ − Aûh,ℓ .
(4.18)
With this definition, the right-hand side of the full approximation storage scheme
becomes
r̂h,ℓ =
=
⇒ Rr̂h,ℓ =
=
bℓ + AP Cuh,ℓ − Auh,ℓ
rh,ℓ + AP Cuh,ℓ
Rrh,ℓ + RAP Cuh,ℓ
Rrh,ℓ + ACuh,ℓ
(4.19)
due to the Galerkin multigrid definition. I here omit the operators’ source and
destination identifiers. The full approximation scheme corresponds to a nodal representation of the approximation within the generating system. The hierarchical
transformation multigrid (HTMG) method switches from the nodal representation
to a hierarchical basis to compute a hierarchical residual (4.18) instead of the standard residual. This hierarchical residual also simplifies, besides the bookkeeping,
the computation of the restriction’s preimage within a V (γ1 , γ2 )-cycle (4.19). See
Algorithm 4.3. The switch from a nodal system to the hierarchical surplus also is
to be performed for the fine grid vertices: the mechanism strictly follows (4.10).
122
4.3 Multigrid Ingredients
Algorithm 4.3 Blueprint of full approximation storage scheme for regular grids. It
is started on the finest mesh level, i.e. the initial ℓactive equals the spacetree’s height.
1: procedure htmg(ℓactive )
2:
Perform γ1 Jacobi smoothing steps for Auh,ℓ = bℓ (presmoothing).
3:
Coarse the approximation with uh,ℓ−1 = Cuh,ℓ .
4:
Determine the hierarchical surplus ûh,ℓ = P uh,ℓ−1 on level ℓ. The nodal value
on this level is not used anymore, so the algorithm stores ûh,ℓ within the
variable uh,ℓ . The algorithm hence switches the representation of grid
level ℓ from a nodal to a hierarchical storage scheme.
5:
Compute the hierarchical surplus r̂h,ℓ = bℓ − Aûh,ℓ . The hierarchical surplus
on hanging vertices equals zero. As uh,ℓ holds the hierarchical surplus,
the residual equation equals the Jacobi update’s residual computation.
The residual however is not used to update the solution.
6:
Transport the hierarchical residual to ℓ − 1 as new right-hand side on the
coarse grid, i.e. bℓ−1 = Rr̂h,ℓ for the refined vertices on level ℓ − 1. The
unrefined vertices’ right-hand side on level ℓ − 1 is not altered.
7:
Apply the scheme with the new right-hand side recursively on level ℓ − 1:
htmg(ℓactive − 1)
8:
Interpolate the new coarse grid value to the fine grid according to uh,ℓ =
ûh,ℓ + P uh,ℓ−1 . This transformation switches the representation of grid
level ℓ from a hierarchical system back to a nodal system. Because of the
updated coarse grid solution, the interpolated value of hanging vertices
on level ℓ differs from the initial interpolated value.
9:
Perform γ2 Jacobi iterations for the system Auh,ℓ = bℓ (postsmoothing).
10: end procedure
123
4 Full Approximation Scheme
Figure 4.12: V -cycle (left) and F -cycle (right). γ gives the pre- or postsmoothing
steps, respectively. The arrow denotes the interpolation with higher
order.
4.3.5 Full Multigrid
The accuracy of a numerical approximation is co-determined and, hence, bounded
by the discretisation error. In turn, the approximation’s error can not underrun
the error error induced by the discretisation. Applying the thesis’ conform finite
element method for the Poisson equation, this error is in O(h2 ) for a sufficiently
smooth analytical solution. With the V-cycle, an algorithm needs a fixed number of
iterations—independent of the number of unknowns—to reduce the solver error by a
certain factor. It does not make sense to solve the equation system more accurately
than up to the order of the discretisation error. Ergo, a grid refinement entails
a fixed number of additional smoothing steps to get an error in the order of the
discretisation error again.
A classical full multigrid algorithms starts on a rather coarse grid and computes
a solution to the PDE with the well-known multigrid cycle—typically a V -cycle.
Then, it projects this solution to the next finer level as initial guess for the new
uh and continues recursively. The resulting sequence of V -cycles on finer and finer
grids is an F -cycle or F M G-cycle (Figure 4.12). An F -cycle solves a well-behaved
problem in O(hd ), i.e. it depends linearly on the number of unknowns [73]. It is an
optimal solver.
To make this hold, the projection of the current solution to a guess on the finer solution has to exhibit higher order. Higher order interpolation within the k-spacetree
and the element-wise traversal world needs additional attention: The traversal automaton knows the solution’s behaviour on a cell, as it has access to the 2d vertices.
For an interpolation of higher order, additional grid vertices had to be evaluated.
This is usually not possible. Yet, due to an extension of the values stored within
a vertex, a k-spacetree traversal is able to interpolate values with higher order. If
the algorithm stores—besides the value uh —the central differences within a vertex,
the algorithm can reconstruct the 4d − 2d additional values adjacent to neighbouring
cells. This allows to inscribe higher-order polynomials into the nodal fine grid approximation. In the implementation coming along with this thesis, the higher-order
124
4.4 Traversal Events
interpolation is not yet integrated.
4.4 Traversal Events
The multigrid realisation plugs into the traversal events, and, consequently, the data
available throughout the traversal has to be convenient for the multigrid solver. This
restriction is accompanied and compensated by the possibility to merge different
solver phases. In the following paragraphs, I analyse the data dependencies of the
different solver steps (Figure 4.13). A mapping from traversal events to solver steps
then enables the realisation of the FAS multigrid scheme. Let active level at this
be the current smoother level. For two-level operations (restriction, coarsening and
prolongation), the active level denotes the finer level. Furthermore, each variable
in the linear equation system, i.e. each vertex v ∈ VT \ HT in the grid, holds
five different properties/attributes: The weight of the shape function (the height
of the hat), a residual variable to accumulate the residual, the right-hand side, the
hierarchical surplus, and the hierarchical residual.
4.4.1 ω-Jacobi Smoother
The Jacobi solver (4.13) splits up into three implementation steps. All of them work
on the nodal representation scheme.
First, the value of hanging vertices is interpolated. Peano’s traversal preserves
the child-father relationship. Within a refined element, the algorithm thus analyses
the finer vertices throughout the steps down. If a vertex v ∈ HT , and if its level
is smaller than the active level, the algorithm interpolates the current solution.
Hanging vertices beyond the active level remain unchanged.
Second, the algorithm evaluates the stencil. As the stencil covers 2d geometric
elements, this second step has to be split up additively. If a vertex of the active level
is read for the first time, the multigrid solver sets its residual to zero. The residual
is an additional attribute of the vertex. In each geometric element belonging to
the active level, the solver evaluates the 2d different stencils affecting the element’s
vertices and adds the result to their residuals. Before a vertex is written to the
output stream, the residual variable holds (b − Aℓ uh,ℓ )v as soon as the algorithm
adds the right-hand side.
Third, the vertex is updated. According to the arguing above, a vertex’s residual
is available as soon as the vertex is to be written to the output stream: Here, the
algorithm takes the residual and updates the current solution according to (4.13).
Hanging vertices and vertices belonging to the boundary are not modified. Afterwards, the algorithm passes the vertex to the output stream.
Throughout the steps up, the realisation coarses the new solution of the active
level. There is no numerical need for this coarsening, but, as a result, a coarse
125
4 Full Approximation Scheme
Figure 4.13: Behaviour of the hierarchical transform multigrid algorithm: Throughout the smoothing, the solution is always coarsened to all smaller levels
(1). Throughout the ascent, the algorithm computes the hierarchical
transform on the finer grid (2). The application of the stencil to the
resulting hierarchical surplus yields the hierarchical residual on the fine
level, and this residual is, in the same step, restricted to the coarse grid’s
right-hand side (3). The algorithm continues recursively. Throughout
the descend, the algorithm adds the coarse grid’s representation to the
linear surplus stored on the fine grid (4).
126
4.4 Traversal Events
grid representation of the solution is available on all levels besides the levels smaller
than the active level. This is for example of value for on-the-fly visualisations of the
solver’s progress on a coarser level with reduced details. Furthermore, a coarsed representation is already available, if the multigrid decides to ascend in the subsequent
iteration.
Most multigrid realisations rely on a Gauß-Seidel-type smoothers instead of the
Jacobi scheme, since they exhibit better convergence and smoothing rates. Due to
the information transport speed restriction, the solver implementation here can not
implement a Gauß-Seidel.
Example 4.2. Let vertex a and vertex b in d = 2 be neighbours on a regular Cartesian grid, and let the traversal implement a Gauß-Seidel smoother. A face connects
a and b, i.e. there are two geometric elements e1 and e2 holding both a and b. Both
contribute to the residual calculation for both vertices. A traversal handles element
e1 first, and the residual of a and b is incremented. It can not update one of the two
vertices, as the contributions of element e2 are still missing. The traversal continues. In geometric element e2 it evaluates a part of the stencil affecting a. Now, let
the residual of a be complete, i.e. the traversal updates the value of vertex a. Then,
the traversal computes (new) a’s contribution to b’s residual. Though, the traversal
computes only a’s contribution due to e2 . The contribution due to e1 can not be
updated, as the traversal processes each element only once. Even worse, it already
added a contribution to b with an invalid, old value of a, i.e. the traversal will end
up with an overall invalid residual on b.
The example shows that Jacobi is the natural choice for a smoother within the
Peano world, as Gauß-Seidel-type solvers rely on a faster information transport
speed than the element-wise traversal permits. Nevertheless, more sophisticated
solver realisations exist. The Poisson equation on a staggered grid assigning the
hats to the geometric elements instead of the vertices fits naturally to a Gauß-Seidel
[51, 76]. In [23], some recursion unrolling techniques enable the user to implement
a set of more sophisticated solvers such as hybrid Jacobi-Gauß-Seidel solvers, block
Gauß-Seidel schemes on subdomains, red-black variants of Gauß-Seidel, and so forth.
4.4.2 Restriction and Hierarchical Transformation
The restriction of the right-hand side splits up into three steps: First, the hierarchical
transform is computed. Afterwards, the grid holds the hierarchical surplus instead of
the nodal value on the fine grid. Second, the algorithm determines the hierarchical
residual. Third, this surplus is restricted to the coarser grid.
The right-hand side codetermines the residual, as it is added to the accumulated
value (4.13). Since the algorithm’s traversal preserves the inverse child-father relationship (Definition 2.2), all the 2d adjacent elements e ∈ adjacent(v) have been
127
4 Full Approximation Scheme
traversed before, if touchV ertexLastT ime is triggered for a vertex v. If they are
refined, all their children have been processed before, i.e. whenever a vertex is contained within the support of v’s shape function, touchV ertexLastT ime has been
called. An algorithm restricting a vertex’s hierarchical residual throughout the
touchV ertexLastT ime event thus has a valid right-hand side for the coarse grid
vertex v, whenever v’s touchV ertexLastT ime operation is invoked. As a result, my
realisation merges the restriction process and the first smoothing iteration on the
upcoming active level.
The algorithm computes the hierarchical transform throughout the step down
events: If a vertex belongs to the active level, and if the vertex is not hanging, the
algorithm takes the vertices from the level above and determines the linear interpoland’s value at the vertex’s position. Computing the linear interpoland is possible
due to the local interpolation operators induced by the d-linear shape functions on
the coarser grid. As the nodal value is redetermined throughout the inverse hierarchical transform, the variable’s content holding the nodal value is replaced by
the linear surplus. Hanging vertices get the value zero: they do not hold a shape
function, and, thus, their linear surplus vanishes.
The algorithm accumulates the hierarchical residual throughout the traversal of
the fine grid. As all vertices have been loaded whenever the automaton enters
an element, all the vertices hold the hierarchical surplus due to the discussion in
the paragraph above, and an application of the Laplacian stencil yields the local
contributions for the hierarchical residual. The nodal residual on the active level
is not needed throughout this traversal, since no solution update occurs, and I
consequently store the hierarchical residual within the residual variable.
The algorithm restricts the hierarchical residual throughout the step up transitions: Whenever a non-hanging vertex belongs to the active level, it holds a hierarchical residual value stored in the residual attribute. Hence, the algorithm restricts
this value to all the vertices of the father element that hold the refinement flag. If
a coarse vertex does not hold the refinement flag, its right-hand side has been set
during the vertex construction. It remains unaltered.
Besides the computation of the hierarchical residual on the fine grid, the algorithm
also applies the Laplacian operator on elements belonging to the active level minus
one. Here, the semantics of the residual variable remains the nodal residual, and it
is actually used to update the solution.
4.4.3 Prolongation and Inverse Hierarchical Transformation
The prolongation of the coarse grid solution and the update of the fine grid approximation due to the inverse hierarchical transform consist of three steps: First,
the coarse grid’s nodal representation is interpolated to the fine grid. Second, the
interpolated value is added to the hierarchical surplus stored within the fine grid’s
128
4.4 Traversal Events
Figure 4.14: Different states of the multigrid solver.
uv . Third, a fine grid smoothing step is applied to this nodal representation, i.e. one
postsmoothing step is merged with the prolongation.
The two-level operations are embedded into the step down transitions. The algorithm analyses whether a fine grid vertex is adjacent exclusively to untouched faces.
In this case, its inverse hierarchical transform has not been computed yet, and the
vertex’s solution attribute holds the hierarchical surplus. The solver computes the
linear interpoland and adds it to this hierarchical surplus. As the algorithm applies
a Jacobi smoothing step within the same traversal, hanging vertices are interpolated
d-linearly throughout the steps down.
Besides the activities above, the standard Jacobi smoother actions from page 125
are invoked throughout the traversal. The traversal events hence are mapped to
both the inverse hierarchical transform and the smoothing operations.
4.4.4 States of the Hierarchical Transformation Multigrid
Method
The user steers the multigrid cycle with a state machine (Figure 4.14). Before each
iteration, he tells the solver holding the active level whether to ascend or descend. If
neither ascend nor descend are invoked, the solver realises a Jacobi smoothing step
on the active level. Throughout the grid traversal, the solver’s state is invariant.
The mapping from events to multigrid operations depends on the solver’s state
(Table 4.1): If the user triggers the ascend operation, the multigrid solver’s state
switches to Ascend. At the end of the iteration, the solver’s state then switches
back to Smooth and the active level is decremented. If the user triggers the descend
129
4 Full Approximation Scheme
Table 4.1: Interplay of the traversal events, the solver states and the multigrid operations. All events not listed explicitly reduce to no operation. Let ℓactive
denote the active level, First abbreviates Descend.
Solver State
Ascend
Traversal Event
enterElement(e)
touchV ertexF irstT ime(v)
createT emporaryV ertex(v)
touchV ertexLastT ime(v)
Smooth
enterElement(e)
touchV ertexF irstT ime(v)
createT emporaryV ertex(v)
touchV ertexLastT ime(v)
Descend
enterElement(e)
touchV ertexF irstT ime(v)
createT emporaryV ertex(v)
touchV ertexLastT ime(v)
130
Description and Operations
Switch to the next coarser level throughout the
iteration. Simultaneously, a smoothing step on
this coarser level ℓactive − 1 is performed.
level(e) = ℓactive ⇒ apply stencil. Accumulates
the hierarchical residual r̂v . As the residual itself
is not needed throughout the level’s ascend, r̂v is
stored in rv .
level(e) = ℓactive − 1 ⇒ apply stencil. Accumulates the residual in rv .
rv ← 0, and
level(e) < ℓactive ∧ Prefined (v) ⇒ bv ← 0.
level(e) = ℓactive ⇒ compute hierarchical transform.
level(v) = ℓactive ⇒ uv ← 0, and
level(v) < ℓactive ⇒ interpolate coarse grid value.
level(v) = ℓactive ⇒ restrict r̂v , with r̂v stored in
variable rv .
level(v) = ℓactive − 1 ⇒ add right-hand side to
residual variable and apply Jacobi update step.
Apply Jacobi smoother on the active level.
level(e) = ℓactive ⇒ apply stencil.
rv ← 0.
level(v) ≤ ℓactive ⇒ interpolate coarse grid value.
level(v) = ℓactive ⇒ apply Jacobi update step.
Equals Smooth, but computes the inverse hierarchical transform before it applies the smoother.
level(e) = ℓactive ⇒ apply stencil.
rv ← 0.
level(v) = ℓactive ⇒ compute inverse hierarchical
transform.
level(v) ≤ ℓactive ⇒interpolate coarse grid value.
level(v) = ℓactive ⇒ add right-hand side to residual variable and apply Jacobi update step.
4.5 Extensions and Realisation
operation operation, the multigrid solver’s state switches to Descend and increments
the active level. At the end of the iteration, the solver resets its state to Smooth.
If the solver is told that this is the last traversal of the current V-cycle, it triggers
both a Jacobi update on the active level, the computation of the local refinement
criterion on each vertex and the evaluation of the global residual. The latter two
aspects are discussed after some additional remarks on the solver’s realisation and
behaviour on adaptive grids.
4.5 Extensions and Realisation
With the preceding section, one can straightforward implement a full approximation storage scheme within the Peano traversal. A naive implementation though
suffers from unbalanced spacetrees—the tree traversal always processes the whole
tree although only a small part of the tree might belong to the active level and, thus,
is actually updated. It is an obvious idea to alter the numerics and to adopt the
scheme to strongly adaptive discretisations. The first subsection of the upcoming
pages is dedicated to such a task. Next, I discuss which of the variables of a vertex
have to be stored on the in- and output stream: the residual variable for example
accumulates a helper value that is initialised at the first usage throughout traversal
and evaluated before the data are written to the output stream. There is no need
to hold such helpers persistently. Finally, I spend a few pages on a suitable yet simple refinement criterion—it compares the actual solution to a solution with doubled
mesh width, i.e. studies the effect of one refinement step of h-adaptivity, and does
not evaluate any complicated residual-based error estimates. Without such a criterion, the tuning to adaptive grids at the beginning of the section would be relevant
solely to the domain’s boundary, as regions within the computational domain were
never refined.
4.5.1 Simultaneous Coarse Grid Smoothing on Adaptive Grids
Adaptive grids with areas of different resolution suffer from the set of rules in Table
4.1, as this set employs one global active level. If an area of the domain is tessellated
by a coarser mesh than the mesh identified by the active level, the Peano algorithm
traverses this part of the grid without performing any solution update. An improvement of the multigrid solver makes the solver simultaneously update the active level
and all elements of the fine grid belonging to a level smaller than the active level.
The smoother then equals a Jacobi on an adaptive grid.
The unrefined elements of a level ℓ < ℓactive identify a Cartesian grid where the
solver can apply Jacobi updates. Updating both the active level and these coarser
levels results in a solver working on different independent and uncoupled subdo-
131
4 Full Approximation Scheme
Figure 4.15: The simultaneous coarse grid smoothing extends the fine grid by one
element wherever possible and applies a Jacobi on the resulting adaptive
Cartesian grid. Interpolation and restriction couple the individual grid
domains.
mains, i.e. it tackles the problem on ℓactive and an additional number of small Dirichlet boundary problems on coarser grids.
To couple the individual problems, I first of all extend the Cartesian grid of
each coarser level ℓ < ℓactive by one element, and, thus, introduce a shadow layer
of width one around the domain. Boundary vertices of these individual grids are
now either hanging vertices, refined vertices, or boundary vertices, and their values
are prescribed: The hanging vertices’ values result from the interpolation rules.
If the solver improves coarser representations, the interpolated vertices change in
the next iteration and the smoothing process on the finer grid continues with an
improved interpolated value—interpolation transports information from coarse grids
to finer grids. The refined vertices’ values depend on the coarsening operator (4.7).
Whenever the solver improves fine representations, the coarser vertices change in the
next iteration and the smoothing process on coarser grids continues with an improved
coarsened value—coarsening transports information from fine grids to coarser grids.
The implementation of the improvement is straightforward (Table 4.2) if a new
predicate is introduced:
∀e ∈ ET :
Punrefined v (e) ⇔ ∃v ∈ vertex(e) : ¬Prefined (v)
holds for all fine grid elements. Furthermore, it holds for all elements adjacent to
fine grid elements. On levels smaller than the active level, it thus identifies all the
geometric elements whose stencils have to be evaluated, and the resulting smoother
scheme equals a domain decomposition approach where different Cartesian grids
132
4.5 Extensions and Realisation
Table 4.2: Modified rule set for the multigrid solver. For a given active level, it
smoothes—besides the active level—also parts of the computational domain covered with a grid coarser than the active level. The rules replace
entries from Table 4.1. Rules not enlisted again remain unaltered.
Solver State
Ascend
Traversal Event
enterElement(e)
touchV ertexLastT ime(v)
Smooth
enterElement(e)
touchV ertexLastT ime(v)
Descend
enterElement(e)
touchV ertexLastT ime(v)
Description and Operations
level(e) = ℓactive ⇒ apply stencil.
level(e) < ℓactive ∧ Punrefined v (e) ⇒ apply stencil.
level(e) = ℓactive − 1 ⇒ apply stencil.
level(v) = ℓactive ⇒ restrict r̂v , with r̂v stored in
variable rv .
level(v) = ℓactive −1 ⇒ apply Jacobi update step.
level(v) < ℓactive ∧ ¬Prefined (v) ⇒ apply Jacobi
update step.
level(e) = ℓactive ⇒ apply stencil.
level(e) < ℓactive ∧ Punrefined v (e) ⇒ apply stencil.
level(v) = ℓactive ⇒ apply Jacobi update step.
level(v) < ℓactive ∧ ¬Prefined (v) ⇒ apply Jacobi
update step.
level(e) = ℓactive ⇒ apply stencil.
level(e) < ℓactive ∧ Punrefined v (e) ⇒ apply stencil.
level(v) = ℓactive ⇒ apply Jacobi update step.
level(v) < ℓactive ∧ ¬Prefined (v) ⇒ apply Jacobi
update step.
overlap by at least one coarse element. The corresponding function basis holds one
shape function per vertex. Yet, the Dirac property of the hats—ϕv (v) = 1, ϕv (v̂) =
0 ∀v̂ 6= v—does not hold anymore (Figure 4.15).
Such a modification of the event mapping yields a non-uniform smoother: The
coarser an unrefined vertex is, the bigger the number of Jacobi updates for this vertex
throughout every V-cycle. In the numerical experiments, the modified adaptive
smoothing outperforms the standard multigrid, and there is hence no reason to skip
the additional coarse grid updates—even if they might not be necessary from a
smoother point of view, i.e. even if the error at these vertices is already sufficiently
smooth—as long as the coarse subtrees are traversed anyway. This reasoning might
not hold anymore, if the additional floating point operations slowed down the coarse
grid traversal. I never observed such a behaviour.
4.5.2 Persistence and Semantics of Vertex Attributes
According to the introductionary remarks of Section 4.4, each vertex v ∈ VT \ HT
in the grid holds five different values or variables, respectively. Yet, the hierarchical
133
4 Full Approximation Scheme
surplus and the actual solution are never needed at the same time, and the algorithm
can reconstruct one value from the other. Thus, both values are held by one variable
having either the one semantics or the other. The same arguing holds for the residual
and the hierarchical residual. This reduces the memory consumption by two floating
point variables per vertex.
The (hierarchical) residual is accumulated throughout the iterations. The update
uses this variable to improve the solution or to compute a new right-hand side
whenever touchV ertexLastT ime is called. In-between two iterations, the residual
is not needed. The implementation hence does not store the residuals on the output
and input streams. This reduces the memory consumption by one floating point
variable per vertex.
4.5.3 Linear Surplus as Error Estimator
Peano’s grid management is hidden from the PDE solver: the grid management
triggers events, and the solver plugs into these events. Nevertheless, the communication is not unidirectional, as any event implementation is allowed to initiate a
refinement or a coarsening on the vertices. The grid then refines or coarses, respectively, throughout the subsequent traversal.
In [21], the effect and implementation of different refinement criterion is evaluated.
For the experiments here, I provide a refinement criterion based upon the linear
surplus. Hereby, the algorithm compares the nodal solution uv on the fine grid in
each vertex to the mean value ũv of the 3d − 1 surrounding vertices. If
|ũv − uv | ≥ ǫ
h · |ũv − uv | ≥ ǫ,
d
or
(4.20)
(4.21)
i.e. the nodal value exceeds a given threshold, the solvers refines v. (4.20) reduces
the error in the k.kmax norm, (4.21) reduces the error in the L2 norm.
The mean value ũv is modeled as additional variable for each vertex. Its value
is set to zero within the touchV ertexF irstT ime. After that, the stencil evaluation
accumulates it besides the residual. For d = 2, e.g.,


1 1 1
1
1 0 1 
(4.22)
8
1 1 1
computes the mean value ũv . It is available throughout touchV ertexLastT ime
where the algorithm evaluates the refinement formula and triggers the refinement
for the next iteration. As for the residual, there is no need to store the mean value
persistently.
The two simple error estimators above deliver a proof of concept that Peano’s
grid management, the full approximation scheme multigrid solver, and the dynamic
134
4.5 Extensions and Realisation
Figure 4.16: Adaptive grid for the L-shape with (below) and without (above) the
boundary refinement criteria (4.23) and (4.24). (4.23) and (4.24) yield
a finer grid around the semi-singularity, and they also highlight the
boundary approximation improvement.
adaptivity fit together. Nevertheless, this thesis lacks an exhaustive discussion of
different refinement criterion.
One remark is essential for the experiments: Criterion (4.20) and (4.21) observe
only inner vertices. Typically, pollution and accuracy problems though stem from
(semi-)singularities on the domain’s boundary, i.e to reduce the mesh inside the
domain does not resolve the root of the problem: Often, it only introduces additional
hanging points on the boundary. Hanging points do not hold information, and, thus,
not improve the boundary sampling (Figure 4.16). Consequently, I refine all vertices
vboundary ∈ VT \ HT :
¬Prefined (vboundary ) ∧
vboundary ∈ f ather(vnew ) ∧
Pboundary (vboundary )
(4.23)
whenever
Pinside (vnew ) ∧
∃v ′ ∈ f ather(vnew ) : Pboundary (v ′ ) ∧
6 ∃v ′′ ∈ f ather(vnew ) : Pboundary (v ′′ ) ∧ Prefined (v ′′ )
(4.24)
holds after the creation of a new vertex vnew .
135
4 Full Approximation Scheme
A comparison of different refinement criterion for k-spacetrees is started in [21].
More elaborate schemes such as dual problem formulations there yield more appropriate grids for some problems, whereas the simple linear surplus delivers grids of
sufficient quality for most examples while it is cheap to compute. For alternative
Laplacian stencils, alternative mean value computations might result in better grids,
i.e. in grids delivering the same quality of solution with a smaller number of grid
points. The example above evaluates a mean value computation corresponding to
a full stencil, as it evaluates 3d − 1 surrounding vertices. Other mean value computations such as five point rules evaluating 2d − 1 vertices yield other adaptivity
patterns. Besides the full mean value computation and the five point rule, the experiments also employ a skewed mean value stencil that exclusively evaluates vertices
not connected to the mean value location by a hyperface. Also, the hierarchical
surplus is an indicator how much precision one gains due to the refinement. Nevertheless, the hierarchical surplus proved to be an insufficient refinement criterion
throughout the numerical experiments, as it introduces k-dependent refinement patterns. The counterpart of refinement is coarsening. A simple coarsening criterion
also compares the mean value to the solution’s value. It triggers a coarsening as
soon as a coarsening threshold is underrun. Such a feature is of great value for
time-dependent problems, e.g., where the grid of the preceding time step acts as
startup grid for the subsequent time step.
4.6 Experiments
The following experiments illustrate the solver’s numerical behaviour. The experiments concentrate on the multigrid properties, i.e. the figures enlighten the smoothing behaviour, the convergence rate, and so forth. They do not present runtime and
memory requirements.
To measure the effect of a multigrid cycle, the experiments either compare two
subsequent approximations’ fine grid representations, or they measure the global
residual r on the fine grid. Since the solver is defined on an adaptive grid, the
norms take the mesh structure into account wherever necessary.
kuh kmax
:= {v ∈ VT \ HT : ¬Prefined (v),
= max |uv |,
kdukmax
new
= max |uold
v − uv |,
V
136
v∈V
v∈V
(4.25)
4.6 Experiments
Figure 4.17: The k.kh norm interprets the nodal values as weights of a piece-wise
constant nodal function basis where each basis elements surrounds one
vertex (a). While it converges to the L2 norm, it yields too big values
for adaptive grids: Here, coarse “hats” (b) and finer “hats” (c) overlap.
kuh kh =
kdukh =
sX
hd (uv )2 ,
v∈V
sX
v∈V
new 2
hd (uold
v − uv ) ,
krkmax = max |rv |,
v∈V
sX
krk2 =
rv2
and
(4.26)
(4.27)
(4.28)
v∈V
compute the discrete norms with h being the generic variable for the grid width of
the surrounding geometric elements. The mesh width in (4.26) interprets the nodal
value as weight belonging to a piece-wise constant ansatz space with each of its shape
functions covering one fine grid element suitable dilated such that it surrounds the
vertex. Thus, this metric corresponds to the L2 norm of the measurand. If the grid is
adaptive, the k.kh norm yields a value bigger than the L2 norm, as the surroundings
constant shape functions corresponding to the measurement overlap (Figure 4.17).
This error vanishes with decreasing mesh width. Norm (4.26) is an extension of the
k.kh norm in [10] to adaptive Cartesian grids. Obviously, the maximum norm (4.25)
is independent of the grid structure.
In contrast to the solution’s representation, the residual in (4.27) and (4.28) is by
definition scaled with hd due to the integral over Ω in the system matrix. It neither
has an interpretation in the solution space nor is it an error indicator [10]. Hence,
137
4 Full Approximation Scheme
Figure 4.18: Solution of (4.30) for d = 2.
changes of the underlying grid change the semantics and order of the residual, too.
Two residuals belonging to different grids thus are not comparable directly.
The multigrid convergence factor
ρ=
kenew
h k
keold
h k
(4.29)
for two subsequent iterations and suitable norms is measured with the residual
ρ ≈ ρh,k.k
krnew k
= old
kr k
applying a suitable norm k.k. This approximation is accurate, if the grid is not
changing anymore, and if the error is bigger than the discretisation error by a magnitude [73].
138
4.6 Experiments
4.6.1 Convergence Rates
The first experiments solve the problem
−∆u = dπ
2
d
Y
sin(πxi ),
i=1
u|∂Ω = 0,
with the analytical solution
d
Y
u =
sin(πxi ).
(4.30)
i=1
kuk∞ = 1,
and
Z
12
2
kukL2 =
|u(x)| dx
=
Ω
=
Z
0
1
d Z
Y
i=1
0
1
! 21
sin2 (πt)dt
(4.31)
d2 d2
1
−1 iπt
−iπt 2
dt
e −e
.
=
4
2
The minimal and maximal mesh width are equal, and a dynamic refinement is
switched off, i.e. the grid underlying the solution (Figure 4.18) is regular.
The smoothing behaviour of the Jacobi solver deteriorates with increasing mesh
size, i.e. the finer the grid the smaller is the residual reduction per iteration (Figure
4.19). A V (1, 1)-cycle in turn converges almost independent of both the grid size
and the dimension (Figure 4.20). The maximum precision to be obtained depends
on the discretisation error.
The corresponding F (1, 1)-cycle is a cycle performing one V (1, 1)-iteration per
grid. It then adds the next level. This F -cycle yields results close to the V -cycle’s
output—the difference between the solutions is in the order of the discretisation
error—but runs much faster, as the initial iterations traverse only coarse grids.
Both multigrid schemes exhibit a convergence factor ρh,k.k ∈]4.69 · 10−1 , 5.87 · 10−1 [.
The V -cycle’s and the F -cycle’s runtime profit from the horizontal tree cuts. With
this feature, coarse grid update steps do not traverse the whole tree. They restrict
to the upper part of the tree whose elements belong to the active level or a smaller
level. As a result, the sum of the pre- and postsmoothing steps has to be even
(Section 3.5.2). I return to absolute runtimes in the closing numerical results.
4.6.2 Multigrid Adjustment Screws
Multigrid algorithms rely on solvers that eliminate high-frequency errors. For the Jacobi smoother, the relaxation factor ω determines the smoother’s effect on the different error frequencies. While the relaxation factors corresponding to bi-partitioning
are well-studied, literature on arbitrary k-partitioning is rare. Nevertheless, the
139
4 Full Approximation Scheme
Jacobi Solver (d=2)
1
0.9
00
0.8
4.00 * 10
01
6.40 * 10
02
6.76 * 10
03
6.40 * 10
04
5.86 * 10
0.7
|u| max
0.6
0.5
0.4
0.3
0.2
|r| max
0.1
0
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
150
200
250
Iterations
300
350
400
450
500
450
500
Jacobi Solver (d=3)
1.2
1
00
8.00 * 10
02
5.12 * 10
04
1.76 * 10
05
5.12 * 10
07
1.42 * 10
|u| max
0.8
0.6
0.4
|r| max
0.2
0
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
150
200
250
Iterations
300
350
400
Figure 4.19: Behaviour of the Jacobi solver for (4.30): The residual is divided by
the initial residual corresponding to the start solution guess u = 0.
140
4.6 Experiments
V(1,1) cycle (d=2)
1.0e+02
4.00 * 1000
6.40 * 1001
6.76 * 1002
6.40 * 1003
5.86 * 1004
5.30 * 1005
1.0e+00
1.0e-02
1.0e+00
1.0e-02
1.0e-04
|e(x0)|
|e(x0)|
V(1,1) cycle (d=3)
1.0e+02
1.0e-06
1.0e-06
1.0e-08
1.0e-08
1.0e-10
1.0e-10
1.0e-12
8.00 * 1000
5.12 * 1002
1.76 * 1004
5.12 * 1005
1.42 * 1007
1.0e-12
0
10
20
30
40
50
60
1.0e+02
1.0e+00
1.0e-02
1.0e-04
1.0e-06
1.0e-08
1.0e-10
1.0e-12
|r| max
|r| max
1.0e-04
0
10
20
30
Cycles
40
50
60
0
10
20
0
10
20
30
40
50
60
30
40
50
60
1.0e+02
1.0e+00
1.0e-02
1.0e-04
1.0e-06
1.0e-08
1.0e-10
1.0e-12
Cycles
Figure 4.20: The multigrid solver tackles (4.30) with V (1, 1)-cycles, and the error
comprising both discretisations and solver contributions is measured at
T
x0 = 21 , 12 , . . . ∈ Rd . The legend denotes the number of unknowns.
141
4 Full Approximation Scheme
V(2,2)-cycle, d=2
10
ω=0.1
ω=0.2
ω=0.3
ω=0.4
ω=0.5
ω=0.6
ω=0.7
ω=0.8
ω=0.9
ω=1.0
ω=1.05
ω=1.1
0.1
|r|L
2
1
0.01
0.001
1e-04
2
3
4
5
6
7
Cycles
8
9
10
9
10
V(3,3)-cycle, d=2
10
ω=0.1
ω=0.2
ω=0.3
ω=0.4
ω=0.5
ω=0.6
ω=0.7
ω=0.8
ω=0.9
ω=1.0
ω=1.05
ω=1.1
1
|r|L
2
0.1
0.01
0.001
1e-04
1e-05
1e-06
2
3
4
5
6
7
Cycles
8
Figure 4.21: Influence of the relaxation factor ω on two different V -cycles.
142
4.6 Experiments
Table 4.3: Residual |r|2 for different V-cycles and d = 2. Regular grid for (4.30)
with mesh size 3.3 · 10−03 , and ω = 0.8. The F -cycle refines after each
V -cycle until the grid meets a prescribed precision bound.
it
1
2
3
4
5
6
7
8
9
10
11
12
13
ρh,k.k2
V (2, 2)
4.39 · 10−04
8.39 · 10−05
1.81 · 10−05
4.40 · 10−06
1.15 · 10−06
3.12 · 10−07
8.58 · 10−08
2.38 · 10−08
6.58 · 10−09
1.82 · 10−09
5.03 · 10−10
1.39 · 10−10
3.85 · 10−11
≈ 0.276
V (3, 3)
9.46 · 10−05
8.79 · 10−06
9.81 · 10−07
1.34 · 10−07
2.04 · 10−08
3.26 · 10−09
5.28 · 10−10
8.60 · 10−11
1.41 · 10−11
2.31 · 10−12
3.81 · 10−13
7.64 · 10−14
4.45 · 10−14
≈ 0.164
F (2, 2)
2.57 · 10−02
2.67 · 10−01
4.47 · 10−02
1.25 · 10−02
4.07 · 10−03
1.35 · 10−03
5.07 · 10−07
1.26 · 10−07
3.27 · 10−08
8.84 · 10−09
2.43 · 10−09
6.75 · 10−10
1.88 · 10−11
≈ 0.277
F (3, 3)
1.61 · 10−03
1.07 · 10−01
8.51 · 10−03
1.39 · 10−03
4.12 · 10−04
1.36 · 10−04
4.50 · 10−08
4.87 · 10−09
6.51 · 10−10
9.88 · 10−11
1.57 · 10−11
2.55 · 10−12
4.18 · 10−13
≈ 0.168
tree depth
2
3
4
5
6
7
7
7
7
7
7
7
7
choice of a relaxation is worth some effort, as the standard relaxation factor ω ≈ 32
for bi-partitioning does not fit to (k = 3)-spacetrees (Figure 4.21). Instead, a relaxation factor ω ≈ 0.9 is appropriate. Following [41], an alternating sequence
(ω1 ≈ 21 , ω2 ≈ 1) of relaxation factors yields an even better rate. These results
reoccur for d = 3, other combinations of pre- and postsmoothing as well as more
complicated domains and right-hand sides. The experiments show that an elaborate
Fourier and smoothing analysis here is yet to be done. This is beyond the scope of
this work.
The fine-tuning of an F -cycle is a laborious work, as the number of pre- and
postsmoothing steps of the underlying V -cycles can either be fixed a priori or made
dependent on the residual’s behaviour. Furthermore, their optimal choice also depends on the applied operators, the relaxation factor, and the smoother type. Since
this chapter gives only a prove-of-concept, such a tuning is beyond the scope of
this work. Table 4.3 nevertheless enlists some residual developments for different
variations of pre- and postsmoothing steps, and compares it to the V -cycles’ results.
The F -cycle exhibits a complicated behaviour (Table 4.3) splitting up into three
phases: In the first phase (cycle one to six), the residual reduces monotonously as the
grid is refined further and further. In the second phase (cycle seven), it diminishes
by four magnitudes. In the third phase (from cycle seven on), the solver reduces the
residual with an almost constant convergence rate.
For the big number of pre- and postsmoothing steps, each of the seven iterations
ends up with a good approximation, i.e. the fine grid approximation is very accurate
143
4 Full Approximation Scheme
1.0e+00
|e(x0)|, F(1,1)
|u|L -|uh|h, F(1,1)
1.0e-01
2
|e(x0)|, F(2,2)
|u|L -|uh|h, F(1,1)
1.0e-02
1.0e-03
2
|e(x0)|, F(3,3)
|u|L -|uh|h, F(3,3)
1.0e-04
2
1.0e-05
1.0e-06
1.0e-07
1.0e-08
1.0e-09
0
10
20
30
40
50
1.0e+02
1.0e+00
1.0e-02
|r| 2
60
70
F(1,1)
F(2,2)
F(3,3)
1.0e-04
1.0e-06
1.0e-08
1.0e-10
0
10
20
30
40
50
60
70
Cycles
Figure 4.22: Experiment from Table 4.3. The F -cycle refines the grid each time the
residual has been reduced by a factor of 100. Residual jumps identify
these refinements.
whenever an additional grid level is added. The additional level resolves an additional high-frequency error. This high-frequency error is not tackled throughout the
interpolation as the algorithm lacks the full multigrid’s higher-order interpolation—
it is the standard d-linear interpolation given by operator P . From cycle one to
seven, the algorithm eliminates the coarse grid error and, in turn, adds an additional fine grid error. As soon as no additional level is added anymore, the residual
is not added an additional high-frequency component, too. It “jumps” down. Afterwards, the solver behaves like a standard V -cycle. In [10, 73], e.g., the full multigrid
algorithm typically exhibits a better convergence rate than the standard V -cycle,
and there is no abrupt residual decay. Both facts illustrate the lack of a more
sophisticated solver (Gauß-Seidel, e.g.) and a higher-order interpolation.
The influence of the lack of a higher-order interpolation is studied further in Figure
4.22: The F -cycle adds the subsequent grid as soon as the ratio of residual to initial
residual underruns 1.0 · 10−02 . The experiment stops as soon as kdukh < 10−12 ,
and its figures illustrate the interplay of residual, grid levels, and actual solution.
It reveals two insights: On the one hand, the peaks correspond to the cycles where
an additional grid level is added. Their height should be reduced by a higher order
interpolation. Nevertheless, the algorithm converges independently of the mesh
width. On the other hand, it does not make sense for the F -cycle to wait for the V cycle to converge. Instead, it can add another level immediately after one V -cycle.
144
4.6 Experiments
Figure 4.23: Experiment (4.32) with the solution, a grid resulting from the surplus
stencil (4.22), a five-point stencil and the screwed five-point stencil (top
down, left-hand side). Results for experiment (4.32) on the right-hand
side.
4.6.3 Dynamic Adaptivity
In Section 4.5.3, the difference between the actual approximation and the mean value
of the surrounding vertices acts as refinement criterion. Since this value vanishes
for continuous solutions with a constant first derivative, the criterion measures the
second derivative and refines where the second derivative is high. It assumes that
such regions are of interest. The experiments studying this refinement strategy
comprise, besides (4.30), the setups
−∆u = 1,
d
1
u|∂Ω = 0
with Ω = ]0, 1[ \ 0,
and
2
1
x+1 y+1
u =
·
· cos 2π ·
−
sinh(8π)
2
2
x+1 y+1
· sinh 2π ·
+
+2
2
2
with Ω =]0, 1[2 .
d
(4.32)
(4.33)
The L-shape problem (4.32) exhibits a semi-singularity at x = ( 12 , 12 , . . .)T , i.e. the
first derivative there is not continuous, and the second derivative around this point is
very large. The semi-singularity makes the solution not belonging to O(h2 ) anymore:
145
4 Full Approximation Scheme
Figure 4.24: Grid of (4.32) with two linear surplus stencils: The full stencil’s pattern
(left) differs from the pattern of the skewed five-points stencil (right).
it introduces a pollution of the approximation. An adaptive grid refining around
x = ( 21 , 21 , . . .)T eliminates this pollution.
Experiment (4.33) stemming from [64] is a smooth experiment, but its analytical
solution exhibits a region of interest at the square’s boundary around to (1, 1)T ∈ R2 .
Here, the solution raises significantly, whereas in the remaining part of the domain
the solution is almost constant.
Both experiments are illustrated in Figure 4.23. A zoom into the results in Figure
4.24 reveals the effect of different stencils to compute the mean value. The corresponding Tables 4.4, and 4.5 compare a V (2, 2)-cycle to a F (2, 2)-cycle for the three
different experiments. Both experiments stop as soon as the solution difference underruns the current approximation by more than a factor of 10−10 in the k.kh norm,
i.e.
kdukh < 1.0 · 10−10 · kuh kh .
(4.34)
The V -cycle works with a prescribed fixed mesh width inside the computational
domain. No refinement criterion is switched on. The F -cycle starts on the coarsest
grid containing at least one inner vertex. It then refines as long as the mean value
error estimator (4.20) yields values greater than a prescribed threshold. It is chosen
such that both the V -cycle and the F -cycle converge to a solution with the same
number of accurate digits in the k.kh norm. The surplus is determined by the full
mean value stencil. In accordance with the insights on page 144, the F-cycle does
146
4.6 Experiments
Table 4.4: Comparison of a V -cycle to a F -cycle for (4.30). Both cycles terminate
if (4.34) holds. d = 2 for the upper part, d = 3 for the lower section.
mesh width
1.0 · 10−01
1.0 · 10−02
1.0 · 10−03
1.0 · 10−02
1.0 · 10−03
V (2, 2)
vertices
depth
1.29 · 103
4
7.02 · 104
6
5.41 · 106
8
3.60 · 104
4
1.59 · 107
6
iterations
11
10
7
11
10
vertices
1.29 · 103
1.52 · 104
1.66 · 106
3.60 · 104
2.34 · 106
F (2, 2)
depth iterations
4
13
6
15
8
10
4
13
6
15
Table 4.5: Comparison of a V -cycle to a F -cycle for (4.33). Both cycles terminate
if (4.34) holds (d = 2).
mesh width
1.0 · 10−01
1.0 · 10−02
1.0 · 10−03
V (2, 2)
vertices
depth
1.28 · 103
4
7.02 · 104
6
5.41 · 106
8
iterations
15
17
16
vertices
1.28 · 103
6.10 · 103
5.33 · 105
F (2, 2)
depth iterations
4
15
6
15
9
13
not search for a converged iteration for each individual grid refinement step, but
evaluates and realises the refinement criterion after each V -cycle.
The results for (4.30) are enlisted in Table 4.4. Since the solution is smooth, the
adaptivity criterion yields rather regular grids: the difference between the number of
degrees of freedom for adaptive and regular grids is not significant, and both solvers
work on a spacetree with the same maximum depth. The F -cycle needs some additional sweeps, as it first retreats to the coarser levels. These additional traversals
might disappear with an interpolation of higher order implemented. Apart from
the figures in the tables, the F -cycle nevertheless outperforms the V -cycle, as the
complete F -cycle’s initial iterations working on rather coarse grids are significantly
faster than any V -cycle. If the number of cycle’s required in total is (almost) constant, this setup phase dominates the overall runtime. This runtime observation
is picked up again in the numerical conclusions. Finally, it becomes obvious that
a termination criterion should adopt to the minimum mesh size for a real-world
problem, i.e. the criterion should pay attention to the finest grid. For sufficiently
smooth problems, a fine resolution (last measurement, d = 2) improves the coarse
grid corrections, but in this example the solver terminates before fine grid solution
“comes into play”.
The results for (4.33) are enlisted in Table 4.5. The corresponding grid is the
finer the farther away it is from the coordinate system’s origin, as the solution’s
147
4 Full Approximation Scheme
Table 4.6: First ten cycles of a V -cycle and a F -cycle for problem (4.32)
iteration
1
2
3
4
5
6
7
8
9
10
vertices
4.08 · 106
V (2, 2)
depth
k.kh
8 6.47 · 10−3
1.02 · 10−2
1.25 · 10−2
1.40 · 10−2
1.52 · 10−2
1.57 · 10−2
1.62 · 10−2
1.66 · 10−2
1.69 · 10−2
1.71 · 10−2
vertices
2.63 · 102
2.63 · 102
1.19 · 103
6.82 · 103
5.32 · 104
4.58 · 105
5.68 · 105
7.32 · 105
9.08 · 105
9.43 · 105
F (2, 2)
depth
3
3
4
5
6
7
8
9
10
11
k.kh
7.53 · 10−3
1.09 · 10−2
1.20 · 10−2
1.34 · 10−2
1.44 · 10−2
1.52 · 10−2
1.58 · 10−2
1.63 · 10−2
1.66 · 10−2
1.69 · 10−2
derivative increases with a growing distance. Thus, the refinement criterion yields
a grid with a significantly lower number of vertices compared to a regular grid.
Finally Table 4.6 enlists the solution evolution for problem (4.32). Here, the
adaptivity criterion yields a grid that is strongly refined around the boundary’s
concave vertex, and the F -cycle can compete with the V -cycles accuracy although
it uses a grid that is significantly smaller.
4.6.4 Simultaneous Coarse Grid Smoothing
The multigrid suffers from adaptive grids, if the smoother acts exclusively on the
active level. All the events on the fine grid cells belonging to a level that is smaller
than the active level then degenerate to no-op. In Section 4.5.1 this rationale motivates Table 4.2. The underlying mechanism introduces a simultaneous coarse grid
smoothing while finer grids are processed.
For (4.32), the F-cycle in the preceding section has to cope with a grid that is
refined extremely around one single point. It is thus a good demonstrator for the
effect of the simultaneous coarse grid smoothing (Table 4.7).
The coarse grid smoothing reduces the iteration numbers by more than a factor
of two. This improvement becomes the bigger the higher the more accurate the
refinement criterion. The traversal always runs through the whole input stream,
and the simultaneous coarse grid smoothing does not alter its cardinality. Hence,
the runtime depends linearly on the reduction of cycles. Adaptive grids benefit from
the simultaneous coarse grid smoothing. The improvement vanishes for regular grids.
A runtime analysis is to be found at the end of the thesis.
148
4.7 Outlook
Table 4.7: Number of iterations for an F -cycle on problem (4.32) with coarse grid
smoothing. Both experiments stop if kdukh ≤ 1.0 · 10−5 . Either, the
solver exclusively processes the active level (off), or it applies the rules
from Table 4.2, i.e. all fine grid levels smaller or equal the active level are
updated (on).
refinement
criterion threshold
1.0 · 10−01
3.3 · 10−02
1.0 · 10−02
3.3 · 10−03
d=2
off on
38 22
42 21
45 19
43 18
d=3
off on
37 22
41 21
44 19
4.7 Outlook
While every PDE beyond the Poisson equation brings along its own challenges and
difficulties, an efficient solver for the Poisson equation and the accompanying insights
and rationale are a good starting point for any solver. They show that sophisticated
multiscale algorithms fit to the k-spacetree world. Some projects already exploit and
extend this chapter’s principles: The ingredients on hand for a collocated degree of
freedom layout (the unknowns are assigned to the vertices) can be transferred to
a staggered, i.e. cell-centered, layout where each geometric element corresponds to
an unknown [51, 76]. Such a discretisation of the function space is fundamental for
many computational fluid dynamics codes [59, 60]. And while d ∈ {2, 3} is a natural
choice for the Poisson equation, the k-spacetree idea allows arbitrary d, and it is an
obvious idea to develop schemes for instationary equations with a full space-time
discretisation.
The insights coming along with the experiments are twofold. On the one hand, the
results show—besides the missing higher-order interpolation—the approach realising
all the features one expects from an up-to-date solver: The convergence behaviour
is independent of the mesh size, the adaptivity is able to resolve pollution effects resulting from singularities, the memory requirements are low, and grids permanently
changing throughout the computation do not pose a restriction on the solver’s behaviour. Both the low memory demands and the latter aspect are due to the fact
that no stiffness matrix is set up.
On the other hand, the convergence rates are low compared to literature. The
solver is not competitive. As the multigrid behaviour holds, the low convergence
rates are due to a poor smoother. To end up with a competitive iteration behaviour, the multigrid has to use a more sophisticated solver than a Jacobi. Unfortunately, the choice of this particular smoother results from the speed information
149
4 Full Approximation Scheme
Figure 4.25: Domain with “complicated” shape. The marked vertices are boundary
vertices, i.e. they do not hold a shape function and a degree of freedom.
The multigrid solver degenerates to a Jacobi solver as there are no
shape functions to implement a coarse grid correction.
can be transported by the element-wise k-spacetree traversal (Section 2.6). Nevertheless, there are better solvers available within the Peano framework: a Gauß-Seidel
type solver can be realised within the framework, if the discretisation switches to
a element-centered degree of freedom layout [51, 76] which is a popular choice for
example for the pressure Poisson equation in computational fluid dynamics. For
this, the choice of an odd k proves of value: Every coarse grid cell’s center coincides
with a fine grid cell center. If this would not hold—any even k implies this—coarser
grids would end up with unknowns assigned to both vertices and elements. For an
odd k, all grid levels exhibit a homogeneous staggered degree of freedom layout.
Besides the improved smoothers, two further aspects have to be mentioned in
the context of k-spacetrees. First, singularities typically yield grids that are refined
extremely around the singularity, i.e. each grid refinement step entails several refinement steps at the singularity. Within the multigrid solver context with simultaneous
coarse grid smoothing, the effect reoccurs the other way round: The solver starts
with an active level set to the maximum level. Every time the active level is decremented, the grid also could switch to a coarser level in the coarse regions of the
solution, as the solution there is sufficiently smooth, too. A local active level equals
such an approach. This discussion picks up the local tree cut paradigm (see Figure
3.21), and both concepts have to be combined in future work.
Second, geometric multigrid algorithms suffer from irregular domains. As the continuous domain shrinks to the individual grid levels, non-smooth domain boundaries
make the coarse grid representations significantly smaller than the finer grids. Then,
they hold a significantly smaller number of shape functions (Figure 4.25), and the
coarse grid correction effect deteriorates. An improved boundary handling tracking the actual intersection of the continuous domain with the geometric elements
resolves this problem [75]. Boundary vertices then hold a shape function, but this
shape function is tailored to the actual domain boundary. As a result, the coarse
150
4.7 Outlook
Figure 4.26: Additive multigrid scheme: The solver computes a fine grid residual,
restricts this residual immediately to all levels and updates all variables
on all levels. Before the next iteration, a prolongation (and summation)
of all the level contributions yields the next iteration’s solution.
grid correction effect improves. Furthermore, the boundary approximation accuracy
improves to O(h2 ), which is especially important for d > 2, as Section 2.7 points
out.
4.7.1 Additive Multigrid
The forerunners of this thesis [35, 39, 63] also solve the Poisson equation, i.e. they
realise exactly the same demonstration challenge. Their implementations provide
an additive multigrid scheme restricting the fine grid residual simultaneously to all
grid levels. The Jacobi update step then updates all grid simultaneously, too, and
these update steps are combined to a fine grid solution throughout the successive
depth-first traversal (Figure 4.26).
Additive multigrid algorithms suffer from the fact that the the finest level has to
be processed for each iteration. They are thus slower than multiplicative schemes
for most problems. Nevertheless, there are setups where the additive variant is more
robust.
As the multiplicative scheme here is based upon a depth-first traversal, it is possible to replace the naive Jacobi smoother with the additive multigrid scheme. The
result is a multiplicative multigrid scheme with a Jacobi smoother that is preconditioned with an additive multigrid. Such an approach sounds promising.
151
4 Full Approximation Scheme
Figure 4.27: Block Gauß-Seidel: If a 3d patch’s values are available, the algorithm
can update one vertex and immediately use the updated value to update
the other vertices. Within the block, it is a Gauß-Seidel scheme. At
the boundary, the residuals are finally accumulated. Here, it remains a
Jacobi smoother.
4.7.2 Block-wise Smoothers
Space-filling curves are the fundament of an efficient realisation of the traversal in
Chapter 3. Efficiency refers to both the memory requirements and the memory
access characteristics. Besides these two factors, they also proof of value within the
parallelisation context in the upcoming chapter.
Nevertheless, first experiments with the running code highlight a fundamental
challenge coming along with the sophisticated traversal order: The traversal consists
of lots of code and lots of integer arithmetics which finally even might slow down
the floating point performance. If the code unrolls recursion steps, small blocks of
the adaptive Cartesian grid can be processed as regular Cartesian blocks. Such an
unrolling leads to the parallelisation approach in the upcoming section, and it allows
for sophisticated algorithmic optimisations [23].
Besides optimisation and parallelisation, recursion unrolling enables the solver to
handle small sections of the Cartesian grid as regular grid block. In these blocks,
the information transport restriction for the smoother does not hold anymore, and
the solver can employ a more sophisticated solver. If the code merges for example
k d blocks—Chapter 5 actually establishes this merging—the code can update the
152
4.7 Outlook
vertices within these blocks with a Gauß-Seidel scheme, e.g. (Figure 4.27), whereas
the blocks’ boundary update scheme remains a Jacobi scheme. Such an algorithm
equals a block Gauß-Seidel smoother. Extending this idea to bigger patches of
unrolled recursion steps establishes the basis for a set of smoother improvements:
(Multicore) Gauß-Seidel schemes and red-black schemes become realisable. Several
iteration steps can be merged [48] or matrices belonging to the patches can be
inverted explicitly and, thus, whole subblocks are solved exactly.
153
4 Full Approximation Scheme
154
5 Parallelisation
Parallelisation is crucial for simulation codes. At least three observations confirm
this statement: First, new computer architectures gain computing power due to an
increased level of parallelism. While their clock rate and instruction complexity
remain roughly the same from generation to generation, the number of computing
units per computer raises, and any application profiting from future architectural
improvements has to exploit multiple computing units. As a result, Moore’s law still
holds, but it does not apply directly to the (single node) performance [71]. Second,
new simulation codes produce more and more data, i.e. the amount of data rapidly
exceeds a single node’s memory. Scientific insights however rely on an increasing
level of detail. Third, new data is of value if and only if it is delivered within a
reasonable time. Consequently, the convenience of a simulation workflow depends
on the latency data is delivered by a simulation code. The data problem is tackled
in Chapter 3. Nevertheless, reducing the memory spent on a PDE solver postpones
the data threshold, but it does not resolve the underlying problem. The Peano
framework consequently has to run in parallel.
Domain decomposition methods are the predominant approach to parallelise multiscale PDE solvers and make them keep step with the increasing architectural concurrency [45]. They split up the computational domain into several overlapping or
non-overlapping parts (see [54], e.g., for some remarks and citations on multigrid
solvers with different overlapping and non-overlapping domains), and they distribute
these parts among the computing nodes. The problems on the individual nodes are
small compared to the original problem, i.e. they can be solved faster with less
memory. k-spacetrees induce a multiscale representation of a computational domain, and each level already represents a non-overlapping—I consider exclusively
non-overlapping approaches here—decomposition of the domain into geometric elements. A mapping from the geometric elements to computing nodes yields a domain
decomposition. It is balanced, if all computing nodes are assigned the same workload.
For multiscale domain representations where not only data but data with a hierarchical topology are to be split up, two mapping paradigms compete: Some
algorithms analyse the fine grid and distribute the fine grid’s elements among the
computing nodes. These elements then determine to which node coarser elements
belong to: Whoever holds fine grid elements is a candidate to handle their father
elements, too. If there are multiple candidates, some approaches duplicate the fa-
155
5 Parallelisation
ther responsibilities ([50, 55], e.g.)—such an overlap is sometimes referred to as full
domain partitioning—while others employ a decision function selecting one dedicated node ([18, 19], e.g.). This is the way followed here. Technically, the partitions
overlap slightly, but logically they are disjoint. Both realisation variants follow a
bottom-up approach balancing the workload on the fine grid. Balancing the fine grid
facilitates a fine granularity due to the big number of atomic grid constituents, while
coarse partitionings depend on the fine grid partition. Operations on the coarse partitions might not scale. Alternative algorithms distribute a fixed coarse grid among
the computing nodes ([26, 29], e.g.). Descendants of one coarse grid element are
then processed by the same computing node. It is a top-down approach balancing
the work on the analysed coarse grid. This partitioning is usually comparably cheap
to determine. If finer grids result from a uniform refinement, their balancing is good,
too. Otherwise, fine grid distributions might be unbalanced.
Whenever a data decomposition is realised, it either distributes the data exclusively among the computing nodes, i.e. each node has access to its and only its data
chunk, or it is based on a shared, common data set. Latter realisations rely on shared
memory architectures: it is important that each computing node can access any data
any time. Implementations splitting their data repositories are more general and fit
to both shared and distributed memory architectures. As they have to synchronise
and preserve the distributed data consistency manually, they though come along
with an additional overhead. This discussion restricts itself to distributed memory
parallelisation realised by a message passing interface. It thus fits to both shared
memory or distributed memory clusters. To exploit a shared memory architecture
without the additional overhead introduced by a logical splitting of the data working
set is beyond the scope of this work. Extensions of this thesis nevertheless adopt
the Peano framework to shared memory architectures ([23], e.g.).
This chapter’s ideas mirror the recursive top-down traversal, i.e. they start balancing the workload top-down throughout the grid setup on each single level. Later
throughout the simulation run, the fine grid workload is measured bottom-up, and
the partitioning/balancing is adopted and optimised. However, with a k-spacetree
where individual subtrees are assigned to remote nodes, a depth-first approach is
useless as it descends into subtrees sequentially. I thus combine the Peano traversal with a single-step breadth-first order into the level-wise depth-first order and,
herefrom, introduce a recursive master-worker topology on the computing nodes,
i.e. the computing nodes are organised as a tree themselves. While the partitioning
is based upon ideas of [31, 40, 50], the combination of a hierarchical assignment
with the parallel Peano traversal is—to my knowledge—a new approach to exploit
a k-spacetree for a domain decomposition.
As soon as a parallel traversal for a given partition is available, the question arises
what a good partition looks like. A good partitioning is at hand, if the workload is
balanced, and if the surface-volume ratio of the individual partitions is small: the
156
amount of computations per node then is big compared to the amount of data to be
exchanged, as the computational load corresponds to the volume of a subdomain,
whereas the exchanged data volume corresponds to the surface of the subdomain.
Partitions induced by space-filling curves are well-known in the parallel community
for their good surface-volume ratio [17, 31, 42, 44]. Consequently, Peano’s parallelisation benefits from the Peano traversal, although the decomposition and load
balancing ideas are not correlated with the space-filling curve. To balance the workload, the Peano framework applies a bottom-up cost model to the k-spacetree and
derives a top-down optimality metric from this. The model incorporates both the
fine grid workload and the multi-level representation of the grid. Such an approach
equals a tree attribute grammar [46] formalising a discrete optimisation problem on
an acyclic connected graph. It quantifies the balancing.
As Peano’s grid is allowed to change throughout the computation, the algorithm
has to adopt the parallel partitions permanently. Again, different approaches are
available for such a dynamic load-balancing: My decomposition approach restricts
the balancing activities to forks and joins, i.e. a partition can either split up into
several partitions, or several partitions can be merged. Such an approach fits to the
master-worker paradigm mentioned before, as only new master-worker relationships
are established or existing masters carry over grid parts from their workers. Workers
released by their masters can be re-employed by other nodes. The load balancing
permanently compares the nodes with the heaviest workload to the laziest nodes.
Whenever a node could afford to do the work of one of its workers, it merges with
this worker. Whenever a node slows down the computation, it hires additional
workers. The approach is a greedy algorithm based upon the tree attributes, and
it is decentralised, i.e. the balancing itself runs in parallel and does not become
a sequential bottleneck. The parallel community offers a vast range of different
parallelisation paradigms, algorithms, and taxonomies. While it is hard to keep
step with all the ideas coming up—many problems need a tailored solution of their
own—it is almost impossible for one thesis to give a comprehensive overview and
classification of the own ideas. I thus concentrate exclusively on Peano’s approach
and leave it to the reader to classify and compare it to other solutions.
The chapter is organised as follows: At first, the parallel traversal for a given
partitioning of the k-spacetree is established (Chapter 5.1). As the sequential depthfirst Peano traversal does not fit to the partitioning approach, I extend it. Next,
Section 5.2 picks up the fact that space-filling curves, on the one hand, imply quasioptimal parallel partitions. On the other hand, they simplify the exchange of the
data assigned to the vertices on the domain boundaries. Tree attributes quantifying
a partitioning are introduced in Section 5.3. Two predicates triggering the forks and
joins then derive herefrom. Realisation details—the mapping of the computational
nodes to the hardware, the concept of a node pool, the buffering of messages, and so
forth—in Section 5.4 lead over to a discussion of the interplay of the parallelisation
157
5 Parallelisation
and the Poisson solver from Chapter 4. Some numerical experiments and a short
outlook close the chapter.
5.1 Parallel Spacetree Traversal
In a domain decomposition algorithm, each node has to traverse its local domain,
and partition boundary operators and the exchange of the subdomains’ boundaries
ensure that the parallel code preserves the sequential semantics of an algorithm.
Typically, the traversal start and the traversal end act as synchronisation points. A
decomposition then defines a partial order on the computations, as the traversals
are ran one after another, but the computations throughout the iteration run in
parallel.
All the algorithms in this thesis are based on an element-wise traversal (Section
2.4) of the k-spacetree. Though the semantics of many algorithms such as the
multigrid approach in Chapter 4 does not rely on the actual order of the k-spacetree’s
nodes—they depend solely on the child-father relationship—it proves of great value
to arrange the individual nodes within the depth-first traversal along the iterates
of the Peano space-filling curve (Section 3.3). The resulting element-wise traversal
defines a total order on the nodes of the k-spacetree.
Any total order prohibits a straightforward parallelisation, since each parallelisation exploits the freedom of a partial evaluation order. A total order lacks this
degree of freedom, i.e. a depth-first traversal ⊑dfo
e ⊑child
a ⊑child b ⇒ b ⊑dfo a,
c ⊑pre d ⇒ c ⊑dfo d,
f ∧ f ⊑pre g ⇒ e ⊑dfo g,
and
a, b, c, d, e, f, g ∈ ET ,
(5.1)
with ⊑pre giving an order on the children of each refined element is not well-suited
for a parallel algorithm. ⊑dfo here specifies how geometric elements are read the
first time. The store order, in turn, results from the recursive implementation of the
traversal.
Example 5.1. Let k = 3, d = 2 with a1 , a2 , . . . , a9 ⊑child a0 and ai ⊑pre ai+1 ,
i ∈ {1, . . . , 8}. a1 and a2 are refined. A Peano traversal can not deploy the work
on a1 and a2 to different nodes, as a1 and all its descendants have to be processed
before a2 according to the depth-first total order.
To make the spacetree traversal run in parallel, one has to weaken the processing
order into a partial order. Three observations accompany this idea:
1. The overall child-father relationship has to be preserved. Otherwise, one has to
rewrite the complete multigrid and stack access algorithms, as both concepts
rely on this relationship.
158
5.1 Parallel Spacetree Traversal
2. The order in which the events are called on the children of one refined element
does not affect the algorithm’s behaviour because of the restriction on the information speed (see Section 2.6). Here, I can weaken the order and parallelise
the individual events.
3. The order of the children of one refined element has to preserve ⊑pre such that
the stack-based grid management still works.
It is not possible to trigger the events of observation two in parallel within a depthfirst traversal, as a depth-first traversal never provides all the data of the k d children
of one refined element at one point. I hence elaborate a modified traversal facilitating
such a parallelism.
Definition 5.1. The level-wise depth first order ⊑lw is an order on a k-spacetree
with
h ⊑pre
a ⊑child b
c ⊑pre d
e ⊑child f ∧ f ⊑pre g
i ∧ k ⊑child h ∧ m ⊑child i
⇒
⇒
⇒
⇒
b ⊑lw a,
c ⊑lw d,
g ⊑lw e,
k ⊑lw m.
and
(5.2)
(5.3)
(5.4)
The order ⊑child determines the spacetree’s structure. The order ⊑pre on all children
of each refined element makes the traversal deterministic.
The distinction of the level-wise and a standard depth-first order results from the
difference between (5.1) and (5.3). The latter makes the traversal visit all children
of one node first before it descends further. A refined element’s k d children thus are
available en bloc (as a patch) before the algorithm continues on the subsequent level
(Algorithm 5.1). If the tree is processed sequentially, the traversal’s uniqueness is
recovered by (5.4), i.e. the traversal again is a total order, and—in the case of a
space-filling curve defining ⊑pre —follows the Peano space-filling curve. While the
traversal still imposes a total order, I consequently weaken the order on the events
and can first load all the data, then trigger the events in parallel, and finally store
the elements.
The traversal modification deserves two additional remarks. On the one hand, the
term level-wise implies that the traversal is a mixture of two traversal paradigms:
The overall traversal remains of depth-first type. Within one set of children, the
traversal though equals a breadth-first approach. On the other hand, the technique
of recursion unrolling [65] also leads to a level-wise depth-first order. k d pairs of
push and pop operations on the call stack are replaced by one recursion step holding
the data simultaneously, i.e. the traversal replaces two levels (one recursion step)
with one code fragment. This code fragment then can employ more sophisticated
159
5 Parallelisation
Algorithm 5.1 Level-wise depth-first traversal without traversal events and vertex handling. The algorithm starts with lwDF S(e0 ), and the enumeration of the
children ei obeys a given traversal order ⊑pre .
T
= (ET , ⊑child ∈ ET × ET , e0 ∈ ET , VT )
1: procedure lwDF S(e)
2:
echild ← (⊥, ⊥, . . .)
3:
for i = 1 : k d do
4:
echild,i ← ei = popinput (),
5:
end for
6:
7:
for i = 1 : k d do
8:
lwDF S(echild,i )
9:
end for
10:
11:
for i = 1 : k d do
12:
pushoutput (echild,i )
13:
end for
14: end procedure
⊲ |echild | = k d
ei ⊑child e
⊲ ei stem from the input stack
⊲ Trigger enterElement events, etc.
⊲ Trigger leaveElement events, etc.
⊲ Store children on output stack
algorithms, as a larger part of the grid—k d elements and their vertices instead of one
single element—is available. Interpreting the level-wise depth-first order as image of
a recursion unrolling motivates that a stack-based grid management for k = 3 still
works: the geometric elements are replaced by 3d patches as atomic elements, and
the 2d vertices within these patches are directly transported from the input stream
to the output stream. The stack sequence consequently alters, but the data access
pattern for vertices on the patch boundaries remains unchanged.
5.1.1 k-spacetree Decomposition
Triggering up to k d events in parallel is a poor level of concurrency. Yet, it is a good
starting point for a more elaborate scheme. In the following, I at first introduce a
colouring for the k-spacetree. Each colour later represents one computing node’s
responsibilities. While its definition states properties of the colouring, a formal
construction is given later on. For the time being, I elaborate a parallel traversal
scheme combining a given colouring and the level-wise depth-first traversal. This
parallel scheme then in turn states the properties of an efficient colouring in Section
5.3.
Let p denote the number of computing nodes. Each node has a number: the rank.
160
5.1 Parallel Spacetree Traversal
Figure 5.1: A (k = 3)-spacetree for d = 1 with three different colours (left). A
level-wise depth-first enumeration of the same spacetree (right).
Assign each element of a spacetree T an attribute holding the rank number of the
program instance responsible for this element. At the beginning, the attributes’ values are undefined ⊥. For each rank besides 0, choose an arbitrary spacetree element
with marker ⊥ besides the spacetree’s root. Set it to the rank, and, afterwards,
all unmarked descendants of the element are assigned this rank, too. Finally, all
the remaining elements with ⊥ are set to rank 0. It is at least the spacetree’s root
element.
The colouring introduces a tree topology on the involved computing nodes if
each node handles one colour (Figure 5.1), and it defines a distributed adaptive grid
hierarchy [61]: each node is responsible for a subtree of the overall k-spacetree, i.e. it
holds a small k-spacetree itself. If subparts of this small k-spacetree are coloured
with a different rank, the decomposition delegates work on these parts to further
computing nodes. The tree topology thus introduces a master-worker relationship:
Rank 0 is a global master not working for anyone. Each other rank is a worker for
one master. Each rank delegating work also acts as master for other nodes.
5.1.2 Parallel Level-wise Depth-first Traversal
With a decomposition at hand, the recursive traversal becomes a parallel traversal
(Algorithm 5.2, Figure 5.2): The global master starts the spacetree traversal. All
the other nodes wait. On each processor exactly the same recursive program is
executed—a single program multiple data paradigm. The traversal stack automaton loads k d subelements in each recursion step. Before it performs operations on
the subelements or steps down further, it analyses which subnodes are assigned to
another remote rank. It first informs all the remote ranks to start their traversal. Then, it steps down into the nodes assigned to itself. As remote nodes handle
elements assigned to them as well as their descendants, the local algorithm exclusively steps down into subtrees assigned to itself. As soon as all local subtrees have
been processed, the algorithm waits for all remote nodes to finish. Afterwards, the
161
5 Parallelisation
Algorithm 5.2 Extension of Algorithm 5.1 exploiting a given k-spacetree decomposition. The result is a parallel, level-wise depth-first traversal.
1: procedure lwDF S(e)
2:
echild ← (⊥, ⊥, . . .) ,
|echild | = k d
3:
for i ∈ {1, k d } do
4:
echild,i ← ei = popinput (),
ei ⊑child e ⊲ ei stem from the input stack
5:
end for
6:
...
⊲ Trigger enterElement events, etc.
7:
for i ∈ {1, k d } do
8:
if ei is remote then
9:
Startup remote k-spacetree with root ei
10:
end if
⊲ All remote nodes are started up
11:
end for
12:
for i ∈ {1, k d } do
⊲ Do local work on finer levels
13:
if ei is not remote then
14:
lwDF S(echild,i )
15:
end if
16:
end for
17:
for i ∈ {1, k d } do
⊲ Local work is done
18:
if ei is remote then
19:
Wait for remote k-spacetree’s result from element ei
20:
end if
21:
end for
22:
...
⊲ Trigger leaveElement events, etc.
23:
for i ∈ {1, k d } do
24:
pushoutput (echild,i )
⊲ Store children on output stack
25:
end for
26: end procedure
162
5.1 Parallel Spacetree Traversal
Figure 5.2: Parallel level-wise depth-first traversal: The traversal starts at the root
element (1). Whenever a remote element is loaded, it starts up the
traversal on the worker (2). Such remote elements are never refined
locally. Then, it follows the level-wise depth-first order (3-4) and ascends
again (5). If a remote element is passed throughout the ascend, the
traversal waits for the remote node’s results, i.e. for the remote traversal
to finish (6).
algorithm continues to step up.
The description reveals a number of properties of the parallel traversal:
• At the beginning, only one node is active. All the other nodes wait for an
activation.
• The deeper the global master descends in the k-spacetree, the more computing
nodes become involved in the computation.
• The higher the global master ascends in the k-spacetree, the fewer computing
nodes work.
• The individual nodes are synchronised due to their root nodes, i.e. as soon as
a worker is activated, it runs completely independent of its master. Finally,
the traversals’ termination is synchronised as the master waits for its workers
to finish.
The first three issues reveal a bottleneck: the startup phase, where the global
master starts the traversal but has not activated its workers which in turn have to
activate their workers. This phase corresponds to a difficulty all multiscale algorithms face: The finer the problem, the more work to distribute. In turn, it always
happens that a problem becomes too coarse to exploit all computational resources.
163
5 Parallelisation
The synchronisation discussion in the fourth issue reveals that information such
as steering data (current smoother level, refinement policy, and so forth) and grid
properties (number of spacetree elements, information whether solver has triggered
a refinement, and so forth) are passed through the tree bottom-up and top-down—
information analysis and inheritance [46]. Information inheritance is mirrored by
recursion steps of the traversal automaton and startup calls for remote nodes. Information analysis corresponds to a call stack depth reduction, i.e. return statements
within the recursive block or finish messages sent by the workers to their master,
respectively.
5.1.3 Multiple Top-level Elements
So far, each worker handles one k-subspacetree, i.e. each worker’s workload corresponds to a tree identified by one root element. Such an approach leads to scaling
problems, since a master has to wait for all remote nodes whenever it leaves a k d
patch, and the following example as well as Figure 5.3 illustrate this.
Figure 5.3: In a gedankenexperiment, a node has booked two workers 1 and 2 for a
3d patch (d = 2). (a) If all children of the patch induce a spacetree of the
same depth, and if each worker can handle only one child, the splitting
into three processes results in a bad work balancing, as the workload
of the initial process is reduced by two units, but the workload of the
workers equals one unit. (b) If each worker handles two subtrees, the
work balancing improves (see also Figure 5.4). (c) The best balancing
corresponds to three subtrees per worker. Sphere domain split up among
four nodes (right).
Example 5.2. Let T be a spacetree yielding a regular Cartesian grid for d = 2, k =
3. The height of T is h. p ∈ {2, . . . , 3d }, and the nine children of the root P
element
i
are distributed among the p nodes. Each remote tree has height h − 1, i.e. h−1
i=0 9
164
5.1 Parallel Spacetree Traversal
elements. As p−1 elements on the first level
P are ideployed to remote nodes, the global
master has to descend into (9 − (p − 1)) h−1
i=0 9 elements before it starts to receive
the remote nodes’ finish messages. The workload on the global master is 10 − p
times higher than the remote nodes’ workload, i.e. the workload is not well-balanced
if p 6= 9.
9
8
Linear speedup
Number of top level elements: 1
Number of top level elements: 2
Number of top level elements: 3
Number of top level elements: 4
Number of top level elements: 5
7
Speedup
6
5
4
3
2
1
1
2
3
4
5
6
7
8
9
Number of Nodes
Figure 5.4: One k-spacetree per node results in a poor load balancing if the number
of computing nodes does not fit to the grid. Here, the grid is regular
(d = 2) with 6.09·105 vertices. The speedup improves with an increasing
number of spacetrees (top level elements) mapped to one rank. Results
stem from the Infinicluster.
The example’s considerations lead to three important insights for regular grids:
1. The number of computing nodes has to fit to the grid to make a domain
decomposition result in a good speedup, i.e. for p = k d the balancing works
fine, but for p < k d no good balancing exists.
2. The bigger the spacetree becomes, the stronger the effect of the imbalance due
to a growing height h in the example.
165
5 Parallelisation
3. For p > k d , the gedankenexperiment
yields the insight that a good decompo
d i
sition exists only for p = k . Consequently, the balancing problem worsens
with an increasing number p of computing nodes, i.e. the number of nodes has
to grow exponentially to make the traversal benefit from additional computing
power.
If the algorithm is allowed to map several spacetrees to one rank, this problem can be
weakened (Figure 5.4). Nevertheless, a step pattern for the speedup curve remains,
since the rank with the maximum number of elements or subspacetrees, respectively,
determines the traversal time.
Example 5.3. Let d = 2, k = 3. I study an optimal colouring with four computing
nodes, where the first node (master) has to distribute all nine children of a refined
element among the four participants. Let the traversal furthermore scale linearly with
the number of geometric elements processed. If the workload is split up equally, three
out of four nodes handle two children. The remaining computing node is responsible
for three children and determines the traversal time, if all children induce spacetrees
of the same height. Having only three nodes at hand would lead to the same speedup,
as the speedup correlates to the maximum number of elements per computing node.
The implementation later recognises situations as sketched above, and it uses
only a minimal number of nodes: Assuming that all children induce spacetrees of
the same height, it gives back workers that would not improve the performance
further.
5.2 Partitions Induced by Space-Filling Curves
Figure 5.5: Two illustrations of a domain decomposition for circle domain. On the
right-hand side, the underlying logical topology of the individual subdomains.
166
5.2 Partitions Induced by Space-Filling Curves
Peano’s traversal runs through the individual spacetree levels along the Peano
curve’s iterates. Let the partitioning follow this traversal idea, too: If one node holds
two spacetrees corresponding to two geometric elements e1 and e2 , these elements
should be siblings, and they should be neighbours along the Peano space-filling
curve’s iterate.
As a result, the fine grid partitions resemble subcurves of the Peano iterates (Figure 5.5), and the parallel performance benefits from this. Several factors influence
the efficiency of a parallel algorithm: First, the whole amount of exchanged data
influences the algorithm’s performance, as it defines the bandwidth required for
the communication. Second, the merge of received data with the local data along
the partition’s boundary influences the algorithm’s performance, as it is realised by
additional code. Third, the number and the size of individual messages influence
the algorithm’s performance, as the higher this number, the higher is the impact
of the network’s latency and the communication realisation1 . Forth, the parallel
performance relies on a good load balancing—perhaps the most important aspect.
Finally, additional technical factors such as the logical topology, the load balancing
overhead, and communication latency codetermine the actual speed. The following pages study the realisation aspects one to three, and I neglect the technical
factors. Furthermore, I assume that the tree colouring has already introduced a
perfect balancing.
5.2.1 Quasi-Optimal Partitions
The bandwidth determines the maximum amount of data per time a network can
exchange. If the amount of data for a communication task is known, the bandwidth
hence determines the time required by one node to send the data to another node as
one big bunch. It is a common wisdom that clever parallel implementations make
the exchange of data and the computations overlap: The nodes exchange data and,
meanwhile, they continue to compute. The computation on each node in turn continues as long as its input does not depend on remote data not exchanged yet. If this
dependency makes the computation wait, the algorithm is communication-restricted.
This restriction becomes the more dominant the more data is exchanged. Otherwise,
the complete communication is hidden from the runtime. Communication facilities
in many supercomputers are extremely limited, and many PDE simulation codes
run the risk of becoming bandwidth- or latency-restricted—if the underlying data
sets are scaled up, the amount of exchanged data scales up, too. As a result, the
scaling’s influence on the computing time and the exchanged data is important.
For a given partitioning, each computing node handles one partition, i.e. the
partition’s volume determines the workload (amount of operations) of the node.
1
Architectures such as the Infinicluster provide for example direct memory access mechanisms,
as long as all messages fit into a hardware buffer.
167
5 Parallelisation
At the same time, the individual ranks have to exchange information along the
partitions’ boundary. The bigger the volume is, the higher is the computational
load. The bigger the surface is, the bigger the amount of exchanged data.
It is the easier to make the communication run in the background of the computation, the smaller the partition’s surface compared to its volume. A good partitioning
yields a big volume-surface ratio [17, 31] often referred as high quality coefficient of
the partitions [42], high partition locality [28], or surface-to-volume if it is defined
as minimisation problem [44].
Hyperspheres exhibit a minimal surface smin with respect to their volume V given
by their radius r(V ). A partition hence is optimal, if it converges to a hypersphere
with decreasing mesh size.
2π d/2
smin (r) = rd−1 ·
,
and
Γ(d/2)
r
1/d
V 1/d d
Γ(d/2 + 1)
d
r(V ) =
= √
· Γ(d/2)
,
V ·
π d/2
π 2
smin (V ) = V
d−1
d
· Coptimal ,
with
i.e.
(5.5)
s
2
d
Γ(d/2) · d
2
d · Γ(d/2)
2π d/2
1
· d−1 ·
Γ(d/2) π 2
s
√
d−1
2
π·d d d
=
.
Γ(d/2)
Coptimal =
It is impossible to split up an arbitrary computational domain into disjoint hyperspheres. However, one can achieve a quasi-optimal partitioning yielding subdomains
with a volume-surface ratio as good as the optimal partitioning besides a constant,
i.e. (5.5) holds with a constant C ≥ Coptimal . For meshes, the volume equals the
number of geometric elements assigned to a partition, and the surface equals the
number of boundary vertices.
Theorem 5.1. The partitions induced by the Peano space-filling curve’s iterate on
a regular grid are quasi-optimal.
The proof of the theorem in [17] is two-fold: First, it analyses the fine grid of a
partition. Its surface is bounded by the surface of the bounding box plus the contributions of concave parts of the subdomain. The bounding box corresponds to a
construction step of the iterate. The concave contributions correspond to subsequent
construction steps, and, hence, they can be analysed by a recursive formulation. Second, the proof takes the fine grid surface’s estimate and adds the vertices belonging
to coarser grids of the k-spacetree. It ends up with
s(V ) ≤ V
168
d−1
d
·
4d
3d−1 .
1 − 31−d
5.2 Partitions Induced by Space-Filling Curves
For the continuous case, i.e. partitions corresponding to an infinite small mesh width,
the Hölder continuity of the space-filling curve [66] immediately yields an analogous
estimate.
5.2.2 Parallel Peano Spacetrees
A node is allowed to hold several k-spacetrees. To avoid duplicated data and to
work within a homogeneous algorithmic environment, these spacetrees are stored
within one single data structure: Besides the spacetree’s root elements, each node
also holds their common parent. Yet, it does not perform any operation on this
parent, as the parent belongs to the master’s partition. Individual subspacetrees
then integrate seamlessly into one data structure.
Figure 5.6: As soon as a partition is found (left), the algorithm augments the partition’s spacetree. First, it adds elements to have a k-spacetree again
(middle). Then, it embeds the whole spacetree into k d elements (right).
The definition of a k-spacetree comprises the completeness of each level, i.e. each
refined element holds k d children. To make the integrated spacetree fit to this
definition—this avoids a modification of the underlying traversal algorithm—the
spacetree is augmented with the missing geometric elements and vertices. The additional records are set to outside, and, hence, the node does not perform any operation on them2 . Such an augmentation is common for many approaches (see [54]
and papers cited therein). The induced memory overhead is studied in a moment.
Boundary vertices in the Peano implementation are stored persistently according
to Section 2.7. For the parallel realisation, an additional advantage of this convention
becomes apparent: A k-spacetree representing a subtree also might have “boundary”
vertices that are inside the domain, and, thus, persistent. Their attributes have to
be available in each iteration. Hence, the parallel Peano spacetree is also embedded
into a coarser grid. As a result, all parallel domain boundary vertices are held
persistently on the input and output streams, if a stack-based traversal is chosen.
Vertices stemming from the data structure augmentation are set to outer if they are
not adjacent to any element assigned to the local rank.
2
the inside/outside definitions in Section 2.7 have to be altered accordingly.
169
5 Parallelisation
3.5
Parallel/Serial Vertex Number Ratio
3
2.5
2
1.5
1
92
81
96
40
48
20
24
10
2
51
6
25
8
12
64
32
16
8
4
2
1
Number of Nodes
Figure 5.7: Memory overhead resulting from the two-step augmentation of the individual node’s k-spacetrees. The figures result from a d = 2 experiment
with a regular grid for a square domain and 6.09 · 105 vertices in the
sequential code.
The two-step augmentation of the spacetree (Figure 5.6) entails a memory overhead. This overhead is bounded, as, first, the partition’s boundary is a submanifold
with a reduced dimension, and, second, the surrounding grid is chosen as coarse as
possible—outer vertices are never refined. The measurements in Figure 5.7 give a
worst case estimate of this overhead for a strong scaling, i.e. the problem size is fixed
and the number of computing nodes is increased. Although the overhead appears to
be impressive (more than a factor of three for around 8000 nodes), this overhead has
to be broken down to the individual processes, i.e. the overhead per process equals
the overall overhead divided by the number of ranks. In the following discussion, I
hence neglect it.
5.2.3 Vertex Data Exchange
The domain decomposition has to exchange the boundaries of the individual partitions: If a vertex is adjacent to elements belonging to different nodes, this vertex
170
5.2 Partitions Induced by Space-Filling Curves
is held on each node locally. Some of its data such as refinement flags have to be
exchanged to enable the individual subspacetrees to keep the grids and the corresponding data consistent. In the following, I use the terms vertex and the exchanged
data as synonyms.
Figure 5.8: There is no need to resort the exchanged vertices: The k-spacetree augmentation and the space-filling curve ensure that the vertices are exchanged in the right order, i.e. the send order equals the output stream’s
order.
All actions of the parallel traversal split up into operations belonging to a computational phase followed by a communication phase comprising the remaining actions.
The latter phase exchanges boundary data with all neighbouring nodes: Each vertex
belonging to the subdomain’s boundary is sent to all the nodes holding a copy of the
vertex. Afterwards, the copy sent by all the communication partners is received and
merged into the local copy. The computation phase works exclusively on the local
copy. According to the idea of the communication running in the background of the
computation, both communication and computation phase are merged. A vertex is
sent away as soon as it is written to the output stream (fire-and-forget semantics).
Its remote contributions have to be received when the vertex is again loaded from
the input stream, but there is no need to have a vertex’s data already available when
the subsequent iteration starts.
With a given partitioning for two computing nodes, the vertex exchange pattern
resembles an X-pattern: Both nodes traverse their subspacetree and send data to the
171
5 Parallelisation
Figure 5.9: For each vertex loaded from the input stream, the algorithm checks
whether this vertex belongs to the partition’s domain. If this is the case,
the vertex is the first vertex in the incoming vertex queue belonging to
the communication partner. Throughout the vertex store process, copies
are sent to the neighbours.
neighbour. The next traversal then combines the local vertex’s state and the received
data (Figure 5.8, 5.9, and 5.12). Such an exchange pattern imposes a restriction
on the information speed: If a vertex’s state is altered, this state transition is not
available at remote nodes before the next iteration. This fits to the discussion in
Section 2.6, where the refinement and coarsening are adopted to the information
speed restriction. No additional effort thus is to be spent on the parallel data
consistency management.
Each node receives data and has to merge the received information into the local
vertex input stream.
If, for example, a received vertex holds the flag
refinement-triggered, this flag of the local vertex copy has to be set before it
is passed to the grid traversal. A naive implementation of this merge either takes
the received vertex and searches for the corresponding vertex in the input stream.
Or it searches within the queue of received vertices as soon as a boundary vertex is
to be loaded from the input stream. Without loss of generality, such a search has
O(s log s) complexity with s representing the partition boundary’s size.
Here, this sorting overhead however disappears completely due to the space-filling
curve and the k-spacetree augmentation (Figure 5.8 and Figure 5.9): All nodes traverse their k-spacetree simultaneously in accordance with the global Peano iterate.
Although the individual iterates and the subdomains do not cover each other, the
order of the boundary vertices on the output stream is globally unique: Each node
has an input and output stream of its own. They are different on each node. However, if a boundary vertex a is stored to the output stream before a boundary vertex
b is stored, this relation holds on all nodes holding both a and b.
Received messages are stored in one queue per communication partner. The queue
172
5.2 Partitions Induced by Space-Filling Curves
is treated as stack: At the end of an iteration, the queue waits until its size equals the
number of sent messages. Afterwards, it deploys the messages starting with the last
message received, i.e. whenever the traversal reads a vertex from the input stream,
it analyses whether this vertex belongs to the parallel subdomain’s boundary. If this
is the case, there are 1 < n ≤ 2d − 1 remote nodes holding this vertex, too. For each
node, there is a receive queue. Their copy of the vertex is the message lying on top
of this queue. Whenever the grid traversal writes a vertex to the output stream, the
stack management analyses whether this vertex belongs to the parallel subdomain’s
boundary. If this is the case, there are 1 < n ≤ 2d − 1 remote nodes waiting for
a copy in the next iteration. The local node copies the vertex n times and sends
all the copies to the remote nodes. Afterwards, the vertex is stored on the output
stream.
A straightforward fire-and-forget realisation—the outgoing vertices are sent
individually—often leads to performance breakdowns, as each send entails a certain
overhead resulting from network latencies, e.g. To tackle this issue, the implementation uses an additional buffer per communication partner. The outgoing messages
are stored within this buffer, and its content is sent away as one huge message as
soon as the buffer is full or the traversal terminates. The receive process also receives
the messages in blocks. As the receiving node does not read the received data before
the subsequent iteration, the buffering strategy is hidden from the exchange process and does not impose any restriction or imply any modification of the exchange
pattern.
5.2.4 Parallel Boundary Vertex Realisation
The preceding section defines the vertex exchange pattern. Yet, it lacks a concrete
realisation, and, in particular, it does not define how to identify whether a vertex belongs to the partition’s boundary, which ranks’ partitions are adjacent, and how this
adjacency information is altered whenever the colouring of the k-spacetree changes.
Due to the element-wise traversal, the automaton holds only a small number of geometric elements at a time, and it can not derive adjacency information from the
elements—particularly, elements being neighbours but not siblings are never available at the same time, i.e. a partition border in-between such elements can not be
detected. The boundary information thus is stored within the vertices.
Domain Adjacency Lists
Each vertex holds two domain adjacency lists
thisLevel : VT × {0, . . . , 2d − 1} 7→ N0 ∪ {⊥},
subLevel : VT × {0, . . . , 2d − 1} 7→ N0 ∪ {⊥}.
and
173
5 Parallelisation
Figure 5.10: For each vertex (marked), thisLevel holds the ranks of the nodes responsible for the adjacent geometric elements (top). subLevel holds
this information for the subsequent level (bottom).
thisLevel stores the rank of the nodes responsible for the 2d elements that are
adjacent to the vertex. Its entries are lexicographically enumerated. An entry
⊥ denotes that the rank is unknown or unimportant. subLevel holds the same
information for the vertex at the same position on the subsequent level: if a vertex
becomes refined, subLevel gives the new vertex’s thisLevel mapping. Consequently,
both lists are related yet not equal if any adjacent element is refined and holds a
child assigned to a different rank.
Let rank denote the number (rank) of a computing node processing a vertex. The
following facts either derive from the semantics of the two mappings or fit naturally
into the concept:
• v ∈ HT implies that the vertex does not hold information. Ergo, there is
also no adjacency information stored: ∀i ∈ {0, . . . , 2d − 1} : thisLevel(v, i) =
⊥ ∧ subLevel(v, i) = ⊥.
• ∀i ∈ {0, . . . , 2d −1} : thisLevel(v, i) = rank implies that the vertex v is located
within the partition, i.e. it is surrounded by elements assigned to rank. It does
thus not belong to the partition’s boundary.
• ∃i ∈ {0, . . . , 2d − 1} : thisLevel(v, i) 6= rank ∧ thisLevel(v, i) 6= ⊥ implies that
the vertex v belongs to the partition boundary, i.e. whenever rank stores the
vertex on the output stream, it has to send a copy to node thisLevel(v, i).
In turn, whenever the stack delivers the vertex, there’s also a copy from node
thisLevel(i) available in the message queue.
• ∀i ∈ {0, . . . , 2d − 1} : thisLevel(v, i) 6= rank implies that vertex v is not an
element of rank’s domain. Thus, the algorithm converts the vertex’s list into
thisLevel(v, i) 7→ (⊥, ⊥, . . .), and subLevel 7→ (⊥, ⊥, . . .).
• ∀i ∈ {0, . . . , 2d −1} : thisLevel(v, i) 6= rank furthermore implies that ¬Prefined (v).
174
5.2 Partitions Induced by Space-Filling Curves
The invariant
∀i ∈ {0, . . . , 2d − 1} :
thisLevel(vi , 2d − 1 − i) = rank ∨
thisLevel(vi , 2d − 1 − i) = ⊥
(5.6)
holds for all the 2d non-hanging vertices of an element e processed by node rank.
i here enumerates the element’s vertices lexicographically. Whenever an element is
deployed to a new worker rankworker , all its entries update according to
∀i ∈ {0, . . . , 2d − 1} : subLevel(vi , 2d − 1 − i) ← rankworker ,
and the vertices of the finer levels follow the inheritance Algorithm 5.3. For vertices
inside a partition, thisLevel and subLevel have the same content.
Vertex Merge
The definition of the adjacency lists finally allows to formalise the merge process
(Algorithm 5.5). The merge of local vertices and data from remote nodes consists
of three steps:
1. If a neighbouring node has booked a new worker, or if the neighbour has
merged with its master, the local node has to update the local adjacency
lists, as the tree colouring has changed. Otherwise, the local node would send
the vertex to the old, invalid neighbouring process at the end of the iteration.
Entries of the adjacency list may differ from the local entries if and only if they
hold the remote node’s rank, i.e. each node is responsible for “its” adjacency
entries. If entries in the receive queue differ from the local entries, the local
node updates its vertices.
2. If the remote vertex has triggered a refinement, the refinement has to be
triggered on the local node, too. As a result, the state refinement-triggered
is set, and the grid structure’s consistency is preserved. For the coarsening,
the same arguing holds.
3. Finally, the merge process has to merge the PDE-specific data. Chapter 5.5
for example elaborates this part of the merge process for the Poisson multigrid
algorithm.
With the realisation details written down, Peano can handle distributed spacetrees. The boundary exchange and data synchronisation fit smoothly into the stack
concept proposed in Chapter 3, and it is clear that Peano’s spacetrees represent
quasi-optimal partitions if the workload is balanced. Balancing thus is the missing
link addressed by the upcoming section.
175
5 Parallelisation
Algorithm 5.3 A new vertex’s adjacency information derives from the coarser
level’s vertices (Figure 5.10). vertices holds the coarse element’s vertices, and
coordinates defines the new vertex’s position within a k d patch. The operation
relies on a recursive operation with the same name and an extended signature.
d
derive :
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
176
d
V2T
derive : V2T × {0, . . . , k − 1}d × VT
d
7→ VT
d
× {0, . . . , k − 1} × VT × {−1, . . . , d − 1} 7→ (N0 ∪ {⊥})2
procedure derive(vertices, coordinates, result)
thisLevel(result) ← derive(vertices, coordinates, result, d − 1)
subLevel(result) ← derive(vertices, coordinates, result, d − 1)
end procedure
procedure derive(vertices, coordinates, result, dim)
if dim = −1 then
coarseCoord ← (0, 0, . . .)
for all i ∈ {0, d − 1} ∧ coordinatesi = k − 1 do
coarseCoordi ← i
end for
return subList(verticesi )
end if
if coordinatesdim = 0 ∨ coordinatesdim = k − 1 then
return derive(vertices, coordinates, result, dim − 1)
end if
smallCoord ← coordinates ∧ bigCoord ← coordinates
smallCoorddim ← 0 ∧ bigCoorddim ← k − 1
smallList ← derive(vertices, smallCoord, result, dim − 1)
bigList ← derive(vertices, bigCoord, result, dim − 1)
return mergeT woLists(smallList, bigList, dim)
⊲ see Algorithm 5.4
end procedure
5.3 Work Partitioning and Load Balancing
Algorithm 5.4 Helper for Algorithm 5.3.
d
mergeT woLists : A × A × {0, . . . , d − 1} →
7
(N0 ∪ ⊥)2
A := {0, . . . , 2d − 1} →
7
N0 ∪ {⊥}
i ∈ Nd0
1: procedure mergeT woLists(smaller, bigger, axis)
2:
result ← (⊥, ⊥, . . .))
3:
for all i ∈ {(0, 0, . . .), (1, 1, . . .)} ∧ iaxis = 0 do
4:
iOpposed ← i
5:
iOpposedaxis ← 1
6:
tmp ← smalleriOpposed
7:
if smalleriOpposed 6= biggeri then ⊲ Invariant (5.6) for coarser element
8:
if smalleriOpposed = ⊥ then
9:
tmp ← biggeri
10:
end if
11:
if biggeri = ⊥ then
12:
tmp ← smalleriOpposed
13:
end if
14:
end if
15:
resulti ← tmp ∧ resultiOpposed ← tmp
16:
end for
17:
return result
18: end procedure
5.3 Work Partitioning and Load Balancing
The preceding pages take a colouring fixing the partitions of the k-spacetree, make
the partitions fit again to the k-spacetree idea, introduce the concept of cuts along
the Peano space-filling curve, and discuss the realisation of the vertex exchange. Yet,
they do not argue how to determine the tree decomposition. This section derives a
splitting algorithm that fits to the level-wise depth-first traversal and balances the
workload among the computing nodes.
Classical load balancing algorithms focus on a homogeneous load distribution,
i.e. they try to assign all nodes of a parallel machine the same workload: Then,
no node is a bottleneck and the parallel speedup is maximised. The approach here
is twofold and adds an additional reasoning: On the one hand, it balances the
workload. On the other hand, external factors such as the grid structure, the load
distribution overhead, and so forth often determine how many nodes can be employed economically, i.e. using more nodes does not improve the performance. In
177
5 Parallelisation
Algorithm 5.5 If a remote vertex and a local vertex are merged, the adjacency
lists and the refinement flags have to be updated. Thereby, each node is allowed to
modify its own partition.
mergeW ithN eighbour : VT \ HT × VT \ HT × N0 7→ VT \ HT
1: procedure mergeW ithN eighbour(local, neighbour, rankOf N eighbour)
2:
for all i ∈ {0, 2d − 1} do
3:
if thisLevel(i) = rankOf N eighbour then
4:
thisLevel(local, i) ← thisLevel(neighbour, i)
5:
subLevel(local, i) ← subLevel(neighbour, i)
6:
end if
7:
end for
8:
if ¬Prefined (local) ∧ Prefinement triggered (neighbour) then
9:
Prefinement triggered (local)
10:
end if
11:
if Prefined (local) ∧ Pcoarsening triggered (neighbour) then
12:
Pcoarsening triggered (local)
13:
end if
14:
return local
15: end procedure
such a case, the concept here tries to use as few nodes as possible leaving the remaining nodes idle. The rationale is that the grid changes throughout the computation,
i.e. the load balancing has to be redone permanently. Having idle nodes at hand
simplifies this rebalancing.
The non-functional properties of the algorithms are twofold, too: As the grid
changes permanently, the algorithm implements an on-the-fly load balancing, i.e. the
grid is permanently rebalanced. Alternative algorithms stop the computation from
time to time and redistribute the complete workload. Consequently, they require a
heuristic or a rule-of-thumb when it is worth to stop the simulation. My approach
gets along without stopping the computation. Next, as the traversal and multigrid
algorithm never set up global data structures, the distribution algorithm also refrains
from a global data structure. Instead, the whole load balancing is integrated into
the element-wise traversal, and the decisions are made locally, i.e. the load balancing
runs in parallel, too.
The balancing’s design follows a greedy paradigm:
• It assumes that a good k-spacetree decomposition is already available for a
given, static k-spacetree. If the spacetree consists of only one element, such a
decomposition is trivial, as it employs only one node.
178
5.3 Work Partitioning and Load Balancing
• It assumes that (enough) idle nodes are available.
• Whenever an algorithm on a node triggers a refinement somewhere in the
grid, the algorithm analyses whether this refinement introduces a new global
bottleneck, i.e. whether the refinement slows down the traversal. If this is the
case, it takes an idle node and adds it as new worker to the refining node. As a
result, the bottleneck is eliminated a priori, before the grid is actually refined.
• Whenever a master-worker relationship in the tree decomposition could be
removed without making the master slow down the overall traversal, the master and the worker a merged. Thus, the merged worker becomes idle and is
available for further decompositions.
5.3.1 Weight Attribute
The balancing requires a cost model. Before a redistribution concept is elaborated,
the algorithm has to be able to measure the time spent on the traversal. It is
a straightforward idea to assign each geometric element a weight representing an
abstract cost function. This weight mirrors the element-wise traversal and a linear
time assumption: If the number of geometric elements is doubled, the time per
traversal doubles, too. This linear cost model in turn is validated by the (almost)
constant time per vertex in Section 3.6.
Definition 5.2. The weight of a geometric element represents the time the traversal
spends in this element and all the descendants of the element.
Let w : ET 7→ N+
0 denote the weight with
Cweight := 1,
w(e) = Cweight ∀e ∈ ET , e is a leaf, i.e. ¬Prefined (e).
Cweight + wremote (p)
if Pwait (p)
w(p) =
,
Cweight + wlocal (p)
if ¬Pwait (p)
wremote (p) =
max
{w(c)},
c⊑child p, c remote
X
wlocal (p) =
w(c), and
(5.7)
(5.8)
(5.9)
(5.10)
(5.11)
c⊑child p, c not remote
Pwait (p) = wremote (p) ≥ wlocal (p).
(5.12)
The weight model is homogeneous, i.e. the computational cost per geometric element
are constant (5.7). If this assumption is not valid anymore, the constant Cweight has
to be replaced by a function modelling the computational load. A leaf’s weight is
prescribed (5.8), whereas a refined element’s weight results from the computational
load of the element itself and all the descendants (5.9): First, let wlocal denote the
179
5 Parallelisation
Figure 5.11: The weight and the δ attribute on a k-spacetree split up into four parts.
time spent in all the descendant (5.11). Second, w denotes for each remote node the
weight, i.e. the time the worker spends in this element3 . As all the workers’ traversals
run in parallel, the remote time equals the maximum of the remote runtimes (5.10).
Third, a node either needs longer than all the remote nodes to process its own
subelements, or it has to wait for a remote node. In the latter case, predicate Pwait
in (5.12) holds. Finally, the weight of a refined element with remote subelements
either is dominated by the workers’ runtime (Pwait holds) or by the local traversal
(Pwait does not hold). Whenever a node is responsible for all the children of a
refined element, the element’s weight w equals the sum of the runtime spent within
the children added one cost unit for the node itself (5.11). The attribute w can be
evaluated within the traversal’s bottom-up steps. It is synthesised.
5.3.2 δ Attribute
Load balancing is a discrete optimisation problem. This thesis follows a local greedy
algorithm: it improves the load distribution on each node locally, since it evaluates
for each node permanently whether it is a bottleneck or whether it is allowed to do
more work without becoming a bottleneck. The weight attribute gives the computational load per subtree, while the wait predicate identifies bottlenecks: If the wait
predicate holds, a worker thwarts its master. Yet, the weight lacks the information
how much additional work could be added without slowing down the algorithm—it
3
If several remote spacetrees are assigned to one node, w has to be modified accordingly.
180
5.3 Work Partitioning and Load Balancing
supports a bottleneck analysis, but does not identify bottlenecks in advance. If an
element refines, additional work is added, and it is important to know whether this
additional work would slow down the application.
Definition 5.3. The attribute δ for a geometric element identifies how many additional work units can be spent on the element and all its successors without making
the corresponding traversal a bottleneck.
Let δ : ET 7→ Z give a runtime difference for each node. It defines how many
additional cost units (geometric elements) could be spent on the element and its
successors without making the worker responsible for the element a bottleneck for
its master. If δ < 0, the worker is already a bottleneck. If δ = n > 0 holds, the
worker will not become a bottleneck, if up to n elements are added to the subtree
identified by the node. δ is an inherited attribute. It is computed throughout the
top–down steps and uses the weight attribute w of the preceding traversal:
δ(e0 ) = −1,
for the spacetree’s root, and
(5.13)

δ(p) + w(p) − w(c) − 1 c and p processed on


∀c, p ∈ ET , c ⊑child p :

different nodes
δ(c) =
δ(p) + w(p) − w(c) − 1 same node ∧ Pwait (p)



δ(p)
same node ∧ ¬Pwait (p).
Attribute δ tells the algorithm where a fork would improve the overall runtime of
the traversal, as the root node is assigned a negative δ in (5.13). Whenever a child
and its parent are handled on the same node, and the parent does not have to wait
for another remote node, the child has the same δ as the parent: If the parent may
spend n additional work units on the grid, the child is also allowed to do so—the
parent is already refined and can not get additional work. If the parent is too slow by
n work units and slows down the whole application, the traversal on the subtree has
to speed up. The parent itself can not speed up, as it has k d children to be processed.
Thus, it delegates the speed up requirement to its children. Whenever a child and
its parent are handled on the same node, and the parent has to wait for another
remote node, the child node can gain weight without slowing down the traversal.
Whenever a child and its parent are handled on different nodes, the child’s node is
a worker of the parent’s node. In both of the latter cases, the difference between
the parent’s weight and the child’s weight equals the additional cost the traversal
could spend on the child. The decrement eliminates the weight of the parent node
from the δ calculation. A combination of both attributes now enables each node to
decide recursively whether it would or does benefit from additional workers handling
subtrees. This strategy is elaborated in the following two sections.
181
5 Parallelisation
5.3.3 Fork Predicate and the Node Pool
My implementation makes use of a worker pool. At the beginning, all the computing
nodes except the initial one processing the spacetree’s root are idle workers assigned
to the worker pool. Whenever a computing node wants to delegate work, it tries to
book a worker from the worker pool.
If a leaf l, l ⊑child p, is to be refined due to a refinement-triggered flag and
δ(l) = n ≤ k d holds, the refinement would slow down the overall computation.
Throughout the descends, the traversal then determines in element p how many of
p’s leaves would refine. Afterwards, it tries to book up to k d − 1 workers from the
node pool in order to distribute those leaves among the new workers along the Peano
iterate. If the worker pool is empty, the computing node has to do the work for the
new leaf itself.
Example 5.4. Let d = 3, k = 2. Six out of eight children of a refined element
want to refine, and the refined element’s δ-attribute is negative. The refined element
is already a bottleneck of the overall computation, and the additional refinements
would worsen this bottleneck. The responsible node thus tries to book five additional
workers—deploying all refined elements to workers would make the node holding
the refined element an idle node, as it has to wait for its workers. Instead of five
workers, the node pool delivers only three workers. The first worker is assigned the
first two refining leaves along the Peano iterate, the second worker is assigned the
leaves three and four along the Peano iterate, the third worker is assigned the fifth
element, and the master itself handles the remaining sixth refined element.
Since the worker deploys exclusively the refining elements along the Peano curve,
the resulting remote partitions may be disconnected: If the non-refining elements are
refined later throughout the computation, for example due to an a posteriori error
estimator, the domain decomposition ends up with disconnected fine grid partitions.
While one can avoid this by deploying solely connected refined elements, I prefer a
balanced to a connected partitioning in the experiments.
Definition 5.4. The fork predicate for a spacetree leaf holds, if a refinement of the
leaf would make the corresponding node a bottleneck.
The predicate is given as
∀e ∈ ET : Pfork (e) =
⊤
⊥
if ¬Prefined (e) ∧ δ(e) ≤ k d
.
else
As it depends only on δ, the algorithm computes the predicate on-demand.
While an application’s performance typically motivates the parallelisation, many
applications also suffer from hard memory restrictions. In such a case, the fork
predicate should also evaluate an upper memory bound, i.e. if the process would
182
5.3 Work Partitioning and Load Balancing
exceed a given memory threshold, the fork predicate holds, too. Such an augmented
fork predicate makes the load distribution fit to the memory available on the individual nodes, i.e. a node never exceeds the main memory due to a refinement, and
it improves the applicability and stability for long-running, real-world simulations.
A similar argumentation though is found seldom in literature (yet [43], e.g., picks it
up).
5.3.4 Join Predicate
The fork predicate makes the parallel traversal as fast as possible. In addition, the
balancing shall use as few nodes as possible. Such a property is no value of its
own: On the one hand, it reduces the exchanged data per iteration. On the other
hand, it ensures that the node pool always holds as many idle nodes as possible.
The following pages elaborate when nodes can be freed without harming the parallel
performance.
Let there be a join predicate: If a node p has delegated the work of a child e and,
thus, a complete subtree to a worker, it is the master of this worker. If Pjoin (p, e)
holds, it is a lazy master, i.e. it could take over the job of the worker without
becoming a bottleneck for its master itself.

⊤
p, c processed on different nodes ∧




Pwait ∧




wlocal (p) + w(c) < w(p) + δ(p)
∀p, c ∈ ET , c ⊑child p :

⊤
p, c processed on different nodes ∧
Pjoin (p, c) =


¬P

wait (p) ∧



wlocal (p) + w(c) < w(p) − w(c) + δ(p)



⊥
else.
Definition 5.5. A lazy master is a node that delegates work to workers but could
do this work without becoming a bottleneck. For the root nodes of workers working
for a lazy master, the join predicate holds.
As Pjoin depends on the tree attributes w and δ, the algorithm computes it ondemand. w and δ are always kept up-to-date. Whenever a node is refined, the fork
condition is evaluated, and, thus, new workers might come into play. Whenever the
join predicate holds for a node, the two corresponding spacetrees are merged, and
the decruited computing node is reassigned to the worker pool.
If a k-spacetree’s data is moved from a node to another node, it is always only from
worker to master. This simple data movement paradigm preserves the tree topology
on the computing nodes, and only complete subtrees are exchanged. The algorithm
here does not allot the rebalancing of work packages smaller than a complete subtree.
While the join predicate evaluates the runtime behaviour, memory also can affect
join decisions in several ways: It is not reasonable to join a worker with its master, if
183
5 Parallelisation
the master would run out of memory because of this join. A join predicate evaluating
the memory requirements besides the runtime behaviour is the counterpart of a fork
predicate splitting up processes because of limited memory. The size of the worker
also determines the amount of data required to join two partitions. As subdomains
have to be moved from the worker to the master, a join can slow down the overall
iteration due to bandwidth restrictions. In such a case, it is reasonable to restrict
the number of joins to a fixed, small number—either globally or per master.
Although the fork and join predicate balance the spacetree according to the analytical model, the per-iteration evaluation of the predicates can introduce lots of
forks and joins that finally prove to be unnecessary. For a less aggressive balancing,
the experiments on the one hand make the join predicate evaluate the refinement
state of the worker: If a worker refines any of its leaves, it is never merged into its
master’s partition. As a result, joins occur exclusively for invariant subtrees. On the
other hand, the join predicates’ left hand side is increased by a fixed constant, i.e. the
join is performed if and only if the join would not introduce a bottleneck, even if
the subdomain were bigger by this constant. The penalty constant mirrors the overhead introduced by a join and prohibits some joins that would become unfortunate
because of the worker’s tree changing later throughout the computation.
Although the domain splittings follow the Peano space-filling curve in this thesis,
partitions assigned to one computing node can be disconnected. On page 182, I
explain this fact with the interplay of the fork process and a dynamic refinement
criterion. The merge process can also introduce disconnected partitions, as the join
predicate does not check if the worker’s and the master’s fine grid partitions are
connected. While one can incorporate such a check into the join predicate, I prefer
an aggressive join behaviour to have as many idle workers as possible and, hence,
accept disconnected partitions.
5.3.5 Join Process
The vertex merge rules in Section 5.2.4 exhibit how the vertices’ adjacency lists and
refinement flags are updated whenever the grid structure or the spacetree decomposition change. Yet, the rules do not work for partition joins: If a master and a
worker partition join, their boundary vertices are given the new (master’s) adjacency
information. At the same time, the neighbours though send vertices to the merging
worker.
To resolve this issue, the join is a two-step process (Figure 5.12). In the first
traversal, the adjacency information of the worker’s partition boundary is updated,
and the algorithm sends away the updated vertices. In the second traversal, the
worker receives the neighbour’s data, merges it into the local spacetree, and, finally,
sends its records to its master. Meanwhile, the neighbours update their adjacency
lists and send their data to the master instead of the merged worker.
184
5.4 Node Pool Realisation
Figure 5.12: Vertex exchange pattern. Each join lasts two iterations.
5.4 Node Pool Realisation
A decentral load balancing as proposed in the preceding chapter has many advantages: As there is no central load balancing instance, this instance can not become a
bottleneck. Furthermore, the load balancing exhibits exactly the overall algorithm’s
scaling. The only precondition is that the node pool answers requests for workers immediately, i.e. the node pool may not slow down the application. Thus, my
implementation deploys the node pool administration to a process of its own.
Yet, the worker booking process remains a competition among the individual
nodes, and the greedy fork often leads to unbalanced partitions4 : Whenever the
traversal comes across n refining leaves introducing a bottleneck, it tries books
n ≤ k d − 1 workers. It does not make sense to book k d workers, as the traversal
has to wait for the workers throughout the ascends anyway. Thus, the node handles
at least one refined element himself. Since the final grid layout is not known a
priori—it might in fact change all the time—the algorithm does not yield an optimal
partition, but permanently rebalances the partitions and, without loss of generality,
never finds the optimal partitioning. This drawback holds for any greedy algorithm.
Yet, an additional problem arises from the approach: If the node pool processes the
worker requests on a first-come first-served (FCFS) basis, nodes with a slow network
4
With number of nodes going to infinity, this drawback disappears, as each request can be fulfilled.
185
5 Parallelisation
connection or a worker request that occurs late throughout the traversal are inferior.
A more sophisticated node pool with a fair answering strategy thus waits,
• until a certain time has elapsed (timeout),
• until each working node has posted a request, or
• until no further idle nodes are available.
It then permutes the worker requests and favours nodes that haven’t booked workers
before. In my implementation, a simple history holds the sequence of preceding
requests, and the node pool sorts the request queue accordingly.
The timeout ensures that all nodes have a chance to post their requests, i.e. the
race among the individual nodes is softened. Yet, the node pool starts to answer
immediately, if the request queue’s size equals the number of working nodes, or if
there are no idle nodes left. In the latter case, it does not make sense to make any
node wait, since the request will be answered negatively anyway. If the queue’s size
equals p, all the workers already wait for a request, i.e. no additional requests are
on the way.
Besides the tailoring of the worker delivery, the strategy also allows to implement
architecture-specific knowledge (also suggested in [43]): As the node pool finally
decides whether a fork happens or not—it can always tell a node that there are
no idle workers left—it can prohibit new master-worker relationships introducing
runtime drawbacks. Furthermore, the node pool decides which new worker to deliver
if there are several workers idle. Knowledge about the concrete hardware topology
and the global logical topology typically influence both aspects.
5.5 Parallel Iterations and HTMG
The parallel algorithm exchanges the partitions’ boundaries after each traversal.
For the multigrid solver, this communication pattern entails some modifications.
While the element-wise evaluation remains unaltered, results of the individual stencils are not always available throughout touchVertexLastTime events: Boundary
vertices are adjacent to k < 2d (local) geometric elements. The contributions of
the 2d − k additional elements is available not until the remote vertices have been
received, i.e. they are available throughout the touchV ertexF irstT ime event of the
subsequent traversal.
The parallelisation thus entails the following algorithm updates:
1. Temporary results such as the residual become persistent attributes. They are
not discarded after the traversal.
186
5.6 Experiments
2. Each accumulated value is exchanged. The vertex exchange hence comprises
for example the residual in addition to the vertex refinement structure.
3. For each accumulated value, the boundary vertex’s remote contributions are
added to the local representation as soon as the vertex is read from the input
stream. The boundary vertex merge process from page 170 hence is augmented
by the accumulation statements.
4. In the parallel code, operations former triggered by touchV ertexLastT ime are
triggered by touchV ertexF irstT ime, as the underlying data is not available
before the merge process ensures that all elements’ contributions have been
added to a vertex.
The enumeration reveals that the changes for the parallel solver are straightforward. Nevertheless, the latter issue also reveals that the parallel solver needs half
an iteration more for a Jacobi step than the sequential code: The sequential code
starts the Jacobi step throughout the first read (it clears the residual), and finishes
the Jacobi step throughout the last write (it updates the value). The parallel code
needs an additional touchV ertexF irstT ime to update the vertex’s value.
As the additional traversal results from the boundary exchange pattern, this arguing holds for all the other multigrid ingredients such as restriction and the subsequent
coarse grid smoothing, too. The events are “shifted” on the time scale (Table 5.1)
and arise “delayed” compared to the sequential code. Furthermore, the solver needs
an additional state firstSmoothingStepOnCoarserGrid. In this state, it completes
the restriction started in the preceding iteration besides the first smoothing step on
the coarser grid. Because of the modified event mapping, the horizontal tree cuts
also are delayed by one iteration: After a multigrid ascend, the traversal has to
access the former active level ℓactive + 1 once more to restrict the right-hand side,
i.e. the tree cuts may not yet store these values on a backup stream yet.
5.6 Experiments
The following experiments analyse the parallelisation and load balancing with three
test setups elaborating algorithmic properties. They refrain from a real-world use
case and have artificial character, and they do not apply any arithmetics, i.e. the
Peano instances create the grid, traverse this grid, and trigger events. These events
however are—besides the geometry analysis—mapped to empty operations, i.e. they
deteriorate.
First, I study the impact of the message size on the overall performance, i.e. I determine appropriate, hardware-specific parameters for the subsequent experiments,
and, thus, reduce the architecture’s influence on the experimental results. Second,
I study the parallel speedup for regular grids. Any parallelisation scheme has to
187
5 Parallelisation
Table 5.1: Interplay of the traversal events, the solver states and the multigrid operations for the parallel code.
The rule set employs
a simultaneous coarse grid smoothing.
FirstCoarse abbreviates
FirstSmoothingStepOnCoarserGrid.
Solver State
Smooth
Traversal Event
createT emporaryV ertex(v)
touchV ertexF irstT ime(v)
enterElement(e)
Ascend
touchV ertexLastT ime(v)
createT emporaryV ertex(v)
touchV ertexF irstT ime(v)
enterElement(e)
FirstCoarse
touchV ertexLastT ime(v)
createT emporaryV ertex(v)
touchV ertexF irstT ime(v)
enterElement(e)
Descend
touchV ertexLastT ime(v)
createT emporaryV ertex(v)
touchV ertexF irstT ime(v)
enterElement(e)
touchV ertexLastT ime(v)
188
Description and Operations
level(v) ≤ ℓactive ⇒ interpolate coarse grid value.
level(v) = ℓactive ⇒ apply Jacobi update step and
rv ← 0 (clear residual) afterwards.
level(v) < ℓactive ∧ ¬Prefined (v) ⇒ apply Jacobi
update step and rv ← 0 (clear residual) afterwards.
level(e) = ℓactive ⇒ apply stencil.
level(e) < ℓactive ∧ Punrefined v (e) ⇒ apply stencil.
no operation.
level(v) = ℓactive ⇒ v ← 0, and
level(v) < ℓactive ⇒ interpolate coarse grid value.
level(v) = ℓactive ⇒ apply Jacobi update step,
clear residual, and compute hierarchical transform.
level(v) < ℓactive ∧ ¬Prefined (v) ⇒ apply Jacobi
update step and rv ← 0 (clear residual) afterwards.
level(e) = ℓactive ⇒ apply stencil.
level(e) < ℓactive ∧ Punrefined v (e) ⇒ apply stencil.
no operation.
level(v) ≤ ℓactive ⇒ interpolate coarse grid value.
level(v) = ℓactive + 1 ⇒ restrict r̂v , with r̂v stored
in variable rv .
level(v) = ℓactive ∧ ¬Prefined (v) ⇒ apply Jacobi
update step and rv ← 0 (clear residual) afterwards.
level(v) = ℓactive ∧ Prefined (v) ⇒ bv ← 0.
level(e) = ℓactive ⇒ apply stencil.
level(e) < ℓactive ∧ Punrefined v (e) ⇒ apply stencil.
no operation.
level(v) ≤ ℓactive ⇒ interpolate coarse grid value.
level(v) < ℓactive ∧ ¬Prefined (v) ⇒ apply Jacobi
update step and rv ← 0 (clear residual) afterwards.
level(v) = ℓactive − 1 ∧ Prefined (v) ⇒ apply Jacobi
update step and rv ← 0 (clear residual) afterwards.
level(v) = ℓactive ⇒ compute inverse hierarchical
transform.
level(e) = ℓactive ⇒ apply stencil.
no operation.
5.6 Experiments
yield a satisfying parallel performance for such a setup, before it switches to more
complicated experiments. Third, I study the parallel speedup for an adaptive grid
resolving a singularity. To parallelise such a grid is challenging, as the underlying
k-spacetree is extremely unbalanced, i.e. it is difficult for a domain decomposition
to result in a parallel speedup at all. Furthermore, the grid construction process
entails a permanent rebalancing, as the decomposition is not aware of the singularity
in advance.
5.6.1 Message Size
Time [s]
Time [s]
7e-005
6e-005
6e-005
5e-005
5e-005
4e-005
4e-005
Time [s]
Infinicluster
7e-004
6e-004
5e-004
4e-004
3e-004
2e-004
1e-004
0e+000
7e-004
6e-004
5e-004
4e-004
3e-004
2e-004
1e-004
0e+000
2d
3d
HLRB II
2d
3d
Jugene
2d
3d
6
8
53
65
4
76
38
32
92
16
96
81
48
40
24
20
2
10
6
51
8
25
12
64
32
16
8
4
2
1
Number of Messages per Message Exchange
Figure 5.13: Influence of the message size on the runtime per vertex for three different architectures.
According to Section 5.2.3, several vertices are bundled into one big message
bundle throughout the partition boundary exchange. This bundle’s size depends
on the attributes exchanged per vertex, on the bundle’s cardinality, and on the
data alignment. Albeit I study the correlation of size and performance, this section
neither derives a global optimum size nor does it give a reasoning for the runtime
behaviour. It just shows that such a performance study is critical—it anyway has
to be done for each PDE individually—and derives reasonable settings for the grid
records in the subsequent experiments.
189
5 Parallelisation
For the measurements in Figure 5.13, a regular Cartesian grid is divided into
equally sized subpartitions for k d nodes. The parallelisation paradigm incorporates a
hidden communication, since the vertex exchange follows a fire-and-forget semantics,
i.e. the actual data exchange runs in the background of the traversal. Consequently,
the runtime per vertex should be independent of the bundle’s size. Nevertheless,
the results reveal that the message size can influence the application’s performance.
On the Infinicluster, a sufficiently big number of messages per bundle has to be
chosen. Otherwise, the message exchange slows down the application (see one or
two messages per bundle, e.g.). This effect is either a result of the network’s latency,
or it results from the overhead required to set up a data exchange. Choosing a too
big message size for d = 3 also reduces the application’s speed. As the effect does
not occur for d = 2, and as the “three-dimensional” messages are bigger than the
messages in two dimensions due to the cardinality of the adjacency lists depending
on d, this is either a bandwidth restriction, or the Infinicluster’s hardware fails to
exchange very big messages in the background. I have no explanation for the peak
for d = 2 and 1024 messages per bundle.
On the HLRB II, the results are two-faced: The performance is independent of
the message size for d = 2. Yet, it is very sensitive to the cardinality for d = 3.
This effect requires for further attention, but it is beyond the scope of this thesis.
On Jugene, a BlueGene/P system, it is important to have a sufficiently big number
of messages per bundle. The latter architecture is furthermore known to be very
sensitive to data alignment [37], but the alignment’s influence is not captured by
the experiments at all.
5.6.2 Regular Cartesian Grids
A regular Cartesian grid corresponds to a balanced k-spacetree, i.e. each refined
spacetree element of one level has the same number of descendants. Because of
the simple, regular structure, the load balancing’s behaviour can be reconstructed
manually: The k d elements on the level one are distributed among the cluster, since
the root element holds δ = −1 (the experiments of Figure 5.4 study the speedup for
less than k d computing nodes). Each of the k d level-one elements holds δ = −1, too,
i.e. they try to re-fork immediately. If the number of remaining nodes is a multiple
d
of k d or bigger than k d , each subtree forks the same number of times. Otherwise,
the subtree with the smallest number of forks determines the overall runtime (Table
5.2).
Each of the k d nodes responsible for the level-one elements tries to book additional
workers. Because of this greedy approach, they compete for additional workers, and
a first come first served (FCFS) node pool strategy might introduce a load imbalance
non-deterministically, since it favours nodes with the best connection to the node
pool. The fair strategy in Section 5.4 tackles this challenge.
190
5.6 Experiments
Table 5.2: Forks accompany the grid creation for a regular grid, d = 2, different
numbers of computing nodes, and one dedicated node pool (rank 0) implementing a fair answering strategy. Bold ranks are the global bottlenecks.
Nodes
9
17
19
24
Iteration
0
..
.
Vertices
6.80 · 101
7
0
1
4.87 · 107
6.80 · 101
6.44 · 102
..
.
7
0
1
..
.
7
0
1
..
.
7
4.89 · 107
6.80 · 101
6.44 · 102
4.89 · 107
6.80 · 101
6.44 · 102
Forks (master 7→ {masters and workers})
0 7→ {2, 4, 1, 8, 3, 9, 5, 7, 6}
0 7→ {8, 4, 2, 16, 1, 13, 9, 12, 5}
1 7→ {1, 17}, 2 7→ {2, 6}, 4 7→ {4, 15}, 5 7→ {5, 3}, 8 7→ {8, 14},
9 7→ {9, 10}, 12 7→ {12, 11}, 13 7→ {13, 7}, 16 7→ {16}
0 7→ {4, 16, 2, 1, 8, 3, 6, 5, 12}
1 7→ {1, 17, 11}, 2 7→ {2, 15}, 3 7→ {3, 14}, 4 7→ {4, 10},
5 7→ {5, 19}, 6 7→ {6, 7}, 8 7→ {8, 9}, 12 7→ {12, 13},
16 7→ {16, 18}
0 7→ {16, 1, 4, 9, 8, 2, 5, 17, 7}
1 7→ {1, 23, 6}, 2 7→ {2, 21, 12}, 4 7→ {4, 3}, 5 7→ {5, 11, 10},
7 7→ {7, 15, 22}, 8 7→ {8, 13, 14}, 9 7→ {9, 18},
16 7→ {16, 20}, 17 7→ {17, 19}
4.90 · 107
The experiments for the regular grid were at first conducted on the Infinicluster
(Figure 5.14). For bigger node cardinalities, I switched to the HLRB II (Figure 5.15).
All measurements illustrate a weak speedup—the grid size, i.e. spacetree depth, is
scaled with the number of computing nodes—starting with 6.09 · 104 (d = 2) or
6.72 · 105 (d = 3), respectively, vertices on a single node, and both the HLRB
II and the Jugene exhibited a similar speedup pattern besides some measurement
inaccuracies.
The fork analysis above explains the step layout in Figure 5.14 as it applies recursively to all grid levels. This step pattern resembles the steps in Figure 5.4 scaled up
to a bigger number of nodes. Furthermore, it becomes obviously that the fair node
pool strategy delivers a more robust parallel decomposition compared to a FCFS
implementation. I use the fair strategy throughout subsequent experiments.
The bigger the number of computing nodes and the bigger the spatial dimension
d, the more additional nodes are needed to reach the next speedup level (Figure
191
5 Parallelisation
90
80
Linear speedup
Infinicluster, 2d, FCFS
Infinicluster, 2d, Fair
Infinicluster, 3d, FCFS
Infinicluster, 3d, Fair
70
Speedup
60
50
40
30
20
10
0
0
10
20
30
40
50
Number of Nodes
60
70
80
90
Figure 5.14: Weak speedup for a regular Cartesian grid on the Infinicluster.
5.15), since the work is distributed top-down and the number of nodes within the
spacetree grows exponentially with k id , i ∈ N0 . If the number of computing nodes
fits to the tree structure, the speedup for the regular grid is already promising. For
the other cases, three remarks are important.
First, few applications yield that regular meshes due to more complicated boundaries or due to a more complicated solution behaviour. If one extracts the finest regular grid from such a mesh’s k-spacetree, this grid’s elements will hold non-uniform
weights: Elements covering regions with a finer resolution have a greater weight than
elements covering regions with a rather coarse mesh. Furthermore, Elements distributing work to additional nodes will have a smaller weight than elements doing all
the work alone. The actual speedup curve pattern then adopts to the non-uniform
weight distribution. Consequently, each application scenario will exhibit a different
step pattern. A fair runtime comparison consequently has to average the speedup
curves for different comparable scenarios to yield a reasonable speedup curve and
to eliminate the influence of the nondeterministic worker assignment. The worst
case steps from the measurement here then disappear, while the best case speedup
results become worse.
Second, the measurements here apply a one-to-one mapping of Peano instances
to computing nodes. In the outlook a modification and extension of this mapping is
192
5.6 Experiments
900
800
700
Linear speedup
HLRB II, 2d
HLRB II, 3d
Speedup
600
500
400
300
200
100
0
0
100
200
300
400
500
Number of Nodes
600
700
800
900
Figure 5.15: Weak speedup for a regular Cartesian grid on the HLRB II. The
speedups of several runs are averaged, and the node pool implements a
fair load distribution strategy.
193
5 Parallelisation
discussed that should smooth out the speedup curve presented in the figures here.
Finally, the experiments employ Peano without a PDE solver. The computational
work thus is exclusively determined by the grid management operations. If a PDE
is solved, the additional floating point operations codetermine the runtime, and the
poor slope for the three-dimensional problem should improve. If this is not the case,
d = 3 has to be paid further attention in the future.
5.6.3 Singularities
Figure 5.16: Grid resolving a singularity.
The last experiment resolves a point singularity at x0 = (1, 1, . . .)T ∈ Rd on the
unit square or cube, respectively: In each k-spacetree level, exclusively the vertices
of the geometric element holding x0 are refined (Figure 5.16). An optimal Peano
traversal’s runtime for such a spacetree scales with h being the spacetree’s height,
while the sequential runtime is ≈ h · 2d · k d : the traversal has to step down into the
finest elements once, while it can, on each level, deploy work for elements adjacent
to the one element covering x0 to remote nodes. If the runtime scales linearly in
the number of elements to be traversed, they will always finish before their master
starts to ascend again.
Again, it is straightforward to reconstruct the load balancing’s behaviour: All
elements covering x0 have δ = −1, i.e. their fork predicate holds. With each refinement, the corresponding node books additional workers. Two different patterns
of load distributions then can occur: If the node deploys all elements besides the
194
5.6 Experiments
Figure 5.17: Grid resolving a singularity (top, left). The global master tries to deploy
all work for refined elements to remote nodes, while it administrates the
node pool itself (top, right). Below, in lexicographical order: Domain
decompositions for 5, 6, 7 and 10 nodes.
195
5 Parallelisation
Table 5.3: Forks and merges accompany the grid construction in Figure 5.16 (d = 2).
The rank holding the fine grid element covering x0 is marked bold, and
the transitions denote forks and joins.
Iteration
0
1
2
3
4
5
6
7
8
9
10
Vertices
6.80 · 101
2.37 · 102
4.06 · 102
5.75 · 102
6.79 · 102
7.83 · 102
8.47 · 102
8.86 · 102
9.90 · 102
1.03 · 103
1.13 · 103
Idle Nodes
8
5
2
1
0
0
1
0
1
0
0
Forks/Joins
0 7→ {0, 6, 8, 5, 3}
3 7→ {3, 4, 12, 10}
4 7→ {4, 1, 11, 9}
4 7→ {4, 7}
7 7→ {7, 2}
{3, 10} 7→ 3
7 7→ {7, 10}
{3, 12} 7→ 3
10 7→ {10, 12}
element covering x0 to workers, it remains the overall bottleneck: Throughout the
steps down, it triggers the workers to startup their traversal. As they are refined
only once, they always have finished their work before the master returns to step
up. In a second case, the node deploys the element covering x0 to a worker w0 .
Then, w0 becomes a bottleneck throughout the subsequent grid refinements, and
the master will become a lazy master—it could take the workload of all workers
besides w0 without becoming a global bottleneck. Therefore, it triggers joins, and
the freed workers are reemployed on a finer grid.
The corresponding experiments were conducted on the Infinicluster and showed a
asymptotic maximum runtime improvement of 2d · k d if the number of computing nodes is sufficiently large. Although the fork and join behaviour are nondeterministic, the structure of all operation traces is always the same. One example
is given in Table 5.3. It agrees with the δ analysis above.
If Peano runs out of idle nodes throughout the grid construction, the runtime
improvement breaks down (Figure 5.17 illustrates several examples): The application distributes the spacetree further and further until no nodes are idle anymore.
The node being the global bottleneck is also responsible for the element covering x0 .
This element refines further. As the node is already the slowest participant in the
cluster, it never becomes a lazy master. No further joins are triggered by this lazy
master, and the overall tree partitioning does not change its layout anymore.
An unbalanced tree such as the spacetree in Figure 5.16 is the worst case for
any tree traversal algorithm. Many PDE solvers never yield such discretisations—
our group’s computational fluid dynamics code for example always delivers rather
regular grids—besides for complicated domains. Having such a behaviour, the prac-
196
5.7 Outlook
tical use of the experiments above is rather low regarding the runtime spent on the
PDE’s inner domain. It is however important that Peano books additional workers
for grid singularities aggressively. In combination with speedup curves for regular
grids (Section 5.6.2) this observation states that Peano’s load balancing tackles regions with an adaptive, irregular structure aggressively with many computing nodes,
and, hence, it diminishes runtime defects due to these regions.
5.7 Outlook
This closing section addresses some improvements of the parallelisation, while it
leaves the fact aside that the implementation obviously needs additional attention,
profiling, and tuning for d ≥ 3. First, parameter studies and a subsequent tuning
of the algorithm’s magic constants within the predicate definitions adopt the parallelisation to a concrete hardware or a concrete PDE. The parallelisation yields
nice speedups whenever the grid structure fits to both the traversal concept, the
hardware topology, and the number of nodes available. To make the parallelisation robust with respect to these factors is, second, an important and outstanding
challenge. Finally, the parallelisation concept here provides a vast set of links for
methodological improvements.
Among the most important magic constants in the algorithm are the message
queue size, the threshold constants in the fork and join predicate definitions, and
the weight function. In Figure 5.13, the impact of the actual message size on the
application’s performance is illustrated. An optimal message size depends on both
the cluster’s hardware and the PDE’s degree of freedom model, as the number of
unknowns per vertex codetermines the size of one message and, thus, the number
of bytes to be spent on a block of messages. It is laborious to determine a fitting
message block size for each PDE model individually. Runtime models incorporating
the network’s latency, bandwidth, memory alignment, and so forth can select a
message block size automatically and disburden the user from the laborious studies.
The constants in the predicate definitions also depend on the hardware used.
Triggering a remote traversal for example entails a communication overhead. As a
result, it sometimes is reasonable to merge two partitions although the master is
not a lazy master. If the times spent on the additional local workload falls below
the communication overhead, the overall runtime nevertheless improves, and an
additional free worker becomes available. Such rationale, decisions, and experiences
can be modeled by additional magic constants within the definition of the fork and
join predicates.
k-spacetree leaves have a uniform cost model in this thesis, as their weight is
fixed. For complicated PDEs, this concept has one shortcoming. If sophisticated
convection operators are to be evaluated within a cell, if a multiphysics or multiscale
197
5 Parallelisation
model is implemented on the spacetree, or if the number of operations per element
changes throughout the simulation, the weight attribute differs from leaf to leaf. In
[23], the weights also depend on the surrounding elements: the more regular the
environment, the smaller the workload becomes. An intelligent weight definition
incorporates such facts.
It is cumbersome that the speedup exhibits a step pattern for regular grids. An
improved parallelisation has to yield a smoothed-out speedup curve close to the
linear speedup that is (almost) independent of the actual number of computing
nodes. First studies working with overloading, i.e. several processes are started on
each (multicore) computing node, yield promising results. In such experiments,
I work with a logical number of computing nodes exceeding the physical number
of nodes. Modern programming environments such as MPI 2 provide a dynamic
process model, i.e. one can adopt the logical cardinality to the application’s needs.
With a reasonable extended number of logical computing nodes, the load balancing
can search for a big, efficient number of nodes. Afterwards, it shrinks back the set
of nodes to this efficient number.
With massive overloading, a sophisticated node pool realisation is essential. In
the situation sketched above, a node pool tracking the relationship of logical and
physical nodes helps to speed up the computation: It is absurd to make one cluster
node hold several logical program instances, while other nodes employ only a single
instance. The node pool is responsible to deliver the workers such that the load
on the physical nodes is balanced. It can for example track the number of non-idle
instances per physical node.
A sophisticated node pool furthermore takes the physical topology into account.
If a cluster consists for example of several blades with several processing units per
blade, some design decisions for such a sophisticated node pool strategy might be
as follows:
• As Peano’s logical topology equals a tree, it is reasonable to allocate workers
for a node on the same blade: If a computing node tries to book a worker, the
node pool searches whether a processor on the same blade is idle, and delivers
this process, as this process can communicate with its new master without
accessing the inter-blade connection typically being slower.
• On the other hand, it might, for the same reason, be reasonable to balance the
working processes among the individual blades. Processors on one blade typically share network devices. If as few as possible processes on each blade are
active, as few as possible processes have to share one piece of communication
hardware.
These two examples highlight the interplay of the topologies. More sophisticated
node pool strategies also take the partition boundaries and connections into account,
as data are exchanged along the partition boundaries.
198
5.7 Outlook
The parallelisation concept orbits around the load balancing problem. In the
best case, it moreover considers memory considerations and topology issues. Longrunning, massive parallel simulations nowadays have to face a much broader set
of challenges: Reliability for example is a topic becoming more and more important since computational resources are limited and expensive, i.e. hardware failures,
power blackouts, and breaks for high-priority jobs may not harm a simulation’s result. Backup strategies and checkpointing algorithms face this challenge but are
beyond the scope of this work. Yet, the interplay of stacks and space-filling curves
yields an efficient serialisation of the partitions, and I am sure that checkpointing can
benefit from them. Besides reliability, this work has not spent a single thought on
efficient inter-program communication through standardised interfaces. The whole
thematic block of simulation interaction paradigms for massive parallel codes offers
a vast field of questions to be answered. Here, the inherent multi-scale character of
this thesis is a promising link.
199
5 Parallelisation
200
6 Numerical Experiments
In three (almost) othogonal chapters, this thesis addresses different challenges of
high performance computing. Each chapter proposes algorithms, and each chapter
also comprises an experiments section studying the effects and properties of these
algorithms. Although the chapters are presented independent of each other, the idea
of k-spacetrees underscores all of them. Because of this “leitmotiv”, an application
plugging the different thesis parts together should inherit all the nice individual
properties directly: a fact to be validated. The following text studies such an
application: a parallel, multigrid solver on adaptive Cartesian grids for the Poisson
equation based upon Peano spacetrees, i.e. (k = 3)-spacetrees. All the experiments
were conducted on the HLRB II.
6.1 Memory Requirements
A unique selling point of all spacetree codes is the low memory requirements to store
a discretisation. As Peano’s multigrid realises a matrix-free solver, i.e. it does not
add an additional data structure for the matrices, e.g., this discretisation memory
also governs the overall memory demands. One expects Peano to induce with very
low memory requirements.
Three classes of properties at this determine the record size of each grid constituent: grid management data, solver- and PDE-specific data, and parallelisation
entries. The grid management data comprises refinement information, spatial information, and state flags. It is discussed in Chapter 3. The PDE-specific data
comprises current solution, PDE parameters such as the equation’s right-hand side,
error estimator, and helper variables. It is discussed in Chapter 4. The parallel
data finally comprises domain partition information as well as load balancing, fork
and join data. It is discussed in Chapter 5. Besides the grid constituents, it is also
interesting to study the memory requirements of the traversal automaton. As Peano
follows a recursive formulation, this automaton’s memory footprint codetermines,
in combination with the maximum k-spacetree height, the size of the call stack.
Peano’s implementation comes along with a small number of bytes (Table 6.1). In
comparison, out-of-the-box solvers supporting dynamic refinement, too, frequently
require several hundred bytes per record. Furthermore, Peano’s memory footprint is
architecture independent, as I switched off any system-specific memory alignment.
201
6 Numerical Experiments
Table 6.1: Memory requirements of the multigrid solver in bytes.
Traversal automaton
Vertex
Geometric element
Vertex on output stream
Geometric element on output stream
Sequential
d=2 d=3
40
56
56
56
32
36
20
20
10
14
Parallel
d=2 d=3
40
56
65
81
48
56
52
68
34
42
Such an alignment speeds up the data access on current computer architectures. In
the implementation, I thus use memory alignment for the internal data structures
and the temporary data containers. For the input and output stacks, I disable the
feature.
Records in the parallel mode exhibit an increased size. The geometric element
then comprises additional load balancing data such as weight and δ attribute. The
vertex comprises additional adjacency information, i.e. it has to track whether it is
adjacent to remote spacetrees. If the latter holds, a vertex also has to hold the ranks
of the associated processors, and this information is encoded by two adjacency lists
with cardinality 2d .
The precompiler DaStGen transforms all of Peano’s records into a
memory-optimised representation [13, 14] before they are passed to the compiler.
This optimised representation uses for example one byte to hold all the individual
bits and bit sequences encoding the refinement structure and vertex states. Nevertheless, it does not yet exploit the knowledge that most of the integers assigned to a
vertex or cell have a bounded range. Integer numbers with a bounded range can also
be compressed by DaStGen, and such a tuning reduces the memory consumptions
further.
Even without the additional compression, Peano comes along with very modest
memory requirements in both sequential and parallel mode although it supports
arbitrary adaptivity. The Poisson solver inherits the memory properties already
studied in Section 3.6.
6.2 Horizontal Tree Cuts
In Section 3.5.2, I introduce the concept of horizontal tree cuts swapping a part
of the k-spacetree to a background buffer while the traversal continues to traverse
the remaining, coarser spacetree levels. The multigrid’s V -cycle in Chapter 4 then
establishes a concrete application of this tree cut mechanism: As soon as the cycle
has updated a fine grid level and has restricted the fine grid contributions to a coarser
202
6.2 Horizontal Tree Cuts
grid, the multigrid’s state machine reduces the active level, and all the events on
the old fine grid level reduce to no operation. It is thus reasonable to swap this fine
grid level to a background buffer until the V -cycle descends again into the level of
interest. One expects Peano’s multigrid solver to profit from the horizontal tree cuts
in terms of runtime.
The experiments studying the tree cuts compared the runtime of 20 V (2, 2)-cycles
for both a regular grid for problem (4.30) and a grid resolving a singularity. The
latter resulted from the homogeneous Poisson equation on the unit square with a
right-hand side given by the Dirac distribution in the coordinate system’s origin.
This grid was extremely refined around the singularity, i.e. it refined solely the
vertex at x = 0 ∈ Rd on each level.
Table 6.2: Speedup due to horizontal tree cuts.
regular, d = 2
adaptive, d = 2
regular, d = 3
adaptive, d = 3
Minimal
mesh width
1.0 · 10−1
1.0 · 10−2
1.0 · 10−3
1.0 · 10−4
1.0 · 10−8
1.0 · 10−12
1.0 · 10−1
2.0 · 10−2
1.0 · 10−3
1.0 · 10−4
1.0 · 10−8
1.0 · 10−12
Depth
4
6
8
10
18
27
4
5
8
10
18
27
Vertices
1.29 · 103
7.02 · 104
5.41 · 106
4.99 · 102
9.39 · 102
1.73 · 103
3.60 · 104
1.59 · 107
3.40 · 103
4.37 · 103
8.25 · 103
2.68 · 104
Speedup
1.00
3.27
4.62
1.00
1.67
2.00
1.87
4.52
1.38
1.38
1.51
3.07
The results in Table 6.2 prove the tree cut mechanism to be robust, i.e. the cuts
never introduce a runtime penalty, and the speedup is always greater than one.
Furthermore, the finer the grid becomes and the greater the number of vertices, the
better the speedup due to the tree cuts. Consequently, the runtime improvement
per grid hierarchy also scales with the spatial dimension d.
A comparison of the adaptive grids with the regular refined meshes reveals that
the speedup per additional spacetree level depends on the regularity of the underlying grid, i.e. rather regular grids benefit more significantly from the feature than
adaptive grids. This insight is not surprising, as the number of additional spacetree
nodes per additional level increases by a factor of k d for the regular grid. For the
adaptive mesh resolving a singularity, the number of additional geometric elements
or vertices, respectively, per additional level is fixed (see also Figure 5.16). Having
a more flexible, local tree cut mechanism as discussed in the outlook of Chapter 3
203
6 Numerical Experiments
(see Figure 3.21), the improvement for adaptive grids might increase, too.
As the tree cuts are robust with respect to runtime, there is not reason to switch
this feature off anytime. Furthermore, the experiments revealed exactly the insights
expected for a V -cycle.
6.3 Simultaneous Coarse Grid Smoothing
In Section 4.5.1, I introduce the simultaneous coarse grid smoothing extending the
standard Jacobi smoother to a Jacobi smoother for adaptive Cartesian grids. The
rationale behind this extension is that the Peano traversal processes all k-spacetree
elements anyway. The experiments in Chapter 4 prove that the number of iterations
reduces due to the additional coarse grid smoothing if the grid is adaptive. They
nevertheless lack runtime measurement. If the additional smoothing operations
on the coarse grid slowed down the individual steps of the V -cycles significantly,
the simultaneous coarse grid smoothing could slow down the overall computation
although it reduces the number of cycles. Hence, the runtime behaviour has to be
studied carefully.
The experiments solved problem (4.33) on the unit square with a refinement
criterion based upon the full stencil (4.22). The refinement threshold was reduced
step by step, and I compared an F (2, 2)-cycle with the simultaneous coarse grid
smoothing with a plain F (2, 2)-cycle working exclusively on one grid level a time. As
the coarse grid smoothing’s effect on the convergence rate has already been studied,
each solver performed a fixed number of 20 cycles, and I concentrated on the pure
execution time measured by hardware counters. Since the simultaneous coarse grid
smoothing alters the numerics, both approaches yielded slightly different refinement
patterns. I neglected these differences here.
Table 6.3: Runtime of the simultaneous coarse grid smoothing. No/yes denotes
whether the solver smooths grids coarser than the active level, too. h
gives the spacetree’s height.
Threshold
1.0 · 10−4
1.0 · 10−5
1.0 · 10−6
1.0 · 10−7
1.0 · 10−8
h
6
7
9
10
11
Vertices
6.07 · 103
4.44 · 104
5.32 · 105
5.74 · 106
5.74 · 107
d=2
No
2.83s
18.67s
261.51s
2575.86s
22937.93s
Yes
2.96s
19.50s
278.65s
2763.99s
24510.41s
h
7
8
Vertices
7.22 · 105
1.45 · 107
d=3
No
660.35s
12772.72s
Yes
695.03s
13791.57s
The additional operations on the coarser grids increase the runtime per cycle by
approximately ten percent (Table 6.3). This performance penalty factor is indepen-
204
6.4 MFlops on Regular Grids
dent of the problem size. It is thus reasonable to use the coarse grid smoothing
all the time: for a given accuracy to be obtained, the solver without coarse grid
smoothing always needs more than 110% of the cycles of the solver with the coarse
grid smoothing.
6.4 MFlops on Regular Grids
Section 3.6 states an almost constant runtime per vertex for the pure grid management. One expects Peano’s multigrid solver to come along with constant cost per
vertex ratio, too.
The percentage of the peak performance achieved is one important quality metric
for the quality of the implementation of a scientific software, as it measures how good
an implementation fits to a given hardware. Besides time per vertex, this section
thus also expresses the cost per vertex in terms of MFlops. A big set of tuning and
optimisation strategies improving the MFlop rate exists. To select the right one, it
is important to know in advance which effect prohibits the processing unit to display
its full power. Cache effects are a popular first guess for such a candidate. Section 3.4
postulates that Peano’s cache behaviour though is by construction asymptotically
optimal, and the experiments afterwards prove this statement for the pure grid
management. However, such a statement has to be verified for a real solver, too:
One expects Peano’s multigrid solver to be cache-oblivious.
One of Peano’s unique selling points is the support for arbitrary dynamic, adaptive
grids without memory drawbacks. However, it is also important to quantify how
such a flexible solver performs for a regular mesh in terms of runtime and memory.
The memory issue is already tackled by Section 6.1, and the next experiments thus
concentrate on measurements for regularly refined spacetrees. They compare the
runtime of V (2, 2)-cycles to F (2, 2)-cycles for the experiment (4.32) on the L-shape.
In the experiments, I sampled hardware counters for V (2, 2)-cycles and F (2, 2)cycles. Besides the pure MFlop rates, the counters also tracked the number of L2 and
L3 cache misses. As the counters’ values were averaged over the complete runtime,
the figures can not distinguish between grid setup and computation phase for the
V (2, 2)-cycle. Nevertheless, with 20 iterations conducted per experiments, the grid
setup phase was assumed to be insignificant.
With the Itanium system and its 31 · 10−3 Flop per cycle yielding approximately
50 MFlops, the measurements reveal the following insights: In the two-dimensional
experiment, the MFlop rate raises to something around 70 MFlops with increasing
problem size. Level two cache hit misses almost disappear throughout the measurements, while the level three cache hit rate drops with a finer resolution. The
time spent on a vertex is almost constant for the V -cycle, and it exhibits a decline
towards a lower bound of something around 2.5 · 10−4 for the F -cycle. In the three-
205
6 Numerical Experiments
Table 6.4: Hardware counter measurements for regular refined k-spacetree grids on
L-shape.
V (2, 2), d = 2
F (2, 2), d = 2
V (2, 2), d = 3
F (2, 2), d = 3
Vertices
1.19 · 103
7.21 · 103
5.48 · 104
4.63 · 105
4.08 · 106
1.19 · 103
7.21 · 103
5.48 · 104
4.63 · 105
4.08 · 106
3.66 · 107
3.50 · 104
6.20 · 105
1.42 · 107
3.50 · 104
6.20 · 105
1.42 · 107
F lop
Cycle
−3
27.04 · 10
35.88 · 10−3
41.24 · 10−3
42.79 · 10−3
43.32 · 10−3
27.15 · 10−3
36.25 · 10−3
41.79 · 10−3
43.48 · 10−3
42.72 · 10−3
42.97 · 10−3
66.28 · 10−3
88.52 · 10−3
97.55 · 10−3
66.61 · 10−3
89.20 · 10−3
97.73 · 10−3
L2 hit rate
99.98%
99.96%
99.95%
99.95%
99.95%
99.98%
99.96%
99.95%
99.95%
99.95%
99.95%
99.96%
99.96%
99.97%
99.96%
99.96%
99.97%
L3 hit rate
68.12%
97.02%
96.71%
55.58%
59.16%
60.38%
94.72%
96.41%
55.88%
40.53%
36.96%
97.76%
59.34%
51.98%
97.03%
59.18%
51.50%
t
#vertices
−4
≪ 10
≪ 10−4
3.87 · 10−4
3.77 · 10−4
3.78 · 10−4
≪ 10−4
≪ 10−4
2.98 · 10−4
2.81 · 10−4
2.71 · 10−4
2.52 · 10−4
5.60 · 10−4
5.90 · 10−4
4.03 · 10−4
5.08 · 10−4
5.01 · 10−4
4.92 · 10−4
dimensional experiment, the MFlop rate raises to something around 160 MFlops,
i.e. the MFlop rate of the two-dimensional experiment is more than doubled. The
L2 and L3 cache behaviour of the two-dimensional experiments is preserved, while
the cost per vertex is—besides the last V -cycle experiment—almost constant.
Peano’s multigrid solver preserves a constant time per vertex model for both
dimensions, and the multigrid solver preserves the grid traversal’s cache characteristics. Hence, the statements from Chapter 3 also hold for the Poisson solver, and the
multigrid code is cache-oblivious. Latter observation implies that techniques such
as the file stack swapping can be directly applied to the Poisson solver, too.
The framework achieves less than 2.5 percent of the theoretical peak performance of one core of the HLRB II, while the switch from two to three dimensions just doubles the runtime per vertex. Studying solely the number of floating
point operations—they are given by the local matrix sizes assigned to the elements’
events—in terms of d, this increase appears to be small. Both properties deserve an
explanation.
First, I spent no effort on a exhaustive source code optimisation of the Peano
framework. This fact takes its toll.
Second, the partial differential equation and finite element method employed here
are unsuited to achieve good MFlop rates. The grid traversal processes one element
a time. Due to the checks from which stack to take the vertices from, due to the
206
6.5 MFlops on Adaptive Grids
checks whether the element and the vertices have to perform a transition, and due
to the checks which events have to be triggered for which element constituents, each
element processing comes along with a big number of case distinctions and integer
arithmetics. The number of floating point operations per event though is small.
A more complicated (system of) differential equations or a higher order method
increase the computational load per element and, thus, lead automatically to a
better megaflop rate due to a better usage of the superscalar computing facilities of
the processor. For the Navier-Stokes equations, e.g., exactly the same code delivers
four up to five times better rates [59], and the linear increase in computing time per
vertex throughout the switch from d = 2 to d = 3 confirms this statement. It would
not hold, if the caches or the memory system were the bottleneck.
Finally, the results show how important the elimination of all the integer arithmetics and case distinctions is. While the arbitrary adaptivity and hierarchical data
structure is convenient for sophisticated mathematical algorithms, current computer
architectures perform best, if the floating point data is sequentially filled without
intermission into the floating point units. For regular grids studied here, such a
streaming is theoretically possible—there is no need for refinement checks or the support of adaptive refinement—although it does not fit to Peano’s traversal paradigm.
First studies with a modified Peano implementation treating whole regular subgrids
(patches) as one big stream, loading them en bloc without case distinctions, and
passing them back to the output stream en bloc consequently exhibit an improved
MFlop rate [23].
6.5 MFlops on Adaptive Grids
The previous section extensively discusses the interplay of cache hit rates, MFlop
rates, and upcoming optimisations for regularly refined grids. This section studies
the same solver properties for adaptive Cartesian grids. Section 3.6.1 suggests that
the runtime cost per vertex remains the same besides a small, fixed overhead. To
validate this assumption, the computational load per hanging vertex and the computational load to evaluate the refinement criterion have to be bounded, too: One
expects Peano to exhibit constant runtime cost per vertex for adaptive grids.
Again, the experiments measured the hardware counters for experiment (4.33)
solved by an F (2, 2)-cycle. In accordance with the results of Section 6.3, I switched
on the simultaneous coarse grid smoothing throughout all the experiments. All
the setups started from a coarse grid with four degrees of freedom, i.e. four inner
vertices. The refinement criterion (4.22) then refined the tree until the linear surplus
fell below a given threshold.
With such a setup, the following measurements and insights (Table 6.5) arise:
In the two-dimensional case, the Flop per cycle ratio of the experiments with the
207
6 Numerical Experiments
Table 6.5: Hardware counters for F (2, 2)-cycle on adaptive grid with dynamic refinement criterion.
d=2
d=3
Threshold
1.0 · 10−4
1.0 · 10−5
1.0 · 10−6
1.0 · 10−7
1.0 · 10−8
1.0 · 10−4
1.0 · 10−5
h
6
7
9
10
11
7
8
Vertices
6.07 · 103
4.43 · 104
5.32 · 105
5.74 · 106
5.75 · 107
7.13 · 105
1.45 · 107
F lop
Cycle
36.85 · 10−3
42.09 · 10−3
42.95 · 10−3
42.21 · 10−3
42.56 · 10−3
95.03 · 10−3
104.66 · 10−3
L2 hit rate
99.96%
99.95%
99.95%
99.95%
99.95%
99.97%
99.97%
L3 hit rate
95.35%
97.84%
53.29%
38.94%
37.24%
57.31%
49.25%
t
#verices
−4
4.94 · 10
4.41 · 10−4
5.23 · 10−4
4.81 · 10−4
4.26 · 10−4
9.75 · 10−4
9.52 · 10−4
regular grid is preserved. In the three-dimensional case, this rate even jumps over
F lop
. The cache hit rates of both test setups preserves the, meanwhile
104.66 · 10−3 Cycle
familiar, pattern already discussed in Section 3.6. Albeit the Flop rates do not
change, the cost per vertex approximately double.
As the Flop rate remains roughly the same, the creation of the hanging vertices,
the additional interpolations arising for them, and the refinement criterion evaluation do not stall the floating point pipeline. Nevertheless, they impose an additional
overhead in terms of floating point operations. Consequently, the Flop rate is preserved, but the time per vertex increases.
As this increase is bounded by a factor of two, and as this experiment already
exhibits real-world character—it is not an artificial grid or a grid resolving solely a
singularity—Peano’s cost per vertex are invariant for adaptive grids, too.
6.6 Outlook
The preceding sections study Peano’s characteristics in detail. Throughout these
studies, two properties become apparent: Due to the framework approach, characteristics demonstrated for the grid management apply directly to the matrix-free
solver implemented on top of Peano. Due to the orthogonality of the individual
chapters, the individual characteristics do not conflict with each other. Measuring the cache hit rate for the grid management, e.g., proved the framework to be
cache-oblivious. The cache hit rate of the Poisson solver mirrors this characteristics.
Furthermore, the Poisson solver’s dynamic adaptivity criteria did not alter the cache
behaviour: it is orthogonal to this property.
Besides the memory table on page 202, the experiments before though ignore the
parallel version of Peano and concentrate solely on the sequential code. The orthogonality of the characteristics and the framework paradigm justify such a procedure,
and it can be assumed that the parallelisation plugs seamlessly into the Poisson
208
6.6 Outlook
solver. Nevertheless, a validation of this arguing is desireable. Furthermore, several
properties of the parallelisation appraoch such as the dynamic load balancing show
to advantage only within the context of a reasonable PDE. For a lack of time, I have
to postpone such studies to further work.
209
6 Numerical Experiments
210
7 Conclusion
In this thesis, I introduce the framework Peano for grid-based solvers of partial
differential equations. Any such endeavour has to face one particular question:
is it worth launching yet another framework? In my opinion, Peano derives its
right to exist from a combination of advantageous characteristics shaping any solver
programmed on top of the framework:
First, the solvers’ grid management requires a low amount of memory. Few bits
per vertex or geometric element, respectively, are sufficient due to the depth-first
traversal with a corresponding tree encoding. While solvers on Cartesian grids and
solvers using patches of Cartesian grids have a modest memory demand, too, they
usually do not allow for an arbitrary, dynamic refinement of the grid.
Second, the framework supports meshes exceeding the available main memory
without a runtime penalty due to a stack-based persistence concept where the main
memory acts as cache for the hard disk. While alternative hard disk swapping
strategies might provide a similar runtime behaviour, they usually pose restrictions
on the grid layout whereas the swapping here is completely encapsulated.
Third, the solvers’ grid management supports a multiscale geometric representation of the domain due to the k-spacetrees. While alternative multiscale discretisations provide such data by definition as well, few allow the user to refine and coarse
the grid arbitrarily without changing the data topology underlying the grid storage.
Fourth, the solvers exhibit constant runtime cost per record and these are independent of the actual grid resolution and layout due to a cache-aware or memoryhierarchy-aware, respectively, grid traversal benefiting from the interplay of stacks
with space-filling curves. While many cache optimisations yield a similar cache hit
rate, their memory behaviour usually breaks down if the grid changes dynamically
or becomes adaptively refined.
Fifth, the solver’s grid management supports a domain decomposition providing
domains with small surfaces compared to their volume due to space-filling curves.
While many decomposition approaches exhibit nicely shaped subdomains and realise
the accompanying data exchange efficiently, these shapes typically suffer from dynamically changing discretisations. Furthermore, few domain decompositions take
into account the grid hierarchy explicitly. Peano does.
Before I continue, one remark: Many papers promote space-filling curves enthusiastically. One often can not help but think that these curves render all the parallelisation challenges null and void, as they make the search for a good partitioning
211
7 Conclusion
simple—just cut the whole curve into equally sized pieces—and yield quasi-optimal
partitions on-the-fly. I do not agree. Naive equally-sized cuts of space-filling curves
do not take multiscale relations into account; adaptive grids and multiscale algorithms with inter-level communication are not tracked, but need a global representative of the whole domain; moving or relocating cuts along the curve is, on the one
hand, far from trivial due to the complex subdomain shapes and structures, as well
as, on the other hand, not for free because of the data to be exchanged. Parallelism
remains hard work.
As final characteristics, Peano’s parallelisation incorporates a dynamic load balancing due to a simple tree attribute grammar. While many load balancing algorithms produce equivalent or even better workload distributions, few work on-the-fly,
in parallel, consider a multiscale representation of the domain, and are near as simple.
As far as I know, no other framework provides all these features in one package.
The key to obtain them is the mixture of spacetrees, space-filling curves, and a
refinement-invariant, stack-based—the number of stacks is linear in d and independent of grid structure and refinement depth—grid storage scheme.
Besides functional properties, the handling of complexity, maintainability, and
extendability accompanies any discussions on software frameworks and software architectures. One technique for tackling these three aspects is encapsulation together with separation-of-concerns. Frameworks hiding technical details and their
realisation from the code solving the actual problem fit into this dogma. In high
performance computing, the quality of a software however is determined by two
different non-functional requirements: execution speed and parallel scalability. As
functional decompositions often introduce runtime bottlenecks, programmers of simulation codes often downgrade the separation of concerns aspect and interweave
grid- and cluster-specific as well as PDE-specific code fragments. Peano tackles the
non-functional challenges with a rigorous object-oriented architecture and a simple,
event-based coupling of PDE-specific parts with the grid and traversal management.
This event paradigm, on the one hand, enforces a separation of PDE-related tasks
from the well-structured framework. On the other hand, they enable the framework
to schedule activities in-between any two PDE tasks, i.e. the paradigm facilitates a
mixture of different activities.
Whether a framework fulfills a functional or non-functional requirement or whether
it is extendable to fulfill a requirement has to be studied by means of concrete applications built atop the framework. Furthermore, the collection of all these applications sheds light on the debate how valuable a framework is for answering and
addressing unsolved questions from science and engineering. In the following, I hence
first present a solver for computational fluid dynamics (CFD) that is implemented
with Peano. While this solver exhibits the highest maturity of any software implemented with the framework, it does not yet exploit all the functional properties
212
coming along with Peano. I thus continue with a list of interesting properties and
features upcoming PDE solvers could or will provide—some efforts already started,
some not. From this discussion, I return to the non-functional aspects. Running
software nearer and nearer at the (multicore) peak performance is of the essence of
high performance computing. I highlight one crucial extension and evolution ansatz
that has already shown to be promising to tackle the runtime challenge. Having
functional and non-functional look-outs and extensions at hand, I finally pick up
again a set of philosophical questions: What programming and software strategy
is well-suited for future-generation multiphysics, multiscale, massive parallel high
performance codes? Is a holistic top-down approach represented by frameworks or
is a component-based decomposition of the challenges the way to go? Elaborating a
recently prototyped integration concept for Peano, I combine advantages from both
worlds and embed Peano into the whole simulation pipeline.
Computational Fluid Dynamics with Peano
A framework for PDE solvers is as good as the best PDE solver implemented with
the framework. Fortunately, Peano has a sophisticated computational fluid dynamics code [60] built on top of it. This solver is actually a collection of plugins
addressing the stationary and instationary, incompressible Navier-Stokes equations
via direct numerical simulation. The solver suite provides an exhaustive set of standard boundary conditions, and it also supports fluid-structure interaction scenarios
due to a separated approach, i.e. the CFD code can compute forces introduced by
the fluid at the domain’s boundaries, and it can handle moving boundaries in return.
Multiple spatial discretisations of the equations facilitate a spatial discretisation of
higher order—which is an interesting fact for any PDE solver, as it follows the appeal to increase the “science per FLOP” [44] or “science per record” as well as Flop
per event (Section 6.5)—and the suite provides several implicit and explicit time
integration schemes.
The overall goal of the CFD plugins is to show that interesting insights and
opportunities stem from the crossfire of adaptive Cartesian grids and computational
fluid dynamics. In this short outline, I concentrate on the CFD solver’s properties
that are interwoven with the framework, i.e. I neglect characteristics that are not
directly related to Peano.
First, the simulation can resolve the computational domain very accurately due to
the low memory requirements of the Peano framework. Such a fine resolution is for
example important if a simulation computes turbulent regimes with direct numerical
simulation, i.e. without explicit turbulence models. Here, the turbulent behaviour
on the scale of interest stems from fine scale influences. Out-of-the-box CFD codes
often can not afford to resolve these fine scales and, hence, have to apply turbulence
213
7 Conclusion
Figure 7.1: Channel filled with gas (left) and fluid circulating around cylinder
(right). Illustrations are due to Tobias Neckel [60] who realised a computational fluid dynamics code within the Peano framework.
models to imitate the fine grid influences. Peano’s direct numerical simulation can
validate these models and vice versa.
Second, the adaptive grids can resolve selected regions significantly finer than the
overall domain due to the low memory requirements and the support of arbitrary
adaptivity. Such an improved resolution is for example important if a simulation
computes flows with different characteristics such as boundary layers where the
overall flow is turbulent whereas the flow along the boundaries is laminar (Figure
7.1). Here, interesting effects arise from the interplay of the two different types of
flow, and the need for fine grid resolutions is motivated by physics. Out-of-the-box
CFD codes often can not afford to resolve the interesting regions with a sufficient
accuracy.
Third, the dynamic adaptivity can adapt the grid to a (changing) geometry due
to the support of dynamic refinement. Such a dynamic change of the computational
grid is for example important in a fluid-structure interaction problem, where the
structure, i.e. the computational domain, permanently changes its position or shape
and is not aligned with the coordinate system axes. Out-of-the-box CFD codes
often can not adopt the grid arbitrarily and have to stop the simulation after a
given number of time steps, have to remesh, and, finally, interpolate from the old
grid to the new grid. Peano provides a homogeneous environment to handle such
flows and computational domains where the mesh update accompanies each single
time step.
Besides the three applications highlighted here, the implementation of [60] emphasises several highlights not correlated directly with the framework. Besides them, it
also adds for example a restart functionality to Peano’s pool of features. Restarts due
to regular snapshots are important for long running parallel computations, where
the malfunction of one single node would destroy the overall computation’s result.
214
Figure 7.2: Space-time grid with adaptive refinement in both time and space. Illustration of simulation snapshot is due to Bernhard Gatzhammer who uses
Peano’s CFD realisation within a fluid-structure interaction application
environment.
Furthermore, the snapshot mechanism enables the user to run a simulation for a
given time. Afterwards, he can re-setup the application’s parameters and conduct
several parameter studies on the same startup setting.
New Applications Using the Peano Framework
Computational fluid dynamics is an active and an agile field of research. Many
fundamental questions in this field are still open, and the computational challenges
are far from being solved. Peano’s CFD implementation is a promising candidate
to become part of the active research. Nevertheless, the CFD implementation does
not yet exploit all of Peano’s unique selling points. In the text underneath, I hence
pick up two additional application areas and concentrate on additional properties.
Peano allows a mathematical model to use an arbitrary d ≥ 2. While the CFD
code selects d ∈ {2, 3}, searching for models with a bigger dimension constant yields
a vast number of possible applications: Parameter studies and parameter fitting
and optimisation problems for example add additional parameter dimensions to
classical fluid dynamics problems. Some problems from mathematical finances are
by construction based on a higher dimensional computational domain. I concentrate
215
7 Conclusion
on a different class of “high-dimensional” partial differential equations.
Many parabolic equations are given on a computational domain with dimension
three and an additional time dimension. Choosing d = 4, Peano can solve such an
equation treating the time as just another parameter dimension of the computational
domain (Figure 7.2). In first experiments with simple setups for the heat equation,
such a holistic approach with space-time discretisations brings along at least three
interesting opportunities: First, problems with periodic solutions benefit from the
fourth simulation dimension. Starting with a first guess on a (rather coarse) time
grid, the simulation successively improves the solution and adopts the grid, the time
stepping, and the periodic boundary conditions in time. No additional effort is to
be spent on the treatment and persistence management of the individual time steps,
and the parallelisation falls seamlessly into place. Second, problems with multiscale
temporal behaviour benefit from the fourth simulation dimension. Many solutions
exhibit regions where the solution changes rapidly, while other parts of the computational domain remain almost invariant. Here, the solution is very smooth in
time. A locking of the time steps leading to a global time stepping treats both types
of regions the same. Adaptive time stepping approaches, where regions of interest
are tracked with smaller time steps, fit to the four-dimensional discretisation, as the
(d = 4)-spacetree simplifies the persistence management of the individual time steps
and the interpolation in time. Finally, simulations with strong pollution effects in
time benefit from the fourth simulation dimension. For parabolic simulations, errors in one time step propagate to the subsequent time steps, spread spatially, and
can break down the overall simulation accuracy. A prominent candidate for such
a misbehaviour is the preservation of mass in a fluid-structure simulation with incompressible fluids: instationary boundaries introduce small mass inconsistencies
around moving geometries and these inconsistencies then spread globally. The fourdimensional grid simultaneously holds all time steps. Whenever pollution is identified, it is thus possible to return to the time step causing the pollution, refine the
grid there locally, and update all subsequent time steps. The pollution’s source is
eliminated.
Besides the higher-dimensional formulations, the multiscale representation of the
domain deserves an additional remark. The Poisson equation studied in this thesis
and the CFD code introduced above rely on a holistic description of all the physics
by a single partial differential equation, i.e. the solution is determined by one equation valid for all spatial resolutions. A promising field for new scientific insights is
the coupling of different mathematical models for different spatial scales. It is for
example a common wisdom that the boundary conditions coming along with classical fluid dynamics codes are often an inappropriate description of the real world.
In such a case, boundary conditions arising from molecular dynamics yield more
accurate global simulation results (Figure 7.3). Yet, noone can afford to compute
large-scale flows with a molecular dynamics code. Peano’s multiscale discretisation
216
Figure 7.3: Boundary condition of sophisticated fluid simulation (top) results from
molecular dynamics code (bottom). Illustration of molecular dynamics
is due to Martin Buchholz.
makes it possible to embed different physical models and mathematical descriptions
into one code, couple them strongly in regions of interest—only regions of interest
are refined such that the mulecular dynamics code comes into play—and make both
simulations benefit from each other.
Framework Extensions
This chapter so far concentrates on use cases and extensions concerning different
functional properties of Peano from an application point of view. Besides a functional evolution of Peano’s plugins, emphasis has also to be put on non-functional
properties, as the applicability and quality of a framework depends to a great extend
on the maintainability, extendability, and performance of the code. The following
paragraphs refrain from a concrete application and concentrate on technical extensions and improvements reducing the software’s runtime.
Performance considerations particularly gain weight with the new hardware architectures arising: Future architectures consist of more and more cores, while the
single-core performance remains almost constant. Code then does not anymore automatically benefit from new architectures due to increased clock rates [71]. In
accordance, performance metrics also undergo a fundamental paradigm shift. They
incorporate both percentage of peak performance achieved and scalability with respect to multicores.
217
7 Conclusion
The tuning of Peano is beyond the scope of this thesis, and, besides the cache discussion, no significant emphasis is put on the debate whether Peano fits to and how
it performs on current computer architectures. As a result, Peano is a framework
with constant cost per degree of freedom and a vast amount of features. Yet, it performs rather poor with respect to the theoretical peak performance of a processing
unit. This has to change with upcoming framework releases.
Methodologically, I consider recursion unrolling [65] as fundamental technique to
make Peano utilise current computer architectures. In first case studies [23], an
additional attribute grammar similar to the weight and δ attribute tracks which
subtrees of the k-spacetree are invariant and correspond to regular refined subgrids.
On these subgrids or patches, respectively, the code switches from the recursive
traversal to a flat, sequential implementation holding the complete subtree as one
sequence of plain Cartesian multiresolution grids.
Having whole Cartesian grids at hand allows for Peano to apply at least three fundamental performance improvements. First, the regular, homogeneous data structure with a sequential processing fits to instruction level parallelisation as soon as
operations on these records exploit SSE and vector processing units. Second, the
regular, homogeneous data structure fits to a simple data decomposition cutting
it into equally sized continuous pieces. Multicore architectures with multiple cores
holding one shared block of data can then take over the responsibility of subparts of
the patches. They share the computational workload. Finally, the regular, homogeneous data structure fits to block-wise numerical algorithms. Block-wise red-black
Gauß-Seidel algorithms for example are easy to realise on these blocks, but improve
the numerical performance, i.e. the convergence rate.
First experiments applying recursion unrolling yield promising results for grids
with big regular subpatches, i.e. the convergence rate improves, the code scales on
several cores, and the MFlops raise. Whereever a (sub)grid changes, exhibits adaptive discretisations, or covers a domain partition’s boundary, the application falls
back to the standard Peano implementation. With such a hypbrid implementation,
Peano’s MFlop rates improve by an order of magnitude, while the code preserves all
the flexibility and features introduced in this thesis.
Software Challenges
Deriving new algorithmic extensions and innovations is fun, and every enthusiastic
programmer appreciates the development of a system or a new algorithmic feature
from scratch. A mere discussion of additional features arising from applications or
performance optimisations nevertheless misses one crucial point of software development in high performance computing: Is the software’s quality sufficiently high, and
does the software remain maintainable, manageable for humans, and extendable?
218
At the time this thesis is handed in, the Peano framework in combination with the
computational fluid dynamics plugin already comprises more than 850 classes distributed among 35 packages. It is hence reasonable, even indispensable, to ask such
questions.
First of all, using a framework does not weaken this challenge. Good component
decompositions provide a set of tools, represented by the components, with welldefined interfaces. The user then selects suitable components and combines them
to an application. A good component decomposition neither forces the user into a
dedicated programming and development style nor do the components influence each
other directly—component decomposition does not coin the software development
process, and it exhibits a low acceptance threshold. Frameworks, by contrast, require
the user to make his or her programming approach and train of thought fit to the
framework’s paradigm, design, and philosophy. In turn, their holistic approach
facilitates runtime and memory efficient implementations. A simple example: No
matter how sophisticated and fast a solver component for linear equation systems
and a component for the grid management are realised; if a discretisation changes
permanently, and if the assembly of the linear equation system that analyses the
grid is a complicated and long-running activity, the application can not benefit from
the elaborate components it relies on. A clever integration and interweaving of
components circumnavigates such a problem a priori—updating for example only
matrix entries related to changing grid elements—and frameworks are one way to
enforce this. For an easy adoption and a low acceptance threshold, it is though
essential to design a framework’s restrictions and paradigms as clear, plain, and
straightforward as possible: It is unacceptable to force the programmer to read
through and understand several Ph.D. theses, each with 222 pages or more, before
programming starts.
Peano introduces the event concept to tackle this challenge for PDE solvers built
atop the framework. While the events hide the complete persistence management
and traversal realisation from the PDE programmer, the framework realisation itself follows a rigorous object-oriented design incorporating many best practices and
design patterns. Nevertheless, features such as parallelisation or recursion unrolling
are that profound—a complete decoupling from the framework and its signatures
here is not possible anymore—that it has carefully to be studied whether a particular
realisation complicates the events’ signature or semantics (the level-wise depth-first
traversal for example falls into this class) and whether it increases the framework’s
code complexity. Due to the clear functional decomposition, I consider the implementation at hand now to be maintainable, manageable, and extendable. However,
preserving these properties becomes more complicated with each feature added.
In this context, the question arises whether an object-oriented language such as
C++ is the right choice for the implementation. Object-oriented languages have
been considered the magic bullet for a long time since their construction enforces a
219
7 Conclusion
rigorous encapsulation of details. For Peano, two drawbacks though appear: On the
one hand, breaking down a design into fragments with one particular responsibility
per type is not always possible: A grid vertex, e.g., combines PDE-specific properties, the management of the grid’s adaptivity, and operations to exchange data in
a shared memory environment due to message passing. For such a type, an aspect
or feature oriented paradigm—the vertex is added technical functionalities such as
message exchange operations automatically, and its signature and appearance depends on the component handling the record—are better suited. Yet, there is a
lack of such tools for high performance computing. Peano exploits the precompiler
DaStGen [13, 14] addressing the aspect oriented challenge. While such a tool is a
first step, the complete tool chain, the feature oriented nature, and the underlying programming paradigm is neither fully understood nor perfected. On the other
hand, Peano’s code blocks often are difficult to understand because of the sophisticated mathematical formulas and algorithms they are realising. They consist of long
sequences of loops, matrix-vector products, and complicated mathematical expressions. While C++’s generics due to expression templates attenuate this problem,
domain-specific languages such as computer algebra systems or Matlab lead to the
most compact, maintainable, and intuitive formulations of such code blocks.
With a hand-written, tailored precompiler for aspect-oriented programming, a
rigorous encapsulation with object-oriented language constructs, and powerful expression templates, I consider the implementation at hand now to be maintainable,
manageable, and extendable. Though, understanding of the underlying principles
and overcoming (at least some of) these issues with something beyond C++ for
high performance computing is highly desireable; in particular, when the software
becomes more complex with each feature and application added.
Peano as a Component
The preceding two paragraphs are a plea for frameworks for PDE solvers, as they
make the problems design and implementation challenges and as they explain them
with inadequate language support. Such an argumentation in favour of frameworks is biased and falls short: A clear bottom-up component decomposition with
well-defined responsiblities and states has proven of great value for many problems
throughout centuries, and it eliminates many software challenges in advance as the
resulting code is focused and typically smaller. For the grid-based solvers on dynamically adaptive meshes, I quote the performance as crucial advantage of an integrated
approach where everything is done in one place, i.e. within the framework. As soon
as the performance argument is not overwhelming anymore, there is no reason to
dismiss a component architecture a priori in favour of a framework approach; particularly, since the threshold of acceptance for component architectures is typically
220
much lower due to smaller component size and complexity. Consequently, I suggest
a twofold approach: Whenever performance is crucial, a feature is realised within
Peano. Whenever performance is not crucial, a feature is deployed to an external
component of its own. Peano as a result runs as one big component among several
others. It incorporates all the performance critical features, and interacts with other
components whenever I favour a clear separation of concerns over performance.
The geometry management is one example for a component: Peano’s geometric
events reduce to a set of “is inside” and “is outside” queries as well as a check
whether and which boundary has been intersected by a hypercube. The latter information is important for a PDE solver to set the appropriate type of boundary
conditions. While Peano could comprise a geometry management with mesh IO
routines, surface management, and domain modificators (for fluid-structure interaction, e.g.), I prefer the geometry events to belong to one interface connected to
an external component. At our chair, the tool preCICE for example implements
such an interface. The resulting two-component architecture has several advantages: First, it follows a separation-of-concerns idea. Peano already incorporates a
vast amount of ideas and technical details dealing with grid storage, grid traversal,
parallelisation, PDE solvers, and so forth. There is no need to add additional complexity. Second, a component architecture allows the user to exchange the geometry
realisations. For my experiments, few hard-coded geometries are sufficient, whereas
the computational fluid dynamics plugin benefits from a geometry component that
can handle complicated computational domains from external mesh files. Finally, a
component interface provides a single-point-of-contact (SPOC) for Peano’s parallel
realisation: All the different Peano instances in a cluster contact one instance of the
geometry component on one computational node. Consequently, the geometry data
are always consistent and the dynamic parallelisation is hidden from the geometry
component. First experiments with preCICE yield promising results with respect
to separation-of-concerns, exchangability, and parallel environments.
While the component interfaces at the moment are plain C++ signatures coupled
at compile-time by direct function calls or MPI messages, such a component interpretation of Peano fits perfectly to the common component architecture (CCA): To
translate the C++ signatures to SIDL is straightforward, and the resulting Peano
codes realising one or several PDE solvers can act as CCA component within a bigger
problem solving environment. To have Peano’s PDE solvers as components within
a greater simulation workbench is an ultimative goal bridging the gap from codes
that study individual properties prototypically to codes that are used for real-world
problems on real-world data sets by application-domain experts.
221
7 Conclusion
In the End
This thesis provides an enormous number of links for future work: The individual
chapters present improvements and extensions concerning the individual features,
and this final chapter gives several more complicated and laborious improvements
and extensions that embed Peano into a bigger environment of open questions, existing software packages, and computational challenges. Subsuming all these aspects
with a few final sentences is impossible. I will pick up several issues throughout
the upcoming years, as I believe Peano being a promising approach to obtain new,
interesting, and powerful PDE solvers. Although it has several shortcomings—the
peak performance perhaps being the most important one—introducing a lot of work
to invest, there is also a lot of insight to harvest. In the end, the convenience of the
effort will be measured by the understanding obtained with the tool—a statement
holding for the whole discipline of computational sciences and engineering. The
purpose is insight, not numbers.
222
A Helper Algorithms
The following pages provide some algorithms swapped from the original text to the
appendix. They are moved to this chapter either due to their technical character
or if they required for too much space and made the text fall into pieces. All the
algorithms are used by the grid persistence management and traversal.
A.1
A.2
A.3
A.4
A.5
Usage
Name
Page 70
belongsT oT ouchedF ace
Page 71
belongsT oU ntouchedF ace
Page 76
createSubStates
Page 84
Page 76
setExitM anif old
Page 84
setEntryM anif old
Algorithm A.3
Page 76
removeF aceAccessN umber
Page 84
Algorithm A.5
Page 76
setF aceAccessN umber
Page 84
223
A Helper Algorithms
Algorithm A.1 Determine for a vertex at position whether any adjacent face is
touched, i.e. whether an element sharing a face with the current element has read
this vertex before, or whether all adjacent faces are untouched, i.e. no element
sharing a face with the current element has written the vertex before. The position
corresponds to a lexicographic vertex enumeration within a hypercube.
2d
belongsT oT ouchedF ace : Ptouched
× {0, 1}d 7→ {⊤, ⊥}
2d
belongsT oU ntouchedF ace : Ptouched
× {0, 1}d 7→ {⊤, ⊥}
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
224
procedure belongsT oT ouchedF ace(touched, position)
result ← ⊥
for i ∈ {0, . . . , d − 1} do
f ace ← i
if positioni = 1 then
f ace ← f ace + d
end if
result ← result ∨ touchedf ace
end for
return result
end procedure
procedure belongsT oU ntouchedF ace(touched, position)
result ← ⊥
for i ∈ {0, . . . , d − 1} do
f ace ← i
if positioni = 1 then
f ace ← f ace + d
end if
result ← result ∨ ¬touchedf ace
end for
return result
end procedure
Algorithm A.2 The following algorithm derives both the even and the access
flags for all geometric subelements of one refined element. It analyses the parent
element’s flags. The result tuples are enumerated along the leitmotiv. Algorithm
relies on Algorithm A.3.
A := {−2d + 1, . . . , 2d − 1}2d
E := {⊤, ⊥}d
d
createSubStates : A × E 7→ (A × E)3
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
procedure createSubStates(access, even)
return createSubStates(access, even, d − 1)
end procedure
procedure createSubStates(access, even, axis)
if axis = −1 then
return (access, even)
else
even0 ← even
even1 ← even
even2 ← even
⊲ Create three copies of even flags corresponding to three substates.
even1axis ← ¬even1axis
⊲ Flag “in-between” has one entry differing from the others.
access0 ← setExitM anif old(access, even, axis)
access1 ← setExitM anif old(access, even, axis)
access1 ← setEntryM anif old(access1, even, axis)
access2 ← setEntryM anif old(access, even, axis)
⊲ Set new entry/exit manifold for three substates.
return
((createSubStates(access0, even0, axis − 1),
(createSubStates(access1, even1, axis − 1),
(createSubStates(access2, even2, axis − 1))
⊲ Recursive calls and concatenation of results.
end if
end procedure
225
A Helper Algorithms
Algorithm A.3 Two helper operations for Algorithm A.2 simplifying the access
flag manipulation. They set a new entry or exit manifold and upate the access flag
such that the invariants (3.5) and (3.6) hold again.
A
E
setExitM anif old : A × E
setEntryM anif old : A × E
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
226
:=
:=
7→
7→
{−2d + 1, . . . , 2d − 1}2d
{⊤, ⊥}d
A
A
procedure setExitM anif old(access, even, axis)
if isT raverseP ositiveAlongAxis(even, axis) then
access ← removeF aceAccessN umber(access, axis + d)
⊲ see
access ← setF aceAccessN umber(access, axis + d, 1)
⊲ see
else
access ← removeF aceAccessN umber(access, axis)
⊲ see
access ← setF aceAccessN umber(access, axis, 1)
⊲ see
end if
return access
end procedure
procedure setEntryM anif old(access, even, axis)
if isT raverseP ositiveAlongAxis(even, axis) then
access ← removeF aceAccessN umber(access, axis)
⊲ see
access ← setF aceAccessN umber(access, axis, −1)
⊲ see
else
access ← removeF aceAccessN umber(access, axis + d)
⊲ see
access ← setF aceAccessN umber(access, axis + d, −1)
⊲ see
end if
return access
end procedure
Algorithm A.4
Algorithm A.5
Algorithm A.4
Algorithm A.5
Algorithm A.4
Algorithm A.5
Algorithm A.4
Algorithm A.5
Algorithm A.4 Helper for Algorithm A.3. Invalidates access entry on f ace,
i.e. access’s entries are shifted such that (3.5) and (3.6) hold again for all entries
besides face’s access entry. If face’s entry accessf ace is greater than zero, all other entries greater than accessf ace hence have to be decremented. If face’s entry accessf ace
is smaller than zero, all other entries smaller than accessf ace hence have to be incremented.
A := {−2d + 1, . . . , 2d − 1}2d
removeF aceAccessN umber : A × {0, . . . , 2d − 1} 7→ A
1: procedure removeF aceAccessN umber(access, f ace)
2:
oldAccessN umber ← accessf ace
3:
if accessf ace > 0 then
4:
for i ∈ {0, . . . , 2d − 1} do
5:
if accessi ≥ oldAccessN umber then
6:
accessi ← accessi − 1
7:
end if
8:
end for
9:
end if
10:
if accessf ace < 0 then
11:
for i ∈ {0, . . . , 2d − 1} do
12:
if accessi ≤ oldAccessN umber then
13:
accessi ← accessi + 1
14:
end if
15:
end for
16:
end if
17:
accessf ace = 0
18:
return access
19: end procedure
227
A Helper Algorithms
Algorithm A.5 Helper operation for Algorithm A.3. Set a face’s access entry to a
new value. If this value is greather than zero, all other access entries greater than
value are incremented. If this value is smaller than zero, all other access entries
greater than value are decremented. As a result, the constraints (3.5) and (3.6) hold
again.
A := {−2d + 1, . . . , 2d − 1}2d
setF aceAccessN umber : A × {0, . . . , 2d − 1} × N0 7→ A
1: procedure setF aceAccessN umber(access, f ace, value)
2:
if value > 0 then
3:
for i ∈ {0, . . . , d − 1} do
4:
if accessi ≥ value then
5:
accessi ← accessi + 1
6:
end if
7:
end for
8:
else
9:
for i ∈ {0, . . . , d − 1} do
10:
if accessi ≤ value then
11:
accessi ← accessi − 1
12:
end if
13:
end for
14:
end if
15:
accessf ace ← value
16:
return access
17: end procedure
228
B Hardware
The following sheets enlist the hardware used throughout the thesis. The experiments typically are conducted on up to four different architectures. All the measurements resulting from hardware counters were conducted on Itanium nodes. The
chair’s local Pentium machines are connected via a standard Ethernet, i.e. they do
not belong to a high performance cluster.
Pentium
Processor
Vendor
Bit
Cluster
Location
One core
Peak performance
Clock rate
Cores
Level 1 data cache
Level 2 cache
Level 3 cache
Memory
Cluster
Nodes
Type
Pentium 4
Intel
32
Local workstations
Chair of Scientific Computing in Computer Science
Technische Universität München
6.8 GFlop/s
3.40 GHz
1
16 KByte
1 MByte
2 GByte
Ethernet
229
B Hardware
Opteron
Processor
Vendor
Bit
Cluster
Location
Opteron 850
AMD
64
Infinicluster
Chair of Rechnertechnik und Rechnerorganisation
Technische Universität München
One core
Peak performance
Clock rate
Cores
Level 1 data cache
Level 2 cache
Level 3 cache
Memory
Cluster
Nodes
Type
Itanium
Processor
Vendor
Bit
Cluster
Location
One core
Peak performance
Clock rate
Cores
Level 1 data cache
Level 2 cache
Level 3 cache
Memory
Memory bandwidth
Cluster
Nodes
Type
230
4.8 GFlop/s per core
2.4 GHz
4 per node
64 KByte, 64 Byte per line
1 MByte, 64 Byte per line
8 GByte shared between 4 cores
32
4X Infiniband
Itanium2 Montecito Dual Core
Intel
64
HLRB II—Höchstleisungsrechner Bayern II
SGI Altix 4700
Leibniz Supercomputing Centre
12.8 GFlop/s per socket (two cores)
1.6 GHz
2 per socket
16 KByte, 64 Byte per line
256 KByte, 128 Byte per line
9 MByte, 128 Byte per line
4 GByte
8.5 GByte/s shared between 2 cores
9728 cores
19 compute partitions
256 sockets per compute partition
NUMAlink 4
PowerPC
Processor
PowerPC 450
Vendor
IBM
Bit
32
Cluster
Jugene - Jülicher Blue Gene/P
Location
Jülich Supercomputing Centre
One core
Peak performance
13.6 Gflops/node
Clock rate
0.85 GHz
Cores
4 per node
Level 1 data cache
32 KByte per node
Level 2 cache
Level 3 cache
8 MByte
Memory
2 GByte
Memory bandwidth
13.6 shared between 4 cores
Cluster
Nodes
16 × 1024 compute nodes
65536 cores
Type
Three-dimensonal torus
231
B Hardware
232
C Notation
Symbol
d
P
Pi
d≥2
Description
Spatial dimension of continuous problem.
Powerset of a set.
Predicate, i.e. function to {true, f alse}. The text also
refers to predicates as flags.
Spacetree
The following symbols, sets, and relations are used within the k-spacetree context.
All of them stem from Chapter 2.
Symbol
k
T
ET
k≥2
VT
HT
⊑child
⊂ VT
∈ ET × ET
⊑pre
∈ ET × ET
invert :
⊑dfo
⊑lw
T 7→ T
∈ ET × ET
∈ ET × ET
Description
Number of cuts along every coordinate axis.
A spacetree.
Set of geometric elements (hypercubes) of the
spacetree T .
Set of vertices within the spacetree T .
Hanging vertices.
Partial order representing father-child relations in
the spacetree. If the algorithm refines a hypercube,
all the k d resulting smaller hypercubes are children
of this hypercube.
Partial order on siblings, i.e. geometric elements
having the same parent. The order defines which
element is traversed before which sibling element.
Maps a spacetree to a spacetree with inverted child
order ⊑pre .
Denotes a depth-first order on the k-spacetree.
Denotes a level-wise depth-first order on the
k-spacetree.
233
C Notation
Grid Properties and Access Operations
List of operations and properties defined on the k-spacetree entities. The implementation does not provide all of these operations, as some encode for example
the complete grid connectivity and adjacency information. Not all this information
however is available (all the time).
Function
Description
adjacent : VT 7→ P(ET ) Yields the adjacent geometric elements of a vertex.
f irst : ET 7→ N0
Returns the moment within one traversal when an element is read for the first time.
level : VT 7→ N0
Takes a vertex and delivers the vertex’s grid level
equaling the level of the surrounding elements.
Prefined : ET 7→ {⊤, ⊥}
Indicates whether an element is refined, i.e. one of the
adjacent vertices holds the refined predicate.
second : ET 7→ N0
Returns the moment within one traversal when an element is read for the second time (call stack reduction)
2d
Yields the adjacent vertices of an element.
vertex : ET 7→ VT
℘ : ∂Ω × VT 7→ ∂Ωh
Maps a point from the continuous computational domain’s boundary to one boundary vertex position
along the shortest path.
f ather : . . .
Derive the one up to 2d father vertices, i.e. the vertices
of the refined father element that influence a vertex.
For complete signature see page 28.
Element Properties
The following properties are defined on each geometric element within the grid.
Property
level : ET 7→ N0
Pinside
Poutside
234
Description
Takes an element and delivers the element’s grid level.
Element lies completely inside the discretised computational domain. In this case, invoke events on this element.
Counterpart of Pinside .
Vertex Properties
The following properties are defined on each vertex within the grid.
Property
Pboundary
Pcoarsening triggered
Pinside
Poutside
Prefined
Prefinement triggered
Description
Vertex lies on the boundary of the discretised computational domain.
Coarsening has been triggered for vertex, i.e. the 2d adjacent geometric elements will be coarsened throughout the
next traversal.
Vertex is inside the computational domain.
Vertex is outside of the computational domain.
Indicates whether a vertex is refined, i.e. whether all surrounding elements are refined elements.
Refinement has been triggered for vertex, i.e. the 2d adjacent geometric elements will be refined throughout the
subsequent traversal.
Automaton Properties
The following properties are defined on the stack automaton realising the Peano
traversal. Most of these data are stored within the geometric elements although all
values depend solely on the automaton’s state in the recursion step before as well
as the parent element.
Function
Description
access : {0, . . . , 2d − 1} 7→ Z Attaches each element face a number. Actual
range is {−2d+1, . . . , 2d−1}. Number returns
order the neighbouring elements have been visited before (negative number) or will be visited
(positive number).
d
even : ET 7→ {⊤, ⊥}
Splits the elements up into odd and even elements along each coordinate axis.
235
C Notation
Finite Elements
The finite element method uses the following symbols to discuss the solution of
the Poisson equation. They define the computational domain, the function spaces
holding the weak solutions, and the spacetree’s discretised approximation spaces
holding the numerical solution. Finally, the table gives the symbols of the operations
mapping these functions to each other.
Symbol
ℓ
Ω
Description
ℓ ∈ N0
Level within grid.
Ω ⊂ Rd
Computational domain. Bounded open subset of the ddimensional space with sufficiently smooth boundary.
The partial differential equation is defined on Ω.
∂Ω
Boundary of computational domain.
Ωh
Discretised computational domain. Also fine grid.
Ωh,ℓ
Grid on level ℓ.
∂Ωh
Boundary of fine grid.
Ωh,ℓ
Grid of level ℓ.
Ωadaptive
Adaptive grid up to level ℓ.
h,ℓ
1
H (Ω)
Sobolev space for the Poisson problem.
1
1
Hh (Ωh )
⊂ H (Ω) Discretised subspace of the Sobolev space. It is
spanned by a d-linear nodal basis (hat functions) on
the fine grid.
1
Hh (Ωh,ℓ )
Discretised Sobolev space spanned by the nodal basis
of one grid level.
Hh1 (Ωadaptive
)
Discretised Sobolev space spanned by the hat functions
h,ℓ
.
on Ωadaptive
h,ℓ
Hh1 (ΩT )
Generating system on k-spacetree.
ϕ
∈ H 1 (Ω) Function from the Sobolev space.
φ
∈ Hh1 (Ωh ) (Hat) Function from the discretised Sobolev space. It
has local support and its support covers 2d geometric
elements of one grid level.
h : H 1 (Ω) 7→ Hh1 (ΩT )
Takes a function from H 1 (Ω) and delivers a representation within the k-spacetree.
1
1
ĥ : Hh (ΩT ) 7→ Hh (Ωh )
Counterpart of h. If h’s preimage is from Hh1 (Ωh ), applying h and ĥ in a row yields the identity, i.e. (ĥ·h)u =
u, ∀u ∈ Hh1 (Ωh ).
236
Multigrid Operations
The following symbols are used for the multigrid Poisson solver. The table starts
with function symbols and continues with the values assigned to the grid’s vertices.
It ends up with the operators and matrices applied to these values.
Symbol
ℓ
uh,ℓ
bh,ℓ
rh,ℓ
eh,ℓ
ûh,ℓ
r̂h,ℓ
uv
ûv
rv
r̂v
bv
ũv
A
P
R
C
Punrefined
k.kmax
k.kh
v
Description
Active level, i.e. level the smoother is currently processing.
Approximation of the solution on level ℓ.
Right-hand side on level ℓ.
Residual belonging to the approximation on level ℓ.
Error of the approximation on level ℓ.
Hierarchical surplus of the approximation on level ℓ with respect to
the approximation on level ℓ − 1.
Hierarchical residual resulting from ûh,ℓ .
Value of the current approximation at a position in space. Is assigned to each inner vertex. For boundary vertices, it determines
the Dirichlet boundary condition.
Hierarchical surplus of a solution stored in vertex v. Value is usually
stored within vertex variable uv .
Residual of the current approximation in vertex.
Hierarchical residual at a vertex’s position. Value is usually stored
within vertex variable rv .
Right-hand side of the PDE. Value is assigned to each inner vertex.
Mean value at a vertex. Results from the surrounding vertices’
values uv . uv − ũv gives the linear surplus.
Stiffness/system matrix resulting from the weak formulation of the
Poisson equation with a nodal ansatz space.
Prolongation of a solution to the next finer level. This thesis solely
applies full-weightening corresponding to the hat functions.
Restriction of a residual. This thesis realises a Galerkin multigrid
approach, i.e. R = P T .
Coarsening operator. This thesis uses the trivial induction, i.e. for
each coarse grid vertex coinciding with a fine grid vertex, the fine
vertex’s value is copied to the coarse grid value.
Identifies refined elements from ET where the adaptive coarse grid
smoothing also evaluates the stencil.
Maximum norm.
h-dependent norm, i.e. norm that takes grid layout into account.
For h → 0, the norm converges to k.kL2 .
237
C Notation
Parallelisation
The following predicates and properties are introduced throughout the parallelisation chapter. They hold both the load balancing and the domain decomposition
data.
Symbol
Cweight
δ
≥1
F CF S
F air
p
Pfork
≥1
Pjoin
Pwait
rank
subLevel
0 ≤ rank < p
thisLevel
w
wlocal
wremote
238
w≥0
Description
Weight of a leaf.
Delta of a geometric element. Gives the number
of additional work units the processor handling the
subtree defined by this geometric element could handle without becoming a bottleneck for its master.
“First come first served” answering strategy of the
node pool server.
Fair answering strategy of the node pool server,
i.e. the node pool tries to treat all computational
nodes equally and sort the requests accordingly.
Number of (logical) processors available.
Fork predicate holds for a leaf if a refinement would
slow down the traversal.
An element is remote, but it could be merged into
the master’s partition. The master is a lazy master.
Wait predicate holds for a geometric element if
the traversal has to wait within this element for a
worker.
Rank (number) of a processor.
Holds the ranks of the processes that are responsible
for the 2d adjacent elements of the vertex at the same
position in space but on the next level.
Holds the ranks of the processes that are responsible
for the 2d adjacent elements of the vertex.
Weight of a geometric element, i.e. the costs of the
element plus all the descendants.
Sum of the weights of all the local children of a refined element.
Maximum of the weights of all the remote children
of a refined element.
Symbol
derive
mergeW ithN eighbour
Description
Derives thisRank and subLevel entries for a new vertex. Uses the parent element’s vertices.
Merges thisRank and subLevel of a vertex with a remote vertex’s lists. Each rank is allowed to modify entries it is responsible for. This happens due to merged
and joins.
Grid Traversal and Storage
The following operations evaluate the traversal stack automaton. Their result is used
to realise the grid management, i.e. the operations deliver information where to take
vertex data from and where to write vertex data to. If a symbol accepts arguments
besides the stack automaton, these arguments are mentioned in the description. The
results are formulated as questions.
Symbol
Ptouched
Description
Accepts number of face. It is a proxy to evaluate
access. Has neighbouring element connected by
face been visited before, i.e. have the vertices belonging to this face been read and written by the
neighbouring element?
belongsT oT ouchedF ace
Is given the 2d touched predicates assigned to the
faces of an element and the position of a vertex
within this element. Is there one touched face adjacent to this vertex?
belongsT oU ntouchedF ace Mirrors belongsT oT ouchedF ace. It there one untouched face adjacent to the vertex?
isP ositiveAlongAxis
Accepts an axis number. Peano iterate runs along
a coordinate axis?
getReadStack
Accepts a vertex position and returns the number of the temporary stack if this vertex is to be
loaded from a temporary stack. Otherwise, it returns UseInputStream, and the algorithm has to
take the vertex from the input stream.
getW riteStack
Counterpart of getReadStack.
239
C Notation
240
Bibliography
[1] M. Bader.
Robuste, parallele Mehrgitterverfahren für die KonvektionsDiffusions-Gleichung. Herbert Utz Verlag, Dissertation, Technische Universität
München, 2001.
[2] M. Bader, S. Schraufstetter, C. A. Vigh, and J. Behrens. Memory Efficient
Adaptive Mesh Generation and Implementation of Multigrid Algorithms Using
Sierpinski Curves. International Journal of Computational Science and Engineering, 4(1):12–21, 2008.
[3] P. Bastian, M. Blatt, A. Dedner, C. Engwer, R. Klöfkorn, M. Ohlberger, and
O. Sander. A Generic Grid Interface for Parallel and Adaptive Scientific Computing. Part I: Abstract Framework. Computing, 82(2–3):103–119, 2008.
[4] P. Bastian, M. Blatt, A. Dedner, C. Engwer, R. Klöfkorn, M. Ohlberger, and
O. Sander. A Generic Grid Interface for Parallel and Adaptive Scientific Computing. Part II: Implementation and Tests in DUNE. Computing, 82(2–3):121–
138, 2008.
[5] M. R. Benioff and E. D. Lazowska. Report to the President. Computational
Science: Ensuring America’s Competitiveness. President’s Information Technology Advisory Committee, 2005.
[6] B. Bergen. Hierarchical Hybrid Grids: Data Structures and Core Algorithms
for Efficient Finite Element Simulations on Supercomputers, volume AS14
of Advances in Simulation. SCS Europe, Dissertation, Friedrich-AlexanderUniversität Erlangen, 2005.
[7] M. J. Berger and J. Oliger. Adaptive mesh refinement for hyperbolic partial
differential equations. Journal of Computational Physics, 53:484–512, 1984.
[8] D. Braess. Finite Elements—Theory, Fast Solvers, and Applications in Solid
Mechanics. Cambridge University Press, 3rd edition, 2007.
[9] M. Brenk, H.-J. Bungartz, M. Mehl, I. L. Muntean, T. Neckel, and T. Weinzierl.
Numerical simulation of particle transport in a drift ratchet. SIAM Journal of
Scientific Computing, 30(6):2777–2798, 2008.
[10] W. L. Briggs, H. Van Emden, and S. F. McCormick. A Multigrid Tutorial.
Cambridge University Press, 2nd edition, 2000.
[11] M. Broy. Informatik—Eine grundlegende Einführung. Programmierung und
Rechnerstrukturen. Springer-Verlag, 2nd edition, 1997.
[12] M. Broy. Informatik—Eine grundlegende Einführung. Systemstrukturen und
Theoretische Informatik. Springer-Verlag, 2nd edition, 1998.
[13] H.-J. Bungartz, W. Eckhardt, M. Mehl, and T. Weinzierl. Dastgen - A Data
Structure Generator for Parallel C++ HPC Software. In M. Buback, G. D. van
Albada, P. M. A. Sloot, and J. J. Dongarra, editors, ICCS 2008 Proceedings,
Lecture Notes in Computer Science, Heidelberg, Berlin, 2008. Springer-Verlag.
[14] H.-J. Bungartz, W. Eckhardt, T. Weinzierl, and C. Zenger. A Precompiler
to Reduce the Memory Footprint of Multiscale PDE Solvers in C++. Future
Generation Computer Systems, 2009. (in press).
[15] H.-J. Bungartz and M. Griebel. Sparse Grids. Acta Numerica, 13:147–269,
2004.
[16] H.-J. Bungartz, M. Griebel, and C. Zenger. Einführung in die Computergraphik:
Grundlagen, geometrische Modellierung, Algorithmen. Vieweg+Teubner Verlag, 2nd edition, 2002.
[17] H.-J. Bungartz, M. Mehl, and T. Weinzierl. A parallel adaptive Cartesian PDE
solver using space–filling curves. In E. W. Nagel, V. W. Walter, and W. Lehner,
editors, Euro-Par 2006, Parallel Processing, 12th International Euro-Par Conference, volume 4128 of Lecture Notes in Computer Science, pages 1064–1074,
Berlin Heidelberg, 2006. Springer-Verlag.
[18] C. Burstedde, O. Ghattas, M. Gurnis, G. Stadler, E. Tan, T. Tu, L. C. Wilcox,
and S. Zhong. Scalable adaptive mantle convection simulation on petascale
supercomputers. In SC ’08: Proceedings of the 2008 ACM/IEEE conference on
Supercomputing, pages 1–15. IEEE Press, 2008.
[19] C. Burstedde, O. Ghattas, G. Stadler, T. Tu, and L. C. Wilcox. Towards
Adaptive Mesh PDE Simulations on Petascale Computers. In Proceedings of
Teragrid ’08, published electronically on www.teragrid.org, 2008.
[20] M. de Berg, O. Cheong, and M. van Kreveld. Computational Geometry: Algorithms and Applications. Springer-Verlag, 3rd edition, 2008.
[21] N. Dieminger. Kriterien für die Selbstadaption cache-effizienter Mehrgitteralgorithmen. Diploma Thesis, Fakultät für Mathematik, Technische Universität
München, 2005.
[22] C. C. Douglas, J. Hu, M. Kowarschik, U. Rüde, and C. Weiss. Cache optimization for structured and unstructured grid multigrid. Electronic Transactions
on Numerical Analysis, 10:21–40, 2000.
[23] W. Eckhardt. Automated Recursion Unrolling for a Dynamical Adaptive
PDE Solver. Diploma Thesis, Fakultät für Informatik, Technische Universität
München, 2009.
[24] M. Fowler. Patterns of Enterprise Application Architecture. Addison-Wesley
Longman, 2002.
[25] A. C. Frank. Organisationsprinzipien zur Integration von geometrischer Modellierung, numerischer Simulation und Visualisierung. Herbert Utz Verlag,
Dissertation, Institut für Informatik, Technische Universität München, 2000.
[26] C. Freundl, T. Gradl, and U. Rüde. Towards Adaptive Mesh PDE Simulations
on Petascale Computers. In Petascale Computing. Algorithms and Applications,
pages 375–389. Chapman & Hall/CRC Computational Science, 2008.
[27] E. Gamma, R. Helm, R. E. Johnson, and J. Vlissides. Design Patterns - Elements of Reusable Object-Oriented Software. Addison-Wesley Longman, 1st
edition, 1994.
[28] C. Gotsman and M. Lindenbaum. On the metric properties of discrete spacefilling curves. In IEEE Transactions on Image Processing, volume 5, pages
794–797, 1996.
[29] T. Gradl and U. Rüde. High Performance Multigrid in Current Large Scale Parallel Computers. In 9th Workshop on Parallel Systems and Algorithms (PASA),
volume 124, pages 37–45. GI Edition: Lecture Notes in Informatics, 2008.
[30] M. Griebel.
Zur Lösung von Finite-Differenzen- und Finite-ElementGleichungen mittels der Hiearchischen-Transformations-Mehrgitter-Methode,
volume 342/4/90 A.
SFB-Bericht, Dissertation, Technische Universität
München, 1990.
[31] M. Griebel and G. Zumbusch. Parallel multigrid in an adaptive PDE solver
based on hashing and space-filling curves. Parallel Computing, 25(7):827–843,
1999.
[32] M. Griebel and G. W. Zumbusch. Hash-Storage Techniques for Adaptive Multilevel Solvers and their Domain Decomposition Parallelization. In J. Mandel,
C. Farhat, and X.-C. Cai, editors, Proceedings of Domain Decomposition Methods 10, DD10, number 218, pages 279–286, 1998.
[33] Michael Griebel. Multilevelmethoden als Iterationsverfahren über Erzeugendensystemen. Teubner Skripten zur Numerik. Teubner, Habilitation, Technische
Universität München, 1994.
[34] W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith. High-performacne
parallel implicit CFD. Parallel Computing, 27(4):337–362, 2001.
[35] F. Günther. Eine cache-optimale Implementierung der Finiten-ElementeMethode. Dissertation, published electronically, Institut für Informatik, Technische Universität München, 2004.
[36] F. Günther, M. Mehl, M. Pögl, and C. Zenger. A cache-aware algorithm for
PDEs on hierarchical data structures based on space-filling curves. SIAM Journal on Scientific Computing, 28(5):1634–1650, 2006.
[37] D. Hackenberg, R. Schöne, W.E. Nagel, and S.Pflüger. Optimizing OpenMP
Parallelized DGEMM Calls on SGI Altix 3700. In W. E. Nagel, W. V. Walter,
and W. Lehner, editors, Euro-Par, volume 4128 of Lecture Notes in Computer
Science, pages 145–154. Springer-Verlag, 2006.
[38] F. H. Harlow and J. E. Welch. Numerical calculation of time-dependent viscous
incompressible flow of fluid with a free surface. Physics of Fluids, 8(12):2182–
2189, 1965.
[39] J. Hartmann. Entwicklung eines cache-optimalen Finite-Element-Verfahrens
zur Lösung d-dimensionaler Probleme. Diploma Thesis, Institut für Informatik,
Technische Universität München, 2004.
[40] W. Herder. Lastverteilung und parallelisierte Erzeugung von Eingabedaten
für ein paralleles cache-optimales Finite-Element-Verfahren. Diploma Thesis,
Institut für Informatik, Technische Universität München, 2005.
[41] T. Huckle. Compact fourier analysis for designing multigrid methods. SIAM
Journal on Scientifc Computing, 31(1):644–666, November 2008.
[42] J. Hungershöfer and J.-M. Wierum. On the quality of partitions based on
space-filling curves. In ICCS ’02: Proceedings of the International Conference
on Computational Science-Part III, pages 36–45. Springer-Verlag, 2002.
[43] S. Iqbal and G. F. Carey. Performance analysis of dynamic load balancing algorithms with variable number of processors. Journal of Parallel and Distributed
Computing, 65(8):934–948, 2005.
[44] D. E. Keyes. Four Horizons for Enhancing the Performance of Parallel Simulations Based on Partial Differential Equations. In A. Bode, T. Ludwig, W. Karl,
and R. Wismüller, editors, Euro-Par ’00: Proceedings from the 6th International Euro-Par Conference on Parallel Processing, volume 1900 of Lecture
Notes in Computer Science, pages 1–17. Springer-Verlag, 2000.
[45] D. E. Keyes. Domain Decomposition Methods in the Mainstream of Computational Science. In Proceedings of the 14th International Conference on Domain
Decomposition Methods, pages 79–93. Published by the National Autonomous
University of Mexico (UNAM), 2003.
[46] D. E. Knuth. The genesis of attribute grammars. In P. Deransart and M. Jourdan, editors, WAGA: Proceedings of the international conference on Attribute
grammars and their applications, pages 1–12. Springer-Verlag, 1990.
[47] D. E. Knuth. The Art of Computer Programming - Volumes 1–3. AddisonWesley Professional, 2nd edition, 1998.
[48] M. Kowarschik and C. Weiß. An Overview of Cache Optimization Techniques and Cache-Aware Numerical Algorithms. In U. Meyer, P. Sanders, and
J. F. Sibeyn, editors, Algorithms for Memory Hierarchies 2002, pages 213–232.
Springer-Verlag, 2003.
[49] A. Krahnke. Adaptive Verfahren höherer Ordnung auf cache-optimalen Datenstrukturen für dreidimensionale Probleme. Dissertation, published electronically, Technische Universität München, 2005.
[50] M. Langlotz. Parallelisierung eines Cache-optimalen 3D Finite-ElementVerfahrens. Diploma Thesis, Fakultät für Informatik, Technische Universität
München, 2004.
[51] M. Lieb. A full multigrid implementation on staggered adaptive cartesian grids
for the pressure poisson equation in computational fluid dynamics. Master’s
thesis, Institut für Informatik, Technische Universität München, 2008.
[52] V. D. Liseikin. Grid Generation Methods. Springer-Verlag, 1st edition, 1999.
[53] S. Meyers. Effective STL. Addison-Wesley, 2001.
[54] W. F. Mitchell. A Parallel Multigrid Method Using the Full Domain Partition. In Special Issue for Proceedings of the 8th Copper Mountain Conference
on Multigrid Methods, volume 6, pages 224–233. Electronic Transactions on
Numerical Analysis, 1998.
[55] W. F. Mitchell. The Full Domain Partition Approach to Distributing Adaptive
Grids. In Proceedings of international centre for mathematical sciences on Grid
adaptation in computational PDES: theory and applications, pages 265–275.
Elsevier, 1998.
[56] W. F. Mitchell. A refinement-tree based partitioning method for dynamic load
balancing with adaptively refined grids. Journal of Parallel and Distributed
Computing, 67(4):417–429, 2007.
[57] G. M. Morton. A computer oriented geodetic data base and a new technique
in file sequencing. Technical report, IBM Ltd., Ottawa, Ontario, 1966.
[58] R.-P. Mundani. Hierarchische Geometriemodelle zur Einbettung verteilter Simulationsaufgaben. Berichte aus der Informatik. Shaker Verlag, Dissertation,
Universität Stuttgart, 2006.
[59] T. Neckel. Einfache 2d-Fluid-Struktur-Wechselwirkungen mit einer cacheoptimalen Finite-Element-Methode. Diploma Thesis, Fakultät für Mathematik,
Technische Universität München, 2005.
[60] T. Neckel. The PDE framework Peano: An environment for efficient flow
simulations. Verlag Dr. Hut, Dissertation, Technische Universität München,
2009.
[61] M. Parashar, J. C. Browne, C. Edwards, and K. Klimkowski. A common data
management infrastructure for parallel adaptive algorithms for pde solutions.
In Proceedings of the 1997 ACM/IEEE conference on Supercomputing, pages
1–22. ACM Press, 1997.
[62] T. Plewa, T. Linde, and V. G. Weirs. Adaptive Mesh Refinement - Theory and
Applications. Springer-Verlag, 2005.
[63] M. Pögl. Entwicklung eines cache-optimalen 3D Finite-Element-Verfahrens für
große Probleme, volume 745 of Fortschritt-Berichte VDI, Informatik Kommunikation 10. VDI Verlag, Dissertation, Technische Universität München, 2004.
[64] U. Rüde. Mathematical and computational techniques for multilevel adaptive
methods, volume 13 of Frontiers in Applied Mathematics. SIAM, Habilitation,
Technische Universität München, 1993.
[65] R. Rugina and M. C. Rinard. Recursion unrolling for divide and conquer
programs. In S. P. Midkiff, J. E. Moreira, M. Gupta, S. Chatterjee, J. Ferrante, J. Prins, W. Pugh, and C.-W. Tseng, editors, LCPC ’00: Proceedings
of the 13th International Workshop on Languages and Compilers for Parallel
Computing, volume 2017 of Lecture Notes in Computer Science, pages 34–48.
Springer-Verlag, 2001.
[66] H. Sagan. Space-filling curves. Springer-Verlag, New York, 1994.
[67] H. Samet. The quadtree and related hierarchical data structures. ACM Computing Surveys, 16(2):187–260, 1984.
[68] H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley,
1989.
[69] S. Schraufstetter. Speichereffiziente Algorithmen zum Lösen partieller Differentialgleichungen auf adaptiven Dreiecksgittern. Diploma Thesis, Fakultät für
Mathematik, Technische Universität München, 2006.
[70] B. Stroustrup. Die C++ Programmiersprache. Addison-Wesley, 4th edition,
2000.
[71] H. Sutter. The Free Lunch Is Over: A Fundamental Turn Toward Concurrency
in Software. Dr. Dobb’s Journal, 3(30):202–210, 2005.
[72] J. F. Thompson, B. K. Soni, and N. P. Weatherill. Handbook of Grid Generation. CRC Press, 1998.
[73] U. Trottenberg, A. Schuller, and C. Oosterlee. Multigrid. Academic Press, 1st
edition, 2000.
[74] J. von Neumann. First Draft of a Report on the EDVAC. IEEE Annals of the
History of Computing, 15(4):27–75, 1993.
[75] T. Wagner. Randbehandlung höherer Ordnung für ein cache-optimales FiniteElement-Verfahren auf kartesischen Gittern. Diploma Thesis, Fakultät für
Mathematik, Technische Universität München, 2005.
[76] T. Weinzierl. Eine cache-optimale Implementierung eines Navier-Stokes Lösers
unter besonderer Berücksichtigung physikalischer Erhaltungssätze. Diploma
Thesis, Institut für Informatik, Technische Universität München, 2005.
[77] C. Weiß, W. Karl, M. Kowarschik, and U. Rüde. Memory characteristics of
iterative methods. In Supercomputing ’99: Proceedings of the 1999 ACM/IEEE
conference on Supercomputing, pages 1–31. ACM Press, 1999.
[78] I. Yavneh. Why Multigrid Methods Are So Efficient. Computing in Science &
Engineering, 8(6):12–22, 2006.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Related manuals

Download PDF

advertisement