Institutionen för systemteknik Department of Electrical Engineering A Data-Parallel Graphics Pipeline

Institutionen för systemteknik Department of Electrical Engineering A Data-Parallel Graphics Pipeline
Institutionen för systemteknik
Department of Electrical Engineering
Examensarbete
A Data-Parallel Graphics Pipeline
Implemented in OpenCL
Examensarbete utfört i informationskodning
vid Tekniska högskolan vid Linköpings universitet
av
Joel Ek
LiTH-ISY-EX--12/4632--SE
Linköping 2012
Department of Electrical Engineering
Linköpings universitet
SE-581 83 Linköping, Sweden
Linköpings tekniska högskola
Linköpings universitet
581 83 Linköping
A Data-Parallel Graphics Pipeline
Implemented in OpenCL
Examensarbete utfört i informationskodning
vid Tekniska högskolan vid Linköpings universitet
av
Joel Ek
LiTH-ISY-EX--12/4632--SE
Handledare: Harald Nautsch
ISY, Linköpings universitet
Examinator: Ingemar Ragnemalm
ISY, Linköpings universitet
Linköping, 27 november 2012
Avdelning, Institution
Division, Department
Datum
Date
Avdelningen för informationskodning
Department of Electrical Engineering
SE-581 83 Linköping
2012-11-27
Språk
Language
Rapporttyp
Report category
ISBN
-
¨ Svenska/Swedish
ý Engelska/English
¨
¨ Licentiatavhandling
ý Examensarbete
¨ C-uppsats
¨ D-uppsats
¨ Övrig rapport
¨
ISRN
LiTH-ISY-EX--12/4632--SE
URL för elektronisk version
ISSN
Serietitel och serienummer
Title of series, numbering
http://urn.kb.se/resolve?urn:nbn:se:liu:diva-85679
Titel
Title
A Data-Parallel Graphics Pipeline Implemented in OpenCL
En Data-Parallell Grafikpipeline Implementerad i OpenCL
Författare
Author
Joel Ek
Sammanfattning
Abstract
This report documents implementation details, results, benchmarks and technical discussions for the
work carried out within a master’s thesis at Linköping University. Within the master’s thesis, the field of
software rendering is explored in the age of parallel computing. Using the Open Computing Language, a
complete graphics pipeline was implemented for use on general processing units from different vendors.
The pipeline is tile-based, fully-configurable and provides means of rendering visually compelling images
in real-time. Yet, further optimizations for parallel architectures are needed as uneven work loads drastically decrease the overall performance of the pipeline.
Nyckelord
Keywords
computer graphics, GPGPU, OpenCL, graphics pipeline, rasterization, texture filtering, parallelization
A Data-Parallel Graphics Pipeline Implemented
in OpenCL
Joel Ek
[email protected]
October 29, 2012
Abstract
This report documents implementation details, results, benchmarks and technical discussions for the work carried out within a master’s thesis at Linköping
University. Within the master’s thesis, the field of software rendering is explored in the age of parallel computing. Using the Open Computing Language,
a complete graphics pipeline was implemented for use on general processing
units from different vendors. The pipeline is tile-based, fully-configurable and
provides means of rendering visually compelling images in real-time. Yet, further optimizations for parallel architectures are needed as uneven work loads
drastically decrease the overall performance of the pipeline.
Contents
A Introduction
1
Report Structure . . . . . . . . . . . . . . . . . . . . . . . . . .
1
3
B Background
1
Related Research . . . . . . . . . . .
1.1
Samuli Laine and Tero Karras
1.2
Larry Seiler et al. . . . . . . .
2
The Open Computing Language . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
4
6
6
C Mathematical Foundation
1
The Fundamental Shapes . . . .
1.1
Triangle . . . . . . . . .
1.2
Quadrilateral . . . . . .
2
Line Normals . . . . . . . . . .
3
Geometric Series . . . . . . . .
4
Orthonormalization . . . . . . .
5
Matrix Inversion . . . . . . . .
5.1
Four Elements . . . . . .
5.2
Nine Elements . . . . . .
5.3
Sixteen Elements . . . .
6
Transformation Matrices . . . .
6.1
Identity . . . . . . . . .
6.2
Translation . . . . . . .
6.3
Scaling . . . . . . . . . .
6.4
Basis . . . . . . . . . . .
6.5
Rotation . . . . . . . . .
6.6
Projection . . . . . . . .
7
Line Segment-Plane Intersection
8
View Frustum Clipping . . . . .
8.1
Clipping Planes . . . . .
8.2
Example . . . . . . . . .
8.3
Algorithm . . . . . . . .
9
Barycentric Coordinates . . . .
9.1
Cramer’s Rule . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
9
10
11
12
12
14
14
15
16
16
18
18
19
20
22
27
28
31
31
33
35
37
39
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
11
12
13
9.2
Ratios of Areas . . . . . .
The Separating Axis Theorem . .
10.1 Candidate Axes . . . . . .
10.2 Collision Detection . . . .
Tangent Space . . . . . . . . . . .
Perspective-Correct Interpolation
Rounding Modes . . . . . . . . .
13.1 Flooring . . . . . . . . . .
13.2 Ceiling . . . . . . . . . . .
13.3 Nearest . . . . . . . . . .
D Texture Filtering
1
Short on Discrete Images . . . .
2
Magnification and Minification .
3
Wrapping Modes . . . . . . . .
3.1
Repeating . . . . . . . .
3.2
Clamping . . . . . . . .
4
Nearest Neighboring Texel . . .
5
Bilinear Interpolation . . . . . .
6
Image Pyramids . . . . . . . . .
6.1
Pre-processing . . . . . .
6.2
Sampling Method . . . .
7
Anisotropic Image Pyramids . .
7.1
Pre-processing . . . . . .
7.2
Sampling Method . . . .
8
Summed-Area Tables . . . . . .
8.1
Construction . . . . . .
8.2
Sampling Method . . . .
8.3
Bilinear Interpolation . .
8.4
Clamping Textures . . .
8.5
Repeating Textures . . .
8.6
Higher-Order Filters . .
E Pipeline Overview
1
Design Considerations . .
1.1
Deferred Rendering
1.2
Rasterization . . .
1.3
Batch Rendering .
2
Vertex Transformation . .
3
View Frustum Clipping . .
4
Triangle Projection . . . .
5
Rasterization . . . . . . .
5.1
Large Tiles . . . .
5.2
Small Tiles . . . .
5.3
Pixels . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41
43
43
44
46
50
51
52
54
55
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
57
57
58
59
60
60
61
61
62
63
64
66
67
67
68
72
74
78
79
80
83
.
.
.
.
.
.
.
.
.
.
.
85
85
85
86
87
87
88
88
89
90
92
94
6
7
Differentiation of Texture Coordinates . . . . . . . . . . . . . . 96
Fragment Illumination . . . . . . . . . . . . . . . . . . . . . . . 98
F Results
1
Rendered Images . . . .
2
Texture Filtering . . . .
3
Benchmark Data . . . .
3.1
Rasterization . .
3.2
Load Balancing .
3.3
Texture Mapping
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
101
. 101
. 104
. 108
. 108
. 108
. 109
G Discussion
1
Memory Requirements
1.1
Triangles . . . .
1.2
Tiles . . . . . .
1.3
Fragments . . .
2
Performance . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
H Conclusion
.
.
.
.
.
111
111
111
112
113
114
116
Chapter A
Introduction
With the introduction of consumer-level graphics processing units during the
late 1990’s, three-dimensional computer graphics has become increasingly advanced and visually compelling. Today, we are on the verge of generating
photo-realistic synthetic images in real-time. These synthetic images can sometimes be hard to distinguish from real photography, all thanks to the research
in computer graphics and to the technical evolution of the graphics processing
unit.
Modern graphics processing units are extremely powerful for parallel computing tasks such as rendering images. To perform such embarrassingly parallel
tasks in serial on the central processing unit is considered a thing of the past.
The visual quality expected in modern computer graphics imposes a requirement on the performance too large to be feasible in a serial architecture.
As there are multiple vendors of graphics processing units, the hardware will
differ too much between vendors and models to program the hardware directly.
Instead, modern three-dimensional computer graphics makes heavy use of application program interfaces which serve as abstraction layers of the hardware.
Two well-used interfaces are the Open Graphics Library from the Khronos
Group and Direct3D from Microsoft. Using these interfaces, the application
programmer can set up the rendering procedure in a desired way, supply rendering data and instruct the hardware to perform the rendering. This without
any knowledge about the actual graphics hardware present to the end user of
the application.
Graphics hardware was once designed to render computer graphics in an almost
unmodifiable fashion. The possibilities of specifying how a certain task should
be performed were limited to a fixed set of pre-selected choices. For instance,
lighting could previously only be computed at the geometry level using either
flat or Gouraud shading as introduced in 1971 by Gouraud in [1]. Neither were
1
there any possibilities of multi-pass rendering. The need for programmability
was evident.
Graphics processing units introduced some programmability during the early
2000’s. Previously fixed functionality was opened up to the application programmers through the introduction of small programs modifying the behavior
of a certain stage in the graphics pipeline. These small programs became
known as shaders since they were originally intended to modify the shading
behavior in the graphics pipeline.
The shaders opened up for a whole range of new rendering techniques, not only
limited to actual shading. As illustrious application programmers discovered
what was possible, they also discovered what the limitations were. During
the mid 2000’s, the concept of shaders evolved as vendors of graphics processing units introduced the unified shader model. Instead of having separate
hardware sub-systems executing the different shader types, a unified array of
processing units was introduced. These were capable of executing all shader
types supported by the graphics hardware.
The unified model allowed application programmers to utilize the hardware as
a massively parallel processing system. As such, the hardware was no longer
limited to rendering images. Embarrassingly parallel tasks became accelerated
by the new architecture.
The hardware vendors and interface developers realized the possibilities of the
new architecture and released application program interfaces for the new hardware. Notably, Nvidia released the Compute Unified Device Architecture in
2007, the Khronos Group the Open Computing Language in 2008 and Microsoft DirectCompute in 2009.
The Compute Unified Device Architecture will only run on Nvidia hardware,
DirectCompute only through Direct3D on Microsoft platforms whereas the
Open Computing Language runs on many different systems and on hardware
from different vendors.
These interfaces were intended for general-purpose heterogeneous computing
on consumer-level graphics processing units. Through these interfaces, the
units are now used for a wide array of massively parallel tasks.
An intresting set of question arises. Does the new tool set reopen the possibility
of explicitly programming the entire graphics pipeline as done on the central
processing unit in the mid 1990’s? If so, what are the benefits and drawbacks
of a fully-configurable pipeline and how does its performance compare to the
industry standard interfaces intended for computer graphics?
2
1
Report Structure
This master’s thesis will explore the structure of the graphics pipeline, evaluate the related techniques and discuss implementation details using the Open
Computing Language. The following chapters of this report are structured as
shown in the list below.
• Background
• Mathematical Foundation
• Texture Filtering
• Pipeline Overview
• Results
• Discussion
• Conclusion
The background chapter will discuss related research and give an introduction
to the Open Computing Language. The following mathematical foundation
will explore techniques and theorems used in computer graphics from a mathematical standpoint.
Texture filtering is a large subject in computer graphics and as such discussed
in a separate chapter. In the chapter, details on the common texture filters
are provided and as well as details on a set of non-standard filters such as the
summed-area table.
The pipeline overview provides the actual details on the implemented graphics
pipeline which uses the Open Computing Language. All major stages are
explained using the mathematical foundation.
In the results chapter, the produced results are presented through rendered
images and tables of benchmark data. The results are discussed in the following
discussion chapter.
The final chapter concludes this master’s thesis and lists some drawbacks that
were observed in the results chapter. It also suggests future research topics
and how some stages of the pipeline could be implemented more efficiently.
3
Chapter B
Background
The graphics pipeline for rasterized graphics consists of several distinct stages.
Each stage receives input data, processes the data and outputs it to the next
stage in the pipeline. For a general pipeline, the stages are transformation,
clipping, projection, culling, rasterization and shading followed by blending.
The structure of the graphics pipeline varies between different application program interfaces and between different hardware and it is constantly evolving.
Recently, a stage that allow tessellation of geometry data was introduced. This
stage alters the linear layout of the traditional pipeline and is a complex stage,
unfortunately out of scope for this master’s thesis.
The set of stages previously listed is the bare minimum of stages in the graphics pipeline required in order to correctly render three-dimensional computer
graphics. The task of implementing a pipeline in a data-parallel fashion is
straight forward for some stages. Other stages pose a great challenge as the
parallel execution may cause data contamination. This is true especially for
the rasterization and blending stages.
1
1.1
Related Research
Samuli Laine and Tero Karras
This master’s thesis is heavily influenced by the research presented in 2011 by
Laine and Karras in [2]. In their paper, they present a data-parallel rasterization scheme running on the Compute Unified Device Architecture for Nvidia
devices. Their paper is heavily focused on the rasterization stage as it is the
stage in which most design decisions affecting performance can be made. As
such, rasterization is a close-kept industrial secret of the different hardware
manufacturers.
4
Laine and Karras employ a batch rendering scheme to enable high occupancy
of the device. The data to be rendered is transferred to device memory and is
composed of a vertex list and an indexed triangle list. They state that their
intentions were to explore the raw rasterization performance and as such, no
texturing occurs and no texture data is transferred. In addition, the vertex
transformation stage is also left out of their pipeline. This as it is trivial to
implement in parallel and its performance theoretically equal to that of the
hardware pipeline. Lighting is computed at each individual vertex and not
included in the benchmarks they present.
In the pipeline of Laine and Karras, the triangle list is processed in parallel.
They employ a clipping pass which clips every triangle against the six planes of
the view frustum, producing zero to seven sub-triangles. These are projected
into an integer grid with sub-pixel resolution and potentially culled before the
rasterization stage is entered.
For rasterization, they employ a sort-middle hierarchical rasterization refinement scheme. The screen is divided into bins of 128 by 128 pixels and every
bin into tiles of 8 by 8 pixels. In addition, they limit the screen to 16 by 16
bins which corresponds to a resolution of 2048 by 2048 pixels. This yields a
maximum number of bins of 256 where each bin in turn corresponds to 256
tiles and each tile to 64 pixels.
In the first rasterization pass of their pipeline, overlap between triangles and
bins are computed in parallel. The ID of every triangle that overlaps a bin
is appended to one of the local queues of that bin. This enables bin queues
to be updated without synchronization between thread arrays and bins to be
processed independently in the next pass, hence the sort-middle term.
In the second rasterization pass, bin queues are processed and the coverage
between triangles and tiles computed. Similar to the first pass, the ID of
triangles overlapping a tile are stored in one of the local queues of that tile.
This is followed by the actual pixel rasterization pass in which coverage is
computed using a look-up table approach.
Instead of computing a 64-bit coverage mask for each triangle and tile, the cases
in a look-up table are determined for each of the three edges of the triangle.
This can be represented using seven bits per edge according to Laine and
Karras. In the shading pass, the actual coverage of the tile can be determined
through bit-wise operations on the values of the look-up table.
Laine and Karras observe that their pipeline performs roughly 2 to 8 times
slower in comparison to the hardware pipeline. This for a number of different
test scenes with depth testing enabled, perspectively-correct color interpolation
and no multi-sampling.
5
1.2
Larry Seiler et al.
The Larrabee was a proposed microarchitecture by Intel, designed for applications with sections of highly parallel tasks. The platform was based on the
x86 instruction set and introduced a range of new vector instructions to enable
parallel processing of those tasks.
The new vector instructions were able to operate on a set of 32 new vector
registers, each 512 bits wide. The width allowed each vector register to store
eight 64-bit or sixteen 32-bit real-valued numbers or integers. The set of new
instructions operated on those registers, effectively processing the stored values
in parallel.
To demonstrate the capabilities of the Larrabee platform, a tile-based rasterization scheme was presented in 2008 by Seiler et al. in [3]. The rasterization
employed a sort-middle hierarchical rasterization refinement scheme, very similar to that later presented by Laine and Karras.
As the new instructions were designed to handle at most 16 elements in parallel,
Seiler et al. grouped 4 by 4 pixels in a tile and 4 by 4 tiles in a bin. In
their pipeline, the rasterization was performed through evaluating the edge
functions as detailed in 1988 by Pineda in [4]. The evaluated edge functions
were compared to the trivial reject and accept values for each rasterization
level. This was all done in parallel using the new instruction set.
The Larrabee platform will never reach any end users as the project was discontinued in 2010. This is unfortunate as it would have been beneficial to render
three-dimensional computer graphics using the native x86 platform. However,
their paper opened up a discussion on the feasibility of software rendering in
the age of parallel computing and inspired the work of Laine and Karras.
2
The Open Computing Language
The Open Computing Language is an application program interface designed
to abstract the hardware of general purpose computing units. As previously
mentioned, the Khronos Group designed the specification to run on multiple
platforms and hardware from different vendors. This makes it an ideal tool for
implementing a hardware-accelerated graphics pipeline.
The interface allows small programs known as kernels to be executed in parallel
on an Open Computing Language-enabled device. These devices can actually
be of any architecture, including multi-core central processing units. However,
the best performance is achieved on graphics processing units, provided that
the task is highly parallel in nature.
6
Kernels are written using a subset of the C99 programming language and
may be compiled in advance if the target device is known or at run-time to
support the device available to the end user. The programming language does
not support memory allocation nor function pointers. In addition, pointer
arithmetic is allowed but could potentially break the code if memory alignment
is not considered.
The kernels are executed in parallel as threads on the target device. How the
threads are distributed within the device is determined by the implementer of
the specification and as such will depend on the type of device. For instance,
the Nvidia Fermi architecture consists of 1 to 15 streaming multi-processors,
each with 32 cores. The threads are grouped into cooperative thread arrays
and in turn into groups of 32 threads. Such a group is known as a warp
and two warps may be issued and executed at each streaming multi-processor
simultaneously. For each clock cycle, 16 cores receive an instruction from the
first warp and the other 16 from the second warp [5].
For the Open Computing Language, a thread is known as a work item and
group of threads as a work group. When executing a kernel on Nvidia hardware, work items translate to threads and work groups to cooperative thread
arrays. As such, it is desirable to partition the work items into groups of n · 32
for the Fermi architecture. This is somewhat problematic in the Open Computing Language as the total number of work items must be a multiple of the
work group size. However, it can be achieved through issuing a greater number
of work items than required for the task and ensuring that kernels return instantly for the extra work items. Since the total number of work items often is
large, the issuing of additional work items should not have a significant impact
on the overall performance.
The Open Computing Language specification defines four different memory
address spaces. These are private memory for each work item, local memory
that is shared between the work items of a work group, global memory that
is shared between all work groups and constant memory which is a read-only
variant of the global memory. The specification requires that the amount of
local memory is at least 16 kilobytes but does not specify that the local memory
needs to be located on-chip [6].
On the Fermi architecture, the local memory is located in the L1 cache of each
streaming multi-processor and can be configured to either 16 or 48 kilobytes
of the total 64 kilobytes. This makes the local memory extremely fast on the
Fermi architecture. As such, kernels should store data shared between the
work items of a work group in local memory whenever possible.
The graphics pipeline implemented was designed with Fermi as the target
architecture. As such, all kernels are executed in work groups of 32 work items
and local memory is utilized whenever possible.
7
The data transfers and the execution of kernels in the Open Computing Language are handled by a device queue. The queue employs the just in time execution model and queued commands are only guaranteed to have completed if
the queue is explicitly instructed to complete all commands. As the data to be
transferred often depends on the results of previously executed kernels, data
transfers can be enqueued as blocking. This ensures that the application program interface does not return control to the application until the commands
have completed.
The Open Computing Language is able to inter-operate with application program interfaces designed for computer graphics such as the Open Graphics
Library or Direct3D. This allows generic data to be processed by kernels and
displayed using a graphics interface. In order to utilize this functionality, the
context of the Open Computing Language must be shared with that of the
graphics interface. This is performed in the initialization stage in which the
context of the graphics interface is created, often through additional interfaces
for context and window handling. The context of the Open Computing Language is then instructed to use the same context as the graphics interface.
Resources can then be acquired and released using the command queue.
One notable inter-operation feature is the possibility of sharing an image object
between the two interfaces. This allows an image to be generated using the
Open Computing Language and displayed using the graphics interface. It
also allows the image to be rendered using the graphics interface, processed
by kernels of the Open Computing Language and displayed by the graphics
interface. For these examples, no image data needs to be transferred between
the host and the device, making inter-operation features extremely useful.
8
Chapter C
Mathematical Foundation
In order to fully comprehend the concepts discussed throughout the rest of this
report, a mathematical foundation is needed. This chapter assumes that the
reader is familiar with linear algebra and specifically the notation of vectors
and matrices.
1
The Fundamental Shapes
Two shapes commonly used in three-dimensional computer graphics are triangles and quadrilaterals.
1.1
Triangle
The fundamental shape of three-dimensional computer graphics is the triangle.
It is formed from three vertices and its edges span a plane if they are linearly
independent. Should two edges be linearly dependent (parallel), the triangle
~ , V~ and W
~
will collapse and have zero area. Figure 1.1 shows three vertices U
and how they form a proper triangle.
9
V
U
W
Figure 1.1: A triangle is formed from three vertices.
Equation 1.1 shows how the area of a two-dimensional triangle At can be
computed using the cross-product between two of the edges. Depending on
the choice of edges, the resulting area may be negative. However, the cross~ to V~ and U
~ to W
~ is guaranteed to be positive if the
product between edges U
vertices are specified in clock-wise order for a left-handed coordinate system or
in counter clock-wise order for a right-handed coordinate system. A triangle
with negative area is back-facing.
At =
1.2
(Vx − Ux ) · (Wy − Uy ) − (Wx − Ux ) · (Vy − Uy )
2
(1.1)
Quadrilateral
Quadrilaterals are formed from four vertices and are slightly more complex
than their three-vertex counterparts. In contrast to triangles, the vertices of
a quadrilateral are not guaranteed to coincide with a single plane and cannot
be arranged into arbitrary shapes. This makes their use in defining threedimensional geometry limited. Yet there are other areas of computer graphics
in which proper quadrilaterals are formed, making them useful. Figure 1.2
~ L,
~ M
~ and N
~ and how they form a proper quadrilateral.
shows four vertices K,
10
K
L
N
M
Figure 1.2: A quadrilateral is formed from four vertices.
Equation 1.2 shows how the area of a two-dimensional quadrilateral Aq can be
computed using the cross-product between two diagonals. Similar to triangles,
the sign of the area depends on the choice of diagonals. In the equation, the
~ to M
~ and L
~ to N
~ is computed. This choice
cross-product between diagonals K
of diagonals guarantees that the resulting area is positive if the quadrilateral
is front-facing with respect to the order of the vertices and to the reference
coordinate system.
Aq =
2
(Mx − Kx ) · (Ny − Ly ) − (Nx − Lx ) · (My − Ky )
2
(1.2)
Line Normals
A two-dimensional line separates space into two sub-spaces. These two can be
referred to as the positive and negative sub-spaces and are defined by selecting
a line normal. For two-dimensional lines and line segments, there are two
~
possible line normals of unit length. Figure 2.1 shows a line segment from S
~ and the corresponding line normals N
~1 and N
~2 .
to E
11
E
N2
N1
S
Figure 2.1: The two possible unit normals of a two-dimensional line segment.
Equation 2.1 shows the relationship between the coordinates of the starting
~ and the ending point E
~ and the unnormalized line normals N
~1 and
point S
~2 . Selecting one of these will effectively determine which sub-space is positive
N
and which is negative.
~1 = (Ey − Sy , Sx − Ex )
N
~2 = (Sy − Ey , Ex − Sx )
N
3
(2.1)
Geometric Series
A geometric series is an infinite sum where each term is the previous term
multiplied by a constant factor q. These series are formed from recursive
patterns and can be used to analyze whether a sum will grow indefinitely or
if there is an upper bound at which the series converges. Equation 3.1 states
that the series will converge if the absolute value of the constant factor q is less
than one. It also implies that the sum of a finite number of terms is bounded
1
.
by the expression 1−q
lim
N →∞
4
N
X
i=0
(
i
q =
1
1−q
∞
if |q| < 1
otherwise
(3.1)
Orthonormalization
In order to span the entire three-dimensional space, a set of three linearly
independent basis vectors is needed. This set does not need to be orthogonal,
12
though it is a convenient property in three-dimensional computer graphics.
Even more convenient is to ensure that the set is both orthogonal and that
each basis vector is of unit length. This is known as an orthonormal set of
basis vectors.
To force any given set of linearly independent basis vectors to be orthonormal,
the Gram-Schmidt process can be used. The process operates on each basis
vector in sequence and may start at any vector. Equation 4.1 shows how the
~ 0 is selected from the set and normalized into X
~ 00 .
first basis vector X
~0 = X
~
X
~0
~ 00 = X
X
~ 0|
|X
(4.1)
Equation 4.2 shows how the second basis vector Y~ 0 is found by subtracting the
~ 00 from itself. The vector is then normalized into Y~ 00 and
projection of Y~ on X
00
~ and Y~ 00 now form an orthonormal pair of basis vectors.
the two vectors X
~ 00 ) · X
~ 00
Y~ 0 = Y~ − (Y~ • X
Y~ 0
Y~ 00 =
|Y~ 0 |
(4.2)
~ 0.
Equation 4.3 shows how the process continues with the final basis vector Z
~ 00 and Y~ 00 is found by subtracting the
The basis vector orthogonal to both X
00
00
~ on X
~ and on Y~ from itself. Finally, the vector is normalized
projection of Z
00
~
and Z is obtained.
~0 = Z
~ − (Z
~ •X
~ 00 ) · X
~ 00 − (Z
~ • Y~ 00 ) · Y~ 00
Z
~0
~ 00 = Z
Z
~ 0|
|Z
(4.3)
~ 00 , Y~ 00 and Z
~ 00 is ortogonal and
After the process, the set of basis vectors X
each basis vector is of unit length. This process is useful if the set is initially
orthonormal but subject to multiple transformations as small errors are introduced with every transformation in a computational context. Figure 4.1
illustrates the orthonormalization process.
13
Y’
Y
Y’’
X’’
(Y•X’’)·X’’
X/X’
Figure 4.1: The Gram-Schmidt process produces a set of orthonormal basis vectors.
5
Matrix Inversion
Inverting small square matrices is a common task in three-dimensional computer graphics and should therefore be detailed here. Equation 5.1 shows the
general formula of the adjoint method. The method is often performed in
three steps and produces an analytical inverse to a square matrix of any size.
However, the complexity of the method grows fast, making it unsuitable for
larger matrices. For such matrices, numerical inversion methods exist.
M −1 =
cof(M )T
adj(M )
=
if det(M ) 6= 0
det(M )
det(M )
(5.1)
Firstly, the minor matrix of M is computed. The minor at index (i, j) is defined
as the determinant of the matrix formed by excluding row i and column j from
M . Secondly, the cofactor matrix is computed. The cofactor at index (i, j) is
defined as the minor at index (i, j) multiplied by (−1)i+j . Finally, the cofactor
matrix is transposed and the adjoint matrix created. Dividing every element
of the adjoint matrix by the determinant of the original matrix results in the
inverse matrix M −1 . This provided that the determinant is not equal to zero.
5.1
Four Elements
Equation 5.2 shows a square matrix Mf and its four elements a, b, c and d.
a b
Mf =
c d
14
(5.2)
Equation 5.3 shows how the determinant of Mf is defined.
det(Mf ) = a · d − b · c
(5.3)
Equation 5.4 shows the cofactor matrix of Mf . For four-element matrices, it
seems as if the cofactor matrix is formed through reorganizing the elements of
Mf . However, every element at index (i, j) is actually the minor at index (i, j)
with the corresponding sign adjustment.
+d −c
cof(Mf ) =
−b +a
(5.4)
Transposing the cofactor matrix and dividing every element by the determinant
of Mf gives the inverse matrix Mf−1 . This is seen in equation 5.5.
Mf−1
5.2
1
+d −b
if det(Mf ) 6= 0
=
det(Mf ) −c +a
(5.5)
Nine Elements
Equation 5.6 shows a square matrix Mn and its nine elements a, b, c, d, e, f ,
g, h and i.


a b c
Mn = d e f 
g h i
(5.6)
Equation 5.7 shows how the determinant of Mn is defined.
det(Mn ) = a · (e · i − f · h) − b · (d · i − f · g) + c · (d · h − e · g)
(5.7)
The cofactor matrix of Mn is formed by computing the minors of Mn and
adjusting the sign as previously stated. In contrast to Mf , the complexity is
higher as every minor is computed as the determinant of a square four-element
matrix. This is seen in equation 5.8.
15


+(e · i − f · h) −(d · i − f · g) +(d · h − e · g)
cof(Mn ) =  −(b · i − c · h) +(a · i − c · g) −(a · h − b · g)
+(b · f − c · e) −(a · f − c · d) +(a · e − b · d)


e·i−f ·h f ·g−d·i d·h−e·g
=  c · h − b · i a · i − c · g b · g − a · h
b·f −c·e c·d−a·f a·e−b·d
(5.8)
Transposing the cofactor matrix and dividing every element by the determinant
of Mn gives the inverse matrix Mn−1 . This is seen in equation 5.9.

Mn−1
5.3

e·i−f ·h c·h−b·i b·f −c·e
1
 f · g − d · i a · i − c · g c · d − a · f  if det(Mn ) 6= 0
=
det(Mn )
d·h−e·g b·g−a·h a·e−b·d
(5.9)
Sixteen Elements
As the method is valid for any square matrix, it can be used to invert sixteenelement matrices. However, the complexity is significantly higher when compared to nine-element matrices, making the method unsuitable for real-time
purposes. Yet, sixteen-element matrices are the most commonly used matrices in three-dimensional computer graphics and the need for clever inversion
methods is evident.
6
Transformation Matrices
Three-dimensional computer graphics uses three-dimensional vectors to define
various geometric properties. Such properties are, amongst others, the topology of a surface, the location of a light source and the orientation of a camera
or an observer.
In order to fully model a transformation of a three-dimensional vector, the
vector is often stored and handled in its four-dimensional homogeneous form.
This embeds the three-dimensional vector in a four-dimensional super-space
for which the fourth component is almost always equal to one. Equation 6.1
shows a homogeneous four-component vector P~ .
P~ = (Px , Py , Pz , 1)
16
(6.1)
The homogeneous form enables affine transformations of the vector using a
square sixteen-component transformation matrix. Equation 6.2 shows an affine
transformation matrix A. For most affine transformations, the bottom row will
be equal to (0, 0, 0, 1).

Xx
 Xy
A=
 Xz
Xw
Yx
Yy
Yz
Yw
Zx
Zy
Zz
Zw

Wx
Wy 

Wz 
Ww
(6.2)
The vector P~ is defined in its local coordinate system. If this coordinate
system is embedded in a global coordinate system, the matrix A represents a
transformation from the local coordinate system into the global one. The three
~ Y~ and Z
~ represent the basis vectors of the local coordinate system
vectors X,
~
expressed in the global coordinate system. Conversely, the fourth vector W
can be viewed as the translation vector which represents the offset of the local
coordinate system with respect to the global one.
Figure 6.1 shows this relationship. The vector P~ is defined in its local coordi~ Y~ and Z.
~ Applying matrix A to the vector
nate system with base vectors X,
~ 0 , Y~ 0 and Z
~ 0 . To transform in the
P~ results in the vector being expressed in X
opposite direction, the inverse of A is applied to the vector P~ expressed in the
global coordinate system.
Y
Y’
P
X
W
X’
Figure 6.1: A transformation matrix transforms vectors between coordinate systems.
A vector may be transformed between multiple coordinate systems and by
different transformation matrices before reaching the coordinate system of a
camera or an observer. This forms a chain of transformations. This transformation chain is often premultiplied so that the vector only needs to be
transformed once. This is shown in equation 6.3.
17
P~ 0 = An · An−1 · . . . · A2 · A1 · P~ = (An · An−1 · . . . · A2 · A1 ) · P~
6.1
(6.3)
Identity
The identity matrix is a transformation matrix which when applied to a vector
does not alter the vector in any way. Its only function in the transformation
chain is to initialize it so that subsequent matrices in the chain can be premultiplied. Equation 6.4 shows an identity matrix I and its inverse I −1 . As seen
in the equation, the matrix and its inverse are identical.
I = I −1
6.2

1
0
=
0
0
0
1
0
0
0
0
1
0

0
0

0
1
(6.4)
Translation
The translation matrix is a transformation matrix that when applied to a vector adds another vector to it. When applied to a three-dimensional model, the
model will be offset by the translation vector. Equation 6.5 shows a translation
matrix T and how the three vector components x, y and z are embedded into
the matrix.

1
0
T (x, y, z) = 
0
0
0
1
0
0

x
y

z
1
0
0
1
0
(6.5)
Inverting the translation matrix is done by translating in the opposite direction. Equation 6.6 shows the relationship between the translation matrix T
and its inverse T −1 .

T (x, y, z)−1
1
0
= T (−x, −y, −z) = 
0
0
18
0
1
0
0

0 −x
0 −y 

1 −z 
0 1
(6.6)
6.3
Scaling
Uniform Scaling
The uniform scaling matrix is a transformation matrix that when applied
to a vector scales all of its components equally. When applied to a threedimensional model, the model will shrink or grow depending on the magnitude
of the scaling factor f . This provided that the model is located around the
origin of the local coordinate system. Equation 6.7 shows a uniform scaling
matrix Su and how the scaling factor f is embedded into the matrix.

f
0
Su (f ) = 
0
0
0
f
0
0
0
0
f
0

0
0

0
1
(6.7)
Inverting the uniform scaling matrix is done by scaling with the inverse of f .
Equation 6.8 shows the relationship between the uniform scaling matrix Su
and its inverse Su−1 . As seen in the equation, the inverse does only exist if
the scaling factor is not equal to zero. This as a scaling factor of zero would
collapse the transformed vector into the null vector, eliminating the possibility
of returning to the original vector.
1
0 0

1
0 f1 0
=
= Su
0 0 1
f
f
0 0 0
f
Su (f )−1

0
0
 if f 6= 0
0
1
(6.8)
Non-Uniform Scaling
The non-uniform scaling matrix is a transformation matrix that when applied
to a vector scales all of its components differently. When applied to a threedimensional model, the model will become wider or smaller in each of the
three directions. As for uniform scaling matrices, this requires that the model
is located around the origin of the local coordinate system. Equation 6.9 shows
a non-uniform scaling matrix S and how the scaling factors fx , fy and fz are
embedded into the matrix.

fx 0 0
 0 fy 0
S(fx , fy , fz ) = 
 0 0 fz
0 0 0
19

0
0

0
1
(6.9)
Inverting the non-uniform scaling matrix is done in the same fashion as with
uniform scaling matrices. The inverse is created by scaling with the inverses
of fx , fy and fz . Equation 6.10 shows this relationship. The inverse requires
that each of the three scaling factors is not equal to zero as this would collapse
one or several of the components of the vector.

−1
S(fx , fy , fz )
6.4
=S
1 1 1
, ,
fx fy fz
1
fx
0

=
0
0
0
1
fy
0
0
0
0
1
fz
0

0
0

 if fx , fy , fz 6= 0 (6.10)
0
1
Basis
The basis matrix is a transformation matrix that when applied to a vector
transforms the vector from its local coordinate system to a global coordinate
~
system. The local coordinate system is defined by the three base vectors X,
~ expressed in the global coordinate system as previously detailed.
Y~ and Z
Equation 6.11 shows a basis matrix B and how the three base vectors are
embedded into the matrix.

Xx Yx Zx
 Xy Yy Z y
~ Y~ , Z)
~ =
B(X,
 Xz Yz Z z
0 0 0

0
0

0
1
(6.11)
Inverting the basis matrix can be done in a few different ways depending on
the properties of the set of base vectors. The upper-left nine components can
be inverted using the inversion method for nine-component matrices discussed
earlier. This as the other elements of the matrix are equal to those of an
identity matrix.
~ 0,
Using this method will amount to finding a set of modified base vectors X
0
0
~ and embedding this set in a transposed fashion. This can be seen
Y~ and Z
in equation 6.12.
~ Y~ , Z)
~ −1
B(X,
 0
Xx Xy0 Xz0
 Yx0 Yy0 Yz0
=
 Zx0 Zy0 Zz0
0
0
0
20

0
0

0
1
(6.12)
When using this method, the inverse matrix B −1 will only exist if the determinant of B is not equal to zero. Computing the determinant from the three
original base vectors is done as shown in equation 6.13.
det(B) = Xx · (Yy · Zz − Zy · Yz )
− Yx · (Xy · Zz − Zy · Xz )
+ Zx · (Xy · Yz − Yy · Xz )
(6.13)
Equation 6.14 shows how the set of modified base vectors is defined from the
original set of base vectors. The components of these vectors are embedded
into the inverse matrix and the inversion is complete.


Y
·
Z
−
Y
·
Z
y
z
z
y
~
~
1 
~0 = Y × Z =
Yz · Zx − Yx · Zz  if det(B) 6= 0
X
det(B)
det(B)
Yx · Zy − Yy · Zx


Zy · Xz − Zz · Xy
~
~
1 
Z ×X
Zz · Xx − Zx · Xz  if det(B) 6= 0
=
Y~ 0 =
det(B)
det(B)
Zx · Xy − Zy · Xx


Xy · Yz − Xz · Yy
~
~
1 
~0 = X × Y =
Xz · Yx − Xx · Yz  if det(B) 6= 0
Z
det(B)
det(B)
Xx · Yy − Xy · Yx
(6.14)
If the set of base vectors is orthogonal and the bases not of unit lengths, the
modified base vectors will correspond to the original ones divided by their
square lengths. This assumes that no base vector is equal to the null vector,
which would result in a determinant of zero. Equation 6.15 shows this process.
~0 =
X
Y~ 0 =
~
X
~ •X
~
X
Y~
~ •X
~ 6= 0
if orthogonal ∧ X
if orthogonal ∧ Y~ • Y~ =
6 0
Y~ • Y~
~
~ 0 = Z if orthogonal ∧ Z
~ •Z
~ 6= 0
Z
~ •Z
~
Z
(6.15)
If the set of base vectors is orthonormal, the inverse matrix B −1 will always
be equal to the transpose of B. If this is a know property of the set of base
vectors, the inverse will always exist and full inversion is unnecessary. The
inverse matrix can instead be created through embedding the original base
vectors in a transposed fashion. This is shown in equation 6.16.
21
~ Y~ , Z)
~ −1
B(X,
6.5

Xx Xy Xz

~ Y~ , Z)
~ T =  Yx Yy Yz
= B(X,
 Zx Zy Zz
0
0
0

0
0
 if orthonormal
0
1
(6.16)
Rotation
Rotational matrices are transformation matrices that when applied to a vector
rotates the vector around a rotational axis. They are closely related to basis
matrices and can be used to rotate a three-dimensional model around an axis.
About the X-Axis
Rotating about the x-axis by an angle α is done using equation 6.17. In the
~ are formed using the two trigonometric
equation, the base vectors Y~ and Z
~ is left unmodified as it is the
functions sine and cosine. The base vector X
axis about which the rotation is performed.

1
0
0
0 cos(α) − sin(α)
Rx (α) = 
0 sin(α) cos(α)
0
0
0

0
0

0
1
(6.17)
The rotation can be undone by rotating with the negative angle as seen in
equation 6.18. The inverse matrix Rx−1 is also equal to the transpose of Rx as
the set of base vectors form an orthonormal basis.
Rx (α)−1


1
0
0
0
0 cos(−α) − sin(−α) 0

= Rx (−α) = 
0 sin(−α) cos(−α) 0
0
0
0
1


1
0
0
0
0 cos(α) sin(α) 0
T

=
0 − sin(α) cos(α) 0 = Rx (α)
0
0
0
1
22
(6.18)
About the Y-Axis
Rotating about the y-axis by an angle α is done using equation 6.19. In the
~ and Z
~ are formed using the two trigonometric
equation, the base vectors X
functions sine and cosine. The base vector Y~ is left unmodified as it is the axis
about which the rotation is performed.

cos(α)

0
Ry (α) = 
− sin(α)
0

0 sin(α) 0
1
0
0

0 cos(α) 0
0
0
1
(6.19)
The rotation can be undone by rotating with the negative angle as seen in
equation 6.20. The inverse matrix Ry−1 is also equal to the transpose of Ry as
the set of base vectors form an orthonormal basis.

Ry (α)−1
cos(−α)

0
= Ry (−α) = 
− sin(−α)
0

cos(α) 0
 0
1
=
 sin(α) 0
0
0

0 sin(−α) 0
1
0
0

0 cos(−α) 0
0
0
1

− sin(α) 0
0
0
 = Ry (α)T
cos(α) 0
0
1
(6.20)
About the Z-Axis
Rotating about the z-axis by an angle α is done using equation 6.21. In the
~ and Y~ are formed using the two trigonometric
equation, the base vectors X
~ is left unmodified as it is the axis
functions sine and cosine. The base vector Z
about which the rotation is performed.

cos(α) − sin(α)
 sin(α) cos(α)
Rz (α) = 
 0
0
0
0
0
0
1
0

0
0

0
1
(6.21)
The rotation can be undone by rotating with the negative angle as seen in
equation 6.22. The inverse matrix Rz−1 is also equal to the transpose of Rz as
the set of base vectors form an orthonormal basis.
23
Rz (α)−1

cos(−α) − sin(−α)
 sin(−α) cos(−α)
= Rz (−α) = 
 0
0
0
0

cos(α) sin(α) 0
− sin(α) cos(α) 0
=

0
0
1
0
0
0

0 0
0 0

1 0
0 1

0
0
 = Rz (α)T
0
1
(6.22)
About an Arbitrary Axis
In order to rotate about an arbitrary axis, an orthonormal set of base vectors
that describes the change of basis needs to be found. These base vectors can be
found analytically if the process is decomposed into several rotation matrices.
Equation 6.23 shows this process and the five rotational matrices involved.
R(x, y, z, α) = R1−1 · R2−1 · R(α) · R2 · R1
(6.23)
The first step is to rotate the rotational axis into one of the three axis-aligned
planes. This is done using the R1 matrix. The second step is to rotate the
rotational axis in that plane so that it coincides with one of the two standard
axes that span the plane. This is done using the R2 matrix. The rotation is
then performed as a rotation about that standard axis by an angle α. This is
done using the R matrix. The final two steps involves inverting the first two
rotations and applying them in inverse order so that the original coordinate
system is restored.
The following example uses the x/z-plane as its axis-aligned plane and performs
the rotation about the z-axis. Given a rotational axis with components x, y
and z, two additional attributes k and l are computed. This is seen in equation
6.24 where k is the length of the rotational axis projected in the x/y-plane and
l the full length of the vector.
p
x2 + y 2
p
l = x2 + y 2 + z 2
k=
(6.24)
The projection of the rotational axis in the x/y-plane is needed as the rotation
into the x/z-plane is performed about the z-axis in this example. Equation
6.25 shows this rotation where R1 forms an orthonormal basis.
24

x
k
− y
k
R1 = 
 0
0
y
k
x
k
0
0
0 1
0 0

0
0
 if k 6= 0
0
1
(6.25)
As the basis is orthonormal, the inverse of R1 is equal to its transpose. This
is shown in equation 6.26.
− ky 0
x
0
k
= R1T = 
0 0 1
0 0 0
x
R1−1
k
y
k

0
0
 if k 6= 0
0
1
(6.26)
Rotating the rotational axis, now in the x/z-plane, into the z-axis is performed
as rotation about the y-axis. This is shown in equation 6.27 in which R2 also
forms an orthonormal basis.
0 − kl
l
0 1 0
R2 = 
k 0 z
l
l
0 0 0
z

0
0
 if l 6= 0
0
1
(6.27)
The orthonormal basis provides a simple way of inverting the matrix R2 . Equation 6.28 shows how the inverse is equal to its transpose.

z
l
 0
R2−1 = R2T = 
− k
l
0

0 kl 0
1 0 0
 if l 6= 0
0 zl 0
0 0 1
(6.28)
Premultiplying the two matrices R2 and R1 gives the result shown in equation
6.29.
 x·z
R2 · R1 =
k·l
− y
 xk

l
0
y·z
k·l
x
k
y
l
0
25

− kl 0
0 0
 if k, l 6= 0
z

0
l
0 1
(6.29)
As the inverses of R1 and R2 are equal to their respective transposes, premultiplying the two matrices R1−1 and R2−1 can be done by transposing the product
of R2 and R1 . Equation 6.30 shows this procedure and the resulting matrix.
 x·z
k·l
R1−1
·
R2−1
=
R1T
·
R2T
 y·z
k·l
= (R2 · R1 ) = 
− k
l
0
T
− ky
x
k
0
0
x
l
y
l
z
l

0
0
 if k, l 6= 0
0
0 1
(6.30)
Multiplying the entire chain of rotations with R as a rotation about the z-axis
~ Y~ and Z.
~ Equation 6.31 shows
by an angle of α gives a set of base vectors X,
how these vectors are embedded into the rotational matrix.

Xx Yx Zx
Xy Yy Zy
R(x, y, z, α) = 
Xz Yz Zz
0 0 0

0
0
 if l = 1
0
1
(6.31)
In order to simplify the expressions of the base vectors, the rotational axis is
assumed to be of unit length. Additionally, the attributes k and l were required
to be non-zero for the individual rotational matrices. However, simplifying
the entire transformation chain removes these criteria from the final matrix.
Equation 6.32 shows the simplified expressions of the base vectors.
Xx
Xy
Xz
Yx
Yy
Yz
Zx
Zy
Zz
= x2 + (1 − x2 ) · cos(α)
= x · y · (1 − cos(α)) + z · sin(α)
= x · z · (1 − cos(α)) − y · sin(α)
= x · y · (1 − cos(α)) − z · sin(α)
= y 2 + (1 − y 2 ) · cos(α)
= y · z · (1 − cos(α)) + x · sin(α)
= x · z · (1 − cos(α)) + y · sin(α)
= y · z · (1 − cos(α)) − x · sin(α)
= z 2 + (1 − z 2 ) · cos(α)
(6.32)
Inverting the compound matrix can be done in one of two ways. Either through
a rotation by the negative angle or by transposing the compound matrix. This
is shown in equation 6.33.
26

R(x, y, z, α)−1
6.6
Xx X y Xz

Yx Yy Yz
= R(x, y, z, −α) = R(x, y, z, α)T = 
 Zx Zy Zz
0
0
0

0
0
 if l = 1
0
1
(6.33)
Projection
Adjusting the perspective for a three-dimensional vector when viewed by a
camera or an observer is one of the fundamental processes in three-dimensional
computer graphics. Objects located near the observer appear bigger than identical objects located further away. The projection process is often modeled using matrices, despite the inability of performing this process solely by applying
a matrix to a vector.
The projection operator is dependent on the actual components of the vector
it is to project, making it impossible to model using matrices. However, projection can be achieved by introducing a concept known as the homogeneous
divide. Equation 6.34 shows a simple projection matrix.

1
0
P =
0
0
0
1
0
0
0
0
1
1

0
0

0
0
(6.34)
Applying this matrix to a four-dimensional homogeneous vector (x, y, z, 1) results in the vector (x, y, z, z). As seen, its fourth component is now possibly
different from one. By applying the dividing operator which divides each component by the fourth component, the vector becomes ( xz , yz , 1, 1). It is now homogenized with respect to the fourth component and perspectively projected
into the axis-aligned plane at z = 1.
It should be emphasized that the projection matrix is irreversible as its determinant is equal to zero.
Perspective projection is better viewed as a geometric operation. Figure 6.2
~ Y~ , Z}.
~ Deriving an expression
shows a vector P~ and the orthonormal basis {X,
for projecting the vector into the axis-aligned plane at z = 1 can be done using
the properties of similar triangles.
27
Px
P
Pz-1
Z
Px’
X
Figure 6.2: Perspective projection can be derived using similar triangles.
Equation 6.35 shows the relationship between the vector P~ and its perspectively projected counterpart P~ 0 . As seen in the equation, the result is equal to
that produced when using matrices and the homogeneous division operator.
Px
Px
Px0
=
⇔ Px0 =
1
Pz − 1 + 1
Pz
Py0
Py
Py
=
⇔ Py0 =
1
Pz − 1 + 1
Pz
Pz0
Pz
=
⇔ Pz0 = 1
1
Pz − 1 + 1
7
(6.35)
Line Segment-Plane Intersection
There are three possible outcomes for the intersection between a line and a
plane. The line may be embedded into the plane, producing infinitely many
intersection points. It may be located a distance from the plane but its direction perpendicular to the normal of the plane. For this case, no intersection
points exist. For all other cases, there is exactly one intersection point.
Line segments require that the two points forming the line are located on
opposite sides of the plane in order to produce exactly one intersection point.
This section is dedicated to locating that unique intersection point, should it
exist.
~ and the minimum distance to
A plane is defined through a plane normal N
that plane d. Equation 7.1 shows how the signed distance between a plane P
~ can be computed as a scalar projection between the normal
and a vector X
and the vector and by subtracting the minimum distance. Naturally, the plane
is located where the distance is zero.
28
~ =N
~ •X
~ −d
dist(P, X)
(7.1)
~ and an ending point E.
~
Line segments are composed of a starting point S
Determining whether the line segment intersects a plane is done by evaluating
the signed distances to the two points. If the two signed distances are of
~ For
opposite signs, the line segment will intersect the plane at a point I.
all other cases, no intersection point exists. Figure 7.1 shows a line segment
intersecting a plane.
dist(P, E)
E
P
b
I
N
a
dist(P, E)
dist(P, S)
S
Figure 7.1: A line segment intersecting a plane from the positive half-space.
From the figure, a number of relationships can be observed through the use of
similar triangles. Equation 7.2 shows these relationships.
a
~
dist(P, S)
=
b
~
−dist(P, E)
=
a+b
~ − dist(P, E)
~
dist(P, S)
(7.2)
Finding the intersection point is essentially the process of scaling the direction
~ to E
~ by an amount proportional to the two signed distances and
vector S
adding it to the starting point. Equation 7.3 shows how the intersection point
is found where t1 is a fraction of the length of the direction vector.
~ + t1 · (E
~ − S)
~
I~ = S
(7.3)
The fraction t1 is computed using the previously observed relationships. Equation 7.4 shows the definition of this fraction.
29
t1 =
~
a
dist(P, S)
=
~ − dist(P, E)
~
a+b
dist(P, S)
(7.4)
The line segment may just as well start in the negative half-space of the plane
and end in the positive half-space. Though the intersection point may be
found using the previously defined equations in a mathematical context, it
will depend on the orientation in a computational context. Due to rounding
errors, the intersection point may be slightly different if the line segment starts
in the negative half-plane compared to when it starts in the positive half-plane.
In a computational context, the interpolation scheme should be defined in a
consistent fashion. That is, the fractional t should always be defined as either
the fraction of the directional vector from the inside point to the outside point
or the other way around. Figure 7.2 shows a line segment intersecting a plane
from the outside and in.
dist(P, S)
S
P
b
I
N
a
dist(P, S)
dist(P, E)
E
Figure 7.2: A line segment intersecting a plane from the negative half-space.
The relationships found using similar triangles are defined slightly different for
this case. Equation 7.5 shows these relationships.
a
~
dist(P, E)
=
b
~
−dist(P, S)
=
a+b
~ − dist(P, S)
~
dist(P, E)
(7.5)
As with the relationships, the interpolation scheme is slightly different. Equation 7.6 shows how interpolation is done when the line segment starts in the
negative half-space.
~ + t2 · (S
~ − E)
~
I~ = E
30
(7.6)
~ to S
~ is defined as shown in
The fraction t2 of the negative direction vector E
equation 7.7 using the previously observed relationships.
t2 =
8
~
a
dist(P, E)
=
~ − dist(P, S)
~
a+b
dist(P, E)
(7.7)
View Frustum Clipping
The view frustum or view volume is a sub-space of the virtual scene in which
triangles are partially or completely visible when projected onto the screen. It
is formed from six planes, defined by the orientation of a camera or an observer
and a few additional parameters such as draw distance. Its shape is in the form
of a truncated pyramid as seen from above in figure 8.1. Four of the six planes
~ l, N
~ r , N~n and
and their corresponding normals are shown in the figure where N
N~f are the normals of the left, right, near and far planes.
Nf
Nl
Nr
Nn
Figure 8.1: The view frustum is in the shape of a truncated pyramid.
View frustum clipping is the process of removing complete triangles or parts
of triangles which will not contribute to the final image when projected onto
the screen. This often in order to save computational power.
8.1
Clipping Planes
By transforming the triangles into camera space before clipping, the definition
of the six planes forming the view frustum will become simplified. Equation
8.1 shows the equations of the left, right, bottom, top, near and far planes,
respectively. All equations are defined in camera space.
31
~ =0
(αx , 0, 1) • X
~ =0
(−αx , 0, 1) • X
~ =0
(0, αy , 1) • X
~ =0
(0, −αy , 1) • X
~ = dn
(0, 0, 1) • X
~ = −df
(0, 0, −1) • X
(8.1)
~ an
In the equation, dn and df are the near and far draw distances and X
arbitrary point. Equation 8.2 states a requirement on the relationship between
the near and far draw distances. As seen in the equation, non-positive draw
distances are not allowed and the far draw distance must be greater than the
near draw distance.
0 < dn < df < ∞
(8.2)
The two constants αx and αy are stretch factors which are determined from
the aspect ratio of the target image. This in order to account for non-square
displays by extending the field of view of the camera along the major axis. As
such, the field of view of the camera becomes defined along the minor axis of
the target image.
Equation 8.3 shows how the horizontal stretch factor αx is determined from
the width and height of the target image.
(
αx =
h
w
1
if h < w
otherwise
(8.3)
The vertical stretch factor αy is determined in a similar way as shown in
equation 8.4.
αy =
(
1
w
h
if h < w
otherwise
(8.4)
The six plane equations are fully determined from the equations above and the
clipping of triangles in camera space can be performed.
32
8.2
Example
The clipping procedure is done in sequence for each of the six planes and
operates on a vertex list as clipping a triangle against a plane may produce
additional vertices. The vertex list is initialized with the three original vertices
of a triangle and may store up to four additional vertices. In addition, a
secondary vertex list is used to store the results of a clipping pass.
Figure 8.2 shows a triangle which intersects the view frustum and its three
initial vertices V~1 , V~2 and V~3 .
V3
V2
V1
Figure 8.2: A triangle intersecting the view frustum.
Clipping against the left plane produces an additional vertex and the vertex
list now contains a total of four vertices. This is shown in figure 8.3.
V4
V3
V2
V1
Figure 8.3: Clipping against the left plane produces an additional vertex.
The clipping procedure may process the six planes in any order as long as the
order is consistent for all triangles. The triangle will eventually become clipped
against the right plane which does not produce any additional vertices for this
33
example. However, the vertices are moved to the respective intersection points
as shown in figure 8.4.
V4
V3
V2
V1
Figure 8.4: Clipping against the right plane moves two of the vertices.
After the clipping process has completed, the final vertex list may not form a
single triangle. It is possible for the process to produce as much as seven vertices, which together form a planar polygon. This polygon can be triangulated
into several triangles and the number of triangles determined from the expression in equation 8.5. In the equation, t is the number of triangles and v the
number of vertices. As the clipping process produces at most seven vertices,
the number of triangles is at most five.
t = max(v − 2, 0)
(8.5)
For this example, the polygon is split up into two triangles. Figure 8.5 shows
the triangulated polygon and the two triangles {V~1 , V~2 , V~3 } and {V~1 , V~3 , V~4 }.
V4
V3
V2
V1
Figure 8.5: The planar polygon can be triangulated into several triangles.
34
8.3
Algorithm
The Sutherland-Hodgman algorithm for line clipping can be used to clip each
line segment implicitly defined by the vertex list against each of the six planes
of the view frustum. The algorithm was introduced in 1974 by Sutherland and
Hodgman in [7].
For every vertex in the vertex list, the signed distance to the current plane P is
~ and ending point E
~ are
computed. The signed distances of the starting point S
used to determine how the line segment is oriented with respect to the plane.
This is done through two boolean tests for which the results are embedded
into a bit mask. Two tests give four cases in total, all of which are shown in
equation 8.6 [7].

0



1
case =

2



3
if
if
if
if
~ < 0 ∧ dist(P, E)
~ <0
dist(P, S)
~ > 0 ∧ dist(P, E)
~ <0
dist(P, S)
~ < 0 ∧ dist(P, E)
~ >0
dist(P, S)
~ > 0 ∧ dist(P, E)
~ >0
dist(P, S)
(8.6)
As seen in the equation, case 0 occurs when both the starting and ending
points are located in the negative half-space created by the plane. Neither the
points nor the line segment is visible and as such, nothing should be appended
to the resulting vertex list. This is illustrated in figure 8.6.
S
P
N
E
Figure 8.6: Illustration of case 0 for the Sutherland-Hodgman algorithm.
If the starting point is located in the positive half-space and the ending point
located in the negative half-space, case 1 occurs. This is illustrated in figure
8.7. For this case, the intersection point I~ needs to be computed. The intersection point is computed with respect to the orientation of the line segment as
discussed in the section about line segment-plane intersections. As the original
35
starting point is visible, it is appended to the resulting vertex list followed by
the intersection point.
E
P
I
N
S
Figure 8.7: Illustration of case 1 for the Sutherland-Hodgman algorithm.
Conversely, case 2 occurs when the starting point is located in the negative
half-space and the ending point located in the positive half-space. For this
case, the intersection point is computed and appended to the resulting vertex
list. This is illustrated in figure 8.8.
S
P
I
N
E
Figure 8.8: Illustration of case 2 for the Sutherland-Hodgman algorithm.
The final case occurs when both points are located in the positive half-space.
For this case, no intersection point exists. The only thing appended to the
resulting vertex list is the original starting point. The case is illustrated in
figure 8.9.
36
S
P
N
E
Figure 8.9: Illustration of case 3 for the Sutherland-Hodgman algorithm.
The clipping against the current plane P is completed by clipping the final line
segment between the last vertex and the first vertex in the same fashion as for
the other line segments. The resulting vertex list describes the intermediate
planar polygon and can be used as input for the next clipping plane and so
forth until all clipping planes have been processed.
9
Barycentric Coordinates
For three-dimensional computer graphics, a triangle is formed from three vertices located in three-dimensional space. If the vertices are arranged so that a
proper triangle is formed, the triangle will create a two-dimensional sub-space.
This sub-space is a plane and the triangle is the small fraction of that plane enclosed within the convex hull of the three vertices. Figure 9.1 shows a triangle
~ , V~ and W
~.
and its three vertices U
V
U
W
Figure 9.1: A triangle is formed from three vertices.
37
If the plane is divided by the three lines formed when extending the edges
infinitely, the triangle can be expressed as the union of the three positive halfspaces. This can be seen in figure 9.2 where the arrows indicate the positive
half-spaces.
V
U
W
Figure 9.2: A triangle can be seen as the union of three positive half-spaces.
Attributes are commonly defined at the three vertices of the triangle and may
be shared with adjacent triangles. In order to interpolate these values across
the entire surface, a clever coordinate system is needed.
Barycentric coordinates are a set of three coordinates that can be used to refer
to all points within the plane spanned by a proper triangle. This set of coordinates is often denoted (u, v, w) and specifies the normalized perpendicular
distances between the opposite edge of a vertex and the vertex itself.
Let an arbitrary point P~ be defined in the plane by the three coordinates
(u, v, w). Equation 9.1 shows two fundamental properties for this point and
for the barycentric coordinates.
~ · u + V~ · v + W
~ ·w
P~ = U
1=u+v+w
(9.1)
The u coordinate is defined by the normalized perpendicular distance from the
~ . The v and w coordinates are defined in a
edge between vertices V~ and W
similar fashion as illustrated in figure 9.3.
38
w=0
V
v=
1
U
P
u=
1
w=1
v=
0
u=
0
W
Figure 9.3: The isolines of the barycentric coordinates are parallel to the edges.
As the coordinates are normalized with respect to the size of the triangle, the
coordinates gain a series of useful properties. If the point P~ coincides with
~ , its position in the barycentric coordinate system is (1, 0, 0). The
vertex U
~ , which are located at (0, 1, 0) and (0, 0, 1),
same is true for vertices V~ and W
respectively.
~ , its coordinates
If the point lies anywhere along the line between V~ and W
will be (0, v, w). Conversely, its coordinates along the other two lines will be
(u, 0, w) or (u, v, 0). The normalization also ensures that the sum of all three
coordinates is exactly equal to one, regardless of the position of the point. This
is always true, even if the point lies outside of the triangle.
This set of coordinates and their properties form an efficient interpolation
scheme for attributes defined at the three vertices.
9.1
Cramer’s Rule
The fundamental properties of the barycentric coordinates form a system of
~ is transequations and can be expressed in matrix form. An input vector X
~
formed by a matrix M into an output vector Y . Equation 9.2 shows a general
system of equations in matrix form.
~
Y~ = M · X
(9.2)
The input vector will be the three barycentric coordinates (u, v, w) and the
matrix and output vector will correspond to the statements from equation
9.1. Expressing these statements in matrix form results in equation 9.3 where
computing the barycentric coordinates is analogous to solving the equation for
(u, v, w).
39
  
 
Px
Ux Vx Wx
u
 Py  =  U y V y W y   v 
1
1 1
1
w
(9.3)
Cramer’s Rule states that any system of equations with as many unknowns
as there are equations (i.e. square matrices) can be solved using quotients of
determinants. Equation 9.4 states that the unknown Xi can be expressed as
the fraction between the determinant of the matrix Mi and that of the original
matrix M .
Xi =
det(Mi )
if det(M ) 6= 0
det(M )
(9.4)
The determinant of the system is computed as shown in equation 9.5. The
bottom row simplifies the calculations as it is equal to (1, 1, 1).
det(M ) = (Vx · Wy − Wx · Vy ) − (Ux · Wy − Wx · Uy ) + (Ux · Vy − Vx · Uy )
= (Vx − Ux ) · (Wy − Uy ) − (Wx − Ux ) · (Vy − Uy )
(9.5)
The additional matrices Mi are formed by replacing column i of the matrix M
with the vector Y~ . Equation 9.6 shows the matrix M1 where the first column
has been replaced by Y~ .


Px Vx Wx
M1 = Py Vy Wy 
1 1
1
(9.6)
Computing the determinant of M1 is performed in the standard way and results
in the expression seen in equation 9.7.
det(M1 ) = (Vx · Wy − Wx · Vy ) − (Px · Wy − Wx · Py ) + (Px · Vy − Vx · P y)
= (Px − Wx ) · (Vy − Wy ) − (Vx − Wx ) · (Py − Wy )
(9.7)
The matrix M2 is formed by replacing the second column of M with the vector
Y~ . Equation 9.8 shows this matrix.
40


Ux Px Wx
M2 = Uy Py Wy 
1 1
1
(9.8)
Its determinant is similar to the determinant of M1 . Equation 9.9 shows an
expression for the determinant of M2 .
det(M2 ) = (Px · Wy − Wx · Py ) − (Ux · Wy − Wx · Uy ) + (Ux · Py − Px · Uy )
= (Px − Wx ) · (Wy − Uy ) − (Wx − Ux ) · (Py − Wy )
(9.9)
The final matrix M3 is formed by replacing the third column of M with the
vector Y~ . This is seen in equation 9.10.


Ux Vx P x
M3 = Uy Vy Py 
1 1 1
(9.10)
An expression for the determinant of the matrix M3 is shown in equation 9.11.
det(M3 ) = (Vx · Py − Px · Vy ) − (Ux · Py − Px · Uy ) + (Ux · Vy − Vx · Uy )
= (Vx − Ux ) · (Py − Uy ) − (Px − Ux ) · (Vy − Uy )
(9.11)
Using Cramer’s Rule, the three unknowns (u, v, w) can now be expressed as
the respective quotients. Equation 9.12 shows the expression for the three
barycentric coordinates.
(u, v, w) =
9.2
1
· (det(M1 ), det(M2 ), det(M3 )) if det(M ) 6= 0
det(M )
(9.12)
Ratios of Areas
Barycentric coordinates
can also be viewed as area-oriented coordinates. The
1 1 1
coordinates 3 , 3 , 3 will always correspond to the center of any proper triangle.
It is therefore possible to define the coordinates as ratios of areas. Figure 9.4
41
~ , V~ , W
~ and an interior
shows the three areas formed by the three vertices U
~
point P .
V
U
Aw
Av
P Au
W
Figure 9.4: Barycentric coordinates can be defined through ratios of areas.
The area of the entire triangle is computed as shown in equation 9.13. This
will become the normalizing factor.
A=
(Vx − Ux ) · (Wy − Uy ) − (Wx − Ux ) · (Vy − Uy )
2
(9.13)
The three areas formed by introducing the interior point P~ are computed in a
similar fashion. Equation 9.14 shows this process in which Aw is defined by the
other two areas Au and Av in conjunction with the entire area A. Naturally,
Aw could be defined explicitly but the computational complexity of the method
would be higher.
(Px − Wx ) · (Vy − Wy ) − (Vx − Wx ) · (Py − Wy )
2
(Px − Wx ) · (Wy − Uy ) − (Wx − Ux ) · (Py − Wy )
Av =
2
Aw = A − Au − Av
Au =
(9.14)
The barycentric coordinates are obtained through a division by the total area.
This is shown in equation 9.15 where the actual coordinates (u, v, w) are computed. For efficiency reasons, the total area can be precalculated, stored and
reused in every calculation for that triangle. It is also unnecessary to divide
every area by two as it is the relative areas that are of importance.
(u, v, w) =
1
(Au , Av , Aw ) if A 6= 0
A
42
(9.15)
It can be observed that the two partial areas Au and Av actually correspond
to half the signed perpendicular distances to the respective edges. That is, a
~ to P~ on the two edge normals conprojection of the relative vector from W
~
~
taining W . Should the point P be located outside of the triangle, some areas
and coordinates will become negative. This in accordance with the previous
observation.
10
The Separating Axis Theorem
The hyper-plane separation theorem is often referred to as the separating axis
theorem in two dimensions. The theorem can be applied to determine if two
convex shapes are overlapping and is heavily used in collision detection for
computer graphics.
10.1
Candidate Axes
If the two shapes are defined by vertices and the connecting edges in between,
there exists a finite number of axes along which the two objects can be separated. In two dimensions, these candidate axes correspond to the edge normals
present in either of the two shapes. Figure 10.1 shows two convex shapes and
their edge normals.
Figure 10.1: A box and a triangle with their corresponding edge normals.
Figure 10.2 shows the connection between the edge normals of the two shapes
and the axes for which a separation is possible. There are seven candidate
axes for the two shapes in the figure but four of them are duplicates, leaving
three unique axes.
43
Figure 10.2: Candidate axes are formed from unique edge normals.
Note that the separating axes are directionless. Two edge normals differing
only in magnitude will correspond to the same separating axis, even if the
magnitudes have opposite signs. This as it is the relative magnitudes of the
projected shapes that are compared using the theorem.
10.2
Collision Detection
The theorem states that if two shapes are separated, there must exist at least
one axis for which the projections of the two shapes do not overlap. If the two
objects overlap, no such axis will exist.
Figure 10.3 shows a case where two shapes do not overlap and how this is
reflected in the projections.
Figure 10.3: Two convex shapes with one separating axis (diagonal).
This case can be determined using the condition in equation 10.1 in which
Bi and Ti are the sets of projected vertices of the box and the triangle along
candidate axis i.
44
∃i : (max(Bi ) < min(Ti ) ∨ min(Bi ) > max(Ti ))
(10.1)
Figure 10.4 demonstrates a different case where the two shapes partially overlap. The overlap is also visible in the projections of the two shapes.
Figure 10.4: Two convex shapes with no separating axis.
This case can be determined using the condition in equation 10.2. The sets
of projected vertices of the box and of the triangle along candidate axis i are
denoted by Bi and Ti , respectively.
∀i : (max(Bi ) > min(Ti ) ∧ min(Bi ) < max(Ti ))
(10.2)
An overlooked feature of the separating axis theorem is the possibility to determine if a shape is completely enclosed within another shape. If the projection
of the smaller shape is enclosed in the projection of the larger shape for every
candidate axis, the smaller shape will be completely enclosed inside the larger
shape.
Figure 10.5 demonstrates this feature. All projections of the smaller shape are
enclosed within the projections of the larger shape.
45
Figure 10.5: A smaller shape enclosed within a larger shape.
This case can be determined using the condition in equation 10.3. As with the
other conditions, Bi and Ti denotes the sets of projected vertices of the box
and of the triangle along candidate axis i, respectively.
∀i : (min(Bi ) > min(Ti ) ∧ max(Bi ) < max(Ti ))
11
(10.3)
Tangent Space
When the vertices of a triangle are coupled with texture coordinates, a local
coordinate system known as the tangent space must exist for that triangle. Its
bases are the axes along which the texture coordinates are defined as well as
~ and N
~,
the normal of the triangle. A common notation for these axes are T~ , B
corresponding to the tangent, bitangent and normal, respectively. However,
~ T~ and N
~ will be used from here on as texture coordinates often
the notation S,
are denoted by s and t.
~ , V~ and W
~ . It also shows
Figure 11.1 shows a triangle and its three vertices U
~ and T~ . These bases can be of arbitrary
two of the tangent space bases, S
orientation, depending on the different texture coordinates defined at the three
vertices but are always located in the same plane as the triangle.
46
V
S
U
W
T
Figure 11.1: A tangent space exists for triangles with proper texture coordinates.
For any point P~ in the same plane as a triangle, it is required for the tangent
space of that triangle to be consistent with the spatial coordinates and with
the texture coordinates of the point. Equation 11.1 shows this relationship.
~ + (Pt − Ut ) · T~
(P~xyz − U~xyz ) = (Ps − Us ) · S
(11.1)
The relationship is illustrated in figure 11.2 in which the point P~ local to
~ is described using the tangent space bases and the relative texture
vertex U
coordinates.
V
(Pt-Ut)·T
P
S
U
(Ps-Us)·S
W
T
Figure 11.2: The tangent space is a coordinate system local to the triangle.
Defining the tangent space from the attributes of the three vertices can be
~ relative
done through observing an important feature. The vertices V~ and W
~ must be uniquely described using the tangent space bases and the relative
to U
texture coordinates. Equation 11.2 shows these criteria.
47
~ + (Vt − Ut ) · T~
(V~xyz − U~xyz ) = (Vs − Us ) · S
~ + (Wt − Ut ) · T~
(W~xyz − U~xyz ) = (Ws − Us ) · S
(11.2)
~ and B
~
These criteria can be simplified by introducing two local vectors A
~
~
~
~
as the vectors from U to V and U to W , respectively. The two vectors will
~ as shown in figure 11.3.
correspond to the two edges local to U
V
A
S
U
B
W
T
~ and B
~ local to U
~ are introduced.
Figure 11.3: Two vectors A
The previous equation can now be rewritten as shown in equation 11.3. This
is a system of six equations with six unknowns.
~ + At · T~
A~xyz = As · S
~ + Bt · T~
B~xyz = Bs · S
(11.3)
Obtaining the tangent space bases is analogous to solving this system of equa~ and T~ . If the system of equations is rewritten in
tions for the two bases S
matrix form, equation 11.4 is obtained.
Ax Ay Az
As At Sx Sy Sz
=
Bx By Bz
Bs Bt Tx Ty Tz
(11.4)
~ and B
~ are known. It is
All attributes belonging to either of the two vectors A
therefore possible to determine whether there exists a tangent space. A matrix
M consisting of the texture coordinates is defined as shown in equation 11.5.
As At
M=
Bs Bt
48
(11.5)
This matrix needs to be inverted in order to compute the tangent space bases.
As with other matrices, it is only invertible if its determinant is not equal to
zero. Using the inverse formula for square four-element matrices produces the
result in equation 11.6.
M
−1
1
Bt −At
if As · Bt − At · Bs 6= 0
=
As · Bt − At · Bs −Bs As
(11.6)
~ and T~ by multiplying equation 11.4
The matrix equation can be solved for S
−1
with M from the left. Rearranging the equation gives the result shown in
equation 11.7.
Sx Sy Sz
−1 Ax Ay Az
=M
Tx Ty Tz
Bx By Bz
(11.7)
The explicit definitions of the two tangent space bases are shown in equation
−1
11.8 where Mi,j
corresponds to the element of the inverse texture coordinate
matrix at row i and column j.
~
S
T~
 −1

−1
M1,1 · Ax + M1,2
· Bx
−1
−1
· Ay + M1,2
· By 
= M1,1
−1
−1
M1,1 · Az + M1,2 · Bz
 −1

−1
M2,1 · Ax + M2,2
· Bx
−1
−1
· Ay + M2,2
· By 
= M2,1
−1
−1
M2,1 · Az + M2,2 · Bz
(11.8)
The normal is by definition computed as the three-dimensional cross-product
between two of the edges and is not connected to the texture coordinates. The
two edges must be selected in accordance with the order in which the vertices
are defined. Throughout this report, left-handed coordinate systems are used
requiring the vertices to be defined in clock-wise order. Equation 11.9 shows
~ using the two local vectors A
~ and B.
~
the definition of the normal N


Bz · Ay − By · Az
~ = Bx · Az − Bz · Ax 
N
By · Ax − Bx · Ay
49
(11.9)
12
Perspective-Correct Interpolation
When three-dimensional triangles are perspectively projected into a plane,
the relationship between the attributes associated with each vertex becomes
distorted. In three dimensions, the relationship is linear over the surface of
the triangle. This is no longer true when the triangle is projected into a plane.
However, the inverse projection is linear and this can be utilized.
This section provides details on how to correctly interpolate attributes across
a projected triangle using the two texture coordinates s and t as an example.
The method detailed here is in no way limited to texture coordinates. Any
attributes may be interpolated using the method, even vectors and matrices.
Equation 12.1 shows the projected texture coordinates s0 and t0 and the inverse
depth z 0 , all of which are linear over the surface of the projected triangle.
1
z
s
0
s =
z
t
t0 =
z
z0 =
(12.1)
~ , V~
These three projected attributes can be computed at the three vertices U
~ of the triangle, providing a set of nine projected attributes. This is
and W
shown in equation 12.2.
zU0 =
s0U =
t0U =
1
zU
sU
zU
tU
zU
zV0 =
s0V =
t0V =
1
zV
sV
zV
tV
zV
0
=
zW
0
sW =
t0W =
1
zW
sW
zW
tW
zW
(12.2)
An arbitrary point P~ on the projected surface is defined by the three barycentric coordinates u, v and w as stated previously. Equation 12.3 shows this
point.
P~ = (u, v, w)
(12.3)
As the nine projected attributes are linear over the surface, the barycentric
coordinates can be used to linearly interpolate these to the point P~ . This
procedure is shown in equation 12.4.
50
0
zP0 = zU0 · u + zV0 · v + zW
·w
0
0
0
0
sP = sU · u + sV · v + sW · w
t0P = t0U · u + t0V · v + t0W · w
(12.4)
The inverse depth provides a simple mean of recovering the actual attributes
at the point. Equation 12.5 shows how the texture coordinates sP and tP are
recovered at the point together with the correct depth value zP .
1
zP0
s0
sP = P0
zP
t0
tP = P0
zP
zP =
13
(12.5)
Rounding Modes
Proper handling of the rounding of real-valued numbers is an important aspect
of computer graphics. Real-valued numbers are used to describe an array
of different properties of a three-dimensional scene, ranging from geometry
definitions to texture coordinates. These will have to map to a discrete raster
of discrete colors sampled from discrete textures, making the need evident.
Rounding can be performed in a number of different modes where each mode
is beneficial in a number of different situations. These rounding modes are
implemented in most programming languages and in hardware but can also
be defined using the truncating function. The truncating function truncates
all decimals from a real-valued number x. Figure 13.1 shows this function and
how it rounds its input towards zero.
51
Y
X
Figure 13.1: The truncating function of a real-valued number.
Defining the truncating function in a computational context can be done in a
few different ways. Either as a bit manipulation of the real-valued number or
as a type conversion to an integer data type. This provided that the integer
data type used is able to represent the truncated value.
13.1
Flooring
Flooring is a rounding mode which rounds its input towards negative infinity.
A real-valued number x is rounded towards the integer to the left on the x-axis.
This is shown in figure 13.2.
Y
X
Figure 13.2: The flooring function of a real-valued number.
The flooring function can be defined as shown in equation 13.1 in which the
definition properly handles positive as well as negative values.
(
trunc(x) − 1
floor(x) ∈ Z =
trunc(x)
52
if x < trunc(x)
otherwise
(13.1)
Fractional
The flooring function has a complimentary function known as the fractional
function. It represents the value that was removed using the flooring function
and is shown in figure 13.3.
Y
X
Figure 13.3: The fractional function of a real-valued number.
The fractional function can be defined using the flooring function or using
the truncating function. Equation 13.2 shows the fractional function defined
using the latter. As with the flooring function, this definition properly handles
positive as well as negative values.
(
x − trunc(x) + 1
frac(x) ∈ R : [0, 1[=
x − trunc(x)
if x < trunc(x)
otherwise
(13.2)
The two functions obey an important relationship. The sum of the flooring
function and the fractional function individually applied to any real-valued
number x is always equal to x as shown in equation 13.3.
x = floor(x) + frac(x)
(13.3)
The sum of the two functions is as a continuous line as expected. This is shown
in figure 13.4.
53
Y
X
Figure 13.4: The sum of the flooring function and the fractional function.
13.2
Ceiling
Ceiling is a rounding mode which rounds its input towards positive infinity. A
real-valued number x is rounded towards the integer to the right on the x-axis.
This is shown in figure 13.5.
Y
X
Figure 13.5: The ceiling function of a real-valued number.
The ceiling function can be defined as shown in equation 13.4 in which the
definition properly handles positive as well as negative values.
(
trunc(x) + 1
ceil(x) ∈ Z =
trunc(x)
54
if x > trunc(x)
otherwise
(13.4)
Reverse Fractional
Similar to the flooring function, the ceiling function has a complimentary function known as the reverse fractional function. It represents the value that was
introduced using the ceiling function and is shown in figure 13.6.
Y
X
Figure 13.6: The reverse fractional function of a real-valued number.
The reverse fractional function can be defined using the ceiling function or
using the truncating function. Equation 13.5 shows the reverse fractional
function defined using the latter. As with the other functions, this definition
properly handles positive as well as negative values.
(
trunc(x) + 1 − x
rfrac(x) ∈ R : [0, 1[=
trunc(x) − x
if x > trunc(x)
otherwise
(13.5)
The two functions obey a relationship similar to that of the flooring and fractional functions. The difference between the ceiling function and the reverse
fractional function individually applied to any real-valued number x is always
equal to x. This is shown in equation 13.6. This difference also forms a continuous line.
x = ceil(x) − rfrac(x)
13.3
(13.6)
Nearest
Rounding to the nearest integer can be performed in a simple way utilizing
the flooring or ceiling functions. Shifting the input by 21 before flooring will
create a rounding mode that produces the closest integer, left or right on the
55
x-axis. Values at the exact midpoint between two integers will be rounded
upwards. Conversely, shifting the input by − 21 before ceiling will also produce
a rounding towards the nearest integer. However, values at the midpoints will
be rounded downwards.
56
Chapter D
Texture Filtering
Texture mapping plays an important role in computer graphics as it greatly
enhances the visual quality of a rendered image. Texture mapping is traditionally used to modulate color values for three-dimensional models but can
also be used to introduce detail in a variety of ways. This includes introducing geometric detail through normal mapping as well as modulating material
properties other than color.
Throughout this chapter, the two words pixel and texel will be heavily used.
Pixel will be used whenever discussing the sample points of the rendered image,
whereas texel will refer to the sample points of a texture image. It is important
to differentiate between these two, even if they technically represent the exact
same thing.
1
Short on Discrete Images
Pixels and texels are indeed sample points, with emphasis on points. Points
are zero-dimensional, infinitely small and should not be thought of as higherdimensional shapes such as circles or squares. As such, they cannot be associated with a size.
The sample points together form a discrete image which has a horizontal and
a vertical resolution equal to the amount of sample points in the respective
direction. The discrete image is best viewed through a reconstruction filter
which extends the image into the continuous domain according to Smith in
[8]. The footprint of this filter will be two-dimensional and therefore have a
size and a shape.
57
2
Magnification and Minification
Texture mapping is a process in which a set of texture coordinates s and t,
along with the partial derivatives ∂s and ∂t are used to reconstruct a discrete
texture image. The texture coordinates s and t are known as the resolutionindependent texture coordinates for which the intervals between zero and one
cover the entire texture. In order to access individual texels, the coordinates
are commonly scaled by the actual resolution and the scaled texture coordinates x and y are obtained. The partial derivatives are scaled in the same
fashion, giving a hint on the optimal size for the footprint of the reconstruction
filter.
Pixels in the rendered image will rarely match the texels of a texture image,
neither in orientation nor in resolution and by that distance between sample
points. In three-dimensional computer graphics, the problem is worsened as
the footprint of a pixel in image space may become distorted in texture space.
This as a result of the perspective projection.
There are two extreme cases which pose great challenges for the texture filtering sub-system. Large regions of the rendered image might produce almost
identical small footprints which fall between four texels. Choosing the nearest
neighboring texel as a reconstruction filter will select the same texel for many
pixels in that region of the rendered image. Clearly, this is not desirable and
better reconstruction filters are needed.
When the distance between two texels is larger than the footprint of a pixel,
the texture sub-system needs to magnify the texture image. Hence the term
texture magnification. Unfortunately, not many techniques suited for real-time
applications exist for this case.
Figure 2.1 shows the footprint of a pixel in texture space. In the figure, the
footprint is smaller than the distance between two texels and magnification is
needed.
Figure 2.1: The footprint of a pixel in texture space needing magnification.
58
The other extreme case presents itself when the footprint of a pixel is mapped
to a region covering multiple texels in texture space. Choosing the nearest
neighboring texel as a reconstruction filter will select a texel seemingly at
random. In addition, slight changes in perspective will cause the filter to
select different texels, causing visual artifacts such as Moiré patterns in the
rendered image.
With the distance between two texels being smaller than the footprint of a
pixel, the texture sub-system needs to minify the texture image. This is known
as texture minification and multiple techniques suitable for real-time applications exist to combat the problem.
Figure 2.2 shows the footprint of a pixel in texture space. The footprint is many
times larger than the distance between two texels and texture minification is
needed.
Figure 2.2: The footprint of a pixel in texture space needing minification.
Most minification filters introduce a pre-processing pass in which the texture
is processed and additional data stored. The pre-processing can often be done
efficiently in a parallel computational context and only needs to be done once
per texture.
3
Wrapping Modes
When a pair of texture coordinates exceed the domain of a texture, a consistent
way of remapping these coordinates is needed. Generally, there are two different wrapping modes which can be used for the two texture dimensions. Both
dimensions need not use the same wrapping mode, even if it often is convenient
to treat both dimensions equally. However, it is crucial that a wrapping mode
is defined as it is near impossible to prevent domain exceedings.
59
Wrapping modes can operate on the resolution-independent texture coordinates or on the integer texel coordinates. This section will define wrapping
modes for the latter. As such, wrapping is performed whenever the texture
filter requests a texel using the integer texel coordinates.
Given a set of texel coordinates xi and yi , the look-up function Il can be defined
as shown in equation 3.1. In the equation, I corresponds to the discrete texture
image whereas Wx and Wy are the two wrapping functions.
Il [xi , yi ] = I[Wx [xi ], Wy [yi ]]
3.1
(3.1)
Repeating
If a texture is meant to be used in a tiling fashion where texel coordinates
exceeding the domain should remap to the corresponding texels in the tile,
the repeat mode is used. Equation 3.2 shows how a remapping of the texel
coordinates xi and yi is done for repeating textures. The method used in the
equation correctly remaps xi and yi into the intervals between [0, w − 1] and
[0, h − 1], even for negative coordinates.
Wx [xi ] = ((xi mod w) + w) mod w
Wy [yi ] = ((yi mod h) + h) mod h
(3.2)
Graphics pipelines such as the Open Graphics Library generally support two
different repeating modes. One which mirrors or inverts every other tile and
one which repeats every tile in a consistent fashion [9]. The expressions in the
above equation are used to repeat every tile consistently.
3.2
Clamping
If a pair of texel coordinates are meant to never exceed their domain, the
clamping mode is used. Clamping remaps texel coordinates outside of the
domain to the edge texels of the texture image. Equation 3.3 shows how a
remapping of the texel coordinates xi and yi is done for clamping textures.
Wx [xi ] = min(max(xi , 0), w − 1)
Wy [yi ] = min(max(yi , 0), h − 1)
(3.3)
Graphics pipelines generally support a few different clamping modes which
allow fine control over the texels that actually get sampled. For instance, the
60
Open Graphics Library allows a virtual border of constant color or an actual
border of varying colors to be specified around the texture image. The different
clamping modes specify whether the clamping should be done to the edge of
the texture, to the border of the texture or to in between the edge and the
border [9].
Borders can be used to show that some pixels in the rendered image will map
to texels outside of the domain. Should this not be the intention, the border
texels visible in the rendered image will indicate that something is wrong with
the texture coordinates specified.
4
Nearest Neighboring Texel
The nearest neighboring texel filter is a texture filter in its most simple form.
It selects a single texel from the texture which corresponds to the one in closest
proximity to the scaled texture coordinates x and y. Equation 4.1 shows the
definition of the nearest neighboring texel filter N . For efficiency, the rounding
can be implemented as a flooring operation shifted by 12 .
N (x, y) = Il [round(x), round(y)]
(4.1)
The filter is generally not used since it is neither suitable for magnification nor
minification. However, it extends the discrete texture image into the continuous domain which is the minimum requirement for a texture filter.
5
Bilinear Interpolation
Bilinear interpolation is a filter which averages four texels weighted in accordance with distance in an aim to reconstruct the missing data. This has proven
to be sufficient for the perceived visual quality of rendered images when the
texture is magnified. Hence, bilinear interpolation is one of the most commonly
used texture magnification filters.
In bilinear interpolation, the scaled texture coordinates x and y are split into
integer parts and fractional parts using the flooring and fractional functions.
The integer parts are used for texel indices and the fractional parts are used
to weigh the texels accordingly.
Equation 5.1 shows how bilinear interpolation extends an image into the continuous domain. In the equation, xi and yi correspond to the integer parts
61
while xf and yf correspond to the fractional parts of the scaled texture coordinates. The equation clearly shows that the filter uses four texel look-ups.
B(x, y) = Il [xi , yi ] · (1 − xf ) · (1 − yf )
+ Il [xi + 1, yi ] · (xf ) · (1 − yf )
+ Il [xi , yi + 1] · (1 − xf ) · (yf )
+ Il [xi + 1, yi + 1] · (xf ) · (yf )
(5.1)
Bilinear interpolation does not work well as a minification filter. When the
footprint of a pixel covers multiple texels, interpolating four of them provides
little to no gain in the visual quality. Better reconstruction filters are needed
for minification.
6
Image Pyramids
Image pyramids were introduced in 1983 by Williams in [10]. Williams used
the term mip-maps where mip is an abbreviation of the latin phrase multum in
parvo, meaning many things in a small place. The reconstruction filter utilizes
a pre-processing pass in which the original image is down-sampled recursively
by a factor of four (one half in each dimension) into an image with multiple
resolutions, or levels. This forms a pyramidal structure, hence the term image
pyramids.
As images with multiple resolutions are created and stored, the memory requirement for image pyramids is higher in comparison to that of ordinary
images. Equation 6.1 shows that the upper bound of the memory requirement
is 43 when compared to ordinary images. This justifies the mip prefix as only
1
more data is required for the storage.
3
lim
N →∞
N i
X
1
4
i=0
=
4
3
(6.1)
Figure 6.1 illustrates how the image pyramid reconstruction filter creates a
series of recursively down-sampled images for three-channel images. It also
illustrates how the memory requirement is bounded by 34 as all down-sampled
copies fit inside an additional channel.
62
l0
l0
l1
l0
l1
l1
l2
l2
l2
Figure 6.1: An illustration of mip-mapping for three-channel images.
Image pyramids are in no way limited to three-channel images, they work just
as well with an arbitrary number of channels. Additionally, they can be used
for both magnification and minification if the sampling is implemented in a
robust way.
6.1
Pre-processing
The pre-processing and sampling of image pyramids can differ slightly between
different implementations and between different hardware vendors. Yet, its
fundamental algorithm is to pre-process the image into multiple resolutions
and sample the appropriate level or levels as determined by the size of the
footprint.
As every level of the image pyramid is isotropically down-sampled, the total
number of levels can be determined using equation 6.2. The formula used in
the equation selects the axis with minimum extent in the image and downsamples until only a single row or column remains. Should the image be
perfectly square, the final level becomes a single pixel which approximates the
entire image.
levels = blog2 (min(w, h))c + 1
(6.2)
Re-sampling an image into half its resolution along both axes can be done in
two passes provided that the filter used is separable. First along the x-axis
and then along the y-axis or the other way around. This two-pass method
uses fewer memory accesses than a one-pass method would but is reliant on
an intermediate buffer storing the results from the first pass. It might also be
subject to additional quantization noise if the intermediate results are stored
in integer data types.
63
The recursive down-sampling process is initialized with the original image.
Each recursion level uses the image at the previous level as its input and
initializes the next recursion provided that there are more levels still to be
processed.
6.2
Sampling Method
The set of resolution-independent texture coordinates s and t as well as their
partial derivatives ∂s and ∂t together provide enough information to sample the image pyramid. The texture coordinates of a pixel are obtained
through (perspectively-correct) interpolation across the triangle while the partial derivatives can be approximated using finite differences. This differentiation is done between adjacent pixels and can be expressed in a number of
different ways.
In hardware, pixels are commonly textured in groups of four. Such a group is
known as a quad pixel and its structure makes differentiation possible. Usu∂t
∂s ∂s ∂t
, ∂y , ∂x and ∂y
are approximated through
ally, the four partial derivatives ∂x
differentiation and used for the entire quad pixel. For tile-based renderers,
other differentiation schemes become possible as entire tiles are textured simultaneously.
The approximate size of an optimal footprint is related to the magnitude of
the partial derivatives. A metric commonly used for image pyramids is shown
in equation 6.3 [10] in which the footprint size fs is approximated from the
maximum length of two gradients. The length is scaled by the number of
samples along the minor axis to produce a footprint size which is expressed in
distances between texels. Choosing the maximum gradient length is a tradeoff which favors blur over aliasing. Hence, image pyramids often produce an
over-blurred sample.

s
2 2 s 2 2
∂s
∂t
∂s
∂t 
+
,
+
· min(w, h)
fs = max 
∂x
∂x
∂y
∂y
(6.3)
The common metric comes with an assumption about the resolution of a texture. Namely, it assumes that every texture is perfectly square and that the
partial derivatives contribute equally to the approximation of the optimal footprint. Williams stated that image pyramids were intended for square, powerof-two resolutions. Yet, it is useful to extend the definition to properly include
rectangular textures and preferably arbitrary resolutions as well.
64
To properly include rectangular textures, the partial derivatives can be weighted
in accordance with resolution when computing the length of the two gradients.
This modification is shown in equation 6.4.

s
2 2 s
2 2
∂t
∂t
∂s
∂s
·w +
·h ,
·w +
·h 
fs = max 
∂x
∂x
∂y
∂y
(6.4)
Approximating the footprint size from the lengths of the two gradients is computationally costly as it involves two square root computations at each pixel.
The square root can be extracted from the maximum function without altering the expression and the footprint size computed using a single square root.
Yet, further optimizations may be made as the size of the footprint is approximated. Equation 6.5 shows a method for which the size of the footprint is
approximated independently in width and height.
∂s ∂s fw = max , · w
∂x ∂y
∂t ∂t fh = max , · h
∂x ∂y
(6.5)
The independent approach also aids in the actual sampling process for rectangular images. The appropriate level of the pyramid is determined as shown in
equation 6.6. By scaling the resolution-independent texture coordinates s and
t by the resolution of the image at level li , the texture at resolution li can be
sampled. The sampling may be performed using a simple texture filter such
as nearest neighboring texel or bilinear interpolation with good results due to
the pre-processing pass.
(
blog2 (fw )c
li =
blog2 (fh )c
if fw > fh
otherwise
(6.6)
When sampling a texture using the image pyramid, transitions between the
different levels of the pyramid will become visible in the rendered image. To
remedy this, trilinear interpolation was introduced. Trilinear interpolation is
essentially two bilinear interpolations applied independently to the closest two
levels in the pyramid followed by a linear interpolation between the two samples
produced. The linear level fractional lf is computed as shown in equation 6.7.
(
lf =
fw
2li
fh
2li
−1
−1
65
if fw > fh
otherwise
(6.7)
The trilinear interpolation can be defined as shown in equation 6.8 where
(x0 , y0 ) are the scaled texture coordinates of level li and (x1 , y1 ) of level li +
1. The two bilinear interpolations are denoted by B and are weighted in
accordance with lf . Together they form the image pyramid filter Pi . The
two bilinear interpolations use four texel look-ups each, amounting to eight
look-ups in total.
Pi (x, y) = B(x0 , y0 ) · (1 − lf ) + B(x1 , y1 ) · (lf )
7
(6.8)
Anisotropic Image Pyramids
Anisotropic image pyramids is an extension to image pyramids. Instead of
down-sampling the image isotropically, the image is down-sampled independently in both directions by a factor of two. Clearly, this requires more memory
when compared to image pyramids and ordinary images. Equation 7.1 shows
that this requirement is bounded by four times the original memory requirement.
lim 2
N →∞
N i
X
1
i=0
2
!
−1
·
lim
M →∞
M j
X
1
!
4
j=0
= 3 · lim
M →∞
M j
X
1
j=0
4
= 4 (7.1)
The process is illustrated in figure 7.1 which forms a two-dimensional structure of levels with different resolutions. As with image pyramids, there is no
limitation in the number of channels that can be used. A single channel was
used here only to illustrate the memory requirement. This structure is often
referred to as a rip-map, where rip corresponds to rectangular mip.
l0,0
l1,0
l2,0
l0,1
l1,1
l2,1
l0,2
l1,2
l2,2
Figure 7.1: An illustration of rip-mapping for one-channel images.
66
It should be emphasized that the anisotropic extension includes the resolution
levels of regular image pyramids. They are present across the diagonal in the
figure.
7.1
Pre-processing
Similar to regular image pyramids, anisotropic image pyramids are recursively
down-sampled in a pre-processing pass. As a two-dimensional structure of resolutions is to be generated, multiple recursive processes need to be performed
where each process down-samples the image along either dimension until only
a single row or column remains. Between every recursive process, the image is
down-sampled once along the other dimension until only a single row or column remains. Equation 7.2 shows the number of levels an image will have in
each direction. The total number of levels is the product of these two numbers.
levelsx = blog2 (w)c + 1
levelsy = blog2 (h)c + 1
7.2
(7.2)
Sampling Method
The fundamental idea behind anisotropic image pyramids is that the optimal
resolution of the texture can be computed using the two partial derivatives
∂s and ∂t and selected from the structure. Using the partial derivatives, the
width and height of the footprint is computed as shown in equation 7.3.
∂s ∂s fw = max , · w
∂x ∂y
∂t ∂t fh = max , · h
∂x ∂y
(7.3)
Using the size of the footprint, the x-level li,x and the y-level li,y is computed
as shown in equation 7.4.
li,x = blog2 (fw )c
li,y = blog2 (fh )c
(7.4)
In contrast to regular image pyramids, there are two linear level fractionals for
anisotropic image pyramids. These are defined as shown in equation 7.5.
67
fw
−1
2li,x
fh
= li,y − 1
2
lf,x =
lf,y
(7.5)
Sampling the anisotropic image pyramid is done in the same fashion as for regular image pyramids, only with a slight alteration. Four bilinear interpolations
are performed for the four different level combinations. As the four samples
need to be weighted, an additional bilinear interpolation weights these four
samples using the two linear levels fractionals. The final filter Pa is shown in
equation 7.6 for which the number of texel look-ups amount to sixteen.
Pa (x, y) = B(x0 , y0 ) · (1 − lf,x ) · (1 − lf,y )
+ B(x1 , y1 ) · (lf,x ) · (1 − lf,y )
+ B(x2 , y2 ) · (1 − lf,x ) · (lf,y )
+ B(x3 , y3 ) · (lf,x ) · (lf,y )
(7.6)
In the equation (x0 , y0 ) correspond to the scaled texture coordinates in level
(li,x , li,y ), (x1 , y1 ) in level (li,x + 1, li,y ), (x2 , y2 ) in level (li,x , li,y + 1) and (x3 , y3 )
in level (li,x + 1, li,y + 1).
8
Summed-Area Tables
Summed-area tables were conceived in 1984 by Crow and presented in [11].
Crow observed that the sum of all pixels in an image enclosed within an axisaligned rectangle could be computed using a simple formula. By altering the
representation of the image into the summed-area table, the sum could be
computed using only four table accesses for rectangles of arbitrary sizes. In
other words, an arbitrary box-filter could be computed in constant time.
Crow stated that an element of the summed-area table S should be computed
as the sum of all pixels in the image with lower or equal indices to that of the
element. This is shown in equation 8.1 where I is the discrete texture image.
S[x, y] =
y
x
X
X
I[i, j]
(8.1)
j=0 i=0
A box-filter of a rectangular region is the defined as the total sum within the
region divided by the total area of the rectangle. Equation 8.2 shows a boxfilter I for the region enclosed within (x0 , y0 ) and (x1 , y1 ). This is equivalent
to the mean of the image within the region.
68
Py1
I[x0 , y0 , x1 , y1 ] =
Px1
I[i, j]
(x1 − x0 ) · (y1 − y0 )
j=y0
i=x0
(8.2)
Figure 8.1 shows a rectangular region defined by the coordinates (x0 , y0 ) and
(x1 , y1 ) for which the box-filter is to be applied.
Figure 8.1: An axis-aligned rectangular region in the image.
The method that Crow presented involves replacing the double sum of the
image with four table look-ups in the summed-area table. This is shown in
equation 8.3 [11].
I[x0 , y0 , x1 , y1 ] =
S[x1 , y1 ] − S[x0 , y1 ] − S[x1 , y0 ] + S[x0 , y0 ]
(x1 − x0 ) · (y1 − y0 )
(8.3)
The four table look-ups in the summed-area table correspond to sums over
different regions in the original image. Figure 8.2 shows the region between
(0, 0) and (x1 , y1 ), corresponding to the table look-up S[x1 , y1 ].
Figure 8.2: An element in the table corresponds to the sum over a region.
69
The table look-up S[x0 , y1 ] corresponds to the sum over the region between
(0, 0) and (x0 , y1 ). This is shown in figure 8.3. This sum is subtracted from
the sum over the previous region as it is not part of the box-filter.
Figure 8.3: An element in the table corresponds to the sum over a region.
Conversely, the table look-up S[x1 , y0 ] corresponds to the sum over the region
between (0, 0) and (x1 , y0 ). This sum is also subtracted as the region is not
part of the box-filter. Figure 8.4 shows this region.
Figure 8.4: An element in the table corresponds to the sum over a region.
The final table look-up S[x0 , y0 ] is added back to the sum as it has been
removed twice by subtracting the two previous sums. It corresponds to the
sum over the region between (0, 0) and (x0 , y0 ) as shown in figure 8.5.
70
Figure 8.5: An element in the table corresponds to the sum over a region.
Using Crow’s original definition for sampling the summed-area table, the boxfilter is actually computed over the region enclosed within (x0 + 1, y0 + 1) and
(x1 , y1 ). This is shown in figure 8.6.
Figure 8.6: The sum is computed over a slightly different region.
This can be corrected using a slightly different definition of the box-filter.
Equation 8.4 shows the new definition for which one has been subtracted from
the coordinates x0 and y0 and the area adjusted accordingly.
I[x0 , y0 , x1 , y1 ] =
S[x1 , y1 ] − S[x0 − 1, y1 ] − S[x1 , y0 − 1] + S[x0 − 1, y0 − 1]
(x1 − x0 + 1) · (y1 − y0 + 1)
(8.4)
Accessing elements above and to the left of the x0 and y0 coordinates requires
that the original image is padded with a row of zeros above and a column of
zeros to the left. The summed-area table can then be constructed as described
by Crow. However, the introduction of an additional row and an additional
71
column will alter the memory layout of the image and of the summed-area table. This has to be considered when accessing the elements in a computational
context.
8.1
Construction
Constructing the summed-area table can be done efficiently in two passes,
each handling one of the two sums in the original definition. An intermediate
summed-area table Sx is created in the first pass. Its elements store the sum
of all pixels in the original image with lower or equal indices in the same row.
Equation 8.5 shows this process.
Sx [x, y] =
x
X
I[i, y]
(8.5)
i=0
The process can be implemented efficiently in parallel as it is independent
between rows. Figure 8.7 shows how the cumulative sum of every row is
computed through an iterative process.
Figure 8.7: An intermediate summed-area table is created in the first pass.
Utilizing the intermediate summed-area table, the final summed-area table S
can be constructed as shown in equation 8.6.
S[x, y] =
y
X
Sx [x, j]
(8.6)
j=0
Following the same procedure as the first pass, the final summed-area table
can be constructed independently between columns. Figure 8.8 shows this
iterative process.
72
Figure 8.8: The final summed-area table is created in the second pass.
As cumulative sums of the original image need to be stored, the summed-area
table requires wider data types for its elements. Assuming integer images with
a bit depth per channel of bi , the summed-area table needs a bit depth per
channel bs corresponding to the expression in equation 8.7. This in order to
account for the rare worst case scenario in which every pixel of the original
image is at maximum intensity. The additional row and column are for the
bottom and right borders which are used for repeating textures, explained
momentarily.
bs = dlog2 ((2bi − 1) · (w + 1) · (h + 1))e
≈ dbi + log2 (w) + log2 (h)e
(8.7)
Given an image bit depth and a summed-area table bit depth, the maximum
resolution can be computed as shown in equation 8.8.
2bs
(w + 1) · (h + 1) = bi
2 −1
(8.8)
Using a bit depth per channel of 8 for the image and 32 for the summed-area
table yields a maximum resolution of roughly 16 million pixels. Textures are
rarely this large and a few bits could be saved by limiting texture resolutions
to a few megapixels. However, 32 bits will be nicely aligned in memory and
the data type is native in most computational contexts, suggesting that 32-bit
elements should be used for the summed-area table. This is a four-fold increase
in memory requirement in comparison to the original image.
It is worth noting that for integer images, the summed-area table can be constructed without introducing any quantization noise. This in contrast to image
pyramids, both regular and anisotropic.
73
The original summed-area table is monotonically increasing with each element.
For real-valued images, this becomes a problem as the floating-point data type
was not designed to handle operations between values with large magnitude
differences.
An improvement was introduced in 2005 by Hensley et al. in [12]. If the
texels of a channel in the original image are biased by subtracting the mean of
all texels in that channel, the summed-area table is no longer monotonically
increasing. This can be exploited for real-valued images as the box-filter is
computed between samples with potentially large magnitude differences.
For biased images, the box-filter is computed as normal and the bias added as
a final step. This minimizes the precision issues when computing the summedarea table as well as the issues that may arise when bilinearly interpolating
the summed-area table.
8.2
Sampling Method
The previous filters have all been point-sampling filters, approximating the
entire footprint by a single sample. This makes any assumptions about the
shape of the footprint obsolete. For image pyramids, the size of the footprint
was used only to select the appropriate texture levels in which to perform
point-sampling.
Sampling in summed-area tables is area-based. This makes both the size and
the shape of the footprint of importance. As previously discussed, Crow conceived summed-area tables with axis-aligned box-filters in mind. As such,
the original summed-area tables are only able to filter rectangular footprints
aligned with the axes of the texture.
With knowledge about the partial derivatives of the texture coordinates ∂s
and ∂t, the axis-aligned rectangle can be computed in conjunction with the
texture coordinates of the sample point sc and tc . This is shown in equation
8.9.
∂s
2
∂s
s1 = sc +
2
∂t
t0 = tc −
2
∂t
t1 = tc +
2
s0 = sc −
74
(8.9)
For tile-based rendering pipelines, it is convenient to define the footprint as a
square around the sample point in image space. This as a box-filter is to be
applied to the underlying texture data. Figure 8.9 shows a tile to be textured
where the grid of dots correspond to the sample points of the image.
K
L
N
M
Figure 8.9: Summed-area tables work great in tile-based rendering contexts.
This requires that the texture coordinates are perspectively interpolated to
the corners of all footprint squares for each tile. For each tile, this results in
(n + 1)2 interpolations instead of n2 interpolations where n is the number of
sample points in a tile along one dimension.
~ L,
~ M
~ and N
~.
In image space, the square is bounded by the four vertices K,
In texture space, it can be distorted into a quadrilateral. This is shown in
figure 8.10 in which each vertex contains the resolution-independent texture
coordinates s and t.
K
L
N
M
Figure 8.10: A square in image space can become a quadrilateral in texture space.
Before the summed-area table can be sampled, the optimal axis-aligned rectangle footprint needs to be computed in texture space. This is done by computing
the bounding box of the quadrilateral. Equation 8.10 shows how this is done.
75
s0
s1
t0
t1
= min(Ks , Ls , Ms , Ns )
= max(Ks , Ls , Ms , Ns )
= min(Kt , Lt , Mt , Nt )
= max(Kt , Lt , Mt , Nt )
(8.10)
The bounding box will in most cases have an area greater than that of the
quadrilateral which will result in a blurry sample if it is used as the actual
footprint. Blur is often preferred over visual artifacts, suggesting that the
bounding box should be used as a footprint.
However, it is possible to adjust the bounding box and by that its area so that
it corresponds to the area of the quadrilateral. This can be achieved by first
computing the center (sc , tc ) of the bounding box as shown in equation 8.11.
s0 + s1
2
t0 + t1
tc =
2
sc =
(8.11)
The bounding box of the quadrilateral and its center are now defined by a
three sets of resolution-independent texture coordinates. This is illustrated in
figure 8.11.
s0
t1
s1
K
t0
tc
sc
L
N
M
Figure 8.11: The bounding box of the quadrilateral and its center.
The area of the quadrilateral Aq and the area of the rectangle Ar are computed
as shown in equation 8.12. The absolute value is required for the area of the
quadrilateral as it is possible for its orientation to be back-facing in texture
space.
76
|(Ms − Ks ) · (Nt − Lt ) − (Ns − Ls ) · (Mt − Kt )|
2
Ar = (s1 − s0 ) · (t1 − t0 )
Aq =
(8.12)
A shrinking factor f is introduced as the square root of the ratio between the
two areas as shown in equation 8.13. This factor will be less than or equal
to one as the area of the rectangle always is greater or equal to that of the
quadrilateral.
r
f=
Aq
Ar
(8.13)
The bounding box of the quadrilateral can be adjusted in accordance with the
shrinking factor. Equation 8.14 shows a process which shrinks the bounding
box around its center.
s00
s01
t00
t01
= sc + f · (s0 − sc )
= sc + f · (s1 − sc )
= tc + f · (t0 − tc )
= tc + f · (t1 − tc )
(8.14)
An adjusted bounding box, defined by the two sets of texture coordinates
(s00 , t00 ) and (s01 , t01 ) is obtained from this process. Figure 8.12 illustrates this
result.
s0’
sc
s1’
K
t0’
tc
t1’
L
N
M
Figure 8.12: The adjusted bounding box of the quadrilateral and its center.
The area of the adjusted bounding box can be computed as shown in equation
8.15. The equation also shows that this area will be exactly equal to that of
the quadrilateral.
77
A0r = (s01 − s00 ) · (t01 − t00 )
= (f · (s1 − sc ) − f · (s0 − sc )) · (f · (t1 − tc ) − f · (t0 − tc ))
= f 2 · (s1 − sc − s0 + sc ) · (t1 − tc − t0 + tc )
= f 2 · (s1 − s0 ) · (t1 − t0 )
Aq
=
· Ar
Ar
= Aq
(8.15)
As previously stated, adjusting the bounding box will cause visual artifacts.
As such, the original rectangular footprint should be used for sampling the
summed-area table, provided that the bounding box does not exceed the domain of the texture. The next section will provide details on how to bilinearly
interpolate the summed-area table.
8.3
Bilinear Interpolation
Ordinary bilinear interpolation can be used to sample the summed-area table
provided that the image is padded with the corresponding wrapped texels at
the bottom and right borders before the summed-area table is constructed.
Similar to point-sampling, a table look-up function Sl is defined as shown in
equation 8.16 where Wx and Wy are the wrapping functions previously defined
and S the summed-area table.
Sl [xi , yi ] = S[Wx [xi ], Wy [yi ]]
(8.16)
The scaled texture coordinate pair (x, y) is obtained through scaling the resolutionindependent texture coordinate pair (s, t) by w and h, respectively. This is
shown in equation 8.17.
x=s·w
y =t·h
(8.17)
The integer and fractional parts are computed using the flooring and fractional
functions as shown in equation 8.18.
xi
xf
yi
yf
= floor(x)
= frac(x)
= floor(y)
= frac(y)
78
(8.18)
If the image was padded before constructing the summed-area table, bilinear
interpolation is performed in the same way as for point-sampling. Equation
8.19 shows how bilinear interpolation is performed using the table look-up
function Sl four times.
B(x, y) = Sl [xi , yi ] · (1 − xf ) · (1 − yf )
+ Sl [xi + 1, yi ] · (xf ) · (1 − yf )
+ Sl [xi , yi + 1] · (1 − xf ) · (yf )
+ Sl [xi + 1, yi + 1] · (xf ) · (yf )
8.4
(8.19)
Clamping Textures
The bounding box of the quadrilateral is described by two resolution-independent
texture coordinate pairs (s0 , t0 ) and (s1 , t1 ), both in texture space. For clamping textures, no sampling should occur past the four edges. Figure 8.12 shows
a footprint which exceeds the domain of the texture at the bottom and right
edges.
s0
t0
s1
A
t1
Figure 8.13: Clamping textures only sample region A.
For clamping textures, the only region of intrest is the rectangle marked with
A in the figure. This region can be computed by limiting the resolutionindependent texture coordinate pairs to the domain of the texture as shown
in equation 8.20
s00
s01
t00
t01
= min(max(s0 , 0), 1)
= min(max(s1 , 0), 1)
= min(max(t0 , 0), 1)
= min(max(t1 , 0), 1)
79
(8.20)
The sum of all texels in region A is computed using bilinear interpolation as
shown in equation 8.21 where the scaled texture coordinates x0 , x1 , y0 and
y1 are obtained as normal. This is the modified definition for sampling the
summed-area table as previously discussed.
X
= B(x1 , y1 ) − B(x0 − 1, y0 − 1) − B(x1 , y0 − 1) + B(x0 − 1, y0 − 1)
A
(8.21)
The box-filter of the image enclosed within the region defined by A is computed
as shown in equation 8.22.
P
I=
A
(x1 − x0 + 1) · (y1 − y0 + 1)
(8.22)
In total there are 16 table look-ups for clamping textures. A number equal to
that of anisotropic image pyramids.
8.5
Repeating Textures
Crow mentioned that special considerations had to be made when sampling
the summed-area table past the edges. As the summed-area table stores a
cumulative sum, elements on the other side of the edges must be increased
or decreased by the corresponding values. If the footprint intersects multiple
edges, the elements must be increased or decreased with multiples of those
values. This section provides a method which properly handles an arbitrary
amount of edge intersections in constant time.
For repeating textures, the sampling process becomes a little more complex.
Figure 8.14 shows the same footprint exceeding the texture domain at the
bottom and right edges. For repeating textures, all four regions A, B, C and
D are of intrest.
80
s0
t0
t1
s1
A
B
C
D
Figure 8.14: Repeating textures sample regions A, B, C and D.
To compute the sum in each region independently, two additional coordinates
are needed. These two coordinates correspond to the width and height of the
texture as shown in equation 8.23.
xe = w
ye = h
(8.23)
The sum of all texels in region A is computed as shown in equation 8.24,
requiring four bilinear interpolations.
X
= B(xe , ye ) − B(x0 − 1, ye ) − B(xe , y0 − 1) + B(x0 − 1, y0 − 1)
(8.24)
A
The sum of all texels in region B is computed using only two bilinear interpolations. This is shown in equation 8.25.
X
= B(x1 , ye ) − B(x1 , y0 − 1)
(8.25)
B
Conversely, the sum of all texels in region C is computed using two bilinear
interpolations. Equation 8.26 shows this process.
X
= B(xe , y1 ) − B(x0 − 1, y1 )
(8.26)
C
The sum of all texels in the final region D is computed using a single bilinear
interpolation as shown in equation 8.27.
81
X
= B(x1 , y1 )
(8.27)
D
It is possible for the rectangle footprint to span across multiple entire textures,
forming additional regions. Though, the sample positions will be exactly the
same as in the above example. Computing the correct sum is done by first determining the number of edges that are intersected by the footprint. Equation
8.28 shows how this is done.
Nx = floor(s1 ) − floor(s0 )
Ny = floor(t1 ) − floor(t0 )
(8.28)
Using these two numbers, the total sum for any rectangular footprint is computed as shown in equation 8.29. It should be emphasized that the expression
in the equation will be exactly equal to that of clamping textures if both Nx
and Ny are equal to zero, meaning that no edges were intersected in either
direction.
X
= B(x0 − 1, y0 − 1)
− B(x1 , y0 − 1)
− B(x0 − 1, y1 )
+ B(x1 , y1 )
+ B(xe , y1 ) · Nx
− B(xe , y0 − 1) · Nx
+ B(x1 , ye ) · Ny
− B(x0 − 1, ye ) · Ny
+ B(xe , ye ) · Nx · Ny
(8.29)
Identical to clamping textures, the box-filter is the total sum divided by the
total area. This is shown in equation 8.30.
P
I=
(x1 − x0 + 1) · (y1 − y0 + 1)
(8.30)
The number of table look-ups will vary for this filter. In most cases, it will
be 16 as for clamping textures. Should the footprint intersect the edges, the
number of look-ups can be as large as 36. However, the number of look-ups is
82
not related to the number of texels in the region of the box-filter. As such, it
is performed in constant time with respect to the size of the footprint.
The extra coordinate pair xe and ye was defined to be exactly equal to the
width and height of the texture. This makes it completely unnecessary to
bilinearly interpolate samples which contain either xe or ye . Instead, two
linear interpolation functions are defined as shown in equation 8.31.
Lx (x, y) = Sl [xi , yi ] · (1 − xf ) + Sl [xi + 1, yi ] · (xf )
Ly (x, y) = Sl [xi , yi ] · (1 − yf ) + Sl [xi , yi + 1] · (yf )
(8.31)
In addition, samples which contain both xe and ye only require a single table
look-up. The two linear interpolation functions provide a more efficient way
of computing the sum of all texels in region of the box-filter. This is shown
in equation 8.32 where the sum is computed using at most 25 table look-ups,
saving a total of 11 look-ups.
X
= B(x0 − 1, y0 − 1)
− B(x1 , y0 − 1)
− B(x0 − 1, y1 )
+ B(x1 , y1 )
+ Ly (xe , y1 ) · Nx
− Ly (xe , y0 − 1) · Nx
+ Lx (x1 , ye ) · Ny
− Lx (x0 − 1, ye ) · Ny
+ Sl (xe , ye ) · Nx · Ny
8.6
(8.32)
Higher-Order Filters
As stated, the original summed-area tables were designed with axis-aligned
box-filters in mind. In many cases, box-filters are not sufficient as they weigh
all samples within the support region of the filter equally. The need for higherorder filters is evident.
A generalization of the summed-area table was presented in 1986 by Heckbert
in [13]. Heckbert recognized that the summed-area table was in fact an efficient
implementation of box-convolution and extended the method to higher-order
filters through convolution theory.
Using the new method, a variable width Gaussian-like filter could be implemented in constant time through integrating the original image multiple times.
83
This followed by applying a slightly different evaluation scheme on the resulting
higher-order summed-area table.
However, handling borders still seems to be an unresolved issue, making their
use for regular texturing in three-dimensional computer graphics limited. Additionally, the required bit depth is significantly higher and as such, precision
becomes an issue for higher-order summed-area tables.
Equation 8.33 [12] shows the required bit depth per channel bs of a secondorder summed-area table (equivalent to a Bartlett filter when sampled) for an
integer image with a bit depth per channel of bi .
h+1
w+1
·h·
bs = log2
2
2
≈ dbi + 2 · log2 (w) + 2 · log2 (h) − 2e
bi
2 −1 ·w·
(8.33)
For an image with a resolution of 512 by 512 pixels and with a bit depth per
channel of 8, the amount of required memory per channel and pixel is 42 bits
[13]. For alignment purposes, 64 bit wide data types should be used for the
second-order summed-area table. This is an eight-fold increase in memory in
comparison to the original image.
84
Chapter E
Pipeline Overview
So far, this report has provided a useful tool set which will come to use for a
general graphics pipeline. The chapter on texture filtering discussed different
problems associated with texture mapping and how to combat them. This
chapter will detail the different stages of the actual graphics pipeline implemented.
1
1.1
Design Considerations
Deferred Rendering
The implemented pipeline employs a deferred rendering scheme for a number
of different reasons. Mainly because of the common use in modern computer
graphics but also due to a number of benefits introduced by the deferred design.
A deferred renderer postpones all lighting computations until the entire scene
has been rendered and by that, visibility correctly determined. Lighting computations often require a great amount of processing power and it is beneficial
to only light fragments that are guaranteed to be visible for the observer. Deferred rendering makes the complexity of lighting computations independent
of overdraw and by that only dependent on the resolution of the target image
and on the number of light sources.
The deferred rendering scheme also allows for a variety of post-processing
effects such as field of view and screen-space ambient occlusion. However,
postponing the lighting computations requires the pipeline to store fragments
in a separate buffer. Rasterized fragments generally contain a great number of
attributes such as albedo, normal and depth. These attributes must be stored
on a per-pixel basis, increasing the memory requirement of the pipeline. If the
85
fragment record is stored in an uncompressed format, its memory requirement
may impose a drawback too large to motivate the deferred design.
The implemented pipeline introduces a compressed record format which stores
albedo (24 bits), normal (24 bits), specular power (8 bits), specular intensity (8
bits) and depth (32 bits). The compressed record requires 96 bits of storage in
addition to the regular pixel buffer required to accumulate light contributions
from the different light sources in the scene.
As opposed to forward rendering, deferred rendering suffers from the inability to properly handle blending. As such, blending is not included in the
implemented pipeline and all models are assumed to be manifold. It should
be noted that even traditional forward rendering suffers from problems with
transparency. This as transparency is dependent on the order in which the
fragments are produced, something that cannot be correctly determined for
forward rendering nor for deferred rendering.
1.2
Rasterization
It is crucial to carefully evaluate the choice of rasterization method as its
compliance with the underlying hardware will have a huge impact on the overall
performance of the pipeline.
For serial architectures, a number of different rasterization methods would
be suitable. For parallel architectures, data contamination may become a
serious problem. If triangles are rasterized on a parallel architecture using
a method suitable only for serial architectures, two executed elements may
produce fragments at the same location in the target image at roughly the same
time. A race condition occurs where a seemingly random process determines
which fragment that is stored in the target buffer. This introduces spatial
noise in the image and temporal noise between different frames and is an
unacceptable flaw.
For serial architectures, scanline rasterization is a common method. For the
method, horizontal spans of fragments between the edges of the triangle is
generated for each scanline. The scanline method iteratively processes each
scanline and each span and stores the resulting fragments or pixels in the
target buffer. As the rasterization is performed by a single process, no data
contamination can occur, making it suitable for serial architectures only.
Pipelines executing on parallel architectures must take data contamination
into consideration. As such, the implemented pipeline employs a hierarchical
tile-based rasterization scheme where the screen is subdivided into multiple
levels of tiles and no two processes may operate on the same tile at the same
time. This completely eliminates the data contamination problem.
86
The tile-based scheme also handles triangles of different sizes elegantly. A
scene may consist of pixel-sized triangles and of triangles covering a larger
portion of the target image. If each rasterization process were to operate on a
single triangle, the work load would differ greatly between processes. This is
something that is undesirable for parallel architectures, motivating the choice
of the tile-based rasterization scheme.
1.3
Batch Rendering
The pipeline operates on batches of isolated triangles. Each triangle consists
of three vertices and stores additional data required in the different stages of
the pipeline. To supply a triangle, the application programmer invokes the
corresponding interface method. The method buffers every triangle queued
for rendering in host memory and ensures isolation by only accepting a single
triangle at once.
Isolating the triangles will enable parallel processing of each individual triangle
at the expense of processing a greater number of vertices during the initial
stages. This is a trade-off which favors parallelism and data locality and by
that helps to utilize the parallel architecture of the graphics processing unit
and its cache hierarchy.
When the full capacity of the host buffer is reached or when the pipeline is
explicitly instructed through the interface to perform the rendering, the buffer
is flushed and the rendering is initialized. This process starts with the triangle
buffer being copied to device memory and proceeds through a series of kernels
which will be explained in detail throughout the rest of this chapter. The entire
process is transparent to the application programmer in a fashion similar to
that of the Open Graphics Library.
2
Vertex Transformation
Whenever a triangle is queued for rendering, the active transformation matrix
is polled from the transformation sub-system and stored together with the
triangle.
The transformation matrix corresponds to the pre-multiplied transformation
chain representing the previous transformations as handled by the transformation sub-system. It is used to transform a triangle from object space to camera
space. This is equivalent to the model view matrix in the Open Graphics Library with the addition of a camera system.
The vertex transformation stage applies the object space to camera space matrix to the three vertices of every triangle currently in the pipeline. This is
87
performed through ordinary matrix-vector multiplication. The stage is executed as a one-dimensional kernel with each work item processing a single
triangle. Work items are issued in groups of 32.
3
View Frustum Clipping
As the triangles have been transformed to camera space, view frustum clipping
can be performed as detailed in the mathematical foundation. The stage is
executed as a one-dimensional kernel where each work item processes a single
triangle and 32 work items form a work group.
The triangulation associated with view frustum clipping appends every generated triangle to an additional triangle buffer in global memory using atomic
operations. In theory, this buffer needs to be five times larger than the batch
size to account for the worst case scenario in which every triangle is clipped
into a polygon with seven vertices. However, this is considered extremely rare
and as such, the two buffers are of equal size.
As the process completes, the number of generated triangles needs to be read
from device memory in order to initialize subsequent kernels. Should the number of triangles be equal to zero, no further kernels are launched, the interface
is reset and control is returned to the application.
4
Triangle Projection
With triangles clipped against the view frustum, no vertices will be located
in front of the near plane. As such, it is safe to project the vertices into the
axis-aligned plane of the target image.
The projection stage is executed similarly to the previous stages. A onedimensional kernel is launched with each work item processing a single triangle
and with 32 work items grouped into a work group. Every vertex is projected
into an integer raster with sub-pixel precision using the expressions in equation
4.1. In the equation, p corresponds to sub-pixel precision while (xr , yr ) is
the vertex in raster space and (xc , yc , zc ) the vertex in camera space. Subpixel precision of the geometry is crucial for good results in the subsequent
rasterization stage.
88
xc
p
· w + min(w, h) ·
xr =
2
zc
p
yc
yr =
· h − min(w, h) ·
2
zc
1
zr =
zc
(4.1)
The expressions in the equation effectively projects the triangles into the axisaligned plane of the target image. For each vertex, the inverse depth is also
stored for use when interpolating the vertex attributes in a perspectivelycorrect fashion as described in the next section.
The projection kernel also computes the signed double area of every triangle
using the integer coordinates in raster space. The double area is stored together
with the triangle and will be used heavily in a subsequent kernel. This is
one of the major reasons for clipping every triangle against the entire view
frustum. With high sub-pixel precision, the raster coordinates will be of large
magnitudes and the signed double area of even larger magnitude. Without
view frustum clipping, these variables would eventually overflow for triangles
extending far beyond the left, right, bottom or top edges of the target image.
The signed double area also provides an efficient method for performing backface culling. Since only triangles with a positive area will face the camera,
all triangles with zero or negative area can be culled from further processing.
This as the pipeline is designed for deferred rendering with no transparency or
blending effects. Every triangle with an area greater than zero is appended to
a final triangle buffer in global memory using atomic operations.
In addition, front-face culling can be achieved by culling all triangles with
a non-negative signed double area. However, this is not implemented in the
pipeline for the time being.
The number of triangles in the final triangle buffer is read from device memory
and used to initialize subsequent kernels. Should this number be zero, no
further kernels are launched, the interface is reset and control is returned to
the application.
5
Rasterization
Rasterization is essentially the process of breaking down the projected triangles
into pixel-sized fragments. This process uses integer arithmetics whenever
possible to avoid precision issues associated with floating point arithmetics.
89
The stage is initialized through a one-dimensional kernel where each work item
processes one of the visible triangles in the final triangle buffer. The kernel is
executed in work groups of 32 work items and is responsible for the preparation
of different attributes required for the rasterization process.
The bounding box of every triangle is computed for three different levels. These
three levels are pixel, small tile and large tile. This is computed through a
series of minimum and maximum functions on the raster coordinates of the
three vertices of every triangle. The three levels form a sort-middle hierarchical
rasterization refinement scheme, similar to the scheme described in 2011 by
Laine and Karras in [2].
The kernel is also responsible for computing the three edge normals of every
triangle as well as projecting the vertex attributes for interpolation across the
surface.
5.1
Large Tiles
The kernel responsible for large tile rasterization is one-dimensional with one
work item for each triangle and is enqueued with a work group size of 32.
The group size of 32 is used primarily since it enables efficient use of the local
memory as described momentarily. Figure 5.1 shows a triangle projected onto
the plane of the target image and how the screen is sub-divided by large tiles.
Figure 5.1: The target image is sub-divided by large tiles.
Every work item is assigned the task of determining which large tiles that are
partially or completely overlapping with the triangle. To achieve this, a work
item iterates through the large tiles inside of the bounding box at the large tile
level. For every overlapping triangle, a local memory buffer is updated with
the corresponding information. Figure 5.2 shows the tiles that are processed.
90
Figure 5.2: Only large tiles within the bounding box are processed.
In a first pass, every large tile corner in the bounding box is projected against
the three edge normals of the triangle and the resulting magnitudes stored in
private memory. For the second pass, the separating axis theorem is applied
and tiles are classified as either outside, inside or overlapping. This information
is stored in a local memory bit matrix using atomic operations on a layer of
32-bit integers which corresponds to the triangles of the current work group.
Figure 5.3: The corners of the large tiles are projected against the edge normals.
It should be emphasized that the three edge normals of the triangle are sufficient for the separating axis theorem as only large tiles within the bounding
box are processed. As such, no separation can occur along the edge normals
of the large tiles.
The first work item in a work group clears the bit matrix before any other
work items are allowed to proceed with their tasks. It is also responsible for
writing the resulting bit matrix into global memory when all work items in the
work group have completed. This requires two synchronizations between the
work items of a work group and is achieved through the use of barriers.
91
No other work group will modify the layer of the bit matrix corresponding to
the current work group. As a result, it can be written into global memory
without having to rely on atomic operations.
This step of the rasterization process introduces a limitation on the graphics
pipeline. As the amount of local memory on a streaming multi-processor is limited and an entire tile layer needs to be resident simultaneously, the resolution
will be directly limited by the size of the large tiles.
To compute the maximum possible resolution with a given aspect ratio α, a
large tile size slarge and a maximum number of large tiles tmax , equation 5.1 is
used.
√
tmax · α · slarge
jp
k
=
tmax /α · slarge
wmax =
hmax
(5.1)
It is safe to assume that the amount of local memory mlocal is at least 16 kilobytes as specified as a requirement in specification of the Open Computing
Language. A tile layer is composed of 32 triangles, each requiring two bits
of memory. This amounts to 8 bytes per tile and layer. Using this information, the maximum number of horizontal and vertical tiles in a layer can be
computed using equation 5.2.
tmax =
jm
local
8
k
(5.2)
Assuming 16 kilobytes of local memory yields a maximum of 2048 large tiles.
If these are arranged according to the common 16:9 (1.78) aspect ratio, the
maximum number of large tiles is 60 by 33. This imposes a limit on the
resolution which with a large tile size of 64 pixels is 3840 by 2112. As this is
well over modern monitor resolutions, the limitation is minimal. It should also
be noted that the large tile size can be increased to support higher resolutions.
5.2
Small Tiles
The first rasterization refinement step is executed as a three-dimensional kernel. Every work group is responsible for sub-dividing a single large tile of a
layer into multiple small tiles. As with the previous step, a work group consists of 32 work items, each corresponding to one of the original triangles. This
allows modifications of the bit matrix to be carried out with atomic operations
on a local memory buffer. Figure 5.4 shows a large tile to be processed by a
work group and the small tiles enclosed within.
92
Figure 5.4: Every work group processes a single large tile.
The small tile rasterization process shares many fundamental ideas with the
previous step and starts by querying the overlap status of the current large tile.
Tiles without overlap are not processed further and the work item completes
its task. For tiles with complete overlap, every enclosed small tile is set to
complete overlap as implied by the large tile. The tiles that were classed as
partially overlapping are the ones in need of an actual refinement. This is
identical to how Seiler et al. successively refine the rasterization of a triangle.
The large tile in the previous figure is partially overlapping with the triangle.
As such, it needs refinement. Figure 5.5 shows the small tiles within the large
tile and within the bounding box of the triangle.
Figure 5.5: Only small tiles within the bounding box are processed.
An iteration is made through all corners of all small tiles within the large tile
and within the bounding box of the triangle at the corresponding level. The
corners are projected against the three edge normals and the magnitudes are
stored in private memory. A second iteration is made where the separating axis
theorem is used to classify the small tiles as either inside, outside or partially
overlapping. This is essentially the same process as for large tiles.
93
Figure 5.6 shows the tile corners that are projected against the edge normals
of the triangle.
Figure 5.6: The corners of the small tile are projected against the edge normals.
The first work item in the work group clears the local memory buffer and
is responsible for writing all the small tiles into global memory in the same
fashion as earlier. As for large tiles, no atomic operations are needed due to
the work group size.
5.3
Pixels
The final refinement step is the actual pixel rasterization. This is done differently in comparison to the two earlier rasterization steps. A two-dimensional
kernel is launched with each work item operating on a single small tile. The
size of a work group is equal to the number of small tiles in a large tile and
each work item is responsible for computing the pixel coverage of every triangle
located in that small tile. This provides exclusive access to the target image in
the region of that tile. As such, every work item is able to read and write the
target image without conflicting with other work items. This at the expense
of load balancing which will come into the discussion at a later stage in this
report.
Figure 5.7 shows a small tile that a single work item is processing. The pixel
coverage is to be determined for all triangles intersecting that tile. This is
done in sequence for all triangles marked as intersecting the specific tile.
94
Figure 5.7: Every work item processes a single small tile.
Different to the two other rasterization steps, all computations are performed
at the sample points of each pixel as shown in figure 5.8. In a first pass, the
double signed areas 2 · Au , 2 · Av and 2 · Aw are computed. The signed areas are
computed using integer coordinates in the sub-pixel raster which completely
eliminates problems with floating point precision.
Figure 5.8: The signed areas are computed for every sample point.
A set of rasterization rules must be defined which together determine if the
triangle covers the sample point. From the definition of the barycentric coordinates, the inside of the triangle is defined where all three signed areas are
positive as shown in equation 5.3. This is also true for the double signed areas.
Au > 0
Av > 0
Aw > 0
(5.3)
If an edge is shared by two triangles, there might be some pixels along the edge
where none of the two triangles cover the sample point when using the above
95
rasterization rules. However, both triangles will tangent the point and the
corresponding signed area will be zero for the two triangles. This will generate
holes in the rasterization and must be corrected.
The set of rasterization rules must incorporate additional criteria in order to
determine which of the two triangles that overlap the unrasterized pixels for
the above example. This must be done in a consistent fashion. No pixels
should be rasterized twice and no pixels should be left unrasterized.
The edge normals of the two triangles sharing an edge will be oriented in
opposite directions if both triangles are front-facing. Since the pipeline only
accepts front-facing triangles, this information can be used to determine pixel
coverage in a consistent fashion.
Equation 5.4 shows a set of stronger rasterization rules which ensure that a
pixel is rasterized exactly once by incorporating the information of the edge
~v and N~w are the edge normals of the edges
normals. In the equation, N~u , N
~ , V~ and W
~ , respectively.
opposite to the triangle vertices U
Au > 0 ∨ (Au = 0 ∧ (Nu,x > 0 ∨ (Nu,x = 0 ∧ Nu,y > 0)))
Av > 0 ∨ (Av = 0 ∧ (Nv,x > 0 ∨ (Nv,x = 0 ∧ Nv,y > 0)))
Aw > 0 ∨ (Aw = 0 ∧ (Nw,x > 0 ∨ (Nw,x = 0 ∧ Nw,y > 0)))
(5.4)
The barycentric coordinates of a covered pixel are obtained through dividing
the three signed double areas Au , Av and Aw by the pre-computed signed
double area of the entire triangle. This set of coordinates is used to interpolate
all vertex attributes.
6
Differentiation of Texture Coordinates
For every covered pixel, the vertex attributes are interpolated in a perspectivelycorrect fashion as detailed in the mathematical foundation. This includes the
texture coordinates, the depth value and the tangent space bases when normal
mapping is activated.
The pipeline supports the three advanced texture filters detailed in the chapter
on texture filtering. In order to use these filters, the derivatives of the texture
coordinates are needed. In a computational context, derivatives are approximated through finite differences. This differentiation can be done in a number
of different ways since the pipeline is tile-based.
Figure 6.1 shows a single pixel of a small rasterization tile. One set of texture
coordinates is not enough for differentiation. Information from adjacent pixels
is required.
96
Figure 6.1: A single pixel is not enough for differentiation.
In the pipeline, the problem is solved by differentiating the interpolated texture
coordinates between groups of four pixels. The differentiation is done along
the diagonals as shown in figure 6.2. This is different to how the hardware
pipelines handle differentiation of the texture coordinates.
A
B
D
C
Figure 6.2: A group of four pixels enable various differentiation schemes.
Equation 6.1 shows how the partial derivatives are approximated from the
maximum absolute differences between the texture coordinates along the diagonals. Using the maximum absolute
differences is a trade-off which favors
√
blur over noise. The division by 2 is present√due to the distance over which
the differentiation is made. The diagonal is 2 units long and this needs to
be accounted for.
max(|Cs − As |, |Ds − Bs |)
√
2
max(|Ct − At |, |Dt − Bt |)
√
∂t ≈
2
∂s ≈
97
(6.1)
The resulting partial derivatives of the texture coordinates are used for all four
pixels in the group. All active texture maps are sampled using the texture coordinates at pixel (i, j) and the partial derivatives for the group in which the
pixel is a member. When normal mapping is activated, the sampled normal
is transformed by the interpolated tangent space bases followed by a transformation by the object space to camera space matrix. The results are stored in
the fragment buffer and are ready for lighting in a separate pass.
7
Fragment Illumination
Fragment illumination is launched as a two-dimensional kernel with each work
item processing a single pixel in the fragment buffer. The kernel operates in
work groups of 32 work items and uses the Phong illumination model as introduced in 1975 by Phong in [14]. Phong observed that the spatial frequencies of
the specular highlights often were low for photographic images and aimed to
improve the standard illumination model used for computer generated images
at that time.
For the Phong illumination model, the vertex normals are linearly interpolated
across the surface of a triangle. For every fragment to be illuminated, the
~ is used in conjunction with the light vector L
~ and
interpolated normal N
~
the reflected observer vector R. These vectors are shown in figure 7.1. The
~ is not used explicitly in the illumination model, only its
observer vector O
reflected counterpart.
N
L
S
O
R
b
a
X
Figure 7.1: The vectors used in the Phong illumination model.
~ is obtained through reversing the projection of the raster coordiThe point X
nates as shown in equation 7.1.
98
Xx =
Xy =
2·xr
p
−w
min(w, h) · zr
h − 2·yp r
(7.1)
min(w, h) · zr
1
Xz =
zr
~ the
By using the point in conjunction with the position of the light source S,
~
light vector L can be computed as shown in equation 7.2. The equation also
~
shows how the point is reflected against the unit normal to obtain R.
~ =S
~ −X
~
L
~ =X
~ − 2 · (X
~ •N
~)·N
~
R
(7.2)
~ and R
~ are
The square length of the light vector is stored and the vectors L
normalized to unit lengths. The two angles present in the previous figure are
computed from these unit vectors as shown in equation 7.3.
~0 • N
~ , 0)
α = max(L
~0 • R
~ 0 , 0)
β = max(L
(7.3)
The ambient, diffuse and specular illuminations are computed from the material properties as shown in equation 7.4 [14]. In the equation, n is the glossiness
factor. Materials with high glossiness will produce small specular highlights
with high intensities while materials with low glossiness will produce large
specular highlights with low intensities.
ia = Ma
id = Md · α
is = Ms · β n
(7.4)
Due to the nature of a deferred rendering pipeline, varying material properties
have a great impact on the memory requirement as opposed to traditional
forward rendering. Every material property needs to be stored per-pixel. This
is not desirable. As such, the ambient factor Ma and the glossiness factor n
99
are constant for all fragments. In addition, the diffuse factor Md is computed
from the specular factor Ms as 1 − Ms .
The total intensity i is computed as shown in equation 7.5 where Ss is the
strength of the light source. The strength is uniform across the spectrum of
the light source.
i = ia +
Ss
· (id + is )
~ 2
|L|
(7.5)
Finally, the target image is updated with the light contribution of the current
light source. The spectral distribution of the light source is mixed with the
spectral distribution of the material and illuminated by the intensity i. The
light contribution is added to the light contribution from previous fragment
illumination passes and stored in the target image. This is shown in equation
7.6.
r0 = r + Mr · Sr · i
g 0 = g + Mg · Sg · i
b0 = b + Mb · Sb · i
(7.6)
For multiple light sources, the fragment illumination pass is run in sequence.
Once for every light source until all light sources have contributed to the final
image.
100
Chapter F
Results
1
Rendered Images
Using the functionality of the implemented three-dimensional graphics pipeline,
a set of images were rendered. These images are intended to demonstrate the
the capabilities of the pipeline and were rendered with a resolution four times
greater than the target image. The target image was bilinearly interpolated
using the texturing filtering of the Open Graphics Library when displayed on
the screen.
The first set demonstrates the deferred rendering of the pipeline. The tank
zombie from the game Left for Dead was used as a test model. It contains 2
894 vertices and 5 784 triangles. Figure 1.1 shows the model illuminated by
three white light sources.
101
Figure 1.1: The model illuminated by three white light sources.
The different channels of the fragment buffer can be visualized using the application. Figure 1.2 shows the albedo (color) channels.
Figure 1.2: The albedo (color) channels of the fragment buffer.
Figure 1.3 shows the normal channels. As the coordinate system is left-handed,
red intensities indicate surface normals directed to the right. Conversely, green
intensities indicate surface normals directed upwards and blue intensities surface normals directed into the image plane. The normals are defined in observer
space.
102
Figure 1.3: The normal channels of the fragment buffer.
Figure 1.4 shows the specular channel of the fragment buffer. It indicates the
specular reflectance of the visible fragments. Higher intensities will transfer
light specularly while lower intensities will transfer light diffusely.
Figure 1.4: The specular channel of the fragment buffer.
The final channel is the depth channel. It is shown in figure 1.5. Inverse,
non-linear depth is used as it can be interpolated linearly across the surface
of every projected triangle. Higher intensities correspond to fragments located
closer to the observer.
103
Figure 1.5: The depth channel of the fragment buffer.
2
Texture Filtering
The second set of rendered images demonstrate the texture filtering techniques,
starting with the filters suitable for both texture magnification and minification. Figure 2.1 shows a surface textured using the image pyramid filter. This
is equivalent to how textures are filtered in the Open Graphics Library [9].
Figure 2.1: A surface textured using the image pyramid filter.
104
Figure 2.2 shows a surface textured using the extended anisotropic image pyramid filter. Please note how the far details become sharper and more distinct.
Figure 2.2: A surface textured using the anisotropic image pyramid filter.
Figure 2.3 shows the same surface textured using the final of the three texture
filters suitable for both texture magnification and minification, the summedarea table. Similar to anisotropic image pyramids, the far details become
sharper. Yet, there are slight differences.
The image produced using summed-area tables is slightly more bright at far
distances in comparison to the previous two images. This as the image pyramids were generated without gamma-correction. As such, the brightness will
drift with every down-sampled texture level. This does not happen when using
summed-area tables.
105
Figure 2.3: A surface textured using the summed-area table.
The different texture levels of the two image pyramids can be visualized using
the application. Figure 2.4 shows the distinct transitions between the texture
levels of the regular image pyramid.
Figure 2.4: The texture levels for the regular image pyramid.
Figure 2.5 shows the corresponding transitions for the anisotropic image pyramid. As the levels are selected from a two-dimensional structure of textures
with different resolutions, the visualization uses two colors. Red corresponding
to the horizontal level and green corresponding to the vertical level.
106
Figure 2.5: The texture levels for the anisotropic image pyramid.
The third and final set of images demonstrate the texture filters suitable for
texture magnification. Figure 2.6 shows the nearest neighboring texel filter.
At close range, the individual texels become visible.
Figure 2.6: A surface textured using the nearest neighboring texel filter.
The bilinear interpolation filter is far better at magnifying textures as shown
in figure 2.7. When implemented correctly, the two image pyramids and the
summed-area table actually default into bilinear interpolation filters at close
range, producing results equal to those shown in the figure.
107
Figure 2.7: A surface textured using the bilinear interpolation filter.
3
3.1
Benchmark Data
Rasterization
The raw rasterization performance is of great importance for the graphics
pipeline. As such, the rasterization scheme was evaluated through a series of
tests. For the tests, the large tiles were fixed at 64 by 64 pixels and the small
tiles at 4 by 4 pixels. A total of 5 000 frames were rendered with no textures
bound and no view-frustum clipping. For every frame, a single quadrilateral
covering the entire screen was rendered. Table 3.1 shows the resulting data.
Resolution (px)
128x128
256x256
512x512
1024x1024
2048x2048
Large (ms) Small (ms) Pixel (ms)
0.10
0.70
0.39
0.12
0.78
0.76
0.24
0.93
1.86
0.65
1.22
6.09
1.13
1.63
10.91
Total (ms)
1.2
1.7
3.0
8.0
13.7
Table 3.1: Benchmark data for different resolutions.
3.2
Load Balancing
In order to test the load balancing of the tile-based rasterization scheme, two
separate tests were conducted. The first test was designed to evaluate how the
108
pipeline handles a fully covered screen with no overdraw. For this, a single
quadrilateral composing of two triangles was rendered. These two triangles
covered the entire screen and view frustum clipping was inactivated. A series
of 5 000 frames with a resolution of 800 by 600 pixels were rendered in total.
Table 3.2 shows the average processing time per frame for the three rasterization steps and for a few different tile sizes. No textures were bound to the
pixel rasterization kernel.
Large (px)
64x64
64x64
64x64
128x128
128x128
Small (px)
4x4
8x8
16x16
8x8
16x16
Large (ms) Small (ms) Pixel (ms)
0.38
1.05
3.17
0.38
0.34
5.27
0.38
0.15
9.64
0.17
0.80
5.17
0.17
0.26
7.65
Total (ms)
4.6
6.0
10.2
6.1
8.1
Table 3.2: Benchmark data for different tile sizes with a few large triangles.
The second test was designed to evaluate how the pipeline handles the case
where only a small region of the screen is covered by a large number of triangles
and with overdraw. A series of 500 frames with a resolution of 800 by 600 pixels
were rendered. No textures were bound when when measuring the average
execution time of the rasterization kernels, shown in table 3.3. The tank
zombie model was rendered from a viewpoint where it covered 6.7 percent of
the total screen pixels. All triangles were rendered in a single batch.
Large (px)
64x64
64x64
64x64
128x128
128x128
Small (px)
4x4
8x8
16x16
8x8
16x16
Large (ms) Small (ms) Pixel (ms) Total (ms)
1.65
9.55
27.20
38.4
1.65
4.45
171.52
177.6
1.65
3.38
640.06
645.1
1.53
4.16
111.97
117.6
1.53
2.94
658.32
662.8
Table 3.3: Benchmark data for different tile sizes with many small triangles.
3.3
Texture Mapping
The texture mapping sub-system is likely to be dependent on the level of cache
utilization. The Fermi architecture uses cache lines of 128 bytes, has an L2
cache of 768 kilobytes and a configurable L1 cache of 64 kilobytes [5]. A series
of tests were conducted to measure how these caches were utilized for the linear
memory layout of the textures.
For the tests, a single quadrilateral covering the entire screen was rendered.
The quadrilateral was textured by an albedo texture with a resolution of 512
109
by 512 pixels. The screen resolution was set to the identical resolution and the
image pyramid filter used. The texture was set to the repeat wrapping mode
and the number of tiles was altered for each test.
In theory, performance should scale with the number of tiles as the texture
resolution will decrease, providing better cache utilization. A total of 5 000
frames were rendered for each test and a separate test with no textures bound
conducted. With no textures, the average execution time of the pixel rasterization kernel was 1.88 ms. The execution times for the texturing sub-system
were computed as the differences between the total execution times and this
duration. This is shown in table 3.4.
Level
0
1
2
3
4
5
6
7
8
9
Pixel (ms) Texturing (ms)
2.76
0.88
2.74
0.86
2.67
0.79
2.61
0.73
2.56
0.68
2.53
0.65
2.50
0.62
2.50
0.62
2.50
0.62
2.49
0.61
Table 3.4: Benchmark data for different texture levels.
110
Chapter G
Discussion
1
Memory Requirements
As a result of the tile-based rasterization scheme and the deferred design of
the graphics pipeline, memory requirements have to be carefully evaluated. In
addition, the triangles in the active rendering batch require a memory buffer
in device memory.
1.1
Triangles
The triangle record occupies a total of 428 bytes of device memory. Of this,
64 bytes are used to store the transformation matrix and 108 bytes to store
the three per-vertex tangent space bases. Other data stored in the record is
the three edge normals, different vertex attributes and bounding boxes for the
different levels of the hierarchical rasterization scheme.
The 428 byte memory footprint per triangle record is large but the triangle
buffer only needs to be allocated to contain a reasonably small number of
triangles. High occupancy levels can be reached using a batch size greater
or equal to a small multiple of the number of cores available on the device.
This provided that the pipeline is able to distribute the work load evenly. For
instance, the Fermi architecture contains at most 480 cores. Rendering all
triangles in batches of a few thousand triangles should be sufficient. With this
in mind, the pipeline was set to process 8 192 triangles in each batch. This
amounts to a total memory footprint of 3 506 176 bytes. Two such buffers are
required for the different stages in the pipeline.
The memory requirement for the triangle buffers is not extremely large, yet it
can be limited through the introduction of a separate vertex buffer. A large
portion of the triangle record stores per-vertex data. Ordinary valence of a
111
vertex is six, meaning that each vertex is shared by approximately six triangles.
The introduction of a vertex buffer could decrease the size of the triangle record
significantly as well as increase the overall performance of some stages in the
pipeline. In addition, the pipeline could be restricted to only handle triangle
batches with equal transformation matrices. This would remove the need of
storing the transformation matrix per-triangle and through that decrease the
size of the record.
1.2
Tiles
During the rasterization process, triangle and tile overlap is stored using 2 bits
per triangle and tile as mentioned earlier. As such, the amount of required
data is dependent on the sizes of the two different tiles, on the batch size as
well as on the screen resolution. With knowledge of the screen resolution w
and h, the size of a large tile sl and the batch size sb , the memory requirement
for the large tiles ml can be computed using equation 1.1.
l m
h
sb
w
·
·
·8
ml =
sl
sl
32
(1.1)
Using this equation with a batch size of 8 192 for a few different resolutions and
large tile sizes yields the data in table 1.1. As seen in the table, the memory
requirements are fairly low, even for higher resolutions.
Resolution (px)
800x600
800x600
1024x768
1024x768
1920x1080
1920x1080
Large (px)
64x64
128x128
64x64
128x128
64x64
128x128
Required Memory (bytes)
266 240
71 680
393 216
98 304
1 044 480
276 480
Table 1.1: Memory requirements for the large tiles for a few different resolutions.
The size of a large tile is defined in the pipeline using the size of a small tile
and a small to large tile ratio. This to ensure that an even number of small
tiles fit inside every large tile during the small tile rasterization pass. As such,
the number of small tiles is directly related to the ratio r and by that also
the memory requirement. The relationship between the memory required for
the large tiles and the memory required for the small tiles is ml · r2 . For the
pipeline to fully utilize the cores of the Fermi architecture, the squared tile
ratio should be as high as possible and preferably an even multiple of 32. This
112
is one reason to why the pipeline performs better with 64 small tiles in each
large tile. However, this implies a large memory requirement for the small
tiles.
As the screen should be divided into as many large tiles as possible and in turn
the large tiles into as many small tiles as possible, the memory requirements
become a problem. Using equation 1.2, the total memory requirement of the
tiles can be computed.
m = ml · (1 + r2 )
(1.2)
The amount of required memory grows to unacceptable numbers. Table 1.2
shows the total memory required for all tiles for a few different resolutions and
large tile sizes. For the computations, the small to large tile ratio r was set to
8.
Resolution (px)
800x600
800x600
1024x768
1024x768
1920x1080
1920x1080
Large (px)
64x64
128x128
64x64
128x128
64x64
128x128
Required Memory (bytes)
17 305 600
4 659 200
25 559 040
6 389 760
67 891 200
17 971 200
Table 1.2: Memory requirements for all tiles for a few different resolutions.
Unfortunately, these numbers hint that the rasterization stage should be redesigned as its preferences regarding tile sizes and number of tiles are in direct
conflict with the memory requirements. Specifically, the amount of memory
required for the small tiles.
1.3
Fragments
The fragment buffer is the common culprit of deferred rendering pipelines. It
requires large amounts of allocated memory to store the per-pixel attributes
of every visible fragment. As such, its memory requirement is directly related
to the resolution of the target image. The only way of limiting the memory
footprint is to limit the number of components in each individual fragment
record.
Fragment buffers often store albedo (color), surface normal, depth, specular
intensity and specular power. Commonly, most of these attributes are compressed into integer data types as the intervals of most attributes are known
113
and limited. This allows compression with minimal quantization noise. During the lighting passes, the fragment record is decompressed and its attributes
used in the illumination model.
A compressed fragment may be stored in a 96-bit record. For a resolution of
1920 by 1080 pixels, this amounts to 24 883 200 bytes. This is a large memory
requirement. Unfortunately, any further improvements other than removing
individual attributes from the fragment records are prevented by the deferred
design of the pipeline. In addition, memory alignment must be considered.
Removing a single attribute compressed into 8 bits will require an additional
three attributes to be removed. Adhering to this discussion, the use of the
96-bit fragment record in the pipeline is well-motivated.
2
Performance
When it comes to rasterization, the tile-based approach provides an efficient
way of partitioning the task. As seen in table 3.1 of the results chapter, the
pipeline performs well for high resolutions with few triangles and no overdraw.
This as the amount of large tiles and by that threads increase with resolution
when the large tile size is fixed, allowing high occupancy levels of the hardware.
If the total execution times from the table are divided by the number of large
tiles at each resolution, an intresting pattern emerges. This is shown in figure
2.1.
[ms]
0.3
4
1024
[N]
Figure 2.1: Linear plot of average rasterization time per large tile.
Only a small number of large tiles fit within the smaller resolutions. As the
resolution is increased, so is the number of total large tiles. At sufficient resolutions, the performance gain evens out as expected. This as high occupancy
levels have been reached for resolutions above 1024 by 1024 pixels. This hints
that tile-based rendering has great potential, provided that the resolution is
high enough to be partitioned into a high number of large tiles.
114
From the results in the previous chapter, uneven load balancing seems to be a
serious issue for the implemented pipeline. When the pipeline is fed with the
reasonably small test model consisting of 5 784 triangles, the only rasterization
kernel with acceptable execution times is the one used for rasterizing the large
tiles. The other two kernels do not handle the uneven work load well.
The behavior of the large tile rasterization kernel was expected as it operates
per-triangle and as such is not subject to an uneven work load. The other two
kernels operate per-tile, with possible uneven work loads. As no specific load
balancing measures were implemented for the two kernels, some problematic
behavior was to be expected. Yet, the data presented in the results chapter
proved the bottle-neck to be narrower than expected. Clearly, a redesign of
that specific stage is needed in order to achieve good performance.
The texturing sub-system yielded results far better than expected. As the
memory layout of the textures are linear per row and planar per channel,
texturing was expected to have a great impact on performance due to poor
cache coherency. This was not the case as demonstrated in the results chapter.
Texturing 262 144 pixels could be performed in 610 to 880 microseconds using
the regular image pyramid filter. As shown in table 3.4 of the results chapter,
the performance increases with each texture level. However, overdraw was
not considered in order to isolate the cache performance. Typical scenes have
overdraw and are usually rendered at much higher resolutions. In addition,
each fragment is often generated using multiple textures.
The measured fill rate of 297 890 909 to 429 744 262 textured pixels per
second does not compare well to the fill rate specified at 42 000 000 000 for
the Fermi architecture [5]. However, the number stated in the specification is
the theoretical texture filtering performance computed as one filtered look-up
per clock cycle.
The Open Computing Language provides functionality to store images in a
non-linear layout and to use the hardware texturing sub-system to filter the
images. This would most certainly increase the performance of the texturing
sub-system. However, the intention of this master’s thesis was to explore if the
entire graphics pipeline could be implemented using solely the programmable
units. As such, the image objects of the Open Computing Language were not
used apart for inter-operability with the window interface.
115
Chapter H
Conclusion
This experiences with the graphics pipeline and with the graphics processing
unit have been incredibly intresting. Much was learned through the continuous
literature study and even more through implementing the different graphics
algorithms. Through doing this, I have observed a few important aspects of
tile-based rendering and of texture filtering. Notably, the increase in visual
quality when using more sophisticated texture filters such as the anisotropic
image pyramid or the summed-area table. These two come with a four-fold
increase in memory requirement, something which is considered unacceptable
within the industry. This as a multitude of textures often need to be simultaneously resident in device memory. A single texture should not be allowed to
occupy four times the memory in comparison to the original texture.
Device memory is still a huge limitation for graphics processing units. The
amount of memory on a typical consumer-level device is often far less than the
main memory of the system. In addition, render data needs to be transferred
to the device and techniques requiring a dynamic amount of memory become
troublesome. This is an issue which as of today remains unsolved for the Open
Computing Language. In many cases, device memory needs to be allocated
for the worst case scenario. This is not a good solution as much of the memory
remains unused while preventing other data from residing in that region of the
device memory.
The tile-based rendering scheme proved to handle high resolutions elegantly
but the non-existent work load balancing proved to be a bottle-neck narrower
than anticipated. As such, a redesign of the rasterization stage should be considered. This redesign must account for uneven work loads, possibly handling
this in a way similar to that of the pipeline presented by Laine and Karras.
Many of the conducted experiments indicated that device occupancy is an
important aspect. A great increase in performance was observed until the
number of sub-tasks were a few multiples of the total number of device cores.
116
This in accordance with the technical documents presented by Nvidia and the
Khronos Group as well as stated in the paper by Laine and Karras [2].
It saddens me to conclude the work carried out within this master’s thesis. From the conducted experiments and the acquired knowledge about the
hardware, many design choices would have been made differently, should the
pipeline had been designed today. Unfortunately, the development of this
graphics pipeline stops here and now.
In conclusion, I wish to thank the authors of the different research papers
quoted throughout this report. They have inspired me and given me ideas of
my own to explore.
117
Bibliography
[1] Henri Gouraud. Continuous shading of curved surfaces. IEEE Transactions on Computers, 20:623–629, 1971.
[2] Samuli Laine and Tero Karras. High-performance software rasterization
on gpus. In Proceedings of the ACM SIGGRAPH Symposium on High
Performance Graphics, HPG 2011, pages 79–88, 2011.
[3] Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael
Abrash, Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat
Hanrahan. Larrabee: A many-core x86 architecture for visual computing.
ACM Transactions on Graphics, 27:1–15, 2008.
[4] Juan Pineda. A parallel algorithm for polygon rasterization. SIGGRAPH
Computer Graphics, 22:17–20, 1988.
[5] Nvidia. Nvidia’s next generation cuda compute architecture: Fermi.
Whitepaper, 2009.
[6] The Khronos Group. The opencl specification (1.2). Specification, 2011.
[7] Ivan E. Sutherland and Gary W. Hodgman. Reentrant polygon clipping.
ACM Communications, 17:32–42, 1974.
[8] Alvy Ray Smith. A pixel is not a little square, a pixel is not a little square,
a pixel is not a little square! (and a voxel is not a little cube). Technical
report, Microsoft Research, 1995.
[9] The Khronos Group. The opengl graphics system: A specification (4.3).
Specification, 2012.
[10] Lance Williams. Pyramidal parametrics. SIGGRAPH Computer Graphics, 17:1–11, 1983.
[11] Franklin C. Crow. Summed-area tables for texture mapping. SIGGRAPH
Computer Graphics, 18:207–212, 1984.
118
[12] Justin Hensley, Thorsten Scheuermann, Greg Coombe, Montek Singh, and
Anselmo Lastra. Fast summed-area table generation and its applications.
Computer Graphics Forum, 24:547–555, 2005.
[13] Paul S. Heckbert. Filtering by repeated integration. SIGGRAPH Computer Graphics, 20:315–321, 1986.
[14] Bui Tuong Phong. Illumination for computer generated pictures. ACM
Communications, 18:311–317, 1975.
119
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement