Institutionen för systemteknik Department of Electrical Engineering Examensarbete A Data-Parallel Graphics Pipeline Implemented in OpenCL Examensarbete utfört i informationskodning vid Tekniska högskolan vid Linköpings universitet av Joel Ek LiTH-ISY-EX--12/4632--SE Linköping 2012 Department of Electrical Engineering Linköpings universitet SE-581 83 Linköping, Sweden Linköpings tekniska högskola Linköpings universitet 581 83 Linköping A Data-Parallel Graphics Pipeline Implemented in OpenCL Examensarbete utfört i informationskodning vid Tekniska högskolan vid Linköpings universitet av Joel Ek LiTH-ISY-EX--12/4632--SE Handledare: Harald Nautsch ISY, Linköpings universitet Examinator: Ingemar Ragnemalm ISY, Linköpings universitet Linköping, 27 november 2012 Avdelning, Institution Division, Department Datum Date Avdelningen för informationskodning Department of Electrical Engineering SE-581 83 Linköping 2012-11-27 Språk Language Rapporttyp Report category ISBN - ¨ Svenska/Swedish ý Engelska/English ¨ ¨ Licentiatavhandling ý Examensarbete ¨ C-uppsats ¨ D-uppsats ¨ Övrig rapport ¨ ISRN LiTH-ISY-EX--12/4632--SE URL för elektronisk version ISSN Serietitel och serienummer Title of series, numbering http://urn.kb.se/resolve?urn:nbn:se:liu:diva-85679 Titel Title A Data-Parallel Graphics Pipeline Implemented in OpenCL En Data-Parallell Grafikpipeline Implementerad i OpenCL Författare Author Joel Ek Sammanfattning Abstract This report documents implementation details, results, benchmarks and technical discussions for the work carried out within a master’s thesis at Linköping University. Within the master’s thesis, the field of software rendering is explored in the age of parallel computing. Using the Open Computing Language, a complete graphics pipeline was implemented for use on general processing units from different vendors. The pipeline is tile-based, fully-configurable and provides means of rendering visually compelling images in real-time. Yet, further optimizations for parallel architectures are needed as uneven work loads drastically decrease the overall performance of the pipeline. Nyckelord Keywords computer graphics, GPGPU, OpenCL, graphics pipeline, rasterization, texture filtering, parallelization A Data-Parallel Graphics Pipeline Implemented in OpenCL Joel Ek [email protected] October 29, 2012 Abstract This report documents implementation details, results, benchmarks and technical discussions for the work carried out within a master’s thesis at Linköping University. Within the master’s thesis, the field of software rendering is explored in the age of parallel computing. Using the Open Computing Language, a complete graphics pipeline was implemented for use on general processing units from different vendors. The pipeline is tile-based, fully-configurable and provides means of rendering visually compelling images in real-time. Yet, further optimizations for parallel architectures are needed as uneven work loads drastically decrease the overall performance of the pipeline. Contents A Introduction 1 Report Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 1 3 B Background 1 Related Research . . . . . . . . . . . 1.1 Samuli Laine and Tero Karras 1.2 Larry Seiler et al. . . . . . . . 2 The Open Computing Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 4 6 6 C Mathematical Foundation 1 The Fundamental Shapes . . . . 1.1 Triangle . . . . . . . . . 1.2 Quadrilateral . . . . . . 2 Line Normals . . . . . . . . . . 3 Geometric Series . . . . . . . . 4 Orthonormalization . . . . . . . 5 Matrix Inversion . . . . . . . . 5.1 Four Elements . . . . . . 5.2 Nine Elements . . . . . . 5.3 Sixteen Elements . . . . 6 Transformation Matrices . . . . 6.1 Identity . . . . . . . . . 6.2 Translation . . . . . . . 6.3 Scaling . . . . . . . . . . 6.4 Basis . . . . . . . . . . . 6.5 Rotation . . . . . . . . . 6.6 Projection . . . . . . . . 7 Line Segment-Plane Intersection 8 View Frustum Clipping . . . . . 8.1 Clipping Planes . . . . . 8.2 Example . . . . . . . . . 8.3 Algorithm . . . . . . . . 9 Barycentric Coordinates . . . . 9.1 Cramer’s Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 9 10 11 12 12 14 14 15 16 16 18 18 19 20 22 27 28 31 31 33 35 37 39 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 11 12 13 9.2 Ratios of Areas . . . . . . The Separating Axis Theorem . . 10.1 Candidate Axes . . . . . . 10.2 Collision Detection . . . . Tangent Space . . . . . . . . . . . Perspective-Correct Interpolation Rounding Modes . . . . . . . . . 13.1 Flooring . . . . . . . . . . 13.2 Ceiling . . . . . . . . . . . 13.3 Nearest . . . . . . . . . . D Texture Filtering 1 Short on Discrete Images . . . . 2 Magnification and Minification . 3 Wrapping Modes . . . . . . . . 3.1 Repeating . . . . . . . . 3.2 Clamping . . . . . . . . 4 Nearest Neighboring Texel . . . 5 Bilinear Interpolation . . . . . . 6 Image Pyramids . . . . . . . . . 6.1 Pre-processing . . . . . . 6.2 Sampling Method . . . . 7 Anisotropic Image Pyramids . . 7.1 Pre-processing . . . . . . 7.2 Sampling Method . . . . 8 Summed-Area Tables . . . . . . 8.1 Construction . . . . . . 8.2 Sampling Method . . . . 8.3 Bilinear Interpolation . . 8.4 Clamping Textures . . . 8.5 Repeating Textures . . . 8.6 Higher-Order Filters . . E Pipeline Overview 1 Design Considerations . . 1.1 Deferred Rendering 1.2 Rasterization . . . 1.3 Batch Rendering . 2 Vertex Transformation . . 3 View Frustum Clipping . . 4 Triangle Projection . . . . 5 Rasterization . . . . . . . 5.1 Large Tiles . . . . 5.2 Small Tiles . . . . 5.3 Pixels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 43 43 44 46 50 51 52 54 55 . . . . . . . . . . . . . . . . . . . . 57 57 58 59 60 60 61 61 62 63 64 66 67 67 68 72 74 78 79 80 83 . . . . . . . . . . . 85 85 85 86 87 87 88 88 89 90 92 94 6 7 Differentiation of Texture Coordinates . . . . . . . . . . . . . . 96 Fragment Illumination . . . . . . . . . . . . . . . . . . . . . . . 98 F Results 1 Rendered Images . . . . 2 Texture Filtering . . . . 3 Benchmark Data . . . . 3.1 Rasterization . . 3.2 Load Balancing . 3.3 Texture Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 . 101 . 104 . 108 . 108 . 108 . 109 G Discussion 1 Memory Requirements 1.1 Triangles . . . . 1.2 Tiles . . . . . . 1.3 Fragments . . . 2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H Conclusion . . . . . 111 111 111 112 113 114 116 Chapter A Introduction With the introduction of consumer-level graphics processing units during the late 1990’s, three-dimensional computer graphics has become increasingly advanced and visually compelling. Today, we are on the verge of generating photo-realistic synthetic images in real-time. These synthetic images can sometimes be hard to distinguish from real photography, all thanks to the research in computer graphics and to the technical evolution of the graphics processing unit. Modern graphics processing units are extremely powerful for parallel computing tasks such as rendering images. To perform such embarrassingly parallel tasks in serial on the central processing unit is considered a thing of the past. The visual quality expected in modern computer graphics imposes a requirement on the performance too large to be feasible in a serial architecture. As there are multiple vendors of graphics processing units, the hardware will differ too much between vendors and models to program the hardware directly. Instead, modern three-dimensional computer graphics makes heavy use of application program interfaces which serve as abstraction layers of the hardware. Two well-used interfaces are the Open Graphics Library from the Khronos Group and Direct3D from Microsoft. Using these interfaces, the application programmer can set up the rendering procedure in a desired way, supply rendering data and instruct the hardware to perform the rendering. This without any knowledge about the actual graphics hardware present to the end user of the application. Graphics hardware was once designed to render computer graphics in an almost unmodifiable fashion. The possibilities of specifying how a certain task should be performed were limited to a fixed set of pre-selected choices. For instance, lighting could previously only be computed at the geometry level using either flat or Gouraud shading as introduced in 1971 by Gouraud in [1]. Neither were 1 there any possibilities of multi-pass rendering. The need for programmability was evident. Graphics processing units introduced some programmability during the early 2000’s. Previously fixed functionality was opened up to the application programmers through the introduction of small programs modifying the behavior of a certain stage in the graphics pipeline. These small programs became known as shaders since they were originally intended to modify the shading behavior in the graphics pipeline. The shaders opened up for a whole range of new rendering techniques, not only limited to actual shading. As illustrious application programmers discovered what was possible, they also discovered what the limitations were. During the mid 2000’s, the concept of shaders evolved as vendors of graphics processing units introduced the unified shader model. Instead of having separate hardware sub-systems executing the different shader types, a unified array of processing units was introduced. These were capable of executing all shader types supported by the graphics hardware. The unified model allowed application programmers to utilize the hardware as a massively parallel processing system. As such, the hardware was no longer limited to rendering images. Embarrassingly parallel tasks became accelerated by the new architecture. The hardware vendors and interface developers realized the possibilities of the new architecture and released application program interfaces for the new hardware. Notably, Nvidia released the Compute Unified Device Architecture in 2007, the Khronos Group the Open Computing Language in 2008 and Microsoft DirectCompute in 2009. The Compute Unified Device Architecture will only run on Nvidia hardware, DirectCompute only through Direct3D on Microsoft platforms whereas the Open Computing Language runs on many different systems and on hardware from different vendors. These interfaces were intended for general-purpose heterogeneous computing on consumer-level graphics processing units. Through these interfaces, the units are now used for a wide array of massively parallel tasks. An intresting set of question arises. Does the new tool set reopen the possibility of explicitly programming the entire graphics pipeline as done on the central processing unit in the mid 1990’s? If so, what are the benefits and drawbacks of a fully-configurable pipeline and how does its performance compare to the industry standard interfaces intended for computer graphics? 2 1 Report Structure This master’s thesis will explore the structure of the graphics pipeline, evaluate the related techniques and discuss implementation details using the Open Computing Language. The following chapters of this report are structured as shown in the list below. • Background • Mathematical Foundation • Texture Filtering • Pipeline Overview • Results • Discussion • Conclusion The background chapter will discuss related research and give an introduction to the Open Computing Language. The following mathematical foundation will explore techniques and theorems used in computer graphics from a mathematical standpoint. Texture filtering is a large subject in computer graphics and as such discussed in a separate chapter. In the chapter, details on the common texture filters are provided and as well as details on a set of non-standard filters such as the summed-area table. The pipeline overview provides the actual details on the implemented graphics pipeline which uses the Open Computing Language. All major stages are explained using the mathematical foundation. In the results chapter, the produced results are presented through rendered images and tables of benchmark data. The results are discussed in the following discussion chapter. The final chapter concludes this master’s thesis and lists some drawbacks that were observed in the results chapter. It also suggests future research topics and how some stages of the pipeline could be implemented more efficiently. 3 Chapter B Background The graphics pipeline for rasterized graphics consists of several distinct stages. Each stage receives input data, processes the data and outputs it to the next stage in the pipeline. For a general pipeline, the stages are transformation, clipping, projection, culling, rasterization and shading followed by blending. The structure of the graphics pipeline varies between different application program interfaces and between different hardware and it is constantly evolving. Recently, a stage that allow tessellation of geometry data was introduced. This stage alters the linear layout of the traditional pipeline and is a complex stage, unfortunately out of scope for this master’s thesis. The set of stages previously listed is the bare minimum of stages in the graphics pipeline required in order to correctly render three-dimensional computer graphics. The task of implementing a pipeline in a data-parallel fashion is straight forward for some stages. Other stages pose a great challenge as the parallel execution may cause data contamination. This is true especially for the rasterization and blending stages. 1 1.1 Related Research Samuli Laine and Tero Karras This master’s thesis is heavily influenced by the research presented in 2011 by Laine and Karras in [2]. In their paper, they present a data-parallel rasterization scheme running on the Compute Unified Device Architecture for Nvidia devices. Their paper is heavily focused on the rasterization stage as it is the stage in which most design decisions affecting performance can be made. As such, rasterization is a close-kept industrial secret of the different hardware manufacturers. 4 Laine and Karras employ a batch rendering scheme to enable high occupancy of the device. The data to be rendered is transferred to device memory and is composed of a vertex list and an indexed triangle list. They state that their intentions were to explore the raw rasterization performance and as such, no texturing occurs and no texture data is transferred. In addition, the vertex transformation stage is also left out of their pipeline. This as it is trivial to implement in parallel and its performance theoretically equal to that of the hardware pipeline. Lighting is computed at each individual vertex and not included in the benchmarks they present. In the pipeline of Laine and Karras, the triangle list is processed in parallel. They employ a clipping pass which clips every triangle against the six planes of the view frustum, producing zero to seven sub-triangles. These are projected into an integer grid with sub-pixel resolution and potentially culled before the rasterization stage is entered. For rasterization, they employ a sort-middle hierarchical rasterization refinement scheme. The screen is divided into bins of 128 by 128 pixels and every bin into tiles of 8 by 8 pixels. In addition, they limit the screen to 16 by 16 bins which corresponds to a resolution of 2048 by 2048 pixels. This yields a maximum number of bins of 256 where each bin in turn corresponds to 256 tiles and each tile to 64 pixels. In the first rasterization pass of their pipeline, overlap between triangles and bins are computed in parallel. The ID of every triangle that overlaps a bin is appended to one of the local queues of that bin. This enables bin queues to be updated without synchronization between thread arrays and bins to be processed independently in the next pass, hence the sort-middle term. In the second rasterization pass, bin queues are processed and the coverage between triangles and tiles computed. Similar to the first pass, the ID of triangles overlapping a tile are stored in one of the local queues of that tile. This is followed by the actual pixel rasterization pass in which coverage is computed using a look-up table approach. Instead of computing a 64-bit coverage mask for each triangle and tile, the cases in a look-up table are determined for each of the three edges of the triangle. This can be represented using seven bits per edge according to Laine and Karras. In the shading pass, the actual coverage of the tile can be determined through bit-wise operations on the values of the look-up table. Laine and Karras observe that their pipeline performs roughly 2 to 8 times slower in comparison to the hardware pipeline. This for a number of different test scenes with depth testing enabled, perspectively-correct color interpolation and no multi-sampling. 5 1.2 Larry Seiler et al. The Larrabee was a proposed microarchitecture by Intel, designed for applications with sections of highly parallel tasks. The platform was based on the x86 instruction set and introduced a range of new vector instructions to enable parallel processing of those tasks. The new vector instructions were able to operate on a set of 32 new vector registers, each 512 bits wide. The width allowed each vector register to store eight 64-bit or sixteen 32-bit real-valued numbers or integers. The set of new instructions operated on those registers, effectively processing the stored values in parallel. To demonstrate the capabilities of the Larrabee platform, a tile-based rasterization scheme was presented in 2008 by Seiler et al. in [3]. The rasterization employed a sort-middle hierarchical rasterization refinement scheme, very similar to that later presented by Laine and Karras. As the new instructions were designed to handle at most 16 elements in parallel, Seiler et al. grouped 4 by 4 pixels in a tile and 4 by 4 tiles in a bin. In their pipeline, the rasterization was performed through evaluating the edge functions as detailed in 1988 by Pineda in [4]. The evaluated edge functions were compared to the trivial reject and accept values for each rasterization level. This was all done in parallel using the new instruction set. The Larrabee platform will never reach any end users as the project was discontinued in 2010. This is unfortunate as it would have been beneficial to render three-dimensional computer graphics using the native x86 platform. However, their paper opened up a discussion on the feasibility of software rendering in the age of parallel computing and inspired the work of Laine and Karras. 2 The Open Computing Language The Open Computing Language is an application program interface designed to abstract the hardware of general purpose computing units. As previously mentioned, the Khronos Group designed the specification to run on multiple platforms and hardware from different vendors. This makes it an ideal tool for implementing a hardware-accelerated graphics pipeline. The interface allows small programs known as kernels to be executed in parallel on an Open Computing Language-enabled device. These devices can actually be of any architecture, including multi-core central processing units. However, the best performance is achieved on graphics processing units, provided that the task is highly parallel in nature. 6 Kernels are written using a subset of the C99 programming language and may be compiled in advance if the target device is known or at run-time to support the device available to the end user. The programming language does not support memory allocation nor function pointers. In addition, pointer arithmetic is allowed but could potentially break the code if memory alignment is not considered. The kernels are executed in parallel as threads on the target device. How the threads are distributed within the device is determined by the implementer of the specification and as such will depend on the type of device. For instance, the Nvidia Fermi architecture consists of 1 to 15 streaming multi-processors, each with 32 cores. The threads are grouped into cooperative thread arrays and in turn into groups of 32 threads. Such a group is known as a warp and two warps may be issued and executed at each streaming multi-processor simultaneously. For each clock cycle, 16 cores receive an instruction from the first warp and the other 16 from the second warp [5]. For the Open Computing Language, a thread is known as a work item and group of threads as a work group. When executing a kernel on Nvidia hardware, work items translate to threads and work groups to cooperative thread arrays. As such, it is desirable to partition the work items into groups of n · 32 for the Fermi architecture. This is somewhat problematic in the Open Computing Language as the total number of work items must be a multiple of the work group size. However, it can be achieved through issuing a greater number of work items than required for the task and ensuring that kernels return instantly for the extra work items. Since the total number of work items often is large, the issuing of additional work items should not have a significant impact on the overall performance. The Open Computing Language specification defines four different memory address spaces. These are private memory for each work item, local memory that is shared between the work items of a work group, global memory that is shared between all work groups and constant memory which is a read-only variant of the global memory. The specification requires that the amount of local memory is at least 16 kilobytes but does not specify that the local memory needs to be located on-chip [6]. On the Fermi architecture, the local memory is located in the L1 cache of each streaming multi-processor and can be configured to either 16 or 48 kilobytes of the total 64 kilobytes. This makes the local memory extremely fast on the Fermi architecture. As such, kernels should store data shared between the work items of a work group in local memory whenever possible. The graphics pipeline implemented was designed with Fermi as the target architecture. As such, all kernels are executed in work groups of 32 work items and local memory is utilized whenever possible. 7 The data transfers and the execution of kernels in the Open Computing Language are handled by a device queue. The queue employs the just in time execution model and queued commands are only guaranteed to have completed if the queue is explicitly instructed to complete all commands. As the data to be transferred often depends on the results of previously executed kernels, data transfers can be enqueued as blocking. This ensures that the application program interface does not return control to the application until the commands have completed. The Open Computing Language is able to inter-operate with application program interfaces designed for computer graphics such as the Open Graphics Library or Direct3D. This allows generic data to be processed by kernels and displayed using a graphics interface. In order to utilize this functionality, the context of the Open Computing Language must be shared with that of the graphics interface. This is performed in the initialization stage in which the context of the graphics interface is created, often through additional interfaces for context and window handling. The context of the Open Computing Language is then instructed to use the same context as the graphics interface. Resources can then be acquired and released using the command queue. One notable inter-operation feature is the possibility of sharing an image object between the two interfaces. This allows an image to be generated using the Open Computing Language and displayed using the graphics interface. It also allows the image to be rendered using the graphics interface, processed by kernels of the Open Computing Language and displayed by the graphics interface. For these examples, no image data needs to be transferred between the host and the device, making inter-operation features extremely useful. 8 Chapter C Mathematical Foundation In order to fully comprehend the concepts discussed throughout the rest of this report, a mathematical foundation is needed. This chapter assumes that the reader is familiar with linear algebra and specifically the notation of vectors and matrices. 1 The Fundamental Shapes Two shapes commonly used in three-dimensional computer graphics are triangles and quadrilaterals. 1.1 Triangle The fundamental shape of three-dimensional computer graphics is the triangle. It is formed from three vertices and its edges span a plane if they are linearly independent. Should two edges be linearly dependent (parallel), the triangle ~ , V~ and W ~ will collapse and have zero area. Figure 1.1 shows three vertices U and how they form a proper triangle. 9 V U W Figure 1.1: A triangle is formed from three vertices. Equation 1.1 shows how the area of a two-dimensional triangle At can be computed using the cross-product between two of the edges. Depending on the choice of edges, the resulting area may be negative. However, the cross~ to V~ and U ~ to W ~ is guaranteed to be positive if the product between edges U vertices are specified in clock-wise order for a left-handed coordinate system or in counter clock-wise order for a right-handed coordinate system. A triangle with negative area is back-facing. At = 1.2 (Vx − Ux ) · (Wy − Uy ) − (Wx − Ux ) · (Vy − Uy ) 2 (1.1) Quadrilateral Quadrilaterals are formed from four vertices and are slightly more complex than their three-vertex counterparts. In contrast to triangles, the vertices of a quadrilateral are not guaranteed to coincide with a single plane and cannot be arranged into arbitrary shapes. This makes their use in defining threedimensional geometry limited. Yet there are other areas of computer graphics in which proper quadrilaterals are formed, making them useful. Figure 1.2 ~ L, ~ M ~ and N ~ and how they form a proper quadrilateral. shows four vertices K, 10 K L N M Figure 1.2: A quadrilateral is formed from four vertices. Equation 1.2 shows how the area of a two-dimensional quadrilateral Aq can be computed using the cross-product between two diagonals. Similar to triangles, the sign of the area depends on the choice of diagonals. In the equation, the ~ to M ~ and L ~ to N ~ is computed. This choice cross-product between diagonals K of diagonals guarantees that the resulting area is positive if the quadrilateral is front-facing with respect to the order of the vertices and to the reference coordinate system. Aq = 2 (Mx − Kx ) · (Ny − Ly ) − (Nx − Lx ) · (My − Ky ) 2 (1.2) Line Normals A two-dimensional line separates space into two sub-spaces. These two can be referred to as the positive and negative sub-spaces and are defined by selecting a line normal. For two-dimensional lines and line segments, there are two ~ possible line normals of unit length. Figure 2.1 shows a line segment from S ~ and the corresponding line normals N ~1 and N ~2 . to E 11 E N2 N1 S Figure 2.1: The two possible unit normals of a two-dimensional line segment. Equation 2.1 shows the relationship between the coordinates of the starting ~ and the ending point E ~ and the unnormalized line normals N ~1 and point S ~2 . Selecting one of these will effectively determine which sub-space is positive N and which is negative. ~1 = (Ey − Sy , Sx − Ex ) N ~2 = (Sy − Ey , Ex − Sx ) N 3 (2.1) Geometric Series A geometric series is an infinite sum where each term is the previous term multiplied by a constant factor q. These series are formed from recursive patterns and can be used to analyze whether a sum will grow indefinitely or if there is an upper bound at which the series converges. Equation 3.1 states that the series will converge if the absolute value of the constant factor q is less than one. It also implies that the sum of a finite number of terms is bounded 1 . by the expression 1−q lim N →∞ 4 N X i=0 ( i q = 1 1−q ∞ if |q| < 1 otherwise (3.1) Orthonormalization In order to span the entire three-dimensional space, a set of three linearly independent basis vectors is needed. This set does not need to be orthogonal, 12 though it is a convenient property in three-dimensional computer graphics. Even more convenient is to ensure that the set is both orthogonal and that each basis vector is of unit length. This is known as an orthonormal set of basis vectors. To force any given set of linearly independent basis vectors to be orthonormal, the Gram-Schmidt process can be used. The process operates on each basis vector in sequence and may start at any vector. Equation 4.1 shows how the ~ 0 is selected from the set and normalized into X ~ 00 . first basis vector X ~0 = X ~ X ~0 ~ 00 = X X ~ 0| |X (4.1) Equation 4.2 shows how the second basis vector Y~ 0 is found by subtracting the ~ 00 from itself. The vector is then normalized into Y~ 00 and projection of Y~ on X 00 ~ and Y~ 00 now form an orthonormal pair of basis vectors. the two vectors X ~ 00 ) · X ~ 00 Y~ 0 = Y~ − (Y~ • X Y~ 0 Y~ 00 = |Y~ 0 | (4.2) ~ 0. Equation 4.3 shows how the process continues with the final basis vector Z ~ 00 and Y~ 00 is found by subtracting the The basis vector orthogonal to both X 00 00 ~ on X ~ and on Y~ from itself. Finally, the vector is normalized projection of Z 00 ~ and Z is obtained. ~0 = Z ~ − (Z ~ •X ~ 00 ) · X ~ 00 − (Z ~ • Y~ 00 ) · Y~ 00 Z ~0 ~ 00 = Z Z ~ 0| |Z (4.3) ~ 00 , Y~ 00 and Z ~ 00 is ortogonal and After the process, the set of basis vectors X each basis vector is of unit length. This process is useful if the set is initially orthonormal but subject to multiple transformations as small errors are introduced with every transformation in a computational context. Figure 4.1 illustrates the orthonormalization process. 13 Y’ Y Y’’ X’’ (Y•X’’)·X’’ X/X’ Figure 4.1: The Gram-Schmidt process produces a set of orthonormal basis vectors. 5 Matrix Inversion Inverting small square matrices is a common task in three-dimensional computer graphics and should therefore be detailed here. Equation 5.1 shows the general formula of the adjoint method. The method is often performed in three steps and produces an analytical inverse to a square matrix of any size. However, the complexity of the method grows fast, making it unsuitable for larger matrices. For such matrices, numerical inversion methods exist. M −1 = cof(M )T adj(M ) = if det(M ) 6= 0 det(M ) det(M ) (5.1) Firstly, the minor matrix of M is computed. The minor at index (i, j) is defined as the determinant of the matrix formed by excluding row i and column j from M . Secondly, the cofactor matrix is computed. The cofactor at index (i, j) is defined as the minor at index (i, j) multiplied by (−1)i+j . Finally, the cofactor matrix is transposed and the adjoint matrix created. Dividing every element of the adjoint matrix by the determinant of the original matrix results in the inverse matrix M −1 . This provided that the determinant is not equal to zero. 5.1 Four Elements Equation 5.2 shows a square matrix Mf and its four elements a, b, c and d. a b Mf = c d 14 (5.2) Equation 5.3 shows how the determinant of Mf is defined. det(Mf ) = a · d − b · c (5.3) Equation 5.4 shows the cofactor matrix of Mf . For four-element matrices, it seems as if the cofactor matrix is formed through reorganizing the elements of Mf . However, every element at index (i, j) is actually the minor at index (i, j) with the corresponding sign adjustment. +d −c cof(Mf ) = −b +a (5.4) Transposing the cofactor matrix and dividing every element by the determinant of Mf gives the inverse matrix Mf−1 . This is seen in equation 5.5. Mf−1 5.2 1 +d −b if det(Mf ) 6= 0 = det(Mf ) −c +a (5.5) Nine Elements Equation 5.6 shows a square matrix Mn and its nine elements a, b, c, d, e, f , g, h and i. a b c Mn = d e f g h i (5.6) Equation 5.7 shows how the determinant of Mn is defined. det(Mn ) = a · (e · i − f · h) − b · (d · i − f · g) + c · (d · h − e · g) (5.7) The cofactor matrix of Mn is formed by computing the minors of Mn and adjusting the sign as previously stated. In contrast to Mf , the complexity is higher as every minor is computed as the determinant of a square four-element matrix. This is seen in equation 5.8. 15 +(e · i − f · h) −(d · i − f · g) +(d · h − e · g) cof(Mn ) = −(b · i − c · h) +(a · i − c · g) −(a · h − b · g) +(b · f − c · e) −(a · f − c · d) +(a · e − b · d) e·i−f ·h f ·g−d·i d·h−e·g = c · h − b · i a · i − c · g b · g − a · h b·f −c·e c·d−a·f a·e−b·d (5.8) Transposing the cofactor matrix and dividing every element by the determinant of Mn gives the inverse matrix Mn−1 . This is seen in equation 5.9. Mn−1 5.3 e·i−f ·h c·h−b·i b·f −c·e 1 f · g − d · i a · i − c · g c · d − a · f if det(Mn ) 6= 0 = det(Mn ) d·h−e·g b·g−a·h a·e−b·d (5.9) Sixteen Elements As the method is valid for any square matrix, it can be used to invert sixteenelement matrices. However, the complexity is significantly higher when compared to nine-element matrices, making the method unsuitable for real-time purposes. Yet, sixteen-element matrices are the most commonly used matrices in three-dimensional computer graphics and the need for clever inversion methods is evident. 6 Transformation Matrices Three-dimensional computer graphics uses three-dimensional vectors to define various geometric properties. Such properties are, amongst others, the topology of a surface, the location of a light source and the orientation of a camera or an observer. In order to fully model a transformation of a three-dimensional vector, the vector is often stored and handled in its four-dimensional homogeneous form. This embeds the three-dimensional vector in a four-dimensional super-space for which the fourth component is almost always equal to one. Equation 6.1 shows a homogeneous four-component vector P~ . P~ = (Px , Py , Pz , 1) 16 (6.1) The homogeneous form enables affine transformations of the vector using a square sixteen-component transformation matrix. Equation 6.2 shows an affine transformation matrix A. For most affine transformations, the bottom row will be equal to (0, 0, 0, 1). Xx Xy A= Xz Xw Yx Yy Yz Yw Zx Zy Zz Zw Wx Wy Wz Ww (6.2) The vector P~ is defined in its local coordinate system. If this coordinate system is embedded in a global coordinate system, the matrix A represents a transformation from the local coordinate system into the global one. The three ~ Y~ and Z ~ represent the basis vectors of the local coordinate system vectors X, ~ expressed in the global coordinate system. Conversely, the fourth vector W can be viewed as the translation vector which represents the offset of the local coordinate system with respect to the global one. Figure 6.1 shows this relationship. The vector P~ is defined in its local coordi~ Y~ and Z. ~ Applying matrix A to the vector nate system with base vectors X, ~ 0 , Y~ 0 and Z ~ 0 . To transform in the P~ results in the vector being expressed in X opposite direction, the inverse of A is applied to the vector P~ expressed in the global coordinate system. Y Y’ P X W X’ Figure 6.1: A transformation matrix transforms vectors between coordinate systems. A vector may be transformed between multiple coordinate systems and by different transformation matrices before reaching the coordinate system of a camera or an observer. This forms a chain of transformations. This transformation chain is often premultiplied so that the vector only needs to be transformed once. This is shown in equation 6.3. 17 P~ 0 = An · An−1 · . . . · A2 · A1 · P~ = (An · An−1 · . . . · A2 · A1 ) · P~ 6.1 (6.3) Identity The identity matrix is a transformation matrix which when applied to a vector does not alter the vector in any way. Its only function in the transformation chain is to initialize it so that subsequent matrices in the chain can be premultiplied. Equation 6.4 shows an identity matrix I and its inverse I −1 . As seen in the equation, the matrix and its inverse are identical. I = I −1 6.2 1 0 = 0 0 0 1 0 0 0 0 1 0 0 0 0 1 (6.4) Translation The translation matrix is a transformation matrix that when applied to a vector adds another vector to it. When applied to a three-dimensional model, the model will be offset by the translation vector. Equation 6.5 shows a translation matrix T and how the three vector components x, y and z are embedded into the matrix. 1 0 T (x, y, z) = 0 0 0 1 0 0 x y z 1 0 0 1 0 (6.5) Inverting the translation matrix is done by translating in the opposite direction. Equation 6.6 shows the relationship between the translation matrix T and its inverse T −1 . T (x, y, z)−1 1 0 = T (−x, −y, −z) = 0 0 18 0 1 0 0 0 −x 0 −y 1 −z 0 1 (6.6) 6.3 Scaling Uniform Scaling The uniform scaling matrix is a transformation matrix that when applied to a vector scales all of its components equally. When applied to a threedimensional model, the model will shrink or grow depending on the magnitude of the scaling factor f . This provided that the model is located around the origin of the local coordinate system. Equation 6.7 shows a uniform scaling matrix Su and how the scaling factor f is embedded into the matrix. f 0 Su (f ) = 0 0 0 f 0 0 0 0 f 0 0 0 0 1 (6.7) Inverting the uniform scaling matrix is done by scaling with the inverse of f . Equation 6.8 shows the relationship between the uniform scaling matrix Su and its inverse Su−1 . As seen in the equation, the inverse does only exist if the scaling factor is not equal to zero. This as a scaling factor of zero would collapse the transformed vector into the null vector, eliminating the possibility of returning to the original vector. 1 0 0 1 0 f1 0 = = Su 0 0 1 f f 0 0 0 f Su (f )−1 0 0 if f 6= 0 0 1 (6.8) Non-Uniform Scaling The non-uniform scaling matrix is a transformation matrix that when applied to a vector scales all of its components differently. When applied to a threedimensional model, the model will become wider or smaller in each of the three directions. As for uniform scaling matrices, this requires that the model is located around the origin of the local coordinate system. Equation 6.9 shows a non-uniform scaling matrix S and how the scaling factors fx , fy and fz are embedded into the matrix. fx 0 0 0 fy 0 S(fx , fy , fz ) = 0 0 fz 0 0 0 19 0 0 0 1 (6.9) Inverting the non-uniform scaling matrix is done in the same fashion as with uniform scaling matrices. The inverse is created by scaling with the inverses of fx , fy and fz . Equation 6.10 shows this relationship. The inverse requires that each of the three scaling factors is not equal to zero as this would collapse one or several of the components of the vector. −1 S(fx , fy , fz ) 6.4 =S 1 1 1 , , fx fy fz 1 fx 0 = 0 0 0 1 fy 0 0 0 0 1 fz 0 0 0 if fx , fy , fz 6= 0 (6.10) 0 1 Basis The basis matrix is a transformation matrix that when applied to a vector transforms the vector from its local coordinate system to a global coordinate ~ system. The local coordinate system is defined by the three base vectors X, ~ expressed in the global coordinate system as previously detailed. Y~ and Z Equation 6.11 shows a basis matrix B and how the three base vectors are embedded into the matrix. Xx Yx Zx Xy Yy Z y ~ Y~ , Z) ~ = B(X, Xz Yz Z z 0 0 0 0 0 0 1 (6.11) Inverting the basis matrix can be done in a few different ways depending on the properties of the set of base vectors. The upper-left nine components can be inverted using the inversion method for nine-component matrices discussed earlier. This as the other elements of the matrix are equal to those of an identity matrix. ~ 0, Using this method will amount to finding a set of modified base vectors X 0 0 ~ and embedding this set in a transposed fashion. This can be seen Y~ and Z in equation 6.12. ~ Y~ , Z) ~ −1 B(X, 0 Xx Xy0 Xz0 Yx0 Yy0 Yz0 = Zx0 Zy0 Zz0 0 0 0 20 0 0 0 1 (6.12) When using this method, the inverse matrix B −1 will only exist if the determinant of B is not equal to zero. Computing the determinant from the three original base vectors is done as shown in equation 6.13. det(B) = Xx · (Yy · Zz − Zy · Yz ) − Yx · (Xy · Zz − Zy · Xz ) + Zx · (Xy · Yz − Yy · Xz ) (6.13) Equation 6.14 shows how the set of modified base vectors is defined from the original set of base vectors. The components of these vectors are embedded into the inverse matrix and the inversion is complete. Y · Z − Y · Z y z z y ~ ~ 1 ~0 = Y × Z = Yz · Zx − Yx · Zz if det(B) 6= 0 X det(B) det(B) Yx · Zy − Yy · Zx Zy · Xz − Zz · Xy ~ ~ 1 Z ×X Zz · Xx − Zx · Xz if det(B) 6= 0 = Y~ 0 = det(B) det(B) Zx · Xy − Zy · Xx Xy · Yz − Xz · Yy ~ ~ 1 ~0 = X × Y = Xz · Yx − Xx · Yz if det(B) 6= 0 Z det(B) det(B) Xx · Yy − Xy · Yx (6.14) If the set of base vectors is orthogonal and the bases not of unit lengths, the modified base vectors will correspond to the original ones divided by their square lengths. This assumes that no base vector is equal to the null vector, which would result in a determinant of zero. Equation 6.15 shows this process. ~0 = X Y~ 0 = ~ X ~ •X ~ X Y~ ~ •X ~ 6= 0 if orthogonal ∧ X if orthogonal ∧ Y~ • Y~ = 6 0 Y~ • Y~ ~ ~ 0 = Z if orthogonal ∧ Z ~ •Z ~ 6= 0 Z ~ •Z ~ Z (6.15) If the set of base vectors is orthonormal, the inverse matrix B −1 will always be equal to the transpose of B. If this is a know property of the set of base vectors, the inverse will always exist and full inversion is unnecessary. The inverse matrix can instead be created through embedding the original base vectors in a transposed fashion. This is shown in equation 6.16. 21 ~ Y~ , Z) ~ −1 B(X, 6.5 Xx Xy Xz ~ Y~ , Z) ~ T = Yx Yy Yz = B(X, Zx Zy Zz 0 0 0 0 0 if orthonormal 0 1 (6.16) Rotation Rotational matrices are transformation matrices that when applied to a vector rotates the vector around a rotational axis. They are closely related to basis matrices and can be used to rotate a three-dimensional model around an axis. About the X-Axis Rotating about the x-axis by an angle α is done using equation 6.17. In the ~ are formed using the two trigonometric equation, the base vectors Y~ and Z ~ is left unmodified as it is the functions sine and cosine. The base vector X axis about which the rotation is performed. 1 0 0 0 cos(α) − sin(α) Rx (α) = 0 sin(α) cos(α) 0 0 0 0 0 0 1 (6.17) The rotation can be undone by rotating with the negative angle as seen in equation 6.18. The inverse matrix Rx−1 is also equal to the transpose of Rx as the set of base vectors form an orthonormal basis. Rx (α)−1 1 0 0 0 0 cos(−α) − sin(−α) 0 = Rx (−α) = 0 sin(−α) cos(−α) 0 0 0 0 1 1 0 0 0 0 cos(α) sin(α) 0 T = 0 − sin(α) cos(α) 0 = Rx (α) 0 0 0 1 22 (6.18) About the Y-Axis Rotating about the y-axis by an angle α is done using equation 6.19. In the ~ and Z ~ are formed using the two trigonometric equation, the base vectors X functions sine and cosine. The base vector Y~ is left unmodified as it is the axis about which the rotation is performed. cos(α) 0 Ry (α) = − sin(α) 0 0 sin(α) 0 1 0 0 0 cos(α) 0 0 0 1 (6.19) The rotation can be undone by rotating with the negative angle as seen in equation 6.20. The inverse matrix Ry−1 is also equal to the transpose of Ry as the set of base vectors form an orthonormal basis. Ry (α)−1 cos(−α) 0 = Ry (−α) = − sin(−α) 0 cos(α) 0 0 1 = sin(α) 0 0 0 0 sin(−α) 0 1 0 0 0 cos(−α) 0 0 0 1 − sin(α) 0 0 0 = Ry (α)T cos(α) 0 0 1 (6.20) About the Z-Axis Rotating about the z-axis by an angle α is done using equation 6.21. In the ~ and Y~ are formed using the two trigonometric equation, the base vectors X ~ is left unmodified as it is the axis functions sine and cosine. The base vector Z about which the rotation is performed. cos(α) − sin(α) sin(α) cos(α) Rz (α) = 0 0 0 0 0 0 1 0 0 0 0 1 (6.21) The rotation can be undone by rotating with the negative angle as seen in equation 6.22. The inverse matrix Rz−1 is also equal to the transpose of Rz as the set of base vectors form an orthonormal basis. 23 Rz (α)−1 cos(−α) − sin(−α) sin(−α) cos(−α) = Rz (−α) = 0 0 0 0 cos(α) sin(α) 0 − sin(α) cos(α) 0 = 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 = Rz (α)T 0 1 (6.22) About an Arbitrary Axis In order to rotate about an arbitrary axis, an orthonormal set of base vectors that describes the change of basis needs to be found. These base vectors can be found analytically if the process is decomposed into several rotation matrices. Equation 6.23 shows this process and the five rotational matrices involved. R(x, y, z, α) = R1−1 · R2−1 · R(α) · R2 · R1 (6.23) The first step is to rotate the rotational axis into one of the three axis-aligned planes. This is done using the R1 matrix. The second step is to rotate the rotational axis in that plane so that it coincides with one of the two standard axes that span the plane. This is done using the R2 matrix. The rotation is then performed as a rotation about that standard axis by an angle α. This is done using the R matrix. The final two steps involves inverting the first two rotations and applying them in inverse order so that the original coordinate system is restored. The following example uses the x/z-plane as its axis-aligned plane and performs the rotation about the z-axis. Given a rotational axis with components x, y and z, two additional attributes k and l are computed. This is seen in equation 6.24 where k is the length of the rotational axis projected in the x/y-plane and l the full length of the vector. p x2 + y 2 p l = x2 + y 2 + z 2 k= (6.24) The projection of the rotational axis in the x/y-plane is needed as the rotation into the x/z-plane is performed about the z-axis in this example. Equation 6.25 shows this rotation where R1 forms an orthonormal basis. 24 x k − y k R1 = 0 0 y k x k 0 0 0 1 0 0 0 0 if k 6= 0 0 1 (6.25) As the basis is orthonormal, the inverse of R1 is equal to its transpose. This is shown in equation 6.26. − ky 0 x 0 k = R1T = 0 0 1 0 0 0 x R1−1 k y k 0 0 if k 6= 0 0 1 (6.26) Rotating the rotational axis, now in the x/z-plane, into the z-axis is performed as rotation about the y-axis. This is shown in equation 6.27 in which R2 also forms an orthonormal basis. 0 − kl l 0 1 0 R2 = k 0 z l l 0 0 0 z 0 0 if l 6= 0 0 1 (6.27) The orthonormal basis provides a simple way of inverting the matrix R2 . Equation 6.28 shows how the inverse is equal to its transpose. z l 0 R2−1 = R2T = − k l 0 0 kl 0 1 0 0 if l 6= 0 0 zl 0 0 0 1 (6.28) Premultiplying the two matrices R2 and R1 gives the result shown in equation 6.29. x·z R2 · R1 = k·l − y xk l 0 y·z k·l x k y l 0 25 − kl 0 0 0 if k, l 6= 0 z 0 l 0 1 (6.29) As the inverses of R1 and R2 are equal to their respective transposes, premultiplying the two matrices R1−1 and R2−1 can be done by transposing the product of R2 and R1 . Equation 6.30 shows this procedure and the resulting matrix. x·z k·l R1−1 · R2−1 = R1T · R2T y·z k·l = (R2 · R1 ) = − k l 0 T − ky x k 0 0 x l y l z l 0 0 if k, l 6= 0 0 0 1 (6.30) Multiplying the entire chain of rotations with R as a rotation about the z-axis ~ Y~ and Z. ~ Equation 6.31 shows by an angle of α gives a set of base vectors X, how these vectors are embedded into the rotational matrix. Xx Yx Zx Xy Yy Zy R(x, y, z, α) = Xz Yz Zz 0 0 0 0 0 if l = 1 0 1 (6.31) In order to simplify the expressions of the base vectors, the rotational axis is assumed to be of unit length. Additionally, the attributes k and l were required to be non-zero for the individual rotational matrices. However, simplifying the entire transformation chain removes these criteria from the final matrix. Equation 6.32 shows the simplified expressions of the base vectors. Xx Xy Xz Yx Yy Yz Zx Zy Zz = x2 + (1 − x2 ) · cos(α) = x · y · (1 − cos(α)) + z · sin(α) = x · z · (1 − cos(α)) − y · sin(α) = x · y · (1 − cos(α)) − z · sin(α) = y 2 + (1 − y 2 ) · cos(α) = y · z · (1 − cos(α)) + x · sin(α) = x · z · (1 − cos(α)) + y · sin(α) = y · z · (1 − cos(α)) − x · sin(α) = z 2 + (1 − z 2 ) · cos(α) (6.32) Inverting the compound matrix can be done in one of two ways. Either through a rotation by the negative angle or by transposing the compound matrix. This is shown in equation 6.33. 26 R(x, y, z, α)−1 6.6 Xx X y Xz Yx Yy Yz = R(x, y, z, −α) = R(x, y, z, α)T = Zx Zy Zz 0 0 0 0 0 if l = 1 0 1 (6.33) Projection Adjusting the perspective for a three-dimensional vector when viewed by a camera or an observer is one of the fundamental processes in three-dimensional computer graphics. Objects located near the observer appear bigger than identical objects located further away. The projection process is often modeled using matrices, despite the inability of performing this process solely by applying a matrix to a vector. The projection operator is dependent on the actual components of the vector it is to project, making it impossible to model using matrices. However, projection can be achieved by introducing a concept known as the homogeneous divide. Equation 6.34 shows a simple projection matrix. 1 0 P = 0 0 0 1 0 0 0 0 1 1 0 0 0 0 (6.34) Applying this matrix to a four-dimensional homogeneous vector (x, y, z, 1) results in the vector (x, y, z, z). As seen, its fourth component is now possibly different from one. By applying the dividing operator which divides each component by the fourth component, the vector becomes ( xz , yz , 1, 1). It is now homogenized with respect to the fourth component and perspectively projected into the axis-aligned plane at z = 1. It should be emphasized that the projection matrix is irreversible as its determinant is equal to zero. Perspective projection is better viewed as a geometric operation. Figure 6.2 ~ Y~ , Z}. ~ Deriving an expression shows a vector P~ and the orthonormal basis {X, for projecting the vector into the axis-aligned plane at z = 1 can be done using the properties of similar triangles. 27 Px P Pz-1 Z Px’ X Figure 6.2: Perspective projection can be derived using similar triangles. Equation 6.35 shows the relationship between the vector P~ and its perspectively projected counterpart P~ 0 . As seen in the equation, the result is equal to that produced when using matrices and the homogeneous division operator. Px Px Px0 = ⇔ Px0 = 1 Pz − 1 + 1 Pz Py0 Py Py = ⇔ Py0 = 1 Pz − 1 + 1 Pz Pz0 Pz = ⇔ Pz0 = 1 1 Pz − 1 + 1 7 (6.35) Line Segment-Plane Intersection There are three possible outcomes for the intersection between a line and a plane. The line may be embedded into the plane, producing infinitely many intersection points. It may be located a distance from the plane but its direction perpendicular to the normal of the plane. For this case, no intersection points exist. For all other cases, there is exactly one intersection point. Line segments require that the two points forming the line are located on opposite sides of the plane in order to produce exactly one intersection point. This section is dedicated to locating that unique intersection point, should it exist. ~ and the minimum distance to A plane is defined through a plane normal N that plane d. Equation 7.1 shows how the signed distance between a plane P ~ can be computed as a scalar projection between the normal and a vector X and the vector and by subtracting the minimum distance. Naturally, the plane is located where the distance is zero. 28 ~ =N ~ •X ~ −d dist(P, X) (7.1) ~ and an ending point E. ~ Line segments are composed of a starting point S Determining whether the line segment intersects a plane is done by evaluating the signed distances to the two points. If the two signed distances are of ~ For opposite signs, the line segment will intersect the plane at a point I. all other cases, no intersection point exists. Figure 7.1 shows a line segment intersecting a plane. dist(P, E) E P b I N a dist(P, E) dist(P, S) S Figure 7.1: A line segment intersecting a plane from the positive half-space. From the figure, a number of relationships can be observed through the use of similar triangles. Equation 7.2 shows these relationships. a ~ dist(P, S) = b ~ −dist(P, E) = a+b ~ − dist(P, E) ~ dist(P, S) (7.2) Finding the intersection point is essentially the process of scaling the direction ~ to E ~ by an amount proportional to the two signed distances and vector S adding it to the starting point. Equation 7.3 shows how the intersection point is found where t1 is a fraction of the length of the direction vector. ~ + t1 · (E ~ − S) ~ I~ = S (7.3) The fraction t1 is computed using the previously observed relationships. Equation 7.4 shows the definition of this fraction. 29 t1 = ~ a dist(P, S) = ~ − dist(P, E) ~ a+b dist(P, S) (7.4) The line segment may just as well start in the negative half-space of the plane and end in the positive half-space. Though the intersection point may be found using the previously defined equations in a mathematical context, it will depend on the orientation in a computational context. Due to rounding errors, the intersection point may be slightly different if the line segment starts in the negative half-plane compared to when it starts in the positive half-plane. In a computational context, the interpolation scheme should be defined in a consistent fashion. That is, the fractional t should always be defined as either the fraction of the directional vector from the inside point to the outside point or the other way around. Figure 7.2 shows a line segment intersecting a plane from the outside and in. dist(P, S) S P b I N a dist(P, S) dist(P, E) E Figure 7.2: A line segment intersecting a plane from the negative half-space. The relationships found using similar triangles are defined slightly different for this case. Equation 7.5 shows these relationships. a ~ dist(P, E) = b ~ −dist(P, S) = a+b ~ − dist(P, S) ~ dist(P, E) (7.5) As with the relationships, the interpolation scheme is slightly different. Equation 7.6 shows how interpolation is done when the line segment starts in the negative half-space. ~ + t2 · (S ~ − E) ~ I~ = E 30 (7.6) ~ to S ~ is defined as shown in The fraction t2 of the negative direction vector E equation 7.7 using the previously observed relationships. t2 = 8 ~ a dist(P, E) = ~ − dist(P, S) ~ a+b dist(P, E) (7.7) View Frustum Clipping The view frustum or view volume is a sub-space of the virtual scene in which triangles are partially or completely visible when projected onto the screen. It is formed from six planes, defined by the orientation of a camera or an observer and a few additional parameters such as draw distance. Its shape is in the form of a truncated pyramid as seen from above in figure 8.1. Four of the six planes ~ l, N ~ r , N~n and and their corresponding normals are shown in the figure where N N~f are the normals of the left, right, near and far planes. Nf Nl Nr Nn Figure 8.1: The view frustum is in the shape of a truncated pyramid. View frustum clipping is the process of removing complete triangles or parts of triangles which will not contribute to the final image when projected onto the screen. This often in order to save computational power. 8.1 Clipping Planes By transforming the triangles into camera space before clipping, the definition of the six planes forming the view frustum will become simplified. Equation 8.1 shows the equations of the left, right, bottom, top, near and far planes, respectively. All equations are defined in camera space. 31 ~ =0 (αx , 0, 1) • X ~ =0 (−αx , 0, 1) • X ~ =0 (0, αy , 1) • X ~ =0 (0, −αy , 1) • X ~ = dn (0, 0, 1) • X ~ = −df (0, 0, −1) • X (8.1) ~ an In the equation, dn and df are the near and far draw distances and X arbitrary point. Equation 8.2 states a requirement on the relationship between the near and far draw distances. As seen in the equation, non-positive draw distances are not allowed and the far draw distance must be greater than the near draw distance. 0 < dn < df < ∞ (8.2) The two constants αx and αy are stretch factors which are determined from the aspect ratio of the target image. This in order to account for non-square displays by extending the field of view of the camera along the major axis. As such, the field of view of the camera becomes defined along the minor axis of the target image. Equation 8.3 shows how the horizontal stretch factor αx is determined from the width and height of the target image. ( αx = h w 1 if h < w otherwise (8.3) The vertical stretch factor αy is determined in a similar way as shown in equation 8.4. αy = ( 1 w h if h < w otherwise (8.4) The six plane equations are fully determined from the equations above and the clipping of triangles in camera space can be performed. 32 8.2 Example The clipping procedure is done in sequence for each of the six planes and operates on a vertex list as clipping a triangle against a plane may produce additional vertices. The vertex list is initialized with the three original vertices of a triangle and may store up to four additional vertices. In addition, a secondary vertex list is used to store the results of a clipping pass. Figure 8.2 shows a triangle which intersects the view frustum and its three initial vertices V~1 , V~2 and V~3 . V3 V2 V1 Figure 8.2: A triangle intersecting the view frustum. Clipping against the left plane produces an additional vertex and the vertex list now contains a total of four vertices. This is shown in figure 8.3. V4 V3 V2 V1 Figure 8.3: Clipping against the left plane produces an additional vertex. The clipping procedure may process the six planes in any order as long as the order is consistent for all triangles. The triangle will eventually become clipped against the right plane which does not produce any additional vertices for this 33 example. However, the vertices are moved to the respective intersection points as shown in figure 8.4. V4 V3 V2 V1 Figure 8.4: Clipping against the right plane moves two of the vertices. After the clipping process has completed, the final vertex list may not form a single triangle. It is possible for the process to produce as much as seven vertices, which together form a planar polygon. This polygon can be triangulated into several triangles and the number of triangles determined from the expression in equation 8.5. In the equation, t is the number of triangles and v the number of vertices. As the clipping process produces at most seven vertices, the number of triangles is at most five. t = max(v − 2, 0) (8.5) For this example, the polygon is split up into two triangles. Figure 8.5 shows the triangulated polygon and the two triangles {V~1 , V~2 , V~3 } and {V~1 , V~3 , V~4 }. V4 V3 V2 V1 Figure 8.5: The planar polygon can be triangulated into several triangles. 34 8.3 Algorithm The Sutherland-Hodgman algorithm for line clipping can be used to clip each line segment implicitly defined by the vertex list against each of the six planes of the view frustum. The algorithm was introduced in 1974 by Sutherland and Hodgman in [7]. For every vertex in the vertex list, the signed distance to the current plane P is ~ and ending point E ~ are computed. The signed distances of the starting point S used to determine how the line segment is oriented with respect to the plane. This is done through two boolean tests for which the results are embedded into a bit mask. Two tests give four cases in total, all of which are shown in equation 8.6 [7]. 0 1 case = 2 3 if if if if ~ < 0 ∧ dist(P, E) ~ <0 dist(P, S) ~ > 0 ∧ dist(P, E) ~ <0 dist(P, S) ~ < 0 ∧ dist(P, E) ~ >0 dist(P, S) ~ > 0 ∧ dist(P, E) ~ >0 dist(P, S) (8.6) As seen in the equation, case 0 occurs when both the starting and ending points are located in the negative half-space created by the plane. Neither the points nor the line segment is visible and as such, nothing should be appended to the resulting vertex list. This is illustrated in figure 8.6. S P N E Figure 8.6: Illustration of case 0 for the Sutherland-Hodgman algorithm. If the starting point is located in the positive half-space and the ending point located in the negative half-space, case 1 occurs. This is illustrated in figure 8.7. For this case, the intersection point I~ needs to be computed. The intersection point is computed with respect to the orientation of the line segment as discussed in the section about line segment-plane intersections. As the original 35 starting point is visible, it is appended to the resulting vertex list followed by the intersection point. E P I N S Figure 8.7: Illustration of case 1 for the Sutherland-Hodgman algorithm. Conversely, case 2 occurs when the starting point is located in the negative half-space and the ending point located in the positive half-space. For this case, the intersection point is computed and appended to the resulting vertex list. This is illustrated in figure 8.8. S P I N E Figure 8.8: Illustration of case 2 for the Sutherland-Hodgman algorithm. The final case occurs when both points are located in the positive half-space. For this case, no intersection point exists. The only thing appended to the resulting vertex list is the original starting point. The case is illustrated in figure 8.9. 36 S P N E Figure 8.9: Illustration of case 3 for the Sutherland-Hodgman algorithm. The clipping against the current plane P is completed by clipping the final line segment between the last vertex and the first vertex in the same fashion as for the other line segments. The resulting vertex list describes the intermediate planar polygon and can be used as input for the next clipping plane and so forth until all clipping planes have been processed. 9 Barycentric Coordinates For three-dimensional computer graphics, a triangle is formed from three vertices located in three-dimensional space. If the vertices are arranged so that a proper triangle is formed, the triangle will create a two-dimensional sub-space. This sub-space is a plane and the triangle is the small fraction of that plane enclosed within the convex hull of the three vertices. Figure 9.1 shows a triangle ~ , V~ and W ~. and its three vertices U V U W Figure 9.1: A triangle is formed from three vertices. 37 If the plane is divided by the three lines formed when extending the edges infinitely, the triangle can be expressed as the union of the three positive halfspaces. This can be seen in figure 9.2 where the arrows indicate the positive half-spaces. V U W Figure 9.2: A triangle can be seen as the union of three positive half-spaces. Attributes are commonly defined at the three vertices of the triangle and may be shared with adjacent triangles. In order to interpolate these values across the entire surface, a clever coordinate system is needed. Barycentric coordinates are a set of three coordinates that can be used to refer to all points within the plane spanned by a proper triangle. This set of coordinates is often denoted (u, v, w) and specifies the normalized perpendicular distances between the opposite edge of a vertex and the vertex itself. Let an arbitrary point P~ be defined in the plane by the three coordinates (u, v, w). Equation 9.1 shows two fundamental properties for this point and for the barycentric coordinates. ~ · u + V~ · v + W ~ ·w P~ = U 1=u+v+w (9.1) The u coordinate is defined by the normalized perpendicular distance from the ~ . The v and w coordinates are defined in a edge between vertices V~ and W similar fashion as illustrated in figure 9.3. 38 w=0 V v= 1 U P u= 1 w=1 v= 0 u= 0 W Figure 9.3: The isolines of the barycentric coordinates are parallel to the edges. As the coordinates are normalized with respect to the size of the triangle, the coordinates gain a series of useful properties. If the point P~ coincides with ~ , its position in the barycentric coordinate system is (1, 0, 0). The vertex U ~ , which are located at (0, 1, 0) and (0, 0, 1), same is true for vertices V~ and W respectively. ~ , its coordinates If the point lies anywhere along the line between V~ and W will be (0, v, w). Conversely, its coordinates along the other two lines will be (u, 0, w) or (u, v, 0). The normalization also ensures that the sum of all three coordinates is exactly equal to one, regardless of the position of the point. This is always true, even if the point lies outside of the triangle. This set of coordinates and their properties form an efficient interpolation scheme for attributes defined at the three vertices. 9.1 Cramer’s Rule The fundamental properties of the barycentric coordinates form a system of ~ is transequations and can be expressed in matrix form. An input vector X ~ formed by a matrix M into an output vector Y . Equation 9.2 shows a general system of equations in matrix form. ~ Y~ = M · X (9.2) The input vector will be the three barycentric coordinates (u, v, w) and the matrix and output vector will correspond to the statements from equation 9.1. Expressing these statements in matrix form results in equation 9.3 where computing the barycentric coordinates is analogous to solving the equation for (u, v, w). 39 Px Ux Vx Wx u Py = U y V y W y v 1 1 1 1 w (9.3) Cramer’s Rule states that any system of equations with as many unknowns as there are equations (i.e. square matrices) can be solved using quotients of determinants. Equation 9.4 states that the unknown Xi can be expressed as the fraction between the determinant of the matrix Mi and that of the original matrix M . Xi = det(Mi ) if det(M ) 6= 0 det(M ) (9.4) The determinant of the system is computed as shown in equation 9.5. The bottom row simplifies the calculations as it is equal to (1, 1, 1). det(M ) = (Vx · Wy − Wx · Vy ) − (Ux · Wy − Wx · Uy ) + (Ux · Vy − Vx · Uy ) = (Vx − Ux ) · (Wy − Uy ) − (Wx − Ux ) · (Vy − Uy ) (9.5) The additional matrices Mi are formed by replacing column i of the matrix M with the vector Y~ . Equation 9.6 shows the matrix M1 where the first column has been replaced by Y~ . Px Vx Wx M1 = Py Vy Wy 1 1 1 (9.6) Computing the determinant of M1 is performed in the standard way and results in the expression seen in equation 9.7. det(M1 ) = (Vx · Wy − Wx · Vy ) − (Px · Wy − Wx · Py ) + (Px · Vy − Vx · P y) = (Px − Wx ) · (Vy − Wy ) − (Vx − Wx ) · (Py − Wy ) (9.7) The matrix M2 is formed by replacing the second column of M with the vector Y~ . Equation 9.8 shows this matrix. 40 Ux Px Wx M2 = Uy Py Wy 1 1 1 (9.8) Its determinant is similar to the determinant of M1 . Equation 9.9 shows an expression for the determinant of M2 . det(M2 ) = (Px · Wy − Wx · Py ) − (Ux · Wy − Wx · Uy ) + (Ux · Py − Px · Uy ) = (Px − Wx ) · (Wy − Uy ) − (Wx − Ux ) · (Py − Wy ) (9.9) The final matrix M3 is formed by replacing the third column of M with the vector Y~ . This is seen in equation 9.10. Ux Vx P x M3 = Uy Vy Py 1 1 1 (9.10) An expression for the determinant of the matrix M3 is shown in equation 9.11. det(M3 ) = (Vx · Py − Px · Vy ) − (Ux · Py − Px · Uy ) + (Ux · Vy − Vx · Uy ) = (Vx − Ux ) · (Py − Uy ) − (Px − Ux ) · (Vy − Uy ) (9.11) Using Cramer’s Rule, the three unknowns (u, v, w) can now be expressed as the respective quotients. Equation 9.12 shows the expression for the three barycentric coordinates. (u, v, w) = 9.2 1 · (det(M1 ), det(M2 ), det(M3 )) if det(M ) 6= 0 det(M ) (9.12) Ratios of Areas Barycentric coordinates can also be viewed as area-oriented coordinates. The 1 1 1 coordinates 3 , 3 , 3 will always correspond to the center of any proper triangle. It is therefore possible to define the coordinates as ratios of areas. Figure 9.4 41 ~ , V~ , W ~ and an interior shows the three areas formed by the three vertices U ~ point P . V U Aw Av P Au W Figure 9.4: Barycentric coordinates can be defined through ratios of areas. The area of the entire triangle is computed as shown in equation 9.13. This will become the normalizing factor. A= (Vx − Ux ) · (Wy − Uy ) − (Wx − Ux ) · (Vy − Uy ) 2 (9.13) The three areas formed by introducing the interior point P~ are computed in a similar fashion. Equation 9.14 shows this process in which Aw is defined by the other two areas Au and Av in conjunction with the entire area A. Naturally, Aw could be defined explicitly but the computational complexity of the method would be higher. (Px − Wx ) · (Vy − Wy ) − (Vx − Wx ) · (Py − Wy ) 2 (Px − Wx ) · (Wy − Uy ) − (Wx − Ux ) · (Py − Wy ) Av = 2 Aw = A − Au − Av Au = (9.14) The barycentric coordinates are obtained through a division by the total area. This is shown in equation 9.15 where the actual coordinates (u, v, w) are computed. For efficiency reasons, the total area can be precalculated, stored and reused in every calculation for that triangle. It is also unnecessary to divide every area by two as it is the relative areas that are of importance. (u, v, w) = 1 (Au , Av , Aw ) if A 6= 0 A 42 (9.15) It can be observed that the two partial areas Au and Av actually correspond to half the signed perpendicular distances to the respective edges. That is, a ~ to P~ on the two edge normals conprojection of the relative vector from W ~ ~ taining W . Should the point P be located outside of the triangle, some areas and coordinates will become negative. This in accordance with the previous observation. 10 The Separating Axis Theorem The hyper-plane separation theorem is often referred to as the separating axis theorem in two dimensions. The theorem can be applied to determine if two convex shapes are overlapping and is heavily used in collision detection for computer graphics. 10.1 Candidate Axes If the two shapes are defined by vertices and the connecting edges in between, there exists a finite number of axes along which the two objects can be separated. In two dimensions, these candidate axes correspond to the edge normals present in either of the two shapes. Figure 10.1 shows two convex shapes and their edge normals. Figure 10.1: A box and a triangle with their corresponding edge normals. Figure 10.2 shows the connection between the edge normals of the two shapes and the axes for which a separation is possible. There are seven candidate axes for the two shapes in the figure but four of them are duplicates, leaving three unique axes. 43 Figure 10.2: Candidate axes are formed from unique edge normals. Note that the separating axes are directionless. Two edge normals differing only in magnitude will correspond to the same separating axis, even if the magnitudes have opposite signs. This as it is the relative magnitudes of the projected shapes that are compared using the theorem. 10.2 Collision Detection The theorem states that if two shapes are separated, there must exist at least one axis for which the projections of the two shapes do not overlap. If the two objects overlap, no such axis will exist. Figure 10.3 shows a case where two shapes do not overlap and how this is reflected in the projections. Figure 10.3: Two convex shapes with one separating axis (diagonal). This case can be determined using the condition in equation 10.1 in which Bi and Ti are the sets of projected vertices of the box and the triangle along candidate axis i. 44 ∃i : (max(Bi ) < min(Ti ) ∨ min(Bi ) > max(Ti )) (10.1) Figure 10.4 demonstrates a different case where the two shapes partially overlap. The overlap is also visible in the projections of the two shapes. Figure 10.4: Two convex shapes with no separating axis. This case can be determined using the condition in equation 10.2. The sets of projected vertices of the box and of the triangle along candidate axis i are denoted by Bi and Ti , respectively. ∀i : (max(Bi ) > min(Ti ) ∧ min(Bi ) < max(Ti )) (10.2) An overlooked feature of the separating axis theorem is the possibility to determine if a shape is completely enclosed within another shape. If the projection of the smaller shape is enclosed in the projection of the larger shape for every candidate axis, the smaller shape will be completely enclosed inside the larger shape. Figure 10.5 demonstrates this feature. All projections of the smaller shape are enclosed within the projections of the larger shape. 45 Figure 10.5: A smaller shape enclosed within a larger shape. This case can be determined using the condition in equation 10.3. As with the other conditions, Bi and Ti denotes the sets of projected vertices of the box and of the triangle along candidate axis i, respectively. ∀i : (min(Bi ) > min(Ti ) ∧ max(Bi ) < max(Ti )) 11 (10.3) Tangent Space When the vertices of a triangle are coupled with texture coordinates, a local coordinate system known as the tangent space must exist for that triangle. Its bases are the axes along which the texture coordinates are defined as well as ~ and N ~, the normal of the triangle. A common notation for these axes are T~ , B corresponding to the tangent, bitangent and normal, respectively. However, ~ T~ and N ~ will be used from here on as texture coordinates often the notation S, are denoted by s and t. ~ , V~ and W ~ . It also shows Figure 11.1 shows a triangle and its three vertices U ~ and T~ . These bases can be of arbitrary two of the tangent space bases, S orientation, depending on the different texture coordinates defined at the three vertices but are always located in the same plane as the triangle. 46 V S U W T Figure 11.1: A tangent space exists for triangles with proper texture coordinates. For any point P~ in the same plane as a triangle, it is required for the tangent space of that triangle to be consistent with the spatial coordinates and with the texture coordinates of the point. Equation 11.1 shows this relationship. ~ + (Pt − Ut ) · T~ (P~xyz − U~xyz ) = (Ps − Us ) · S (11.1) The relationship is illustrated in figure 11.2 in which the point P~ local to ~ is described using the tangent space bases and the relative texture vertex U coordinates. V (Pt-Ut)·T P S U (Ps-Us)·S W T Figure 11.2: The tangent space is a coordinate system local to the triangle. Defining the tangent space from the attributes of the three vertices can be ~ relative done through observing an important feature. The vertices V~ and W ~ must be uniquely described using the tangent space bases and the relative to U texture coordinates. Equation 11.2 shows these criteria. 47 ~ + (Vt − Ut ) · T~ (V~xyz − U~xyz ) = (Vs − Us ) · S ~ + (Wt − Ut ) · T~ (W~xyz − U~xyz ) = (Ws − Us ) · S (11.2) ~ and B ~ These criteria can be simplified by introducing two local vectors A ~ ~ ~ ~ as the vectors from U to V and U to W , respectively. The two vectors will ~ as shown in figure 11.3. correspond to the two edges local to U V A S U B W T ~ and B ~ local to U ~ are introduced. Figure 11.3: Two vectors A The previous equation can now be rewritten as shown in equation 11.3. This is a system of six equations with six unknowns. ~ + At · T~ A~xyz = As · S ~ + Bt · T~ B~xyz = Bs · S (11.3) Obtaining the tangent space bases is analogous to solving this system of equa~ and T~ . If the system of equations is rewritten in tions for the two bases S matrix form, equation 11.4 is obtained. Ax Ay Az As At Sx Sy Sz = Bx By Bz Bs Bt Tx Ty Tz (11.4) ~ and B ~ are known. It is All attributes belonging to either of the two vectors A therefore possible to determine whether there exists a tangent space. A matrix M consisting of the texture coordinates is defined as shown in equation 11.5. As At M= Bs Bt 48 (11.5) This matrix needs to be inverted in order to compute the tangent space bases. As with other matrices, it is only invertible if its determinant is not equal to zero. Using the inverse formula for square four-element matrices produces the result in equation 11.6. M −1 1 Bt −At if As · Bt − At · Bs 6= 0 = As · Bt − At · Bs −Bs As (11.6) ~ and T~ by multiplying equation 11.4 The matrix equation can be solved for S −1 with M from the left. Rearranging the equation gives the result shown in equation 11.7. Sx Sy Sz −1 Ax Ay Az =M Tx Ty Tz Bx By Bz (11.7) The explicit definitions of the two tangent space bases are shown in equation −1 11.8 where Mi,j corresponds to the element of the inverse texture coordinate matrix at row i and column j. ~ S T~ −1 −1 M1,1 · Ax + M1,2 · Bx −1 −1 · Ay + M1,2 · By = M1,1 −1 −1 M1,1 · Az + M1,2 · Bz −1 −1 M2,1 · Ax + M2,2 · Bx −1 −1 · Ay + M2,2 · By = M2,1 −1 −1 M2,1 · Az + M2,2 · Bz (11.8) The normal is by definition computed as the three-dimensional cross-product between two of the edges and is not connected to the texture coordinates. The two edges must be selected in accordance with the order in which the vertices are defined. Throughout this report, left-handed coordinate systems are used requiring the vertices to be defined in clock-wise order. Equation 11.9 shows ~ using the two local vectors A ~ and B. ~ the definition of the normal N Bz · Ay − By · Az ~ = Bx · Az − Bz · Ax N By · Ax − Bx · Ay 49 (11.9) 12 Perspective-Correct Interpolation When three-dimensional triangles are perspectively projected into a plane, the relationship between the attributes associated with each vertex becomes distorted. In three dimensions, the relationship is linear over the surface of the triangle. This is no longer true when the triangle is projected into a plane. However, the inverse projection is linear and this can be utilized. This section provides details on how to correctly interpolate attributes across a projected triangle using the two texture coordinates s and t as an example. The method detailed here is in no way limited to texture coordinates. Any attributes may be interpolated using the method, even vectors and matrices. Equation 12.1 shows the projected texture coordinates s0 and t0 and the inverse depth z 0 , all of which are linear over the surface of the projected triangle. 1 z s 0 s = z t t0 = z z0 = (12.1) ~ , V~ These three projected attributes can be computed at the three vertices U ~ of the triangle, providing a set of nine projected attributes. This is and W shown in equation 12.2. zU0 = s0U = t0U = 1 zU sU zU tU zU zV0 = s0V = t0V = 1 zV sV zV tV zV 0 = zW 0 sW = t0W = 1 zW sW zW tW zW (12.2) An arbitrary point P~ on the projected surface is defined by the three barycentric coordinates u, v and w as stated previously. Equation 12.3 shows this point. P~ = (u, v, w) (12.3) As the nine projected attributes are linear over the surface, the barycentric coordinates can be used to linearly interpolate these to the point P~ . This procedure is shown in equation 12.4. 50 0 zP0 = zU0 · u + zV0 · v + zW ·w 0 0 0 0 sP = sU · u + sV · v + sW · w t0P = t0U · u + t0V · v + t0W · w (12.4) The inverse depth provides a simple mean of recovering the actual attributes at the point. Equation 12.5 shows how the texture coordinates sP and tP are recovered at the point together with the correct depth value zP . 1 zP0 s0 sP = P0 zP t0 tP = P0 zP zP = 13 (12.5) Rounding Modes Proper handling of the rounding of real-valued numbers is an important aspect of computer graphics. Real-valued numbers are used to describe an array of different properties of a three-dimensional scene, ranging from geometry definitions to texture coordinates. These will have to map to a discrete raster of discrete colors sampled from discrete textures, making the need evident. Rounding can be performed in a number of different modes where each mode is beneficial in a number of different situations. These rounding modes are implemented in most programming languages and in hardware but can also be defined using the truncating function. The truncating function truncates all decimals from a real-valued number x. Figure 13.1 shows this function and how it rounds its input towards zero. 51 Y X Figure 13.1: The truncating function of a real-valued number. Defining the truncating function in a computational context can be done in a few different ways. Either as a bit manipulation of the real-valued number or as a type conversion to an integer data type. This provided that the integer data type used is able to represent the truncated value. 13.1 Flooring Flooring is a rounding mode which rounds its input towards negative infinity. A real-valued number x is rounded towards the integer to the left on the x-axis. This is shown in figure 13.2. Y X Figure 13.2: The flooring function of a real-valued number. The flooring function can be defined as shown in equation 13.1 in which the definition properly handles positive as well as negative values. ( trunc(x) − 1 floor(x) ∈ Z = trunc(x) 52 if x < trunc(x) otherwise (13.1) Fractional The flooring function has a complimentary function known as the fractional function. It represents the value that was removed using the flooring function and is shown in figure 13.3. Y X Figure 13.3: The fractional function of a real-valued number. The fractional function can be defined using the flooring function or using the truncating function. Equation 13.2 shows the fractional function defined using the latter. As with the flooring function, this definition properly handles positive as well as negative values. ( x − trunc(x) + 1 frac(x) ∈ R : [0, 1[= x − trunc(x) if x < trunc(x) otherwise (13.2) The two functions obey an important relationship. The sum of the flooring function and the fractional function individually applied to any real-valued number x is always equal to x as shown in equation 13.3. x = floor(x) + frac(x) (13.3) The sum of the two functions is as a continuous line as expected. This is shown in figure 13.4. 53 Y X Figure 13.4: The sum of the flooring function and the fractional function. 13.2 Ceiling Ceiling is a rounding mode which rounds its input towards positive infinity. A real-valued number x is rounded towards the integer to the right on the x-axis. This is shown in figure 13.5. Y X Figure 13.5: The ceiling function of a real-valued number. The ceiling function can be defined as shown in equation 13.4 in which the definition properly handles positive as well as negative values. ( trunc(x) + 1 ceil(x) ∈ Z = trunc(x) 54 if x > trunc(x) otherwise (13.4) Reverse Fractional Similar to the flooring function, the ceiling function has a complimentary function known as the reverse fractional function. It represents the value that was introduced using the ceiling function and is shown in figure 13.6. Y X Figure 13.6: The reverse fractional function of a real-valued number. The reverse fractional function can be defined using the ceiling function or using the truncating function. Equation 13.5 shows the reverse fractional function defined using the latter. As with the other functions, this definition properly handles positive as well as negative values. ( trunc(x) + 1 − x rfrac(x) ∈ R : [0, 1[= trunc(x) − x if x > trunc(x) otherwise (13.5) The two functions obey a relationship similar to that of the flooring and fractional functions. The difference between the ceiling function and the reverse fractional function individually applied to any real-valued number x is always equal to x. This is shown in equation 13.6. This difference also forms a continuous line. x = ceil(x) − rfrac(x) 13.3 (13.6) Nearest Rounding to the nearest integer can be performed in a simple way utilizing the flooring or ceiling functions. Shifting the input by 21 before flooring will create a rounding mode that produces the closest integer, left or right on the 55 x-axis. Values at the exact midpoint between two integers will be rounded upwards. Conversely, shifting the input by − 21 before ceiling will also produce a rounding towards the nearest integer. However, values at the midpoints will be rounded downwards. 56 Chapter D Texture Filtering Texture mapping plays an important role in computer graphics as it greatly enhances the visual quality of a rendered image. Texture mapping is traditionally used to modulate color values for three-dimensional models but can also be used to introduce detail in a variety of ways. This includes introducing geometric detail through normal mapping as well as modulating material properties other than color. Throughout this chapter, the two words pixel and texel will be heavily used. Pixel will be used whenever discussing the sample points of the rendered image, whereas texel will refer to the sample points of a texture image. It is important to differentiate between these two, even if they technically represent the exact same thing. 1 Short on Discrete Images Pixels and texels are indeed sample points, with emphasis on points. Points are zero-dimensional, infinitely small and should not be thought of as higherdimensional shapes such as circles or squares. As such, they cannot be associated with a size. The sample points together form a discrete image which has a horizontal and a vertical resolution equal to the amount of sample points in the respective direction. The discrete image is best viewed through a reconstruction filter which extends the image into the continuous domain according to Smith in [8]. The footprint of this filter will be two-dimensional and therefore have a size and a shape. 57 2 Magnification and Minification Texture mapping is a process in which a set of texture coordinates s and t, along with the partial derivatives ∂s and ∂t are used to reconstruct a discrete texture image. The texture coordinates s and t are known as the resolutionindependent texture coordinates for which the intervals between zero and one cover the entire texture. In order to access individual texels, the coordinates are commonly scaled by the actual resolution and the scaled texture coordinates x and y are obtained. The partial derivatives are scaled in the same fashion, giving a hint on the optimal size for the footprint of the reconstruction filter. Pixels in the rendered image will rarely match the texels of a texture image, neither in orientation nor in resolution and by that distance between sample points. In three-dimensional computer graphics, the problem is worsened as the footprint of a pixel in image space may become distorted in texture space. This as a result of the perspective projection. There are two extreme cases which pose great challenges for the texture filtering sub-system. Large regions of the rendered image might produce almost identical small footprints which fall between four texels. Choosing the nearest neighboring texel as a reconstruction filter will select the same texel for many pixels in that region of the rendered image. Clearly, this is not desirable and better reconstruction filters are needed. When the distance between two texels is larger than the footprint of a pixel, the texture sub-system needs to magnify the texture image. Hence the term texture magnification. Unfortunately, not many techniques suited for real-time applications exist for this case. Figure 2.1 shows the footprint of a pixel in texture space. In the figure, the footprint is smaller than the distance between two texels and magnification is needed. Figure 2.1: The footprint of a pixel in texture space needing magnification. 58 The other extreme case presents itself when the footprint of a pixel is mapped to a region covering multiple texels in texture space. Choosing the nearest neighboring texel as a reconstruction filter will select a texel seemingly at random. In addition, slight changes in perspective will cause the filter to select different texels, causing visual artifacts such as Moiré patterns in the rendered image. With the distance between two texels being smaller than the footprint of a pixel, the texture sub-system needs to minify the texture image. This is known as texture minification and multiple techniques suitable for real-time applications exist to combat the problem. Figure 2.2 shows the footprint of a pixel in texture space. The footprint is many times larger than the distance between two texels and texture minification is needed. Figure 2.2: The footprint of a pixel in texture space needing minification. Most minification filters introduce a pre-processing pass in which the texture is processed and additional data stored. The pre-processing can often be done efficiently in a parallel computational context and only needs to be done once per texture. 3 Wrapping Modes When a pair of texture coordinates exceed the domain of a texture, a consistent way of remapping these coordinates is needed. Generally, there are two different wrapping modes which can be used for the two texture dimensions. Both dimensions need not use the same wrapping mode, even if it often is convenient to treat both dimensions equally. However, it is crucial that a wrapping mode is defined as it is near impossible to prevent domain exceedings. 59 Wrapping modes can operate on the resolution-independent texture coordinates or on the integer texel coordinates. This section will define wrapping modes for the latter. As such, wrapping is performed whenever the texture filter requests a texel using the integer texel coordinates. Given a set of texel coordinates xi and yi , the look-up function Il can be defined as shown in equation 3.1. In the equation, I corresponds to the discrete texture image whereas Wx and Wy are the two wrapping functions. Il [xi , yi ] = I[Wx [xi ], Wy [yi ]] 3.1 (3.1) Repeating If a texture is meant to be used in a tiling fashion where texel coordinates exceeding the domain should remap to the corresponding texels in the tile, the repeat mode is used. Equation 3.2 shows how a remapping of the texel coordinates xi and yi is done for repeating textures. The method used in the equation correctly remaps xi and yi into the intervals between [0, w − 1] and [0, h − 1], even for negative coordinates. Wx [xi ] = ((xi mod w) + w) mod w Wy [yi ] = ((yi mod h) + h) mod h (3.2) Graphics pipelines such as the Open Graphics Library generally support two different repeating modes. One which mirrors or inverts every other tile and one which repeats every tile in a consistent fashion [9]. The expressions in the above equation are used to repeat every tile consistently. 3.2 Clamping If a pair of texel coordinates are meant to never exceed their domain, the clamping mode is used. Clamping remaps texel coordinates outside of the domain to the edge texels of the texture image. Equation 3.3 shows how a remapping of the texel coordinates xi and yi is done for clamping textures. Wx [xi ] = min(max(xi , 0), w − 1) Wy [yi ] = min(max(yi , 0), h − 1) (3.3) Graphics pipelines generally support a few different clamping modes which allow fine control over the texels that actually get sampled. For instance, the 60 Open Graphics Library allows a virtual border of constant color or an actual border of varying colors to be specified around the texture image. The different clamping modes specify whether the clamping should be done to the edge of the texture, to the border of the texture or to in between the edge and the border [9]. Borders can be used to show that some pixels in the rendered image will map to texels outside of the domain. Should this not be the intention, the border texels visible in the rendered image will indicate that something is wrong with the texture coordinates specified. 4 Nearest Neighboring Texel The nearest neighboring texel filter is a texture filter in its most simple form. It selects a single texel from the texture which corresponds to the one in closest proximity to the scaled texture coordinates x and y. Equation 4.1 shows the definition of the nearest neighboring texel filter N . For efficiency, the rounding can be implemented as a flooring operation shifted by 12 . N (x, y) = Il [round(x), round(y)] (4.1) The filter is generally not used since it is neither suitable for magnification nor minification. However, it extends the discrete texture image into the continuous domain which is the minimum requirement for a texture filter. 5 Bilinear Interpolation Bilinear interpolation is a filter which averages four texels weighted in accordance with distance in an aim to reconstruct the missing data. This has proven to be sufficient for the perceived visual quality of rendered images when the texture is magnified. Hence, bilinear interpolation is one of the most commonly used texture magnification filters. In bilinear interpolation, the scaled texture coordinates x and y are split into integer parts and fractional parts using the flooring and fractional functions. The integer parts are used for texel indices and the fractional parts are used to weigh the texels accordingly. Equation 5.1 shows how bilinear interpolation extends an image into the continuous domain. In the equation, xi and yi correspond to the integer parts 61 while xf and yf correspond to the fractional parts of the scaled texture coordinates. The equation clearly shows that the filter uses four texel look-ups. B(x, y) = Il [xi , yi ] · (1 − xf ) · (1 − yf ) + Il [xi + 1, yi ] · (xf ) · (1 − yf ) + Il [xi , yi + 1] · (1 − xf ) · (yf ) + Il [xi + 1, yi + 1] · (xf ) · (yf ) (5.1) Bilinear interpolation does not work well as a minification filter. When the footprint of a pixel covers multiple texels, interpolating four of them provides little to no gain in the visual quality. Better reconstruction filters are needed for minification. 6 Image Pyramids Image pyramids were introduced in 1983 by Williams in [10]. Williams used the term mip-maps where mip is an abbreviation of the latin phrase multum in parvo, meaning many things in a small place. The reconstruction filter utilizes a pre-processing pass in which the original image is down-sampled recursively by a factor of four (one half in each dimension) into an image with multiple resolutions, or levels. This forms a pyramidal structure, hence the term image pyramids. As images with multiple resolutions are created and stored, the memory requirement for image pyramids is higher in comparison to that of ordinary images. Equation 6.1 shows that the upper bound of the memory requirement is 43 when compared to ordinary images. This justifies the mip prefix as only 1 more data is required for the storage. 3 lim N →∞ N i X 1 4 i=0 = 4 3 (6.1) Figure 6.1 illustrates how the image pyramid reconstruction filter creates a series of recursively down-sampled images for three-channel images. It also illustrates how the memory requirement is bounded by 34 as all down-sampled copies fit inside an additional channel. 62 l0 l0 l1 l0 l1 l1 l2 l2 l2 Figure 6.1: An illustration of mip-mapping for three-channel images. Image pyramids are in no way limited to three-channel images, they work just as well with an arbitrary number of channels. Additionally, they can be used for both magnification and minification if the sampling is implemented in a robust way. 6.1 Pre-processing The pre-processing and sampling of image pyramids can differ slightly between different implementations and between different hardware vendors. Yet, its fundamental algorithm is to pre-process the image into multiple resolutions and sample the appropriate level or levels as determined by the size of the footprint. As every level of the image pyramid is isotropically down-sampled, the total number of levels can be determined using equation 6.2. The formula used in the equation selects the axis with minimum extent in the image and downsamples until only a single row or column remains. Should the image be perfectly square, the final level becomes a single pixel which approximates the entire image. levels = blog2 (min(w, h))c + 1 (6.2) Re-sampling an image into half its resolution along both axes can be done in two passes provided that the filter used is separable. First along the x-axis and then along the y-axis or the other way around. This two-pass method uses fewer memory accesses than a one-pass method would but is reliant on an intermediate buffer storing the results from the first pass. It might also be subject to additional quantization noise if the intermediate results are stored in integer data types. 63 The recursive down-sampling process is initialized with the original image. Each recursion level uses the image at the previous level as its input and initializes the next recursion provided that there are more levels still to be processed. 6.2 Sampling Method The set of resolution-independent texture coordinates s and t as well as their partial derivatives ∂s and ∂t together provide enough information to sample the image pyramid. The texture coordinates of a pixel are obtained through (perspectively-correct) interpolation across the triangle while the partial derivatives can be approximated using finite differences. This differentiation is done between adjacent pixels and can be expressed in a number of different ways. In hardware, pixels are commonly textured in groups of four. Such a group is known as a quad pixel and its structure makes differentiation possible. Usu∂t ∂s ∂s ∂t , ∂y , ∂x and ∂y are approximated through ally, the four partial derivatives ∂x differentiation and used for the entire quad pixel. For tile-based renderers, other differentiation schemes become possible as entire tiles are textured simultaneously. The approximate size of an optimal footprint is related to the magnitude of the partial derivatives. A metric commonly used for image pyramids is shown in equation 6.3 [10] in which the footprint size fs is approximated from the maximum length of two gradients. The length is scaled by the number of samples along the minor axis to produce a footprint size which is expressed in distances between texels. Choosing the maximum gradient length is a tradeoff which favors blur over aliasing. Hence, image pyramids often produce an over-blurred sample. s 2 2 s 2 2 ∂s ∂t ∂s ∂t + , + · min(w, h) fs = max ∂x ∂x ∂y ∂y (6.3) The common metric comes with an assumption about the resolution of a texture. Namely, it assumes that every texture is perfectly square and that the partial derivatives contribute equally to the approximation of the optimal footprint. Williams stated that image pyramids were intended for square, powerof-two resolutions. Yet, it is useful to extend the definition to properly include rectangular textures and preferably arbitrary resolutions as well. 64 To properly include rectangular textures, the partial derivatives can be weighted in accordance with resolution when computing the length of the two gradients. This modification is shown in equation 6.4. s 2 2 s 2 2 ∂t ∂t ∂s ∂s ·w + ·h , ·w + ·h fs = max ∂x ∂x ∂y ∂y (6.4) Approximating the footprint size from the lengths of the two gradients is computationally costly as it involves two square root computations at each pixel. The square root can be extracted from the maximum function without altering the expression and the footprint size computed using a single square root. Yet, further optimizations may be made as the size of the footprint is approximated. Equation 6.5 shows a method for which the size of the footprint is approximated independently in width and height. ∂s ∂s fw = max , · w ∂x ∂y ∂t ∂t fh = max , · h ∂x ∂y (6.5) The independent approach also aids in the actual sampling process for rectangular images. The appropriate level of the pyramid is determined as shown in equation 6.6. By scaling the resolution-independent texture coordinates s and t by the resolution of the image at level li , the texture at resolution li can be sampled. The sampling may be performed using a simple texture filter such as nearest neighboring texel or bilinear interpolation with good results due to the pre-processing pass. ( blog2 (fw )c li = blog2 (fh )c if fw > fh otherwise (6.6) When sampling a texture using the image pyramid, transitions between the different levels of the pyramid will become visible in the rendered image. To remedy this, trilinear interpolation was introduced. Trilinear interpolation is essentially two bilinear interpolations applied independently to the closest two levels in the pyramid followed by a linear interpolation between the two samples produced. The linear level fractional lf is computed as shown in equation 6.7. ( lf = fw 2li fh 2li −1 −1 65 if fw > fh otherwise (6.7) The trilinear interpolation can be defined as shown in equation 6.8 where (x0 , y0 ) are the scaled texture coordinates of level li and (x1 , y1 ) of level li + 1. The two bilinear interpolations are denoted by B and are weighted in accordance with lf . Together they form the image pyramid filter Pi . The two bilinear interpolations use four texel look-ups each, amounting to eight look-ups in total. Pi (x, y) = B(x0 , y0 ) · (1 − lf ) + B(x1 , y1 ) · (lf ) 7 (6.8) Anisotropic Image Pyramids Anisotropic image pyramids is an extension to image pyramids. Instead of down-sampling the image isotropically, the image is down-sampled independently in both directions by a factor of two. Clearly, this requires more memory when compared to image pyramids and ordinary images. Equation 7.1 shows that this requirement is bounded by four times the original memory requirement. lim 2 N →∞ N i X 1 i=0 2 ! −1 · lim M →∞ M j X 1 ! 4 j=0 = 3 · lim M →∞ M j X 1 j=0 4 = 4 (7.1) The process is illustrated in figure 7.1 which forms a two-dimensional structure of levels with different resolutions. As with image pyramids, there is no limitation in the number of channels that can be used. A single channel was used here only to illustrate the memory requirement. This structure is often referred to as a rip-map, where rip corresponds to rectangular mip. l0,0 l1,0 l2,0 l0,1 l1,1 l2,1 l0,2 l1,2 l2,2 Figure 7.1: An illustration of rip-mapping for one-channel images. 66 It should be emphasized that the anisotropic extension includes the resolution levels of regular image pyramids. They are present across the diagonal in the figure. 7.1 Pre-processing Similar to regular image pyramids, anisotropic image pyramids are recursively down-sampled in a pre-processing pass. As a two-dimensional structure of resolutions is to be generated, multiple recursive processes need to be performed where each process down-samples the image along either dimension until only a single row or column remains. Between every recursive process, the image is down-sampled once along the other dimension until only a single row or column remains. Equation 7.2 shows the number of levels an image will have in each direction. The total number of levels is the product of these two numbers. levelsx = blog2 (w)c + 1 levelsy = blog2 (h)c + 1 7.2 (7.2) Sampling Method The fundamental idea behind anisotropic image pyramids is that the optimal resolution of the texture can be computed using the two partial derivatives ∂s and ∂t and selected from the structure. Using the partial derivatives, the width and height of the footprint is computed as shown in equation 7.3. ∂s ∂s fw = max , · w ∂x ∂y ∂t ∂t fh = max , · h ∂x ∂y (7.3) Using the size of the footprint, the x-level li,x and the y-level li,y is computed as shown in equation 7.4. li,x = blog2 (fw )c li,y = blog2 (fh )c (7.4) In contrast to regular image pyramids, there are two linear level fractionals for anisotropic image pyramids. These are defined as shown in equation 7.5. 67 fw −1 2li,x fh = li,y − 1 2 lf,x = lf,y (7.5) Sampling the anisotropic image pyramid is done in the same fashion as for regular image pyramids, only with a slight alteration. Four bilinear interpolations are performed for the four different level combinations. As the four samples need to be weighted, an additional bilinear interpolation weights these four samples using the two linear levels fractionals. The final filter Pa is shown in equation 7.6 for which the number of texel look-ups amount to sixteen. Pa (x, y) = B(x0 , y0 ) · (1 − lf,x ) · (1 − lf,y ) + B(x1 , y1 ) · (lf,x ) · (1 − lf,y ) + B(x2 , y2 ) · (1 − lf,x ) · (lf,y ) + B(x3 , y3 ) · (lf,x ) · (lf,y ) (7.6) In the equation (x0 , y0 ) correspond to the scaled texture coordinates in level (li,x , li,y ), (x1 , y1 ) in level (li,x + 1, li,y ), (x2 , y2 ) in level (li,x , li,y + 1) and (x3 , y3 ) in level (li,x + 1, li,y + 1). 8 Summed-Area Tables Summed-area tables were conceived in 1984 by Crow and presented in [11]. Crow observed that the sum of all pixels in an image enclosed within an axisaligned rectangle could be computed using a simple formula. By altering the representation of the image into the summed-area table, the sum could be computed using only four table accesses for rectangles of arbitrary sizes. In other words, an arbitrary box-filter could be computed in constant time. Crow stated that an element of the summed-area table S should be computed as the sum of all pixels in the image with lower or equal indices to that of the element. This is shown in equation 8.1 where I is the discrete texture image. S[x, y] = y x X X I[i, j] (8.1) j=0 i=0 A box-filter of a rectangular region is the defined as the total sum within the region divided by the total area of the rectangle. Equation 8.2 shows a boxfilter I for the region enclosed within (x0 , y0 ) and (x1 , y1 ). This is equivalent to the mean of the image within the region. 68 Py1 I[x0 , y0 , x1 , y1 ] = Px1 I[i, j] (x1 − x0 ) · (y1 − y0 ) j=y0 i=x0 (8.2) Figure 8.1 shows a rectangular region defined by the coordinates (x0 , y0 ) and (x1 , y1 ) for which the box-filter is to be applied. Figure 8.1: An axis-aligned rectangular region in the image. The method that Crow presented involves replacing the double sum of the image with four table look-ups in the summed-area table. This is shown in equation 8.3 [11]. I[x0 , y0 , x1 , y1 ] = S[x1 , y1 ] − S[x0 , y1 ] − S[x1 , y0 ] + S[x0 , y0 ] (x1 − x0 ) · (y1 − y0 ) (8.3) The four table look-ups in the summed-area table correspond to sums over different regions in the original image. Figure 8.2 shows the region between (0, 0) and (x1 , y1 ), corresponding to the table look-up S[x1 , y1 ]. Figure 8.2: An element in the table corresponds to the sum over a region. 69 The table look-up S[x0 , y1 ] corresponds to the sum over the region between (0, 0) and (x0 , y1 ). This is shown in figure 8.3. This sum is subtracted from the sum over the previous region as it is not part of the box-filter. Figure 8.3: An element in the table corresponds to the sum over a region. Conversely, the table look-up S[x1 , y0 ] corresponds to the sum over the region between (0, 0) and (x1 , y0 ). This sum is also subtracted as the region is not part of the box-filter. Figure 8.4 shows this region. Figure 8.4: An element in the table corresponds to the sum over a region. The final table look-up S[x0 , y0 ] is added back to the sum as it has been removed twice by subtracting the two previous sums. It corresponds to the sum over the region between (0, 0) and (x0 , y0 ) as shown in figure 8.5. 70 Figure 8.5: An element in the table corresponds to the sum over a region. Using Crow’s original definition for sampling the summed-area table, the boxfilter is actually computed over the region enclosed within (x0 + 1, y0 + 1) and (x1 , y1 ). This is shown in figure 8.6. Figure 8.6: The sum is computed over a slightly different region. This can be corrected using a slightly different definition of the box-filter. Equation 8.4 shows the new definition for which one has been subtracted from the coordinates x0 and y0 and the area adjusted accordingly. I[x0 , y0 , x1 , y1 ] = S[x1 , y1 ] − S[x0 − 1, y1 ] − S[x1 , y0 − 1] + S[x0 − 1, y0 − 1] (x1 − x0 + 1) · (y1 − y0 + 1) (8.4) Accessing elements above and to the left of the x0 and y0 coordinates requires that the original image is padded with a row of zeros above and a column of zeros to the left. The summed-area table can then be constructed as described by Crow. However, the introduction of an additional row and an additional 71 column will alter the memory layout of the image and of the summed-area table. This has to be considered when accessing the elements in a computational context. 8.1 Construction Constructing the summed-area table can be done efficiently in two passes, each handling one of the two sums in the original definition. An intermediate summed-area table Sx is created in the first pass. Its elements store the sum of all pixels in the original image with lower or equal indices in the same row. Equation 8.5 shows this process. Sx [x, y] = x X I[i, y] (8.5) i=0 The process can be implemented efficiently in parallel as it is independent between rows. Figure 8.7 shows how the cumulative sum of every row is computed through an iterative process. Figure 8.7: An intermediate summed-area table is created in the first pass. Utilizing the intermediate summed-area table, the final summed-area table S can be constructed as shown in equation 8.6. S[x, y] = y X Sx [x, j] (8.6) j=0 Following the same procedure as the first pass, the final summed-area table can be constructed independently between columns. Figure 8.8 shows this iterative process. 72 Figure 8.8: The final summed-area table is created in the second pass. As cumulative sums of the original image need to be stored, the summed-area table requires wider data types for its elements. Assuming integer images with a bit depth per channel of bi , the summed-area table needs a bit depth per channel bs corresponding to the expression in equation 8.7. This in order to account for the rare worst case scenario in which every pixel of the original image is at maximum intensity. The additional row and column are for the bottom and right borders which are used for repeating textures, explained momentarily. bs = dlog2 ((2bi − 1) · (w + 1) · (h + 1))e ≈ dbi + log2 (w) + log2 (h)e (8.7) Given an image bit depth and a summed-area table bit depth, the maximum resolution can be computed as shown in equation 8.8. 2bs (w + 1) · (h + 1) = bi 2 −1 (8.8) Using a bit depth per channel of 8 for the image and 32 for the summed-area table yields a maximum resolution of roughly 16 million pixels. Textures are rarely this large and a few bits could be saved by limiting texture resolutions to a few megapixels. However, 32 bits will be nicely aligned in memory and the data type is native in most computational contexts, suggesting that 32-bit elements should be used for the summed-area table. This is a four-fold increase in memory requirement in comparison to the original image. It is worth noting that for integer images, the summed-area table can be constructed without introducing any quantization noise. This in contrast to image pyramids, both regular and anisotropic. 73 The original summed-area table is monotonically increasing with each element. For real-valued images, this becomes a problem as the floating-point data type was not designed to handle operations between values with large magnitude differences. An improvement was introduced in 2005 by Hensley et al. in [12]. If the texels of a channel in the original image are biased by subtracting the mean of all texels in that channel, the summed-area table is no longer monotonically increasing. This can be exploited for real-valued images as the box-filter is computed between samples with potentially large magnitude differences. For biased images, the box-filter is computed as normal and the bias added as a final step. This minimizes the precision issues when computing the summedarea table as well as the issues that may arise when bilinearly interpolating the summed-area table. 8.2 Sampling Method The previous filters have all been point-sampling filters, approximating the entire footprint by a single sample. This makes any assumptions about the shape of the footprint obsolete. For image pyramids, the size of the footprint was used only to select the appropriate texture levels in which to perform point-sampling. Sampling in summed-area tables is area-based. This makes both the size and the shape of the footprint of importance. As previously discussed, Crow conceived summed-area tables with axis-aligned box-filters in mind. As such, the original summed-area tables are only able to filter rectangular footprints aligned with the axes of the texture. With knowledge about the partial derivatives of the texture coordinates ∂s and ∂t, the axis-aligned rectangle can be computed in conjunction with the texture coordinates of the sample point sc and tc . This is shown in equation 8.9. ∂s 2 ∂s s1 = sc + 2 ∂t t0 = tc − 2 ∂t t1 = tc + 2 s0 = sc − 74 (8.9) For tile-based rendering pipelines, it is convenient to define the footprint as a square around the sample point in image space. This as a box-filter is to be applied to the underlying texture data. Figure 8.9 shows a tile to be textured where the grid of dots correspond to the sample points of the image. K L N M Figure 8.9: Summed-area tables work great in tile-based rendering contexts. This requires that the texture coordinates are perspectively interpolated to the corners of all footprint squares for each tile. For each tile, this results in (n + 1)2 interpolations instead of n2 interpolations where n is the number of sample points in a tile along one dimension. ~ L, ~ M ~ and N ~. In image space, the square is bounded by the four vertices K, In texture space, it can be distorted into a quadrilateral. This is shown in figure 8.10 in which each vertex contains the resolution-independent texture coordinates s and t. K L N M Figure 8.10: A square in image space can become a quadrilateral in texture space. Before the summed-area table can be sampled, the optimal axis-aligned rectangle footprint needs to be computed in texture space. This is done by computing the bounding box of the quadrilateral. Equation 8.10 shows how this is done. 75 s0 s1 t0 t1 = min(Ks , Ls , Ms , Ns ) = max(Ks , Ls , Ms , Ns ) = min(Kt , Lt , Mt , Nt ) = max(Kt , Lt , Mt , Nt ) (8.10) The bounding box will in most cases have an area greater than that of the quadrilateral which will result in a blurry sample if it is used as the actual footprint. Blur is often preferred over visual artifacts, suggesting that the bounding box should be used as a footprint. However, it is possible to adjust the bounding box and by that its area so that it corresponds to the area of the quadrilateral. This can be achieved by first computing the center (sc , tc ) of the bounding box as shown in equation 8.11. s0 + s1 2 t0 + t1 tc = 2 sc = (8.11) The bounding box of the quadrilateral and its center are now defined by a three sets of resolution-independent texture coordinates. This is illustrated in figure 8.11. s0 t1 s1 K t0 tc sc L N M Figure 8.11: The bounding box of the quadrilateral and its center. The area of the quadrilateral Aq and the area of the rectangle Ar are computed as shown in equation 8.12. The absolute value is required for the area of the quadrilateral as it is possible for its orientation to be back-facing in texture space. 76 |(Ms − Ks ) · (Nt − Lt ) − (Ns − Ls ) · (Mt − Kt )| 2 Ar = (s1 − s0 ) · (t1 − t0 ) Aq = (8.12) A shrinking factor f is introduced as the square root of the ratio between the two areas as shown in equation 8.13. This factor will be less than or equal to one as the area of the rectangle always is greater or equal to that of the quadrilateral. r f= Aq Ar (8.13) The bounding box of the quadrilateral can be adjusted in accordance with the shrinking factor. Equation 8.14 shows a process which shrinks the bounding box around its center. s00 s01 t00 t01 = sc + f · (s0 − sc ) = sc + f · (s1 − sc ) = tc + f · (t0 − tc ) = tc + f · (t1 − tc ) (8.14) An adjusted bounding box, defined by the two sets of texture coordinates (s00 , t00 ) and (s01 , t01 ) is obtained from this process. Figure 8.12 illustrates this result. s0’ sc s1’ K t0’ tc t1’ L N M Figure 8.12: The adjusted bounding box of the quadrilateral and its center. The area of the adjusted bounding box can be computed as shown in equation 8.15. The equation also shows that this area will be exactly equal to that of the quadrilateral. 77 A0r = (s01 − s00 ) · (t01 − t00 ) = (f · (s1 − sc ) − f · (s0 − sc )) · (f · (t1 − tc ) − f · (t0 − tc )) = f 2 · (s1 − sc − s0 + sc ) · (t1 − tc − t0 + tc ) = f 2 · (s1 − s0 ) · (t1 − t0 ) Aq = · Ar Ar = Aq (8.15) As previously stated, adjusting the bounding box will cause visual artifacts. As such, the original rectangular footprint should be used for sampling the summed-area table, provided that the bounding box does not exceed the domain of the texture. The next section will provide details on how to bilinearly interpolate the summed-area table. 8.3 Bilinear Interpolation Ordinary bilinear interpolation can be used to sample the summed-area table provided that the image is padded with the corresponding wrapped texels at the bottom and right borders before the summed-area table is constructed. Similar to point-sampling, a table look-up function Sl is defined as shown in equation 8.16 where Wx and Wy are the wrapping functions previously defined and S the summed-area table. Sl [xi , yi ] = S[Wx [xi ], Wy [yi ]] (8.16) The scaled texture coordinate pair (x, y) is obtained through scaling the resolutionindependent texture coordinate pair (s, t) by w and h, respectively. This is shown in equation 8.17. x=s·w y =t·h (8.17) The integer and fractional parts are computed using the flooring and fractional functions as shown in equation 8.18. xi xf yi yf = floor(x) = frac(x) = floor(y) = frac(y) 78 (8.18) If the image was padded before constructing the summed-area table, bilinear interpolation is performed in the same way as for point-sampling. Equation 8.19 shows how bilinear interpolation is performed using the table look-up function Sl four times. B(x, y) = Sl [xi , yi ] · (1 − xf ) · (1 − yf ) + Sl [xi + 1, yi ] · (xf ) · (1 − yf ) + Sl [xi , yi + 1] · (1 − xf ) · (yf ) + Sl [xi + 1, yi + 1] · (xf ) · (yf ) 8.4 (8.19) Clamping Textures The bounding box of the quadrilateral is described by two resolution-independent texture coordinate pairs (s0 , t0 ) and (s1 , t1 ), both in texture space. For clamping textures, no sampling should occur past the four edges. Figure 8.12 shows a footprint which exceeds the domain of the texture at the bottom and right edges. s0 t0 s1 A t1 Figure 8.13: Clamping textures only sample region A. For clamping textures, the only region of intrest is the rectangle marked with A in the figure. This region can be computed by limiting the resolutionindependent texture coordinate pairs to the domain of the texture as shown in equation 8.20 s00 s01 t00 t01 = min(max(s0 , 0), 1) = min(max(s1 , 0), 1) = min(max(t0 , 0), 1) = min(max(t1 , 0), 1) 79 (8.20) The sum of all texels in region A is computed using bilinear interpolation as shown in equation 8.21 where the scaled texture coordinates x0 , x1 , y0 and y1 are obtained as normal. This is the modified definition for sampling the summed-area table as previously discussed. X = B(x1 , y1 ) − B(x0 − 1, y0 − 1) − B(x1 , y0 − 1) + B(x0 − 1, y0 − 1) A (8.21) The box-filter of the image enclosed within the region defined by A is computed as shown in equation 8.22. P I= A (x1 − x0 + 1) · (y1 − y0 + 1) (8.22) In total there are 16 table look-ups for clamping textures. A number equal to that of anisotropic image pyramids. 8.5 Repeating Textures Crow mentioned that special considerations had to be made when sampling the summed-area table past the edges. As the summed-area table stores a cumulative sum, elements on the other side of the edges must be increased or decreased by the corresponding values. If the footprint intersects multiple edges, the elements must be increased or decreased with multiples of those values. This section provides a method which properly handles an arbitrary amount of edge intersections in constant time. For repeating textures, the sampling process becomes a little more complex. Figure 8.14 shows the same footprint exceeding the texture domain at the bottom and right edges. For repeating textures, all four regions A, B, C and D are of intrest. 80 s0 t0 t1 s1 A B C D Figure 8.14: Repeating textures sample regions A, B, C and D. To compute the sum in each region independently, two additional coordinates are needed. These two coordinates correspond to the width and height of the texture as shown in equation 8.23. xe = w ye = h (8.23) The sum of all texels in region A is computed as shown in equation 8.24, requiring four bilinear interpolations. X = B(xe , ye ) − B(x0 − 1, ye ) − B(xe , y0 − 1) + B(x0 − 1, y0 − 1) (8.24) A The sum of all texels in region B is computed using only two bilinear interpolations. This is shown in equation 8.25. X = B(x1 , ye ) − B(x1 , y0 − 1) (8.25) B Conversely, the sum of all texels in region C is computed using two bilinear interpolations. Equation 8.26 shows this process. X = B(xe , y1 ) − B(x0 − 1, y1 ) (8.26) C The sum of all texels in the final region D is computed using a single bilinear interpolation as shown in equation 8.27. 81 X = B(x1 , y1 ) (8.27) D It is possible for the rectangle footprint to span across multiple entire textures, forming additional regions. Though, the sample positions will be exactly the same as in the above example. Computing the correct sum is done by first determining the number of edges that are intersected by the footprint. Equation 8.28 shows how this is done. Nx = floor(s1 ) − floor(s0 ) Ny = floor(t1 ) − floor(t0 ) (8.28) Using these two numbers, the total sum for any rectangular footprint is computed as shown in equation 8.29. It should be emphasized that the expression in the equation will be exactly equal to that of clamping textures if both Nx and Ny are equal to zero, meaning that no edges were intersected in either direction. X = B(x0 − 1, y0 − 1) − B(x1 , y0 − 1) − B(x0 − 1, y1 ) + B(x1 , y1 ) + B(xe , y1 ) · Nx − B(xe , y0 − 1) · Nx + B(x1 , ye ) · Ny − B(x0 − 1, ye ) · Ny + B(xe , ye ) · Nx · Ny (8.29) Identical to clamping textures, the box-filter is the total sum divided by the total area. This is shown in equation 8.30. P I= (x1 − x0 + 1) · (y1 − y0 + 1) (8.30) The number of table look-ups will vary for this filter. In most cases, it will be 16 as for clamping textures. Should the footprint intersect the edges, the number of look-ups can be as large as 36. However, the number of look-ups is 82 not related to the number of texels in the region of the box-filter. As such, it is performed in constant time with respect to the size of the footprint. The extra coordinate pair xe and ye was defined to be exactly equal to the width and height of the texture. This makes it completely unnecessary to bilinearly interpolate samples which contain either xe or ye . Instead, two linear interpolation functions are defined as shown in equation 8.31. Lx (x, y) = Sl [xi , yi ] · (1 − xf ) + Sl [xi + 1, yi ] · (xf ) Ly (x, y) = Sl [xi , yi ] · (1 − yf ) + Sl [xi , yi + 1] · (yf ) (8.31) In addition, samples which contain both xe and ye only require a single table look-up. The two linear interpolation functions provide a more efficient way of computing the sum of all texels in region of the box-filter. This is shown in equation 8.32 where the sum is computed using at most 25 table look-ups, saving a total of 11 look-ups. X = B(x0 − 1, y0 − 1) − B(x1 , y0 − 1) − B(x0 − 1, y1 ) + B(x1 , y1 ) + Ly (xe , y1 ) · Nx − Ly (xe , y0 − 1) · Nx + Lx (x1 , ye ) · Ny − Lx (x0 − 1, ye ) · Ny + Sl (xe , ye ) · Nx · Ny 8.6 (8.32) Higher-Order Filters As stated, the original summed-area tables were designed with axis-aligned box-filters in mind. In many cases, box-filters are not sufficient as they weigh all samples within the support region of the filter equally. The need for higherorder filters is evident. A generalization of the summed-area table was presented in 1986 by Heckbert in [13]. Heckbert recognized that the summed-area table was in fact an efficient implementation of box-convolution and extended the method to higher-order filters through convolution theory. Using the new method, a variable width Gaussian-like filter could be implemented in constant time through integrating the original image multiple times. 83 This followed by applying a slightly different evaluation scheme on the resulting higher-order summed-area table. However, handling borders still seems to be an unresolved issue, making their use for regular texturing in three-dimensional computer graphics limited. Additionally, the required bit depth is significantly higher and as such, precision becomes an issue for higher-order summed-area tables. Equation 8.33 [12] shows the required bit depth per channel bs of a secondorder summed-area table (equivalent to a Bartlett filter when sampled) for an integer image with a bit depth per channel of bi . h+1 w+1 ·h· bs = log2 2 2 ≈ dbi + 2 · log2 (w) + 2 · log2 (h) − 2e bi 2 −1 ·w· (8.33) For an image with a resolution of 512 by 512 pixels and with a bit depth per channel of 8, the amount of required memory per channel and pixel is 42 bits [13]. For alignment purposes, 64 bit wide data types should be used for the second-order summed-area table. This is an eight-fold increase in memory in comparison to the original image. 84 Chapter E Pipeline Overview So far, this report has provided a useful tool set which will come to use for a general graphics pipeline. The chapter on texture filtering discussed different problems associated with texture mapping and how to combat them. This chapter will detail the different stages of the actual graphics pipeline implemented. 1 1.1 Design Considerations Deferred Rendering The implemented pipeline employs a deferred rendering scheme for a number of different reasons. Mainly because of the common use in modern computer graphics but also due to a number of benefits introduced by the deferred design. A deferred renderer postpones all lighting computations until the entire scene has been rendered and by that, visibility correctly determined. Lighting computations often require a great amount of processing power and it is beneficial to only light fragments that are guaranteed to be visible for the observer. Deferred rendering makes the complexity of lighting computations independent of overdraw and by that only dependent on the resolution of the target image and on the number of light sources. The deferred rendering scheme also allows for a variety of post-processing effects such as field of view and screen-space ambient occlusion. However, postponing the lighting computations requires the pipeline to store fragments in a separate buffer. Rasterized fragments generally contain a great number of attributes such as albedo, normal and depth. These attributes must be stored on a per-pixel basis, increasing the memory requirement of the pipeline. If the 85 fragment record is stored in an uncompressed format, its memory requirement may impose a drawback too large to motivate the deferred design. The implemented pipeline introduces a compressed record format which stores albedo (24 bits), normal (24 bits), specular power (8 bits), specular intensity (8 bits) and depth (32 bits). The compressed record requires 96 bits of storage in addition to the regular pixel buffer required to accumulate light contributions from the different light sources in the scene. As opposed to forward rendering, deferred rendering suffers from the inability to properly handle blending. As such, blending is not included in the implemented pipeline and all models are assumed to be manifold. It should be noted that even traditional forward rendering suffers from problems with transparency. This as transparency is dependent on the order in which the fragments are produced, something that cannot be correctly determined for forward rendering nor for deferred rendering. 1.2 Rasterization It is crucial to carefully evaluate the choice of rasterization method as its compliance with the underlying hardware will have a huge impact on the overall performance of the pipeline. For serial architectures, a number of different rasterization methods would be suitable. For parallel architectures, data contamination may become a serious problem. If triangles are rasterized on a parallel architecture using a method suitable only for serial architectures, two executed elements may produce fragments at the same location in the target image at roughly the same time. A race condition occurs where a seemingly random process determines which fragment that is stored in the target buffer. This introduces spatial noise in the image and temporal noise between different frames and is an unacceptable flaw. For serial architectures, scanline rasterization is a common method. For the method, horizontal spans of fragments between the edges of the triangle is generated for each scanline. The scanline method iteratively processes each scanline and each span and stores the resulting fragments or pixels in the target buffer. As the rasterization is performed by a single process, no data contamination can occur, making it suitable for serial architectures only. Pipelines executing on parallel architectures must take data contamination into consideration. As such, the implemented pipeline employs a hierarchical tile-based rasterization scheme where the screen is subdivided into multiple levels of tiles and no two processes may operate on the same tile at the same time. This completely eliminates the data contamination problem. 86 The tile-based scheme also handles triangles of different sizes elegantly. A scene may consist of pixel-sized triangles and of triangles covering a larger portion of the target image. If each rasterization process were to operate on a single triangle, the work load would differ greatly between processes. This is something that is undesirable for parallel architectures, motivating the choice of the tile-based rasterization scheme. 1.3 Batch Rendering The pipeline operates on batches of isolated triangles. Each triangle consists of three vertices and stores additional data required in the different stages of the pipeline. To supply a triangle, the application programmer invokes the corresponding interface method. The method buffers every triangle queued for rendering in host memory and ensures isolation by only accepting a single triangle at once. Isolating the triangles will enable parallel processing of each individual triangle at the expense of processing a greater number of vertices during the initial stages. This is a trade-off which favors parallelism and data locality and by that helps to utilize the parallel architecture of the graphics processing unit and its cache hierarchy. When the full capacity of the host buffer is reached or when the pipeline is explicitly instructed through the interface to perform the rendering, the buffer is flushed and the rendering is initialized. This process starts with the triangle buffer being copied to device memory and proceeds through a series of kernels which will be explained in detail throughout the rest of this chapter. The entire process is transparent to the application programmer in a fashion similar to that of the Open Graphics Library. 2 Vertex Transformation Whenever a triangle is queued for rendering, the active transformation matrix is polled from the transformation sub-system and stored together with the triangle. The transformation matrix corresponds to the pre-multiplied transformation chain representing the previous transformations as handled by the transformation sub-system. It is used to transform a triangle from object space to camera space. This is equivalent to the model view matrix in the Open Graphics Library with the addition of a camera system. The vertex transformation stage applies the object space to camera space matrix to the three vertices of every triangle currently in the pipeline. This is 87 performed through ordinary matrix-vector multiplication. The stage is executed as a one-dimensional kernel with each work item processing a single triangle. Work items are issued in groups of 32. 3 View Frustum Clipping As the triangles have been transformed to camera space, view frustum clipping can be performed as detailed in the mathematical foundation. The stage is executed as a one-dimensional kernel where each work item processes a single triangle and 32 work items form a work group. The triangulation associated with view frustum clipping appends every generated triangle to an additional triangle buffer in global memory using atomic operations. In theory, this buffer needs to be five times larger than the batch size to account for the worst case scenario in which every triangle is clipped into a polygon with seven vertices. However, this is considered extremely rare and as such, the two buffers are of equal size. As the process completes, the number of generated triangles needs to be read from device memory in order to initialize subsequent kernels. Should the number of triangles be equal to zero, no further kernels are launched, the interface is reset and control is returned to the application. 4 Triangle Projection With triangles clipped against the view frustum, no vertices will be located in front of the near plane. As such, it is safe to project the vertices into the axis-aligned plane of the target image. The projection stage is executed similarly to the previous stages. A onedimensional kernel is launched with each work item processing a single triangle and with 32 work items grouped into a work group. Every vertex is projected into an integer raster with sub-pixel precision using the expressions in equation 4.1. In the equation, p corresponds to sub-pixel precision while (xr , yr ) is the vertex in raster space and (xc , yc , zc ) the vertex in camera space. Subpixel precision of the geometry is crucial for good results in the subsequent rasterization stage. 88 xc p · w + min(w, h) · xr = 2 zc p yc yr = · h − min(w, h) · 2 zc 1 zr = zc (4.1) The expressions in the equation effectively projects the triangles into the axisaligned plane of the target image. For each vertex, the inverse depth is also stored for use when interpolating the vertex attributes in a perspectivelycorrect fashion as described in the next section. The projection kernel also computes the signed double area of every triangle using the integer coordinates in raster space. The double area is stored together with the triangle and will be used heavily in a subsequent kernel. This is one of the major reasons for clipping every triangle against the entire view frustum. With high sub-pixel precision, the raster coordinates will be of large magnitudes and the signed double area of even larger magnitude. Without view frustum clipping, these variables would eventually overflow for triangles extending far beyond the left, right, bottom or top edges of the target image. The signed double area also provides an efficient method for performing backface culling. Since only triangles with a positive area will face the camera, all triangles with zero or negative area can be culled from further processing. This as the pipeline is designed for deferred rendering with no transparency or blending effects. Every triangle with an area greater than zero is appended to a final triangle buffer in global memory using atomic operations. In addition, front-face culling can be achieved by culling all triangles with a non-negative signed double area. However, this is not implemented in the pipeline for the time being. The number of triangles in the final triangle buffer is read from device memory and used to initialize subsequent kernels. Should this number be zero, no further kernels are launched, the interface is reset and control is returned to the application. 5 Rasterization Rasterization is essentially the process of breaking down the projected triangles into pixel-sized fragments. This process uses integer arithmetics whenever possible to avoid precision issues associated with floating point arithmetics. 89 The stage is initialized through a one-dimensional kernel where each work item processes one of the visible triangles in the final triangle buffer. The kernel is executed in work groups of 32 work items and is responsible for the preparation of different attributes required for the rasterization process. The bounding box of every triangle is computed for three different levels. These three levels are pixel, small tile and large tile. This is computed through a series of minimum and maximum functions on the raster coordinates of the three vertices of every triangle. The three levels form a sort-middle hierarchical rasterization refinement scheme, similar to the scheme described in 2011 by Laine and Karras in [2]. The kernel is also responsible for computing the three edge normals of every triangle as well as projecting the vertex attributes for interpolation across the surface. 5.1 Large Tiles The kernel responsible for large tile rasterization is one-dimensional with one work item for each triangle and is enqueued with a work group size of 32. The group size of 32 is used primarily since it enables efficient use of the local memory as described momentarily. Figure 5.1 shows a triangle projected onto the plane of the target image and how the screen is sub-divided by large tiles. Figure 5.1: The target image is sub-divided by large tiles. Every work item is assigned the task of determining which large tiles that are partially or completely overlapping with the triangle. To achieve this, a work item iterates through the large tiles inside of the bounding box at the large tile level. For every overlapping triangle, a local memory buffer is updated with the corresponding information. Figure 5.2 shows the tiles that are processed. 90 Figure 5.2: Only large tiles within the bounding box are processed. In a first pass, every large tile corner in the bounding box is projected against the three edge normals of the triangle and the resulting magnitudes stored in private memory. For the second pass, the separating axis theorem is applied and tiles are classified as either outside, inside or overlapping. This information is stored in a local memory bit matrix using atomic operations on a layer of 32-bit integers which corresponds to the triangles of the current work group. Figure 5.3: The corners of the large tiles are projected against the edge normals. It should be emphasized that the three edge normals of the triangle are sufficient for the separating axis theorem as only large tiles within the bounding box are processed. As such, no separation can occur along the edge normals of the large tiles. The first work item in a work group clears the bit matrix before any other work items are allowed to proceed with their tasks. It is also responsible for writing the resulting bit matrix into global memory when all work items in the work group have completed. This requires two synchronizations between the work items of a work group and is achieved through the use of barriers. 91 No other work group will modify the layer of the bit matrix corresponding to the current work group. As a result, it can be written into global memory without having to rely on atomic operations. This step of the rasterization process introduces a limitation on the graphics pipeline. As the amount of local memory on a streaming multi-processor is limited and an entire tile layer needs to be resident simultaneously, the resolution will be directly limited by the size of the large tiles. To compute the maximum possible resolution with a given aspect ratio α, a large tile size slarge and a maximum number of large tiles tmax , equation 5.1 is used. √ tmax · α · slarge jp k = tmax /α · slarge wmax = hmax (5.1) It is safe to assume that the amount of local memory mlocal is at least 16 kilobytes as specified as a requirement in specification of the Open Computing Language. A tile layer is composed of 32 triangles, each requiring two bits of memory. This amounts to 8 bytes per tile and layer. Using this information, the maximum number of horizontal and vertical tiles in a layer can be computed using equation 5.2. tmax = jm local 8 k (5.2) Assuming 16 kilobytes of local memory yields a maximum of 2048 large tiles. If these are arranged according to the common 16:9 (1.78) aspect ratio, the maximum number of large tiles is 60 by 33. This imposes a limit on the resolution which with a large tile size of 64 pixels is 3840 by 2112. As this is well over modern monitor resolutions, the limitation is minimal. It should also be noted that the large tile size can be increased to support higher resolutions. 5.2 Small Tiles The first rasterization refinement step is executed as a three-dimensional kernel. Every work group is responsible for sub-dividing a single large tile of a layer into multiple small tiles. As with the previous step, a work group consists of 32 work items, each corresponding to one of the original triangles. This allows modifications of the bit matrix to be carried out with atomic operations on a local memory buffer. Figure 5.4 shows a large tile to be processed by a work group and the small tiles enclosed within. 92 Figure 5.4: Every work group processes a single large tile. The small tile rasterization process shares many fundamental ideas with the previous step and starts by querying the overlap status of the current large tile. Tiles without overlap are not processed further and the work item completes its task. For tiles with complete overlap, every enclosed small tile is set to complete overlap as implied by the large tile. The tiles that were classed as partially overlapping are the ones in need of an actual refinement. This is identical to how Seiler et al. successively refine the rasterization of a triangle. The large tile in the previous figure is partially overlapping with the triangle. As such, it needs refinement. Figure 5.5 shows the small tiles within the large tile and within the bounding box of the triangle. Figure 5.5: Only small tiles within the bounding box are processed. An iteration is made through all corners of all small tiles within the large tile and within the bounding box of the triangle at the corresponding level. The corners are projected against the three edge normals and the magnitudes are stored in private memory. A second iteration is made where the separating axis theorem is used to classify the small tiles as either inside, outside or partially overlapping. This is essentially the same process as for large tiles. 93 Figure 5.6 shows the tile corners that are projected against the edge normals of the triangle. Figure 5.6: The corners of the small tile are projected against the edge normals. The first work item in the work group clears the local memory buffer and is responsible for writing all the small tiles into global memory in the same fashion as earlier. As for large tiles, no atomic operations are needed due to the work group size. 5.3 Pixels The final refinement step is the actual pixel rasterization. This is done differently in comparison to the two earlier rasterization steps. A two-dimensional kernel is launched with each work item operating on a single small tile. The size of a work group is equal to the number of small tiles in a large tile and each work item is responsible for computing the pixel coverage of every triangle located in that small tile. This provides exclusive access to the target image in the region of that tile. As such, every work item is able to read and write the target image without conflicting with other work items. This at the expense of load balancing which will come into the discussion at a later stage in this report. Figure 5.7 shows a small tile that a single work item is processing. The pixel coverage is to be determined for all triangles intersecting that tile. This is done in sequence for all triangles marked as intersecting the specific tile. 94 Figure 5.7: Every work item processes a single small tile. Different to the two other rasterization steps, all computations are performed at the sample points of each pixel as shown in figure 5.8. In a first pass, the double signed areas 2 · Au , 2 · Av and 2 · Aw are computed. The signed areas are computed using integer coordinates in the sub-pixel raster which completely eliminates problems with floating point precision. Figure 5.8: The signed areas are computed for every sample point. A set of rasterization rules must be defined which together determine if the triangle covers the sample point. From the definition of the barycentric coordinates, the inside of the triangle is defined where all three signed areas are positive as shown in equation 5.3. This is also true for the double signed areas. Au > 0 Av > 0 Aw > 0 (5.3) If an edge is shared by two triangles, there might be some pixels along the edge where none of the two triangles cover the sample point when using the above 95 rasterization rules. However, both triangles will tangent the point and the corresponding signed area will be zero for the two triangles. This will generate holes in the rasterization and must be corrected. The set of rasterization rules must incorporate additional criteria in order to determine which of the two triangles that overlap the unrasterized pixels for the above example. This must be done in a consistent fashion. No pixels should be rasterized twice and no pixels should be left unrasterized. The edge normals of the two triangles sharing an edge will be oriented in opposite directions if both triangles are front-facing. Since the pipeline only accepts front-facing triangles, this information can be used to determine pixel coverage in a consistent fashion. Equation 5.4 shows a set of stronger rasterization rules which ensure that a pixel is rasterized exactly once by incorporating the information of the edge ~v and N~w are the edge normals of the edges normals. In the equation, N~u , N ~ , V~ and W ~ , respectively. opposite to the triangle vertices U Au > 0 ∨ (Au = 0 ∧ (Nu,x > 0 ∨ (Nu,x = 0 ∧ Nu,y > 0))) Av > 0 ∨ (Av = 0 ∧ (Nv,x > 0 ∨ (Nv,x = 0 ∧ Nv,y > 0))) Aw > 0 ∨ (Aw = 0 ∧ (Nw,x > 0 ∨ (Nw,x = 0 ∧ Nw,y > 0))) (5.4) The barycentric coordinates of a covered pixel are obtained through dividing the three signed double areas Au , Av and Aw by the pre-computed signed double area of the entire triangle. This set of coordinates is used to interpolate all vertex attributes. 6 Differentiation of Texture Coordinates For every covered pixel, the vertex attributes are interpolated in a perspectivelycorrect fashion as detailed in the mathematical foundation. This includes the texture coordinates, the depth value and the tangent space bases when normal mapping is activated. The pipeline supports the three advanced texture filters detailed in the chapter on texture filtering. In order to use these filters, the derivatives of the texture coordinates are needed. In a computational context, derivatives are approximated through finite differences. This differentiation can be done in a number of different ways since the pipeline is tile-based. Figure 6.1 shows a single pixel of a small rasterization tile. One set of texture coordinates is not enough for differentiation. Information from adjacent pixels is required. 96 Figure 6.1: A single pixel is not enough for differentiation. In the pipeline, the problem is solved by differentiating the interpolated texture coordinates between groups of four pixels. The differentiation is done along the diagonals as shown in figure 6.2. This is different to how the hardware pipelines handle differentiation of the texture coordinates. A B D C Figure 6.2: A group of four pixels enable various differentiation schemes. Equation 6.1 shows how the partial derivatives are approximated from the maximum absolute differences between the texture coordinates along the diagonals. Using the maximum absolute differences is a trade-off which favors √ blur over noise. The division by 2 is present√due to the distance over which the differentiation is made. The diagonal is 2 units long and this needs to be accounted for. max(|Cs − As |, |Ds − Bs |) √ 2 max(|Ct − At |, |Dt − Bt |) √ ∂t ≈ 2 ∂s ≈ 97 (6.1) The resulting partial derivatives of the texture coordinates are used for all four pixels in the group. All active texture maps are sampled using the texture coordinates at pixel (i, j) and the partial derivatives for the group in which the pixel is a member. When normal mapping is activated, the sampled normal is transformed by the interpolated tangent space bases followed by a transformation by the object space to camera space matrix. The results are stored in the fragment buffer and are ready for lighting in a separate pass. 7 Fragment Illumination Fragment illumination is launched as a two-dimensional kernel with each work item processing a single pixel in the fragment buffer. The kernel operates in work groups of 32 work items and uses the Phong illumination model as introduced in 1975 by Phong in [14]. Phong observed that the spatial frequencies of the specular highlights often were low for photographic images and aimed to improve the standard illumination model used for computer generated images at that time. For the Phong illumination model, the vertex normals are linearly interpolated across the surface of a triangle. For every fragment to be illuminated, the ~ is used in conjunction with the light vector L ~ and interpolated normal N ~ the reflected observer vector R. These vectors are shown in figure 7.1. The ~ is not used explicitly in the illumination model, only its observer vector O reflected counterpart. N L S O R b a X Figure 7.1: The vectors used in the Phong illumination model. ~ is obtained through reversing the projection of the raster coordiThe point X nates as shown in equation 7.1. 98 Xx = Xy = 2·xr p −w min(w, h) · zr h − 2·yp r (7.1) min(w, h) · zr 1 Xz = zr ~ the By using the point in conjunction with the position of the light source S, ~ light vector L can be computed as shown in equation 7.2. The equation also ~ shows how the point is reflected against the unit normal to obtain R. ~ =S ~ −X ~ L ~ =X ~ − 2 · (X ~ •N ~)·N ~ R (7.2) ~ and R ~ are The square length of the light vector is stored and the vectors L normalized to unit lengths. The two angles present in the previous figure are computed from these unit vectors as shown in equation 7.3. ~0 • N ~ , 0) α = max(L ~0 • R ~ 0 , 0) β = max(L (7.3) The ambient, diffuse and specular illuminations are computed from the material properties as shown in equation 7.4 [14]. In the equation, n is the glossiness factor. Materials with high glossiness will produce small specular highlights with high intensities while materials with low glossiness will produce large specular highlights with low intensities. ia = Ma id = Md · α is = Ms · β n (7.4) Due to the nature of a deferred rendering pipeline, varying material properties have a great impact on the memory requirement as opposed to traditional forward rendering. Every material property needs to be stored per-pixel. This is not desirable. As such, the ambient factor Ma and the glossiness factor n 99 are constant for all fragments. In addition, the diffuse factor Md is computed from the specular factor Ms as 1 − Ms . The total intensity i is computed as shown in equation 7.5 where Ss is the strength of the light source. The strength is uniform across the spectrum of the light source. i = ia + Ss · (id + is ) ~ 2 |L| (7.5) Finally, the target image is updated with the light contribution of the current light source. The spectral distribution of the light source is mixed with the spectral distribution of the material and illuminated by the intensity i. The light contribution is added to the light contribution from previous fragment illumination passes and stored in the target image. This is shown in equation 7.6. r0 = r + Mr · Sr · i g 0 = g + Mg · Sg · i b0 = b + Mb · Sb · i (7.6) For multiple light sources, the fragment illumination pass is run in sequence. Once for every light source until all light sources have contributed to the final image. 100 Chapter F Results 1 Rendered Images Using the functionality of the implemented three-dimensional graphics pipeline, a set of images were rendered. These images are intended to demonstrate the the capabilities of the pipeline and were rendered with a resolution four times greater than the target image. The target image was bilinearly interpolated using the texturing filtering of the Open Graphics Library when displayed on the screen. The first set demonstrates the deferred rendering of the pipeline. The tank zombie from the game Left for Dead was used as a test model. It contains 2 894 vertices and 5 784 triangles. Figure 1.1 shows the model illuminated by three white light sources. 101 Figure 1.1: The model illuminated by three white light sources. The different channels of the fragment buffer can be visualized using the application. Figure 1.2 shows the albedo (color) channels. Figure 1.2: The albedo (color) channels of the fragment buffer. Figure 1.3 shows the normal channels. As the coordinate system is left-handed, red intensities indicate surface normals directed to the right. Conversely, green intensities indicate surface normals directed upwards and blue intensities surface normals directed into the image plane. The normals are defined in observer space. 102 Figure 1.3: The normal channels of the fragment buffer. Figure 1.4 shows the specular channel of the fragment buffer. It indicates the specular reflectance of the visible fragments. Higher intensities will transfer light specularly while lower intensities will transfer light diffusely. Figure 1.4: The specular channel of the fragment buffer. The final channel is the depth channel. It is shown in figure 1.5. Inverse, non-linear depth is used as it can be interpolated linearly across the surface of every projected triangle. Higher intensities correspond to fragments located closer to the observer. 103 Figure 1.5: The depth channel of the fragment buffer. 2 Texture Filtering The second set of rendered images demonstrate the texture filtering techniques, starting with the filters suitable for both texture magnification and minification. Figure 2.1 shows a surface textured using the image pyramid filter. This is equivalent to how textures are filtered in the Open Graphics Library [9]. Figure 2.1: A surface textured using the image pyramid filter. 104 Figure 2.2 shows a surface textured using the extended anisotropic image pyramid filter. Please note how the far details become sharper and more distinct. Figure 2.2: A surface textured using the anisotropic image pyramid filter. Figure 2.3 shows the same surface textured using the final of the three texture filters suitable for both texture magnification and minification, the summedarea table. Similar to anisotropic image pyramids, the far details become sharper. Yet, there are slight differences. The image produced using summed-area tables is slightly more bright at far distances in comparison to the previous two images. This as the image pyramids were generated without gamma-correction. As such, the brightness will drift with every down-sampled texture level. This does not happen when using summed-area tables. 105 Figure 2.3: A surface textured using the summed-area table. The different texture levels of the two image pyramids can be visualized using the application. Figure 2.4 shows the distinct transitions between the texture levels of the regular image pyramid. Figure 2.4: The texture levels for the regular image pyramid. Figure 2.5 shows the corresponding transitions for the anisotropic image pyramid. As the levels are selected from a two-dimensional structure of textures with different resolutions, the visualization uses two colors. Red corresponding to the horizontal level and green corresponding to the vertical level. 106 Figure 2.5: The texture levels for the anisotropic image pyramid. The third and final set of images demonstrate the texture filters suitable for texture magnification. Figure 2.6 shows the nearest neighboring texel filter. At close range, the individual texels become visible. Figure 2.6: A surface textured using the nearest neighboring texel filter. The bilinear interpolation filter is far better at magnifying textures as shown in figure 2.7. When implemented correctly, the two image pyramids and the summed-area table actually default into bilinear interpolation filters at close range, producing results equal to those shown in the figure. 107 Figure 2.7: A surface textured using the bilinear interpolation filter. 3 3.1 Benchmark Data Rasterization The raw rasterization performance is of great importance for the graphics pipeline. As such, the rasterization scheme was evaluated through a series of tests. For the tests, the large tiles were fixed at 64 by 64 pixels and the small tiles at 4 by 4 pixels. A total of 5 000 frames were rendered with no textures bound and no view-frustum clipping. For every frame, a single quadrilateral covering the entire screen was rendered. Table 3.1 shows the resulting data. Resolution (px) 128x128 256x256 512x512 1024x1024 2048x2048 Large (ms) Small (ms) Pixel (ms) 0.10 0.70 0.39 0.12 0.78 0.76 0.24 0.93 1.86 0.65 1.22 6.09 1.13 1.63 10.91 Total (ms) 1.2 1.7 3.0 8.0 13.7 Table 3.1: Benchmark data for different resolutions. 3.2 Load Balancing In order to test the load balancing of the tile-based rasterization scheme, two separate tests were conducted. The first test was designed to evaluate how the 108 pipeline handles a fully covered screen with no overdraw. For this, a single quadrilateral composing of two triangles was rendered. These two triangles covered the entire screen and view frustum clipping was inactivated. A series of 5 000 frames with a resolution of 800 by 600 pixels were rendered in total. Table 3.2 shows the average processing time per frame for the three rasterization steps and for a few different tile sizes. No textures were bound to the pixel rasterization kernel. Large (px) 64x64 64x64 64x64 128x128 128x128 Small (px) 4x4 8x8 16x16 8x8 16x16 Large (ms) Small (ms) Pixel (ms) 0.38 1.05 3.17 0.38 0.34 5.27 0.38 0.15 9.64 0.17 0.80 5.17 0.17 0.26 7.65 Total (ms) 4.6 6.0 10.2 6.1 8.1 Table 3.2: Benchmark data for different tile sizes with a few large triangles. The second test was designed to evaluate how the pipeline handles the case where only a small region of the screen is covered by a large number of triangles and with overdraw. A series of 500 frames with a resolution of 800 by 600 pixels were rendered. No textures were bound when when measuring the average execution time of the rasterization kernels, shown in table 3.3. The tank zombie model was rendered from a viewpoint where it covered 6.7 percent of the total screen pixels. All triangles were rendered in a single batch. Large (px) 64x64 64x64 64x64 128x128 128x128 Small (px) 4x4 8x8 16x16 8x8 16x16 Large (ms) Small (ms) Pixel (ms) Total (ms) 1.65 9.55 27.20 38.4 1.65 4.45 171.52 177.6 1.65 3.38 640.06 645.1 1.53 4.16 111.97 117.6 1.53 2.94 658.32 662.8 Table 3.3: Benchmark data for different tile sizes with many small triangles. 3.3 Texture Mapping The texture mapping sub-system is likely to be dependent on the level of cache utilization. The Fermi architecture uses cache lines of 128 bytes, has an L2 cache of 768 kilobytes and a configurable L1 cache of 64 kilobytes [5]. A series of tests were conducted to measure how these caches were utilized for the linear memory layout of the textures. For the tests, a single quadrilateral covering the entire screen was rendered. The quadrilateral was textured by an albedo texture with a resolution of 512 109 by 512 pixels. The screen resolution was set to the identical resolution and the image pyramid filter used. The texture was set to the repeat wrapping mode and the number of tiles was altered for each test. In theory, performance should scale with the number of tiles as the texture resolution will decrease, providing better cache utilization. A total of 5 000 frames were rendered for each test and a separate test with no textures bound conducted. With no textures, the average execution time of the pixel rasterization kernel was 1.88 ms. The execution times for the texturing sub-system were computed as the differences between the total execution times and this duration. This is shown in table 3.4. Level 0 1 2 3 4 5 6 7 8 9 Pixel (ms) Texturing (ms) 2.76 0.88 2.74 0.86 2.67 0.79 2.61 0.73 2.56 0.68 2.53 0.65 2.50 0.62 2.50 0.62 2.50 0.62 2.49 0.61 Table 3.4: Benchmark data for different texture levels. 110 Chapter G Discussion 1 Memory Requirements As a result of the tile-based rasterization scheme and the deferred design of the graphics pipeline, memory requirements have to be carefully evaluated. In addition, the triangles in the active rendering batch require a memory buffer in device memory. 1.1 Triangles The triangle record occupies a total of 428 bytes of device memory. Of this, 64 bytes are used to store the transformation matrix and 108 bytes to store the three per-vertex tangent space bases. Other data stored in the record is the three edge normals, different vertex attributes and bounding boxes for the different levels of the hierarchical rasterization scheme. The 428 byte memory footprint per triangle record is large but the triangle buffer only needs to be allocated to contain a reasonably small number of triangles. High occupancy levels can be reached using a batch size greater or equal to a small multiple of the number of cores available on the device. This provided that the pipeline is able to distribute the work load evenly. For instance, the Fermi architecture contains at most 480 cores. Rendering all triangles in batches of a few thousand triangles should be sufficient. With this in mind, the pipeline was set to process 8 192 triangles in each batch. This amounts to a total memory footprint of 3 506 176 bytes. Two such buffers are required for the different stages in the pipeline. The memory requirement for the triangle buffers is not extremely large, yet it can be limited through the introduction of a separate vertex buffer. A large portion of the triangle record stores per-vertex data. Ordinary valence of a 111 vertex is six, meaning that each vertex is shared by approximately six triangles. The introduction of a vertex buffer could decrease the size of the triangle record significantly as well as increase the overall performance of some stages in the pipeline. In addition, the pipeline could be restricted to only handle triangle batches with equal transformation matrices. This would remove the need of storing the transformation matrix per-triangle and through that decrease the size of the record. 1.2 Tiles During the rasterization process, triangle and tile overlap is stored using 2 bits per triangle and tile as mentioned earlier. As such, the amount of required data is dependent on the sizes of the two different tiles, on the batch size as well as on the screen resolution. With knowledge of the screen resolution w and h, the size of a large tile sl and the batch size sb , the memory requirement for the large tiles ml can be computed using equation 1.1. l m h sb w · · ·8 ml = sl sl 32 (1.1) Using this equation with a batch size of 8 192 for a few different resolutions and large tile sizes yields the data in table 1.1. As seen in the table, the memory requirements are fairly low, even for higher resolutions. Resolution (px) 800x600 800x600 1024x768 1024x768 1920x1080 1920x1080 Large (px) 64x64 128x128 64x64 128x128 64x64 128x128 Required Memory (bytes) 266 240 71 680 393 216 98 304 1 044 480 276 480 Table 1.1: Memory requirements for the large tiles for a few different resolutions. The size of a large tile is defined in the pipeline using the size of a small tile and a small to large tile ratio. This to ensure that an even number of small tiles fit inside every large tile during the small tile rasterization pass. As such, the number of small tiles is directly related to the ratio r and by that also the memory requirement. The relationship between the memory required for the large tiles and the memory required for the small tiles is ml · r2 . For the pipeline to fully utilize the cores of the Fermi architecture, the squared tile ratio should be as high as possible and preferably an even multiple of 32. This 112 is one reason to why the pipeline performs better with 64 small tiles in each large tile. However, this implies a large memory requirement for the small tiles. As the screen should be divided into as many large tiles as possible and in turn the large tiles into as many small tiles as possible, the memory requirements become a problem. Using equation 1.2, the total memory requirement of the tiles can be computed. m = ml · (1 + r2 ) (1.2) The amount of required memory grows to unacceptable numbers. Table 1.2 shows the total memory required for all tiles for a few different resolutions and large tile sizes. For the computations, the small to large tile ratio r was set to 8. Resolution (px) 800x600 800x600 1024x768 1024x768 1920x1080 1920x1080 Large (px) 64x64 128x128 64x64 128x128 64x64 128x128 Required Memory (bytes) 17 305 600 4 659 200 25 559 040 6 389 760 67 891 200 17 971 200 Table 1.2: Memory requirements for all tiles for a few different resolutions. Unfortunately, these numbers hint that the rasterization stage should be redesigned as its preferences regarding tile sizes and number of tiles are in direct conflict with the memory requirements. Specifically, the amount of memory required for the small tiles. 1.3 Fragments The fragment buffer is the common culprit of deferred rendering pipelines. It requires large amounts of allocated memory to store the per-pixel attributes of every visible fragment. As such, its memory requirement is directly related to the resolution of the target image. The only way of limiting the memory footprint is to limit the number of components in each individual fragment record. Fragment buffers often store albedo (color), surface normal, depth, specular intensity and specular power. Commonly, most of these attributes are compressed into integer data types as the intervals of most attributes are known 113 and limited. This allows compression with minimal quantization noise. During the lighting passes, the fragment record is decompressed and its attributes used in the illumination model. A compressed fragment may be stored in a 96-bit record. For a resolution of 1920 by 1080 pixels, this amounts to 24 883 200 bytes. This is a large memory requirement. Unfortunately, any further improvements other than removing individual attributes from the fragment records are prevented by the deferred design of the pipeline. In addition, memory alignment must be considered. Removing a single attribute compressed into 8 bits will require an additional three attributes to be removed. Adhering to this discussion, the use of the 96-bit fragment record in the pipeline is well-motivated. 2 Performance When it comes to rasterization, the tile-based approach provides an efficient way of partitioning the task. As seen in table 3.1 of the results chapter, the pipeline performs well for high resolutions with few triangles and no overdraw. This as the amount of large tiles and by that threads increase with resolution when the large tile size is fixed, allowing high occupancy levels of the hardware. If the total execution times from the table are divided by the number of large tiles at each resolution, an intresting pattern emerges. This is shown in figure 2.1. [ms] 0.3 4 1024 [N] Figure 2.1: Linear plot of average rasterization time per large tile. Only a small number of large tiles fit within the smaller resolutions. As the resolution is increased, so is the number of total large tiles. At sufficient resolutions, the performance gain evens out as expected. This as high occupancy levels have been reached for resolutions above 1024 by 1024 pixels. This hints that tile-based rendering has great potential, provided that the resolution is high enough to be partitioned into a high number of large tiles. 114 From the results in the previous chapter, uneven load balancing seems to be a serious issue for the implemented pipeline. When the pipeline is fed with the reasonably small test model consisting of 5 784 triangles, the only rasterization kernel with acceptable execution times is the one used for rasterizing the large tiles. The other two kernels do not handle the uneven work load well. The behavior of the large tile rasterization kernel was expected as it operates per-triangle and as such is not subject to an uneven work load. The other two kernels operate per-tile, with possible uneven work loads. As no specific load balancing measures were implemented for the two kernels, some problematic behavior was to be expected. Yet, the data presented in the results chapter proved the bottle-neck to be narrower than expected. Clearly, a redesign of that specific stage is needed in order to achieve good performance. The texturing sub-system yielded results far better than expected. As the memory layout of the textures are linear per row and planar per channel, texturing was expected to have a great impact on performance due to poor cache coherency. This was not the case as demonstrated in the results chapter. Texturing 262 144 pixels could be performed in 610 to 880 microseconds using the regular image pyramid filter. As shown in table 3.4 of the results chapter, the performance increases with each texture level. However, overdraw was not considered in order to isolate the cache performance. Typical scenes have overdraw and are usually rendered at much higher resolutions. In addition, each fragment is often generated using multiple textures. The measured fill rate of 297 890 909 to 429 744 262 textured pixels per second does not compare well to the fill rate specified at 42 000 000 000 for the Fermi architecture [5]. However, the number stated in the specification is the theoretical texture filtering performance computed as one filtered look-up per clock cycle. The Open Computing Language provides functionality to store images in a non-linear layout and to use the hardware texturing sub-system to filter the images. This would most certainly increase the performance of the texturing sub-system. However, the intention of this master’s thesis was to explore if the entire graphics pipeline could be implemented using solely the programmable units. As such, the image objects of the Open Computing Language were not used apart for inter-operability with the window interface. 115 Chapter H Conclusion This experiences with the graphics pipeline and with the graphics processing unit have been incredibly intresting. Much was learned through the continuous literature study and even more through implementing the different graphics algorithms. Through doing this, I have observed a few important aspects of tile-based rendering and of texture filtering. Notably, the increase in visual quality when using more sophisticated texture filters such as the anisotropic image pyramid or the summed-area table. These two come with a four-fold increase in memory requirement, something which is considered unacceptable within the industry. This as a multitude of textures often need to be simultaneously resident in device memory. A single texture should not be allowed to occupy four times the memory in comparison to the original texture. Device memory is still a huge limitation for graphics processing units. The amount of memory on a typical consumer-level device is often far less than the main memory of the system. In addition, render data needs to be transferred to the device and techniques requiring a dynamic amount of memory become troublesome. This is an issue which as of today remains unsolved for the Open Computing Language. In many cases, device memory needs to be allocated for the worst case scenario. This is not a good solution as much of the memory remains unused while preventing other data from residing in that region of the device memory. The tile-based rendering scheme proved to handle high resolutions elegantly but the non-existent work load balancing proved to be a bottle-neck narrower than anticipated. As such, a redesign of the rasterization stage should be considered. This redesign must account for uneven work loads, possibly handling this in a way similar to that of the pipeline presented by Laine and Karras. Many of the conducted experiments indicated that device occupancy is an important aspect. A great increase in performance was observed until the number of sub-tasks were a few multiples of the total number of device cores. 116 This in accordance with the technical documents presented by Nvidia and the Khronos Group as well as stated in the paper by Laine and Karras [2]. It saddens me to conclude the work carried out within this master’s thesis. From the conducted experiments and the acquired knowledge about the hardware, many design choices would have been made differently, should the pipeline had been designed today. Unfortunately, the development of this graphics pipeline stops here and now. In conclusion, I wish to thank the authors of the different research papers quoted throughout this report. They have inspired me and given me ideas of my own to explore. 117 Bibliography [1] Henri Gouraud. Continuous shading of curved surfaces. IEEE Transactions on Computers, 20:623–629, 1971. [2] Samuli Laine and Tero Karras. High-performance software rasterization on gpus. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, HPG 2011, pages 79–88, 2011. [3] Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan. Larrabee: A many-core x86 architecture for visual computing. ACM Transactions on Graphics, 27:1–15, 2008. [4] Juan Pineda. A parallel algorithm for polygon rasterization. SIGGRAPH Computer Graphics, 22:17–20, 1988. [5] Nvidia. Nvidia’s next generation cuda compute architecture: Fermi. Whitepaper, 2009. [6] The Khronos Group. The opencl specification (1.2). Specification, 2011. [7] Ivan E. Sutherland and Gary W. Hodgman. Reentrant polygon clipping. ACM Communications, 17:32–42, 1974. [8] Alvy Ray Smith. A pixel is not a little square, a pixel is not a little square, a pixel is not a little square! (and a voxel is not a little cube). Technical report, Microsoft Research, 1995. [9] The Khronos Group. The opengl graphics system: A specification (4.3). Specification, 2012. [10] Lance Williams. Pyramidal parametrics. SIGGRAPH Computer Graphics, 17:1–11, 1983. [11] Franklin C. Crow. Summed-area tables for texture mapping. SIGGRAPH Computer Graphics, 18:207–212, 1984. 118 [12] Justin Hensley, Thorsten Scheuermann, Greg Coombe, Montek Singh, and Anselmo Lastra. Fast summed-area table generation and its applications. Computer Graphics Forum, 24:547–555, 2005. [13] Paul S. Heckbert. Filtering by repeated integration. SIGGRAPH Computer Graphics, 20:315–321, 1986. [14] Bui Tuong Phong. Illumination for computer generated pictures. ACM Communications, 18:311–317, 1975. 119
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement