Ray Tracing NURBS Surfaces using CUDA Master’s Thesis (Version of 2nd February 2010) Erik Valkering Ray Tracing NURBS Surfaces using CUDA THESIS submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in COMPUTER SCIENCE by Erik Valkering born in Limmen, the Netherlands Computer Graphics and CAD/CAM group Department of Mediamatics Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology Delft, the Netherlands www.ewi.tudelft.nl c 2010 Erik Valkering. Cover picture: The ducky scene ray traced by CNRTS with shadows and reflections. Ray Tracing NURBS Surfaces using CUDA Author: Student id: Email: Erik Valkering 1328913 [email protected] Abstract This thesis presents CNRTS, a CUDA-based system capable of ray tracing NURBS-surfaces, directly. By coarsely subdividing the NURBS-surfaces in a preprocessing step, a more tight bounding volume hierarchy is generated which can be traversed first, thus heavily reducing the number of ray-patch intersection tests. Additionally, the smaller leaf-nodes provide better initial seed-points, which allows the root-finder to converge more quickly. For scene-traversal, two approaches have been investigated, a packet-based traversal scheme, employing a shared stack implemented using the GPU’s fast on-chip memory. The second approach uses an optimized single-ray traversal scheme, in which the scene-traversal and root-finding are separated from each other, to help maximize the utilization during root-finding. The single-ray approach turns out to be up to 3× faster than the packet-based approach. Furthermore, a hybrid approach has been used in tracing primary rays, by employing the rasterizer. Besides that only one ray-patch intersection test is required per ray, the initial seedpoints provided by the rasterization are very close. Using the hybrid approach, the performance will generally be increased. However, an artifact-free rendering is not always guaranteed, due to tessellation. Thesis Committee: Chair/supervisor: Committee Member: Committee Member: Committee Member: Prof.dr.ir. F.W. Jansen, Faculty EEMCS, CG, TU Delft Dr. W.F. Bronsvoort, Faculty EEMCS, CG, TU Delft Ir. J. Blaas, Faculty EEMCS, CG, TU Delft Dr. A. Iosup, Faculty EEMCS, PDS, TU Delft Preface High-quality visualization of Non-Uniform Rational B-Spline (NURBS) surfaces, mostly found in industrial designs (such as cars, airplanes, ships, etc.), is usually performed through rasterization, after a costly preprocessing phase, in which the NURBS-model is converted to a triangular representation. In order to represent highly curved areas, the triangulation must be very fine. As a consequence, a lot of triangles will be generated in areas with high curvature. This results in a long preprocessing time, in which lots of triangles are generated. In this thesis, a solution will be presented, which avoids the expensive preprocessing step, and works directly with the NURBS-data. By employing the ray tracing algorithm, the resulting visualization will always appear smooth, regardless of the viewing distance. Because the original NURBS-data is used for visualization, the memory requirements are very low. Furthermore, the ray tracing system is fully implemented using CUDA to exploit the processing power of recent GPUs. I would like to thank a number of people who have been a great help and support in realizing this thesis. First of all, I would like to thank my supervisor Erik Jansen for the many interesting discussions which kept me motivated and for providing really helpful advice on how to improve this report. Next, I would like to thank Jorik Blaas for all those clever ways on how to debug my software. I would also like to thank Menno Kuyper for always be willing to help me to test my software on his computer, despite of the many system crashes I have caused. And last but not least, I would like to thank my parents, Bert and Ellen, for their continuous support, in every possible way, and giving me the opportunity to study Computer Science. iii Contents Contents v List of Figures 1 Introduction 1.1 Rasterization . . . . . . . . . . . 1.2 Ray Tracing . . . . . . . . . . . . 1.3 General-Purpose GPU computing 1.4 Structure of this thesis . . . . . . I Background 2 NURBS Basics 2.1 Definition . . . . . . . . 2.2 Cox-de Boor Recurrence 2.3 Knot-vector . . . . . . . 2.4 Convex Hull . . . . . . . 2.5 Knot-insertion . . . . . . 2.6 Surfaces . . . . . . . . . 2.7 Derivative . . . . . . . . 3 4 vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 3 6 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 10 11 12 12 14 14 NURBS Ray Tracing 3.1 Scene Traversal . . . . . . 3.2 Packet-Based Ray Tracing 3.3 Ray-Patch Intersection . . 3.4 Adaptive subdivision . . . 3.5 Evaluation Schemes . . . . 3.6 Overview . . . . . . . . . 3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 17 19 19 22 26 32 33 NURBS Ray Tracing on the GPU 4.1 GPU Ray Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Hybrid Ray Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 35 37 43 v C ONTENTS 5 CUDA 5.1 Massive Multi-threading Architecture 5.2 Memory Hierarchy . . . . . . . . . . 5.3 Launching a kernel . . . . . . . . . . 5.4 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II Development 45 45 46 48 48 51 6 System overview 53 6.1 Primary-ray Accelerator subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.2 Ray Tracer subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 7 Preprocessing 63 7.1 Subdivision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 7.2 Root-Finder Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 8 Kernel details 8.1 Inter-kernel Communication . . . . 8.2 Root-finding . . . . . . . . . . . . . 8.3 Surface Evaluation . . . . . . . . . 8.4 Primary-ray Accelerator subsystem . 8.5 Ray Tracer subsystem . . . . . . . . 8.6 Kernel-launch configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III Results 9 Results 9.1 Experimental Setup 9.2 Image Quality . . . 9.3 Performance . . . . 9.4 Analysis . . . . . . 9.5 Discussion . . . . . 69 69 72 72 74 77 86 89 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 91 93 97 104 109 10 Conclusions and Future Work 111 10.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Bibliography vi 113 List of Figures 1.1 1.2 1.3 1.4 1.5 1.6 2.1 2.2 2.3 2.4 2.5 2.6 3.1 3.2 3.3 NURBS-model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tessellation of a NURBS model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A model of the VW Polo from different viewpoints. The car remains perfectly curved, even from the shortest viewing distance, thanks to the direct ray tracing of NURBS. . . Ray Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance statistics for the CPU and GPU. . . . . . . . . . . . . . . . . . . . . . . Simplified processing model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fine-tuning a NURBS-curve (a): By specifying different weights (b) and through the knot-vector (c,d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example B-Spline basis-functions: constant (a), linear (b), quadratic (c) and cubic (d). Example of polynomial segments of a cubic NURBS-curve. . . . . . . . . . . . . . . ~ Strong convex hull property: for u ∈ [ui , ui+1 ), C(u) is in the triangle ~Pi−2~Pi−1~Pi (a) and ~ ~ ~ ~ in the quadrilateral Pi−3 Pi−2 Pi−1 Pi (b). . . . . . . . . . . . . . . . . . . . . . . . . . . Knot-insertion: knot inserted at t = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . NURBS-surface example: Utah Teapot consisting of 32 bi-cubic patches. The patches are visualized using a red and green color mapping for the parametric u and v direction, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 . . . . 2 3 4 5 . 10 . 11 . 12 . 13 . 13 . 14 3.4 3.5 3.6 3.7 3.8 Six levels in a Bounding Volume Hierarchy. . . . . . . . . . . . . . . . . . . . . . . . Newton’s Iteration: each subsequent point is closer to the root. . . . . . . . . . . . . . Plane representation for a ray for which |d~x | > |d~y | and |d~x | > |d~z | holds, results in a plane rotated about the z-axis and a plane perpendicular to it (a) and for a ray for which it does not hold, the resulting plane will be rotated about the x-axis (b). The cyan line represents the ray and the yellow plane is the first plane ~p1 = (~n1 , d1 ). . . . . . . . . . Example of a NURBS-curve with exemplary associated bounding volumes. . . . . . . Flatness criterion based on the normals of a sub-patch. . . . . . . . . . . . . . . . . . Splitting a sub-patch into two sub-patches obtaining two control-point grids. . . . . . . Procedure to fix a crack in the tessellation. . . . . . . . . . . . . . . . . . . . . . . . . Evaluating the NURBS basis-functions. . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 4.2 4.3 Multiple stages in the ray tracer of [PBMH05]. . . . . . . . . . . . . . . . . . . . . . . 36 Stackless BVH traversal encoding example. . . . . . . . . . . . . . . . . . . . . . . . . 37 Hybrid Ray Tracing phases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 vii . 18 . 20 . . . . . . 21 23 24 25 25 31 L IST OF F IGURES 4.4 4.8 4.9 Overview of the algorithm stages and their relation to stages in the extended graphics pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . uv-texturing using midpoint of parameter range (uniform subdivision: 2 × 2). Figure 4.5a shows the corresponding uv-texture. Figure 4.5b to Figure 4.5d show the result for up to two, three, and four iterations, respectively. . . . . . . . . . . . . . . . . . . uv-texturing using control point mapping (uniform subdivision: 2 × 2). Figure 4.6a shows the corresponding uv-texture. Figure 4.6b to Figure 4.6d show the result for up to two, three, and four iterations, respectively. . . . . . . . . . . . . . . . . . . . . . . Figure 4.7a, Figure 4.7b view-independent uv-texturing vs. Figure 4.7a, Figure 4.7c view-dependent uv-texturing (max. iterations: 10, subdivision: uniform, 4 × 4). Figure 4.7d is the difference between the initial values of both methods. . . . . . . . . . . Artifacts which are handled correctly. . . . . . . . . . . . . . . . . . . . . . . . . . . Artifacts which cannot be handled correctly. . . . . . . . . . . . . . . . . . . . . . . . 5.1 5.2 Thread blocks and their mapping to the multiprocessors. . . . . . . . . . . . . . . . . . 46 Memory Hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.1 6.2 System overview of the rendering core of the CNRTS . . . . . . . . . . . . . . . . . . . 53 Schematic overview of the rasterization phase of the Primary-ray Accelerator. The output provides some cues for finding the exact intersection-point: the left rasterized image contains the patch-ids and the right rasterized image contains the uv-values encoded in red for u-parameter range and green for v-parameter range. . . . . . . . . . . . . . . . . 54 The ray’s direction is derived from the viewing parameters. In here, a denotes the fieldof-view. By setting the image-plane’s distance d from the camera equal to 1, the position in the image plane is easily converted to world-coordinates. . . . . . . . . . . . . . . . . 55 Artifacts appear when using the Primary-ray Accelerator solely. In Figure 6.4a, backgroundpixels appear instead of the surface. By forwarding all background-pixels to the Ray Tracer, this artifact will disappear. Figure 6.4b shows some gaps, appearing in-between two surfaces. Figure 6.4c shows an artifact which cannot be detected using the algorithm, the only solution is to tessellate more fine. . . . . . . . . . . . . . . . . . . . . . 57 The ray tracing algorithm separated into different CUDA-kernels. The CPU part coordinates the algorithm by launching the kernels successively. The inter-kernel data propagation through global-memory is visualized by dashed lines, whereas the arrows indicate the flow/dependencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Separation of image-space into tiles: instead of applying the ray tracing algorithm onto the entire image, it is now applied successively to each tile. The tiles are visualized using different shades of grey, the pixels are separated by dashed lines. . . . . . . . . . . 59 Implementing recursion for CUDA using a stack. . . . . . . . . . . . . . . . . . . . . . 60 Lifetime of a primary-ray processed by the Primary-ray Accelerator and the Ray Tracer. A primary-ray is spawn by rasterizing the tesselation; a fragment is further processed by the primary-ray accelerator’s root-finder; a hit may skip the ray tracer’s traversal and root-finder and immediately continue by casting a shadow ray; misses and background pixels are traced conventionally by the ray tracer;whereafter they are processed further by casting a shadow ray; shading is applied for non-blocked shadow-rays; and finally reflection and refraction rays are spawn and traced recursively. . . . . . . . . . . . . . . 62 4.5 4.6 4.7 6.3 6.4 6.5 6.6 6.7 6.8 7.1 7.2 7.3 viii . 39 . 40 . 40 . 40 . 42 . 43 Overview of subdivision results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 The NURBS-surface is converted into smaller Bézier-patches. . . . . . . . . . . . . . . 64 Bounding boxes of a BVH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 8.1 8.2 8.3 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 9.10 9.11 9.12 9.13 9.14 Memory layout of a stack inside global-memory. This stack is for 4 threads and has a maximum depth of 3. The exemplary “virtual” stack of thread 2 shows the mapping of threads and stack-levels to global-memory addresses. . . . . . . . . . . . . . . . . . . . 70 Arrangement of data for a cache inside shared-memory. This example uses an array of 3 elements for each thread, and a block size of 4 threads. . . . . . . . . . . . . . . . . . 73 Mapping of pixels to threads. This example uses an image-resolution of 16 × 16, a tilesize of 4 × 4, and a block-size of 2 × 2. The tiles are visualized using a shade of grey, the blocks are separated by black lines, and the pixels are separated by dashed lines. . . . 87 Test scenes used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . teapot-scene: accelerator (left), full hybrid (middle), standard ray traced (right) . . . . head-scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . This image shows a close-up of the ducky-scene: hybrid (left), and normally ray traced (right). It seems that the hybrid method can sometimes introduce artifacts, when zooming in too close. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Killeroo-closeup-scene rendering with a increasing miniterations-value. The images correspond to the miniterations-values in Table 9.4. . . . . . . . . . . . . . . . . . . . Average speedup of primary rays w.r.t. reference implementation (packet-based, no caching, no acceleration). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average speedup of shadow rays (level 0) w.r.t. reference implementation (packet-based, no caching). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average speedup of reflection rays (level 1) w.r.t. reference implementation (packetbased, no caching). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average speedup of refraction rays (level 1) w.r.t. reference implementation (packetbased, no caching). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Durations of the variants using the 8800M test-system, as a function of the number of patches. The red line shows the curve corresponding to a logarithmic function fitted to the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Durations of the variants using the GTX 295 test-system, as a function of the number of patches. The red line shows the curve corresponding to a logarithmic function fitted to the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Number of BVH-nodes visited. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amount of time spent idle of total durations for each thread. . . . . . . . . . . . . . . Frequencies of the average number of rays participating in the root-finding. . . . . . . . 95 . 95 . 96 . 96 . 97 . 99 . 100 . 100 . 101 . 102 . . . . 102 104 107 108 ix Chapter 1 Introduction In the field of CAD/CAM1 , designs are often modeled using curved, smooth-looking free-form surfaces. In modeling industrial designs such as cars, airplanes, ships, etc. the surface-type of choice is the Non-Uniform Rational B-Spline (NURBS) surface (Figure 1.1). This type of surface is favored, because it provides local and intuitive control, has a very compact representation and is very well suited for a large number of problems. Figure 1.1: NURBS-model In current CAD/CAM software, when visualizing a model, the curved surface is first tessellated into a polygonal mesh in order to obtain a tight approximation of the surface (Figure 1.2). Such a tessellation usually results in many polygons, taking up lots of memory. After the model has been tessellated, the original continuity is lost, zooming in on the tessellated model will reveal this. Besides memory requirements, the tessellation preprocessing step also takes some time. For complex models found in the automobile industry for example, this can take up to a whole day making it not very suitable for rapid prototyping [AGM06]. 1.1 Rasterization Usually, these polygon meshes are rendered using rasterization. Basically, this method first projects the objects to a two-dimensional image plane, after which the algorithm determines for each pro1 Computer Aided Design/Computer Aided Manufacturing 1 1. I NTRODUCTION (a) Original NURBS-model (b) Tessellated NURBS-model Figure 1.2: Tessellation of a NURBS model. jected object, which pixels of the image plane will be occupied by it. By taking the depth of the objects into account, the pixels will be given the color of the closest object for that pixel. While this method enables easy visualization of the model, it does not handle physically-correct lighting effects, such as shadows, reflections, refractions, etc. which are often needed in industrial prototyping. In order to include these effects, several methods have been devised to approximate them. Reflections, for example, can be obtained by first generating a texture onto which the environment surrounding the object is projected (usually using a spherical or cubical texture). During rendering, reflection is applied by following the reflection-vector to obtain the reflection-texel from the environment map. However, since only the environment is stored in the texture, self-reflection cannot be simulated using this method. Furthermore, since the rendering time of the rasterization method is linear in the number of polygons, eventually the models become too complex, and cannot be visualized interactively anymore. Although rasterization, when implemented on graphics hardware, quickly generates images having reasonable quality, the effects necessary to obtain a realistic visualization are non-trivial and methods to simulate them usually produce very rough approximations. 1.2 Ray Tracing An alternative to rasterization is visualization using the ray tracing algorithm [WH80]. The relatively simple algorithm, compared to rasterization, implicitly embeds all these effects in a physicallycorrect way. Although this algorithm is computationally much more expensive, it is logarithmic in the number of objects. Therefore, increasing the tessellation-complexity will have less impact on the performance when compared to rasterization. Moreover, by using ray tracing, the resulting pixel values can even be computed directly without the intermediate tessellation step. In this way, the original model can be used instead, reducing memory requirements, and increasing the detail (Figure 1.3). Figure 1.3: A model of the VW Polo from different viewpoints. The car remains perfectly curved, even from the shortest viewing distance, thanks to the direct ray tracing of NURBS. The algorithm is actually quite simple. Instead of projecting each geometrical object to the two2 General-Purpose GPU computing Figure 1.4: Ray Tracing dimensional image plane, this algorithm works the other way around. For each pixel in the image plane, the color is determined by “shooting” a primary ray originating from the camera (the eyes of the viewer) passing through that pixel and finding the closest intersection with an object from which the color is obtained. These primary rays “look” into the scene and try to find an object it sees. If no object crosses the path of the ray, the ray will leave the scene non-intersected and the color of the corresponding pixel will be set to a default background color. This primary intersection stage will result in an image equivalent to the rasterization algorithm, and is known as Ray Casting. If the ray does intersect an object, the lighting of the primary intersection is computed by spawning several secondary rays originating from the intersection point of the corresponding object: for each light-source in the scene a shadow ray pointing to this light-source; one reflection ray pointing to the mirror-reflection direction from a shiny surface; and one refraction ray, with a direction into the surface of the object (in case of a (semi)transparent surface). The shadow rays are used to determine if the primary intersection is illuminated by the light-sources. If any object happens to block the path of a shadow ray, the light of the corresponding light-source will not reach the primary intersection and will not contribute to the illumination of it. The other secondary rays are further traced recursively to compute secondary intersections which may contribute in the final lighting for the primary intersection. These secondary intersections will again spawn rays and the process is continued recursively until a certain traversal-depth has been reached or the color converges. This process can be seen in Figure 1.4. Although the generated images by the Ray Tracing algorithm are much more realistic than the images generated by the Rasterization algorithm, the cost for generating them however, is very high. Algorithm 1 shows the basic ray tracing algorithm, which is executed for each pixel. As can be drawn from the algorithm, the complexity of a basic ray tracer depends on the total number of pixels N in the image plane, the number of geometrical objects M in the scene and the traversal-depth D for each ray and can be described as O(NM D ). Later will be shown, how this complexity can be reduced to O(N log(M)D ), by introducing an acceleration structure which reduces the number of intersection computations to a relatively small subset of the objects in the scene. 1.3 General-Purpose GPU computing In the last few years the potential numerical performance of graphics processing units (GPUs) has grown dramatically. Modern GPUs usually offer multiple processor cores running in parallel. Furthermore the available memory bandwidth is usually also very high, typically 60 GB/s or more. This high memory bandwidth is needed to keep the cores busy. Figure 1.5 gives the potential numeri3 1. I NTRODUCTION Algorithm 1 Ray Tracing algorithm Require: Ray R, Traversal-depth d Ensure: Color C of ray 1: C ← Default background color 2: if R intersects an object in the scene then 3: O ← Nearest object intersected by R 4: for all light-source L in the scene do 5: S ← Shadow-ray originating from Land pointing to O 6: if S intersects no object then 7: C ← C + Apply illumination model(L, O) 8: end if 9: end for 10: if d ≤ maximum traversal-depth D then 11: {Ri } ← Secondary reflection and refraction rays 12: C ← C + ∑i Ray Trace(Ri , d + 1) 13: end if 14: end if 15: return C cal performance of both the latest GPU and CPU families. The potential numerical performance is expressed in GFLOPS2 . The main reason of the fast evolution of modern GPUs is that they are specially designed for compute intensive and highly parallel computation. (a) Floating-Point Operations per Second GT200 G92 G80 G71 = GeForce GTX 280 = GeForce 9800 GTX = GeForce 8800 GTX = GeForce 7900 GTX (b) Memory Bandwidth G70 NV40 NV35 NV30 = GeForce 7800 GTX = GeForce 6800 Ultra = GeForce FX 5950 Ultra = GeForce FX 5800 Figure 1.5: Performance statistics for the CPU and GPU. A typical CPU has a very large cache memory and a very deep pipeline. The cache memory is required to hide long memory latencies; the pipeline is required to hide complicated arithmetic 2 Giga 4 FLoating-point Operations Per Second General-Purpose GPU computing operations. So the latency for most operations on a CPU is very short. But usually a CPU only offers a couple of fixed and floating point arithmetic and logical units (ALUs). So the number of operations that can be done in parallel is very limited. In other words a CPU is not really suitable for highly parallel computation. Modern GPUs do not have very large data caches or very deep pipelines. Most of the transistors on the chip are devoted for data processing. The simplified processing model of a typical CPU and GPU is given in Figure 1.6. (a) CPU (b) GPU Figure 1.6: Simplified processing model. A GPU is very well suitable for applications/algorithms that require the same set of operations for each data element. Because the same set of operations is performed for each element there is a lower requirement for sophisticated flow control; and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches. But using this large amount of potential numerical performance can be tricky. A typical GPU can only be programmed in special development environments using special programming languages (such as Microsoft’s HLSL, OpenGL GLSL, NVIDIA Cg) and complicated (texture) memory fiddling. Furthermore, no recursion is supported, due to the lack of a stack. In other words, it involves a high learning curve. However, more recently several GPGPU3 -languages have been devised in order to help the programmer to focus on the parallel problem, instead of all the stages in the graphics pipeline. These languages allow to program in a familiar language such as c or c++. The tools then generate code that run as shaders using the graphics pipeline (BrookGPU, Sh, RapidMind, CUDA). The latter of these languages, CUDA, is used throughout this thesis, and will be discussed in more detail in Chapter 5. Ray tracing is an inherently parallel and highly compute intensive operation. Therefore, the GPU should be exploited as much in order to speed up the visualization process. However, due to the divergent nature of the algorithm, it’s hard to keep all processor cores busy at all times. Some rays may require more intersection tests, while others are already finished. By employing a ray tracing algorithm which maximizes the number of active cores in addition with a fast intersection test, these problems can be overcome. In addition, a hybrid rasterization/ray tracing approach is taken in accelerating the ray tracing of primary rays. 3 General-Purpose GPU 5 1. I NTRODUCTION 1.4 Structure of this thesis This thesis will continue with part I, which focuses on the existing literature. In Chapter 2, the mathematical basics for NURBS and their algorithms will be given. Then, in Chapter 3, existing techniques are described for ray tracing NURBS-surfaces. Next, in Chapter 4, this discussion will extend to GPU-implementations. Finally, Chapter 5 will discuss the CUDA architecture, which is used in this thesis. Part II of this thesis will present the CNRTS, the ray tracing system developed for this thesis. Chapter 6 will give a brief overview of the system’s architecture. Next, Chapter 7 will describe the preprocessing steps required prior to visualization. And finally, Chapter 8 will give a more in-depth description of the system. The final part of the thesis, part III, will discuss the results achieved with the system described in part II. Chapter 9 will discuss the methods used to evaluate and analyze the system, and Chapter 10 will conclude this thesis, by summing up the most important observations, and giving some hints on how the system can be further improved. 6 Part I Background 7 Chapter 2 NURBS Basics In geometric modeling, the two most common methods of representing curves and surfaces are implicit equations and parametric functions [PT97]. While both forms have advantages over each other, the parametric form is the preferred method to represent free-form surfaces in computer aided geometric design, due to its intuitive geometrical properties. In parametric form, the coordinates of a surface-point are represented separately as an explicit function of two independent parameters x(u, v) S(u, v) = y(u, v) , z(u, v) (u, v) ∈ [a, b] × [c, d]. Thus, ~S(u, v) is a vector-valued function of the independent variables, u and v. For many types of free-form surfaces, the domain is normalized to let 0 ≤ u, v ≤ 1, in order to simplify computations[PT97]. The most used type of parametric free-form surface is the Non-Uniform Rational B-Splinesurface or NURBS-surface. Before going into the details of this surface type, the NURBS-curve is discussed first, which is easily extended to a surface. Section 2.1 will start with some mathematical definitions about NURBS. 2.1 Definition Geometrically, a NURBS-curve is defined by a set of control-points, along with a weight for each control-point, and a knot-vector. The control-points roughly specify the shape of the curve (see Figure 2.1a). The shape of the resulting curve can be fine-tuned in several ways: by including more control-points around an existing control-point, the curve will be drawn towards the cluster of control-points. However, this will require more storage for representing the curve, since more control-points are needed in order to obtain the desired shape. Alternatively, by specifying a weight for each control-point, the curve can be pulled towards the control-point by increasing this weight (see Figure 2.1b). Decreasing this value will push the curve away from the control-point. Using weights instead of extra control-points, a more compact representation of the curve is obtained. More important, using the weights, the required geometry can be specified very precisely, without them it is impossible to represent basic shapes such as circles, arcs and other conic sections. Therefore, using NURBS it is possible to specify analytical shapes as well as free-form shapes. Finally, the knot-vector is used to fine-tune the shape of the curve (Figure 2.1c, Figure 2.1d). The knot-vector will be discussed in Section 2.3. The power of a NURBS-curve lies in the fact that its complexity is independent on the number of control-points. Instead, it depends on its degree. Therefore, a NURBS-curve can be adjusted locally without changing the shape of the entire curve. 9 2. NURBS BASICS P P 2 P 3 P P 2 P 3 P 5 P 2 3 P 5 5 w =1 P P 1 P P P 6 2 w = 10 1 2 P 0 (a) P 2 w = 0.1 4 P P 6 P 1 P 4 P 0 (b) 6 4 0 0 (c) 1 (d) Figure 2.1: Fine-tuning a NURBS-curve (a): By specifying different weights (b) and through the knot-vector (c,d). ~ Mathematically, a NURBS-curve C(t) is a piecewise rational function defined by a set of n control~ points {Pi } along with their weights {wi }, its degree p (or its order p + 1), and a non-decreasing sequence {t0 , ...,tn+p } called the knot-vector on the domain [t p ,tn ]: n−1 ~ p ~ ~ = ∑i=0 wi Pi Ni (t) ≡ D(t) , C(t) p w(t) ∑n−1 i=0 wi Ni (t) (2.1) or more compact as: n−1 ~ h (t) = C ∑ ~Pih Nip (t) ≡ ~D(t), w(t) , (2.2) i=0 p where the Ni (t) functions are the B-Spline basis-functions (discussed in Section 2.2). The {~Pih } control-points embed their weights using a homogeneous space-coordinate, by multiplying each coordinate with the weight wi and appending the weight to the control-point: ~Pih ≡ (wi~Pi , wi ). To obtain ~ h (t) only needs to be divided by its homogeneous the corresponding curve-point as in Equation 2.1, C coordinate w(t). In the special case of all weights equaling one, the equation reduces to: n−1 ~ = C(t) ∑ ~Pi Nip (t). (2.3) i=0 2.2 Cox-de Boor Recurrence The B-Spline basis-functions, or blending-functions, in Equation 2.1 are used to smoothly blend together the control-points of the curve. They are derived from a knot-vector (Section 2.3) using the Cox-de Boor Recurrence formula [dB72, Cox72]: if p = 0and t ∈ [ti ,ti+1 ), 1 p if p = 0and t ∈ / [ti ,ti+1 ), (2.4) Ni (t) = 0 h h i i p−1 t−ti Nip−1 (t) + ti+p+1 −t Ni+1 (t) otherwise. ti+p −ti ti+p+1 −ti+1 Here, Ni0 (t) are step functions equal to zero except in the interval [ti ,ti+1 ). For the basis-functions of higher degree, they are linear combinations of two lower-degree basis-functions (Figure 2.2). In the case the numerator and the denominator are both zero, the convention is used that 00 ≡ 0. 10 u Knot-vector (a) (b) (c) (d) Figure 2.2: Example B-Spline basis-functions: constant (a), linear (b), quadratic (c) and cubic (d). 2.2.1 Local support Following the dependencies of the recurrence, the following triangular structure appears: p Ni (t) p−1 p−1 Ni (t) Ni+1 (t) p−2 p−2 p−2 Ni (t) Ni+1 (t) Ni+2 (t) .. . Ni0 (t) 0 (t) Ni+1 0 (t) Ni+2 0 (t) · · · N 0 (t) Ni+3 i+p p p 0 , therefore N (t) is As can be seen, basis-function Ni (t) depends on basis-functions Ni0 (t) · · · Ni+p i zero if t ∈ / [ti ,ti+p+1 ). This means that each control-point Pi has local support on the curve on the domain [ti ,ti+p+1 ), moving the control-point does not change the shape globally. 2.3 Knot-vector The knot-vector determines the resulting basis-functions and thus the shape of the NURBS-curve. A knot-vector ~t consists of a non-decreasing sequence of n + p + 1 real numbers, the so-called knots, where n is the number of control-points of the corresponding curve, and p is the degree of that curve: ~t = {t0 ,t1 , · · · ,tn+p } . (2.5) These knots “tie together” the polynomial pieces of the basis-functions (Figure 2.3). Each successive pair of knots represents a parametric interval [ti ,ti+1 ) called a knot-span. Knot-spans may be empty, increasing a knot’s multiplicity (the number of successive knots which are equal). At a knot of p multiplicity k, a basis-function Ni (t) is C p−k continuous, i.e. p − k times differentiable. Obviously, p within a non-empty knot-span, a basis-function Ni (t) is C∞ continuous, since in that region the basisfunction is defined by only one single polynomial. In the special case of a knot ti having multiplicity p p p, the only non-zero basis-function is Ni (t), with Ni (ti ) = 1. Therefore, the curve interpolates the control-point ~Pi at t = ti . There are basically three flavors of knot-vectors: periodic, non-periodic, and non-uniform. The periodic knot-vectors have equidistant knots. Curves using periodic knot-vectors result in having the endpoints coincide, obtaining a closed shape. For example, a periodic knot-vector for a curve with n = 4 control-points and degree p = 3 could be: ~t = {0, 1, 2, 3, 4, 5, 6, 7} . Non-periodic knot-vectors also have equidistant knots, except for the first and last knots, having a multiplicity of p + 1. As a result, the curve interpolates both endpoints. For example, a non-periodic 11 2. NURBS BASICS Figure 2.3: Example of polynomial segments of a cubic NURBS-curve. knot-vector with n = 7 control-points and degree p = 2 could be: ~t = {0, 0, 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.5, 0.5} . Non-uniform knot-vectors provide more flexibility as they allow each knot-span to vary. For example, a non-uniform knot-vector with n = 9 control-points and degree p = 2 could be: ~t = {0, 0, 0, 0.5, 0.5, 0.75, 2, 2, 2.5, 2.5, 2.5} . Finally, a very special knot-vector is the non-periodic knot-vector for which n = p + 1. These knot-vectors result in basis-functions mathematically equivalent with Bernstein-polynomials, which are used by the well-known Bézier-curves [BM99]. For example, the following knot-vector results in a cubic Bézier-curve: ~t = {0, 0, 0, 0, 1, 1, 1, 1} 2.4 Convex Hull A very useful property of B-Spline basis-functions is the partition of unity which states: i ∑ p N j (t) = 1 for all t ∈ [ti ,ti+1 ). (2.6) j=i−p Which is proved easily by expressing the sum in terms of lower-degree basis-functions. For the full proof I refer to [PT97]. Combined with the local support property (Section 2.2.1), the following statement can be made, known as the strong convex hull property: if t ∈ [ti ,ti+1 ), then C(t) is contained within the convex hull of the control-points ~Pi−p , ..., ~Pi . This property is shown in Figure 2.4. 2.5 Knot-insertion Another basic ingredient for many other algorithms is knot-insertion, in which a knot is inserted in an existing knot-vector without changing the shape of the curve [PT97]. However, since the size of the knot-vector is always equal to n + p + 1, an extra control-point needs to be added (or the degree needs to be incremented, but this modifies the shape of the curve). Furthermore, the existing control-points need to be modified in order to preserve the shape of the curve. When inserting a knot t ∈ [ti ,ti+1 ), the local support property (Section 2.2.1) states that only p p basis-functions Ni−p (t)..Ni (t) are non-zero. Therefore, the computation of the new control-points can be restricted to only these. The following scheme transforms the old control-points {~P0 , ..., ~Pn } 12 Knot-insertion (a) (b) ~ Figure 2.4: Strong convex hull property: for u ∈ [ui , ui+1 ), C(u) is in the triangle ~Pi−2~Pi−1~Pi (a) and in the quadrilateral ~Pi−3~Pi−2~Pi−1~Pi (b). ~ 0 , ..., Q ~ n+1 } (Figure 2.5): to the new ones {Q ~Pk ~ Qk = (1 − αk )~Pk−1 + αk ~Pk ~ Pk−1 k ≤ i− p i− p < k ≤ i otherwise, where αk = t − tk . tk+p − tk Figure 2.5: Knot-insertion: knot inserted at t = 0.5. 13 2. NURBS BASICS 2.6 Surfaces A NURBS-curve is easily generalized to a NURBS-surface ~S(u, v), by defining ~S(u, v) as a grid of n × m control-points {~Pi j } along with their weights {wi j }, the degrees p and q, and two sets of basis-functions for each parametric direction, together with their knot-vectors {u0 , ..., un+p } and {v0 , ..., vm+q }: h i q n−1 ~ p ∑m−1 ~D(u, v) j=0 ∑i=0 wi j Pi j Ni (u) N j (v) ~S(u, v) = n−1 q ≡ , (2.7) p m−1 w(u, v) ∑ j=0 ∑i=0 wi j Ni (u) N j (v) or analogous to Equation 2.2: " m−1 n−1 ~h S (u, v) = ∑ ∑ j=0 # ~Pihj Nip (u) q N j (v) ≡ ~D(u, v), w(u, v) . (2.8) i=0 which needs to be divided by its homogeneous coordinate w(t) in order to obtain the surface-point. The domain of the surface is then being defined as [u p , un ] × [vq , vm ]. Just like the parametric domain of a NURBS-curve is divided into spans, the parametric domain of a NURBS-surface is divided into rectangular areas called patches (Figure 2.6). The patches are defined by each combination of the non-empty spans between the two parametric directions. Figure 2.6: NURBS-surface example: Utah Teapot consisting of 32 bi-cubic patches. The patches are visualized using a red and green color mapping for the parametric u and v direction, respectively. 2.7 Derivative The availability of the partial derivatives of a NURBS-surface is very important for many algorithms, such as finding the roots, determining the normal at a point on a surface, computing the curvature, etc. In this section, a very simple analytical formula is given for the derivative of a NURBS-curve, which is then extended to the partial derivatives computation of a NURBS-surface. 14 Derivative The first-order derivative of a NURBS-curve is obtained by applying the quotient rule to Equation 2.1: 0 ~ 0 ~ ~ 0 = w(t)D(t) − D(t)w(t) , (2.9) C(t) w(t)2 where n n ~D(t)0 = ∑ wi~Pi Nip (t)0 , i=0 and p Ni (t)0 = p w(t)0 = ∑ wi Ni (t)0 , i=0 p p p−1 p−1 Ni (t) − Ni+1 (t) . ti+p − ti ti+p+1 − ti+1 (2.10) A proof for Equation 2.10 can be found in [PT97, Pro05]. The first-order partial derivatives of a NURBS-surface follow by applying the quotient rule to Equation 2.7: ~ u (u, v) − ~D(u, v)wu (u, v) w(u, v)D , S~u (u, v) = w(u, v)2 ~ v (u, v) − ~D(u, v)wv (u, v) w(u, v)D S~v (u, v) = , w(u, v)2 (2.11) (2.12) where m ~ u (u, v) = D " n ∑ ∑ # p wi j ~Pi j Ni (u)0 q N j (v), m wu (u, v) = j=0 i=0 m ~ v (u, v) = D ∑ " n n ∑ ∑ # p wi j Ni (u)0 q N j (v), j=0 i=0 # ∑ wi j ~Pi j Nip (u) N qj (v)0 , j=0 i=0 " m wv (u, v) = " n ∑ ∑ # p wi j Ni (u) q N j (v)0 . j=0 i=0 15 Chapter 3 NURBS Ray Tracing As we discussed in Chapter 1, rasterization of the tessellation is not very memory-friendly, requires a long preprocessing time, and does not provide the desired realism. Furthermore, it will become a very slow visualization for too complex scenes. Although ray tracing solves these last two problems, it still depends on the quality of the tessellation. Alternatively, the original NURBS-surface can be ray traced directly, by employing a root-finder to find intersections between a ray and the original mathematical representation of the NURBSsurface (Section 3.3.1). This method, while being more expensive, leads to an always exact visualization of the surface without any artifacts, independent of the view point. Additionally, since the surface does not need to be converted into a triangular mesh, the tessellation step can be avoided, resulting in a shorter start-up time and lesser memory requirements. Before we discuss the ray-patch intersection algorithm, we start with some BVH-based methods to traverse through the scene more efficiently. 3.1 Scene Traversal A naive implementation of the basic ray tracing algorithm will test every object for an intersection during ray traversal, resulting in many costly computations. Already since the very beginning of ray tracing, several ray traversal acceleration data structures have been developed, in order to reduce this number of ray/object intersection tests. The first one ever used in ray tracing was the bounding volume hierarchy (BVH) [WH80], after which many other more efficient data structures have been used, such as uniform grids, octrees, kd-trees, etc. In 2001, Havran investigated these data structures in his PhD. thesis, and concluded the kd-tree as the most efficient, and the BVH as the least efficient [Hav01]. More recently, the bounding volume hierarchy has gained more popularity, due to its simpler traversal algorithm, lesser memory requirements and being more appropriate for dynamic scenes [WBB08, WBS07, LYTM06]. Furthermore, BVHs built according to the surface area heuristic [MB90] seem to be quite competitive to kd-trees, when traced using packets of rays (Section 3.2) [WBS07]. The Bounding Volume Hierarchy, or BVH, is a hierarchical scene partitioning datastructure. It differs from spatial partitioning techniques (uniform grid, kd-tree, etc.) in that it partitions objects as opposed to space. In general, a BVH starts with a root-node, holding a bounding volume corresponding to the entire scene (usually a sphere or a bounding box). The children of the root-node, and in general of every internal-node1 , partition their parent’s objects by holding a bounding volume of a subset of 1A node with children is called an internal-node. 17 3. NURBS R AY T RACING (a) Level 0 (b) Level 1 (c) Level 2 (d) Level 3 (e) Level 4 (f) Level 5 Figure 3.1: Six levels in a Bounding Volume Hierarchy. Algorithm 2 Traverse-BVH Require: Ray R, BVH-Node N Ensure: Object O 1: if R does not intersect the BV of N then 2: return null 3: end if 4: if N is leaf-node then 5: return Object of(N) 6: end if 7: for all child C do 8: O ← Traverse-BVH(R,C) 9: if O is not null then 10: return O 11: end if 12: end for their parent’s objects. Finally, a node without children is called a leaf-node, which wraps a single object and holds a bounding volume of it. Figure 3.1 shows an example of BVH built from a scene. 3.1.1 BVH Traversal The basic ray traversal algorithm is modified by first testing for an intersection with the BVH. A ray missing the bounding volume of the root-node implies a ray missing the entire scene and is therefore discarded. Otherwise the ray is tested against the bounding volume of the child-nodes, and the ray is traversed further through the nodes the ray intersects, if any. This process continues recursively until the traversal reaches a leaf-node, where the basic algorithm is executed. Algorithm 2 shows the modified traversal using a BVH as the acceleration datastructure. Using this method, many costly ray/object intersection tests will be avoided, since rays will terminate early when missing a bounding volume. On average, a ray traverses the tree until a leaf is reached or the ray is early terminated, because the ray misses all bounding volumes. Since the depth of the tree is logarithmic in the number of objects in the scene, traversing the scene is now of order O(logn), instead of O(n). 3.1.2 BVH Construction A careful construction of the BVH-tree is very important. Because the BVH determines how a ray traverses the scene, a badly constructed BVH will result in many ray/bounding box-intersection tests. For other meshes with no additional hierarchical information, the construction of the BVH is non-trivial. The BVH is usually constructed by recursively partitioning the scene objects using 18 Packet-Based Ray Tracing a split plane determined by some heuristic. Traditional heuristics determine the splitting plane to be the midpoint of the bounding box’ axis with the longest extent (spatial median) or the midpoint through the median of the objects when sorted along the axis with the longest extent (object median). Although the spatial or object median methods lead to a very efficient construction algorithm, the BVH-traversal efficiency is generally not very high. A better way to determine where to partition the scene is by using the Surface Area Heuristic. This algorithm chooses from a set of splitting planes, the one with the lowest cost based on the surface area and number of primitives of the children. More information about this algorithm can be found in [GS87, MB90]. 3.2 Packet-Based Ray Tracing In [WSBW01], a modified version of the standard ray tracing algorithm is presented, which is called the RTRT system. By paying close attention to the coherence among the rays, a performance increase of an order of magnitude is achieved by tracing packets of rays in parallel. A kd-tree datastructure is used to accelerate the ray traversal. Since the traversal algorithm is relatively simple to implement, the complexity of the code is reduced so the compiler can optimize the generated code. A traditional ray tracer will trace all rays sequentially. However, since there is a lot of coherence among primary rays and to a somewhat lesser degree among shadow rays, the same memory locations will be accessed multiple times for rays that intersect the same primitive. In order to remedy this, multiple rays are grouped together to form a packet of rays. Now the whole packet is traced simultaneously. This way, the memory is only accessed once for the whole packet, instead of separately for each ray. During traversal a packet has three choices: traverse the left child, the right child or both. Although the rays inside the packet are coherent, it is not guaranteed that they all will traverse the same child. Therefore some rays get disabled when traversing the non-intersecting child. This way, unnecessary calculations will be avoided. By using SIMD instructions, the packets can be traversed, intersected and shaded in parallel. Since the major bottle-neck in Ray Tracing is the memory bandwidth [WSBW01], it is necessary to exploit the caches as much as possible. By aligning the data to the cache lines, every data item can be fetched by loading one cache line. Items larger than the cache line width, are split over multiple cache lines and padded to half the width. In order to minimize the bandwidth requirements, only data needed for the actual computation is kept together. In this way, loading data is avoided that will not be used. Another problem is the relative high latency for accessing the main memory. By prefetching the data, instead of fetching it on demand, the data will be available instantly when they are needed. 3.3 Ray-Patch Intersection The most important part of a NURBS Ray Tracer is the intersection test between a three-dimensional ray and the surface, since this test determines if the corresponding pixel should be colored using the surface or not. Normally, such a test would be trivial since a lot of fast analytical intersection tests are available for standard primitives, such as triangles, quadrilaterals, etc. However, for NURBSsurfaces, this is less obvious. Algebraic methods Several approaches exist for computing an intersection between a ray and a NURBS-surface. Algebraic methods have been devised for the intersection of a ray and a bicubic parametric surface [Kaj82]. While this method computes the exact intersection, it requires the computation of the roots of a polynomial of 18th degree. Solving such a polynomial will be 19 3. NURBS R AY T RACING very expensive. Furthermore, this method will be applicable to bi-cubic surfaces only and not to NURBS-surfaces. Subdivision Another approach is on-the-fly subdivision [BBLW07]. By repeatedly subdividing the NURBS-patches, the mesh will converge to the actual surface. Due to the coherence among rays, same patches will be subdivided in exactly the same way. By paying attention to the coherence among these rays, the subdivision-cost can be amortized over a “packet” of rays. When the subdivision is sufficiently close to the limit surface, the Ray Tracer can use standard tests to compute the intersection. Numerical methods The third approach is a numerical method to approximate the intersection, known as Newton’s Iteration, or the Newton-Rhapson method [PTVF97]. This method uses a very simple algorithm to iteratively find better approximations of the roots of non-linear functions. A nice property of the algorithm is, that in each subsequent iteration, the error of the approximation decreases quadratically, roughly meaning that in each iteration the number of correct digits are doubled. Additionally, the algorithm only requires the previous approximation, to compute the next approximation, thus keeping memory requirements low. The algorithm depends on the function, its derivative, and an initial guess. For this initial guess, it is very important to be close enough to the exact solution in order for the method to converge towards the correct root. This method will be explained in more detail in Section 3.3.1. Obtaining suitable initial guesses for NURBS-surfaces will be discussed in Section 3.4. 3.3.1 Newton-Rhapson Method The general idea is to start with the initial guess and compute the tangent line that approximates the function at that point Figure 3.2. The x-intercept of this tangent line will typically be closer to the root, than the initial guess. Now this process can be repeated to obtain closer approximations to the root of the function: f (xn ) (3.1) xn+1 = xn − 0 f (xn ) x xn+1 xn Figure 3.2: Newton’s Iteration: each subsequent point is closer to the root. However, we are dealing with a ray/NURBS-surface intersection, therefore the standard algorithm needs to be extended. The approach taken by [MCFS00] is discussed in here. The ray 20 Ray-Patch Intersection ~ where ~o denotes the ray’s origin and d the direction, is represented as the intersection of ~r = ~o + λd, two orthogonal planes ~p1 = (~n1 , d1 ) and ~p2 = (~n2 , d2 ), where the ~ni are orthogonal vectors of unit ~ The di , the planes distances to the world’s origin, are given by di =~ni ·~o, length, perpendicular to d. with ~o denoting the ray origin. The plane equations are setup by: (d~y ,−d~x ,0) q if |d~x | > |d~y | and |d~x | > |d~z | d~2 +d~2 ~n1 = (0,d~ x,−d~y ) z y qd~2 +d~2 y z (3.2) otherwise, ~ −1 ~n2 =~n1 × d~ · |d| (3.3) d1 =~n1 ·~o (3.4) d2 =~n2 ·~o. (3.5) The first plane equation results in a plane with a normal lying in the xy-plane or the yz-plane, depending on the largest magnitude of the components of the direction of the ray Figure 3.3. (a) (b) Figure 3.3: Plane representation for a ray for which |d~x | > |d~y | and |d~x | > |d~z | holds, results in a plane rotated about the z-axis and a plane perpendicular to it (a) and for a ray for which it does not hold, the resulting plane will be rotated about the x-axis (b). The cyan line represents the ray and the yellow plane is the first plane ~p1 = (~n1 , d1 ). To find the intersection of a surface ~S(u, v) and a ray, we have to find the roots of: ~ ~R(u, v) = ~n1 · S(u, v) + d1 . ~n2 · ~S(u, v) + d2 (3.6) Geometrically, this can be interpreted as the distance from the evaluated surface-point to the ray. ~R(u, v) becoming zero indicates that this distance becomes zero, hence an intersection point is found. The iteration will be continued until one of the following termination criteria is met: 1. If the distance to the real root falls below some user defined threshold ε, then an intersection point is found, i.e. ~ R(un , vn ) < ε Since the obtained intersection point is an approximation, it will definitely not lie along the ray. Therefore, to obtain the ray parameter λ, the point is projected onto the ray: ~ λ = (~S(u, v) −~o) · d. (3.7) 21 3. NURBS R AY T RACING 2. Whenever the iteration takes us further away from the root, the computation will be aborted, assuming divergence. ~ R(un+1 , vn+1 ) > ~R(un , vn ) 3. A maximum number of iteration steps has been performed, also indicating divergence. Whenever none of the criteria is met, the algorithm continues with a next iteration by updating the (u, v) values. However, since we are dealing with a bi-variate function, Equation 3.1 cannot be employed. Instead, the iteration step is extended to a two-dimensional problem, given by: un+1 un = − J −1~R(un , vn ) (3.8) vn+1 vn where J is the Jacobian matrix of ~R containing the partial derivatives: J= ~n1 · ~Su (u, v) ~n1 · ~Sv (u, v) ~n2 · ~Su (u, v) ~n2 · ~Sv (u, v) (3.9) One problem which may occur, is that the Jacobian matrix turns out to be singular, making it impossible to compute the inverse. This may happen for example at non-regular areas of the surface (i.e. where the normals are zero). Therefore, in the case of a singular Jacobian, the (u, v) values are perturbed with some small random values. Algorithm 3 gives an overview on how to compute the intersection between a ray and a NURBS-surface, given an initial-guess value (u, v). The core part of the algorithm, the so-called iteration, heavily depends on the evaluation of a surface-point together with its partial-derivatives. Several algorithms exist for efficient evaluation of surface-points and partial-derivatives. 3.4 Adaptive subdivision The first step required for a direct NURBS Ray Tracing implementation is the creation of a Bounding Volume Hierarchy (Section 3.1.2). Normally, a leaf of a BVH bounds multiple primitives. Since the cost for the intersection test between a ray and a triangle2 is relatively low compared to that of traversing the BVH, it is clever to group multiple primitives. However, when ray tracing NURBSsurfaces, this cost is very high, therefore, it is generally not a very good idea to group multiple NURBS-surfaces into a single BVH-leafnode. Instead of each BVH-leafnode bounding multiple surfaces, the surface is refined and for each sub-patch, a new child node is created bounding the convex hull of the sub-patch, as in [MCFS00]. Besides a reference to the corresponding surfacepatch, these leaf-nodes also contain an initial-guess parameter value, for the Newton’s Iteration rootfinder (Section 3.3.1). This initial guess value is defined as the center of the parametric interval of the sub-patch’s parametric domain. By increasing the number of leaf-nodes for a surface, the parametric interval becomes smaller, and a better initial guess value is obtained (Figure 3.4). Additionally, by increasing the number of leaf-nodes per surface, the number of intersection tests for rays missing the surface is reduced, because the bounding volume is more tight when using many small bounding volumes. In other words, rays will be culled away before the costly intersection test takes place. In order to construct a bounding volume hierarchy from the subdivision, the leaf-nodes are sorted along the axis with the greatest extent. The first half of the leaf-nodes becomes the first child node, and the second half the second child. Methods for constructing a subdivision will now be discussed. 2 Or 22 any other primitive for which a very efficient intersection test exists Adaptive subdivision Algorithm 3 The Newton Root-Finder Require: u, v Ensure: λ, u, v 1: error prev ← ∞ 2: uinitial ← u 3: vinitial ← v 4: for i = 1 to MAX_ITERATIONS do ~S ← EvaluateSurface(u, v) 5: ~ ~R ← ~n1 · S + d1 6: ~n2 · ~S + d 2 7: error ← ~R1 + ~R2 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: if error < ε then return λ end if if error > error prev then return ∞ end if error prev ← error J ← ComputeJacobian if singular(J) then u u (uinitial − u) · random([0, 1]) ← + 0.1 v v (vinitial − v) · random([0, 1]) else u u ← − J −1~R v v end if end for return ∞ Figure 3.4: Example of a NURBS-curve with exemplary associated bounding volumes. 3.4.1 Recursive subdivision One very simple method to subdivide a surface is to recursively split each surface-patch in the center of the parametric domain and the sides of the domain, resulting in four sub-patches. These new patches determine the knot-span used for evaluation. Since there is no spatial relation between the control-points of the surface and its parametric domain, sampling in such a uniform way will not necessarily result in a uniform distribution of the 23 3. NURBS R AY T RACING sample points on the surface. For example, the subdivision could fail to represent highly curved areas of a surface if the sampling density is too low. A simple solution is to increase the sampling density, in order to refine such curved areas. However, this will unnecessarily increase the number of polygons in flat areas of the surface. A better solution is to adaptively determine the sampling density in different areas of the surface. This way, highly curved areas will automatically be sampled more dense than less curved areas. The uniform method is easily transformed into an adaptive version. Instead of using a fixed recursion depth, extra criteria can be included to determine whether or not a (sub-)patch needs more subdivision. Normal based criteria [Abe05] samples the normals ~ni of the sub-patch at eight different locations (Figure 3.5a) and then uses the following criterion to stop the subdivision for that sub-patch [Abe05]: 7 ∏~ni ·~ni+1 ≈ 1. i=1 The dot-product between two nearly-identical vectors is approximately equal to one3 . In other words, the surface is considered flat if the normals are all approximately equal. However, this method is not always successful in capturing curved areas as Figure 3.5b indicates. (a) The blue circles mark the positions where normals will be(b) The normals taken may satisfy the flatness criterion, but the evaluated. peak in the middle of the spline is not recognizes properly. Figure 3.5: Flatness criterion based on the normals of a sub-patch. Straightness based criteria A more thorough approach is taken by [Pet94], however their method operates directly on the control-point grid instead of the surface itself. A straightness-test is applied to every row and column of the control-point grid. The straightness-test return true if all points in the row (or column) are co-linear. However, this test alone is not sufficient. If the surface is twisted, the rows and columns could be straight, but the surface is nevertheless heavily curved. Therefore, a final check is made to determine if the corners are co-planar. To split the sub-patch, a refinement step is applied using the Oslo algorithm [CLR80, BBB87]. This algorithm modifies the knot-vector of the non-straight direction by adding p (or q) new knots 3 It 24 is assumed that |~ni | = 1. Adaptive subdivision at the parametric center of the sub-patch. Then, using this new knot-vector, the control-point grid is updated by adding control-points corresponding to these new knots. Using such a refinement operation, the shape of the surface remains unchanged. However, since p (or q) new knots are inserted, the new control-points corresponding to these knots touch the surface. Thus, two separate sub-patches are obtained along with their control-points (Figure 3.6). Figure 3.6: Splitting a sub-patch into two sub-patches obtaining two control-point grids. During the subdivision, two adjacent sub-patches with different subdivision depths introduce cracks in the surface (Figure 3.7a) if the subdivision is used as a tessellation. In order to fix this, the surface points corresponding to the corners of the crack are projected onto the straight line (Figure 3.7b). (a) Crack introduced by different subdi-(b) Fixed crack by projecting the corvision depths. ners onto the straight line. Figure 3.7: Procedure to fix a crack in the tessellation. 3.4.2 Direct subdivision Yet another method for subdivision, particularly well suited for ray tracing, is described in [MCFS00], in which the refinement scheme is based on the curvature of each patch. Each patch is refined according to a heuristic approximating the curvature of that patch. The heuristic operates on the iso-curves4 and takes three measures into consideration: the maximum curvature of a curve-segment corresponding to a knot-span, the length of that curve-segment, and a bound on the deviation of the curve-segment from its linear approximation. The measure for the maximum curvature ensures that no multiple roots are found when finding an intersection between a ray and the surface (recall that this method is intended for ray tracing). The measure for the length of a curve-segment makes sure that the initial-guess values are close enough, especially for large patches. The first two measures result in the following heuristic value: ~ ~ n1 = C1 × max[ti ,ti+1 ) {curvature(C(t))} × arclength(C(t)) [ti ,ti+1 ) . 4 Iso-curves are the curves on a surface with a fixed u value or a fixed v value. 25 3. NURBS R AY T RACING The third measure determines the number of segments that needs to be generated and is approximated by: q ~ n2 = C2 × arclength(C(t)) [ti ,ti+1 ) . Combining the previous two heuristics, the heuristic value that determines the number of knots n to insert into the knot-span is determined by n1 × n2 : 3/2 ~ ~ n = C × max[ti ,ti+1 ) {curvature(C(t))} × arclength(C(t)) [ti ,t i+1 ) . The C constant is a fineness constant and can be determined empirically. Since the maximum curvature and the arc length are hard to compute, their values are heavily estimated, which usually results in a too fine refinement rather that too coarse. However, the computation is much faster compared to the exact functions. Furthermore, the constant C can be used to fine-tune the subdivision. The final heuristic (after replacing many functions by their estimates) becomes: n =C× maxi−p+2≤ j≤i |~A j | × (ti+1 − ti )3/2 . ( 1 ∑ij=i−p+1 |~V j |)1/2 p where ~ ~ ~V j = p Pj − Pj−1 , t j+p − t j ~ ~ ~A j = (p − 1) V j − V j−1 . t j+p−1 − t j The full derivation of the estimated maximum curvature and arc length functions can be found in [MCFS00]. The actual subdivision takes place in two subsequent steps. In the first step, the heuristic is applied to determine the number of knots that needs to be inserted into the non-empty knot-spans. The number of knots that needs to be inserted into each knot-span of the u direction is determined by applying the heuristic to each row of the control-point grid. Per knot-span, the maximum of all rows for this knot-span is used as the number of new knots. This is repeated for the v direction, but now for every column. After the numbers have been determined, the new knots are inserted uniformly distributed over the knot-span. As a final step, each patch is converted into a Bézier-patch by setting the multiplicity of the internal knots to p and q for the u and v parametric directions, respectively. In the second step, the new control-point grid is computed from the new knot-vectors by using a method similar to knot-insertion (Section 2.5). While it is not important for this step to be fast (in the case of ray tracing it is only done once in a preprocessing phase), the actual implementation will not be discussed here an can be found in [MCFS00]. 3.5 Evaluation Schemes When evaluating a point on a NURBS-curve or surface, it is important to note, that not every basisfunction needs to be computed. Since t lies in exactly one knot-span, only one zero-degree basisfunction is non-zero. Consequently, only two basis-functions of degree one are non-zero. In general, only p + 1 basis-functions of degree p are non-zero. In the following scheme, these non-zero basisfunctions are emphasized: 26 Evaluation Schemes Algorithm 4 EvaluateSurface Require: spanu , spanv , Pi j , p, q, u ∈ [uspanu , uspanu +1 ), v ∈ [vspanv , vspanv +1 ) . Ensure: ~S, ~Su , ~Sv 1: Nu [0..p], Nderivu [0..p] ← EvaluateBasisFunctionsAndDerivativesu 2: Nv [0..q], Nderivv [0..q] ← EvaluateBasisFunctionsAndDerivativesv 3: ~ Sh = (~S, Sw ) ← ~0 ~u h = (S~u , Su,w ) ← ~0 4: S h S~v = (S~v , Sv,w ) ← ~0 for i = spanu − p to spanu do for j = spanv − q to spanv do ~Sh ← ~Sh + ~Ph Nu [i − (spanu − p)]Nv [ j − (spanv − q)] ij h h ~ ~ 9: Su ← Su + ~Pihj Nderivu [i − (spanu − p)]Nv [ j − (spanv − q)] h h 10: S~v ← S~v + ~Pihj Nu [i − (spanu − p)]Nderivv [ j − (spanv − q)] 11: end for 12: end for 13: ~ S ← Sw−1~Sh −1 )S−1 ~u ← (S~u − ~SSu,w 14: S w −1 )S−1 ~v ← (S~v − ~SSv,w 15: S w 16: return ~ S, ~Su , ~Sv 5: 6: 7: 8: p p p p p · · · Ni−p (t) · · · Ni−3 (t) Ni−2 (t) Ni−1 (t) Ni (t) .. . 2 (t) · · · N 2 (t) N2 (t) N2 (t) N2 (t) · · · Ni−p i−3 i−2 i−1 i 1 (t) · · · N 1 (t) N 1 (t) N1 (t) N1 (t) · · · Ni−p i−3 i−2 i−1 i 0 (t) · · · N 0 (t) N 0 (t) N 0 (t) N0 (t) · · · Ni−p i−3 i−2 i−1 i p p Ni+1 (t) Ni+2 (t) · · · 2 (t) N 2 (t) · · · Ni+1 i+2 1 (t) N 1 (t) · · · Ni+1 i+2 0 (t) N 0 (t) · · · Ni+1 i+2 For efficient evaluation, this is a very important fact: instead of multiplying and summing up all n control-points, actually only p + 1 control-points are involved in the computation. Thus Equation 2.2 becomes: a ~ h (t)[t ,t ) = C a a+1 ∑ ~Pih Nip (t) i=a−p and for a surface, becomes: b ~Sh (u, v)[u ,u ),[v ,v ) = a a+1 b b+1 ∑ " a ∑ # ~Pihj Nip (u) N qj (v) j=b−q i=a−p Basis-function cache Algorithm 4 simultaneously evaluates a point on a NURBS-surface along with its partial derivatives, by first computing the basis-functions in both parametric directions and storing them in a basis-function cache [AGM06]. Then, the computed basis-functions are multiplied with the corresponding control-points and added to the result. Directly evaluating the basis-functions using Equation 2.4 is not very efficient. At first, many higher-degree basis-functions depend on the same lower-degree basis-function, resulting in repeated 27 3. NURBS R AY T RACING Algorithm 5 EvaluateBasisFunctionPowerBasisForm Require: t, p, * P ← αp * D←0 for i = p − 1 to 0 * D ← D∗t +P * P ← P ∗ t + αi endfor return (P, D) computations for these lower-degree basis-functions. Secondly, many computations of the basisfunctions are actually unnecessary, since they depend on basis-functions whose active domain (where the basis-function is non-zero), does not contain the evaluation parameter value and are therefore zero for that value. Finally, a recursive implementation of this algorithm, when compiled, is not very efficient, since the basis-functions are evaluated many times and thus many function calls are issued, which cannot be in-lined by the compiler, resulting in poor performance. 3.5.1 Power-basis Form Evaluation One possibility to avoid the recursion in Equation 2.4 is to convert the recursive basis-functions into power-basis form as in [Abe05]. The power-basis form of a basis-function is defined as: p N(t) = ∑ αit i i=0 Using Horner’s scheme [PTVF97] the polynomial and its derivative can be evaluated very efficiently with linear time complexity: p N(t) = ∑ αit i i=0 = α0t 0 + α1t 1 + · · · + α p−1t p−1 + α pt p = t(· · ·t(t α p + α p−1 ) · · · + α1 ) + α0 | {z } linear in p For each knot-span, a set of basis-functions in power-basis form is evaluated as well as their derivatives. The power-basis coefficients αi are obtained in a preprocessing step by a polynomial expansion of Equation 2.4. For full details I refer to [Abe05]. Algorithm 5 evaluates the basis-function and its derivative simultaneously using only multiplyadd5 operations. Although the power-basis form enables a very efficient evaluation of a basis-function and its derivative, it does require some preprocessing. Additionally, there is a lot of redundancy in the generated coefficients, since a polynomial is maintained for each knot-span for each basis-function, while these are derived from the knot-vector which uses much less memory. Finally, a rather bad property of polynomials in power-basis form is that the evaluation is numerically unstable if the coefficients vary greatly in magnitude [FR87]. 5 Some hardware architectures (e.g. GPUs) are able to perform a multiply operation followed by an add operation in a single combined multiply-add instruction. 28 Evaluation Schemes 3.5.2 Direct Evaluation Although there are efficient iterative algorithms available for the original Cox-de Boor recurrence Equation 2.4, their complexity is still quadratic in the degree of the curve as opposed to the linear complexity of the power-basis form method. However, they are numerically stable, which is a very important property for the numerical root-finder (Section 3.3.1). Observing the computation of all p-degree basis-functions (Equation 2.4) for t ∈ [ti ,ti+1 ), it becomes clear there is a lot of redundancy in the computation of them: t − ti−p ti+1 − t p−1 p−1 p Ni−p (t) + N (t) (3.10) Ni−p (t) = ti − ti−p ti+1 − ti−p+1 i−p+1 t − ti−p+1 ti+2 − t p p−1 p−1 Ni−p+1 (t) = Ni−p+1 (t) + N (t) (3.11) ti+1 − ti−p+1 ti+2 − ti−p+2 i−p+2 .. . ti+p − t t − ti−1 p p−1 p−1 Ni−1 (t) = N (t) + N (t) (3.12) ti+p−1 − ti−1 i−1 ti+p − ti i ti+p+1 − t t − ti p−1 p−1 p (3.13) Ni (t) + N (t) Ni (t) = ti+p − ti ti+p+1 − ti+1 i+1 p p The last term of basis-function N j (t) and first term of basis-function N j+1 (t) contain the same factor: p−1 Qj = N j+1 (t) t j+p+1 − t j+1 Note that for the first term of Equation 3.10 and the last term of Equation 3.13 this factor is zero, p−1 p−1 since Ni−p (t) = Ni+1 (t) = 0 for t ∈ [ti ,ti+1 ). Now, let: α j = t − ti+1− j β j = ti+ j − t Equations 3.10-3.13 are then: " p Ni−p (t) = α p+1 p−1 Ni−p (t) # " p−1 Ni−p+1 (t) # + β1 α p+1 + β0 α p + β1 " p−1 # " p−1 # Ni−p+1 (t) Ni−p+2 (t) p Ni−p+1 (t) = α p + β2 α p + β1 α p−1 + β2 .. . # " # p−1 p−1 Ni−1 (t) Ni (t) + βp α2 + β p−1 α1 + β p " # " p−1 # p−1 Ni+1 (t) Ni (t) p Ni (t) = α1 + β p+1 α1 + β p α0 + β p+1 " p Ni−1 (t) = α2 Algorithm 6 (named “Inverted Triangle”-method in [PT97]) simultaneously evaluates the basisfunctions and their derivatives by reusing the previously computed values. Since terms resulting in a division by zero are not computed anymore, the algorithm requires no special treatment of such cases. 29 3. NURBS R AY T RACING Algorithm 6 EvaluateBasisFunctionsDirect Require: i, t ∈ [ti ,ti+1 ), p. Ensure: N[0..p], D[0..p] 1: N[0] ← 1 2: for j = 1 to p do 3: α j ← t − ti+1− j , β j ← ti+ j − t 4: RN ← 0, RD ← 0 5: for k = 0 to j − 1 do 6: Q ← α N[k] j−k +βk+1 7: N[k] ← RN + βk+1 Q 8: RN ← α j−k Q 9: D[k] ← p(RD − Q) 10: RD ← Q 11: end for 12: N[ j] ← RN 13: D[ j] ← pRD 14: end for 3.5.3 Division-free Evaluation The iterative method described in Section 3.5.2 does not require preprocessing, and the additional storage is even zero (excluding the knot-vector, which is always needed). However, this method contains division-operations, which are costly compared to basic operations such as multiply and addition. Another very fast, yet simple and memory-efficient, iterative approach is presented in [AGM06], for evaluating NURBS-curves and surfaces. Their method requires very little preprocessing, and the data resulting from it takes up only but a small amount of memory. By carefully investigating the Cox de-Boor recurrence from Equation 2.4, one can see that a lot of constants are involved. By rewriting the equation, the constant part of the equation can be separated out: ti+p − t t − ti p−1 p−1 N (t) + N (t) ti+p−1 − ti i ti+p − ti+1 i+1 1 −1 p−1 p−1 (t − ti ) Ni (t) + (t − ti+p ) Ni+1 (t) = ti+p−1 − ti ti+p − ti+1 p Ni (t) = p−1 = [a (t − ti )] Ni p−1 = [at + c] Ni p−1 (t) + [b (t − ti+p )] Ni+1 (t) p−1 (t) + [bt + d] Ni+1 (t). The resulting equation was rewritten in such a way, that evaluation of a basis-function requires only three multiply-add instructions and one multiply instruction, given that the depending basisfunctions of lower degree are known: p−1 [at + c] N | {z } i multiply-add p−1 (t) + [bt + d] Ni+1 (t) | {z } multiply-add | | 30 {z multiply-add {z multiply } } Evaluation Schemes Where the constants can be precomputed in a preprocessing step: p ai = p 1 ti+p−1 − ti p p bi = p ci = −ai ti −1 ti+p − ti+1 p di = −bi ti+p A slightly different approach is taken for computing all required basis-functions. Figure 3.8 shows the dependencies and evaluation order of the basis-functions. All basis-functions of degree higher than zero depend on two lower-degree basis-functions. Furthermore, the outer basis-functions depend on only one basis-function. (a) (b) Figure 3.8: Evaluating the NURBS basis-functions. Algorithm 7 simultaneously evaluates the basis-functions and their derivatives [AGM06]. Algorithm 7 EvaluateBasisFunctionsDivisionFree Require: i, t ∈ [ti ,ti+1 ), p. Ensure: N[0..p], D[0..p] 1: N[0] ← 1 2: for j = 1 to p do j 3: D[ j] ← p(ai N[ j − 1]) j j 4: N[ j] ← (ai t + ci )N[ j − 1] 5: for k = j − 1 down to 1 do j j 6: D[k] ← p(ai−( j−k) N[k] − bi−( j−k) N[k + 1]) j j j +(bi−( j−k)t + di−( j−k) )N[k + 1] end for j D[0] ← p(−ai− j N[ j]) 8: 9: 10: 11: j N[k] ← (ai−( j−k)t + ci−( j−k) )N[k] 7: j j N[0] ← (ai− j t + ci− j )N[ j] end for 3.5.4 De Boor’s Algorithm De Boor’s algorithm takes a very different approach in evaluating the points on a NURBS-curve or surface. Instead of first computing the basis-functions, and using them in a second stage by multiplying them with the control-points, it operates directly on the control-points. 31 3. NURBS R AY T RACING Geometrically, it is identical to repeatedly inserting knots (Section 2.5) until the knot has a multiplicity p. The curve-point is then equal to the control-point (after refinement) interpolated by the curve at the knot. However, instead of computing each time all new control-points of the refinement, the algorithm limits itself to a subset of the refinement. Algorithm 8 shows the De Boor’s Algorithm for evaluating a NURBS-curve at t. Algorithm 8 De Boor’s Algorithm Require: t ∈ [ti ,ti+1 ), p. ~ Ensure: C(t) 1: if t 6= ti then 2: s←0 3: else 4: s ← multiplicity(ti ) 5: end if 6: h ← p − s 7: for j = 1 to h do 8: for k = i − p + j to k − s do j 9: αk ← tk+p−t−tj+1k −tk ~P j ← (1 − α j )~P j−1 + α j ~P j−1 k k k−1 k k end for 12: end for p−s 13: return ~ Pi−s 10: 11: To extend the algorithm to a surface-evaluation, observe the following (from Equation 2.8): " # m−1 n−1 p h h ~S (u, v) = ∑ ∑ ~Pi j N (u) N q (v). i j=0 j i=0 For a fixed j this becomes: n−1 ~qhj (u) = ∑ ~Pihj Nip (u). i=0 Therefore, evaluation of ~Sh (u, v) can be done by first applying De Boor’s Algorithm q times on the iso-v-curves, and then applying it one more time to the curve defined by: m−1 ~Sh (u, v) = ∑ ~qhj (u)Nip (v). j=0 This results in a numerically very stable algorithm for computing NURBS-surface points. However, it does require the temporary storage of 2q surface-points. 3.6 Overview In summary, NURBS-surfaces can be ray traced directly using a modified version of the basic ray tracing algorithm by including the following steps: • In a preprocessing step: 1. Subdivide the surface to obtain better initial-guess values for the root-finder, and to reduce the number of intersection tests (Section 3.4). 32 Discussion 2. Construct a bounding volume hierarchy containing these initial-guess values from the subdivision from the previous step. After the bounding volume hierarchy is generated, the subdivision can be discarded (Section 3.1.2). • During ray tracing: 1. Traverse the bounding volume hierarchy using a packet-based BVH traversal scheme (Section 3.2). 2. When reaching a leaf-node, use root-finding to obtain the intersection-point (Section 3.3.1). 3. From all obtained intersection-points, take the closest w.r.t. the ray, and use that intersectionpoint for further shading computations, and as a spawning point for secondary rays. 3.7 Discussion The direct ray tracing of NURBS-surfaces poses an exciting challenge. Ray tracing itself is already a very expensive algorithm, combining it with NURBS-surface support will make it only more complex. However, the many advantages over standard rasterization and tessellation-based ray tracing are worthwhile to investigate an efficient implementation. At first, the reduced pre-processing time is a huge improvement, making interactive prototyping a lot easier. Secondly, as a consequence of using direct ray tracing of NURBS-surfaces, the required data, including preprocessed data, will be very compact, making it possible to ray trace very geometrically complex surfaces. Thirdly, exact representation makes it possible to zoom-in while preserving the model’s smoothness: never will a linear approximation become visible (as opposed to the rasterization-based method and the tessellation-based ray tracing method). Finally, the benefits of ray tracing alone are the accurate optical phenomena, such as shadows, reflections and refractions. In order to improve the performance of the ray tracer, efficient algorithms need to be employed for finding the intersections. The Newton-Rhapson method proves to be a very good candidate for this, due to its quadratic convergence rate and low memory requirements. By subdividing the surface in a preprocessing step, not only the surface is better approximated by the corresponding bounding volume hierarchy, but the smaller sub-patches also provide better initial-guess values. For evaluation of the surface-points and their partial-derivatives, multiple algorithms are available. While the power-basis form provides a very simple method for evaluation, the memorybandwidth required for it is rather high. Also, its stability decreases for increasing degrees. Whereas the direct form is more memory-friendly, as it does not rely on any preprocessed data. However, it performs divisions, which are very costly. The division-free method on the other hand is relatively memory-friendly and does not perform divisions at all, and is supposedly the fastest yet available according to the literature [AGM06]. The last method, the de Boor’s algorithm is definitely not memory-friendly, as it requires a huge amount of registers. However, the method is very robust. Clearly, each method has its own advantages and disadvantages. However, due to the low memoryrequirements of the direct and division-free methods, they are worthwhile for investigation on GPU. 33 Chapter 4 NURBS Ray Tracing on the GPU Recently, graphical processing units (GPUs), originally only meant for rasterization-based graphics rendering, are being “misused” for other, general-purpose computations. With the increasing programmability of commodity GPUs, these chips are capable of performing more than the specific graphics computations for which they were originally designed. They are now capable coprocessors, and their high speed makes them useful for a variety of applications [OLG+ 07]. As of now, NVIDIA GPUs are being used to accelerate computations for computational fluid dynamics, finance, physics, life sciences, signal processing, and many other non-graphics areas. However, due to the limited memory of GPUs and streaming architecture, some extra challenges need to be tackled, when trying to exploit the raw processing power of modern GPUs for ray tracing. This chapter will discuss recent developments in GPU Ray Tracing, and even more recent, NURBS Ray Tracing on the GPU. 4.1 GPU Ray Tracing Due to the parallel nature of the ray tracing algorithm (Section 1.2), researchers have long been trying to exploit the huge performance of GPUs. The first steps were taken with The Ray Engine [CHH02]. The ray tracer used the programmable shader pipeline of the GPU to compute all rayprimitive intersections. While the GPU was able to quickly compute the intersections, streaming the data to the GPU quickly became the bottleneck. This bottleneck in turn was avoided by moving all computations to the GPU [PBMH05]. However, this implementation required multiple rendering passes to ray trace a scene. One pass computed the eye rays, while a second pass traverses a uniform grid, a third pass computes the intersections, and the fourth pass shades the result. This process is shown schematically in Figure 4.1. Although the memory transfers were heavily decreased, they still were not able to outperform CPU-based ray tracers. Due to limited shader size and no branching support, many CPU-controlled rendering passes were necessary to traverse, intersect and shade the rays. 4.1.1 GPU Scene Traversal kd-Tree Traversal Also the implementation of kd-tree traversal has been researched. Since a stack is rather difficult to implement on a GPU, several methods have been devised to overcome this problem. [FS05] presented two methods for stackless traversal of a kd-tree on GPU, namely kd-restart and kd-backtrack [FS05]. Although their methods are better suited for the GPU, the high number of redundant traversal steps leads to relative low performance. By adding a short stack, [HSHH07] improved the kd-restart 35 4. NURBS R AY T RACING ON THE GPU Figure 4.1: Multiple stages in the ray tracer of [PBMH05]. method [HSHH07]. They achieved a performance of 15.2M rays/s for the Conference scene (appr. 280k triangles). [PGSS07] presented a CUDA-based stackless packet traversal algorithm by using ropes, pointers to adjacent nodes [PGSS07]. Their traversal algorithm does not need to be restarted and is able to instantly continue the traversal starting from the corresponding node, resulting in a performance of 16.7M rays/s. However, their implementation requires six ropes for each node, increasing the memory requirements by a factor of 4, limiting the support to only medium-sized scenes. Additionally, even though optimized for the GPU architecture, it can still not utilize the full power of modern GPUs. 4.1.2 Stackless BVH Traversal Although the kd-tree allows for fast scene traversal, the memory requirements are relatively high. Since the GPU has limited memory, the size of supported scenes for kd-trees is limited. However, BVHs generally require only 25% to 33% of memory compared to kd-trees, and compared to the stackless kd-tree traversal even 10% [PGSS07]. [TS05] presented a stackless traversal algorithm for the BVH which allows for efficient GPU implementations [TS05]. By storing the BVH tree in depth-first order, the tree is traversed in fixedorder. Therefore, no explicit reference needs to be kept for child-nodes. However, a ray could miss a node, and traversal needs to continue starting from another part of the tree. By including skippointers in the nodes, traversal can continue instantly from these other nodes (Figure 4.2). Although they outperformed both regular grids and the kd-restart and kd-backtrack [FS05] variants for kdtrees, their method is not as fast as the stackless kd-tree traversal [PGSS07]. 4.1.3 Shared-Stack BVH Traversal Although the BVH implementation of [TS05] does not depend on a stack, it uses a fixed-order traversal. The disadvantage of this is that many nodes, as well as primitives, will be tested for 36 Hybrid Ray Tracing Figure 4.2: Stackless BVH traversal encoding example. intersections even if they fall behind other primitives that will be traversed later. This leads to many redundant tests. [GPSS07] use an ordered, view dependent traversal, thus heavily improving performance for most scenes. However, since the traversal is ordered, a stack needs to be maintained. By tracing the rays in packets, the cost for this stack can be reduced, by amortizing it over the whole packet. By targeting their ray tracer at CUDA-enabled GPUs, several features can be exploited to implement this shared stack. The shared memory space of the GPU’s multiprocessors is used to keep this stack. Access to this data area is as fast as accessing the local registers on the chip. Their algorithm begins by traversing the hierarchy with the packet. Only one node is processed at a time. If the node is a leaf, the rays simply compute the nearest intersection with the geometry in it. If the processed node is not a leaf, the two children are loaded. The rays then intersect both children, to determine the traversal order based on the first intersected child. If the front-most child happens to lie behind an already intersected primitive, the child is considered as not being intersected. The next node is then collectively selected as the child-node which is front-most for the gross part of the rays. If any ray intersects the other child first, that child is “scheduled” by pushing it on the stack. If both children were missed, or if a leaf was processed, the next node is popped from the stack and its children are traversed. The algorithm terminates, if the stack is empty. The results show that the BVH-based GPU ray tracer is competing with the stackless kd-tree ray tracer [PGSS07] with a performance of 16M rays/s for the Conference scene. Using a preprocessing step, a special ray/triangle intersection test can be performed, which results in a performance increase of 20%. Using this optimization, their ray tracer even reaches 19M rays/s. Although kd-trees have long been regarded as the preferred acceleration datastructure for ray tracing, their novel BVH traversal algorithm is easier to implement and uses less live registers, resulting in a higher utilization (63% compared to 33% for [PGSS07]). Using their implementation, they were even able to ray trace the 12.7 million triangle POWER PLANT scene at 1024 × 1024 image resolution with 3 fps, including shading and shadows. The BVH required only 230 MB, and therefore fits easily in GPU memory. 4.2 Hybrid Ray Tracing Standard ray tracing uses an acceleration datastructure to reduce the complexity for a ray-scene intersection test from O(n) to O(log n), where n is number of primitives. While this datastructure reduces the number of intersection tests dramatically, still a significant amount of time is spent for the traversal of this datastructure. Another approach is hybrid ray tracing, where the GPU is used to rasterize the scene to obtain the intersections of the primary rays, after which this ray casted scene is further processed by the standard ray tracing algorithm (implemented on CPU or GPU), to include secondary effects such as reflections, refractions, shadows, etc. (Figure 4.3). Additionally, GPU shadow-maps can be used to avoid the intersection tests for the shadow-rays [BBDF05]. Interestingly, hybrid ray tracing is more beneficial to NURBS scenes than to triangle-based 37 4. NURBS R AY T RACING ON THE GPU (a) Rasterization phase of the hybrid algorithm. (b) Ray tracing phase of the hybrid algorithm. Figure 4.3: Hybrid Ray Tracing phases. scenes. Since the NURBS intersection test is extremely expensive compared to simple triangle intersections, every unnecessary intersection test avoided is worth much more compared to classical triangle tests. The following subsections discuss some hybrid techniques which accelerate the primary ray intersection stage of a NURBS ray tracer. 4.2.1 Extended Graphics Pipeline [PSS+ 06] propose a conceptual extension of the standard triangle-based graphics pipeline by an additional intersection stage. They first generate a very rough approximation of the surface by subdividing the NURBS-surface patch. The convex hulls of this subdivision are then rendered using standard rasterization. In the fragment shader they include an intersection stage to compute the exact intersection values between the rays and the surface, after which the fragments are shaded [PSS+ 06]. Figure 4.4 shows a schematic overview of the extended graphics pipeline. In order to reduce the number of ray/NURBS intersection tests, they employ some kind of early ray termination, similar to the early-z test. Their method requires the primitives to be sorted front-toback, before being uploaded to the GPU. Before computing the ray/NURBS intersection, the depth value of the convex hull coming from the rasterizer is compared against the current value in the depth-buffer. If the convex hull lies behind the already computed pixel, the corresponding surface will also lie behind it, therefore the fragment/ray will be discarded and a costly intersection test can be skipped. Otherwise, the exact depth is obtained from the intersection, and the depth-buffer is updated accordingly. Usually, the early-z test can be employed for planar primitives (such as triangles etc.), which heavily increases performance. This test compares the depth value coming from the rasterizer with the value in the depth-buffer, so that the entire fragment shader stage can be skipped if this fragment is behind another. However, the exact intersection for the NURBS-surface is determined during the fragment shader stage, consequently, the precise depth value can only be obtained using the fragment shader. Therefore, the fast early-z test cannot be employed in this case. However, it is implemented manually in the fragment shader to prevent the costly intersection tests. Test results show that this manual implementation does not decrease performance that much compared to a hardware early-z implementation [PSS+ 06]. Therefore, the benefit for the early ray termination is still very high. To obtain the intersection between the ray and the surface, Newton’s Iteration root finding method (Section 3.3.1) is used in combination with de Boor’s method (Section 3.5.4). Three different techniques are used to obtain an initial guess for the (u, v)-values. The first technique follows the approach of [MCFS00], by using multiple bounding volumes 38 Hybrid Ray Tracing Figure 4.4: Overview of the algorithm stages and their relation to stages in the extended graphics pipeline. per surface-patch. However, instead of using axis-aligned bounding boxes (AABBs), the bounding volume is the convex hull of a sub-patch itself and contains as initial-guess value the midpoint in the parameter range of the convex hull. An additional advantage compared to the AABBs, is that the smaller convex hulls approximate the surface better, so less rays will miss the surface. For the second technique, called uv-texturing, the patch is further subdivided as in the previous method. But instead of using the midpoint as initial-guess, the parameter values correspond to the vertices of the convex hull, which are interpolated to obtain a closer initial guess. This method can be further optimized into a view-dependent method (technique three), by using the vertex shader to compute the intersection of the ray and the vertex to obtain exact (u, v)-values for the vertices, which again will be interpolated afterwards, resulting in an even closer initial guess. The test results show that the number of iterations is reduced when using the view-dependent variant. However, they do not provide a comparison with the view-independent variant with respect to speed. Figure 4.5 and Figure 4.6 compare the number of required iterations for the intersection finder to converge. As can be seen, the uv-texturing method converges very fast, whereas the midpoint method requires several additional iterations. Figure 4.7 compares the view-independent method with the view-dependent method. It can be seen, that the view-dependent method requires less iterations in order to find the intersection. Furthermore, using an adaptive subdivision, a coarser mesh can be generated globally, while locally, in very curved areas, the mesh can be subdivided a bit further. This will automatically determine the correct subdivision depth, which results in a maximum performance for the rasterizer. Additionally, since curved areas are subdivided further, less iterations are required, increasing the performance even more. Their implementation is able to rasterize the bi-cubic NURBS-teapot consisting of 32 patches each uniformly subdivided 4×4 times (12698 triangles for the convex hulls), with a frame rate of 36 frames per second at a resolution of 1280×1024. While they are able to quickly rasterize NURBS-surfaces, their method does not include sec39 4. NURBS R AY T RACING ON THE GPU ondary effects, such as shadows, reflections, refractions, etc. Furthermore, they are not able to fully exploit the coherence among rays, since the rays are processed independently in parallel using the fragment shaders. Using CUDA, this coherence could be exploited better, and a full-fledged ray tracer could be built. However, this is not the goal of the authors, since they proposed a conceptual extension to the standard graphics pipeline, to support the NURBS surface as a primitive. (a) (b) (c) (d) Figure 4.5: uv-texturing using midpoint of parameter range (uniform subdivision: 2×2). Figure 4.5a shows the corresponding uv-texture. Figure 4.5b to Figure 4.5d show the result for up to two, three, and four iterations, respectively. (a) (b) (c) (d) Figure 4.6: uv-texturing using control point mapping (uniform subdivision: 2 × 2). Figure 4.6a shows the corresponding uv-texture. Figure 4.6b to Figure 4.6d show the result for up to two, three, and four iterations, respectively. (a) Surface (b) Number of iterations for view-independent (c) Number of iterations for view-dependent (d) Difference Figure 4.7: Figure 4.7a, Figure 4.7b view-independent uv-texturing vs. Figure 4.7a, Figure 4.7c view-dependent uv-texturing (max. iterations: 10, subdivision: uniform, 4 × 4). Figure 4.7d is the difference between the initial values of both methods. 40 Hybrid Ray Tracing 4.2.2 GPU/CPU Hybrid As in the method of [PSS+ 06] in Section 4.2.1, [ABS08] use a hybrid approach by rasterizing a coarse mesh using the GPU and then computing the exact intersections in a second stage. However, whereas [PSS+ 06] use the convex hull of a subdivided NURBS-patch as input for the rasterization phase, the subdivision itself is used instead to obtain a more tight approximation of the surface. Also, the intersection stage is performed on the CPU as opposed to the GPU requiring costly GPU-CPU memory transfers. During the intersection stage, they employ the fast NURBS evaluation algorithm (Section 3.5.3), together with the Newton Iteration to obtain the exact intersection [AGM06]. Two variants are employed in their method. The first one, called id-processing, uses only the id of the triangle/surface for further processing. However, since no (u, v) values are available, an additional function call is required to obtain those values. The second method, called uv-processing, uses the vertices’ (u, v) values. These values are interpolated in turn by the rasterizer to obtain an initial guess for the Newton Iteration. However, since an additional buffer is needed to hold the interpolated (u, v) values, more memory needs to be transferred from GPU-memory to CPUmemory. Comparing this variant to the view-dependent uv-texturing-approach [PSS+ 06], there are several advantages. Since the vertices of the subdivided mesh lie exactly on the surface, the precise (u, v) values are already known, whereas [PSS+ 06] require additional intersection tests using the vertex shader to obtain these values. Although using the subdivision directly fits the NURBS-surface better than using the convex hull as in [PSS+ 06], their method suffers from some artifacts. Rays that intersect the surface, but miss the tessellated silhouette, will not appear in the rasterized output, whereas standard ray tracing does include them (Figure 4.8a). In order to counter this artifact, they include a test in the intersection stage, to determine if a ray/pixel needs to traverse the entire scene, or it can depend on the hit reported by the rasterizer. Due to the mentioned artifact, every background pixel in the rasterization output therefore needs to be ray traced using a packet-based BVH ray tracer, in order to produce a correct result. While many background-pixels will actually be rays that do not intersect the surface, this check will incur some overhead. However, this overhead is relatively low, since no (or very few) NURBS intersection tests needs to be performed. Another artifact that may occur is when the rasterized output contains a hit where the ray traced output does not, or hits another surface (Figure 4.8b). Therefore, an additional check, after the intersection test is used, to determine if the intersection test succeeded. If not, the standard ray tracing algorithm is used as well. Although their artifact-handling fixes the majority of incorrect rasterized silhouettes, it is still incapable of correctly ray tracing (self-)intersecting surfaces (Figure 4.9a). Luckily, this artifact is avoided when tessellating not too coarsely. However, not mentioned by their paper, since the tessellation is an approximation, the surface at the silhouette can be missed by the rasterizer. Therefore, if another surface is behind the missed surface, additional artifacts will occur, resulting in a tessellated-looking silhouette (Figure 4.9b). The test results show that using the hybrid approach, 20-70% of the costly NURBS-intersection tests can be avoided. Normally a ray would intersect multiple surfaces, but using the z-buffer the number of surfaces is reduced to only one (except for the artifact-handling). Interestingly, although uv-processing starts with a better initial guess, the cost for obtaining these guesses is very high since an additional data buffer needs to be transferred. Therefore, their id-processing variant outperforms the uv-processing variant. Their implementation is able to ray cast the bi-cubic NURBS-teapot consisting of 32 patches subdivided over 12698 triangles, with a frame rate of 8.0 frames per second at a resolution of 512×512. Using a dual 3.0GHz quad-core processor, they reach 25.5 frames per second for uv-processing, and 41.7 frames per second for id-processing. Although only primary rays are traced, this implementation is easily extended to a full-fledged 41 4. NURBS R AY T RACING ON THE GPU (a) A primary ray missing the tessellation, but hitting the surface. (b) A primary ray r hitting a triangle ts from a tessellated surface while missing the original (parent) surface s. Figure 4.8: Artifacts which are handled correctly. ray tracer, by simply shooting more rays in the intersection stage. However, these secondary rays will not be traced as efficiently as the primary rays. 4.2.3 Comparison For the teapot scene, [PSS+ 06] reach 47M rays/s, whereas [ABS08] reach 11M rays/s (id-processing, using 8 threads). However, this comparison is not fair, since different hardware is used in both implementations as well as a different approach. Furthermore, the scenes are very different actually. The subdivision is performed differently, and another viewpoint is chosen for the camera. Nevertheless, the features can be compared. Whereas [PSS+ 06] rasterize the convex hull in order to produce an artifact-free result, [ABS08] rasterize the tessellation of the subdivision, introducing some possible artifacts. However, a convex hull requires more primitives to process, slowing down the rasterization stage, on the other hand, this is not the bottleneck. Using a tessellation also approximates the surface better, but more important, it provides more accurate initial-guess values. Using the convex hull, the best initial-guess values one can get are obtained by using view-dependent uv-texturing. This requires additional intersection tests during the vertex shader stage, slowing down the ray tracing process. The tessellated method implicitly uses view-dependent uv-texturing, since the vertices of the triangles lie exactly on the surface. The tessellation method also guarantees only one intersection test, except for some cases when an artifact is detected, where additional standard ray tracing is required. Using the convex hull, an intersection could be in front of the current front-most intersection according to the depth-buffer. Therefore, several intersection tests can be needed to obtain the correct intersected surface. However, this consequence can be minimized by first sorting the primitives front-to-back. Furthermore, the method by [PSS+ 06] is fully implemented using the programmable 42 Discussion (a) A packet of rays r1..4 intersecting the wrong surface t j . The altered shape tis of the surface s causes objects in the shaded area to be drawn instead of culled and vice versa (if rays origin in the opposite direction) (b) A primary ray hitting the front surface, but missing the tessellation. In the back the ray does hit the tessellation, however, this will result in a wrong hit. Figure 4.9: Artifacts which cannot be handled correctly. shader pipeline. Therefore, due to the parallel nature of the fragment processing units, coherence of initial values for the Newton Iteration between neighboring pixels/rays is difficult to exploit, only the caches can be used to exploit localness in memory fetches. The method of [ABS08], traces rays in packets, and therefore can profit from the coherence. Table 4.1 shows a comparison of the features for both hybrid methods. 4.3 Discussion For ray tracing NURBS-surfaces, the GPU BVH ray tracer in Section 4.1.3 can be employed. It allows an efficient traversal of the bounding volume hierarchy (derived from the subdivision of the NURBS-surface) by including a shared-stack per ray-packet. Together with the Ray-Patch intersection routine (Section 3.3.1), and the evaluation schemes in Section 3.5, an efficient implementation should be possible. Hybrid Ray Tracing is an interesting method for accelerating the primary intersection stage, which avoids the entire traversal stage. Although ray tracing will eventually outperform rasterization as the complexity increases, this is usually not the case for ray tracing NURBS-surfaces, as the 43 4. NURBS R AY T RACING ON THE GPU Feature Artifact-free Initial Mesh Intersection Tests/Ray Early ray termination uv-Texturing View-dependent GPU/CPU Intersection stage coherence exploitation [PSS+ 06] Yes Convex Hull ≥1 Yes (manually) Yes Yes (vertex shader) GPU Little (through caches) [ABS08] No (rare) Tessellation 1 (non-missed intersections) Yes Yes Yes CPU Yes (through packets) Table 4.1: Comparison of features for both hybrid methods. rasterized mesh is only a rough approximation of the NURBS-surface, therefore, the complexity can be kept modest, resulting in a fast rasterization. But more important, since many triangles are being culled by the rasterization pipeline, through frustum culling, early-z culling, viewport culling, etc., the average number of intersection tests is reduced to only a few per ray. Most of the time requiring just only one intersection test, as most of the other triangles are culled away. Two different approaches are employed for hybrid ray tracing. [PSS+ 06] implemented the entire ray tracer using vertex and fragment shaders, whereas [ABS08] used the CPU for computing the exact intersections. For the former method, the intersection tests take place during rasterization by using the fragment shader for computing the intersections. Since all fragments are processed independently by the GPU, little attention is paid to coherence among the rays, except for the caches. The main bottleneck in the latter is the expensive GPU-CPU memory transfer, especially for the u/v processing variant, since it requires the transfer of an additional buffer for the (u, v) values. Furthermore, the resulting (u, v) buffers use RGBA data, therefore adding some amount of overhead. However, recent GPUs support integer textures, which will heavily decrease the required bandwidth. As mentioned in their paper [ABS08], implementing the intersection stage on GPU, all memory transfers can be avoided. In combination with the packet-based GPU BVH traversal, a performance gain can be expected. Finally, a second major difference between the two approaches lies in the rasterized mesh. Whereas [PSS+ 06] use the convex hull to ensure an artifact-free result, [ABS08] rasterize the refinement itself. Although some artifacts may occur by using the tessellation, the z-buffer guarantees that at most one intersection test is required for every pixel, except for surface boundaries, where a ray can miss the surface, as opposed to [PSS+ 06], where several intersection tests may be needed per pixel. Furthermore, almost all artifacts are correctly handled in the second intersection stage. The artifacts that can not be handled result from (self-)intersecting surfaces, which should not happen in reality. Additionally, since a tessellation is used, less primitives are generated, but more important, the tessellation results in a view-dependent uv-texturing approach for free, whereas [PSS+ 06] require additional intersection computations in the vertex shaders. Using view-dependent uv-texturing is currently the most efficient method, which converges very fast to the root. Therefore, a NURBS ray tracing system should be possible to be fully implemented on GPU by using a hybrid approach for casting primary rays and falling back on GPU BVH ray tracing for missed intersections and secondary rays, such as shadows, reflections, refractions, etc. 44 Chapter 5 CUDA With the introduction of the G80-architecture, NVIDIA has made a big step forward towards generalpurpose computing using GPU hardware. The Compute Unified Device Architecture, or CUDA, is a general-purpose parallel computing hardware/software architecture. By choosing the well-known C-language as the basis for their architecture, any programmer familiar with this language, can immediately start developing parallel software. In addition, their programming language, C for CUDA, has some of the C++ features, including polymorphism, operator overloading, and function templates. This chapter will briefly discuss the CUDA architecture, how it can be used to build highly parallel software, and which obstackles need to be faced. 5.1 Massive Multi-threading Architecture CUDA GPUs contain a number of multi-threaded streaming multiprocessors (SMs), each capable of executing SIMD-groups of threads concurrently. Each SM contains 8 cores, together with a scheduler which creates, manages, and executes concurrent threads in hardware with zero scheduling overhead. In contrast to SIMD software for CPUs, in which 4 instructions are executed in parallel, CUDA applications launch a massive number of lightweight threads which are executed in parallel in a streaming fashion. All threads form a grid, which is divided into user-defined blocks (Figure 5.1a). Each block is assigned to a specific multiprocessor. Therefore, threads of the same block will always execute on the same multiprocessor. Groups of 32 threads, called a warp, are executed physically in parallel in SIMD-fashion. Additionally, each thread executes independently with its own instruction address and register state1 . As long as all threads in a warp execute the same instruction, the warp achieves maximum utilization. Otherwise, it is serially executed in two (or more) parallel groups. Furthermore, to hide memory latencies, warps are time-sliced by a hardware scheduler with zero schedule-overhead. The thread blocks are enumerated and distributed to the available multiprocessors, which allows for a highly scalable system (Figure 5.1b). Finally, CUDA-GPUs supporting SLI, can be linked together to increase the performance even further. However, each GPU needs to be controlled using a separate thread, and data is not shared between GPUs, therefore, all data need to be uploaded twice. Nevertheless, by using SLI, the performance can be doubled in theory. 1 NVIDIA refers to this as SIMT, for single-instruction, multiple-thread. 45 5. CUDA (a) Hierarchical overview of CUDA threads. (b) A device with more multiprocessors will automatically execute a kernel grid in less time than a device with fewer multiprocessors. Figure 5.1: Thread blocks and their mapping to the multiprocessors. 5.1.1 Predicated execution Sometimes, a thread may decide not to execute a branch, while other threads belonging to the same warp do execute that branch. Such a warp will usually be split into as many thread-groups as required to enable a parallel execution of each group. However, when the branch contains only a small number of instructions, the compiler might decide to predicate the instructions. Threads which did not execute the branch will be deactivated, and thus skip the instruction. 5.2 Memory Hierarchy There are five different types of memory available for use. The GPU’s main memory, named globalmemory, is very big, but the latency is very high. The global-memory is shared among all running threads, and even between kernel invocations. Each thread also has a private portion of this globalmemory, called local-memory. Although it’s optimized for efficient data transfers, it has the same latency as the global-memory, making it very slow in some situations. The texture-memory, is a special read-only portion of the global-memory which can be read using special texture-fetch functions. The advantage of using texture-memory, is that the texture-unit, which executes the fetches, caches its incoming data. Therefore, frequently accessed data is likely to appear in the cache, so a read from global-memory can be avoided. The latency to this cache is as fast as the access to a register. In addition, the texture-unit provides filtering functionality, to normalize values, interpolate values, etc. There is also a portion of global-memory, which is much alike the texture-memory: the constantmemory. It has some restrictions though, as being very small (only 64 kB), and has no filtering support. Finally, each multiprocessor has on-chip memory of 16 kB, called the shared-memory. As with the cache, the latency of this memory equals that of a register access. This memory is divided among the active blocks running on a multiprocessor. The threads in a block can all access the same shared data, however they cannot access the shared data of other blocks. The amount of shared46 Memory Hierarchy Figure 5.2: Memory Hierarchy. memory a block is given, is determined by the total amount of shared-memory which is reserved by a kernel. Figure 5.2 illustrates the different types of memory. 5.2.1 Coalescing Due to the wide memory bus, memory transactions always occur by fetching 32, 64, or 128 contiguous bytes simultaneously. Therefore, global memory bandwidth is used most efficiently when the simultaneous memory accesses by threads in a half-warp (either the first 16 threads, or the last 16) can be coalesced into a single memory transaction of 32, 64, or 128 bytes. In order to assist in coalescing, data is sometimes arranged as a structure of arrays (SOA), instead of an array of structures (AOS). In this way, the data for each member is clustered, and can be fetched using a single memory transaction. 5.2.2 Prefetching The texture-unit also provides an implicit prefetching function. When fetching data from address X, the texture-unit will automatically prefetch the data on addresses X-1 and X+1. If the thread, or any other thread executed by the same multiprocessor, will read from a neighboring address, the data will already be available and can be fetched directly from the cache. 47 5. CUDA 5.2.3 Bank conflicts The data of the constant- and shared-memory are physically arranged into 16 memory-banks. As long as the threads in a half-warp do not read or write from the same memory-bank, there will be no conflicts. In case of a conflict, the memory-transactions will be performed serially for the number of distinct conflicts. However, a read from the same address (broadcast), will not result in a bank-conflict. 5.3 Launching a kernel A kernel is defined in C for CUDA, which is an extension to the C-language. For example, the well-known saxpy-operation can be defined as follows: Listing 5.1: Example CUDA-kernel: saxpy 1 2 3 4 5 6 7 8 __global__ void saxpy (float *s , float a , { int id_x = ( threadIdx .x + blockIdx .x int id_y = ( threadIdx .y + blockIdx .y int id = id_x + id_y * ( blockDim .x float *x , float *y) * blockDim .x ); * blockDim .y ); * gridDim .x) s[ idx ] = a * x[ idx ] + y[ idx ]; } This kernel first computes its unique identifier, based on its position in the grid. threadIdx is a vector containing the x and y location of the thread inside the block. blockDim gives the dimensions of every block. blockIdx represents the location of the block inside the grid. And gridDim gives the dimensions of the grid, in blocks. After having computed the identifier, the x input element is read from global memory, multiplied with the a-factor, and added to the y input element, also from global memory. Finally, the result is written back to global memory. The code in Listing 5.2 launches a grid of 4 × 4 blocks, each containing 1024 threads. Listing 5.2: Example of launching CUDA-threads 1 2 3 4 ... memory allocations , and variables defined elsewhere dim3 grid (4 , 4); dim3 block (32 , 32); saxpy <<< grid , block > > >( output_s , a , input_x , input_y ); 5.4 5.4.1 Debugging Emulation The nvcc compiler driver provides an option to generate binary code which can be executed by the CPU, using the device emulation mode. Normally, it would not be possible to specify breakpoints when the application is being executed by the GPU. However, using this mode, CUDA-applications can be debugged, as if they were normal CPU-applications. Unfortunately, the emulation is not 100% accurate. Whereas the GPU can execute 32 threads physically in parallel, the emulation mode cannot. Therefore, additional synchronization points need to be set, in order to obtain the same 48 Debugging results. Finally, the emulation mode runs very slowly. Nevertheless, to find bugs, the emulation mode is a very nice addition to the CUDA Toolkit. Two other emulators which were available, are Barra [CDP] and GPUocelot [Dia09, DKK09]. Those emulators do not require a different compilation mode. Instead, they temporarily replace the CUDA-Runtime, in order to intercept all instructions. Their goal is to provide a 100% accurate , and much faster emulation. However, at the time of writing this thesis, the emulators were not yet capable of emulating applications designed for the CUDA 2.3 runtime. Nevertheless, an eye should be kept on these emulators. 5.4.2 Hardware breakpoints When running Linux, breakpoints can be set to running CUDA-applications, using the cuda-gdb debugger (an extension to gdb). However, no X-Server may be running on the GPU which is currently being debugged, making it very difficult to visualize the output. Although it is possible to run the X-Server on another computer in the network, it is generally recommended to have a second GPU in the computer. Unfortunately, the debugger did not work very well for the ray tracing system, probably because the generated code was very complex. 49 Part II Development 51 Chapter 6 System overview In this chapter, the CUDA NURBS Ray Tracing System (CNRTS) is presented: a GPU-based implementation of the recursive ray tracing algorithm capable of direct ray tracing of NURBS-surfaces. Besides primary-rays, the system is also able to cast shadow rays, reflection rays, and refraction rays up to any desired depth. Additionally, the system features an acceleration subsystem for speeding-up the casting of primary-rays. Finally, the system requires very little preprocessing, and the amount of used memory is kept to a minimum. The system heavily depends on auxiliary datastructures. These datastructures are computed only once in a preprocessing step, and are then fed into the core of the system, which uses it for rendering. The next chapter will discuss the preprocessing phase. This chapter only provides a global overview of the system, the implementation details will be provided in Chapter 8. Rendering Core The rendering of NURBS-surfaces is handled by the core of the CNRTS. It is composed of two subsystems, namely the Primary-ray Accelerator and the Ray Tracer. The first subsystem accelerates the primary intersection stage of the ray tracing system by limiting the number of ray/surface intersection tests to a single test1 . Additionally, it accelerates the convergence of the root-finder by providing more accurate initial-guess values. The Ray Tracer subsystem then performs the classical recursive ray tracing algorithm, by tracing secondary-rays, including shadows, reflections and refractions. Figure 6.1 gives a high-level overview of the architecture of the core of the CNRTS. Both subsystems are executed entirely on the GPU with only a minimal amount of coordination by the CPU. Figure 6.1: System overview of the rendering core of the CNRTS 6.1 Primary-ray Accelerator subsystem A hybrid technique is employed to accelerate the tracing of primary-rays. The efficiency of the standard rasterization pipeline is exploited to limit the number of intersection tests and to obtain a 1 As will be seen later, there are some cases in which this number can be slightly higher. 53 6. S YSTEM OVERVIEW good starting point for the root-finder. Since the intention was to create an OS-independent system, OpenGL is chosen as the graphics API. However, the system is easily modified to use DirectX instead. 6.1.1 Rasterization Based on the ideas by [ABS08, PSS+ 06, BBDF05], the Primary-ray Accelerator subsystem removes the necessity for the traversal of an acceleration data structure by first rasterizing a coarse tessellation of the NURBS-model. However, instead of just using the standard rendering pipeline (Section 1.1) output, a vertex-shader and fragment-shader are employed to generate a customized output. The tessellation is obtained from a refinement of the NURBS-model using the method described in [MCFS00]. Before rasterization begins, two custom attributes are added to each vertex: a patchidentifier uniquely identifying the NURBS-surface and the span locations in the corresponding knotvectors; and the uv-parameter value of the NURBS-surface corresponding to the face-corner. The tessellation is generated in a preprocessing step and is discussed in Section 7.1. During rasterization, the rasterization-unit will interpolate the vertices’ uv-parameter values providing a weighted average closely approximating the surface’ uv-parameter corresponding to the pixel’s true intersection-point2 . Pixels containing a rasterized primitive, will provide a good starting point for finding the exact intersection-point: the patch-identifier identifies the NURBS-surface and spans3 , and the uv-parameter value provides a good initial guess-value for the root-finder to find the exact intersection-point. Pixels containing a background-value either do not contain a surface, or missed the surface due to too coarse tesselation. Figure 6.2 shows schematically the rasterization phase of the Primary-ray Accelerator. Figure 6.2: Schematic overview of the rasterization phase of the Primary-ray Accelerator. The output provides some cues for finding the exact intersection-point: the left rasterized image contains the patch-ids and the right rasterized image contains the uv-values encoded in red for u-parameter range and green for v-parameter range. 6.1.2 Traversal-less Ray Casting As soon as the rasterization step completes, the actual ray tracing can begin. However, instead of immediately applying the entire recursive algorithm, only the primary rays are traced without spawning secondary rays, resulting in a ray casted image. Normally, in order to find out which object is front-most w.r.t. the viewer, all the objects in the scene need to be traversed, or part of the scene when using an acceleration datastructure (see 2 The patch-identifier however, does not require interpolation, since it’s constant across the entire surface patch. However, if the rootfinder happens to fail in finding an intersection-point, then the pixel can be “repaired” by the Ray Tracer, as will be seen next (Section 6.1.4 and Section 6.2). In the case the rootfinder reports an intersection, it cannot be repair, and the incorrect surface will appear in the result. 3 There are some cases in which the rasterized output incorrectly identifies the surface. 54 Primary-ray Accelerator subsystem Figure 6.3: The ray’s direction is derived from the viewing parameters. In here, a denotes the fieldof-view. By setting the image-plane’s distance d from the camera equal to 1, the position in the image plane is easily converted to world-coordinates. Section 3.1). However, we actually already know nearly certain which object is front-most: the patch-identifier resulting from rasterization identifies the surface closest to the viewer along the ray being processed (see the left rasterized image in Figure 6.2). Thus, only one intersection test is required. So all that’s left to be done is computing the intersection based on the data output by the rasterizer. The shading computations are deferred to the ray tracer subsystem. The interpolated uv-parameters resulting from the rasterization are a perfect candidate for an initial-guess value for the root-finder (see the right rasterized image in Figure 6.2). 6.1.3 Primary-ray setup In order to use the rasterization output as input for the ray caster, the pixel positions need to be mapped to primary-rays in such a way, that when the tessellation would be ray traced, the exact same output should be obtained. Figure 6.3 shows how the camera is positioned w.r.t. the image-plane. The mapping is derived from the pixel position in the image plane, and the viewing parameters used for rasterization. Algorithm 9 shows the pseudo-algorithm to compute the ray from the pixel location and the viewing parameters. 55 6. S YSTEM OVERVIEW Algorithm 9 Computation of a primary-ray Require: pixel location x, y, image plane resolution width and height, camera position eye, camera focal point f ocus, camera up-vector up, and field-of-view a. Ensure: ray origin O, and ray direction D. 1: O ← eye 2: D ← f ocus − eye 3: stepx ← 4: stepy ← tan 12 a 1 width 2 tan 21 a 1 height 2 D ← D + (stepx · (x − 21 width + 12 )) · (D × up) 6: D ← D + (stepy · (x − 21 height + 12 )) · up 5: 6.1.4 Artifacts While the Primary-ray Accelerator enables fast ray casting for NURBS-surfaces, it does not guarantee that the resulting output will always be correct. There are three types of artifact which can appear, from which two of them can be fixed by the system. The first type of artifact is related to the root-finder being skipped, due to a background-pixel in the rasterization output. There are two cases in which this will happen: either the ray does not intersect any surface-patch and vanishes into the background; or the ray does intersect a surface-patch, but due to a too coarse tessellation, it won’t show up in the rasterization. While the former is correct, the latter produces an incorrect result, making the tessellation apparent in the result (Figure 6.4a). Since both cases produce a background-pixel, it is not possible to differentiate between them. The second type of artifact which could occur happens when the root-finder diverges. Again, there are two cases in which this can happen: either the ray does not intersect any surface-patch, but the rasterization output reported a hit; or the ray intersects another surface-patch than the one reported in the rasterization output. Obviously the former is correct, but the latter will produce gaps appearing in the result as can be seen clearly in Figure 6.4b. Both types of artifacts are easily handled by the system by forwarding background-pixels, and diverged pixels to the Ray Tracer subsystem in order to fix potential incorrect results. A disadvantage of this method is that correctly computed pixels will have to be computed twice. A third possibility leading to an incorrect result is a too coarse tessellation which doesn’t show up in the rasterization. Whereas this will be handled correctly for background-pixels, it will not be in the case of another surface lying behind the too coarse tessellated surface: this surface will be intersected by the ray, and thus an incorrect intersection-point will be returned. This appears clearly in the tesselation of the teapot (Figure 6.4c). While the former two types of artifacts are fixable by forwarding them to the Ray Tracing subsystem, this type of artifact is not. Since a valid intersectionpoint is returned by the root-finder, it is not possible to determine if it was the front-most. However, by using a not too coarse tessellation, these artifacts will occur rarely. 6.2 Ray Tracer subsystem The heart of the rendering core is the Ray Tracer subsystem, which performs the full-blown recursive ray tracing algorithm. Apart from tracing secondary-rays, it also handles the artifacts described in Section 6.1.4 by retracing the primary-rays that either missed the tessellation (background-pixels) or diverged during the root-finding phase. The Ray Tracer subsystem follows the methods described in [MCFS00]. A BVH is constructed from a refinement of the NURBS-model, in which each leaf-node represents a flat enough sub-patch 56 Ray Tracer subsystem (a) Detectable corners appear due to a too (b) Gaps appear due to the tessellation. (c) Undetectable corners appear due to a too coarse tessellation. coarse tesselation. Figure 6.4: Artifacts appear when using the Primary-ray Accelerator solely. In Figure 6.4a, background-pixels appear instead of the surface. By forwarding all background-pixels to the Ray Tracer, this artifact will disappear. Figure 6.4b shows some gaps, appearing in-between two surfaces. Figure 6.4c shows an artifact which cannot be detected using the algorithm, the only solution is to tessellate more fine. of a NURBS-surface. The subdivision is generated in a preprocessing step which will be discussed in Section 7.1. The scene is traversed using a packet-based BVH-traversal scheme. The NewtonRhapson method Section 3.3.1 is adopted to obtain the ray/NURBS-surface intersections. The implementation details will follow in Chapter 8. 6.2.1 Kernel separation The first problem arising in mapping the ray tracing algorithm to GPU is that CUDA does not have native support for recursive function calls. This problem could be overcome, by rewriting the algorithm to an iterative version and using the shared-memory of CUDA to implement a stack to push the spawned rays onto. While this may seem to be reasonable, it enforces us to write the entire algorithm as a single complex kernel. However, an increase in kernel complexity, generally leads to an increase in the number of used registers. This will reduce the occupancy of the kernel. And if no more registers are available, it will result in lots of slow local-memory transfers. Additionally, it is less likely the CUDA-compiler will do a good job in optimizing a complex kernel. Finally, most CUDA GPUs have a limit on the maximum execution time of a kernel. Therefore, merging all traversal steps, will increase the kernel’s execution time, and eventually will result in program crashes. Instead I have put the different ray tracing-steps into multiple CUDA-kernels. A single kernel for tracing primary/secondary rays, another kernel for tracing shadow-rays and yet another kernel for applying shading. The kernels propagate information to each other through the global-memory. A per-pixel ray-stack is used on which reflection and refraction rays are pushed to, and popped from when tracing secondary rays (Section 8.1.1). To store the intermediate intersection data required by the shader, a hitdata-buffer is used, containing the intersection-point, the surface-normal, the material-id, etc. (Section 8.1.2). A thin layer of CPU-code is used to coordinate the algorithm, by launching the different CUDA-kernels (Algorithm 6.1). Figure 6.5 schematically shows this separation. 57 6. S YSTEM OVERVIEW Figure 6.5: The ray tracing algorithm separated into different CUDA-kernels. The CPU part coordinates the algorithm by launching the kernels successively. The inter-kernel data propagation through global-memory is visualized by dashed lines, whereas the arrows indicate the flow/dependencies. 6.2.2 Tiling Another problem that occurred during implementation was the amount of required memory for the temporary buffers, i.e. the ray-stack and hitdata-buffer. For a maximum recursion-depth of 3, at a resolution of 1024 × 1024, this results in 128 MB to be allocated for the ray-stack, and yet another 60 MB for the hitdata-buffer, totalling 188 MB (the derivation of these numbers can be found in Section 8.1.1 and Section 8.1.2). When comparing this number to other types of allocation (such as NURBS-data, BVH-data, image-buffers, etc.), it seems to be a rather high percentage of the total allocated memory. Although modern GPUs are capable of allocating such an amount of memory, these temporary buffers should not dominate the total amount of available memory, as it should actually be reserved for read-only data such as NURBS-data, BVH-data, textures, etc. What’s even worse, is that the buffers could even prove a potential bottleneck: since the amount of memory required for the buffers is proportional to the image-resolution, increasing this resolution eventually will fail the allocation of them4 . The solution to this problem is to divide the image-space in evenly-sized rectangular pixelblocks, called tiles. Instead of applying the ray tracing algorithm onto the entire image, it is now applied successively to each tile. Each tile is given an attribute tileIdx5 , a two-dimensional index specifying the block position in image-space (analogous to the blockIdx and threadIdx parameters for position in grid and block, respectively). Based on the tileIdx, together with the threadIdx, the corresponding pixel location is computed, and accumulated with the color value from the shader. 4 A full-hd resolution of 1920 × 1080 with a maximum recursion depth of 3 would require 372 MB, solely for the temporary buffers. 5 In CUDA-terms, this would have been gridIdx. 58 Ray Tracer subsystem Figure 6.6: Separation of image-space into tiles: instead of applying the ray tracing algorithm onto the entire image, it is now applied successively to each tile. The tiles are visualized using different shades of grey, the pixels are separated by dashed lines. Figure 6.6 schematically shows this separation into tiles. The main advantage of this method, is that it does not depend on the image-resolution anymore. Ray tracing an image with a maximum recursion-depth of 3 and a tile-size of 256 × 256 will always result in a temporary-buffer size of 11.75 MB, which is only a small fraction of the available GPU memory compared to the 188 MB when not using tiles. 6.2.3 Indirect recursion Figure 6.5 suggests an iterative rewrite of the recursive ray tracing algorithm. However, for the sake of simplicity, and since the CPU can’t possibly become a bottleneck in this system, I’ve chosen to leave it recursive. In this way, the structure of the classical recursive ray tracing algorithm can be found literally in the CPU-part of the system. Listing Algorithm 6.1 shows the CPU-code which coordinates the ray tracing algorithm. The raytrace-function is called and given a tileIdx which represents the current tile being processed. It then launches the castRays CUDA-kernel, which performs the primary or secondary ray-traversal step (including the root-finding and spawning of reflection- and refraction-rays). It then loops over all lights and performs the casting of shadow-rays, by launching the castShadowRays CUDAkernel. Subsequently, it applies shading based on the intersection obtained from the castRaysand castShadowRays-kernels, by launching the computeShading CUDA-kernel. And finally, for the recursion-step, the reflection- and refraction-rays are traced by calling the raytrace-function. While recursively calling the raytrace-function, a stackpointer is provided from which the location of the pushed ray is derived. In order to mimic a depth-first traversal of the ray-hierarchy, in which the reflection-ray is spawned before the refraction-ray, the refraction-ray should be pushed before the reflection-ray. Therefore, in the first recursive call, the stackpointer is increased, to let it point to the top of the stack. Section 8.1.1 will provide more details about the ray-stack. Figure 6.7a and Figure 6.7b illustrate how the stack is populated during the execution of the algorithm. 59 6. S YSTEM OVERVIEW (a) The ray-hierarchy corresponding to a maximum recursion-depth of 3. Each ray spawns two rays: a reflection-ray and a refraction-ray. (b) The state of the stack while casting a ray corresponding to every node in the ray-hierarchy of Figure 6.7a. Figure 6.7: Implementing recursion for CUDA using a stack. Listing 6.1: CPU-code for the Ray Tracer subsystem 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 60 void raytrace ( int2 tileIdx , int depth = 0, int stackpointer = 0) { castRays <<< gridDim , blockDim > > >( tileIdx , stackpointer ); for (int i = 0; i < light_count ; i ++) { castShadowRays <<< gridDim , blockDim > > >( tileIdx , i ); computeShading <<< gridDim , blockDim > > >( tileIdx , i ); } if ( depth < maxRecursionDepth ) { /* Trace reflection-rays */ raytrace ( tileIdx , depth + 1, stackpointer + 1); /* Trace refraction-rays */ raytrace ( tileIdx , depth + 1, stackpointer ); } } Summary 6.2.4 Shading As a final step, shading is applied by the shader, only if the shadow-ray caster informs us that the surface-point is illuminated by the light-source. For the shading computation, a simplified Phong Illumination model is employed. For each pixel the following formula is used to compute the light intensity: Ip = (ka IL + kd (L · N)IL + ks (R ·V )α IL ). lights ∑ In this formula, I p amounts for the light intensity per intersection-point, ka the ambient reflection constant, kd the diffuse reflection constant, ks the specular reflection constant, and α is a shininess constant. Furthermore, IL is the light-source’s intensity. This formula is computed for each spawned ray, and summed to obtain the final intensity for the pixel. Since this intensity can be greater than 1, the value is scaled by a global intensity scaling factor. 6.3 Summary Figure 6.8 basically shows the path from a primary-ray up to the final pixel. A primary-ray is spawn by rasterizing the tesselation; a fragment is further processed by the primary-ray accelerator’s rootfinder; a hit may skip the ray tracer’s traversal and root-finder and immediately continue by casting a shadow ray; misses and background pixels are traced conventially by the ray tracer;whereafter they are processed further by casting a shadow ray; shading is applied for non-blocked shadow-rays; and finally reflection and refraction rays are spawn and traced recursively. 61 6. S YSTEM OVERVIEW Figure 6.8: Lifetime of a primary-ray processed by the Primary-ray Accelerator and the Ray Tracer. A primary-ray is spawn by rasterizing the tesselation; a fragment is further processed by the primaryray accelerator’s root-finder; a hit may skip the ray tracer’s traversal and root-finder and immediately continue by casting a shadow ray; misses and background pixels are traced conventionally by the ray tracer;whereafter they are processed further by casting a shadow ray; shading is applied for nonblocked shadow-rays; and finally reflection and refraction rays are spawn and traced recursively. 62 Chapter 7 Preprocessing Using a separate program, a Wavefront OBJ file containing a NURBS-model is read, parsed, and converted to the required format. Afterwards, the NURBS-data is preprocessed to obtain the bounding volume hierarchy for the traversal, the tesselation for the rasterizer, and the Cox-de Boor items for the evaluation, which is then uploaded to the rendering core. This chapter describes the preprocessing phase of the program, and how the rendering core expects its data to be arranged. 7.1 Subdivision Good performance, as well as an accurate result, is realized by a carefully subdivided NURBSmodel. Both the ray caster and the ray tracing subsystems depend on it. To ensure a close enough linear approximation to the real surface, the NURBS-surface is first refined by inserting new knots into the knot-vectors. By adding new knots, new control-points can be computed, which represent exactly the same surface, but the convex-hull of the new refined surface also becomes closer to the true surface. Therefore, by adaptively determining how many knots to insert and where, a new refined mesh is obtained which can be used for tessellation and BVH generation. The heuristic from [MCFS00] is employed for each u knot-span and corresponding column in the grid (or row, in case of the v knot-vector). The maximum value for this heuristic, applied to each row in the column, determines the number of knots to insert in the knot-span. This is repeated for each knot-span for both knot-vectors. As a final step the NURBS-surface is converted to Bézierpatches by setting the multiplicity of each internal knot in the new knot-vectors to the degree of the surface. The new control-grid is updated by computing the new control-points (Figure 7.2). Figure 7.1: Overview of subdivision: the two subsystems both depend on the subdivision of the NURBS-model. 63 7. P REPROCESSING Figure 7.2: The NURBS-surface is converted into smaller Bézier-patches. 7.1.1 Tessellation The ray caster subsystem depends on a tessellation in order to rasterize the NURBS-surface to obtain good starting values for the numerical root-finder. This tessellation is directly derived from the subdivision of the surface. The tessellation consist of a set of quadrilaterals each corresponding to a sub-patch. The corners of each quadrilateral map to the corners of the corresponding Bézier-patch. Each vertex contains a surface-identifier, a span-identifier, and an initial-guess uv-parameter. 7.1.2 BVH generation The bounding volume hierarchy is easily constructed from the refined control-grid. From each subpatch (corresponding to the new non-empty intervals in the refined knot-vectors), a bounding box is computed. Since the sub-patches are Bézier, the convex-hull is just the convex set of control-points corresponding to the sub-patch. The bounding box is now simply the minimum, respectively the maximum of these control-points (component-wise). To generate the parent nodes higher up in the tree, the set of leaf-nodes are sorted along the axis of longest extent, and split in halve (object median). A new internal node is generated and is set as the parent of these two sets. This process continues recursively until a set of leaf-nodes contains only one leaf-node Figure 7.3. 7.1.3 Memory Layout BVH Nodes The BVH nodes are arranged depth-first, as an array of structures to allow efficient access using the parallel-read operation (Section 8.5.2). The array is stored in a texture bound to a single large piece of linear-memory. As can be seen in Listing 7.1, each node contains two child-pointers and two AABB for the corresponding children. To ensure the structure is correctly aligned to 16 bytes, two extra floats are padded. 64 Subdivision Figure 7.3: Bounding boxes of a BVH. Listing 7.1: Definition of the BVHNode-struct 1 2 3 4 5 6 7 8 struct __align__ (16) BVHNode { int child [2]; float aabb_left [3][2]; float aabb_right [3][2]; float padding [2]; }; Leaf-nodes The leaf-nodes contain other data, than the internal BVH nodes. Therefore, a separate structure defines the leaf-nodes, which are actually just nurbs-patches. A nurbs-patch contains a reference to its parent surface, its corresponding knot-spans, and an initial-guess value. Additionally, an extra AABB is added to increase performance during traversal, as will be described in Section 8.5.2. Listing 7.2 shows the definition of the NURBSPatch-struct. 65 7. P REPROCESSING Algorithm 10 Cox-deBoor-items precomputation Require: Knot-vector knot, n, degree 1: cdbitems ← [] 2: for d = 1 to degree do 3: for i = n − 1 to degree − d step − 1 do 1 4: cdbitem[0] ← knot[i+d]−knot[i] 5: 6: 7: 8: 9: 10: 11: −1 cdbitem[1] ← knot[i+d+1]−knot[i+1] cdbitem[2] ← −cdbitem[0] · knot[i] cdbitem[3] ← −cdbitem[1] · knot[i + d + 1] append(cdbitems, cdbitem) end for end for return cdbitems Listing 7.2: Definition of the NURBSPatch-struct 1 2 3 4 5 6 7 8 9 struct __align__ (16) NURBSPatch { int surface_id ; int span_id ; float2 uv ; float float aabb [3][2]; padding [2]; } 7.2 Root-Finder Data The root-finder depends heavily on the evaluation algorithm, which in turn requires the availability of the NURBS-surface data. Depending on the used evaluation scheme (Section 3.5), additional data may be required. 7.2.1 Basis-function data Two evaluation schemes have been tested, namely the “inverted-triangle”-method, and the divisionfree method. While the former does not require any preprocessing, the latter depends on the precomputed Cox-deBoor-items. The computation of these items is fairly straightforward. For each surface, these items are computed row-wise. Algorithm 10 gives the pseudo-code for computing the items. 7.2.2 Memory layout NURBS-surface data The root-finder heavily depends on information describing the NURBS-surface, such as the degrees, control-points, knot-vectors, etc. The probability of two coherent rays hitting the same NURBSsurface, and therefore accessing the same data, is rather high. Therefore, the texture-memory is an ideal candidate for storing the NURBS-data. Frequently accessed surface-data will appear in the texture-cache, resulting in very fast fetch-times. 66 Root-Finder Data For packet-based traversals, the performance can be increased, by choosing a layout that minimizes the fetch-time. For this, the data is arranged as an array of structures. In this way, the data can be fetched very efficiently by a single instruction using a parallel-read. For non-packet-based traversals, the AOS-approach is still beneficial over an SOA-approach, since reading a member will result in a pre-fetch of neighbouring members, which again will reduce the fetch-times. Furthermore, for the primary-ray accelerator, it could even be less efficient to store the surfaces as a SOA. At first, there is totally no spatial relation between the surfaces and the storage of them, and second, because the primary-ray accelerator only performs a single intersection test. Because the data for the control-points and the knot-vectors are variable-sized, they cannot be included in the nurbs-surface data directly, and are therefore stored in two separate textures: one for the control-points, and another for the knot-vectors. These textures will be discussed in the next subsections. Listing Listing 7.3 shows the definition of the structure describing the NURBS-surface. Listing 7.3: Definition of the NURBSSurface-struct 1 2 3 4 5 6 7 8 9 10 11 struct __align__ (16) NURBSSurface { int n_u ; int n_v ; int degree_u ; int degree_v ; int offsetControlPoints ; int offsetKnotVector_u ; int offsetKnotVector_v ; int material_id ; }; Control-points CUDA provides a very optimized way for accessing texture data by exploiting the two-dimensional spatial coherence. However, in order to benefit from this efficiency, the control-points are required to be stored into a rectangular texture. Since the sizes of the control-point grids may differ, a lot of “empty” areas will appear in the texture. Instead, the control-points of all surfaces are combined in a texture bound to a single large piece of linear-memory. The control-points are stored in row-major order. The rows of a surface are concatenated to the texture to form an indexible array. The index of the first control-point is stored into the NURBSSurface-struct and is used to find the corresponding control-points (ListingListing 7.3). Knot-vectors The knot-vector data is stored similarly to the control-point data. Using a single large texture both the knot-vectors are concatenated and appended to the texture. The starting index for each knotvector is stored into the NURBSSurface-struct and is used to find the corresponding knot-vector (ListingListing 7.3). Cox-deBoor items The Cox-deBoor array is stored similarly to the control-point data. Using a single large texture both arrays are concatenated and appended to the texture. The starting index for each array is stored into the NURBSSurface-struct and is used to find the corresponding knot-vector (ListingListing 7.3). 67 7. P REPROCESSING Material data Since there is no relationship between the spatial position of two surfaces having two distinct materials and the memory addresses of those materials, little advantageous can be expected by arranging the materials as a structure of arrays. Furthermore, while shading a pixel, every data-item of the material will be accessed successively. Therefore, the material-data should be clustered to allow prefetching (Section 5.2.2). By creating a texture containing an array of material-structures, the data can be fetched most efficiently. Listing Listing 7.4 shows the definition of the structure describing a material. Listing 7.4: Definition of the Material-struct 1 2 3 4 5 6 7 8 9 68 struct __align__ (16) Material { float3 diffuse ; float3 specular ; int shininess ; float reflectance ; float transparency ; float refraction_index ; }; Chapter 8 Kernel details This chapter will discuss all the implementation-specific details for the rendering core. First, all data structures will be described, which are used for inter-kernel communication. Then, the root-finder will be discussed, which will be used by the primary-ray accelerator subsystem as well as the ray tracer subsystem. Finally, the kernel-launch configurations are discussed, including the mapping of rays to threads. 8.1 8.1.1 Inter-kernel Communication Ray-stack To get around CUDA’s absence of support for recursive function calls, a ray-stack is used. Instead of stepping recursively into the ray tracing function to trace reflection- and refraction-rays, the rays are pushed onto the ray-stack, and traced further during subsequent kernel-launches. Each kernellaunch starts by popping one secondary ray from the ray-stack, which is traced and possibly spawns two more rays. The stack is implemented in global-memory as a structure of arrays (Section 5.2.1) containing stack-data for every thread. Each array maps to exactly one stack-level holding the data for each thread for that level. Figure 8.1 shows the memory layout of a stack for 4 threads using a stackdepth of 3. Reading from as well as writing to memory laid out like that, results in efficient coalesced transfers (Section 5.2.1). Usually each thread maintains its own stack-pointer, since some rays will hit a surface and spawn rays, while others will not. Instead, each ray traverses the same path in the ray-hierarchy, and pushes two dummy-rays onto the stack whenever it does not spawn secondary rays. In this way, all threads can share the same stack-pointer. The dummy-ray can be identified by setting the x-value to +∞. When popping a ray, an x-value of +∞ indicates it’s a dummy-ray and the thread should not participate in the traversal. The definition of the structure can be found in listing Listing 8.1. The amount of memory in bytes required for allocating the ray-stack containing 8 float-arrays, is given by: kRayStackk = width · height · (depth + 1) · 32 where width and height represent the image-resolution, and depth equals the maximum recursiondepth. The depth + 1 factor equals the maximum stack-size, and is one larger than the maximum recursion depth, since the ray-hierarchy is traversed depth-first. 69 8. K ERNEL DETAILS Figure 8.1: Memory layout of a stack inside global-memory. This stack is for 4 threads and has a maximum depth of 3. The exemplary “virtual” stack of thread 2 shows the mapping of threads and stack-levels to global-memory addresses. Listing 8.1: Definition of the RayStack-SOA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 struct RayStack { struct { float *x; float *y; float *z; } o; struct { float *x; float *y; float *z; } d; float * opacity ; float * refraction_index ; }; __constant__ RayStack rayStack ; 8.1.2 Hitdata-buffer The hitdata-buffer is used to propagate intermediate results from the ray caster to the shader. These include, among others, the coordinates of the intersection-point, the surface-normal corresponding to that intersection, the distance along the ray causing that intersection, etc. The buffer is arranged in global-memory as a structure of arrays in order to hide the transfer latency (Section 5.2.1). The definition of the structure can be found in listing Listing 8.2. The amount of memory in bytes required for allocating the hitdata-buffer containing 13 float-arrays and 2 int-arrays, is given by: kHitDataBufferk = width × height × 60 70 Inter-kernel Communication where width and height represent the image-resolution. Listing 8.2: Definition of the HitDataBuffer-SOA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 struct HitDataBuffer { float *t; struct { int * surface_id ; int * span_id ; } patch_id ; struct { float *x; float *y; float *z; } s; struct { float *x; float *y; float *z; } normal ; struct { float *x; float *y; float *z; } d; float *u; float *v; float * opacity ; }; __constant__ HitDataBuffer hitDataBuffer ; 8.1.3 Image-buffer After shading is applied to each ray, its resulting color is accumulated to the corresponding pixel in the image-buffer. This buffer is not explicitly allocated by CUDA1 , but is acquired by mapping an OpenGL Pixel Buffer Object (PBO) into the address-space of CUDA. The buffer is arranged as successive quadruples containing floats for the red, green, blue, and alpha channels, respectively. When mapped, the buffer can be accessed as an ordinary float4-array. After the rendering is finished, the buffer is unmapped, and displayed on screen. Listing Listing 8.3 shows the code required for initialization of the image-buffer and Listing 8.4 shows the code for rendering. 1 However, when the goal of ray tracing is solely offscreen-rendering, i.e. in order to output a ray traced video, the buffer must be allocated by CUDA. 71 8. K ERNEL DETAILS Listing 8.3: Initialization of the image-buffer. 1 2 3 4 5 6 7 8 9 /* Allocation of image-buffer */ size_t size = width * height * sizeof( float4 ); glGenBuffers (1 , & pboID ); glBindBuffer ( GL_PIXEL_PACK_BUFFER , pboID ); glBufferData ( GL_PIXEL_PACK_BUFFER , size , 0, GL_DYNAMIC_DRAW ); glBindBuffer ( GL_PIXEL_PACK_BUFFER , 0); /* Registration to CUDA */ cudaGLRegisterBufferObject ( pboID ); Listing 8.4: Usage of the image-buffer. 1 2 3 4 5 6 7 8 9 10 11 12 13 8.2 /* Mapping into address-space of CUDA */ cudaGLMapBufferObject ((void **)& imageBuffer , pboID ); /* Rendering */ ... /* Unmapping */ cudaGLUnmapBufferObject ( pboID ); /* Blitting */ glBindBuffer ( GL_PIXEL_UNPACK_BUFFER , pboID ); glDrawPixels ( width , height , GL_RGBA , GL_FLOAT , 0); glBindBuffer ( GL_PIXEL_UNPACK_BUFFER , 0); Root-finding The root-finder basically implements a slightly modified version of the Newton-Rhapson algorithm described in Section 3.3.1. Normally, when the uv-parameter leaves the surface-domain, the rootfinder should report an intersection-miss. However, it appeared that sometimes for surfaces which actually do contain a valid intersection-point, the uv-increment becomes too large during the first iteration, which causes the parameter to leave the surface-domain too soon. The root-finder will stop searching too early, and reports a miss. To prevent this faulty early-termination, a minimum iteration-count is incorporated to prevent such events. This extra minimum is also applied to the criteria in which the error increases. The results will show (Section 9.2.2), that the quality of the output is increased, while the performance is not affected (if will actually increase a little). Usually, a binary-search is performed to compute the knot-intervals from the uv-parameters. However, we already have this information: both the rasterization and the BVH provides us these data from the patch-identifier. From the span-id and the knot-vector, the patch-domain is derived which is used inside the root-finder. When the uv-parameter leaves the patch-domain, the span-id is updated according to the new location on the surface. If the uv-parameter leaves the surface-domain, the root-finder will exit (only after the minimum number of iterations has been reached). 8.3 Surface Evaluation Algorithm 4 is used for evaluation of the surface-point and its partial-derivatives. An important optimization has been added however. Since the basis-functions (Section 3.5) are accessed very frequently, both during basis-function evaluation, as well as during the computation of the tensor72 Surface Evaluation Figure 8.2: Arrangement of data for a cache inside shared-memory. This example uses an array of 3 elements for each thread, and a block size of 4 threads. product, storing them in the local-memory will be rather slow. While local-memory may suggest that it’s on-chip memory, it’s actually not. Local-memory is located in the DRAM, and although the accesses are implicitly coalesced, they still have a very high latency. Therefore, a software-managed cache has been implemented which will now be described. 8.3.1 Software-managed Cache In order to minimize these latencies, a cache is implemented in shared-memory (Section 5.2). For a block of threads, some shared-memory is reserved in which these arrays reside. To prevent bankconflicts, each thread in a half-warp needs to access a different location. Therefore, the per-thread array-elements have a stride equal to the number of threads in the block, which is always a multiple of the width of a half-warp (Figure 8.2), in this way bank-conflict are avoided entirely. Listing 8.5 shows an example of a software-managed cache using the shared-memory. Listing 8.5: Software-managed Cache 1 2 3 4 5 __shared__ float N[ BLOCKSIZE * elements ]; float element_0 = N[ BLOCKSIZE * 0 + THREAD_IDX ]; float element_1 = N[ BLOCKSIZE * 1 + THREAD_IDX ]; float element_2 = N[ BLOCKSIZE * 2 + THREAD_IDX ]; ... 8.3.2 Basis-function evaluation From the different evaluation schemes described in Section 3.5, only the direct and the division-free methods have been implemented. The power-basis method was not chosen, because it requires a lot of preprocessing, and does not result in a very stable evaluation method. The de Boor’s algorithm requires too much registers and has also been skipped. 73 8. K ERNEL DETAILS 8.4 Primary-ray Accelerator subsystem 8.4.1 Rasterization Usually, rasterization is directly output to the frame-buffer and displayed on screen. However, we need to obtain the resulting output, since it needs further processing. The OpenGL Framebuffer Object (FBO) extension, provides a highly efficient mechanism for redirecting the rasterization to a virtual frame-buffer. The FBO is a collection of attachable OpenGL buffer objects containing one or more render-buffers to which the rasterization is output, a depth-buffer, stencil buffer, auxiliary buffers, etc. Also, textures can be attached to the FBO to enable direct render-to-texture support. This system uses an FBO containing a render-buffer for the data and additionally a depth-buffer to enable proper depth ordering of the fragments. Recall from Section 6.1.1, that each vertex gets some custom attributes. These attributes correspond to the surface-patch and the uv-parameter of the vertex. Since a single render-buffer is used to store the rasterization, the format of the custom attributes need to match, which is not the case (uvparameters are floating-point and patch-identifier contains integer data). Therefore, the integer data is converted to floating-point data by performing a type cast on the binary data2 . Due to numerical round-off errors, interpolating all attributes will result in an incorrect patch-identifier. Therefore, a custom GLSL-shader is employed which only interpolates the uv-parameters, but leaves the patchidentifier data untouched. Listing 8.6 and Listing 8.7 shows the shader programs which are used to obtain the correct result. The extra flat keyword denotes an attribute which will not be interpolated by the rasterization-unit, the resulting fragment will receive the value of an arbitrary vertex contained in the primitive. Listing 8.6: Vertex shader 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 #extension EXT_gpu_shader4 : enable attribute vec4 attr_data ; varying vec2 uv ; flat varying float surface_id ; flat varying float span_id ; void main (void) { gl_Position uv surface_id span_id } 2 For 74 = = = = ftransform (); attr_data . xy ; attr_data .z; attr_data .w; example, the integer value 0x3f800000 would be converted to the floating-point value 1.0. Primary-ray Accelerator subsystem Listing 8.7: Fragment shader 1 2 3 4 5 6 7 8 9 10 #extension EXT_gpu_shader4 : enable varying vec2 uv ; flat varying float surface_id ; flat varying float span_id ; void main (void) { gl_FragColor = vec4 ( uv .x , uv .y , surface_id , span_id ); } After rasterization has been completed, the render-buffer can be read by the ray caster. Unfortunately, CUDA currently does not support direct access to render-buffers. Therefore, an OpenGL Pixel Buffer Object (PBO) is used to copy the data from the render-buffer to the PBO. Although this requires another copy, the transfer is from GPU-memory to GPU-memory, which is very fast. After copying, the data is made available to CUDA by mapping the PBO into the address-space of CUDA. Instead of allocating a new buffer for this PBO, the image-buffer is used (Section 8.1.3). 8.4.2 Ray Casting There are two ray casting kernels, one for the Primary-ray Accelerator subsystem, and another for the Ray Tracer subsystem. In this section, the former is discussed, which can be seen in Listing 8.8. It starts by reading the corresponding data-item from the rasterization buffer. If the surface-identifier value happens to be equal to zero, a background pixel has been identified, and no intersection needs to be found. A non-zero value however, indicates a surface-patch, consequently a ray should be cast for this pixel. Since we already have a surface-patch candidate, we avoid the need for traversal of the scene. We can immediately try to find an intersection using the data from the rasterization. Although the Ray Tracer subsystem should spawn secondary rays, it turns out to be more efficient to immediately spawn them. The reason is that if we would let the ray tracer spawn the rays, it would depend on the hitdata, which is read from global-memory. This will heavily stall the execution of the ray tracing kernel. As we already have all required data, i.e. the intersection-point, surface-normal, material-properties, etc., we can just as easy compute the reflection and refraction ray here. This will increase the performance in a later stage, since now only the intersection-distance has to be read to determine if the pixel can be skipped. After having obtained an intersection, or if the ray was skipped since the rasterization reported a background-pixel, the hitData-record is written to global-memory through the hitData-buffer. 75 8. K ERNEL DETAILS Listing 8.8: Ray Casting kernel (Primary-ray Accelerator) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 template<bool spawnRays > __global__ void castRay ( int2 tileIdx , float4 * framebuffer ) { :: tileIdx = tileIdx ; float4 data = framebuffer [ OFFSET_SCREEN ]; HitData hitData ; hitData .t = FLOAT_INFINITY ; /* Skip background-pixels */ if ( __float_as_int ( data .z) != 0) { Ray ray = fromPixel ( X_SCREEN , Y_SCREEN ); NURBSPatch nurbsPatch ; nurbsPatch . uv = make_float2 ( data .x , data .y); nurbsPatch . surface_id = __float_as_int ( data .z) - 1; nurbsPatch . span_id = __float_as_int ( data .w); float2 uv_min ; float2 uv_max ; fetchPatchDomain ( nurbsPatch , & uv_min , & uv_max ); rootFinder (ray , nurbsPatch , uv_min , uv_max , hitData ); if ( spawnRays ) { Ray reflectionRay = ray . reflect (& hitData ); rayStack . push (& reflectionRay , stackpointer + 1) ; Ray refractionRay = ray . refract (& hitData ); rayStack . push (& refractionRay , stackpointer + 0) ; } } __hitDataBuffer__ . write (& hitData ); } Parallel intersection test Normally, when taking a SIMD-approach to ray-tracing, the scene is traced packet-based using a small packet-size (usually 2x2 rays). Often, four pixels in a block refer to the same patch which can then be intersected by at best four rays in parallel with only one call to the intersection routine. Occasionally, the rays intersect different patches. The intersection routine is then called for each distinct patch in the SIMD-packet. However, due to CUDA’s SIMD-width of 32 threads, the ray caster in this system uses a much larger packet, increasing the likeliness of the rays intersecting different patches. Calling the intersection routine for each distinct patch will result in many calls, in which only a small amount of rays will actually be testing the patch for an intersection. Hence, for a CUDA-based ray-tracer, employing such a method will decrease the multiprocessor’s utilization heavily. Because all dependent NURBS-data is located in texture-memory, such as the knot-vectors, control-grids, etc., it is possible for different rays to access different surface-patches. Therefore, the intersection test will be called only once for each thread, maximizing the utilization. 76 Ray Tracer subsystem 8.5 Ray Tracer subsystem 8.5.1 Ray Casting The castRay-kernel implements the majority of the ray tracing algorithm. It starts by constructing a ray if it’s casting a primary-ray, otherwise it simply pops a previously spawned reflection- or refraction-ray from the ray-stack (Section 8.1.1). It then traverses the scene using either the packetbased traversal scheme or the single-ray traversal scheme, which will be described in the next subsection. After traversal, the hitData-record is written to global-memory through the hitData-buffer. Finally, the reflection- and refraction-rays are spawned. Once again, regarding the stack-pointer, the reflection-rays are pushed “above” the refraction-rays. In order to assist nvcc in optimizing as much as possible, the complexity of the code is reduced by using a templated function. Using templates, parts of the code can be enabled or disabled during compile-time. The primaryRay-flag indicates we are casting primary-rays, so we should not pop from the stack, but instead construct the ray in-place. One could argue this is less efficient, since it increases the complexity of the code. However, it turns out that popping of rays is quite expensive compared to constructing primary-rays, due to the global-memory latency. The same applies for the spawnRays-flag, which is used to enable or disable the reflection- and refraction-rays. Listing Listing 8.12 shows the ray casting kernel for the ray tracing subsystem. Listing 8.9: Ray Casting kernel (Ray Tracer) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 template<bool primaryRay , bool spawnRays > __global__ void castRay ( int2 tileIdx , int stackpointer ) { :: tileIdx = tileIdx ; HitData hitData ; hitData .t = INF ; Ray ray ; if ( primaryRay ) ray = fromPixel ( X_SCREEN , Y_SCREEN ); else rayStack . pop (& ray , stackpointer ); traverseRay (ray , hitData ); hitDataBuffer . write (& hitData ); if ( spawnRays ) { Ray reflectionRay = ray . reflect (& hitData ); rayStack . push (& reflectionRay , stackpointer + 1) ; Ray refractionRay = ray . refract (& hitData ); rayStack . push (& refractionRay , stackpointer + 0) ; } } 8.5.2 Packet-based Traversal For the packet-based traversal scheme, the method from [GPSS07] is used which is described in Section 4.1.3, with some additions specific for NURBS-ray tracing. It can be beneficial for all threads to exit as early as possible. Non-participating packets for example, may skip the entire 77 8. K ERNEL DETAILS traversal algorithm. This decision is made by performing a parallel-reduction prior to the traversal (Section 8.5.2). Shared-memory stack In a packet-based traversal each ray traverses the hierarchy in exactly the same order. Therefore, the stack required to push nodes onto, can be shared by the rays. This stack is implemented by pushing node addresses to an array in shared-memory. By reserving 24 integers for this array, scenes consisting of up to 16,777,2163 sub-patches can be traversed, which should be enough. Each thread maintains its own local stack-pointer. Popping elements from the stack is then simply done by decrementing the pointer and returning the value from the shared-memory array corresponding to this pointer. Pushing to the stack requires some synchronization however, since only one thread is allowed to write to the shared-memory if all threads want to write to the same location, which is the case. To prevent potential read-after-write hazards, a barrier is set before pushing the new value. For the same reason, after the data has been written, another barrier is set. Afterwards, the stack-pointer is incremented for each thread. Listing 8.10 shows a code-fragment of how the shared-memory stack can be used to push a data-item onto the stack, and pop it from the stack afterwards. Listing 8.10: Shared-memory stack 1 2 3 4 5 6 7 8 9 10 11 __shared__ int stack [ STACK_SIZE ]; int stack_pointer = 0; /* Pushing an element to the shared-stack */ __syncthreads () ; if ( THREAD_IDX == 0) stack [ stack_pointer ] = data ; __syncthreads () ; stack_pointer ++; /* Popping an element from the shared-stack */ data = stack [-- stack_pointer ]; View-dependent ordered traversal As in [GPSS07], a view-dependent ordered traversal scheme is used, which increases performance for most scenes. A packet traverses the hierarchy, collectively deciding which child to traverse first. This decision is slightly altered however. Inactive rays (rays which do not intersect the current node) should not affect the decision-making of the algorithm, therefore the following modified formula is used to determine which child should be visited next: i = (t0 6= ∞) + 2 · (t1 6= ∞) where t0 is the intersection distance for the AABB of the left child, and t1 is the intersection distance for the AABB of the right child. For each ray, this value is computed and the index is used to set the ith element in a shared-memory array to true, for which each four values are initialized with the value false. Consequently, the values of this array determine which children need a visit. If the fourth element is true, or the second and third element are both true, both children need a visit. Otherwise, the left child only needs a visit if the second element is true, the right child if the third is true, and no children if the first element is true (which denotes an inactive ray). 3 Assuming 78 a balanced tree. Ray Tracer subsystem The other decision is to determine which child needs to be visited first if the packet will visit both: j = 2 · (t0 > t1 ) − (t0 6= t1 ) j will result in −1 if for this ray the left child is closer than the right child, +1 if the right child is closer, and 0 if the ray is inactive. All these values are summed, and the resulting value decides which child will be visited next. Parallel Reduction A frequently used data-parallel primitive in parallel software is the parallel reduction (or parallel sum), in which an array of data-elements is “summed” using an associative operation (i.e. addition, multiplication, maximum, minimum, etc.). Usually, performing a reduction operation on an array of data requires n steps, for an array of size n. However, with the availability of a parallel multiprocessor, this operation can be performed very efficiently. Using a tree-based approach, in which each node (thread) computes the partial sum of its two children (array elements), an array of n elements can be summed in log2 n steps. Using this primitive, also functions such as all and any can be implemented, simply by using the and resp. or operators. Using these functions, a predicate can be tested for all threads to be true, or at least one thread in case of the any operation. Listing 8.11 shows a code-fragment of how the parallel reduction is used to sum the 64 integer values in deciding which child should be visited first. Since CUDA-threads are executed in SIMDgroups of 32 threads, called warps, no synchronization is required during the reduction steps, as long as the size of the array does not exceed 64 elements. Listing 8.11: Parallel-sum 1 2 3 4 5 6 7 8 9 10 11 12 13 14 __shared__ int data [64]; data [ THREAD_IDX ] = decision ; __syncthreads () ; /* Reductions */ if ( THREAD_IDX < 32) data [ THREAD_IDX ] += if ( THREAD_IDX < 16) data [ THREAD_IDX ] += if ( THREAD_IDX < 8) data [ THREAD_IDX ] += if ( THREAD_IDX < 4) data [ THREAD_IDX ] += if ( THREAD_IDX < 2) data [ THREAD_IDX ] += if ( THREAD_IDX < 1) data [ THREAD_IDX ] += __syncthreads () ; data [ THREAD_IDX data [ THREAD_IDX data [ THREAD_IDX data [ THREAD_IDX data [ THREAD_IDX data [ THREAD_IDX + 32]; + 16]; + 8]; + 4]; + 2]; + 1]; decision = data [0]; Parallel Memory Fetching A highly optimized parallel algorithm to obtain data which is shared among all threads in the block, is the parallel-read algorithm. The algorithm is used to transfer a block of data from global-memory (or texture-memory) to shared-memory in an efficient way: by letting each subsequent thread read a subsequent data-element. Using this algorithm, a block of memory can be read using only one instruction as long as the number of data-elements does not exceed the number of threads in the block. The efficiency lies in the coalescing of the memory-accesses: each thread accesses a subsequent address. Therefore, the memory controller can transfer the memory line-by-line exploiting the band79 8. K ERNEL DETAILS width of the memory-bus. Furthermore, a thread does not have to wait for the read to complete before reading the next data-element, since all data are read at once. Packet-based Root-finding The Primary-ray Accelerator subsystem fetches its data directly through the texture-unit, allowing the rays to simultaneously intersect different patches. The Ray Tracer subsystem on the other hand, traverses the scene packet-based. While processing a leaf-node, all rays in a packet always call the intersection routine using the same patch. Since all rays access the same data, they can be fetched efficiently using parallel-read operations (Section 8.5.2) and transferred to shared-memory. Although the data-fetches are already cached by the texture-unit, using the parallel-transfer operations, the memory is accessed much more efficiently. Extra Leaf-node Ray-AABB intersection test If a packet visits a leaf-node because only a single ray intersects it (or a few rays), the other “nonactive” rays in the packet pay the price for the packet-based traversal. Normally, this is not that big of a deal, since ray/primitive intersection tests are very cheap for basic primitives (i.e. triangles). However, now the other rays also have to participate in the root-finding. There is a possibility some rays converge or diverge more slowly than the rays that were actually intersecting the node. This will unnecessarily stall the traversal. Also, a ray could already have obtained an intersection which is more close than the intersection with the AABB of the to-be-tested leaf-node. In that case, the intersection test even becomes unnecessary. By simply inserting an additional, repeated ray/AABB intersection test right before the rootfinder, these situations can be avoided, reducing the total number of intersection tests and stalls. Listing 8.12 shows the total packet-based traversal algorithm. 80 Ray Tracer subsystem Listing 8.12: Packet-based Traversal-function 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 __device__ void traverseRayPacket (const Ray &ray , HitData & hitData ) { if (all < BLOCKSIZE >(! ray . participating () )) return; SharedStack < MAX_STACK_DEPTH > stack ; int nodeIndex = 0; for (;;) { if ( IS_LEAF ( nodeIndex )) { __shared__ NURBSPatch nurbsPatch ; fetchNURBSPatch ( nurbsPatch , nodeIndex ); fetchSharedEvaluationData ( nurbsPatch ); if ( ray . participating () && ray . intersectAABB ( nurbsPatch . aabb , hitData .t) != INF ) rootFinder (ray , nurbsPatch , hitData ); if ( stack . empty () ) break; stack . pop ( nodeIndex ); } else { __shared__ BVHNode bvhNode ; fetchBVHNode ( bvhNode , nodeIndex ); float t0 = INF ; float t1 = INF ; if ( ray . participating () ) { t0 = ray . intersectAABB ( bvhNode . aabb [0] , hitData .t); t1 = ray . intersectAABB ( bvhNode . aabb [1] , hitData .t); } /* Set the flag for which child needs a visit (or both) M[0] == true --> No children intersected M[1] == true --> Only left child intersected M[2] == true --> Only right child intersected M[3] == true --> Both children intersected */ __shared__ int M [4]; if ( THREAD_IDX < 4) M[ THREAD_IDX ] = 0; __syncthreads () ; M [( t0 != INF ) + 2 * ( t1 != INF )] = 1; __syncthreads () ; if (M [3] || M [1] && M [2]) { /* -1 == prefer left, * 1 == prefer right, * 0 == no preference */ bool go_left = sum < BLOCKSIZE >(2*( t0 > t1 ) - ( t0 != t1 )) < 0; stack . push ( bvhNode . child [ go_left ]) ; nodeIndex = bvhNode . child [1 - go_left ]; } else if (M [1]) { nodeIndex = bvhNode . child [0]; } else if (M [2]) { nodeIndex = bvhNode . child [1]; } else { if ( stack . empty () ) break; stack . pop ( nodeIndex ); } } } } 81 8. K ERNEL DETAILS 8.5.3 Single-ray Traversal Although the packet-based scheme allows for an efficient traversal, it results in more leaf-nodes visited per ray. Therefore, when processing a leaf-node, the likeliness of a ray not intersecting that leaf-node is much higher than for single-ray traversal. As a consequence, the number of rays in a block actively participating in the root-finding, will be much lower than for single-ray traversal. On the other hand, the packet-based traversal scheme will not lead to warp-divergence, since each ray in a packet follows the exact same traversal path. Single-ray traversal however, does lead to heavy warp-divergence. Some rays for example, may execute the root-finder, while others are traversing the BVH. A warp containing traversing and root-finding rays, cannot be executed simultaneously. Therefore, such warps will be executed in two subsequent steps: the first executing a traversal step, disabling the root-finding threads. The second step will execute the root-finder, and disables the other threads. However, since a root-finding step is very expensive, the inactive threads will have to wait very long, before they can continue, resulting in a very low utilization. Another traversal scheme has been implemented, based on the quite recent optimized single-ray traversal scheme from [AL09]. Their method separates the root-finding from the traversal. Although warps still may diverge during traversal, they will join right after all rays have found a candidate leafnode, so that all threads will participate in the root-finding, maximizing the utilization. Listing 8.13 shows the single-ray traversal algorithm. 82 Ray Tracer subsystem Listing 8.13: Single-ray Traversal-function 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 __device__ void traverseSingleRay (const Ray &ray , HitData & hitData ) { if (! ray . participating () ) return; LocalStack < MAX_STACK_DEPTH > stack ; stack . push (0) ; int nodeindex ; while (true) { if (! traverseSingleRayNextNode (ray , hitData , nodeIndex , stack )) return; NURBSPatch nurbsPatch ; nurbsPatch . surface_id = fetchSurfaceID ( nodeIndex ); nurbsPatch . span_id = fetchSpanID ( nodeIndex ); nurbsPatch . uv = fetchUV ( nodeIndex ); float2 uv_min ; float2 uv_max ; fetchPatchDomain ( nurbsPatch , & uv_min , & uv_max ); fetchNURBSPatch ( nurbsPatch , nodeIndex ); if ( ray . intersectAABB ( nurbsPatch . aabb , hitData .t) != INF ) rootFinder (ray , nurbsPatch , hitData ); } } __device__ void traverseSingleRayNextNode (const Ray &ray , HitData & hitData , int & nodeIndex , LocalStack & stack ) { if ( stack . empty () ) return false; stack . pop ( nodeIndex ); for (;;) { if ( IS_LEAF ( nodeIndex )) return true; float t0 = ray . intersectAABB ( bvhNode . aabb [0] , hitData .t); float t1 = ray . intersectAABB ( bvhNode . aabb [1] , hitData .t); int M = ( t0 != FLOAT_INFINITY ) + 2 * ( t1 != FLOAT_INFINITY ); if (M == 3) { int go_left = t0 < t1 ; stack . push ( fetchChildIndex ( nodeIndex , go_left )); nodeIndex = fetchChildIndex ( nodeIndex , 1 - go_left ); } else if (M == 1) { nodeIndex = fetchChildIndex ( nodeIndex , 0) ; } else if (M == 2) { nodeIndex = fetchChildIndex ( nodeIndex , 1) ; } else { if ( stack . empty () ) return false; stack . pop ( nodeIndex ); } } } 83 8. K ERNEL DETAILS 8.5.4 Shadow-ray Casting The castShadowRay-kernel is almost equal to the castRay-kernel, except for some simplifications. Listing 8.14 shows the CUDA-kernel . Ray generation Because the shadow-ray is simply a ray originating from the light-source’s origin pointing to the intersection-point, the construction of such a ray is rather trivial. First, the t-parameter is read from the hitData-buffer to determine if we need to cast a shadow-ray. If it indicates a surface-point, the corresponding surface-point and normal is read from the hitData-buffer. In order to prevent self-intersection, the surface-point is slightly offset along the surface-normal, before the ray is constructed. If no shadow-ray needs to be cast, the ray is explicitly set to non-participating, to enable a packet-based traversal. Traversal Whereas in ray casting the front-most intersection is required, in shadow-ray casting any intersection will suffice. Consequently, the traversal can be aborted when the root-finder reports an intersection. Nevertheless, due to the packet-based traversal, a ray must continue traversing the scene until all rays have obtained an intersection, which is constantly checked using a parallel reduction. Inter-kernel communication Since the surface is unimportant, only a boolean indicating intersection or no intersection will suffice. Therefore, very little global-memory transfers are required to update the hitdata-buffer. Infact, only the t-parameter needs to be overwritten to indicate a non-illuminated surface-point. Because multiple light-sources are supported, setting this t-parameter to ∞ would break the system: when casting a shadow-ray for a second light-source, this parameter would indicate that there was no intersection at all. Therefore, the shadow-ray caster negates the t-parameter in case of an occluder. The shader can easily detect this value by allowing only positive values (and rejecting the value ∞). When processing a shadow-ray, the absolute value is taken from the t-parameter, ∞ values will be skipped, and any other values will result in casting a shadow-ray. Finally, the shadow-ray caster is not recursive and hence does not require a stack to push rays onto, again reducing global-memory transfers. 84 Ray Tracer subsystem Listing 8.14: Shadow-ray Casting kernel 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 __global__ void castShadowRay ( int2 tileIdx , int stackpointer , int lightIndex ) { :: tileIdx = tileIdx ; HitData hitData ; hitDataBuffer . read_t (& hitData ); if (all < BLOCKSIZE >( hitData .t == INF )) return; Ray shadowRay ; if ( hitData .t != FLOAT_INFINITY ) { /* reset the t-value negated by a previous shadow-ray */ hitData .t = fabs ( hitData .t); hitDataBuffer . read_s (& hitData ); hitDataBuffer . read_normal (& hitData ); Light light ( lightIndex ); shadowRay .o = light . pos ; hitData .s += hitData . normal * epsilonSelfIntersection ; shadowRay .d = hitData .s - shadowRay .o; normalize ( shadowRay .d); shadowRay . init () ; hitData .t = shadowRay . distance ( hitData .s); } else { shadowRay . set_nonparticipating () ; } /* Traverse shadow-ray packet */ traverseShadowRay ( shadowRay , hitData ); hitDataBuffer . write_t (& hitData ); } 8.5.5 Shading The computeShading-kernel is fairly straightforward and is really just the lighting-model written as CUDA-code. 85 8. K ERNEL DETAILS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 __global__ void computeShading ( int2 tileIdx , int lightIndex , float4 * outputBuffer ) { :: tileIdx = tileIdx ; float3 color = make_float3 (0.0 f , 0.0f , 0.0 f); HitData hitData ; hitDataBuffer . read_t (& hitData ); if ( hitData .t == INF ) return; hitDataBuffer . read_hitData (& hitData ); float3 L = environment . light [ lightIndex ]. position - hitData .s; normalize (L); float3 R = math :: reflect (L , hitData . normal ); normalize (R); float LdotN = dot (L , hitData . normal ); float RdotV = LdotN > 0.0 f ? dot (R , hitData .d) : 0.0 f; int material_id = MATERIAL_ID ( hitData . patch_id . surface_id ); color += hitData . opacity * sceneParameters . environment_ambient ; color += hitData . opacity * MATERIAL_DIFFUSE ( material_id ) * max (0.0 f , LdotN ); color += hitData . opacity * MATERIAL_SPECULAR ( material_id ) * pow ( max (0.0 f , RdotV ) , MATERIAL_SHININESS ( material_id )); outputBuffer [ SCREEN_OFFSET ] += make_float4 ( color , 1.0 f); } 8.6 Kernel-launch configurations It is not required for the different kernels to share the same block-size. Obviously, the packet-based ray casting kernels should match their packet-size, since they trace their scene packet-wise. The other kernels however, are not packet-based and therefore do not necessarily benefit from having the same block-size as the ray casters. Furthermore, due to the lesser complexity of the generated binary code for the Primary-ray Accelerator, the register usage is lower than the other ray casting kernels. Therefore, a larger block-size could possibly result in a performance increase. In the results-section of this thesis, different block-sizes are examined and a conclusion is drawn on what block-size is ideal for which kernel (Section 9.1.2). 8.6.1 Pixel/Thread mapping Which pixel, and consequently which rays, will be assigned to which thread depends on the chosen block-sizes, as well as the size of the tiles. All pixels are distributed over several tiles and mapped to a CUDA-grid. Each tile, or grid, is divided into equal-sized groups of pixels which are mapped to a CUDA-block. Finally, each CUDA-block contains the threads executing the kernel. Figure 8.3 gives a schematic overview of the mapping for an example using an image-resolution of 16 × 16, a tile-size of 4 × 4, and a block-size of 2 × 2. For the kernels the mapping from pixel to thread is not very interesting. What is interesting, is the inverse-mapping, which computes the pixel-position in image-space for a thread, based on the threadIdx, blockIdx, and tileIdx. Using the pixel-position, the primary-ray can be setup. Also, 86 Kernel-launch configurations Figure 8.3: Mapping of pixels to threads. This example uses an image-resolution of 16 × 16, a tilesize of 4 × 4, and a block-size of 2 × 2. The tiles are visualized using a shade of grey, the blocks are separated by black lines, and the pixels are separated by dashed lines. using the same indices, the offsets can be computed for data-buffer such as the image-buffer, the hitdata-buffer, the ray-stack and, as will be seen later, the datastructures in shared-memory. Rather trivial, the positions and offsets are given by: tilex = threadIdxx + blockIdxx · blockDimx tiley = threadIdxy + blockIdxy · blockDimy imagex imagey blocko f f set tileo f f set imageo f f set = tilex + tileIdxx · tileDimx = tiley + tileIdxy · tileDimy = threadIdxx + threadIdxy · blockDimx = tilex + tiley · tileDimx = imagex + imagey · imageDimx the tileDim and imageDim variables represent the tile-size and image-size, respectively. 87 Part III Results 89 Chapter 9 Results In this chapter, the runtime performance of the CUDA NURBS Ray Tracing System, described in previous chapters, will be examined. In Section 9.1, the procedure will be described how the system is tested. Section 9.2 will discuss the impact of several parameters on the quality of the resulting renderings. Section 9.3, will present the performance statistics, which is used to compare against other implementations. Then, Section 9.4 will investigate where the bottleneck is located, and where improvements might me possible. This chapter will conclude with a discussion about this system, comparing it with existing systems, and suggesting possible improvements. 9.1 Experimental Setup In the tests, the two different traversal algorithms are being investigated, namely packet-based traversal and single-ray traversal (described in Chapter 8). For each algorithm, a scene is rendered without acceleration, using the standard ray tracer, and with acceleration, using the hybrid ray tracer. Finally, every kernel is compiled and tested with caching and without caching (described in Section 8.3.1). In total, 8 different variants will be tested: packet, packet_cache, single, single_cache, hybridpacket, hybridpacket_cache, hybridsingle, and hybridsingle_cache. For these variants, 6 kernels are produced: accelerator, accelerator_cache, packet, packet_cache, single, and single_cache. Additionally, each kernel will be further fine-tuned to determine a good blocksize/occupancy ratio (Section 9.1.2). 9.1.1 Testing Platforms Two platforms have been used for obtaining the test-results. The first one is the slightly outdated NVIDIA GeForce 8800M GTX mobile GPU. The second one is the NVIDIA GeForce GTX 295, which has some nice improvements compared to the other. Not only is the per-multiprocessor occupancy higher due to the higher register count, but also more threads are processed physically concurrent, since the number of multiprocessors is 30 per GPU. Unfortunately, to exploit both GPUs, the system must support SLI (Section 5.1), which it doesn’t. Table 9.1 summarizes the most important differences between the two platforms/architectures. 9.1.2 Fine-tuning When hand-optimizing a kernel, it is important that three goals are kept in mind: first, divergence should be minimized by determining a good block size. Second, the occupancy should be maximized, in order to exploit the parallelism of the GPU. And third, memory latencies should be avoided by reducing the local-memory usage. 91 9. R ESULTS Release Date Compute Capability CUDA Cores Core Clock (MHz) Memory Clock (MHz) Maximum Memory (MB) Memory Interface Mem. Bandwidth (GB/s) GFLOPS Operating System CUDA Driver CUDA Toolkit 8800M GTX 19 November 2007 1.1 96 500 800 512 256-bit 51.2 360.0 Linux: Ubuntu 9.04 2.3 2.3 GTX 295 January 8, 2009 1.3 480 (240/GPU) 576 MHz 1998 (999/GPU) 1792 (896/GPU) 896-bit (448-bit/GPU) 223.8 (111.9/GPU) 1788.48 (894.24/GPU) Windows Vista x64 2.3 2.3 Table 9.1: Testing platforms. The entries in bold denote the statistics per GPU. Since SLI is currently not supported, the performance is limited to only one GPU. The block size also impacts occupancy. Although choosing a very small block size increases coherence, it will also decrease the occupancy. For kernels with a relatively large, fixed sharedmemory use, the shared-memory will become the limiting factor. As the block size decreases, more block will be necessary to maximize the occupancy, however this will not be possible when the total amount of shared-memory required isn’t available. Regarding the third goal, increasing the maximum number of registers a kernel may use, will lower the local-memory usage, since too much used registers will be spilled to the slow local-memory (which is located in global-memory). However, when using too many registers, the occupancy (goal one) will decrease. Therefore, a balance must be found between the block size, and the maximum number of registers. Additionally, the dimension of the block will also influence the coherence among rays. However, it looks like square-shaped blocks do not imply the maximum coherence as preliminary tests have shown. Therefore, also the block dimension must be determined. Determination of the best, most general options In order to determine which blockdim/maxregcountcombination is best, every reasonable combination is tried and the performance is measured for several scenes. Per scene, the ratio of the best combination and every other combination is computed to provide a normalized performance indication for all blockdim/maxregcount-combinations. Then, the averages are computed for each combination, which is then sorted. By choosing the blockdim/maxregcount-combination with the highest ratio, a kernel will be generated which deviates the least from the optimum blockdim/maxrregcount for an average scene. Chosen options Table 9.2 gives the resulting blockdim/maxregcount-combinations used in the rest of this report. While the accelerator favors square-shaped block sizes, which is as expected since no real divergence can occur, the other kernels do not. 9.1.3 Test scenes Several NURBS-scenes have been used to produce the results in this chapter. Table 9.3a lists some simple scenes containing only a few surfaces/patches. The degree varies from 1 to 5. Figure 9.1a shows the images corresponding to the scenes. 92 Image Quality 8800M Uncached Cached Accelerator Packet Single Accelerator Packet Single dim 16 × 8 2 × 32 1 × 64 ( 16 × 8 if d < 7 16 × 4 otherwise 2 × 32 2 × 32 regs 64 32 40 occ 17% 33% 25% GTX 295 dim regs 32 × 8 64 1 × 64 64 1 × 64 80 occ 25% 25% 19% 64 17% 8×8 92 13% 64 64 17% 17% 1 × 64 1 × 64 86 94 13% 13% Table 9.2: Chosen compiler-options. “dim” denotes the block dimension, “regs” denotes the maximum number of registers a kernel is allowed to use. “occ” denotes the occupancy of the kernel using these parameters, with a maximum degree of 3. The scenes in Table 9.3b depict some well-known scenes often used for comparison. All scenes are of degree 3. Figure 9.1b shows the images corresponding to the scenes. And finally, Table 9.3c shows some scenes with increasing number of patches. All scenes are of degree 3. Figure 9.1c shows the images corresponding to the scenes. 9.2 Image Quality This section will discuss the impact of several parameters on the quality of the resulting renderings. 9.2.1 Hybrid Artifacts This subsection discusses some scenes rendered using the hybrid technique, of which some were rendered correctly, and others were rendered with noticeable artifacts. Figure 9.2 shows three renderings of the teapot scene. The left image shows the scene after it has been processed by the accelerator. Although there are some very tiny artifacts, they are hardly visible. Therefore, the teapot does not really require any further processing for the primary rays. Obviously, accelerator-artifacts are related to the fineness chosen for the scene, however, it does show that as long as no secondary rays and shadow rays are required, the accelerator alone is sufficient. Although the teapot-scene can be rendered properly using the accelerator, other scenes cannot be rendered without noticeable artifacts. Figure 9.3a shows three renderings of the head-scene, in which the accelerator clearly fails to produce an artifact-free result. Apart from the “holes” in the model, which are caused by the linear approximation having a small offset from the surface, the model suffers from an incorrect tessellation. In this scene, not every control-point interpolates the surface, therefore, the tessellation will be far off from the surface at the domain boundaries, since the convex-hull is used for tessellation. Fortunately, these artifacts are all handled correctly by the ray tracer, which fixes these erroneous pixels. Figure 9.3b, on the other hand, shows the same scene from a different view-point. In this view, a strange gap appears in the chin. Because the curvature of the surface around the chin is very high, the tessellation becomes too coarse to represent a smooth surface. To represent this part of the surface, a very high fineness would be required, which increases the size of the BVH as well. Ultimately, the artifact will become visible again when zooming in too close. Since the rasterization provided the accelerator the wrong surface, the ray tracer is not able to check if the accelerator converged correctly. Therefore, the hybrid result still contains this incorrect surface. 93 9. R ESULTS cornellbox 5 19 33 158 1-2 5776 14 5777 6452 0.13s 5.5 kB 631.8 kB 291.9 kB 100.00% Fineness Surfaces Patches Control-points Degree BVHNodes BVH Depth Quads Vertices Preprocessing time NURBS data BVH data Tessellation data Screen fill head 2 4 508 969 2-5 10745 14 10746 16556 0.25s 21.0 kB 1.2 MB 685.3 kB 35.60% ducky 2 5 124 602 2-5 11246 15 11247 13977 0.25s 15.5 kB 1.2 MB 612.5 kB 27.30% (a) Scenes containing several surfaces. Fineness Surfaces Patches Control-points Degree BVHNodes BVH Depth Quads Vertices Preprocessing time NURBS data BVH data Tessellation data Screen fill teapot 5.55 4 32 358 3 13089 15 13090 14444 0.28s 9.2 kB 1.4 MB 655.9 kB 21.16% killeroo-lowres 1.03 89 11532 17181 3 104248 18 104249 184669 2.76s 372.1 kB 11.1 MB 7.23 MB 14.10% killeroo-highres 0.79 89 46128 56625 3 130952 18 130953 330891 3.98s 1.1 MB 14.0 MB 12.1 MB 14.10% killeroo-closeup 0.79 89 46128 56625 3 130952 18 130953 330891 3.98s 1.1 MB 14.0 MB 12.1 MB 92.30% (b) Benchmark scenes for comparison with other systems. Fineness Surfaces Patches Control-points Degree BVHNodes BVH Depth Quads Vertices Preprocessing time NURBS data BVH data Tessellation data Screen fill 94 teapot-13 5.55 4 32 358 3 13089 15 13090 14444 0.28s 9.2 kB 1.4 MB 655.9 kB 28.50% teapot-23 5.55 32 256 2864 3 104719 18 104720 115552 2.52s 72.9 kB 11.2 MB 5.1 MB 28.70% teapot-33 5.55 108 864 9666 3 353429 20 353430 389988 6.52s 245.8 kB 37.8 MB 17.3 MB 28.20% teapot-43 5.55 256 2048 22912 3 837759 21 837760 924416 16.15s 582.6 kB 89.5 MB 41.0 MB 28.10% (c) Scenes containing many surfaces. Table 9.3: Test scenes used for the results. teapot-53 5.55 500 4000 44750 3 1636249 22 1636250 1805500 32.28s 1.1 MB 174.8 MB 80.1 MB 28.60% Image Quality (a) Scenes containing several surfaces. From left to right: cornellbox, head, ducky. (b) Benchmark scenes for comparison with other systems. From left to right: teapot, killeroo-lowres, killeroo-highres, killeroo-closeup. (c) Scenes containing many surfaces. From left to right: teapot-1x1x1, teapot-2x2x2, teapot-3x3x3, teapot-4x4x4, teapot-5x5x5. Figure 9.1: Test scenes used. Figure 9.2: teapot-scene: accelerator (left), full hybrid (middle), standard ray traced (right) 95 9. R ESULTS (a) Fixable artifact in the head-scene: accelerator-only (left), hybrid (middle), ray traced reference (right). (b) Non-fixable artifact in the head-scene: accelerator-only (left), hybrid (middle), ray traced reference (right) Figure 9.3: head-scene Figure 9.4: This image shows a close-up of the ducky-scene: hybrid (left), and normally ray traced (right). It seems that the hybrid method can sometimes introduce artifacts, when zooming in too close. 96 Performance min. #iterations 3 2 1 0 Frame rate (FPS) 2.2 2.5 2.6 1.9 Quality of result Correct Few holes Small holes Long Gaps Table 9.4: Different epsilon values. Figure 9.5: Killeroo-closeup-scene rendering with a increasing miniterations-value. The images correspond to the miniterations-values in Table 9.4. Zooming in on the eye in Figure 9.4 will reveal the tessellation, which is not fixable using hybrid ray tracing. Nevertheless, when viewed from some distance, the hybrid result appears correct. 9.2.2 Min-iterations switch Sometimes, the root-finder may incorrectly return a miss, for example if the iteration jumps out of the patch’s domain. However, sometimes this already happens during the first iteration. Usually, it’s only the first iteration that tends to have a somewhat larger uv-increment, therefore, the system includes an extra parameter, which pushes the uv-parameter back into the patch’s domain as long as the minimum number of iterations has not yet been reached. Although less iterations should be preferred, it seems to increase performance a bit when specifying a minimum of one iteration. 9.3 Performance In this section the performance of the ray tracing system will be examined. The different variants will be investigated to determine which variant performs the best on average. Finally, the system will be compared against existing NURBS ray tracing systems. 9.3.1 Primary Rays As can be seen clearly from Table 9.5 and Table 9.6, as well as Table 9.7, the single-ray variant almost always outperforms the packet-based variant, using both architectures. What’s interesting, is that the uncached version of the single-ray variant is not much slower than its cached brother, which is only about 14-17% faster on average (Figure 9.6). On the GTX 295 architecture however, it is actually faster without the cache (for standard ray tracing). On the other hand, the packet-based variant does benefit from the cache, increasing the average performance with 30-53% for 8800M and 27-37% for GTX 295. 97 9. R ESULTS Scene cornellbox cached ducky cached head cached Standard (ms) Packet Single 163.89 129.13 27.90 21.06 116.26 117.44 20.84 23.73 449.43 288.38 89.44 47.42 260.20 240.77 71.88 49.76 517.43 406.29 112.22 67.52 291.03 309.22 82.94 63.71 Accelerator 0.91+16.75 0.84+4.71 0.91+14.21 0.84+3.94 0.95+31.25 0.87+6.86 0.95+26.28 0.87+6.29 2.11+44.17 0.89+8.89 2.11+36.96 0.89+8.05 Hybrid (ms) Packet/Total 22.36/40.02 4.64/10.20 16.42/31.54 4.11/8.90 265.66/297.86 50.19/57.92 169.63/196.87 30.93/38.09 345.66/391.94 81.26/91.05 224.68/263.76 49.88/58.82 Single/Total 28.28/45.94 5.05/10.61 20.34/35.46 4.99/9.77 231.65/263.86 44.98/52.71 223.94/251.18 37.93/45.09 323.85/370.13 62.77/72.56 261.91/300.98 49.91/58.85 Table 9.5: Timings in milliseconds for the primary rays of some small scenes per variant. The hybrid section shows two durations: the first one denotes the duration for tracing rays which were missed during acceleration. The second duration denotes the total duration for hybrid ray tracing (acceleration+ray tracing). The accelerator is separated into rasterization+acceleration. The bold items are the timings from the GTX 295 test-system, the other timings are from the 8800M testsystem. This suggests that the majority of execution-time is spent somewhere else, where the use of cache is not possible. In Section 9.4 the variants will be investigated more thoroughly. 9.3.2 Secondary Rays Observing the diagrams in Figure 9.7, Figure 9.8, and Figure 9.9, it becomes clear that for secondary rays, the single-ray variant is again superior to the packet-based variant. While the diagrams for shadow-rays show that the use of cache is even more beneficial in the case of the 8800M test-system, all other diagrams suggest there is little advantage in using the cached variant above the uncached. This is generally a good sign, as future architectures may allow more threads to run concurrently. In order to maximize the occupancy, the shared-memory use should be kept to a minimum. In this case, no cache would be used, hence this cannot become a limiting factor. 9.3.3 Scene size A first look at the results of the massive teapot scenes (Table 9.7) reveals a logarithmic relation between the number of BVH-nodes in a scene and the time to render (Figure 9.10 and Figure 9.11). Although this is characteristic for ray tracing systems in general, the results of the killeroo-models (Table 9.6) seem to suggest that the performance of the ray tracing system is not directly related to the number of BVH-nodes. The high-resolution model, which contains a lot more control-points and BVH-nodes, does not perform worse than the low-resolution model; sometimes it even outperforms the low-resolution model. This might be explained by the fact that the increase in the number of BVH-nodes does not create a new level in the BVH. However, it does create smaller leaf-nodes, decreasing the number of intersection-tests returning a miss. When comparing the durations of the accelerator for the low-resolution and the high-resolution killeroo-model scenes, as well as the massive teapot scenes, it becomes clear that the number of 98 Performance Scene teapot cached killeroo-lowres cached killeroo-highres cached killeroo-closeup cached Standard (ms) Packet Single 231.57 142.96 60.53 34.76 140.89 117.75 40.31 35.78 727.55 245.16 120.16 41.41 451.56 206.45 81.26 45.37 809.36 229.85 120.60 41.58 499.88 191.16 85.00 42.85 324.34 326.97 61.39 44.88 227.17 253.88 45.74 47.69 Accelerator 0.86+19.25 0.85+5.28 0.86+14.83 0.85+4.85 4.68+22.19 1.82+5.96 4.68+17.20 1.82+6.10 7.09+23.21 4.90+6.13 7.09+17.85 4.90+5.83 9.47+58.37 4.91+11.01 9.47+43.43 4.91+8.30 Hybrid (ms) Packet/Total 121.29/141.40 32.88/39.01 92.25/107.93 21.49/27.19 206.60/233.46 41.26/49.04 165.47/187.35 28.23/36.15 231.71/262.00 41.56/52.58 169.37/194.30 30.44/41.16 168.00/235.84 32.77/48.69 139.63/192.53 28.59/41.80 Single/Total 129.17/149.28 30.43/36.55 106.99/122.67 25.03/30.73 198.97/225.84 35.12/42.90 169.29/191.17 31.29/39.20 194.05/224.35 35.43/46.45 156.62/181.55 30.62/41.35 190.09/257.93 30.11/46.03 165.26/218.16 27.33/40.53 Table 9.6: Timings in milliseconds for the primary rays of the benchmark-scenes per variant. The hybrid section shows two durations: the first one denotes the duration for tracing rays which were missed during acceleration. The second duration denotes the total duration for hybrid ray tracing (acceleration+ray tracing). The accelerator is separated into rasterization+acceleration. The bold items are the timings from the GTX 295 test-system, the other timings are from the 8800M testsystem. Packet Single Packet Single 3.0 2.0 2.5 1.5 2.0 3.02´ 1.5 2.40´ 2.11´ 1.0 2.32´ 2.79´ 2.18´ 2.06´ 1.0 1.92´ 2.38´ 1.72´ 1.82´ 2.02´ 1.37´ 1.53´ 0.5 1.00´ 0.5 1.00´ Standard Uncached Standard Cached Hybrid Uncached (a) 8800M Hybrid Cached Standard Uncached Standard Cached Hybrid Uncached Hybrid Cached (b) GTX 295 Figure 9.6: Average speedup of primary rays w.r.t. reference implementation (packet-based, no caching, no acceleration). 99 9. R ESULTS Packet Single Packet 2.0 2.5 2.0 1.5 1.5 2.85´ 1.0 2.24´ 2.18´ 1.0 1.94´ 1.25´ 1.58´ 0.5 0.5 Single 1.00´ 1.00´ Standard Uncached Standard Cached Standard Uncached (a) 8800M Standard Cached (b) GTX 295 Figure 9.7: Average speedup of shadow rays (level 0) w.r.t. reference implementation (packet-based, no caching). Packet Single Packet Single 2.5 2.0 2.0 1.5 1.5 2.36´ 2.22´ 1.0 2.69´ 2.59´ 1.0 1.27´ 0.5 1.40´ 1.00´ 0.5 Standard Uncached Standard Cached (a) 8800M 1.00´ Standard Uncached Standard Cached (b) GTX 295 Figure 9.8: Average speedup of reflection rays (level 1) w.r.t. reference implementation (packetbased, no caching). 100 Performance Packet Single Packet 2.5 2.5 2.0 2.0 1.5 2.83´ 2.74´ 1.0 1.5 2.93´ 2.66´ 1.0 1.29´ 0.5 Single 1.00´ 0.5 Standard Uncached 1.27´ 1.00´ Standard Cached Standard Uncached (a) 8800M Standard Cached (b) GTX 295 Figure 9.9: Average speedup of refraction rays (level 1) w.r.t. reference implementation (packetbased, no caching). Scene teapot-1x1x1 cached teapot-2x2x2 cached teapot-3x3x3 cached teapot-4x4x4 cached teapot-5x5x5 cached Standard (ms) Packet Single 276.02 197.52 54.23 30.98 161.62 148.36 36.42 32.59 497.88 250.74 82.14 41.74 317.40 220.85 56.35 44.96 698.79 292.41 100.92 46.78 505.26 277.16 80.88 52.37 925.00 348.83 128.80 58.36 654.11 345.03 98.98 62.14 1037.69 350.81 161.27 57.39 798.94 348.87 125.80 68.86 Accelerator 0.89+17.77 0.87+5.52 0.89+13.36 0.87+5.55 4.66+26.31 1.67+5.70 4.66+19.74 1.67+4.92 12.46+25.25 9.41+5.48 12.46+19.11 9.41+4.83 28.17+26.53 39.68+6.35 28.17+20.89 39.68+6.97 43.94+27.05 89.45+8.27 43.94+20.95 89.45+8.54 Hybrid (ms) Packet/Total 143.13/161.79 27.00/33.39 81.49/95.74 19.13/25.54 209.74/240.71 38.21/45.59 150.71/175.12 29.83/36.42 279.60/317.31 41.31/56.19 193.85/225.43 36.93/51.16 343.61/398.30 51.11/97.13 276.56/325.61 45.36/92.01 298.20/369.19 56.73/154.45 270.19/335.08 59.35/157.34 Single/Total 165.68/184.34 28.63/35.01 116.45/130.70 24.85/31.26 225.97/256.94 37.91/45.28 192.33/216.74 36.27/42.86 263.72/301.42 41.76/56.64 229.09/260.67 40.69/54.93 287.33/342.02 53.39/99.42 268.93/317.99 43.01/89.66 287.21/358.20 47.17/144.89 274.00/338.89 48.88/146.87 Table 9.7: Timings in milliseconds for the primary rays of the large scenes per variant. The hybrid section shows two durations: the first one denotes the duration for tracing rays which were missed during acceleration. The second duration denotes the total duration for hybrid ray tracing (acceleration+ray tracing). The accelerator is separated into rasterization+acceleration. The bold items are the timings from the GTX 295 test-system, the other timings are from the 8800M test-system. 101 9. R ESULTS 800 340 æ 300 æ 300 800 æ 600 æ 700 280 æ 250 æ 500 260 600 æ æ 500 400 240 æ 200 æ 1.0 ´ 106 500 000 æ æ 1.5 ´ 106 æ 300 220 400 æ æ æ 700 320 900 350 æ æ æ 1000 1.0 ´ 106 500 000 (a) packet, nocache 1.5 ´ 106 1.0 ´ 106 500 000 æ (b) single, nocache 1.5 ´ 106 1.0 ´ 106 500 000 (c) packet, cached 1.5 ´ 106 (d) single, cached Figure 9.10: Durations of the variants using the 8800M test-system, as a function of the number of patches. The red line shows the curve corresponding to a logarithmic function fitted to the data. æ æ æ 160 æ æ 120 65 55 æ 140 100 50 æ 120 60 æ 55 æ æ 45 æ 80 50 æ 100 æ 45 40 60 æ 80 æ æ 40 500 000 æ 500 000 1.0 ´ 10 6 1.5 ´ 10 (a) packet, nocache 6 1.0 ´ 106 1.5 ´ 106 æ æ (b) single, nocache 500 000 1.0 ´ 106 1.5 ´ 106 (c) packet, cached æ 500 000 1.0 ´ 106 (d) single, cached Figure 9.11: Durations of the variants using the GTX 295 test-system, as a function of the number of patches. The red line shows the curve corresponding to a logarithmic function fitted to the data. Direct evaluation Division-free evaluation Teapot 4.9 4.7 Killeroo-highres 4.5 4.3 Head 3.0 2.8 Table 9.8: Performance in FPS for different evaluation methods. The results were obtained using the cached single-ray ray tracer on the 8800M test system. control-points in a scene or the number of patches does not influence the performance at all, which is actually quite obvious, since each pixel contains up to one patch to find a root. The rasterization however, is influenced by the number of patches, since it’s linear in the number of sub-patches. Therefore, the performance is independent on the number of control-points as long as the rasterization is fast enough. An interesting detail, is that starting with the teapot-3x3x3 scene, the GTX 295-test-system using the standard uncached single-ray variant begins to outperform every other hybrid variant. Starting with the teapot-5x5x5 scene, it even outperforms the rasterization phase, making the hybrid approach obsolete. However, I doubt the maximum rasterization throughput has really been reached, since that part was not fully optimized yet, especially since the 8800M should perform worse. 9.3.4 Surface-Evaluation Scheme As it turned out, the direct evaluation method in slightly faster than the division-free evaluation method, which is clearly visible in Table 9.8. 102 1.5 ´ 106 Performance Scene cornellbox head ducky teapot killeroo-lowres killeroo-highres killeroo-closeup teapot-1x1x1 teapot-2x2x2 teapot-3x3x3 teapot-4x4x4 teapot-5x5x5 P 7.5/34.6 3.1/13.2 4.0/18.3 5.4/24.5 4.6/20.5 4.9/20.5 3.7/19.2 5.5/28.2 4.3/21.4 3.5/18.7 2.8/15.4 2.8/15.5 Standard P+S 3.0/16.7 2.0/8.7 2.6/11.6 3.5/17.3 3.0/13.6 2.2/13.4 2.9/12.4 4.2/20.5 1.8/14.7 1.7/12.2 3.0/10.0 2.9/10.1 P+S+1 1.8/10.1 0.7/3.2 1.0/4.9 1.3/6.5 1.0/4.9 0.5/5.0 1.0/2.9 1.6/8.0 0.6/4.8 0.6/3.8 1.8/3.2 1.7/3.2 Hybrid P 26.0/59.8 4.8/15.0 6.5/22.0 8.7/30.1 6.6/23.0 6.2/20.7 5.9/20.4 9.0/33.3 6.6/24.1 5.1/17.2 3.9/10.2 2.2/6.1 Table 9.9: Overview of the performance of the ray tracing system. The numbers represent the performance in frames per second. P denotes the primary rays, S denotes the shadow rays (level 0), and 1 denotes the level-1 secondary rays (reflection+refraction+shadow). The bold entries denote the frame rates for the GTX 295 test-platform. 9.3.5 Preferred variant Based on the observations in the previous subsections, Table 9.9 gives an overview of the performance data of the system using the overall-best variants. In here, the single-ray implementation using the cache has been selected as the preferred variant for standard ray tracing on the 8800M test platform, and the uncached version on the GTX 295 platform. For hybrid ray casting, the cached version of the packet-based implementation has been selected, for both test platforms. 9.3.6 Comparison Table 9.10 and Table 9.11 give a comparison with two other existing CPU-based direct NURBS ray tracing systems, described in [AGM06, ABS08]. Both systems implement the division-free evaluation method described in Section 3.5.3, and employ a packet-based scheme for scene-traversal. In addition, the system described in [ABS08] implements the hybrid method described in Section 4.2.2 and has a variant which uses 8 cores in parallel. Both systems use SIMD to execute 4 rays in parallel, per core. The performance of the 8800M test platform seems to outperform the single-core variants of the systems. Although the frame rate of the teapot-scene of the system described in [AGM06] is much higher than in [ABS08], the frame rate for the high-resolution killeroo scene is the same. Furthermore, the impact of the number of patches, as well as the screenfill is much higher for the CPU-based systems, than for the GPU-based system (CNRTS), suggesting the system is better suited for massive models. The GTX 295 platform performs much better, with a speedup of 4-5×. However, the CPU-based ray tracing system is still much faster when using 8 cores, although the frame rate is not much higher for the killeroo-scene. For hybrid ray tracing, the ID-processing variant of the CPU-based ray tracing system is much faster than the GPU-based system, while the uv-processing variant is comparable. 103 9. R ESULTS CNRTS Scene teapot killeroo-lowres killeroo-highres killeroo-closeup 5.0/24.5 4.6/20.5 4.9/20.5 3.7/19.2 [AGM06] 1 core 7.2 3.8 3.3 1.5 [ABS08] (Standard) 1 core 8 cores 5.2 34.2 3.3 22.4 - Table 9.10: Primary-ray frame rate comparison of several existing systems for standard NURBS ray tracing. The bold entries denote the frame rates of the GTX 295 test platform. The selected variants, are the single-ray cached variant for the 8800M platform, and the single-ray uncached variant for the GTX 295 platform. CNRTS 1 core Scene teapot killeroo-lowres killeroo-highres killeroo-closeup 9.0/30.1 6.6/23.0 6.2/20.7 5.9/20.4 8.0 5.9 - [ABS08] (Hybrid) 8 cores ID-processing u/v-processing 41.7 25.5 30.8 22.1 - Table 9.11: Primary-ray frame rate comparison of several existing systems for hybrid NURBS ray tracing. The bold entries denote the frame rates of the GTX 295 test platform. The selected variant, for both platforms, is the packet-based cached variant. (a) Packet-based (b) Single-ray Figure 9.12: Number of BVH-nodes visited. 9.4 Analysis In this section, the variants are compared and analyzed to identify areas with room for improvement. 9.4.1 Traversal The total number of BVH-nodes visited by each ray during traversal, is different for each variant. Figure 9.12 gives a visualization which compares the number of nodes visited per ray for the teapotscene. It’s clear that each ray in the same packet visits the same number of nodes. Also, Figure 9.12b indicates that the maximum number of visited nodes using the single-ray variant is much smaller than the packet-based variant. 104 Analysis cornellbox head ducky teapot killeroo-lowres killeroo-highres killeroo-closeup teapot-1x1x1 teapot-2x2x2 teapot-3x3x3 teapot-4x4x4 teapot-5x5x5 Single 14.6M (181.2k) 4.2M (1.5M) 2.9M (943.7k) 2.6M (1.1M) 2.2M (932.2k) 2.2M (952.0k) 8.3M (602.1k) 2.9M (1.2M) 3.6M (1.4M) 4.6M (1.8M) 4.3M (1.5M) 5.9M (2.3M) Packet 19.6M (181.3k) 9.9M (2.7M) 8.5M (1.9M) 7.8M (2.1M) 14.1M (2.6M) 15.3M (2.7M) 12.9M (902.8k) 6.8M (2.3M) 14.1M (3.4M) 25.7M (4.3M) 34.6M (5.1M) 51.0M (6.3M) Table 9.12: Total number of visited BVH nodes for every scene. The numbers between curly brackets denote the number of visited nodes when using hybrid ray tracing. Table 9.12 lists the cumulative number of visited nodes for each scene, comparing the singleray variant and the packet-based variant, as well as their hybrid versions. It is evident, that for every scene, the packet-based variant visits a lot more nodes than the single-ray variant. Note that zooming in on a model, reduces the number of visits for the packet-based variant (compare killeroo-highres and killeroo-closeup). This is of course caused by the fact that zooming in on the model, will enlarge the bounding-boxes from the packet’s perspective. Consequently, less surface-area of the boundingbox will be intersected by the packet’s volume, resulting in fewer intersected nodes. Although the screenfill is higher, it turns out to be more efficient for packet-based traversal (observe the smaller execution time in Table 9.6). The single-ray variant on the other hand, requires much more node visits for the close-up scene. Also observe, that for increasing BVHs (teapot-scenes), the number of visited nodes for the packet-based variant increases rapidly w.r.t. the single-ray variant. This is easily explained by the fact that the increase in number of nodes in the hierarchy, combined with the same screenfill used (zoomed to fit the screen), will shrink the size of the bounding-boxes, again from the perspective of the rays. Therefore, the opposite is expected of zooming in on the scene, namely, that the number of visited nodes increases. This is confirmed by the table. 9.4.2 Root-finding Table 9.13 shows statistics about the root-finder. Although the number of visited nodes is much larger for the packet-based variant, than the single-ray variant, the total number of intersection tests and iterations is roughly the same. This is caused by the extra Ray-AABB intersection test performed just before the root-finder stage. However, the execution times per iteration actually differ a lot. Although both variants use caching to keep the computed basis-functions, the packet-based variant performs much better. First, during a root-finding call, every ray in the packet, when participating in the root-finding, will test against the same surface-patch. Therefore, the data-coherence will be perfect: each thread requires the exact same data. Moreover, since the data is first transferred to the shared-memory, which acts again as a software-managed cache, the memory-latencies will be completely hidden. The single-ray variant will fetch its data through the texture-unit. Although this will be cached as well, it proves to be more efficient to put it into shared-memory. Furthermore, since the rays may 105 9. R ESULTS Packet Single Accelerator Packet (Hybrid) Single (Hybrid) Intersection tests Cumulative 123,042 119,408 55,382 16,790 (72,172) 16,745 (72,127) Iterations Cumulative Average 313,993 2.6 304,833 2.6 92,201 1.7 44,049 2.6 43,923 2.6 Newton-iteration Cumulative Average 9.0 · 109 28,589 22.6 · 109 74,210 9 1.1 · 10 12,328 1.3 · 109 29,749 3.0 · 109 68,060 Table 9.13: Root-finder statistics for the teapot-scene (cached). The Newton-iteration numbers denote durations in GPU clock-cycles. fetch different surface-patches, the memory-coherence will be much less than the packet-based variant. Also, because the data may be scattered in the texture, fetches will result in uncoalesced accesses the first time, which is very inefficient. The packet-based variant even has one extra advantage, in that the data will be fetched coalesced by all threads cooperatively. Unfortunately, for the packet-based variant, the single-ray variant will eventually win for other reasons, which will be investigated in the next subsection. 9.4.3 Utilization Apart from root-finding being a very expensive computation, a poor utilization of the multiprocessors will make it perform even worse. Ideally, all cores of a multiprocessor should be executing, to reach the full potential. However, this is not always the case. During packet-based traversal, some rays may intersect a leaf-node, and others do not, in which case only a subset of the packet will be participating in the root-finding. The other cores are shut-down for a moment, until every ray’s root-finder-call has finished executing, after which all rays in the packet continue their execution. Generally, the utilization refers to that of a warp. However, since we are traversing packet-wise, the utilization now covers all warps in the packet. In the worst case, if only one ray is active, the entire packet will be blocked until that ray is done. Unfortunately, this is the price we have to pay for the advantage of having a fast shared-stack and a very efficient node/patch fetch implementation. Figure 9.13a shows a visualization of the total idle-time per pixel for the teapot-scene rendered using the packet-based variant (using caching). As can be seen, a large portion of the execution-time the threads are idle. This is the major reason for the packet-based traversal for being rather slow, regarding the brute horsepower of the GPU. The single-ray variant on the other hand, does not suffer from these restrictions, since the traversal happens separately. As a consequence, during the root-finding phase, all rays will be active (apart from the rays which did not intersect a leaf-node), maximizing the multiprocessor’s utilization. However, some rays may finish earlier (either successful or not) than some other rays in the same warp, which causes the already finished rays will have to wait for the entire warp to finish. Therefore, also the single-ray variant suffers from some under-utilization. Although this under-utilization is restricted to only one warp and will not affect other warps. Finally, rays in the single-ray variant may have already traversed the entire BVH, while other threads in the same warp are still busy. Note, that this will never occur using the packet-based variant, where idle-time is caused by a minority of rays intersecting a leaf-node. Figure 9.13b gives a visualization of this idle-time. Finally, Figure 9.14 shows a histogram of the average utilization during root-finding. The image clearly shows that most of the time, only a very small subset of rays is active, which causes the high idle-time in the other image. 106 Analysis (a) Packet-based (Cached) (b) Single-ray (Cached) (c) Packet-based (Hybrid+Cached) (d) Single-ray (Hybrid+Cached) Figure 9.13: Amount of time spent idle of total durations for each thread. 107 9. R ESULTS 8000 6000 4000 2000 0 5 10 15 20 25 30 35 Figure 9.14: Frequencies of the average number of rays participating in the root-finding. 9.4.4 Low-level Profiling An easy way to obtain statistics about running CUDA-kernels, is by enabling the CUDA-profiler. Running CUDA-applications will then automatically generate a log-file containing low-level information for each kernel-launch. Among these measurements are: running time for a kernel, occupancy, number of divergent branching warps, etc. By using the CUDA-profiler, some insight may be gained about kernels, and why some perform less than expected. Branching warps Table 9.14 lists the collected information about main kernel for each variant based on the rendering of the teapot-scene. A high difference in the number of warp-branches is revealed, which is caused by the difference in traversal of each variant. When traversing the BVH packet-based, the set of visited nodes is the union of the set of nodes which would be visited when traversing each ray separately. Therefore, much more nodes will be visited in the packet-based variant than the singleray variant, depending on the coherence of the rays in the packet. The coherence is likely to drop, due to the much larger packets (64 rays). The possibility that rays will hit different BVH-nodes will therefore be much larger than when using smaller packets such as 2x2 packets which is usually the case ([AGM06, ABS08]). Because each extra visited node requires an extra traversal step, extra code will be executed, resulting in more branches. The difference in the number of instructions executed confirms this. The difference in the number of branching warps between the single-ray variants is explained by the fact that their block-sizes differ. Therefore, threads will execute differently depending on their corresponding rays. Diverging warps During packet-based traversal, most branches never will be divergent, because the entire packet decides the order of traversal, causing the entire warp to follow the same execution path. However, there will always be branches which will cause divergence. Some rays for example, may take longer to find an intersection while root-finding, causing the warp to diverge, leading to warp serialization (which is confirmed by the table). Additionally, rays may skip the root-finder entirely, if it misses the sub-patch’s bounding-box. The single-ray variant on the other hand, does not share a common traversal path, and therefore may diverge during traversal. Since the traversal is separated from the root-finder, the divergence is kept to a minimum. Nevertheless, it is still larger than the packet-based variant. 108 Discussion GPUTime (ms) occupancy (%) branching warps divergent warps instructions +/warp serialize local load local store Nocache Packet Single 308 216 33 25 1.189.939 664.956 36.775 45.486 9.6M 6.9M 780k 21k 1.591.277 1.431.112 4.253.364 5.066.636 Cache Packet Single 190 168 17 17 1.189.939 527.565 36.775 35.816 9.4M 5.4M 820k 19k 4.574 67.343 40.152 320.800 Table 9.14: CUDA profiling data for the CNRTS-kernels (based on the teapot-scene). Warp serialization Whereas the number of divergent warps of the single-ray variant (uncached) is much higher than for packet-based, the number of warp serializes is a lot smaller (37×). Warp serialization is caused by bank-conflicts to either the shared-memory or the constant memory. Since the number of visited nodes for packet-based is larger, this number will be larger. Local-memory accesses The effect of using cache is very evident in the table. The number of local loads/stores is heavily reduced for the cached variants, which is obvious since the caches avoid local memory transfers. 9.5 Discussion Standard Ray Tracing Although nowadays packet-based ray tracing is considered to be the preferred way of ray tracing on CPU-architectures (and actually also on GPU-architectures), the singleray implementation seems to perform much better on GPU. At first, the rather large packet-size, enforced by the 32-wide SIMD-architecture, causes the packets to be less coherent compared to CPU-implementations. On the CPU, these packets are usually much smaller, since the SIMD-width is 4. As a consequence, the overall efficiency will be much lower on GPU, as less rays will participate in the root-finding. While a naive single-ray implementation will suffer heavily from ray divergence, the implementation used in here tries to avoid these under-utilization problems as much as possible. By separating the traversal from the root-finder, the divergence only affects the traversal stage. Under-utilization occurs only if some rays belonging to a warp have not yet returned, either successful or unsuccessful. As soon as all rays return, the warp enters the root-finding stage, in which each ray will try to find an intersection (except those for which no leaf-node was found during traversal). Therefore, the utilization during root-finding, which is the most expensive operation in the system, in much higher than the packet-based implementation. Hybrid Ray Tracing Although the accelerator phase is very fast, the additional ray tracing phase necessary to fix the small number of artifacts, takes up the majority of the total hybrid rendering time (Table 9.5, Table 9.6). Therefore, the benefit of using a GPU/CPU-hybrid is rather low and less than expected. Looking at the processing time required to fix the relatively small amount of pixels, it seems that the system does not scale very well over the number of rays. For highly complex scenes, hybrid ray tracing becomes actually slower than standard ray tracing (Table 9.7). 109 9. R ESULTS Furthermore, not always the same quality is guaranteed as standard ray tracing. Some artifacts may appear, due to wrong tessellations, which cannot be handled by the ray tracer (Figure 9.3b). However, because the total number of pixels that require a repair is generally very low, the accelerator could act as a very fast “low-detail” method to navigate through the scene. Caching While the use of a cache does improve the performance of the 8800M architecture, it is less obvious for the GTX 295 architecture. For the primary rays, using standard single-ray tracing, the use of cache even decreases performance a little (Figure 9.6). However, for shadow- and secondary-rays, the speedup becomes more apparent (Figure 9.7, Figure 9.8, and Figure 9.9). Although the difference is not that big for the GTX 295 architecture, it shows there is some improvement when using a cache. Furthermore, since the cache is implemented using the sharedmemory, the occupancy is lowered, suggesting that the performance increase would be much bigger when the same occupancy could be maintained. The upcoming Fermi architecture has an hierarchical caching system built-in. This would make the software-managed cache obsolete. Nevertheless, for “older” architectures, the cache proves to be worthwhile. Basis-function evaluation Although [AGM06] presented their method as the best performing evaluation scheme on SIMD-architectures, using CUDA their division-free basis-function evaluation method perform slightly worse than the direct evaluation-method described in [PT97]. Since a lot of preprocessed data needs to be fetched to evaluate a set of basis-functions, the advantage of not having to compute a division diminishes, whereas the direct evaluation-method only requires the knot-vector data. Therefore, the advantage of using the division-free method above the direct evaluation-method is very small. Preprocessing Although the focus of this thesis was not on efficient ways to generate BVHs, the timings in Table 9.3 show that the preprocessing time is usually under a second up to a few seconds for moderately complex scenes, which is not very high. Nevertheless, this might be improved even further by generating the BVH on the GPU also. Supported scenes Table 9.3 shows that the system is capable of ray tracing scenes containing 4000 patches. Obviously, more complex scenes are supported which is only limited by the available GPU-memory. The maximum degree the system can handle is limited by the available shared-memory. When using the uncached variant, virtually any degree is supported. Of course, the intersection-time will increase also. Time-complexity While standard rasterization is linear in the number of objects, ray tracing isn’t. Table 9.7, together with Figure 9.10 and Figure 9.11 clearly show a logarithmic increase in rendering time, in which the single-ray implementations appear to converge to a constant rendering time. The linear nature of the rasterization algorithm becomes apparent in the hybrid ray tracing results in Table 9.7. It can be seen that the rasterization time increases much faster than the ray trace time. However, eventually, the rasterization time will be too high, making it unusable for interactive ray tracing. 110 Chapter 10 Conclusions and Future Work Ray Tracing is by no doubt a very expensive algorithm. For regular planar primitives, such as triangles, quadrilaterals, polygons, etc. the difficulty generally lies in the traversal of the acceleration datastructure, since computing an intersection between a ray and a primitive is computationally cheap to implement. However, when ray tracing NURBS-surfaces, the bottleneck has shifted to the intersection computation, which is very expensive to compute. Furthermore, due to the incoherent nature of the ray tracing algorithm, it is very difficult to provide the GPU with homogeneous workloads since rays may take longer to compute. In this thesis we have presented a system capable of ray tracing NURBS-surfaces with shadows, reflections, and refractions up to any depth, and fully implemented on the GPU using CUDA. By separating the traversal from the root-finder, a single-ray implementation was able to outperform the generally-preferred packet-based implementation. Contrary to the common believe that GPUs are not suitable for ray tracing, due to their streaming SIMD-architecture, the results suggest the crossing-point is being reached between CPUs versus GPUs, in favor of GPUs. Although the system currently does not yet outperform the 8-core CPUimplementation from [ABS08], it does leave the 1-core implementation far behind. And the long idle-times suggest that there is plenty of room for improvement. Although using a hybrid approach increases performance, it does not perform as expected. Looking at the processing time required to fix the relatively small amount of pixels, it seems that the system does not scale very well over the number of rays. Because artifacts may remain, and rasterization times can become larger than standard ray tracing, it is doubtful that hybrid ray tracing will remain beneficial in the future. It could act however as a very fast “low-detail” method to navigate through the scene. 10.1 Future Work Trimming Although the system is able to ray trace virtually any NURBS-model, it does not yet have support for trimming curves, which is commonly used in industries to ease the design of complex models. Without trimming curves, surfaces with holes are very difficult to model. The surface will have to be wrapped around the hole, resulting in many patches. The availability of support for trimming curves would therefore be a great addition to the system. SLI Currently, the ray tracing system is limited to use only a single GPU. Therefore, the 295GTX’s full potential is far from being reached. By adding support for multiple GPUs in an SLI configuration, the ray tracing performance can be doubled theoretically. However, since the GPUs do not share their memory-spaces, all data needs to be uploaded twice. 111 10. C ONCLUSIONS AND F UTURE W ORK Persistent Threads More recently, [AL09] have taken a total different approach towards ray tracing on GPUs, by using “persistent threads”. Instead of mapping each thread to a screen-pixel, they let the threads fetch unprocessed rays from a ray-pool. If a thread finishes processing its ray, it fetches the next ray from the pool. In this way, all threads will be kept busy, heavily increasing the utilization. Another benefit from this approach is, that now only rays will be processed that were spawned previously. Currently, all pixels will be traced, but terminated rays will exit earlier. The hybrid approach could benefit from this as well, as there are usually few rays which need to be repaired. Fermi Last but not least, the upcoming new architecture by NVIDIA looks very promising. Apart from the brute horsepower, it contains some nice new features, very useful for ray tracing. The builtin cache hierarchy, will automatically increase performance for much processes in the ray tracer. The possibility to run different kernels concurrently sounds very interesting, and opens new ways to improve today’s state of the art. 112 Bibliography [Abe05] O. Abert. Interactive Ray Tracing of NURBS Surfaces by Using SIMD Instructions and the GPU in Parallel. Master’s thesis, University of Koblenz-Landau, 2005. [ABS08] O. Abert, M. Bröcker, and R. Spring. Accelerating Rendering of NURBS Surfaces by Using Hybrid Ray Tracing. 2008. [AGM06] O. Abert, M. Geimer, and S. Muller. Direct and Fast Ray Tracing of NURBS Surfaces. In Proc. IEEE Symposium on Interactive Ray Tracing 2006, pages 161–168, 18–20 Sept. 2006. [AL09] Timo Aila and Samuli Laine. Understanding the efficiency of ray traversal on gpus. In Proceedings of the Conference on High Performance Graphics 2009, pages 145–149, New York, NY, USA, 2009. ACM. [BBB87] Richard H. Bartels, John C. Beatty, and Brian A. Barsky. An introduction to splines for use in computer graphics and geometric modeling. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1987. [BBDF05] S. Beck, AC Bernstein, D. Danch, and B. Fröhlich. CPU-GPU Hybrid Real Time Ray Tracing Framework. 2005. [BBLW07] C. Benthin, S. Boulos, D. Lacewell, and I. Wald. Packet-based Ray Tracing of CatmullClark Subdivision Surfaces. Technical report, Scientific Computing and Imaging Institute, 2007. [BM99] W. Boehm and A. Müller. On de Casteljau’s algorithm. Computer Aided Geometric Design, 16(7):587–605, 1999. [CDP] Sylvain Collange, David Defour, and David Parello. Barra, a Parallel Functional GPGPU Simulator. [CHH02] N.A. Carr, J.D. Hall, and J.C. Hart. The ray engine. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, pages 37–46. Eurographics Association Aire-la-Ville, Switzerland, Switzerland, 2002. [CLR80] Elaine Cohen, Tom Lyche, and Richard Riesenfeld. Discrete b-splines and subdivision techniques in computer-aided geometric design and computer graphics. Computer Graphics and Image Processing, 14:87–111, 1980. 113 B IBLIOGRAPHY [Cox72] M.G. Cox. The numerical evaluation of b-splines. IMA Journal of Applied Mathematics, 10:16, 1972. [dB72] C. de Boor. On calculating with B-splines. J. Approx. Theory, 6(1):50–62, 1972. [Dia09] G. Diamos. The Design and Implementation Ocelot’s Dynamic Binary Translator from PTX to Multi-Core x86. 2009. [DKK09] G. Diamos, A. Kerr, and M. Kesavan. Translating GPU binaries to tiered SIMD architectures with Ocelot. 2009. [FR87] R.T. Farouki and V.T. Rajan. On the numerical condition of polynomials in bernstein form. Computer Aided Geometric Design, 4:191–216, 1987. [FS05] T. Foley and J. Sugerman. KD-tree acceleration structures for a GPU raytracer. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, pages 15–22. ACM New York, NY, USA, 2005. [GPSS07] J. Gunther, S. Popov, H. P. Seidel, and P. Slusallek. Realtime Ray Tracing on GPU with BVH-based Packet Traversal. In Proc. IEEE Symposium on Interactive Ray Tracing RT ’07, pages 113–118, 10–12 Sept. 2007. [GS87] Jeffrey Goldsmith and John Salmon. Automatic creation of object hierarchies for ray tracing. IEEE Comput. Graph. Appl., 7:14–20, 1987. [Hav01] V. Havran. Heuristic Ray Shooting Algorithms. Czech Technical University. PhD thesis, Ph. D. dissertation, 2001, 2001. [HSHH07] D.R. Horn, J. Sugerman, M. Houston, and P. Hanrahan. Interactive kd tree GPU raytracing. In Proceedings of the 2007 symposium on Interactive 3D graphics and games, pages 167–174. ACM Press New York, NY, USA, 2007. [Kaj82] J.T. Kajiya. Ray tracing parametric patches. In Proceedings of the 9th annual conference on Computer graphics and interactive techniques, pages 245–254. ACM New York, NY, USA, 1982. [LYTM06] C. Lauterbach, S.E. Yoon, D. Tuft, and D. Manocha. RT-DEFORM: Interactive ray tracing of dynamic scenes using BVHs. In Proceedings of the 2006 IEEE Symposium on Interactive Ray Tracing, pages 39–45. Citeseer, 2006. [MB90] J.D. MacDonald and K.S. Booth. Heuristics for ray tracing using space subdivision. The Visual Computer, 6(3):153–166, 1990. [MCFS00] W. Martin, E. Cohen, R. Fish, and P. Shirley. Practical Ray Tracing of Trimmed NURBS Surfaces. JOURNAL OF GRAPHICS TOOLS, 5(1):27–52, 2000. [OLG+ 07] J.D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A.E. Lefohn, and T.J. Purcell. A Survey of General-Purpose Computation on Graphics Hardware. In Computer Graphics Forum, volume 26, pages 80–113. Blackwell Synergy, 2007. [PBMH05] T.J. Purcell, I. Buck, W.R. Mark, and P. Hanrahan. Ray tracing on programmable graphics hardware. In International Conference on Computer Graphics and Interactive Techniques. ACM Press New York, NY, USA, 2005. [Pet94] 114 John W. Peterson. Tesselation of nurb surfaces. In Paul S. Heckbert, editor, Graphics Gems IV, pages 286–320. Academic Press, 1994. [PGSS07] S. Popov, J. Gunther, H.P. Seidel, and P. Slusallek. Stackless KD-Tree Traversal for High Performance GPU Ray Tracing. In Computer Graphics Forum, volume 26, pages 415–424. Blackwell Synergy, 2007. [Pro05] Jana Procházková. Derivative of b-spline function, 2005. [PSS+ 06] H. F. Pabst, J. P. Springer, A. Schollmeyer, R. Lenhardt, C. Lessig, and B. Froehlich. Ray Casting of Trimmed NURBS Surfaces on the GPU. In Proc. IEEE Symposium on Interactive Ray Tracing 2006, pages 151–160, 18–20 Sept. 2006. [PT97] L.A. Piegl and W. Tiller. The Nurbs Book. Springer, 1997. [PTVF97] William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. Numerical Recipes in C. CAMBRIDGE UNIVERSITY PRESS, 1997. [TS05] N. Thrane and LO Simonsen. A comparison of acceleration structures for GPU assisted ray tracing. Master’s thesis, 2005. [WBB08] I. Wald, C. Benthin, and S. Boulos. Getting rid of packets-Efficient SIMD single-ray traversal using multi-branching BVHs. In Interactive Ray Tracing, 2008. RT 2008. IEEE Symposium on, pages 49–57, 2008. [WBS07] Ingo Wald, Solomon Boulos, and Peter Shirley. Ray tracing deformable scenes using dynamic bounding volume hierarchies. ACM Trans. Graph., 26(1):6, 2007. [WH80] T. Whitted and N.J. Holmdel. An Improved Illumination Model for Shaded Display. Communications, 1980. [WSBW01] I. Wald, P. Slusallek, C. Benthin, and M. Wagner. Interactive Rendering with Coherent Ray Tracing. In Computer Graphics Forum, volume 20, pages 153–165. Blackwell Synergy, 2001. 115
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement