University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 3
- Chang Liu
- Tested on personal laptop:
- i7-12700 @ 4.90GHz with 16GB RAM
- RTX 3070 Ti Laptop 8GB
This project is about implementing a GPU path tracer with CUDA programming library.
A virtual camera is capturing another virtual camera
Scene Specs | [./scenes/pbr_texture.txt] |
---|---|
Resolution | 2400 x 1800 |
Samples Per Pixel | 3000 |
Render Time | < 7 minutes (> 7.5 frames per second) |
Million Rays Per Second | 32.4 |
Triangle Count | 25637 |
Large scene (6,704,264 triangles)
Lens effect: depth of field with heart-shaped aperture blades
The base code's lighting strategy is to shoot new ray at every intersected point until the ray hits a light source. This works fine with most scenes having large light sources and dominated by diffuse surfaces. However, if light sources are very small, like a point light source, rays are hardly able to hit them, thus the result image will be dark and noisy.
BSDF Sampling | Light Sampling | MIS Combined |
---|---|---|
Rendered at 32 spp
Instead of passively waiting our rays to hit the light sources, we can "pull" them to the light source that we want to get direct illuminance from. This strategy is called light sampling. Light sampling converges much faster than BSDF sampling, as you can see from the images. However, it introduces another problem: when surfaces are very specular, it becomes hard to produce any valid contribution.
In this path tracer, I implemented Multiple Importance Sampling to combine samples from BSDF sampling and light sampling. MIS takes the advantage of both while avoiding the disadvantage of each, producing less noisy results.
Tired of repeatedly setting up and adjusting "virtual artificial" light sources for the best effect? Let's introduce some natural light from real world that can easily improve our rendering quality at no cost. That is, we retrieve luminance values stored in HDR photographs when calculating lighting and color of surfaces. As you can see, the demo images all benefited from such HDR images.
A very simple implementation is to transform the direction of escaped rays from spherical direction to 2D texture coordinate and get the correspondent color from HDR images. However, if we want it to work well with our MIS direct lighting, we should treat it as a light source to perform importance sampling on it.
The idea of importance sampling on HDR image is to first convert the color of image to a 2D brightness distribution. Then, we stretch the distribution to 1D and perform importance sampling on it. For sampling a 1D distribution, I used the alias method, which requires to build an alias table for the distribution. Sampling with alias table is
No Importance Sampling | Importance Sampling |
---|---|
These are rendered with an HDR image that has a very bright but tiny dot in it (the sun). You can see only importance sampling can capture it and correctly light the ground -- the sun is too small to be hit by a random ray sampled by BSDF.
For materials, I didn't implement many this time in the renderer. There are three material models implemented:
- Lambertian: a classic diffuse material used to represent rough surfaces
- Dielectric: a transparent material having both reflection and refraction. Used to simulate glass, water, air or any "clear and specular" media
- Metallic workflow: a popular material model used in the industry. It has two parameters, Metallic and Roughness. By changing the value of two parameters, we can simulate a variety of materials such as plastic, glossy metal and mirror. It's very expressive and artist friendly, and easy to combine with textures to create realistic rendering. The model implemented by the renderer is a mixture of Lambertian diffuse and Trowbridge-Reitz (GGX; GTR2) microfacet specular
Metallic workflow: changing metallic and roughness; Dielectric: IOR
The following picture shows the type of textures supported. You can see how the workflow make a difference to turn a plain white model to very realistic image.
Initially I only implemented normal map. When I first tried to add texture channels for metallic an roughness and rendered that camera image, I was shocked too by how expressive PBR material and PBR textures are.
Real world cameras' lens system compose of sets of glasses that focus the light to the sensor. In path tracing, the basic model of camera is the pinhole model, which creates precise one-to-one mapping from world position to pixel coordinate. If we want to simulate depth of field, we'd allow one world position to be mapped to a range of pixels.
How to achieve this? The answer is to create a virtual aperture and randomly generate points on the aperture, then connect the point to another point we sample in the pixel to generate the ray.
No DOF | DOF |
---|---|
This is my favorite part of the project.
In DOF, we sample points on the camera's aperture disk to create blurred background and foreground. This idea can even be extended by stochastically sampling a mask image instead of the circle shape of the aperture, which creates very interesting and amazing results.
Star Mask | Heart Mask |
---|---|
The sampling method is the same as what is used to pick up a direction of the skybox.
You can toggle this by changing the value of ApertureMask
in scene files. Set it to the path of mask image, or leave it Null
to disable the effect.
Also a trivial feature. This can used to create some funny looking image.
Twisted Sponza rendered with panorama camera
Toggle for this feature is compile time -- change the value of CAMERA_PANORAMA
in ./src/common.h
Implementing gamma correction is very trivial. But it is necessary if we want our final image to be correctly displayed on monitors, through which we see by our eyes. Otherwise, the image will always look darker.
Tone mapping is also a feature that does not require heavy workload but improves overall appearance easily. It's simply remapping the value stored in color buffer from HDR to LDR before gamma correction and being displayed on monitor.
No Tone Mapping | ACES |
---|---|
I implemented Filmic and ACES tone mapping. The images in this readme are all processed with ACES. By comparison we can see that ACES lifts contrast so that pictures will look more appealing.
You can toggle tone mapping through GUI, or by pressing T
on your keyboard.
Ray-scene intersection is probably the best time consuming part of of ray tracing process. In a naive method we try to intersect every ray with every object in the scene, which is quite inefficient when there are numerous objects.
You can enable this BVH visualizer through GUI
An alternative is to setup spatial culling data structure according to how the objects distribute in the scene space and intersect the data structure before intersecting objects. For example, we could assign each object a bounding box and only try to find intersection with an object if the ray intersects its bounding box first. However, this is not sufficient to reduce intersection time. Therefore, we need more advanced spatial structures, like bounding volume hierarchy, octree or K-D tree. These tree-based structures usually have an average height of
I implemented BVH as the acceleration structure in this render. Also, I did two levels of optimization to make it fast and robust. These optimizations allow my path tracer to build BVH for the Rungholt scene (6,704,264 triangles) in 7 seconds and run at 3 FPS, 2560 x 1440. (Click the images to see details)
Rendering | Bounding Volume Hierarchy of the Whole Scene |
---|---|
SAH, the Surface Area Heuristic, is a method to determine how to split a set of bounding volumes into subsets when constructing a BVH, that the constructed tree's structure would be highly optimal.
The second level of optimization is done on GPU. BVH is a tree after all, so we still have to traverse through it during ray-scene intersection even on GPU. Tree traversal is usually based on recursion or stack, but we know that it's hard to do these with GPU: first, recursion is almost impossible with standard GPU programming; if we use stack instead of recursion, we need to maintain a stack in every thread -- the registers required would limit thread concurrency, not to say when BVH depth exceeds the limit.
For the BVH in this project, I implemented stackless traversal. The idea is to preprocess tree structure to know in advance which node to visit if the ray intersects/does not intersect the node's bounding box. By doing so we actually create a automated machine, whose traversal only requires
I only tested relatively small scenes considering that it would take fairly long to render without BVH.
This result came with the demo camera scene, rendered at 1800 x 1350. It's obvious BVH beats naive intersection.
To disable BVH, you can toggle BVH_DISABLE
in ./src/common.h
, but I highly discourage you to do so.
In path tracing or any other Monte Carlo-based light transport algorithms, apart from improving the performance from a point of view of programming, we can also improve it mathematically. Quasi-Monte Carlo sequence is a class of quasi-random sequence that is widely used in Monte Carlo simulation. This kind of sequence is mathematically proved to be more efficient than pseudorandom sequences (like what thrust::default_random_engine
generates). You can find that QMC sequences look more uniformly distributed.
Theoretically, to maximize the benefit of Sobol sequence, we need to generate unique sequences for every pixel during each sampling iteration at real-time -- this is not trivial. Not to say that computing each number requires at most 32 bit loops. A better choice would be precomputing one pixel's sequence, then use some sort of perturbation to produce different sequences for different pixels.
I precomputed Sobol sequence of 10000 size and 200 dimension in ./sobol_10k_200.bin
, so that no runtime computation is needed. For the perturbation approach, I used is xor perturbation, simply xoring a random number for each pixel's sequence.
Here is the result I get from testing the untextured PBR texture scene. With the same number of samples per pixel, path tracing with Sobol sequence produces much lower variance (less noise).
Pseudorandom Sequence | Xor-Scrambled Sobol Sequence |
---|---|
To figure out how much stream compaction can possibly improve a GPU path tracer's performance, we need a baseline to compare with. Instead of toggling streamed path tracer's kernel to disable stream compaction, we can separately write another kernel that does the entire ray tracing process. That is, we shoot rays, find intersection, shading surfaces and sampling new rays in one kernel. Unlike streamed path tracing where you have to store path information in global memory after each pass, this single kernel keeps those required variables only in threads' local memory.
In real-time rendering, a technique called deferred shading stores scene's geometry information in texture buffers (G-Buffer) at the beginning of render pass, so that later passes can save time by retrieving geometry information directly from buffer rather than rasterization process. It turns out we can do something similar with offline rendering.
However, my test result shows NO significant improvement after ray caching. I think probably the time for first intersection is sufficiently short.
Toggle the pragma ENABLE_GBUFFER
in ./src/common.h
for comparison.
Before shading and sampling surfaces, we can sort intersection information by material type (or id) in order that same or similar materials are contiguous in memory, which potentially reduces thread divergence during shading and sampling. I simply used thrust::sort_by_key
to do this job.
Some of required performance analysis is already presented in the passage above.
I decide to put these three together for comparison and analysis.
This is the result I got by inspecting the exact execution time of kernels during each pass and sum them together using NSight Compute. As we can see from the graph, stream compaction improves overall performance, which depends on scenes' spatial structures: if the scene is more open, stream compaction will remove the majority: escaped rays after the first intersection, while single kernel PT has to keep the threads idle. However, if the scene is more closed, the improvement would be less significant, probably because the number of valid rays does not descend obviously after each iteration.
In general, it's a tradeoff between thread concurrency and time spent accessing global memory.
After implementing material sorting, I found it actually slower. Sorting materials does improve memory coalescing for intersection, sampling and stream compaction -- from the graph they take less time. However, the effect is not sufficient to tradeoff the additional time introduced by sorting. As we can see the test result below, sorting makes up more than 1/3 of ray tracing time in the open scene, which is unacceptable.
There is also another possibility that BSDF sampling and evaluation is not that time consuming as expected.
Therefore, in my opinion, material sorting is best applied when:
- The scene is more closed than open
- There are many objects, many different materials in the scene
- Objects sharing the same material are randomly distributed in many small clusters over the scene space. The clusters' sizes in solid angle are typically less than what a GPU warp can cover
Here is a more detailed graph recording the time spent by each kernel for each depth of tracing. In closed scenes the time consumed after each depth can even grow, which is probably due to memory coalescence of path segment and intersection information.
Open Scene | Closed Scene | |
---|---|---|
Sorted | ||
Streamed | ||
Single Kernel |
To fulfill the requirement, I did implement a basic procedural texture to compare with (the code has been changed after I saved this picture so it will look different):
It shows no much difference no matter what kind of texture is used.
To use this procedural texture, simply set BaseColor
of a material to Procedural
.
- All core features
- Refraction (2)
- Depth of field (2)
- Antialiasing (2)
- Texture mapping & normal mapping combined with mesh loading (6)
- Direct lighting (2)
- Better sampling method (Sobol) (3)
- Post processing (?)
- OBJ loading (2)
- BVH (6)
- HDR skybox (?)
- And more...
Apart from what was provided with the base code, two additional third party libraries are included:
- tinyobjloader
- tinygltf (not used. I didn't implement GLTF mesh loading)
The following scene assets are licensed under CC BY 3.0. They are modified to fit in the path tracer's scene file structure.
- Rungholt
- Crytek Sponza
- Background image of dragon demo from sIBL HDRI Archive
The rest of assets all have CC0 license from Poly Haven, a very good website where you can download free and public models, textures and HDR images.