Explicit Multi-GPU with DirectX 12 – Frame Pipelining, a New Alternative
This is the second part of the blog post about explicit multi-GPU programming with DirectX 12. In this part, I’ll describe frame pipelining - a new way for utilizing multiple GPUs that was not possible before DirectX 12. I’ll first explain the pipelining in general and then go through a case study.
Frame pipelining - A new possibility
explicit control can help solve some problems but it doesn’t increase PCIe bus transfer capacity. Too many inter-frame dependencies can kill the performance no matter whether AFR is done implicitly by driver or explicitly by application. But with the ability to explicitly control multi-GPU behavior, you are not limited to AFR. You can think out of the box.
A classic alternative to AFR has been Split Frame Rendering (SFR) where the screen is divided to regions, which are drawn with different GPUs. Naturally, you can do it now with DirectX 12. However, depending on your rendering process, SFR may cause more problems than it solves. Geometry replication on all GPUs and rendering techniques requiring sampling potentially over whole screen, like screen space reflections, can be problematic.
Pipelined rendering of frames is a different way to break the problematic dependency chain in AFR. A pipeline can be formed from the multiple GPUs, or from engines of the multiple GPUs. The first GPU begins rendering of the frame. Then, at a predefined point in the rendering process, the copy engine takes the intermediate results (a set of textures in practice) and copies them to next GPU for further processing. Finally, a completed frame is presented on screen.
The key idea in pipelining is that there’s no back and forth dependencies between the physical GPUs. Crucially, in the example above there is no dependency from GPU 0 back to GPU 1. Again with pipelining, the bandwidth of the PCIe bus unavoidably sets a limit to what can be transferred but due to simpler dependencies it’s easier to manage without hurting frame rate. And notably, temporal techniques can be used without penalties. The exception is that a GPU in the beginning of the pipeline cannot use something produced by a GPU later in the pipeline.
Pipelining can also be used to do additional work with multiple GPUs instead of doing the same work faster. It can be used to increase image quality or simulation fidelity. You could, for example, update your global illumination data with a secondary GPU.
The challenge in implementing pipelining is to find a good point to split the rendering process. There cannot be too much data to copy. On a mainstream system supporting two GPUs in 8xPCIe 3.0 mode, the theoretical maximum transfer rate to one direction is 7.875 GB/s. In practice, mainly due to lower level protocol overhead, you can expect effective transfer speeds for application data over PCIe to be around 80% of the theoretical maximum, which means that transferring, for example, 64 MB consumes around 10 ms. To reduce the amount of data that needs to be transferred, you can also recompute some things on both GPUs. (As you can do to reduce dependencies in an AFR implementation.)
As workload distribution over steps in the rendering process usually depends on what is visible on camera, the workload cannot be perfectly balanced between the GPUs in practice. Cost of rendering passes that process only visible objects can naturally vary a lot as camera moves. Fortunately, there are often also full screen passes with fairly constant cost that can be distributed between the GPUs to achieve a reasonable and pretty reliable balance. The better the balance the more performance you get from the multiple GPUs. It’s most important to achieve good balance when the overall workload is high.
Frame pipelining case studyImage 12. A view from the Microsoft DirectX 12 miniengine used in the case study
As a proof of concept, I did a case study with the publicly available Microsoft DirectX 12 miniengine. It contains a number of rendering passes to play with and experiment. I used 4k (3840 x 2160) screen resolution and 4096 x 4096 sun shadow map as a stress test. (Increasing resolution quickly grows texture sizes.) (I also crudely multiplied the geometry in the test scene to get enough rendering workload.) I ended up with the following pipelining solution. On the first GPU, pre-depth, linear depth, SSAO and sun shadow map textures were generated. The other GPU continues by doing the primary pass, temporal AA, particles and bloom. The best pipelining solution for a given renderer depends entirely on the passes it does and dependencies between them. I my solution, the total amount of data to copy from first to second GPU was 87.3 MB containing the followed textures:
|Linear Depth||R16_FLOAT||3840x2160||15.8 MB|
|Sun Shadow Map||D16_UNORM||4096x4096||32 MB|
My test system runs two GTX 980 Ti GPUs in 8xPCIe 3.0 mode. Copying the 87.3 MB took 14.6 ms, which is in line with the expected transfer speed. I measured the following frame rates:
Single GPU: 22 fps
Two GPUs: 31 fps
Two GPUs using copy engine: 37 fps
That’s about 70% increase from single to dual GPU. The copying step taking 14.6 ms limits the frame rate to about 1000 / 14.6 = 68 fps on the test system. So it seems that pipelining can achieve relatively good multi GPU performance at least with the case study renderer. A GPUView capture possibly describes best how the work is distributed between the GPUs during each frame.Image 13. GPUView capture from the case study. The red circles mark idle time caused by unbalance in work between GPUs. The copy engine work is encircled with green line. The green dot line marks processing flow of one frame.
While FPS increase is reasonable, the copying step increases latency from what it would be when doing the same rendering with only one GPU at lower frame rate. So the latency problem in frame pipelining is similar to what it is with AFR. You can, however, potentially do something about it. By dividing the work to smaller command lists and taking a more fine-grained control of the dependencies between them, you may be able to start processing earlier on pipeline node that presents the frames on screen. While doing this, you have to remember also that too many too small command lists can cause their own problems. Signaling and waiting on synchronization fences are command queue operations and can limit batch submission of command lists. On the case study, the copy latency could be practically entirely hidden by dividing the work to smaller command lists. GPUView capture shows the more complex distribution of work.
Image 14. GPUView capture from the case study when work is divided to smaller command lists in order to hide latency. The green dot lines mark dependencies between the command lists.
That concludes part 2 of this blog post. I hope you now have a better understanding of the available new possibilities. Feel free to go ahead and try new ideas to exploit multiple GPUs under DX12.