An increasingly common design pattern for autonomous vehicles (AVs), robotics, and spatial AI systems is bird’s-eye-view (BEV) perception. BEV models project multicamera image features into a shared top-down grid, providing downstream perception and planning modules with a common spatial layout for reasoning about lanes, vehicles, pedestrians, and free space.
A key operation in this pipeline is BEV pooling, which gathers image features, weights them with depth information, and scatter-reduces them into BEV grid cells. For developers, the practical value of BEV perception is that it converts many camera-specific views into one spatially consistent representation of the scene. Instead of reasoning separately over each camera image, downstream modules can operate on a unified top-down feature map aligned to the world around the vehicle or robot. BEV pooling is the step that makes this representation usable in real time: it turns depth-aware image features into a compact BEV tensor that can feed detection, occupancy, trajectory prediction, mapping, and planning workloads.
Conceptually, this is simple. In deployment, however, BEV pooling can become a latency bottleneck because it combines irregular memory access, repeated index reads, scatter-reduce behavior, and GPU-specific cache effects.
This post uses BEVPoolV3 as a case study in optimizing BEV pooling and other gather- or scatter-heavy operators for NVIDIA GPUs. It walks through a practical workflow you can apply to your workloads: classify the memory regime, remove redundant scatter traffic, map the kernel implementation to the target GPU, and validate the active bottleneck with NVIDIA Nsight Compute. The performance results show why this workflow matters: the same BEV pooling operator can require different optimization strategies depending on whether the working set is DRAM-bound or largely L2-resident.
How does BEVPoolV3 reduce BEV pooling latency on NVIDIA RTX GPUs?
Prior work has already made important progress. BEVPoolV2, referred to as V2 in this post, introduced an efficient deployment-oriented BEV pooling formulation for BEVDet-style models. CUDA-BEVFusion includes bevpool_half_pack10_kernel, referred to here as V2+DO, which uses depth-outer traversal to remove much of the V2 repeated tile-outer index loading.
BEVPoolV3 continues this optimization direction with four additional changes: reduced duplicate depth loads, a five-array INT32 scatter map, precomputed indices that remove runtime integer division, and interval-owned output writes.
This post uses BEVPoolV3 as a case study in how to optimize BEV pooling and other gather- or scatter-heavy operators for NVIDIA GPUs. You will learn how to classify a BEV pooling workload by memory regime, identify redundant scatter traffic, map the kernel implementation to the target GPU, and validate the active bottleneck with Nsight Compute. The performance results on two NVIDIA RTX GPUs show why this workflow matters: the same BEV pooling algorithm can be DRAM-bound on one GPU and largely L2-resident on another, requiring different optimization choices.
The evaluation compares two NVIDIA RTX GPUs that represent different memory regimes: NVIDIA RTX A6000, an NVIDIA Ampere SM86 GPU with a 6 MB L2 cache and no native FP8 ISA, and NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, an NVIDIA Blackwell SM120 GPU with a 128 MB L2 cache and native FP8 support. The canonical config used here is derived from real nuScenes samples and contains about 209K scatter points, 80 feature channels, and a 49 MB BEV pooling working set. That working set exceeds RTX A6000 L2 cache but fits inside RTX PRO 6000 Blackwell Max-Q L2 cache, making RTX A6000 DRAM-bound and RTX PRO 6000 Blackwell Max-Q largely L2-resident after the initial fill.

In the canonical config, the V2-style NVIDIA TensorRT plugin path takes 274.0 µs on RTX PRO 6000 Blackwell Max-Q. BEVPoolV3 reduces that to 17.3 µs in FP16 and 16.4 µs in FP8. On RTX A6000, the DRAM-adapted BEVPoolV3 FP16 path reaches 90.0 µs. Beyond the speedup, this post shows a repeatable workflow for optimizing scatter-reduce kernels: classify the working set, remove redundant memory traffic, match the launch shape to the target GPU, and validate the result with Nsight Compute.

Prerequisites
This post discusses CUDA kernel behavior, TensorRT plugin integration, and GPU profiling in the context of BEV pooling. Helpful prerequisites include:
- CUDA kernel concepts such as warp scheduling, atomics, vectorized global loads, and DRAM/L2/L1 cache behavior
- TensorRT plugin integration, especially the IPluginV3 interface
- Nsight Compute profiling for validating memory behavior, occupancy, and instruction-issue bottlenecks
- The BEV-pooling kernel in CUDA-BEVFusion as the prior depth-outer reference implementation
For related background information, see the CUDA C++ Programming Guide, TensorRT plugin documentation, TensorRT samples, and Nsight Compute Profiling Guide.
Classify the memory regime
The first step is to classify whether the BEV-pooling working set fits in L2. In the canonical config, the main arrays total about 49 MB, dominated by feature data and output. That single number determines the memory regime: it is larger than the RTX A6000 6 MB L2 cache, but smaller than RTX PRO 6000 Blackwell Max-Q 128 MB L2 cache.

This fit/no-fit decision changes the optimization target. On RTX A6000, feature gathers and output traffic spill beyond L2, so the small-L2 path prioritizes byte reduction and cache-streaming output stores. On RTX PRO 6000 Blackwell Max-Q, the canonical working set fits in L2, so the large-L2 path shifts toward instruction efficiency, occupancy, precomputed indices, vectorized loads, and FP8 specialization.
Remove redundant scatter traffic
The BEV scatter-reduce can be summarized as:
out[ranks_bev[t], c] += depth[ranks_depth[t]] * feat[ranks_feat[t], c];
BEVPoolV2 iterates over channel tiles outside the scatter loop. For C=80 and an 8-channel tile, the same scatter indices are loaded 10 times. That produces roughly 25.1 MB of index traffic for indices that only need 2.51 MB when read once. A depth-outer loop order fixes most of that problem by iterating over each BEV interval first and accumulating all channels for that interval in one pass.
BEVPoolV3 extends the depth-outer optimization direction used in CUDA-BEVFusion bevpool_half_pack10_kernel, referred to here as V2+DO. V2+DO is a useful baseline because it already removes the repeated tile-outer index loads in BEVPoolV2 and demonstrates the value of interval-based traversal. BEVPoolV3 keeps that direction and adds four implementation changes that improve portability and performance across GPU memory regimes: reduced duplicate depth loads within each interval; a five-array INT32 scatter map µsing ranks_depth, ranks_feat, ranks_bev, interval_starts, and interval_lengths; precomputed explicit indices that remove runtime integer division; and interval-owned output writes that avoid atomics relative to the V2-style path.

The five-array scatter map is especially important on large-L2 GPUs. Packing (ranks_depth, ranks_feat, ranks_bev) into an int3 array gives a 12-byte record. That layout is inconvenient for aligned memory transactions and does not map cleanly to a 16-byte LDG.128 load. Separate INT32 arrays let adjacent threads merge aligned loads and avoid field coupling. The total logical bytes may look similar, but the instruction stream is much cleaner.
Implement interval-owned scatter-reduce
In production, BEVPoolV3 uses multiple specialized kernels, but the core implementation idea is easier to understand as a small logic sketch. The scatter map is prepared ahead of time, each BEV interval is assigned to one owner, the owner walks the points in that interval, accumulates the relevant feature channels, and writes the output once.
This structure removes the inner-loop decoding work that appears when the scatter map is packed into a single record. Instead of reconstructing indices at runtime, the kernel reads explicit arrays such as ranks_depth, ranks_feat, ranks_bev, interval_starts, and interval_lengths.
// 1. Use five precomputed scatter arrays.
// 2. Read explicit indices directly, with no runtime index division.
// 3. Let one interval owner accumulate the output cell.
// 4. Load each depth value once per scatter point in the owner loop.
for each interval iv in parallel:
start = interval_starts[iv]
length = interval_lengths[iv]
bev = ranks_bev[start]
acc[channel_tile] = 0
for offset in 0 .. length - 1:
t = start + offset
d = depth[ranks_depth[t]]
feat_row = ranks_feat[t]
for c in channel_tile:
acc[c] += d * feat[feat_row, c]
out[bev, channel_tile] = acc
This code sketch captures the common BEVPoolV3 structure: the scatter map is explicit, runtime index decoding is removed, depth is loaded in the interval owner loop, and each output cell is written once after local accumulation.
The production kernels specialize this structure for the target memory regime. On small-L2 GPUs such as RTX A6000, the implementation prioritizes byte reduction, FP16 half2 accumulation, and cache-streaming output stores so the output tensor does not evict useful index data from L2. On large-L2 GPUs such as RTX PRO 6000 Blackwell Max-Q, the implementation first matches a high-occupancy launch envelope, then reduces instruction overhead with precomputed indices, vectorized index loads, and FP8-specialized inner loops where the working set is L2-resident.
The algorithmic invariant stays the same: own the interval, avoid runtime index decoding, accumulate locally, and write once. The architecture-specific work changes how that invariant is implemented, not what the BEV-pooling operator computes.

The absolute latency results on RTX PRO 6000 Blackwell Max-Q show how the large-L2 path behaves across different point counts and channel widths. The same optimization pattern also holds on the RTX A6000 DRAM-bound path when measured as speedup over the V2 FP16 baseline. On RTX A6000, the DRAM-adapted V3 FP16 path reaches speedups of 11s to 22x over V2 across the tested configurations. On RTX PRO 6000 Blackwell Max-Q, V3 FP8 reaches speedups of 11x to 42x over V2, with the largest gains appearing at larger point counts and wider channel configurations.

Deploy and validate the TensorRT plugin
BEVPoolV3 is exposed as a TensorRT IPluginV3 operator. The plugin accepts the five-array scatter map plus depth and feat, then dispatches the appropriate kernel for the GPU class and dtype. The benchmark path used ONNX-to-TensorRT builds and CUDA Graph replay with trtexec.
For validation, compare against an FP64 reference or an existing trusted V2 path. The RTX A6000 DRAM-adapted kernel passed all tested output elements across the six configurations at atol=1e-2, with maximum observed error of 0.0065. On RTX PRO 6000 Blackwell Max-Q, V2 and V3 produced identical outputs for the tested configurations, indicating that the optimized scatter-map and launch changes preserved the numerical behavior of the reference path.
Map the algorithm onto the hardware
The four BEVPoolV3 algorithmic changes are portable, but the production kernel must match the active GPU bottleneck. The key decision is whether the BEV-pooling working set fits in L2.
On RTX A6000, the canonical working set exceeds L2, so the kernel is limited by random-gather DRAM traffic. The FP16 path therefore prioritizes byte reduction and cache preservation. Increasing TILE_C from 8 to 16 cuts the C=80 tile passes from 10 to 5, reducing loop overhead and repeated scalar work. Using __half2 accumulation with __hfma2 avoids unnecessary FP16-to-FP32 widening and packing. Cache-streaming output stores prevent the 12.8 MB output tensor from evicting the smaller L2-resident index arrays. After these changes, the RTX A6000 path reaches 90.0 µs in the canonical config, compared with 1,738.0 µs for V2 FP16.
On RTX PRO 6000 Blackwell Max-Q, the canonical working set fits in L2, so the limiting factors shift toward instruction issue, occupancy, and dependency latency. The production kernel first matches the high-occupancy V2+DO-style launch envelope, then removes inner-loop overhead with the five-array scatter map and precomputed indices. This avoids runtime integer division and reduces scatter-map load pressure. In the canonical config, V3 FP16 reaches 17.3 µs versus 37.8 µs for V2+DO FP16, a 2.18x speedup at the same dtype.
The FP8 path further specializes in the large-L2 case. Because feature and output data are served from L2, reducing their dtype can translate into real latency gains. The production FP8 path uses per-channel-count entry points, LDG.64 index packing for C=80, and wider feature loads for C=128 and C=256. More aggressive combinations, such as adding loop unrolling on top of the packed-index path, did not compose cleanly because they increased register pressure and spill traffic.
The precision ladder has a practical destination, and our NVFP4 evaluation helps clarify exactly where each format shines: we tested an NVFP4 path that stores camera features in E2M1 with per-16-element E4M3 microblock scales while keeping depth and output in FP8, and even with an aggressively optimized implementation featuring __half2 packed accumulators, fused scale–depth coefficients, and a half-precision LUT, the decode overhead causes it to run notably slower than the FP8 baseline.
Profiling with Nsight Compute shows the kernel is fully resident in L2 cache, with low DRAM bandwidth utilization and smsp__issue_active hovering well below peak throughput, while the ALU pipeline carries significantly more instructions than the FMA pipeline.
This indicates that this scatter-reduce regime has already captured the available byte-efficiency benefits at FP8, while the NVFP4 additional per-element nibble extraction, value decode, and per-microblock scale fold introduce inner-loop work that the FP8 path avoids through a single scalar FP8 to half conversion. The result is a crisp workload-placement story: NVFP4 remains an incredibly powerful fit for compute-bound matrix multiplication shapes flowing through Tensor Cores through MMA.kind::nvfp4, while for L2-resident scatter-reduce workloads, FP8 is ideal on the dtype ladder.
The same analysis applies beyond BEV pooling. For sparse embeddings, voxelization, histograms, segmented reductions, and other gather- or scatter-heavy operators, first classify the memory regime, then use Nsight Compute to determine whether the active ceiling is bandwidth, instruction issue, or occupancy.
Table 1 summarizes RTX PRO 6000 Blackwell Max-Q TensorRT plugin-path latency, reported as 100-iteration median latency.
| Config | C-Dimension | V2 FP16 | V2+DO FP16 | V3 FP16 | V3 FP8 | V3 FP8 / V2 |
| small | 80 | 137.8 µs | 31.5 µs | 12.7 µs | 12.6 µs | 10.94x |
| canonical | 80 | 274.0 µs | 37.8 µs | 17.3 µs | 16.4 µs | 16.71x |
| large | 80 | 749.9 µs | 48.0 µs | 27.3 µs | 24.9 µs | 30.12x |
| xlarge | 80 | 1,675.0 µs | 61.9 µs | 48.0 µs | 39.8 µs | 42.09x |
| wide_c128 | 128 | 457.3 µs | 54.2 µs | 21.4 µs | 14.8 µs | 30.90x |
| wide_c256 | 256 | 880.9 µs | 152.3 µs | 33.4 µs | 22.0 µs | 40.04x |
Considerations for edge-class platforms
The same analysis can extend to edge-class NVIDIA platforms, including NVIDIA DRIVE AGX Thor. In early edge-oriented experiments, the FP16 BEVPoolV3 path carries over well because the core improvements—removing redundant scatter traffic, avoiding runtime index decoding, and using interval-owned writes—are architecture-independent.
FP8 speedup, however, is not automatic. On edge-class targets, smaller problem sizes, memory hierarchy behavior, register pressure, and FP8 conversion overhead can limit or offset the theoretical dtype bandwidth benefit. This makes FP8 a kernel- and architecture-specific optimization rather than a guaranteed drop-in replacement for FP16.
Get started with BEV pooling optimization
To apply the BEVPoolV3 workflow to your own BEV perception or gather/scatter-heavy workload, start by profiling the operator in isolation. Measure the feature, depth, scatter-index, and output tensor sizes, then compare the total working set with the target GPU L2 cache capacity.
Use NVIDIA Nsight Compute to validate whether the active bottleneck is memory bandwidth, instruction issue, occupancy, or dependency latency. Then choose the optimization strategy that matches the memory regime: byte reduction and cache-preserving stores for DRAM-bound workloads, or occupancy, precomputed indices, vectorized loads, and dtype specialization for L2-resident workloads.
The same approach applies to sparse embeddings, voxelization, histograms, segmented reductions, and other irregular memory-bound kernels. Use the BEVPoolV3 results as a guide for profiling your own operator, selecting the right implementation strategy for the target GPU, and validating the result before deploying through TensorRT. For related resources, see the TensorRT documentation, CUDA C++ Programming Guide, Nsight Compute documentation, and NVIDIA Developer Forums.