Trustworthy AI / Cybersecurity

Hardware-Rooted AI Security That Won’t Slow You Down

NVIDIA Confidential Computing delivers security at 98% of performance of solutions that don’t enable CC

Decorative image.

AI has transformed how organizations operate, driving unprecedented levels of productivity and innovation. However, AI adoption can be impeded by concerns surrounding data privacy, sovereignty and how to secure data while it is in use, or during inference and engagement with AI models. NVIDIA Confidential Computing (CC) was engineered to be a secure and performant solution for the era of agentic AI to scale any model securely. 

CC enables the protection of enterprise data and proprietary model weights and the model itself during active inference. In this post, we will provide an overview of CC and demonstrate benchmarks that show its inference performance is nearly identical (up to 98%) to solutions that don’t enable CC security. 

Data, code, and model integrity

CC provides a security layer that spans silicon, interconnect, and system software. Here’s how it works:

Hardware root of trust

NVIDIA Blackwell GPUs, including the NVIDIA RTX PRO 6000, HGX B200, and HGX B300, are engineered with CC embedded in the hardware. The HGX B200 and HGX B300 GPUs support confidential computing across multiple GPUs (up to 8) with NVIDIA NVLink encryption. At the silicon level, the GPU maintains a private signing key that is fused at the time of manufacturing and never exposed to software, firmware, or the host system. This key is the foundation of the attestation chain.

Attestation: Verification before execution

Before a confidential workload receives any secrets, it undergoes remote attestation. The NVIDIA Remote Attestation Service (NRAS) verifies a signed evidence bundle—the GPU’s hardware report combined with CPU TEE measurements (AMD SEV-SNP or Intel TDX)—against a known-good reference integrity manifest (RIM).

Once the Confidential VM (CVM) is in a verified, unmodified state, secrets such as  model decryption keys can be deployed into the CVM. The attestation handshake is typically a one-time startup event. Once the workload is running, attestation does not add latency to individual inference requests.

Figure 2. Attestation services remotely validate the identity, configuration, and integrity of Trusted Execution Environments and issue cryptographic proof

Optimizing AI inference performance in Confidential Computing

CC changes to AI inference performance on Blackwell GPUs can come from two areas: 

  1. Secure work submission latency:  For inference, secure work submission latency is often the larger factor and due to the added overhead from encryption and kernel launches, smaller units of work are more affected. Increasing the amount of work performed per GPU work launch reduces the impact of the secure launch overhead. 
  2. Reduced host-to-device CPU-to-GPU bandwidth: If a workload depends heavily on transferring inputs to the GPU, performance will depend on whether the required bandwidth to keep the GPU fully utilized exceeds the encrypted transfer bandwidth available in CC mode.

Several innovations optimize inference performance with CC including:

  • CC-safe autotuner timing: FlashInfer replaces event timers in CC mode with the GPU global timer register, allowing autotuners to accurately compare kernel candidates and select the fastest implementation for each shape.
  • Async D2H copy worker: SGLang moves per-step token readback off the scheduler’s critical path. This helps restore compute/copy overlap because CC can otherwise make many host-to-device and device-to-host copies effectively synchronous during cudaMemcpyAsync.
  • Piecewise CUDA graph support: SGLang adds CUDA graph replay for prefill and mixed batches, reducing kernel launch overhead that is amplified in CC mode.

NVIDIA continues to work with upstream communities for inference frameworks to ensure these frameworks are optimized for performance. 

We measured the inference performance of CC across different key metrics. Below are the details on the test setup and measurements. 

Benchmark results

Across all workload configurations tested, enabling CC mode produced minimal throughput and time per output token overhead during steady-state inference.

The following table summarizes CC throughput, TTFT, TPOT overhead on Blackwell Ultra (HGX B300) for model Qwen/Qwen3.5-397B-A17B-FP8

Relative Performance of Confidential Computing

ConcurrencyISL/OSL = 1024 / 1024ISL/OSL = 8192 / 1024
Throughput/GPU (tok/s)Median TPOT (ms)Throughput/GPU (tok/s)Median TPOT (ms)
Δ% vs OFFΔ% vs OFFΔ% vs OFFΔ% vs OFF
4-2.0%-1.6%-3.5%-3.6%
8-2.6%-2.4%-2.8%-2.9%
16-5.3%-4.9%-2.8%-3.0%
32-6.3%-7.8%-1.0%-0.9%
64-6.2%-6.8%-2.3%-2.4%
128-7.5%-8.1%-3.5%-3.5%
256-4.6%-4.1%-3.6%-3.7%
Table 1. Relative performance impact of enabling NVIDIA Confidential Computing 

Test Setup

Benchmark: Qwen 3.5 397B-A17B model at FP8 precision
Environment: Virtual Machine with GPU passthrough
Baseline: Confidential Computing Off
Experiment: Confidential Computing On

All other variables held constant. 

Hardware Configurations

HGX B300 with Blackwell Ultra. 

Software Stack

ComponentVersion / Detail
PlatformIntel TDX
Host OSUbuntu 25.10
Host Kernel6.17.0-20-generic
Guest OSUbuntu 24.04.4 LTS
Guest Kernel6.8.0-124-generic
Guest vCPUs256
Guest NUMA2 nodes
NVIDIA Driver595.71.05
VBIOSFW 1.4.x [97.10.64.00.0C]
GPU Power Limit1100.00
CUDA13.2
SGlangdocker.io/lmsysorg/sglang:v0.5.12-cu130PRs: 28251 (SGLang) and 3638 (FlashInfer)
NCCLv2.28.9-1
OpenSSL3.6.0
OrchestrationDocker Container + NVIDIA Container Toolkit
Table 2. Software configuration for test setup

Note: Please follow the CPU power and vCPU pinning configuration described in this document. 

Workload Parameters

Each configuration was tested across a range of conditions representative of real enterprise inference workloads:

Input/output token lengths: 8192/1024, 1024/1024
Batch sizes: 4, 8, 16, 32, 64, 128 and 256 concurrent requests. 
Inference framework (Mode): SGLang (Server)
Baseline: Without –enable-symm-mem

Metrics Collected

Output Throughput per GPU (tokens/sec/gpu)
Median Time to First Token (TTFT) — latency from request submission to first token generated, in ms
Median Time Per Output Token (TPOT) — per-token generation latency in steady-state streaming, in ms

Path forward

Hardware-level security with CC protects sensitive AI workloads while preserving the performance needed for production AI workloads. 

CC provides a stronger security foundation for production inference workloads with minimal performance overheads. In our evaluation using Qwen 3.5 on SGLang, we observed  this across a sweep of concurrency levels, input sequence lengths, and output sequence lengths, proving that organizations can secure their AI workloads and data, and stay compliant to regulation without compromising on performance. 

Join NVIDIA and our partners to secure your AI workloads with CC on Blackwell by accessing the resources below.

Resources

NVIDIA Confidential Computing Documentation
NVIDIA Blackwell Architecture Whitepaper
NVIDIA GPU Operator and Container Toolkit
NVIDIA Remote Attestation Service (NRAS)
NIST SP 800-207 Zero Trust Architecture
HIPAA Security Rule (HHS)
GDPR Article 32 — Security of Processing

Discuss (0)

Tags