Data Center / Cloud

Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM

Organizations deploying LLMs are challenged by inference workloads with different resource requirements. A small embedding model might use only a few gigabytes of GPU memory, while a 70B+ parameter LLM could require multiple GPUs. This diversity often leads to low average GPU utilization, high compute costs, and unpredictable latency.

The problem isn’t just about packing more workloads onto GPUs but about scheduling them intelligently. Without orchestration that understands inference workload patterns, organizations face a choice between overprovisioning (wasting resources) and underprovisioning (degrading performance).

This blog post covers:

  • The inference utilization problem: Why traditional scheduling underutilizes GPU resources.
  • How NVIDIA NIM delivers production inference: The role of containerized microservices in standardizing model deployment.
  • NVIDIA Run:ai’s intelligent scheduling strategies: Four key capabilities that enhance performance (lower latency, increase TPS/GPU) while increasing GPU utilization and reducing compute costs.
  • Benchmarking results: ~2x GPU utilization improvement with minimal throughput loss, up to ~1.4x higher throughput under heavy concurrency with dynamic fractions, and 44-61x faster first-request latency with GPU memory swap.
  • How to get started: Practical guidance for implementing these strategies with NIM on NVIDIA Run:ai.

The inference utilization problem

GPU utilization determines how many workloads can be run on a given cluster, and at what cost. In practice, most inference deployments leave significant GPU capacity idle as each model is assigned a full GPU “just to be safe” or because naive sharing without memory isolation causes out-of-memory (OOM) conditions and latency spikes under traffic.

Without intelligent orchestration, teams are forced to choose between overprovisioning (waste) and underprovisioning (performance risk).

How NVIDIA NIM delivers production inference

NVIDIA NIM packages optimize inference engines as containerized microservices with:

  • Packaged inference engines: Inference runtimes pre-configured for improved throughput/latency
  • Industry-standard APIs: OpenAI-compatible endpoints for integration
  • Model optimization: Automatic selection of quantization, batching, and acceleration techniques.
  • Production-ready containers: Pre-built with dependencies, tested at scale
  • Security and compliance: Enterprise-grade security controls and container signing for deployments  
  • Enterprise support: NVIDIA support and maintenance for production deployments

NIM standardizes the deployment layer, but maximizing GPU utilization requires intelligent orchestration. This is where NVIDIA Run:ai‘s scheduling capabilities become essential.

How NVIDIA Run:ai unlocks efficient resource management for NVIDIA NIM

Inference utilization is more than just scheduling—it’s about adapting to how workloads behave. With NVIDIA Run:ai, NIM deployments get inference-first prioritization, GPU fractions with full memory isolation, smarter placement based on workload needs, dynamic memory management, and autoscaling (including replica scaling and scale-to-zero). This enables users to follow traffic and give back GPUs when models are idle.

Inference priority protects user-facing workloads

NVIDIA Run:ai automatically assigns inference workloads the highest default priority, ensuring training jobs never preempt them. Why this matters:

  • Inference serves users: Latency spikes and downtime impact the user experience and SLA compliance.
  • Training can tolerate interruption: Model training can checkpoint and resume; inference requests cannot wait.

This automatic priority assignment eliminates manual tuning in most environments. For organizations running mixed workloads, this ensures training jobs flex around inference demands rather than competing with them. GPUs can train when inference load is low, automatically yielding resources when user-facing requests arrive.

GPU fractions with bin packing for multiple small models on a GPU

Many NIM workloads, like embeddings, rerankers, and small LLMs, rarely need an entire GPU. When used with GPU fractions, NVIDIA Run:ai’s bin packing strategy fills GPUs before allocating new ones, maximizing utilization across the cluster.

How GPU fractions with bin packing work:

  • GPU fractions provide true memory isolation (not soft limits). Each model gets a guaranteed memory allocation.
  • Bin packing scores GPUs by current utilization and prioritizes filling partially used GPUs before allocating fresh ones.
  • Scheduler prioritizes partially-used GPUs for new workloads

Benchmarking results:

The approach was tested by simulating a scenario with three NIM models (a 7B LLM, a 12B VLM, and a 30B MoE) on NVIDIA H100 GPUs:

  • Scenario A: Three GPUs with one H100 GPU per NIM (baseline)
  • Scenario B: Three NIM on 1.5 H100 GPUs using NVIDIA Run:ai fractions, keeping NIM configurations and client load patterns constant
Diagram comparing GPU allocation: baseline uses three dedicated H100 GPUs (one NIM per GPU), while NVIDIA Run:ai GPU fractions consolidate the same workloads onto ~1.5 GPUs, retaining 91–100% throughput and freeing ~50% capacity.
Figure 1. Three NIM microservices consolidated from three dedicated H100 GPUs to ~1.5 H100 GPUs using GPU fractions and bin packing, retaining 91–100% of baseline throughput

Exercising short and long-context prompts, the key findings include:

  • Each NIM retained about 91–100% of its single-GPU throughput, with modest increases in time-to-first-token (TTFT) and end-to-end (E2E) latency.
  • Mistral-7B matched its dedicated-GPU throughput at 834 token/s with long-context input (100%).
  • Nemotron-3-Nano-30B retained 95% (582 vs. 614 token/s).
  • Nemotron-Nano-12B-v2-VL retained 91% (658 vs. 723 token/s) at short-context input.

Three NIM microservices that previously required three dedicated H100s were consolidated onto ≈1.5 H100s, freeing the remaining capacity for other workloads.

Dynamic GPU fractions maintain performance under heavy concurrent requests

Static GPU fractions guarantee memory isolation, but they impose a rigid ceiling that creates “standard capacity”. As concurrent requests increase, each NIM’s KV-cache grows dynamically to track active sequences. When that growth hits the fixed fraction boundary, throughput plateaus, and latency degrades. This bottleneck forces a difficult trade-off: over-allocate fractions (wasting GPU capacity) or cap concurrency to stay within the fixed memory budget.

NVIDIA Run:ai’s dynamic GPU fractions solve this by replacing fixed allocations with a request/limit model, borrowing Kubernetes resource semantics for GPU memory:

  • Request: The guaranteed minimum fraction, always reserved for the workload.
  • Limit: The burstable upper bound, enabling the NIM to spread into available GPU memory when on-demand KV-cache or compute pressure increases.

When a NIM operates its request, the unused headroom between the request and limit remains available to co-located workloads. When concurrent traffic spikes occur, the NIM bursts toward its limit, claiming that memory and converting it into active throughput. This state transition between request and limit is handled automatically. Workloads scale up when they need resources and release them when demand subsides, maximizing total GPU utilization without manual intervention.

Benchmarking results:

Using the same three NIM models and 1.5 H100 GPU footprint from Experiment 1, static fractions were replaced with dynamic fractions to measure performance under increasing concurrency:

  • Mistral-7B NIM (Request: 0.3, Limit: 0.4)
  • Nemotron-Nano-12B-v2-VL NIM (Request: 0.4, Limit: 0.5)
  • Nemotron-3-Nano-30B NIM (Request: 0.65, Limit: 0.75) 

Scenarios compared:

  • Scenario A (static fractions + bin packing): The fixed-fraction deployment from Experiment 1 (See Figure 1), where each NIM has a hard memory ceiling with full isolation.
  • Scenario B (dynamic fractions + bin packing): Same bin-packed layout on ≈1.5 H100 GPUs, but each NIM uses a request/limit pair instead of a fixed allocation.
Scatter-line chart of throughput vs. p50 E2E latency for Nemotron-3-Nano-30B on H100 GPUs. Static fractions stall at c=4. Dynamic fractions scale to c=256.
Figure 2. Throughput vs. p50 end-to-end latency for Nemotron-3-Nano-30B on H100 GPUs with 2,048 input tokens

In Figures 2, 3, and 4, as concurrency ramped up, static fractions hit a performance wall, throughput stalled, and latency spiked because models couldn’t access additional memory for growing KV caches. With dynamic fractions, NIM microservices absorbed the pressure by bursting toward their limits during traffic peaks and releasing memory back when the load subsided. 

Across all three NVIDIA NIM microservices, dynamic fractions delivered up to 1.4x higher throughput and 1.7x lower latency, scaling cleanly with concurrency. For example:

  • Nemotron-3-Nano-30B sustained 1,025 token/s at 256 concurrent requests with dynamic fractions compared to a static-fraction ceiling of 721 token/s at just four concurrent requests before instability (1.4x).
  • Mistral-7B-Instruct-v0.3 p50 end-to-end latency dropped from 5,235 ms to 3,098 ms at 64 concurrent 2,048-token requests (1.7x). 

The p50 latency curve remains smooth and monotonic rather than spiking or collapsing, confirming that the request/limit headroom accommodates KV-cache growth patterns, improving GPU utilization.

Scatter-line chart of throughput vs. p50 E2E latency for Mistral-7B-Instruct-v0.3 on H100 GPUs. Dynamic fractions reach c=256.
Figure 3. Throughput vs. p50 end-to-end latency for Mistral-7B-Instruct-v0.3 on H100 GPUs with 2,048 input tokens

Key takeaway:

  • Static fractions + bin packing: Predictable traffic, low-to-moderate concurrency, models with stable memory footprints
  • Dynamic GPU fractions + bin packing: Variable traffic, high concurrency, models with significant KV-cache growth
Scatter-line chart of throughput vs. p50 E2E latency for Nemotron-Nano-12B-v2-VL on ~1.5 H100 GPUs. Dynamic fractions reach c=256 versus c=64 for static fractions, delivering up to 1.3x higher throughput and 1.7x lower latency.
Figure 4. Throughput vs. p50 end-to-end latency for Nemotron-Nano-12B-v2-VL on H100 GPUs with 2,048 input tokens

Dynamic GPU fractions eliminate the performance ceiling of static allocations at high concurrency while maintaining workload density. With static fractions, the KV-cache cannot grow beyond the fixed memory boundary, and the inference engine begins rejecting requests because it lacks the headroom to admit new sequences. Dynamic GPU fractions solve this as NIM can burst into available headroom on demand, and organizations get both the efficiency of bin packing and the resilience to handle traffic spikes without allocating additional GPUs.

GPU memory swap: Efficiently serving rarely-used models

Organizations serving LLMs face a fundamental trade-off between latency and cost. Scaling an LLM from zero means full container initialization, loading model weights from disk, and allocating GPU memory; a process that can take tens of seconds to minutes. Because this cold-start latency is unacceptable for user-facing applications, most organizations choose over-provision, keeping multiple replicas always-on with dedicated GPUs even during low-traffic or idle periods. 

This guarantees low latency but wastes GPU capacity, paying for hardware that sits idle just to avoid the risk of a cold start. Scale-to-zero (the Kubernetes pattern of shutting down idle replicas completely and restarting them on demand) can free the GPUs, but the cold-start penalty makes it impractical for latency-sensitive inference workloads.

How GPU memory swap works:

With GPU memory swap, models are kept in CPU memory and dynamically swap model weights between CPU and GPU as requests arrive. Only the active model’s weights reside in GPU memory at any moment. When a request targets an idle model, NVIDIA Run:ai’s GPU memory swap moves the currently loaded model’s weights to CPU RAM and loads the requested model into GPU memory, keeping it warm for a configurable window. The model never leaves memory entirely; it just moves between GPU and CPU, eliminating the need for container restarts, disk I/O, and cold-start initialization.

GPU memory swap works across single-GPU, multi-GPU, and fractional GPU workloads. Previous benchmarking with single-GPU deployments showed up to 66x improvements in time to first token (TTFT) compared to scale-from-zero. In this benchmark, combining GPU memory swap with NIM deployments on fractional GPUs tested whether the same latency benefits hold when models share hardware through bin packing and under memory constraints.

Benchmarking results:

Latency between GPU memory swap and scale-from-zero for the same three NIM deployments was compared:

  • Scenario A (scale-from-zero): Each NIM cold‑starts from scratch on a dedicated H100 GPU when traffic arrives (three GPUs in total).
  • Scenario B (GPU memory swap): The three NVIDIA NIM microservices share 1.5 H100 GPUs (with the same fractions from previous experiments), with swap‑in/swap‑out between GPU and CPU memory.
A bar chart comparing TTFT for short 128-token prompts between scale-from-zero and GPU memory swap across three NIM models on 1.5 H100 GPUs.Short context (128 tokens): memory swap delivers 55-61x faster TTFT. 
Figure 5. GPU memory swap vs. scale‑from‑zero TTFT on an H100 GPU with 128‑token prompts
 Bar chart comparing TTFT for longer 2048-token prompts between scale-from-zero and GPU memory swap across three NIM models on 1.5 H100 GPUs. Long context (2048 tokens): memory swap delivers 44x faster TTFT.
Figure 6. GPU memory swap vs. scale-from-zero TTFT on H100 GPUs with longer 2048-token prompts

With scale-from-zero, infrequently accessed NIM microservices suffer high first-request latency due to full cold starts. With GPU memory swap, first-request latency stays acceptable, and subsequent requests see warm TTFT. All three NIM microservices run on half of the GPUs, freeing up the remaining capacity for high-traffic or other workloads. 

At 128-token input, cold-start TTFT ranged from 75.3 s (Mistral-7B) to 92.7 s (Nemotron-3-Nano-30B), while GPU memory swap reduced these to 1.23–1.61 s – a 55–61x improvement. At 2,048-token input, cold-start TTFT of 158.3–180.2 s dropped to 3.52–4.02 s with swap, a consistent ~44x reduction.

Key takeaway: GPU memory swap delivers 44-61x faster TTFT than scale-from-zero while using fewer resources when combined with GPU fractions, eliminating the cold-start penalty for infrequently accessed models, whether deployed on dedicated or fractional GPUs.

Get started with NVIDIA Run:ai and NVIDIA NIM

Check out this guide to get started with deploying NVIDIA NIM as a native inference workload on NVIDIA Run:ai. Watch this webinar to see how teams manage growing AI workloads with intelligent scheduling, fine-grained GPU controls, Kubernetes-native traffic balancing, and autoscaling—while new platform updates improve access control, endpoint management, and visibility. 

Discuss (0)

Tags