Data Center / Cloud

Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai

Joint benchmarking with Nebius shows that fractional GPUs significantly improve throughput and utilization for production LLM workloads

As AI workloads scale, achieving high throughput, efficient resource usage, and predictable latency becomes essential. NVIDIA Run:ai addresses these challenges through intelligent scheduling and dynamic GPU fractioning. GPU fractioning is wholly delivered by NVIDIA Run:ai in any environment—cloud, NCP, and on-premises.

This post presents the joint benchmarking effort between NVIDIA and AI cloud provider Nebius to evaluate how NVIDIA Run:ai fractional GPU allocation can improve large language model (LLM) inference performance. Nebius’ AI Cloud provided the infrastructure foundation, dedicated NVIDIA GPUs, NVIDIA Quantum InfiniBand networking, and hyperscaler-grade performance and elasticity needed to deliver these gains at production scale. 

All benchmarks were executed using NVIDIA NIM microservices. This approach provides standardized, production-grade model deployment with consistent performance, security, and lifecycle management across environments.

The results show that fractional GPUs dramatically increase effective capacity without compromising latency SLAs:

  • 77% of full GPU throughput and 86% of full-GPU concurrent user capacity using only 0.5 GPU fraction, with time to first token (TTFT) under one second
  • Up to 2x more concurrent inference users on smaller models using 0.25 GPU fractions
  • Up to 3x more total system users when running mixed workloads (chat, reasoning, embeddings) on shared GPUs
  • Near-linear throughput scaling across 0.5, 0.25, and 0.125 GPU fractions, with modest TTFT impact
  • Production-ready autoscaling with no latency cliffs or error spikes during scale-out

This benchmarking shows that fractional GPU scheduling is no longer an optimization technique. It is a foundational capability for running large-scale, multimodel LLM inference efficiently in production.

LLM inference enterprise challenges

Enterprise IT departments operate with a finite, often fixed inventory of GPUs. Deploying LLM for inference requires a dedicated GPU (or multiple GPUs) to be allocated to a single LLM instance, even during sporadic traffic. This is necessary because the model must load all the weights in advance of an inference request, so the latency for generating tokens (responses) is as low as possible. 

As a result, most LLMs consume all GPUs allocated, so it becomes difficult to run more than one model using the same pool of GPUs available. In this scenario, enterprise IT must manually maintain the GPUs to LLM allocation, figure out when and how to scale LLMs as users requesting inference grow to maintain latency between chat requests and tokens generated, and cannot repurpose idle GPUs during off-peak hours.

Ideally, enterprises want an elastic environment where GPUs can be used to run multiple LLMs, not just one, without significantly impacting the number of users who can run inference or latency for those users. They can scale GPUs based on workloads, and scale down GPUs during off-peak hours, such that other workloads can consume the same GPUs.

Scale inference workloads with NVIDIA Run:ai and Nebius AI Cloud 

The NVIDIA Run:ai platform addresses these pain points through its high-throughput AI workload scheduler, built for large-scale GPU clusters and dynamic fractional GPU allocation, without sacrificing performance. Together, NVIDIA Run:ai orchestration and Nebius AI Cloud infrastructure create a flexible, production-ready framework for maximizing GPU ROI. 

In benchmarking tests conducted by NVIDIA and Nebius AI Cloud, NVIDIA Run:ai delivered up to 2x greater user capacity on existing hardware during peak periods, demonstrating that enterprises can significantly scale inference workloads without proportional increases in GPU investment.

Dynamic GPU fractioning

NVIDIA Run:ai enables GPUs to be fractioned into smaller units (such as 0.5 GPU allocations) that serve multiple workloads simultaneously. Users specify their memory requirements directly and the scheduler allocates resources on-demand without any preconfiguration. This is particularly impactful for inference workloads, where smaller, concurrent requests can share GPU resources without significant performance degradation. 

Memory isolation is enforced at runtime while compute cycles are distributed fairly among active processes. Users can also define a guaranteed minimum (Request) with a burstable upper bound (Limit), allowing workloads to consume additional GPU capacity when available and release it automatically when demand shifts.

Intelligent workload scheduling

NVIDIA Run:ai scheduler acts as the “brain” of the operation, analyzing workload priorities, resource requirements, and system capacity to optimize allocations. It prioritizes latency-sensitive tasks, such as real-time inference, over batch-oriented training jobs during peak periods, ensuring service-level agreements (SLAs) are met. 

The scheduler also automatically scales LLMs up or down based on consecutive users running inference and token latency depending on the SLA criterias given by the admin. These strategies collectively drive higher utilization rates, lower operational complexity, and reduce total cost of ownership (TCO). 

Teams at NVIDIA and Nebius ran benchmarking to discover the impact NVIDIA Run:ai has on running inference at scale for various LLMs. Scale tests were performed on the number of concurrent users that can run various chat requests and recording the TTFT, output throughput (tokens/second generated), and GPU utilization. At NVIDIA these tests were run on a cluster built following the PCIe-optimized NVIDIA Enterprise Reference Architectures with NVIDIA H100 NVL GPUs. At Nebius AI Cloud the tests were run on a cluster built following the HGX based Enterprise RA for NVIDIA HGX B200 GPUs.

Benchmarking setup

The software stack is based on NVIDIA Enterprise RAs (Figure 1). This includes the NVIDIA AI Enterprise stack to manage GPUs using NVIDIA GPU Operator for lifecycle management, NVIDIA Network Operator for north-south and east-west networking, NVIDIA NIM Operator to download various model weights, and NVIDIA NIM microservices to deploy the different models. This was deployed in a cluster of nodes managed by Kubernetes. To learn more, see NVIDIA NIM LLM with NVIDIA Run:ai and Vanilla Kubernetes for Enterprise RA.

Infrastructure

Identical benchmarks were run across two hardware configurations: an on-premises cluster with 64 NVIDIA H100 NVL GPUs built to NVIDIA Enterprise RA specifications, and a Nebius AI Cloud cluster with 32 NVIDIA HGX B200 GPUs. This dual-environment approach validates that the results generalize across both self-managed infrastructure and public cloud deployments.

Diagram illustrating the NVIDIA Run:ai deployment stack on NVIDIA Enterprise Reference Architecture.
Figure 1. NVIDIA Run:ai deployment on NVIDIA Enterprise Reference Architecture

Model selection

The four models selected span different sizes, memory footprints, and inference use cases (Table 1). This range enables evaluating fractional allocation across workloads with different memory footprints. 

ModelNumber of parametersMemory requirementsUse case
Llama 3.1 8B Instruct8B~16 GBGeneral-purpose chat
Phi-4-Mini3.8B~8 GBLightweight assistant
Qwen3-14B14B~28 GBReasoning
Qwen-Embeddings-0.6B0.6B~1.5 GBDocument embedding and reranking
Table 1. Models selected span diverse sizes, memory requirements, and use cases

Notably, the largest model (Qwen3-14B) occupies only ~35% of one NVIDIA H100 NVL GPU 80 GB capacity, illustrating why traditional whole-GPU allocation might leave so much capacity stranded. 

Methodology

GenAI Perf was used to simulate concurrent users sending chat requests to each NIM endpoint. The tool records per-session latency and throughput, enabling measurement under increasing load.

Primary metrics include:

  • TTFT: Latency from request submission to first response token
  • Output throughput: Tokens generated per second per session
  • GPU utilization: Percentage of GPU memory consumed under load
  • Concurrency scaling: Maximum simultaneous users supported while maintaining TTFT and throughput within acceptable bounds (for example, the point at which adding more users causes latency SLA drops)

Test conditions

Each model was benchmarked under the following five configurations:

  • Baseline: LLM inference without NVIDIA Run:ai (native Kubernetes scheduling)
  • Full GPU(s) with NVIDIA Run:ai: 1.0 GPU allocation per model replica
  • Fractional 0.5 GPU(s): NVIDIA Run:ai with 0.5 GPU allocation per model replica
  • Fractional 0.25 GPU(s): NVIDIA Run:ai with 0.25 GPU allocation per model replica
  • Mixed mode: Multiple LLMs co-located on shared GPUs

For the Qwen-Embeddings model, data ingestion throughput was also tested to evaluate embedding-specific workloads.

Benchmarking results using NVIDIA Run:ai

This section presents observations based on the results captured from GenAI Perf. 

Fractional GPU efficiency at half allocation

Based on the results captured from GenAI Perf, NVIDIA Run:ai was evaluated across two dimensions: scheduler overhead compared to native Kubernetes, fractional GPU efficiency at various allocation sizes. The following subsections detail the findings for each.

No scheduler overhead

NVIDIA Run:ai introduces no measurable performance penalty compared to native Kubernetes scheduling across all test configurations. At 64 GPUs, NVIDIA Run:ai with full GPU allocation delivered 10,200 concurrent users versus 9,934 for the native scheduler, confirming the scheduler itself adds no overhead.

Fractional GPU efficiency

Concurrent user scaling: At 64 GPUs, the 0.5 GPU configuration supported 8,768 concurrent users, where the TTFT for each user did not go over one second (1,000 ms)—86% of the full GPU capacity (10,200 CCU). This demonstrates that fractional allocation introduces only a modest performance trade-off, enabling enterprises to run multiple models on shared GPUs or scale deployments more granularly without significant capacity loss (Figure 2).

Graph showing CCU scaling from 1–64 GPUs for Meta Llama 3.1 8B. Three configurations compared: no Run:ai, Run:ai at 1.0 GPU, and Run:ai at 0.5 GPU. At 64 GPUs, 0.5 GPU delivers 86% of full CCU (8,768 vs 10,200).
Figure 2. Concurrent user scaling for Llama 3.1 8B Instruct powered by the NVIDIA H100 NVL GPU cluster

Output throughput: Token generation throughput showed similar efficiency. At 64 GPUs, the 0.5 GPU configuration achieved 152,694 tokens/sec—77% of full GPU throughput 198,680 tokens/sec), as shown in Figure 3.

All three configurations—without NVIDIA Run:ai, NVIDIA Run:ai with full GPU, and NVIDIA Run:ai with fractional GPU—scale linearly from one to 64 GPUs. This linear relationship confirms that the efficiency ratios observed at scale are not artifacts of small deployments.

Graph showing throughput scaling from 1–64 GPUs for Llama 3.1 8B. Three configurations: no Run:ai, Run:ai at 1.0 GPU, Run:ai at 0.5 GPU. 0.5 GPU delivers 77% of full GPU throughput.
Figure 3. Output throughput scaling for Llama 3.1 8B Instruct powered by the NVIDIA H100 NVL GPU cluster

Smaller models scale further with quarter-GPU fractions

Smaller models have lighter memory footprints, which means they can take even greater advantage of fractional allocation. Phi-4-Mini was tested with 0.25 GPU fractions to measure how much concurrency and throughput this enables.

Graph showing CCU scaling from 1–32 GPUs for Phi-4-Mini-4B-Instruct on NVIDIA HGX B200 (Nebius AI Cloud). At 32 GPUs: 1.0 GPU = 7,100 CCU, 0.5 GPU = 11,000 CCU (155%), 0.25 GPU = 12,200 CCU (172%).
Figure 4. Concurrent user scaling (1-32 GPUs) for Phi-4-Mini with TTFT under 1,000 ms on an NVIDIA HGX B200 cluster running on Nebius AI Cloud

On smaller models such as Phi-4-Mini, NVIDIA Run:ai with 0.25 GPU fractions supported up to 72% more concurrent users than full-GPU allocation (Figure 4). At 32 GPUs, this configuration achieved ~450K tokens/sec with P95 TTFT under 300 ms (Figure 5). Phi-Mini is an ideal candidate for high-density fractional deployments due to its small parameter count and tensor efficiency.

Graph showing throughput scaling for Phi-4-Mini-4B-Instruct on Blackwell (Nebius). At 32 GPUs: 1.0 GPU = 456,295 tokens/sec, 0.5 GPU = 458,138 (100%), 0.25 GPU = 389,197 (85%).
Figure 5. Throughput at scale for Phi-4 Mini NIM on NVIDIA HGX B200 cluster running on Nebius AI Cloud

Multimodel co-location on fractional GPUs in Nebius AI Cloud

NVIDIA Run:ai supports allocating fractional GPUs dynamically. In previous tests, the same number of users were run on fractional GPUs. One test loaded two models (Llama 3.1 8B and DeepseekR1-Distill-8B) on fractional 0.5 NVIDIA H100 NVL GPUs using NVIDIA Run:ai. A single NVIDIA H100 NVL GPU was running two inference models. 

Results show double the concurrent users with NVIDIA Run:ai versus deploying a single NIM pod per GPU (Figure 6). The performance impact increased when the scale reached more than 50% of the GPUs in the cluster. At max scale, the TTFT for the combined users dropped by 3x while the throughput dropped only by 0.4x.

Bar chart comparing system CCU: without Run:ai = 9,934; Run:ai 0.5 GPU = 8,768; Run:ai 0.5 GPU mixed models = 17,792.
Figure 6. Total number of concurrent users on cluster powered by NVIDIA H100 NVL GPU server running two models on a single GPU

Traditional Kubernetes schedulers don’t support this fractional allocation. NVIDIA Run:ai enables loading multiple models with dynamic frame buffer memory allocation without manual capacity planning. 

NVIDIA NIM complements this by packaging each model as a production-ready, optimized inference microservice with consistent startup and health signaling. NVIDIA Run:ai then enforces memory isolation and fair compute distribution at runtime. Combined, this enables safe co-location of heterogeneous workloads without cross-model interference.

Bar chart comparing total concurrent users between a mixed model scenario and Llama-only deployment across three scales. Mixed (0.5 Llama plus 0.25 PHI plus 0.125 Qwen) delivers ~3x more users: 1-GPU = 303 versus 104; 1-Host (8 GPUs) = 2,960 versus 850; 1-Cluster = 9,190 versus 3,000.
Figure 7. The total system users that ran with multiple models on the NVIDIA HGX B200 cluster in Nebius AI Cloud more than tripled

Nebius ran a similar test co‑deploying 0.5 GPU Llama 3.1 8B, 0.25 GPU Phi‑4 Mini, and 0.125 GPU Qwen‑Embeddings. The cluster achieved predictable scaling with no cross‑model interference, and combined throughput exceeded 350K TPS at full scale (Figure 8). The total number of concurrent users that can run inference went up by almost 3x (Figure 7). This validates that the NVIDIA Run:ai scheduler can bin‑pack heterogeneous inference workloads without destabilizing latency or utilization.

Bar chart comparing total throughput between a mixed model scenario and Llama-only deployment across three scales. Mixed achieves higher TPS at all scales: 1-GPU = 9.943 versus 6.894; 1-Host = 141,838 vs 52,740; 1-Cluster = 354,312 vs 200,979.
Figure 8. Total system throughput while running multiple models on the NVIDIA HGX B200 cluster in Nebius AI Cloud

Autoscaling NIM LLM with NVIDIA Run:ai

NVIDIA Run:ai supports auto-scaling inference pods based on concurrent users, throughput, or latency thresholds. Nebius set up Llama 3.1 8B to scale when concurrent users exceeded 50, triggering NVIDIA Run:ai to allocate additional GPUs to the NIM inference service.

Replicas scaled smoothly from 1 to 16 as demand increased. The autoscaling traces showed clean ramp-up with no TTFT spikes, stable GPU utilization during pod warm-up, and negligible HTTP error rates, demonstrating that fractional GPU inference can scale elastically while maintaining SLAs.

Run:ai dashboard showing autoscaling for Llama 3.1 8B.
Figure 9. Autoscaling results for Llama 3.1 8B on NVIDIA HGX B200 in Nebius AI Cloud

Get started with GPU fractioning in NVIDIA Run:ai 

NVIDIA Run:ai enables efficient GPU utilization through dynamic allocation, fractioning, and intelligent workload placement. Combined with Nebius AI Cloud’s dedicated GPUs, NVIDIA networking, and hyperscaler-grade elasticity, enterprises can achieve:

  • GPU utilization improvements under fractional scheduling, eliminating fragmentation and idle pockets
  • Near‑linear throughput scaling across 0.5 and 0.25 GPU slices (and 0.125 for embeddings), with modest TTFT impact
  • Clean co-existence of mixed workloads: embeddings plus generative plus summarization on the same nodes
  • Production‑ready autoscaling for fractional LLM inference—no SLA cliffs during scale‑out
  • More workloads per GPU, higher concurrency, and reduced fleet size

For an executive summary of this benchmark, see Scaling Efficient Production-Grade Inference with NVIDIA Run:ai on Nebius. 

Get started with the latest version of NVIDIA Run:ai v2.24. To learn more, check out the NVIDIA GTC 2026 session, Scale Inference Using Open Models: How Nebius Token Factory Delivers Control and Efficiency (Presented by Nebius) [S82234].

Discuss (0)

Tags