As AI workloads scale, achieving high throughput, efficient resource usage, and predictable latency becomes essential. NVIDIA Run:ai addresses these challenges through intelligent scheduling and dynamic GPU fractioning. GPU fractioning is wholly delivered by NVIDIA Run:ai in any environment—cloud, NCP, and on-premises.
This post presents the joint benchmarking effort between NVIDIA and AI cloud provider Nebius to evaluate how NVIDIA Run:ai fractional GPU allocation can improve large language model (LLM) inference performance. Nebius’ AI Cloud provided the infrastructure foundation, dedicated NVIDIA GPUs, NVIDIA Quantum InfiniBand networking, and hyperscaler-grade performance and elasticity needed to deliver these gains at production scale.
All benchmarks were executed using NVIDIA NIM microservices. This approach provides standardized, production-grade model deployment with consistent performance, security, and lifecycle management across environments.
The results show that fractional GPUs dramatically increase effective capacity without compromising latency SLAs:
- 77% of full GPU throughput and 86% of full-GPU concurrent user capacity using only 0.5 GPU fraction, with time to first token (TTFT) under one second
- Up to 2x more concurrent inference users on smaller models using 0.25 GPU fractions
- Up to 3x more total system users when running mixed workloads (chat, reasoning, embeddings) on shared GPUs
- Near-linear throughput scaling across 0.5, 0.25, and 0.125 GPU fractions, with modest TTFT impact
- Production-ready autoscaling with no latency cliffs or error spikes during scale-out
This benchmarking shows that fractional GPU scheduling is no longer an optimization technique. It is a foundational capability for running large-scale, multimodel LLM inference efficiently in production.
LLM inference enterprise challenges
Enterprise IT departments operate with a finite, often fixed inventory of GPUs. Deploying LLM for inference requires a dedicated GPU (or multiple GPUs) to be allocated to a single LLM instance, even during sporadic traffic. This is necessary because the model must load all the weights in advance of an inference request, so the latency for generating tokens (responses) is as low as possible.
As a result, most LLMs consume all GPUs allocated, so it becomes difficult to run more than one model using the same pool of GPUs available. In this scenario, enterprise IT must manually maintain the GPUs to LLM allocation, figure out when and how to scale LLMs as users requesting inference grow to maintain latency between chat requests and tokens generated, and cannot repurpose idle GPUs during off-peak hours.
Ideally, enterprises want an elastic environment where GPUs can be used to run multiple LLMs, not just one, without significantly impacting the number of users who can run inference or latency for those users. They can scale GPUs based on workloads, and scale down GPUs during off-peak hours, such that other workloads can consume the same GPUs.
Scale inference workloads with NVIDIA Run:ai and Nebius AI Cloud
The NVIDIA Run:ai platform addresses these pain points through its high-throughput AI workload scheduler, built for large-scale GPU clusters and dynamic fractional GPU allocation, without sacrificing performance. Together, NVIDIA Run:ai orchestration and Nebius AI Cloud infrastructure create a flexible, production-ready framework for maximizing GPU ROI.
In benchmarking tests conducted by NVIDIA and Nebius AI Cloud, NVIDIA Run:ai delivered up to 2x greater user capacity on existing hardware during peak periods, demonstrating that enterprises can significantly scale inference workloads without proportional increases in GPU investment.
Dynamic GPU fractioning
NVIDIA Run:ai enables GPUs to be fractioned into smaller units (such as 0.5 GPU allocations) that serve multiple workloads simultaneously. Users specify their memory requirements directly and the scheduler allocates resources on-demand without any preconfiguration. This is particularly impactful for inference workloads, where smaller, concurrent requests can share GPU resources without significant performance degradation.
Memory isolation is enforced at runtime while compute cycles are distributed fairly among active processes. Users can also define a guaranteed minimum (Request) with a burstable upper bound (Limit), allowing workloads to consume additional GPU capacity when available and release it automatically when demand shifts.
Intelligent workload scheduling
NVIDIA Run:ai scheduler acts as the “brain” of the operation, analyzing workload priorities, resource requirements, and system capacity to optimize allocations. It prioritizes latency-sensitive tasks, such as real-time inference, over batch-oriented training jobs during peak periods, ensuring service-level agreements (SLAs) are met.
The scheduler also automatically scales LLMs up or down based on consecutive users running inference and token latency depending on the SLA criterias given by the admin. These strategies collectively drive higher utilization rates, lower operational complexity, and reduce total cost of ownership (TCO).
Teams at NVIDIA and Nebius ran benchmarking to discover the impact NVIDIA Run:ai has on running inference at scale for various LLMs. Scale tests were performed on the number of concurrent users that can run various chat requests and recording the TTFT, output throughput (tokens/second generated), and GPU utilization. At NVIDIA these tests were run on a cluster built following the PCIe-optimized NVIDIA Enterprise Reference Architectures with NVIDIA H100 NVL GPUs. At Nebius AI Cloud the tests were run on a cluster built following the HGX based Enterprise RA for NVIDIA HGX B200 GPUs.
Benchmarking setup
The software stack is based on NVIDIA Enterprise RAs (Figure 1). This includes the NVIDIA AI Enterprise stack to manage GPUs using NVIDIA GPU Operator for lifecycle management, NVIDIA Network Operator for north-south and east-west networking, NVIDIA NIM Operator to download various model weights, and NVIDIA NIM microservices to deploy the different models. This was deployed in a cluster of nodes managed by Kubernetes. To learn more, see NVIDIA NIM LLM with NVIDIA Run:ai and Vanilla Kubernetes for Enterprise RA.
Infrastructure
Identical benchmarks were run across two hardware configurations: an on-premises cluster with 64 NVIDIA H100 NVL GPUs built to NVIDIA Enterprise RA specifications, and a Nebius AI Cloud cluster with 32 NVIDIA HGX B200 GPUs. This dual-environment approach validates that the results generalize across both self-managed infrastructure and public cloud deployments.

Model selection
The four models selected span different sizes, memory footprints, and inference use cases (Table 1). This range enables evaluating fractional allocation across workloads with different memory footprints.
| Model | Number of parameters | Memory requirements | Use case |
| Llama 3.1 8B Instruct | 8B | ~16 GB | General-purpose chat |
| Phi-4-Mini | 3.8B | ~8 GB | Lightweight assistant |
| Qwen3-14B | 14B | ~28 GB | Reasoning |
| Qwen-Embeddings-0.6B | 0.6B | ~1.5 GB | Document embedding and reranking |
Notably, the largest model (Qwen3-14B) occupies only ~35% of one NVIDIA H100 NVL GPU 80 GB capacity, illustrating why traditional whole-GPU allocation might leave so much capacity stranded.
Methodology
GenAI Perf was used to simulate concurrent users sending chat requests to each NIM endpoint. The tool records per-session latency and throughput, enabling measurement under increasing load.
Primary metrics include:
- TTFT: Latency from request submission to first response token
- Output throughput: Tokens generated per second per session
- GPU utilization: Percentage of GPU memory consumed under load
- Concurrency scaling: Maximum simultaneous users supported while maintaining TTFT and throughput within acceptable bounds (for example, the point at which adding more users causes latency SLA drops)
Test conditions
Each model was benchmarked under the following five configurations:
- Baseline: LLM inference without NVIDIA Run:ai (native Kubernetes scheduling)
- Full GPU(s) with NVIDIA Run:ai: 1.0 GPU allocation per model replica
- Fractional 0.5 GPU(s): NVIDIA Run:ai with 0.5 GPU allocation per model replica
- Fractional 0.25 GPU(s): NVIDIA Run:ai with 0.25 GPU allocation per model replica
- Mixed mode: Multiple LLMs co-located on shared GPUs
For the Qwen-Embeddings model, data ingestion throughput was also tested to evaluate embedding-specific workloads.
Benchmarking results using NVIDIA Run:ai
This section presents observations based on the results captured from GenAI Perf.
Fractional GPU efficiency at half allocation
Based on the results captured from GenAI Perf, NVIDIA Run:ai was evaluated across two dimensions: scheduler overhead compared to native Kubernetes, fractional GPU efficiency at various allocation sizes. The following subsections detail the findings for each.
No scheduler overhead
NVIDIA Run:ai introduces no measurable performance penalty compared to native Kubernetes scheduling across all test configurations. At 64 GPUs, NVIDIA Run:ai with full GPU allocation delivered 10,200 concurrent users versus 9,934 for the native scheduler, confirming the scheduler itself adds no overhead.
Fractional GPU efficiency
Concurrent user scaling: At 64 GPUs, the 0.5 GPU configuration supported 8,768 concurrent users, where the TTFT for each user did not go over one second (1,000 ms)—86% of the full GPU capacity (10,200 CCU). This demonstrates that fractional allocation introduces only a modest performance trade-off, enabling enterprises to run multiple models on shared GPUs or scale deployments more granularly without significant capacity loss (Figure 2).

Output throughput: Token generation throughput showed similar efficiency. At 64 GPUs, the 0.5 GPU configuration achieved 152,694 tokens/sec—77% of full GPU throughput 198,680 tokens/sec), as shown in Figure 3.
All three configurations—without NVIDIA Run:ai, NVIDIA Run:ai with full GPU, and NVIDIA Run:ai with fractional GPU—scale linearly from one to 64 GPUs. This linear relationship confirms that the efficiency ratios observed at scale are not artifacts of small deployments.

Smaller models scale further with quarter-GPU fractions
Smaller models have lighter memory footprints, which means they can take even greater advantage of fractional allocation. Phi-4-Mini was tested with 0.25 GPU fractions to measure how much concurrency and throughput this enables.

On smaller models such as Phi-4-Mini, NVIDIA Run:ai with 0.25 GPU fractions supported up to 72% more concurrent users than full-GPU allocation (Figure 4). At 32 GPUs, this configuration achieved ~450K tokens/sec with P95 TTFT under 300 ms (Figure 5). Phi-Mini is an ideal candidate for high-density fractional deployments due to its small parameter count and tensor efficiency.

Multimodel co-location on fractional GPUs in Nebius AI Cloud
NVIDIA Run:ai supports allocating fractional GPUs dynamically. In previous tests, the same number of users were run on fractional GPUs. One test loaded two models (Llama 3.1 8B and DeepseekR1-Distill-8B) on fractional 0.5 NVIDIA H100 NVL GPUs using NVIDIA Run:ai. A single NVIDIA H100 NVL GPU was running two inference models.
Results show double the concurrent users with NVIDIA Run:ai versus deploying a single NIM pod per GPU (Figure 6). The performance impact increased when the scale reached more than 50% of the GPUs in the cluster. At max scale, the TTFT for the combined users dropped by 3x while the throughput dropped only by 0.4x.

Traditional Kubernetes schedulers don’t support this fractional allocation. NVIDIA Run:ai enables loading multiple models with dynamic frame buffer memory allocation without manual capacity planning.
NVIDIA NIM complements this by packaging each model as a production-ready, optimized inference microservice with consistent startup and health signaling. NVIDIA Run:ai then enforces memory isolation and fair compute distribution at runtime. Combined, this enables safe co-location of heterogeneous workloads without cross-model interference.

Nebius ran a similar test co‑deploying 0.5 GPU Llama 3.1 8B, 0.25 GPU Phi‑4 Mini, and 0.125 GPU Qwen‑Embeddings. The cluster achieved predictable scaling with no cross‑model interference, and combined throughput exceeded 350K TPS at full scale (Figure 8). The total number of concurrent users that can run inference went up by almost 3x (Figure 7). This validates that the NVIDIA Run:ai scheduler can bin‑pack heterogeneous inference workloads without destabilizing latency or utilization.

Autoscaling NIM LLM with NVIDIA Run:ai
NVIDIA Run:ai supports auto-scaling inference pods based on concurrent users, throughput, or latency thresholds. Nebius set up Llama 3.1 8B to scale when concurrent users exceeded 50, triggering NVIDIA Run:ai to allocate additional GPUs to the NIM inference service.
Replicas scaled smoothly from 1 to 16 as demand increased. The autoscaling traces showed clean ramp-up with no TTFT spikes, stable GPU utilization during pod warm-up, and negligible HTTP error rates, demonstrating that fractional GPU inference can scale elastically while maintaining SLAs.

Get started with GPU fractioning in NVIDIA Run:ai
NVIDIA Run:ai enables efficient GPU utilization through dynamic allocation, fractioning, and intelligent workload placement. Combined with Nebius AI Cloud’s dedicated GPUs, NVIDIA networking, and hyperscaler-grade elasticity, enterprises can achieve:
- GPU utilization improvements under fractional scheduling, eliminating fragmentation and idle pockets
- Near‑linear throughput scaling across 0.5 and 0.25 GPU slices (and 0.125 for embeddings), with modest TTFT impact
- Clean co-existence of mixed workloads: embeddings plus generative plus summarization on the same nodes
- Production‑ready autoscaling for fractional LLM inference—no SLA cliffs during scale‑out
- More workloads per GPU, higher concurrency, and reduced fleet size
For an executive summary of this benchmark, see Scaling Efficient Production-Grade Inference with NVIDIA Run:ai on Nebius.
Get started with the latest version of NVIDIA Run:ai v2.24. To learn more, check out the NVIDIA GTC 2026 session, Scale Inference Using Open Models: How Nebius Token Factory Delivers Control and Efficiency (Presented by Nebius) [S82234].