Ensuring Balanced GPU Allocation in Kubernetes Clusters with Time-Based Fairshare

NVIDIA Run:ai v2.24 introduces time-based fairshare, a new scheduling mode that brings fair-share scheduling with time awareness for over-quota resources to Kubernetes clusters. This capability, built on the open source KAI Scheduler that powers NVIDIA Run:ai, addresses a long-standing challenge in shared GPU infrastructure.

Consider two teams with equal priority sharing a cluster. Team A continuously submits smaller jobs, while Team B needs to run a larger job that requires more resources. Every time resources free up, the smaller jobs from Team A fit immediately and get scheduled. The larger job from Team B continues to wait for enough resources to become available. Before that happens, the next small job from Team A claims the freed capacity. The result: although both teams have identical priority and entitlements, Team A runs job after job while the job from Team B sits in the queue indefinitely.

Time-based fairshare solves this problem by giving the scheduler memory. Instead of calculating fair share at a single instant, the scheduler now tracks historical resource usage and adjusts each queue’s share based on past consumption. Teams that have used more resources recently receive lower scores for over-quota allocation, while teams that have been waiting receive a boost.

Time-based fairshare results in proportional compute time over days and weeks. This enables true time-sharing of GPU resources, burst access for occasional large jobs, and resource planning that aligns with weekly or monthly GPU-hour budgets. Importantly, guaranteed quotas and queue priorities continue to work exactly as before.

This post explains the problem in more detail, walks through a real-world use case, and demonstrates how to enable time-based fairshare in NVIDIA Run:ai and KAI Scheduler.

Why is over-quota GPU resource fairness important?

Enterprise deployments have shown a consistent pattern: when organizations move from static GPU allocation to dynamic scheduling, cluster usage becomes far more dynamic. Over-quota resources (the shared pool beyond guaranteed quotas) become one of the most heavily utilized resource types. Teams regularly exceed their guaranteed allocations, resulting in higher GPU utilization and more compute time for researchers.

This makes over-quota fairness critical. When a significant portion of cluster value comes from this shared pool, that pool needs to be divided fairly over time.

How does stateless fair share scheduling work?

The classical stateless fair share algorithms divide cluster resources in two phases. First, it allocates Deserved Quota, the guaranteed resources that each queue is entitled to. This allocation always happens first and is unaffected by historical usage. Time-based fairshare does not change this behavior.

After deserved quotas are satisfied, any remaining capacity becomes the Over-Quota Pool, a shared surplus that queues compete for based on their weights. This is where point-in-time fairness breaks down.

When dividing over-quota resources, the scheduler:

Groups queues by priority level and starts with the highest tier
Calculates fair share based on weights in that tier:
$\text{fairShare} = \text{remainingCapacity} \times \frac{\text{weight}}{\text{totalWeights}}$
Queues using less than their fair share get resources first
Breaks ties using workload submission time
If resources remain, moves to the next priority tier and repeats

Here’s where the problem lies. Consider the following two queue types competing for over-quota resources.

When queues have equal weights: Both receive the same calculated fair share. When resources become available after a job completes, both queues are in the exact same state – same allocation (zero), same fair share, both with pending jobs. The scheduler sees no difference between them, falls back to tie-breakers (queue creation timestamp, then alphabetical order) and the same queue wins every time.

When queues have different weights: The higher-weight queue receives a larger fair share, which is correct. But the point-in-time calculation doesn’t track whether queues actually receive their proportional share over time. For example, if Queue A has weight 3 and Queue B has weight 1, the scheduler correctly calculates that A is entitled to 75% of over-quota resources (3/4) and B to 25% (1/4). But if Queue A submits large workloads while Queue B submits many smaller ones, Queue B can more easily fit within its fair share while the Queue A large jobs push it above fair share. The scheduler continues to prefer Queue B because it appears “underallocated” at each decision point. Over time, Queue B ends up running far more workloads than its 25% entitlement.

In both cases, the scheduler has no memory. It doesn’t know that one team just finished running a job while the other has been waiting for hours.

How does time-based fairshare work?

The core idea of time-based fairshare is straightforward: for each queue, compare the proportion of over-quota resources it actually consumed over the configured time window against the proportion it should have received based on its weight. Then adjust accordingly.

For example, if Queue A has weight 3 and Queue B has weight 1, Queue A should receive 75% of over-quota resources and Queue B should receive 25%. If the scheduler looks back over the past week and sees that Queue A actually consumed 90% while Queue B only received 10%, it will boost the Queue B effective weight and reduce Queue A’s, balancing future allocations toward the 75/25 split.

Everything else stays the same. Deserved quotas are still guaranteed first. Priority ordering still applies. Queue hierarchies work as before. Time-based fairshare only changes how the over-quota pool gets divided.

How is time-based fairshare calculated?

The scheduler uses three inputs to adjust the effective weight of each queue:

Weight: What the queue should get based on its configured weight relative to others
Usage: What the queue actually consumed over a configurable time window (default: one week)
K-value: How aggressively the scheduler corrects imbalances. Higher values mean faster correction

When a queue has consumed more than its fair share, its effective weight is reduced. When it has been starved, its effective weight is boosted. This way, allocations naturally drift back toward the intended proportions over time.

Time-based fairshare can be enabled or disabled directly from the UI (see the Node Pools section of the NVIDIA Run:ai documentation), while parameters like window size, window type, and decay rates can be tuned via API to balance responsiveness against stability. Because these settings are configured per node-pool, administrators can experiment on a dedicated node-pool without affecting the rest of the cluster. For the full details, see the time-based fairshare documentation.

A few details worth noting:

Usage is measured against cluster capacity, not against what others consumed. This prevents teams from being penalized for using GPUs that were sitting idle anyway.
Priority still comes first. Time-based fairshare operates within each priority tier. A high-priority queue still gets resources before lower-priority queues, regardless of historical usage.

Example scenario: One cluster, multiple workload types

This section walks through a realistic scenario that shows how time-based fairshare solves resource contention in a heterogeneous cluster.

A 100-GPU cluster is shared by two ML teams with very different workload patterns. The LLM team focuses on post-training and inference, with 30 GPUs guaranteed. The Vision team focuses on computer vision R&D, with 20 GPUs guaranteed. Both teams have equal over-quota weight. The remaining 50 GPUs from the over-quota pool available for burst workloads.

The LLM team runs customer-facing inference endpoints that serve production traffic. These inference workloads use 10 GPUs continuously. They are critical and must never be interrupted. The remaining 20 GPUs from their quota, plus access to the over-quota pool, are available for post-training jobs when the team occasionally needs to improve their models based on customer feedback.

The Vision team focuses on computer vision research: running VSCode, testing architectures, hyperparameter sweeps, and training object detection models. They have a steady stream of training jobs that regularly tap into the over-quota pool.

The problem: Burst access becomes blocked

One day, the LLM team finishes analyzing a batch of customer feedback and is ready to launch a post-training run. The job needs 60 GPUs; their 20 GPU quota plus 40 from the over-quota pool. What happens with and without time-based fairshare is outlined below.

To illustrate this scenario, we used the open source time-based fairshare simulator from the KAI Scheduler. This tool lets you model different cluster configurations and visualize how resources are allocated over time. The simulations below show exactly what happens in our example scenario.

Without time-based fairshare

LLM team’s inference endpoints continue running on their 10 guaranteed GPUs (deserved quota is protected).
Vision team has been continuously running CV training jobs, consuming over-quota resources.
LLM team’s 60-GPU post-training job enters the queue.
Whenever over-quota resources are free, the Vision team has more pending jobs ready
Vision team’s jobs continue to be scheduled first. This happens because the LLM team’s 40-GPU over-quota request exceeds their fair share. The scheduler won’t allocate beyond fair share while the Vision team still has pending jobs claiming their portion. The LLM team must wait until Vision team’s over-quota usage drops.
LLM team’s post-training job waits…and waits…and waits.

The LLM team’s inference services are fine, and the guaranteed quota works perfectly. But their post-training job is effectively starved because a team with continuous workloads monopolizes the over-quota pool. The occasional user never gets their turn.

Two stacked line graphs showing GPU allocation and fair share evolution over simulation cycles without time-based fairshare. The top graph shows actual GPU allocations: LLM Team stays flat at 10 GPUs while Vision Team maintains 50 GPUs throughout, even after LLM's burst jobs enter the queue at cycle 256. The bottom graph shows fair share evolution: LLM's fair share increases when burst jobs arrive due to new demand, but remains below the job size requirement, so the burst job cannot be scheduled. — *Figure 1. Without time-based fairshare, the LLM burst job remains pending while the Vision team continues using over-quota resources*

With time-based fairshare

For detailed instructions on configuring the time-based fairshare in NVIDIA Run:ai UI under node pools, see the NVIDIA Run:ai documentation or the KAI Scheduler documentation.

With time-based fairshare, the scheduler tracks historical usage. When the LLM Team submits their post-training job:

Vision team has accumulated high historical over-quota usage from continuous CV training
LLM team has minimal historical over-quota usage (they’ve been running jobs within quota)
LLM team’s effective fair share is boosted because they’ve been “starved” for over-quota
LLM team’s 60-GPU job is scheduled

If the post-training job runs long enough, both teams end up time-sharing over-quota resources. The LLM Team runs for a while, accumulating usage. As their historical usage grows, the Vision Team becomes relatively more starved and starts getting prioritized. The resources oscillate back and forth (sometimes the LLM job runs, sometimes Vision jobs run) resulting in fair sharing over time rather than one team monopolizing the pool.

Two stacked line graphs showing GPU allocation and fair share over simulation cycles with time-based fairshare enabled. The top graph shows actual GPU allocations across three phases: initially the Vision team uses over-quota resources while LLM runs only inference. When LLM's burst jobs arrive, their over-quota access is boosted due to low historical usage, allowing the 60-GPU job to run. Resources then oscillate between teams as fair shares rebalance. The bottom graph shows fair share evolution: LLM's fair share is boosted high initially due to low historical usage, then decreases as they consume resources. The lines cross over multiple times, demonstrating dynamic rebalancing where each team gets proportional access to over-quota resources over time. — *Figure 2. With time-based fairshare, the LLM burst job is scheduled and resources oscillate fairly between teams*

Time-based fairshare enables several important patterns, including:

Protected critical workloads: Inference endpoints and other production services run on guaranteed quota, completely untouched by fairness adjustments.
Burst access when needed: Teams that don’t continuously consume over-quota resources can still get burst capacity when they need it, without being blocked for long periods of time or even permanently.
Fair sharing over time: No team monopolizes the over-quota pool indefinitely. Everyone gets their proportional share across the configured time window.
Fairer treatment of large workloads: In point-in-time fair share, queues with large jobs often get deprioritized because smaller jobs from other queues fit more easily. Time-based fairshare improves this: as the queue with large jobs accumulates less usage, it becomes increasingly prioritized until it gets a chance to run.

Get started with NVIDIA Run:ai time-based fairshare

Time-based fairshare addresses a fundamental limitation in point-in-time fair share scheduling: the lack of memory. By tracking historical usage, the scheduler distributes over-quota resources fairly across time windows rather than just at each scheduling decision. Guaranteed quotas remain untouched – critical workloads like inference endpoints stay protected.

Ready to get started? NVIDIA Run:ai v2.24 includes time-based fairshare with straightforward configuration through the platform UI. Settings are configured per node-pool, so it’s easy to experiment on a dedicated pool without imposing the new mode across the entire cluster. For setup details, see the time-based fairshare documentation.

Time-based fairshare is also available in open source KAI Scheduler. Complete the configuration steps, enable Prometheus, set your parameters, and start scheduling.

Want to try time-based fairshare before deploying it? Check out the time-based fairshare simulator, where you can model queue allocations over time. Define your queues, weights, and workloads in a simple YAML file, run the simulation, and visualize how resources oscillate between competing teams.

To learn more about time-based fairshare and other features in the NVIDIA Run:ai v2.24 release, join the upcoming webinar Elevate Your AI Operations With Simplified Workload Management.