Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap

Deploying large language models (LLMs) at scale presents a dual challenge: ensuring fast responsiveness during high demand, while managing the costs of GPUs. Organizations often face a trade-off between provisioning additional GPUs for peak demand or risking service level agreement during spikes in traffic, where they decide between:

Deploying many replicas with GPUs to handle ‌worst-case traffic scenarios, paying for hardware that spends most of its time idling.
Scaling up aggressively from zero, with users suffering through latency spikes.

Neither approach is ideal. The first drains your budget—the second risks frustrating your users.

NVIDIA Run:ai GPU memory swap, also known as model hot-swapping, is a new innovation designed to push the boundaries of GPU utilization for inference workloads by addressing GPU memory constraints and enhancing auto-scaling efficiency.

Why model hot-swapping?

Hot-swapping introduces a more dynamic approach to resource management in serving models. It enables multiple models to share the same GPUs, even if their combined memory requirements exceed the available GPU capacity. Here’s how it works:

Dynamic memory offloading: Models that are not getting requests within a specific time frame no longer hog GPU memory. They are swapped to the CPU memory when not in use.
Rapid activation: On receiving a request, the model is immediately swapped back into GPU memory with minimal latency.
More model replicas, less hardware: This enables multiple models to share the same hardware, significantly reducing the number of always-on machines, without compromising responsiveness. Additionally, since the server (i.e., the CPU process) remains active even when the GPU part is swapped out, the replica can be quickly re-enabled as the server is already initialized.

With hot-swapping, organizations can efficiently handle unpredictable workloads while avoiding costly over-provisioning.

Benchmarking GPU memory swap: validating performance

To demonstrate the performance of GPU memory swaps, we simulated real-world LLM deployment scenarios.

Models tested

Meta Llama 3.1 8B Instruct (FP32: ~32 GB)
Mistral-7B (FP32: 27.5 GB)
Falcon-11B (FP32: 41.83 GB)

Hardware and software environment

GPU: NVIDIA L40S (48 GB) connected via PCIe Gen4 x8, limited to half its maximal theoretical throughput
Instance type: AWS g6e.4xlarge
Scheduler: NVIDIA Run:ai Scheduler (v2.19)
Inference engine: vLLM version 0.6.4 with default configurations
The server image is preloaded into the node, and the model weights are cached in an EBS storage, eliminating network traffic overhead in all scenarios.

Metrics

Time to first token (TTFT): Measured from the moment the first request hits the server to when the model generates its first token. We used the official benchmarking script of vLLM, simulating a production environment by disabling the warm-up phase.

Input conditions

Prompt lengths: 128 tokens and 2,048 tokens, with models stopping at the EOS token.

Comparing latency and efficiency across three deployment scenarios

We evaluated three distinct scenarios:

Scale from zero: Measuring TTFT loading a model from scratch.
GPU memory swap between models on a single GPU: Evaluating TTFT when a model is swapped from CPU memory back into GPU memory.
Baseline (warm models): Establishing a baseline TTFT when the model is already resident in GPU memory.

1. Scaling from zero—long delays

Scaling from zero involves initializing the pod, loading the model onto the GPU, and processing the first request. As expected, this approach resulted in the highest TTFT due to initialization overhead.

Model	Input Length (Tokens)	TTFT (s)
Llama 3.1 8B Instruct	128	159.49
Llama 3.1 8B Instruct	2,048	159.77
Mistral-7B	128	145.90
Mistral-7B	2,048	146.90
Falcon-11B	128	207.07
Falcon-11B	2,048	208.13

Table 1. Scale from zero results with Llama 3.1 8B Instruct, Mistral-7B, and Falcon-11B.

TTFT consistently exceeded 140 seconds for smaller models and stretched beyond 200 seconds for slightly larger ones. These delays—up to 208 seconds—are often impractical for real-time applications, underscoring the inefficiency of scaling from zero in production.

2. GPU memory swap—optimal efficiency

For this test, models started in CPU memory and were dynamically swapped into GPU memory upon request. Two model groups were tested. The first Group consisted of Llama 3.1 8B and Mistral-7B, while the second group consisted of Llama 3.1 8B and Falcon-11B, with the following sequence:

A request was sent to one model, prompting it to load into GPU memory. The system dynamically swapped this model from CPU to GPU memory, and TTFT was recorded.
Once this model completed its task and was automatically swapped back to CPU memory, a request was sent to the second model. Similarly, it was loaded into GPU memory, and its TTFT was recorded.

Note: With GPU memory swap, TTFT is limited by the PCI bandwidth and the time it takes to swap models between the CPU and GPU memory.

Model	Input Length (Tokens)	TTFT (s)
Mistral-7B	128	2.4
Mistral-7B	2,048	2.57
Llama 3.1 8B Instruct	128	2.9
Llama 3.1 8B Instruct	2,048	3

Table 2. GPU memory swap results with Mistral-7b and Llama 3.1 8B Instruct

Model	Input Length (Tokens)	TTFT (s)
Falcon-11B	128	2.93
Falcon-11B	2,048	3.13
Llama-3.1 8B Instruct	128	2.9
Llama-3.1 8B Instruct	2,048	3.13

Table 3: GPU memory swap results with Falcon-11b and Llama 3.1 8B Instruct

Both batches—Llama 3.1 8B Instruct paired with Mistral-7b, and Llama 3.1 8B Instruct paired with Falcon-11b—produced consistent results across models and input sizes. Falcon-11b showed a slightly longer TTFT in comparison to Mistral-7b, as expected, due to its memory footprint. However, this variation (~0.5 seconds) is minimal and well within acceptable performance ranges for real-world scenarios.

These results—just 2–3 seconds depending on the input sequence length—represent a ~50-66x improvement over scaling from zero, depending on the model type and input length.

3. Baseline performance—warm models, high costs

To establish a baseline, we measured TTFT for models already fully loaded into GPU memory. This represents the theoretical best-case scenario in terms of latency.

Model	Input Length (Tokens)	TTFT (s)
Llama-3.1-8B-Instruct	128	0.038
Llama-3.1-8B-Instruct	2,048	0.21
Mistral-7B	128	0.036
Mistral-7B	2,048	0.17
Falcon-11B	128	0.05
Falcon-11B	2,048	0.25

Table 4. Warm model performances as a baseline using Llama 3.1 8B Instruct, Mistral-7B, and Falcon-11B

Warm models deliver near-instant responses but require the GPU to be fully dedicated to the model at all times. This leads to significant costs when handling multiple models or replicas, as GPUs remain underutilized during periods of low demand.

Cost efficiency without compromise

:A bar graph showing Scale from Zero and GPU Memory Swap cost efficiency comparison to TTFT(s). — *Figure 1. Optimizing Cost and Performance with GPU Memory swap vs. Scale from Zero across Falcon-11B, Llama 3.1. 8B Instruct, and Mistral-7B*

GPU memory swap achieves the ideal balance between performance and cost, reducing TTFT to just a few seconds. This approach enables organizations to consolidate workloads onto fewer GPUs while maintaining stringent SLAs, ensuring both efficiency and reliability. Compared to always-on warm models, this approach delivers significant cost savings with only a minor latency trade-off.

While NVIDIA Run:ai Model Streamer can help reduce TTFT for scale from zero scenarios by a few tens of seconds, GPU memory swap pushes the boundaries to sub-10-second TTFT and fits applications that require such SLAs.

With GPU memory swap, you can maximize GPU efficiency, minimize idle costs, and maintain the responsiveness your users expect. Contact us to see GPU memory swap live and learn more about how NVIDIA Run:ai can transform your AI infrastructure.