Networking / Communications

Introducing NVIDIA BlueField-4-Powered Inference Context Memory Storage Platform for the Next Frontier of AI

AI‑native organizations increasingly face scaling challenges as agentic AI workflows drive context windows to millions of tokens and models scale toward trillions of parameters. These systems currently rely on agentic long‑term memory for context that persists across turns, tools, and sessions so agents can build on prior reasoning instead of starting from scratch on every request. 

As context windows increase, Key-Value (KV) cache capacity requirements grow proportionally, while the compute requirements to recalculate that history grow much faster, making KV cache reuse and efficient storage essential for performance and efficiency. 

This increases pressure on existing memory hierarchies, forcing AI providers to choose between scarce GPU high‑bandwidth memory (HBM) and general‑purpose storage tiers optimized for durability, data management, and protection—not for serving ephemeral, AI-native, KV cache—driving up power consumption, inflating cost per token, and leaving expensive GPUs underutilized.

The NVIDIA Rubin platform enables AI-native organizations to scale inference infrastructure and meet the demands of the agentic era. The platform organizes AI infrastructure into compute pods, which are multi-rack units of GPUs, NVIDIA Spectrum‑X Ethernet networking, and storage that serve as the basic scale-out building block for AI factories

Within each pod, NVIDIA Inference Context Memory Storage (ICMS) Platform provides a new class of AI-native storage infrastructure designed for gigascale inference. NVIDIA Spectrum‑X Ethernet provides predictable, low‑latency, and high‑bandwidth RDMA connectivity ensuring consistent, low‑jitter data access to shared KV cache at scale. 

Powered by the NVIDIA BlueField-4 data processor, the Rubin platform establishes an optimized context memory tier that augments existing networked object and file storage by holding latency‑sensitive, reusable inference context and prestaging it to increase GPU utilization. It delivers additional context storage that enables 5x higher tokens‑per‑second (TPS), and is 5x more power efficient than traditional storage. 

This post explains how growing agentic AI workloads and long-context inference put increasing pressure on existing memory and storage tiers, and introduces the NVIDIA Inference Context Memory Storage (ICMS) platform as a new context tier in Rubin AI factories to deliver higher throughput, better power efficiency, and scalable KV cache reuse.

A new inference paradigm and a context storage challenge

Organizations face new scalability challenges as models evolve from simple chatbots to complex, multiturn agentic workflows. With foundation models reaching trillions of parameters and context windows spanning millions of tokens, the three AI scaling laws (pretraining, post-training, and test-time scaling) are driving a surge in compute-intensive reasoning. Agents are no longer stateless chatbots and depend on long‑term memory of conversations, tools, and intermediate results, shared across services and revisited over time.

In transformer-based models, that long‑term memory is realized as inference context, also known as KV cache. This preserves inference-time context so the model does not recompute history for every new token. As sequence lengths increase, the KV cache grows linearly, forcing it to persist across longer sessions and be shared across inference services. 

This evolution positions KV cache as a unique class of AI‑native data defined by a specific duality: it is critical for performance yet inherently ephemeral. In agentic systems, KV cache effectively becomes the model’s long‑term memory, reused and extended across many steps rather than discarded after a single-prompt response. 

Unlike immutable enterprise records, inference context is derived and recomputable, demanding a storage architecture that prioritizes power and cost efficiency as well as speed and scale, over traditional data durability. In modern AI infrastructure, that means every megawatt of power is ultimately judged by how many useful tokens it can deliver. 

Meeting these requirements stretches today’s memory and storage tiers to their limits. This is why organizations are rethinking how context is placed across GPU memory, host memory, and shared storage.

To understand the gap, it’s helpful to look at how inference context currently moves across the G1–G4 hierarchy (Figure 1). AI infrastructure teams use orchestration frameworks, such as NVIDIA Dynamo, to help manage this context across these storage tiers: 

  • G1 (GPU HBM) for hot, latency‑critical KV used in active generation 
  • G2 (system RAM) for staging and buffering KV off HBM
  • G3 (local SSDs) for warm KV that is reused over shorter timescales 
  • G4 (shared storage) for cold artifacts, history, and results that must be durable but are not on the immediate critical path 

G1 is optimized for access speed while G3 and G4 are optimized for durability. As context grows, KV cache quickly exhausts local storage capacity (G1-G3), while pushing it down to enterprise storage (G4), which introduces unacceptable overheads and drives up both cost and power consumption. 

Figure 1 illustrates this tradeoff, showing how KV cache usage becomes increasingly expensive as it moves farther from the GPU across the memory and storage hierarchy.

A four-tier KV cache memory hierarchy diagram showing latency and efficiency tradeoffs. From top to bottom: G1 GPU HBM with nanosecond access for active KV; G2 system DRAM with 10–100 nanosecond access for staging or spillover KV; G3 local SSD or rack-local storage with microsecond access for warm KV reuse; and G4 shared object or file storage with millisecond access for cold or shared KV context. An upward arrow on the left indicates faster access and lower latency toward the top, while a downward arrow on the right indicates declining efficiency, from peak efficiency at GPU HBM to lowest efficiency at shared storage as energy, cost, and per-token overhead increase.
Figure 1. KV cache memory hierarchy, from on‑GPU memory (G1) to shared storage (G4)

At the top of the hierarchy, GPU HBM (G1) delivers nanosecond-scale access and the highest efficiency, making it ideal for active KV cache used directly in token generation. As context grows beyond the physical limits of HBM, KV cache spills into system DRAM (G2) and local/rack-attached storage (G3), where access latency increases and per-token energy and cost begin to rise. While these tiers extend effective capacity, each additional hop introduces overhead that reduces overall efficiency.

At the bottom of the hierarchy, shared object and file storage (G4) provides durability and capacity, but at millisecond-level latency and the lowest efficiency for inference. While suitable for cold or shared artifacts, pushing active or frequently reused KV cache into this tier drives up power consumption, and directly limits cost-efficient AI scaling.

The key takeaway is that latency and efficiency are tightly coupled: as inference context moves away from the GPU, access latency increases, energy use and cost per token rise, and overall efficiency declines. This growing gap between performance-optimized memory and capacity-optimized storage is what forces AI infrastructure teams to rethink how growing KV cache context is placed, managed, and scaled across the system.

AI factories need a complementary, purpose‑built context layer that treats KV cache as its own AI‑native data class rather than forcing it into either scarce HBM or general‑purpose enterprise storage.

Introducing the NVIDIA Inference Context Memory Storage platform 

The NVIDIA Inference Context Memory Storage platform is a fully integrated storage infrastructure. It uses the NVIDIA BlueField-4 data processor to create a purpose-built context memory tier operating at the pod level to bridge the gap between high-speed GPU memory and scalable shared storage. This accelerates KV cache data access and high-speed sharing across nodes within the pod to enhance performance and optimize power consumption for the growing demands of large-context inference.

The platform establishes a new G3.5 layer, an Ethernet-attached flash tier optimized specifically for KV cache. This tier acts as the agentic long‑term memory of the AI infrastructure pod that is large enough to hold shared, evolving context for many agents simultaneously, but also close enough for the context to be pre‑staged frequently back into GPU and host memory without stalling decode. 

It provides petabytes of shared capacity per GPU pod, allowing long‑context workloads to retain history after eviction from HBM and DRAM. The history is stored in a lower‑power, flash‑based tier that extends the GPU and host memory hierarchy. The G3.5 tier delivers massive aggregate bandwidth with better efficiency than classic shared storage. This transforms KV cache into a shared, high‑bandwidth resource that orchestrators can coordinate across agents and services without rematerializing it independently on each node.

With a large portion of latency-sensitive, ephemeral KV cache now served from the G3.5 tier, durable G4 object and file storage can be reserved for what truly needs to persist over time. This includes inactive multiturn KV state, query history, logs, and other artifacts of multiturn inference that may be recalled in later sessions.

This reduces capacity and bandwidth pressure on G4 while still preserving application-level history where it matters. As inference scale increases, G1–G3 KV capacity grows with the number of GPUs but remains too small to cover all KV needs. ICMS fills this missing KV capacity between G1–G3 and G4.   

Inference frameworks like NVIDIA Dynamo use their KV block managers together with NVIDIA Inference Transfer Library (NIXL) to orchestrate how inference context moves between memory and storage tiers, using ICMS as the context memory layer for KV cache. KV managers in these frameworks prestage KV blocks, bringing them from ICMS into G2 or G1 memory ahead of the decode phase. 

This reliable prestaging, backed by the higher bandwidth and better power efficiency of ICMS compared to traditional storage, is designed to minimize stalls and reduce idle time, enabling up to 5x higher sustained TPS for long-context and agentic workloads. When combined with the NVIDIA BlueField-4 processor running the KV I/O plane, the system efficiently terminates NVMe-oF and object/RDMA protocols.

Figure 2 shows how ICMS fits into the NVIDIA Rubin platform and AI factory stack. 

A layered diagram showing ICMS in the NVIDIA Rubin platform, from the inference pool with Dynamo, NIXL, and KV cache management, through Grove orchestration and Rubin compute nodes with KV$ tiering across memory tiers, down to Spectrum-X connected BlueField-4 ICMS nodes built on SSDs.
Figure 2. NVIDIA Inference Context Memory Storage architecture within the NVIDIA Rubin platform, from inference pool to BlueField-4 ICMS target nodes

At the inference layer, NVIDIA Dynamo and NIXL manage prefill, decode, and KV cache while coordinating access to shared context. Under that, a topology-aware orchestration layer using NVIDIA Grove places workloads across racks with awareness of KV locality so workloads can continue to reuse context even as they move between nodes. 

At the compute node level, KV tiering spans GPU HBM, host memory, local SSDs, ICMS, and network storage, providing orchestrators with a continuum of capacity and latency targets for placing context. Tying it all together, Spectrum-X Ethernet links Rubin compute nodes with BlueField-4 ICMS target nodes, providing consistently low latency and efficient networking that integrates flash-backed context memory into the same AI-optimized fabric that serves training and inference.

Powering the NVIDIA Inference Context Memory Storage platform

NVIDIA BlueField-4 powers ICMS with 800 Gb/s connectivity, 64-core NVIDIA Grace CPU, and high-bandwidth LPDDR memory. Its dedicated hardware acceleration engines deliver line‑rate encryption and CRC data protection at up to 800 Gb/s. 

These crypto and integrity accelerators are designed to be used as part of the KV pipeline, securing and validating KV flows without adding host CPU overhead. By leveraging standard NVMe and NVMe-oF transports, including NVMe KV extensions, ICMS maintains interoperability with standard storage infrastructure while delivering the specialized performance required for KV cache. 

The architecture uses BlueField‑4 to accelerate KV I/O and control plane operations, across DPUs on the Rubin compute nodes and controllers in ICMS flash enclosures, reducing reliance on the host CPU and minimizing serialization and host memory copies. Additionally, Spectrum‑X Ethernet provides the AI‑optimized RDMA fabric that links ICMS flash enclosures and GPU nodes with predictable, low‑latency, high‑bandwidth connectivity.

Additionally, the NVIDIA DOCA framework introduces a KV communication and storage layer that treats context cache as a first class resource for KV management, sharing, and placement, leveraging the unique properties of KV blocks and inferencing patterns. DOCA interfaces inference frameworks, with BlueField-4 transferring the KV cache efficiently to and from the underlying flash media. 

This stateless and scalable approach aligns with AI-native KV cache strategies and leverages NIXL and Dynamo for advanced sharing across AI nodes and improved inference performance. The DOCA framework supports open interfaces for broader orchestration, providing flexibility to storage partners to expand their inference solutions to cover the G3.5 context tier.

Spectrum-X Ethernet serves as the high-performance network fabric for RDMA-based access to AI-native KV cache, enabling efficient data sharing and retrieval for the NVIDIA Inference Context Memory Storage platform. Spectrum-X Ethernet is purpose-built for AI, delivering predictable, low-latency, high-bandwidth connectivity at scale. It achieves this through advanced congestion control, adaptive routing, and optimized lossless RoCE, which minimizes jitter, tail latency, and packet loss under heavy load. 

With very high effective bandwidth, deep telemetry, and hardware-assisted performance isolation, Spectrum-X Ethernet enables consistent, repeatable performance in large, multitenant AI fabrics while remaining fully standards-based and interoperable with open networking software. Spectrum-X Ethernet enables ICMS to scale with consistent high performance, maximizing throughput and responsiveness for multiturn, agentic inference workloads.

Delivering power‑efficient, high-throughput KV cache storage

Power availability is the primary constraint for scaling AI factories, making energy efficiency a defining metric for gigascale inference. Traditional, general-purpose storage stacks sacrifice this efficiency because they run on x86‑based controllers and expend significant energy on features like metadata management, replication, and background consistency checks that are unnecessary for ephemeral, reconstructable KV data.

KV cache fundamentally differs from enterprise data: it is transient, derived, and recomputable if lost. As inference context, it does not require the durability, redundancy, or extensive data protection mechanisms designed for long-lived records. Applying these heavy storage services to KV cache introduces unnecessary overhead, increasing latency and power consumption while degrading inference efficiency. By recognizing KV cache as a distinct, AI-native data class, ICMS eliminates this excess overhead, enabling up to 5x improvements in power efficiency compared to general-purpose storage approaches.

This efficiency extends beyond the storage tier to the compute fabric itself. By reliably prestaging context and reducing or avoiding decoder stalls, ICMS prevents GPUs from wasting energy on idle cycles or redundant recomputation of history, which results in up to 5x higher TPS. This approach ensures that power is directed toward active reasoning rather than infrastructure overhead, maximizing effective tokens‑per‑watt for the entire AI pod.

Enabling gigascale agentic AI with better performance and TCO

The BlueField‑4–powered ICMS provides AI‑native organizations with a new way to scale agentic AI: a pod‑level context tier that extends effective GPU memory and turns KV cache into a shared high‑bandwidth, long‑term memory resource across NVIDIA Rubin pods. By offloading KV movement and treating context as a reusable, nondurable data class, ICMS reduces recomputation and decode stalls, translating higher tokens‑per‑second directly into more queries served, more agents running concurrently, and shorter tail latencies at scale.

Together, these gains improve total cost of ownership (TCO) by enabling teams to fit more usable AI capacity into the same rack, row, or data center, extend the life of existing facilities, and plan future expansions around GPU capacity instead of storage overhead.

To learn more about the NVIDIA BlueField-4-powered Inference Context Memory Storage platform, see the press release and the NVIDIA BlueField-4 datasheet

Watch NVIDIA Live at CES 2026 with CEO Jensen Huang and explore related sessions.

Discuss (0)

Tags