Build with DeepSeek V4 Using NVIDIA Blackwell and GPU-Accelerated Endpoints

DeepSeek just launched its fourth generation of flagship models with DeepSeek-V4-Pro and DeepSeek-V4-Flash, both targeted at enabling highly efficient million-token context inference.

DeepSeek-V4-Pro is the largest model in the family, with 1.6T total parameters and 49B active parameters. DeepSeek-V4-Flash is a smaller 284B-parameter model with 13B active parameters, designed for higher-speed, higher-efficiency workloads. Both models support up to a 1M-token context window, opening new possibilities for long-context coding, document analysis, retrieval, and agentic AI workflows.

Specification	DeepSeek-V4-Pro	DeepSeek-V4-Flash
Modality	Text	Text
Total parameters	1.6T	284B
Active parameters	49B	13B
Context length	1M tokens	1M tokens
Max output length	Up to 384K tokens through DeepSeek API docs	Up to 384K tokens through DeepSeek API docs
Primary use cases	Advanced reasoning, coding, long-context agents	High-speed efficiency, chat, routing, summarization
License	MIT	MIT

Table 1. Specifications for the DeepSeek V4 model family

Architectural innovations for long-context inference

The V4 family builds on the DeepSeek MoE architecture, with an increased focus on optimizing the attention component of the transformer architecture. These innovations are designed to achieve a 73% reduction in per-token inference FLOPs and a 90% reduction in KV cache memory burden compared with DeepSeek-V3.2.

That matters because long context is becoming a core requirement for agentic applications. Agents store more than a single prompt and response. They carry system instructions, tool outputs, retrieved context, code, logs, memory, and multi-step reasoning traces across a workflow. As context windows grow, attention and KV cache become major bottlenecks.

The core architectural solution to these challenges is hybrid attention, which blends together:

Compressed Sparse Attention (CSA): Leverages dynamic sequence compression to compress KV entries to reduce the KV cache memory footprint and then applies DeepSeek Sparse Attention (DSA) to sparsify the attention matrices and reduce computational overhead.
Heavily Compressed Attention (HCA): Applies much more aggressive compression by consolidating KV entries across sets of tokens into a single compressed entry, resulting in a significant reduction in KV cache size.

DeepSeek-V4’s architectural innovations signal a shift from basic chat toward multi-turn, long-context inference and agentic systems. This new paradigm stresses the entire stack – software, memory, compute, and networking – fundamentally altering the dynamics of inference economics. As open models reach the frontier of intelligence, the enterprise focus is pivoting from model selection to infrastructure strategy. In this landscape, the ultimate competitive advantage is the ability to deploy and scale these high-performance models at the lowest token cost.

Out-of-the-box NVIDIA Blackwell performance insights

Whether developers are deploying the 1.6T Pro model for advanced reasoning or the 284B Flash model for high-speed efficiency, Blackwell provides the scale and low-latency performance required for a new era of 1M long-context inference and trillion-parameter intelligence.

The NVIDIA Blackwell platform is built for this class of workload. Open benchmark tests on SemiAnalysis InferenceX show that DeepSeek-V4-Pro on NVIDIA GB200 NVL72 achieves over 150 tokens/sec/user, marking 30x better perf/watt than H200 at similar interactivity levels. In addition to these initial tests, the NVIDIA team used vLLM’s Day 0 NVIDIA HGX B300 recipe to produce a snapshot of out-of-the-box performance across the Pareto (Figure 2).

Expect this performance to climb even higher as we optimize our entire extreme co-design stack: Dynamo, NVFP4, optimized CUDA kernels, advanced parallelization techniques, and beyond.

Build with NVIDIA GPU-accelerated endpoints

Developers can start building with DeepSeek V4 through NVIDIA GPU-accelerated endpoints on build.nvidia.com as part of the NVIDIA Developer Program. Hosted endpoints provide a fast way to prototype with the latest models before moving to self-hosted deployment paths.

Video 1. A demo of text generation from DeepSeek-v4-pro hosted on build.nvidia.com

DeepSeek V4 is also available to download on day-0 with the NVIDIA NIM container, so it can be deployed to build long-context coding, document analysis, and agentic workflows using familiar API patterns.

Deploying with SGLang

SGLang offers three primary serving recipes for DeepSeek‑V4 on NVIDIA Blackwell and Hopper, each tuned for a different latency/throughput profile (low‑latency, balanced, and max‑throughput), along with specialized recipes for long‑context workloads and for prefill/decode disaggregation.

Deploying with vLLM

vLLM provides DeepSeek‑V4 single‑node and multinode serving recipes for NVIDIA Blackwell and Hopper, including multinode prefill/decode disaggregation recipes scaling up to 100+ GPUs, with support for tool calling, reasoning, and speculative decoding.

Powering agentic workflows

DeepSeek V4 is especially great for agents, as it excels at long-context orchestration, reasoning, and tool calling. To get started, developers can configure DeepSeek V4 as the LLM:

NVIDIA NemoClaw: Run OpenClaw in a secure OpenShell environment to create a long-running personal assistant powered by DeepSeek V4 for tasks like code generation, personal assistant, autonomous support, and more. Run nemoclaw onboard and during step 3, enter your DeepSeek V4 provider URL and their DeepSeek V4 model name.
NVIDIA AI-Q Blueprint: The blueprint makes a best-in-class deep research assistant available to you or your agents. The blueprint, based on LangChain Deep Agents, is extensible, making it easy to add DeepSeek V4 into your workflow for orchestration and planning.
NVIDIA Data Explorer Agent: The agent won 1st place in the DABstep benchmark; it excels at data analysis, data science, and tabular research. The agent is written with NeMo Agent Toolkit, making it easy to switch to using DeepSeek V4.
Post-train DeepSeek V4 Flash model using the open source NVIDIA NeMo AutoModel library, part of the NVIDIA NeMo framework, using this flash recipe. Fine-tuning support for the pro variant is coming soon.

The best part of using open agent harnesses and open models is that you’re always able to try new models to pick up the bleeding edge.

Get started with DeepSeek

From data center deployments on NVIDIA Blackwell to managed NIM microservices and fine-tuning workflows, NVIDIA provides a range of options for integrating DeepSeek and other open models across different stages of development and deployment. NVIDIA is an active contributor to the open-source ecosystem and has released several hundred projects under open-source licenses.

To get started, check out DeepSeek-V4 on Hugging Face or test out pro on build.nvidia.com.