DeepSeek just launched its fourth generation of flagship models with DeepSeek-V4-Pro and DeepSeek-V4-Flash, both targeted at enabling highly efficient million-token context inference.
DeepSeek-V4-Pro is the largest model in the family, with 1.6T total parameters and 49B active parameters. DeepSeek-V4-Flash is a smaller 284B-parameter model with 13B active parameters, designed for higher-speed, higher-efficiency workloads. Both models support up to a 1M-token context window, opening new possibilities for long-context coding, document analysis, retrieval, and agentic AI workflows.
| Specification | DeepSeek-V4-Pro | DeepSeek-V4-Flash |
| Modality | Text | Text |
| Total parameters | 1.6T | 284B |
| Active parameters | 49B | 13B |
| Context length | 1M tokens | 1M tokens |
| Max output length | Up to 384K tokens through DeepSeek API docs | Up to 384K tokens through DeepSeek API docs |
| Primary use cases | Advanced reasoning, coding, long-context agents | High-speed efficiency, chat, routing, summarization |
| License | MIT | MIT |
Architectural innovations for long-context inference
The V4 family builds on the DeepSeek MoE architecture, with an increased focus on optimizing the attention component of the transformer architecture. These innovations are designed to achieve a 73% reduction in per-token inference FLOPs and a 90% reduction in KV cache memory burden compared with DeepSeek-V3.2.
That matters because long context is becoming a core requirement for agentic applications. Agents store more than a single prompt and response. They carry system instructions, tool outputs, retrieved context, code, logs, memory, and multi-step reasoning traces across a workflow. As context windows grow, attention and KV cache become major bottlenecks.

The core architectural solution to this challenges is hybrid attention, which blends together:
- Compressed Sparse Attention (CSA): Leverages dynamic sequence compression to compress KV entries to reduce the KV cache memory footprint and then applies DeepSeek Sparse Attention (DSA) to sparsify the attention matrices and reduce computational overhead.
- Heavily Compressed Attention (HCA): Applies much more aggressive compression by consolidating KV entries across sets of tokens into a single compressed entry, resulting in significant reduction in KV cache size.
DeepSeek-V4’s architectural innovations signal a shift from basic chat toward multi-turn, long-context inference and agentic systems. This new paradigm stresses the entire stack – software, memory, compute, and networking – fundamentally altering the dynamics of inference economics. As open models reach the frontier of intelligence, the enterprise focus is pivoting from model selection to infrastructure strategy. In this landscape, the ultimate competitive advantage is the ability to deploy and scale these high-performance models at the lowest token cost.
Out-of-the-box NVIDIA Blackwell performance insights
Whether developers are deploying the 1.6T Pro model for advanced reasoning or the 284B Flash model for high-speed efficiency, Blackwell provides the scale and low-latency performance required for a new era of 1M long-context inference and trillion-parameter intelligence.
The NVIDIA Blackwell Platform is built for this class of workload. Out of the box tests on DeepSeek-V4-Pro on NVIDIA GB200 NVL72 demonstrate over 150 tokens/sec/user. In addition to these initial tests, the NVIDIA team leveraged vLLM’s Day 0 NVIDIA Blackwell B300 recipe to produce a snapshot of out-of-the-box performance across the pareto (Figure 2).

Expect this performance to climb even higher as we optimize our entire extreme co-design stack: Dynamo, NVFP4, optimized CUDA kernels, advanced parallelization techniques, and beyond.
Build with NVIDIA GPU-accelerated endpoints
Developers can start building with DeepSeek V4 through NVIDIA GPU-accelerated endpoints on build.nvidia.com as part of the NVIDIA Developer Program. Hosted endpoints provide a fast way to prototype with the latest models before moving to self-hosted deployment paths.
DeepSeek V4 is also available to download on day-0 with NVIDIA NIM so it can be deployed to build long-context coding, document analysis, and agentic workflows using familiar API patterns.
Deploying with SGLang
SGLang offers three primary serving recipes for DeepSeek‑V4 on NVIDIA Blackwell and Hopper, each tuned for a different latency/throughput profile (low‑latency, balanced, and max‑throughput), along with specialized recipes for long‑context workloads and for prefill/decode disaggregation.
Deploying with vLLM
vLLM provides DeepSeek‑V4 single‑node and multinode serving recipes for NVIDIA Blackwell and Hopper, including multinode prefill/decode disaggregation recipes scaling up to 100+ GPUs, with support for tool calling, reasoning, and speculative decoding.
Powering agentic workflows
DeepSeek V4 is especially great for agents as it excels at long context orchestration, reasoning, and tool calling. To get started, developers can configure DeepSeek V4 as the LLM:
- NVIDIA NemoClaw: Run OpenClaw in a secure OpenShell environment to create a long-running personal assistant powered by DeepSeek V4 for tasks like code generation, personal assistant, autonomous support, and more. Run
nemoclaw onboardand during step 3, enter your DeepSeek V4 provider URL and their DeepSeek V4 model name. - NVIDIA AI-Q Blueprint: The blueprint makes a best-in-class deep research assistant available to you or your agents. The blueprint, based on LangChain Deep Agents, is extensible, making it easy to add DeepSeek V4 into your workflow for orchestration and planning.
- NVIDIA Data Explorer Agent: The agent won 1st place in the DABstep benchmark; it excels at data analysis, data science, and tabular research. The agent is written with NeMo Agent Toolkit making it easy to switch to using DeepSeek V4.
The best part of using open agent harnesses and open models is you’re always able to try new models to pick up the bleeding edge.
Get started with DeepSeek
From data center deployments on NVIDIA Blackwell to managed NIM microservices and fine-tuning workflows, NVIDIA provides a range of options for integrating DeepSeek and other open models across different stages of development and deployment. NVIDIA is an active contributor to the open-source ecosystem and has released several hundred projects under open-source licenses. NVIDIA is committed to optimizing community software and open models lets users broadly share work in AI safety and resilience.
To get started, check out DeepSeek-V4 on Hugging Face or test out pro on build.nvidia.com.