NVIDIA-Accelerated Mistral 3 Open Models Deliver Efficiency, Accuracy at Any Scale

The new Mistral 3 open model family delivers industry-leading accuracy, efficiency, and customization capabilities for developers and enterprises. Optimized from NVIDIA GB200 NVL72 to edge platforms, Mistral 3 includes:

One large state-of-the-art sparse multimodal and multilingual mixture of experts (MoE) model with a total parameter count of 675B
A suite of small, dense high-performance models (called Ministral 3) of sizes 3B, 8B, and 14B, each with Base, Instruct, and Reasoning variants (nine models total)

All the models were trained on NVIDIA Hopper GPUs and are now available through Mistral AI on Hugging Face. Developers can choose from a variety of options for deploying these models on different NVIDIA GPUs with different model precision formats and open source framework compatibility (Table 1).

	Mistral Large 3	Ministral-3-14B	Ministral-3-8B	Ministral-3-3B
Total parameters	675B	14B	8B	3B
Active parameters	41B	14B	8B	3B
Context window	256K	256K	256K	256K
Base	–	BF16	BF16	BF16
Instruct	–	Q4_K_M, FP8, BF16	Q4_K_M, FP8, BF16	Q4_K_M, FP8, BF16
Reasoning	Q4_K_M, NVFP4, FP8	Q4_K_M, BF16	Q4_K_M, BF16	Q4_K_M, BF16
Frameworks
vLLM	✔	✔	✔	✔
SGLang	✔	–	–	–
TensorRT-LLM	✔	–	–	–
Llama.cpp	–	✔	✔	✔
Ollama	–	✔	✔	✔
NVIDIA hardware
GB200 NVL72	✔	✔	✔	✔
Dynamo	✔	✔	✔	✔
DGX Spark	✔	✔	✔	✔
RTX	–	✔	✔	✔
Jetson	–	✔	✔	✔

Table 1. Mistral 3 model specifications

Mistral Large 3 delivers best-in-class performance on NVIDIA GB200 NVL72

NVIDIA-accelerated Mistral Large 3 achieves best-in-class performance on NVIDIA GB200 NVL72 by leveraging a comprehensive stack of optimizations tailored for large state-of-the-art MoEs. Figure 1 shows the performance Pareto frontiers for GB200 NVL72 and NVIDIA H200 across the interactivity range.

Where production AI systems must deliver both strong user experience (UX) and cost-efficient scale, GB200 provides up to 10x higher performance than the previous-generation H200, exceeding 5,000,000 tokens per second per megawatt (MW) at 40 tokens per second per user.

This generational gain translates to better UX, lower per-token cost, and higher energy efficiency for the new model. The gain is primarily driven by the following components of the inference optimization stack:

NVIDIA TensorRT-LLM Wide Expert Parallelism (Wide-EP) provides optimized MoE GroupGEMM kernels, expert distribution and load balancing, and expert scheduling to fully exploit the NVL72 coherent memory domain. Of particular interest is how resilient this Wide-EP feature set is to architectural variations across large MoEs. This enables a model such as Mistral Large 3 (with roughly half as many experts per layer (128) as DeepSeek-R1) to still realize the high-bandwidth, low-latency, non-blocking benefits of the NVIDIA NVLink fabric.
Low-precision inference that maintains efficiency and accuracy has been achieved using NVFP4, with support from SGLang, TensorRT-LLM, and vLLM.
Mistral Large 3 relies on NVIDIA Dynamo, a low latency distributed inference framework, to rate-match and disaggregate the prefill and decode phases of inference. This in turn boosts performance for long-context workloads, such as 8K/1K configurations (Figure 1).

As with all models, upcoming performance optimizations—such as speculative decoding with multitoken prediction (MTP) and EAGLE-3—are expected to push performance further, unlocking even more benefits from this new model.

NVFP4 quantization

For Mistral Large 3, developers can deploy a compute-optimized NVFP4 checkpoint that was quantized offline using the open-source llm-compressor library. This allows for reducing compute and memory costs while maintaining accuracy, by leveraging the NVFP4 higher-precision FP8 scaling factors and finer-grained block scaling to control quantization error.

The recipe targets only the MoE weights while keeping all other components at their original checkpoint precision. Because NVFP4 is native to Blackwell, this variant deploys seamlessly on GB200 NVL72. NVFP4 FP8-scale factors and fine-grained block scaling keep quantization error low, delivering lower compute and memory cost with minimal accuracy loss.

Open source inference

These open weight models can be used with your open source inference framework of choice. TensorRT-LLM leverages optimizations for large MoE models to boost performance on GB200 NVL72 systems. To get started, you can use the TensorRT-LLM preconfigured Docker container.

NVIDIA collaborated with vLLM to expand support for kernel integrations for speculative decoding (EAGLE), NVIDIA Blackwell, disaggregation, and expanded parallelism. To get started, you can deploy the launchable that uses vLLM on NVIDIA cloud GPUs. To see the boilerplate code for serving the model and sample API calls for common use cases, check out Running Mistral Large 3 675B Instruct with vLLM on NVIDIA GPUs.

Figure 2 shows the range of GPUs available in the NVIDIA build platform where you can deploy Mistral Large 3 and Ministral 3. You can select the appropriate GPU size and configuration for your needs.

NVIDIA also collaborated with SGLang to create an implementation of Mistral Large 3 with disaggregation and speculative decoding. Try it out today by deploying the launchable that uses SGLang on NVIDIA cloud GPUs.

Ministral 3 models deliver speed, versatility, and accuracy

The small, dense high performance Ministral 3 models are for edge deployment. Offering flexibility for a variety of needs, they come in three parameter sizes—3B, 8B, and 14B—each with Base, Instruct, and Reasoning variants. You can try the models on edge platforms like NVIDIA GeForce RTX AI PC, NVIDIA DGX Spark, and NVIDIA Jetson.

When developing locally, you still get the benefit of NVIDIA acceleration. NVIDIA collaborated with Ollama and llama.cpp for faster iteration, lower latency, and greater data privacy. You can expect fast inferencing at up to 385 tokens per second on the NVIDIA RTX 5090 GPU with the Ministral-3B variants. Get started with Llama.cpp and Ollama.

For Ministral-3-3B-Instruct, Jetson developers can use the vLLM container on NVIDIA Jetson Thor to achieve 52 tokens per second for single concurrency, with scaling up to 273 tokens per second with concurrency of 8.

Production-ready deployment with NVIDIA NIM

Mistral Large 3 and Ministral-14B-Instruct are available for use through the NVIDIA API catalog and preview API for developers to get started with minimal setup. Soon enterprise developers can use the downloadable NVIDIA NIM microservices for easy deployment on any GPU-accelerated infrastructure.

Video 1. Mistral 3 users can input text and images and view the response from the hosted model

Get started building with open source AI

The NVIDIA-accelerated Mistral 3 open model family represents a major leap for Transatlantic AI in the open source community. The flexibility of the models for large-scale MoE and edge-friendly dense transformers meet developers where they are and within their development lifecycle.

With NVIDIA-optimized performance, advanced quantization techniques like NVFP4, and broad framework support, developers can achieve exceptional efficiency and scalability from cloud to edge. To get started, download Mistral 3 models from Hugging Face or test deployment-free on build.nvidia.com/mistralai.