Agentic AI / Generative AI

NVIDIA-Accelerated Mistral 3 Open Models Deliver Efficiency, Accuracy at Any Scale 

The new Mistral 3 open model family delivers industry-leading accuracy, efficiency, and customization capabilities for developers and enterprises. Optimized from NVIDIA GB200 NVL72 to edge platforms, Mistral 3 includes: 

  • One large state-of-the-art sparse multimodal and multilingual mixture of experts (MoE) model with a total parameter count of 675B 
  • A suite of small, dense high-performance models (called Ministral 3) of sizes 3B, 8B, and 14B, each with Base, Instruct, and Reasoning variants (nine models total) 

All the models were trained on NVIDIA Hopper GPUs and are now available through Mistral AI on Hugging Face. Developers can choose from a variety of options for deploying these models on different NVIDIA GPUs with different model precision formats and open source framework compatibility (Table 1). 

 Mistral Large 3 Ministral-3-14B Ministral-3-8B Ministral-3-3B 
Total parameters 675B  14B 8B   3B 
Active parameters 41B 14B 8B 3B 
Context window 256K 256K 256K 256K 
Base – BF16 BF16 BF16 
Instruct – Q4_K_M, FP8, BF16  Q4_K_M, FP8, BF16  Q4_K_M, FP8, BF16  
Reasoning Q4_K_M, NVFP4, FP8 Q4_K_M, BF16 Q4_K_M, BF16 Q4_K_M, BF16 
Frameworks 
vLLM ✔ ✔ ✔ ✔ 
SGLang ✔ – – – 
TensorRT-LLM – – – 
Llama.cpp – ✔ ✔ ✔ 
Ollama – ✔ ✔ 
NVIDIA hardware 
GB200 NVL72 ✔  ✔ ✔   ✔ 
Dynamo ✔  ✔ ✔  ✔ 
DGX Spark ✔ ✔ ✔ ✔ 
RTX  – ✔ ✔ ✔ 
Jetson  – ✔  ✔ ✔ 
Table 1. Mistral 3 model specifications

Mistral Large 3 delivers best-in-class performance on NVIDIA GB200 NVL72  

NVIDIA-accelerated Mistral Large 3 achieves best-in-class performance on NVIDIA GB200 NVL72 by leveraging a comprehensive stack of optimizations tailored for large state-of-the-art MoEs. Figure 1 shows the performance Pareto frontiers for GB200 NVL72 and NVIDIA H200 across the interactivity range. 

Line chart titled “Performance per MW on Mistral Large 3 NVFP4 ISL/OSL 1K/8K.” The x-axis shows TPS per user (interactivity) from 0 to about 150, and the y-axis shows TPS per megawatt from 0 to 7,000,000. A green line labeled GB200 starts high on the left (around 5,000,000 TPS/MW at roughly 40 TPS/user) and slopes downward as interactivity increases. A gray line labeled H200 follows the same general shape but is consistently much lower, starting near 2,000,000 TPS/MW around 15 TPS/user and dropping to the right. The graphic illustrates that GB200 delivers substantially higher energy efficiency than H200 across the full interactivity range.
Figure 1. Performance per megawatt for Mistral Large 3, comparing NVIDIA GB200 NVL72 and NVIDIA H200 across different interactivity targets

Where production AI systems must deliver both strong user experience (UX) and cost-efficient scale, GB200 provides up to 10x higher performance than the previous-generation H200, exceeding 5,000,000 tokens per second per megawatt (MW) at 40 tokens per second per user.  

This generational gain translates to better UX, lower per-token cost, and higher energy efficiency for the new model. The gain is primarily driven by the following components of the inference optimization stack: 

  • NVIDIA TensorRT-LLM Wide Expert Parallelism (Wide-EP) provides optimized MoE GroupGEMM kernels, expert distribution and load balancing, and expert scheduling to fully exploit the NVL72 coherent memory domain. Of particular interest is how resilient this Wide-EP feature set is to architectural variations across large MoEs. This enables a model such as Mistral Large 3 (with roughly half as many experts per layer (128) as DeepSeek-R1) to still realize the high-bandwidth, low-latency, non-blocking benefits of the NVIDIA NVLink fabric. 
  • Low-precision inference that maintains efficiency and accuracy has been achieved using NVFP4, with support from SGLang, TensorRT-LLM, and vLLM. 
  • Mistral Large 3 relies on NVIDIA Dynamo, a low latency distributed inference framework, to rate-match and disaggregate the prefill and decode phases of inference. This in turn boosts performance for long-context workloads, such as 8K/1K configurations (Figure 1). 

As with all models, upcoming performance optimizations—such as speculative decoding with multitoken prediction (MTP) and EAGLE-3—are expected to push performance further, unlocking even more benefits from this new model. 

NVFP4 quantization 

For Mistral Large 3, developers can deploy a compute-optimized NVFP4 checkpoint that was quantized offline using the open-source llm-compressor library. This allows for reducing compute and memory costs while maintaining accuracy, by leveraging the NVFP4 higher-precision FP8 scaling factors and finer-grained block scaling to control quantization error.  

The recipe targets only the MoE weights while keeping all other components at their original checkpoint precision. Because NVFP4 is native to Blackwell, this variant deploys seamlessly on GB200 NVL72. NVFP4 FP8-scale factors and fine-grained block scaling keep quantization error low, delivering lower compute and memory cost with minimal accuracy loss. 

Open source inference 

These open weight models can be used with your open source inference framework of choice. TensorRT-LLM leverages optimizations for large MoE models to boost performance on GB200 NVL72 systems. To get started, you can use the TensorRT-LLM preconfigured Docker container.  

NVIDIA collaborated with vLLM to expand support for kernel integrations for speculative decoding (EAGLE), NVIDIA Blackwell, disaggregation, and expanded parallelism. To get started, you can deploy the launchable that uses vLLM on NVIDIA cloud GPUs. To see the boilerplate code for serving the model and sample API calls for common use cases, check out Running Mistral Large 3 675B Instruct with vLLM on NVIDIA GPUs.

Figure 2 shows the range of GPUs available in the NVIDIA build platform where you can deploy Mistral Large 3 and Ministral 3. You can select the appropriate GPU size and configuration for your needs.  

The image shows the console at brev.dev which allows users to select which type of GPU option in the ‘Select your Compute’ page, the user can select between boxes in a row of H200, H100, A100, L40s, A10 and A100 shown.
Figure 2. A range of GPUs are available in the NVIDIA build platform where developers can deploy Mistral Large 3 and Ministral 3

NVIDIA also collaborated with SGLang to create an implementation of Mistral Large 3 with disaggregation and speculative decoding. For details, see the SGLang documentation

Ministral 3 models deliver speed, versatility, and accuracy   

The small, dense high performance Ministral 3 models are for edge deployment. Offering flexibility for a variety of needs, they come in three parameter sizes—3B, 8B, and 14B—each with Base, Instruct, and Reasoning variants. You can try the models on edge platforms like NVIDIA GeForce RTX AI PC, NVIDIA DGX Spark, and NVIDIA Jetson

When developing locally, you still get the benefit of NVIDIA acceleration. NVIDIA collaborated with Ollama and llama.cpp for faster iteration, lower latency, and greater data privacy. You can expect fast inferencing at up to 385 tokens per second on the NVIDIA RTX 5090 GPU with the Ministral-3B variants. Get started with Llama.cpp and Ollama.  

For Ministral-3-3B-Instruct, Jetson developers can use the vLLM container on NVIDIA Jetson Thor to achieve 52 tokens per second for single concurrency, with scaling up to 273 tokens per second with concurrency of 8. 

Production-ready deployment with NVIDIA NIM 

Mistral Large 3 and Ministral-14B-Instruct are available for use through the NVIDIA API catalog and preview API for developers to get started with minimal setup. Soon enterprise developers can use the downloadable NVIDIA NIM microservices for easy deployment on any GPU-accelerated infrastructure.  

Video 1. Mistral 3 users can input text and images and view the response from the hosted model

Get started building with open source AI 

The NVIDIA-accelerated Mistral 3 open model family represents a major leap for Transatlantic AI in the open source community. The flexibility of the models for large-scale MoE and edge-friendly dense transformers meet developers where they are and within their development lifecycle.  

With NVIDIA-optimized performance, advanced quantization techniques like NVFP4, and broad framework support, developers can achieve exceptional efficiency and scalability from cloud to edge. To get started, download Mistral 3 models from Hugging Face or test deployment-free on build.nvidia.com/mistralai. 

Discuss (0)

Tags