Agentic AI / Generative AI

Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure

Jun 12, 2026

By Anu Srivastava

Discuss (0)

AI-Generated Summary

Dislike

MiniMax M3, a 428B parameter Mixture-of-Experts model with 1M-token context and native multimodality, leverages NVIDIA Blackwell infrastructure to unify text, vision, and code tasks, supporting agentic workflows and extended creative applications within a single architecture.
The core MiniMax Sparse Attention mechanism replaces standard quadratic attention with a pre-filtering stage, enabling more than 4x faster contiguous KV cache access, 1/20th per-token compute cost at 1M context, and significant speedups in prefill and decoding, with no loss in precision or compression of key-values.
Deployment and customization leverage the NVIDIA ecosystem, including open source inference on TensorRT LLM, SGLang, and vLLM, large-scale serving with NVIDIA Dynamo for disaggregated inference, and advanced fine-tuning or RL via the NVIDIA NeMo Framework with full N-D parallelism and context parallelism up to 128k tokens.

AI-generated content may summarize information incompletely. Verify important information. Learn more

As enterprise AI adoption scales, developers are increasingly forced to stitch together fragmented pipelines—separate models for text, vision, and code—leading to added complexity, higher costs, and slower iteration.

MiniMax M3—available on NVIDIA accelerated infrastructure, including NVIDIA Blackwell—changes this by enabling a single multimodal system capable of long-context reasoning, agentic workflows, and creative tasks.

The 428B parameter MoE supports up to 1M tokens and native multimodal input. Developers can build applications like long video understanding, extended coding sessions (8+ hours), and high-quality design workflows—all with a unified model and production-ready deployment paths on NVIDIA platforms.

Name	MiniMax M3
Input modalities	Video, image, text
Total parameters	428B
Visual encoder parameters	600M
Active parameters	22B
Context length	1M
Experts	Total 128, 4 experts activated per token
Precision format	BF16, MXFP8

Table 1. MiniMax M3 a VLM MoE model specs

MiniMax M3’s core architectural innovation is MiniMax Sparse Attention (MSA), which replaces standard quadratic attention with a pre-filtering stage that identifies relevant context blocks and attends only to those. At the operator level, each KV cache block is read once with contiguous memory access—more than 4x faster than existing sparse attention implementations. This yields 1/20th the per-token compute of M2 at 1M-token context, with 9x faster prefill and 15x faster decoding, all without compressing key-values or sacrificing precision. The model also trains text, images, and video natively from step 0 across ~100 trillion interleaved tokens, rather than adding multimodality post-training.

Video 1. MiniMax M3 in the NVIDIA API catalog, where developers can test prompts, adjust parameters and explore reasoning controls before building with the model

NVIDIA Blackwell performance insights

MiniMax M3 is built for a new class of multimodal, long-context inference, and NVIDIA Blackwell provides the scale and low-latency performance to serve it efficiently. On MiniMax M3, NVIDIA Blackwell Ultra delivers up to 10x higher throughput than the prior-generation NVIDIA Hopper shown in Figure 1, doubling interactivity and increasing AI factory throughput.

These gains are driven by a combination of hardware and software for extreme co-design. With speculative decoding, MTP, and accuracy-preserving NVFP4 acceleration, Blackwell is positioned to push MiniMax M3 performance even further as the stack continues to optimize across NVIDIA Dynamo and NVIDIA CUDA kernels.

Open source inference

Developers can use accelerated computing with their open source inference engine of choice, such as NVIDIA TensorRT LLM (text-only), SGLang, or vLLM.

Deploying with NVIDIA TensorRT LLM

The optimizations are available on the NVIDIA TensorRT LLM GitHub repository. Follow the quick start guide to stand up a high-performance server—it covers downloading model checkpoints from Hugging Face, a ready-to-run Docker container, and configuration options for both low-latency and max-throughput serving. NVIDIA also collaborated on the developer experience through the Transformers library.

Deploying with SGLang

Users deploying models with the SGLang serving framework can use the following instructions. See the SGLang documentation for more information and configuration options.

# 8 GPUs node case 
$ python -m sglang.launch_server \ 
    --model-path MiniMaxAI/MiniMax-M3 \ 
    --dtype bfloat16 \ 
    --tp-size 8 \ 
    --ep-size 8 \ 
    --trust-remote-code \ 
    --mem-fraction-static 0.8 \ 
    --enable-multimodal \ 
    --quantization mxfp8 \ 
    --attention-backend flashinfer \ 
    --mm-attention-backend flashinfer_cudnn \ 
    --moe-runner-backend deep_gemm \ 
    --chunked-prefill-size 8192 \ 
    --reasoning-parser minimax-m3 \ 
    --tool-call-parser minimax-m3-nom 
--tr

Deploying with vLLM

When deploying models with the vLLM serving framework, use the following instructions. For more information, see the vLLM Recipe.

vllm serve MiniMaxAI/MiniMax-M3 \ 
  --tensor-parallel-size 8 \ 
  --enable-expert-parallel \ 
  --block-size 128 \ 
  --mm-encoder-attn-backend FLASHINFER \ 
  --mm-processor-cache-type shm \ 
  --tool-call-parser minimax_m3 \ 
  --enable-auto-tool-choice \ 
  --reasoning-parser minimax_m3 \ 
  --trust-remote-code

Scaling with NVIDIA Dynamo

Dynamo is an open source distributed inference serving platform for developers to deploy frontier models like MiniMax M3 for large-scale applications. Deploying MiniMax M3 using Dynamo with TensorRT LLM improves performance for long input sequence lengths without sacrificing throughput or increasing GPU budget.

Dynamo integrates with all major inference engines and frameworks, including PyTorch, SGLang, TensorRT LLM, and vLLM, and offers LLM-aware routing, elastic autoscaling, and low-latency data transfer. Developers can follow the deployment guide to run MiniMax M3 with Dynamo.

Customize with NVIDIA NeMo Framework

MiniMax M3 can be customized and fine-tuned with the open source NVIDIA NeMo Framework. Users can:

Use NVIDIA NeMo AutoModel for out-of-the-box fine-tuning (both SFT and LoRA) over Hugging Face checkpoints without any conversion, with high-throughput acceleration from full N-D parallelism. Specifically, context parallel support is available for sequence lengths up to 128k.
Use NVIDIA NeMo RL to conduct reinforcement learning on top of Minimax M3, referencing the following sample accuracy curves.

These libraries provide developers with a suite of lightweight tools for rapid experimentation on the latest frontier models.

Get started today

Developers can prototype and evaluate MiniMax M3 by using the GPU-accelerated API on build.nvidia.com or by downloading the weights from Hugging Face.

Discuss (0)

About the Authors

About Anu Srivastava
Anu Srivastava is a senior technical marketing manager who focuses on NVIDIA’s lighthouse AI model collaborations. She works with key partners and foundations to enable NVIDIA accelerated platform support for the open source developer ecosystem. Prior to NVIDIA, she worked at Google for over a decade in various engineering and management roles and holds a degree in computer science from the University of Texas at Austin.

View all posts by Anu Srivastava