Generative AI

New Open Source Qwen3-Next Models Preview Hybrid MoE Architecture Delivering Improved Accuracy and Accelerated Parallel Processing across NVIDIA Platform 

As AI models grow larger and process longer sequences of text, efficiency becomes just as important as scale.  

To showcase what’s next, Alibaba released two new open models, Qwen3-Next 80B-A3B-Thinking and Qwen3-Next 80B-A3B-Instruct to preview a new hybrid Mixture of Experts (MoE) architecture with the research and developer community.  

Qwen3-Next-80B-A3B-Thinking is now live on build.nvidia.com, giving developers instant access to test its advanced reasoning capabilities directly in the UI or through the NVIDIA NIM API. 

Video 1.  Qwen3-Next-80B-A3B-Thinking demo on build.nvidia.com  

The new architecture of these Qwen3-Next models is optimized for long context lengths (>260K tokens input) and large-scale parameter efficiency. Each model has a total of 80B parameters, but only 3B are activated per token due to its sparse MoE structure, delivering the power of a massive model with the efficiency of a smaller one. The MoE module has 512 routed experts and 1 shared expert, with 10 experts being activated per token.  

The performance of an MoE model like Qwen3-Next, which routes requests between 512 different experts, is heavily dependent on inter-GPU communication. Blackwell’s 5th-generation NVLink provides 1.8 TB/s of direct GPU-to-GPU bandwidth. This high-speed fabric is essential for minimizing latency during the expert routing process, directly translating to faster inference and higher token throughput in the AI Factory. 

There are 48 layers in the model, every 4th layer using GQA attention while the remaining use the new linear attention. Large language models (LLMs) use attention layers to interpret and assign importance to each token of the input sequence.  Less mature software stacks lack pre-optimized primitives for novel architectures or the specific fusions required to make the constant switching between attention types efficient. 

The diagram shows an example of how an input sequence “The Cat Jumped Over” is parsed through an encoding block then decoding then linear layer and each example token is weighted to a decimal number from 0 to 1 in a table below.
Figure 1. A general representation of how an input sequence is parsed and weighted by a transformer 

To achieve long input context capability, the model leverages Gated Delta Networks from NVIDIA research and MIT. Gated DeltaNets improve focus sequence processing so the model can process super long text efficiently without drifting off or forgetting what matters. This allows it to efficiently process extremely long sequences, with memory and computation scaling almost linearly with sequence length. 

In addition to these architectural innovations, the model can be run on NVIDIA Hopper and Blackwell for optimized inference performance. NVIDIA’s flexible CUDA programming architecture allows for the experimentation of new and unique approaches, enabling both full attention layers of traditional Transformer models and the linear attention layers in the Qwen3-Next models. When run on NVIDIA, the hybrid approach seen in the Qwen3-Next models can lead to efficiency gains, paving the way for greater token generation and revenue for AI Factories.  

the diagram shows a representation of the 48 layers of the model, from left to right contains a box representing the first layer, the middle layer represents the 10 layers in the middle, and the rightmost layer is the outer layer. Each layer box contains 3 linear attention layers and one full attention layer.
Figure 2. Diagram of the configuration of the 48 layers in the model 

NVIDIA collaborated with open source frameworks SGLang and vLLM to enable model deployment for the community as well as packaging both models as NVIDIA NIM. Developers can consume leading open models through enterprise software containers, depending on their needs.   

Deploying with SGLang 

Users deploying models with SGLang serving framework can use the following instructions. See the SGLang documentation for more information and configuration options. 

python3 -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Instruct --tp 4

Deploying with vLLM  

Users deploying models with vLLM serving framework can use the following instructions. See the vLLM announcement blog for more information. 

uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly --torch-backend=auto
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 4

Production-ready deployment with NVIDIA NIM 

Enterprise developers can try Qwen3-Next-80B-A3B along with the rest of the Qwen models for free using NVIDIA-hosted NIM microservice endpoints in the NVIDIA API catalog.  Prepackaged, optimized NIM microservices for the models will also be available for download soon. 

Building on the Power of Open Source AI 

The new hybrid MoE Qwen3-Next architecture pushes the boundaries of efficiency and reasoning, marking a significant advancement for the community. Making these models openly available empowers researchers and developers everywhere to experiment, build, and accelerate innovation. At NVIDIA, we share this commitment to open source through contributions such as NeMo for AI lifecycle management, Nemotron LLMs, and Cosmos world foundation models (WFMs). We’re working alongside the community to advance the state of AI. Together, these efforts ensure that the future of AI models is not just more powerful, but more accessible, transparent, and collaborative. 

Get started today 

Try the models on Open Router: Qwen3-Next-80B-A3B-Thinking and Qwen3-Next-80B-A3B-Instruct or download from Hugging Face: Qwen3-Next-80B-A3B-Thinking and Qwen3-Next-80B-A3B-Instruct  

Discuss (0)

Tags