Alibaba has introduced the new open source Qwen3.5 series built for native multimodal agents. The first model in this series is a ~400B parameter native vision-language model (VLM) with reasoning built with a hybrid architecture of mixture of experts (MoE) and Gated Delta Networks. Qwen3.5 can understand and navigate user interfaces, which improves on the previous generation of VLMs.Â
Qwen3.5 is ideal for a variety of use cases, including:
- Coding, including web development
- Visual reasoning, including mobile and web interfaces
- Chat applications
- Complex search
| Qwen3.5 | |
| Modalities | Vision, language |
| Total parameters | 397B |
| Active parameters | 17B |
| Activation rate | 4.28% |
| Input context length | 256K extensible to 1M tokens |
| Languages supported | 200+ |
| Additional configuration information | |
| Experts | 512 |
| Shared experts | 1 |
| Experts per token | 11 (10 routed + 1 shared) |
| Layers | 60 |
| Words (vocabulary) | 248,320 |
Build with NVIDIA endpoints
You can start building with Qwen3.5 today with free access to GPU-accelerated endpoints on build.nvidia.com, powered by NVIDIA Blackwell GPUs. As part of the NVIDIA Developer Program, you can explore quickly in the browser, experiment with prompts, and even test the model with your own data to evaluate real-world performance.
You can also use the NVIDIA-hosted model through the API, free with registration in the NVIDIA Developer Program.
import requests
invoke_url = "https://integrate.api.nvidia.com/v1/chat/completions"
headers = {
"Authorization": "Bearer $NVIDIA_API_KEY",
"Accept": "application/json",
}
payload = {
"messages": [
{
"role": "user",
"content": ""
}
],
"model": "qwen/qwen3.5-397b-a17b",
"chat_template_kwargs": {
"thinking": True
},
"frequency_penalty": 0,
"max_tokens": 16384,
"presence_penalty": 0,
"stream": True,
"temperature": 1,
"top_p": 1
}
# re-use connections
session = requests.Session()
response = session.post(invoke_url, headers=headers, json=payload)
response.raise_for_status()
response_body = response.json()
print(response_body)
To take advantage of tool calling, simply define an array of OpenAI compatible tools to add to the chat completions tools parameter.
NVIDIA NIM makes it easy to take Qwen3.5 from development into production. Available as optimized, containerized inference microservices, NIM packages the model with the performance tuning, standardized APIs, and deployment flexibility enterprises need. Download and run it anywhere; on-premises, in the cloud, or across hybrid environments.
Customize with NVIDIA NeMo
While Qwen3.5 offers impressive “out-of-the-box” multimodal capabilities, the NVIDIA NeMo framework provides the essential tools to adapt it for specialized domain needs. Using the NeMo Automodel library, developers can fine-tune the Qwen3.5 397B-parameter architecture with high-throughput efficiency.
NeMo Automodel is a PyTorch-native training library that offers Day 0 Hugging Face support, enabling direct training on existing checkpoints without tedious model conversions. This facilitates rapid experimentation, whether performing full supervised fine-tuning (SFT) or using memory-efficient methods such as LoRA.
As a reference implementation guide, developers can leverage the technical tutorial on Medical Visual QA, which details how to fine-tune Qwen3.5 on radiological datasets. For massive scale, NeMo supports multinode Slurm and Kubernetes deployments, ensuring that even the largest MoE models are optimized for domain-specific reasoning and complex agentic workflows with minimal latency.
Get started with Qwen3.5
From data center deployments on NVIDIA Blackwell to NVIDIA NIM microservice for containerized deployment anywhere, NVIDIA offers solutions for your integration of Qwen3.5. To get started, check out the Qwen3.5 model page on Hugging Face and test Qwen3.5 on build.nvidia.com.