Build with Kimi K2.5 Multimodal VLM Using NVIDIA GPU-Accelerated Endpoints

Kimi K2.5 is the newest open vision language model (VLM) from the Kimi family of models. Kimi K2.5 is a general-purpose multimodal model that excels in current high-demand tasks such as agentic AI workflows, chat, reasoning, coding, mathematics, and more.

The model was trained using the open source Megatron‑LM framework. Megatron-LM provides accelerated computing for scalability and GPU optimization through several types of parallelism (tensor, data, sequence) for training massive transformer-based models.

This model architecture builds on leading state-of-the-art large open models for efficiency and capability. The model is composed of 384 experts with a single dense layer, which allows for smaller-sized experts and specialized routing for different modalities. Kimi K2.5 achieves a 3.2% activation rate of parameters per token.

Kimi K2.5
Modalities	Text, image, video
Total parameters	1T
Active parameters	32.86B
Activation rate	3.2%
Input context length	262K
Additional configuration information
# experts	384
# shared experts	1
# experts per token	8
# layers	61 (1 dense, 60 MoE)
# attention heads	64
Vocab size	~164K

Table 1. Specifications and configuration details for the Kimi K2.5 model

For vision capability, the large training vocabulary of 164K contains vision-specific tokens. Kimi created the MoonViT3d Vision Tower for the visual processing component of this model, which converts images and video frames into embeddings.

Illustration of the Kimi K2.5 Vision Pipeline, which consists of a Vision Tower (MoonViT3d) (left), a Visual and Text Embedding Merger (center), and a Language Model (right). — *Figure 1. Kimi K2.5 vision pipeline*

Build with NVIDIA GPU-accelerated endpoints

You can start building with Kimi K2.5 with free access for prototyping to GPU-accelerated endpoints on build.nvidia.com as part of the NVIDIA Developer Program. You can use your own data in the browser experience. NVIDIA NIM microservices, containers for production inference, are coming soon.

Video 1. Learn how to you can test Kimi K2.5 on NVIDIA GPU-accelerated endpoints

You can also use the NVIDIA-hosted model through the API, free with registration in the NVIDIA Developer Program.

import requests 
  
invoke_url = "https://integrate.api.nvidia.com/v1/chat/completions" 
  
headers = { 
	"Authorization": "Bearer $NVIDIA_API_KEY", 
	"Accept": "application/json", 
} 
  
payload = { 
  "messages": [ 
	{ 
  	"role": "user", 
  	"content": "" 
	} 
  ], 
  "model": "moonshotai/kimi-k2.5", 
  "chat_template_kwargs": { 
	"thinking": True 
  }, 
  "frequency_penalty": 0, 
  "max_tokens": 16384, 
  "presence_penalty": 0, 
  "stream": True, 
  "temperature": 1, 
  "top_p": 1 
} 
  
# re-use connections 
session = requests.Session() 
  
response = session.post(invoke_url, headers=headers, json=payload) 
  
response.raise_for_status() 
response_body = response.json() 
print(response_body)

To take advantage of tool calling, simply define an array of OpenAI compatible tools to add to the chat completions tools parameter.

Deploying with vLLM

When deploying models with the vLLM serving framework, use the following instructions. For more information, see the vLLM recipe for Kimi K2.5.

$ uv venv
$ source .venv/bin/activate
$ uv pip install -U vllm --pre \
   --extra-index-url https://wheels.vllm.ai/nightly/cu129 \
   --extra-index-url https://download.pytorch.org/whl/cu129 \
   --index-strategy unsafe-best-match

Fine-tuning with NVIDIA NeMo Framework

Kimi K2.5 can be customized and fine-tuned with the open source NeMo Framework using NeMo AutoModel library to adapt the model for domain-specific multimodal tasks, agentic workflows, and enterprise reasoning use cases.

NeMo Framework is a suite of open libraries enabling scalable model pretraining and post-training, including supervised fine-tuning, parameter-efficient methods, and reinforcement learning for models of all sizes and modalities.

NeMo AutoModel is a PyTorch Distributed native training library within NeMo Framework that provides high throughput training directly on the Hugging Face checkpoint without the need for conversion. This provides a lightweight and flexible tool for developers and researchers to do rapid experimentation on the latest frontier models.

Try fine-tuning Kimi K2.5 with the NeMo AutoModel recipe.

Get started with Kimi K2.5

From data center deployments on NVIDIA Blackwell to the fully managed enterprise NVIDIA NIM microservice, NVIDIA offers solutions for your integration of Kimi K2.5. To get started, check out the Kimi K2.5 model page on Hugging Face and Kimi API Platform, and test Kimi K2.5 on the build.nvidia.com playground.