Run Step 3.7 Flash on NVIDIA GPUs with Enterprise-Ready Multimodal AI

AI applications are moving beyond text generation to multimodal systems that can perceive, search, and reason across images, documents, video, and language in real time—turning fragmented information into actionable insights.

Step 3.7 Flash, the latest from StepFun, brings these capabilities to production and enterprise-scale, available on NVIDIA-accelerated infrastructure. It is a 198B-parameter Mixture-of-Experts vision-language model, with approximately 11B activated parameters per forward pass, optimized for agentic workflows that combine perception, search, and multi-step reasoning at production scale.

With native image and video input, three configurable reasoning levels—low, medium, and high—and a 256k context window, it is designed for enterprise use cases such as financial analysis, concurrent coding agents, and other high-throughput multimodal use cases. Developers can use StepFun’s NVFP4-quantized checkpoint available through Hugging Face for boosted inference due to reduced memory bandwidth and storage requirements.

Model	Step 3.7 Flash
Total parameters	198B
Visual encoder parameters	1.8B
Active parameters	11B
Context length	256K
Experts	288 (8 active)

Table 1. Overview of the key Step 3.7 Flash specs, such as parameter counts, context length, and MoE configuration

Step 3.7 Flash can be deployed with open source frameworks such as SGLang, NVIDIA TensorRT-LLM, and vLLM to utilize kernels optimized for NVIDIA hardware.

Build with NVIDIA endpoints

Developers can use GPU-accelerated endpoints available through build.nvidia.com for prototyping and evaluating Step 3.7 Flash. Test this out in the demo notebook, which uses Step 3.7 Flash and NVIDIA Nemotron Parse. The multi-step document intelligence pipeline extracts structured insights from large, complex documents with bounding boxes like financial reports, slide decks, and scientific papers, including PDFs, and organizes the output.

Video 1. See how document intelligence pipelines extract usable data, then follow the workflow in a JupyterLab notebook

Production-ready deployment with NVIDIA NIM

NVIDIA NIM makes it easy to take Step 3.7 Flash from development into production. Available as optimized, containerized inference microservices, NIM packages the model with the performance tuning, standardized APIs, and deployment flexibility enterprises need. Download and run it on-premises, in the cloud, or across hybrid environments. NIM provides a standard OpenAI inference for sending inference requests to the NIM server.

Download the NIM container from the NVIDIA container registry (enterprise license required).
Start a server with the OpenAI client.
Send either text or image input to the endpoint.

from openai import OpenAI 
  
client = OpenAI( 
  base_url = "http://0.0.0.0:8000/v1", 
  api_key="no-key-required" 
) 
  
completion = client.chat.completions.create( 
  model="stepfun/step-3.7-flash", 
  messages=[{"role":"user","content":"Explain particle physics?"}] 
  temperature=0.5, 
  top_p=1, 
  max_tokens=1024, 
  stream=True 
) 
  
for chunk in completion: 
  if chunk.choices[0].delta.content is not None: 
    print(chunk.choices[0].delta.content, end="")

Day 0 fine-tuning with NVIDIA NeMo Framework

Step 3.7 Flash can be customized with domain-specific data using open libraries from the NVIDIA NeMo framework. NVIDIA NeMo Automodel library combines native PyTorch n-D parallelisms with optimized performance and supports Day 0 fine-tuning directly from Hugging Face model checkpoints without checkpoint conversion. The Automodel fine-tuning recipe for Step 3.7 supports techniques such as supervised fine-tuning (SFT) and memory-efficient LoRA at 600 tokens/sec on Hopper GPUs.

For advanced large-scale training, teams can also use the NeMo Megatron-Bridge fine-tuning recipe, which provides additional performance optimizations.

From data center deployments on NVIDIA Blackwell to deskside with NVIDIA DGX Station to managed NIM microservices and Day 0 fine-tuning workflows, NVIDIA provides a range of options for integrating Step 3.7 Flash across different stages of development and deployment. With 748 GB of coherent memory, DGX Station is ideal for running Step 3.7 Flash with increased headroom for the full 256k context length, and faster local developer iteration.

NVIDIA is an active contributor to the open-source ecosystem and has released several hundred projects under open source licenses. NVIDIA is committed to open models such as Step 3.7 Flash that promote AI transparency and enable users to share their AI safety and resilience work.

To get started, check out Step 3.7 Flash on Hugging Face, test it with your own data on build.nvidia.com, or locally on DGX Station using the vLLM Playbook.