As global AI adoption accelerates, developers face a growing challenge: delivering large language model (LLM) performance that meets real-world latency and cost requirements. Running models with tens of billions of parameters in production, especially for conversational or voice-based AI agents, demands high throughput, low latency, and predictable service-level performance. For startups building sovereign AI models from scratch, these challenges are amplified by the need to balance model scale and accuracy with infrastructure efficiency—while also maintaining data sovereignty and cost control.
Sarvam AI, a generative AI startup based in Bengaluru, India, set out to build large, multilingual, multimodal foundation models that serve its country’s diverse population, support nearly two-dozen languages, and keep model development and data governance fully under India’s sovereign control. To meet strict latency targets and improve inference efficiency for its flagship Sovereign 30B model, Sarvam AI collaborated with NVIDIA to co-design hardware and software optimizations.
This collaboration delivered a 4x speedup in inference performance on NVIDIA Blackwell over baseline NVIDIA H100 GPUs, and established a path for deployment on the next-generation NVIDIA Blackwell architecture. The end-to-end performance boost was achieved through kernel and scheduling optimizations on NVIDIA H100 SXM GPUs that contributed a 2x speedup. That was combined with the powerful compute capabilities of Blackwell, along with NVFP4 weight quantization, for an additional 2x speedup, with an even bigger performance gain of 2.8x seen at higher interactivity points.
NVIDIA engineers helped Sarvam AI build 3B, 30B, and 100B foundational models, and optimize a new family of sovereign foundation models that were trained using NVIDIA Nemotron libraries, including the NVIDIA NeMo Framework and NVIDIA NeMo-RL. These models support 22 Indian languages, English, math, and code. They demonstrate how developer teams can leverage NVIDIA’s full-stack AI platform—from data to deployment—to achieve state-of-the-art performance and localized AI capabilities.
This post walks through the joint engineering effort and shares benchmarks for the speed-ups achieved on the NVIDIA H100, the largest-deployed NVIDIA GPU in India. We also provide an early look at how these workloads are being adapted for the NVIDIA Blackwell architecture.
Making multilingual sovereign AI scalable with MoE
To deliver sovereign-scale intelligence with high efficiency, the Sarvam AI models employ a sophisticated heterogeneous mixture-of-experts (MoE) architecture tailored for deep reasoning and linguistic density. These models were pretrained from scratch across 3B, 30B, 100B using the NVIDIA NeMo framework and NVIDIA Megatron-LM. Furthermore, Nemo-RL was used for post-training workflows for these models including long-context reasoning.
Sarvam 30B utilizes a 19-layer depth (1 dense + 18 MoE) with 128 experts and a top-6 routing strategy, relying on grouped query attention (GQA) to balance memory bandwidth with generation quality.
Sarvam 100B scales this design to 32 layers (1 dense + 31 MoE) and employs top-8 routing over 128 experts with a larger MoE FFN hidden size of 2048. Additionally, the 100B model adopts multi-head latent attention (MLA)—similar to DeepSeek-V3—to aggressively compress the Key-Value (KV) cache, enabling massive context windows without the memory penalties of standard attention.
Both models feature a shared expert design where a dedicated expert handles common features while routed experts tackle specialized tasks. This combination of high active parameter counts (via top-6/top-8 routing) and complex memory access patterns created a unique serving challenge, necessitating the deep kernel optimizations on NVIDIA Hopper and NVIDIA Blackwell GPUs detailed below.
The performance challenge: SLAs and baseline configuration on NVIDIA H100
Optimizing the Sarvam 30B model wasn’t just about raw speed; it was about maximizing density under strict latency constraints. For the applications served by this model—voice-to-voice agents—we established the following service level agreements (SLAs):
- P95 (95th percentile) time to first token (TTFT): < 1000 ms
- P95 (95th percentile) inter-token latency (ITL): < 15 ms
P95 (95th percentile) in inference performance testing measures latency, indicating that 95% of served requests are completed faster than this threshold, while the slowest 5% take longer. It is a critical tail-latency metric used to evaluate user experience and system stability, ensuring that even under load, most users face no more than a specific delay. The engineering goal was to maximize the inference server’s token throughput (concurrently served requests) without breaching these P95 targets.
For the initial performance analysis, the Sarvam AI and NVIDIA teams selected the SGLang inference engine for their initial performance analysis. Unlike standard serving frameworks that treat the KV cache as a linear buffer, SGLang implements RadixAttention—a mechanism that manages the KV cache as a radix tree. This was critical for the Sarvam 30B architecture; RadixAttention enables automatic prefix sharing, allowing the shared expert context and system prompts to be computed once and reused across concurrent requests. Furthermore, SGLang’s Cache-Aware Scheduler maximizes the hit rate of these shared prefixes, significantly reducing redundant memory operations during the prefill phase.
The Sarvam AI and NVIDIA teams modeled a production traffic profile characterized by an average input sequence length (ISL) of 3,584 tokens and an output sequence length (OSL) of 128 tokens. Guided by internal simulation data, we deployed the model on two NVIDIA H100 SXM GPUs with a specific parallelism strategy designed to balance the distinct memory and compute requirements of the MoE layers:
- Expert parallelism (EP=2) for the expert weights. This configuration utilizes Grouped GEMM kernels to maximize compute density and ensures that the massive expert weights reside in HBM, reducing the cost of expert routing.
- Data parallelism (DP=2) for the attention weights with –enable-dp-attention. This enabled us to parallelize attention computation across parallel batches, significantly boosting the aggregate throughput of the prefill phase.
While this configuration provided a robust functional baseline, profiling revealed that satisfying the sub-second TTFT at high concurrency required deeper optimization – leading us to the specific kernel and precision strategies detailed below.
From profiling to performance: eliminating MoE bottlenecks
Simulation data indicated that a concurrency range of 32 to 64 requests would offer the best chance of meeting SLA requirements. To identify the precise bottlenecks limiting token throughput in this concurrency range, the NVIDIA and Sarvam AI teams utilized NVIDIA Nsight Systems to capture execution traces of both the prefill and decode phases at a concurrency of 32 requests. We then processed the traces to extract the microsecond-level latency contribution of every kernel within a single transformer layer.
The profiling revealed that while the heavy General Matrix Multiplication (GEMM) operations (experts and attention) were performing well, significant latency bubbles existed in the non-compute-intensive operations—specifically in the MoE routing logic and positional embedding calculations. These operations were suffering from kernel launch overheads and redundant memory reads.

Following these observations, we executed a targeted optimization strategy across three axes – kernel optimizations, scheduling efficiency, and disaggregated serving.
Cutting transformer layer time by 34% with kernel-level optimizations
The NVIDIA and Sarvam AI teams systematically targeted the most expensive kernels identified in the trace, replacing standard PyTorch implementations with fused, architecture-specific kernels. We implemented the models first using a baseline implementation on SGLang with H100 GPUs and then optimized them to achieve significant speedups, as detailed below in Table 1 and in the following text.
| Kernel | Baseline time (microseconds) | Optimized time (microseconds) | Optimization applied |
| RMSNorm + Prepare QKV | 186 | 185 | N/A |
| QK Norm + RoPE | 414 | 54 | Use optimized fused in-place query-key normalization kernel |
| Attention | 322 | 296 | Use FA3 for prefill, FlashInfer backend for decode |
| Post-attention linear projection | 114 | 112 | N/A |
| AllReduce | 252 | 250 | N/A |
| Router logits and TopK | 560 | 134 | Use fused TopK impl.; ReplicatedLinear block for router logits |
| Routed experts computation | 1103 | 1080 | Tune kernel params for and DEP2 configuration (64 experts per GPU) |
| Shared expert computation | 216 | 215 | Overlap with TopK using NVIDIA CUDA streams |
| AllReduce | 265 | 249 | N/A |
| Total layer time | 3432 | 2575 | 1.34x faster prefill overall |
MoE routing (4.1x faster than baseline H100 performance): The most significant bottleneck identified was the MoE routing mechanism. In the baseline, computing router logits and performing TopK selection involved multiple kernel launches and redundant memory round-trips.
- Optimization: We implemented a Fused TopK kernel that fuses the logit computation and selection logic into a single CUDA kernel. Furthermore, we utilized a ReplicatedLinear block for the router logits. Since the router weights are small, replicating them across GPUs eliminates the need for expensive communication during the gating phase, keeping the operation purely compute-bound.
Fusing positional embeddings (7.6x faster than baseline H100 performance): The baseline implementation of query-key (QK) norm, followed by rotary positional embeddings (RoPE), required reading and writing the massive KV cache twice.
- Optimization: We deployed a custom fused in-place QK norm + RoPE kernel. This kernel performs normalization and rotary embedding calculations in a single pass, keeping the data in the L2 cache and reducing global memory bandwidth consumption.
Hiding latency with overlap: While the shared expert computation itself saw negligible speedup, we effectively hid its cost. By utilizing separate NVIDIA CUDA streams, we scheduled the shared expert computation to execute asynchronously alongside the router logits and TopK calculation. This parallelism ensures that the GPU’s compute units (streaming multiprocessors, or SMs) remain saturated even while the routing logic is being resolved.
These targeted kernel optimizations reduced the total time per transformer layer in a prefill iteration from 3.4ms to 2.5ms, a 1.3x speedup over baseline H100 performance. This latency reduction directly translated to higher supportable concurrency, allowing us to serve more users per GPU while maintaining the strict <1000ms time to first token (TTFT) and < 15ms inter-token latency service level agreement (ITL SLA) as shown in Figure 2 below.

How mixed prefill and decode scheduling improve GPU utilization
While kernel-level optimizations improve individual operation latency, significant efficiency gains can be achieved at the scheduler level by optimizing aggregated serving (prefill and decode run on the same GPU) and disaggregated serving (prefill and decode run on different GPUs).
The default scheduling strategy for aggregated serving in the SGLang engine is to strictly serialize the prefill and decode phases. In this default mode, the GPU processes a batch of prefills, finishes them, and only then switches to processing decodes. While this simplifies memory management, it often leads to suboptimal GPU utilization. Prefills are typically compute-bound (dense matrix multiplications), while decodes are memory-bound (loading KV cache). Serializing them means the GPU’s Tensor Core units (SMs) are underutilized during decode phases, and memory bandwidth may be underutilized during prefill phases, particularly for the low concurrency operating point imposed by the tight SLA requirements.
To address this, we enabled a mixed batching strategy. This approach allows the SGLang scheduler to mix prefill tokens and decode tokens within the same batch or compute chunk. By processing a chunk of prefill tokens alongside ongoing decode requests, we achieve a complementary resource profile on the GPU. This optimization introduces a subtle tradeoff. Mixing heavy prefill chunks into the decode stream can arguably increase inter-token latency (ITL) for the active decode requests, as they must wait for the shared compute resources.
However, for the Sarvam 30B workload, we observed that this impact was marginal and well within our 15ms ITL SLA. In exchange, the end-to-end request latency improved significantly due to the reduction in queue times. By clearing the prefill queue faster (piggybacking on decodes), we reduced the time requests spent waiting to start, ultimately driving up total system throughput by 15%. This scheduling optimization is quite favorable in the high ISL, low OSL scenario of interest here. For more decode-heavy cases, it might be worthwhile to pick smaller mixed chunk sizes or disable it altogether.

How disaggregated serving removes the critical path and boosts throughput 1.5x
Despite kernel and scheduling improvements, our profiling indicated that inter-GPU communication for token distribution (expert parallelism) remained on the critical path. Since the Sarvam 30B model (optimized with FP8 precision) fits comfortably within a single NVIDIA H100 SXM GPU’s memory, we pivoted from model parallelism to disaggregated serving.
We reconfigured the setup to use a 1P+1D strategy via the SGLang router: dedicating one NVIDIA H100 SXM GPU exclusively to prefill and another to decode. This approach eliminated the overhead of routing tokens between GPUs during the forward pass. The result was immediate: We observed a sharp reduction in TTFT (as prefill workers ran uninterrupted) and a significant increase in per-user decode throughput (1.5x over baseline H100 performance), proving that for this model size, pipeline separation outweighs the benefits of aggregated memory capacity.

The end-to-end impact of kernel, scheduling, and disaggregation optimizations
Figure 5 below summarizes the end-to-end performance speedup we were able to achieve via a combination of optimized kernels and scheduling optimizations. We also observe that disaggregated serving is the most optimal configuration for this model and ISL/OSL workload pattern and specific TTFT and ITL SLAs.

Running the Sarvam 30B model on Blackwell NVIDIA GPUs
The NVIDIA Blackwell architecture is designed to accelerate generative AI. The NVIDIA Blackwell GPU delivers up to 20 PFLOPS of peak FP4 compute and 8 TB/s of memory bandwidth, representing a jump over the NVIDIA H100 GPU’s capabilities. This throughput is driven by the second-generation Transformer Engine, which utilizes the new NVFP4 format to provide over 2x the performance of FP8 while maintaining high model accuracy.
To take advantage of these capabilities in the Sarvam models, we used the NVIDIA Model Optimizer to quantize the base BF16 model to the NVFP4 format. Unlike in the case of multiple H100 GPUs, we found that the NVIDIA HGX B200 was able to serve the Sarvam 30B model most efficiently with just one Blackwell GPU. By combining the kernel and scheduling optimizations for the model with NVIDIA Blackwell’s NVFP4 compute throughput, we were able to realize a 4x increase in inference serving throughput at the 75 tokens per second per user operating point.
As indicated in Figure 6 below, the NVIDIA Blackwell GPU enables high performance at low latency due to its superior compute, as well as exceptional throughput at higher concurrencies from its memory capacity advantage.

Learn more
Together, this work shows what is possible when model design, kernel engineering, scheduling strategy, quantization, and GPU architecture are treated as a single system rather than isolated components. By co-optimizing across the full stack, Sarvam AI and NVIDIA delivered substantial gains in throughput and latency while maintaining strict TTFT and inter-token latency targets required for real-world deployment.
The result is not just a faster model, but a more economically viable and sovereign-ready inference stack that scales to national workloads. These learnings provide a blueprint for other teams building large, production-grade AI systems on NVIDIA platforms.
More information about Sarvam AI’s models can be found here.
To begin exploring your own sovereign AI model strategy, check out the NVIDIA Nemotron framework and libraries for training, fine-tuning, and deploying models on local infrastructure.
Stay up-to-date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, Discord, and YouTube.
- Visit our Nemotron developer page for all the essentials you need to get started with the most open, smartest-per-compute reasoning model.
- Explore new open Nemotron models and datasets on Hugging Face and NIM microservices and Blueprints on build.nvidia.com.
- Tune into upcoming Nemotron livestreams and connect with the NVIDIA Developer community through the Nemotron developer forum and the Nemotron channel on Discord
- Browse video tutorials and livestreams to get the most out of NVIDIA Nemotron
And read more about NVIDIA Cloud Functions, NVIDIA’s multi-cloud, high-performance AI inference solution, here.