How NVIDIA Extreme Hardware-Software Co-Design Delivered a Large Inference Boost for Sarvam AI’s Sovereign Models

As global AI adoption accelerates, developers face a growing challenge: delivering large language model (LLM) performance that meets real-world latency and cost requirements. Running models with tens of billions of parameters in production, especially for conversational or voice-based AI agents, demands high throughput, low latency, and predictable service-level performance. For startups building sovereign AI models from scratch, these challenges are amplified by the need to balance model scale and accuracy with infrastructure efficiency—while also maintaining data sovereignty and cost control.

Sarvam AI, a generative AI startup based in Bengaluru, India, set out to build large, multilingual, multimodal foundation models that serve its country’s diverse population, support nearly two-dozen languages, and keep model development and data governance fully under India’s sovereign control. To meet strict latency targets and improve inference efficiency for its flagship Sovereign 30B model, Sarvam AI collaborated with NVIDIA to co-design hardware and software optimizations.

This collaboration delivered a 4x speedup in inference performance on NVIDIA Blackwell over baseline NVIDIA H100 GPUs, and established a path for deployment on the next-generation NVIDIA Blackwell architecture. The end-to-end performance boost was achieved through kernel and scheduling optimizations on NVIDIA H100 SXM GPUs that contributed a 2x speedup. That was combined with the powerful compute capabilities of Blackwell, along with NVFP4 weight quantization, for an additional 2x speedup, with an even bigger performance gain of 2.8x seen at higher interactivity points.

NVIDIA engineers helped Sarvam AI build 3B, 30B, and 100B foundational models, and optimize a new family of sovereign foundation models that were trained using NVIDIA Nemotron libraries, including the NVIDIA NeMo Framework and NVIDIA NeMo-RL. These models support 22 Indian languages, English, math, and code. They demonstrate how developer teams can leverage NVIDIA’s full-stack AI platform—from data to deployment—to achieve state-of-the-art performance and localized AI capabilities.

This post walks through the joint engineering effort and shares benchmarks for the speed-ups achieved on the NVIDIA H100, the largest-deployed NVIDIA GPU in India. We also provide an early look at how these workloads are being adapted for the NVIDIA Blackwell architecture.

Making multilingual sovereign AI scalable with MoE

To deliver sovereign-scale intelligence with high efficiency, the Sarvam AI models employ a sophisticated heterogeneous mixture-of-experts (MoE) architecture tailored for deep reasoning and linguistic density. These models were pretrained from scratch across 3B, 30B, 100B using the NVIDIA NeMo framework and NVIDIA Megatron-LM. Furthermore, Nemo-RL was used for post-training workflows for these models including long-context reasoning.

Sarvam 30B utilizes a 19-layer depth (1 dense + 18 MoE) with 128 experts and a top-6 routing strategy, relying on grouped query attention (GQA) to balance memory bandwidth with generation quality.

Sarvam 100B scales this design to 32 layers (1 dense + 31 MoE) and employs top-8 routing over 128 experts with a larger MoE FFN hidden size of 2048. Additionally, the 100B model adopts multi-head latent attention (MLA)—similar to DeepSeek-V3—to aggressively compress the Key-Value (KV) cache, enabling massive context windows without the memory penalties of standard attention.

Both models feature a shared expert design where a dedicated expert handles common features while routed experts tackle specialized tasks. This combination of high active parameter counts (via top-6/top-8 routing) and complex memory access patterns created a unique serving challenge, necessitating the deep kernel optimizations on NVIDIA Hopper and NVIDIA Blackwell GPUs detailed below.

The performance challenge: SLAs and baseline configuration on NVIDIA H100

Optimizing the Sarvam 30B model wasn’t just about raw speed; it was about maximizing density under strict latency constraints. For the applications served by this model—voice-to-voice agents—we established the following service level agreements (SLAs):

P95 (95th percentile) time to first token (TTFT): < 1000 ms
P95 (95th percentile) inter-token latency (ITL): < 15 ms

P95 (95th percentile) in inference performance testing measures latency, indicating that 95% of served requests are completed faster than this threshold, while the slowest 5% take longer. It is a critical tail-latency metric used to evaluate user experience and system stability, ensuring that even under load, most users face no more than a specific delay. The engineering goal was to maximize the inference server’s token throughput (concurrently served requests) without breaching these P95 targets.

For the initial performance analysis, the Sarvam AI and NVIDIA teams selected the SGLang inference engine for their initial performance analysis. Unlike standard serving frameworks that treat the KV cache as a linear buffer, SGLang implements RadixAttention—a mechanism that manages the KV cache as a radix tree. This was critical for the Sarvam 30B architecture; RadixAttention enables automatic prefix sharing, allowing the shared expert context and system prompts to be computed once and reused across concurrent requests. Furthermore, SGLang’s Cache-Aware Scheduler maximizes the hit rate of these shared prefixes, significantly reducing redundant memory operations during the prefill phase.

The Sarvam AI and NVIDIA teams modeled a production traffic profile characterized by an average input sequence length (ISL) of 3,584 tokens and an output sequence length (OSL) of 128 tokens. Guided by internal simulation data, we deployed the model on two NVIDIA H100 SXM GPUs with a specific parallelism strategy designed to balance the distinct memory and compute requirements of the MoE layers:

Expert parallelism (EP=2) for the expert weights. This configuration utilizes Grouped GEMM kernels to maximize compute density and ensures that the massive expert weights reside in HBM, reducing the cost of expert routing.
Data parallelism (DP=2) for the attention weights with –enable-dp-attention. This enabled us to parallelize attention computation across parallel batches, significantly boosting the aggregate throughput of the prefill phase.

While this configuration provided a robust functional baseline, profiling revealed that satisfying the sub-second TTFT at high concurrency required deeper optimization – leading us to the specific kernel and precision strategies detailed below.

From profiling to performance: eliminating MoE bottlenecks

Simulation data indicated that a concurrency range of 32 to 64 requests would offer the best chance of meeting SLA requirements. To identify the precise bottlenecks limiting token throughput in this concurrency range, the NVIDIA and Sarvam AI teams utilized NVIDIA Nsight Systems to capture execution traces of both the prefill and decode phases at a concurrency of 32 requests. We then processed the traces to extract the microsecond-level latency contribution of every kernel within a single transformer layer.

The profiling revealed that while the heavy General Matrix Multiplication (GEMM) operations (experts and attention) were performing well, significant latency bubbles existed in the non-compute-intensive operations—specifically in the MoE routing logic and positional embedding calculations. These operations were suffering from kernel launch overheads and redundant memory reads.

A diagram showing the Nsight Systems profiler timeline for a single transformer layer during the model’s prefill phase. The horizontal axis shows time, and stacked rows display GPU metrics including GPC and system clock frequency, GPU active time, streaming multiprocessor (SM) active percentage, SM instructions, and SM warp occupancy. Along the bottom, a sequence of GPU kernel launches is visible. Several regions are outlined with red boxes to highlight the most time-consuming operations: Query and key (QK) normalization and rotary positional embedding (RoPE), attention computation, router logits with top-K selection, routed MoE expert GEMM plus gated linear unit (GLU), and a shared expert GEMM plus GLU. The visualization emphasizes how GPU compute and occupancy vary across these kernels within one transformer layer. — Figure 1. Nsys profiler timeline showing SM activity and kernel execution over time of the prefill phase, with red boxes marking the most expensive kernels in the layer—QK normalization, attention, and MoE expert computation.

Following these observations, we executed a targeted optimization strategy across three axes – kernel optimizations, scheduling efficiency, and disaggregated serving.

Cutting transformer layer time by 34% with kernel-level optimizations

The NVIDIA and Sarvam AI teams systematically targeted the most expensive kernels identified in the trace, replacing standard PyTorch implementations with fused, architecture-specific kernels. We implemented the models first using a baseline implementation on SGLang with H100 GPUs and then optimized them to achieve significant speedups, as detailed below in Table 1 and in the following text.

Kernel	Baseline time (microseconds)	Optimized time (microseconds)	Optimization applied
RMSNorm + Prepare QKV	186	185	N/A
QK Norm + RoPE	414	54	Use optimized fused in-place query-key normalization kernel
Attention	322	296	Use FA3 for prefill, FlashInfer backend for decode
Post-attention linear projection	114	112	N/A
AllReduce	252	250	N/A
Router logits and TopK	560	134	Use fused TopK impl.; ReplicatedLinear block for router logits
Routed experts computation	1103	1080	Tune kernel params for and DEP2 configuration (64 experts per GPU)
Shared expert computation	216	215	Overlap with TopK using NVIDIA CUDA streams
AllReduce	265	249	N/A
Total layer time	3432	2575	1.34x faster prefill overall

Table 1. Kernel-level optimizations pay off: Fusing and tuning the hottest kernels cut layer time drastically and deliver faster prefill.

MoE routing (4.1x faster than baseline H100 performance): The most significant bottleneck identified was the MoE routing mechanism. In the baseline, computing router logits and performing TopK selection involved multiple kernel launches and redundant memory round-trips.

Optimization: We implemented a Fused TopK kernel that fuses the logit computation and selection logic into a single CUDA kernel. Furthermore, we utilized a ReplicatedLinear block for the router logits. Since the router weights are small, replicating them across GPUs eliminates the need for expensive communication during the gating phase, keeping the operation purely compute-bound.

Fusing positional embeddings (7.6x faster than baseline H100 performance): The baseline implementation of query-key (QK) norm, followed by rotary positional embeddings (RoPE), required reading and writing the massive KV cache twice.

Optimization: We deployed a custom fused in-place QK norm + RoPE kernel. This kernel performs normalization and rotary embedding calculations in a single pass, keeping the data in the L2 cache and reducing global memory bandwidth consumption.

Hiding latency with overlap: While the shared expert computation itself saw negligible speedup, we effectively hid its cost. By utilizing separate NVIDIA CUDA streams, we scheduled the shared expert computation to execute asynchronously alongside the router logits and TopK calculation. This parallelism ensures that the GPU’s compute units (streaming multiprocessors, or SMs) remain saturated even while the routing logic is being resolved.

These targeted kernel optimizations reduced the total time per transformer layer in a prefill iteration from 3.4ms to 2.5ms, a 1.3x speedup over baseline H100 performance. This latency reduction directly translated to higher supportable concurrency, allowing us to serve more users per GPU while maintaining the strict <1000ms time to first token (TTFT) and < 15ms inter-token latency service level agreement (ITL SLA) as shown in Figure 2 below.

Line chart titled “Performance impact of kernel optimizations on Sarvam 30B model.” The x-axis shows tokens per second per user, and the y-axis shows tokens per second per GPU. Two lines compare performance: a green line for optimized kernels and a lighter green line for baseline (unoptimized). Across all concurrency points, the optimized line remains above the baseline line, indicating higher throughput per GPU. At 75 tokens per second per user, the optimized configuration reaches about 1,255 TPS per GPU compared to about 998 TPS per GPU for baseline, marked with dashed guide lines and an annotation indicating a 1.26× improvement. As tokens per second per user increase, both lines slope downward, but the optimized kernels consistently deliver higher throughput than the baseline. — Figure 2. Performance gains from kernel optimizations across various concurrency points. In focus is the performance gain at the 75 TPS/user point. With kernel optimizations, we see a 1.26x improvement in overall token throughput per GPU.

How mixed prefill and decode scheduling improve GPU utilization

While kernel-level optimizations improve individual operation latency, significant efficiency gains can be achieved at the scheduler level by optimizing aggregated serving (prefill and decode run on the same GPU) and disaggregated serving (prefill and decode run on different GPUs).

The default scheduling strategy for aggregated serving in the SGLang engine is to strictly serialize the prefill and decode phases. In this default mode, the GPU processes a batch of prefills, finishes them, and only then switches to processing decodes. While this simplifies memory management, it often leads to suboptimal GPU utilization. Prefills are typically compute-bound (dense matrix multiplications), while decodes are memory-bound (loading KV cache). Serializing them means the GPU’s Tensor Core units (SMs) are underutilized during decode phases, and memory bandwidth may be underutilized during prefill phases, particularly for the low concurrency operating point imposed by the tight SLA requirements.

To address this, we enabled a mixed batching strategy. This approach allows the SGLang scheduler to mix prefill tokens and decode tokens within the same batch or compute chunk. By processing a chunk of prefill tokens alongside ongoing decode requests, we achieve a complementary resource profile on the GPU. This optimization introduces a subtle tradeoff. Mixing heavy prefill chunks into the decode stream can arguably increase inter-token latency (ITL) for the active decode requests, as they must wait for the shared compute resources.

However, for the Sarvam 30B workload, we observed that this impact was marginal and well within our 15ms ITL SLA. In exchange, the end-to-end request latency improved significantly due to the reduction in queue times. By clearing the prefill queue faster (piggybacking on decodes), we reduced the time requests spent waiting to start, ultimately driving up total system throughput by 15%. This scheduling optimization is quite favorable in the high ISL, low OSL scenario of interest here. For more decode-heavy cases, it might be worthwhile to pick smaller mixed chunk sizes or disable it altogether.

Line chart titled “Impact of mixed prefill and decode chunks in SGLang aggregate serving.” The x-axis shows P95 request latency in milliseconds (lower is better), and the y-axis shows tokens per second per GPU (higher is better). Two lines are compared: a blue line for separate prefill and decode chunks and a red line for mixed chunks. At all latency points, the mixed-chunk line sits above the separate-chunk line, indicating higher throughput. Around the 2-second latency point, annotations show roughly 1,310 TPS per GPU for mixed chunks versus about 1,140 TPS per GPU for separate chunks, highlighted as an approximately 1.15× throughput improvement. As latency increases, both configurations scale to higher throughput, with mixed chunks consistently outperforming separate chunks. — *Figure 3. The impact of mixed chunk scheduling, with 15% token throughput gains seen at the 2-second request latency point.*

How disaggregated serving removes the critical path and boosts throughput 1.5x

Despite kernel and scheduling improvements, our profiling indicated that inter-GPU communication for token distribution (expert parallelism) remained on the critical path. Since the Sarvam 30B model (optimized with FP8 precision) fits comfortably within a single NVIDIA H100 SXM GPU’s memory, we pivoted from model parallelism to disaggregated serving.

We reconfigured the setup to use a 1P+1D strategy via the SGLang router: dedicating one NVIDIA H100 SXM GPU exclusively to prefill and another to decode. This approach eliminated the overhead of routing tokens between GPUs during the forward pass. The result was immediate: We observed a sharp reduction in TTFT (as prefill workers ran uninterrupted) and a significant increase in per-user decode throughput (1.5x over baseline H100 performance), proving that for this model size, pipeline separation outweighs the benefits of aggregated memory capacity.

Line chart titled “Performance impact of disaggregated serving with NVIDIA H100 SXM GPUs on Sarvam 30B model.” The x-axis shows tokens per second per user, and the y-axis shows tokens per second per GPU. Two lines are plotted: a light green for baseline aggregate EP2 configuration (unoptimized), an orange line for optimized aggregate EP2 configuration, and a green line for disaggregated 1P+1D optimized configuration. Across all user throughput points, the disaggregated configuration delivers the highest tokens per second per GPU, followed by the optimized aggregate configuration, with the baseline lowest. At 75 tokens per second per user, annotations show approximately 998 TPS/GPU for baseline, and about 1,995 TPS/GPU for disaggregated serving, highlighted with arrows indicating roughly a 2x× improvement over baseline. — *Figure 4. The benefits of disaggregated serving on NVIDIA H100 SXM for Sarvam 30B model*

The end-to-end impact of kernel, scheduling, and disaggregation optimizations

Figure 5 below summarizes the end-to-end performance speedup we were able to achieve via a combination of optimized kernels and scheduling optimizations. We also observe that disaggregated serving is the most optimal configuration for this model and ISL/OSL workload pattern and specific TTFT and ITL SLAs.

Bar chart titled “Performance optimization journey for Sarvam 30B on NVIDIA H100 SXM.” The y-axis shows token throughput ratio at 75 tokens per second per user, and the x-axis lists serving configurations. Three green bars show increasing performance with a gray bar as the baseline: “Baseline aggregated serving” at 1.00, “aggregated serving + optimized kernels” at 1.26 with notes about MoE GEMM shape tuning, router kernel optimization, fused normalization with RoPE, and shared expert overlap, “aggregated serving + optimal scheduling” at 1.31 with optimized kernels and mixed prefill and decode chunking, and “Disaggregated optimized” at 2.00 labeled as disaggregated prefill and decode (1P+1D configuration) with optimized kernels. The chart illustrates steady gains in throughput as kernel optimizations, scheduling improvements, and disaggregated serving are applied. — *Figure 5. Progressive improvements seen in Sarvam 30B model inference on NVIDIA H100 SXM through a combination of kernel optimizations, scheduling optimizations, and disaggregated serving.*

Running the Sarvam 30B model on Blackwell NVIDIA GPUs

The NVIDIA Blackwell architecture is designed to accelerate generative AI. The NVIDIA Blackwell GPU delivers up to 20 PFLOPS of peak FP4 compute and 8 TB/s of memory bandwidth, representing a jump over the NVIDIA H100 GPU’s capabilities. This throughput is driven by the second-generation Transformer Engine, which utilizes the new NVFP4 format to provide over 2x the performance of FP8 while maintaining high model accuracy.

To take advantage of these capabilities in the Sarvam models, we used the NVIDIA Model Optimizer to quantize the base BF16 model to the NVFP4 format. Unlike in the case of multiple H100 GPUs, we found that the NVIDIA HGX B200 was able to serve the Sarvam 30B model most efficiently with just one Blackwell GPU. By combining the kernel and scheduling optimizations for the model with NVIDIA Blackwell’s NVFP4 compute throughput, we were able to realize a 4x increase in inference serving throughput at the 75 tokens per second per user operating point.

As indicated in Figure 6 below, the NVIDIA Blackwell GPU enables high performance at low latency due to its superior compute, as well as exceptional throughput at higher concurrencies from its memory capacity advantage.

Line chart titled “Performance comparison between NVIDIA B200 and NVIDIA H100 SXM for Sarvam 30B inference.” The x-axis shows tokens per second per user, and the y-axis shows tokens per second per GPU. Two lines are plotted: a green line for NVIDIA B200 (nvfp4, aggregate 1 GPU) and a light green line for NVIDIA H100 SXM (fp8, disaggregated 1P+1D). Across all operating points, the B200 line is significantly higher than the H100 line, indicating greater throughput per GPU. At 100 tokens per second per user, annotations show approximately 3,571 TPS per GPU for B200 versus about 1,274 TPS per GPU for H100, with an arrow highlighting roughly a 2.8x throughput advantage for B200. At the 75 tokens per second per user operating point, the B200 still maintains a 2x advantage over H100. — *Figure 6. NVIDIA Blackwell GPU offers a 2.8x higher token throughput vs Nvidia H100 SXM GPU at the 100 TPS/User operating point.*

Learn more

Together, this work shows what is possible when model design, kernel engineering, scheduling strategy, quantization, and GPU architecture are treated as a single system rather than isolated components. By co-optimizing across the full stack, Sarvam AI and NVIDIA delivered substantial gains in throughput and latency while maintaining strict TTFT and inter-token latency targets required for real-world deployment.

The result is not just a faster model, but a more economically viable and sovereign-ready inference stack that scales to national workloads. These learnings provide a blueprint for other teams building large, production-grade AI systems on NVIDIA platforms.

More information about Sarvam AI’s models can be found here.

To begin exploring your own sovereign AI model strategy, check out the NVIDIA Nemotron framework and libraries for training, fine-tuning, and deploying models on local infrastructure.

Stay up-to-date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, Discord, and YouTube.

Visit our Nemotron developer page for all the essentials you need to get started with the most open, smartest-per-compute reasoning model.
Explore new open Nemotron models and datasets on Hugging Face and NIM microservices and Blueprints on build.nvidia.com.
Tune into upcoming Nemotron livestreams and connect with the NVIDIA Developer community through the Nemotron developer forum and the Nemotron channel on Discord
Browse video tutorials and livestreams to get the most out of NVIDIA Nemotron