Building Scalable AI on Enterprise Data with NVIDIA Nemotron RAG and Microsoft SQL Server 2025

At Microsoft Ignite 2025, the vision for an AI-ready enterprise database becomes a reality with the announcement of Microsoft SQL Server 2025, giving developers powerful new tools like built-in vector search and SQL native APIs to call external AI models. NVIDIA has partnered with Microsoft to seamlessly connect SQL Server 2025 with the NVIDIA Nemotron RAG collection of open models. This enables you to build high-performance, secure AI applications on your data in the cloud or on-premises.

Retrieval-augmented generation (RAG) is the most effective approach for enterprises to put their data to use. RAG grounds AI in live, proprietary data without the immense cost and complexity of retraining a model from scratch. Yet the effectiveness of RAG depends on compute-intensive steps, one of which is vector embedding generation. This creates a massive performance bottleneck on traditional CPU infrastructure.

This challenge is compounded by the complexity of deployment at scale and the need for model flexibility. Enterprises require a portfolio of embedding models to balance accuracy, speed, and cost for different tasks.

This post details the new NVIDIA reference architecture that solves this problem. It’s built on SQL Server 2025 and Llama Nemotron Embed 1B v2, part of the Nemotron RAG family. It explains how this integration allows you to call the Nemotron RAG model directly from your SQL Server database, turning it into a high-performance AI application engine. The implementation is based on Azure Cloud and Azure Local to cover main SQL Server usage on cloud or on-premises.

Solving enterprise AI RAG challenges with Nemotron RAG and SQL Server 2025

Connecting SQL Server 2025 to the flexible, accelerated NVIDIA AI engine with Nemotron RAG solves the core enterprise AI RAG challenges: performance, deployment, and flexibility and security.

Improve RAG performance bottlenecks

This architecture solves the primary RAG performance bottleneck by offloading embedding generation from CPUs to NVIDIA GPUs using Llama Nemotron Embed 1B v2. This is a state-of-the-art open model for creating highly accurate embeddings optimized for retrieval tasks. It offers multilingual and cross-lingual text question-answering retrieval with long context support and optimized data storage.

Llama Nemotron Embed 1B v2 is part of Nemotron RAG, which is a collection of extraction, embedding, reranking models, fine-tuned with the Nemotron RAG datasets and scripts, to achieve the best accuracy.

On the database side, SQL Server 2025 delivers seamless, high-performance data retrieval with vector search, powered by native vector distance functions. When hosting embedding models locally, you eliminate network overhead and cut latency, two key factors that would deliver performance improvements.

Deploy AI models as simple, containerized endpoints

Deployment is where NVIDIA NIM microservices come in. NIM microservices are prebuilt, production-ready containers designed to streamline the deployment of the latest optimized AI models, like NVIDIA Nemotron RAG, across any NVIDIA-accelerated infrastructure whether in the cloud or on-premises. With NIM, you can deploy AI models as simple, containerized endpoints without the need for managing complex libraries or dependencies.

Also, data residency and compliance are addressed through locally hosted models powered by NIM microservices. Ease of use is another key advantage. The prebuilt nature of NIM combined with native SQL REST APIs significantly reduces the learning curve, making it easier to bring AI closer to the data customers already have.

Maintain security and flexibility

This architecture provides a portfolio of state-of-the-art Nemotron RAG models while keeping proprietary data secure within your SQL Server database. NIM microservices are designed for enterprise-grade security and backed by NVIDIA enterprise support. All communications between NIM microservices and SQL Server is further secured using end-to-end HTTPS encryption.

Nemotron RAG and Microsoft SQL Server 2025 reference architecture

The Nemotron RAG and SQL Server 2025 reference architecture details the implementation of the solution using the Llama Nemotron Embed 1B v2 embedding model, delivered as a NIM microservice. This enables enterprise-grade, secure, GPU-accelerated RAG workflows directly from SQL Server on Azure Cloud or Azure Local.

For the full code, deployment scripts, and detailed walkthroughs for this solution, see NVIDIA NIM with SQL Server 2025 AI on Azure Cloud and Azure Local.

Core architecture components

Figure 1 shows the three core architecture components and flow foundation, which are also described in detail below.

Pipeline image showing NVIDIA NIM and SQL Server 2025 with three main areas (left to right): SQL Server 2025 AI, ACA On-premises, NIM Repository. Arrows indicate the flow of HTTPS requests/responses and image pulls between these areas. — *Figure 1. This architecture is composed of three core components that work together*

SQL Server 2025: The AI-ready database

The foundation of this solution is SQL Server 2025, which introduces two transformative capabilities that act as the engine for in-database AI:

Native vector data type: This feature enables you to securely store vector embeddings directly alongside structured data. It eliminates the need for a separate vector database, simplifying your architecture, reducing data movement, and enabling hybrid searches such as finding products that are both “running shoes” (vector search) and “in stock” (structured filter).
Vector distance search: You can now perform similarity searches directly within SQL Server 2025 using built-in functions. This allows you to rank results by closeness in embedding space, enabling use cases like semantic search, recommendation systems, and personalization—all without leaving the database.
Create external model: Register and manage external AI models (NIM microservices, for example) as first-class entities in SQL Server 2025. This provides a seamless way to orchestrate inference workflows while keeping governance and security centralized.
Generate embeddings: Use the AI_GENERATE_EMBEDDINGS function to create embeddings for text or other data directly from T-SQL. This function leverages calling external REST APIs under the hood, enabling real-time embedding generation without complex integration steps.

NVIDIA NIM microservices: The accelerated AI engine

The Nemotron RAG family of open models, including the Llama Nemotron Embed 1B v2 model used in this reference architecture, are delivered as production-ready NVIDIA NIM microservices that run in standard Docker containers.

This approach simplifies deployment and ensures compatibility across cloud and local Windows or Linux environments with NVIDIA GPUs. The models can be deployed on Azure Container Apps or on-premises with Azure Local. This containerized delivery supports both automatic and manual scaling strategies and provides the ideal “ground-to-cloud” flexibility for use with SQL Server 2025.

Cloud scale: You can deploy NIM microservices to ACA with serverless NVIDIA GPUs. This approach abstracts all infrastructure management. You get on-demand, GPU-accelerated inference that scales to zero with per-second billing, optimizing costs and simplifying operations.
On-premises: For maximum data sovereignty and low-latency, you can run the same NIM container on-premises using Azure Local with NVIDIA GPUs. Azure Local extends Azure’s management plane to your own hardware, enabling you to run AI directly against your local data while meeting strict compliance or performance needs.

The link between SQL Server and NIM microservices

The communication bridge between SQL Server and the NIM microservice is simple and robust, built on standard, secure web protocols.

OpenAI-compatible API: NVIDIA NIM exposes an OpenAI-compatible API endpoint. This allows SQL Server 2025 to use its native functions to call the NIM service just as it would call an OpenAI service, ensuring seamless, out-of-the-box integration.
Standard POST requests: SQL Server 2025 issues standard HTTPS POST requests to retrieve results such as embeddings.
Secure and flexible communication: The design uses TLS certificates for end-to-end encryption, establishing mutual trust and ensuring all responses are secure, performant and standards-compliant for both cloud and on-premises deployments. This provides a significant advantage over a remote-only model, as you retain full control, and proprietary data never leaves your secure environment.

While this reference architecture features the state-of-the-art Nemotron RAG models, it can be extended to enable SQL Server 2025 to call any NIM microservice to power a broad range of AI applications—such as text summarization, content classification, or predictive analysis—all performed directly on your data in SQL Server 2025.

Two methods of deployment

This post covers the two primary deployment patterns for this solution: on-premises (using Azure Local) and cloud (using Azure Container Apps). Both patterns rely on the same core mechanism: SQL Server 2025 calling the NVIDIA NIM microservice endpoint using the standard OpenAI-compatible protocol

On-premises implementation with Azure Local

The on-premises implementation ensures maximum flexibility, supporting practical combinations of Windows and Linux systems running on NVIDIA GPU-enabled servers such as:

Windows/Ubuntu Server or Windows/Ubuntu on-premises virtual machine running both SQL Server and NVIDIA NIM
Windows running SQL Server and Ubuntu running NVIDIA NIM, or vice versa

To deploy, leverage Azure Local, the new Microsoft offering that extends the Azure Cloud platform directly into your on-premises environments. For full installation instructions for establishing secure communication including NIM deployment details, visit NVIDIA/GenerativeAIExamples on GitHub. Note that this solution was validated using SQL Server 2025 (RC 17.0.950.3).

Cloud implementation

The cloud deployment leverages NVIDIA Llama Nemotron Embedding NIM hosted on Azure Container Apps (ACA), Microsoft Azure’s fully managed serverless container platform. ACA fully supports and extends the advantages of the proposed architecture. To learn more, see NVIDIA NIM with Microsoft SQL Server 2025 AI on Azure Cloud and Azure Local on the NVIDIA/GenerativeAIExamples GitHub repo.

This serverless approach provides several key advantages for deploying your AI applications with data stored in SQL Server 2025.

To accelerate NIM replica startup, we recommend using ACA volumes backed by Azure File Share or ephemeral storage to persist the local NIM cache.The number of replicas is managed automatically through ACA HTTP scaling, allowing you to scale to zero.

ACA applications can host multiple versions and types of NIM in parallel, each accessible through distinct URLs configured in SQL Server.

Solution demo

To get complete instructions for running the full end-to-end workflow, check out the demo SQL Server 2025 AI functionality with NVIDIA Retrieval QA using E5 Embedding v5.

Specifically, the demo SQL scripts guide you through the following steps:

Create the AdventureWorks sample database
Create the ProductDescriptionEmbeddings demo table
Execute demo scripts to populate embeddings through the NVIDIA NIM integration
Verify and visualize stored embeddings using Select_Embeddings.sql

This workflow demonstrates the new SQL Server 2025 AI capabilities, using the built-in T-SQL AI capabilities VECTOR_DISTANCE, AI_GENERATE_EMBEDDINGS, and CREATE EXTERNAL MODEL, which form the foundation of the new AI integration in SQL Server 2025.

Get started with SQL Server 2025 and NVIDIA Nemotron RAG

The integration of Microsoft SQL Server 2025 with NVIDIA Nemotron RAG, delivered as production-grade NVIDIA NIM microservices, offers a seamless “ground-to-cloud” path for building high-performance AI applications. By combining the SQL Server 2025 built-in AI capabilities with the NVIDIA GPU-optimized inference stack, you can now solve the primary RAG performance bottleneck, bringing AI directly to their data—securely, efficiently, and without the operational complexity of managing data pipelines.

This joint reference architecture demonstrates how you can build RAG applications that generate embeddings, perform semantic search, and invoke inference services directly within SQL Server 2025. This approach delivers the flexibility to deploy state-of-the-art models such as NVIDIA Nemotron wherever the data lives—on Azure Cloud or on-premises with Azure Local—while preserving full data sovereignty.

Ready to start building? Get all deployment scripts, code samples, and detailed walkthroughs for both cloud and on-premises scenarios through NVIDIA NIM with Microsoft SQL Server 2025 AI on Azure Cloud and Azure Local on the NVIDIA/GenerativeAIExamples GitHub repo.