Agentic AI / Generative AI

Accelerating Text-to-SQL Inference on Vanna with NVIDIA NIM for Faster Analytics

May 30, 2025

By Divyansh Jain, Fay Wang and Gorkem Batmaz

Discuss (0)

AI-Generated Summary

Dislike

Vanna's text-to-SQL solution is optimized using NVIDIA NIM, which delivers accelerated inference microservices for generative AI models, improving performance and reducing latency.
The tutorial demonstrates how to initialize Vanna with NVIDIA-accelerated components, including the NVIDIA NeMo Retriever embedding model and Llama 3.1 70B NIM microservice.
By leveraging NVIDIA NIM for LLMs endpoints, the system can generate SQL queries from natural language inputs more efficiently, enabling more responsive analytics for user-generated queries.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Slow and inefficient query generation from natural language inputs bottlenecks decision-making. This forces analysts and business users to rely heavily on data teams, delaying insights and limiting agility.

Text-to-SQL is shifting how we interact with data, enabling users to query structured databases in natural language. This is especially useful when deploying a domain-specific model for specialized industries. However, inferencing with these models at scale for analytics workloads comes with latency and performance challenges.

In this tutorial, we’ll show how to optimize Vanna’s text-to-SQL solution using NVIDIA NIM, accelerated inference microservices that deliver optimized endpoints for generative AI models. Vanna’s open-source text-to-SQL solution has gained traction among organizations due to its flexibility, security, and adaptability.

You’ll learn how to:

Insert preprocessed data into a database and connect with Vanna.
Train Vanna on business terminology or definitions.
Leverage an NVIDIA NeMo Retriever embedding model to retrieve context for SQL generation.
Use an NVIDIA NIM for LLMs endpoints for accelerated inference.

For the demonstration, we used two open-source Steam games datasets from Kaggle, a rich dataset of game metadata that’s ideal for analytics queries.

Prerequisites

Python version 3.10 or higher installed
NVIDIA supported LangChain package for an NVIDIA NeMo Retriever embedding model
Access to an NVIDIA NIM endpoint
Familiarity with SQLiteVanna library installed
Steam datasets
- Dataset 1
- Dataset 2

Procedure

1. Download and preprocess the Steam dataset

To follow along, clone the repository located at /NVIDIA/GenerativeAIExamples and open the Vanna_with_NVIDIA notebook in the community section.

The preprocessing steps in this tutorial are adapted from an excellent Kaggle notebook and can be found in the ‘Data Preparation’ section of the Vanna with NVIDIA notebook.

Key preprocessing steps include:

Filtering required columns
Unpacking columns with a range into separate columns
Adding game length to each game title

We get three CSVs at the end of the preprocessing step:

tableau_games.csv
tableau_categories.csv
tableau_tags.csv

2. Initialize Vanna with NVIDIA NIM and NeMo Retriever

A Vanna instance needs a connection to a vector database, an embedder, and the LLM endpoint. We use the Milvus vector database as it has GPU acceleration capabilities, an NVIDIA retriever embedding model (llama-3.2-nv-embedqa-1b-v2), and the Llama 3.1 70B NIM microservice.

With NIM microservices, your inference model runs on NVIDIA accelerated infrastructure, delivering faster response times and cost efficiency for production deployments. NVIDIA NIM—which is included with the NVIDIA AI Enterprise software license—can run anywhere, giving you ownership of your customizations and full control of your intellectual property (IP) and AI applications.

For setup instructions on NIM endpoints, refer to the NVIDIA NIM documentation.

In the following text, we will go over the initialization of a Vanna instance with NVIDIA-accelerated components.

Define the Vanna Class as follows.

from pymilvus import MilvusClient, model
from vanna.milvus import Milvus_VectorStore
from vanna.openai import OpenAI_Chat
from openai import OpenAI

# Set the nvidia api key 
nvidia_api_key = '...'

# Define Vanna Class
class VannaMilvus(Milvus_VectorStore, OpenAI_Chat):
    def __init__(self, llm_client, config=None):
        Milvus_VectorStore.__init__(self, config=config)
        OpenAI_Chat.__init__(self, client=llm_client, config=config)

Next, create a NIM client based on OpenAI Wrapper with an NVIDIA API key.

def get_openai_client():
    client = OpenAI(
        base_url = "https://integrate.api.nvidia.com/v1",
        api_key = nvidia_api_key
    )
    return client

llm_client = get_openai_client()

Use a wrapper for the NeMo Retriever embedding model with Milvus DB. You can use the same nvidia_api_key.

class EmbeddingWrapper:
    def __init__(self, embedder):
        self.embedder = embedder

    def encode_documents(self, texts):
        result = self.embedder.embed_documents(texts)
        return [np.array(r) for r in result]

    def encode_queries(self, texts):
        embeddings = []
        for text in texts:
            embeddings.append(self.embedder.embed_query(text))
        return embeddings

vanna_embedder = EmbeddingWrapper(nvidia_embedder)

Next, create a NIM client based on OpenAI Wrapper with an NVIDIA API key.

def get_openai_client():
    client = OpenAI(
        base_url = "https://integrate.api.nvidia.com/v1",
        api_key = nvidia_api_key
    )
    return client

llm_client = get_openai_client()

Using a local Milvus DB instance.

milvus_uri = "./milvus_nvidia.db"
milvus_client_nvidia = MilvusClient(uri=milvus_uri)

Specify the config for the Vanna instance with the components defined previously.

# Specify the LLM used for SQL generation
model_name = "nvidia/llama-3.1-70b-instruct"

# Define the configuration for Vanna instance
config_nvidia = {
        "model": model_name,
        "milvus_client": milvus_client_nvidia,
        "embedding_function": vanna_embedder,
        "n_results": 2,  # The number of results to return from Milvus semantic search.
    }

vn_nvidia = VannaMilvus(llm_client, config=config_nvidia)

3. Create and populate the SQLite database

Using SQLite for this demo provides a lightweight, serverless database for testing text-to-SQL. You can use any database supported by Vanna. Insert the three CSVs from the pre-processing step into the DB.

import sqlite3
import pandas as pd

# Specify the path to the SQLite database
sqlite_path = 'steam_data.db'

# Connect to the SQLite database
sql_connect = sqlite3.connect(sqlite_path)
c = sql_connect.cursor()

# Create tables
init_sqls = """
CREATE TABLE IF NOT EXISTS games (
    app_id INTEGER PRIMARY KEY,
    name TEXT,
    release_date TEXT,
    price REAL,
    short_description TEXT,
    positive INTEGER,
    negative INTEGER,
    min_owners INTEGER,
    max_owners INTEGER,
    hltb_single REAL
);

CREATE TABLE IF NOT EXISTS categories (
    app_id INTEGER,
    categories TEXT,
    FOREIGN KEY (app_id) REFERENCES games(app_id)
);

CREATE TABLE IF NOT EXISTS tags (
    app_id INTEGER,
    tags TEXT,
    tag_frequencies TEXT,
    FOREIGN KEY (app_id) REFERENCES games(app_id)
);
"""

for sql in init_sqls.split(";"):
    c.execute(sql)

# Read the CSV files
games_df = pd.read_csv('processed_dataset/games.csv')
categories_df = pd.read_csv('processed_dataset/categories.csv')
tags_df = pd.read_csv('processed_dataset/tags.csv')

# Insert data into tables
games_df.to_sql('games', sql_connect, if_exists='append', index=False)
categories_df.to_sql('categories', sql_connect, if_exists='append', index=False)
tags_df.to_sql('tags', sql_connect, if_exists='append', index=False)

sql_connect.commit()

Now, connect the SQL DB to Vanna.

# Connect to the SQLite database
vn_nvidia.connect_to_sqlite(sqlite_path)

4. Train with data

Vanna performs best with context on the queried data and the business-specific terminology.

Remove existing training data to avoid de-duplication and train Vanna with the database data definition language (DDL).

# Remove existing training data
existing_training_data = vn_nvidia.get_training_data()
if len(existing_training_data) > 0:
    for _, training_data in existing_training_data.iterrows():
        vn_nvidia.remove_training_data(training_data["id"])

# Get the DDL of the SQLite database
df_ddl = vn_nvidia.run_sql("SELECT type, sql FROM sqlite_master WHERE sql is not null")

# Train the model on the DDL data
for ddl in df_ddl["sql"].to_list():
    vn_nvidia.train(ddl=ddl)

Next, add documentation about the three tables in the database schema games, categories, and tags.

# Add documentation about your business terminology or definitions.
vn_nvidia.train(
documentation="""
This dataset is used to answer questions about the game trends.
"""
)

# Add documentation about the tables
vn_nvidia.train(
documentation="""
The games table contains information about the games.
The app_id is the unique identifier for the game.This is a primary key.
The name is the name of the game.
The release_date is the date the game was released.
The price is the price of the game. Price in USD, 0.0 if its free.
The short_description is a brief description of the game.
The positive is the number of positive reviews or votes.
The negative is the number of negative reviews or votes.
The min_owners is the minimum number of owners. Used together with max_owners to get an estimate of the player base.
The max_owners is the maximum number of owners. Used together with min_owners to get an estimate of the player base.
The hltb_single is the average playtime of the game. This is an estimate.
"""
)

vn_nvidia.train(
documentation="""
The categories table contains information about the categories of the games.
The app_id is the unique identifier for the game.
The categories is the categories of the game.
The app_id is a foreign key to the games table.
"""
)

vn_nvidia.train(
documentation="""
The tags table contains information about the tags of the games.
The app_id is the unique identifier for the game.
The tags is the tags of the game. These are user defined.
The tag_frequencies is the frequencies of the tags.
The app_id is a foreign key to the games table.
"""
)

Then, view the training data with a simple command.

training_data = vn_nvidia.get_training_data()
training_data

5. Generate SQL

You are all set to start querying the database with NIM-powered Vanna.

Start with a simple query. The generated SQL can also be printed before execution.

sql = vn_nvidia.generate_sql("Which 5 games have the most positive reviews and how many?")
vn_nvidia.run_sql(sql)

Run a harder query now.

sql = vn_nvidia.generate_sql("Which indie game has the biggest player base?")
vn_nvidia.run_sql(sql)

Conclusion

In this tutorial, you learned how to accelerate text-to-SQL inference on Vanna using NVIDIA NIM. With a clean database schema, real-world dataset, and an optimized inference endpoint, the system is well-positioned to deliver more responsive analytics for user-generated queries.

To explore further:

Try Vanna on your own dataset using this notebook as a starting point.
Deploy NVIDIA NIM endpoints for production-scale inference.

Experiment with adding different training information to unlock better SQL generation.

Discuss (0)

About the Authors

About Divyansh Jain
Divyansh Jain is a senior data scientist on the RAPIDS Applied Data Science team at NVIDIA, where he builds GPU-accelerated AI solutions for large-scale analytics and intelligent automation. His recent work spans generative AI, SQL reasoning, predictive maintenance, and manufacturing optimization. Before NVIDIA, he earned his B.S. in Computer Science from VIT and M.S. in Data Science from the University of Southern California.

View all posts by Divyansh Jain

About Fay Wang
Fay Wang is a senior data engineer on the Applied Data Science team at NVIDIA. She has extensive experience in curating and managing large-scale datasets, designing robust SQL-based data pipelines, and integrating cutting-edge LLM technologies to enhance data processing and analytics. Before joining NVIDIA in 2017, she worked at IBM as a senior software engineer developing various applications on DB2. She holds an M.S. in Computer Engineering from Santa Clara University.

View all posts by Fay Wang

About Gorkem Batmaz
Gorkem Batmaz is a senior data scientist on the NVIDIA Morpheus team at NVIDIA. His focus is on applying GPU-accelerated high-performance analytics to solve cybersecurity challenges. He developed ML/NLP-based predictive maintenance, phishing-DGA-Malware detection, asset classification, periodicity detection, and Generative AI for cyber solutions. Before joining the Morpheus team, Gorkem worked in the cybersecurity team for NVIDIA's autonomous cars division. Before joining NVIDIA in 2011, he worked at Motorola and Alcatel-Lucent. He holds an M.S. in Engineering & Technology Management and a B.S. in Electrical Electronics Engineering from Bogazici University.

View all posts by Gorkem Batmaz

Accelerating Text-to-SQL Inference on Vanna with NVIDIA NIM for Faster Analytics

Prerequisites

Procedure

Conclusion

Tags

About the Authors

Comments