Agentic AI / Generative AI

Build an AI Catalog System That Delivers Localized, Interactive Product Experiences

Learn how to deploy, integrate, and customize NVIDIA Blueprint for Retail at scale.

Decorative image.

E-commerce catalogs often contain sparse product data, generic images, a basic title, and short description. This limits discoverability, engagement, and conversion. Manual enrichment doesn’t scale because it relies on catalog managers to manually write descriptions, apply tags, and categorize. The process is slow, inconsistent, and error-prone.

This tutorial shows developers, product managers, and catalog teams how to deploy an AI-powered enrichment blueprint that transforms a single product image into rich, localized catalog entries.

Using NVIDIA Nemotron large language models (LLMs) and vision-language models (VLMs)—including Nemotron-Nano-12B-V2-VL, Llama-3.3-Nemotron-Super-49B-V1, FLUX.1-Kontext-Dev for image generation, and TRELLIS Image-to-3D models—the system automatically generates detailed titles and descriptions, accurate categories, comprehensive tags, localized cultural variations, and interactive 3D assets tailored to regional markets.

The tutorial covers the complete architecture, API usage for VLM analysis and asset generation, deployment strategies with Docker containers, and real-world integration patterns. By the end, this tutorial demonstrates how to automate catalog enrichment at scale, turning sparse product data like “Black Purse” into rich listings like “Glamorous Black Evening Handbag with Gold Accents” complete with detailed descriptions, validated categories, tags, and multiple asset types.

Prerequisites

This tutorial assumes intermediate to advanced technical knowledge. It involves working with AI APIs, building REST services, and deploying containerized applications. Basic familiarity with the listed technologies will help in following along and implementing the system:

  • Python 3.11+
  • The uv package manager (or pip)
  • An NVIDIA API key
  • A HuggingFace token for FLUX model access
  • Docker and Docker Compose 

Creating an AI-powered catalog enrichment blueprint

To address the scalability and consistency gap of manual catalog enrichment, and discoverability and conversion issues, the blueprint is designed as an end-to-end catalog transformation pipeline. The modular system of specialized models works together containerized with Docker and served through NVIDIA NIM for enterprise-grade performance.

Catalog enrichment workflow: product images and optional text go through Nemotron VLM/LLM to generate localized title, description, categories, and attributes. FLUX and Trellis models for image and 3D asset generation.
Figure 1. Catalog enrichment workflow diagram

Here’s the core technology stack:

  • NVIDIA Nemotron VLM (nemotron-nano-12b-v2-vl): Analyzes product images to extract features, categories, and context.
  • NVIDIA Nemotron LLM (llama-3_3-nemotron-super-49b-v1_5): Acts as the “brain,” generating rich, localized text (titles, descriptions) and planning culturally-aware prompts for image generation.
  • Black Forest Labs FLUX.1-Kontext-dev: Generate new, high-quality 2D image variations.
  • Microsoft TRELLIS Image-to-3D: Transforms 2D product images into interactive 3D models.

The most important part of this solution is its modular, three-stage API. A common mistake is building one slow, monolithic API call that does everything.

  1. Stage 1: Fast VLM analysis (POST /vlm/analyze)
    • Job: Takes an image, locale, existing product data, and brand instructions as optional.
    • Output: Rich, structured JSON. It returns improved titles,  descriptions, validated categories, comprehensive tags, and attributes localized to the target region.
  2. Stage 2: Image generation (POST /generate/variation)
    • Job: Takes the output from Stage 1, the title, description, tags, and original image.
    • Output: A new, culturally-appropriate 2D image variation.
  3. Stage 3: 3D asset generation (POST /generate/3d)
    • Job: Takes the original 2D image.
    • Output: An interactive 3D .glb model.

The frontend can call /vlm/analyze, get instant results to show the user, and then offer buttons to “generate 3D model” or “create marketing assets,” which trigger asynchronous backend jobs.

Building the enrichment pipeline

In this section, the backend is run locally to call the enrichment APIs end-to-end. A product image is uploaded to generate enriched, localized metadata, create an image variation with quality scoring, and produce a 3D asset. The three-stage API approach is described next.

Step 1: Set up the local backend

First, get the FastAPI backend server running on a local machine to test the API endpoints.

Clone the repository:

git clone https://github.com/NVIDIA-AI-Blueprints/Retail-Catalog-Enrichment.git
cd Retail-Catalog-Enrichment

Create an .env file in the root directory with the API keys:

NGC_API_KEY=your_nvidia_api_key_here
HF_TOKEN=your_huggingface_token_here

Set up the Python environment using uv (or pip):

# Create and activate a virtual environment

# Create and activate virtual environment
uv venv .venv
source .venv/bin/activate

# Install dependencies
uv pip install -e .

Run the FastAPI server with Uvicorn:

uvicorn --app-dir src backend.main:app --host 0.0.0.0 --port 8000 --reload

The API is now live at http://localhost:8000. Its health can be checked at http://localhost:8000/health.

Step 2: Visual analysis

With the server running, the core /vlm/analyze endpoint can be used. This is the workhorse of the system, designed for instant, synchronous feedback.

Execute a basic analysis of a product image. This command sends a product image (bag.jpg) and specifies the en-US locale.

curl -X POST \
  -F "image=@bag.jpg;type=image/jpeg" \
  -F "locale=en-US" \
  http://localhost:8000/vlm/analyze

Review the JSON response. In just a few seconds, a rich JSON object is returned. This is the “before-and-after” transformation:

{
  "title": "Glamorous Black Evening Handbag with Gold Accents",
  "description": "This exquisite handbag exudes sophistication and elegance. Crafted from high-quality, glossy leather...",
  "categories": ["accessories"],
  "tags": ["black leather", "gold accents", "evening bag", "rectangular shape"],
  "colors": ["black", "gold"],
  "locale": "en-US"
}

Step 3: Augment data with localization and brand voice

The true power of the API comes from its augmentation capabilities. Localize content for a new region by providing existing product data and a new locale. This example targets the Spanish market (es-ES). The system is smart enough to enhance the sparse data using regional terminology.

curl -X POST \
  -F "image=@bag.jpg;type=image/jpeg" \
  -F 'product_data={"title":"Black Purse","description":"Elegant bag"}' \
  -F "locale=es-ES" \
  http://localhost:8000/vlm/analyze

Apply a custom brand voice using the brand_instructions parameter. A brand isn’t generic, so the content shouldn’t be either. This guides the AI’s tone, voice, and taxonomy.

curl -X POST \
  -F "image=@product.jpg;type=image/jpeg" \
  -F 'product_data={"title":"Beauty Product","description":"Nice cream"}' \
  -F "locale=en-US" \
  -F 'brand_instructions=You work at a premium beauty retailer. Use a playful, empowering, and inclusive brand voice. Focus on self-expression and beauty discovery. Use terms like "beauty lovers", "glow", "radiant", and "treat yourself".' \
  http://localhost:8000/vlm/analyze

The AI will generate a description that’s accurate and on-brand.

Step 4: Generate cultural image variations

Now that rich, localized text has been generated, the /generate/variation endpoint can be used to create matching 2D marketing assets.

Generate a new image by passing in the results from Step 2. This endpoint uses the localized text as a plan to generate a new image with the FLUX model.

curl -X POST \
  -F "image=@bag.jpg;type=image/jpeg" \
  -F "locale=en-US" \
  -F "title=Glamorous Black Evening Handbag with Gold Accents" \
  -F "description=This exquisite handbag exudes sophistication..." \
  -F 'categories=["accessories"]' \
  -F 'tags=["black leather","gold accents","evening bag"]' \
  -F 'colors=["black","gold"]' \
  http://localhost:8000/generate/variation

This call returns JSON with a generated_image_b64 string. If using the es-ES locale, the model generates a background more fitting for that market, like a Mediterranean courtyard instead of a modern studio.

Review the JSON response:

{
  "generated_image_b64": "iVBORw0KGgoAAAANS...",
  "artifact_id": "a4511bbed05242078f9e3f7ead3b2247",
  "image_path": "data/outputs/a4511bbed05242078f9e3f7ead3b2247.png",
  "metadata_path": "data/outputs/a4511bbed05242078f9e3f7ead3b2247.json",
  "locale": "en-US"
}

Step 5: Automated quality control with NVIDIA Nemotron VLM

Generative AI is powerful, but it can hallucinate. In an enterprise catalog, a “Black Handbag” can’t suddenly have a blue strap or a missing handle. To solve this, an agentic reflection loop has been implemented.

Instead of relying on human reviewers, a Quality Assurance Agent powered by NVIDIA Nemotron VLM can be deployed. This module acts as a strict critic, performing a “reflection” step that compares the generated variation against the original product image to ensure fidelity.

Before the API responds, this agent analyzes the generated image against the original product photo across five strict dimensions:

  • Product consistency: Do colors, materials, and textures match the original?
  • Structural fidelity: Are key elements like handles, zippers, and pockets preserved?
  • Size and scale: Does the product look realistically sized in its new context?
  • Anatomical accuracy: If a human model is present, are the hands and fingers rendered correctly?
  • Background quality: Is the lighting and context photorealistic?

The “VLM Judge” output: The API returns the generated asset alongside a detailed quality report, including a quality score and a list of specific issues.

{
  "generated_image_b64": "iVBORw0KGgoAAAANSUhEUgA...",
  "artifact_id": "027c08866d90450399f6bf9980ab7...",
  "image_path": "/path/to/outputs/027c08866d90450399f6bf9980ab73...png",
  "metadata_path": "/path/to/outputs/027c08866d90450399f6bf9980ab73...json",
  "quality_score": 72.5,
  "quality_issues": [
    "Product appears slightly oversized relative to background context",
    "Minor texture inconsistency on handle hardware"
  ],
  "locale": "en-US"
}

This feature provides the critical metadata needed for automation. Software integrators can expand this functionality to build self-correcting pipelines where the system autonomously retries generation with adjusted prompts until the VLM Judge awards a passing score (e.g., >85).

Step 6: Create interactive 3D assets

Finally, bring the product to life with a 3D model using the /generate/3d endpoint.

Request a 3D model from the original 2D image. This is a simple call that only needs the image.

curl -X POST \
  -F "image=@bag.jpg;type=image/jpeg" \
  http://localhost:8000/generate/3d \
  --output product.glb

In a few seconds, a product.glb file is generated. This file can be dropped directly into any web-based 3D viewer, allowing customers to inspect the product from every angle.

Request a JSON response (optional). For web clients, it’s often easier to handle a JSON response. To do this, set return_json=true.

curl -X POST \
  -F "image=@bag.jpg;type=image/jpeg" \
  -F "return_json=true" \
  http://localhost:8000/generate/3d

Review the JSON response: This will return the 3D model as a base64 string, along with metadata.

{
  "glb_base64": "Z2xURgIAAA...A=",
  "artifact_id": "c724a1b8e1f54a6b8d2c9a7e6f3d1b9f",
  "metadata": {
    "slat_cfg_scale": 5.0,
    "ss_cfg_scale": 10.0,
    "slat_sampling_steps": 50,
    "ss_sampling_steps": 50,
    "seed": 0,
    "size_bytes": 1234567
  }
}

Step 7: Move to production (Docker and troubleshooting)

Here are a few common tips for debugging and moving to a full production-like deployment.

  • Run the full stack with Docker. In this example, the backend was run locally; however, the complete project is designed for Docker. The docker-compose.yml file will launch the frontend, the backend, and all the AI models served through NVIDIA NIM (NVIDIA Interface for Models).
  • Check GPU availability. If models fail, the first check should be nvidia-smi to ensure Docker can see the GPUs.
  • Inspect service logs. The best way to debug is by tailing the logs for a specific service: docker-compose logs -f backend

Extensibility and future features

The goal of extending this blueprint is to increase the breadth and quality of commerce-ready assets and metadata autonomously. The project roadmap includes several extensions that can be built on:

  • Agentic social media research: This planned feature introduces a specialized social media research agent as part of an agentic workflow, where autonomous agents handle complex tasks. Powered by reasoning models like NVIDIA Nemotron and using tool calling with social media APIs or MCPs, the agent analyzes real-world usage patterns, sentiment, and trending terminology, feeding these insights into the /vlm/analyze step to keep product descriptions rich, relevant, and on-trend.
  • Short video generation: The next step is to add another generative endpoint to create 3-5 second product video clips. Using open source models, short video clips can be generated directly from 2D images, creating a dynamic, AI-generated lifestyle clip or product spin without needing a complex video shoot.

This foundation is designed for extension. Modules can be added for virtual try-on, automated ad generation, or dynamic pricing models by following the same pattern of adding a new, specialized microservice.

Conclusion

We’ve successfully built a powerful, AI-driven pipeline that solves the sparse catalog problem. The key takeaways for building a system like this are:

  • Go modular: A production-ready system must separate fast analysis from slow generation. This provides a responsive UI and the flexibility to treat asset generation as an on-demand or background task.
  • Localization is key: True enrichment isn’t just translation; it’s cultural adaptation. By making locale a core parameter, the system generates text and images that resonate with global audiences.
  • Brand voice is a feature: The brand_instructions parameter is a game-changer. It transforms the LLM from a generic generator into a true, scalable brand assistant.

Resources

Ready to build this yourself? Dive into the project documentation:

Learn more about the Retail Catalog Enrichment Blueprint.

Discuss (0)

Tags