Data Center / Cloud

How to Build AI Systems In House with Outerbounds and DGX Cloud Lepton

It’s easy to underestimate how many moving parts a real-world, production-grade AI system involves. Whether you’re building an agent that combines internal data with external LLMs or a service that generates anime on demand, the system must orchestrate multiple models and dynamic data across online and offline components.

Many AI services, from LLMs to vector databases, are readily accessible via off-the-shelf APIs, enabling rapid prototyping and quick demos. As product requirements evolve and API wrappers become increasingly commoditized, differentiated AI products rely more on proprietary data, thoughtfully designed code and agents, and fine-tuned models. This shift often motivates companies to own and operate key components in-house, which also helps alleviate concerns around security, privacy, and compliance.

In this post, we walk through a realistic use case that demonstrates the benefits of operating the stack in-house. We build a Reddit post stylizer and subreddit recommender powered by tens of thousands of vector indices and an online LLM component. Beyond the application itself, we highlight the infrastructure requirements and show how to leverage the new NVIDIA DGX Cloud Lepton for flexible GPU access. We also demonstrate how to use open-source Metaflow—available as a managed service by NVIDIA Inception program partner Outerbounds—to orchestrate the entire system end-to-end.

How Outerbounds helps build differentiated AI products and services

A key challenge to in-sourcing AI components is the operational cost and complexity involved. Nearly all components—including training, inference, and RAG systems—depend on GPUs and require a sophisticated software stack to run efficiently and at-scale. The AI stack is deep: from efficient GPU-centric datacenters, such as Nebius, to optimized models and inference runtimes available as NVIDIA NIM microservices. Then there’s orchestration with developer-friendly APIs, which is where Outerbounds comes in.

Outerbounds provides a secure, cloud-native platform for developing and operating AI systems in your own environment. Built on open source Metaflow, it equips developers with powerful, composable APIs to build, orchestrate, and continuously improve AI products at scale.

How to build AI systems with NVIDIA DGX Cloud Lepton

The GPU cloud landscape has evolved significantly since the early days of the current AI boom. Today, a diverse range of providers, both large and small, offer GPU resources with varying geographic reach and stack depth. Navigating this landscape can be complex, particularly as these clouds must work with your existing hyperscaler infrastructure.

A key benefit of Outerbounds is easy access to diverse compute resources, which removes a major obstacle to building differentiated AI products. From the start, Outerbounds has integrated with NVIDIA Cloud Functions (NVCF) and, more recently, has partnered with Nebius, an NVIDIA Cloud Partner

Outerbounds now is enabling early access to NVIDIA DGX Cloud Lepton, which expands access to a growing pool of GPUs through a unified interface.

The following diagram illustrates the new setup in context of a demo application, featured below.

An architecture diagram showing NVIDIA DGX Cloud Lepton integrated with the AI stack on Outerbounds and Nebius cloud infrastructure accelerated by NVIDIA GPUs .
Figure 1. NVIDIA DGX Cloud Lepton, integrated with the AI stack on Outerbounds and GPUs through Nebius.

A common obstacle to adopting new GPU clouds is the tight coupling of the company’s existing infrastructure, developer operation (DevOps) practices, and security policies to existing cloud environments. Outerbounds integrates with DGX Cloud Lepton and NVIDIA Cloud Partners, including Nebius, which allows you to bring your own policies and run existing code seamlessly alongside your home cloud without migration. It minimizes the risk and effort involved in getting access to new infrastructure.

Develop a Reddit Agent with DGX Cloud Lepton

To illustrate the benefits of the complete stack and to highlight the intricacies of real-world AI, let’s walk through a fun demo application: an agent that helps you choose the most suitable groups and style when posting on Reddit. A screenshot is worth a thousand words:

Screenshot of a Reddit Agent tool. At the top, a text box contains the user’s prompt: “I think ion thrusters are a good option for future Mars missions.” Below, under “Suggested Subreddits,” three subreddit cards are shown: r/ArtemisProgram, r/SpaceXLounge, and r/IsaacArthur. Each card has a short paragraph post tailored to that subreddit, discussing ion thrusters for Mars missions in contexts such as NASA’s Solar Electric Propulsion, pairing with nuclear power, and their role in space logistics.
Figure 2. Example output from the Reddit Agent tool. Each suggestion includes a short, tailored post highlighting the relevance of ion thrusters to that community’s interests.

Although Reddit data is public, we used a preprocessed dataset available on Hugging Face consisting of nearly 100 million posts and comments. (Note that many real-world applications involve private or proprietary data.) In such cases, it is beneficial—and often necessary—to build and operate your own end-to-end stack, including Retrieval-Augmented Generation (RAG), to ensure data privacy and maintain full control over the system, as demonstrated by our example.

The following outlines the system’s high-level architecture and operation:

Diagram of Reddit Agent architecture. At the top, a “Prompt” box leads to databases that match subreddits and comments, then format the content into responses. This process is supported by NVIDIA DGX Cloud Lepton, which contains four components: Embeddings model, Update vector indices, Retrieval model, and Agent deployment. Output flows back to generate the final response. The system is deployed in the cloud and is powered by Nebius.
Figure 3. System architecture of the Reddit Agent deployed by Outerbounds.

Here’s what happens when you enter a prompt in the demo app:

  1. The system converts a prompt to an embedding using the nv-embedqa-e5-v5 model, a part of the NVIDIA NeMo Retriever collection, deployed as an NVIDIA NIM container through DGX Cloud Lepton.
  2. The embedding is matched against a GPU-accelerated vector database called FAISS, which contains centroids for all subreddits.
  3. The embedding is then matched against subreddit-specific vector databases for the top subreddits to retrieve topical samples.
  4. The original prompt and topical samples are then passed to a large LLM, llama-3_1-nemotron-70b-instruct (also deployed as a NIM container), to reformat the prompt to match the style of the chosen subreddits.
  5. The agent itself is deployed as a container over DGX Cloud Lepton.

Additionally, a workflow is scheduled to update vector indices. Thanks to an integration between DGX Cloud and Metaflow, you can execute a task responsible for building the indices as a part of a Metaflow workflow by adding the following decorators:

   @conda(packages={'faiss-gpu-cuvs': '1.11.0'}, python='3.11')
   @nvidia(gpu=1, gpu_type='NEBIUS_H100')
   @step
   def build_indices(self):
   	....

Notably, as illustrated by the @conda decorator above, you can take care of the software supply chain efficiently, ensuring that all necessary dependencies, including NVIDIA CUDA drivers, are available for the tasks—no matter what execution environment you choose to target.

Produce lightning fast embeddings and vector indices

Our indexing workflow starts with a dataset containing nearly 100 million posts and comments. After removing comments with fewer than 10 tokens and subreddits with fewer than 100 posts, the dataset contains 50 million passages, spread over 30,000 subreddits.

As a special feature of this example, instead of building a single vector database, the system constructs a separate vector database for each subreddit—over 30,000 vector databases in total—matching samples specific to the style of each community. In addition, the system builds a database for centroids of each community to find the most suitable communities for the prompt.

Due to the large scale of the dataset, the system needs to:

  1. Produce a large set of embeddings in a reasonable amount of time as a batch process.
  2. Index the embeddings quickly, producing tens of thousands of database shards.
  3. Produce an embedding and matching entries with low latency during prompting.

A major benefit of DGX Cloud Lepton is that it provides access to a deep pool of GPU resources across environments. Taking advantage of this feature, the system can parallelize the processing of embeddings—orchestrated by a workflow on Outerbounds—hitting the embedding model across multiple NVIDIA H100 GPUs. The service is able to handle parallel workers, scaling almost linearly:

A bar chart with 10 green bars showing embeddings throughput as a function of the number of parallel workers.
Figure 4. Embeddings throughput as a function of the number of parallel workers.

Check out this site for further benchmark results using the nv-embedqa-e5-v5 model, as well as other embedding models from NVIDIA on a variety of GPU infrastructures. The resulting dataset of 50 million 1024-dimensional embeddings is nearly 200GB, so Metaflow’s optimized IO path comes in handy when moving the matrix around.

The system achieves very high performance by leveraging the new NVIDIA cuVS-accelerated FAISS library running on an NVIDIA H100 GPU: It can index 10 million embeddings in 80 seconds. In this case, producing 30,000 indices, many of which are small, was 2.5x faster on a single H100 compared to a massive CPU instance, r5.24xlarge, leveraging up to 60 CPU cores in parallel.

Thanks to Nebius, the GPU-accelerated version—using a single H100—is over 2x faster while being 2x cheaper than the CPU instance.

How to assemble building blocks into production-ready AI systems with Outerbounds

The Reddit Recommender Agent illustrates the structure of a typical AI system, spanning:

  • Various LLMs: In this case, an embedding and a retrieval model.
  • Agent deployments: Stateful workers that call LLMs and take actions accordingly.
  • Batch processing: Such as building vector indices and data processing.

You need to orchestrate and operate all these components as a cohesive system, safely and securely deployed within your governance boundary. Importantly, your development workflows and DevOps practices must support safe iteration across the entire system, enabling A/B testing of models, agent versions, and datasets, with detailed tracking of all assets, observation, and evaluation of the results.

Outerbounds addresses these needs by enabling both online agents and offline workflows on a single platform. You can build AI systems with state-of-the-art components, like NIM containers and GPU-accelerated vector indices, while accessing the latest accelerated computing through direct integrations with providers like Nebius or accessing a deep pool of resources via DGX Cloud Lepton. 

Crucially, you can access these resources through simple Python APIs, making the experience as easy as calling off-the-shelf APIs. That helps keep simple things simple while also making sophisticated solutions possible.

To give you an idea, here’s what a live deployment of a particular version of the Reddit Agent looks like on Outerbounds:

 Screenshot of the Outerbounds platform showing the “Reddit Recommender” deployment page. The agent is active and deployed to an NVIDIA H100 GPU compute pool in Nebius, using NVIDIA NIM MessageFormatter and Embeddings models. The interface lists components for Code, Data, and Model, along with 2/64 active workers. A console log displays recent subreddit suggestions for example prompts, such as recommending r/ArtemisProgram, r/Spaceflight, and r/IsaacArthur for a Mars ion thruster discussion. The left sidebar contains navigation links for project assets, components, deployments, workflows, and platform settings.
Figure 5. Outerbounds deployment interface for the Reddit Agent.

As shown in Figure 5 above, Outerbounds keeps track of all the key assets, including code, data, and models that form the end-to-end solution. This is especially useful if you have multiple people working together (or multiple AI co-pilots), as it allows you to safely deploy any number of concurrent variants, each with their own assets, as isolated branched deployments.

Because of these tracking capabilities, you can easily evaluate variants against each other to, for instance, compare the performance of off-the-shelf APIs to custom models.

How to develop differentiated AI systems with full ownership

Building differentiated AI products requires a complete stack from scalable GPU compute to a developer-friendly software layer. Enterprise deployments also need to account for factors like geography, compliance, and data residency, making infrastructure choices important.

DGX Cloud Lepton offers a unified interface to multiple GPU providers, allowing you to match compute demand to the needs of your use case. Outerbounds builds on this foundation, providing the tools to develop and operate AI applications efficiently and reliably.

If you ask the Reddit Agent to highlight the above value proposition in the style of r/dailybargains, which is a popular subreddit for deal hunters, you may get this answer about a promotion Outerbounds is running:

Outerbounds is offering free credits to run workloads on NVIDIA H100 GPUs via DGX Cloud Lepton. You also get access to its enterprise-ready AI platform that helps you build, deploy, and iterate on custom models and agents in your own cloud.

To start testing these capabilities in your environment, get started at Outerbounds. And claim free GPU credits on Nebius’s infrastructure to power your trial.

You can also go deeper with DGX Cloud Lepton in NVIDIA’s Developer Forums or learn more about the NVIDIA Inception program to see how NVIDIA supports AI startups all over the world.

Discuss (0)

Tags