For any data center, operating large, complex GPU clusters is not for the faint of heart! There is a tremendous amount of complexity. Cooling, power, networking, and even such benign things like fan replacement cycles all must be managed effectively and governed well in accelerated computing data centers. Managing all of this requires an accelerated understanding of the petabytes of telemetry data from every layer of the compute stack.
Now imagine being able to chat with your data center directly to check on GPU cluster reliability. Consider a question such as, “Which of the top 5 most frequently replaced parts in our data center have the most supply chain risk?” Maybe you have a more complex task, such as, “Examine each GPU cluster and assign the most relevant technician to resolve the 5% of clusters most at risk for failure.”
To answer these types of questions and more, our team at NVIDIA embarked on a project that we dubbed LLo11yPop (LLM + Observability), to develop an observability AI agent framework that uses the OODA loop (observation, orientation, decision, action).
In this post, I provide an overview of how we built an observability agent framework for GPU fleet management, using a multi-LLM compound model in our architecture design. I also describe various agent roles, such as orchestration and task execution. Lastly, I share lessons learned for future experimentation with agentic observability frameworks.
An observability agent framework to chat with your GPU cluster
The NVIDIA DGX Cloud team governs a global GPU fleet that spans all major cloud service providers, as well as our own data centers. As the global buildout of accelerated data centers continues, we had to invent entirely new ways to observe the fleet in a manner that enables us to provide accelerated capabilities to the world in the most efficient, timely manner possible.
Monitoring accelerated data centers
With each successive iteration of GPUs, the amount of observability needed expands. Standard data center metrics like utilization, errors, and throughput are the baseline.
To really understand what’s happening in this next generation of data centers, you must consider everything possible about the physical environment around it:
- Temperature
- Humidity
- Power stability
- Latency
There are dozens of metrics critical to profiling accelerated AI workloads, as they enable you to address data center incidents more quickly. With the additional complexity of GPU clusters designed for training foundation models, it is essential to apply whatever technology we can to help meet the challenge.
To that end, our team’s first goal was to build a system that enables you to have a conversation with your GPU cluster. Inspired by NVIDIA NIM microservices, we’ve already established the ability for users to have conversations with a database. If you can do that, why not open up the conversation to the kind of high-dimensional data you get from observability systems, enabling data center operators to have this capability?
At NVIDIA, we have many observability systems across our fleet that we can access using Elasticsearch. So, we decided to use NIM microservices that enable us to converse with Elasticsearch in human language. That way, the agent system could answer questions such as, “Which clusters across the fleet have had the most issues with fan failures?” and get accurate, actionable results.
Model architecture
Figure 1 shows the types of agents:
- Director
- Orchestrator agents: Route questions to the correct analyst and, as a consequence, choose the best action as a response.
- Manager
- Analyst agents: Are trained on a specific functional domain. They understand what data they have available and convert broad questions into specific questions that are answered by retrieval agents.
- Action agents: Coordinate action in response to something observed by the orchestrator, such as notifying an SRE when something needs human attention in the data center.
- Worker
- Retrieval agents: Convert questions in a given topic into code that runs against a data source or service endpoint using NVIDIA NeMo Retriever NIM microservices.
- Task execution agents: Carry out a specific task, often through a workflow engine.
All agents are modeled on the same kind of organizational hierarchy that you might find in any organization of humans doing knowledge work. Directors coordinate efforts toward the achievement of a mission, managers use specialized domain knowledge to allocate work, and worker agents are optimized toward specific tasks.
Figure 2 shows specific agents that we built to accommodate the initial use case where we can analyze various types of telemetry from heterogeneous sources and then use Python for more detailed analysis.
Another key discovery was the need for a specific type of agent to detect off-topic questions. We learned early on that without such guardrails, the model would hallucinate more frequently about areas beyond the scope of the system.
Moving towards a multi-LLM compound model
To make this work, we realized that we would need more than one large language model (LLM) to address all the different types of telemetry for effectively managing clusters. While GPU metrics are required, they aren’t sufficient to understand relevant layers of the stack: everything from the GPU layer up to orchestration layers like Slurm and Kubernetes.
For more information, see The Shift from Models to Compound AI Systems.
Using the mixture of agents technique
To address these diverse needs, we went with a mixture of agents (MoA) approach initially. We developed a series of analyst agents that were experts in their domain, such as GPU cluster operating parameters, Slurm job data, and system log patterns. We then built a supervisor model whose job it is to build a plan and assign tasks to analyst agents, who in turn ask questions to query agents.
We did all this using prompt engineering, without fine-tuning or other elaborate techniques, in the lead-up to the first version being put into use. Figure 3 shows us asking a question about Slurm job data and getting an answer and a supporting graph back.
In Figure 3, we show a common case where our team in resource governance needs to understand trends around Slurm job failures on all clusters we manage within a specific cloud provider.
The question is passed to the supervisor agent who selects the right agent to answer the question. In this case, it’s the Elasticsearch Slurm analyst. The agent then, based on the need reflected in the question, gathers data from one or more query agents. In this case, it’s the Elasticsearch query tool that converts the question into the correct dialect of SQL to be understood by the Elasticsearch REST interface.
We typically use this as an initial query to understand a trend, then ask more questions and use the same model to drill down. Perhaps we ask in the chat session to explore why May had more failures than normal, or perhaps we ask other analysts to go into more depth.
By providing the capability to get an immediate answer, look at a graph, and quickly move on to the next question, we can quickly diagnose issues that drive anything from how we allocate GPUs to where we must do further diagnosis in GPU clusters that may be experiencing issues.
Using a swarm-of-agents technique, we chained together a series of small, focused models and performed several interesting optimizations. For example, we can fine-tune models built for passing SQL in the syntax that Elasticsearch understands from a base model built for code generation and use different, larger base LLM for planning tasks that direct execution agents.
From answers and actions to autonomous agents with OODA loops
When we demonstrated that we could get reliable answers from our analyst agents, the next obvious step was to close the loop. This led to the idea of having an autonomous supervisor agent that works towards a mission. If the supervisor agent were an AI engineering director, we would give it a metric to work towards and assign it a mission to improve that metric over time. As with an organization of humans, you would expect your AI director to operate in an OODA loop.
Solving for GPU cluster reliability
For the supervisor AI agent to achieve greater reliability, it should do the following:
- Observe to understand data from the observability system.
- Orient to choose the correct agents to conduct analysis.
- Decide to make a decision on an action to take.
- Invoke the action.
Of course, to gain confidence that the AI director is making relevant decisions, most of the initial actions are the creation of tickets for a site reliability engineer (SRE) to analyze and, if relevant, act on. This is similar to how a human would operate.
Much like self-driving cars, automation of data center ops exists on a spectrum from human-assisted driving to fully autonomous. In the early stages of adoption, humans are always in the loop.
When we receive a cluster optimization recommendation, our SRE team analyzes the request, validates it, performs the task if relevant, and provides feedback about what is wrong with the recommendation if not. This forms the reinforcement learning from human feedback (RLHF) loop that enables us to improve the system over time, which we combine with other techniques where we use the telemetry from the system itself.
Test pyramid for LLM hierarchies
In classical microservice architecture, it is common to have tests for specific services, often with their own unit, functional, or end-to-end tests. As with classical microservices, it is most efficient to have more tests on individual components, with fewer yet more comprehensive tests as you move up the hierarchy (Figure 4). This enables you to balance the competing objectives of fast feedback at the lower level, comprehensiveness at the higher level, and the ability to diagnose problems more quickly.
This last benefit is a key differentiator because if you try to do everything with a large monolithic model, it becomes much more challenging to diagnose why a given topic area might be hallucinating.
While LLM testing is a large topic area that goes beyond the scope of a single post, it is important to point out that the framework of classical software unit testing changes significantly.
For example, you might ask the LLM, “How many GPUs exceeded their normal temperature range?” You would get back responses such as, “Three,” “Three GPUs exceeded,” “3,” “3.00,” or other ways of expressing the same idea. To that end, we used a second, usually stronger LLM, to validate conceptual equivalence of answers in a test suite of over 200 different tests at various levels of the system that run on each build of the system.
Lessons learned from building an observability AI agent
First, you don’t need to and frankly should not jump to training or tuning models as you are getting started. We achieved a functional prototype by doing prompt engineering and hooking a series of NIM microservices together using LangChain.
Notable models used in the system include Mixtral 8x7b and more recently, the new Llama 3.1 405b NIM model from ai.nvidia.com. This enables you to get started and gives you the freedom to choose which model to use in each node of your graph without the sunk cost of model training. After you have something that is working well for the 90% case, then fine-tune specific models to increase accuracy to the required threshold for your use case.
Second, choose the right model for the right job. Coding models work great as a base for human-to-SQL or other tools that do things like formatting output. Smaller models work great for simpler domains and can help increase speed and save money on tokens. Use larger models for the hardest tasks, often at the orchestrator-agent level where more context is required to understand the whole picture.
Finally, don’t fully automate without a human in the loop until you have strong evidence that the actions taken by the agentic system are accurate, useful, and safe. Much like you would not start with full autonomy when implementing self-driving cars, to gain the trust of the people operating the system, walk before you run to fully autonomous systems in production.
Begin building your AI agent application
For more information about how you can use NVIDIA generative AI technologies and tools to build your own AI agents and applications, see ai.nvidia.com or try out NVIDIA NIM APIs.
If you’re just getting started, see Building Your First LLM Agent Application.