Generative AI

Advanced AI and Retrieval-Augmented Generation for Code Development in High-Performance Computing

Decorative image of multimodal RAG workflow.

In the rapidly evolving field of software development, AI tools such as chatbots and GitHub Copilot have significantly transformed how developers write and manage code. These tools, built on large language models (LLMs), enhance productivity by automating routine coding tasks. 

Parallel computing challenges

However, the use of LLMs in generating parallel computing code—essential for high-performance computing (HPC) applications—has encountered challenges. Parallel computing applications require a nuanced understanding of both the logic of functional programming and the complexities involved in managing multiple concurrent operations, such as avoiding deadlocks and race conditions. Traditional AI models can generate serial code effectively but falter with parallel constructs due to these added complexities.

NVIDIA faced a similar challenge with generating complex code but in the context of supporting the design of new accelerated computing semiconductors.  Highly relevant work addressing this problem, ChipNeMo, was published during Supercomputing 2023, which combined domain-adaptive pre-training (DAPT) and retrieval-augmented generation (RAG) techniques successfully to produce an EDA co-pilot for engineers. The combination of DAPT with advanced RAG techniques resonated well with the problem facing the scientists at Sandia and kickstarted the following collaboration.

A team of scientists at Sandia National Laboratories embarked on an ambitious project using NVIDIA NeMo software offerings, such as early access to nVolve-40K (Embedding model), to address these challenges. They developed a specialized coding assistant tailored for Kokkos, a C++ library that provides you with the tools to write performance-portable applications. It abstracts hardware complexities and enables code to run efficiently on different types of HPC architectures.

Creating an AI model for such a specific task usually requires fine-tuning a foundation model with domain-specific knowledge, a resource-intensive and time-consuming process. To stay agile with software updates and bug fixes, HPC developers generally prefer a more flexible approach. 

This assistant aims to support you by providing accurate, context-aware code suggestions, using the latest in AI advancements.

Implementing advanced RAG

In collaboration with NVIDIA, Sandia is developing an advanced RAG approach, to enable a modular workflow that is adaptable to ongoing data set changes and integrates state-of-the-art models with HPC workloads as soon as they are released.

According to Robert Hoekstra, Ph.D., senior manager of Extreme Scale Computing at Sandia, organizations across several industries, including Sandia, are unlocking valuable insights with NVIDIA generative AI, enabling semantic search of their enterprise data. Sandia and NVIDIA are collaborating to evaluate emerging generative AI tools to maximize data insights while improving accuracy and performance.

By compiling extensive code repositories that use Kokkos, Sandia scientists created a dynamic and continually updated dataset. Using a recursive text splitter for C programming, the dataset can be organized into manageable chunks. These dataset chunks are then converted into vector embeddings and stored in a database, yielding a less resource-intensive process than fine-tuning.

When a query is made, it’s transformed into a vector that retrieves relevant code chunks from the database based on their cosine similarity. These chunks are then used to provide context for generating the response, creating a rich, informed output incorporating real-world coding patterns and solutions.

Flowchart starts with the user query, expanded to include context from the LLM user history. Five similar queries are re-written to query the LLM. All queries are fed into the embedding model, which returns the top k documents for each.  Custom data is chunked and fed to an embedding model, which generates a float vector. Float vectors are stored as vector store embeddings and determine semantic similarity. Output is re-ranked against the query and the top 6 documents are retrieved. This relevant text is context to the LLM along with the rewritten query for final output.
Figure 1. RAG techniques for user queries

RAG evaluation

In Sandia’s preliminary evaluation of naive RAG, a variety of metrics including BLEU, ChrF, METEOR, and ROUGE-L were used to assess the effectiveness of the generated code against standard benchmarks provided by Sandia Kokkos developers. Encouragingly, initial results revealed a 3-4 point increase in scaled mean evaluations with the implementation of the RAG approach. 

Scaled mean = [100(BLEU+RougeL+CodeBLEU+METEOR)+ChrF]\slash5

Embedding modelsLLMs
BGE – large – EN – V 1.5MISTRAL – 7B – INSTRUCT – V0.2
E5 – base – V2.MIXTRAL – 8X7V – INSTRUCT – V 0.1
NVolve40KWIZARDCODER – 15B – V1.0
UAE – LARGE – V1MAGICOD ER – S – DS – 6.7 B.
Table 1. Embedding models and open-source LLMs tested
ModelOSS scaled meanRAG scaled mean
Table 2. Benchmarking open-source software (OSS) embedding models and LLMs

RAG results for modular HPC workflows 

Minor performance improvements were achieved with a naïve RAG pipeline implementation. The Mixtral model (MoE architecture) was not significantly affected by the appended Kokkos context. 

No superior embedding model was found. Instead, specific pairings of embedding models and LLMs offer the best results.

RAG for multi-query retrieval

The Sandia team also experimented with advanced RAG techniques to refine how to retrieve relevant content. 

One such technique involved generating multiple related queries to broaden the search for applicable code snippets, especially when user queries were vague or lacked specific details. This method improved the likelihood of retrieving more relevant context, enhancing the accuracy and usefulness of the generated code.

The chart compares the Naive RAG c1024_o256 with several MultiQueryRetrievers - MQ_c1024_o256 chunk size 1024 and chunk overlap 256’ MQ_c500_o100 chunk size 500 and chunk overlap 100; and MQ_c300_050 chunk size 300 and chunk overlap 50.  The comparisons are for the metrics BLEU, ChrF, CodeBLEU, METEOR, ROUGE-L, and the scale mean. The mid-size code snippet MQ_c500_o100 was scored most consistently better (a few percentage points) for enhanced accuracy and usefulness than the other retrieval methods. 
Figure 2. Applying a multi-query retrieval to mid-sized code examples

The following example shows a query and the generated queries created by the multi-query retriever.

Prompt: Create a Kokkos view of doubles with size 10 x 3

Generated Queries:

1. How can I create a Kokkos view of doubles with size 10 x 3?
2. How can I create a Kokkos view of doubles with size 10 x 3 and initialize it with random values?
3. How can I create a Kokkos view of doubles with size 10 x 3 and initialize it with values from a file?

Dataset context enrichment

Sandia researchers are also exploring the parent document retriever approach. In this approach, the data is segmented into large parent chunks that provide general context and smaller child chunks focused on specific details. 

This strategy helps balance the need for specificity with the breadth of context, optimizing both the relevance and comprehensiveness of the information retrieved.

The chart compares the naive RAG c1024_o256 with the ParentDocumentRetrievers PD_parent-c1000_o200_child-c300-o0 with parent chunk size 1000 parent chunk overlap 200 child chunk size 300 child overlap 0; PD_parent-c500_o100_child-c200-o0 with parent chunk size 500 parent chunk overlap 100 child chunk size 200 child overlap 0; and PD_parent-c300_o75_child-c150-o0 with parent chunk size 300 parent chunk overlap 75 child chunk size 150 child overlap 0.  The comparisons are for the metrics BLEU, ChrF, CodeBLEU, METEOR, ROUGE-L and the scale mean. The approaches for context enrichment that splits datasets into both parent and child chunks, especially the smallest sizes, yielded the most substantial improvements on Kokkos.
Figure 3. Splitting datasets into both parent and child short chunks 

The following prompt example shows the child and parent chunks used to search the database.

Prompt: Create a Kokkos view of doubles with size 10 x 3

Child Chunk:
// Create the Kokkos::View X_lcl.
const size_t numLclRows = 10;
const size_t numVecs = 3;
typename dual_view_type::t_dev X_lcl ("X_lcl", numLclRows, numVecs);

Parent Chunk:
// Create the Kokkos::View X_lcl.
const size_t numLclRows = 10;
const size_t numVecs = 3;
typename dual_view_type::t_dev X_lcl ("X_lcl", numLclRows, numVecs);

// Modify the Kokkos::View's data.
Kokkos::deep_copy (X_lcl, ONE);

lclSuccess = success ? 1 : 0;
gblSuccess = 0; // output argument
reduceAll<int, int> (*comm, REDUCE_MIN, lclSuccess, outArg (gblSuccess));
if (gblSuccess != 1) {
std::ostringstream os;
os << "Proc " << comm->getRank () << ": checkpoint 1" << std::endl;
std::cerr << os.str ();

Advanced AI for HPC results

The Sandia team’s initial evaluations used standard NLP metrics like BLEU (Bilingual Evaluation Understudy) and ROUGE (Recall-Oriented Understudy for Gisting Evaluation). 

However, they recognized the need for a more tailored evaluation metric that better suit code generation, particularly one that can assess the syntactical accuracy and operational efficacy of the generated code without relying on traditional output comparisons.

Enhancing future code functionality

As Sandia scientists refine their retrieval methods and AI models, they anticipate further enhancements in the coding assistant’s ability to generate not only functionally accurate but also contextually relevant code suggestions. Future developments will focus on fine-tuning the base model similar to how the NVIDIA engineers did on ChipNemo and refining the retrieval processes and integrating more advanced LLMs and embedding techniques as they become available.

AI for HPC code development

The integration of AI into code development, especially within the scope of high-performance computing, represents a significant leap forward in productivity and efficiency. The proof of concept work with some of the offerings within NVIDIA NeMo and Sandia is at the forefront of this transformation, pushing the boundaries of what AI can achieve in complex computing environments. By continuously adapting and improving their AI models and methodologies, Sandia aims to provide you with powerful tools that enhance your ability to innovate and solve complex problems.

This advanced RAG initiative showcases the potential of AI in software development and highlights the importance of targeted solutions in technology advancement.

At Sandia, they’re evaluating both in-house and external cloud solutions to host these complex AI models, which require significant GPU resources. The choice depends on the availability of appropriate hardware and the scale of use across Sandia teams. Partnering with vendors like NVIDIA could provide valuable support, especially for integrating their container systems with Sandia’s existing infrastructure.

More information

For more information, see the following resources:

ChipNeMo: Domain-Adapted LLMs for Chip Design: Includes an explanation of how RAG is used to create EDA scripts for chip design, a challenge with some similar characteristics to create a RAG model for parallel code.

Discuss (1)