Streamline Evaluation of LLMs for Accuracy with NVIDIA NeMo Evaluator

Large language models (LLMs) have demonstrated remarkable capabilities, from tackling complex coding tasks to crafting compelling stories to translating natural language. Enterprises are customizing these models for even greater application-specific effectiveness to deliver higher accuracy and improved responses to end users.

However, customizing LLMs for specific tasks can cause the model to “forget” previously learned tasks. This is known as catastrophic forgetting. Therefore, as enterprises adopt LLMs into their applications, it’s necessary to evaluate LLMs both on the original and the newly learned tasks—continuously optimizing the models to provide a better experience. This implies that running an evaluation on a customized model requires re-running foundation and alignment evaluations to detect any potential regressions.

To simplify the evaluation of LLMs, the NVIDIA NeMo team has announced an early access program for NeMo Evaluator, a cloud-native microservice that provides automated benchmarking capabilities. It assesses state-of-the-art foundation models and custom models using a diverse, curated set of academic benchmarks, customer-provided benchmarks, or LLM-as-a-judge.

NeMo Evaluator simplifies generative AI model evaluation

NVIDIA NeMo is an end-to-end platform for developing custom generative AI, anywhere. It includes tools for training, fine-tuning, retrieval-augmented generation, guardrailing, data curation, as well as pretrained models. It has offerings across the tech stack, from frameworks to higher-level APIs, managed endpoints, and microservices.

The NeMo Evaluator microservice, recently launched as part of the NeMo microservices suite, comprises a set of API endpoints that provide the easiest path for enterprises to get started with LLM evaluation. To learn more, see Simplify Custom Generative AI Development with NVIDIA NeMo Microservices.

Along with the NVIDIA NeMo Customizer microservice, enterprises can continuously customize and evaluate models to enhance their performance (Figure 1).

An image of the generative AI model lifecycle showing continuous customization and evaluation to improve model performance with NVIDIA NeMo microservices. — *Figure 1. The generative AI model lifecycle involves continuous customization and evaluation to improve model performance with NVIDIA NeMo microservices*

Supported evaluation methods in early access

The NeMo Evaluator microservice supports automated evaluation on a curated set of academic benchmarks and user-provided evaluation datasets. It also supports using LLM-as-a-judge to perform a holistic evaluation of model responses, which is relevant for generative tasks where the ground truth could be undefined. The various evaluation methods supported are explained more fully below.

Automated evaluation on academic benchmarks

Academic benchmarks offer a comprehensive evaluation of LLM performance across diverse language understanding and generation tasks. They serve as valuable tools for comparing different models and assisting in the selection of the most suitable LLM for specific needs. Additionally, benchmarks offer insights into areas where models may underperform, directing efforts to improve performance in those specific areas.

The NeMo Evaluator currently supports popular academic benchmarks, including:

Beyond the Imitation Game benchmark (BIG-bench): A collaborative benchmark intended to probe LLMs and extrapolate their future capabilities. It includes more than 200 tasks such as summarization, paraphrasing, solving sudoku puzzles, and more.
Multilingual: A benchmark that consists of classification and generative tasks to understand ‌multilingual capabilities across a wide variety of languages. This benchmark tests the LLMs on various tasks, including common-sense reasoning, multilingual question-and-answer, and multilingual translation across 101 languages.
Toxicity: A benchmark to measure the toxicity of an LLM. Model toxicity is defined as content that is inappropriate, disrespectful, or unreasonable. The toxicity benchmark here is based on RealToxicityPrompts, a set of 100,000 prompts and toxicity scores.

Automated evaluation on custom datasets

Standard academic datasets and benchmarks often fail to meet the distinctive requirements of enterprises because they overlook crucial aspects such as domain expertise, cultural nuances, localization, and other specific considerations. That’s why enterprises turn to experts to build custom datasets and run evaluations that fit their needs.

To support evaluation on custom datasets, the NeMo Evaluator microservice supports popular natural language processing (NLP) metrics to measure the similarity of ground truth labels to ‌LLM-generated responses, such as:

Accuracy measures the proportion of correctly predicted instances out of the total instances in the dataset.
BiLingual Evaluation Understudy (BLEU) is a metric for automatically evaluating machine-translated text. The BLEU score ranges from 0 to 1, assessing machine translation similarity to quality references. A score of 0 indicates no match (low quality), and a score of 1 signifies a perfect match (high quality).
Recall-Oriented Understudy for Gisting Evaluation (ROUGE) measures the quality of automatic text summarization, as well as text composition, by comparing the overlap between machine-generated summaries and human-generated reference summaries.
F1 combines precision and recall into a single score, providing a balance between them. It is commonly used for evaluating the performance of classification models as well as question-and-answer.
Exact match measures the proportion of predictions that exactly match the ground truth or expected output.

Automated evaluation with LLM-as-a-judge

Using humans to evaluate LLM responses is a time-consuming and expensive process. However, employing LLM-as-a-judge has shown promising results in terms of scalability and efficiency. LLMs can rapidly assess numerous responses, potentially reducing evaluation time and costs while maintaining reliable judgment standards. For more details, see Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

The NeMo Evaluator microservice can leverage any NVIDIA NIM-supported LLM listed in the NVIDIA API catalog with the MT-Bench dataset or custom datasets for evaluating models customized with NVIDIA NeMo Customizer.

Apply for early access

To get started, apply for NeMo Evaluator early access. Applications are reviewed, and a link to access the microservice containers will be sent upon approval.

As part of the early access program, you can also request access to the NVIDIA NeMo Curator and NVIDIA NeMo Customizer microservices. Together, these microservices enable enterprises to easily build enterprise-grade custom generative AI and bring solutions to market faster.