Announcing HelpSteer: An Open-Source Dataset for Building Helpful LLMs

NVIDIA recently announced the NVIDIA NeMo SteerLM technique as part of the NVIDIA NeMo framework. This technique enables users to control large language model (LLM) responses during inference. The developer community has shown great interest in using the approach for building custom LLMs.

The NVIDIA NeMo team is now open-sourcing a multi-attribute dataset called Helpfulness SteerLM dataset (HelpSteer). This new resource enables developers to get started with using the SteerLM technique quickly and build state-of-the-art custom models.

HelpSteer is a collaborative effort between our team and Scale AI. Coupled with the SteerLM technique, it improves the factuality and coherence of responses. Developers can now guide LLM responses on additional attributes like complexity and verbosity and enhance the overall controllability of the responses for end-users.

By using this new dataset and the SteerLM technique, NVIDIA trained a Llama 2 70B foundational model that outperforms the Llama 2-70B chat model on MT Bench and TruthfulQA MC2 benchmarks.

Let’s dive deep into how we built this dataset.

Collecting high-quality human feedback

We collected around 10K high-quality prompts and generated four responses for each using an in-house LLM. The Scale AI team then engaged their human experts to evaluate each response on a scale of 0–4. Scale’s Generative AI Data Engine combines automation and human intelligence to generate training data tailored to specific AI goals and data needs rapidly.

All the responses in the dataset were annotated on five attributes, namely:

Helpfulness: how well the response overall satisfies what the prompt asks for.
Correctness: whether the response contains factually accurate and relevant information without hallucinations or false information.
Coherence: how clear, logical, and self-consistent the response is.
Complexity: the sophistication of vocabulary and sentence structure from basic to expert level.
Verbosity: how concise or verbose the response is relative to what the prompt asks for.

We selected these attributes to complement the attributes in OpenAssistant, the dataset used to train the original SteerLM model.

A sample of the dataset has the following structure:

### Prompt
What is the spiciest chili in the world? Is it a type of chili or made from another ingredient?

### Response 1
The spiciest chili in the world is the Carolina Reaper. It is a type of chili that is made from the chili pepper variety.
### Label 1
helpfulness:1,correctness:1,coherence:3,complexity:1,verbosity:1

### Response 2
The spiciest chili in the world is the Carolina Reaper, which is a type of chili pepper. It was created by a man named Ed Currie and is a cross between a Ghost Pepper and a Red Habanero. It is known for its extreme heat, which is measured at over one million Scoville units. The Carolina Reaper is made from a chili pepper that has been selectively bred for its high levels of capsaicin, the compound that gives chili peppers their spicy heat. While it is not made from any other ingredient, some people may add it to other ingredients to make a spicy sauce or salsa.
### Label 2
helpfulness:4,correctness:4,coherence:4,complexity:2,verbosity:2

LLMs trained on this dataset are more factually correct, and coherent in their responses, making them more helpful in several industries such as customer service, education, and more. Users can also adjust the preferred level of complexity and verbosity of the language model at inference time.

For example, developers can build LLMs that can explain complex subjects such as American History to a middle school student or to someone whose first language isn’t English, all with just one customization run.

Llama2-70B-SteerLM-Chat evaluation results

The Llama 2-70B foundational model trained using the HelpSteer dataset along with the SteerLM technique outperformed the Llama 2-70B chat model on several benchmarks.

The following table compares the performance metrics for both the models on two popular benchmarks: MT Bench and TruthfulQA MC2. A higher score is better, meaning that the SteerLM model is both more helpful and truthful compared to the Llama 2-70B chat model.

Benchmark	Llama 2-70B chat	Llama2-70B-SteerLM-Chat
MT Bench	6.86	7.54
TruthfulQA MC2	0.528	0.561

Table 1. Comparison of Llama 2-70B chat and Llama2-70B-SteerLM-Chat models on MT Bench and TruthfulQA MC2 benchmarks

In Figure 1, we observe that the Llama2-70B-SteerLM-Chat model outperforms the Llama 2-70B chat model in most categories within MT Bench.

Llama2-70B-SteerLM-Chat model responses

Let’s look at some examples from this SteerLM model. For the same prompt, the model provided different responses based on different levels of complexity. Developers can use the same model and tailor it for different use cases at the time of inference.

Prompt
Can we travel at the speed of light?

Low-Complexity Response
'No, it is not possible to travel at the speed of light. The speed of light is a fundamental constant of nature and is the maximum speed at which information can be transmitted in a vacuum. According to our current understanding of physics, it is not possible for any matter or energy to reach or exceed the speed of light.'

High-Complexity Response
According to our current understanding of physics, it is not possible to travel at the speed of light. This is because as an object approaches the speed of light, its mass would increase infinitely, making it impossible to reach the speed of light. This effect is described by Einstein's theory of relativity and is known as the "mass-energy equivalence. Additionally, as an object approaches the speed of light, its length would contract and time would slow down, making it impossible to reach the speed of light from an observer's perspective. Therefore, it is currently believed that it is not possible for any material object to travel at the speed of light, and the speed of light is considered to be the upper limit for the speed of any matter or energy in the universe.