Generative AI / LLMs

NVIDIA Text Embedding Model Tops MTEB Leaderboard

An illustration representing an embedding model.

The latest embedding model from NVIDIA—NV-Embed—set a new record for embedding accuracy with a score of 69.32 on the Massive Text Embedding Benchmark (MTEB), which covers 56 embedding tasks.

Highly accurate and effective models like NV-Embed are key to transforming vast amounts of data into actionable insights. NVIDIA provides top-performing models through the NVIDIA API catalog.

LLM-powered “Talk to your Data” pipelines rely heavily on an embedding model like NV-Embed, which creates a semantic representation of unstructured text by converting English words into a compressed mathematical representation of the information in the text. This representation is typically stored in a vector database for later use.

When a user asks a question, the mathematical representation of the questions and all the chunks of underlying data are compared to retrieve the most useful information to answer the user’s questions.

Note, that this specific model can only be used for noncommercial purposes.

Breaking down the benchmarks

Before discussing the model’s accuracy numbers, it’s important to discuss the benchmarks. This section briefly covers details about understanding benchmarks. Our deep dive, Evaluating Retriever for Enterprise-Grade RAG, is an excellent resource for further information.

Understanding the metrics for embedding models

Starting with the metrics we’ll be discussing for the benchmarks there are mainly two of note:

  • Normalized discounted cumulative gain (NDCG) is a rank-aware metric that measures the relevance and order of the retrieved information. Simply put, if we have 1,000 chunks and retrieve 10 (NDCG@10), the ideal score is given when the most relevant chunk is ranked first, the second-most relevant chunk is ranked second, and so on until the 10th most relevant chunk is in the 10th position.
  • Recall is a rank-agnostic metric that measures the percentage of the relevant results retrieved. In this case, if we have 1,000 chunks and retrieve 10 (Recall@10), a perfect score is given if the top 10 most relevant chunks are selected regardless of the order in which they were ranked.

Most of the benchmarks report NDCG@10, but due to the nature of most enterprise-grade retrieval-augmented generation (RAG) pipelines, we recommend using Recall@5.

What are MTEB and BEIR?

The core function of retrieval pipelines is comparing the semantic representation of a question with various data points. This naturally leads developers to a couple of follow-up questions: 

  • Can the same representation be used for different tasks? 
  • If we narrow down on one task, is the model good at representing different types of questions or understanding different domains?

To answer these questions, we look at two benchmarks most prevalent in the literature around retrieval. 

  • MTEB: This benchmark covers 56 different tasks, including retrieval, classification, re-ranking, clustering, summarization, and more. Depending on your goals, you can look at the precise subset of tasks representing your use case.
  • BEIR: This benchmark focuses on the retrieval task and adds complexity in the form of different types and domains of questions, such as fact-checking, biomedical questions, or detecting duplicate questions. MTEB is largely a superset of the BEIR benchmark, so we’ll focus on MTEB for most of the discussion.

NV-Embed model accuracy benchmark

Now that we have discussed the underlying benchmarks and metrics, let’s see how our new model NV-Embed performed.

The top 5 models and MTEB benchmark scores include NV-Embed 69.32, Voyage-large-2-instruct 68.28, Linq-Embed-Mistral 68.17, SFR-Embedding-Mistral 67.56, and gte-Qwen-1.5-7B Instruct 67.34.
Figure 1. The top five models on the MTEB benchmark

Tracking accuracy across 56 tasks, on average, the NV-Embed model performs best with an NDCG@10 score of 69.32 (see Figure 1). 

While NV-Embed covers most of the model architecture and training details for achieving an accuracy of 69.32, the following summarizes key improvements made.

  • A new latent attention layer. We introduce a latent attention layer, which simplifies the model’s process of combining the mathematical representation (embeddings) of a series of words (tokens sequence). Typically, this is done by either taking an average, in the case of BERT-based models, or by focusing on an End-of-Sequence-Token (<EOS>) in the case of decoder-only models.
  • A two-stage learning process. In the first stage, in-batch negative and hard negative pairs are used for contrastive learning. Simply put, pairs of evidence and questions are used. The evidence seems to answer the question in these pairs, but if you read closely, you’ll see essential information is missing. In the second stage, data from non-retrieval tasks are blended in for contrastive learning and the in-batch negative training is disabled.

A natural question at this point is, “how well does this translate for my enterprise retrieval workload”. 

The answer is, it depends on the nature and domain of your data. For each benchmark, you must assess how relevant the individual datasets are to general retrieval use cases.

Our key takeaway is that while 19 datasets comprise the BEIR benchmark, datasets like Quora are curated with questions beyond the usual retrieval tasks. Therefore, we recommend looking at a subset of the datasets more representative of your workload, for instance, Natural Questions and HotPotQA datasets. Please refer to the snippets below for context. 

Quora sample dataset has pairs focused on retrieving other similar questions asked on Quora.

Input: Which question should I ask on Quora?
Target: What are good questions to ask on Quora?

HotpotQA sample with a general question-passage pair

Input-Question: Were Scott Derrickson and Ed Wood of the same nationality?

Target-Chunk: Scott Derrickson (born July 16, 1966) is an American director, screenwriter and producer. He lives in Los Angeles, California. He is best known for directing horror films such as “Sinister”, “The Exorcism of Emily Rose”, and “Deliver Us From Evil”, as well as the 2016 Marvel Cinematic Universe installment, “Doctor Strange.”

NQ sample with a general question-passage pair

Input-Question: What is non-controlling interest on the balance sheet?

Target-Chunk: In accounting, minority interest (or non-controlling interest) is the portion of a subsidiary corporation’s stock that is not owned by the parent corporation. The magnitude of the minority interest in the subsidiary company is generally less than 50% of outstanding shares, or the corporation would generally cease to be a subsidiary of the parent.[1]
The top three embedding models from MTEB on HotPotQA and NQ. NV-Embed outperforms other models on the subset.
Figure 2. The top three embedding models from MTEB on HotPotQA and NQ, which are good representatives of a generic retrieval use case

In Figure 2, the NV-Embed model is the best for datasets representing these use cases. We encourage you to repeat this evaluation on their own data. If you don’t have clean data to test, we recommend finding a subset that represents your use case.

Begin prototyping today

Experience the NV-Embed model through the API catalog

In addition, use the NVIDIA NeMo Retriever collection of microservices, designed to enable organizations to seamlessly connect custom models to diverse business data and deliver highly accurate responses. 

Discuss (0)

Tags