Generative AI / LLMs

Power Text-Generation Applications with Mistral NeMo 12B Running on a Single GPU

Decorative image of a model with multiple apps.

NVIDIA collaborated with Mistral to co-build the next-generation language model that achieves leading performance across benchmarks in its class. With a growing number of language models purpose-built for select tasks, NVIDIA Research and Mistral AI combined forces to offer a versatile, open language model that’s performant and runs on a single GPU, such as NVIDIA A100 or H100 GPUs.

This post explores the benefits of Mistral NeMo, training and inference optimizations, applicability for various use cases, and the ease of deployment with NVIDIA NIM.

Mistral NeMo 12B

Mistral NeMo is a 12B-parameter, text decoder-only, dense transformer model trained on 131K multilingual vocabulary size. It delivers leading accuracy on popular benchmarks across common sense reasoning, world knowledge, coding, math, and multilingual and multi-turn chat tasks.

Model Context Window HellaSwag (0-shot) Winograd (0-shot) NaturalQ (5-shot) TriviaQA (5-shot) MMLU (5-shot) OpenBookQA (0-shot) CommonSenseQA (0-shot) TruthfulQA (0-shot) MBPP (pass@1 3-shots)
Mistral NeMo 12B 128k 83.5% 76.8% 31.2% 73.8% 68.0% 60.6% 70.4% 50.3% 61.8%
Gemma 2 9B  8k 80.1% 74.0% 29.8% 71.3% 71.5% 50.8% 60.8% 46.6% 56.0%
Llama 3 8B    8k 80.6% 73.5% 28.2% 61.0% 62.3% 56.4% 66.7% 43.0% 57.2%
Table 1. Mistral NeMo model performance across popular benchmarks

Supporting 128K context length, the model has enhanced understanding and the capability to process extensive and complex information, leading to more coherent, accurate, and contextually relevant outputs.

Mistral NeMo is trained on Mistral’s proprietary dataset that includes a large proportion of multilingual and code data, which enables better feature learning, reduced bias, and an improved ability to handle diverse and complex scenarios.

Optimized training

The model is trained using NVIDIA Megatron-LM, an open-source, PyTorch-based library with a collection of GPU-optimized techniques, cutting-edge system-level innovations, and modular APIs for training models at large scale.

Megatron-LM, part of NVIDIA NeMo, offers the core building blocks for the distributed training of text: multimodal and mixture of experts (MoE) models natively built into the library:

  • Attention mechanisms
  • Transformer blocks and layers
  • Normalization layers
  • Embedding techniques
  • Activation recomputation
  • Distributed checkpointing

Optimized inference

Mistral NeMo is optimized with TensorRT-LLM engines for higher inference performance. TensorRT-LLM compiles the models into TensorRT engines, from model layers into optimized CUDA kernels using pattern matching and fusion, to maximize inference performance. These engines are executed by the TensorRT-LLM runtime, which includes several optimizations:

Inference in FP8 precision is also supported by using NVIDIA TensorRT-Model-Optimizer. Using post-training quantization (PTQ) on NVIDIA Hopper and NVIDIA Ada GPUs, you can optimize and reduce model complexity by creating smaller models with lower memory footprint, without sacrificing accuracy.  

This model fits on a single GPU, improving compute efficiency, lowering compute cost, and enhancing security and privacy. Run the model on a range of commercial use cases including summarizing long documents, classification, multi-turn conversations, language translation, and code generation.

NVIDIA NIM

The Mistral NeMo model is packaged as an NVIDIA NIM inference microservice to streamline and accelerate the deployment of generative AI models across NVIDIA accelerated infrastructure anywhere, including cloud, data center, and workstations.

NIM uses inference optimization engines, industry-standard APIs, and prebuilt containers to provide high-throughput AI inference that scales with demand. It supports a wide range of generative AI models across domains including speech, image, video, healthcare, and more.

NVIDIA NIM delivers best-in-class throughput, enabling enterprises to generate tokens up to 5x faster. For generative AI applications, token processing is the key performance metric, and increased token throughput directly translates to higher revenue for enterprises.

Mistral NeMo has an open Apache 2.0 permissive license that provides enterprises the flexibility to customize the model and integrate it into their commercial applications.

Here’s how to test the Mistral NeMo model for programming tasks.

Coding copilot

Coding is a popular use case supported by the model to enhance developer productivity with AI-powered code suggestions. A coding copilot offers not only inline code suggestions but also generates code, documentation, comments, unit tests, and error fixes, all of which can be easily accessed through industry-standard APIs.

Screenshot of the Mistral NeMo coding co-pilot where the model understands natural language and generates code based on user input prompt.
Figure 1. Mistral NeMo model coding example

Here’s the syntactically and functionally correct code generated by Mistral NeMo with an English language prompt. You can paste the following in your development environment. To generate other code samples, see the NVIDIA API Catalog.

import torch
import torch.nn as nn
import torch.nn.functional as F

# Define the neural network
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        # Convolutional layers
        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1)
        # Fully connected layer
        self.fc = nn.Linear(32 * 8 * 8, 10)  # Assuming input image size is 32x32

    def forward(self, x):
        # Add convolutional layers with ReLU activation and max pooling
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2)

        # Flatten the tensor before passing it to the fully connected layer
        x = x.view(-1, 32 * 8 * 8)

        # Add fully connected layer with log softmax for multi-class classification
        x = self.fc(x)
        output = F.log_softmax(x, dim=1)
        return output

# Create an instance of the neural network
net = Net()

# Print the model architecture
print(net)

# Test the forward pass with a dummy input
dummy_input = torch.randn(1, 3, 32, 32)  # Batch size of 1, 3 channels, 32x32 image size
output = net(dummy_input)
print("Test output:\n", output)

You may also want to fine-tune the model with your domain data to generate higher-accuracy responses. NVIDIA offers tools to align the model for your use case.

Model customization

The instruction-tuned variant of the Mistral NeMo model offers strong performance amongst similarly sized LLMs across several benchmarks such as MT Bench, MixEval-Hard, IFEval-v5, and WildBench.

You can further customize it for your specific needs using NVIDIA NeMo, an end-to-end platform for developing custom generative AI, anywhere.

NeMo offers state-of-the-art fine-tuning and alignment support with parameter-efficient fine-tuning (PEFT) techniques, including p-tuning, low-rank adaption (LoRA), and its quantized version (QLoRA). These techniques are useful for creating custom models without requiring a lot of computing power.

NeMo also supports supervised fine-tuning (SFT) and alignment techniques such as reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), and NeMo SteerLM. These techniques enable further steering the model responses and aligning them with human preferences, making the LLMs ready to integrate into custom applications.

Get started

To experience Mistral NeMo NIM microservice, see the Artificial Intelligence solution page. You will also find popular models, such as Llama 3.1 405B, Mixtral 8X22B, and Gemma 2B.

With free NVIDIA cloud credits, you can start testing the model at scale and build a proof of concept (POC) by connecting your application to the NVIDIA-hosted API endpoint running on a fully accelerated stack.

Discuss (3)

Tags