Agentic AI / Generative AI

Revolutionizing Code Completion with Codestral Mamba, the Next-Gen Coding LLM

Jul 25, 2024

By Chintan Patel, Anjali Shah and Ankit Patel

Discuss (0)

AI-Generated Summary

Dislike

Codestral Mamba, developed by Mistral, is a coding model that uses the Mamba-2 architecture and fill-in-the-middle (FIM) technique to generate accurate and contextually relevant code examples.
The Mamba-2 architecture is an advanced state space model (SSM) that connects SSMs and attention mechanisms through structured space duality (SSD), enabling efficient processing of sequences and improving model quality.
NVIDIA TensorRT-LLM and NVIDIA NIM support Codestral Mamba, with TensorRT-LLM optimizing inference by supporting Mamba-2's SSD algorithm and NIM streamlining deployment across NVIDIA-accelerated infrastructure.

AI-generated content may summarize information incompletely. Verify important information. Learn more

In the rapidly evolving field of generative AI, coding models have become indispensable tools for developers, enhancing productivity and precision in software development. They provide significant benefits by automating complex tasks, enhancing scalability, and fostering innovation, making them invaluable tools in modern software development.

This post explores the benefits of Codestral Mamba, highlights its Mamba-2 architecture, inference optimizations supported in NVIDIA TensorRT-LLM, and the ease of deployment with NVIDIA NIM for transformative potential and coding efficiency.

Codestral Mamba

Developed by Mistral, Codestral Mamba is a groundbreaking coding model built on the innovative Mamba-2 architecture. It is designed specifically for superior code completion. Using an advanced technique called fill-in-the-middle (FIM), Codestral Mamba sets a new standard in generating accurate and contextually relevant code examples.

Codestral Mamba’s seamless integration with NVIDIA NIM for containerization also ensures effortless deployment across diverse environments.

Screenshot of the Codestral Mamba coding language model running in the API catalog user interface, where the model understands natural language and generates responses based on a user input prompt. — *Figure 1. The Codestral Mamba model generates responses from a user prompt*

The following syntactically and functionally correct code sample was generated by Mistral NeMo with an English language prompt. You can copy it into a development environment. To generate other code samples, see the NVIDIA API Catalog.

from collections import deque

def bfs_traversal(graph, start):
    visited = set()
    queue = deque([start])

    while queue:
        vertex = queue.popleft()
        if vertex not in visited:
            visited.add(vertex)
            print(vertex)
            queue.extend(graph[vertex] - visited)

# Example usage:
graph = {
    'A': set(['B', 'C']),
    'B': set(['A', 'D', 'E']),
    'C': set(['A', 'F']),
    'D': set(['B']),
    'E': set(['B', 'F']),
    'F': set(['C', 'E'])
}

bfs_traversal(graph, 'A')

Mamba-2

The Mamba-2 architecture is an advanced state space model (SSM) architecture. It is a recurrent model that has been carefully designed to challenge the supremacy of attention-based architecture for language modeling.

Mamba-2 connects SSMs and attention mechanisms through the concept of structured space duality (SSD). Exploring this notion led to improvements in terms of accuracy and implementation compared to Mamba-1.

The architecture uses selective SSMs, which can dynamically choose to focus on or ignore inputs at each timestep. This flexibility enables a more efficient processing of sequences by focusing computational resources on the most relevant parts of the input.

Mamba-2 also addresses inefficiencies in tensor parallelism and enhances the computational efficiency of the model, making it faster and more suitable for GPUs.

TensorRT-LLM

NVIDIA TensorRT-LLM optimizes LLM inference by supporting Mamba-2’s SSD algorithm. SSD retains the core benefit of Mamba-1’s selective SSM, such as fast autoregressive inference with parallelizable selective scans to filter irrelevant information. It further simplifies the SSM parameter matrix A from diagonal to scalar structure to enable the use of matrix multiplication units, such as those used by the Transformer attention mechanism and accelerated by GPUs.

An added benefit of Mamba-2’s SSD and supported in TensorRT-LLM is the ability to share the recurrence dynamics across all state dimensions N (d_state) as well as head dimensions D (d_head). This enables it to support larger state space expansion compared to Mamba-1 by using GPU Tensor Cores. The larger state space size helps improve model quality and generated outputs.

Batching of variable length sequences can be a challenge in Transformer-based models requiring either padding so all sequences have the same length (wasting computation), or implementing specialized attention for variable sequence length with careful load-balancing.

Mamba-2-based models can treat the whole batch as a long sequence and avoid passing the states between different sequences in the batch by setting the state transition to 0 for tokens at the end of each sequence.

TensorRT-LLM supports SSD’s chunking and state passing on input sequences using Tensor Core matmuls through context and generation phases. It uses chunk scanning on intermediate shorter chunk states to determine the final output state given all the previous inputs.

NVIDIA NIM

NVIDIA NIM inference microservices are designed to streamline and accelerate the deployment of generative AI models across NVIDIA-accelerated infrastructure anywhere, including cloud, data center, and workstations.

NIM uses inference optimization engines, industry-standard APIs, and prebuilt containers to provide high-throughput AI inference that scales with demand. It supports a wide range of generative AI models across domains including speech, image, video, healthcare, and more.

NIM delivers best-in-class throughput, enabling enterprises to generate tokens up to 5x faster. For generative AI applications, token processing is the key performance metric, and increased token throughput directly translates to higher revenue for enterprises.

Get started

To experience Codestral Mamba, see Instantly Deploy Generative AI with NVIDIA NIM. Here, you will also find popular models like Llama3-70B, Llama3-8B, Gemma 2B, and Mixtral 8X22B.

With free NVIDIA cloud credits, you can start testing the model at scale and build a proof of concept (POC) by connecting your application to the NVIDIA-hosted API endpoint running on a fully accelerated stack.

Discuss (0)

About the Authors

About Chintan Patel
Chintan Patel is a senior product manager at NVIDIA focused on bringing GPU-accelerated solutions to the HPC community. He leads the management and offering of the HPC application containers on the NVIDIA GPU Cloud registry. Prior to NVIDIA, he held product management, marketing and engineering positions at Micrel, Inc. He holds an MBA from Santa Clara University and a bachelor's degree in electrical engineering and computer science from UC Berkeley.

View all posts by Chintan Patel

About Anjali Shah
Anjali Shah is a senior deep learning scientist at NVIDIA within the Developer Advocate Engineering group helping clients build generative AI solutions. Early in her career, as a software engineer, she built mission-critical platforms for the world's leading financial services firms. She then spent several years in the healthcare sector, architecting and implementing large scale healthcare (EHR) systems. Before joining NVIDIA, she spent several years at a leading tech company, working across different industries helping clients build innovative data and AI solutions. She has a Ph.D. in biomedical informatics and applied statistics and an M.S. and B.S. in computer science and engineering.

View all posts by Anjali Shah

About Ankit Patel
Ankit Patel is a vice president at NVIDIA, leading developer engagement for the company’s ecosystem of libraries, compilers and developer tools. Ankit joined NVIDIA in 2011 as a GPU Product Manager helping pioneer GPU-accelerated virtual machines, and pushed the boundaries of server GPUs, including NVIDIA's early 8-GPU server appliance. His career transitioned to software, notably leading product management for NVIDIA's OptiX library for ray tracing and AI denoising, a key development that leveraged RT Cores and deepened his connection to deep learning. Prior to NVIDIA, he held product management roles at Matrox Video and Blackmagic Design. Working on full systems for over two decades has given him perspective and expertise at the intersection of silicon and software. Ankit holds a Bachelor's in Computer Science from Concordia University, an MBA from Cornell University and currently serves on the PyTorch Governing Board.

View all posts by Ankit Patel