Generative AI

IBM’s New Granite 3.0 Generative AI Models Are Small, Yet Highly Accurate and Efficient

Today, IBM released the third generation of IBM Granite, a collection of open language models and complementary tools. Prior generations of Granite focused on domain-specific use cases; the latest IBM Granite models meet or exceed the performance of leading similarly sized open models across both academic and enterprise benchmarks. 

The developer-friendly Granite 3.0 generative AI models are designed for function calling, supporting tool-based use cases. They were developed as workhorse enterprise models capable of serving as the primary building block of sophisticated workflows across use cases including text generation, agentic AI, classification, tool calling, summarization, entity extraction, customer service chatbots, and more. 

Introducing IBM’s Granite Generation 3 family

IBM developed the Granite series, available as an NVIDIA NIM microservice, for enterprise use, prioritizing industry-leading trust, safety and cost efficiency without compromising performance. 

In its entirety, the Granite 3.0 release comprises of

  • Dense, text-only LLMs: Granite 3.0 8B, Granite 3.0 2B
  • Mixture of Experts (MoE) LLMs: Granite 3.0 3B-A800M, Granite 3.0 1B-A400M
  • LLM-based input-output guardrail models: Granite Guardian 8B, Granite Guardian 2B

Core components of Granite’s architecture are: Group-query attention (GQA) and Rotary Position Encodings (RoPE) for positional information, multilayer perceptron (MLP) with SwiGLU activation, RMSNorm, and shared input/output embeddings. 

Optimized performance with speculative decoding

Trained on over 12 trillion tokens of carefully curated enterprise data, the new 8B and 2B models demonstrate significant improvements over their predecessors in both performance and speed. 

Speculative decoding is an optimization technique for accelerating model inference speed, helping LLMs generate text faster while using the same (or less) compute resources, and allowing more users to utilize a model at the same time. For example, in a recent IBM Research breakthrough, speculative decoding was used to cut the latency of Granite Code 20B in half while quadrupling its throughput.

In standard inferencing, LLMs process each previous token they’ve generated thus far, then generate one token at a time. In speculative decoding, LLMs also evaluate several prospective tokens that might come after the token they’re about to generate—if these “speculated” tokens are verified as sufficiently accurate, one pass can produce two or more tokens for the computational “price” of one.

Benchmark MetricMistral 7BLlama-3.1 8BGranite-3.0 8B
IFEval 0-shot49.9350.3752.27
MT-Bench7.628.218.22
AGI-Eval 5-shot37.1541.0740.52
MMLU 5-shot62.0168.2765.82
MMLU-Pro 5-shot30.3437.9734.45
OBQA 0-shot47.4043.0046.60
SIQA 0-shot59.6465.0171.21
Hellaswag 10-shot84.6180.1282.61
WinoGrande 5-shot78.8578.3777.51
TruthfulQA 0-shot59.6854.0760.32
BoolQ 5-shot87.3487.2588.65
SQuAD 2.0 0-shot18.6621.4921.58
ARC-C 25-shot63.6560.6764.16
GPQA 0-shot30.4532.1333.81
BBH 3-shot46.7350.8151.55
HumanEvalSynthesis pass@134.7663.4164.63
HumanEvalExplain pass@121.6545.8857.16
HumanEvalFix pass@153.0568.9065.85
MBPP pass@138.6052.2049.60
GSM8k 5-shot, cot37.6865.0468.99
MATH 4-shot13.1034.4630.94
PAWS-X (7 langs) 0-shot56.5764.6864.94
MGSM (6 langs) 5-shot35.2743.0048.20
Average All45.8652.8754.33
Open LLM Leaderboard 165.5468.5869.04
Open LLM Leaderboard 234.6137.2837.56
LiveBench22.4027.6026.20
MixEval73.5573.3576.5
Table 1. Accuracy performance of IBM Granite-3.0 8B Instruct model across popular benchmarks compared to other foundational LLMs.

Granite 3.0 8B Instruct kept pace with Mistral and Llama models on RAGBench, a benchmarking dataset consisting of 100,000 retrieval augmented generation (RAG) tasks drawn from industry corpora such as user manuals.

IBM Granite’s first MoE models

IBM Granite Generation 3 also includes Granite’s first mixture of experts (MoE) models, Granite-3B-A800M-Instruct and Granite-1B-A400-Instruct. Trained on over 10 trillion tokens of data, the Granite MoE models are ideal for deployment in on-device applications or situations requiring extremely low latency.

In this architecture, the MLP layers used by the Dense models are replaced with MoE layers. Core components of Granite MoE architecture are: Fine-grained experts, Dropless Token Routing that ensures not a single input token is dropped by the MoE router regardless of the load imbalance among experts, and Load Balancing Loss as a strategy to maintain balanced distribution of expert load. 

Benchmark MetricLlama-3.2SmolLMGranite-3.0
Active parameters1B1.7B800M
Total parameters1B1.7B3B
Instruction Following   
IFEval 0-shot41.689.2042.49
MT-Bench5.784.827.02
Human Exams   
AGI-Eval 5-shot19.6319.5025.70
MMLU 5-shot45.4028.4750.16
MMLU-Pro 5-shot19.5211.1320.51
Commonsense   
OBQA 0-shot34.6039.4040.80
SIQA 0-shot35.5034.2659.95
Hellaswag 10-shot59.7462.6171.86
WinoGrande 5-shot61.0158.1767.01
TruthfulQA 0-shot43.8339.7348.00
Reading Comprehension   
BoolQ 5-shot66.7369.9778.65
SQuAD 2.0 0-shot16.5019.806.71
Reasoning   
ARC-C 25-shot41.3845.5650.94
GPQA 0-shot25.6725.4226.85
BBH 3-shot33.5430.6937.70
Code   
HumanEvalSynthesis pass@135.9818.9039.63
HumanEvalExplain pass@121.496.2540.85
HumanEvalFix pass@136.623.0535.98
MBPP37.0025.2027.40
Math   
GSM8k 5-shot,cot26.160.6147.54
MATH 4-shot17.620.1419.86
Multilingual   
PAWS-X (7 langs) 0-shot34.4417.8650.23
MGSM (6 langs) 5-shot23.800.0728.87
Average All34.0724.8240.20
Open Leaderboards   
Open LLM Leaderboard 147.3639.8755.83
Open LLM Leaderboard 226.5018.3027.79
LiveBench11.603.4016.8
Table 2. Accuracy performance of IBM Granite-3.0 MoE 3B model compared to other foundational LLMs.

Granite Guardian: leading safety guardrails

The new Guardian 3.0 8B and Granite Guardian 3.0 2B are variants of their respective correspondingly sized base pre-trained Granite models, fine-tuned to evaluate and classify model inputs and outputs into various categories of risk and harm dimensions, including jailbreaking, bias, violence, profanity, sexual content, and unethical behavior.

The Granite Guardian 3.0 models also cover a range of RAG-specific concerns, evaluating for qualities like groundedness (measuring the degree to which an output is supported by the retrieved documents), context relevance (gauging whether the documents retrieved were germane to the input prompt) and answer relevance. 

The model family is developer-friendly, offered under the Apache 2.0 license and accompanied by new developer recipes available in IBM’s Granite Community on GitHub. 

Deploy Granite models anywhere with NVIDIA NIM

NVIDIA has partnered with IBM to offer the Granite family of models through NVIDIA NIM – a set of easy-to-use microservices designed for secure, reliable deployment of high performance AI model inferencing across clouds, data centers and workstations.

NIM uses inference optimization engines, industry-standard APIs, and prebuilt containers to provide high-throughput AI inference that scales with demand.

NVIDIA NIM delivers best-in-class throughput, enabling enterprises to generate more tokens, faster. For generative AI applications, token processing is the key performance metric, and increased token throughput directly translates to higher revenue for enterprises and better user experience.

Get started

Experience the Granite models with free NVIDIA cloud credits. You can start testing the model at scale and build a proof of concept (POC) by connecting your application to the NVIDIA-hosted API endpoint running on a fully accelerated stack. 

Visit the documentation page to download the models and deploy on any NVIDIA GPU-accelerated workstation, data center, or cloud platform.

Discuss (0)

Tags