Agentic AI / Generative AI

IBM’s New Granite 3.0 Generative AI Models Are Small, Yet Highly Accurate and Efficient

Oct 21, 2024

By Maryam Ashoori and Chintan Patel

Discuss (0)

AI-Generated Summary

Dislike

IBM has released Granite 3.0, the third generation of its open language models, which meet or exceed the performance of similarly sized open models across various benchmarks.
The Granite 3.0 models, including dense and mixture of experts (MoE) models, are designed for function calling, supporting tool-based use cases and enterprise applications such as text generation and customer service chatbots.
The models are available as an NVIDIA NIM microservice and are offered under the Apache 2.0 license, accompanied by new developer recipes available in IBM's Granite Community on GitHub.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Today, IBM released the third generation of IBM Granite, a collection of open language models and complementary tools. Prior generations of Granite focused on domain-specific use cases; the latest IBM Granite models meet or exceed the performance of leading similarly sized open models across both academic and enterprise benchmarks.

The developer-friendly Granite 3.0 generative AI models are designed for function calling, supporting tool-based use cases. They were developed as workhorse enterprise models capable of serving as the primary building block of sophisticated workflows across use cases including text generation, agentic AI, classification, tool calling, summarization, entity extraction, customer service chatbots, and more.

Introducing IBM’s Granite Generation 3 family

IBM developed the Granite series, available as an NVIDIA NIM microservice, for enterprise use, prioritizing industry-leading trust, safety and cost efficiency without compromising performance.

In its entirety, the Granite 3.0 release comprises of

Dense, text-only LLMs: Granite 3.0 8B, Granite 3.0 2B
Mixture of Experts (MoE) LLMs: Granite 3.0 3B-A800M, Granite 3.0 1B-A400M
LLM-based input-output guardrail models: Granite Guardian 8B, Granite Guardian 2B

Core components of Granite’s architecture are: Group-query attention (GQA) and Rotary Position Encodings (RoPE) for positional information, multilayer perceptron (MLP) with SwiGLU activation, RMSNorm, and shared input/output embeddings.

Optimized performance with speculative decoding

Trained on over 12 trillion tokens of carefully curated enterprise data, the new 8B and 2B models demonstrate significant improvements over their predecessors in both performance and speed.

Speculative decoding is an optimization technique for accelerating model inference speed, helping LLMs generate text faster while using the same (or less) compute resources, and allowing more users to utilize a model at the same time. For example, in a recent IBM Research breakthrough, speculative decoding was used to cut the latency of Granite Code 20B in half while quadrupling its throughput.

In standard inferencing, LLMs process each previous token they’ve generated thus far, then generate one token at a time. In speculative decoding, LLMs also evaluate several prospective tokens that might come after the token they’re about to generate—if these “speculated” tokens are verified as sufficiently accurate, one pass can produce two or more tokens for the computational “price” of one.

Benchmark Metric	Mistral 7B	Llama-3.1 8B	Granite-3.0 8B
IFEval 0-shot	49.93	50.37	52.27
MT-Bench	7.62	8.21	8.22
AGI-Eval 5-shot	37.15	41.07	40.52
MMLU 5-shot	62.01	68.27	65.82
MMLU-Pro 5-shot	30.34	37.97	34.45
OBQA 0-shot	47.40	43.00	46.60
SIQA 0-shot	59.64	65.01	71.21
Hellaswag 10-shot	84.61	80.12	82.61
WinoGrande 5-shot	78.85	78.37	77.51
TruthfulQA 0-shot	59.68	54.07	60.32
BoolQ 5-shot	87.34	87.25	88.65
SQuAD 2.0 0-shot	18.66	21.49	21.58
ARC-C 25-shot	63.65	60.67	64.16
GPQA 0-shot	30.45	32.13	33.81
BBH 3-shot	46.73	50.81	51.55
HumanEvalSynthesis pass@1	34.76	63.41	64.63
HumanEvalExplain pass@1	21.65	45.88	57.16
HumanEvalFix pass@1	53.05	68.90	65.85
MBPP pass@1	38.60	52.20	49.60
GSM8k 5-shot, cot	37.68	65.04	68.99
MATH 4-shot	13.10	34.46	30.94
PAWS-X (7 langs) 0-shot	56.57	64.68	64.94
MGSM (6 langs) 5-shot	35.27	43.00	48.20
Average All	45.86	52.87	54.33
Open LLM Leaderboard 1	65.54	68.58	69.04
Open LLM Leaderboard 2	34.61	37.28	37.56
LiveBench	22.40	27.60	26.20
MixEval	73.55	73.35	76.5

Table 1. Accuracy performance of IBM Granite-3.0 8B Instruct model across popular benchmarks compared to other foundational LLMs.

Granite 3.0 8B Instruct kept pace with Mistral and Llama models on RAGBench, a benchmarking dataset consisting of 100,000 retrieval augmented generation (RAG) tasks drawn from industry corpora such as user manuals.

IBM Granite’s first MoE models

IBM Granite Generation 3 also includes Granite’s first mixture of experts (MoE) models, Granite-3B-A800M-Instruct and Granite-1B-A400-Instruct. Trained on over 10 trillion tokens of data, the Granite MoE models are ideal for deployment in on-device applications or situations requiring extremely low latency.

In this architecture, the MLP layers used by the Dense models are replaced with MoE layers. Core components of Granite MoE architecture are: Fine-grained experts, Dropless Token Routing that ensures not a single input token is dropped by the MoE router regardless of the load imbalance among experts, and Load Balancing Loss as a strategy to maintain balanced distribution of expert load.

Benchmark Metric	Llama-3.2	SmolLM	Granite-3.0
Active parameters	1B	1.7B	800M
Total parameters	1B	1.7B	3B
Instruction Following
IFEval 0-shot	41.68	9.20	42.49
MT-Bench	5.78	4.82	7.02
Human Exams
AGI-Eval 5-shot	19.63	19.50	25.70
MMLU 5-shot	45.40	28.47	50.16
MMLU-Pro 5-shot	19.52	11.13	20.51
Commonsense
OBQA 0-shot	34.60	39.40	40.80
SIQA 0-shot	35.50	34.26	59.95
Hellaswag 10-shot	59.74	62.61	71.86
WinoGrande 5-shot	61.01	58.17	67.01
TruthfulQA 0-shot	43.83	39.73	48.00
Reading Comprehension
BoolQ 5-shot	66.73	69.97	78.65
SQuAD 2.0 0-shot	16.50	19.80	6.71
Reasoning
ARC-C 25-shot	41.38	45.56	50.94
GPQA 0-shot	25.67	25.42	26.85
BBH 3-shot	33.54	30.69	37.70
Code
HumanEvalSynthesis pass@1	35.98	18.90	39.63
HumanEvalExplain pass@1	21.49	6.25	40.85
HumanEvalFix pass@1	36.62	3.05	35.98
MBPP	37.00	25.20	27.40
Math
GSM8k 5-shot,cot	26.16	0.61	47.54
MATH 4-shot	17.62	0.14	19.86
Multilingual
PAWS-X (7 langs) 0-shot	34.44	17.86	50.23
MGSM (6 langs) 5-shot	23.80	0.07	28.87
Average All	34.07	24.82	40.20
Open Leaderboards
Open LLM Leaderboard 1	47.36	39.87	55.83
Open LLM Leaderboard 2	26.50	18.30	27.79
LiveBench	11.60	3.40	16.8

Table 2. Accuracy performance of IBM Granite-3.0 MoE 3B model compared to other foundational LLMs.

Granite Guardian: leading safety guardrails

The new Guardian 3.0 8B and Granite Guardian 3.0 2B are variants of their respective correspondingly sized base pre-trained Granite models, fine-tuned to evaluate and classify model inputs and outputs into various categories of risk and harm dimensions, including jailbreaking, bias, violence, profanity, sexual content, and unethical behavior.

The Granite Guardian 3.0 models also cover a range of RAG-specific concerns, evaluating for qualities like groundedness (measuring the degree to which an output is supported by the retrieved documents), context relevance (gauging whether the documents retrieved were germane to the input prompt) and answer relevance.

The model family is developer-friendly, offered under the Apache 2.0 license and accompanied by new developer recipes available in IBM’s Granite Community on GitHub.

Deploy Granite models anywhere with NVIDIA NIM

NVIDIA has partnered with IBM to offer the Granite family of models through NVIDIA NIM – a set of easy-to-use microservices designed for secure, reliable deployment of high performance AI model inferencing across clouds, data centers and workstations.

NIM uses inference optimization engines, industry-standard APIs, and prebuilt containers to provide high-throughput AI inference that scales with demand.

NVIDIA NIM delivers best-in-class throughput, enabling enterprises to generate more tokens, faster. For generative AI applications, token processing is the key performance metric, and increased token throughput directly translates to higher revenue for enterprises and better user experience.

Get started

Experience the Granite models with free NVIDIA cloud credits. You can start testing the model at scale and build a proof of concept (POC) by connecting your application to the NVIDIA-hosted API endpoint running on a fully accelerated stack.

Visit the documentation page to download the models and deploy on any NVIDIA GPU-accelerated workstation, data center, or cloud platform.

Discuss (0)

About the Authors

About Maryam Ashoori
Dr. Maryam Ashoori is the Director of Product Management and Head of Product for IBM's watsonx.ai, where she spearheads the product strategy and delivery of IBM's watsonx Foundation Models. With over 15 years of experience in developing data-driven technologies, she has a proven track record of creating products that drive demand and delight customers. Maryam's leadership has resulted in the creation of high-performing, diverse teams spanning engineering, design, science, and product management, whose work has impacted millions worldwide. Prior to her current role, Maryam was the Head of Engineering at Lyft Bikes and Scooters Operations, and prior to that she spent six years at IBM Research, designing experiences for emerging technologies in AI and Quantum. She holds a Ph.D. in System Design Engineering from the University of Waterloo and two Master's degrees in Artificial Intelligence and is currently an Adjunct Professor at University of Waterloo.

View all posts by Maryam Ashoori

About Chintan Patel
Chintan Patel is a senior product manager at NVIDIA focused on bringing GPU-accelerated solutions to the HPC community. He leads the management and offering of the HPC application containers on the NVIDIA GPU Cloud registry. Prior to NVIDIA, he held product management, marketing and engineering positions at Micrel, Inc. He holds an MBA from Santa Clara University and a bachelor's degree in electrical engineering and computer science from UC Berkeley.

View all posts by Chintan Patel

IBM’s New Granite 3.0 Generative AI Models Are Small, Yet Highly Accurate and Efficient

Tags

About the Authors

Comments