Agentic AI / Generative AI

NVIDIA Speech AI Models Deliver Industry-Leading Accuracy and Performance

Jun 04, 2025

By Adi Margolin, Maryam Motamedi and Nithin Rao Koluguri

Discuss (0)

AI-Generated Summary

Dislike

NVIDIA is driving advancements in speech AI and language models, particularly in automatic speech recognition (ASR), with models like NVIDIA Parakeet TDT 0.6B v2 and NVIDIA NeMo Canary.
The NVIDIA Parakeet v2 model offers industry-leading performance with a 6.05% word error rate and is 50 times faster than alternative models, making it suitable for applications requiring high accuracy and speed.
NVIDIA's speech AI models, including Parakeet and Canary, are part of NVIDIA Riva, a collection of GPU-accelerated multilingual speech and translation microservices that can be deployed using NVIDIA NIM microservices.

AI-generated content may summarize information incompletely. Verify important information. Learn more

NVIDIA is driving state-of-the-art performance, efficiency, and accessibility in both speech AI and language models, setting the stage for innovations that are redefining what’s possible in automatic speech recognition (ASR).

NVIDIA Parakeet TDT 0.6B v2 is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription. It is currently ranked #1 on the Hugging Face ASR leaderboard, alongside four other top-ranking NVIDIA Parakeet models. NVIDIA NeMo Canary models have also made their mark on the Hugging Face ASR leaderboard.

This post explores how these and other cutting-edge NVIDIA speech AI models are setting new benchmarks for accuracy, speed, and versatility in automatic speech recognition (ASR). We will review model highlights, leaderboard performance, and practical deployment options so you can leverage these state-of-the-art models for real-world applications.

Overview of NVIDIA speech AI models

The NVIDIA Parakeet and Canary AI model families are part of NVIDIA Riva, a collection of GPU-accelerated multilingual speech and translation microservices for building fully customizable, real-time conversational AI pipelines.

Riva speech models typically begin as research prototypes, undergoing a journey from experimentation to scalable, high-performance deployments. While the journey from research to deployment follows a structured path, the decision to progress a model to an NVIDIA NIM microservice often depends on real-world demand and how the model performs in the broader developer community.

NVIDIA models typically progress from research prototypes to improved deployments by undergoing performance tuning and being packaged as NIM where they can be seamlessly deployed using Riva for scalable, real-world applications. To learn more, check out the recent interview with Joey Conway, senior director of product management, generative AI software at NVIDIA.

NVIDIA Parakeet v2 model highlights

With an industry-best 6.05% word error rate (WER), Parakeet v2 takes performance to the next level with unmatched accuracy, blazing-fast inference (RTFx 3386.02, or 50x faster than alternatives), and innovative, pioneering capabilities like accurate timestamps and song-to-lyrics transcription. These models are open source and available for commercial use.

Where other ASR models struggle to balance speed, accuracy, and specialized use cases, Parakeet v2 delivers all of these, making it the go-to choice for developers who demand both cutting-edge performance and versatility.

Video 1. An example of a song-to-lyric transcription created using NVIDIA Parakeet v2

NVIDIA NeMo Canary model highlights

NVIDIA NeMo Canary models are also topping the Hugging Face ASR leaderboard. NVIDIA NeMo Canary 1B and NVIDIA NeMo Canary 1B Flash, currently ranking #4 and #3 respectively, stand out for their strong multilingual performance and rapid inference. The models rank among the best for speech recognition and translation in several major languages.

NVIDIA speech AI model details and use cases

The latest NVIDIA speech AI models are designed to deliver where it matters most. The Recurrent Neural Network Transducer (RNNT) multilingual model supports global reach with 25 languages, making it easy to connect with teams and customers anywhere.

For scenarios with background noise—such as transcriptions in hospitals, airports, virtually any busy and noisy place—built-in Silero VAD maintains accurate output. The model with the lowest WER, Parakeet v2, leads the way with fast, precise results, including advanced features like music transcription.

For teams that want ready-to-deploy solutions, NVIDIA offers a lineup of fully supported Riva NIM microservices. These include:

Parakeet RNNT 1.1B for accurate, multilingual transcription with punctuation in 25 languages
Parakeet CTC 1.1B, which supports Silero Voice Activity Detector (VAD), a lightweight, open source deep learning model designed to improve noise robustness, for fast, low-latency results
Parakeet CTC 06B, a 600-million parameter English model trained on over 35,000 hours of speech for crisp, natural text.

Video 2. Learn more about the leaderboard-topping NVIDIA Parakeet TDT 0.6B v2 model in this short interview

NVIDIA speech models are easy to deploy and enterprise-ready: Riva models are available through NVIDIA AI Enterprise, NVIDIA NGC, and as NVIDIA NIM microservices. The latest research models can be accessed on Hugging Face.

Model name	Architecture	Languages	Key features	Sample use cases
Parakeet TDT 0.6B v2	FastConformer-TDT	English (en-US)	–Industry-best WER –Ultra-fast –Word-level timestamps –Song lyrics –Punctuation	–Media and entertainment –Edge and IoT
Parakeet RNNT 1.1B	FastConformer-RNNT	25 languages	–Universal tokenizer –Punctuation-aware –NVIDIA NIM	–Global customer support –Multilingual transcription
Parakeet CTC 1.1B (Silero VAD, optional)	FastConformer-CTC	English (en-US)	–High-speed ASR –Noise robust –Silero VAD –High throughput –Low latency	–Virtual assistants and enterprise voice apps –Noisy environments (hospitals, airports, drive-through kiosks)
Parakeet CTC 06B	FastConformer-CTC	English (en-US), Spanish (es-US)	–High-speed ASR –Trained on ASRSet and 35K+ hours of English (en-US) speech –Lowercase –Spaces and apostrophes –Fast inference	–Clear dictation needs (in Healthcare and Finance, for example) –Media –Edge devices

Table 1. An overview of NVIDIA Parakeet models

Get started with NVIDIA speech AI models

With continuous innovation and new releases, NVIDIA Parakeet ASR models are setting the pace for speech recognition, delivering global language coverage, robust noise handling, and industry-leading speed and accuracy. Whether you’re building enterprise voice solutions, powering multilingual customer support, or developing next-generation media applications, Parakeet models provide the tools to make your products stand out with clarity and intelligence.

To get started, download NVIDIA Parakeet v2 and experience the NVIDIA Riva speech NIM. For technical details, deployment guides, and more, visit the NGC Catalog.

Discuss (0)

About the Authors

About Adi Margolin
Adi Margolin is the product manager for Riva SDK and Speech NIM. With 16 years of product management experience, Adi has built high-impact speech technology solutions across enterprise software companies, developing expertise in bringing ASR and TTS innovations to market. Based in San Jose, Adi brings a unique perspective to speech AI development, having successfully navigated the transition from legacy systems to modern AI-driven platforms while addressing complex requirements of real-time media applications.

View all posts by Adi Margolin

About Maryam Motamedi
Maryam Motamedi is a product marketing lead for AI software at NVIDIA. She brings decades of cross-industry experience in media/AdTech, streaming, retail, and telecom. Maryam specializes in translating cutting-edge technology into real-world solutions, helping developers and enterprises build AI-powered applications that redefine how we connect, work, and interact.

View all posts by Maryam Motamedi

About Nithin Rao Koluguri
Nithin Rao Koluguri is a senior research scientist on the NVIDIA Conversational AI team, focusing on advancing speech and speaker recognition models. As a key contributor to the NVIDIA NeMo toolkit, he plays a vital role in enhancing features for conversational AI model development.

View all posts by Nithin Rao Koluguri