Conversational AI

NVIDIA Speech AI Models Deliver Industry-Leading Accuracy and Performance

NVIDIA is driving state-of-the-art performance, efficiency, and accessibility in both speech AI and language models, setting the stage for innovations that are redefining what’s possible in automatic speech recognition (ASR). 

NVIDIA Parakeet TDT 0.6B v2 is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription. It is currently ranked #1 on the Hugging Face ASR leaderboard, alongside four other top-ranking NVIDIA Parakeet models. NVIDIA NeMo Canary models have also made their mark on the Hugging Face ASR leaderboard.

This post explores how these and other cutting-edge NVIDIA speech AI models are setting new benchmarks for accuracy, speed, and versatility in automatic speech recognition (ASR). We will review model highlights, leaderboard performance, and practical deployment options so you can leverage these state-of-the-art models for real-world applications.

Overview of NVIDIA speech AI models

The NVIDIA Parakeet and Canary AI model families are part of NVIDIA Riva, a collection of GPU-accelerated multilingual speech and translation microservices for building fully customizable, real-time conversational AI pipelines. 

Riva speech models typically begin as research prototypes, undergoing a journey from experimentation to scalable, high-performance deployments. While the journey from research to deployment follows a structured path, the decision to progress a model to an NVIDIA NIM microservice often depends on real-world demand and how the model performs in the broader developer community.

NVIDIA models typically progress from research prototypes to improved deployments by undergoing performance tuning and being packaged as NIM where they can be seamlessly deployed using Riva for scalable, real-world applications. To learn more, check out the recent interview with Joey Conway, senior director of product management, generative AI software at NVIDIA. 

NVIDIA Parakeet v2 model highlights

With an industry-best 6.05% word error rate (WER), Parakeet v2 takes performance to the next level with unmatched accuracy, blazing-fast inference (RTFx 3386.02, or 50x faster than alternatives), and innovative, pioneering capabilities like accurate timestamps and song-to-lyrics transcription. These models are open source and available for commercial use.

Where other ASR models struggle to balance speed, accuracy, and specialized use cases, Parakeet v2 delivers all of these, making it the go-to choice for developers who demand both cutting-edge performance and versatility. 

Video 1. An example of a song-to-lyric transcription created using NVIDIA Parakeet v2

NVIDIA NeMo Canary model highlights

NVIDIA NeMo Canary models are also topping the Hugging Face ASR leaderboard. NVIDIA NeMo Canary 1B and NVIDIA NeMo Canary 1B Flash, currently ranking #4 and #3 respectively, stand out for their strong multilingual performance and rapid inference. The models rank among the best for speech recognition and translation in several major languages. 

Image shows NVIDIA Parakeet and Canary models ranked at the top of the Hugging Face Open ASR Leaderboard for speech recognition, with Parakeet TDT 0.6B v2 and several other NVIDIA models occupying leading positions, outperforming competing models in word error rate (WER) and speed.
Figure 1. Several NVIDIA Parakeet and Canary models top the Hugging Face Open ASR Leaderboard for Speech Recognition

NVIDIA speech AI model details and use cases

The latest NVIDIA speech AI models are designed to deliver where it matters most. The Recurrent Neural Network Transducer (RNNT) multilingual model supports global reach with 25 languages, making it easy to connect with teams and customers anywhere. 

For scenarios with background noise—such as transcriptions in hospitals, airports, virtually any busy and noisy place—built-in Silero VAD maintains accurate output. The model with the lowest WER, Parakeet v2, leads the way with fast, precise results, including advanced features like music transcription. 

For teams that want ready-to-deploy solutions, NVIDIA offers a lineup of fully supported Riva NIM microservices. These include: 

Video 2. Learn more about the leaderboard-topping NVIDIA Parakeet TDT 0.6B v2 model in this short interview

NVIDIA speech models are easy to deploy and enterprise-ready: Riva models are available through NVIDIA AI Enterprise, NVIDIA NGC, and as NVIDIA NIM microservices. The latest research models can be accessed on Hugging Face

Model nameArchitectureLanguagesKey featuresSample use cases
Parakeet TDT 0.6B v2FastConformer-TDTEnglish (en-US)–Industry-best WER
–Ultra-fast
–Word-level timestamps
–Song lyrics
–Punctuation
–Media and entertainment
–Edge and IoT
Parakeet RNNT 1.1BFastConformer-RNNT25 languages –Universal tokenizer
–Punctuation-aware
–NVIDIA NIM
–Global customer support
–Multilingual transcription
Parakeet CTC 1.1B (Silero VAD, optional)FastConformer-CTCEnglish (en-US)–High-speed ASR
–Noise robust
–Silero VAD
–High throughput
–Low latency
–Virtual assistants and enterprise voice apps
–Noisy environments (hospitals, airports, drive-through kiosks)
Parakeet CTC 06BFastConformer-CTCEnglish (en-US), Spanish (es-US)–High-speed ASR
–Trained on ASRSet and  35K+ hours of English (en-US) speech
–Lowercase
–Spaces and apostrophes
–Fast inference
–Clear dictation needs (in Healthcare and Finance, for example)
–Media
–Edge devices
Table 1. An overview of NVIDIA Parakeet models

Get started with NVIDIA speech AI models

With continuous innovation and new releases, NVIDIA Parakeet ASR models are setting the pace for speech recognition, delivering global language coverage, robust noise handling, and industry-leading speed and accuracy. Whether you’re building enterprise voice solutions, powering multilingual customer support, or developing next-generation media applications, Parakeet models provide the tools to make your products stand out with clarity and intelligence.

To get started, download NVIDIA Parakeet v2 and experience the NVIDIA Riva speech NIM. For technical details, deployment guides, and more, visit the NGC Catalog.

Discuss (0)

Tags