Breaking barriers in speech recognition, NVIDIA NeMo proudly presents pretrained models tailored for Dutch and Persian—languages often overlooked in the AI landscape.
These models leverage the recently introduced FastConformer architecture and were trained simultaneously with CTC and transducer objectives to maximize each model’s accuracy.
Automatic speech recognition (ASR) is a fundamental technology for conversational AI applications, as it enables users to communicate with AI systems and other devices using voice. It’s also widely adopted in conversational analytics and audio captioning, resulting in broader content accessibility.
Persian speech recognition model
The Persian model was trained on Mozilla’s Common Voice (MCV) 15.0 Persian data. Notably, two techniques helped maximize the model’s performance: initialization from a pretrained English checkpoint and a custom train-test split that allowed the use of an extra 300 hours of MCV-validated recordings.
This model achieves a 13.16% word error rate (WER) and 3.85% character error rate (CER) in evaluation. While WER is a standard metric for ASR, it does not necessarily reflect ASR performance in the Persian language well due to flexibility in compound word notation. This means a compound word may not be separated by a whitespace. In these cases, CER may be a more realistic indication of an ASR system’s accuracy.
Dutch speech recognition model
The Dutch model is trained on 40 hours of MCV data, 547 hours of Multilingual LibriSpeech (MLS), and 34 hours of VoxPopuli data.
This model achieves a 9.2% and 12.1% word error rate on MCV and MLS in evaluation, which is among the top of the available open-source Dutch models. This model can also produce transcripts with punctuation and capitalization.
Try the models
These models are permissively licensed with a CC-4.0 BY license that enables commercial use. They are available to download at both NGC and HuggingFace: