The year 2022 has thus far been a momentous, thrilling, and an overwhelming year for AI aficionados. Get3D is pushing the boundaries of generative 3D modeling, an AI model can now diagnose breast cancer from MRIs as accurately as board-certified radiologists, and state-of-the-art speech AI models have widened their horizons to extended reality.
Pretrained models from NVIDIA have redefined performance this year, amused us on the stage of America’s Got Talent, won four global contests and a Best Inventions 2022 award from Time Magazine.
In addition to empowering researchers and data scientists, NVIDIA pretrained models are also empowering developers to create cutting-edge AI applications, by offering deep learning pretrained models and speedier convergence. To enable this, NVIDIA has spearheaded the research behind building and training these pretrained models for use cases like automatic speech recognition, pose estimation, object detection, 3D generation, semantic segmentation, and many more.
Model deployment can be streamlined, and users have already reaped the benefits over the last 3 months with 870 different NVIDIA pretrained models that support more than 50 use cases across several industries.
This post walks through a few of the top pretrained AI models that are behind groundbreaking AI applications.
Speech recognition for all
NVIDIA NeMo is serving a variety of industries with cutting-edge AI application development for speech AI and natural language processing. The use cases include the creation of virtual assistants in Arabic and the facilitation of state-of-the-art automatic speech recognition (ASR) for financial audio.
For language-specific ASR, the NVIDIA NeMo deep learning conformer transducer pretrained model and conformer-ctc (connectionist temporal classification) pretrained model are well-liked. These models have high accuracy, a low word error rate, and a low character error rate due to their pretraining on a range of datasets, such as Librispeech and Mozilla Common Voice Data. They also have a robust AI architecture.
These models are laying the groundwork for state-of-the-art Kinyarwanda ASR model, Kabyle, Catalan, and many low-resource language pretrained models, which are bringing the usage of enhanced speech AI to low-resource languages, regions, and sectors.
For more information, see NeMo automatic speech recognition models.
Verifying speakers for the greater good
To determine ‘who talked when,’ voice AI enthusiasts and application developers are fusing deep neural network speech recognition with speaker diarization architecture.
Beyond well-known uses like multi-speaker transcription in video conferencing, developers are gaining benefit from this AI architecture for special use cases:
- Clinical speech recordings and understanding medical conversations for effective healthcare
- Captioning and separating teacher-student speech in the education sector
Pretrained embeddings of the modified Emphasized Channel Attention, Propagation, and Aggregation in TDNN (ECAPA-TDNN) model are accessible with the NVIDIA NeMo toolkit. Fisher, Voxceleb, and real room-reaction data were used to train this deep neural network model for speaker identification and verification.
One of the best solutions for speaker diarization, ECAPA is based on the time-delay neural network (TDNN) and SE (squeeze and excite) structure with 22.3M parameters. It outperforms traditional TDNNs by emphasizing channel attention, propagation, and aggregation, as well as significantly reducing error rates.
For more information, see Speaker Diarization.
Visionary image control with SegFormer AI models
SegFormer is visionary research that uses AI to pioneer world-class image control. The original model and its variants are thriving in a variety of industries, including manufacturing, healthcare, automotive and retail. Its enormous potential is best demonstrated by applications like virtual changing rooms, robotic image control, medical imaging and diagnostics, and vision analytics in self-driving cars.
The semantic segmentation AI algorithm, a computer vision method for separating various objects in images, is the foundation of SegFormer. To increase performance to meet particular needs, the fine-tuned SegFormer is pretrained on datasets like ADE20k and CityScapes at several resolutions, such as 512×512, 640×640, 1024×1024, and so on. The AI design, which draws inspiration from the Transformer model architecture, produces cutting-edge outcomes in a variety of tasks.
For more information, see the NVlabs/SegFormer GitHub repo.
Purpose-built, pretrained model for automotive low-code developers
By detecting and identifying cars, people, road signs, and two-wheelers to comprehend traffic flow, TrafficCamNet has been driving smart city initiatives and detection technology for the automotive sector.
The model has been thoroughly trained using a vast amount of data that includes pictures of actual traffic crossings in US cities. The deep neural network model NVIDIA DetectNet_v2 detector is used with ResNet18 as a feature extractor. The AI architecture, which is sometimes referred to as GridBox object detection, employs bounding-box regression on a regular grid in the input image. The NVIDIA TAO toolbox can be used to access and further fine-tune the purpose-built, pretrained model TrafficCamNet for best-in-class accuracy.
For more information, see Purpose-Built Models.
NVIDIA pretrained models have won numerous awards for their cutting-edge performance, extraordinary research, and exemplary ability to solve real-world problems. Here are some notable wins.
World’s largest genomics language model wins Gordon Bell Special Award 2022
Researchers from Argonne National Labs, NVIDIA, the Technical University of Munich, the University of Chicago, CalTech, Harvard University, and others developed one of the world’s largest genomics language models that predicts new COVID variants. For their work, they won the 2022 Gordon Bell Special Award.
The model informs timely public health intervention strategies and downstream vaccine development for emerging viral variants. The research was published in October 2022 and presents GenSLMs (genome-scale language models), which can accurately and rapidly identify variants of concern in the SARS-CoV-2 virus.
The large genomics language models were pretrained on >110M gene sequences and then a SARS-CoV-2 specific model was fine-tuned on 1.5M genomes with 2.5B and 25B trainable parameters, respectively. This research enables programmers to further genetic language modeling by creating applications that can assist different public health initiatives.
For more information, see Speaking the Language of the Genome: Gordon Bell Winner Applies Large Language Models to Predict New COVID Variants.
State-of-the-art vision model wins Robust Vision Challenge 2022
The Fully Attential Network (FAN) Transformer model from NVIDIA Research won the Robust Vision Challenge 2022. The team adopted the SegFormer head on top of an ImageNet-22k pretrained FAN-B-Hybrid model, as described in the Understanding The Robustness in Vision Transformers paper. The model was then further fine-tuned on a composed, large-scale dataset, similar to MSeg.
NVIDIA Research developed all the models used. The model achieved a state-of-the-art 87.1% accuracy and 35.8% mCE on ImageNet-1k and ImageNet-C with 76.8M parameters. We also demonstrated state-of-the-art accuracy and robustness in two downstream tasks, semantic segmentation and object detection.
For more information, see the NVlabs/FAN GitHub repo.
Winning the Telugu automatic speech recognition competition
NVIDIA recently won the Telugu-ASR challenge conducted by IIIT-Hyderabad, India. They trained a Conformer-RNNT (recurrent neural network transducer) model from scratch using 2K hours of Telugu-only data provided by organizers. Their efforts helped achieve the first position on the leaderboard for the closed track with WER 13.12%.
For an open competition track, they performed transfer-learning on a pretrained SSL Conformer-RNNT checkpoint trained on 36K hours from 40 Indic languages. With WER 12.64%, they won the competition. The fine-tuned winning model can be used by developers to create applications for automatic speech recognition that will benefit the 83M Telugu speakers globally.
NVIDIA pretrained models
NVIDIA pretrained models remove the need for constructing models from the start or experimenting with other open-source models that don’t converge, making high-performing AI development simple, rapid, and accessible.
For more information, see AI models.