NVIDIA JARVIS

NVIDIA Jarvis is an application framework for multimodal conversational AI services that delivers real-time performance on GPUs.

Introductory Webinar  Sign Up for Beta Program



Figure 1: Jarvis Pipeline



Jarvis is a fully accelerated application framework for building multimodal conversational AI services that use an end-to-end deep learning pipeline. Developers at enterprises can easily fine-tune state-of-art-models on their data to achieve a deeper understanding of their specific context and optimize for inference to offer end-to-end real-time services that run in less than 300 milliseconds (ms) and delivers 7x higher throughput on GPUs compared with CPUs.

The Jarvis framework includes pre-trained conversational AI models, tools in the NVIDIA AI Toolkit, and optimized end-to-end services for speech, vision, and natural language understanding (NLU) tasks.

Fusing vision, audio, and other sensor inputs simultaneously provides capabilities such as multi-user, multi-context conversations in applications such as virtual assistants, multi-user diarization, and call center assistants.

Jarvis-based applications have been optimized to maximize performance on the NVIDIA EGX™ platform in the cloud, in the data center, and at the edge.



Real-Time Performance

Run deep learning-based conversational AI applications in under 300 ms, the latency threshold for real-time performance.

Multimodal

Fuse speech and vision to offer accurate and natural interactions in virtual assistants, chatbots, and other conversational AI applications.

Automated Deployment

Use one command to deploy conversational AI services in the cloud or at the edge.



Create State-of-the-Art Deep Learning Models


Figure 1: Pre-trained models

Use state-of-the-art deep learning models trained for more than 100,000 hours on NVIDIA DGX™ systems for speech, language understanding, and vision tasks. Pre-trained models and scripts used in Jarvis are freely available in NGC™.

You can fine-tune these models for your domain with your data using NVIDIA NeMo and then use tools in the NVIDIA AI Toolkit to easily deploy them as services.



Develop New Multimodal Skills


Build multimodal skills such as multi-speaker transcription, chatbots, gesture recognition, and look-to-talk for your conversational AI applications.

Jarvis includes multi-skill samples that you can customize for your use case. With Jarvis, you can use speech, language understanding, and vision pipelines along with a dialog manager that supports multi-user and multi-context to author new skills.

Figure 2: Multimodal application with multiple users and contexts


Optimized Task-Specific Services


Figure 3: Jarvis AI services

Access high-performance services for tasks such as speech recognition, intent recognition, text-to-speech, pose estimation, gaze detection, and facial landmark detection through a simple API.

Pipelines for each skill can be fused to build new skills. Each pipeline is performance-tuned to deliver the highest performance possible and can be customized for your specific use case.



Build and Deploy Services Easily



Automates the steps that go from pre-trained models to optimized services deployed in the cloud, in the data center, and at the edge. Under the hood, Jarvis applies powerful NVIDIA® TensorRT™ optimizations to models, configures the NVIDIA Triton™ Inference Server, and exposes the models as a service through a standard API.

To deploy, you can use a single command to download, set up, and run the entire Jarvis application or individual services through Helm charts on Kubernetes clusters. The Helm charts can be customized for your use case and setup.

Figure 4: Helm command to deploy models to production



Leading Adopters Across All Verticals


"Ping An addresses millions of queries from customers each day using chat-bot agents. As an early partner of the Jarvis early access program, we were able to use the tools and build better solutions with higher accuracy and lower latency, thus providing better services. More specifically, with NeMo, the pre-trained model, and the ASR pipeline optimized using Jarvis, the system achieved 5% improvement on accuracy, so as to serve our customers with better experience."

— Dr. Jing Xiao, the Chief Scientist at Ping An

"In our evaluation of Jarvis for virtual assistants and speech analytics, we saw remarkable accuracy by fine-tuning the Automated Speech Recognition models in the Russian language using the NeMo toolkit in Jarvis. Jarvis can provide up to 10x throughput performance with powerful TensorRT optimizations on models, so we’re looking forward to using Jarvis to get the most out of these technology advancements.”

— Nikita Semenov, Head of ML at MTS AI

“InstaDeep delivers decision-making AI products and solutions for enterprises. For this project, our goal is to build a virtual assistant in the Arabic language, and NVIDIA Jarvis played a significant role in improving the application’s performance. Using the NeMo toolkit in Jarvis, we were able to fine-tune an Arabic speech-to-text model to get a Word Error Rate as low as 7.84% and reduced the training time of the model from days to hours using GPUs. We look forward to integrating these models in Jarvis’ end-to-end pipeline to ensure real-time latency.”

— Karim Beguir, CEO and Co-Founder at InstaDeep

At Intelligent Voice, we provide high performance speech recognition solutions, but our customers are always looking for more. Jarvis takes a multi-modal approach that fuses key elements of Automatic Speech Recognition with entity and intent matching to address new use cases where high-throughput and low latency are required. The Jarvis API is very easy to use, integrate and customize to our customers’ workflows for optimized performance.”

— Nigel Cannings, CTO at Intelligent Voice

“At Northwestern Medicine, we aim to improve patient satisfaction and staff productivity with our suite of healthcare AI solutions. Conversational AI, powered by NVIDIA Clara Guardian and Jarvis, improves patient and staff safety during COVID-19 by reducing direct physical contact while delivering high-quality care. Jarvis ASR and TTS models make this conversational AI a reality. Patients now no longer need to wait for the clinical staff to become available, they can receive immediate answers from an AI-powered virtual assistant.”

— Andrew Gostine, MD, MBA, CEO of Whiteboard Coordinator

“Low latency is critical in call centers, and with NVIDIA GPUs, our agents are able to listen, understand, and respond in under a second with the highest levels of accuracy. Based on early evaluations of speech and language understanding pipelines in NVIDIA Jarvis, we believe we can improve latency even further while maintaining accuracy, delivering the best experience possible for our customers.”

— Alan Bekker, co-founder and CTO of Voca

“Through the NVIDIA Jarvis early access program, we’ve been able to power our conversational AI products with state-of-the-art models using NVIDIA NeMo, significantly reducing the cost of getting started. Jarvis speech recognition has amazingly low latency and high accuracy. Having the flexibility to deploy on-prem and offer a range of data privacy and security options to our customers has helped us position our conversational AI-enabled products in new industry verticals.”

— Rajesh Jha, CEO of Siminsights.

“Conversational AI applications are data hungry. Imagine the data needed to train models or the storage required to hold all of the information to have more natural and useful interactions. Jarvis helped us to leverage this data to reach our goal of building virtual assistants for retail-stores faster. Jarvis pipelines use state-of-the-art deep learning models and run the conversational applications in milliseconds.”

— AJ Mahajan, Senior Director, Solutions at NetApp



Resources




Apply for exclusive news, updates, and beta program to NVIDIA Jarvis.

Sign Up for Beta Program