NVIDIA Jarvis

Multimodal AI SDK

During everyday conversations, humans rely on sight, sound and past interactions for context. Conversation systems today, on the other hand, rely on single inputs such as text or audio and the application developer needs to inject context programmatically. This leads to several limitations in conversation agents today. Today’s agents are not able to differentiate between speakers or handle more than one conversation at a time. They have very limited capabilities to derive context for a question, or offer responses beyond simple discrete tasks. To achieve its full potential, conversation-based AI applications need to process several inputs simultaneously, fuse them to derive context and then use that to generate more accurate, engaging and natural responses.

NVIDIA Jarvis is an SDK for building and deploying AI applications that fuse vision, speech and other sensors. It offers a complete workflow to build, train and deploy GPU-accelerated AI systems that can use visual cues such as gestures and gaze along with speech in context. For example lip movement can be fused with speech input to identify the active speaker. Gaze can be used to understand if the speaker is engaging the AI agent or other people in the scene. Such multi-modal fusion enables simultaneous multi-user, multi-context conversations with the AI agent that need deeper understanding of the context.


Jarvis provides several GPU-accelerated base modules for speech tasks such as intent and entity classification, sentiment analysis, dialog modeling, domain and fulfillment mapping. For vision, modules include person detection and tracking, detection of key body landmarks and body pose, gestures, lip activity and gaze. You can also use custom modules or fine tune to adapt for your use case. To use these modules together, Jarvis provides several fusion algorithms which can be modularly integrated, extended and customized for novel use cases.

Jarvis-based applications can achieve the highest accuracy and performance possible by using state of the art speech and vision research algorithms through NVIDIA Neural Modules. These machine learning and deep learning models can be trained on your custom data to achieve better accuracy for your use case. As the application grows, and new modules or sensor inputs get added, system administrators can setup and manage the components easily. Services developed using Jarvis can be deployed for production to the cloud, data center and on the edge using the EGX platform. Deep learning and sensor fusion has applications across retail stores, manufacturing, as well as in cars, restaurants and hospitals to name a few.

For edge and IOT use cases, Jarvis runs on the NVIDIA EGX stack, which is compatible with all commercially available Kubernetes infrastructure. NVIDIA EGX stack now packages the NVIDIA Driver, NVIDIA Kubernetes Plug-In, NVIDIA Container Runtime Plug-In, and NVIDIA GPU Monitoring software into a GPU operator that installs all the NVIDIA software as containers that run on Kubernetes to simplify the management of GPU enabled servers.

Apply for exclusive news, updates, and early access to NVIDIA Jarvis.

Apply Now