Deploy Agents, Assistants, and Avatars on NVIDIA RTX AI PCs with New Small Language Models

NVIDIA just announced a series of small language models (SLMs) that increase the amount and type of information digital humans can use to augment their responses. This includes new large-context models that provide more relevant answers and new multi-modal models that allow images as inputs. These models are available now as part of NVIDIA ACE, a suite of digital human technologies that brings life to agents, assistants, and avatars.

NVIDIA ACE introduces its first multi-modal SLM

To elevate the responses of digital humans, they must be able to ingest more context of the world, just like humans do. The NVIDIA Nemovision-4B-Instruct model is a small multi-modal model that enables digital humans to understand visual imagery in the real world and on the Windows desktop to output relevant responses.

This model uses the latest NVIDIA VILA and NVIDIA NeMo framework and recipe for distilling, pruning, and quantizing to make it small enough to be performant on a broad range of NVIDIA RTX GPUs while maintaining the accuracy that developers need. Multi-modality serves as the foundation for agentic workflows and enables digital humans that can reason and take action with little to no assistance from a user.

Solving larger problems requires large-context language models

The new family of large-context SLMs are designed to handle large amounts of data inputs. This enables the model to understand harder prompts. The Mistral-NeMo-Minitron-128k-Instruct family of models has 8B, 4B, and 2B parameter versions for those looking to optimize between speed, memory usage, and accuracy on NVIDIA RTX AI PCs. These large-context models can process large sets of data in a single pass, which can reduce the need for segmentation and reassembly and provide greater accuracy.

		Mistral NeMo-Minitron-8B-128k-Instruct	Mistral NeMo-12B-Instruct	Llama-3.1-8B-Instruct	Qwen-2.5-7B-Instruct	Phi-3-Small-12-8k-Instruct	Gemma-2-9B-Instruct
Features	Context Window	128K	128K	128K	128K	8K	8K
Benchmarks*	Instruction Following IFEval	83.7	64.7	79.7	76.9	65.8	75.2
	Reasoning MUSR	12.08	8.48	8.41	8.45	16.77	9.74
	Function Calling BFCL v2 Live	69.5	47.9	44.3	62.1	39.9	65.7
	Multi-Turn Conversation MTBench (GPT4-Turbo)	7.84	8.10	7.78	8.41	7.63	8.05
	General Knowledge GPQA (Main) 0-shot	33.3	28.6	30.4	29.9	30.8	35.5
	General Knowledge MMLU Pro	33.36	27.97	30.68	36.52	38.96	31.95
	Math GSM8k 0-shot	87.6	79.8	83.9	55.5	81.7	80.1
	Coding MBPP 0-shot	74.1	66.7	72.8	73.5	68.7	44.4
Speed*	Latency (TTFT)	190ms	919ms	170ms	557ms	DNR**	237ms
Speed*	Throughput (Tok/s)	108.4	51.4	120.7	80.8	DNR**	84.4

Table 1. Accuracy of the Mistral NeMo-Minitron-8B-128k-Instruct model

Table compares Mistral NeMo-Minitron-8B-128k-Instruct model to other models in the similar size range and the teacher Mistral NeMo 12B models. The higher the number, the better the accuracy. Bold numbers represent the best and underscored numbers represent the second best amongst the 8B model class.
Note: Models executed with llama.cpp with Q4_0 quantization with an NVIDIA RTX 4090. Input sequence length = 2000 tokens, output sequence length = 100 tokens.
* Benchmarks done at FP16 precision. Speeds are done at INT4 quantization.
** Does not run in GPT-Generated Unified Format (GGUF)

The NVIDIA Nemovision-4B Instruct and larger context models are available through early access.

New updates to Audio2Face-3D NIM microservice

When building these more intelligent digital humans, you need realistic facial animation to ensure authentic interactions that feel believable.

The NVIDIA Audio2Face 3D NIM microservice uses audio in real time to provide lip-sync and facial animation. Now, the Audio2Face-3D NIM microservice, an easy-to-use inference microservice for accelerated deployment, is available as a single downloadable optimized container. This NVIDIA NIM microservice exposes new configurations for improved customizability. It also includes the inference model used in the “James” digital human for public use.

Deploying digital humans for NVIDIA RTX AI PCs made easier

It’s challenging to orchestrate animation, intelligence, and speech AI models efficiently and optimize the pipeline for the quickest response time for PCs with the highest accuracy.

These pipelines become more complex when introducing the multiple inputs and outputs required to fully realize advanced use cases, such as autonomous agents. Selecting the right models and frameworks, writing the orchestration code, and optimizing them for your specific hardware is a time-consuming task that slows down development.

NVIDIA is announcing new SDK plugins and samples for on-device workflows available now. This collection includes NVIDIA Riva Automatic Speech Recognition for speech-to-text transcription, a retrieval augmented generation (RAG) demo and reference implementation, and an Unreal Engine 5 sample application powered by Audio2Face-3D.

These on-device plugins are built on the NVIDIA In-Game Inference SDK, available in beta today. The In-Game Inference SDK simplifies AI integration by automating model and dependency download, abstracting away the details of inference libraries and hardware, and enabling hybrid AI, where the application can easily switch between AI running on the PC and AI running in the cloud.

You can get started with the SDK plugins and samples today at NVIDIA Developer.