New NVIDIA Maxine Microservices Enhance Real-Time Audio and Video Effects for Conferences at Scale

At CES 2023, NVIDIA Maxine announced SDK updates and new microservices, enabling clear communications in video conferences through private or public clouds.

NVIDIA Maxine is a suite of GPU-accelerated AI SDKs and cloud-native microservices for deploying optimized and accelerated AI features that enhance audio, video, and augmented reality (AR) effects for real-time communications.

Powered by state-of-the-art NVIDIA AI models, Maxine improves individual presence in video conferences through clear communications and innovative features. The high-quality effects in Maxine can be achieved with standard microphones and webcams.

New Maxine microservices

NVIDIA is releasing two new microservices in early access in January 2023: Live Portrait and Video Effects. They join the updated Audio Effects microservice.

Maxine microservices are ready-to-use containerized packages of applications that are built from state-of-the-art, pretrained models for enhancing audio, video, and augmented reality using AI. These containers include all necessary dependencies, which can be easily deployed on public and private clouds. The microservices integrate applications with Maxine algorithms through cloud-based GPU computing.

Diagram showing Maxine microservices workflow between clients, the app load balancer, and the Maxine Audio Effects, Video Effects, and Live Portrait microservices. — *Figure 1. NVIDIA Maxine microservices cloud reference application architecture*

The Live Portrait microservice animates a photo or stylized portrait through a live video feed. With Live Portrait integrated into a video conferencing client, participants can remain off-camera but still be present in the meeting by speaking through their animated photo. Watch how Live Portrait and NVIDIA Omniverse Avatar Cloud Engine (ACE) bring 2D photos to life in the NVIDIA CES keynote.

The Video Effects microservice is a package of two Maxine AI features for video:

Virtual Background: Segments a person from the video stream and applies AI-powered background removal, replacement, or blur.
Eye Contact: Simulates eye contact by estimating and aligning the gaze with the camera.

For more information, see the Eye Contact feature demo.

The Maxine Audio Effects microservice, introduced at GTC Fall 2021, is now UCF-compliant. NVIDIA Unified Compute Framework is a low-code framework for developing cloud-native, real-time, and multimodal AI applications. UCF enables you to build and combine accelerated microservices across domains into real-time, multimodal AI applications. Every microservice can be independently deployed, managed, and scaled within the application.

Maxine SDK updates

In the latest release, Maxine has introduced a series of updates to its three SDKs:

Speaker Focus is an audio effect that separates the audio tracks of foreground and background speakers, making each voice more intelligible. Available in early access, this feature has undergone significant performance and quality improvements.

Performance and quality improvements have been made to Audio Super Resolution, Virtual Background, Eye Contact, and Face Expression Estimation.

Audio Super Resolution improves audio quality by increasing the temporal resolution of the audio signal. It currently supports upsampling from 8 kHz to 16 kHz and from 16 kHz to 48 kHz. In the latest release, Audio Super Resolution latency has now been reduced by over 50% over the previous release.

Virtual Background segments a person and applies AI-powered background removal or replacement, or background blur. The latest update has been improved to reduce latency.

The Eye Contact feature simulates eye contact by estimating and aligning gaze with the camera. The latest release includes performance improvements through NVIDIA CUDA Graph functionality.

Face Expression Estimation tracks a face and infers the subject’s expression. Estimated blend shape coefficients are used to animate a properly rigged model to accurately mirror the subject’s expression. The latest release includes an enhanced AI model, with a new six-degree-of-freedom (6DOF) head pose, and a new face model with updated blend shapes and face area partitioning.

Virtual conference feature chaining

Maxine’s features can also be chained together to power your presence in a virtual conference.

“With NVIDIA Maxine integrated in Avaya’s Media Processing Core, hybrid teams can connect and collaborate more clearly and effectively from anywhere, with any device,” said Mike Kuch, senior director of Solutions Marketing at Avaya.

“We can easily combine advanced NVIDIA AI technologies with our own unique server side architecture, which allows our teams at Pexip to deliver an enhanced experience for virtual meetings,” said Eddie Clifton, senior vice president of Strategic Alliances at Pexip.

Learn more about NVIDIA Maxine, and read about how we worked with Avaya and Pexip in their delivery of virtual meetings.