Improving Real-Time Communication Experiences with NVIDIA Maxine

The audio and video quality of real-time communication applications such as virtual collaboration and content creation applications is the true gauge of users’ real-time communication experience. They rely heavily on network bandwidth and user equipment quality.

Narrow network bandwidth and low-quality equipment produce unstable and noisy audio and video outputs. This problem is often compounded by an increased number of corruptible streams because users simultaneously produce and consume audio and video. Users further add to audio and video congestion by deploying content creation tools.

To empower you to enhance real-time audio and video quality of real-time communication applications, NVIDIA Maxine offers GPU-accelerated SDKs to do the following:

Improve the standard microphone and webcam experience through the Video Effects, Audio Effects, and AR SDK features.
Enable the real-time chaining of multiple audio, video, and augmented reality features into the end-to-end pipeline by using Maxine to build new or integrate in an already built end-to-end pipeline.
Include transcription and translation when using Maxine together with NVIDIA Riva, an SDK for building conversational AI applications.

In short, Maxine enables the highest performance for virtual collaboration and content creation applications with high audio and video quality, on PCs, on-premises, or in cloud data centers with GPUs.

In this post, you learn about:

Audio and video enhancements that users experience when you use Maxine SDKs for real-time communication applications.
The benefits of building end-to-end pipelines with Maxine, the NVIDIA Video Codec SDK, and Riva.
Real-world examples of Maxine SDK feature integration in video conferencing, content creation, and live streaming applications.

*Figure 1. End-to-end pipeline with Maxine and Riva SDKs*

Transform standard audio and video equipment into smart devices

NVIDIA Maxine consists of the Video Effects SDK, Audio Effects SDK, and AR SDK, with GPU-accelerated state-of-the-art AI features that are developed through hundreds of thousands of training hours.

With the Maxine Video Effects SDK, you can turn standard webcam input into high-quality video. The video improvements are as follows:

Crisper and sharper image with enriched details, achieved with Maxine Super Resolution and Upscaler features.
Significant reduction of video noise caused by webcam sensor type, exposure, and a low illumination level with integration of the Maxine Video Noise Removal feature.
Blocky artifacts, ringing, and mosquito noise removal with application of the Maxine Artifact Reduction feature.
Virtual backgrounds of user choice, enabled with the Virtual Background feature (Figure 2).

*Figure 2. The illustration of virtual background implementation*

For more information about how you can run these effects with standard webcam input and integrate them into your application, see Transforming Noisy Low-Resolution into High-Quality Videos for Captivating End-User Experiences.

The Maxine Audio Effects SDK offers AI models that remove virtually any type of audio noise–obstructing narrowband, wideband, and ultra-wideband audio, and improve conversation quality. The benefits of addressing poor audio quality with Maxine are as follows:

No unwanted background noise such as AC noise, construction sounds, traffic noise, or keyboard strokes. For more information about the full list of types of background noise removed by the Noise Removal feature, see About the Background Noise Suppression Effect.
No unintelligible voices or voice distortions, that is, no reverberations appear when talking in large spaces with reflective surfaces. The removal is achieved with the Maxine Room Echo Cancellation feature.

For better end-to-end quality, you can combine Maxine audio effects features. For more information about how to build virtual collaboration and content creation applications with outstanding audio quality, see Achieving Noise-Free Audio for Virtual Collaboration and Content Creation Applications.

The Maxine Augmented Reality SDK empowers you to create fun and engaging AR effects from a webcam video and use them in applications to engage users, understand users’ moods, or create 3D photo-realistic avatars.

Maxine AR SDK offers Face Tracking, Facial Landmark Tracking, and Face Mesh features (Figure 3).

Illustration of Maxine AR face-related features including: Face Tracking, Face Landmark Tracking, and Face Mesh — *Figure 3. Illustration of Maxine AR face-related features*

Face Tracking creates a bounding box around the face and tracks face position over time.
Face Landmark Tracking recognizes facial features such as nose, eyes, and lips, and tracks them in real time.
Face Mesh represents the face with a 3D mesh that mimics the user’s face as it changes in real time and can be used for facial authentication and building avatar.

Face Tracking and Face Landmark Tracking can be used in tracking driver attentiveness, or in face mask and eyewear detection applications.

With the Maxine Body Pose Estimation feature, you can create applications for understanding the user’s pose and use it in human activity recognition, motion transfer, and virtual interactions in real time.

Quickly build real-time, end-to-end pipelines

When building audio and video pipelines, developers often customize AI models for achieving desired audio and video effects. In addition, their pipelines must support multiple platforms, such as embedded, PC, and server, and also satisfy low latency and high throughput video processing requirements. It turns out that such pipelines are compute intensive. The tradeoff is often made between operating costs and quality of the audio and video streams.

NVIDIA Maxine, and the ecosystem around it, are perfectly positioned to tackle this challenge. By taking advantage of the acceleration provided by NVIDIA GPUs and the state-of-the-art AI model capabilities, you can build applications that provide better user experience while managing the associated costs. Here’s how this works.

Maxine AI features

At the core of NVIDIA Maxine are three SDKs that provide a multitude of AI features. These features increase video resolution, remove noise both from audio and video, and provide unique capabilities.

The NVIDIA ecosystem around Maxine consists of two key products, NVIDIA Video Codec SDK and NVIDIA Riva.

With the Video Codec SDK, you get access to NVENC and NVDEC APIs which provide hardware-accelerated encoding and decoding capabilities.
With NVIDIA Riva, you can build a conversational AI to help enhance the virtual collaboration experience by providing capabilities such as transcription and translation.

All these features are GPU-accelerated, so the volume of media that can be processed is considerably higher than a CPU-based pipeline.

For example, consider a production floor manager in Germany who remotely interacts with the executives in the USA in making critical business decisions. Factories are often in remote locations with limited internet connections and the production floors are frequently large rooms submerged in much background noise.

With the Maxine Noise Removal feature, the manager can remove the background noise of the production floor.
With Room Echo Cancellation, they can remove audio reverberations.
With Riva translation, the production floor manager and executives can communicate in their preferred languages.
With Video Noise Removal, and Super Resolution, a noisy 360p video is transformed into a clear 1440p video.
The manager can mask the production floor clutter with a clean background.

Figure 4 underscores the sheer difference in user experiences with and without Maxine.

Video conferencing pipeline with Maxine ecosystem including Maxine, Video Codec, and Riva — *Figure 4. Video conferencing pipeline with Maxine ecosystem:* *Maxine*, *Video Codec, and* *Riva*

The advantage of the modular Maxine design offering is that you can easily select and integrate SDKs that you need in existing pipelines, or you can build new end-to-end pipelines from scratch. Maxine and the ecosystem around Maxine enable you to quickly build a high throughput end-to-end pipeline that takes noisy streams and transforms them real time into a noise-free, high-quality, and high-utility experience streamable to all devices.

Real-world examples of supercharging apps with Maxine SDKs

To show how you can integrate Maxine features, we have chosen a few real-world examples, one for each major use case.

Avaya Spaces

Avaya Spaces, a modern meeting and workstream collaboration platform built on CPaaS, offers high-definition video conferencing, video compositing, meeting recording, real-time transcription, and persistent collaboration at a cloud-scale.

Avaya’s goal is to democratize delivering real-time, high-quality media services at a large scale regardless of the users’ equipment quality or location worldwide with a browser-first experience. To achieve these objectives and optimize media processing, Avaya combines an underlying cloud-based CPaaS with NVIDIA Maxine technology.

Cloud deployment with a range of GPUs attached to containers and VMs enables 100% uptime. Servers are spun up based on demand, facilitating large-scale, real-time, two-way video interactive meetings with thousands of participants.

Avaya uses the Maxine Noise Removal feature to meet computationally expensive and latency-limited budgets for satisfying people’s need for clean and crisp audio without background noise and a low tolerance for audio and speech gaps due to low-quality equipment and poor network performance. Compared to the traditional DSP approach, the Maxine Noise Removal feature is more powerful and covers broader scenarios. It enables low latency while not running on endpoints but as close as possible at the network’s edge. In addition, there is no buffering in the audio pipeline as the AI-based algorithm delay is under 40ms.

Avaya also uses the unique Maxine Virtual Background feature of overlaying multiple speakers on the presentation for more engaging presentations. End users don’t have to have special hardware or to download any software. They can do it on any device, with the flexibility to create different types of layouts.

With the Maxine Virtual Background feature, Avaya delivers robust video segmentation for speakers moving and in complicated body positions. For more information about how Maxine enabled Avaya to provide a professional, high-quality, ubiquitous, end-user experience that’s accessible from any platform, see Avaya’s recent GTC session, How NVIDIA’s Maxine Changed the Way We Communicate.

Notch

Notch is a real-time graphics tool for 3D, VFX, and live events visuals. Creating effects for stage shows often requires a separate costly camera and tracking solutions to track body motion. There are also cases when producing visual effects becomes tricky for processing full camera feeds, including background.

With the Maxine real-time Face Tracking and Body Pose Estimation features, Notch empowers artists to simplify live-event stage setup considerably by reducing the need for custom hardware tracking systems. Instead, Notch enables the use of standard camera equipment. Users can further use the motion-captured rig of the skeletal body data to control 3D character animation (Figure 5).

*Figure 5. The Maxine Body Pose Estimation feature enables entire human body tracking in 3D real time*

With the Maxine AI-driven Virtual Background feature, Notch users can create video processing effects that separate people from the background and apply processing to just the talent on the stage or the background itself. This easy process works with high resolution and accuracy and for complex conditions such as dark clothing and tricky lighting conditions. For more information, see the demo video, Notch 0.9.23.195 NVIDIA patch release walk-through.

Be.Live

Another example of using Maxine Virtual Background features is in the live-streamer space. Be.Live is a live streaming studio that helps enterprises, SMBs, and retailers to create professional-looking live streams without a learning curve. They run all processes related to the virtual background in the cloud.

Whether it’s a small business looking to connect with an audience or an enterprise setting up employer-to-employee communication, Maxine and Be.Live provide a solution that enables top-tier background removal without a green screen behind the host. Besides enjoying high-quality background with no need to upgrade webcams and studio setups, users save computing capacity and experience a better streaming experience.

Be.Live aims to implement the Maxine Virtual Background innovation in the Live Commerce ecosystem as well, as the technology can help many brands start their live shopping streams without much investment.

Summary

Interested in trying out Maxine to quickly build or supercharge your application? Head over to the Getting Started webpage and download Maxine SDKs. Maxine Video Effects, Audio Effects, and AR SDKs have sample applications that help you learn the API and evaluate the features.

If you have an NVIDIA RTX GPU, try the NVIDIA Broadcast App to experience the features locally.