Generative AI

NVIDIA Collaborates with Hugging Face to Simplify Generative AI Model Deployments

As generative AI experiences rapid growth, the community has stepped up to foster this expansion in two significant ways: swiftly publishing state-of-the-art foundational models, and streamlining their integration into application development and production.

NVIDIA is aiding this effort by optimizing foundation models to enhance performance, allowing enterprises to generate tokens faster, reduce the costs of running the models, and improve end user experience with NVIDIA NIM.

NVIDIA NIM

NVIDIA NIM inference microservices are designed to streamline and accelerate the deployment of generative AI models across NVIDIA accelerated infrastructure anywhere, including cloud, data center, and workstations.

NIM leverages TensorRT-LLM inference optimization engine, industry-standard APIs, and prebuilt containers to provide low-latency, high-throughput AI inference that scales with demand. It supports a wide range of LLMs including Llama 3, Mixtral 8x22B, Phi-3, and Gemma, as well as optimizations for domain-specific applications in speech, image, video, healthcare, and more.

NIM delivers superior throughput, enabling enterprises to generate tokens up to 5x faster. For generative AI applications, token processing is the key performance metric, and increased token throughput directly translates to higher revenue for enterprises.

By simplifying the integration and deployment process, NIM enables enterprises to rapidly move from AI model development to production, enhancing efficiency, reducing operational costs, and allowing businesses to focus on innovation and growth.

And now, we’re going a step further with Hugging Face to help developers run models in a matter of minutes.

Deploy NIM on Hugging Face with a few clicks

Hugging Face is a leading platform for AI models and has become the go-to destination for AI developers as it enhances the accessibility of AI models.

Leverage the power of seamless deployment with NVIDIA NIM, starting with Llama 3 8B and Llama 3 70B, on your preferred cloud service provider, all directly accessible from Hugging Face.

NIM delivers superior throughput and achieves near-100% utilization with multiple concurrent requests, enabling enterprises to generate text 3x faster. For generative AI applications, token processing is the key performance metric, and increased token throughput directly translates to higher revenue for enterprises.

The Llama 3 NIM is performance optimized to deliver higher throughput, which translates to higher revenue and lower TCO. The Llama 3 8B NIM processes ~9300 tokens per second compared to the non-NIM version which processes ~2700 tokens per second on HF Endpoints.
Figure 1. Llama 3 8B NIM on Hugging Face achieves 3x throughput

The dedicated NIM endpoint on Hugging Face spins up instances on your preferred cloud, automatically fetches and deploys the NVIDIA optimized model, and enables you to start inference with just a few clicks, all in a matter of minutes.

Let’s take a closer look.

Step 1: Navigate to the Llama 3 8B or 70B instruct model page on Hugging Face, and click on those ‘Deploy’ drop-down, and then select ‘NVIDIA NIM Endpoints’ from the menu.

Hugging Face provides various serverless and dedicated endpoint options to deploy the models. NVIDIA NIM endpoints can be deployed on top cloud platforms.
 Figure 2. Screenshot of the Llama 3 model page on Hugging Face

Step 2: A new page with ‘Create a new Dedicated Endpoint’ with NVIDIA NIM is presented. Select your preferred CSP instance type to run the model on. The A10G/A100 on AWS, and A100/H100 on GCP instances leverage NVIDIA optimized model engines for best performance.

Create a new dedicated NIM endpoint by selecting your cloud service provider, region, and GPU configuration.
 Figure 3. Select your Cloud Service Provider (CSP) and infrastructure configuration on the endpoint page

Step 3: In the ‘Advanced configuration’ section, choose ‘NVIDIA NIM’ from the Container Type drop-down, and then click on ‘Create Endpoint’.

Select NVIDIA NIM container. The rest of the configurations are pre-selected to eliminate guesswork for users in picking the best options and allowing them to focus on building their solutions.
Figure 4. Select NVIDIA NIM container in the ‘Advanced configuration section of the page.

Step 4: Within a matter of minutes, an inference endpoint is up and running.

The Llama 3 NIM endpoint is up and running. Now you can make API calls to the model and run your generative AI application.
Figure 5. NIM Endpoint is deployed and online

Get started

Deploy Llama 3 8B and 70B NIMs from Hugging Face to speed time to market for generative AI solutions, boost revenue with high token throughput, and reduce inference costs.

To experience and prototype applications with over 40 multimodal NIMs available today, visit ai.nvidia.com.

With free NVIDIA cloud credits, you can build and test prototype applications by integrating NVIDIA-hosted API endpoints with just a few lines of code.

Discuss (0)

Tags