Deploying AI-enabled applications and services presents enterprises with significant challenges:
- Performance is critical as it directly shapes user experience and competitive advantage and affects deployment costs, influencing your overall return on investment.
- Achieving scalability is essential to meet the fluctuating demands of the deployed AI application effectively without over-provisioning compute resources. This entails scaling up resources dynamically during peak periods to ensure smooth operation and scaling down during quieter times to optimize costs.
- Complexity further complicates matters, involving tasks such as optimizing the performance of multiple AI models, seamlessly integrating them into existing workflows, and managing the underlying infrastructure
Addressing these challenges requires a full-stack approach that can optimize performance, manage scalability effectively, and navigate the complexities of deployment, enabling organizations to maximize AI’s full potential while maintaining operational efficiency and cost-effectiveness.
Google Cloud and NVIDIA have collaborated to address these challenges and simplify AI inference deployments by combining the performance of the NVIDIA AI platform and the ease of serverless computing in the cloud.
Cloud Run, Google Cloud’s fully managed serverless container runtime, has added support for NVIDIA L4 Tensor Core GPUs, available in preview. You can now run on-demand real-time AI applications accelerated at scale without worrying about infrastructure management. Combined with the power of NVIDIA NIM microservices, Cloud Run can significantly simplify the complexities of optimizing and serving AI models for production while maximizing application performance.
Deploy real-time AI-enabled applications
Cloud Run enables you to deploy and run containerized applications by abstracting away infrastructure management and dynamically allocating resources on demand. It automatically scales applications based on incoming traffic so you don’t have to provision excess compute resources to handle peak loads. With its fast instance starts and scale to zero, you also don’t have to maintain idle resources during periods of low demand.
Cloud Run support for NVIDIA L4 Tensor Core GPUs marks a significant leap from its previous CPU-only offerings.
The NVIDIA L4 GPU is optimized for inference at scale for a broad range of AI applications, including recommendations, voice-based AI assistants, generative AI, visual search, and contact center automation to deliver the best personalized experiences. L4 GPUs deliver up to 120x higher AI video performance over CPU solutions and 2.7x more generative AI inference performance over the previous generation.
Google Cloud was the first cloud to offer NVIDIA L4 GPUs with its G2 VMs and they are supported across Google Cloud services including Google Compute Engine (GCE), Google Kubernetes Engine (GKE), and Vertex AI.
Companies like Let’s Enhance, Wombo, Writer, Descript, and AppLovin are using the power of NVIDIA L4 GPUs to bring generative AI–powered applications to life and deliver delightful experiences to their customers.
Adding support for NVIDIA L4 on Cloud Run enables you to deploy real-time inference applications with lightweight generative AI models like Gemma-2B/7B, Llama3-8B, and Mistral-8x7B. This is combined with the scalability, per-second billing, low latency, and fast cold start times of Cloud Run’s serverless platform.
Performance-optimized serverless AI inference
Optimizing the performance of the AI model being deployed is crucial because it directly affects the resources required and influences the overall costs of deploying the AI-enabled application.
To address this challenge, NVIDIA introduced NVIDIA NIM, a set of optimized cloud-native microservices designed to simplify and accelerate the deployment of AI models. NIM provides pre-optimized, containerized models that can be easily integrated into applications, reducing development time and maximizing resource efficiency.
By using NVIDIA NIM on Cloud Run, you can deploy high-performance AI applications using optimized inference engines that unlock the full potential of NVIDIA L4 GPUs and deliver the best throughput and latency, without the need for expertise in inference performance optimization.
Part of NVIDIA AI Enterprise on Google Cloud Marketplace, NIM offers flexible integration with an OpenAI API-compatible programming model and custom extensions, while prioritizing enterprise-grade security by using safetensors, continuously monitoring and patching CVEs, and conducting regular internal penetration tests. This ensures that AI applications are robust, secure, and well-supported, facilitating a smooth transition from development to production.
In addition to Cloud Run, NVIDIA NIM can be deployed across different Google Cloud services, including Google Kubernetes Engine (GKE) or Google Compute Engine (GCE) giving you the choice of the level of abstraction you need for building and deploying AI-enabled applications.
Deploying a Llama3-8B-Instruct NIM microservice on Google Cloud Run with NVIDIA L4
Here’s how you can deploy a Llama3-8B-Instruct model with Cloud Run on an NVIDIA L4 GPU using NIM. Cloud Run currently supports attaching one NVIDIA L4 GPU per Cloud Run instance. As a prerequisite, install the Google Cloud SDK on your workstation.
Clone the repository:
$ git clone https://github.com/NVIDIA/nim-deploy
$ cd nim-deploy/cloud-service-providers/google-cloud/cloudrun
Set the environment variables needed for launch:
$ cat env
export SERVICE_ACCOUNT_ID=<Put your service account>
export PROJECTID=<Put your project ID>
export PROJECTUSER=<Put your user name>
export PROJECTNUM=<Put your project number>
export REGION=<Put your region>
export GCSBUCKET=<Put your GCS bucket>
export SERVICE_NAME=llama-3-8b-instruct
# ---- entries below created by build_nim.sh
Edit the Dockerfile with the appropriate NIM microservice name needed for deployment. Place the desired model URL from NGC in the FROM
statement:
FROM nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
Build the container for launch:
$ source ./env && ./build_nim.sh
Deploy the container by executing the run.sh script:
$ source ./env && ./run.sh
Ready to get started?
The powerful combination of the NVIDIA AI platform, including NVIDIA NIM and NVIDIA L4 GPUs, together with Google Cloud Run, addresses the critical challenges of performance, scalability, and complexity inherent in deploying AI applications. This synergy not only accelerates deployment but also boosts application performance, helping organizations make the most of AI while keeping operations efficient and costs low.
You can experience and prototype with NVIDIA NIM microservices through the NVIDIA API catalog, enabling you to test and refine your applications. You can then download the NIM containers to continue development, research, and testing on Google Cloud Run as part of the free NVIDIA Developer Program.
If you are looking for enterprise-grade security, support, and API stability, you can access NIM through a free 90-day NVIDIA AI Enterprise license. You can also try a hands-on lab with NIM on NVIDIA LaunchPad.
Cloud Run with NVIDIA L4 GPU support is currently in preview and available in the us-central1 Google Cloud region. For more information about this feature and to see demos in action, see the launch event livestream and sign up for access today!