Inference is an important part of the machine learning lifecycle and occurs after you have trained your model. It is when a business realizes value from their AI investment. Common applications of AI include image classification (“this is an image of a tumor”), recommendation (“here is a movie you will like”), transcription of speech audio into text, and decision (“turn the car to the left”).
Systems for deep learning training require a lot of computing capabilities, but after an AI model has been trained, fewer resources are needed to run it in production. The most important factors in determining the system requirements for inference workloads are the model being run and the deployment location. This post discusses these areas, with a particular focus on AI inference at the edge.
AI model inference requirements
For help with determining the optimal inference deployment configuration, a tool like NVIDIA Triton Model Analyzer makes recommendations based on the specific AI models that are running. An inference compiler like NVIDIA TensorRT can reduce the resource requirements for inference by optimizing the model to run with the highest throughput and lowest latency while preserving accuracy.
Even with these optimizations, GPUs are still critical to achieving the business service level objectives SLAs and requirements for inference workloads. Results from the MLPerf 2.0 Inference benchmark demonstrate that NVIDIA GPUs are more than 100x faster than CPU-only systems. GPUs can also provide the low latency required for workloads that need a real-time response.
Deployment locations of inference workloads
AI inference workloads can be found both in the data center as well as at the edge. Examples of inference workloads running in a data center include recommender systems and natural language processing.
There is great variety in the way these workloads can be run. For example, many different models can be served simultaneously from the same servers, and there can be hundreds, thousands, or even tens of thousands of concurrent inference requests in flight. In addition, data center servers often run other workloads besides AI inference.
There is no “one size fits all” solution when it comes to system design for data center inference.
Inference applications running at edge locations represent an important and growing class of workloads. Edge computing is driven by the requirement for low-latency, real-time results as well as the desire to reduce data transit for both cost and security reasons. Edge systems run in locations physically close to where data is collected or processed, in settings such as retail stores, factory floors, and cell phone base stations.
As compared with data center inference, system requirements for AI inference at the edge are easier to articulate, because these systems are usually designed to focus on a narrow range of inference workloads.
Edge inference typically involves either a camera or other sensor gathering data that must be acted upon. An example of this could be sensor-equipped video cameras in chemical plants being used to detect corrosion in pipes and alert staff before any damage is done.
Edge inference system requirements
Servers for AI training must be designed to process large amounts of historical data to learn the right values for model parameters. By contrast, servers for edge inference are required to process streaming data being gathered in real time at the edge location, which is smaller in volume.
As a result, system memory doesn’t need to be as large, and the number of CPU cores can be lower. The network adapter doesn’t need as high bandwidth and the local storage on the server can be smaller as it’s not caching any training data sets.
However, both the networking and storage should be configured to enable the lowest latency, as the ability to respond as quickly as possible is critical.
|AI training in the data center
|AI inferencing at the edge
|Fastest CPUs with high core count
|Fastest GPUs with most memory, more GPUs per system
|Lower-power GPU, or larger GPU with MIG, one or two GPUs per system
|Large memory size
|Average memory size
|High bandwidth NVMe flash drive, one per CPU
|Average bandwidth, lowest-latency NVMe flash drive, one per system
|Highest bandwidth network adapter, Ethernet or InfiniBand, one per GPU pair
|Average bandwidth network adapter, Ethernet, one per system
|Devices balanced across PCIe topology; PCIe switch for multi-GPU, multi-NIC deployments
|Devices balanced across PCIe topology; PCIe switch not required
Edge systems are by definition deployed outside traditional data centers, often in remote locations. The environment is often constrained in terms of space and power. These constraints can be met by using smaller systems in conjunction with low-powered GPUs, such as the NVIDIA A2.
If the inference workload is more demanding, and power budgets allow it, then a larger GPU, such as the NVIDIA A30 or NVIDIA A100, can be used. The Multi-Instance GPU (MIG) feature enables these GPUs to service multiple inference streams simultaneously so that the system overall can provide highly efficient performance.
Other factors for edge inference
Beyond system requirements, there are other factors to consider that are unique to the edge.
Security is a critical aspect of edge systems. Data centers by their nature can provide a level of physical control as well as centralized management that can prevent or mitigate attempts to steal information or take control of servers.
Edge systems must be designed with the assumption that their deployment locations are not physically secured, and that they cannot benefit from as many of the access control mechanisms found in data center IT management systems.
Trusted Platform Module (TPM) is one technology that can help greatly with host security. Configured appropriately, a TPM can ensure that the system can only boot with firmware and software that has been digitally signed and unaltered. Additional security checks such as signed containers ensure that applications haven’t been tampered with, and disk volumes can be encrypted with keys that are securely stored in the TPM.
Another important consideration is the encryption of all network traffic to and from the edge system. Signed network adapters with encryption acceleration hardware, as found in NVIDIA ConnectX products, ensure that this protection doesn’t come at the expense of a reduction in data transfer rates.
For certain use cases, such as on a factory floor for automation control or in an enclosure next to a telecommunications antenna tower, edge systems must perform well under potentially harsh conditions, such as elevated temperatures, large shock and vibration, and dust.
Ruggedized servers designed for these purposes are increasingly available with GPUs, thus allowing even these extreme use cases to benefit from greatly higher performance.
Choose an end-to-end platform for inference
NVIDIA has extended the NVIDIA-Certified Systems program to include categories for edge deployments that run outside a traditional data center. The design criteria for these systems include all of the following:
- NVIDIA GPUs
- CPU, memory, and network configurations that provide optimal performance
- Security and remote management capabilities
The Qualified System Catalog has a list of NVIDIA-Certified systems from NVIDIA partners. The list can be filtered by category of system, including the following that are ideal for inference workloads:
- Data Center servers are validated for performance and scale-out capabilities on a variety of data science workloads and are ideal for data center inference.
- Enterprise Edge systems are designed to be deployed in controlled environments, such as the back office of a retail store. Systems in this category are tested in data center-like environments.
- Industrial Edge systems are designed for industrial or rugged environments, such as a factory floor or cell phone tower base station. Systems that achieve this certification must pass all tests while running within the environment for which the system was designed, such as elevated temperature environments outside of the typical data center range.
In addition to certifying systems for the edge, NVIDIA has also developed enterprise software to run and manage inference workloads.
NVIDIA Triton Inference Server streamlines AI inference by enabling teams to deploy, run, and scale trained AI models from any framework on any GPU- or CPU-based infrastructure. It helps you deliver high-performance inference across cloud, on-premises, edge, and embedded devices.
NVIDIA AI Enterprise is an end-to-end, cloud-native suite of AI and data analytics software, optimized so every organization can be good at AI, certified to deploy in both data center and edge locations. It includes global enterprise support so that AI projects stay on track.
NVIDIA Fleet Command is a cloud service that centrally connects systems at edge locations to securely deploy, manage, and scale AI applications from one dashboard. It’s turnkey with layers of security protocols and can be fully functional in hours.
By choosing an end-to-end platform consisting of certified systems and infrastructure software, you can kick-start your AI production deployments and have inference applications deployed and running much more quickly than trying to assemble a solution from individual components.
Learn more about the NVIDIA AI Inference platform
There’s a lot more involved when it comes to deep learning inference. The NVIDIA AI Inference Platform Technical Overview has an in-depth discussion of this topic, including a view of the end-to-end deep learning workflow, the details of taking AI-enabled applications from prototype to production deployments, and software frameworks for building and running AI inference applications.
Sign up for Edge AI News to stay up to date with the latest trends, customer use cases, and technical walkthroughs.