The boom in open source generative AI models is pushing beyond data centers into machines operating in the physical world. Developers are eager to deploy these models at the edge, enabling physical AI agents and autonomous robots to automate heavy-duty tasks.
A key challenge is efficiently running multi-billion-parameter models on edge devices with limited memory. With ongoing constraints on memory supply and rising costs, developers are focused on achieving more with less.
The NVIDIA Jetson platform supports popular open models while delivering strong runtime performance and memory optimization at the edge. For edge developers, memory footprint determines whether a system functions. Unlike cloud environments, edge devices operate under strict memory limits, with CPU and GPU sharing constrained resources.
Inefficient memory use can lead to bottlenecks, latency spikes, or system failure. Meanwhile, modern edge applications often run multiple pipelines—such as detection, tracking, and segmentation—making efficient memory management critical for stable, real-time performance under power and thermal constraints.
Optimizing memory usage provides clear benefits. Developers can improve performance on the same hardware by reducing overhead and increasing concurrency, while enabling more complex workloads like LLMs, multi-camera systems, and sensor fusion. It also reduces system cost by fitting into smaller memory configurations and improves efficiency (performance per watt) by minimizing bottlenecks and maximizing GPU utilization.
This blog explores optimization strategies to help developers maximize performance, efficiency, and capability on resource-constrained edge systems.
Edge AI software stack
Let’s dive deeper into the runtime software stack for edge devices. This isn’t an exhaustive guide to full-memory optimization, but a reference framework to spark ideas and help developers identify new ways to improve their stacks. The memory savings show what NVIDIA teams achieved. Experienced users may reach higher efficiencies, while others can use these examples as a starting point to better use resources on NVIDIA Jetson and NVIDIA IGX platforms.
This blog explores five key layers—starting from the foundation with Jetson BSP and NVIDIA JetPack, and moving up through the inference pipeline, inference frameworks, and quantization techniques. Let’s dive into each layer step by step.

Foundation layers: Board support package and software stack
The NVIDIA Jetson Board Support Package (BSP) and NVIDIA JetPack layer form the foundation of the software stack, interfacing with hardware. It includes the Linux kernel, device drivers, firmware, and the JetPack SDK with components that enable compute, multimedia, and accelerated I/O. This layer abstracts hardware complexity—GPUs, CPUs, memory, and peripherals—providing a stable, optimized base for higher-level services and applications.
At this layer, memory savings can be achieved by disabling unused services and reclaiming reserved carveout regions. These optimizations reduce overhead and free DRAM for application workloads without affecting core functionality. The following sections highlight key techniques to enable these optimizations.
The BSP and JetPack layer optimization guidelines work for Jetson Orin NX and Jetson Orin Nano.
| Knobs | Memory that may be reclaimed | Instructions |
| Disabling the graphical desktop, including display and UI-related services. | Up to 865 MB | sudo systemctl set-default multi-user.target |
| Disabling networking, connectivity, and non-essential journaling services. | Up to 32 MB | sudo systemctl disable <service-name> |
Carveout regions on NVIDIA Jetson Orin NX, along with kernel- and user-space optimizations, are key areas for improving overall system efficiency. The following sections explore practical techniques for optimizing these layers.
Carveout optimization
Carveout regions in NVIDIA Jetson Orin NX and NVIDIA Jetson Orin Nano are reserved physical memory set aside at boot for specific hardware engines, firmware, and real-time subsystems. They aren’t accessible to Linux or NVIDIA CUDA applications and are used by on-chip microcontrollers and accelerators. These act as dedicated memory pools to ensure isolation, security, and deterministic behavior. Depending on your pipeline and application needs, some carveouts can be disabled to further optimize memory usage.
| Carveout | When to disable | How to disable | Reclaimed dram size |
| CARVEOUT_DCE_TSEC | When display isn’t needed | Refer to note 1 and re-flash | 1 MB |
| CARVEOUT_DCE | 32 MB | ||
| CARVEOUT_DISP_EARLY_BOOT_FB | 34 MB | ||
| CARVEOUT_TSEC_DCE | 1 MB | ||
| CARVEOUT_CAMERA_TASKLIST | When camera isn’t needed | Refer to note 2 and re-flash | 32 MB |
| CARVEOUT_RCE | 1 MB |
Note 1: The following example shows how a user can make memory optimization when display isn’t required. Add the code snippet inside /misc/carveout/ node of Linux_for_Tegra/bootloader/generic/BCT/tegra234-mb1-bct-misc-p3767-0000.dts
// Display-related carveouts
aux_info@CARVEOUT_BPMP_DCE {
pref_base = <0x0 0x0>;
size = <0x0 0x0>; // 0MB
alignment = <0x0 0x0>; // 0MB
};
aux_info@CARVEOUT_DCE_TSEC {
pref_base = <0x0 0x0>;
size = <0x0 0x0>; // 0MB
alignment = <0x0 0x0>; // 0MB
};
aux_info@CARVEOUT_DCE {
pref_base = <0x0 0x0>;
size = <0x0 0x0>; // 0MB
alignment = <0x0 0x0>; // 0MB
};
aux_info@CARVEOUT_DISP_EARLY_BOOT_FB {
pref_base = <0x0 0x0>;
size = <0x0 0x0>; // 0MB
alignment = <0x0 0x0>; // 0MB
};
aux_info@CARVEOUT_TSEC_DCE {
pref_base = <0x0 0x0>;
size = <0x0 0x0>; // 0MB
alignment = <0x0 0x0>; // 0MB
};
Update /mb2-misc/auxp_controls@3/ node’s content of Linux_for_Tegra/bootloader/tegra234-mb2-bct-common.dtsi to:
/* Control fields for DCE cluster. */
auxp_controls@3 {
enable_init = <0>;
enable_fw_load = <0>;
enable_unhalt = <0>;
reset_vector = <0x40000000>;
};
Remove entire /mb2-misc/auxp_ast_config@6 and /mb2-misc/auxp_ast_config@7 nodes of Linux_for_Tegra/bootloader/tegra234-mb2-bct-common.dtsi
Use the dtc tool to decompile the kernel dtb to dts, mark the status of /display@13800000 node as disabled, then re-compile the dts to kernel dtb:
display@13800000 {
status = "disabled";
};
Note 2: The following example shows how a user can make memory optimization when a camera isn’t needed. Add the code snippet inside /misc/carveout/ node of Linux_for_Tegra/bootloader/generic/BCT/tegra234-mb1-bct-misc-p3767-0000.dts:
aux_info@CARVEOUT_CAMERA_TASKLIST {
pref_base = <0x0 0x0>;
size = <0x0 0x0>; // 0MB
alignment = <0x0 0x0>; // 0MB
};
aux_info@CARVEOUT_RCE {
pref_base = <0x0 0x0>;
size = <0x0 0x0>; // 0MB
alignment = <0x0 0x0>; // 0MB
};
Update /mb2-misc/auxp_controls@2/ node’s content of Linux_for_Tegra/bootloader/tegra234-mb2-bct-common.dtsi to:
/* Control fields for RCE cluster. */
auxp_controls@2 {
enable_init = <0>;
enable_fw_load = <0>;
enable_unhalt = <0>;
};
Kernel-side optimization
Jetson Orin, Orin NX, and Orin Nano platforms use an NVIDIA-specific Input/Output Memory Management Unit (IOMMU) to handle Direct Memory Access (DMA) address translation for peripherals, enabling devices to access system memory regardless of physical address.
The Linux Software I/O Translation Lookaside Buffer (SWIOTLB) is a workaround for systems without a hardware IOMMU or with peripherals limited to 32-bit DMA. Since Orin includes a robust hardware IOMMU that remaps DMA addresses, SWIOTLB is generally redundant.
SWIOTLB tuning
For specific use cases or non-standard peripherals requiring SWIOTLB—or when kernel logs indicate DMA issues—the reservation size can be adjusted using boot arguments.
The swiotlb= parameter defines the number of I/O TLB slabs (each 2 KB):
Total size (bytes) = swiotlb_value × 2,048
Example (4 MB buffer):
- 4 MB ÷ 2 KB = 2,048 slabs
- Kernel command:
swiotlb=2048
User space side optimizations
On Jetson, total application memory includes:
- CPU memory used by processes and system services.
- Hardware (NvMap) memory used by CUDA, multimedia buffers, and accelerators.
Both share the same physical memory pool, optimizing one benefits the other.
Reduce CPU memory usage
Start by identifying processes that consume the most CPU memory. Background services—such as GUI or audio components—can use significant memory and may be unnecessary in production.
- Measure CPU memory usage
Useprocrankto analyze memory usage:
$ git clone https://github.com/csimmonds/procrank_linux.git
$ cd procrank_linux/
$ make
$ sudo ./procrank
The output is sorted by PSS (Proportional Set Size), reflecting actual physical memory usage.
- Optimize based on findings and identify processes
gnome-shellorXorg(GUI)pulseaudio- Unused python3 processes
These are often unnecessary in production and can be disabled to reclaim memory. In headless deployments, disabling GUI services can free significant system memory.

- Analyze and measure hardware memory usage
In addition to CPU memory, GPU and multimedia allocations can impact available memory.
$ sudo cat /sys/kernel/debug/nvmap/iovmm/clients
*This shows memory usage across processes using NvMap (e.g., CUDA, video pipelines).
- Optimize hardware memory
Identify processes using large GPU or buffer allocations. As with CPU optimization, services like GUI pipelines (gnome-shell, Xorg) may consume unnecessary hardware memory. Reducing these allocations frees up more memory for AI workloads.

Inferencing pipeline
This layer manages the end-to-end data flow through preprocessing, inference, and postprocessing to produce actionable outputs. Frameworks like NVIDIA DeepStream provide a high-performance, GPU-accelerated pipeline for streaming data such as video and sensor inputs. They handle decoding, batching, inference, tracking, and analytics in a streamlined workflow, enabling scalable processing. This layer abstracts complexity and optimizes data movement and compute utilization for efficient, production-ready AI applications.
Learn how to optimize the inferencing pipeline to reduce memory footprint and improve performance through configuration and implementation choices. While shown with DeepStream, these principles apply broadly across frameworks and applications.
| Knob | Memory that may be reclaimed |
| Container vs BareMetal | Up to 70 MB |
| Switching from Python to C++ | Up to 84 MB |
| Tweaking pipeline configuration:**Disable Tiler/OSDUse FakeSink | Up to 258 MB |
| Total | 412 MB |
**In a DeepStream-style inference pipeline, disabling Tiler/OSD and using FakeSink removes display stages needed for visualization but unnecessary in headless or production deployments. This saves memory, reduces GPU load, and improves throughput.
Inferencing frameworks
The inference-serving framework layer for LLMs focuses on efficiently deploying and scaling large language models in production, with frameworks like vLLM, SGLang, and Llama.cpp leading this space. These frameworks optimize inference through techniques such as continuous batching, KV cache management, and efficient memory utilization to maximize throughput and reduce latency.
- vLLM excels in high-throughput serving with its paged attention mechanism.
- SGLang enables flexible and programmable inference workflows.
- Llama.cpp and NVIDIA TensorRT Edge-LLM are optimized for memory-efficient execution in resource-constrained environments.
These frameworks provide the infrastructure needed to serve LLMs reliably when deploying locally at the edge.
Model quantization
Model quantization is a key technique used for reducing the memory footprint and accelerating inference of AI models by representing weights and activations with lower-precision data types.
Quantization should be driven by explicit accuracy and performance requirements for the target use case. Before selecting a quantization scheme, define:
- The minimum acceptable model quality or task accuracy.
- The target throughput and latency.
- The deployment constraints, especially available GPU memory.
With these requirements locked down, the recommended approach is to progressively evaluate lower-precision quantization options. Start from the highest-accuracy baseline and move downward through supported quantization formats until the model no longer meets the required quality threshold. The selected quantization point should be the lowest precision that still satisfies the use-case accuracy requirement, since that typically provides the best memory savings and efficiency.

If lower-bit quantization introduces unacceptable degradation, use recovery techniques such as quantization-aware distillation (QAD) to recover lost accuracy. These methods can often restore enough model quality to enable more aggressive quantization while still meeting deployment requirements.
Once the quantization level is chosen, optimize runtime memory for the target deployment. Perform a sweep across vLLM configuration parameters—especially GPU memory utilization—to find the minimum memory footprint to sustain target performance. This ensures an efficient, right-sized deployment for throughput and latency goals.
Formats like FP16 and FP8 balance accuracy and performance, with FP8 increasingly used for higher throughput. More aggressive schemes, like W4A16, reduce memory and bandwidth needs while maintaining acceptable accuracy. NVIDIA NVFP4 further improves efficiency with hardware-friendly 4-bit computation. Together, these approaches enable faster, cost-effective inference for large models and resource-constrained systems. Support varies across Jetson platforms—refer to the NVIDIA Jetson product catalog for details.
| Knob | Memory that may be reclaimed | Notes |
| Model quantization on Qwen3 8B from FP16 to W4A16 | ~10 GB | Qwen3 8B |
| Model quantization on Qwen3 4B from BF16 to INT4 | ~5.6 GB | Qwen3 4B |
Depending on the components of the five-layer software stack that are included and optimized, memory savings of up to 10–12 GB are possible while maintaining high accuracy and feature parity.
Disaggregating inference at the edge with specialized accelerators
Jetson platforms include several non-GPU accelerators that improve efficiency by offloading specialized workloads from the CPU and GPU. These include an image signal processor (ISP) for camera processing, NVENC/NVDEC for video encoding/decoding, and the NVIDIA Programmable Vision Accelerator (PVA) for vision tasks.
The PVA, available from Jetson Orin NX to Jetson Thor, is well-suited for always-on, low-power vision workloads—such as sentry mode, motion detection, object tracking, and feature extraction—where continuous GPU use would be inefficient. By offloading these tasks, the PVA reduces latency and frees GPU resources for more complex inference or parallel workloads, improving overall performance and power efficiency in edge deployments.
The NVIDIA cuPVA SDK is currently in Early Access. If you’re interested in exploring its capabilities, reach out for more information.
Possible savings across multiple layers:
| Layer | Potential Savings |
| BSP & OS services | ~1,025 MB |
| Pipeline optimization | ~ 412 MB |
| Inferencing frameworks and model quantization | ~ 5 to 10 GB |
If there’s one key takeaway, it’s to use the right quantization precision.
Formats like NVFP4, INT4, and W4A16 significantly reduce memory and storage needs while maintaining strong accuracy for many LLM workloads.
Real use-case: Reachy Mini Jetson Mini Assistant
To show the impact of these memory optimizations, consider the Reachy Mini Jetson Assistant, an on-device conversational AI robot running on the Jetson Orin Nano with 8 GB of unified memory and no cloud dependency.
The assistant runs a multimodal AI pipeline concurrently, including: a vision-language model (Cosmos-Reason2-2B) quantized to 4-bit (Q4_K_M GGUF) and served via Llama.cpp for visual understanding; faster-whisper (small.en) for speech recognition; and Kokoro TTS for text-to-speech—all alongside the Reachy Mini robot SDK and a live web dashboard.
With stack-wide optimizations—disabling the display manager, running headless, serving the VLM via Llama.cpp instead of heavier Python frameworks, using 4-bit quantized Cosmos Reason2 2B, and selecting optimized runtimes (CTranslate2 for STT, ONNX Runtime for TTS and VAD)—the full pipeline runs on a single Orin Nano 8 GB system.
More broadly, combining 4-bit quantization with efficient inference runtimes, like Llama.cpp and TensorRT-Edge-LLM, makes a wide range of models accessible within this memory budget with LLMs up to ~10B parameters and VLMs up to ~4B parameters. The full list of tested models is available on the Jetson AI Lab Models page and NVIDIA Developer Forum.
Get started
- Learn more about the next-generation Orin NX Edge AI platform.
- Install JetPack and DeepStream.
- Share your story and how this post helped you on the NVIDIA Jetson Forum.