Robotics

Maximizing Memory Efficiency to Run Bigger Models on NVIDIA Jetson

Decorative image.

The boom in open source generative AI models is pushing beyond data centers into machines operating in the physical world. Developers are eager to deploy these models at the edge, enabling physical AI agents and autonomous robots to automate heavy-duty tasks.

A key challenge is efficiently running multi-billion-parameter models on edge devices with limited memory. With ongoing constraints on memory supply and rising costs, developers are focused on achieving more with less.

The NVIDIA Jetson platform supports popular open models while delivering strong runtime performance and memory optimization at the edge. For edge developers, memory footprint determines whether a system functions. Unlike cloud environments, edge devices operate under strict memory limits, with CPU and GPU sharing constrained resources. 

Inefficient memory use can lead to bottlenecks, latency spikes, or system failure. Meanwhile, modern edge applications often run multiple pipelines—such as detection, tracking, and segmentation—making efficient memory management critical for stable, real-time performance under power and thermal constraints.

Optimizing memory usage provides clear benefits. Developers can improve performance on the same hardware by reducing overhead and increasing concurrency, while enabling more complex workloads like LLMs, multi-camera systems, and sensor fusion. It also reduces system cost by fitting into smaller memory configurations and improves efficiency (performance per watt) by minimizing bottlenecks and maximizing GPU utilization.

This blog explores optimization strategies to help developers maximize performance, efficiency, and capability on resource-constrained edge systems.

Edge AI software stack

Let’s dive deeper into the runtime software stack for edge devices. This isn’t an exhaustive guide to full-memory optimization, but a reference framework to spark ideas and help developers identify new ways to improve their stacks. The memory savings show what NVIDIA teams achieved. Experienced users may reach higher efficiencies, while others can use these examples as a starting point to better use resources on NVIDIA Jetson and NVIDIA IGX platforms.

This blog explores five key layers—starting from the foundation with Jetson BSP and NVIDIA JetPack, and moving up through the inference pipeline, inference frameworks, and quantization techniques. Let’s dive into each layer step by step.

Foundation layers: Board support package and software stack

The NVIDIA Jetson Board Support Package (BSP) and NVIDIA JetPack layer form the foundation of the software stack, interfacing with hardware. It includes the Linux kernel, device drivers, firmware, and the JetPack SDK with components that enable compute, multimedia, and accelerated I/O. This layer abstracts hardware complexity—GPUs, CPUs, memory, and peripherals—providing a stable, optimized base for higher-level services and applications.

At this layer, memory savings can be achieved by disabling unused services and reclaiming reserved carveout regions. These optimizations reduce overhead and free DRAM for application workloads without affecting core functionality. The following sections highlight key techniques to enable these optimizations.

The BSP and JetPack layer optimization guidelines work for Jetson Orin NX and Jetson Orin Nano

KnobsMemory that may be reclaimedInstructions
Disabling the graphical desktop, including display and UI-related services.Up to 865 MBsudo systemctl set-default multi-user.target
Disabling networking, connectivity, and non-essential journaling services.Up to 32 MBsudo systemctl disable <service-name>
Table 1. Memory optimization knobs at the BSP and JetPack levels

Carveout regions on NVIDIA Jetson Orin NX, along with kernel- and user-space optimizations, are key areas for improving overall system efficiency. The following sections explore practical techniques for optimizing these layers.

Carveout optimization

Carveout regions in NVIDIA Jetson Orin NX and NVIDIA Jetson Orin Nano are reserved physical memory set aside at boot for specific hardware engines, firmware, and real-time subsystems. They aren’t accessible to Linux or NVIDIA CUDA applications and are used by on-chip microcontrollers and accelerators. These act as dedicated memory pools to ensure isolation, security, and deterministic behavior. Depending on your pipeline and application needs, some carveouts can be disabled to further optimize memory usage.

CarveoutWhen to disableHow to disableReclaimed dram size
CARVEOUT_DCE_TSECWhen display
isn’t needed
Refer to note 1 
and re-flash
1 MB
CARVEOUT_DCE32 MB
CARVEOUT_DISP_EARLY_BOOT_FB34 MB
CARVEOUT_TSEC_DCE1 MB
CARVEOUT_CAMERA_TASKLISTWhen camera
isn’t needed
Refer to note 2 
and re-flash
32 MB
CARVEOUT_RCE1 MB
Table 2. Memory optimization knobs for various carveouts

Note 1: The following example shows how a user can make memory optimization when display isn’t required. Add the code snippet inside /misc/carveout/ node of Linux_for_Tegra/bootloader/generic/BCT/tegra234-mb1-bct-misc-p3767-0000.dts

 // Display-related carveouts  
                        aux_info@CARVEOUT_BPMP_DCE {
                                pref_base = <0x0 0x0>;
                                size = <0x0 0x0>; // 0MB
                                alignment = <0x0 0x0>; // 0MB
                        };
 
                        aux_info@CARVEOUT_DCE_TSEC {
                                pref_base = <0x0 0x0>;
                                size = <0x0 0x0>; // 0MB
                                alignment = <0x0 0x0>; // 0MB
                        };
 
                        aux_info@CARVEOUT_DCE {
                                pref_base = <0x0 0x0>;
                                size = <0x0 0x0>; // 0MB
                                alignment = <0x0 0x0>; // 0MB
                        };
 
                        aux_info@CARVEOUT_DISP_EARLY_BOOT_FB {
                                pref_base = <0x0 0x0>;
                                size = <0x0 0x0>; // 0MB
                                alignment = <0x0 0x0>; // 0MB
                        };
 
                        aux_info@CARVEOUT_TSEC_DCE {
                                pref_base = <0x0 0x0>;
                                size = <0x0 0x0>; // 0MB
                                alignment = <0x0 0x0>; // 0MB
                        };

Update /mb2-misc/auxp_controls@3/ node’s content of Linux_for_Tegra/bootloader/tegra234-mb2-bct-common.dtsi to:

/* Control fields for DCE cluster. */
        auxp_controls@3 {
        enable_init = <0>;
        enable_fw_load = <0>;
        enable_unhalt = <0>;
        reset_vector = <0x40000000>;
};

Remove entire /mb2-misc/auxp_ast_config@6 and /mb2-misc/auxp_ast_config@7 nodes of Linux_for_Tegra/bootloader/tegra234-mb2-bct-common.dtsi

Use the dtc tool to decompile the kernel dtb to dts, mark the status of /display@13800000 node as disabled, then re-compile the dts to kernel dtb:

display@13800000 {
                     	      status = "disabled";
                     	};

Note 2: The following example shows how a user can make memory optimization when a camera isn’t needed. Add the code snippet inside /misc/carveout/ node of Linux_for_Tegra/bootloader/generic/BCT/tegra234-mb1-bct-misc-p3767-0000.dts:

aux_info@CARVEOUT_CAMERA_TASKLIST {
                	        pref_base = <0x0 0x0>;
                	        size = <0x0 0x0>; // 0MB
                	        alignment = <0x0 0x0>; // 0MB
                	};
 
                	aux_info@CARVEOUT_RCE {
                	        pref_base = <0x0 0x0>;
                	        size = <0x0 0x0>; // 0MB
                	        alignment = <0x0 0x0>; // 0MB
                	};

Update /mb2-misc/auxp_controls@2/ node’s content of Linux_for_Tegra/bootloader/tegra234-mb2-bct-common.dtsi to:

/* Control fields for RCE cluster. */
                 	auxp_controls@2 {
                	  	enable_init = <0>;
                	       enable_fw_load = <0>;
                	  	enable_unhalt = <0>;
                	  };

Kernel-side optimization

Jetson Orin, Orin NX, and Orin Nano platforms use an NVIDIA-specific Input/Output Memory Management Unit (IOMMU) to handle Direct Memory Access (DMA) address translation for peripherals, enabling devices to access system memory regardless of physical address.

The Linux Software I/O Translation Lookaside Buffer (SWIOTLB) is a workaround for systems without a hardware IOMMU or with peripherals limited to 32-bit DMA. Since Orin includes a robust hardware IOMMU that remaps DMA addresses, SWIOTLB is generally redundant.

SWIOTLB tuning

For specific use cases or non-standard peripherals requiring SWIOTLB—or when kernel logs indicate DMA issues—the reservation size can be adjusted using boot arguments.

The swiotlb= parameter defines the number of I/O TLB slabs (each 2 KB):

   Total size (bytes) = swiotlb_value × 2,048

Example (4 MB buffer):

  • 4 MB ÷ 2 KB = 2,048 slabs
  • Kernel command: swiotlb=2048

User space side optimizations

On Jetson, total application memory includes:

  • CPU memory used by processes and system services.
  • Hardware (NvMap) memory used by CUDA, multimedia buffers, and accelerators.

Both share the same physical memory pool, optimizing one benefits the other.

Reduce CPU memory usage


Start by identifying processes that consume the most CPU memory. Background services—such as GUI or audio components—can use significant memory and may be unnecessary in production.

  1. Measure CPU memory usage
    Use procrank to analyze memory usage:
$ git clone https://github.com/csimmonds/procrank_linux.git
$ cd procrank_linux/
$ make
$ sudo ./procrank

The output is sorted by PSS (Proportional Set Size), reflecting actual physical memory usage.

  1. Optimize based on findings and identify processes
  • gnome-shell or Xorg (GUI)
  • pulseaudio
  • Unused python3 processes

These are often unnecessary in production and can be disabled to reclaim memory. In headless deployments, disabling GUI services can free significant system memory.

  1. Analyze and measure hardware memory usage

In addition to CPU memory, GPU and multimedia allocations can impact available memory.

$ sudo cat /sys/kernel/debug/nvmap/iovmm/clients

*This shows memory usage across processes using NvMap (e.g., CUDA, video pipelines).

  1. Optimize hardware memory

Identify processes using large GPU or buffer allocations. As with CPU optimization, services like GUI pipelines (gnome-shell, Xorg) may consume unnecessary hardware memory. Reducing these allocations frees up more memory for AI workloads.

Inferencing pipeline

This layer manages the end-to-end data flow through preprocessing, inference, and postprocessing to produce actionable outputs. Frameworks like NVIDIA DeepStream provide a high-performance, GPU-accelerated pipeline for streaming data such as video and sensor inputs. They handle decoding, batching, inference, tracking, and analytics in a streamlined workflow, enabling scalable processing. This layer abstracts complexity and optimizes data movement and compute utilization for efficient, production-ready AI applications.

Learn how to optimize the inferencing pipeline to reduce memory footprint and improve performance through configuration and implementation choices. While shown with DeepStream, these principles apply broadly across frameworks and applications.

KnobMemory that may be reclaimed
Container vs BareMetalUp to 70 MB
Switching from Python to C++Up to 84 MB
Tweaking pipeline configuration:**Disable Tiler/OSDUse FakeSinkUp to 258 MB
Total412 MB
Table 3. Knobs that help reduce memory footprint in the DeepStream-style inference pipeline
**In a DeepStream-style inference pipeline, disabling Tiler/OSD and using FakeSink removes display stages needed for visualization but unnecessary in headless or production deployments. This saves memory, reduces GPU load, and improves throughput.

Inferencing frameworks

The inference-serving framework layer for LLMs focuses on efficiently deploying and scaling large language models in production, with frameworks like vLLM, SGLang, and Llama.cpp leading this space. These frameworks optimize inference through techniques such as continuous batching, KV cache management, and efficient memory utilization to maximize throughput and reduce latency. 

  • vLLM excels in high-throughput serving with its paged attention mechanism.
  • SGLang enables flexible and programmable inference workflows. 
  • Llama.cpp and NVIDIA TensorRT Edge-LLM are optimized for memory-efficient execution in resource-constrained environments. 

These frameworks provide the infrastructure needed to serve LLMs reliably when deploying locally at the edge.

Model quantization 

Model quantization is a key technique used for reducing the memory footprint and accelerating inference of AI models by representing weights and activations with lower-precision data types. 

Quantization should be driven by explicit accuracy and performance requirements for the target use case. Before selecting a quantization scheme, define:

  • The minimum acceptable model quality or task accuracy.
  • The target throughput and latency.
  • The deployment constraints, especially available GPU memory.

With these requirements locked down, the recommended approach is to progressively evaluate lower-precision quantization options. Start from the highest-accuracy baseline and move downward through supported quantization formats until the model no longer meets the required quality threshold. The selected quantization point should be the lowest precision that still satisfies the use-case accuracy requirement, since that typically provides the best memory savings and efficiency.

If lower-bit quantization introduces unacceptable degradation, use recovery techniques such as quantization-aware distillation (QAD) to recover lost accuracy. These methods can often restore enough model quality to enable more aggressive quantization while still meeting deployment requirements.

Once the quantization level is chosen, optimize runtime memory for the target deployment. Perform a sweep across vLLM configuration parameters—especially GPU memory utilization—to find the minimum memory footprint to sustain target performance. This ensures an efficient, right-sized deployment for throughput and latency goals.

Formats like FP16 and FP8 balance accuracy and performance, with FP8 increasingly used for higher throughput. More aggressive schemes, like W4A16, reduce memory and bandwidth needs while maintaining acceptable accuracy. NVIDIA NVFP4 further improves efficiency with hardware-friendly 4-bit computation. Together, these approaches enable faster, cost-effective inference for large models and resource-constrained systems. Support varies across Jetson platforms—refer to the NVIDIA Jetson product catalog for details.

KnobMemory that may be reclaimedNotes
Model quantization on Qwen3 8B from FP16 to W4A16~10 GBQwen3 8B
Model quantization on Qwen3 4B from BF16 to INT4~5.6 GBQwen3 4B
Table 4. Memory reclaimed in model quantization

Depending on the components of the five-layer software stack that are included and optimized, memory savings of up to 10–12 GB are possible while maintaining high accuracy and feature parity.

Disaggregating inference at the edge with specialized accelerators

Jetson platforms include several non-GPU accelerators that improve efficiency by offloading specialized workloads from the CPU and GPU. These include an image signal processor (ISP) for camera processing, NVENC/NVDEC for video encoding/decoding, and the NVIDIA Programmable Vision Accelerator (PVA) for vision tasks.

The PVA, available from Jetson Orin NX to Jetson Thor, is well-suited for always-on, low-power vision workloads—such as sentry mode, motion detection, object tracking, and feature extraction—where continuous GPU use would be inefficient. By offloading these tasks, the PVA reduces latency and frees GPU resources for more complex inference or parallel workloads, improving overall performance and power efficiency in edge deployments.

The NVIDIA cuPVA SDK is currently in Early Access. If you’re interested in exploring its capabilities, reach out for more information.

Possible savings across multiple layers:

LayerPotential Savings
BSP & OS services~1,025 MB
Pipeline optimization~ 412 MB
Inferencing frameworks and model quantization~ 5 to 10 GB
Table 5. Memory reclaimed at various levels in the software stack

If there’s one key takeaway, it’s to use the right quantization precision.

Formats like NVFP4, INT4, and W4A16 significantly reduce memory and storage needs while maintaining strong accuracy for many LLM workloads.

Real use-case: Reachy Mini Jetson Mini Assistant

To show the impact of these memory optimizations, consider the Reachy Mini Jetson Assistant, an on-device conversational AI robot running on the Jetson Orin Nano with 8 GB of unified memory and no cloud dependency. 

The assistant runs a multimodal AI pipeline concurrently, including: a vision-language model (Cosmos-Reason2-2B) quantized to 4-bit (Q4_K_M GGUF) and served via Llama.cpp for visual understanding; faster-whisper (small.en) for speech recognition; and Kokoro TTS for text-to-speech—all alongside the Reachy Mini robot SDK and a live web dashboard.

With stack-wide optimizations—disabling the display manager, running headless, serving the VLM via Llama.cpp instead of heavier Python frameworks, using 4-bit quantized Cosmos Reason2 2B, and selecting optimized runtimes (CTranslate2 for STT, ONNX Runtime for TTS and VAD)—the full pipeline runs on a single Orin Nano 8 GB system.

More broadly, combining 4-bit quantization with efficient inference runtimes, like Llama.cpp and TensorRT-Edge-LLM, makes a wide range of models accessible within this memory budget with LLMs up to ~10B parameters and VLMs up to ~4B parameters. The full list of tested models is available on the Jetson AI Lab Models page and NVIDIA Developer Forum.

Get started

  1. Learn more about the next-generation Orin NX Edge AI platform.
  2. Install JetPack and DeepStream.
  3. Share your story and how this post helped you on the NVIDIA Jetson Forum.
Discuss (0)

Tags