Robotics

Unlock Faster, Smarter Edge Models with 7x Gen AI Performance on NVIDIA Jetson AGX Thor

Jetson Thor family image.

A defining strength of the NVIDIA software ecosystem is its commitment to continuous optimization. In August, NVIDIA Jetson AGX Thor launched, with up to a 5x boost in generative AI performance over NVIDIA Jetson AGX Orin. Through software updates since the release, Jetson Thor now powers a 7x increase in generative AI throughput.

With this proven approach, showcased previously on NVIDIA Jetson Orin and NVIDIA Jetson AGX Xavier, developers can enjoy these improvements on models such as Llama and DeepSeek, and similar benefits are expected for future model releases. In addition to consistent software enhancements, NVIDIA also provides support for leading models, often within days of their launch. This enables developers to experiment with the latest AI models early on.

The Jetson Thor platform also supports major quantization formats, including the new NVFP4 from the NVIDIA Blackwell GPU architecture, helping optimize inference even further. New techniques like speculative decoding are also being supported, offering an additional way to accelerate Gen AI workloads at the edge.

Continuous software optimization

With the recent vLLM container release, Jetson Thor delivers up to 3.5x greater performance on the same model and same quantization compared to its launch-day performance in late August. Table 1 shows the output tokens/sec on Llama 3.3 70B and DeepSeek R1 70B models at launch in August, compared to the latest benchmarked numbers from September 2025.

FamilyModelJetson AGX Thor
Aug 2025 (output tokens/sec)
Jetson AGX Thor
Sep 2025 (output tokens/sec)
Jetson AGX Thor
speedup compared to launch
LlamaLlama 3.3 70B41.512.643.3
DeepSeekDeepSeek R1 70B40.2911.53.5

Table 1. Tokens/sec output on Llama 3.3 and DeepSeek R1 at launch compared to the latest benchmarks 

Configuration for these benchmarks: Sequence Length: 2048, Output Sequence Length: 128; Max Concurrency: 8; Power Mode: MAXN

Jetson Thor also now supports Eagle 3 speculative decoding in vLLM containers to further increase the performance of generative AI models. For example, on Llama 3.3 70B with speculative decoding, you can get 88.62 output tokens/sec, creating a 7x speedup compared to launch.

A graph showing the increase in software optimization and speculative decoding for DeepSeek R1 and Llama 3.3.
Figure 1. 3.5x increase with software optimization and 7x Increase with speculative decoding

Run the latest models with day 0 support

Developers can run the latest and greatest generative AI models on the edge with Jetson Thor with day 0 support. For example, gpt-oss was supported on llamacpp/ollama on Day 0 of the launch on Jetson AGX Thor. It’s supported on vLLM as well. Similarly, you’ll find week zero support for many NVIDIA Nemotron models, including:

Get max gen AI performance with Jetson Thor 

Jetson Thor is powerful for generative AI at the edge, but using it to its full advantage requires the right techniques. This section is your guide to getting the most out of the platform. We’ll dive into quantization and speculative decoding, the two strategies for accelerating LLM and VLM inference. We’ll finish with a tutorial showing how to benchmark your models on Jetson Thor. This will give you a clear path for choosing the best model and configuration for your specific use case.

Quantization: Shrinking model size, speeding up inference

At its core, quantization is the process of reducing the numerical precision of a model’s data (its weights and activations). Think of it like using fewer decimal places to represent a number—it’s not exactly the same, but it’s close enough and much more efficient to store and calculate. We typically move from the standard 16-bit formats (like FP16 or BF16) to lower-bit formats like 8-bit or 4-bit.

This gives you two huge wins:

  1. Smaller memory footprint
    This is the key that unlocks larger models on-device. By cutting the number of bytes needed for each parameter, you can load models that would otherwise be too big.  

    As a rule of thumb, a 70-billion-parameter model’s weights take up about:
    • 140 GB in floating point 16 (FP16) and won’t fit on Thor’s 128 GB memory. 
    • 70 GB in floating-point 8 (FP8), fits with room to spare.
    • 35 GB in 4-bit, enabling multiple large models.

  2. Faster memory access
    Smaller weights mean fewer bytes to pull from memory into the compute cores. This directly reduces latency, which is critical for edge applications where time-to-first-token affects responsiveness and user experience.

Let’s look at the two formats that matter most on Jetson Thor.

FP8

FP8 is your go-to for a nearly lossless first step in optimization. A 70B model’s 16-bit weights are too large for Jetson Thor memory once you account for activations and the KV cache. By halving the weight of memory, FP8 makes it practical to load and run that same model on-device. When properly calibrated, FP8’s accuracy is extremely close to the FP16 baseline (often with a drop of less than 1%), making it a “safe first step” for chat and general workloads, though sensitive tasks like math or code generation may require extra tuning.

W4A16:4-bit weights and 16-bit activations

W4A16 unlocks massive models on the edge by quantizing static model weights to an ultra-compact 4-bit, while keeping the dynamic, in-flight calculations (the activations) at a higher-precision 16-bit. This trade-off makes it possible to fit models with over 175B parameters on a single Jetson Thor, leaving plenty of headroom for their activations. Serving multiple large models at once—for example, two 70B models—is a feat that was a major challenge for previous Jetson generations.

Which format should you use?

Our recommendation is simple: start with W4A16. It typically delivers the highest inference speeds and the lowest memory footprint. If you test the quantized model on your task and find that the accuracy meets your quality bar, stick with it.

If your task is more complex (like nuanced reasoning or code generation) and you find W4A16’s accuracy isn’t quite there, switch to FP8. It’s still fast, keeps memory usage low, and provides more than enough quality for most edge use cases.

Speculative decoding: Boost inference with a draft-verification decoding approach

Once you’ve picked a quantization format, the next big performance lever is speculative decoding. This technique speeds up inference by using two models: a small, fast “draft” model and your large, accurate “target” model.

Here’s how it works:

  1. The draft model quickly generates a chunk of candidate tokens (a “guess” of what comes next).
  2. The target model then validates the entire chunk in a single pass instead of generating one token at a time.

This “draft-and-verify” process generates multiple tokens per cycle while guaranteeing the final output is identical to what the target model would produce alone. Your success is measured by the acceptance rate—the percentage of draft tokens accepted. A high rate yields significant latency wins, while a low rate can add overhead, so it’s crucial to benchmark with prompts that reflect your workload. Your main lever for improving this is the draft model choice; start with one architecturally similar to your target, and for specialized domains, consider fine-tuning a custom draft model to maximize the acceptance rate.

In our experiments, we found that EAGLE-3 speculative decoding delivered the best speedups. In our benchmarks on Llama 3.3 70B (W4A16), this feature delivered a 2.5x performance uplift, boosting throughput from 6.27 to 16.19 tokens/sec using vLLM with a concurrency of 1. We benchmarked this using the ShareGPT dataset, but you should always test on your own data to validate performance for your specific use case.

Putting together quantization and speculative decoding

The real magic happens when you combine these techniques. We used vLLM, which has great built-in support for EAGLE-3. Here’s an example command we used to serve the Llama 3.3 w4a16 model with speculative decoding enabled.

vllm serve "RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16" --trust_remote_code -- --speculative-config '{"method":"eagle3","model":"yuhuili/EAGLE3-LLaMA3.3-Instruct-70B","num_speculative_tokens":5}'

Making getting started simpler, NVIDIA is releasing a standalone vLLM container that supports Jetson Thor and is updated monthly with the latest improvements. 

Here’s a step-by-step guide to finding the best balance between model quality and inference performance:

  1. Establish a quality baseline. Before optimizing, load your model at its highest possible precision (FP16 preferably, but if the model is too big, FP8 is also fine) and simply verify that it performs your task correctly.
  2. Optimize with quantization. Progressively lower the weight precision (for example, to W4A16), testing for accuracy at each step. Stop when the quality no longer meets your requirements.
  3. Benchmark against reality. Validate your final setup using a performance benchmark that mimics your workload, whether that involves high concurrency, large context windows, or long output sequences.

If your chosen model still isn’t fast enough, repeat this process with a smaller one. To see exactly how to run these performance benchmarks, follow our hands-on tutorial on Jetson AI Lab.

Now you can confidently improve your generative AI model performance on Jetson Thor. Get your Jetson AGX Thor Developer Kit today and download the latest NVIDIA JetPack 7 to start your journey. 

Discuss (0)

Tags