Robotics

Building Autonomous Vehicles That Reason with NVIDIA Alpamayo

Autonomous vehicle (AV) research is undergoing a rapid shift. The field is being reshaped by the emergence of reasoning-based vision–language–action (VLA) models that bring human-like thinking to AV decision-making. These models can be viewed as implicit world models operating in a semantic space, allowing AVs to solve complex problems step-by-step and to generate reasoning traces that mirror human thought processes. This shift extends beyond the models themselves: traditional open-loop evaluation is no longer sufficient to rigorously assess such models, and new evaluation tools are required.

Recently, NVIDIA introduced Alpamayo, a family of models, simulation tools, and datasets to enable development of reasoning-based AV architectures. Our goal is to provide researchers and developers with a flexible, fast, and scalable platform for evaluating, and ultimately training, modern reasoning-based AV architectures in realistic closed-loop settings. 

In this blog, we introduce Alpamayo and how to get up and running with reasoning-based AV development:

  • Part 1: Introducing NVIDIA Alpamayo 1, an open, 10B reasoning VLA model, as well as how to use the model to both generate trajectory predictions and review the corresponding reasoning traces.
  • Part 2: Introducing the Physical AI dataset, one of the largest and most geographically diverse open AV datasets available that enables training and evaluating these models.
  • Part 3: Introducing NVIDIA AlpaSim, an open-source end-to-end simulation tool designed for evaluating end-to-end models

These three key components provide the essential pieces needed to start building reasoning-based VLA models: a base model, large-scale data for training, and a simulator for testing and evaluation.

Figure 1. Alpamayo 1 model driving closed-loop in AlpaSim using reconstructed scenes from the NVIDIA Physical AI – AV NuRec Dataset.

Part 1: Alpamayo 1, an open reasoning VLA for AVs

Get started with the Alpamayo reasoning VLA model in just three steps.

Step 1: Access Alpamayo model weights and code

The Hugging Face repository contains pretrained model weights, which can be loaded with the corresponding code on GitHub.

Step 2: Prepare your environment

The Alpamayo GitHub repository contains steps to set up your development environment, including setting up uv (if not already installed) and creating a Python virtual environment.

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"

# Setup the virtual environment
uv venv ar1_venv
source ar1_venv/bin/activate

# Install pip in the virtual environment (if missing)
./ar1_venv/bin/python -m ensurepip

# Install Jupyter notebook package
./ar1_venv/bin/python -m pip install notebook

uv sync --active

Finally, as the model requires access to gated Hugging Face resources. Request access here:

Then, authenticate with:

hf auth login

and get your Hugging Face token here.

Step 3: Run the Alpamayo reasoning VLA

The model repository includes a notebook that will download the Alpamayo model weights, load some example data from the NVIDIA PhysicalAI-AV Dataset, run the model on it, and visualize the output trajectories and their associated reasoning traces.

In particular, the example data contains the ego-vehicle passing a construction zone, with four timesteps (columns) from four cameras (front_left, front_wide, front_right, front_tele, respectively in rows) visualized below.

Figure 2. A visualization of the example data sample, containing a construction zone, that will be passed into the model. Specifically, 4 timesteps (across columns) from 4 cameras (front_left, front_wide, front_right, and front_tele) are shown.
Figure 2. A visualization of the example data sample, containing a construction zone, that will be passed into the model. Specifically, 4 timesteps (across columns) from 4 cameras (front_left, front_wide, front_right, and front_tele) are shown.

After running this through the Alpamayo model, an example output you may see in the notebook is “Nudge to the left to increase clearance from the construction cones encroaching into the lane,” with the corresponding predicted trajectory and ground truth trajectory visualized below.

A visualization of the trajectory output from the model (in blue) along with the ground truth trajectory (in red) for comparison.
Figure 3. A visualization of the trajectory output from the model (in blue) along with the ground truth trajectory (in red) for comparison.

In case you would like to produce more trajectories and reasoning traces, please feel free to change the num_traj_samples=1 argument in the inference call to a higher number.

Part 2: Physical AI AV dataset for large-scale, diverse AV data

The PhysicalAI-Autonomous-Vehicles dataset provides one of the largest, most geographically diverse collections of multi-sensor data for AV researchers to build the next generation of physical AI based end-to-end driving systems.

Clips from the Physical AI AV Dataset, one of the largest, most geographically diverse collections of multi-sensor AV data.
Figure 4. Clips from the Physical AI AV Dataset, one of the largest, most geographically diverse collections of multi-sensor AV data.

It contains a total of 1,727 hours of driving recorded in 25 countries and over 2,500 cities (coverage shown below, with color indicating the number of clips per country). The dataset captures diverse traffic, weather conditions, obstacles, and pedestrians in the environment. Overall, it consists of 310,895 clips that are each 20 seconds long. The sensor data includes multi-camera and LiDAR coverage for all clips, and radar coverage for 163,850 clips.

Geographic coverage of the Physical AI AV Dataset. It contains a total of 1,727 hours of driving recorded in 25 countries and over 2,500 cities. Color indicates the number of clips by country.
Figure 5. Geographic coverage of the Physical AI AV Dataset. It contains a total of 1,727 hours of driving recorded in 25 countries and over 2,500 cities (color indicates the number of clips by country).

To get started with the Physical AI AV Dataset, the physical_ai_av GitHub repository contains a Python developer kit and documentation (in the form of a wiki). In fact, this package was already used in Part 1 to load a sample of the dataset for Alpamayo 1.

Part 3: AlpaSim, a closed-loop simulation for AV evaluation

AlpaSim overview

Figure 6. High level overview of the AlpaSim microservice architecture around the central runtime. Each service runs in separate processes, enabling flexible scaling and modularity.
Figure 6. High level overview of the AlpaSim microservice architecture around the central runtime. Each service runs in separate processes, enabling flexible scaling and modularity.

AlpaSim is built on a microservice architecture centered around the Runtime (see Figure 6), which orchestrates all simulation activity. Individual services, such as the Driver, Renderer, TrafficSim, Controller, and Physics, run in separate processes and can be assigned to different GPUs. This design offers two major advantages:

  1. Clear, modular APIs via gRPC, making it easy to integrate new services without dependency conflicts.
  2. Arbitrary horizontal scaling, allowing researchers to allocate compute where it matters most. For example, if driver inference becomes the bottleneck, simply launch additional driver processes. If rendering is the bottleneck, dedicate more GPUs to rendering. And if a rendering process cannot handle multiple scenes simultaneously, you can run multiple renderer instances on the same GPU to maximize utilization.

But horizontal scaling alone isn’t the full story. The real power of AlpaSim lies in how the Runtime enables pipeline parallelism (see Figure 7).

In traditional sequential rollouts, components must wait on one another, for instance, the driver must pause after each inference step until the renderer produces the next perception input. AlpaSim removes this bottleneck: while one scene is rendering, the driver can run inference for another scene. This overlap dramatically improves GPU utilization and throughput. Scaling even further, driver inference can be batched across many scenes, while multiple rendering processes generate perception inputs in parallel.

AlpaSim implements Pipeline Parallel Execution to optimize GPU utilization and increase throughput.
Figure 7. AlpaSim implements Pipeline Parallel Execution to optimize GPU utilization and increase throughput.

A shared ecosystem

We provide initial implementations for all core services, including rendering via NVIDIA Omniverse NuRec 3DGUT algorithm, a reference controller, and driver baselines. We will also be adding additional driver models, including Alpamayo 1 and CAT-K in the coming weeks. 

The platform also ships initially with roughly 900 reconstructed scenes, each 20 seconds long, and the Physical AI AV Dataset, giving researchers an immediate way to evaluate end-to-end models in realistic closed-loop scenarios. In addition, AlpaSim offers extensive configurability, from camera parameters and rendering frequency to artificial latencies and many other simulation settings.

Beyond these built-in components, we see AlpaSim evolving into a broader collaborative ecosystem. Eventually, labs can seamlessly plug in their own driving, rendering, or traffic models, and compare approaches directly on shared benchmarks.

AlpaSim in action

AlpaSim is already powering several of our internal research efforts.

Firstly, in our recently proposed Sim2Val framework, we demonstrated that AlpaSim rollouts are realistic enough to meaningfully improve real-world validation. By incorporating simulated trajectories into our evaluation pipeline, we were able to reduce variance in key real-world metrics by up to 83%, enabling faster and more confident model assessments.

Secondly, we rely on AlpaSim for closed-loop evaluation of our Alpamayo 1 model. By replaying reconstructed scenes and allowing the policy to drive end-to-end, we compute a DrivingScore that reflects performance under realistic traffic conditions.

Beyond evaluation, we are leveraging AlpaSim for closed-loop training using our concurrently released RoaD algorithm. RoaD effectively mitigates covariate shift between open-loop training and closed-loop deployment while being significantly more data-efficient than traditional reinforcement learning. 

Metric correlation between real-world drive (x-axis) and re-simulated drive (y-axis). We measure the closest distance to a nearby object (left) and the distance to the lane center (right).
Figure 8. Metric correlation between real-world drive (x-axis) and re-simulated drive (y-axis). We measure the closest distance to a nearby object (left) and the distance to the lane center (right).

Getting started with Alpasim

Get started using AlpaSim for your own model evaluation in just three steps.

Step 1: Access AlpaSim

The open source repository contains the necessary software, with scene reconstruction artifacts available from the NVIDIA Physical AI Open Dataset.

Step 2: Prepare your environment

First, make sure to follow the onboarding steps in ONBOARDING.md 

Then, perform initial setup/installations with the following command:

source setup_local_env.sh

This will compile protos, download an example driver model, download a sample scene from Hugging Face, and install the alpasim_wizard command line tool.

Step 3: Run the simulation

Use the wizard to build, run, and evaluate a simulation rollout:

alpasim_wizard +deploy=local wizard.log_dir=$PWD/tutorial

The simulation logs/output can be found in the created tutorial directory. For a visualization of the results, an mp4 file is created in tutorial/eval/videos/clipgt-05bb8212..._0.mp4 which will look similar to the following.

Figure 9. An output visualization from AlpaSim, displaying a top-down semantic view with agent bounding boxes and maps (if available), average and per-timestep metrics, as well as the front camera view with predicted and ground truth trajectories overlaid. 

For more details about the output, and much more information about using AlpaSim, please see TUTORIAL.md.

Overall, this example demonstrates how real world drives can be replayed with an end-to-end policy, including all static and dynamic objects from the original scene. From this starting point and the flexible plug-and-play architecture of AlpaSim, users can tweak contender behavior, modify camera parameters, and iterate on policy.

Integrating your policy

Driving policies are easily swappable through generic APIs, allowing developers to test their state-of-the-art implementations.

Step 1: gRPC integration

AlpaSim uses gRPC as the interface between components: a sample implementation of the driver component can be used as inspiration for conforming to the driver interface.

Step 2: Reconfigure and run

AlpaSim is highly customizable through yaml file descriptions, including the specification of components used by the sim at runtime. Create a new configuration file for your model (some examples can be found below)

# driver_configs/my_model.yaml

# @package _global_
services:
  driver:
    image: <user docker image>
    command:
      - "<command to start user-defined service>"

And run:

alpasim_wizard +deploy=local wizard.log_dir=$PWD/my_model +driver_configs=my_model.yaml

Examples of customization using the CLI:

You can change the configuration when running the wizard example:

# Different scene
alpasim_wizard +deploy=local wizard.log_dir=$PWD/custom_run \
  scenes.scene_ids=['clipgt-02eadd92-02f1-46d8-86fe-a9e338fed0b6']

# More rollouts
alpasim_wizard +deploy=local wizard.log_dir=$PWD/custom_run \
  runtime.default_scenario_parameters.n_rollouts=8

# Different simulation length
alpasim_wizard +deploy=local wizard.log_dir=$PWD/custom_run \
  runtime.default_scenario_parameters.n_sim_steps=200

Configuration is managed via Hydra – see src/wizard/configs/base_config.yaml for all available options.
To download the scene referenced above in Figure 9, you can run the following command:

hf download --repo-type=dataset \
--local-dir=data/nre-artifacts/all-usdzs \
nvidia/PhysicalAI-Autonomous-Vehicles-NuRec \
sample_set/25.07_release/Batch0001/02eadd92-02f1-46d8-86fe-a9e338fed0b6/02eadd92-02f1-46d8-86fe-a9e338fed0b6.usdz

Scaling your runs

AlpaSim adapts to fit your hardware configuration through coordination and parallelization of services, efficiently facilitating large test suites, perturbation studies, and training.

alpasim_wizard +deploy=local wizard.log_dir=$PWD/test_suite +experiment=my_test_suite.yaml runtime.default_scenario_parameters.n_rollouts=16
Figure 10. Multiple scene realizations can be obtained from the same starting point, due to variations in ego-vehicle motion or other agent behaviors. Four different rollouts are shown in this example, all starting from the same initial state.

Conclusion: Putting it all together

The future of autonomous driving relies on powerful end-to-end models, and AlpaSim provides the capability to quickly test and iterate on those models, accelerating research efforts. In this blog we introduced Alpamayo 1 model, the physical AI dataset, and Alpasim Simulator. Together, it provides a complete framework for developing reasoning based AV systems–a model, large amounts of data to train it, and a simulator for evaluation.

Putting it all together, below is an example of Alpamayo 1 driving closed-loop through a construction zone within AlpaSim, demonstrating the model’s reasoning and driving capabilities as well as AlpaSim’s ability to evaluate AV models in a variety of realistic driving environments.

Figure 11. Alpamayo 1 driving closed-loop within AlpaSim, navigating through a construction zone with its reasoning traces and trajectory predictions visualized.

Happy coding!

Discuss (0)

Tags