Building Generalist Humanoid Capabilities with NVIDIA Isaac GR00T N1.6 Using a Sim-to-Real Workflow

To make humanoid robots useful, they need cognition and loco-manipulation that span perception, planning, and whole-body control in dynamic environments.

Building these generalist robots requires a workflow that unifies simulation, control, and learning for robots to acquire complex skills before transferring into the real world.

In this post, we present NVIDIA Isaac GR00T N1.6 and describe a sim-to-real workflow that combines whole-body reinforcement learning (RL) in NVIDIA Isaac Lab, synthetic data–trained navigation with COMPASS, and vision-based localization using NVIDIA CUDA-accelerated visual mapping and simultaneous localization and mapping (SLAM).

These components enable loco-manipulation, robust navigation, and environment-aware behavior across diverse robot embodiments.

Vision-language-action and reasoning

GR00T N1.6 is a multimodal vision-language-action (VLA) model that integrates visual observations from egocentric camera streams, robot states, and natural language instructions into a unified policy representation. The model uses world models, such as NVIDIA Cosmos Reason, to decompose high-level instructions into stepwise action plans grounded in scene understanding to perform real-world tasks. This architecture enables GR00T to execute locomotion and dexterous manipulation through end-to-end learned representations.

GR00T N1.6 introduces several enhancements from previous releases that expand its capabilities and real-world applicability:

Enhanced reasoning and perception: Uses a variant of Cosmos-Reason-2B VLM with native resolution support, enabling the robot to “see” clearly without distortion and reason better about its environment. This improvement translates to better scene understanding and more reliable task decomposition.
Fluid, adaptive motion: A 2x larger diffusion transformer (32 layers) and state-relative action predictions result in smoother, less jittery movements that adapt easily to changing positions.
Improved cross-embodiment performance: Trained on thousands of hours of new and diverse teleoperation data (humanoids, mobile manipulators, bimanual arms), enabling better generalization across various robot embodiments.
- Isaac GR00T N1.6 was trained on a diverse collection of datasets, including both simulated and real-world data. The simulated data comprises environments and task demonstrations from BEHAVIOR, RoboCasa, and a custom simulated environment developed for GR-1. The real-world component integrates demonstrations collected across multiple robotic platforms, including GR-1 (Fourier), G1 (Unitree), bimanual YAM arms, Agibot, and the DROID dataset. A quantitative breakdown of the data contributions from each dataset is provided below.

Donut chart illustrating the proportion of simulated and real-world datasets used to train Isaac GR00T N1.6, spanning multiple robot embodiments to support cross-embodiment generalization. — *Figure 1. Training data distribution for Isaac GR00T N1.6.*

GR00T N1.6 includes pretrained weights for zero-shot evaluation and validation of basic manipulation primitives. Finetuning the model is beneficial when deploying it to a specific embodiment or task.

This demo from the Conference on Robot Learning (CoRL) shows GR00T N1.6 in action, performing a loco-manipulation task on a G1 humanoid robot.

Video 1. Synthetic data from neural simulation for robot training

Whole-body RL training and Sim-to-real transfer

Whole-body RL training in simulation provides the low-level motor intelligence that GR00T N1.6 uses and coordinates through its higher-level VLA policy. The whole-body controller trained in Isaac Lab with RL produces human-like, dynamically stable motion primitives covering locomotion, manipulation, and coordinated multi-contact behaviors.

These policies are trained and stress-tested at scale in Isaac Lab and Isaac Sim, then transferred zero-shot to physical humanoids, minimizing task-specific finetuning while maintaining robustness across environments and embodiments. This sim-to-real pipeline enables GR00T’s high-level VLA to assume reliable whole-body control, focusing its reasoning on task sequencing and scene-aware decision-making rather than raw motor stability.

GR00T-WholeBodyControl served as the whole-body controller, providing the low-level loco-manipulation layer under GR00T N1.6. Using this controller, the full stack—spanning high-level instruction following, mid-level behavior composition, and low-level robust control—is validated in simulation before deployment on hardware.

Synthetic-data–trained navigation

To layer goal-directed navigation on top of whole-body control, GR00T N1.6 is finetuned for point-to-point navigation using large-scale synthetic datasets generated by COMPASS in Isaac Lab. In this setup, COMPASS acts as a navigation specialist, producing diverse trajectories across scenes and embodiments used to adapt GR00T from a VLA model into a strong point navigation policy.

The navigation policy is trained in simulation and exposed through simple velocity commands to the whole-body controller, rather than directly producing joint torques. This enables the low-level whole-body RL policy to handle balance and contact, while the navigation head focuses on obstacle avoidance, path following, and navigation–manipulation handoffs in real-world scenes. In experiments, this synthetic-only training pipeline achieves zero-shot sim-to-real transfer, including zero-shot deployment to new physical environments, without additional task-specific data collection.

COMPASS is a novel workflow for developing cross-embodiment mobility policies by integrating imitation learning, residual RL, and policy distillation. It has demonstrated the effectiveness of RL fine-tuning and strong zero-shot sim-to-real performance using Isaac Lab.

A humanoid robot navigating a crowded room with the COMPASS policy. — *Figure 2. GR1 robot using the COMPASS workflow*

Building on this, the GR00T N1.6 PointNav example releases provide step-by-step instructions and code for fine-tuning and evaluating navigation policies using COMPASS-generated data, so practitioners can reproduce and extend the navigation stack for their own embodiments and scenes.

Video 2. NVIDIA robot mobility workflows and AI models

Vision-based localization

Vision-based localization enables the GR00T N1.6 stack to use its whole-body controller and navigation policies in a large, real-world environment. After whole-body RL equips the robot with robust loco-manipulation skills and COMPASS-style synthetic data finetunes GR00T for point-to-point navigation, the system still requires an accurate estimate of the robot’s location so commands and waypoints correspond to real coordinates.

To provide this, a vision-centric mapping and localization stack uses onboard cameras and prebuilt maps to maintain low-drift pose estimates, enabling robot commands to be grounded in precise robot and object coordinates.

The visual mapping and localization stack is built on top of NVIDIA Isaac, NVIDIA CUDA-X libraries, and the following stereo depth models:

cuVSLAM is a real-time visual-inertial SLAM and odometry library. Its odometry provides smooth vehicle velocity, and its SLAM backend produces low-drift poses with loop-closure corrections for navigation.
cuVGL is a visual global localization library that computes an initial pose in a prebuilt map, which is used to bootstrap cuVSLAM.
FoundationStereo is a foundation model for stereo depth estimation, offering strong zero-shot generalization across diverse environments.
nvblox is an efficient 3D perception library that reconstructs the environment and generates a 2D occupancy map for path planning.

We collect stereo images of the environment and pre-build maps, including a cuVSLAM landmark map, a cuVGL bag-of-words map, and an occupancy map. Semantic locations, such as the kitchen table, are identified in the occupancy map and used for task planning.

At runtime, cuVGL retrieves visually similar image pairs from the pre-built map and estimates an initial pose from the stereo pairs. Using this pose as a prior, cuVSLAM matches local landmarks against the pre-built landmark map to localize. After successful localization, cuVSLAM tracks features continuously and performs map-based optimization, keeping the robot accurately localized during navigation.

We develop an offline map creation workflow in Isaac ROS to create the maps from a ROS bag, along with isaac_ros_visual_slam and isaac_ros_visual_global_localization packages for localization. You can create a localization pipeline in ROS2 using a stereo camera driver, image rectification nodes, an occupancy map server, cuVSLAM and cuVGL nodes.

A black and white gif of a robot picking up an apple off a counter from a first-person perspective. — *Figure 3. cuVSLAM feature tracking when a robot picks up an apple*

Get started

Download and experiment:
- Open Isaac GR00T N1.6 model from Hugging Face
- GR00T N1.6 variant post-trained on the BEHAVIOR 1K dataset
Use Isaac Lab and Newton for RL and policy training, and Isaac Lab to generate synthetic navigation data with COMPASS
Use Isaac Lab – Arena for your robot policy evaluation
Use CUDA-X visual mapping and localization libraries released as part of Isaac ROS:
- Create visual and occupancy maps from rectified stereo images
- Launch cuVSLAM and cuVGL to localize the robot using the generated maps