Robotics

Building Generalist Humanoid Capabilities with NVIDIA Isaac GR00T N1.6 Using a Sim-to-Real Workflow 

A robot thr

To make humanoid robots useful, they need cognition and loco-manipulation that span perception, planning, and whole-body control in dynamic environments. 

Building these generalist robots requires a workflow that unifies simulation, control, and learning for robots to acquire complex skills before transferring into the real world. 

In this post, we present NVIDIA Isaac GR00T N1.6 and describe a sim-to-real workflow that combines whole-body reinforcement learning (RL) in NVIDIA Isaac Lab, synthetic data–trained navigation with COMPASS, and vision-based localization using NVIDIA CUDA-accelerated visual mapping and simultaneous localization and mapping (SLAM). 

These components enable loco-manipulation, robust navigation, and environment-aware behavior across diverse robot embodiments.

Vision-language-action and reasoning

GR00T N1.6 is a multimodal vision-language-action (VLA) model that integrates visual observations from egocentric camera streams, robot states, and natural language instructions into a unified policy representation. The model uses world models, such as NVIDIA Cosmos Reason, to decompose high-level instructions into stepwise action plans grounded in scene understanding to perform real-world tasks. This architecture enables GR00T to execute locomotion and dexterous manipulation through end-to-end learned representations.

GR00T N1.6 introduces several enhancements from previous releases that expand its capabilities and real-world applicability:

  1. Enhanced reasoning and perception: Uses a variant of Cosmos-Reason-2B VLM with native resolution support, enabling the robot to “see” clearly without distortion and reason better about its environment. This improvement translates to better scene understanding and more reliable task decomposition.
  2. Fluid, adaptive motion: A 2x larger diffusion transformer (32 layers) and state-relative action predictions result in smoother, less jittery movements that adapt easily to changing positions.
  3. Improved cross-embodiment performance: Trained on thousands of hours of new and diverse teleoperation data (humanoids, mobile manipulators, bimanual arms), enabling better generalization across various robot embodiments. 

GR00T N1.6 includes pretrained weights for zero-shot evaluation and validation of basic manipulation primitives. Finetuning the model is beneficial when deploying it to a specific embodiment or task.

This demo from the Conference on Robot Learning (CoRL) shows GR00T N1.6 in action, performing a loco-manipulation task on a G1 humanoid robot. 


Video 1. Synthetic data from neural simulation for robot training

Whole-body RL training and Sim-to-real transfer

Whole-body RL training in simulation provides the low-level motor intelligence that GR00T N1.6 uses and coordinates through its higher-level VLA policy. The whole-body controller trained in Isaac Lab with RL produces human-like, dynamically stable motion primitives covering locomotion, manipulation, and coordinated multi-contact behaviors.

These policies are trained and stress-tested at scale in Isaac Lab and Isaac Sim, then transferred zero-shot to physical humanoids, minimizing task-specific finetuning while maintaining robustness across environments and embodiments. This sim-to-real pipeline enables GR00T’s high-level VLA to assume reliable whole-body control, focusing its reasoning on task sequencing and scene-aware decision-making rather than raw motor stability.​

GR00T-WholeBodyControl served as the whole-body controller, providing the low-level loco-manipulation layer under GR00T N1.6. Using this controller, the full stack—spanning high-level instruction following, mid-level behavior composition, and low-level robust control—is validated in simulation before deployment on hardware.

Synthetic-data–trained navigation

To layer goal-directed navigation on top of whole-body control, GR00T N1.6 is finetuned for point-to-point navigation using large-scale synthetic datasets generated by COMPASS in Isaac Lab. In this setup, COMPASS acts as a navigation specialist, producing diverse trajectories across scenes and embodiments used to adapt GR00T from a VLA model into a strong point navigation policy.​

The navigation policy is trained in simulation and exposed through simple velocity commands to the whole-body controller, rather than directly producing joint torques. This enables the low-level whole-body RL policy to handle balance and contact, while the navigation head focuses on obstacle avoidance, path following, and navigation–manipulation handoffs in real-world scenes.​ In experiments, this synthetic-only training pipeline achieves zero-shot sim-to-real transfer, including zero-shot deployment to new physical environments, without additional task-specific data collection.​

COMPASS is a novel workflow for developing cross-embodiment mobility policies by integrating imitation learning, residual RL, and policy distillation. It has demonstrated the effectiveness of RL fine-tuning and strong zero-shot sim-to-real performance using Isaac Lab. 

A humanoid robot navigating a crowded room with the COMPASS policy.
Figure 1. GR1 robot using the COMPASS workflow

Building on this, the GR00T N1.6 PointNav example releases provide step-by-step instructions and code for fine-tuning and evaluating navigation policies using COMPASS-generated data, so practitioners can reproduce and extend the navigation stack for their own embodiments and scenes.​

Video 2. NVIDIA robot mobility workflows and AI models

Vision-based localization

Vision-based localization enables the GR00T N1.6 stack to use its whole-body controller and navigation policies in a large, real-world environment. After whole-body RL equips the robot with robust loco-manipulation skills and COMPASS-style synthetic data finetunes GR00T for point-to-point navigation, the system still requires an accurate estimate of the robot’s location so commands and waypoints correspond to real coordinates.

To provide this, a vision-centric mapping and localization stack uses onboard cameras and prebuilt maps to maintain low-drift pose estimates, enabling robot commands to be grounded in precise robot and object coordinates.

The visual mapping and localization stack is built on top of NVIDIA Isaac, NVIDIA CUDA-X libraries, and the following stereo depth models:

  • cuVSLAM is a real-time visual-inertial SLAM and odometry library. Its odometry provides smooth vehicle velocity, and its SLAM backend produces low-drift poses with loop-closure corrections for navigation.
  • cuVGL is a visual global localization library that computes an initial pose in a prebuilt map, which is used to bootstrap cuVSLAM.
  • FoundationStereo is a foundation model for stereo depth estimation, offering strong zero-shot generalization across diverse environments.
  • nvblox is an efficient 3D perception library that reconstructs the environment and generates a 2D occupancy map for path planning.

We collect stereo images of the environment and pre-build maps, including a cuVSLAM landmark map, a cuVGL bag-of-words map, and an occupancy map. Semantic locations, such as the kitchen table, are identified in the occupancy map and used for task planning.

At runtime, cuVGL retrieves visually similar image pairs from the pre-built map and estimates an initial pose from the stereo pairs. Using this pose as a prior, cuVSLAM matches local landmarks against the pre-built landmark map to localize. After successful localization, cuVSLAM tracks features continuously and performs map-based optimization, keeping the robot accurately localized during navigation.

We develop an offline map creation workflow in Isaac ROS to create the maps from a ROS bag, along with isaac_ros_visual_slam and isaac_ros_visual_global_localization packages for localization. You can create a localization pipeline in ROS2 using a stereo camera driver, image rectification nodes, an occupancy map server, cuVSLAM and cuVGL nodes. 

A black and white gif of a robot picking up an apple off a counter from a first-person perspective.
Figure 2. cuVSLAM feature tracking when a robot picks up an apple

Get started

  • Download and experiment with the open Isaac GR00T N1.6 model from HuggingFace.
  • Use Isaac Lab and Newton for RL and policy training, and Isaac Lab to generate synthetic navigation data with COMPASS.
  • Use CUDA-X visual mapping and localization libraries released as part of Isaac ROS:

Stay up to date by subscribing to our newsletter and following NVIDIA Robotics on LinkedIn, Instagram, X, and Facebook. Explore NVIDIA documentation and YouTube channels, and join the NVIDIA Developer Robotics forum. To start your robotics journey, enroll in our free NVIDIA Robotics Fundamentals courses today.

Get started with NVIDIA Isaac libraries and AI models for developing physical AI systems.

Learn more by watching NVIDIA Live at CES

Discuss (0)

Tags