R²D²: Three Neural Breakthroughs Transforming Robot Learning from NVIDIA Research

While today’s robots excel in controlled settings, they still struggle with the unpredictability, dexterity, and nuanced interactions required for real-world tasks—from assembling delicate components to manipulating everyday objects with human-like precision.

Robot learning has emerged as the key to bridging this gap between laboratory demonstrations and real-world deployment. Yet traditional approaches face fundamental limitations:

Classical simulators can’t capture the full complexity of modern robotic systems
Human demonstrations are difficult to translate across different robot embodiments
The intricate coordination of vision and touch that humans take for granted remains elusive for machines

This edition of NVIDIA Robotics Research and Development Digest (R²D²) explores three groundbreaking neural innovations from NVIDIA Research that are transforming how robots learn and adapt, featured at CoRL 2025:

NeRD (Neural Robot Dynamics): Enhances simulation with learned dynamics models that generalize across tasks while enabling real-world fine-tuning.
Dexplore: Unlocks human-level dexterity by treating motion-captured demonstrations as adaptive guidance.
VT-Refine: Combines vision and tactile sensing to master precise bimanual assembly tasks through novel real-to-sim-to-real training.

Together, these advances provide developers with techniques, libraries, and workflows to advance research.

Teaching robots through neural simulation

Simulation plays a key role in the robotics development workflow. Robots can learn to robustly do tasks in simulation, as parameters and properties like mass and friction can be randomized during training. However, traditional simulators struggle to capture the complexity of modern robots, which often have high degrees of freedom and intricate mechanisms. Neural models can help with this challenge, as they can efficiently predict complex dynamics and adapt to real-world data.

NeRD, for example, is a learned dynamics model used for predicting future states of a specific robot (or articulated rigid-body system) under contact constraints. It can replace low-level dynamics and contact solvers in an analytical simulator, enabling a hybrid simulation prediction framework.

Framework overview of Neural Robot Dynamics (NeRD) with three sections labeled Classical Robotics Simulation Workflow, Hybrid Prediction Framework with NeRD, and Robot-Centric State Representation. — *Figure 1. NeRD* can efficiently predict complex dynamics and adapt to real-world data

NeRD uses a robot-centric state representation that enforces spatial invariance – this enhances the training and data efficiency of NeRD and greatly improves generalization. NeRD can be easily integrated into existing articulated rigid-body simulation frameworks. It has been validated with integration to NVIDIA Warp, and will also serve as one of the many solvers in the Newton Physics Engine at some point in the future.

To train a NeRD model for a given robot, 100K random trajectories with 100 timesteps each are collected as training data. NeRD is modeled using a lightweight implementation of the GPT-2 Transformer and models were trained for six different robotic systems.

NeRD models are stable and accurate over thousands of time steps, achieving remarkable accuracy with less than 0.1% error in accumulated reward for 1,000-step policy evaluation for an ANYmal quadruped robot. This approach has also shown zero-shot sim to real transfer with a Franka reach policy learned in the simulator that was integrated with NeRD, and NeRD can also be fine-tuned on real-world data to further close the sim-to-real gap.

Neural models like NeRD will speed up robotics research, enabling developers to accurately simulate the complex full-body training alongside classical simulation techniques.

Two side-by-side images showing how well the NeRD-integrated simulator compares to a classic simulator. In both cases, the robot using the NeRD-integrated simulator shows the same, gait, and speed as the classic simulator. — *Figure 2. Executions of robot learned policies exhibit high matching between the NeRD-integrated simulator and the classic simulator*

Learning dexterous skills from human motion

Teaching robot hands human-level dexterity has historically been a difficult problem. Human hands possess an unparalleled combination of kinematic complexity, compliance and rich tactile sensing. Robotic hands have limitations with lesser degrees of freedom and actuation, limited sensing and control. This makes it difficult for robots to learn dexterous manipulation from humans.

Hand-object motion-capture (MoCap) repositories provide abundant contact-rich human demonstrations but cannot be easily used for direct policy learning of robots. Existing workflows incorporate three major components: retargeting, tracking, and residual correction which compounds errors.

This research introduces Reference-Scoped Exploration (RSE), a unified, single-loop optimization. It integrates retargeting and tracking to train a scalable robot control policy directly from MoCap data. Demonstrations are not treated as “strict” ground truth but are instead viewed as soft guidance.

This preserves the intent of the demonstration and enables the robot to autonomously discover motions compatible with its own embodiment.

In the first part of the workflow, the robot imitates demonstrated trajectories in reference scoped regions. These skills are then distilled into a vision-based policy in the next part. — *Figure 3.* Dexterous manipulation from human demonstrations learned by first training a state-based imitation control policy with RSE to explore robot-specific manipulation strategies

The second part of the workflow, a vision-based generative control policy, is learned to distill the state-based imitation control policy. This enables the robotic hand to manipulate an object with partial observations obtained from a single-view depth image, and sparse, user-defined goals.

During training, the policy’s objective is to have the robot hand follow the given trajectory to enable performing diverse object manipulation skills like grabbing a banana, cellphone, cup and binoculars. The model comprises an encoder, a prior network and a decoder policy. At inference time, the encoder is omitted and the latent embedding is sampled directly from the learned prior, thus producing a generative control policy capable of performing effective goal-conditioned dexterous manipulation from only partial observations.

This approach achieves almost 20% more success rates with the Inspire hand. It outperforms each baseline method consistently on both Inspire and Allegro robot hands. The state-based policy is evaluated on its ability to imitate human demonstrations and generalize in unseen scenarios whereas for the vision-based policy framework manipulation in simulation and successful transfer to real world is used.

Combining vision and touch for precise bimanual assembly

Humans are good at manipulation and bimanual assembly tasks as they rely on visual and tactile feedback in the process. Envision performing a plug and socket assembly with both hands. First, you would visually identify and grasp the components needed. Next, during assembly of the parts, tactile feedback plays an important role as visual feedback alone with the occlusions makes it difficult to complete the task.

Behavioral cloning with diffusion policies are useful but suffer from limited real-world demonstrations and limitations of tactile feedback on their data collection interface.

To address this data problem, VT-Refine develops a novel real-to-sim-to-real framework that combines simulation, vision, and touch to solve this problem for bimanual assembly tasks (Figure 4). A high-level overview of the steps involves:

Collecting a small number of real-world demonstrations (30 episodes, for example) to pretrain a bimanual visuo-tactile diffusion policy.
Fine-tuning this policy in its digital twin on a parallelized simulation environment using reinforcement learning (RL).
Deploying this policy back to the real world.

Diagram divided into two sections labeled Real World (left) showing human demos (Vision and Touch), Tactile Signals, and diverse assembly tasks; and Simulation (right) showing GPU-Parallelized Tactile Simulation and Large-Scale RL Fine-Tuning. Each section shows robot arms performing various tasks. — *Figure 4. VT-Refine is a novel visuo-tactile policy learning framework for precise, contact-rich bimanual assembly tasks*

The simulation for tactile sensory input is built on TacSL, a GPU-based tactile simulation library integrated with Isaac Lab. This enables better sim-to-real transferability as efficient approximation of the softness of tactile sensors can be leveraged in GPU-accelerated simulation enabling scalable training. The observations used for training include:

Point cloud captured by an ego-centric camera
Point cloud representation of the tactile sensor feedback
Joint positions from the arms and grippers

The data collected is then used to pretrain a diffusion policy. For scaled training in simulation a digital twin of the scene is set up with the vision and tactile sensors. The pretraining on human demonstrations provide a strong prior that guides RL exploration without the need of complex reward engineering.

Two side-by-side panels, each depicting different robotic setups for manipulation tasks: Table-Top Bimanual Setup (left) showing real asset, ego-centric camera, and tactile sensors; Semi-Humanoid Bimanual Setup (right) showing ego-centric camera and tactical sensors. — *Figure 5. Robot setups with four tactile sensing pads and an ego-centric camera*

The RL fine-tuned policy significantly boosts performance on high-precision assembly tasks by introducing the necessary exploration. It improves real world success rates by approximately 20% in the vision-only variant and 40% for the visuo-tactile variant. There is a slight drop in the sim to real transfer of around 5-10% which is negligible compared to the improvement in success rate by over 30% with RL fine-tuning in simulation.

This work is one of the first successful RL sim-to-real transfer for bimanual visuo-tactile policies using large scale simulation.

Summary

Advances in robot learning are transforming how robots acquire and transfer complex skills from simulation to the real world. NeRD enables more accurate dynamics prediction, RSE streamlines learning dexterous manipulation from human demonstrations, and VT-Refine combines vision and touch for robust bimanual assembly. Together, these approaches show how scalable, data-driven learning is narrowing the gap between robotic and human capabilities.

Check out the following resources to learn more and see all the NVIDIA research being showcased at CoRL and Humanoids, happening September 27-October 2 in Seoul, Korea:

This post is part of our NVIDIA Robotics Research and Development Digest (R²D²) to give developers deeper insight into the latest breakthroughs from NVIDIA Research across physical AI and robotics applications.

Learn more about the research being showcased at CoRL and Humanoids, happening September 27–October 2 in Seoul, Korea.

Also, join the 2025 BEHAVIOR Challenge, a robotics benchmark for testing reasoning, locomotion, and manipulation, featuring 50 household tasks and 10,000 tele-operated demonstrations.

Stay up to date by subscribing to the newsletter and following NVIDIA Robotics on YouTube, Discord, and NVIDIA Developer Forums. To start your robotics journey, enroll in free NVIDIA Robotics Fundamentals courses.

Acknowledgments

For their contributions to the research mentioned in this post, we’d like to thank Arsalan Mousavian, Balakumar Sundaralingam, Binghao Huang, Dieter Fox, Eric Heiden, Iretiayo Akinola, Jie Xu, Liang-Yan Gui, Liuyu Bian, Miles Macklin, Rowland O’Flaherty, Sirui Xu, Wei Yang, Xiaolong Wang, Yashraj Narang, Yunzhu Li, Yu-Wei Chao, Yu-Xiong Wang.