R²D²: Perception-Guided Task & Motion Planning for Long-Horizon Manipulation

Traditional task and motion planning (TAMP) systems for robot manipulation use cases operate on static models that often fail in new environments. Integrating perception with manipulation is a solution to this challenge, enabling robots to update plans mid-execution and adapt to dynamic scenarios.

In this edition of the NVIDIA Robotics Research and Development Digest (R²D²), we explore the use of perception-based TAMP and GPU-accelerated TAMP for long-horizon manipulation. We’ll also learn about a framework for improving robot manipulation skills. And we’ll show how vision and language can be used to translate pixels into subgoals, affordances, and differentiable constraints.

Subgoals are smaller intermediate objectives that guide the robot step-by-step toward the final goal.
Affordances describe the actions that an object or environment allows a robot to perform, based on its properties and context. For instance, a handle affords “grasping,” a button affords “pressing,” and a cup affords “pouring.”
Differentiable constraints in robot-motion planning ensure that the robot’s movements satisfy physical limits (like joint angles, collision avoidance, or end-effector positions) while still being adjustable via learning. Because they’re differentiable, GPUs can compute and refine them efficiently during training or real-time planning.

How task and motion planning transforms vision and language into robot action

TAMP involves deciding what a robot should do and how it should move to do it. Doing this requires combining high-level task-planning (what task to do) and low-level motion-planning (how to move to perform the task).

Modern robots can use both vision and language (like pictures and instructions) to break down complex tasks into smaller steps, called subgoals. These subgoals help the robot understand what needs to happen next, what objects to interact with, and how to move safely.

This process uses advanced models to turn images and written instructions into clear plans the robot can follow in real-world situations. Long-horizon manipulation requires structured intentions that can be satisfied by the planner. Let’s see how OWL-TAMP, VLM-TAMP, and NOD-TAMP help address this:

OWL-TAMP: This workflow enables robots to execute complex, long-horizon manipulation tasks described in natural language, such as “put the orange on the table.” OWL-TAMP is a hybrid workflow that integrates vision-language models (VLMs) with TAMP, where the VLM generates constraints that describe how to ground open-world language (OWL) instructions in robot action space. These constraints are incorporated into the TAMP system, which ensures physical feasibility and correctness through simulation feedback.
VLM-TAMP: This is a workflow for planning multi-step tasks for robots in visually rich environments. VLM-TAMP combines VLMs with traditional TAMP to generate and refine action plans in real-world scenes. It uses a VLM to understand images and uses task descriptions (like “make chicken soup”) to generate high-level plans for the robot. These plans are then iteratively refined through simulation and motion planning to check feasibility. This hybrid model outperforms both the VLM-only and TAMP-only baselines on long-horizon kitchen tasks that require 30 to 50 sequential actions and involve up to 21 different objects. This workflow enables robots to handle ambiguous information by using both visual and language context, resulting in improved performance in complex manipulation tasks.

Chart showing TAMP and VLM tasks alone versus when using VLM-TAMP. — *Figure 1. VLM-TAMP overcomes the pitfalls of using TAMP alone or VLM task then motion planning when solving long-horizon robot manipulation problems.*

NOD-TAMP: Traditional TAMP frameworks often struggle to generalize on long-horizon manipulation tasks because they rely on explicit geometric models and object representations. NOD-TAMP overcomes this by using neural object descriptors (NODs) to help generalize object types. NODs are learned representations derived from 3D point clouds that encode spatial and relational properties of objects. This enables robots to interact with new objects and helps the planner adapt actions dynamically.

How cuTAMP accelerates robot planning with GPU parallelization

Classical TAMP first analyzes the outline of actions for a task (called plan skeletons) and then proceeds to solve the continuous variables. This second step is usually the bottleneck in manipulation systems, which is vastly accelerated in cuTAMP. For a specified skeleton in cuTAMP, thousands of seeds (particles) are sampled, and then differentiable batch optimization is executed on the GPU to satisfy the various constraints (like inverse kinematics, collisions, stability, and goal costs).

If a skeleton is not feasible, the algorithm backtracks. If it is, the algorithm provides a plan, which can often happen in a matter of seconds for constrained packing/stacking tasks. This means that robots can find solutions for packing, stacking, or manipulating many objects in seconds instead of minutes or hours.

This “vectorized satisfaction” is the essence of making long-horizon problem-solving feasible in real-world applications.

Diagram showing how cuTAMP leverages GPU parallelism to efficiently explore thousands of candidate continuous solutions simultaneously. — *Figure 2. cuTAMP frames TAMP as a backtracking bilevel search over plan skeletons.*

How robots learn from failures using Stein variational inference

Long-horizon manipulation models can fail in novel conditions not seen during training. Fail2Progress is a framework for improving manipulation by enabling robots to learn from their own failures. This framework integrates failures into skill models through data-driven correction and simulation-based refinement. Fail2Progress uses Stein variational inference to generate targeted synthetic datasets similar to observed failures.

These generated datasets can then be used to fine-tune and re-deploy a skill-effect model, enabling fewer repeats of the same failure on long-horizon tasks.

Getting started

In this blog, we talked about perception-based TAMP, GPU-accelerated TAMP, and a simulation-based refinement framework for robot manipulation. We saw common challenges in traditional TAMP and how these research efforts aim to solve them.

Check out the following resources to learn more:

OWL-TAMP – Paper, project website
VLM-TAMP – Paper, project website
NOD-TAMP – Paper, project website
cuTAMP – Paper, project website
Fail2Progress – Paper, project website

This post is part of our NVIDIA Robotics Research and Development Digest (R²D²) to give developers deeper insight into the latest breakthroughs from NVIDIA Research across physical AI and robotics applications.

Stay up to date by subscribing to the newsletter and following NVIDIA Robotics on YouTube, Discord, and developer forums. To start your robotics journey, enroll in free NVIDIA Robotics Fundamentals courses.

Acknowledgments

For their contributions to the research mentioned in this post, thanks to Ankit Goyal, Caelan Garrett, Tucker Hermans, Yixuan Huang, Leslie Pack Kaelbling, Nishanth Kumar, Tomas Lozano-Perez, Ajay Mandlekar, Fabio Ramos, Shuo Cheng, Mohanraj Devendran Shanthi, William Shen, Danfei Xu, Zhutian Yang, Novella Alvina, Dieter Fox, and Xiaohan Zhang.