Robotics

Using Generative AI to Enable Robots to Reason and Act with ReMEmbR

Sep 23, 2024

By Abrar Anwar, John Welsh and Yan Chang

Discuss (0)

AI-Generated Summary

Dislike

ReMEmbR is a project that combines large language models (LLMs), vision-language models (VLMs), and retrieval-augmented generation (RAG) to enable robots to reason and take actions over what they see during long-horizon deployments, on the order of hours to days, using VLMs and vector databases to build a long-horizon semantic memory.
The ReMEmbR system consists of two phases: memory-building and querying, where the memory-building phase uses NVIDIA VILA captioning VLM to caption video segments and embed them into a MilvusDB vector database, along with timestamps and coordinate information from the robot.
ReMEmbR was demonstrated on a real robot using Nova Carter and NVIDIA Isaac ROS, where it answered questions and guided people around an office environment, with the robot's memory being populated by teleoperating it and running AMCL for global localization.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Vision-language models (VLMs) combine the powerful language understanding of foundational LLMs with the vision capabilities of vision transformers (ViTs) by projecting text and images into the same embedding space. They can take unstructured multimodal data, reason over it, and return the output in a structured format. Building on a broad base of pretraining, they can be easily adapted for different vision-related tasks by providing new prompts or parameter-efficient fine-tuning.

They can also be integrated with live data sources and tools, to request more information if they don’t know the answer or take action when they do. LLMs and VLMs can act as agents, reasoning over data to help robots perform meaningful tasks that might be hard to define.

In a previous post, Bringing Generative AI to Life with NVIDIA Jetson, we demonstrated that you can run LLMs and VLMs on NVIDIA Jetson Orin devices, enabling a breadth of new capabilities like zero-shot object detection, video captioning, and text generation on edge devices.

But how can you apply these advances to perception and autonomy in robotics? What are the challenges you face when deploying these models into the field?

In this post, we discuss ReMEmbR, a project that combines LLMs, VLMs, and retrieval-augmented generation (RAG) to enable robots to reason and take actions over what they see during a long-horizon deployment, on the order of hours to days.

ReMEmbR’s memory-building phase uses VLMs and vector databases to efficiently build a long-horizon semantic memory. Then ReMEmbR’s querying phase uses an LLM agent to reason over that memory. It is fully open source and runs on-device.

ReMEmbR addresses many of the challenges faced when using LLMs and VLMs in a robotics application:

How to handle large contexts.
How to reason over a spatial memory.
How to build a prompt-based agent to query more data until a user’s question is answered.

To take things a step further, we also built an example of using ReMEmbR on a real robot. We did this using Nova Carter and NVIDIA Isaac ROS and we share the code and steps that we took. For more information, see the following resources:

ReMEmbR website
/NVIDIA-AI-IOT/remembr GitHub repo
ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot Navigation paper

Video 1. Enhancing Robot Navigation with LLM Agent ReMEmbR

ReMEmbR for long-horizon spatial and temporal memory, reasoning, and action

Robots are increasingly expected to perceive and interact with their environments over extended periods. Robots are deployed for hours, if not days, at a time and they incidentally perceive different objects, events, and locations.

For robots to understand and respond to questions that require complex multi-step reasoning in scenarios where the robot has been deployed for long periods, we built ReMEmbR, a retrieval-augmented memory for embodied robots.

ReMEmbR builds scalable long-horizon memory and reasoning systems for robots, which improve their capacity for perceptual question-answering and semantic action-taking. ReMEmbR consists of two phases: memory-building and querying.

In the memory-building phase, we took advantage of VLMs for constructing a structured memory by using vector databases. During the querying phase, we built an LLM agent that can call different retrieval functions in a loop, ultimately answering the question that the user asked.

Building a smarter memory

ReMEmbR’s memory-building phase is all about making memory work for robots. When your robot has been deployed for hours or days, you need an efficient way of storing this information. Videos are easy to store, but hard to query and understand.

During memory building, we take short segments of video, caption them with the NVIDIA VILA captioning VLM, and then embed them into a MilvusDB vector database. We also store timestamps and coordinate information from the robot in the vector database.

This setup enabled us to efficiently store and query all kinds of information from the robot’s memory. By capturing video segments with VILA and embedding them into a MilvusDB vector database, the system can remember anything that VILA can capture, from dynamic events such as people walking around and specific small objects, all the way to more general categories.

Using a vector database makes it easy to add new kinds of information for ReMEmbR to take into consideration.

ReMEmbR agent

Given such a long memory stored in the database, a standard LLM would struggle to reason quickly over the long context.

The LLM backend for the ReMEmbR agent can be NVIDIA NIM microservices, local on-device LLMs, or other LLM APIs. When a user poses a question, the LLM generates queries to the database, retrieving relevant information iteratively. The LLM can query for text information, time information, or position information depending on what the user is asking. This process repeats until the question is answered.

Our use of these different tools for the LLM agent enables the robot to go beyond answering questions about how to go to specific places and enables reasoning spatially and temporally. Figure 2 shows how this reasoning phase may look.

Deploying ReMEmbR on a real robot

To demonstrate how ReMEmbR can be integrated into a real robot, we built a demo using ReMEmbR with NVIDIA Isaac ROS and Nova Carter. Isaac ROS, built on the open-source ROS 2 software framework, is a collection of accelerated computing packages and AI models, bringing NVIDIA acceleration to ROS developers everywhere.

In the demo, the robot answers questions and guides people around an office environment. To demystify the process of building the application, we wanted to share the steps we took:

Building an occupancy grid map
Running the memory builder
Running the ReMEmbR agent
Adding speech recognition

Building an occupancy grid map

The first step we took was to create a map of the environment. To build the vector database, ReMEmbR needs access to the monocular camera images as well as the global location (pose) information.

Depending on your environment or platform, obtaining the global pose information can be challenging. Fortunately, this is straightforward when using Nova Carter. Nova Carter, powered by the Nova Orin reference architecture, is a complete robotics development platform that accelerates the development and deployment of next-generation autonomous mobile robots (AMRs). It may be equipped with a 3D LIDAR to generate accurate and globally consistent metric maps.

By following the Isaac ROS documentation, we quickly built an occupancy map by teleoperating the robot. This map is later used for localization when building the ReMEmbR database and for path planning and navigation for the final robot deployment.

Running the memory builder

After we created the map of the environment, the second step was to populate the vector database used by ReMEmbR. For this, we teleoperated the robot, while running AMCL for global localization. For more information about how to do this with Nova Carter, see Tutorial: Autonomous Navigation with Isaac Perceptor and Nav2.

With the localization running in the background, we launched two additional ROS nodes specific to the memory-building phase.

The first ROS node runs the VILA model to generate captions for the robot camera images. This node runs on the device, so even if the network is intermittent we could still build a reliable database.

Running this node on Jetson is made easier with NanoLLM for quantization and inference. This library, along with many others, is featured in the Jetson AI Lab. There is even a recently released ROS package (ros2_nanollm) for easily integrating NanoLLM models with a ROS application.

The second ROS node subscribes to the captions generated by VILA, as well as the global pose estimated by the AMCL node. It builds text embeddings for the captions and stores the pose, text, embeddings, and timestamps in the vector database.

Running the ReMEmbR agent

After we populated the vector database, the ReMEmbR agent had everything it needed to answer user queries and produce meaningful actions.

The third step was to run the live demo. To make the robot’s memory static, we disabled the image captioning and memory-building nodes and enabled the ReMEmbR agent node. As detailed earlier, the ReMEmbR agent is responsible for taking a user query, querying the vector database, and determining the appropriate action the robot should take. In this instance, the action is a destination goal pose corresponding to the user’s query.

We then tested the system end-to-end by manually typing in user queries:

“Take me to the nearest elevator”
“Take me somewhere I can get a snack”

The ReMEmbR agent determines the best goal pose and publishes it to the /goal_pose topic. The path planner then generates a global path for the robot to follow to navigate to this goal.

Adding speech recognition

In a real application, users likely won’t have access to a terminal to enter queries and need an intuitive way to interact with the robot. For this, we took the application a step further by integrating speech recognition to generate the queries for the agent.

On Jetson Orin platforms, integrating speech recognition is straightforward. We accomplished this by writing a ROS node that wraps the recently released WhisperTRT project. WhisperTRT optimizes OpenAI’s whisper model with NVIDIA TensorRT, enabling low-latency inference on NVIDIA Jetson AGX Orin and NVIDIA Jetson Orin Nano.

The WhisperTRT ROS node directly accesses the microphone using PyAudio and publishes recognized speech on the speech topic.

All together

With all the components combined, we created our full demo of the robot.

Get started

We hope this post inspires you to explore generative AI in robotics. To learn more about the contents presented in this post, try out the ReMEmBr code, and get started building your own generative AI robotics applications, see the following resources:

ReMEmbR website
/NVIDIA-AI-IOT/remembr GitHub repo
ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot Navigation paper
NVIDIA Isaac ROS documentation
Nova Carter
NVIDIA Jetson AI Lab

Sign up for the NVIDIA Developer Program for updates on additional resources and reference architectures to support your development goals.

Stay up to date on LinkedIn, Instagram, X, and Facebook. For more information, explore our documentation and join the robotics community on our developer forums and YouTube channels. Follow along with self-paced training and webinars (Isaac ROS and Isaac Sim).

Discuss (0)

About the Authors

About Abrar Anwar
Abrar Anwar is a Ph.D. student at the University of Southern California and an intern at NVIDIA. His research interests are on the intersection of language and robotics, with a focus on navigation and human-robot interaction. Abrar received his B.Sc. in computer science from the University of Texas at Austin.

View all posts by Abrar Anwar

About John Welsh
John Welsh is a developer technology engineer of autonomous machines at NVIDIA, where he develops accelerated applications with NVIDIA Jetson. Whether it’s Legos, robots or a song on a guitar, he always enjoys creating new things. John holds a Bachelor of Science and Master of Science in electrical engineering from the University of Maryland, focusing on robotics and computer vision.

View all posts by John Welsh

About Yan Chang
Yan Chang is a senior engineering manager and principal engineer at NVIDIA. She leads the NVIDIA Isaac loco‑manipulation team, advancing robot learning and synthetic data generation (SDG) for cross‑embodiment robots. She received her Ph.D. from the University of Michigan.

View all posts by Yan Chang