Building Robotic Mental Models with NVIDIA Warp and Gaussian Splatting

This post explores a promising direction for building dynamic digital representations of the physical world, a topic gaining increasing attention in recent research. We introduce an approach for constructing a digital twin in a robotic setting that stays continuously synchronized with the real world in real time. Such a twin can provide rich state information that supports and enhances a wide range of downstream tasks.

Humans construct an internal model of the world from vision with remarkable ease. We interpret flat images from our eyes into a coherent, three-dimensional environment. Within this imagined space, we simulate physical interactions, predict outcomes, and adapt seamlessly. Even with our eyes closed, we can “see” ourselves moving objects around. When we reopen them, we reconcile any mismatch between what we imagined and what happened.

Replicating this dynamic, visual-physical reasoning in robots is a frontier of physical AI, and it’s beginning to materialize. At the core of our approach, Physically Embodied Gaussians, is the idea that robots can benefit from maintaining a live, internal simulation of the world. Rather than relying solely on raw image streams or offline reconstructions, we aim to build a continuously updated, physics-aware world model that mirrors reality in real time.

Why explicit simulation?

Historically, explicit modeling of the physical world has been challenging because it requires known 3D models, well-tuned dynamics, and well-modeled sensors to ensure that results in simulation can reliably transfer to the real world.

Today, that barrier is eroding.

Thanks to breakthroughs in differentiable rendering, particularly Gaussian splatting, combined with modern segmentation and scene understanding models, it’s now possible to generate simulators from just a handful of images and basic physical prior knowledge. In our use case, high modeling accuracy becomes less critical, as the simulator can be supervised and corrected continuously using a stream of real-world image observations.

Continuous visual supervision through differentiable rendering

In Physically Embodied Gaussians, differentiable rendering plays a dual role—initializing and supervising the simulator.

Supervision is achieved by adjusting the simulator’s state continuously until the rendered images align with real-world observations. When paired with a physics engine running at approximately 30 Hz, this creates a robust feedback loop. The simulator only needs to remain accurate for around 33 milliseconds. If it drifts, the rendering system quickly corrects it. In practice, this enables even imperfectly initialized physical models to remain accurate over time, as the real-time correction mechanism compensates for errors in the simulation.

Using Gaussian splatting as the renderer, combined with fast modern GPUs, enables this entire process to run in real time.

Fewer cameras, thanks to strong prior knowledge

Gaussian splatting systems typically rely on 30 or more cameras to work reliably, which is a non-starter for robotics applications.

We address this by using prior knowledge available in a robotics setting. For example:

We know the robot’s pose and geometry at all times.
We know which objects the robot is likely to interact with and whether they’re rigid or deformable.
We know the basic physics of the world; objects fall, collide, and don’t pass through each other.

With this prior information, we can go beyond visual replication. Our representation is grounded not just in appearance, but also in physics, and it can function robustly with far fewer cameras.

A dual representation: particles and Gaussians

Two images showing a robot interacting with tabletop objects. — *Figure 1. The dual representation of embodied Gaussians showing the particles that are acted upon by the physics system (left) and the Gaussians rendered with Gaussian splatting (right)*

To bring this vision to life, we built our simulator around two key components:

Particles represent the physical structure of the world. They are governed by a fast and stable physics engine using extended position-based dynamics (XPBD), a technique widely used in real-time graphics and games.
3D Gaussians represent the visual appearance of the scene. These are attached to the particles and rendered using Gaussian splatting.

The particles drive the motion of the Gaussians, while the visual errors from the differentiable renderer generate corrective forces that push the particles back into alignment. This dual system forms a closed loop: physics moves visuals, visuals correct physics.

Together, these two subsystems maintain a real-time, visually and physically accurate model of the environment—adaptable, efficient, and grounded in perception.