Generative AI

Reconstructing Dynamic Driving Scenarios Using Self-Supervised Learning

From monotonous highways to routine neighborhood trips, driving is often uneventful. As a result, much of the training data for autonomous vehicle (AV) development collected in the real world is heavily skewed toward simple scenarios. 

This poses a challenge to deploying robust perception models. AVs must be thoroughly trained, tested, and validated to handle complex situations, which requires an immense amount of data covering such scenarios.

Simulation offers an alternative to finding and collecting such data in the real world—which would be incredibly time- and cost-intensive. And yet generating complicated, dynamic scenarios at scale is still a significant hurdle.

In a recently released paper, NVIDIA Research shows how a new neural radiance field (NeRF)-based method, known as EmerNeRF, uses self-supervised learning to accurately generate dynamic scenarios. By training through self-supervision, EmerNeRF not only outperforms other NeRF-based methods for dynamic objects, but also for static scenes. For more details, see EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision.

Example of EmerNeRF reconstructing dynamic driving scene: nighttime.
Example of EmerNeRF reconstructing dynamic driving scene: busy intersection with cars.
Example of EmerNeRF reconstructing dynamic driving scene: busy intersection with pedestrians.
Figure 1. Examples of EmerNeRF reconstructing dynamic driving scenes

When running EmerNeRF alongside similar NeRFs, it increases dynamic scene reconstruction accuracy by 15% and static scene by 11%, additionally achieving a 12% improvement for novel view synthesis.

Addressing limitations in NeRF-based methods

NeRFs take a set of static images and reconstruct them into a realistic 3D scene. They make it possible to create high-fidelity simulations from driving logs for closed-loop deep neural network (DNN) training, testing, and validation. 

However, current NeRF-based reconstruction methods struggle with dynamic objects and have proven difficult to scale. For example, while some approaches can generate both static and dynamic scenes, they require ground truth (GT) labels to do so. This means that each object in the driving logs must be accurately outlined and defined using autolabeling techniques or human annotators.

Other NeRF methods rely on additional models to achieve complete information about a scene, such as optical flow. 

To address these limitations, EmerNeRF uses self-supervised learning to decompose a scene into static, dynamic, and flow fields. The model learns associations and structure from raw data rather than relying on human-labeled GT annotations. It then renders both the temporal and spatial aspects of a scene simultaneously, eliminating the need for an external model to fill in the gaps while improving accuracy.

A decomposed version of the nighttime driving scene, broken into rendered depth, decomposed dynamic RGB, decomposed dynamic depth, emerged forward flow, decomposed static RGB, and decomposed static depth.
Figure 2. EmerNeRF breaks down the scene shown in the first video in Figure 1 into dynamic, static, and flow fields

As a result, while other models tend to produce over-smoothed renderings and dynamic objects with lower accuracy, EmerNeRF reconstructs high-fidelity background scenery as well as dynamic objects, all while preserving the fine details of a scene.

Dynamic-32 Split
 Scene ReconstructionNovel View Synthesis
MethodsFull ImageDynamic OnlyFull ImageDynamic Only
D2 NeRF24.350.64521.780.50424.170.64221.440.494
Table 1. Evaluation results comparing EmerNeRF with other NeRF-based reconstruction methods for dynamic scenes, categorized into performance for scene reconstruction and novel view synthesis
Static-32 Split
MethodsStatic Scene Reconstruction
Table 2. Evaluation results comparing EmerNeRF with other NeRF-based reconstruction for static scenes

The EmerNeRF approach

Using self-supervised learning, rather than human annotation or external models, enables EmerNeRF to bypass challenges previous methods have encountered.

A diagram showing how EmerNeRF breaks a scene into static, dynamic, and flow fields, then reconstructs the final scene rendering simultaneously, rather than use an external model.
Figure 3. EmerNeRF decomposition and reconstruction pipeline

EmerNeRF is designed to break down a scene into dynamic and static elements. As it decomposes a scene, EmerNeRF also estimates a flow field from dynamic objects, such as cars and pedestrians, and uses this field to further improve reconstruction quality by aggregating features across time. Other approaches use external models to provide such optical flow data, which can often lead to inaccuracies.

By combining the static, dynamic, and flow fields all at once, EmerNeRF can represent highly dynamic scenes self-sufficiently, which improves accuracy and enables scaling to general data sources. 

Adding semantic understanding with foundation models

EmerNeRF’s semantic understanding of a scene is further strengthened using foundation models for additional supervision. Foundation models have a broad knowledge of objects (specific types of vehicles or animals, for example). EmerNeRF leverages vision transformer (ViT) models such as DINO and DINOv2 to incorporate semantic features into its scene reconstruction.

This enables EmerNeRF to better predict objects in a scene, as well as perform downstream tasks such as autolabeling.

Four views of the same driving scene, clockwise: the original ground truth recording, the EmerNeRF reconstruction, the DINO semantic rendering and the DINOv2 semantic rendering.
Figure 4. EmerNeRF uses foundation models such as DINO and DINOv2 to strengthen its semantic understanding of a scene

However, transformer-based foundation models pose a new challenge: semantic features can exhibit position-dependent noise, which can significantly limit downstream task performance.

Four views of the same driving scene, clockwise: the original ground truth recording, the EmerNeRF reconstruction, the decomposed noise-free DINO semantic rendering and the decomposed noise-free DINOv2 semantic rendering.
Figure 5. EmerNeRF uses positional embedding to eliminate noise caused by transformer-based foundation models

To solve the noise issue, EmerNeRF uses positional embedding decomposition to recover a noise-free feature map. This unlocks the full, accurate representation of foundation model semantic features, as shown in Figure 5.

Evaluating EmerNeRF

As detailed in EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision, we evaluated the performance of EmerNeRF by curating a dataset of 120 unique scenarios, divided into 32 static, 32 dynamic, and 56 diverse scenes across challenging conditions such as high-speed and low-light conditions. 

Each NeRF model was then evaluated on its ability to reconstruct scenes and synthesize novel views based on different subsets of the dataset.

Accordingly, we found EmerNeRF consistently and significantly outperformed other methods in both scene reconstruction and novel view synthesis, as shown in Table 1. 

EmerNeRF also outperformed methods specifically designed for static scenes, suggesting that self-supervised decomposing of a scene into static and dynamic elements improves static reconstruction as well as dynamic.


AV simulation is only effective if it can accurately reproduce the real world. The need for fidelity increases—and becomes more challenging to achieve—as scenarios become more dynamic and complex.

EmerNeRF represents and reconstructs dynamic scenarios more accurately than previous methods, without requiring human supervision or external models. This enables reconstructing and modifying complicated driving data at scale, addressing current imbalances in AV training datasets.

We’re eager to investigate new capabilities that EmerNeRF unlocks, including end-to-end driving, autolabeling, and simulation. 

To learn more, visit the EmerNeRF project page and read the paper, EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision.

Discuss (0)