Simulation / Modeling / Design

Sim2SG: Generating Sim-to-Real Scene Graphs for Transfer Learning

Feb 24, 2021

By Aayush Prakash, Shoubhik Debnath, Jean-Francois Lafleche, Eric Cameracci, Gavriel State and Marc T. Law

Discuss (2)

AI-Generated Summary

Dislike

The Sim2SG framework is proposed to overcome the domain gap problem in scene graph generation by enabling sim-to-real transfer learning for neural network models trained on synthetic data.
Sim2SG addresses the domain gap by decomposing it into label, prediction, and appearance discrepancies between synthetic and real domains, and uses techniques such as pseudo-statistic-based self-learning and adversarial techniques with gradient reversal layer to align these discrepancies.
The proposed method achieves significant improvements over baselines in scene graph generation on unlabeled real-world datasets, as demonstrated through quantitative and qualitative evaluations on datasets such as KITTI.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Scene graphs (SGs) in both computer vision and computer graphics are an interpretable and structural representation of scenes. A scene graph summarizes entities in the scene and plausible relationships among them. SGs have applications in the fields of computer vision, robotics, autonomous vehicles, and so on.

Current SG-generation techniques rely on the limited availability of expensive labeled datasets. Synthetic data is a viable alternative to this problem, as annotations are essentially free. Although synthetic data has been used for a variety of tasks such as image classification, object detection, and semantic segmentation, the use of synthetic data for SG generation and visual relationships is yet to be explored. The crucial issue is that training neural network models on labelled synthetic data and evaluating on unlabeled real data leads to the domain gap problem, because the synthetic and real data differ in both appearance and content.

Sim2SG framework

To overcome these challenges, we propose Sim2SG, a scalable technique for sim-to-real transfer for scene-graph generation. The primary goal of this research is to enable scene-graph generation from real-world images by first training a neural network on a simulated dataset that contains labelled SG information and then transferring the learned model onto a real-world dataset.

During the training process, Sim2SG addresses the domain gap and learns to generate scene graphs. The domain gap can be subdivided into the following gaps:

Appearance gap is the discrepancy in the appearance of the two domains, such as differences in texture, color, light, or reflectance of objects in the scene.
Content gap refers to discrepancies between the two domains, including the difference in distribution of the number of objects and their class, placement, pose, and scale.

We analyze the content gap further and address its subcomponents – label and prediction discrepancies. Figure 1 showed Sim2SG generating accurate scene graphs for a real-world driving dataset and Figure 2 shows the entire pipeline.

In Figure 2, the Sim2SG pipeline takes labelled synthetic data from the source domain and unlabeled real data from the target domain as input. The labeled synthetic and unlabeled real data are mapped to a shared representation, Z, using an encoder. We then train the scene graph prediction network, h, on Z using synthetic data. We handle the label discrepancy by using pseudo-statistic–based self-learning to generate label-aligned synthetic data for training. We further align both the prediction discrepancies as well as appearance discrepancies between the two domains using adversarial techniques using gradient reversal layer (GRL) and domain discriminator.

Quantitative evaluation

We used four classes—car, pedestrian, vegetation, and house—and four types of relationships—front, left, right, and behind. All relationships have the car as the subject.

Table 1 shows how label alignment and appearance alignment in the proposed method drastically reduce the domain gap compared to the baselines. We compared Sim2SG to the randomization-based method (Prakash et al., 2019), the method addressing content gap (Kar et al., 2019), self-learning based on pseudo labels (Zou et al., 2018) and domain adaptation methods for object detection (Chen et al., 2018; Xu et al., 2020; Li et al., 2020). The domain gap reduces further by combining label, appearance, and prediction alignment (final row).

Qualitative evaluation

Figure 3 shows the qualitative results of Sim2SG on the target domain. The first column shows that the source-only baseline fails to either detect objects or has a high number of false positives (mislabels), leading to poor scene graphs. Our method detects objects better, has fewer false positives, and ultimately generates more accurate scene graphs, as shown in the second and third column, respectively. This is because the appearance alignment term reduces false positive detections. Also, the label alignment term improves detection performance as it helps generate synthetic data for training that is more label aligned regarding target domain. Figure 4 shows some label-aligned, synthetic reconstructions corresponding to target domain samples.

Summary

In this work, we propose Sim2SG, a model that achieves sim-to-real transfer learning for scene graph generation on unlabeled real-world datasets. We decompose the domain gap into label, prediction, and appearance discrepancies between synthetic and real domains. We propose methods to address these discrepancies and achieve significant improvements over baselines in all three environments: Clevr, Dining-Sim, and Drive-Sim.

For more information, see the following resources:

Discuss (2)

About the Authors

About Aayush Prakash
Aayush Prakash is a senior researcher at the Toronto AI Lab, NVIDIA. His research interests lie at the juncture of machine learning, computer vision, and computer graphics. Specifically, he works on sim-to-real problems for perception. He wants to train effective real world models from simulation. Prior to NVIDIA, he was part of IBM Labs, Toronto, where he worked on compilers, another area he's been interested in the past. He graduated with a B.Tech in electronics and electrical communication engineering from Indian Institute of Technology (IIT) Kharagpur, India, in 2010, and MASc in computer engineering from the University of Waterloo, Canada, in 2013.

View all posts by Aayush Prakash

About Shoubhik Debnath
Shoubhik Debnath is a research and development engineer at NVIDIA working on robotics, simulation, and deep learning. He graduated with a B.Tech in computer science and engineering from the Indian Institute of Technology (IIT) Mandi, India in 2014, and a MS in computer science with focus on robotics research under the guidance of Prof. Gaurav Suhkatme from the University of Southern California in 2018. Previously, Shoubhik also spent two years at Microsoft focusing on data science and cloud-based technologies.

View all posts by Shoubhik Debnath

About Jean-Francois Lafleche
Jean-Francois Lafleche is a driven and passionate engineer with a love for solving complex challenges with innovative solutions. He is a self-starter and life-long learner with a wide and diverse set of skills and a focus in machine learning and robotics applications.

View all posts by Jean-Francois Lafleche

About Eric Cameracci
Eric Cameracci is a 2015 graduate of the University of Waterloo with a BASc in computer engineering.

View all posts by Eric Cameracci

About Gavriel State
Gavriel State is a Senior Director for Simulation and AI at NVIDIA, based in Toronto, where he leads the team building the Isaac Sim robotics simulator and Isaac Lab for reinforcement learning and sim-to-real robotics transfer. He is also involved in efforts around real-to-sim 3D reconstruction, including rendering technologies such as 3D Gaussian Ray Tracing, as well as synthetic data generation through the Omniverse Replicator system.

View all posts by Gavriel State

About Marc T. Law
Marc T. Law is a senior research scientist at NVIDIA working on machine learning and computer vision. He works in the NVIDIA Toronto AI Lab (Canada) led by Professor Sanja Fidler. He received a PhD in computer science from the Université Pierre et Marie Curie (now Sorbonne Université) in Paris, France in 2015. His doctoral supervisors were Professor Matthieu Cord and Professor Stéphane Gançarski, He was also supervised by Professor Nicolas Thome. Marc was a visiting research scholar in the team of Professor Eric Xing at the School of Computer Science, Carnegie Mellon University in 2015~2016. From 2016 to 2019, he was a postdoctoral fellow in the Department of Computer Science (Machine Learning group) at the University of Toronto and Vector Institute under the supervision of Professor Raquel Urtasun and Professor Richard Zemel.

View all posts by Marc T. Law