To potentially improve natural language queries, including the retrieval of images from speech, Researchers from IBM and the University of Virginia developed a deep learning model that can generate objects and their attributes from natural language descriptions. Unlike other recent methods, this approach does not use GANs.
“We show that under minor modifications, the proposed framework can handle the generation of different forms of scene representations, including cartoon-like scenes, object layouts corresponding to real images, and synthetic images,” the researchers stated in their paper.
Named Text2Scene, the model can interpret visually descriptive language to generate scene representations.
Using NVIDIA Tesla P100 GPUs on the Google Cloud platform, with the cuDNN-accelerated PyTorch deep learning framework, the researchers trained several models, including a text encoder, image encoder, convolutional recurrent module, attention modules, object, encoder, and an attribute encoder to generate compositional scene representations from the text.
The model can generate different forms of scenes including cartoon-like scenes, semantic layouts corresponding to real images, and synthetic image composites.
“Our method is not only competitive when compared with state-of-the-art GAN-based methods using automatic metrics and superior based on human judgments but also has the advantage of producing interpretable results,” the researchers stated.
In addition to generating abstract scenes of clip arts, the researchers else tested their model to perform semantic layout generation on the COCO dataset, as well as synthetic image generation.
According to the researchers, the model outperforms several of the previous by large margins except for one.
For real-time inference, the models rely on NVIDIA GeForce 1080TI GPUs.
“As our model adopts a composite image generation framework without adversarial training, gaps between adjacent patches may result in unnaturally shaded areas. We observe that, after performing a regression-based inpainting, the composite outputs achieve consistent improvements on all automatic metrics,” the team said.
The work was recently presented at the annual Computer VIsion and Pattern Recognition COnference in Long Beach, California this year.