Computer Vision / Video Analytics

AI Model Can Generate Images from Natural Language Descriptions

To potentially improve natural language queries, including the retrieval of images from speech, Researchers from IBM and the University of Virginia developed a deep learning model that can generate objects and their attributes from natural language descriptions. Unlike other recent methods, this approach does not use GANs. 

“We show that under minor modifications, the proposed framework can handle the generation of different forms of scene representations, including cartoon-like scenes, object layouts corresponding to real images, and synthetic images,” the researchers stated in their paper

Named Text2Scene, the model can interpret visually descriptive language to generate scene representations. 

Using NVIDIA Tesla P100 GPUs on the Google Cloud platform, with the cuDNN-accelerated PyTorch deep learning framework, the researchers trained several models, including a text encoder, image encoder, convolutional recurrent module, attention modules, object, encoder, and an attribute encoder to generate compositional scene representations from the text. 

Overview of Text2Scene. The general framework consists of (A) a Text Encoder that produces a sequential representation of the input, (B) an Image Encoder that encodes the current state of the generated scene, (C) a Convolutional Recurrent Module that tracks, for each spatial location, the history of what have been generated so far, (D-F) two attention-based predictors that sequentially focus on different parts of the input text, first to decide what object to place, then to decide what attributes to be assigned to the object, and (G) an optional foreground embedding step that learns an appearance vector for patch retrieval in the synthetic image generation task

The model can generate different forms of scenes including cartoon-like scenes, semantic layouts corresponding to real images, and synthetic image composites. 

Step-by-step generation of an abstract scene, showing the top-3 attended words for the object prediction and attribute prediction at each time step. Notice how except for predicting the sun at the first time step, the top-1 attended words in the object decoder are almost one-to-one mappings with the predicted objects. The attended words by the attribute decoder also correspond semantically to useful information for predicting either pose or location, e.g. to predict the location of the hotdog at the fifth time step, the model attends to mike and holding.

“Our method is not only competitive when compared with state-of-the-art GAN-based methods using automatic metrics and superior based on human judgments but also has the advantage of producing interpretable results,” the researchers stated. 

In addition to generating abstract scenes of clip arts, the researchers else tested their model to perform semantic layout generation on the COCO dataset, as well as synthetic image generation. 

(Qualitative examples of synthetic image generation (best viewed in color). The first column shows input captions with manually highlighted objects (purple), counts (blue) and relations (red). The second columns shows the true images. Columns in the middle show competing approaches. The last two columns show the outputs of our model before and after pre-processing.

According to the researchers, the model outperforms several of the previous by large margins except for one. 

For real-time inference, the models rely on NVIDIA GeForce 1080TI GPUs. 

“As our model adopts a composite image generation framework without adversarial training, gaps between adjacent patches may result in unnaturally shaded areas. We observe that, after performing a regression-based inpainting, the composite outputs achieve consistent improvements on all automatic metrics,” the team said. 

The work was recently presented at the annual Computer VIsion and Pattern Recognition COnference in Long Beach, California this year. 

Read more>

Discuss (0)

Tags