Google released the latest version of their automatic image captioning model that is more accurate, and is much faster to train compared to the original system.
“The TensorFlow implementation released today achieves the same level of accuracy with significantly faster performance: time per training step is just 0.7 seconds in TensorFlow compared to 3 seconds in DistBelief (a system Google previously used for generating image captions) on an NVIDIA K20 GPU, meaning that total training time is just 25 percent of the time previously required,” Chris Shallue, Software Engineer of the Google Brain Team wrote in a blog post.
Using CUDA and the TensorFlow deep learning framework, Google trains Show and Tell by letting it take a look at images and captions that people wrote for those images. Sometimes, if the model thinks it sees something going on in a new image that’s exactly like a previous image it has seen, it falls back on the caption for the caption for that previous image. But at other times, Show and Tell is able to come up with original captions. “Moreover,” Shallue wrote, “it learns how to express that knowledge in natural-sounding English phrases despite receiving no additional language training other than reading the human captions.”
The initial training phase took nearly two weeks on a single Tesla K20 GPU, but they mention it would be 10 times slower if you were to run the code on a CPU.
Read more >
Google Open-Sources Image Captioning Intelligence
Sep 27, 2016
Discuss (0)

Related resources
- GTC session: Introduction to "Learning Deep Learning"* (Spring 2023)
- GTC session: Generative AI Text-to-Video: Humanizing the Way We Interact with Machines (Spring 2023)
- GTC session: Retrieval-Augmented Language Model and Its Application for Question-Answering and Image Captioning (Spring 2023)
- SDK: DALI
- Webinar: Inception Workshop 101 - Getting Started with Conversational AI
- Webinar: Inception Workshop 101 - Getting Started with Vision AI