MIT researchers have developed a deep learning system that can identify objects within an image, based on a spoken description of the picture, in real time.
“We wanted to do speech recognition in a way that’s more natural, leveraging additional signals and information that humans have the benefit of using, but that machine learning algorithms don’t typically have access to,” David Harwath, a researcher in the Computer Science and Artificial Intelligence Laboratory told MIT News. “We got the idea of training a model in a manner similar to walking a child through the world and narrating what you’re seeing.”

Give the model an image with an audio description, the system will match the relevant regions described in the audio.
Using NVIDIA TITAN Xp GPUs with the cuDNN-accelerated PyTorch deep learning framework, Harwath and his team trained two convolutional neural networks on 402,385 image/caption pairs. One of the CNNs processes images and the other processes spectrograms. The team uses the same GPUs for inference.

What makes this process unique is the fact that Harwath and his team do not use conventional forms of speech recognition or object detection. Instead of learning fixed points in an embedding space, the neural network learns representations that are distributed both spatially and temporally, the researchers said.
“Both the speech and images are completely unsegmented, unaligned, and unannotated during training, aside from the assumption that we know which images and spoken captions belong together,” Harwath said.  “The biggest contribution of the paper is demonstrating that these cross-modal alignments can be inferred automatically by simply teaching the network which images and captions belong together and which pairs don’t.”
The network has a vocabulary of 44,000 words and was trained on speech from over 2.500 speakers.
The work was recently presented at the ECCV conference in Munich, Germany, and the code and dataset have been published online.
Read more>