Inception Spotlight: DeepZen Uses AI to Generate Speech for Audiobooks

Almost 1,000,000 books are published every year in the United States, however, only around 40,000 of them are converted into audiobooks, primarily due to costs and production time.

To help with the process, DeepZen, a London-based company, and a member of the Inception program, NVIDIA’s start-up incubator, developed a deep learning-based system that can generate complete audio recordings of books and other voice related applications that are human-like and filled with emotion.

“The traditional process is taking too long and costing too much money, said Taylan Kamis, the company’s Co-Founder and CEO. “If you think about it, we need to find a narrator, arrange a studio, and do a lot of recordings with that person. It’s quite a lengthy, taking anywhere from three weeks to months, and can cost up to $5,000 per book. We are hoping to give people more options.”

Using NVIDIA P100 and V100 GPUs on the Google Cloud, with the cuDNN-accelerated PyTorch and Tensorflow deep learning frameworks, available via NVIDIA’s NGC container registry, the team trained their text-to speech-algorithms on thousands of hours of narrator speech.

Once trained, the system automatically analyzes text, converts it to speech, and adds the necessary emotion for each line and word.

“At a really basic stage, we teach machines to speak as humans do. If you think about human speech, there are punctuation rules, pauses, emotions, and a lot of different aspects of speech that we train a machine to replicate,” Kamis said.

For inference, the company runs NVIDIA TensorRT inference engine from NGC on V100 GPUs on the Amazon Web Services cloud. Company developers say that previously, it would take 4-5 hours just to setup a server and frameworks. However with NGC, the process is reduced to minutes. This helps save compute costs as well as expensive developer resources.

“We have built an end-to-end system where any text is fed, cleaned, analyzed by our NLP to understand the context, enrich with information that defines the exact emotional mix for each word and sentence,” said Kerem Sozugecer, the company’s Chief Technology Officer. “Context is set in a way that the natural transitions between sentences and paragraphs are ensured for continuity of the story, just like a human narrator would read.”

The tool has the potential to revolutionize video game sound, professional voice-overs, and the audiobooks publishing industry. The system can also help people who are blind or suffer from visual impairment, or who have reading disabilities get more access to books.

“Human speech is so unique and it takes more than just a program to get the human-like experience, so deep learning is the only programming system that allows us to capture all the factors that allow us to generate speech,” Kamis said. “We are scaling the process so people can have choices.”

The AI generated recordings can also be easily modified by human editors through proprietary software to more closely convey emotion in specific lines and pages. Publishers can also select various voices that take into account gender and accents to produce different versions of the same book.