To help localize subtitles from English to other languages, such as Russian, Spanish, or Portuguese, Netflix developed a proof-of-concept AI model that can automatically simplify and translate subtitles to multiple languages.
The work is presented in a paper, Simplify-then-Translate: Automatic Preprocessing for Black-Box Machine Translation, published this month on the preprint platform arXiv. The work is a collaboration between Netflix and Virginia Tech.
“Black-box machine translation systems have proven incredibly useful for a variety of applications yet by design are hard to adapt, tune to a specific domain, or build on top of,” the researchers explained.
To help solve the problem, the team used a technique that they call automatic preprocessing (APP), which uses sentence simplification as a precursor to black-box, AI-based, translation systems.
“The model is used to preprocess source sentences of multiple low-resource language pairs. We show that this preprocessing leads to better translation performance as compared to non-preprocessed source sentences,” the researchers stated.
For training, the team used the Transformer architecture through the tensor2tensor7 library on TensorFlow. All experiments were run using machines with four NVIDIA V100 GPUs.
“We use a sub-word vocabulary of size 32,000 implemented using the Word-Piece algorithm to deal with out-of-vocabulary words and the open vocabulary problem in speech-to-speech language models,” the researchers explained.
Even though this work mainly focuses on simplifying English-based subtitles, the model is universal and can be used for other languages.
“Our work merges two important sub-fields of natural language processing (NLP) (machine translation and sentence simplification) and paves the path for future research in both of these fields,” the researchers said.