Generative AI

Customizing Neural Machine Translation Models with NVIDIA NeMo, Part 1

Decorative image of a globe surrounded by people speaking and texting in different languages, with the text Part 1.

Neural machine translation (NMT) is an automatic task of translating a sequence of words from one language to another. In recent years, the development of attention-based transformer models has had a profound impact on complicated language modeling tasks, which predict the next upcoming token in the sentence. NMT is one of the typical instances.

There are plenty of open-source NMT models available in the community. However, it may be challenging to use them directly in real life to translate content. Common issues include: 

  • Mistranslation
  • Lack of semantic accuracy
  • Lack of domain-specific knowledge
  • Inability to handle proper nouns or rare words

A common underlying reason for this issue is that the data used to train the model does not match the distribution of data in the use cases. Model fine-tuning is necessary.

NVIDIA NeMo is an end-to-end platform for developing custom generative AI anywhere, including large language models (LLMs), multimodal, vision, and speech AI. It includes tools for training, and retrieval-augmented generation (RAG), guardrailing and toolkits, data curation tools, and pretrained models. NeMo offers enterprises an easy, cost-effective, and fast way to adopt generative AI.

In this post, we walk through the prerequisites, running the pretrained model with NeMo, and evaluating its performance. In the second post, we walk through customizing the dataset and fine-tuning the model with the custom dataset. 

Introduction to machine translation

Researchers have been interested in studying machine translation since the 1950s. Before 2010, rule-based and statistical machine translation were prevalent research fields. The rule-based machine translation generates the translation based on morphological, syntactic, and semantic analysis of the source and target language. However, statistical machine translation superseded it by using statistical models to predict the probability distribution of target words. 

With the advancements in deep neural networks in the 2010s, NMT has become the dominant approach today. As research evolves, the quality of NMT has benefited from recurrent neural networks (RNN), long short-term memory (LSTM), attention-based encoder-decoder transformers, and the decoder-only LLMs.

This post uses two NMT models—NVIDIA NeMo NMT and Advanced Language Model-Based Translator (ALMA NMT)—as examples, demonstrating the process of fine-tuning these models on custom datasets. 

NVIDIA NeMo NMT models

NeMo enables you to efficiently create, customize, and deploy new generative AI models easily. It also provides pretrained models for different natural language processing (NLP) tasks such as NMT, automatic speech recognition (ASR), and text-to-speech (TTS). 

The NeMo NMT models are bilingual models with a self-attention–based encoder-decoder structure, 24 layers in the encoder and 6 layers in the decoder. They are trained on publicly available parallel datasets on NeMo.

In the diagram, the encoder takes an English sentence as input and gives the encoded embedding. The decoder is fed both its output token from the last step and the encoded embedding to generate the next token auto-regressively.
Figure 1. Translation process with self-attention–based encoder-decoder transformer architecture
(source: The Annotated Transformer: English-to-Chinese Translator)

ALMA NMT models

ALMA is a many-to-many LLM-based (decoder-only) translation model. It starts from a pretrained LLM, such as the Llama 2 model, is continue-trained with a monolingual corpus (stage 1), and is followed by low-rank adaptation (LoRA) tuning with a parallel translation dataset to further enhance the performance (stage 2). 

Compared to encode-decoder–based models, the decode-only model considers NMT as a downstream task, with an instruction prompt needed as a part of the input. In the pretraining and continue-training stages, the model is trained to predict the next output token on multiple monolingual corpora, and the parallel dataset is only used in the LoRA tuning stage.  

NMT model customization pipeline

Two NMT models are used because they represent two types of fine-tuning recipes: 

  • The NeMo NMT models are natively designed to be trained and fine-tuned with NeMo.
  • The ALMA models are open-source projects available on HuggingFace and GitHub and they use Accelerate in fine-tuning. 

Both recipes share a common customization pipeline but they have subtle differences in particular steps.

The next sections in this post follow the order of the model fine-tuning pipeline:

  1. Get the prerequisites.
  2. Run the pretrained NMT models.
  3. Evaluate the initial performance of the NMT models.

In part 2, we cover the following steps:

  1. Create a custom data collection.
  2. Create a data preprocessing pipeline.
  3. Fine-tune the model.
  4. Fine-tune the model evaluations.

In model customization, the first step is to investigate the publicly available models and evaluate their initial performance with custom datasets collected from or simulated on the use cases. You must customize when none of these models satisfy the requirements. 

Data collection and preprocessing are a must in customization, and data quality has a huge impact on the final fine-tuned model. The collected data should follow the distribution of real use-case scenarios. They should be preprocessed to remove outliers and sometimes need normalization as well. 

The next step is to run fine-tuning with the curated dataset on top of the pretrained model. In this tutorial, we demonstrate NeMo NMT fine-tuning and ALMA model LoRA tuning (stage 2) with a parallel translation dataset. Finally, model evaluation is performed again on the fine-tuned models. 

Model customization is an iterative process until satisfactory performance is reached for real use cases. 


The NGC Catalog provides you with access to GPU-accelerated software that speeds up end-to-end workflows with performance-optimized containers, pretrained AI models, and industry-specific SDKs that can be deployed on-premises, in the cloud, or at the edge. 

For this tutorial, use the NeMo framework container 24.01 available on the NGC catalog for fine-tuning the NMT models.

To start up the container and follow the tutorial, we suggest the following system resources:

Generate the NGC API Key and log in to the Docker NGC registry:

docker login

Username: $oauthtoken
Password: <Your Key>

Start up the NeMo framework container with the following command:

docker run --runtime=nvidia -it --rm -p 8888:8888 -p 6006:6006  --shm-size=16g --ulimit memlock=-1 --ulimit stack=67108864

Run the pretrained NMT models

When you’ve entered the NeMo framework container, you can run inference on various pretrained NMT models and test their initial performance.

Inferencing pretrained NeMo NMT models

It’s straightforward to download and run inference on the pretrained NeMo NMT models in a NeMo container. For this post, we used an English-to-simplified Chinese model

Run the following Python commands in the NeMo framework container:

from nemo.collections.nlp.models import MTEncDecModel

model = MTEncDecModel.from_pretrained("nmt_en_zh_transformer24x6")
translations=model.translate(["AI is powering change in every industry"], source_lang="en", target_lang="zh")


The execution result is as follows:

 [‘AI 正在推动每个行业的变革’]

Inferencing pretrained ALMA NMT models 

ALMA provides LoRA-tuned models on HuggingFace. To run such models in the NeMo framework container, you must install an additional peft dependency, which is used to load LoRA weights.

pip install peft

The following is a sample inference code example for ALMA:

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM
from transformers import LlamaTokenizer

# Load base model and LoRA weights
model = AutoModelForCausalLM.from_pretrained("haoranxu/ALMA-7B-Pretrain", torch_dtype=torch.float16, device_map="auto")
model = PeftModel.from_pretrained(model, "haoranxu/ALMA-7B-Pretrain-LoRA")
tokenizer = LlamaTokenizer.from_pretrained("haoranxu/ALMA-7B-Pretrain", padding_side='left')

# Add the source sentence into the prompt template
prompt_template = "Translate this from English to Chinese:\nEnglish: {}\nChinese:"
prompt = prompt_template.format("AI is powering change in every industry")

# Tokenize
input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=40, truncation=True).input_ids.cuda()
# Inference
with torch.no_grad():
    generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=256, do_sample=True, temperature=0.6, top_p=0.9)

Here’s the result:


The prompt template "Translate this from English to Chinese:\nEnglish: {}\nChinese:" is used as an instruction to the LLM and the output sentence completes the actual prompt. You must also post-process the output and extract the actual translation.

NMT model evaluation

Evaluation of the pretrained model gives insights into how the model performs and expectations on the fine-tuning result. 

The bilingual evaluation understudy (BLEU) algorithm is commonly used as a metric for evaluating the quality of machine translation. It measures the relevance between a generated translation text and the reference text by comparing n-gram matches. It ranges from 0 to 1, and a higher score indicates high relevance.

The NeMo framework container includes the sacrebleu package for benchmarking the BLEU metric:

sacrebleu reference.txt -i generated.txt -m bleu -b -w 4

The generated.txt file is the machine-generated translation and reference.txt is the reference translation. Each line in each file has a one-to-one correspondence in the other file. In the command, the following parameters are used:

  • -m bleu means the BLEU metric is used.
  • -b means to output only the BLEU score
  • -w 4 is the floating point width.

You can collect custom parallel translations from real use cases as the evaluation dataset, use the inference codes from the previous section to generate translations for the dataset, and measure the pretrained model performance. 

Alternatively, for NeMo NMT models, the NeMo framework provides a script for translating a text file line by line. The following commands download the pretrained English to the Chinese NeMo model and evaluate its performance for the input_en.txt file.

# Download the pretrained en-zh NeMo model
mkdir -p model/pretrained_ckpt
wget -O --content-disposition ""
unzip -d model/pretrained_ckpt

# Translation script
python /opt/NeMo/examples/nlp/machine_translation/ \
   --model model/pretrained_ckpt/en_zh_24x6.nemo \
   --srctext input_en.txt \
   --tgtout pretrained_nemo_out_zh.txt \
   --source_lang en \
   --target_lang zh \
   --batch_size 200 \
   --max_delta_length 20 

sacrebleu reference.txt -i pretrained_nemo_out_zh.txt -m bleu -b -w 4
  • input_en.txt is the English text file.
  • pretrained_nemo_out_zh.txt is the output translated text file.
  • reference.txt is the reference translation.


In this post, we showed you how to run a pretrained model with NeMo, and evaluate its performance. In the second post, we walk you through the process of curating a dataset and fine-tuning the model with the custom dataset using LoRA tuning.

Discuss (0)