Building an effective automatic speech recognition (ASR) model for underrepresented languages presents unique challenges due to limited data resources.
In this post, I discuss the best practices for preparing the dataset, configuring the model, and training it effectively. I also discuss the evaluation metrics and the encountered challenges. By following these practices, you can confidentially develop your own high-quality ASR model for Georgian or any other language with limited data resources.
Finding and enriching Georgian language data resources
Mozilla Common Voice (MCV), an open-source initiative for more inclusive voice technology, provides a diverse range of Georgian voice data.
The MCV dataset for Georgian includes approximately:
- 76.38 hours of validated training data
- 19.82 hours of validated development (dev) data
- 20.46 hours of validated test data.
This validated data totals ~116.6 hours, and is still considered small for training a robust ASR model. A good size of dataset for the models like this starts from 250 hours. For more information, see Example: Training Esperanto ASR model using Mozilla Common Voice Dataset.
To overcome this limitation, I included the unvalidated data from the MCV dataset, which has 63.47 hours of data. This unvalidated data might have to be more accurate or clean, so extra processing is required to ensure its quality before using it for training. I explain this extra data processing in detail in this post. All mentioned hours of data are after preprocessing.
An interesting aspect of the Georgian language is its unicameral nature, as it does not have distinct uppercase and lowercase letters. This unique characteristic simplifies text normalization, potentially contributing to improved ASR performance.
Choosing FastConformer Hybrid Transducer CTC BPE
Harnessing the power of NVIDIA FastConformer Hybrid Transducer Connectionist Temporal Classification (CTC) and Byte Pair Encoder (BPE) for developing an ASR model offers unparalleled advantages:
- Enhanced speed performance: FastConformer is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling that reduces the computational complexity.
- Improved accuracy: The model is trained in a multitask setup with joint transducer and CTC decoder loss functions improving speech recognition and transcribing accuracy.
- Robustness: The multitask setup enhances resilience to variations and noise in the input data.
- Versatility: This solution combines Conformer blocks for capturing long-range dependencies with efficient operations suitable for real-time applications, which enables handling a wider range of ASR tasks and varying level of complexity and difficulty.
Fine-tuning model parameters ensures accurate transcription and better user experiences even with small datasets.
Building Georgian language data for an ASR model
Building a robust ASR model for the Georgian language requires careful data preparation and training. This section explains how we prepare and clean the data to ensure it is high-quality which includes integrating additional data sources and creating a custom tokenizer for the Georgian language. The section also covers the different ways to train the model to achieve the best results. We focus on checking and improving the model throughout the process. All steps will be described in more detail below.
- Processing data
- Adding data
- Creating a tokenizer
- Training the model
- Combining data
- Evaluating performance
- Averaging checkpoints
- Evaluating the model
Processing data
To create manifest files, use the /NVIDIA/NeMo-speech-data-processor repo. In the dataset-configs
→Georgian
→MCV
folders, you can find a config.yaml file that handles data processing.
Convert data to the NeMo format
Extract and convert all data to the NeMo format necessary for future processing. In SDP, each processor works sequentially, requiring the NeMo format for the subsequent processors.
Replace unsupported characters
Certain unsupported characters and punctuation marks are replaced with equivalent supported versions (Table 1).
Non-supported characters | Replacement |
! | . |
… | . |
; | , |
“:\“”;„“-/ | space |
A lot of spaces | space |
Drop non-Georgian data
Remove data that does not contain any Georgian letters. This is crucial, as unvalidated data often includes texts with only punctuation or empty text.
Filter data by the supported alphabet
Drop any data containing symbols not in the supported alphabet, keeping only data with Georgian letters and supported punctuation marks [?.,]
.
Filter by characters and word occurrence
After SDE analysis, data with an abnormal character rate (more than 18) and word rate (0.3<word_rate<2.67) is dropped.
Filter by duration
Drop any data that has a duration of more than 18 seconds, as typical audio in MCV is less than this duration.
For more information about how to work with NeMo-speech-data-processor
, see the /NVIDIA/NeMo-speech-data-processor GitHub repo and the documentation of the Georgian dataset. The following command runs the config.yaml
file in SDP:
python main.py --config-path=dataset_configs/georgian/mcv/ --config-name=config.yaml
Adding data
From the FLEURS dataset, I also incorporated the following data:
- 3.20 hours of training data
- 0.84 hours of development data
- 1.89 hours of test data
The same preprocessing steps were applied to ensure consistency and quality. Use the same config file for FLEURS Georgian data, but download it yourself.
Creating a tokenizer
After data processing, create a tokenizer containing vocabulary. I tested two different tokenizers:
- Byte Pair Encoding (BPE) tokenizer by Google
- Word Piece Encoding tokenizer for transformers
The BPE tokenizer yielded better results. Tokenizers are integrated into the NeMo architecture, created with the following command:
python <NEMO_ROOT>/scripts/tokenizers/process_asr_text_tokenizer.py \
--manifest=<path to train manifest files, seperated by commas>
OR
--data_file=<path to text data, seperated by commas> \
--data_root="<output directory>" \
--vocab_size=1024 \
--tokenizer=spe \
--no_lower_case \
--spe_type=unigram\
--spe_character_coverage=1.0 \
--log
Running this command generates two folders in the output directory:
text_corpus
tokenizer_spe_unigram_1024
During the training, the path to the second folder is required.
Training the model
The next step is model training. I trained the FastConformer hybrid transducer CTC BPE model. The config file is located in the following folder:
<NEMO_ROOT>/examples/asr/conf/fastconformer/hybrid_transducer_ctc/fastconformer_hybrid_transducer_ctc_bpe.yaml
Start training from the English model checkpoint stt_en_fastconformer_hybrid_large_pc.nemo chosen for its large dataset and excellent performance. Add the checkpoint to the config file:
name: "FastConformer-Hybrid-Transducer-CTC-BPE"
init_from_nemo_model:
model0:
path: '<path_to_the_checkpoint>/stt_en_fastconformer_hybrid_large_pc.nemo'
exclude: ['decoder','joint']
Train the model by calling the following command, finding the one with the best performance, and then setting the final parameters:
python <NEMO_ROOT>/examples/asr/asr_hybrid_transducer_ctc/speech_to_text_hybrid_rnnt_ctc_bpe.py\
--config-path=<path to dir of configs>
--config-name=<name of config without .yaml>) \
model.train_ds.manifest_filepath=<path to train manifest> \
model.validation_ds.manifest_filepath=<path to val/test manifest> \
model.tokenizer.dir=<path to directory of tokenizer> (not full path to the vocab file!)>\
model.tokenizer.type=bpe \
Combining data
The model was trained with various data combinations:
- MCV-Train: 76.28 hours of training data
- MCV-Development: 19.5 hours
- MCV-Test: 20.4 hours
- MCV-Other (unvalidated data)
- Fleur-Train: 3.20 hours
- Fleur- Development: 0.84 hours
- Fleur-Test: 1.89 hours
As the percentage ratio between Train/Dev/Test is small, in some training, I added development data to the train data.
The combinations of data during the training include the following:
- MCV-Train
- MCV-Train/Dev
- MCV-Train/Dev/Other
- MCV-Train/Other
- MCV-Train/Dev-Fleur-Train/Dev
- MCV-Train/Dev/Other-Fleur-Train/Dev
Evaluating performance
CTC and RNN-T models trained on various MCV subsets show that incorporating additional data (MCV-Train/Dev/Other) improves the WER, with lower values indicating better performance. This highlights the models’ robustness when extended datasets are used.
CTC and RNN-T models trained on various Mozilla Common Voice (MCV) subsets demonstrate improved WER on the Google FLEURS dataset when additional data (MCV-Train/Dev/Other) is incorporated. Lower WER values indicate better performance, underscoring the models’ robustness with extended datasets.
The model was trained with approximately 3.20 hours of FLEURS training data, 0.84 hours of development data, and 1.89 hours of test data, yet still achieved commendable results.
Averaging checkpoints
The NeMo architecture enables you to average checkpoints saved during the training to improve the model’s performance, using the following command:
find . -name '/checkpoints/*.nemo' | grep -v -- "-averaged.nemo" | xargs scripts/checkpoint_averaging/checkpoint_averaging.py <Path to the folder with checkpoints and nemo file>/file.nemo
Best parameters
Table 2 lists the best parameters for the model with the best performance dataset, MCV-Train/Dev/Other FLEUR-Train/Dev.
Parameter | Value |
Epochs | 150 |
Precision | 32 |
Tokenizer | spe-unigram-bpe |
Vocabulary size | 1024 |
Punctuation | ?,. |
Min learning rate | 2e-4 |
Max learning rate | 6e-3 |
Optimizer | Adam |
Batch size | 32 |
Accumulate Grad Batches | 4 |
Number of GPUs | 8 |
Evaluating the model
Approximately 163 hours of training data took 18 hours to train a model on 8 GPUs and one node.
The evaluations consider scenarios with and without punctuation to comprehensively assess the model’s performance.
Following the impressive results, I trained a FastConformer hybrid transducer CTC BPE streaming model for real-time transcription. This model features a look-behind of 5.6 seconds and latency of 1.04 seconds. I initiated the training from an English streaming model checkpoint, using the same parameters as the previously described model. Table 2 compares the results of two different FastConformers, compared with those of Seamless and Whisper.
Comparing with Seamless from MetaAI
FastConformer and FastConformer Streaming with CTC outperformed Seamless and Whisper Large V3 across nearly all metrics (word error rate (WER), character error rate (CER), and punctuation error rates) on both the Mozilla Common Voice and Google FLEURS datasets. Seamless and Whisper do not support CTC-WER.
Conclusion
FastConformer stands out as an advanced ASR model for the Georgian language, achieving significantly lower WER and CER compared to MetaAI’s Seamless on the MCV dataset and Whisper large V3 on all datasets. The model’s robust architecture and effective data preprocessing drive its impressive performance, making it a reliable choice for real-time speech recognition in underrepresented languages such as Georgian.
FastConformer’s adaptability to various datasets and optimization for resource-constrained environments highlight its practical application across diverse ASR scenarios. Despite being trained with a relatively small amount of FLEURS data, FastConformer demonstrates commendable efficiency and robustness.
For those working on ASR projects for low-resource languages, FastConformer is a powerful tool to consider. Its exceptional performance in Georgian ASR suggests its potential for excellence in other languages as well.
Discover FastConformer’s capabilities and elevate your ASR solutions by integrating this cutting-edge model into your projects. Share your experiences and results in the comments to contribute to the advancement of ASR technology.