Handwritten character recognition is one of the most quintessential deep learning (DL) problems. One of the oldest and still widely used benchmark datasets for machine learning (ML) tasks is the MNIST dataset, which consists of 70,000 handwritten digits. MNIST was released in 1995. To this day, it is one of the best studied and understood ML problems.
In 1998, Yan LeCunn and his team used one of the earliest successful applications of convolutional neural networks (CNNs), called LeNet5, to obtain a superhuman level of success with the application of DL to ML problems. After that, the use of automated character recognition became the standard tool in many applications. However, the use of DL for character recognition has been predominantly restricted to the Western alphabets, which are particularly amenable to these approaches as they have relatively few unique characters. With the rise of computing power and large corpora of unique handwritten characters in different alphabets, we are getting to the point where DL techniques can increasingly be applied to non-Western alphabets as well.
The Bengali language is the official language of Bangladesh, and one of the 22 official languages in India. It is the native language for approximately 228 million people, and secondary language for another 37 million. It is the fifth largest native language in the world.
The written Bengali language is an abugida, a writing system where individual characters (graphemes) are constructed by combining three components: a grapheme root, vowel diacritic, and consonant diacritic. This is a complex writing system, with ~13,000 possible grapheme variations. Bengali script is cursive, which also adds to the total complexity. Developing a machine learning algorithm for Bengali character recognition is orders of magnitude harder than it is for the languages written with Western characters.
The Kaggle Bengali handwritten grapheme classification ran between December 2019 and March 2020. The competition attracted 2,623 participants from all over the world, in 2,059 teams. Participants submitted trained models that were then evaluated on an unseen test set. The training set consisted of over 200,000 Bengali graphemes. All images were grayscale scans of handwritten characters, with the resolution of 137X236. The training set consisted of approximately 1,000 unique characters, which left out most of the combinations. Many of the remaining combinations were present in the test set. The final solution had to finish inference within two hours while running on a NVIDIA P100 GPU.
NVIDIA team approaches
NVIDIA was represented at this competition by five members of the Kaggle Grandmasters of NVIDIA (KGMON) team, in four different teams. Here are short summaries of the solutions, with the key insights.
Single model with SE-ResNeXt50
Christof Henkel’s final solution was a single model with a SE-ResNeXT50 backbone and a custom head. The model was trained several times with different random seeds, to reduce noise and have a bagging effect. The custom head predicted individual grapheme components as well as each grapheme as a whole. The training schedule was simple yet effective with training on the original images for 200 epochs using a cyclic cosine annealing learning rate schedule. Using CutMix as main augmentation ensured an appropriate degree of regularization.
While the single model used was strong, it wasn’t good enough to earn one of the top spots without some extra magic. Christof realized that some consonant diacritics had special forms that were not present in the training dataset, although they were in the test set. He addressed this issue by further disassembling the consonant diacritics into subcomponents and adjusted his model accordingly. The adjusted model predicted unseen graphemes more precisely and secured a top spot on the private leaderboard. For more information, see 5th place solution.
Seven models on six architectures
Bojan Tunguz and Chris Deotte were part of Team ২১, which used an ensemble of seven different models trained on six different underlying architectures:
- EfficentNet-B4, -B6, and -B7
- DenseNet 201
- SeNet 154
- SE-ResNext 101
They split the training data into 90% train and 10% test set. They trained on several different image resolutions, and the final ensemble used models trained on 128×128, 224×224, 137×256, and 128×256 images.
One of the most important aspects of the modeling effort was the training schedule. They would start with a low learning rate of 0.0001, train until the training schedule reached a plateau, and then reduce the training rate. They trained between 100–200 epochs, with batch sizes between 150–300. After the one main cycle, they retrained for another 70 epochs. Depending on the machine, network, and batch sizes, one epoch would take between 10–30 minutes to train.
Another key contribution to the model were image augmentations. They used both CutMix and CoarseDropout. The single best network achieved 0.9980 for local validation and 0.9892 on the public leaderboard. However, to make the models generalize to the unknown images in the private test set, they had to also implement strong post-processing. The recall metric is sensitive to the distribution of various classes. To exploit this insight, it was necessary to scale various predictions in such a way that the probabilities of less frequent items got proportionally higher weight. This final step was crucial in preventing the shakeup and maintaining placement in the Gold Zone. For more information about Bojan and Chris’s team solution, see 14th Place Solution and CAM CutMix Augmentation.
Before teaming up, Jean-Francois Puget worked on 64×64 images, and a bit on 128×128 images with SEResNext50. Small image sizes lead to high epoch throughput and allow for lots of experiments. He didn’t work on image augmentation and focused on learning PyTorch, understanding losses and image preprocessing. On the other hand, Marios Michailidis had a EfficientNet-B4 model with a good augmentation pipeline. He used Cutout, Grid Mask, Mixup, and Albumentation, but he found that Cutmix wasn’t effective. He also used OHEM loss, and cosine lr scheduler.
After teaming up, they explored ways to add diversity, and ways to combine as many models as possible in the scoring kernel. They used an average of 14 EfficientNet-B4 model predictions in the end. For each model, they averaged weights of several checkpoints. Models differed by the following:
- Input data, original size, or crop resize with same image ratio
- Fold or full train data
- Loss, weighted loss for R, V, C, or weighted loss per class within each of R, V, C.
- OHEM or unmodified cross entropy loss
- Use of external data or not
For more information about this solution, see Solution 38, some lessons learned.
Insights from other top solutions
As it is usually the case with Kaggle competitions, the final solutions from all successful teams demonstrated a variety of insights and ingenuity. Many teams successfully applied arcface loss, which has seen an increase in popularity in recent years. The second-place team used FMix, a variant of mixed sample data augmentation, which is a class of augmentations that also include CutMix and MixUp.
Perhaps the most fascinating work was done by the winning team, which used CycleGan zero shot learning. In this approach, they used GANs to create a synthetic representation of all possible combinations of character parts. This was an involved approach, but it proved to generalize well on the unseen images.
Grow your data science skills by competing in Kaggle competitions. If you are specifically interested in character recognition, then you might consider the ongoing Digit Recognizer competition and the finished playground Kannada Mnist competition.