3. Quick prototyping with a small dataset
Data augmentation provided by TAO Toolkit and pretrained models from the NGC catalog can be used together for quick prototyping with as few as 100 images.3.1 Challenges with data collection for training AI models
The old adage of “garbage in equals garbage out” holds true in the world of AI. You need large high-quality datasets when training models from scratch. There are several challenges with data collection:
- Collecting and labeling data is time consuming, labor intensive, and expensive
- Using small datasets can lead to poorly performing models as they lack the variety needed to train robust model successfully
- In some cases, data may simply not be available or restricted (e.g. patient medical x rays and scans)
Take for instance a case where detecting defects on a PCB assembly line is a critical task for ensuring quality and discovering errors in the assembly process. However, the rate at which defects occur in PCB assembly is so low that it can take months or years to collect enough images to train an accurate model. The challenge of collecting enough data spans many industries when trying to detect anomalies.
There are few ways to overcome these challenges:
- Synthetic data generation: Synthetic data is annotated information that computer simulations or algorithms generate.
- Data augmentation: Augmenting your dataset adds more variability and randomness that enables model generalization, which improves accuracy on data that the model has never seen before.
Both these methods are significantly cheaper and faster than collecting more data. For this experiment, we’ll look at the data augmentation feature in the TAO Toolkit.
3.2 What is data augmentation?
Data augmentation takes an existing dataset and applies transformations in the spatial and color domain to create new images that are similar but different enough from the original to generalize the model and add variability. Much research has been done to determine the most effective types of augmentation techniques. Common transformations include translation, rotation, and color shifting.
When a model trains on a small dataset, it begins to memorize the patterns in the data rather than learn the features needed to solve the problem. Increasing the size of the dataset by applying augmentation increases the complexity of the data and forces the model to generalize rather than memorize. This reduces overfitting on the training set and improves performance on images that it hasn’t seen before. Augmentation is especially useful in cases where the model may come across objects in variable lighting conditions, positions, and orientations.
Applying augmentation to a dataset is done either offline or online. Offline augmentation is applied before training and will create new images in storage with the applied transformations. This enables control over the number of unique images that the model trains on and typically leads to the model converging in fewer epochs. Online augmentation dynamically applies randomized transformations to each image as it is used in training. This means that no extra images are stored and no extra disk space is required. This also enables the model to train on new images continuously as each applied transformation creates a unique image. As a model trains with online augmentation, it may take more epochs to converge because it is continuously seeing new images.
3.3 Applying data augmentation with the TAO Toolkit
The TAO Toolkit supports both online and offline data augmentation. You perform offline augmentation by configuring a spec file and using the command-line interface to generate the images. The configuration gives you control to customize spatial, color, and blurring augmentations. You can also customize online augmentation to specify the range of spatial and color augmentations to be applied while the model is training.
The recommended way to use augmentation in the TAO Toolkit is to first apply offline augmentation to increase the size of the dataset. Then, configure training to use online augmentation to further increase the complexity of the dataset. Combining both types of augmentation allows the model to see a large variety of images and leads to better model performance, as we show in the augmentation task results.
The following table shows the different augmentation techniques that are supported by TAO Toolkit.
Spatial | Offline | Online | Color | Offline | Online | |
---|---|---|---|---|---|---|
Rotation | Hue Rotation | |||||
Flip | Saturation Shift | |||||
Translation | Contrast | |||||
Shear | Brightness | |||||
Zoom | Color Shift | |||||
Blur | ||||||
Table 2. Shows all the possible spatial and color augmentations in TAO |
3.4 Results
To show the benefits of data augmentation, take the example of a PCB defect dataset with only 100 images to train on. The task of the model is to detect six types of defects from an image of a PCB. The key performance indicator for this task is mean average precision (mAP), which gives a measure of accuracy on the bounding boxes that the model places around the defects when compared to the ground truth.
The task was carried out by training a model on only 100 images without augmentation and then applying offline augmentation to create datasets that are 10X and 20X the original size. These offline augmented datasets were then trained with and without online augmentation for comparison.
Training on 100 images without any augmentation, the model only achieved 36% mAP on the test set. However, with both forms of augmentation applied, the model improved significantly to almost 79% mAP. This task shows that with datasets as small as 100 images, augmentation is highly effective at improving the quality of the model.
The initial prototype model used only 100 images. Then, more data was added to achieve even higher accuracy. The same tasks were carried out with 500 annotated images. The increase in data led to a mAP of 82% without any augmentation. Furthermore, the mAP increased to over 95% with both forms of augmentation.
Apply the augmentation provided by TAO Toolkit to any size dataset. It’s a great tool to use when trying to rapidly prototype a machine learning application with a small dataset or trying to further improve a model with high accuracy.
The task was conducted with the PCB Defect Datasetdataset2. All code and steps to reproduce the results are provided in the TAO Tasks GitHub repo.
1 Shorten, C., Khoshgoftaar, T.M. A survey on Image Data Augmentation for Deep Learning. J Big Data 6, 60 (2019). https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0197-0
2 R. Ding, L. Dai, G. Li and H. Liu, "TDD-net: a tiny defect detection network for printed circuit boards," in CAAI Transactions on Intelligence Technology, vol. 4, no. 2, pp. 110-116, 6 2019, doi: 10.1049/trit.2019.0019.