2. Adapting to different camera types
Learn how TAO Toolkit allows pretrained models to be adapted to new domains, environments or sensors with ease.2.1 Industry problem
All computer vision applications require an AI model to sense the world around it. Cameras are the most commonly used sensor for this purpose, enabling an AI model to take in visual inputs and complete tasks such as object classification, detection, and tracking.
When AI models that rely on cameras are deployed in the field, they must work well for different conditions caused by environmental and technological factors. The types of cameras and their locations can lead to image distortions, color shifting, and changes in brightness levels. Addressing these and other constraints by customizing a model to work in a specific environment is crucial for rapid field deployment.
For example, infrared (IR) or thermal cameras are extremely useful in capturing images in a low light environment because they don’t use the visible light spectrum. Although IR cameras work in the dark, the image output of an IR camera lacks color data, is often low resolution, and doesn’t have clear contours between objects. These can pose many challenges when trying to use an AI model trained with regular RGB images on an IR dataset. However, NVIDIA pretrained models help reduce data and train times even across different camera types and environmental conditions.
2.2 Start with a pretrained AI model
The PeopleNet pretrained model, which has been trained on more than a million images, can detect people in crowded environments, partially occluded people, and people at low resolutions.The model was trained on images from well-lit areas. In its original form, it performs poorly on images from a thermal IR camera. Using the TAO Toolkit, the model can be adapted to perform well on IR images.
To show the power of transfer learning across different camera types, this task adapts the NVIDIA pretrained PeopleNet model to work for thermal infrared images. It shows how pretrained models require less data and achieve higher accuracy by training two models on differently sized datasets from an IR camera.
2.3 Results
Training a model from scratch requires around 6,300 images to reach 77% mAP. However, when starting from the pretrained PeopleNet model, only 2,500 images are needed to get to more than 78% mAP. For this use case, you can achieve comparable accuracy with 60% less data when using the PeopleNet pretrained model. That means less time collecting and annotating the extra images and faster training on a smaller dataset.
Dataset size (Number of images)
|
mAP
|
---|---|
6300
|
77%
|
2500
|
78%
|
Table 1. Data shows the results obtained when training the PeopleNet model from scratch using 6300 images and fine-tuning with only 2500 images.
The PeopleNet pretrained model also achieves higher accuracy overall when training on the entire dataset of 6300 images reaching 83% mAP, 6% higher than when training without PeopleNet. This task shows how using a pretrained model can save on data labeling costs and training costs by reaching higher accuracy with a smaller dataset.
This task was conducted using the FLIR Thermal Dataset. The complete task implementation with a step-by-step guide is available in the TAO Tasks GitHub repo.