Computer Vision / Video Analytics

Using Multi-Scale Attention for Semantic Segmentation

Jun 12, 2020

By Andrew Tao and Karan Sapra

Discuss (1)

AI-Generated Summary

Dislike

NVIDIA Research has developed a new method for semantic segmentation that achieves state-of-the-art results in Cityscapes and Mapillary Vistas benchmarks, with 85.4 IOU and 61.1 IOU respectively.
The method addresses common failure modes in semantic segmentation, such as fine detail and class confusion, by using a hierarchical multi-scale attention approach that learns to combine multi-scale predictions.
The approach also leverages auto-labelling to generate richer labels for coarse data, improving generalization and achieving a 0.9 IOU improvement on Cityscapes when combined with fine data.

AI-generated content may summarize information incompletely. Verify important information. Learn more

There’s an important technology that is commonly used in autonomous driving, medical imaging, and even Zoom virtual backgrounds: semantic segmentation. That’s the process of labelling pixels in an image as belonging to one of N classes (N being any number of classes), where the classes can be things like cars, roads, people, or trees. In the case of medical images, classes correspond to different organs or anatomical structures.

NVIDIA Research is working on semantic segmentation because it is a broadly applicable technology. We also believe that the techniques discovered to improve semantic segmentation may also help to improve many other dense prediction tasks, such as optical flow prediction (predicting the motion of objects), image super-resolution, and so on.

We have developed a new method for semantic segmentation that achieves record-setting state-of-the-art results in two common benchmarks: Cityscapes and Mapillary Vistas, as seen in the following tables. IOU is intersection over union, a metric that describes the accuracy of semantic prediction.

In Cityscapes, this method achieves 85.4 IOU on the test set, a sizable improvement over other entries, given how close each of those scores are to each other.

Entry	IOU
Ours	85.4
Panoptic DeepLab	85.2
HRNet-OCR SegFix	84.5
iFlytek-CV	84.4
MoKu Lab	84.3

Table 1. Results on Cityscapes test set.

With Mapillary, it achieves 61.1 IOU on the validation set with a single model, as compared to the next best result that used an ensemble to achieve 58.7.

Entry	IOU
Ours	61.1
Panoptic Deeplab (ensemble)	58.7
DeepLabV3+	52.0
Seamless	50.4

Table 2. Results on Mapillary Vistas semantic segmentation validation set.

Prediction results

Figure 1 (video). Qualitative results on Cityscapes ‘stuttgart demo 1’, a held-out dataset.

Research journey

To develop this new method, we considered what specific areas of an image needed improvement. Figure 2 shows two of the biggest failure modes of current semantic segmentation models: errors with fine detail and class confusion.

In this example, two issues are present: fine detail and class confusion.

The fine details of the posts in the first image are best resolved in the 2x scale prediction, but poorly resolved at 0.5x scale.
The coarse prediction of the road compared to median segmentation is much better resolved at 0.5x scale than 2x scale, where there is class confusion.

Our solution performs much better on both issues, with the class confusion all but gone in the road and much smoother and consistent prediction of the fine detail.

After identifying these failure modes, the team experimented with many different strategies, including different network trunks (for example, WiderResnet-38, EfficientNet-B4, Xception-71), as well as different segmentation decoders (for example, DeeperLab). We decided to adopt HRNet as the network backbone and RMI as the primary loss function.

HRNet has been demonstrated to be well-suited to computer vision tasks as it maintains a 2x higher resolution representation than the previous network WiderResnet38. The RMI loss provides a way to have a structural loss without having to resort to something like a conditional random field. Both HRNet and RMI loss were helpful to address both fine detail and class confusion.

To further address the primary failure modes, we innovated on two approaches: multi-scale attention and auto-labelling.

Multi-scale attention

To achieve the best results, it is common practice in computer vision models to use multi-scale inference. Multiple image scales are run through the network and the results are combined with average pooling.

Using average pooling as a combination strategy treats all scales as equally important. However, fine detail is usually best predicted at higher scales and large objects are better predicted at lower scales, where the receptive field of the network is better able to understand the scene.

Learning how to combine multi-scale predictions at a pixel level can help to address this issue. There is prior work on this strategy, with Chen’s approach in Attention to Scale being the closest. In Chen’s method, attention is learned for all scales simultaneously. We refer to this as the explicit approach, shown in Figure 3.

Motivated by Chen’s approach, we proposed a multi-scale attention model that also learns to predict a dense mask to combine multi-scale predictions together. However, in this method, we learn a relative attention mask to attend between one scale and the next higher scale, as shown in Figure 4. We refer to this as the hierarchical approach.

The primary benefits to this approach are as follows:

The theoretical training cost is reduced over Chen’s method by ~4x.
While training happens with only pairs of scales, inference is flexible and can be done with any number of scales.

	Train scales	Eval scales	Mapillary val IOU	Training Cost	Minibatch time (sec)
Baseline single scale	1.0	1.0	47.7	1.00x	0.8
Baseline avgpool	1.0	0.5,1.0,2.0	49.4	1.00x	0.8
Explicit	0.5,1.0,2.0	0.5,1.0,2.0	51.4	5.25x	3.1
Hierarchical (Ours)	0.5,1.0	0.5,1.0,2.0	51.6	1.25x	1.2
Hierarchical (Ours)	0.5,1.0	0.25,0.5,1.0,2.0	52.2	1.25x	1.2

Table 3. Comparison of the hierarchical multi-scale attention method vs. other approaches on the Mapillary validation set. The network architecture is DeepLab V3+ with a ResNet-50 trunk. Eval scales: scales used for multi-scale evaluation. FLOPS: the relative amount of flops consumed by the network for training. This method achieves the best validation score, but with only a moderate cost as compared to the explicit approach.

Some examples of our method, along with the learned attention mask, are shown in Figure 5. For the fine posts in the image on the left, there is little attention to the 0.5x prediction, but a very strong attention to the 2.0x scale prediction. Conversely, for the very large road/divider region in the image on the right, the attention mechanism learns to most leverage the lower scale (0.5x) and much less of the erroneous 2.0x prediction.

Auto-labelling

A common approach to improving semantic segmentation results with Cityscapes is to leverage the large set of coarse data. This data is roughly 7x as large as the baseline fine data. Previous SOTA approaches to Cityscapes used coarse labels as-is and either use the coarse data for pretraining the network or mix it in with the fine data.

However, the coarse labels present a challenge because they are noisy and imprecise. The ground truth coarse labels are shown in Figure 6 as ‘Original coarse label’.

Inspired by recent work, we pursued auto-labelling as a means to generate much richer labels to fill in the labelling gaps in the ground truth coarse labels. Our generated auto-labels show much finer detail than the baseline coarse labels as seen in Figure 6. We believe that this helps generalization by filling in the gaps in the data distribution for long-tail classes.

A naive approach to using auto-labelling, such as using the multi-class probabilities from a teacher network to guide the student, would be very costly in disk space. Generating labels for the 20,000 coarse images, which are all 1920×1080 in resolution across 19 classes would cost roughly 2 TB of storage. The biggest impact of such a large footprint would be reduced training performance.

We used a hard thresholding approach instead of a soft one to vastly reduce the generated label footprint from 2 TB to 600 MB. In this approach, teacher predictions with probability > 0.5 are valid, and predictions with lower probability are treated as an ‘ignore’ class. Table 4 shows the benefit of adding the coarse data to the fine data and training a new student with the fused dataset.

Dataset	Cityscales Val IOU	Benefit
Fine + ground truth coarse labels	85.4	–
Fine + auto-generated coarse labels	86.3	0.9

Table 4. The baseline method shown here uses HRNet-OCR as the trunk and our multi-scale attention method. We compare two regimes: training with ground truth fine + ground truth coarse labels to ground truth fine + auto-coarse labels (our method). The regime including the auto-coarse labels improves on the baseline by 0.9 IOU.

Final details

This model was trained using the PyTorch framework with automated mixed precision training with fp16 Tensor Cores across four DGX nodes.

For more information, see the paper, Hierarchical Multi-Scale Attention for Semantic Segmentation, and the code in the NVIDIA/semantic-segmentation GitHub repo, where you can run pretrained models or train your own.

To learn more about projects like this, see Research at NVIDIA.

Discuss (1)

About the Authors

About Andrew Tao
Andrew Tao is a distinguished engineer and manager of the Computer Vision team in the Applied Deep Learning Research group at NVIDIA. He received his masters degree in Electrical Engineering from Stanford University in 1992 with an emphasis on Computer Architecture. At NVIDIA, he has worked as a CPU hardware engineer, GPU hardware engineer and architect, and director of applied architecture. Previously, he led a number of computer vision teams in the automotive sector.

View all posts by Andrew Tao

About Karan Sapra
Karan Sapra is a senior research scientist on the Applied Deep Learning Research team at NVIDIA. He graduated with his Ph.D. in Computer Engineering from Clemson University. His research interests include deep learning, graph theory, and genomic networks. He has also previously worked on research in the field P2P networks, cloud computing, and high-performance computing.

View all posts by Karan Sapra