Computer Vision / Video Analytics

Novel Transformer Model Achieves State-of-the-Art Benchmarks in 3D Medical Image Analysis

Jun 22, 2022

By Ali Hatamizadeh, Vishwesh Nath and Vanessa Braunstein

Discuss (3)

AI-Generated Summary

Dislike

Researchers at NVIDIA presented over 35 papers at the Computer Vision and Pattern Recognition Conference (CVPR), including work on Shifted WINdows UNEt TRansformers (Swin UNETR), a transformer-based pretraining framework for self-supervised 3D medical image analysis.
Swin UNETR employs MONAI, an open-source PyTorch framework, and has achieved state-of-the-art benchmarks in various medical image segmentation tasks, demonstrating effectiveness even with limited labeled data.
The Swin UNETR model was pretrained on 5,050 publicly available CT images and achieved top rankings on public leaderboards, including the Beyond the Cranial Vault (BTCV) Segmentation Challenge and the Medical Segmentation Decathlon (MSD) dataset.

AI-generated content may summarize information incompletely. Verify important information. Learn more

At the Computer Vision and Pattern Recognition Conference (CVPR), NVIDIA researchers are presenting over 35 papers. This includes work on Shifted WINdows UNEt TRansformers (Swin UNETR)—the first transformer-based pretraining framework tailored for self-supervised tasks in 3D medical image analysis. The research is the first step in creating pretrained, large-scale, and self-supervised 3D models for data annotation.

As a transformer-based approach for computer vision, Swin UNETR employs MONAI, an open-source PyTorch framework for deep learning in healthcare imaging, including radiology and pathology. Using this pretraining scheme, Swin UNETR has set new state-of-the-art benchmarks for various medical image segmentation tasks and consistently demonstrates its effectiveness even with a small amount of labeled data.

Swin UNETR model training

The Swin UNETR model was trained on an NVIDIA DGX-1 cluster using eight GPUs and the AdamW optimization algorithm. It was pretrained on 5,050 publicly available CT images from various body regions of healthy and unhealthy subjects selected to maintain a balanced dataset.

For self-supervised pretraining of the 3D Swin Transformer encoder, the researchers used a variety of pretext tasks. Randomly cropped tokens were augmented with different transforms such as rotation and cutout. These tokens were used for masked volume inpainting, rotation, and contrastive learning, for the encoder to learn a contextual representation of training data, without increasing the burden of data annotation.

The technology behind Swin UNETR

Swin Transformers adopts a hierarchical Vision Transformer (ViT) for local computing of self-attention with nonoverlapping windows. This unlocks the opportunity to create a medical-specific ImageNet for large companies and removes the bottleneck of needing a large quantity of high-quality annotated datasets for creating medical AI models.

Compared to CNN architectures, the ViT demonstrates exceptional capability in self-supervised learning of global and local representations from unlabeled data (the larger the dataset, the stronger the pretrained backbone). The user can fine-tune the pretrained model in downstream tasks (for example, segmentation, classification, and detection) with a very small amount of labeled data.

This architecture computes self attention in local windows and has shown better performance in comparison to ViT. In addition, the hierarchical nature of Swin Transformers makes them well suited for tasks requiring multiscale modeling.

Following the success of the pioneering UNETR model with a ViT-based encoder that directly uses 3D patch embeddings, Swin UNETR uses a 3D Swin Transformer encoder with a pyramid-like architecture.

In the encoder of the Swin UNETR, self-attention is computed in local windows since computing naive global self-attention is not feasible for high-resolution feature maps. In order to increase the receptive field beyond the local windows, window-shifting is used to compute the region interaction for different windows.

The encoder of the Swin UNETR is connected to a residual UNet-like decoder at five different resolutions by skip connections. It can capture multiscale feature representations for dense prediction tasks, such as medical image segmentation.

Swin UNETR model performance

After fine-tuning with the Beyond the Cranial Vault (BTCV) Segmentation Challenge on 13 abdominal organs in CT and the segmentation tasks from the Medical Segmentation Decathlon (MSD) dataset, the model achieved state-of-the-art accuracy on the public leaderboards.

BTCV

In the BTCV, Swin UNETR obtained an average Dice of 0.918, outperforming other top-ranked models.

There are improvements compared to prior state-of-the-art methods for smaller organs, such as the splenic and portal veins (3.6%), pancreas (1.6%), and adrenal glands (3.8%.) Small organ data label segmentation is an excruciatingly difficult task for a radiologist. The improvement can be seen in the figure below.

MSD

In the MSD, Swin UNETR achieved state-of-the-art performance in brain tumor, lung, pancreas, and colon. The results are comparable for the heart, liver, hippocampus, prostate, hepatic vessel, and spleen. Overall, Swin UNETR presented the best average Dice of 78.68% across all 10 tasks and achieved the top ranking on the MSD leaderboard.

Swin UNETR has shown better segmentation performance using significantly fewer training GPU hours compared to DiNTS—a powerful AutoML methodology for medical image segmentation. For instance, qualitative segmentation outputs for the task of hepatic vessel segmentation demonstrate the capability of Swin UNETR to better model the long-range spatial dependencies.

Conclusion

The Swin UNETR architecture provides a much-needed breakthrough in medical imaging using transformers. Given the need in medical imaging to build accurate models quickly, the Swin UNETR architecture powers data scientists to pretrain on a large corpus of unlabeled data. This reduces the cost and time associated with expert annotation by radiologists, pathologists, and other clinical teams. Here we show SOTA segmentation performance, which is used for organ detection and automatic volume measurements.

To learn more:

Check out this work at the CVPR conference.
Read the study Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis.
Download the SwinUNETR code on GitHub.

Discuss (3)

About the Authors

About Ali Hatamizadeh
Ali Hatamizadeh is a research scientist in the Learning & Perception Research (LPR) team at NVIDIA. During his time at NVIDIA, he has spearheaded the efforts in developing vision transformer-based methodologies for various medical imaging tasks. He received his PhD from the University of California, Los Angeles Computer Science Department, where he was advised by Prof. Demetri Terzopoulos and worked on various visual perception tasks such as detection and segmentation using classical techniques and deep learning.

View all posts by Ali Hatamizadeh

About Vishwesh Nath
Dr. Vishwesh Nath is an Applied Research Scientist at Nvidia. He works with the Clara DLMED Research Team and his research is focused on medical imaging with subdomains including AI-Assisted Annotation (DeepGrow 2D & 3D), Neural Architecture Search, and Federated Learning. Prior to Nvidia Dr. Nath pursued his PhD and MS degrees in Computer Science from the Electrical Engineering & Computer Science department at Vanderbilt University and Bachelor's in Electrical and Electronics Engineering from Manipal Institute of Technology.

View all posts by Vishwesh Nath

About Vanessa Braunstein
Vanessa Braunstein leads healthcare and life science product marketing at NVIDIA for our Clara products in drug discovery, genomics, medical imaging, medical devices, NLP, and smart hospitals. Previously, she was in product development, business development and marketing for radiology, genomics, pharmaceutical, chemistry, and bioinformatics companies using AI. She studied molecular and cell biology, public health, and business at UC Berkeley and UCLA.

View all posts by Vanessa Braunstein