Computer Vision / Video Analytics

Addressing Medical Imaging Limitations with Synthetic Data Generation

Synthetic data in medical imaging offers numerous benefits, including the ability to augment datasets with diverse and realistic images where real data is limited. This reduces the costs and labor associated with annotating real images. Synthetic data also provides an ethical alternative to using sensitive patient data, which helps with education and training without compromising patient privacy.

This post introduces MAISI, an NVIDIA AI Foundation model for 3D computed tomography (CT) image generation. The overarching goal of MAISI is to revolutionize the field of medical imaging by providing a reliable and efficient way to generate high-quality synthetic images that can be used for various research and clinical applications. By overcoming the challenges of data scarcity and privacy concerns, MAISI aims to enhance the accessibility and usability of medical imaging data.

The model can generate high-resolution synthetic CT images and corresponding segmentation masks with up to 127 anatomical classes (including bones, organs, and tumors), while achieving the landmark voxel dimensions of 512 × 512 × 512 and spacing of 1.0 × 1.0 × 1.0 mm³. Key applications include data augmentation that involves generating real-world medical imaging data to supplement datasets subject to privacy concerns or rarity.  


The DLMED research team at NVIDIA focused on high-resolution, detailed contexture in 3D medical image generation modeling. This approach not only enriches the dataset but also enhances the performance of other machine learning models in the field of medical imaging. Another major application is saving annotation work. Generating pairs based on user-defined classes (image, label) simplifies the process of creating synthetic medical images with annotations, providing a cost-effective alternative to the labor-intensive task of collecting and annotating real medical data. 

Furthermore, the MAISI model also addresses the issue of ethical data use. It provides a responsible alternative to using sensitive patient data, as the images generated do not correspond to real individuals. This capability is invaluable for generating a variety of medical images for educational purposes, helping trainees and medical students make diagnoses without having to access confidential patient records.

Foundation compression network

To generate high-resolution 3D images, the research team trained a foundation compression model that is designed to efficiently compress CT and magnetic resonance imaging (MRI) data into a condensed feature space. This variational autoencoder (VAE) model accepts CT or MRI images as inputs and produces a feature representation output. The output serves as the foundational input for the subsequent latent diffusion model. The training regimen for this model encompassed a vast collection of CT and MRI images from various anatomical regions and featuring diverse voxel spacings. 

This extensive training has endowed the model with robust adaptability, enabling application to diverse datasets without the need for additional fine-tuning. In parallel, a decoder model was meticulously trained to accurately reconstruct high-resolution images from the generated feature sets.

Foundation diffusion network

Latent diffusion models (LDMs) have emerged as a powerful tool within generative machine learning, particularly for synthesizing 3D medical images. These models function by iteratively removing noise from a random distribution within a latent space. This process effectively enables the LDM to learn the underlying data distribution of the training data and then generate novel, high-fidelity samples. 

In the domain of 3D medical imaging, LDMs hold immense promise for generating anatomically accurate and diverse images. By learning the data distribution, the model can produce synthetic images that reflect real-world variations.

Our LDM was trained using large-scale, high-resolution CT datasets. We also incorporated conditionings based on body regions as an extra feature embedding. These regions encompass the head, chest, abdomen, and lower body. At the inference stage, users can specify the body regions for which they wish to generate CT images. Two concrete examples of generated CT images are shown in Figure 1.

Front and lateral views of two generated CT images showing head, chest, and abdomen regions.
Figure 1. Examples of generated images with different region inputs

ControlNet to support additional conditioning

ControlNet is a framework that supports various spatial contexts as additional conditioning for diffusion models like Stable Diffusion. It was introduced in the paper, Adding Conditional Control to Text-to-Image Diffusion Models. With ControlNet, users have more control over the generation process. The output can be customized with different spatial contexts such as depth maps, segmentation maps, scribbles, key points, and more. 

Specifically, the research team leveraged ControlNet to treat the organ segmentation maps, including 127 anatomic structures, as the extra condition of the foundation diffusion model to facilitate the CT image generation. Figure 2 shows a typical generated CT image and its corresponding segmentation condition.

A side-by-side animated comparison of a typical generated CT image and its corresponding segmentation condition.
Figure 2. An example of a typical generated CT image and its corresponding segmentation condition

This is achieved using “zero-convolution” layers connecting the trainable and locked copies. The zero-convolution layer enables the model to preserve the semantics already learned by the pretrained foundation diffusion model while enabling the trainable copy to learn the specific spatial conditioning required for the task. 

Performance evaluation

Our team conducted a comprehensive evaluation of the foundation diffusion model and the ControlNet using multiple datasets. This ensures broad coverage of many different body regions.

Image quality

Initially, we evaluated the quality of the images generated by our model by comparing the images to those produced by other baseline methods, using the model weights provided. We used the chest CT image generation and actual chest CT datasets shown in Table 1. 

Our method demonstrated superior performance over previous methods according to the Fréchet Inception Distance (FID) scores. In addition, our generated images are much closer to real images in appearance.

FID (Average) ↓MSD Task 06*LIDC-IDRITCIA
RealMSD Task 063.9871.858
Table 1. Fréchet Inception Distance scores of the MAISI model and the baseline method using its released checkpoint with multiple public datasets as the references
*Dataset used for model training

Subsequently, we retrained several state-of-the-art diffusion model-based methods using our datasets. The results in Tables 2 and Table 3 show that our method consistently outperformed the previous methods for both our dataset and unseen datasets (autoPET 2023).

MethodFID (XY Plane) ↓FID (YZ Plane) ↓FID (ZX Plane) ↓FID (Average) ↓
Table 2. Comparison of Fréchet Inception Distance scores between our foundation model and retrained baseline methods using our dataset as the reference

MethodFID (XY Plane) ↓FID (YZ Plane) ↓FID (ZX Plane) ↓FID (Average) ↓
Table 3. Comparison of Fréchet Inception Distance scores between our foundation model and retrained baseline methods using the unseen public dataset autoPET 2023 as the reference

Figure 3 shows that the images generated by our method exhibit significantly enhanced details and more accurate global anatomical structures.

Four rows of generated images illustrating our method exhibit significantly enhanced details and more accurate global anatomical structures.
Figure 3. Qualitative comparison of generated images between baseline methods (retrained using our large-scale dataset) and our method

Downstream tasks

One of the most important applications of the generative model is to synthesize new data for data augmentation in model training. We can evaluate the quality of generated images by assessing the impact of including synthetic data. We adopted the Auto3DSeg pipeline, an automatic pipeline for developing medical image segmentation solutions in MONAI, and trained each segmentation model from scratch to reduce randomness by five-fold cross-validation. 

There are two sets of experiments: 

  1. Real: The normal model training is conducted on real data. 
  2. Real + Synthetic: Real and synthetic data are combined in equal proportions during training to show the effect of synthetic data for data augmentation. 

As shown in Table 4, all synthetic data across five tumor types positively influence the final performance of the testing set (about 2.5%~4.5% improvement). These results indicate better generalizability of models trained using synthetic data. 

ExperimentDatasetTumor TypeDice ScoreImprovement
RealMSD Task 06Lung Tumor0.581
Real + Synthetic0.6254.5%
RealMSD Task 10Colon Tumor0.449
Real + Synthetic0.4904.1%
RealIn-House Bone LesionBone Lesion0.504
Real + Synthetic0.5343.0%
RealMSD Task 03Hepatic Tumor0.662
Real + Synthetic0.6872.5%
RealMSD Task 07Pancreatic Tumor0.433
Real + Synthetic0.4734.0%
Table 4. Improved average Dice Score for Auto3DSeg compared to the baseline performance of various models on different tumor types

Qualitative assessment

Figure 4 shows qualitative evaluations of three cases having abnormalities. It can be seen that MAISI yields excellent CT generation quality on both normal organs and abnormal tumor regions, as shown in the boxes of each subfigure. Our results indicate that MAISI effectively delineates abnormal tissue boundaries with high fidelity, demonstrating its robustness in capturing intricate details based on segmentation mask conditions in medical imaging. MAISI has the potential to effectively enhance the diversity and realism of generated CT images for data augmentation purposes. 

A side-by-side of segmentation masks and generated CT images showing how MAISI can generate CT images containing a colon tumor, a bone lesion, and a lung tumor.
Figure 4. Examples of segmentation masks (left) and generated CT images (right) for colon tumor (top), bone lesion (middle), and lung tumor (bottom)

Notably, in each case, MAISI accurately simulates the appearance of abnormal tumor regions and opens the possibility of enriching the dataset with variations in tumor morphology and spatial distribution. These findings highlight the potential of MAISI as a powerful tool for augmenting medical imaging datasets, thereby improving the robustness and generalization of machine learning models in clinical applications.


MAISI is a state-of-the-art foundation AI model for generating 3D high-resolution synthetic medical images with corresponding labels to address data limitations, reduce annotation costs, and maintain patient privacy. With its ability to achieve high-quality resolutions and segment 127 anatomical classes, MAISI is poised to make a significant impact in medical imaging. Incorporating MAISI-generated synthetic data into training segmentation models has demonstrated substantial performance improvements, paving the way for increased robustness and generalization in clinical applications.

To explore the potential of synthetic data generation with MAISI for your projects, join the early access program.


All co-authors wish to note that they made equal contributions to the research presented here and to the writing of this post.

Discuss (0)