Simulation / Modeling / Design

Gen AI Super-resolution Accelerates Weather Prediction with Scalable, Low-Compute Models

With NVIDIA Earth-2 and CorrDiff, faster training, real-time inference, and scalable ensemble workflows are making AI forecasting more efficient and accessible.

As AI weather and climate prediction models rapidly gain adoption, the NVIDIA Earth-2 platform provides libraries and tools for accelerating solutions using a GPU-optimized software stack. Downscaling, which is the task of refining coarse-resolution (25km scale) weather data, enables national meteorological service (NMS) agencies to deliver high-resolution predictions for agriculture, energy, transportation, and disaster preparedness at spatial resolutions fine enough for actionable decision-making and planning. 

Traditional dynamical downscaling is prohibitively expensive, especially for large ensembles at high resolution and over extensive spatial domains. CorrDiff, a generative AI downscaling model that sidesteps computational bottlenecks of traditional numerical methods, achieves state-of-the-art results, and uses a patch-based multidiffusion approach to scale to continental and global domains. This AI-based solution unlocks significant gains in efficiency and scalability compared to traditional numerical methods, while greatly reducing computational costs.

CorrDiff has gained global adoption for various use cases, demonstrating its versatility and impact across domains where fine-scale weather information is essential:

  • The Weather Company (TWC) for supporting the agriculture, energy, and aviation industries.
  • G42 for improving smog and dust storm predictions in the Middle East.
  • Tomorrow.io for enhancing a range of storm-scale predictions, including fire weather forecasts and wind gust forecasts that disrupt railway operations.

In this blog post, we show the performance optimizations and enhancements for CorrDiff training and inference that were incorporated into two tools in the Earth-2 stack, NVIDIA PhysicsNeMo and NVIDIA Earth2Studio. Achieving over 50x speedup on training and inference baselines, these optimizations enable:

  • Scaling patch-based training of the entire planet in under 3,000 GPU-hours.
  • Lowering most country-scale trainings to O(100) GPU-hrs.
  • Training over the contiguous United States (CONUS) in under 1000 GPU-hours.
  • Fine-tuning and bespoke training that democratizes km-scale AI-weather.
  • Country-scale inference in GPU-seconds, planetary-scale inference in GPU-minutes. 
  • Generating large ensembles affordably for high-resolution probabilistic forecasting.
  • Interactive exploration of kilometer-scale data.

CorrDiff: Training and inference 

Coarse-resolution global weather data at 25 km scale is used to predict the mean µ first using a regression model, which is then stochastically corrected using an Elucidated Diffusion Model (EDM) r, together producing the probabilistic high-resolution 2 km-scale regional forecast. Bottom right: diffusion model is conditioned with the coarse-resolution input to generate the residual r after a few denoising steps. Bottom left: the score function for diffusion is learned based on the UNet architecture.
Figure 1: CorrDiff training and sampling workflow

Figure 1 illustrates the training and sampling workflow of CorrDiff for generative downscaling. During diffusion training, a pretrained regression model is used to generate the conditional mean, which serves as input for training the diffusion model. For background and details on CorrDiff, refer to the CorrDiff publication, PhysicsNeMo docs, and source code

Why optimize CorrDiff? 

Diffusion models are resource-intensive because they rely on iterative sampling, with each denoising step involving multiple neural network computations. This makes inference time-consuming and costly. Training is also more expensive because the denoiser has to be trained for the full range of noise levels. Optimizing their performance requires:

  • Streamlining core operations (e.g., fusing kernels, using mixed precision, using NVIDIA CUDA graphs, etc.).
  • Improving the sampling process by reducing the number of denoising steps and using optimal time integration schemes.

CorrDiff uses the EDM architecture, where several computationally expensive operations, such as group normalization, activation functions, and convolutions, can be optimized using advanced packages and kernel fusion.

CorrDiff also uses a two-stage pipeline (regression and correction), offering opportunities to amortize cost across multiple diffusion steps by caching regression outputs, minimizing redundant compute.

Accelerated CorrDiff

In the following figures, we summarize the various optimizations that result in over a 50x speedup on both training and inference costs over the CONUS domain. Figures 2 and 3 summarize the cumulative speedup factors achieved over the baseline with each successive optimization. Details of each optimization are provided in subsequent sections. 

Patch-based CorrDiff training speedup vs. FP32 baseline with runtime shown on log scale.
Figure 2. Patch-based CorrDiff training speedup per patch for CONUS-scale training. Each patch ~200K pixels (448×448). Green bars plot the speedup relative to the fp32 baseline performance (left y-axis) and grey solid line plots the sampling runtime per patch element in logscale (right y-axis)
Bar and line chart titled “Patch-based CorrDiff Inference Speedup per Sample for CONUS-scale (~2M pixels, 1056×1792).” Green bars show the inference speedup compared to the FP32 baseline (using the left y-axis), while a solid gray line indicates sampling runtime per batch element on a logarithmic scale (right y-axis). The chart visualizes improved performance and reduced runtime for various precision modes.
Figure 3. Patch-based CorrDiff Inference Speedup per sample, for CONUS-scale ~2M pixels (1056×1792). Green bars plot the speedup relative to the fp32 baseline performance (left y-axis) and grey solid line plots the sampling runtime per batch element in log scale (right y-axis)

Optimized CorrDiff: How it’s achieved

The baseline performance of CorrDiff on NVIDIA H100 GPUs with FP32 precision, batch size = 1, patch size = 1 (in absolute time) was as follows:

  • Regression forward: 1204ms
    • Domain:  CONUS of size 1056 × 1792 pixels
    • Input channels: [“u500”, “v500”, “z500”, “t500”, “u850”, “v850”, “z850”, “t850”, “u10m”, “v10m”, “t2m”, “tcwv”] at 25km resolution 
    • Output channels:  [“refc”, “2t”, “10u”, “10v”] at 2km resolution 
  • Diffusion forward: 155ms
    • Domain: spatial patch of size 448 x 448 pixels
    • Input channels: [“u500”, “v500”, “z500”, “t500”, “u850”, “v850”, “z850”, “t850”, “u10m”, “v10m”, “t2m”, “tcwv”] at 25km resolution 
    • Output channels:  [“refc”, “2t”, “10u”, “10v”] at 2 km resolution 
  • Diffusion backward: 219ms

While effective, this baseline was limited by expensive regression model forward passes and inefficient data transposes.

Figure 4 shows an NVIDIA Nsight Systems training performance profile of the baseline patch-based CorrDiff model. The chart displays the runtime distribution across different training stages, highlighting that the regression forward stage dominates total iteration time, indicating it as the primary computational bottleneck during training.
Figure 4. Training performance profile: NVIDIA Nsight Systems profile showing runtime distribution of baseline patch-based CorrDiff training stages, with regression forward dominating total iteration time

Key CorrDiff training optimizations 

To achieve substantial acceleration in CorrDiff training, culminating in 53.86x speedup on NVIDIA B200 and 25.51x on H100, we introduced a series of performance optimizations outlined below.

Optimization 1: Enable AMP-BF16 for training
The original training recipe uses FP32 precision. Here, we enabled Automatic Mixed Precision (AMP) with BF16 for training to reduce memory usage and improve throughput without compromising numerical stability, leading to a 2.03x speedup over baseline.

Optimization 2: Amortizing regression cost using multi-iteration patching
In the original patch-based training workflow, each 448×448 patch sample for ‌diffusion model training required inference of the regression model for the full 1056×1792 CONUS domain. This caused ‌diffusion model training throughput to be bottlenecked by ‌regression model inference.

We improved efficiency by caching regression outputs and reusing them across 16 diffusion patches per timestamp. This provided broader spatial coverage while spreading regression costs more effectively, yielding a 12.33× speedup over ‌baseline.

Optimization 3: Eliminating data transposes with Apex GroupNorm
The training pipeline initially used the default NCHW memory layout, which triggers costly implicit memory transposes before/after convolutions. Switching the model and input tensors to NHWC (channels-last) format aligns them with cuDNN’s preferred layout. However, PyTorch GroupNorm ops do not support the channels-last format. To prevent transposes and keep data in channels-last format for more efficient normalization kernels, we replaced PyTorch GroupNorm with NVIDIA Apex GroupNorm. This eliminated transpose overhead and yielded a 16.71× speedup over the baseline.

Optimization 4: Fusing GroupNorm with SiLU
By fusing GroupNorm and SiLU activation into a single kernel using Apex, we reduced kernel launches and the number of global memory accesses. This increased GPU utilization and delivered a 17.15× speedup over the baseline.

Optimization 5: Extended channel dimension support in Apex GroupNorm
Some CorrDiff layers use channel dimensions unsupported by Apex. We extended support for these channel dimensions, unlocking fusion for all layers. This improved performance to 19.74× speedup over the baseline.

Optimization 6: Kernel fusion through graph compilation
We used torch.compile to fuse the remaining elementwise operations (e.g., addition, multiplication, sqrt, exp). This improved scheduling, reduced global memory accesses, and cut Python overhead, delivering speedup of 25.51× over the baseline.

Optimization 7: Apex GroupNorm V2 on NVIDIA Blackwell
Using Apex GroupNorm V2, optimized for NVIDIA Blackwell GPUs, yielded 53.86× speedup over the baseline on B200 and 2.1× over the H100-optimized workflow.

NVIDIA Nsight Systems profile showing runtime distribution of optimized patch-based CorrDiff training stages, where regression forward cost was amortized across multiple diffusion training steps.
Figure 5: Optimized patch-based CorrDiff training profile

Training throughput 

We compare the training throughput of baseline CorrDiff on NVIDIA Hopper against optimized versions on Hopper and Blackwell in Table 1. The optimized implementation achieves improvements in efficiency across both architectures, with Blackwell showing the most significant gains.

Note: Regression refers to the regression forward pass. Diffusion refers to the diffusion forward pass. Total includes the combined cost of (regression forward + diffusion forward + diffusion backward).

GPU VersionPrecisionRegression (ms/patch)Diffusion (ms/patch)Total runtime (ms/patch)Throughput (patch/s)
H100Baseline FP321204.0374.01578.00.63
H100 OptimizedBF1610.60951.2561.85916.2
B200 OptimizedBF164.73424.5629.29734.1
Table 1. CorrDiff training throughput comparison

Speed-of-Light analysis 

To evaluate how close our optimized CorrDiff workflow comes to the hardware performance ceiling, we conducted a Speed-of-Light (SOL) analysis on H100 and B200 GPUs. This provides an upper-bound estimate of achievable performance by assessing how effectively GPU resources are being used.

Steps to estimate SOL:

  1. Identify low-utilization kernels:
    We focus on kernels with both DRAM read/write bandwidth < 60% and Tensor Core utilization < 60%. Such kernels are neither memory-bound nor compute-bound, making them likely performance bottlenecks.
  2. Estimate per-kernel potential:
    For each low-utilization kernel, we estimate the potential speedup under ideal conditions—namely, full DRAM bandwidth usage or full Tensor Core activity.
  3. Aggregate overall speedup:
    We then compute the hypothetical end-to-end speedup if each kernel were optimized to its ideal performance.
  4. Compute SOL efficiency:
    Finally, we estimate the fraction of theoretical maximum SOL as the fraction of peak performance achievable if the top 10 runtime-dominant kernels were individually boosted to their theoretical maximum.

Using this framework, our optimized CorrDiff workflow achieves 63% of the estimated SOL on H100 and 67% on B200. This indicates strong GPU utilization while still leaving headroom for future kernel-level improvements.

To further assess efficiency, we visualize kernel performance as shown in Figures 5 and 6. Each dot represents a kernel, plotted by NVIDIA Tensor Core utilization (x-axis) and combined DRAM read/write bandwidth utilization (y-axis). The dot size reflects its share of total runtime, highlighting performance-critical operations. 

Kernels near the top or right edges are generally well-optimized, as they fully exploit compute or memory resources. In contrast, kernels in the bottom-left quadrant underutilize both and represent the best opportunities for further optimization. This visualization provides a clear picture of the runtime distribution and helps identify where GPU efficiency can be improved.

Scatter plot showing kernel distribution by Tensor Core utilization (x-axis) and DRAM bandwidth utilization (y-axis)  for baseline patch-based CorrDiff. Each dot’s size represents the kernel’s share of total runtime. Most kernels are underutilized in both Tensor Core and DRAM bandwidth.
Figure 6: Baseline FP32 patch-based CorrDiff kernel utilization on B200

Figure 6 shows the distribution of kernels in terms of Tensor Core utilization and DRAM bandwidth utilization for the baseline CorrDiff implementation. In an unoptimized workflow with FP32 precision, >95% of time is spent in low-utilization kernels where both DRAM utilization (read + write) and tensor core utilization are below 60%.

The majority of runtime-dominant kernels cluster near the origin, showing very low DRAM and Tensor Core utilization. Only a small number of kernels lie near the upper or right boundaries, where kernels become clearly memory-bound or compute-bound. The unoptimized US CorrDiff workflow was only 1.23% at SOL on B200.

Scatter plot showing kernel distribution by Tensor Core utilization (x-axis) and DRAM bandwidth utilization (y-axis)  for optimized patch-based CorrDiff. Each dot’s size represents the kernel’s share of total runtime. Most kernels show higher utilization of Tensor Cores or DRAM bandwidth.
Figure 7: Optimized BF16 patch-based CorrDiff kernel utilization on B200

Figure 7 shows the distribution of kernels in the optimized implementation in terms of Tensor Core utilization and DRAM bandwidth utilization. In the optimized workflow with AMP-BF16 training, a higher proportion of kernels are near the top left or bottom right edges, indicating good performance and GPU utilization. Optimized CorrDiff is now 67% at SOL on B200. Despite the overall improvements in the optimized workflow, some kernels still have the potential to be accelerated further.

CorrDiff inference optimizations 

Many of the training optimizations can also be applied to inference. We proposed several more inference-specific optimizations to maximize performance.

Optimized multi-diffusion
CorrDiff uses a patch-based multi-diffusion approach, where overlapping spatial patches are denoised and then aggregated. Initially, 27.1% of the total runtime was spent in im2col folding/unfolding operations. Precomputing overlap counts for each pixel and using torch.compile() to accelerate the remaining folding/unfolding steps eliminates the im2col bottleneck entirely, resulting in a speedup of 7.86x.

Deterministic Euler sampler (12 steps) 
The original stochastic sampler used 18 denoising steps with the Heun solver and second-order correction. By enabling a deterministic sampler using the Euler solver (with no second-order correction), we reduced the number of denoising steps to 12 without impacting output quality. This change delivered an additional ~2.8× speedup on both Hopper and Blackwell. The ultimate speedup with a 12-step deterministic sampler is 21.94x on H100 and 54.87x on B200.

Several of the optimizations described in this blog post also apply to diffusion models in general, and some are specific to patch-based approaches. As such, those can be ported to other models in PhysicsNeMo and used in the development of solutions beyond weather downscaling. 

Getting started

Train/inference CorrDiff in PhysicsNeMo: PhysicsNeMo CorrDiff documentation 

  • To train with the optimized codebase, follow the instructions in the CorrDiff repo readme, and set the following options in the training.perf section in your selected training YAML config:
fp_optimizations: amp-bf16
use_apex_gn: True
torch_compile: True
profile_mode: False
  • To run inference with the optimized codebase, follow the instructions in the CorrDiff repo readme, and set the following options in the generation.perf section in your selected generation config:
use_fp16: True
use_apex_gn: True
use_torch_compile: True
profile_mode: False
io_syncronous: True
  • Set profile_mode to False for optimized performance, as the NVTX annotations would introduce graph breaks to torch.compile workflow.
  • To utilize the latest Apex GroupNorm kernels, either build Apex GroupNorm in PhysicsNeMo container Dockerfile or build it locally after loading the PhysicsNeMo container
    • Clone the Apex repo and build using:
CFLAGS="-g0" NVCC_APPEND_FLAGS="--threads 8" \
pip install \
--no-build-isolation \
--no-cache-dir \
--disable-pip-version-check \
--config-settings "--build-option=--group_norm" .

Learn more about optimized CorrDiff training in PhysicsNeMo and run optimized workflows in Earth2Studio.

Video 1. Visualizing patch-based CorrDiff downscaling on CONUS with 55x speedup

Discuss (0)

Tags