Using Accelerated Computing to Live-Steer Scientific Experiments at Massive Research Facilities

Scientists and engineers who design and build unique scientific research facilities face similar challenges. These include managing massive data rates that exceed current computational infrastructure capacity to extract scientific insights and driving the experiments in real time. These challenges are obstacles to maximizing the impact of scientific discoveries and significantly slow the pace of knowledge growth.

Scientists and engineers at NVIDIA work with these facilities to develop new solutions built on parallel and distributed computation that remove these blockers. This post will walk through two notable examples of formalizing complex physics problems into tractable mathematical puzzles that benefit greatly from GPU-accelerated scientific computing, involving the U.S. Department of Energy: NSF-DOE Vera C. Rubin Observatory and SLAC’s Linac Coherent Light Source II (LCLS-II).

These unique and massive-scale research facilities both took a decade to build and enable unprecedented scientific discoveries to serve worldwide scientific communities. NVIDIA accelerated computing together with the GPU-accelerated Python libraries CuPy and cuPyNumeric are enabling live feedback for experiment steering, which was previously impossible. The team leveraged Accelerated Space and Time Image Analysis (ASTIA) to process real-time “movies” of the southern sky and X-ray Analysis for Nanoscale Imaging (XANI) using cuPyNumeric and CuPy to achieve real-time steering of LCLS II experiments.

Data analyses that previously took nine months were completed in four hours.

Astrophysics and ultrafast X-ray science

The breakthrough in experimental advancement has enabled extremely high data acquisition rates to capture more objects than ever before, on their intrinsic time- and length-scales.

At the Vera C. Rubin Observatory, for the first time, astrophysicists and astronomists are able to capture the entire southern sky and discover 2,000+ new asteroids per night using a 3.2-billion-pixel camera. Meanwhile, at the LCLS II, scientists and engineers drive the electrons, which are converted to photons along a 3-km tunnel to make movies of materials on the atomic scale using ultrafast X-ray bursts.

Astrophysics: The NSF-DOE Vera C. Rubin Observatory’s LSST camera will produce 20 terabytes of images per night and operate in continuous mode for 10 years to map the entire southern sky every three to four nights. Over the course of one month or more, the LSST camera reaches petabytes of data accumulation that will be used to create the 10-year time-lapse movie of the universe.

X-ray science: The LCLS-II produces the most powerful X-ray pulses—up to 1 million bursts per second—increasing brightness compared to the original LCLS by a factor of 10,000. This enables mapping the swiftest and smallest movements of electrons and atoms inside matter. LCLS-II produces petabyte-scale X-ray data within days to make movies of quantum phenomena and provide unprecedented insights into how matter behaves.

The Linac Coherent Light Source long X‑ray particle accelerator tunnel. — Figure 1. The Linac Coherent Light Source at SLAC has the world’s longest X-ray particle accelerator tunnel, making data available at unprecedented speed and volume. Image credit: SLAC National Accelerator Laboratory

Common challenge: The demand for real time analysis of massive datasets requires both computational speed and memory beyond traditional systems. Accelerated computing provides the speed of computation, but one must still use distributed systems to process the incredible sizes of these problems. By utilizing HPC systems with acceleration and specialized networking, scientists can meet these demands. Using cuPyNumeric, programmers are able to utilize a single programming model that works both on traditional systems and utilizes the modern hardware features.

Towards full workflow automation: Both facilities move beyond batch analysis, favoring modular, highly parallel pipelines that execute reliably regardless of experiment size. Data movement, transformation, and extraction are automated to the degree that human oversight is focused on hypothesis and interpretation, rather than manual intervention or IT tuning.

Solutions: NVIDIA accelerated computing coupled with the GPU-accelerated Python libraries CuPy and cuPyNumeric are together enabling live-feedback for experiment steering, which was previously impossible due to excessively long computations. Now, by running these same scientific analysis pipelines on NVIDIA DGX Grace Hopper and NVIDIA Blackwell, NVIDIA DGX Spark, NVIDIA RTX PRO, researchers are gaining powerful new advantages for both performance and collaboration.

Data analyses that previously took nine months are now possible in four hours through cleverly solved equations using distributed computation on GPUs. Unified memory, available in NVIDIA GH200 Grace Hopper Superchip and NVIDIA Blackwell architecture, unlocks massive problem sizes through GPU acceleration to extract physics parameters quickly. These are used to train AI models for autonomous experiments and science analyses at unprecedented speed.

Vera C. Rubin Observatory accelerated workflow and prompt processing

The LSST traverses the sky in space and time with a 3.2-gigapixel camera to capture the southern sky, producing up to 20 TB of images per night. Every night, the camera will discover 2,000+ new asteroids that have never been seen before. The principal scientific goals include:

Tracking billions of celestial objects with precise time-resolved measurements.
Detecting and classifying transient phenomena that have never been observed before (for example, supernovae, near-Earth objects, and variable stars).
Searching for signatures of dark matter/energy of the ever-expanding universe.
Creating a year-round repository of the objects and their locations in space and time of the complete southern sky. Send alerts to a worldwide network of broker platforms and astronomical telescopes to acquire more detailed follow-up observations of individual stars, galaxies, black holes.

To date, the astrophysics and astronomy communities have jointly developed an open source CPU-based data processing pipeline to process data in up to 10 minutes. The timescale for acquisition of each image is 40 seconds. Live data processing—to promptly send alerts to telescopes around the world and steer observation decisions—requires accelerated computing.

These steps require advanced image calibration, basis constructions, convolutions, subpixel differencing, pattern extraction, and real-time statistical inference on data streams too large for the current CPU cluster processing workflow developed by scientists and engineers from world-wide astrophysics and astronomy communities.

To realize these goals on an accelerated timescale and enable greater complexity in data processing operations, scientists and engineers at NVIDIA, Princeton University, and SLAC are developing an accelerated GPU workflow, called Accelerated Space and Time Image Analysis (ASTIA). This workflow includes:

Calibration and basis construction: Rapidly calibrate massive CCD data to remove artifacts and distortions, and construct basis functions of each acquired image to enable coordinate mapping and transformations.
Chained transformation: Warping, convolutions, background and image subtractions, object movement, error calculations (through CuPy) are benchmarked on both NVIDIA Grace Hopper and NVIDIA Grace Blackwell.
Parallelization: Parallel prompt processing (mapping, object detection, fit and catalog) running as a batch or interactive sessions. Numerical computations happen in milliseconds instead of minutes.
Packaging and broker alert: Catalog new objects, orbit information, coordinates, and issue global alerts within seconds to the worldwide LSST community.

A flow chart demonstrates the 86x to 180x acceleration of prompt processing for the Vera Rubin Observatory using NVIDIA H200 and GH200 GPUs, reducing the initial two steps (GetTemplate and Alard Lupton) from 8.4 and 12.5 seconds to 46 and 146 milliseconds, respectively, to enable rapid generation and distribution of astrophysical alerts. — *Figure 2. Prompt processing workflow for astrophysical alert production and live steering of LSST camera on CPU versus GPU*

LCLS II: Scaling with parallel and distributed computation

At LCLS II, ultrafast X-ray pulses generate movies of atomic and electronic dynamics within materials and molecules. Major science challenges include:

Capturing 3D X-ray movies across tens of terabytes in a single session
Characterizing defects, phonon dispersions, crystal structures, electron distributions, and quantum phenomena from scattered X-ray patterns at rapid cadence
Delivering live feedback for experiment steering, so scientists can adjust parameters in real time to catch rare dynamic states

This requires processing and analyzing data at the single-pixel, single-event level, with mathematical models that can detect and reconstruct complex atomic motions—all under stringent time constraints. In essence, enabling researchers to watch atoms move in real time.

Ultrafast X-ray analysis of nanoscale imaging (XANI) workflow

At LCLS, NVIDIA and SLAC scientists and engineers developed the pipeline to concurrently process X-ray frames, fit physical models for pixel-wise elements, and rapidly reconstruct the 3D phonon dispersions to extract the thermal, optical, and electrical properties of materials. The analysis leverages pattern-matching, nonlinear fitting, and large-scale reduction to summarize experiment outcomes in a form meaningful for real-time scientific inference and automatic instrument steering.

A multipanel figure illustrating the scientific discovery cycle at the Linac Coherent Light Source (LCLS), which proceeds from experiment setup to data collection and analysis, and is supported by a performance chart demonstrating 1,100x computational acceleration, reducing processing time from 15 hours (CPU baseline) to 0.5 seconds using 128 GPUs. — *Figure 3. LCLS nanoscale science discovery workflow, XANI acceleration of nanoscale imaging. Demonstrated accelerated computing runtime performance on CPUs versus GPUs*

How does XANI accelerate the stack?

Data ingestion: High-throughput connections rapidly transfer images or experiment data to local cluster, supercomputer, or local DGX Spark storage.
Parallelization: cuPyNumeric achieves efficient parallelization across available resources by strategically partitioning the global data arrays. It then distributes computations by mapping operations on these sub-partitions to separate processing units. The runtime also decomposes the scientific code into a dependency-driven task graph, which enables implicit parallelism and dynamic scheduling of work across all allocated resources.
Operator chains: XANI executes complex transformation graphs (sum, convolution, basis change) as a series of kernels, reducing latency and memory movement overhead. Interoperability through Python-tasks enables embedding of third-party single-GPU Python libraries (CuPy, for example) for data-parallel operations.
Distributed scaling: cuPyNumeric enables array and matrix computations to scale from desktop to thousand-GPU clusters, handling datasets that exceed a single node’s memory—all natively in Python.
Collaboration and control: Researchers access their environment and computational results interactively, monitor GPU/CPU utilization, and profile performance with built-in tools.

Accelerated computation enables physics-informed AI training

The CUDA Python stack provides an integrated solution for:

Developing accelerated mathematical kernels and functions which are widely compatible with the Python ecosystem when existing solutions do not already exist.
CuPy offers GPU-compatible NumPy and SciPy interfaces to enable parallelism on a single GPU to accelerate numerical computations.
cuPyNumeric delivers a familiar NumPy/SciPy interface, which enables distribution of computation across multi GPUs and nodes using advanced runtime management.
XANI uses high-performance array operations and transformation chains, optimized for tasks like matrix math, subpixel warping, and polynomial projection. This package accelerates ultrafast X-ray characterization with GPU kernels and advanced workflow integration.
All of the above mentioned codes are optimized to run on servers based on Grace Hopper and Grace Blackwell. For individual testing and development, running these codes on DGX Spark or RTX PRO provides accelerated results compared to running on CPU systems.

Tips for using GPUs and CUDA Python for science

To use GPUs and CUDA Python to solve scientific problems, follow these strategies:

Identify the key scientific questions, followed by relevant mathematical operations and models that can be solved linearly. Develop a workflow to process the raw data and solve for the models using NumPy, then port to CuPy locally for parallelization. For thousands to billions of computations that require multinode systems, introduce cuPyNumeric to distribute the same code across multiple GPUs and nodes, leveraging the same patterns discussed in this post.
For ultrafast X-ray and other pixel-wise, model-fitting workloads, XANI provides an open, Python-based pipeline that wraps high-performance GPU kernels and uses the cuPyNumeric to distribute vectorized tasks over available resources and schedule them across many GPUs. Interested teams can clone XANI, treat it as a reference design, and adapt their own domain-specific steps—such as data ingestion, operator graphs, fitting, and reductions—to run with cuPyNumeric distributed execution for cluster-scale acceleration.
The same software stacks (CuPy, cuPyNumeric, and XANI) run on a spectrum of NVIDIA hardware (NVIDIA DGX Spark, NVIDIA RTX PRO Server as well as workstation and desktop-class systems through 8-way servers and NVIDIA DGX SuperPODs equipped with NVIDIA Grace Hopper and NVIDIA Grace Blackwell platforms) with unified memory simplifying handling of datasets larger than a single device. This means developers and researchers can begin by reproducing similar scaled-down workflows on a laptop, workstation, single DGX Spark or small lab cluster, then move unchanged code to cloud or larger on-premises DGX systems, using open repos as templates and focusing effort on domain logic rather than rewriting for new hardware.
Adopt CUDA Python to attain fast processing and live-steering of scientific instruments and extract scientific insights in seconds.

Benefits of adopting accelerated computing to enable live-steering experiments

Adopting accelerated computing to enable live-steering of scientific experiments offers numerous benefits, including:

Elastic scalability: The same Python code, powered by cuPyNumeric and CuPy, can be run unmodified on modest local clusters and then scaled out to exascale resources or supercomputer nodes when needed.
Shorter time to insight: Accelerated networking and device-level parallelism means the data is processed as it arrives—enabling discoveries, experiment steering, or event detection in timescales aligned with instrumentation.
Resource optimization: High-density, energy-efficient DGX Spark nodes deliver performance comparable to large-scale cluster racks in a compact office footprint.
Unified memory: Unlocks higher performance and flexibility to accelerate CPU-GPU workflow. With NVLink C2C, CPU and GPU share a single virtual address space for large data structures, up to 128 GB, with very high bandwidth, low latency, and concurrency. For physics-informed AI, this means simpler code and higher sustained throughput that is not constrained by a slower, higher latency, PCIe link.
Collaborative science: Teams benefit from shared data, distributed compute jobs, and rapid workflow iteration—crucial for multi-institutional research, experiment repeatability, and open science.

Get started with accelerated computing for science

XANI, cuPyNumeric, the broader NVIDIA accelerated computing stack, and CuPy are already powering production-scale astrophysics and ultrafast X-ray science. The same open source Python libraries and NVIDIA platforms are available for any researcher or developer to adopt in their own workflows.

XANI, CUDA Python, cuPyNumeric, and CuPy demonstrate a generational leap in scientific computing capability for exascale-era facilities, such as the Rubin Observatory and LCLS-II. By merging local desktop-class hardware, scalable server infrastructure, scalable software, and high-performance networking, researchers can develop, test, and deploy massive data workflows faster and more flexibly than ever before. Whether analyzing a single sky survey or orchestrating a global experiment, NVIDIA accelerated computing empowers science teams to achieve real-time insight and discovery.

Get started with CUDA Python, cuPyNumeric, and CuPy.

Learn more at the NVIDIA GTC AI Conference with the session, Accelerated HPC+AI Workflow Enables Live-Steering of Vera C. Rubin Observatory and X-ray Free Electron Laser [S81766].

Acknowledgments

Thanks to Yusra AlSayyad and Nate Lust (Princeton University); Adam Bolton, Seshu Yamajala, and Jana Thayer (SLAC National Accelerator Laboratory); and Lucas Erlandson, Emilio Castillo Villar, Malte Foerster, and Irina Demeshko (NVIDIA) for their contributions.