For decades, computational chemistry has faced a tug-of-war between accuracy and speed. Ab initio methods like density functional theory (DFT) provide high fidelity but are computationally expensive, limiting researchers to systems of a few hundred atoms. Conversely, classical force fields are fast but often lack the chemical accuracy required for complex bond-breaking or transition-state analysis.
Machine learning interatomic potentials (MLIPs) have emerged as the bridge, offering quantum accuracy at classical speeds. However, the software ecosystem is a new bottleneck. While the MLIP models themselves run on GPUs, the surrounding simulation infrastructure often relies on legacy CPU-centric code.
NVIDIA ALCHEMI (AI Lab for Chemistry and Materials Innovation) helps to address these challenges by accelerating chemicals and materials discovery with AI. We have previously announced two components of the ALCHEMI portfolio:
- ALCHEMI NIM microservices: Scalable, cloud‑ready microservices for AI-accelerated batched atomistic simulations in chemistry and materials science
- ALCHEMI Toolkit-Ops: A set of foundational GPU kernels designed to accelerate the calculations behind simulations, such as neighbor lists, dispersion corrections, and electrostatics
Today, we are introducing the NVIDIA ALCHEMI Toolkit, a collection of GPU-accelerated simulation building blocks that incorporates and expands on ALCHEMI Toolkit-Ops. ALCHEMI Toolkit is designed to manage the data flow between accelerated chemistry and materials domain-specific kernels and deep learning models. ALCHEMI Toolkit extends beyond individual models and kernels to provide a modular, PyTorch-native structure for researchers and developers to compose custom simulation workflows.
Figure 1 shows the ALCHEMI architectural stack and product features supported in this initial release of ALCHEMI Toolkit, including expanded functionality in Toolkit-Ops. This release includes capabilities for geometry relaxation and molecular dynamics, and the supporting pipeline infrastructure for combining multiple simulation workflows.

How does ALCHEMI Toolkit advance digital chemistry?
ALCHEMI Toolkit is not just a collection of scripts. It’s designed to enable researchers and developers to build custom, performant atomistic simulation workflows with ease.
Expanding ALCHEMI Toolkit-Ops
ALCHEMI Toolkit leverages the capabilities of Toolkit-Ops to handle the underlying calculations of the simulations. The previous release included several key operations:
- Neighbor list constructions
- DFT-D3 dispersion corrections
- Long-range electrostatic interactions
This release broadens the scope of common operations addressed to include:
- Batched dynamics kernels
- JAX support (for v0.2.0 release features)
Integration with the atomistic simulation ecosystem
ALCHEMI Toolkit is designed to integrate seamlessly with the broader atomistic simulation ecosystem. We’re excited to announce the following integrations with leading platforms in the chemistry and materials science community.
Orbital
Orbital develops advanced AI foundation models used to accelerate the discovery of novel cooling systems for data centers and sustainable materials. Orbital has integrated ALCHEMI Toolkit into their new OrbMolv2 model to drastically reduce the time required for inference. The new model will leverage ALCHEMI Toolkit components such as PME electrostatics for periodic Coulomb interactions and the MTK integrator for batched constant-pressure molecular dynamics. The existing Orb models already leverage Toolkit-Ops for GPU-accelerated graph construction, providing a ~1.7x acceleration for large systems and ~33x for batched smaller systems with TorchSim support.
Materials Graph Library (MatGL)
MatGL is an open source framework for state-of-the-art graph-based MLIPs. ALCHEMI Toolkit is integrating with the MatGL TensorNet model to significantly accelerate materials simulations and property predictions workflows. By leveraging ALCHEMI Toolkit GPU-native kernels and batching infrastructure, MatGL users can achieve higher computational efficiency and lower memory consumption for simulations at scale.
Matlantis
Matlantis enables rapid materials discovery by combining universal MLIPs with high-performance cloud computing. Matlantis is actively exploring the ALCHEMI Toolkit and identifying where its composable dynamics can deliver the greatest value for industrial materials simulation customers. This builds on its proven integration of ALCHEMI Toolkit-Ops—including Warp-optimized neighbor list construction and DFT-D3 dispersion corrections—which significantly reduces computational overhead of atomistic interactions with speedups of up to 10x.
Furthermore, by evaluating specific components within ALCHEMI Toolkit, this collaboration has the potential to enable Matlantis to move beyond single-structure optimization to high-throughput, parallel relaxation of millions of molecular configurations. Ultimately, this integration aims to further power small-scale research and industrial-scale materials design, accelerating chemical evaluation with unparalleled GPU efficiency.
How to get started with ALCHEMI Toolkit
This section walks you through how to get started with ALCHEMI Toolkit, which is straightforward and designed for ease of use.
System and package requirements
- Python ≥3.11, <3.14
- PyTorch ≥2.8
- CUDA Toolkit 12+, NVIDIA driver 470.57.02+
- Operating System: Linux (primary), macOS
- NVIDIA GPU (RTX 20xx or newer), CUDA Compute Capability ≥ 7.0
- Minimum 4 GB RAM (16GB recommended for large systems)
Installation
Use the following code to install ALCHEMI Toolkit:
# Install Atomic Simulation Environment (ASE, used in the examples below)
uv pip install ase
# Using pip
pip install nvalchemi-toolkit
# Using uv
uv venv --seed --python 3.12
uv pip install nvalchemi-toolkit
# Install from source
git clone https://github.com/NVIDIA/nvalchemi-toolkit.git
cd nvalchemi-toolkit
uv sync --all-extras
# Add nvalchemi as a project dependency
uv add nvalchemi-toolkit
For more information, reference the NVIDIA/nvalchemi-toolkit GitHub repo and the ALCHEMI Toolkit documentation.
Key features of ALCHEMI Toolkit for building end-to-end workflows
This section dives into four core ALCHEMI Toolkit features: customizable batched simulation workflows, build-your-own dynamics classes, model wrappers, and advanced data management. These features provide researchers and developers with the tools and flexibility needed to create bespoke end-to-end workflows that maximize efficiency and performance on NVIDIA GPUs.
Customizable batched simulation workflows
The distinctive feature of the NVIDIA ALCHEMI Toolkit is the GPU-native batched dynamics engine. No single MLIP model is perfect for every chemical environment, especially when dealing with nonlocal, long-range interactions.
ALCHEMI Toolkit enables researchers to combine modular chemistry and materials science domain-specific kernels and models into customized simulation workflows. This architecture supports the development of specialized compute workflows and running virtual laboratories with millions of concurrent atomic interactions without the latency of traditional software stacks.
Capabilities
- Composable calculators combining MLIPs with physics-based corrections
- High-performance wrappers (MACE, TensorNet, AIMNet2)
API example
The following example constructs the data, sets up the MLIP, and configures a FIRE2 geometry optimization that is then used as a starting point for velocity Verlet (microcanonical) dynamics:
from ase import Atoms
from nvalchemi.data import AtomicData, AtomicBatch
from nvalchemi.dynamics import ConvergenceHook
from nvalchemi.dynamics.optimizers import FIRE2
from nvalchemi.dynamics.integrator import VelocityVerlet
# setup some batch of atomic structures
atomic_data = [AtomicData.from_atoms(Atoms(...), device="cuda") for _ in range(16)]
batch = Batch.from_data_list(atomic_data)
# setup your MLIP and dynamics classes
mlip = ...
# optimizer convergence depends on the force norm and max values
conv_criteria = ConvergenceHook(
criteria=[
{"key": "forces", "threshold": 0.05, "reduce_op": "norm"},
{"key": "forces", "threshold": 0.1, "reduce_op": "max"}
]
)
optimizer = FIRE2(
mlip,
convergence_hook=conv_criteria,
n_steps=200
)
velverlet = VelocityVerlet(mlip, n_steps=1000)
You can run and scale the simulation pipelines in one of two ways: on a single GPU or on across multiple CPUs and GPUs.
Run and scale the pipeline on a single GPU: The FusedStage class is formed by “adding” two or more dynamics objects together. This enables wrapping the end-to-end workflow in torch.compile and sharing CUDA stream contexts.
fused = optimizer + velverlet
# context manager handles compilation and CUDA stream
with fused:
# runs 200 steps of optimization and 1000 steps of MD
fused.run(batch)
With this approach, you can easily build simulation workflows that run sequential steps as samples within the batch converge immediately, and make optimal use of your GPU.
Run and scale the pipeline across multiple CPUs and GPUs: The second approach is to distribute the pipeline across multiple CPUs/GPUs. Using the pipe operator on two dynamics classes will then distribute the FIRE2 optimization onto one GPU, and the velocity Verlet integration on another.
pipeline = optimizer | velverlet
# equivalent to manual allocation with explicit producer/consumer
# optimizer.next_rank = 1, velverlet.prior_rank = 0
# DistributedPipeline({0: optimizer, 1: velverlet})
with pipeline:
pipeline.run(batch)
While this example is deliberately simplified for illustrative purposes, such abstraction allows users to scale their pipeline up to multiple GPUs on a node, and out to multiple nodes to arbitrarily large datasets and number of ranks.
The following example configures eight GPUs to run geometry optimization, which pipelines the results to run Langevin dynamics on another eight GPUs:
from torch import distributed as dist
from torch.utils.data.distributed import DistributedSampler
from nvalchemi.data.datapipes import Dataset, DataLoader
# set up distributed; torchrun --nproc-per-node 8 --nnodes 2 ...
dist.initialize_process_group()
# set up data and distributed sampler
dataset = Dataset(...)
data_sampler = DistributedSampler(
dataset,
num_replicas=dist.get_world_size(),
rank=dist.get_rank()
)
loader = DataLoader(
dataset,
batch_size=128,
sampler=sampler,
use_stream=True
)
# configure your pipeline; 8 ranks do optimization, 8 do langevin dynamics
optimizers = [FIRE2(mlip, ..., next_rank=index + 8) for index in range(8)]
dynamics = [Langevin(mlip, ..., prior_rank=index) for index in range(8)]
pipeline = DistributedPipeline(
{index: stage for index, stage in enumerate(optimizers + dynamics)}
)
with pipeline:
for batch in loader:
pipeline.run(batch)
Build-your-own dynamics classes
ALCHEMI Toolkit offers a modular architecture to build and customize dynamics classes from the ground up. This approach enables the community to integrate new sampling methods or thermodynamic ensembles into the ALCHEMI environment while maintaining direct access to underlying kernels. This transforms dynamics into a fully customizable environment where users can construct specialized dynamics classes from scratch.
Capabilities
- Specialized GPU-first trajectory analysis tools
- Integrated and customizable dynamics kernels (Velocity Verlet, NPT, Langevin thermostats)
- FIRE and FIRE2 optimizers
API example
from enum import Enum
import torch
from nvalchemi.data import Batch
from nvalchemi.dynamics.base import BaseDynamics, DynamicsStage
from nvalchemi.hooks import Hook, HookContext
class MySimulatedAnnealer(Hook):
def __init__(
self,
t_start: float,
t_end: float,
cooldown_steps: int,
frequency: int,
stage: DynamicsStage
) -> None:
# this hook will fire off every `frequency` MD steps,
# bringing the temperature from `t_start` to `t_end`
self.frequency = frequency
self.t_start = t_start
self.t_end = t_end
self.cooldown_steps = cooldown_steps
self.stage = DynamicsStage.BEFORE_STEP
self.decay = (t_end / t_start) ** (1.0 / cooldown_steps)
self._current_temp = t_start
def __call__(self, ctx: HookContext, stage: Enum) -> None:
# access the calling dynamics class through `HookContext`
dynamics = ctx.workflow
dynamics.target_temperature = max(
dynamics.target_temperature * self.decay,
self.t_end
)
class VelocityVerlet(BaseDynamics)
__needs_keys__: {"energies", "forces", "masses", "velocities"}
__provides_keys__: {"positions"}
def __init__(
self,
model: BaseModelMixin,
n_steps: int,
dt: float = 1.0, # timestep
target_temperature: float = 300.0, # initial temperature
tau: float = 10.0, # coupling constant
hooks: list[Hook] | None = None,
convergence_hook: ConvergenceHook | dict | None = None,
**kwargs,
):
super().__init__(model=model, n_steps=n_steps, hooks=hooks, convergence_hook=convergence_hook)
self.dt = dt
self.target_temperature = target_temperature
self.tau = tau
self._prev_accelerations = None
def pre_update(self, batch: Batch) -> None:
# perform the first half of velocity Verlet
with torch.no_grad():
accelerations = batch.forces / batch.masses
self._prev_accelerations = accelerations.clone()
batch.positions.add_(
batch.velocities * dt + 0.5 * accelerations * dt**2.0
)
def post_update(self, batch: Batch) -> None:
# perform second half of velocity Verlet, with thermostat
# temperature update
with torch.no_grad():
new_accelerations = batch.forces / batch.masses
batch.velocities.add_(0.5 * (self._prev_accelerations + new_accelerations) * self.dt)
ke_per_atom = 0.5 * batch.masses * (batch.velocities**2).sum(dim=-1, keepdim=True)
# get the total kinetic energy per system
total_ke = scatter_add_(...)
current_temp = 2.0 * total_ke / (batch.num_atoms * 3.0)
ratio = self.target_temperature / current_temp
lam = torch.sqrt(
torch.tensor(1.0 + (self.dt / self.tau) * (ratio - 1.0))
).clamp(min=0.8, max=1.2) # clamp for stability
batch.velocities.mul_(lam)
# configure the new dynamics class
my_velverlet = VelocityVerlet(
...,
hooks=[
MySimulatedAnnealer(t_start=900.0, t_end=300.0, cooldown_steps=10, frequency=100, stage=DynamicsStage.BEFORE_STEP)
],
)
Model wrappers
With ALCHEMI Toolkit, you can use your own pretrained models with accelerated physics components. It provides the essential infrastructure for importing your own models into the pipeline, ensuring that proprietary or domain-specific architectures can leverage GPU-native orchestration. This abstracts the complexity of different model types, providing a standardized path to move from a standalone model to a production-ready, high-throughput simulation.
Capabilities
- MLIP support (MACE, TensorNet, AIMNet2)
- Composable calculators
- Standardized model configuration
API example
from beartype import beartype
from super_mlip import BestMLIPModel
from nvalchemi._typing import ModelOutputs
from nvalchemi.models.base import BaseModelMixin, ModelConfig, NeighborConfig
class BestMLIPWrapper(nn.Module, BaseModelMixin):
def __init__(self, model: BestMLIPModel, **kwargs):
super().__init__(**kwargs)
# ModelConfig declares model capabilities (which are frozen)
# and runtime control (mutable) for the rest of the framework
self.model_config = ModelConfig(
outputs=frozenset({"energy", "forces", "hessians"}),
# this is actually the default value
required_inputs=frozenset({"positions", "atomic_numbers"})
autograd_outputs=frozenset({"forces"}),
neighbor_config=NeighborConfig(cutoff=5.0, format="coo")
)
def adapt_input(self, data: Batch, **kwargs) -> dict[str, Any]:
# adapts the nvalchemi data structure to what is
# expected by the model
model_inputs = super().adapt_input(data, **kwargs)
# dict structure expected by BestMLIPModel
model_inputs["atom_numbers"] = data.atomic_numbers
model_inputs["coords"] = data.positions
return model_inputs
def adapt_output(self, model_output: any, data: Batch) -> ModelOutputs:
# adapt the model outputs from the model's forward pass to
# format expected by nvalchemi
output = super().adapt_output(model_output, data)
energies = model_output["energies"]
output["energies"] = energies
# check model config for expected outputs
if "forces" in self.model_config.active_outputs:
output["forces"] = model_output["forces"]
return output
# beartype decorator is optional, but will runtime type check arguments
@beartype
def forward(self, data: Batch, **kwargs) -> ModelOutputs:
model_inputs = self.adapt_input(data, **kwargs)
# calls BestMLIPModel's forward definition based on MRO
model_outputs = super().forward(**model_inputs)
return self.adapt_output(model_outputs, data)
Advanced data management
Traditionally, the “memory tax” of moving data between the CPU and GPU is a significant bottleneck in AI-driven discovery. ALCHEMI Toolkit acts as the specialized orchestrator for scientific data, providing the infrastructure required to build custom ingestion pipelines to move information from standard research files into optimized GPU tensors.
This supports discovery to scale, making industrial-scale simulations accessible through familiar interfaces. By standardizing how atomic information is represented and loaded, ALCHEMI Toolkit ensures that data remains resident on the device, meaning the entire simulation stays on the GPU, enabling batched simulations for optimization of GPU utilization and eliminating communication overhead.
Capabilities
- High-performance data loaders
- ASE and Pymatgen interface
- AtomicData and batch objects
API example
from nvalchemi import AtomicData, Batch
from nvalchemi import data
from ase.build import slab
atoms = slab(...)
# Create AtomicData object from ase.Atoms object
data = AtomicData.from_atoms(atoms, device="cuda")
>>> data ...
data.node_properties
data.system_properties
# Create a Batch object from a list of AtomicData
batch = Batch.from_data_list([data, data, data])
batch.num_graphs
batch.get_data(0)
# get first three samples
batch[:2]
batch[mask]
batch["energies"] -> ...
batch.from_atoms([ase.Atoms,...])
# Create a dataset from ase.Atoms
writer = data.AtomicDataZarrWriter("atom_dataset.zarr")
# writer will amortize overhead by writing batches of data;
# this is equivalent to writing individual samples but efficiently
writer.write(batch)
# Read the data from zarr
reader = data.AtomicDataZarrReader("atom_dataset.zarr")
# Dataset treats device natively; individual samples # are placed on GPU and it accelerates preprocessing transforms;
# num_workers sets the number of threads used for async prefetching
dataset = data.Dataset(reader, device = "cuda", num_workers=4)
dataloader = data.DataLoader(dataset, batch_size=16)
for batch in dataloader:
# do something with batch
Get started building molecular workflows with ALCHEMI Toolkit
ALCHEMI Toolkit provides researchers and developers with the low-level primitives and high-level abstractions needed to build end-to-end, GPU-native molecular workflows. Moving critical bottlenecks—such as neighbor list construction, structural relaxation, and integration steps—into the PyTorch ecosystem eliminates the host-to-device memory transfer overhead that has traditionally throttled MLIP-driven simulations.
Whether you’re composing hybrid ML or physics potentials or scaling batched molecular dynamics, ALCHEMI Toolkit exposes the necessary API hooks to manage complex tensorized states without sacrificing performance.
To accelerate your chemistry and materials science simulations and explore building your own custom workflows, visit the NVIDIA/nvalchemi-toolkit GitHub repo and ALCHEMI Toolkit documentation. As we continue to expand the library of supported operations and architectures, we encourage you to clone the repository, explore the provided Jupyter notebooks, and begin integrating these GPU-accelerated workflows into your own discovery pipelines.
Acknowledgments
We’d like to thank James Gin, Tim Duignan, Vaidas Šimkus of Orbital; Professor Shyue Ping Ong of MatGL; Susumu Ohno, Ryuhei Okuno, Jethro Tan of Matlantis for working with us to adopt NVIDIA ALCHEMI Toolkit into their platforms. We would also like to thank Nikita Fedik, Roman Zubatyuk, Atul Thakur, and Logan Ward for their contributions to this post.