Shelby Thomas

Shelby is the product lead for training reliability at NVIDIA in DGX Cloud. Before NVIDIA, he worked at OctoAI on accelerating ML model deployment across diverse hardware platforms and developed deep learning models at Google. He received his Ph.D. in computer science from UC San Diego.
Avatar photo

Posts by Shelby Thomas

Image shows cloud-based GPU clusters dedicated to AI training.
Data Center / Cloud

Ensuring Reliable Model Training on NVIDIA DGX Cloud

Training AI models on massive GPU clusters presents significant challenges for model builders. Because manual intervention becomes impractical as job scale... 8 MIN READ