Gargi Prasad

Gargi Prasad is the program lead for resilience at NVIDIA in DGX Cloud. Her main focus areas are AI infrastructure resilience and performance optimization. Prior to NVIDIA, Gargi worked at Meta in the Core Infra serving large scale distributed systems. She has expertise in Software/System Engineering and Architecture and has worked for 15+ years in the industry. Gargi has a master’s degree in Computer Science from Delft University of Technology with a specialization in Parallel & Distributed Systems.
Avatar photo

Posts by Gargi Prasad

Decorative image.
Networking / Communications

Real-Time Performance Monitoring and Faster Debugging with NCCL Inspector and Prometheus

Distributed deep learning depends on fast, reliable GPU-to-GPU communication using the NVIDIA Collective Communication Library (NCCL). When training slows down,... 7 MIN READ
Networking / Communications

Enhancing Communication Observability of AI Workloads with NCCL Inspector

When using the NVIDIA Collective Communication Library (NCCL) to run a deep learning training or inference workload that uses collective operations (such as... 6 MIN READ