GTC Silicon Valley-2019 ID:S9449:Building a Distributed GPU DataFrame with Python
We'll discuss the GPU Open Analytics Initiative, an effort to develop a GPU data frame that can handle a large-scale data-analytics workflow and support out-of-core cases in which the data is larger than GPU memory. We'll describe how we divided the problem into two parts, developing an elementary single-GPU data frame to handle in-memory use cases, and then combining multiple single-GPU data frames into a distributed multi-GPU data frame for out-of-core use cases. We'll briefly introduce our distributed GPU data frame and its capabilities. We'll then explain how we scaled out by using Dask, a distributed computation framework in Python, to orchestrate the single-GPU data frames and achieve out-of-core capability with minimal effort. Our idea can be generalized to build custom distributed GPU computation by composing single-GPU libraries.