Developer Blog

Data Science |

Cloudera and NVIDIA Collaborate to Accelerate Data Analytics and AI at Scale

Data engineering and data science workflows are often limited by the ability of platforms to process massively growing amounts of data. The integration of the Cloudera Data Platform (CDP), the RAPIDS Accelerator for Apache Spark 3.0, and NVIDIA computing, announced April 12, 2021, enables accelerated and scalable big data pre-processing, and workflows without code changes. With Cloudera CDP and the power of NVIDIA computing, customers like IRS can accelerate data processing and model training at a lower cost across any on-premises, public cloud, or hybrid cloud deployment.

“We need to be able to make accurate decisions at speed utilizing vast swathes of data. That challenge is ever-evolving as data volumes and velocities continue to increase,” said Joe Ansaldi, IRS/Research Applied Analytics & Statistics Division (RAAS)/Technical Branch Chief. “The Cloudera and NVIDIA integration will empower us to use data-driven insights to power mission-critical use cases such as fraud detection. We are currently implementing this integration, and are already seeing over three times speed improvements for our data engineering and data science workflows.”

How do NVIDIA GPUs on CDP enable fast and robust computation for end-to-end ML workflows?

GPUs, with their massively parallel architecture, are driving the advancement of deep learning (DL) and machine learning (ML) model training in the past several years. With GPUs, you can exploit data parallelism through columnar data processing instead of traditional row-based reading designed initially for CPUs. This provides higher performance and cost savings. But for modern data workflows, data science teams need both a robust platform that facilitates collaboration and robust computing frameworks which go beyond just accelerated GPU model building. 

What is the Cloudera Data Platform?

Cloudera Data Platform (CDP) is a software framework that provides big data management and analytics services for enterprises across hybrid public cloud, private cloud, and multi-cloud environments. CDP can manage data and data workloads, spin or scale the necessary cluster infrastructure and software up and down on-demand, and do that on-premises as well as across the three major public clouds. CDP enables structuring and optimizing data and data processing where they are best suited and allows existing on-prem implementations to “burst to the cloud” for scaling and performance. 

What is NVIDIA Accelerated End-to-End Data Science?

The RAPIDS suite of open-source software libraries, built on CUDA, gives you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs, while still using familiar interfaces like Pandas and Scikit-Learn APIs. 

https://lh3.googleusercontent.com/1a2ezobUvs7d_mIc-GA7yj8LhxoOis4bRdsTjsu9zAFDm_A7eTnvBzSZF_ggVd5O13dDLzOvn1z2mJNNb9uy0KFxYU5XbKVlfxGjbnfyzZWDg5wRYFwPhCbo7GDRsb7-hVFX53wN

The RAPIDS Accelerator for Apache Spark combines the power of the RAPIDS library and the scale of the Apache Spark distributed computing framework to accelerate SQL and DataFrame data processing with GPUs without code changes.  

In addition, RAPIDS integration with ML/DL frameworks and Spark 3.x GPU task scheduling enables the acceleration of model training and tuning. This allows data scientists and ML engineers to have a unified, GPU-accelerated pipeline for ETL and analytics, while ML and DL applications leverage the same GPU infrastructure, removing bottlenecks, increasing performance, and simplifying clusters. For IT teams, the simplest infrastructure path to enabling this accelerated data science is to deploy NVIDIA Certified Servers.

CDP Powered by NVIDIA Accelerated Data Science

While effectively leveraging GPUs to accelerate end-to-end ETL and ML workflows with GPUs has been difficult in the past, enabling this capability on CDP powered by NVIDIA is turn-key. Data scientists can leverage best-in-class GPU computing frameworks from NVIDIA natively in CDP on any cloud and on-premises through CDP Private Cloud Base. Cloudera, together with NVIDIA, makes it easier than ever to optimize data science workflows and execute compute-heavy processes in a fraction of the time previously required.

To find out about the CDP powered by NVIDIA roadmap, performance results, and more, attend the NVIDIA and Cloudera GTC session: Enabling Machine Learning at Scale: Accelerating Big Data Workloads on Cloudera Data Platform with NVIDIA RAPIDS [S31947].