Data Science

Faster Causal Inference on Large Datasets with NVIDIA RAPIDS

Nov 14, 2024

By Nick Becker, Dante Gama Dessavre and Ben Zaitlen

Discuss (0)

AI-Generated Summary

Dislike

Double machine learning is a technique that brings the power of machine learning models to causal inference problems by training two predictive models across independent dataset samples and combining them to build a de-biased estimate of the target variable.
Using RAPIDS cuML, a GPU-accelerated machine learning library, with DoubleML can significantly speed up causal inference workflows, achieving up to 12x speedups compared to using scikit-learn's CPU-based RandomForestRegressor.
By leveraging accelerated computing libraries like RAPIDS cuML with DoubleML, enterprises can efficiently process large datasets and gain critical insights into the causal relationships between variables, enabling data-driven decision-making.

AI-generated content may summarize information incompletely. Verify important information. Learn more

As consumer applications generate more data than ever before, enterprises are turning to causal inference methods for observational data to help shed light on how changes to individual components of their app impact key business metrics.

Over the last decade, econometricians have developed a technique called double machine learning that brings the power of machine learning models to causal inference problems. It involves training two predictive models across independent dataset samples and combining them to build a de-biased estimate of the target variable.

Open-source Python libraries like DoubleML and others make it easy for data scientists to tap into this new technique, but struggle with the size of data that enterprises need to process on CPUs.

RAPIDS is a collection of open-source GPU-accelerated data science and AI libraries. cuML is a GPU-accelerated machine learning library for Python with a scikit-learn compatible API.

In this blog post, we illustrate how you can use RAPIDS cuML with the DoubleML library for faster causal inference, enabling you to more effectively work with large datasets.

Why causal inference?

Many data science and machine learning use cases are more focused on the quality of predictions than the exact effect sizes of individual features on the outcome variable. As a result, non-parametric models like random forest (available in scikit-learn) and XGBoost have become a go-to choice for many data scientists.

For some problems, we need to measure the causal effect of one variable (feature) on our target outcome variable. The gold standard for doing this effectively is to run a randomized controlled trial or A/B test and measure average treatment effects across groups.

Unfortunately, this isn’t always practical for enterprises due to the impact that changes can have on their business. Ideally, we’d be able to find out how important a component of the in-app experience is to user churn without risking increasing it. Causal inference techniques enable estimating that relationship from real-world datasets of user behavior, providing critical guidance about where we should invest resources to improve a product.

Historically, it was challenging to use flexible, non-parametric models like random forest and XGBoost for causal inference. Double machine learning allows us to easily tap into these advancements.

Bringing accelerated computing to double machine learning

Using state-of-the-art machine learning algorithms for causal inference increases the computational requirements for the workflow. With small datasets, this isn’t an issue. But as datasets continue to grow, using DoubleML on CPUs in practice can be a challenge.

In the benchmark below, we lightly adapt this example from the DoubleML documentation and run it on a range of dataset sizes using scikit-learn and cuML to see how performance changes.

import doubleml as dml
from doubleml.datasets import make_plr_CCDDHNR2018
from sklearn.ensemble import RandomForestRegressor
from sklearn.base import clone
import cuml

NROWS = [10000, 100000, 1000000, 10000000]
USE_GPU = True

for N in NROWS:
    data = make_plr_CCDDHNR2018(alpha=0.5, n_obs=N, dim_x=100, return_type="DataFrame").astype("float32")
    obj_dml_data = dml.DoubleMLData(data, "y", "d")

    if USE_GPU:
  learner = cuml.ensemble.RandomForestRegressor(n_estimators=200, max_features=100, max_depth=10, min_samples_leaf=2)
    else: # standard scikit-learn
  learner = RandomForestRegressor(n_estimators=200, max_features=100, max_depth=10, min_samples_leaf=2, n_jobs=-1)

    ml_l = clone(learner)
    ml_m = clone(learner)
    dml_plr_obj = dml.DoubleMLPLR(obj_dml_data, ml_l, ml_m).fit()

With hundreds of thousands or millions of records, CPU-based DoubleML pipelines quickly slow down, as the underlying machine learning model becomes the bottleneck. On the 10 million row x 100 column dataset, fitting the DoubleMLPLR pipeline takes more than 6.5 hours. Switching to the GPU-accelerated RAPIDS cuML for the underlying model enables it to finish in just 51 minutes, which is a 7.7x speedup.

Based on the results, accelerated machine learning libraries like cuML can provide up to 12x speedups compared to using scikit-learn’s CPU-based RandomForestRegressor as the backend model, with minimal code changes required.

Conclusion

Causal inference can help enterprises better understand key components of their products, but traditionally it’s been challenging to take advantage of innovations in machine learning focused on prediction.

New techniques like double machine learning are bridging this gap, enabling enterprises to use computationally intensive machine learning algorithms for causal inference problems. As datasets grow, CPU-based infrastructure struggles to keep up with productivity demands.

Using accelerated computing libraries like RAPIDS cuML with DoubleML makes it possible to turn hours of waiting into minutes, with minimal code change.

To learn more about accelerated machine learning, visit the cuML documentation.

Discuss (0)

About the Authors

About Nick Becker
Nick Becker is a senior technical product manager on the RAPIDS team at NVIDIA, where his efforts are focused on building the GPU-accelerated data science ecosystem. Nick has a professional background in technology and government. Prior to NVIDIA, he worked at Enigma Technologies, a data science startup. Before Enigma, he conducted economics research and forecasting at the Federal Reserve Board of Governors, the central bank of the United States.

View all posts by Nick Becker

About Dante Gama Dessavre
Dante Gama Dessavre is the engineering manager of the RAPIDS machine learning team at NVIDIA.

View all posts by Dante Gama Dessavre

About Ben Zaitlen
Benjamin is a System Software Manager at NVIDIA. He has been a long time contributor to the Python and PyData ecosystem. Currently he is helping accelerate the PyData stack on GPUs as part of the RAPIDS project.

View all posts by Ben Zaitlen