Data Science

7 Drop-In Replacements to Instantly Speed Up Your Python Data Science Workflows

You’ve been there. You wrote the perfect Python script, tested it on a sample CSV, and everything worked flawlessly. But when you unleashed it on the full 10 million row dataset, your laptop fan started screaming, your console froze, and you had enough time to brew three pots of coffee before seeing a result.

What if you could get massive speedups on those exact same workflows with a simple flag or parameter switch?

It turns out, you can. Many of Python’s most popular data science libraries—including pandas, Polars, scikit-learn, and XGBoost—can now run much faster on GPUs with little to no code changes. Using libraries like NVIDIA cuDF, cuML, and cuGraph, you can keep your existing code and scale it to handle much larger workloads with ease.

This post shares how to speed up seven drop-in replacements for popular Python libraries—complete with starter code to try yourself.

Make pandas and Polars run faster on larger datasets

The foundation of any data science or machine learning project is data preparation. It’s often the most time-consuming part of the workflow, but it doesn’t have to be.

#1: %%load_ext cudf.pandas: Use pandas as-is with GPU Acceleration

pandas is the cornerstone of Python data science—but it slows down fast on large datasets. With cudf.pandas, you can keep your code exactly the same and still get GPU acceleration.

How it works: Simply load the cudf.pandas extension at the top of your script or notebook. cuDF will then intelligently run your pandas commands on the GPU whenever possible, dramatically speeding up your workflow.

# Just add this to the top of your script!
%load_ext cudf.pandas


# Your existing pandas code now runs on the GPU
import pandas as pd
df = pd.read_csv("your_large_dataset.csv")

# ... all your other pandas operations are now accelerated


Watch how fast this stock analysis workflow runs with cudf.pandas enabled:

Video 1. Comparing processing performance on 18M rows of stock data in pandas with cuDF on/off

Try it now:

#2: .collect(engine="gpu"): Make Polars even faster

Polars is already famous for its speed. Now you can combine its powerful query optimization with the raw processing power of cuDF for even greater performance.

How it works: Polars has a built-in execution engine that can be pointed to the GPU. By enabling the cuDF-powered engine, you tell Polars to leverage the GPU for its operations.

# pip install polars using the "GPU" feature flag
pip install polars[gpu]
 
import polars as pl


# Call the GPU engine at collection
(transactions
 .group_by("CUST_ID")
 .agg(pl.col("AMOUNT").sum())
 .sort(by="AMOUNT", descending=True)
 .head()
 .collect(engine="gpu"))

Here’s what happens when we run the same Polars query with and without GPU acceleration.

Video 2. Demo showing how the Polars GPU Engine powered by cuDF can tackle massive transaction datasets that typically cause slowdowns, processesing 100 million rows in under two seconds

Try it now:

Train scikit-learn and XGBoost faster

With your data prepped, it’s time to train your models—and this is where many Python workflows hit another wall. Libraries like scikit-learn and XGBoost are powerful but can get slow on large datasets. Fortunately, both offer simple ways to unlock GPU acceleration and drastically reduce training time.

#3: %%load_ext cuml.accel: Train scikit-learn models faster With GPU support

Many data scientists rely on scikit-learn for everyday machine learning tasks like classification, regression, and clustering. But as data grows, and hyperparameter tuning and visualization cycles get added to the mix training time escalates. With cuML, you can accelerate popular scikit-learn models on the GPU to save time and train on larger datasets without changing your code.

How it works: Just load the accelerator, and keep writing scikit-learn code as usual. Under the hood, cuML handles the GPU execution. No syntax changes. No new APIs.

%load_ext cuml.accel


from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
 
X, y = make_classification(n_samples=500000, n_features=100, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
 
rf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
 
rf.fit(X_train, y_train)

Watch how RandomForestClassifier training drops from minutes to seconds:

Video 3. cuML accelerates machine learning training with no code changes required

Try it now:

Note: While cuml.accel works out of the box for many common scikit-learn models, it’s still expanding coverage. Some workflows may partially fall back to CPU—see supported estimators for details.

#4: device = "cuda": Enable CUDA acceleration in XGBoost With one parameter

XGBoost is a world-class, competition-winning library that comes with GPU acceleration built-in. You just have to turn it on.

How it works: There’s no need for a different library. Simply set the device parameter to “cuda” during model initialization to leverage the GPU.

# Just set the device to "cuda"
xgb_model = xgb.XGBRegressor(device="cuda") 
xgb_model.fit(X,y)

This demo shows how enabling GPU acceleration speeds up model training and unlocks faster iteration during feature engineering and hyperparameter tuning—making it easier to test, refine, and improve models in less time.

Video 4. Demo showing how GPUs are used to accelerate XGBoost workflows using real-world taxi fare data. It walks through how to build baseline models and improve model accuracy with advanced feature engineering.

Try it now: Run in Colab

Accelerating exploratory ML and clustering workflows

Before training your model, it’s common to explore high-dimensional patterns or identify clusters in your data. Tools like UMAP and HDBSCAN are great for the job—but they can slow to a crawl on large datasets. With cuML, you can run these workflows much faster. 

#5: %%load_ext cuml.accel: Make UMAP visualizations run in seconds, not minutes

UMAP is a powerful technique for dimensionality reduction, but it can be painfully slow on large datasets. cuML’s implementation lets you create stunning visualizations in a fraction of the time.

How it works: Just like with scikit-learn, you simply change the import to use cuml.accel and let the GPU do the heavy lifting.

%%load_ext cuml.accel

import umap
umap_model = umap.UMAP(n_neighbors=15, n_components=2, random_state=42, min_dist=0.0)

# Fit UMAP model to the data
X_train_umap = umap_model.fit_transform(X_train_scaled)

Here’s a side-by-side demo showing how cuML speeds up UMAP on a real-world dataset.

Video 5. UMAP projection on the UCI HAR dataset, using the same umap-learn code on CPU vs GPU. Thanks to cuML’s accelerator mode, the GPU version runs in under a second—without any code changes.

Try it now:

#6: %%load_ext cuml.accel: Faster HDBSCAN clustering on millions of rows

Density-based clustering with HDBSCAN can be painfully slow on CPU—especially with high-dimensional data. With cuML’s accelerator mode, you can uncover complex structures in seconds instead of minutes.

How it works: Load the cuml.accel extension, and your existing HDBSCAN code automatically runs on GPU—no refactoring needed.

%%load_ext cuml.accel
import hdbscan

clusterer = hdbscan.HDBSCAN()
time clusterer.fit(X)

Clustering large datasets with HDBSCAN can be time-consuming—even on toy examples, CPU runs can take 30 to 60 seconds. With cuML’s accelerator mode, you can fit clustering models like HDBSCAN in a second, using the same Python code.

Video 6. Watch how a single import—%load_ext cuml.accel—drops HDBSCAN clustering time from 45 seconds to under 2 seconds on a high-dimensional dataset. No code rewrites needed. Just load the extension and keep using your existing hdbscan code.

Try it now:

Scaling graph analytics with NetworkX

Graphs are incredibly powerful for analyzing relationships in data, and NetworkX is one of the most widely used libraries for working with them.

It offers hundreds of functions to help you build and analyze diverse graph structures with ease. But its pure-Python implementation can become a bottleneck on large datasets—making it hard to scale to real-world graph analytics on CPU.

#7: %env NX_CUGRAPH_AUTOCONFIG=True: Instantly scale your NetworkX graphs

To overcome these scaling limitations, the NetworkX ecosystem now includes a GPU-accelerated backend powered by cuGraph called nx-cugraph. With nx-cugraph, you can keep your exact same NetworkX code and unlock GPU acceleration—no code changes required.

How it works: Just install nx-cugraph and set the environment variable NX_CUGRAPH_AUTCONFIG=True before running your usual NetworkX code. NetworkX automatically detects algorithms supported by nx-cugraph and routes them to cuGraph on the GPU—no rewrites or conversions required.

# Install the GPU-accelerated NetworkX backend
pip install nx-cugraph-cu11 --extra-index-url https://pypi.nvidia.com


# Enable GPU acceleration for NetworkX
%env NX_CUGRAPH_AUTOCONFIG=True

# Your existing NetworkX code stays the same
import pandas as pd
import networkx as nx

df = pd.read_csv("your_edgelist.csv", names=["src", "dst"])
G = nx.from_pandas_edgelist(df, source="src", target="dst")

centrality_scores = nx.betweenness_centrality(G, k=10)

Here’s a side-by-side of NetworkX on CPU vs GPU:

Video 7. A side-by-side demo of NetworkX running on CPU vs GPU using the new nx-cugraph accelerator. With one environment variable and zero code changes, you can run standard NetworkX code on much larger graphs — and finish in seconds, not minutes.

Try it now:

Conclusion: Same code. More speed. 

You don’t need to be a CUDA expert to leverage the massive parallel processing power of GPUs. For a huge number of data science and machine learning workflows, the tools are already here. By using libraries like cuDF, cuML, and cuGraph, you can accelerate your favorite tools and get results faster.

Ready to start building? All the examples, notebooks, and starter code from this blog post are in this one GitHub.

Discuss (0)

Tags