In the world of Python data science, pandas has long reigned as the go-to library for intuitive data manipulation and analysis. However, as data volumes grow, CPU-bound pandas workflows can become a bottleneck.
That’s where cuDF and its pandas accelerator mode, cudf.pandas
, step in. This mode accelerates operations with GPUs whenever possible, seamlessly falling back to the CPU for unsupported operations. A fundamental pillar of this approach is the cudf.pandas
profiler, which provides insights into how much of your code is being executed on the GPU compared to the CPU.
In this post, we discuss what the cudf.pandas
profiler is, how to use it, and why it’s critical for understanding and optimizing your accelerated pandas workloads.
cudf.pandas profiler overview
The cudf.pandas.profile
magic command, available in Jupyter and IPython, is a profiling tool that analyzes your pandas-style code in real time. After you enable the cudf.pandas
extension, the profiler reports each operation’s execution device (GPU or CPU) and counts how many times a particular function or method was triggered.
By capturing this data, you can quickly pinpoint the following:
- Which operations were fully accelerated on the GPU.
- Which operations fell back to pandas on the CPU.
- Where potential bottlenecks or unnecessary data transfers may be lurking.
Enabling the profiler
To get started, load the cudf.pandas
extension in your notebook, just as you would with other IPython magics:
%load_ext cudf.pandas
import pandas as pd
From here, you can begin writing pandas code—such as reading CSVs, merging DataFrames, or running groupbys—and let cudf.pandas
automatically decide how to accelerate them on the GPU or fall back to the CPU when needed.
Profiling in action
There are several ways to use cudf.pandas
profiler:
- Using the cell-level profiler
- Using the line profiler
- Using the profiler from the command line
Using the cell-level profiler
In Jupyter or IPython, you can activate profiling on a cell-by-cell basis:
%%cudf.pandas.profile
df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 3]})
df.min(axis=1)
out = df.groupby('a').filter(lambda group: len(group) > 1)
After the cell completes, you’ll see an output that breaks down which operations ran on the GPU compared to the CPU, how many times each operation was called, and a handy summary for discovering performance bottlenecks.
Total time elapsed: 0.256 seconds
3 GPU function calls in 0.170 seconds
1 CPU function calls in 0.009 seconds
Stats
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Function ┃ GPU ncalls ┃ GPU cumtime ┃ GPU percall ┃ CPU ncalls ┃ CPU cumtime ┃ CPU percall ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ DataFrame │ 1 │ 0.031 │ 0.031 │ 0 │ 0.000 │ 0.000 │
│ DataFrame.min │ 1 │ 0.137 │ 0.137 │ 0 │ 0.000 │ 0.000 │
│ DataFrame.groupby │ 1 │ 0.001 │ 0.001 │ 0 │ 0.000 │ 0.000 │
│ DataFrameGroupBy.filter │ 0 │ 0.000 │ 0.000 │ 1 │ 0.009 │ 0.009 │
└─────────────────────────┴────────────┴─────────────┴─────────────┴────────────┴─────────────┴─────────────┘
Not all pandas operations ran on the GPU. The following functions required CPU fallback:
- DataFrameGroupBy.filter
To request GPU support for any of these functions, please file a Github issue here: https://github.com/rapidsai/cudf/issues/new/choose.
DataFrameGroupBy.filter
isn’t accelerated on GPU, so you see one call of it on the CPU taking 0.025 seconds in the table.
Using the line profiler
To go a step deeper and see how each line in your cell performed, run the following code example:
%%cudf.pandas.line_profile
df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 3]})
df.min(axis=1)
out = df.groupby('a').filter(lambda group: len(group) > 1)
Here, the profiler shows execution details line by line, making it easier to spot specific lines of code that are causing CPU fallbacks or performance slowdowns.
Total time elapsed: 0.244 seconds
Stats
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Line no. ┃ Line ┃ GPU TIME(s) ┃ CPU TIME(s) ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ 2 │ df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 3]}) │ 0.004833249 │ │
│ │ │ │ │
│ 3 │ df.min(axis=1) │ 0.006497159 │ │
│ │ │ │ │
│ 4 │ out = df.groupby('a').filter(lambda group: len(group) > 1) │ 0.000599624 │ 0.000347643 │
│ │ │ │ │
└──────────┴────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘
Using the profiler from the command line
You can also run the cudf.pandas
profiler from the command line by passing the --profile
argument.
Here are the contents of demo.py
:
import pandas as pd
s = pd.Series([1, 2, 3])
s = (s * 10) + 2
print(s)
Run the following command:
python -m cudf.pandas –-profile demo.py
You will see the profile stats output:
Total time elapsed: 0.852 seconds
4 GPU function calls in 0.029 seconds
0 CPU function calls in 0.000 seconds
Stats
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Function ┃ GPU ncalls ┃ GPU cumtime ┃ GPU percall ┃ CPU ncalls ┃ CPU cumtime ┃ CPU percall ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Series │ 1 │ 0.002 │ 0.002 │ 0 │ 0.000 │ 0.000 │
│ OpsMixin.__mul__ │ 1 │ 0.011 │ 0.011 │ 0 │ 0.000 │ 0.000 │
│ OpsMixin.__add__ │ 1 │ 0.008 │ 0.008 │ 0 │ 0.000 │ 0.000 │
│ object.__str__ │ 1 │ 0.008 │ 0.008 │ 0 │ 0.000 │ 0.000 │
└──────────────────┴────────────┴─────────────┴─────────────┴────────────┴─────────────┴─────────────┘
Why profiling matters
When it comes to GPU acceleration, not everything is automatically supported. Some operations—such as certain Python-level functions, custom lambdas, or DataFrame methods not yet implemented on the GPU—may trigger CPU fallback.
The profiler helps you see precisely where those fallbacks occur. With this knowledge, you can adjust your workflow, rewrite certain functions, or investigate potential areas for improvement:
- Rewrite CPU-bound operations: Some user-defined functions or operations might be modified to become more GPU-friendly.
- Watch for frequent data transfers: Excessive data transfers between CPU and GPU can offset acceleration gains. Identifying these is crucial for maximum speedup, as a repeated sequence of GPU operations followed by CPU operations can generally lead to expensive data transfers.
- Stay up-to-date: cuDF is constantly adding functionality (and bridging gaps with pandas).
Knowing which methods are currently CPU-bound helps you keep track of future improvements.
Conclusion
The pandas user experience is central to modern data science, but scaling to large datasets often strains CPU performance.
With cudf.pandas
, you get the same intuitive API using the speed of GPUs when possible and an automatic CPU fallback for unsupported operations. The cudf.pandas
profiler is the key to understanding this hybrid CPU/GPU ecosystem, highlighting acceleration opportunities and helping you refine your code for the best performance.
Give it a try in your data project. By profiling your code and identifying CPU fallbacks, you can quickly push the boundaries of pandas without leaving the comfort of its familiar API.
If you run into APIs that you would like to have GPU-accelerated, please open a GitHub issue in the /rapidsai/cudf GitHub repo.