Mastering the cudf.pandas Profiler for GPU Acceleration

In the world of Python data science, pandas has long reigned as the go-to library for intuitive data manipulation and analysis. However, as data volumes grow, CPU-bound pandas workflows can become a bottleneck.

That’s where cuDF and its pandas accelerator mode, cudf.pandas, step in. This mode accelerates operations with GPUs whenever possible, seamlessly falling back to the CPU for unsupported operations. A fundamental pillar of this approach is the cudf.pandas profiler, which provides insights into how much of your code is being executed on the GPU compared to the CPU.

In this post, we discuss what the cudf.pandas profiler is, how to use it, and why it’s critical for understanding and optimizing your accelerated pandas workloads.

cudf.pandas profiler overview

The cudf.pandas.profile magic command, available in Jupyter and IPython, is a profiling tool that analyzes your pandas-style code in real time. After you enable the cudf.pandas extension, the profiler reports each operation’s execution device (GPU or CPU) and counts how many times a particular function or method was triggered.

By capturing this data, you can quickly pinpoint the following:

Which operations were fully accelerated on the GPU.
Which operations fell back to pandas on the CPU.
Where potential bottlenecks or unnecessary data transfers may be lurking.

Enabling the profiler

To get started, load the cudf.pandas extension in your notebook, just as you would with other IPython magics:

%load_ext cudf.pandas
import pandas as pd

From here, you can begin writing pandas code—such as reading CSVs, merging DataFrames, or running groupbys—and let cudf.pandas automatically decide how to accelerate them on the GPU or fall back to the CPU when needed.

Profiling in action

There are several ways to use cudf.pandas profiler:

Using the cell-level profiler
Using the line profiler
Using the profiler from the command line

Using the cell-level profiler

In Jupyter or IPython, you can activate profiling on a cell-by-cell basis:

%%cudf.pandas.profile

df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 3]})
df.min(axis=1)
out = df.groupby('a').filter(lambda group: len(group) > 1)

After the cell completes, you’ll see an output that breaks down which operations ran on the GPU compared to the CPU, how many times each operation was called, and a handy summary for discovering performance bottlenecks.

                                          Total time elapsed: 0.256 seconds                                  
                                        3 GPU function calls in 0.170 seconds                                
                                        1 CPU function calls in 0.009 seconds                                
                                                                                                             
                                                        Stats                                                
                                                                                                             
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Function                ┃ GPU ncalls ┃ GPU cumtime ┃ GPU percall ┃ CPU ncalls ┃ CPU cumtime ┃ CPU percall ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ DataFrame               │ 1          │ 0.031       │ 0.031       │ 0          │ 0.000       │ 0.000       │
│ DataFrame.min           │ 1          │ 0.137       │ 0.137       │ 0          │ 0.000       │ 0.000       │
│ DataFrame.groupby       │ 1          │ 0.001       │ 0.001       │ 0          │ 0.000       │ 0.000       │
│ DataFrameGroupBy.filter │ 0          │ 0.000       │ 0.000       │ 1          │ 0.009       │ 0.009       │
└─────────────────────────┴────────────┴─────────────┴─────────────┴────────────┴─────────────┴─────────────┘
Not all pandas operations ran on the GPU. The following functions required CPU fallback:

- DataFrameGroupBy.filter

To request GPU support for any of these functions, please file a Github issue here: https://github.com/rapidsai/cudf/issues/new/choose.

DataFrameGroupBy.filter isn’t accelerated on GPU, so you see one call of it on the CPU taking 0.025 seconds in the table.

Using the line profiler

To go a step deeper and see how each line in your cell performed, run the following code example:

%%cudf.pandas.line_profile

df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 3]})
df.min(axis=1)
out = df.groupby('a').filter(lambda group: len(group) > 1)

Here, the profiler shows execution details line by line, making it easier to spot specific lines of code that are causing CPU fallbacks or performance slowdowns.

                                        Total time elapsed: 0.244 seconds                                
                                                                                                         
                                                      Stats                                              
                                                                                                         
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Line no. ┃ Line                                                           ┃ GPU TIME(s) ┃ CPU TIME(s) ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ 2        │     df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 3]})        │ 0.004833249 │             │
│          │                                                                │             │             │
│ 3        │     df.min(axis=1)                                             │ 0.006497159 │             │
│          │                                                                │             │             │
│ 4        │     out = df.groupby('a').filter(lambda group: len(group) > 1) │ 0.000599624 │ 0.000347643 │
│          │                                                                │             │             │
└──────────┴────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘

Using the profiler from the command line

You can also run the cudf.pandas profiler from the command line by passing the --profile argument.

Here are the contents of demo.py:

import pandas as pd
s = pd.Series([1, 2, 3])
s = (s * 10) + 2
print(s)

Run the following command:

python -m cudf.pandas –-profile demo.py

You will see the profile stats output:

                                      Total time elapsed: 0.852 seconds                               
                                    4 GPU function calls in 0.029 seconds                             
                                    0 CPU function calls in 0.000 seconds                             
                                                                                                      
                                                    Stats                                             
                                                                                                      
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Function         ┃ GPU ncalls ┃ GPU cumtime ┃ GPU percall ┃ CPU ncalls ┃ CPU cumtime ┃ CPU percall ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Series           │ 1          │ 0.002       │ 0.002       │ 0          │ 0.000       │ 0.000       │
│ OpsMixin.__mul__ │ 1          │ 0.011       │ 0.011       │ 0          │ 0.000       │ 0.000       │
│ OpsMixin.__add__ │ 1          │ 0.008       │ 0.008       │ 0          │ 0.000       │ 0.000       │
│ object.__str__   │ 1          │ 0.008       │ 0.008       │ 0          │ 0.000       │ 0.000       │
└──────────────────┴────────────┴─────────────┴─────────────┴────────────┴─────────────┴─────────────┘

Why profiling matters

When it comes to GPU acceleration, not everything is automatically supported. Some operations—such as certain Python-level functions, custom lambdas, or DataFrame methods not yet implemented on the GPU—may trigger CPU fallback.

The profiler helps you see precisely where those fallbacks occur. With this knowledge, you can adjust your workflow, rewrite certain functions, or investigate potential areas for improvement:

Rewrite CPU-bound operations: Some user-defined functions or operations might be modified to become more GPU-friendly.
Watch for frequent data transfers: Excessive data transfers between CPU and GPU can offset acceleration gains. Identifying these is crucial for maximum speedup, as a repeated sequence of GPU operations followed by CPU operations can generally lead to expensive data transfers.
Stay up-to-date: cuDF is constantly adding functionality (and bridging gaps with pandas).

Knowing which methods are currently CPU-bound helps you keep track of future improvements.

Conclusion

The pandas user experience is central to modern data science, but scaling to large datasets often strains CPU performance.

With cudf.pandas, you get the same intuitive API using the speed of GPUs when possible and an automatic CPU fallback for unsupported operations. The cudf.pandas profiler is the key to understanding this hybrid CPU/GPU ecosystem, highlighting acceleration opportunities and helping you refine your code for the best performance.

Give it a try in your data project. By profiling your code and identifying CPU fallbacks, you can quickly push the boundaries of pandas without leaving the comfort of its familiar API.

If you run into APIs that you would like to have GPU-accelerated, please open a GitHub issue in the /rapidsai/cudf GitHub repo.