Data Science

Supercharging Deduplication in pandas Using RAPIDS cuDF

A common operation in data analytics is to drop duplicate rows. Deduplication is critical in Extract, Transform, Load (ETL) workflows, where you might want to study the latest records, find the first time a key appears, or remove duplicate keys completely from your data. Which rows you keep and the row ordering both impact downstream processing steps. 

The RAPIDS suite of open-source libraries provides optimized GPU-accelerated libraries for the data science ecosystem. RAPIDS cuDF provides optimized algorithms for DataFrame analytics, leading to faster performance for pandas applications on NVIDIA GPUs with zero code changes.

This post explores the algorithms and data structures used for fast, correct deduplication in cuDF. We demonstrate how cuDF implements the popular method DataFrame.drop_duplicates from pandas, providing lightning-fast results without application code changes. 

Introduction to drop_duplicates in pandas

In pandas drop_duplicates, the keep option is the most important aspect for a correct implementation because it determines which duplicates to retain. The following options are described in the pandas drop_duplicates documentation.

  • "first": Drop duplicates except for the first occurrence
  • "last": Drop duplicates except for the last occurrence
  • False: Drop all duplicates

The following code demonstrates the different options:

>>> import pandas as pd
>>> df = pd.DataFrame({"key": [0, 0, 0, 1, 1, 2, 2], "values": [1, 2, 3, 4, 5, 6, 7]})
>>> df
   key  values
0	0   	1
1	0   	2
2	0   	3
3	1   	4
4	1   	5
5	2   	6
6	2   	7
>>> df.drop_duplicates(subset=["key"], keep="first")
   key  values
0	0   	1
3	1   	4
5	2   	6
>>> df.drop_duplicates(subset=["key"], keep="last")
   key  values
2	0   	3
4	1   	5
6	2   	7
>>> df.drop_duplicates(subset=["key"], keep=False)
Empty DataFrame
Columns: [key, values]
Index: []

Stable ordering

Another crucial point is that drop_duplicates includes stable ordering, which means that the output rows retain their original order from the input. Without stable ordering, the algorithm returns the requested duplicates (first, last, or none) in any order. For serial algorithms, stable ordering can be straightforward to satisfy. The algorithm used by pandas 2.2.2 runs in serial using khash, a C hash table implementation. 

Matching pandas requires stable ordering. GPU execution exploits a high level of parallelism, which for some algorithms comes at the expense of deterministic or stably-ordered behavior. We delve into the consequences of requiring stable ordering in subsequent sections.

Solving drop_duplicates in CUDA C++

When cudf.pandas is enabled, all pandas APIs, such as DataFrame constructors, and methods such as DataFrame.drop_duplicates call the cuDF Python library, which in turn calls the libcudf CUDA C++ library to execute operations on the GPU. The cudf.pandas interface provides the entire pandas API, accelerating existing pandas code with no code changes by calling cuDF on GPU and falling back to CPU pandas execution if GPU execution fails. 

To achieve this requires implementing CUDA C++ code in libcudf that supports all of the options in pandas, including keep. Before discussing a GPU implementation, we’ll first look to the C++ Standard Library and similar libraries of algorithmic primitives like CCCL for inspiration on how C++ library developers might approach the problem of deduplicating data. Then we’ll find the most parallel-friendly CUDA solution to compose from those primitive algorithms and data structures.

The C++ Standard Library offers several algorithms for handling duplicates. The most relevant ones are std::unique from the <algorithm> header and std::unordered_set from the <unordered_set> header.

1. std::unique

  • Removes consecutive duplicate elements in a range
  • Requires a sorted range if you want to remove all duplicates (not just consecutive ones)
  • Operates on iterators, providing flexibility to work with various container types

The following code provides an example:

#include <algorithm>
#include <vector>
#include <iostream>

int main() {
  std::vector<int> vec = {1, 2, 2, 3, 4, 4, 2, 2};
  auto it = std::unique(vec.begin(), vec.end());
  vec.erase(it, vec.end());
  for (int n : vec) {
    std::cout << n << " ";
  }
  // Outputs 1, 2, 3, 4, 2 (retains input order)
  return 0;
}

2. std::unordered_set

  • Provides a direct way to eliminate duplicates using a hash-based structure
  • Does not preserve the order of elements

The following code provides an example:

#include <unordered_set>
#include <vector>
#include <iostream>

int main() {
  std::vector<int> vec = {1, 2, 2, 3, 4, 4, 2, 2};
  std::unordered_set<int> s(vec.begin(), vec.end());
  for (int n : s) {
    std::cout << n << " ";
  }
  // Outputs 1, 2, 3, 4 in some order (implementation defined)
  return 0;
}

Although the C++ STL offers two means of deduplicating data, neither one follows what pandas would do. The std::unique algorithm only drops adjacent duplicates. To drop all duplicates would require sorting the input before calling std::unique, which can be expensive. It would also require tracking the input indices, and a second sort at the end based on the input indices to achieve stable ordering with respect to the input data. 

On the other hand, while building a std::unordered_set uses a hash-based data structure that eliminates the need to sort, it does not retain the order of the inputs. This means that it is not possible to implement features like keep="first" or provide stable ordering. We conclude that neither std::unique nor std::unordered_set can provide the features needed for order-aware deduplication with control over which duplicates are kept.

Introducing the distinct algorithm

Because there is no direct analogue for the algorithm we want in the C++ Standard Library, we call this algorithm cudf::distinct (and correspondingly cudf::stable_distinct when preserving input order). This distinguishes it from the behavior of cudf::unique which only considers adjacent duplicates like its namesake, std::unique. The name drop_duplicates is too specific to pandas behavior and could be easily confused with the behavior of cudf::unique.

Hash-based solution for distinct

When implementing CUDA C++ algorithms, we prefer hash-based approaches, thanks to the excellent performance of concurrent data structures in cuCollections. For more details, see Maximizing Performance with Massively Parallel Hash Maps on GPUs

Using the distinct algorithm requires tracking which duplicate value came first or last in the input data, as well as the ordering of the input rows (even if keep=False). The logic for each keep option could be implemented with std::unordered_map in pure C++. 

However, a GPU implementation requires concurrent data structures such as static_set and static_map in cuCollections that provide similar functionality to std::unordered_set or std::unordered_map in the C++ Standard Library. For stream compaction algorithms like distinct, we have found the highest data throughput with implementations based on static_set using a results array rather than using static_map.

Supporting keep options in distinct

libcudf introduces a new option called keep="any", which is the default option for cudf::distinct. This option offers the best performance in most cases, as it simplifies index comparison and tracking within the algorithm. If we can keep any of the duplicate rows arbitrarily, we can use a static_set and avoid the need for a value reduction that tracks indices or group sizes. We can directly extract the matching data from the set. 

For keep="first" or keep="last", if a key that does not already exist is inserted into the hash map, its corresponding row index is stored. Whenever a duplicate key is encountered, the value is updated depending on the keep option. For keep="first", the hash map retains the smaller row index, whereas for keep="last", the larger row index is preserved. 

For keep=False, a different approach is used. When a new key is inserted, its initial value is set to 1, and each additional insertion of the same key increases this value. This effectively tracks the count of rows for each key, similar to df.groupby("key").size() in pandas. 

When all the data has been inserted into the cuCollections static_set, a means to get the desired result is needed. For keep="any", all of the keys in the hash map are retrieved. For the keep="first" and keep="last" options, thrust::copy_if is employed to collect the matching indices for all distinct rows. For the keep=False option, only the keys with a group size of 1 are retained. This approach results in the exclusion of all duplicated rows, as their corresponding values, indicating the size of the duplicate grouping, are greater than 1.

Supporting stable ordering in distinct

For deduplication in cudf.pandas, recall that stable ordering is required to match pandas. The indices available at this point may not have the same order as the input, because our hash table doesn’t provide any guarantees about the order of the results. It’s possible to take the indices and call thrust::sort, but there is a faster solution that doesn’t require sorting. 

Instead, create an array of Boolean values that is the same length as the input, with every entry initialized as false. Then set the value to true for every index that we have retrieved with a scatter operation. Finally, use that Boolean mask to filter the input data, and its input order is retained. This Boolean mask filtration step is only necessary if stable ordering is needed. Otherwise, a gather operation with the indices from the hash map could be used.

Matching pandas drop_duplicates with cudf.pandas

cuDF supports a broad range of deduplication conventions in the libcudf CUDA C++ layer, and only uses a subset of these to support the deduplication conventions in pandas. Depending on the keep option and stable ordering requirement, different data structures, value reductions, and output materialization methods are used. Table 1 shows the hash-based algorithms for finding distinct entries in libcudf.

libcudf hash-based algorithmData structureValue reductionOutput materializationpandas equivalent
Stable Distinct, KEEP_FIRST and KEEP_LASTstatic_set plus results arraySet or update row index in results arrayScatter into Boolean mask and gather using the maskdrop_duplicates, keep=”first” or keep=”last”
Stable Distinct, KEEP_NONEstatic_set plus results arraySet or update count in results array Scatter into Boolean mask and gather using the maskdrop_duplicates, keep=False
Stable Distinct, KEEP_ANYstatic_setn/aScatter Boolean mask and gather using the maskn/a
Distinct, KEEP_FIRST and KEEP_LASTstatic_set plus results arraySet or update row index in results array Gather n/a
Distinct, KEEP_NONEstatic_set plus results arraySet or update count in results array Gather n/a
Distinct, KEEP_ANYstatic_setn/aGather n/a
Table 1. Hash-based implementations for distinct in RAPIDS libcudf, matched to corresponding pandas APIs where applicable

Performance of distinct algorithms in libcudf

Overall, we observe good saturation behavior in data throughput for all of the distinct implementations, and significant performance benefits from relaxing the keep option and stable ordering requirements. Performance statistics were collected on the 24.10 branch on libcudf, using an NVIDIA H100 Tensor Core 80 GB HBM3 GPU and Intel Xeon Platinum 8480CL CPU with 2TiB of RAM. We used the libcudf nvbench microbenchmarks (see code pointer for details) to collect data on the performance of distinct algorithms as a function of data size and cardinality.

Some basic examples for benchmarking distinct using nvbench with custom axis values are shown below.

./STREAM_COMPACTION_NVBENCH -d 0 -b distinct -a NumRows[pow2]=[24:28] -a cardinality[pow2]=[24:28]

./STREAM_COMPACTION_NVBENCH -d 0 -b stable_distinct -a NumRows[pow2]=[24:28] -a cardinality[pow2]=[24:28]

Impact of cardinality and keep option

For the CUDA C++ implementation of distinct in libcudf, data cardinality (the number of distinct elements) and the selected keep option have significant impacts on data throughput. Figure 1 shows data throughput for KEEP_ANY and KEEP_FIRST across a range of data sizes and data cardinalities. Note that data throughput for KEEP_FIRST is similar to the throughput for KEEP_LAST and KEEP_NONE

For the common case where cardinality is high and duplicate count is low (region 1), we observe data throughput increasing steadily as row count increases from 50K to 1 million rows, followed by saturation around >1 million rows with 10-15 billion items/sec throughput. This region of high cardinality reaches the limit of random-access memory bandwidth for GPU global memory. 

As the cardinality decreases below 1%, we observe some cases with higher throughput, up to 40 billion items/sec (region 2). This peak throughput region is due to a higher concentration of L2 cache hits in GPU memory for smaller counts of distinct elements. 

Finally, for KEEP_FIRST, we observe a throughput drop for cardinalities below <32 distinct elements due to atomic contention while updating the results array (region 3). Overall, we recommend using KEEP_ANY when possible to achieve the highest throughput, especially when your data has very low cardinality, such as 10 distinct elements in 100 million rows.

Heat map showing data throughput with annotations marking region 1: high cardinality; region 2: low cardinality; and region 3: very low cardinality with KEEP_FIRST.
Figure 1. Data throughput for KEEP_ANY and KEEP_FIRST across a range of data sizes and data cardinalities

Impact of stable ordering

For applications that require stable ordering of the outputs, stable_distinct layers this functionality onto the core distinct algorithm with only 10-20% increase in runtime for many cases. Figure 2 shows the data throughput as a function of row count for distinct and stable_distinct where the data has a distinct ratio of 0.5 (50% duplicates). For each of the cases, we observe 10-15 billion rows/s for row counts >20 million. Looking closer at the difference between distinct and stable_distinct, we observe that stable ordering brings ~10% decreased throughput for KEEP_ANY and ~20% decreased throughput for KEEP_FIRST.

Line chart showing data throughput versus data size for distinct with KEEP_ANY, distinct with KEEP_FIRST, stable_distinct with KEEP_ANY, and stable_distinct with KEEP_FIRST.
Figure 2. Data throughput as a function of row count for distinct and stable_distinct where the data has a distinct ratio of 0.5 (50% duplicates)

Conclusion

In summary, drop_duplicates in pandas provides a user-friendly way to remove duplicates from a dataset with convenient high-level options. Through cudf.pandas and its use of libcudf and cuCollections, RAPIDS provides efficient GPU implementations that scale to massive data sizes with excellent performance.

For the fastest way to try GPU-accelerated DataFrame processing, check out RAPIDS cuDF pandas accelerator mode and RAPIDS cuDF for accelerated data science on Google Colab. If you’re interested in building in C++ with RAPIDS libcudf, we recommend building and running the stream_compaction benchmarks available through rapidsai/cudf on GitHub. 

For more information about CUDA-accelerated DataFrames, see the cuDF documentation and the rapidsai/cudf GitHub repo. For easier testing and deployment, RAPIDS Docker containers are also available for releases and nightly builds. 

Discuss (0)

Tags