In this post, we continue the series on accelerating vector search using NVIDIA cuVS. Our previous post in the series introduced IVF-Flat, a fast algorithm for accelerating approximate nearest neighbors (ANN) search on GPUs. We discussed how using an inverted file index (IVF) provides an intuitive way to reduce the complexity of the nearest neighbor search by limiting it to only a small subset of data using clustering.

Building on these ideas, this post introduces IVF-PQ (Inverted File Index with Product Quantization), an algorithm that improves both the search performance and the memory footprint of vector search by storing the data in a compressed, encoded form. This comes with a cost to accuracy, a compromise that we explore in the second part of this post.

## Example

To show an example of IVF-PQ in action, we prepared a few benchmarks using the DEEP dataset. The full dataset has a billion records and 96 dimensions in fp32 with a total size of 360 GiB. A typical IVF-PQ configuration compresses this into an index of 54 GiB without a significant impact to the search performance (or as small as 24 GiB with ~1.5x slower search). The compression allows the index to fit into GPU memory. cuVS’s IVF-PQ algorithm can be used to accelerate both index building and vector search on billion-scale datasets. Figure 1 compares cuVS IVF-PQ to a popular CPU algorithm, HNSW, on a 100-million subset of the DEEP dataset.

So, how does PQ compression work? More importantly, how do we set its parameters to achieve the best results? In this two-part post, we dive into the details to answer these questions.

## Algorithm overview

Like the IVF-Flat algorithm, IVF-PQ is an ANN search method that logically consists of two steps: coarse search and fine search. The coarse search step for IVF-PQ is the same as for IVF-Flat. Conceptually, the fine search step is also the same, and involves calculating the distance between the query points and all the vectors in the probed clusters. However, IVF-PQ stores the database vectors in a compressed format using a technique called product quantization (PQ).

PQ is a lossy compression. This means IVF-PQ has an advantage for large datasets, because it can fit more data into GPU memory. This also means better utilization of GPU memory bandwidth, and thus faster search.

As a lossy compression, PQ does decrease the recall of the search. The following sections discuss how to alleviate this by tuning the compression ratio, or using a technique called reranking, which is also referred to as refinement. When recall is paramount, you can always fall back to cuVS IVF-Flat or even brute-force search when the index fits in the GPU memory.

## Product quantization

In the IVF-PQ index, a database vector ` is approximated with two-level quantization:`

The first-level quantizer (), maps the vector to the nearest cluster center. In this case, IVF-PQ employs the same balanced hierarchical k-means algorithm as IVF-Flat. The number of clusters is `n_lists`

.

The second-level quantizer is called the product quantizer and it encodes the residual, or distance to the closest cluster center. For more information, see Product Quantization for Nearest Neighbor Search.

A product quantizer encodes a `dim-`

dimensional vector with a `pq_dim-`

dimensional vector. First, the input vector is split into `pq_dim`

subvectors (denoted by ), where each vector contains `pq_len`

distinct components of .

_{}

_{Figure 2. The input vector is split into multiple subvectors denoted by }

Next, each subvector is encoded with a separate quantizer , and the results are concatenated:

Each quantizer outputs a code with `pq_bits`

bits and the cuVS k-means algorithm is used one more time for this step. By default, the clustering is applied to each `pq_len-`

dimensional subspace individually (`pq_dim`

times) to encode each subvector . The closest cluster center in the `pq_len-`

dimensional subspace becomes the approximation of .

The number of k-means clusters is called the * codebook size*, and the index of the closest cluster becomes the encoding of the subvector . To fully use the available

`pq_bits`

bits of the encoding, the codebook size is set to `2`^{pq_bits}

. The `pq_bits`

variable has a strong impact on both recall and speed of the algorithm.Parameters `pq_bits`

and `pq_dim`

together determine the compression ratio. Intuitively, we should be able to pack data records of size `(dim·sizeof(dataset_element_type))`

bytes into `(pq_dim·pq_bits)`

-bit encodings but in practice, we align the data into 16-byte chunks, which allows us to maximize the data throughput and optimize the bit arithmetic:

```
record_size = 16·⌈pq_dim/pq_chunk⌉, where pq_chunk = ⌊128/pq_bits⌋
```

During the search, a look-up table (LUT) is constructed using the appropriate codebooks and the query coordinates for every query and probed list. The LUT is essentially an array of distances between the query subvector and all possible cluster centers (encoded as ) in the corresponding subspace.

Then, the distance from the full vector to any vector in the cluster is computed as a function of the sum of the LUT records given by the indices that constitute the vector encoding:

The size of the LUT has a profound effect on performance:

`lut_size = sizeof(lut_dtype)·pq_dim·2`

^{pq_bits}

Here, `lut_dtype`

is the data type used internally on the GPU for storing the LUT elements. If possible, the LUT is stored right in the GPU’s shared memory during search but if it’s too large for shared memory then it’s stored in the slower global GPU memory. Setting the `lut_dtype`

to a lower-precision data type can, in some cases, reduce the size of a large LUT just enough for it to fit into shared memory.

To summarize, the product quantization is defined by the two parameters: `pq_dim`

and `pq_bits`

. They determine the index data layout and size, as well as the size of the look-up table (`lut_size`

). Both of these have a profound effect on the algorithm performance, which we explore in more detail in the second part of this post.

## How does cuVS accelerate IVF-PQ?

Overall, the logic of IVF-PQ search is quite similar to IVF-Flat. Initial pre-processing is followed by the coarse search to select which clusters to probe for each query. Then, the fine search runs for every query-probe pair. The output of the fine search is passed to the cuVS select-top-k algorithm to aggregate the candidates from all probes. Finally, some post-processing is applied to transform the distance and index types.

The heart of the implementation (and the main bottleneck for a broad range of inputs) is the fine search. Broadly speaking, it consists of two steps performed on each query-probe pair: 1) the LUT construction and 2) the cluster scanning.

During the scanning, the LUT is used to compute the distance scores directly from the encoded data. Depending on the cluster sizes and other parameters, the workload over these two steps can be distributed anywhere from 10%-90% to 50%-50% of the run time.

The IVF-PQ fine search logic in cuVS is implemented as a set of GPU device functions called CUDA kernels. An appropriate kernel is selected according to the input data type, size, the indexing and search parameters. With this decision, a few code paths are switched.

### Fused and non-fused top-k fine search

For all but the smallest work sizes (`n_queries·n_probes`

), we prefer to filter out the best k candidates at the time we compute the distances. In GPU parlance we refer to this as “fusing” operations. Fusing the k-selection with the distance computation enables 1) reducing the output size for better memory bandwidth utilization and a faster aggregation step, and 2) additional optimizations, such as discarding some candidates in the middle of the distance computation.

For a fixed small value of k, highly optimized select-top-k methods are possible for the fused filtering. For details, see Parallel Top-K Algorithms on GPU: A Comprehensive Study and New Methods. Unfortunately, these methods become too expensive for a larger number of candidates. Hence, beyond some threshold, it is best to switch to the non-fused version of the fine search algorithm. Currently, this limit is set to `k ≤ 128`

in IVF-PQ.

### LUT in shared and global memory

As explained in the Product Quantization section, the LUT is the critical part of the fine search kernel. We try to keep this in the shared memory of the GPU for faster memory access. However, this is not always possible due to the limited device shared memory. If the LUT does not fit in shared memory, it is moved into the global memory and a corresponding code path is activated in the kernel.

### Precomputed distance in shared memory

When constructing the LUT, certain parts of the computation can be reused – for example, the distance between the query and the cluster center for the Euclidean metric. As with LUT, these intermediate values are placed in the shared memory to reduce the access time. This optimization competes for the shared memory resource with the LUT optimization and the kernel selection logic disables one of the optimizations when necessary.

### Other optimizations

We use many other optimizations in cuVS to ensure the optimal performance of algorithms on the GPU. A few of these are listed below:

- A custom 8-bit floating point data type in the LUT. Used only to store values, this does not implement any arithmetic operations but enables a faster conversion to and from
`half`

and`float types.`

- The data in the index is kept in 16-byte, aligned chunks and grouped in strides to enable optimal vectorized data transfers.
- In the fused kernel, for some metrics, we implement an “early stop” check to avoid computing the whole distance when it’s clear from the first few components that the distance is too large.
- In the fused kernel, we keep the distance to k-th neighbors globally to make the early stop check (and filtering in general) more efficient.
- We unroll the inner-most loops of distance computation using C++ templates to reduce the load on the GPU arithmetic logic unit, which is strained by the PQ decoding.

## Performance: 100-million scale dataset

### Index building space and time

Building an IVF-PQ index involves the same k-means clustering step as an IVF-Flat index. In addition, IVF-PQ must train the sub-quantizers, which makes the index building time take considerably longer.

Figure 3 and Figure 4 show that for a given dataset size, IVF-PQ takes longer to build than IVF-Flat by a factor of x2-5. The building time depends less on the number of clusters because of the extra overheads of the quantizer training and the data encoding.

The total index size is much smaller with IVF-PQ. Even with the least compression possible, every input element (fp32 in case of DEEP dataset) is encoded into a single byte. The selected IVF-PQ configs consistently yield x4-5 compression. This means we can fit a significantly larger dataset into the limited GPU memory and take advantage of the GPU acceleration.

### Search performance

In Figure 5, we compare the search performance on a dataset at 100-million scale for both IVF-Flat and IVF-PQ. For IVF-Flat, we used a single index with 100K clusters and two indices for IVF-PQ with 50K clusters.

We set ` pq_dim = dim = 96`

for both IVF-PQ indices, but set the `pq_bits`

to 5 and 8, respectively, and enabled the refinement with ratio 2. Both `internal_distance_dtype`

and `lut_dtype`

were set to 16-bit float (`half`

). With IVF-PQ, we are able to keep the original dataset on the host, so that we can perform the refinement operation on the CPU. We present the refinement step in more detail in part 2 of this blog post.

Figure 5 shows QPS/recall curves for IVF-Flat and IVF-PQ models. Each curve represents the number of queries per section (QPS) and the recall values with varying search parameters. On a small batch (10 queries) without the refinement step, IVF-PQ can be slightly faster than IVF-Flat, but only in the area of very low recall. Refinement, on the other hand, has a significant impact on recall, and even with a higher compression ratio, the IVF-PQ index with refinement has competitive recall. The throughput boost due to the smaller LUT can compensate for the overheads of the refinement, while the refinement fixes the errors arising from the lossy PQ compression.

Figure 6 shows the same experiment performed from Figure 4, but performed on a large batch of 10K queries at a time. Thanks to a larger work size (`n_queries·n_probes`

), both algorithms can fully utilize the GPU and hide latencies like those encountered from launching the CUDA kernels.

With large batches, IVF-PQ performs even better than IVF-Flat, showing 3-4x the number of queries per second of IVF-Flat. This result is somewhat expected, given that the IVF-PQ index is smaller than the IVF-Flat index by a similar ratio. Data transfer is therefore proportionally faster for IVF-PQ.

## Getting started with IVF-PQ in NVIDIA cuVS

cuVS is a library for vector search and clustering on the GPU. Many of the important building blocks in cuVS can be integrated into vector databases and other vector search libraries. Today, cuVS IVF-PQ is integrated into Milvus 2.3+ and FAISS 1.8.

## Summary

IVF-PQ is a fast ANN search algorithm that clusters and compresses data points to limit the complexity and improve the throughput of the nearest neighbor search. In the first part of this two-part post, we provided an overview of the internal workings of the NVIDIA cuVS IVF-PQ implementation and explained why it is fast on the GPU. IVF-PQ shares a lot in common with the IVF-Flat algorithm, but adds the product quantization compression on top of it.

By the end of this post, you should have an intuition behind the design details of the IVF-PQ algorithm and an idea of its overall performance. We encourage you to continue to the second part, to get practical recommendations on how to tune the performance of IVF-PQ for your data.