Build Efficient Recommender Systems with Co-Visitation Matrices and RAPIDS cuDF

Recommender systems play a crucial role in personalizing user experiences across various platforms. These systems are designed to predict and suggest items that users are likely to interact with, based on their past behavior and preferences. Building an effective recommender system involves understanding and leveraging huge, complex datasets that capture interactions between users and items.

This post will show you how to build a simple yet strong recommender system based on co-visitation matrices. One of the key challenges in building co-visitation matrices is the computational complexity involved in processing large datasets. Traditional methods using libraries like pandas can be inefficient and slow, especially when dealing with millions or even billions of interactions. This is where RAPDIS cuDF comes in. RAPIDS cuDF is a GPU DataFrame library that provides a pandas-like API for loading, filtering, and manipulating data.

Recommender systems and co-visitation matrices

Recommender systems are a category of machine learning algorithms aimed at delivering personalized suggestions, or recommendations, to users. These systems are used in a variety of applications, including e-commerce (Amazon, OTTO), content streaming (Netflix, Spotify), social media (Instagram, X, TikTok), and more. The role of these systems is to assist users in discovering products, services, or other content in alignment with their interests and preferences.

Datasets used to build a recommender system typically contain the following:

\(N\) items to recommend. \(N\) can be huge (even millions).
Interactions between users and items. A sequence of such interactions for a given user is called a session. The goal is then to infer which items the user is going to interact with next.

Figure 1 shows an example session in which the user interacted with items 6543, 242, 5381, and 5391. The goal of the recommender system is to predict which items the user will interact with next. One common way to assess performance is to use the Recall on \(k\) guesses made by the model (recall@\(k\)). Recall counts how many ground truth items the model can retrieve normalized by the number of ground truth items.

During a session, a user will often interact with several items. A co-visitation matrix counts the items appearing together and is of size \(N\) x \(N\). Co-visitation matrices can be easily used to make recommendations by checking which items co-occur frequently with the items in the session. For the session shown in Figure 1, if item 2834 is frequently bought together with item 6543, then it is a good recommendation.

Challenges building co-visitation matrices

Computing co-visitation matrices requires looking at all the sessions and counting all the co-occurrences. This quickly becomes costly: for a given session of size \(L\), the complexity is in \(O(L^2)\). For a real-world recommender system dataset, you can expect millions of sessions, if not billions. As a result, heavy optimization is necessary for co-visitation matrices to be usable. Sessions need to be considered simultaneously.

pandas makes the computations easy to implement, but at the expense of efficiency. On one hand, sessions need to be considered in parts for the memory not to explode. On the other hand, big datasets cause significant slow-downs.

A faster computing framework is necessary that also allows for the code clarity of pandas. This is where RAPIDS cuDF comes in. You can accelerate computations 40x with zero code changes using RAPIDS cuDF.

This post demonstrates how to build a co-visitation matrix and accelerate the workflow using the RAPIDS cuDF pandas accelerator mode. Running the code as you read will provide you with a better understanding of how beneficial the accelerator really is. Before you begin, make sure to turn on the GPU accelerator. For more details, see the demo notebook.

RAPIDS cuDF pandas accelerator mode

RAPIDS cuDF is a Python GPU DataFrame library designed to speed up operations that can be slow when performed on CPU on big datasets (loading, joining, aggregating, and filtering, for example).

Its API style is similar to pandas’, and with the new cuDF pandas accelerator mode, you can bring accelerated computing to your pandas workflows without any code changes. The cuDF library delivers 50x to 150x faster performance for tabular data processing.

RAPIDS cuDF pandas accelerator mode extends the capability of the cuDF library and delivers 50x to 150x faster performance for tabular data processing. It enables you to bring accelerated computing to your pandas workflows without requiring any code changes.

The data

The data used in this tutorial is extracted from the train set of the OTTO – Multi-Objective Recommender System Kaggle competition, which contains one month of sessions. The first three weeks are used for building the matrices, and the last week for evaluation. Validation sessions were truncated to build targets for the model. Recall@20 will be used to see how well the co-visitation matrices can retrieve the truncated items.

Note that it’s important to use a temporal split to avoid leaking information. The test data contains the fifth week of the dataset. The dataset contains 1.86 million items, and about 500 million interactions of users with these items. These interactions are stored in chunked parquet files for easier handling.

The following information is known about the data:

session: The session ID; a session is equivalent to a user in this case
aid: The item ID
ts: The time at which the interaction occurred
type: The type of interaction; can be clicks, carts, or orders

Implementing co-visitation matrices

Because of the number of items in the dataset, memory is an issue. So, split the data into two parts to prevent the co-visitation matrix from being too large. Then loop over all the parquet files in the training data.

covisitation_matrix = []
for part in range(PARTS):
      print(f"- Part {part + 1}/{PARTS}")
      matrix = None
      for file_idx, file in enumerate(tqdm(train_files)):

The first step is to load the data. Then apply some transformations to save memory: change the types of the columns to int32, and restrict sessions longer than 30 interactions.

for part in range(PARTS):
     for file_idx, file in enumerate(tqdm(train_files)):
	      [...]
          # Load sessions & convert columns to save memory
          df = pd.read_parquet(file, columns=["session", "aid", "ts"])
          df["ts"] = (df["ts"] / 1000).astype("int32")
          df[["session", "aid"]] = df[["session", "aid"]].astype("int32")

          # Restrict to first 30 interactions
          df = df.sort_values(
               ["session", "ts"],
               ascending=[True, False],
               ignore_index=True
          )
          df["n"] = df.groupby("session").cumcount()
          df = df.loc[df.n < 30].drop("n", axis=1)

Next, you can get all the co-occurrences by aggregating the data with itself on the session column. This is already time-consuming and the resulting data frame is quite big.

     # Compute pairs 
     df = df.merge(df, on="session")

To reduce the cost of the matrix computation, restrict the items to the ones in the part currently considered. Also only consider interactions occurring within 1 hour, on different items. Don’t allow for duplicated interactions within sessions.

     # Split in parts to reduce memory usage
     df = df.loc[
          (df["aid_x"] >= part * SIZE) &
          (df["aid_x"] < (part + 1) * SIZE)
     ]

     # Restrict to same day and remove self-matches
     df = df.loc[
          ((df["ts_x"] - df["ts_y"]).abs() < 60 * 60) & 
          (df.aid_x != df.aid_y)
     ]
	# No duplicated interactions within sessions
    df = df.drop_duplicates(
          subset=["session", "aid_x", "aid_y"],
          keep="first",
     ).reset_index(drop=True)

In the next step, compute the matrix weights. Count all the cooccurrences by doing a sum aggregation on pairs.

     # Compute weights of pairs
     df["wgt"] = 1
     df["wgt"] = df["wgt"].astype("float32")

     df.drop(["session", "ts_x", "ts_y"], axis=1, inplace=True)
     df = df.groupby(["aid_x", "aid_y"]).sum()

At the end of the second loop (the one which loops over parquet files), update the coefficients by adding the newly computed weights to the previous ones. Since co-visitation matrices are big, this process is slow and consumes memory. To free some memory, delete unused variables.

     # Update covisitation matrix with new weights
     if matrix is None:
          matrix = df
     else:  # this is the bottleneck operation
          matrix = matrix.add(df, fill_value=0)  

     # Clear memory
     del df
     gc.collect()

After seeing all the data, reduce the size of the matrices by only keeping the N best candidates per item—that is, the candidates with the highest weight. This is where the interesting information is located.

for part in range(PARTS):
     [...]
     # Final matrix : Sort values
     matrix = matrix.reset_index().rename(
          columns={"aid_x": "aid", "aid_y": "candidate"}
     )	
     matrix = matrix.sort_values(
          ["aid", "wgt"], ascending=[True, False], ignore_index=True
     )

     # Restrict to n candids
     matrix["rank"] = matrix.groupby("aid").candidate.cumcount()
     matrix = matrix[matrix["rank"] < N_CANDIDS].reset_index(drop=True)
     covisitation_matrix.append(matrix)

The final step is to concatenate the different parts of the matrix that were computed separately. Afterwards, you can choose to save your matrix to disk if needed.

covisitation_matrix = pd.concat(covisitation_matrix, ignore_index=True)

There you have it. This code computes a simple co-visitation matrix using pandas. It does have one major flaw: it is slow, taking almost 10 minutes to compute the matrix! It also required restricting the data significantly for a reasonable runtime.

cuDF pandas accelerator mode

And this is where the cuDF pandas accelerator mode comes in. Restart the kernel, and with one line of code unleash the power of your GPU:

%load_ext cudf.pandas

Table 1 shows the runtimes with and without cuDF acceleration. One line of code achieves a 40x speedup.

	pandas	cuDF pandas
10% of the data	8 min 41 s	13 s
Whole dataset	1 h 30 min+	5 min 30 s

Table 1. Performance comparison of pandas versus cuDF pandas

Generating candidates

Generating candidates to recommend is one use of co-visitation matrices. This is done by aggregating the weights of the co-visitation matrix over all the items in a session. The items with the highest weight will be recommended. The implementation is straightforward in pandas, and once again benefits from the GPU accelerator.

Start by loading the data, then only consider the last N_CANDIDS seen items. Items that have already been seen in the sessions are great recommendations. They will be recommended first. Remember that self-matches were removed from the co-visitation matrices to save memory, so simply recover them here.

candidates_df_list = []
last_seen_items_list = []

for file_idx in tqdm(range(len(val_files))):
     # Load sessions & convert columns to save memory
     df = pd.read_parquet(
          val_files[file_idx], columns=["session", "aid", "ts"]
     )
     df["ts"] = (df["ts"] / 1000).astype("int32")
     df[["session", "aid"]] = df[["session", "aid"]].astype("int32")

     # Last seen items
     df = df.sort_values(
          ["session", "ts"], ascending=[True, False], ignore_index=True
     )
     df["n"] = df.groupby("session").cumcount()

     # Restrict to n items
     last_seen_items_list.append(
          df.loc[df.n < N_CANDIDS].drop(["ts", "n"], axis=1)
      )
     df.drop(["ts", "n"], axis=1, inplace=True)

Next, merge the co-visitation matrix on the items in the session. Each item in the session is then associated with \(N\) (candidate, weight) pairs. To get a session-level weight for the candidates, take the sum of their weights over the session’s items. By taking the N_CANDIDS candidates with the highest weight, you can obtain candidates for the recommender system.

for file_idx in tqdm(range(len(val_files))):
     [...]
     # Merge covisitation matrix
     df = df.merge(
          covisitation_matrix.drop("rank", axis=1), how="left", on="aid"
     )
     df = df.drop("aid", axis=1).groupby(
          ["session", "candidate"]
     ).sum().reset_index()

     # Sort candidates
     df = df.sort_values(["session", "wgt"], ascending=[True, False])

     # Restrict to n items
     df["rank"] = df.groupby("session").candidate.cumcount()
     df = df[df["rank"] < N_CANDIDS].reset_index(drop=True)
     candidates_df_list.append(df)

Performance assessment

To evaluate the strength of the candidates, use the recall metric. The recall measures the proportion of items in the ground truth that were successfully found by the retriever.

In this case, allow for 20 candidates. You want to see the proportion of items users purchased that the matrices successfully retrieved. The achieved recall is 0.5868, which is already a strong baseline. Out of the 20 items returned by the recommender, on average 11 of them were purchased by the user. This is already a strong baseline: during the competition, top teams reached scores close to 0.7. Refer to the demo notebook for details about the implementation, which are beyond the scope of this post.

Going further

Accelerating co-visitation matrix computation and aggregation enables you to iterate very quickly, to improve candidate recall. The first improvement is to give more history to the matrices. With a fast implementation, this can be done without having to wait for hours for computations to end.

Then refine the matrices. While you can try various approaches, for now, the co-visitation matrices are given a weight of one for every item co-occurring. The demo gives more weight to items that are closer in time, since such interactions seem more relevant. Another idea is to consider the type of interaction. Items purchased together, or items added to a cart together, seem more relevant to recommend than items that were simply viewed.

Until now, all items in the session have been considered equally important when aggregating matrix weights. Items that appear later in the session have more predictive factors than the oldest ones. Therefore, it’s possible to refine the aggregation by increasing the weights of the candidates of items closer to the end of the session.

Using these three changes, the recall@20 improves to 0.5992. The implementations follow the same scheme as the code presented in this post, with a few added lines to bring in the improvements. For details, see the demo notebook.

The final step is to merge several co-visitation matrices to capture different types of candidates.

Summary

This post provides a comprehensive guide to building and optimizing co-visitation matrices. Although co-visitation matrices are a simple tool—they count the co-occurrence of items during user sessions—building them involves processing large amounts of data. Doing so efficiently is necessary to iterate quickly and improve recommender systems.

By leveraging RAPIDS cuDF and its newly released pandas accelerator mode, co-visitation matrix computation is up to 50x faster. Building a diverse set of co-visitation matrices quickly, thanks to GPU acceleration, was a key component that enabled Kaggle Grandmasters of NVIDIA to win multiple recommender system competitions, including KDDCup 2023 and OTTO.

While the demo notebook only computes two types of matrices, the possibilities are limitless. Experiment with the code and try to improve the construction of the matrices to capture different candidates.