Data Science

Using NetworkX, Jaccard Similarity, and cuGraph to Predict Your Next Favorite Movie

As the amount of data available to everyone in the world increases, the ability for a consumer to make informed decisions becomes increasingly difficult. Fortunately, large datasets are a beneficial component for recommendation systems, which can make a sometimes overwhelming decision much easier.

Graphs are excellent choices for modeling the relationships inherent in the data that fuel recommendation systems, and NetworkX is a very popular option that many data scientists turn to for graph analytics in Python. NetworkX is easy to learn and use, stocked with a wide breadth of graph algorithms, backed by a large and friendly community, and has copious examples available in notebooks, documents, Stack Overflow, and your favorite LLM.

However, to the disappointment of countless developers that broke into graph analytics with or even because of NetworkX, it famously falls short in performance at the scales used by typical recommendation systems.

This begs the question: Can an effective graph-based recommendation system be written in a few simple lines of Python? More generally, can developers and data scientists have both easy-to-use and high-performance graph analytics?

The answer to both questions is, “Yes.”

Read on to discover how you can create a simple and effective recommendation system in Python using NetworkX, a dataset of 33M movie reviews, the Jaccard Similarity algorithm, and the NVIDIA cuGraph back-end, which provides the >250x speedup necessary for modern large-scale graph data.

The MovieLens dataset

Here’s the most important part of the system: the data. The MovieLens dataset is generously made available for public download and is described in more detail in the README file. The full set includes about 331K anonymized users reviewing 87K movies, resulting in 34M ratings.

Image showing how the MovieLens data can be shown as a graph.
Figure 1. MovieLens data represented as a graph, where the individual ratings easily map to edges between user and movie nodes

Extracting recommendations from the data: bipartite graphs and Jaccard Similarity

The type of graph created from the MovieLens data is a bipartite graph because there are only two types of nodes: movies and users. The reviews (edges) can only occur between a user and a movie. This makes it particularly easy to apply the Jaccard Similarity algorithm to find similarities between movies.

Jaccard Similarity compares pairs of nodes and computes a similarity coefficient using their relationships in the graph. In this case, movies are related to each other based on how users have chosen to watch and review them.

Image shows six users linked to one or more of four movies and the similarity coefficients for movie recommendations based on a user's preferences.
Figure 2. Jaccard Similarity computes a similarity coefficient using the sizes of the sets of neighbors for the two nodes being compared

Based on the viewing preferences of users, you can see m3 is more similar to m2 than it is to m1, and movies m4 and m1 aren’t similar at all. This system would recommend m2 to someone who likes m3 and wouldn’t recommend m1 to someone who likes m4.

NetworkX makes it easy… for smaller graphs

Not surprisingly, NetworkX supports the type of analysis described earlier, and it’s quite easy to start seeing results in just a few lines of Python. But as you’ll see, performance becomes a limitation for larger-sized graphs—such as those needed for your movie recommendation system—when using NetworkX without the GPU-accelerated cuGraph backend.

I discuss the key pieces of the recommendation system later in this post, but the full source code is available in the /rapidsai/nx-cugraph GitHub repo.

Because the Jaccard Similarity algorithm you’re using doesn’t take edge weights into account, it considers all reviews equal. You don’t want movies with low reviews to be recommended, so filter out all reviews under a certain threshold, which has the side effect of making the graph smaller too.

# Create a separate DataFrame containing only "good" reviews (rating >= 3).
good_ratings_df = ratings_df[ratings_df["rating"] >= 3]
good_user_ids = good_ratings_df["userId"].unique()
good_movie_ids = good_ratings_df["movieId"].unique()

If you print the sizes of the data you’re working with, you see that your graph of good reviews is approximately 330K nodes and 28M edges, with an average degree (number of neighbors per node) of 84:

total number of users: 330975
total number of reviews: 33832162
average number of total reviews/user: 102.22
total number of users with good ratings: 329127
total number of good reviews: 27782577
average number of good reviews/user: 84.41

As mentioned earlier, graphs of this size often present a challenge to NetworkX, but GPU acceleration using the cuGraph backend removes the performance limitations often associated with this much data. However, I’ll continue with a CPU-only environment to demonstrate the default performance.

All the following examples were run on a workstation using NetworkX 3.4.2 and a Intel Xeon Platinum 8480CL at 2.0 GHz with 2 TB RAM.

Using a NetworkX graph created from users and good movie reviews, pick a user, find one of their highest rated movies, and use Jaccard Similarity to find other movies like it:

# Pick a user and one of their highly-rated movies
user = good_user_ids[321]
user_reviews = good_user_movie_G[user]
highest_rated_movie = max(
	user_reviews,
	key=lambda n: user_reviews[n].get("rating", 0)
)

When you look up the node ID in the movie name map, you see that one of this user’s highest rated movies is the animated film, Mulan:

highest rated movie for user=289308 is Mulan (1998), id: 1907, rated: {'rating': 5.0}

You can now use Jaccard Similarity to recommend a movie based on the user’s preferences and viewing history:

%%time
# Run Jaccard Similarity
jacc_coeffs = list(nx.jaccard_coefficient(good_user_movie_G, ebunch))
CPU times: user 2min 5s, sys: 15.4 ms, total: 2min 5s
Wall time: 2min 14s

The Jaccard Similarity computation using the default NetworkX implementation ran for over two minutes. Using these results, you can now provide a recommendation.

# Sort by coefficient value, which is the 3rd item in the tuples
jacc_coeffs.sort(key=lambda t: t[2], reverse=True)
 
# Create a list of recommendations ordered by "best" to "worst" based on the
# Jaccard Similarity coefficients and the movies already seen
movies_seen = list(good_user_movie_G.neighbors(user))
recommendations = [mid for (_, mid, _) in jacc_coeffs
               	  if mid not in movies_seen]

Now you can print the first movie in the sorted list of recommendations:

User ID 289308 might like Tarzan (1999) (movie ID: 2687)

The code is easy and the results look good, but performance holds us back

As you can see, the recommendation seems reasonable; someone who likes Mulan seems likely to also enjoy the 1999 Disney animated film Tarzan.

However, if the goal was to provide a service or to analyze hundreds or thousands of movies, the two-minute runtime would have you start looking for an alternative to NetworkX. You can see that finding similarities between other movies using this system isn’t any faster:

%%time
# 1196: "Star Wars: Episode V - The Empire Strikes Back (1980)"
print_similar_movies(1196)
movies similar to Star Wars: Episode V - The Empire Strikes Back (1980):
movieId=260, Star Wars: Episode IV - A New Hope (1977)
movieId=1210, Star Wars: Episode VI - Return of the Jedi (1983)
movieId=1198, Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)
CPU times: user 13min 47s, sys: 71.8 ms, total: 13min 47s
Wall time: 11min 30s

Here’s another example:

%%time
# 318: "Shawshank Redemption, The (1994)"
print_similar_movies(318) 
movies similar to "Shawshank Redemption, The (1994)":
movieId=296, Pulp Fiction (1994)
movieId=593, "Silence of the Lambs, The (1991)"
movieId=356, Forrest Gump (1994)
CPU times: user 28min 28s, sys: 172 ms, total: 28min 28s
Wall time: 16min 49s

The quality of the recommendations returned is impressive given that this system is composed of only a few lines of code. However, the runtime performance makes it virtually unusable. ‌As described earlier, finding recommendations based on Shawshank Redemption, The (1994) takes nearly 17 minutes.

NVIDIA cuGraph makes it transformatively faster

The graph algorithm in this workflow is prohibitively expensive, but by using the NVIDIA cuGraph backend and a compatible GPU, you can dramatically improve performance without changing the code.

Jaccard Similarity is supported in nx-cugraph version 25.02 or later. Version 25.02 is available from nightly builds and will be part of future stable releases later this month. Instructions on installing nx-cugraph, as well as other RAPIDS packages, from both nightly and stable channels using conda or pip, are available in the RAPIDS Installation Guide.

After being installed, enable nx-cugraph by setting an environment variable:

NX_CUGRAPH_AUTOCONFIG=True

cuGraph uses the GPU to dramatically accelerate the neighbor lookups and set comparisons needed for the Jaccard Similarity computation. As the graph scales and the number of movies and reviews per movie increases, performance remains almost constant.

The best part of the system—the simplicity of the code—does not change, and the results are identical, but performance increases by over 250x for the run that previously took nearly 17 minutes, reducing it to under 4 seconds.

Bar chart shows the Jaccard Similarity speedup with cuGraph of 252.88x on Shawshank Redemption; 174.68x on The Empire Strikes Back; and 65.05x on Mulan.
Figure 3. Speedup of cuGraph over NetworkX for Jaccard Similarity computation for various movies

Software: NetworkX 3.4.2, cuGraph/nx-cugraph 25.02; CPU: Intel(R) Xeon(R) Platinum 8480CL @ 2.0GHz 2TB RAM; GPU: NVIDIA Quadro RTX 8000 48GB RAM

Conclusion

This post covered a simple and effective recommendation system that’s easy to write in Python using NetworkX. Although there are many other approaches you could take—as covered in What Is a Recommendation System?—few would match the low effort required to start exploring data that graph analysis with NetworkX offers.

However, productive and meaningful data exploration requires quick turnaround, and NetworkX has traditionally struggled to scale to larger, real-world problem sizes.

The NVIDIA cuGraph backend for NetworkX accelerates the familiar and flexible NetworkX API to also make it performant at scale, generating results in seconds instead of tens of minutes, keeping you focused and productive. You can now continue using NetworkX, the most popular graph analytics library, without concern for scaling issues simply by adding a GPU and the cuGraph backend to your environment.

For more information about accelerated graph analysis using NetworkX and NVIDIA cuGraph, see RAPIDS: GPU-Accelerated NetworkX Backend.

Discuss (0)

Tags