Simulation / Modeling / Design

Building Recommender Systems Faster Using Jupyter Notebooks from NGC

The NVIDIA NGC team is hosting a webinar with live Q&A to dive into this Jupyter notebook available from the NGC catalog. Learn how to use these resources to kickstart your AI journey. Register now: NVIDIA NGC Jupyter Notebook Day: Recommender System.

Recommender systems deal with predicting user preferences for products based on historical behavior and are widespread in online retail for product recommendations, in social media for targeted ads, in streaming video and music platforms for providing user centric content, and more.

In my previous post, Building Image Segmentation Faster Using Jupyter Notebooks from NGC, I discussed some of the problems inherent in building, training, and optimizing state-of-the-art models. I also explained how NGC is now adding sample Jupyter notebooks, complete with instructions on how to train and deploy a model using these artefacts from the NGC Catalog.

In this post, I show how you can use the sample recommendation system notebook. The trained model in this notebook predicts the rate of a movie for a user and recommends movies.

Recommender system

The Variational Autoencoder (VAE) shown here is an optimized implementation of the architecture first described in Variational Autoencoders for Collaborative Filtering and can be used for recommendation tasks. The Jupyter notebook shows the process of training and testing the VAE model. The model consists of two parts: the encoder and decoder. The encoder transforms the vector, which contains the interactions for a specific user, into a n-dimensional variational distribution. You can use this variational distribution to obtain a latent representation of a user. This latent representation is then fed into the decoder. The result is a vector of item interaction probabilities for a particular user.

The Variational Autoencoder (VAE) shown here is an optimized implementation of the architecture first described in Variational Autoencoders for Collaborative Filtering and can be used for recommendation tasks
Figure 1. Model architecture for recommender systems.

The following features were implemented in this model:

  • Sparse matrix support
  • Data-parallel multi-GPU training
  • Dynamic loss scaling with backoff for tensor cores (mixed precision) training
  • Feature support matrix

To follow this tutorial, ensure that you have the following resources:

Clone the repository

You can find the resources related to this model on the NGC Catalog resources page. The resource can be downloaded manually from the top right menu or by using the wget resource command.

The image shows the VAE Recommender Model page, with all the pull instructions, documentation for setting up the model and the necessary performance metrics associated with this model.
Figure 2. Resource landing page for VAE Recommender Model in the NGC Catalog.

To download and unzip resources using the command line interface, follow these steps:

  1. Make a new folder using mkdir.
  2. Go to the new folder using cd.
  3. Use wget to download resources as a zip file inside the folder.
  4. Unzip the zip file using unzip.
mkdir VAE

cd VAE

wget --content-disposition -O


Build the container

Make the container using the Dockerfile inside the resource folder that you just downloaded:

docker build . -t < tagname>

You can tag the container with any name.

docker build . -t vae

Run the container

Start the container. This mounts a folder for the data inside the container. You download a movie rating dataset inside this folder. Set port 8888:8888 for this container so that you can access the Jupyter notebooks inside the container later using this port.

docker run -it --rm --runtime=nvidia -v /data/vae-cf:/data -p 8888:8888 vae /bin/bash

Data set

For this post, you use the MovieLens 20m data set to train the model. MovieLens 20M is a movie-rating data set that includes 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. The goal of the model is to predict the rating of a new movie for a user considering the previous sets of the user: movie, ratings, and so on.

If you do not have the dataset downloaded, run the following commands to download and extract the MovieLens dataset to the /data/ml-20m/extracted/ folder. The data set can be preprocessed by running the following command in the Docker container:


By default, the dataset is stored in the /data directory. To store the data in a different location, pass the desired location to the --data_dir argument.

mkdir /data/ml-20m

mkdir /data/ml-20m/extracted



If you already have the data set downloaded and unzipped elsewhere, run the following commands to first exit the current VAE-CF Docker container and then restart the VAE-CF Docker container by mounting the MovieLens data set location.

docker run -it --rm --gpus all -v /data/vae-cf:/data -v <ml-20m folder path>:/data/ml-20m/extracted/ml-20m -p 8888:8888 vae /bin/bash

Train and test the model inside the container

When you have the container running and the dataset accessible within the container, run the rest of commands using a Jupyter notebook. Open the Jupyterlab workspace in the container. After that, upload the VAE-model notebook and run the cells of the notebook to train, test, and run inference on this model.

jupyter notebook --ip --port 8888 --allow-root

Preparing the dataset

After downloading the dataset in the extracted folder, go back to the main workspace and run This command takes the ml-20m dataset and divides it up so that it can be used for training, validating, and testing.


Training the model

Start the training by running the script with the train argument. The resulting checkpoints, containing the trained model weights, are then stored in the directory specified by the --checkpoint_dir directory option. By default, no checkpoints are saved.

Additionally, the --results_dir command-line argument (by default, None) specifies where to save the following statistics in a JSON format. A complete list of command-line arguments is saved as /args.json, along with a dictionary of validation metrics and performance metrics recorded during training.

When you run the training command, you see the details of the training process in each epoch. Also, you can change the training hyperparameters of the model by changing the arguments of For more information about the arguments, see the last cell of the notebook.

After each 50 epochs, run inference to see the recall of the model. Recall is a metric that shows the performance of this model as a percentage of the correctly predicted ratings over all the correct ratings: true positive /true positive + false negative.

mpirun --allow-run-as-root -np 1 -H localhost:8 python --train --amp --checkpoint_dir ./checkpoints

Figure 3 shows the output of running the previous command.

Output of each training step from the model for 50 epochs that shows the additional parameter such as batch size, validation steps, iteration time and the training throughput.
Figure 3. Output of each training step from the model for 50 epochs.

Testing the model

The model is exported to the default model_dir location and can be loaded and tested using the following command. You use the weights of the trained model that have been saved in the checkpoints folder and test data for performance on the unseen data.

In this step, use the data set set aside for testing on the trained model to test the performance of the model. You also view the recall metric to view the accuracy of the prediction.

python --test --amp --checkpoint_dir ./checkpoints

Figure 4 shows the output of the test command.

The output of the test command shows the accuracy of the model in predicting the rate of a movie for a new user by considering the (movie id, rate) of previous users. The model was trained using (movie id, rate) of some users and test with unseen data for model.
Figure 4. Output results from the model after using a small subset of the data set aside for testing.

This model was trained using some default hyperparameters and small dataset for 400 epochs. You can use larger datasets, increase the training epochs, and choose different hyperparameters to enhance the model performance. The --help command shows you the different options you have for working with this specific model. The script also provides an entry point to all the provided functionalities such as training, testing, and inference. The behavior of the script is controlled by command-line arguments listed in the next section. The script can be used to preprocess the MovieLens 20m data set.


The most important command-line parameters include:

  • --data_dir— Specifies the directory inside the Docker container where the data is stored, overriding the default location /data.
  • --checkpoint_dir— Controls if and where the checkpoints are stored.
  • --amp— Enables mixed precision training.

There are also multiple parameters controlling the various hyperparameters of the training process, such as learning rate, batch size, and so on. To see the full list of available options and their descriptions, use the -h or --help command-line option.

python --help

Figure 5 shows some available options to customize running

Figure 5. Options for customizing



In this post, I showed you how to pull the VAE model and build a container from the resources related to this model. After that, you uploaded a notebook in the container to train and test the model.

Get started today by visiting the NGC Catalog, downloading the Jupyter notebooks for a recommender system, and applying it to your own use cases.

NVIDIA GTC provides training, insights, and direct access to experts. Join us for breakthroughs in AI, data center, accelerated computing, healthcare, game development, networking, and more. Invent the future with us April 12-16, 2021. 

Discuss (6)