After clicking “Watch Now” you will be prompted to login or join.
Condensa: A Programming System for DNN Model Compression
Saurav Muralidharan, NVIDIA
GTC 2020
Deep neural networks contain far more weights than they need for the specific task they're trained to perform. They can be compressed using techniques such as weight pruning and quantization that reduce both model size and inference time without appreciable loss in accuracy, but finding the best compression strategy for a given neural network target platform and optimization objective often requires extensive experimentation. Also, finding optimal hyperparameters for a given compression strategy results in even more expensive, and frequently manual, trial-and-error exploration. We'll introduce a programmable system for model compression called Condensa. Users of our framework can programmatically compose simple operators in Python to build complex compression strategies. Given a strategy and a user-provided objective, such as minimizing runtime, Condensa uses a novel sample efficient constrained Bayesian optimization-based algorithm to automatically infer optimal sparsity ratios.