Grouped Sparse Projection for Deep Learning
Riyasat Ohib,
Nicolas Gillis,
Sergey Plis,
and Vamsi Potluru
ICLR Hardware Aware Efficient Training workshop
2021
Accumulating empirical evidence shows that very large deep learning models learn faster and achieve higher accuracy than their
smaller counterparts. Yet, smaller models have benefits of energy efficiency and are often easier to interpret. To simultaneously
get the benefits of large and small models we often encourage sparsity in the model weights of large models. For this, different
approaches have been proposed including weight-pruning and distillation. Unfortunately, most existing approaches do not have a
controllable way to request a desired value of sparsity as an interpretable parameter and get it right in a single run. In this work,
we design a new sparse projection method for a set of weights in order to achieve a desired average level of sparsity without additional
hyperparameter tuning which is measured using the ratio of the l1 and l2 norms. Instead of projecting each vector of the weight matrix
individually, or using sparsity as a regularizer, we project all vectors together to achieve an average target sparsity, where the
sparsity levels of the individual vectors of the weight matrix are automatically tuned. Our projection operator has the following
guarantees – (A) it is fast and enjoys a runtime linear in the size of the vectors; (B) the solution is unique except for a measure
set of zero. We utilize our projection operator to obtain the desired sparsity of deep learning models in a single run with a
negligible performance hit, while competing methods require sparsity hyperparameter tuning. Even with a single projection of a
pre-trained dense model followed by fine-tuning, we show empirical performance competitive to the state of the art. We support these
claims with empirical evidence on real-world datasets and on a number of architectures, comparing it to other state of the art
methods including DeepHoyer.