The process of sparsification is taking a trained deep learning model and removing redundant information from the over-parameterized network resulting in a faster and smaller model. Techniques for sparsification include everything from inducing sparsity using pruning and quantization to distilling from a larger model to create a smaller version. When implemented correctly, these techniques result in significantly more performant and smaller models with limited to no effect on the baseline metrics.
Combining multiple sparsification algorithms will generally result in more compressed and faster models than any individual algorithm. This combination of algorithms is what is termed as Compound Sparsification. For example, combining both pruning and quantization is very common to create sparse-quantized models that can be up to 4 times smaller. Additionally, it is common for NLP models to combine distillation, weight pruning, layer dropping, and quantization to create much smaller models that recover close to the original baseline.
See our blog for a detailed conceptual discussion of pruning.
Ultimately the power of sparsification is only realized when the deployment environment supports it. DeepSparse is specifically engineered to utilize sparse networks for GPU-class performance on CPUs.