Neural Magic LogoNeural Magic Logo
Products
menu-icon
Products
DeepSparseSparseMLSparseZoo
Get Started
Sparsify a Model
Supported Integrations

Sparsifying a Model for SparseML Integrations

This page walks through an example of creating a sparsification recipe to prune a dense model from scratch and applying a recipe to a supported integration.

SparseML has pre-made integrations with many popular model repositories, such as with Hugging Face Transformers and Ultralytics YOLOv5. For these integrations, a sparsification recipe is all you need, and you can apply state-of-the-art sparsification algorithms, including pruning, distillation, and quantization, with a single command line call.

Install Requirements

This section requires SparseML Torchvision Install to run the Apply the Recipe section.

Pruning and Pruning Recipes

Pruning is a systematic way of removing redundant weights and connections within a neural network. An applied pruning algorithm must determine which weights are redundant and will not affect the accuracy.

A standard algorithm for pruning is gradual magnitude pruning, or GMP for short. With it, the weights closest to zero are iteratively removed over several epochs or training steps. The non-zero weights are then fine-tuned to the objective function. This iterative process enables the model to adjust to a new optimization space after pathways are removed before pruning again.

Important hyperparameters that need to be set are the following:

  • The layers to prune and their target sparsity levels
  • The number of epochs for pruning
  • The frequency of pruning
  • The length of time to fine-tune after pruning
  • The learning rates to (LR) for pruning and fine-tuning

The proper hyperparameter values will differ for different model architectures, training schemes, and domains, but there is some general intuition for safe starting values. The following are reasonably default values to start with:

  • The final sparsity is set to 80% sparsity applied globally across all layers.
  • The running frequency is set to pruning once per epoch (up to a few times per epoch for shorter schedules).
  • The number of pruning epochs is set to 1/3 the original training epochs.
  • The number of fine-tuning epochs is set to 1/4 the original epochs.
  • The pruning LR is set to the midrange from the model's training start and final LRs.
  • The fine-tuning LRs cycle from the pruning LR to the final LR is used for training.

SparseML conveniently encodes these hyperparameters into a YAML-based Recipe file. The rest of the system parses the arguments in the YAML file to set the parameters of the algorithm.

For example, the following recipe.yaml file for the default values listed above:

1modifiers:
2 - !GlobalMagnitudePruningModifier
3 init_sparsity: 0.05
4 final_sparsity: 0.8
5 start_epoch: 0.0
6 end_epoch: 30.0
7 update_frequency: 1.0
8 params: __ALL_PRUNABLE__
9
10 - !SetLearningRateModifier
11 start_epoch: 0.0
12 learning_rate: 0.05
13
14 - !LearningRateFunctionModifier
15 start_epoch: 30.0
16 end_epoch: 50.0
17 lr_func: cosine
18 init_lr: 0.05
19 final_lr: 0.001
20
21 - !EpochRangeModifier
22 start_epoch: 0.0
23 end_epoch: 50.0

In this recipe:

  • GlobalMagnitudePruningModifier applies gradual magnitude pruning globally across all the prunable parameters/weights in a model.
  • GlobalMagnitudePruningModifier starts at 5% sparsity at epoch 0 and gradually ramps up to 80% sparsity at epoch 30, pruning at the start of each epoch.
  • SetLearningRateModifier sets the pruning LR to 0.05 (midpoint between the original 0.1 and 0.001 training LRs).
  • LearningRateFunctionModifier cycles the fine-tuning LR from the pruning LR to 0.001 with a cosine curve (0.001 was the final original training LR).
  • EpochRangeModifier expands the training time to continue fine-tuning for an additional 20 epochs after pruning has ended.
  • 30 pruning epochs and 20 fine-tuning epochs were chosen based on a 90 epoch training schedule -- be sure to adjust based on the number of epochs used for the initial training for your use case.

Quantization and Quantization Recipes

A quantization recipe systematically reduces the precision for weights and activations within a neural network, generally from FP32 to INT8. Running a quantized model increases speed and reduces memory consumption while sacrificing very little in terms of accuracy.

Quantization-aware training (QAT) is the standard algorithm. With QAT, fake quantization operators are injected into the graph before quantizable nodes for activations, and weights are wrapped with fake quantization operators. The fake quantization operators interpolate the weights and activations down to INT8 on the forward pass but enable a full update of the weights at FP32 on the backward pass. The updates to the weights at FP32 throughout the training process allow the model to adapt to the loss of information from quantization on the forward pass. QAT generally guarantees better recovery for a given model compared with post-training quantization (PTQ), where training is not used.

Important hyperparameters for QAT are the learning rate (LR), the number of epochs to train for while quantized, and when to freeze batch normalization statistics for CNNs. Freezing batch normalization statistics enables the folding of these operators into convolutions for inference time and is an essential step for QAT. The proper hyperparameter values will differ for different model architectures, training schemes, and domains, but there is some general intuition for safe starting values. The following are reasonably good values to start with:

  • The LR is set to 0.1 or 0.01 times the value of the final LR during training
  • The number of quantized training epochs is set to 5.
  • The batch normalization statistics are frozen at the start of the third epoch.

For example, the following recipe.yaml file for the default values listed above:

1modifiers:
2 - !QuantizationModifier
3 start_epoch: 0.0
4 freeze_bn_stats_epoch: 3.0
5
6 - !SetLearningRateModifier
7 start_epoch: 0.0
8 learning_rate: 10e-6
9
10 - !EpochRangeModifier
11 start_epoch: 0.0
12 end_epoch: 5.0

In this recipe:

  • The QuantizationModifier applies QAT to all quantizable modules under the model scope. Note the model is used here as a general placeholder; to determine the name of the root module for your model, print out the root module and use that root name.
  • The QuantizationModifier starts at epoch 0 and freezes batch normalization statistics at the start of epoch 3.
  • The SetLearningRateModifier sets the quantization LR to 10e-6 (0.01 times the example final LR of 0.001).
  • The EpochRangeModifier sets the training time to continue training for the desired 5 epochs.

Pruning plus Quantization Recipe

To create a pruning and quantization recipe, the pruning and quantization recipes are merged from the previous sections. Quantization is added after pruning and fine-tuning are complete such that the training cycles end with it. This prevents stability issues from lacking precision when pruning and utilizing larger LRs.

Combining the two previous recipes creates the following new recipe.yaml file:

1modifiers:
2 - !GlobalMagnitudePruningModifier
3 init_sparsity: 0.05
4 final_sparsity: 0.8
5 start_epoch: 0.0
6 end_epoch: 30.0
7 update_frequency: 1.0
8 params: __ALL_PRUNABLE__
9
10 - !SetLearningRateModifier
11 start_epoch: 0.0
12 learning_rate: 0.05
13
14 - !LearningRateFunctionModifier
15 start_epoch: 30.0
16 end_epoch: 50.0
17 lr_func: cosine
18 init_lr: 0.05
19 final_lr: 0.001
20
21 - !QuantizationModifier
22 start_epoch: 50.0
23 freeze_bn_stats_epoch: 53.0
24
25 - !SetLearningRateModifier
26 start_epoch: 50.0
27 learning_rate: 10e-6
28
29 - !EpochRangeModifier
30 start_epoch: 0.0
31 end_epoch: 55.0

Applying a Recipe

The recipe created can now be applied using the SparseML integrations. For example, SparseML installs with a CLI utilizing Ultralytics YOLOv5 repo and training pathways, among others. To view instructions for the CLI, run the following command:

sparseml.yolov5.train --help

To use the recipe given in the previous section, save it locally as a recipe.yaml file. Next, it can be passed in for the --recipe argument in the YOLOv5 train CLI.

By running the following command, you will apply the GMP and QAT algorithms encoded in the recipe to the dense version of YOLOv5s (which is pulled down from the SparseZoo). In this example, the fine-tuning is done onto the COCO dataset.

1sparseml.yolov5.train \
2 --weights zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none \
3 --data coco.yaml \
4 --hyp data/hyps/hyp.scratch.yaml \
5 --recipe recipe.yaml
Sparsify a Model
Creating a Custom Integration for Sparsifying Models