This page walks through an example of creating a sparsification recipe to prune a dense model from scratch and applying a recipe to a supported integration.
SparseML has pre-made integrations with many popular model repositories, such as with Hugging Face Transformers and Ultralytics YOLOv5. For these integrations, a sparsification recipe is all you need, and you can apply state-of-the-art sparsification algorithms, including pruning, distillation, and quantization, with a single command line call.
This section requires SparseML Torchvision Install to run the Apply the Recipe section.
Pruning is a systematic way of removing redundant weights and connections within a neural network. An applied pruning algorithm must determine which weights are redundant and will not affect the accuracy.
A standard algorithm for pruning is gradual magnitude pruning, or GMP for short. With it, the weights closest to zero are iteratively removed over several epochs or training steps. The non-zero weights are then fine-tuned to the objective function. This iterative process enables the model to adjust to a new optimization space after pathways are removed before pruning again.
Important hyperparameters that need to be set are the following:
The proper hyperparameter values will differ for different model architectures, training schemes, and domains, but there is some general intuition for safe starting values. The following are reasonably default values to start with:
SparseML conveniently encodes these hyperparameters into a YAML-based Recipe file. The rest of the system parses the arguments in the YAML file to set the parameters of the algorithm.
For example, the following recipe.yaml
file for the default values listed above:
1modifiers:2 - !GlobalMagnitudePruningModifier3 init_sparsity: 0.054 final_sparsity: 0.85 start_epoch: 0.06 end_epoch: 30.07 update_frequency: 1.08 params: __ALL_PRUNABLE__910 - !SetLearningRateModifier11 start_epoch: 0.012 learning_rate: 0.051314 - !LearningRateFunctionModifier15 start_epoch: 30.016 end_epoch: 50.017 lr_func: cosine18 init_lr: 0.0519 final_lr: 0.0012021 - !EpochRangeModifier22 start_epoch: 0.023 end_epoch: 50.0
In this recipe:
GlobalMagnitudePruningModifier
applies gradual magnitude pruning globally across all the prunable parameters/weights in a model.GlobalMagnitudePruningModifier
starts at 5% sparsity at epoch 0 and gradually ramps up to 80% sparsity at epoch 30, pruning at the start of each epoch.SetLearningRateModifier
sets the pruning LR to 0.05 (midpoint between the original 0.1 and 0.001 training LRs).LearningRateFunctionModifier
cycles the fine-tuning LR from the pruning LR to 0.001 with a cosine curve (0.001 was the final original training LR).EpochRangeModifier
expands the training time to continue fine-tuning for an additional 20 epochs after pruning has ended.A quantization recipe systematically reduces the precision for weights and activations within a neural network, generally from FP32
to INT8
. Running a quantized
model increases speed and reduces memory consumption while sacrificing very little in terms of accuracy.
Quantization-aware training (QAT) is the standard algorithm. With QAT, fake quantization operators are injected into the graph before quantizable nodes for activations, and weights are wrapped with fake quantization operators. The fake quantization operators interpolate the weights and activations down to INT8 on the forward pass but enable a full update of the weights at FP32 on the backward pass. The updates to the weights at FP32 throughout the training process allow the model to adapt to the loss of information from quantization on the forward pass. QAT generally guarantees better recovery for a given model compared with post-training quantization (PTQ), where training is not used.
Important hyperparameters for QAT are the learning rate (LR), the number of epochs to train for while quantized, and when to freeze batch normalization statistics for CNNs. Freezing batch normalization statistics enables the folding of these operators into convolutions for inference time and is an essential step for QAT. The proper hyperparameter values will differ for different model architectures, training schemes, and domains, but there is some general intuition for safe starting values. The following are reasonably good values to start with:
For example, the following recipe.yaml
file for the default values listed above:
1modifiers:2 - !QuantizationModifier3 start_epoch: 0.04 freeze_bn_stats_epoch: 3.056 - !SetLearningRateModifier7 start_epoch: 0.08 learning_rate: 10e-6910 - !EpochRangeModifier11 start_epoch: 0.012 end_epoch: 5.0
In this recipe:
QuantizationModifier
applies QAT to all quantizable modules under the model
scope.
Note the model
is used here as a general placeholder; to determine the name of the root module for your model, print out the root module and use that root name.QuantizationModifier
starts at epoch 0 and freezes batch normalization statistics at the start of epoch 3.SetLearningRateModifier
sets the quantization LR to 10e-6 (0.01 times the example final LR of 0.001).EpochRangeModifier
sets the training time to continue training for the desired 5 epochs.To create a pruning and quantization recipe, the pruning and quantization recipes are merged from the previous sections. Quantization is added after pruning and fine-tuning are complete such that the training cycles end with it. This prevents stability issues from lacking precision when pruning and utilizing larger LRs.
Combining the two previous recipes creates the following new recipe.yaml file:
1modifiers:2 - !GlobalMagnitudePruningModifier3 init_sparsity: 0.054 final_sparsity: 0.85 start_epoch: 0.06 end_epoch: 30.07 update_frequency: 1.08 params: __ALL_PRUNABLE__910 - !SetLearningRateModifier11 start_epoch: 0.012 learning_rate: 0.051314 - !LearningRateFunctionModifier15 start_epoch: 30.016 end_epoch: 50.017 lr_func: cosine18 init_lr: 0.0519 final_lr: 0.0012021 - !QuantizationModifier22 start_epoch: 50.023 freeze_bn_stats_epoch: 53.02425 - !SetLearningRateModifier26 start_epoch: 50.027 learning_rate: 10e-62829 - !EpochRangeModifier30 start_epoch: 0.031 end_epoch: 55.0
The recipe created can now be applied using the SparseML integrations. For example, SparseML installs with a CLI utilizing Ultralytics YOLOv5 repo and training pathways, among others. To view instructions for the CLI, run the following command:
sparseml.yolov5.train --help
To use the recipe given in the previous section, save it locally as a recipe.yaml
file.
Next, it can be passed in for the --recipe
argument in the YOLOv5 train CLI.
By running the following command, you will apply the GMP and QAT algorithms encoded in the recipe to the dense version of YOLOv5s (which is pulled down from the SparseZoo). In this example, the fine-tuning is done onto the COCO dataset.
1sparseml.yolov5.train \2 --weights zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none \3 --data coco.yaml \4 --hyp data/hyps/hyp.scratch.yaml \5 --recipe recipe.yaml