sparseml.onnx.optim.quantization package

Submodules

sparseml.onnx.optim.quantization.calibration module

Provides a class for performing quantization calibration on an Onnx model.

class sparseml.onnx.optim.quantization.calibration.CalibrationSession(onnx_file: str, calibrate_op_types: Iterable[str] = ('Conv', 'MatMul', 'Gemm'), exclude_nodes: Optional[List[str]] = None, include_nodes: Optional[List[str]] = None, augmented_model_path: Optional[str] = None, static: bool = True)[source]

Bases: object

Class for performing quantization calibration on an Onnx model.

Parameters
  • onnx_file – File path to saved Onnx model to calibrate

  • calibrate_op_types – List of Onnx ops names to calibrate and quantize within the model. Currently Onnx only supports quantizing ‘Conv’ and ‘MatMul’ ops.

  • exclude_nodes – List of operator names that should not be quantized

  • include_nodes – List of operator names to force to be quantized

  • augmented_model_path – file path to save augmented model to for verification

  • static – True to use static quantization. Default is True

add_reduce_to_node_output(node: onnx.onnx_ml_pb2.NodeProto, output_edge: str, op_type: str)Tuple[onnx.onnx_ml_pb2.NodeProto, onnx.onnx_ml_pb2.ValueInfoProto][source]
Parameters
  • node – the node to add the reduce op to

  • output_edge – the output of node to generate reduce op for

  • op_type – the reduce operation name

Returns

a tuple of the reduce operation node and its output

generate_augmented_model()onnx.onnx_ml_pb2.ModelProto[source]
return: A new Onnx model with ReduceMin and ReduceMax nodes added to all

quantizable nodes in the original model and ensures their outputs are stored as part of the graph output.

get_model_input_names()List[str][source]
Returns

List of input names to the model

get_quantization_params_dict()Dict[str, List[Union[int, float]]][source]
Returns

A dictionary of quantization parameters based on the original model and calibrated quantization thresholds from runs of the process_batch function. The format of the dictionary will be: {“param_name”: [zero_point, scale]}

property model

The loaded model, if optimization has run, will be the optimized version

Type

return

property model_augmented

The augmented model, if optimization has run, will be the optimized version

Type

return

process_batch(input_batch: Dict[str, numpy.ndarray])None[source]

Updates the model’s calibration thresholds based on a run of the input batch

Parameters

input_batch – Dictionary of pre-processed model input batch to use, with input names mapped to a numpy array of the batch

sparseml.onnx.optim.quantization.quantize module

class sparseml.onnx.optim.quantization.quantize.ONNXQuantizer(model, per_channel, mode, static, fuse_dynamic_quant, weight_qType, input_qType, quantization_params, nodes_to_quantize, nodes_to_exclude)[source]

Bases: object

find_weight_data(initializer)[source]
Parameters

initializer – TensorProto initializer object from a graph

Returns

a list of initialized data in a given initializer object

quantize_model()[source]
class sparseml.onnx.optim.quantization.quantize.QuantizationMode[source]

Bases: object

IntegerOps = 0
QLinearOps = 1
class sparseml.onnx.optim.quantization.quantize.QuantizedInitializer(name, initializer, rmins, rmaxs, zero_points, scales, data=[], quantized_data=[], axis=None, qType=2)[source]

Bases: object

Represents a linearly quantized weight input from ONNX operators

class sparseml.onnx.optim.quantization.quantize.QuantizedValue(name, new_quantized_name, scale_name, zero_point_name, quantized_value_type, axis=None, qType=2)[source]

Bases: object

Represents a linearly quantized value (input/output/intializer)

class sparseml.onnx.optim.quantization.quantize.QuantizedValueType[source]

Bases: object

Initializer = 1
Input = 0
sparseml.onnx.optim.quantization.quantize.check_opset_version(org_model, force_fusions)[source]

Check opset version of original model and set opset version and fuse_dynamic_quant accordingly. If opset version < 10, set quantized model opset version to 10. If opset version == 10, do quantization without using dynamicQuantizeLinear operator. If opset version == 11, do quantization using dynamicQuantizeLinear operator. :return: fuse_dynamic_quant boolean value.

sparseml.onnx.optim.quantization.quantize.quantize(model, per_channel=False, nbits=8, quantization_mode=0, static=False, force_fusions=False, symmetric_activation=False, symmetric_weight=False, quantization_params=None, nodes_to_quantize=None, nodes_to_exclude=None)[source]

Given an onnx model, create a quantized onnx model and save it into a file

Parameters
  • model – ModelProto to quantize

  • per_channel – quantize weights per channel

  • nbits – number of bits to represent quantized data. Currently only supporting 8-bit types

  • quantization_mode

    Can be one of the QuantizationMode types. IntegerOps:

    the function will use integer ops. Only ConvInteger and MatMulInteger ops are supported now.

    QLinearOps:

    the function will use QLinear ops. Only QLinearConv and QLinearMatMul ops are supported now.

  • static

    True: The inputs/activations are quantized using static scale and zero point values

    specified through quantization_params.

    False: The inputs/activations are quantized using dynamic scale and zero point values

    computed while running the model.

  • force_fusions – True: Fuses nodes added for dynamic quantization False: No fusion is applied for nodes which are added for dynamic quantization. Should be only used in cases where backends want to apply special fusion routines

  • symmetric_activation – True: activations are quantized into signed integers. False: activations are quantized into unsigned integers.

  • symmetric_weight – True: weights are quantized into signed integers. False: weights are quantized into unsigned integers.

  • quantization_params

    Dictionary to specify the zero point and scale values for inputs to conv and matmul nodes. Should be specified when static is set to True. The quantization_params should be specified in the following format:

    {

    “input_name”: [zero_point, scale]

    }.

    zero_point should be of type np.uint8 and scale should be of type np.float32. example:

    {

    ‘resnet_model/Relu_1:0’: [np.uint8(0), np.float32(0.019539741799235344)], ‘resnet_model/Relu_2:0’: [np.uint8(0), np.float32(0.011359662748873234)]

    }

  • nodes_to_quantize

    List of nodes names to quantize. When this list is not None only the nodes in this list are quantized. example: [

    ’Conv__224’, ‘Conv__252’

    ]

  • nodes_to_exclude – List of nodes names to exclude. The nodes in this list will be excluded from quantization when it is not None.

Returns

ModelProto with quantization

sparseml.onnx.optim.quantization.quantize.quantize_data(data, quantize_range, qType)[source]
Parameters
  • data – data to quantize

  • quantize_range – list of data to weight pack.

  • qType – data type to quantize to. Supported types UINT8 and INT8

Returns

minimum, maximum, zero point, scale, and quantized weights

To pack weights, we compute a linear transformation
  • when data type == uint8 mode, from [rmin, rmax] -> [0, 2^{b-1}] and

  • when data type == int8, from [-m , m] -> [-(2^{b-1}-1), 2^{b-1}-1] where

    m = max(abs(rmin), abs(rmax))

and add necessary intermediate nodes to trasnform quantized weight to full weight using the equation r = S(q-z), where

r: real original value q: quantized value S: scale z: zero point

sparseml.onnx.optim.quantization.quantize_model_post_training module

Provides a wrapper function for calibrating and quantizing an Onnx model

sparseml.onnx.optim.quantization.quantize_model_post_training.quantize_model_post_training(onnx_file: str, data_loader: sparseml.onnx.utils.data.DataLoader, output_model_path: Optional[str] = None, calibrate_op_types: Iterable[str] = ('Conv', 'MatMul', 'Gemm'), exclude_nodes: Optional[List[str]] = None, include_nodes: Optional[List[str]] = None, augmented_model_path: Optional[str] = None, static: bool = True, symmetric_weight: bool = False, force_fusions: bool = False, show_progress: bool = True, run_extra_opt: bool = True)Union[None, onnx.onnx_ml_pb2.ModelProto][source]

Wrapper function for calibrating and quantizing an Onnx model

Parameters
  • onnx_file – File path to saved Onnx model to calibrate and quantize

  • data_loader – Iterable of lists of model inputs or filepath to directory of numpy arrays. If the model has multiple inputs and an .npz file is provided, the function will try to extract each input from the .npz file by name. If the names do not match, the function will try to extract the inputs in order. Will raise an exception of the number of inputs does not match the number of arrays in the .npz file.

  • output_model_path – Filepath to where the quantized model should be saved to. If not provided, then the quantized Onnx model object will be returned instead.

  • calibrate_op_types – List of Onnx ops names to calibrate and quantize within the model. Currently Onnx only supports quantizing ‘Conv’ and ‘MatMul’ ops.

  • exclude_nodes – List of operator names that should not be quantized

  • include_nodes – List of operator names force to be quantized

  • augmented_model_path – file path to save augmented model to for verification

  • static – True to use static quantization. Default is static.

  • symmetric_weight – True to use symmetric weight quantization. Default is False

  • force_fusions – True to force fusions in quantization. Default is False

  • show_progress – If true, will display a tqdm progress bar during calibration. Default is True

  • run_extra_opt – If true, will run additional optimizations on the quantized model. Currently the only optimization is quantizing identity relu outputs in ResNet blocks

Returns

None or quantized onnx model object if output_model_path is not provided

Module contents

Post training quantization tools for quantizing and calibrating onnx models.