Reshare: Quanto: a PyTorch quantization toolkit

lqdev👽03/19/2024

https://huggingface.co/blog/quanto-introduction

Quantization is a technique to reduce the computational and memory costs of evaluating Deep Learning Models by representing their weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).

Today, we are excited to introduce quanto, a versatile pytorch quantization toolkit, that provides several unique features:

available in eager mode (works with non-traceable models)

quantized models can be placed on any device (including CUDA and MPS),

automatically inserts quantization and dequantization stubs,

automatically inserts quantized functional operations,

automatically inserts quantized modules (see below the list of supported modules),

provides a seamless workflow for a float model, going from a dynamic to a static quantized model,

supports quantized model serialization as a state_dict,

supports not only int8 weights, but also int2 and int4,

supports not only int8 activations, but also float8.

Permalink: /feed/quanto-pytorch-quantization-toolkit/

Tags: #huggingface #quantization #pytorch #tools #ai

Back to feed

Send me a message or webmention