first commit
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
This commit is contained in:
182
docs/source/en/quantization/sinq.md
Executable file
182
docs/source/en/quantization/sinq.md
Executable file
@@ -0,0 +1,182 @@
|
||||
[](https://arxiv.org/abs/2509.22944)
|
||||
[](https://opensource.org/licenses/Apache-2.0)
|
||||
[](https://github.com/huawei-csl/SINQ/stargazers)
|
||||
[](https://huggingface.co/huawei-csl)
|
||||
|
||||
# SINQ
|
||||
|
||||
[Sinkhorn-Normalized Quantization (SINQ)](https://github.com/huawei-csl/SINQ/tree/main) is a fast, plug-and-play, model-agnostic quantization technique delivering state-of-the-art performance for Large Language Models without sacrificing accuracy.
|
||||
|
||||
### 🔍 What You’ll Find Here
|
||||
|
||||
- [1. Quantize (and save) any LLM with SINQ](#1-quantize-any-llm-with-sinq)
|
||||
- [2. How to Cite This Work](#2-how-to-cite-this-work)
|
||||
- [3. Current Limitations](#3-current-limitations)
|
||||
|
||||
#### 📊 Feature Comparison: SINQ vs HQQ _(calibration-free)_ and A-SINQ vs AWQ _(calibrated)_
|
||||
|
||||
| Feature | **SINQ** | **HQQ** | **A-SINQ** | **AWQ** |
|
||||
|------------|:--------:|:--------:|:----------:|:-------:|
|
||||
| 🎯 Calibration | Calibration-free | Calibration-free | Calibrated | Calibrated |
|
||||
| 🧮 Quantization Type | Symmetric & Asymmetric | Asymmetric only | Symmetric & Asymmetric | Symmetric & Asymmetric |
|
||||
| 📦 NF4 Support | **Yes** | No | **Yes** | No |
|
||||
| ⚡ Quantization Speed | ~2× **Faster** than HQQ | Slower | ~4× **Faster** than AWQ | Slower |
|
||||
| 📈 Model Quality | **Higher** | Lower | **Higher** | Lower |
|
||||
|
||||
|
||||
📄 **Want to know more?**
|
||||
- Read our paper on [**arXiv**](http://arxiv.org/abs/2509.22944)
|
||||
- Check the official [**SINQ**](https://github.com/huawei-csl/SINQ/tree/main) github repository
|
||||
|
||||
---
|
||||
|
||||
## 1. Quantize any LLM with SINQ
|
||||
|
||||
### Setup & Quick Start
|
||||
|
||||
First, install the package. It can be done in two ways:
|
||||
- From source using the official Github repository [**SINQ**](https://github.com/huawei-csl/SINQ/tree/main) **[Recommended]**
|
||||
- Using pip package:
|
||||
```bash
|
||||
pip install sinq
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Quantize in a few lines
|
||||
|
||||
Quantizing any 🤗 Hugging Face model with SINQ is simple and takes only a few lines of code.
|
||||
First, create a [`SinqConfig`] and specify the following parameters:
|
||||
|
||||
| Flag | Description | Type | Options | Default |
|
||||
|------|-------------|---------|---------|----------|
|
||||
| `--nbits` | Bit-width for weight quantization | int | 2, 3, 4, 5, 6, 8 | 4 |
|
||||
| `--tiling_mode` | Weight matrix tiling strategy | str | 1D, 2D | 1D |
|
||||
| `--group_size` | Weights per quantization group | int | 64, 128 | 64 |
|
||||
| `--method` | Quantization method | str | sinq, asinq | sinq |
|
||||
| `--modules_to_not_convert` | List of the layers that are NOT quantize | List of str | [lm_head, ...] | [lm_head] |
|
||||
|
||||
Then specify the model you want to quantize and pass the SinqConfig as quantization configuration option
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM, SinqConfig
|
||||
|
||||
model_name = "Qwen/Qwen3-1.7B"
|
||||
|
||||
cfg = SinqConfig(
|
||||
nbits=4,
|
||||
group_size=64,
|
||||
tiling_mode="1D",
|
||||
method="sinq",
|
||||
modules_to_not_convert=["lm_head"]
|
||||
)
|
||||
|
||||
tok = AutoTokenizer.from_pretrained(model_name)
|
||||
qmodel = AutoModelForCausalLM.from_pretrained(
|
||||
model_name,
|
||||
quantization_config=cfg,
|
||||
dtype=torch.bfloat16
|
||||
)
|
||||
|
||||
```
|
||||
|
||||
✅ That’s it. Your model is now quantized with **SINQ** and ready for inference or saving.
|
||||
|
||||
> Check our official [**SINQ**](https://github.com/huawei-csl/SINQ/tree/main) github repository to stay updated!
|
||||
|
||||
---
|
||||
|
||||
### Save & reload
|
||||
|
||||
If you want to reuse a quantized model later, save it to disk or push it on the HuggingFace Hub and reload it without needing base FP weights.
|
||||
If you installed SINQ from source you should call *patch_hf_pretrained_io* function when re-loading a quantized model:
|
||||
```python
|
||||
# Save sinq quantized model
|
||||
model.save_pretrained("/path/to/save/qwen3-1.7B-sinq-4bit")
|
||||
model.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit")
|
||||
tokenizer.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit")
|
||||
```
|
||||
```python
|
||||
from sinq.hf_io import patch_hf_pretrained_io
|
||||
patch_hf_pretrained_io()
|
||||
# Reload a sinq quantized model
|
||||
hf_hub_model = "HF_Hub_username/qwen3-1.7B-sinq-4bit"
|
||||
tokenizer = AutoTokenizer.from_pretrained(hf_hub_model)
|
||||
model = AutoModelForCausalLM.from_pretrained(hf_hub_model)
|
||||
```
|
||||
Otherwise, if you installed SINQ through pip, you can simply use HF built-in functions:
|
||||
|
||||
```python
|
||||
# --- Save to a folder (sharded safetensors) ---
|
||||
|
||||
# 'model' must already be SINQ-quantized
|
||||
# Locally save
|
||||
qmodel.save_pretrained("/path/to/save/qwen3-1.7B-sinq-4bit")
|
||||
# Push to the Hub
|
||||
qmodel.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit")
|
||||
tok.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit")
|
||||
|
||||
# --- Reload later--
|
||||
|
||||
save_dir = "/path/to/save/qwen3-1.7B-sinq-4bit"
|
||||
hf_hub_model = "HF_Hub_username/qwen3-1.7B-sinq-4bit"
|
||||
|
||||
# From local directory
|
||||
tok = AutoTokenizer.from_pretrained(save_dir)
|
||||
qmodel = AutoModelForCausalLM.from_pretrained(save_dir)
|
||||
|
||||
# From HF Hub
|
||||
tok = AutoTokenizer.from_pretrained(hf_hub_model)
|
||||
qmodel = AutoModelForCausalLM.from_pretrained(hf_hub_model)
|
||||
|
||||
```
|
||||
|
||||
✅ Your model is now loaded and ready for inference!
|
||||
|
||||
> Note: If the model has been quantized in 4 bit and `gemlite` library is installed, gemlite faster kernel is used to run the inference.
|
||||
|
||||
---
|
||||
|
||||
### Compatible with [`lm-eval`](https://github.com/EleutherAI/lm-evaluation-harness) evaluation framework
|
||||
|
||||
Below is a minimal example showing how to evaluate a SINQ-quantized model on a benchmark dataset:
|
||||
|
||||
```python
|
||||
from lm_eval import evaluator
|
||||
from lm_eval.models.huggingface import HFLM
|
||||
|
||||
# Wrap the already quantized model and tokenizer with HFLM
|
||||
lm = HFLM(pretrained=qmodel, tokenizer=tok, device=device)
|
||||
device = "cuda:0"
|
||||
|
||||
# Evaluate (many tasks available on lm-eval such as MMLU and HellaSwag)
|
||||
results = evaluator.simple_evaluate(
|
||||
model=lm,
|
||||
tasks=["wikitext"], # small and fast benchmark
|
||||
device=device
|
||||
)
|
||||
```
|
||||
|
||||
## 2. How to Cite This Work
|
||||
|
||||
If you find **SINQ** useful in your research or applications
|
||||
- Support our project by putting a star ⭐️ in the [**SINQ**](https://github.com/huawei-csl/SINQ/tree/main) github repository
|
||||
- Please cite our <a href="http://arxiv.org/abs/2509.22944" target="_blank"><strong>paper</strong></a>:
|
||||
|
||||
```bibtex
|
||||
@misc{muller2025sinq,
|
||||
title={SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights},
|
||||
author={Lorenz K. Muller and Philippe Bich and Jiawei Zhuang and Ahmet Celik and Luca Benfenati and Lukas Cavigelli},
|
||||
year={2025},
|
||||
eprint={2509.22944},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass={cs.LG},
|
||||
url={http://arxiv.org/abs/2509.22944}
|
||||
}
|
||||
```
|
||||
|
||||
## 3. Current Limitations
|
||||
|
||||
Currently, the A-SINQ method is not supported in Hugging Face. Please refer to the official [SINQ repository](https://github.com/huawei-csl/SINQ/tree/main) to quantize a model with this strategy.
|
||||
At the moment the SINQ quantization strategy and SINQ quantized models do not support Multi-GPU option, so if your system counts multiple GPUs please specify which one should be used.
|
||||
Reference in New Issue
Block a user