Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
182 lines
7.0 KiB
Markdown
Executable File
182 lines
7.0 KiB
Markdown
Executable File
[](https://arxiv.org/abs/2509.22944)
|
||
[](https://opensource.org/licenses/Apache-2.0)
|
||
[](https://github.com/huawei-csl/SINQ/stargazers)
|
||
[](https://huggingface.co/huawei-csl)
|
||
|
||
# SINQ
|
||
|
||
[Sinkhorn-Normalized Quantization (SINQ)](https://github.com/huawei-csl/SINQ/tree/main) is a fast, plug-and-play, model-agnostic quantization technique delivering state-of-the-art performance for Large Language Models without sacrificing accuracy.
|
||
|
||
### 🔍 What You’ll Find Here
|
||
|
||
- [1. Quantize (and save) any LLM with SINQ](#1-quantize-any-llm-with-sinq)
|
||
- [2. How to Cite This Work](#2-how-to-cite-this-work)
|
||
- [3. Current Limitations](#3-current-limitations)
|
||
|
||
#### 📊 Feature Comparison: SINQ vs HQQ _(calibration-free)_ and A-SINQ vs AWQ _(calibrated)_
|
||
|
||
| Feature | **SINQ** | **HQQ** | **A-SINQ** | **AWQ** |
|
||
|------------|:--------:|:--------:|:----------:|:-------:|
|
||
| 🎯 Calibration | Calibration-free | Calibration-free | Calibrated | Calibrated |
|
||
| 🧮 Quantization Type | Symmetric & Asymmetric | Asymmetric only | Symmetric & Asymmetric | Symmetric & Asymmetric |
|
||
| 📦 NF4 Support | **Yes** | No | **Yes** | No |
|
||
| ⚡ Quantization Speed | ~2× **Faster** than HQQ | Slower | ~4× **Faster** than AWQ | Slower |
|
||
| 📈 Model Quality | **Higher** | Lower | **Higher** | Lower |
|
||
|
||
|
||
📄 **Want to know more?**
|
||
- Read our paper on [**arXiv**](http://arxiv.org/abs/2509.22944)
|
||
- Check the official [**SINQ**](https://github.com/huawei-csl/SINQ/tree/main) github repository
|
||
|
||
---
|
||
|
||
## 1. Quantize any LLM with SINQ
|
||
|
||
### Setup & Quick Start
|
||
|
||
First, install the package. It can be done in two ways:
|
||
- From source using the official Github repository [**SINQ**](https://github.com/huawei-csl/SINQ/tree/main) **[Recommended]**
|
||
- Using pip package:
|
||
```bash
|
||
pip install sinq
|
||
```
|
||
|
||
---
|
||
|
||
### Quantize in a few lines
|
||
|
||
Quantizing any 🤗 Hugging Face model with SINQ is simple and takes only a few lines of code.
|
||
First, create a [`SinqConfig`] and specify the following parameters:
|
||
|
||
| Flag | Description | Type | Options | Default |
|
||
|------|-------------|---------|---------|----------|
|
||
| `--nbits` | Bit-width for weight quantization | int | 2, 3, 4, 5, 6, 8 | 4 |
|
||
| `--tiling_mode` | Weight matrix tiling strategy | str | 1D, 2D | 1D |
|
||
| `--group_size` | Weights per quantization group | int | 64, 128 | 64 |
|
||
| `--method` | Quantization method | str | sinq, asinq | sinq |
|
||
| `--modules_to_not_convert` | List of the layers that are NOT quantize | List of str | [lm_head, ...] | [lm_head] |
|
||
|
||
Then specify the model you want to quantize and pass the SinqConfig as quantization configuration option
|
||
|
||
```python
|
||
import torch
|
||
from transformers import AutoTokenizer, AutoModelForCausalLM, SinqConfig
|
||
|
||
model_name = "Qwen/Qwen3-1.7B"
|
||
|
||
cfg = SinqConfig(
|
||
nbits=4,
|
||
group_size=64,
|
||
tiling_mode="1D",
|
||
method="sinq",
|
||
modules_to_not_convert=["lm_head"]
|
||
)
|
||
|
||
tok = AutoTokenizer.from_pretrained(model_name)
|
||
qmodel = AutoModelForCausalLM.from_pretrained(
|
||
model_name,
|
||
quantization_config=cfg,
|
||
dtype=torch.bfloat16
|
||
)
|
||
|
||
```
|
||
|
||
✅ That’s it. Your model is now quantized with **SINQ** and ready for inference or saving.
|
||
|
||
> Check our official [**SINQ**](https://github.com/huawei-csl/SINQ/tree/main) github repository to stay updated!
|
||
|
||
---
|
||
|
||
### Save & reload
|
||
|
||
If you want to reuse a quantized model later, save it to disk or push it on the HuggingFace Hub and reload it without needing base FP weights.
|
||
If you installed SINQ from source you should call *patch_hf_pretrained_io* function when re-loading a quantized model:
|
||
```python
|
||
# Save sinq quantized model
|
||
model.save_pretrained("/path/to/save/qwen3-1.7B-sinq-4bit")
|
||
model.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit")
|
||
tokenizer.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit")
|
||
```
|
||
```python
|
||
from sinq.hf_io import patch_hf_pretrained_io
|
||
patch_hf_pretrained_io()
|
||
# Reload a sinq quantized model
|
||
hf_hub_model = "HF_Hub_username/qwen3-1.7B-sinq-4bit"
|
||
tokenizer = AutoTokenizer.from_pretrained(hf_hub_model)
|
||
model = AutoModelForCausalLM.from_pretrained(hf_hub_model)
|
||
```
|
||
Otherwise, if you installed SINQ through pip, you can simply use HF built-in functions:
|
||
|
||
```python
|
||
# --- Save to a folder (sharded safetensors) ---
|
||
|
||
# 'model' must already be SINQ-quantized
|
||
# Locally save
|
||
qmodel.save_pretrained("/path/to/save/qwen3-1.7B-sinq-4bit")
|
||
# Push to the Hub
|
||
qmodel.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit")
|
||
tok.push_to_hub("HF_Hub_username/qwen3-1.7B-sinq-4bit")
|
||
|
||
# --- Reload later--
|
||
|
||
save_dir = "/path/to/save/qwen3-1.7B-sinq-4bit"
|
||
hf_hub_model = "HF_Hub_username/qwen3-1.7B-sinq-4bit"
|
||
|
||
# From local directory
|
||
tok = AutoTokenizer.from_pretrained(save_dir)
|
||
qmodel = AutoModelForCausalLM.from_pretrained(save_dir)
|
||
|
||
# From HF Hub
|
||
tok = AutoTokenizer.from_pretrained(hf_hub_model)
|
||
qmodel = AutoModelForCausalLM.from_pretrained(hf_hub_model)
|
||
|
||
```
|
||
|
||
✅ Your model is now loaded and ready for inference!
|
||
|
||
> Note: If the model has been quantized in 4 bit and `gemlite` library is installed, gemlite faster kernel is used to run the inference.
|
||
|
||
---
|
||
|
||
### Compatible with [`lm-eval`](https://github.com/EleutherAI/lm-evaluation-harness) evaluation framework
|
||
|
||
Below is a minimal example showing how to evaluate a SINQ-quantized model on a benchmark dataset:
|
||
|
||
```python
|
||
from lm_eval import evaluator
|
||
from lm_eval.models.huggingface import HFLM
|
||
|
||
# Wrap the already quantized model and tokenizer with HFLM
|
||
lm = HFLM(pretrained=qmodel, tokenizer=tok, device=device)
|
||
device = "cuda:0"
|
||
|
||
# Evaluate (many tasks available on lm-eval such as MMLU and HellaSwag)
|
||
results = evaluator.simple_evaluate(
|
||
model=lm,
|
||
tasks=["wikitext"], # small and fast benchmark
|
||
device=device
|
||
)
|
||
```
|
||
|
||
## 2. How to Cite This Work
|
||
|
||
If you find **SINQ** useful in your research or applications
|
||
- Support our project by putting a star ⭐️ in the [**SINQ**](https://github.com/huawei-csl/SINQ/tree/main) github repository
|
||
- Please cite our <a href="http://arxiv.org/abs/2509.22944" target="_blank"><strong>paper</strong></a>:
|
||
|
||
```bibtex
|
||
@misc{muller2025sinq,
|
||
title={SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights},
|
||
author={Lorenz K. Muller and Philippe Bich and Jiawei Zhuang and Ahmet Celik and Luca Benfenati and Lukas Cavigelli},
|
||
year={2025},
|
||
eprint={2509.22944},
|
||
archivePrefix={arXiv},
|
||
primaryClass={cs.LG},
|
||
url={http://arxiv.org/abs/2509.22944}
|
||
}
|
||
```
|
||
|
||
## 3. Current Limitations
|
||
|
||
Currently, the A-SINQ method is not supported in Hugging Face. Please refer to the official [SINQ repository](https://github.com/huawei-csl/SINQ/tree/main) to quantize a model with this strategy.
|
||
At the moment the SINQ quantization strategy and SINQ quantized models do not support Multi-GPU option, so if your system counts multiple GPUs please specify which one should be used. |