first commit
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
This commit is contained in:
251
docs/source/en/kernel_doc/loading_kernels.md
Normal file
251
docs/source/en/kernel_doc/loading_kernels.md
Normal file
@@ -0,0 +1,251 @@
|
||||
<!--Copyright 2026 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# Loading kernels
|
||||
|
||||
A kernel works as a drop-in replacement for standard PyTorch operations. It swaps the `forward` method with the optimized kernel implementation without breaking model code.
|
||||
|
||||
This guide shows how to load kernels to accelerate inference.
|
||||
|
||||
Install the kernels package. We recommend the latest version which provides the best performance and bug fixes.
|
||||
|
||||
> [!NOTE]
|
||||
> kernels >=0.11.0 is the minimum required version for working with Transformers.
|
||||
|
||||
```bash
|
||||
pip install -U kernels
|
||||
```
|
||||
|
||||
Set `use_kernels=True` in [`~PreTrainedModel.from_pretrained`] to load a matching kernel variant for your platform and environment. This replaces supported PyTorch operations with the kernel implementation.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"Qwen/Qwen3-0.6B",
|
||||
use_kernels=True,
|
||||
device_map="cuda"
|
||||
)
|
||||
```
|
||||
|
||||
## Attention kernels
|
||||
|
||||
Load attention kernels from the Hub with the `attn_implementation` argument.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"Qwen/Qwen3-0.6B",
|
||||
attn_implementation="kernels-community/flash-attn2",
|
||||
device_map="cuda"
|
||||
)
|
||||
```
|
||||
|
||||
Note that for attention kernels, anything that is not part of the `kernels-community` repository (which is trusted - we may add more trusted repositories in the future) will require an additional `allow_all_kernels=True` kwarg to be used (similar to the `trust_remote_code=True` kwarg for non-HF models). This is because loading a kernel can lead to arbitrary code execution on the host machine, and we cannot verify every repo, so you need to explicitly allow it.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"Qwen/Qwen3-0.6B",
|
||||
attn_implementation="random-repo/random-attention",
|
||||
allow_all_kernels=True,
|
||||
device_map="cuda"
|
||||
)
|
||||
```
|
||||
|
||||
Specific kernels, like attention, accept several formats.
|
||||
|
||||
- `@v2.1.0` pins to a specific tag or branch.
|
||||
- `@>=2.0,<3.0` sets semantic versioning constraints.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM
|
||||
|
||||
# pin to a specific version
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"Qwen/Qwen3-0.6B",
|
||||
attn_implementation="kernels-community/flash-attn2@v2.1.0",
|
||||
device_map="cuda"
|
||||
)
|
||||
# use semantic versioning constraints
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"Qwen/Qwen3-0.6B",
|
||||
attn_implementation="kernels-community/flash-attn2@>=2.0,<3.0",
|
||||
device_map="cuda"
|
||||
)
|
||||
```
|
||||
|
||||
## Mode-awareness
|
||||
|
||||
Kernels automatically adapt to [training](https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.train) and [inference](https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.eval) modes based on PyTorch's `model.training` state.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"Qwen/Qwen3-0.6B",
|
||||
use_kernels=True,
|
||||
device_map="cuda"
|
||||
)
|
||||
|
||||
# Switch to inference mode - uses inference-optimized kernels
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
output = model.generate(input_ids, max_new_tokens=50)
|
||||
|
||||
# Switch to training mode - uses training-optimized kernels with gradient support
|
||||
model.train()
|
||||
loss = model(input_ids, labels=labels).loss
|
||||
loss.backward()
|
||||
```
|
||||
|
||||
Explicitly enable training and inference modes with the `mode` argument in the [`~kernels.kernelize`] function. Training mode also supports an additional torch.compile mode.
|
||||
|
||||
```py
|
||||
from kernels import Mode
|
||||
|
||||
# inference optimized kernels
|
||||
model.kernelize(mode=Mode.INFERENCE)
|
||||
|
||||
# training optimized kernels
|
||||
model.kernelize(mode=Mode.TRAINING)
|
||||
|
||||
# training and torch-compile friendly kernels
|
||||
model.kernelize(mode=Mode.TRAINING | Mode.TORCH_COMPILE)
|
||||
```
|
||||
|
||||
## KernelConfig
|
||||
|
||||
[`KernelConfig`] customizes which kernels are used in a model.
|
||||
|
||||
The `:` separator names a specific kernel entry inside the repository and maps it to a layer.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, KernelConfig
|
||||
|
||||
kernel_config = KernelConfig(
|
||||
kernel_mapping={
|
||||
"RMSNorm": "kernels-community/liger_kernels:LigerRMSNorm",
|
||||
}
|
||||
)
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"Qwen/Qwen3-0.6B",
|
||||
attn_implementation="kernels-community/flash-attn2:FlashAttention2",
|
||||
use_kernels=True,
|
||||
kernel_config=kernel_config,
|
||||
device_map="cuda"
|
||||
)
|
||||
```
|
||||
|
||||
Specify different kernel implementations for each device type.
|
||||
|
||||
```py
|
||||
from transformers import KernelConfig
|
||||
|
||||
kernel_config = KernelConfig(
|
||||
kernel_mapping={
|
||||
"RMSNorm": {
|
||||
"cuda": "kernels-community/liger_kernels:LigerRMSNorm",
|
||||
"rocm": "kernels-community/rocm-kernels:RocmRMSNorm",
|
||||
"metal": "kernels-community/metal-kernels:MetalRMSNorm",
|
||||
"xpu": "kernels-community/xpu-kernels:XpuRMSNorm"
|
||||
}
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
## Local kernels
|
||||
|
||||
Load kernels from local file paths with `use_local_kernel=True` in [`KernelConfig`]. This loads from a local filesystem path instead of a Hub repository.
|
||||
|
||||
Local kernels use `/abs/path:layer_name` instead of the Hub format `org/repo:layer_name`.
|
||||
|
||||
```py
|
||||
from transformers import KernelConfig, AutoModelForCausalLM
|
||||
|
||||
kernel_mapping = {
|
||||
"RMSNorm": "/path/to/liger_kernels:LigerRMSNorm",
|
||||
}
|
||||
kernel_config = KernelConfig(kernel_mapping, use_local_kernel=True)
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"Qwen/Qwen3-0.6B",
|
||||
dtype="auto",
|
||||
device_map="auto",
|
||||
use_kernels=True,
|
||||
kernel_config=kernel_config
|
||||
)
|
||||
```
|
||||
|
||||
## Disabling kernels
|
||||
|
||||
Disable kernels for specific layers with an empty kernel mapping in [`KernelConfig`].
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, KernelConfig
|
||||
|
||||
kernel_config = KernelConfig(
|
||||
kernel_mapping={
|
||||
"RMSNorm": "",
|
||||
}
|
||||
)
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"Qwen/Qwen3-0.6B",
|
||||
use_kernels=True,
|
||||
kernel_config=kernel_config,
|
||||
device_map="cuda"
|
||||
)
|
||||
```
|
||||
|
||||
Set the environment variable to disable kernels globally.
|
||||
|
||||
```bash
|
||||
export USE_HUB_KERNELS=0 # or OFF or NO
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
Kernel integration depends on hardware, drivers, and package versions working together. The following sections cover common failures.
|
||||
|
||||
### Installation issues
|
||||
|
||||
Import errors indicate the kernels library isn't installed.
|
||||
|
||||
```bash
|
||||
pip install -U kernels
|
||||
```
|
||||
|
||||
### Kernel loading failures
|
||||
|
||||
If specific kernels fail to load, try the following.
|
||||
|
||||
- Check your hardware compatibility with the kernel requirements.
|
||||
- Verify your CUDA/ROCm/Metal drivers are up to date.
|
||||
- Consult the kernel repository documentation for known issues.
|
||||
|
||||
### Device compatibility
|
||||
|
||||
Not all kernels support all devices. The library falls back to standard PyTorch operations if a kernel is unavailable for your hardware. Check kernel repository documentation for device-specific support.
|
||||
|
||||
## Resources
|
||||
|
||||
- [Kernels](https://github.com/huggingface/kernels) repository
|
||||
- [Enhance Your Models in 5 Minutes with the Hugging Face Kernel Hub](https://huggingface.co/blog/hello-hf-kernels) blog post
|
||||
- Discover kernels in the [kernels-community](https://huggingface.co/kernels-community) org
|
||||
Reference in New Issue
Block a user