Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
388 lines
16 KiB
Markdown
388 lines
16 KiB
Markdown
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
|
|
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
|
rendered properly in your Markdown viewer.
|
|
|
|
-->
|
|
|
|
# Optimizers and schedulers
|
|
|
|
An optimizer updates model weights during training. The scheduler wraps the optimizer and adjusts the learning rate each training step. [`Trainer`] creates both when it calls [`~Trainer.create_optimizer_and_scheduler`].
|
|
|
|
```md
|
|
┌────────────┐ ┌──────────────┐
|
|
│ Optimizer │ │ Scheduler │
|
|
│ (adamw_torch_fused)◄─│ (linear) │
|
|
│ │ │ │
|
|
│ param_groups | |
|
|
│ └ lr ◄────────┤ |
|
|
│ └ weight_decay │ │
|
|
└──────┬─────┘ └──────────────┘
|
|
│
|
|
┌──── EACH TRAINING STEP ───────────────────────────────────────────┐
|
|
│ │ │
|
|
│ model(batch) │ │
|
|
│ │ │ │
|
|
│ ▼ │ │
|
|
│ loss ──► loss.backward() ──► param.grad │
|
|
│ │ │
|
|
│ ┌─────────────┘ │
|
|
│ ▼ │
|
|
│ optimizer.step() │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ param.data updated │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ lr_scheduler.step() ──► recalculates lr │
|
|
│ │ writes to optimizer │
|
|
│ ▼ .param_groups['lr'] │
|
|
│ model.zero_grad() │
|
|
│ │
|
|
└───────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
Configure optimizer and scheduler behavior, like [`~TrainingArguments.lr_scheduler_type`] and [`~TrainingArguments.optim`], in [`TrainingArguments`]. The defaults (`adamw_torch` optimizer and `linear` warmup scheduler) are a good starting point for most fine-tuning runs.
|
|
|
|
```py
|
|
from transformers import TrainingArguments
|
|
|
|
args = TrainingArguments(
|
|
...,
|
|
# Optimizer
|
|
optim="adamw_torch", # or "adamw_torch_fused", "adafactor", "sgd", etc.
|
|
learning_rate=2e-5,
|
|
weight_decay=0.01,
|
|
adam_beta1=0.9,
|
|
adam_beta2=0.999,
|
|
adam_epsilon=1e-8,
|
|
# Scheduler
|
|
lr_scheduler_type="cosine", # "linear", "cosine", "constant_with_warmup", etc.
|
|
warmup_steps=500,
|
|
lr_scheduler_kwargs={"num_cycles": 3}, # scheduler-specific extras
|
|
)
|
|
```
|
|
|
|
## Metric-based schedulers
|
|
|
|
Some schedulers adapt to training dynamics instead of following a fixed schedule.
|
|
|
|
[GreedyLR](https://huggingface.co/papers/2512.14527) updates the learning rate from evaluation results. It raises the learning rate by dividing it by `factor` when the metric keeps improving, and lowers the learning rate by multiplying it by `factor` when the metric doesn't improve. When the learning rate stops at `min_lr` and doesn't improve after `reset_start` steps, [`GreedyLR`] resets to its initial state and starts a new cycle.
|
|
|
|
[`GreedyLR`] requires evaluation during training. Set `eval_strategy` to `"steps"` or `"epoch"`.
|
|
|
|
```diff
|
|
args = TrainingArguments(
|
|
+ lr_scheduler_type="greedy",
|
|
+ lr_scheduler_kwargs={"patience": 10, "factor": 0.95, "min_lr": 1e-5},
|
|
+ eval_strategy="steps",
|
|
+ eval_steps=200,
|
|
... # remaining args from the TrainingArguments intro config
|
|
)
|
|
```
|
|
|
|
> [!TIP]
|
|
> The default `mode="min"` works for loss. If you're tracking a metric where a higher value is better, like accuracy, pass `"mode": "max"` in `lr_scheduler_kwargs`.
|
|
|
|
See the [`GreedyLR`] class for the full list of configurable parameters.
|
|
|
|
## Optimizer integrations
|
|
|
|
Transformers integrates third-party optimizers for specialized training scenarios.
|
|
|
|
| Optimizer | Install | `optim="value"` | Description |
|
|
|---|---|---|---|
|
|
| APOLLO | `apollo-torch` | `apollo_adamw` | Memory-efficient full-param via random projections; rank-1 sufficient |
|
|
| FlashOptim | `flashoptim` | `flash_adamw`, `flash_adam`, `flash_sgd`, `flash_sgdw`, `flash_lion` | Reduces optimizer memory with low-precision master weights |
|
|
| GrokAdamW | `grokadamw` | `grokadamw` | Targets delayed generalization (grokking) |
|
|
| LOMO / AdaLomo | `lomo-optim` | `lomo` / `adalomo` | Fuses gradient + update step for low-memory full-param fine-tuning |
|
|
| Schedule Free | `schedulefree` | `schedule_free_adamw`, `schedule_free_radam`, `schedule_free_sgd` | Eliminates LR annealing; pair with `lr_scheduler_type="constant"` |
|
|
| GaLore | `galore-torch` | `galore_adamw`, `galore_adafactor`, `galore_adamw_8bit` | Full-parameter learning via gradient low-rank projection |
|
|
| StableAdamW | `torch-optimi` | `stable_adamw` | AdamW + AdaFactor update clipping; no gradient clipping needed |
|
|
|
|
<hfoptions id="optimizer">
|
|
<hfoption id="APOLLO">
|
|
|
|
```bash
|
|
pip install apollo-torch
|
|
```
|
|
|
|
[Approximated Gradient Scaling for Memory Efficient LLM Optimization (APOLLO)](https://huggingface.co/papers/2412.05270) is a memory-efficient optimizer for full-parameter learning during pretraining and fine-tuning. It matches AdamW performance with SGD-like memory cost by using cheap random projections instead of SVD. For extreme memory savings, use APOLLO-Mini, a rank-1 variant.
|
|
|
|
Use the `optim_target_modules` parameter to specify which layers to train.
|
|
|
|
```diff
|
|
args = TrainingArguments(
|
|
+ optim="apollo_adamw",
|
|
+ optim_target_modules=[r".*.attn.*", r".*.mlp.*"],
|
|
... # remaining args from the TrainingArguments intro config
|
|
)
|
|
```
|
|
|
|
Pass additional hyperparameters through `optim_args`.
|
|
|
|
> [!TIP]
|
|
> Set `scale` to `n/r`, where `n` is the original space dimension and `r` is the low-rank space dimension. Adjusting the learning rate while keeping `scale` at its default achieves a similar effect.
|
|
|
|
| parameter | description | APOLLO | APOLLO-Mini |
|
|
|---|---|---|---|
|
|
| rank | rank of the auxiliary sub-space for gradient scaling | 256 | 1 |
|
|
| scale_type | how scaling factors are applied | `channel` (per-channel scaling) | `tensor` (per-tensor scaling) |
|
|
| scale | adjusts gradient updates to stabilize training | 1.0 | 128 |
|
|
| update_proj_gap | steps before updating projection matrices | 200 | 200 |
|
|
| proj | projection type | `random` | `random` |
|
|
|
|
Enable APOLLO-Mini with a rank-1 configuration.
|
|
|
|
```py
|
|
args = TrainingArguments(
|
|
optim="apollo_adamw",
|
|
optim_target_modules=[r".*.attn.*", r".*.mlp.*"],
|
|
optim_args="proj=random,rank=1,scale=128.0,scale_type=tensor,update_proj_gap=200",
|
|
... # remaining args from the TrainingArguments intro config
|
|
)
|
|
```
|
|
|
|
</hfoption>
|
|
<hfoption id="FlashOptim">
|
|
|
|
```bash
|
|
pip install flashoptim
|
|
```
|
|
|
|
[FlashOptim](https://huggingface.co/papers/2602.23349) reduces optimizer memory by storing master weights in lower precision. It supports AdamW, Adam, SGD, SGDW, and Lion variants.
|
|
|
|
> [!TIP]
|
|
> FlashOptim requires bf16 or fp16 model weights. It automatically disables `master_weight_bits` and warns if your model uses fp32.
|
|
|
|
```diff
|
|
args = TrainingArguments(
|
|
+ optim="flash_adamw",
|
|
+ bf16=True,
|
|
... # remaining args from the TrainingArguments intro config
|
|
)
|
|
```
|
|
|
|
`master_weight_bits` controls the precision of the optimizer's master weight copy. By default, it stores the master copy in 24 bits. Set it to `"None"` to remove the master copy entirely for maximum memory savings at the cost of a slightly higher loss.
|
|
|
|
```diff
|
|
args = TrainingArguments(
|
|
+ optim="flash_adamw",
|
|
+ optim_args="master_weight_bits=None",
|
|
+ bf16=True,
|
|
... # remaining args from the TrainingArguments intro config
|
|
)
|
|
```
|
|
|
|
</hfoption>
|
|
<hfoption id="GrokAdamW">
|
|
|
|
```bash
|
|
pip install grokadamw
|
|
```
|
|
|
|
[GrokAdamW](https://github.com/cognitivecomputations/grokadamw) targets *grokking*, where models exhibit delayed generalization due to slow-varying gradients.
|
|
|
|
```diff
|
|
args = TrainingArguments(
|
|
+ optim="grokadamw",
|
|
... # remaining args from the TrainingArguments intro config
|
|
)
|
|
```
|
|
|
|
</hfoption>
|
|
<hfoption id="LOMO">
|
|
|
|
```bash
|
|
pip install lomo-optim
|
|
```
|
|
|
|
[Low-Memory Optimization (LOMO)](https://github.com/OpenLMLab/LOMO) includes two optimizers for low-memory full-parameter finetuning, [LOMO](https://huggingface.co/papers/2306.09782) and [AdaLomo](https://hf.co/papers/2310.10195). Both fuse gradient computation and parameter updates into one step. AdaLomo adds an adaptive per-parameter learning rate, similar to Adam.
|
|
|
|
> [!TIP]
|
|
> AdaLomo works best without `grad_norm`, improving performance and throughput.
|
|
|
|
```diff
|
|
args = TrainingArguments(
|
|
+ optim="adalomo",
|
|
learning_rate=2e-6,
|
|
... # remaining args from the TrainingArguments intro config
|
|
)
|
|
```
|
|
|
|
</hfoption>
|
|
<hfoption id="Schedule Free">
|
|
|
|
```bash
|
|
pip install schedulefree
|
|
```
|
|
|
|
[Schedule Free optimizer (SFO)](https://hf.co/papers/2405.15682) replaces momentum with a combination of averaging and interpolation, completely removing the need to anneal the learning rate.
|
|
|
|
SFO supports the RAdam (`schedule_free_radam`), AdamW (`schedule_free_adamw`), and SGD (`schedule_free_sgd`) optimizers. The RAdam scheduler doesn't require `warmup_steps`.
|
|
|
|
Pair SFO with `lr_scheduler_type="constant"`. Other scheduler types work but affect SFO's intended behavior.
|
|
|
|
```diff
|
|
args = TrainingArguments(
|
|
+ optim="schedule_free_radam",
|
|
+ lr_scheduler_type="constant",
|
|
learning_rate=2e-6,
|
|
... # remaining args from the TrainingArguments intro config
|
|
)
|
|
```
|
|
|
|
</hfoption>
|
|
<hfoption id="StableAdamW">
|
|
|
|
```bash
|
|
pip install torch-optimi
|
|
```
|
|
|
|
[StableAdamW](https://huggingface.co/papers/2304.13013) ports AdaFactor's update clipping into AdamW, removing the need for gradient clipping. Otherwise, it's a drop-in replacement for AdamW.
|
|
|
|
> [!TIP]
|
|
> If you're training with large batch sizes or still observing loss spikes, try setting `beta_2` between 0.95 and 0.99.
|
|
|
|
```diff
|
|
args = TrainingArguments(
|
|
+ optim="stable_adamw",
|
|
learning_rate=2e-6,
|
|
... # remaining args from the TrainingArguments intro config
|
|
)
|
|
```
|
|
</hfoption>
|
|
<hfoption id="GaLore">
|
|
|
|
```bash
|
|
pip install galore-torch trl
|
|
```
|
|
|
|
[Gradient Low-Rank Projection (GaLore)](https://hf.co/papers/2403.03507) reduces memory for training LLMs. Unlike low-rank adaptation methods like [LoRA](https://hf.co/papers/2106.09685), GaLore preserves *full-parameter* learning.
|
|
|
|
Set `optim` in [`trl.SFTConfig`] to a GaLore optimizer (`"galore_adamw"`, `"galore_adafactor"`, or `"galore_adamw_8bit"`). Specify target modules with `optim_target_modules` and GaLore-specific parameters (`rank`, `update_proj_gap`, `scale`) through `optim_args`.
|
|
|
|
```py
|
|
from trl import SFTConfig
|
|
|
|
args = SFTConfig(
|
|
output_dir="./galore",
|
|
max_steps=100,
|
|
optim="galore_adamw",
|
|
optim_target_modules=[r".*.attn.*", r".*.mlp.*"],
|
|
optim_args="rank=64, update_proj_gap=100, scale=0.10",
|
|
)
|
|
```
|
|
|
|
Append `_layerwise` to the optimizer name for layerwise optimization (`"galore_adamw_layerwise"`). Only linear layers targeted by GaLore use low-rank decomposition. All other layers are optimized normally.
|
|
|
|
```py
|
|
from trl import SFTConfig, SFTTrainer
|
|
|
|
args = SFTConfig(
|
|
output_dir="./galore",
|
|
max_steps=100,
|
|
optim="galore_adamw_layerwise",
|
|
optim_target_modules=[r".*.attn.*", r".*.mlp.*"],
|
|
optim_args="rank=64, update_proj_gap=100, scale=0.10",
|
|
)
|
|
```
|
|
|
|
Layerwise mode is experimental. It only runs on a [single GPU](https://github.com/jiaweizzhao/GaLore?tab=readme-ov-file#train-7b-model-with-a-single-gpu-with-24gb-memory), doesn't support DistributedDataParallel (DDP), and gradient clipping and DeepSpeed may not work.
|
|
|
|
</hfoption>
|
|
</hfoptions>
|
|
|
|
## Customizing optimizer and scheduler
|
|
|
|
Create a custom optimizer and scheduler to use an optimizer not yet integrated, adjust per-layer learning rates, or apply custom logic.
|
|
|
|
### Pass a class and kwargs
|
|
|
|
[`~Trainer.optimizer_cls_and_kwargs`] accepts a custom optimizer class while delegating parameter grouping and device placement to [`Trainer`].
|
|
|
|
[`Trainer`] defers building the optimizer until [`~Trainer.create_optimizer`] runs, so the model is already on the correct device.
|
|
|
|
```py
|
|
import torch
|
|
|
|
trainer = Trainer(
|
|
...
|
|
optimizer_cls_and_kwargs=(
|
|
torch.optim.SGD,
|
|
{"momentum": 0.9, "nesterov": True}
|
|
),
|
|
)
|
|
```
|
|
|
|
### Pass prebuilt instances
|
|
|
|
Pass a predefined optimizer and scheduler to [`~Trainer.optimizers`]. [`Trainer`] skips [`~Trainer.create_optimizer`] and [`~Trainer.create_scheduler`] when prebuilt instances are provided. If you don't pass a scheduler, [`Trainer`] automatically creates one.
|
|
|
|
> [!WARNING]
|
|
> Build the optimizer after placing your model on the correct device. Parameters are resolved at construction time, before `Trainer` moves the model. In distributed training, mismatched devices can silently cause incorrect behavior.
|
|
|
|
```py
|
|
import torch
|
|
from transformers import Trainer, get_cosine_schedule_with_warmup
|
|
|
|
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
|
|
scheduler = get_cosine_schedule_with_warmup(
|
|
optimizer, num_warmup_steps=500, num_training_steps=10_000
|
|
)
|
|
|
|
trainer = Trainer(
|
|
...
|
|
optimizers=(optimizer, scheduler),
|
|
)
|
|
```
|
|
|
|
Prebuilt instances bypass [`~Trainer.create_optimizer`] and [`~Trainer.create_scheduler`], so you need to specify your own parameter groups.
|
|
|
|
### Override optimizer and scheduler methods
|
|
|
|
Subclass [`~Trainer.create_optimizer`] and [`~Trainer.create_scheduler`] for full control. Both methods run *during* [`~Trainer.train`].
|
|
|
|
Override [`~Trainer.create_scheduler`] to use a scheduler like [OneCycleLR](https://docs.pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.OneCycleLR.html) that isn't available in [`SchedulerType`].
|
|
|
|
For each method, make sure to assign to `self` and return it.
|
|
|
|
```py
|
|
import torch
|
|
from transformers import Trainer
|
|
|
|
class MyTrainer(Trainer):
|
|
|
|
def create_scheduler(self, num_training_steps, optimizer=None):
|
|
optimizer = optimizer or self.optimizer
|
|
self.lr_scheduler = torch.optim.lr_scheduler.OneCycleLR(
|
|
optimizer,
|
|
max_lr=0.1,
|
|
total_steps=num_training_steps,
|
|
)
|
|
return self.lr_scheduler
|
|
```
|
|
|
|
You don't need to override [`~Trainer.create_optimizer`] if the default optimizer works. Extending a method with `super()` is easier than replacing it entirely. For example, add an extra parameter group while keeping everything else the same.
|
|
|
|
```py
|
|
class MyTrainer(Trainer):
|
|
def create_optimizer(self, model=None):
|
|
super().create_optimizer(model) # builds the default two param groups
|
|
# add extra param group
|
|
self.optimizer.add_param_group({
|
|
"params": self.model.classifier.parameters(),
|
|
"lr": self.args.learning_rate * 10,
|
|
})
|
|
return self.optimizer
|
|
```
|