gavin/transformers

Fork 0

Files

陈赣 06f1fd69a6

Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled

Details

Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled

Details

Build documentation / build (push) Has been cancelled

Details

Build documentation / build_other_lang (push) Has been cancelled

Details

CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled

Details

New model PR merged notification / Notify new model (push) Has been cancelled

Details

PR CI / pr-ci (push) Has been cancelled

Details

Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled

Details

Secret Leaks / trufflehog (push) Has been cancelled

Details

Update Transformers metadata / build_and_package (push) Has been cancelled

Details

Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled

Details

Check Tiny Models / Check tiny models (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled

Details

Nvidia CI - Flash Attn / Setup (push) Has been cancelled

Details

Nvidia CI - Flash Attn / Model CI (push) Has been cancelled

Details

Nvidia CI / Setup (push) Has been cancelled

Details

Nvidia CI / Model CI (push) Has been cancelled

Details

Nvidia CI / Torch pipeline CI (push) Has been cancelled

Details

Nvidia CI / Example CI (push) Has been cancelled

Details

Nvidia CI / Trainer/FSDP CI (push) Has been cancelled

Details

Nvidia CI / DeepSpeed CI (push) Has been cancelled

Details

Nvidia CI / Quantization CI (push) Has been cancelled

Details

Nvidia CI / Kernels CI (push) Has been cancelled

Details

Doctests / Setup (push) Has been cancelled

Details

Doctests / Call doctest jobs (push) Has been cancelled

Details

Doctests / Send results to webhook (push) Has been cancelled

Details

Extras Smoke Test / Get supported Python versions (push) Has been cancelled

Details

Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled

Details

Extras Smoke Test / Check Slack token availability (push) Has been cancelled

Details

Extras Smoke Test / Notify failures to Slack (push) Has been cancelled

Details

Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled

Details

Stale Bot / Close Stale Issues (push) Has been cancelled

Details

first commit

2026-06-05 16:53:03 +08:00

16 KiB

Raw Blame History

Optimizers and schedulers

An optimizer updates model weights during training. The scheduler wraps the optimizer and adjusts the learning rate each training step. [Trainer] creates both when it calls [~Trainer.create_optimizer_and_scheduler].

                                    ┌────────────┐         ┌──────────────┐
                                    │ Optimizer  │         │  Scheduler   │
                                    │ (adamw_torch_fused)◄─│  (linear)    │
                                    │            │         │              │
                                    │ param_groups         |              |
                                    │  └ lr       ◄────────┤              |
                                    │  └ weight_decay      │              │
                                    └──────┬─────┘         └──────────────┘
                                           │                      
  ┌──── EACH TRAINING STEP ───────────────────────────────────────────┐
  │                                        │                          │
  │   model(batch)                         │                          │
  │       │                                │                          │
  │       ▼                                │                          │
  │     loss ──► loss.backward() ──► param.grad                       │
  │                                        │                          │
  │                          ┌─────────────┘                          │
  │                          ▼                                        │
  │              optimizer.step()                                     │
  │                          │                                        │
  │                          ▼                                        │
  │                   param.data updated                              │
  │                          │                                        │
  │                          ▼                                        │
  │              lr_scheduler.step()  ──► recalculates lr             │
  │                          │            writes to optimizer         │
  │                          ▼            .param_groups['lr']         │
  │              model.zero_grad()                                    │
  │                                                                   │
  └───────────────────────────────────────────────────────────────────┘

Configure optimizer and scheduler behavior, like [~TrainingArguments.lr_scheduler_type] and [~TrainingArguments.optim], in [TrainingArguments]. The defaults (adamw_torch optimizer and linear warmup scheduler) are a good starting point for most fine-tuning runs.

from transformers import TrainingArguments

args = TrainingArguments(
    ...,
    # Optimizer
    optim="adamw_torch",          # or "adamw_torch_fused", "adafactor", "sgd", etc.
    learning_rate=2e-5,
    weight_decay=0.01,
    adam_beta1=0.9,
    adam_beta2=0.999,
    adam_epsilon=1e-8,
    # Scheduler
    lr_scheduler_type="cosine",   # "linear", "cosine", "constant_with_warmup", etc.
    warmup_steps=500,
    lr_scheduler_kwargs={"num_cycles": 3},  # scheduler-specific extras
)

Metric-based schedulers

Some schedulers adapt to training dynamics instead of following a fixed schedule.

GreedyLR updates the learning rate from evaluation results. It raises the learning rate by dividing it by factor when the metric keeps improving, and lowers the learning rate by multiplying it by factor when the metric doesn't improve. When the learning rate stops at min_lr and doesn't improve after reset_start steps, [GreedyLR] resets to its initial state and starts a new cycle.

[GreedyLR] requires evaluation during training. Set eval_strategy to "steps" or "epoch".

args = TrainingArguments(
+   lr_scheduler_type="greedy",
+   lr_scheduler_kwargs={"patience": 10, "factor": 0.95, "min_lr": 1e-5},
+   eval_strategy="steps",
+   eval_steps=200,
    ...  # remaining args from the TrainingArguments intro config
)

Tip

The default mode="min" works for loss. If you're tracking a metric where a higher value is better, like accuracy, pass "mode": "max" in lr_scheduler_kwargs.

See the [GreedyLR] class for the full list of configurable parameters.

Optimizer integrations

Transformers integrates third-party optimizers for specialized training scenarios.

Optimizer	Install	`optim="value"`	Description
APOLLO	`apollo-torch`	`apollo_adamw`	Memory-efficient full-param via random projections; rank-1 sufficient
FlashOptim	`flashoptim`	`flash_adamw`, `flash_adam`, `flash_sgd`, `flash_sgdw`, `flash_lion`	Reduces optimizer memory with low-precision master weights
GrokAdamW	`grokadamw`	`grokadamw`	Targets delayed generalization (grokking)
LOMO / AdaLomo	`lomo-optim`	`lomo` / `adalomo`	Fuses gradient + update step for low-memory full-param fine-tuning
Schedule Free	`schedulefree`	`schedule_free_adamw`, `schedule_free_radam`, `schedule_free_sgd`	Eliminates LR annealing; pair with `lr_scheduler_type="constant"`
GaLore	`galore-torch`	`galore_adamw`, `galore_adafactor`, `galore_adamw_8bit`	Full-parameter learning via gradient low-rank projection
StableAdamW	`torch-optimi`	`stable_adamw`	AdamW + AdaFactor update clipping; no gradient clipping needed

pip install apollo-torch

Approximated Gradient Scaling for Memory Efficient LLM Optimization (APOLLO) is a memory-efficient optimizer for full-parameter learning during pretraining and fine-tuning. It matches AdamW performance with SGD-like memory cost by using cheap random projections instead of SVD. For extreme memory savings, use APOLLO-Mini, a rank-1 variant.

Use the optim_target_modules parameter to specify which layers to train.

args = TrainingArguments(
+   optim="apollo_adamw",
+   optim_target_modules=[r".*.attn.*", r".*.mlp.*"],
    ...  # remaining args from the TrainingArguments intro config
)

Pass additional hyperparameters through optim_args.

Tip

Set scale to n/r, where n is the original space dimension and r is the low-rank space dimension. Adjusting the learning rate while keeping scale at its default achieves a similar effect.

parameter	description	APOLLO	APOLLO-Mini
rank	rank of the auxiliary sub-space for gradient scaling	256	1
scale_type	how scaling factors are applied	`channel` (per-channel scaling)	`tensor` (per-tensor scaling)
scale	adjusts gradient updates to stabilize training	1.0	128
update_proj_gap	steps before updating projection matrices	200	200
proj	projection type	`random`	`random`

Enable APOLLO-Mini with a rank-1 configuration.

args = TrainingArguments(
    optim="apollo_adamw",
    optim_target_modules=[r".*.attn.*", r".*.mlp.*"],
    optim_args="proj=random,rank=1,scale=128.0,scale_type=tensor,update_proj_gap=200",
    ...  # remaining args from the TrainingArguments intro config
)

pip install flashoptim

FlashOptim reduces optimizer memory by storing master weights in lower precision. It supports AdamW, Adam, SGD, SGDW, and Lion variants.

Tip

FlashOptim requires bf16 or fp16 model weights. It automatically disables master_weight_bits and warns if your model uses fp32.

args = TrainingArguments(
+   optim="flash_adamw",
+   bf16=True,
    ...  # remaining args from the TrainingArguments intro config
)

master_weight_bits controls the precision of the optimizer's master weight copy. By default, it stores the master copy in 24 bits. Set it to "None" to remove the master copy entirely for maximum memory savings at the cost of a slightly higher loss.

args = TrainingArguments(
+   optim="flash_adamw",
+   optim_args="master_weight_bits=None",
+   bf16=True,
    ...  # remaining args from the TrainingArguments intro config
)

pip install grokadamw

GrokAdamW targets grokking, where models exhibit delayed generalization due to slow-varying gradients.

args = TrainingArguments(
+   optim="grokadamw",
    ...  # remaining args from the TrainingArguments intro config
)

pip install lomo-optim

Low-Memory Optimization (LOMO) includes two optimizers for low-memory full-parameter finetuning, LOMO and AdaLomo. Both fuse gradient computation and parameter updates into one step. AdaLomo adds an adaptive per-parameter learning rate, similar to Adam.

Tip

AdaLomo works best without grad_norm, improving performance and throughput.

args = TrainingArguments(
+   optim="adalomo",
    learning_rate=2e-6,
    ...  # remaining args from the TrainingArguments intro config
)

pip install schedulefree

Schedule Free optimizer (SFO) replaces momentum with a combination of averaging and interpolation, completely removing the need to anneal the learning rate.

SFO supports the RAdam (schedule_free_radam), AdamW (schedule_free_adamw), and SGD (schedule_free_sgd) optimizers. The RAdam scheduler doesn't require warmup_steps.

Pair SFO with lr_scheduler_type="constant". Other scheduler types work but affect SFO's intended behavior.

args = TrainingArguments(
+   optim="schedule_free_radam",
+   lr_scheduler_type="constant",
    learning_rate=2e-6,
    ...  # remaining args from the TrainingArguments intro config
)

pip install torch-optimi

StableAdamW ports AdaFactor's update clipping into AdamW, removing the need for gradient clipping. Otherwise, it's a drop-in replacement for AdamW.

Tip

If you're training with large batch sizes or still observing loss spikes, try setting beta_2 between 0.95 and 0.99.

args = TrainingArguments(
+   optim="stable_adamw",
    learning_rate=2e-6,
    ...  # remaining args from the TrainingArguments intro config
)

pip install galore-torch trl

Gradient Low-Rank Projection (GaLore) reduces memory for training LLMs. Unlike low-rank adaptation methods like LoRA, GaLore preserves full-parameter learning.

Set optim in [trl.SFTConfig] to a GaLore optimizer ("galore_adamw", "galore_adafactor", or "galore_adamw_8bit"). Specify target modules with optim_target_modules and GaLore-specific parameters (rank, update_proj_gap, scale) through optim_args.

from trl import SFTConfig

args = SFTConfig(
    output_dir="./galore",
    max_steps=100,
    optim="galore_adamw",
    optim_target_modules=[r".*.attn.*", r".*.mlp.*"],
    optim_args="rank=64, update_proj_gap=100, scale=0.10",
)

Append _layerwise to the optimizer name for layerwise optimization ("galore_adamw_layerwise"). Only linear layers targeted by GaLore use low-rank decomposition. All other layers are optimized normally.

from trl import SFTConfig, SFTTrainer

args = SFTConfig(
    output_dir="./galore",
    max_steps=100,
    optim="galore_adamw_layerwise",
    optim_target_modules=[r".*.attn.*", r".*.mlp.*"],
    optim_args="rank=64, update_proj_gap=100, scale=0.10",
)

Layerwise mode is experimental. It only runs on a single GPU, doesn't support DistributedDataParallel (DDP), and gradient clipping and DeepSpeed may not work.

Customizing optimizer and scheduler

Create a custom optimizer and scheduler to use an optimizer not yet integrated, adjust per-layer learning rates, or apply custom logic.

Pass a class and kwargs

[~Trainer.optimizer_cls_and_kwargs] accepts a custom optimizer class while delegating parameter grouping and device placement to [Trainer].

[Trainer] defers building the optimizer until [~Trainer.create_optimizer] runs, so the model is already on the correct device.

import torch

trainer = Trainer(
    ...
    optimizer_cls_and_kwargs=(
        torch.optim.SGD,
        {"momentum": 0.9, "nesterov": True}
    ),
)

Pass prebuilt instances

Pass a predefined optimizer and scheduler to [~Trainer.optimizers]. [Trainer] skips [~Trainer.create_optimizer] and [~Trainer.create_scheduler] when prebuilt instances are provided. If you don't pass a scheduler, [Trainer] automatically creates one.

Warning

Build the optimizer after placing your model on the correct device. Parameters are resolved at construction time, before Trainer moves the model. In distributed training, mismatched devices can silently cause incorrect behavior.

import torch
from transformers import Trainer, get_cosine_schedule_with_warmup

optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
scheduler = get_cosine_schedule_with_warmup(
    optimizer, num_warmup_steps=500, num_training_steps=10_000
)

trainer = Trainer(
    ...
    optimizers=(optimizer, scheduler),
)

Prebuilt instances bypass [~Trainer.create_optimizer] and [~Trainer.create_scheduler], so you need to specify your own parameter groups.

Override optimizer and scheduler methods

Subclass [~Trainer.create_optimizer] and [~Trainer.create_scheduler] for full control. Both methods run during [~Trainer.train].

Override [~Trainer.create_scheduler] to use a scheduler like OneCycleLR that isn't available in [SchedulerType].

For each method, make sure to assign to self and return it.

import torch
from transformers import Trainer

class MyTrainer(Trainer):

    def create_scheduler(self, num_training_steps, optimizer=None):
        optimizer = optimizer or self.optimizer
        self.lr_scheduler = torch.optim.lr_scheduler.OneCycleLR(
            optimizer,
            max_lr=0.1,
            total_steps=num_training_steps,
        )
        return self.lr_scheduler

You don't need to override [~Trainer.create_optimizer] if the default optimizer works. Extending a method with super() is easier than replacing it entirely. For example, add an extra parameter group while keeping everything else the same.

class MyTrainer(Trainer):
    def create_optimizer(self, model=None):
        super().create_optimizer(model)  # builds the default two param groups
        # add extra param group
        self.optimizer.add_param_group({
            "params": self.model.classifier.parameters(),
            "lr": self.args.learning_rate * 10,
        })
        return self.optimizer

16 KiB Raw Blame History

Optimizers and schedulers

Metric-based schedulers

Optimizer integrations

Customizing optimizer and scheduler

Pass a class and kwargs

Pass prebuilt instances

Override optimizer and scheduler methods

16 KiB

Raw Blame History