Files
transformers/tests/trainer/TESTING_GUIDE.md
陈赣 06f1fd69a6
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
first commit
2026-06-05 16:53:03 +08:00

5.4 KiB

Trainer Testing Guide

Test files

File What it covers
test_trainer.py Core: mixed precision, grad accumulation, logging, metrics, early stopping
test_trainer_checkpointing.py Checkpoint save/resume, interrupted training, frozen params
test_trainer_data.py Collators, dynamic shapes, iterable datasets, label smoothing
test_trainer_optimizers.py Optimizers & LR schedulers
test_trainer_seq2seq.py Encoder-decoder fine-tuning
trainer_test_utils.py Shared utilities (models, datasets, helpers) — not a test file
distributed/ DDP, FSDP, DeepSpeed (see below)

Running tests

Always use RUN_SLOW=1 — most trainer tests are @slow and will be skipped without it.

Debugging workflow

Never run the full suite until the specific failing test passes. Work from smallest scope outward:

  1. Single GPU — fastest feedback:
    CUDA_VISIBLE_DEVICES=0 RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class::test_name -xvs
    
  2. Fix and re-run that same test until it passes.
  3. 2 GPUs — catch DataParallel issues:
    CUDA_VISIBLE_DEVICES=0,1 RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class::test_name -xvs
    
  4. Full test class — check for regressions:
    RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class -xvs
    
  5. All tests in that file — only at the very end:
    RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py -v --tb=line
    

Same for distributed tests — single failing test first, fix, confirm, then widen scope.

Tip: -k filter applies globally across files. Use full node IDs instead: pytest file::Class::test.

Writing tests

get_regression_trainer() is the fastest way to get a working Trainer. Pass any TrainingArguments kwarg directly. Uses RegressionModel + RegressionDataset (trains in milliseconds).

For LLM tests, use tiny Hub models: AutoModelForCausalLM.from_pretrained("hf-internal-testing/tiny-random-LlamaForCausalLM").

Use max_steps=10 instead of num_train_epochs=3 when you just need training to run.

Multi-GPU safety

The Trainer uses nn.DataParallel when n_gpu > 1:

  • train_batch_size = per_device_train_batch_size * n_gpu — don't hardcode batch sizes in assertions.
  • Compute steps dynamically: math.ceil(num_samples / (batch_size * grad_accum)).
  • Use 100+ samples — small datasets can leave zero resume steps on multi-GPU.
  • DataParallel gather introduces ~1e-8 FP differences — use places=6 for loss assertions.
  • If a test model has **kwargs but ignores num_items_in_batch, set model.accepts_loss_kwargs = False.

Decorators

@parameterized.expand must be outermost (top), above @require_*.


Distributed tests

Directory layout

distributed/
  test_trainer_distributed.py           # Base: path constants, TrainerDistributedCommon ABC
  test_trainer_distributed_ddp.py       # DDP tests
  test_trainer_distributed_fsdp.py      # FSDP tests (config parsing + distributed)
  test_trainer_distributed_deepspeed.py # DeepSpeed tests (single-GPU + distributed)
  accelerate_configs/                   # YAML configs for `accelerate launch`
  scripts/                              # Scripts launched as subprocesses
    train.py                            # Main training script (synthetic data, tiny Qwen2)
    torchrun_env_check.py               # Dumps distributed env info to JSON per rank
    ds_config_zero2.json, ds_config_zero3.json

Architecture

Each framework has three pieces:

  1. {Framework}CommandsMixinget_torchrun_cmd() and get_accelerate_cmd().
  2. TestTrainerDistributed{Framework} — framework-specific tests (env parity, etc.). NOT @slow.
  3. TestTrainerDistributed{Framework}Common — inherits TrainerDistributedCommon for shared scenarios. @slow.

MRO: class Foo(Mixin, TrainerDistributedCommon, TestCasePlus) — Mixin before ABC.

TrainerDistributedCommon provides: check_training, check_mixed_precision, check_gradient_accumulation, check_resume, check_eval. Subclasses call these with config_file=....

Env parity tests

Both torchrun and accelerate sides must use the framework:

  • DDP: no extra args (both DistributedType.MULTI_GPU)
  • FSDP: --fsdp full_shard --fsdp_config '{"fsdp_version": 1}' (JSON string, no file)
  • DeepSpeed: --deepspeed path/to/ds_config_zero2.json

torchrun_env_check.py uses HfArgumentParser(TrainingArguments) — accepts any TrainingArguments flag.

Adding a distributed test

  1. Shared scenario → add check_* to TrainerDistributedCommon, wire from each Common class.
  2. Framework-specific → add to TestTrainerDistributed{Framework}.
  3. New scripts → distributed/scripts/, reference via SCRIPTS_DIR.

Pitfalls

  • str(args.parallel_mode)"ParallelMode.DISTRIBUTED", not "DISTRIBUTED".
  • FSDP cpu_offload is not JSON-serializable — use str().
  • train.py defaults to do_train=True. Pass --do_eval explicitly for eval. Auto-enables when --eval_output_file is passed.
  • DeepSpeed eval only works with ZeRO-3.
  • --fsdp_config accepts a file path OR JSON string starting with {. Same for --deepspeed, --accelerator_config.
  • args.local_rank may be -1 before framework consumes it — use assertIn(val, (rank, -1)).
  • @parameterized.expand + ABC: can't use @abstractmethod on methods that subclasses decorate with expand.