gavin/transformers

Fork 0

Files

陈赣 06f1fd69a6

Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled

Details

Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled

Details

Build documentation / build (push) Has been cancelled

Details

Build documentation / build_other_lang (push) Has been cancelled

Details

CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled

Details

New model PR merged notification / Notify new model (push) Has been cancelled

Details

PR CI / pr-ci (push) Has been cancelled

Details

Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled

Details

Secret Leaks / trufflehog (push) Has been cancelled

Details

Update Transformers metadata / build_and_package (push) Has been cancelled

Details

Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled

Details

Check Tiny Models / Check tiny models (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled

Details

Nvidia CI - Flash Attn / Setup (push) Has been cancelled

Details

Nvidia CI - Flash Attn / Model CI (push) Has been cancelled

Details

Nvidia CI / Setup (push) Has been cancelled

Details

Nvidia CI / Model CI (push) Has been cancelled

Details

Nvidia CI / Torch pipeline CI (push) Has been cancelled

Details

Nvidia CI / Example CI (push) Has been cancelled

Details

Nvidia CI / Trainer/FSDP CI (push) Has been cancelled

Details

Nvidia CI / DeepSpeed CI (push) Has been cancelled

Details

Nvidia CI / Quantization CI (push) Has been cancelled

Details

Nvidia CI / Kernels CI (push) Has been cancelled

Details

Doctests / Setup (push) Has been cancelled

Details

Doctests / Call doctest jobs (push) Has been cancelled

Details

Doctests / Send results to webhook (push) Has been cancelled

Details

Extras Smoke Test / Get supported Python versions (push) Has been cancelled

Details

Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled

Details

Extras Smoke Test / Check Slack token availability (push) Has been cancelled

Details

Extras Smoke Test / Notify failures to Slack (push) Has been cancelled

Details

Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled

Details

Stale Bot / Close Stale Issues (push) Has been cancelled

Details

first commit

2026-06-05 16:53:03 +08:00

5.4 KiB

Raw Blame History

Trainer Testing Guide

Test files

File	What it covers
`test_trainer.py`	Core: mixed precision, grad accumulation, logging, metrics, early stopping
`test_trainer_checkpointing.py`	Checkpoint save/resume, interrupted training, frozen params
`test_trainer_data.py`	Collators, dynamic shapes, iterable datasets, label smoothing
`test_trainer_optimizers.py`	Optimizers & LR schedulers
`test_trainer_seq2seq.py`	Encoder-decoder fine-tuning
`trainer_test_utils.py`	Shared utilities (models, datasets, helpers) — not a test file
`distributed/`	DDP, FSDP, DeepSpeed (see below)

Running tests

Always use RUN_SLOW=1 — most trainer tests are @slow and will be skipped without it.

Debugging workflow

Never run the full suite until the specific failing test passes. Work from smallest scope outward:

Single GPU — fastest feedback:

CUDA_VISIBLE_DEVICES=0 RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class::test_name -xvs

Fix and re-run that same test until it passes.

2 GPUs — catch DataParallel issues:

CUDA_VISIBLE_DEVICES=0,1 RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class::test_name -xvs

Full test class — check for regressions:

RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class -xvs

All tests in that file — only at the very end:

RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py -v --tb=line

Same for distributed tests — single failing test first, fix, confirm, then widen scope.

Tip: -k filter applies globally across files. Use full node IDs instead: pytest file::Class::test.

Writing tests

get_regression_trainer() is the fastest way to get a working Trainer. Pass any TrainingArguments kwarg directly. Uses RegressionModel + RegressionDataset (trains in milliseconds).

For LLM tests, use tiny Hub models: AutoModelForCausalLM.from_pretrained("hf-internal-testing/tiny-random-LlamaForCausalLM").

Use max_steps=10 instead of num_train_epochs=3 when you just need training to run.

Multi-GPU safety

The Trainer uses nn.DataParallel when n_gpu > 1:

train_batch_size = per_device_train_batch_size * n_gpu — don't hardcode batch sizes in assertions.
Compute steps dynamically: math.ceil(num_samples / (batch_size * grad_accum)).
Use 100+ samples — small datasets can leave zero resume steps on multi-GPU.
DataParallel gather introduces ~1e-8 FP differences — use places=6 for loss assertions.
If a test model has **kwargs but ignores num_items_in_batch, set model.accepts_loss_kwargs = False.

Decorators

@parameterized.expand must be outermost (top), above @require_*.

Distributed tests

Directory layout

distributed/
  test_trainer_distributed.py           # Base: path constants, TrainerDistributedCommon ABC
  test_trainer_distributed_ddp.py       # DDP tests
  test_trainer_distributed_fsdp.py      # FSDP tests (config parsing + distributed)
  test_trainer_distributed_deepspeed.py # DeepSpeed tests (single-GPU + distributed)
  accelerate_configs/                   # YAML configs for `accelerate launch`
  scripts/                              # Scripts launched as subprocesses
    train.py                            # Main training script (synthetic data, tiny Qwen2)
    torchrun_env_check.py               # Dumps distributed env info to JSON per rank
    ds_config_zero2.json, ds_config_zero3.json

Architecture

Each framework has three pieces:

{Framework}CommandsMixin — get_torchrun_cmd() and get_accelerate_cmd().
TestTrainerDistributed{Framework} — framework-specific tests (env parity, etc.). NOT @slow.
TestTrainerDistributed{Framework}Common — inherits TrainerDistributedCommon for shared scenarios. @slow.

MRO: class Foo(Mixin, TrainerDistributedCommon, TestCasePlus) — Mixin before ABC.

TrainerDistributedCommon provides: check_training, check_mixed_precision, check_gradient_accumulation, check_resume, check_eval. Subclasses call these with config_file=....

Env parity tests

Both torchrun and accelerate sides must use the framework:

DDP: no extra args (both DistributedType.MULTI_GPU)
FSDP: --fsdp full_shard --fsdp_config '{"fsdp_version": 1}' (JSON string, no file)
DeepSpeed: --deepspeed path/to/ds_config_zero2.json

torchrun_env_check.py uses HfArgumentParser(TrainingArguments) — accepts any TrainingArguments flag.

Adding a distributed test

Shared scenario → add check_* to TrainerDistributedCommon, wire from each Common class.
Framework-specific → add to TestTrainerDistributed{Framework}.
New scripts → distributed/scripts/, reference via SCRIPTS_DIR.

Pitfalls

str(args.parallel_mode) → "ParallelMode.DISTRIBUTED", not "DISTRIBUTED".
FSDP cpu_offload is not JSON-serializable — use str().
train.py defaults to do_train=True. Pass --do_eval explicitly for eval. Auto-enables when --eval_output_file is passed.
DeepSpeed eval only works with ZeRO-3.
--fsdp_config accepts a file path OR JSON string starting with {. Same for --deepspeed, --accelerator_config.
args.local_rank may be -1 before framework consumes it — use assertIn(val, (rank, -1)).
@parameterized.expand + ABC: can't use @abstractmethod on methods that subclasses decorate with expand.

5.4 KiB Raw Blame History

Trainer Testing Guide

Test files

Running tests

Debugging workflow

Writing tests

Multi-GPU safety

Decorators

Distributed tests

Directory layout

Architecture

Env parity tests

Adding a distributed test

Pitfalls

5.4 KiB

Raw Blame History