5.4 KiB
Trainer Testing Guide
Test files
| File | What it covers |
|---|---|
test_trainer.py |
Core: mixed precision, grad accumulation, logging, metrics, early stopping |
test_trainer_checkpointing.py |
Checkpoint save/resume, interrupted training, frozen params |
test_trainer_data.py |
Collators, dynamic shapes, iterable datasets, label smoothing |
test_trainer_optimizers.py |
Optimizers & LR schedulers |
test_trainer_seq2seq.py |
Encoder-decoder fine-tuning |
trainer_test_utils.py |
Shared utilities (models, datasets, helpers) — not a test file |
distributed/ |
DDP, FSDP, DeepSpeed (see below) |
Running tests
Always use RUN_SLOW=1 — most trainer tests are @slow and will be skipped without it.
Debugging workflow
Never run the full suite until the specific failing test passes. Work from smallest scope outward:
- Single GPU — fastest feedback:
CUDA_VISIBLE_DEVICES=0 RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class::test_name -xvs - Fix and re-run that same test until it passes.
- 2 GPUs — catch DataParallel issues:
CUDA_VISIBLE_DEVICES=0,1 RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class::test_name -xvs - Full test class — check for regressions:
RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class -xvs - All tests in that file — only at the very end:
RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py -v --tb=line
Same for distributed tests — single failing test first, fix, confirm, then widen scope.
Tip: -k filter applies globally across files. Use full node IDs instead: pytest file::Class::test.
Writing tests
get_regression_trainer() is the fastest way to get a working Trainer. Pass any TrainingArguments kwarg directly. Uses RegressionModel + RegressionDataset (trains in milliseconds).
For LLM tests, use tiny Hub models: AutoModelForCausalLM.from_pretrained("hf-internal-testing/tiny-random-LlamaForCausalLM").
Use max_steps=10 instead of num_train_epochs=3 when you just need training to run.
Multi-GPU safety
The Trainer uses nn.DataParallel when n_gpu > 1:
train_batch_size = per_device_train_batch_size * n_gpu— don't hardcode batch sizes in assertions.- Compute steps dynamically:
math.ceil(num_samples / (batch_size * grad_accum)). - Use 100+ samples — small datasets can leave zero resume steps on multi-GPU.
- DataParallel gather introduces ~1e-8 FP differences — use
places=6for loss assertions. - If a test model has
**kwargsbut ignoresnum_items_in_batch, setmodel.accepts_loss_kwargs = False.
Decorators
@parameterized.expand must be outermost (top), above @require_*.
Distributed tests
Directory layout
distributed/
test_trainer_distributed.py # Base: path constants, TrainerDistributedCommon ABC
test_trainer_distributed_ddp.py # DDP tests
test_trainer_distributed_fsdp.py # FSDP tests (config parsing + distributed)
test_trainer_distributed_deepspeed.py # DeepSpeed tests (single-GPU + distributed)
accelerate_configs/ # YAML configs for `accelerate launch`
scripts/ # Scripts launched as subprocesses
train.py # Main training script (synthetic data, tiny Qwen2)
torchrun_env_check.py # Dumps distributed env info to JSON per rank
ds_config_zero2.json, ds_config_zero3.json
Architecture
Each framework has three pieces:
{Framework}CommandsMixin—get_torchrun_cmd()andget_accelerate_cmd().TestTrainerDistributed{Framework}— framework-specific tests (env parity, etc.). NOT@slow.TestTrainerDistributed{Framework}Common— inheritsTrainerDistributedCommonfor shared scenarios.@slow.
MRO: class Foo(Mixin, TrainerDistributedCommon, TestCasePlus) — Mixin before ABC.
TrainerDistributedCommon provides: check_training, check_mixed_precision, check_gradient_accumulation, check_resume, check_eval. Subclasses call these with config_file=....
Env parity tests
Both torchrun and accelerate sides must use the framework:
- DDP: no extra args (both
DistributedType.MULTI_GPU) - FSDP:
--fsdp full_shard --fsdp_config '{"fsdp_version": 1}'(JSON string, no file) - DeepSpeed:
--deepspeed path/to/ds_config_zero2.json
torchrun_env_check.py uses HfArgumentParser(TrainingArguments) — accepts any TrainingArguments flag.
Adding a distributed test
- Shared scenario → add
check_*toTrainerDistributedCommon, wire from each Common class. - Framework-specific → add to
TestTrainerDistributed{Framework}. - New scripts →
distributed/scripts/, reference viaSCRIPTS_DIR.
Pitfalls
str(args.parallel_mode)→"ParallelMode.DISTRIBUTED", not"DISTRIBUTED".- FSDP
cpu_offloadis not JSON-serializable — usestr(). train.pydefaults todo_train=True. Pass--do_evalexplicitly for eval. Auto-enables when--eval_output_fileis passed.- DeepSpeed eval only works with ZeRO-3.
--fsdp_configaccepts a file path OR JSON string starting with{. Same for--deepspeed,--accelerator_config.args.local_rankmay be -1 before framework consumes it — useassertIn(val, (rank, -1)).@parameterized.expand+ ABC: can't use@abstractmethodon methods that subclasses decorate with expand.