Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
123 lines
5.4 KiB
Markdown
123 lines
5.4 KiB
Markdown
# Trainer Testing Guide
|
|
|
|
## Test files
|
|
|
|
| File | What it covers |
|
|
|---|---|
|
|
| `test_trainer.py` | Core: mixed precision, grad accumulation, logging, metrics, early stopping |
|
|
| `test_trainer_checkpointing.py` | Checkpoint save/resume, interrupted training, frozen params |
|
|
| `test_trainer_data.py` | Collators, dynamic shapes, iterable datasets, label smoothing |
|
|
| `test_trainer_optimizers.py` | Optimizers & LR schedulers |
|
|
| `test_trainer_seq2seq.py` | Encoder-decoder fine-tuning |
|
|
| `trainer_test_utils.py` | Shared utilities (models, datasets, helpers) — not a test file |
|
|
| `distributed/` | DDP, FSDP, DeepSpeed (see [below](#distributed-tests)) |
|
|
|
|
## Running tests
|
|
|
|
Always use `RUN_SLOW=1` — most trainer tests are `@slow` and will be skipped without it.
|
|
|
|
### Debugging workflow
|
|
|
|
**Never run the full suite until the specific failing test passes.** Work from smallest scope outward:
|
|
|
|
1. **Single GPU** — fastest feedback:
|
|
```bash
|
|
CUDA_VISIBLE_DEVICES=0 RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class::test_name -xvs
|
|
```
|
|
2. **Fix and re-run** that same test until it passes.
|
|
3. **2 GPUs** — catch DataParallel issues:
|
|
```bash
|
|
CUDA_VISIBLE_DEVICES=0,1 RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class::test_name -xvs
|
|
```
|
|
4. **Full test class** — check for regressions:
|
|
```bash
|
|
RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class -xvs
|
|
```
|
|
5. **All tests in that file — only at the very end**:
|
|
```bash
|
|
RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py -v --tb=line
|
|
```
|
|
|
|
Same for distributed tests — single failing test first, fix, confirm, then widen scope.
|
|
|
|
**Tip**: `-k` filter applies globally across files. Use full node IDs instead: `pytest file::Class::test`.
|
|
|
|
## Writing tests
|
|
|
|
**`get_regression_trainer()`** is the fastest way to get a working Trainer. Pass any `TrainingArguments` kwarg directly. Uses `RegressionModel` + `RegressionDataset` (trains in milliseconds).
|
|
|
|
For LLM tests, use tiny Hub models: `AutoModelForCausalLM.from_pretrained("hf-internal-testing/tiny-random-LlamaForCausalLM")`.
|
|
|
|
Use `max_steps=10` instead of `num_train_epochs=3` when you just need training to run.
|
|
|
|
### Multi-GPU safety
|
|
|
|
The Trainer uses `nn.DataParallel` when `n_gpu > 1`:
|
|
|
|
- `train_batch_size = per_device_train_batch_size * n_gpu` — don't hardcode batch sizes in assertions.
|
|
- Compute steps dynamically: `math.ceil(num_samples / (batch_size * grad_accum))`.
|
|
- Use 100+ samples — small datasets can leave zero resume steps on multi-GPU.
|
|
- DataParallel gather introduces ~1e-8 FP differences — use `places=6` for loss assertions.
|
|
- If a test model has `**kwargs` but ignores `num_items_in_batch`, set `model.accepts_loss_kwargs = False`.
|
|
|
|
### Decorators
|
|
|
|
`@parameterized.expand` must be **outermost** (top), above `@require_*`.
|
|
|
|
---
|
|
|
|
## Distributed tests
|
|
|
|
### Directory layout
|
|
|
|
```
|
|
distributed/
|
|
test_trainer_distributed.py # Base: path constants, TrainerDistributedCommon ABC
|
|
test_trainer_distributed_ddp.py # DDP tests
|
|
test_trainer_distributed_fsdp.py # FSDP tests (config parsing + distributed)
|
|
test_trainer_distributed_deepspeed.py # DeepSpeed tests (single-GPU + distributed)
|
|
accelerate_configs/ # YAML configs for `accelerate launch`
|
|
scripts/ # Scripts launched as subprocesses
|
|
train.py # Main training script (synthetic data, tiny Qwen2)
|
|
torchrun_env_check.py # Dumps distributed env info to JSON per rank
|
|
ds_config_zero2.json, ds_config_zero3.json
|
|
```
|
|
|
|
### Architecture
|
|
|
|
Each framework has three pieces:
|
|
|
|
1. **`{Framework}CommandsMixin`** — `get_torchrun_cmd()` and `get_accelerate_cmd()`.
|
|
2. **`TestTrainerDistributed{Framework}`** — framework-specific tests (env parity, etc.). NOT `@slow`.
|
|
3. **`TestTrainerDistributed{Framework}Common`** — inherits `TrainerDistributedCommon` for shared scenarios. `@slow`.
|
|
|
|
MRO: `class Foo(Mixin, TrainerDistributedCommon, TestCasePlus)` — Mixin before ABC.
|
|
|
|
`TrainerDistributedCommon` provides: `check_training`, `check_mixed_precision`, `check_gradient_accumulation`, `check_resume`, `check_eval`. Subclasses call these with `config_file=...`.
|
|
|
|
### Env parity tests
|
|
|
|
Both torchrun and accelerate sides must use the framework:
|
|
|
|
- **DDP**: no extra args (both `DistributedType.MULTI_GPU`)
|
|
- **FSDP**: `--fsdp full_shard --fsdp_config '{"fsdp_version": 1}'` (JSON string, no file)
|
|
- **DeepSpeed**: `--deepspeed path/to/ds_config_zero2.json`
|
|
|
|
`torchrun_env_check.py` uses `HfArgumentParser(TrainingArguments)` — accepts any TrainingArguments flag.
|
|
|
|
### Adding a distributed test
|
|
|
|
1. Shared scenario → add `check_*` to `TrainerDistributedCommon`, wire from each Common class.
|
|
2. Framework-specific → add to `TestTrainerDistributed{Framework}`.
|
|
3. New scripts → `distributed/scripts/`, reference via `SCRIPTS_DIR`.
|
|
|
|
### Pitfalls
|
|
|
|
- `str(args.parallel_mode)` → `"ParallelMode.DISTRIBUTED"`, not `"DISTRIBUTED"`.
|
|
- FSDP `cpu_offload` is not JSON-serializable — use `str()`.
|
|
- `train.py` defaults to `do_train=True`. Pass `--do_eval` explicitly for eval. Auto-enables when `--eval_output_file` is passed.
|
|
- DeepSpeed eval only works with ZeRO-3.
|
|
- `--fsdp_config` accepts a file path OR JSON string starting with `{`. Same for `--deepspeed`, `--accelerator_config`.
|
|
- `args.local_rank` may be -1 before framework consumes it — use `assertIn(val, (rank, -1))`.
|
|
- `@parameterized.expand` + ABC: can't use `@abstractmethod` on methods that subclasses decorate with expand.
|