transformers/tests/trainer/TESTING_GUIDE.md

# Trainer Testing Guide

## Test files

| File | What it covers |
|---|---|
| `test_trainer.py` | Core: mixed precision, grad accumulation, logging, metrics, early stopping |
| `test_trainer_checkpointing.py` | Checkpoint save/resume, interrupted training, frozen params |
| `test_trainer_data.py` | Collators, dynamic shapes, iterable datasets, label smoothing |
| `test_trainer_optimizers.py` | Optimizers & LR schedulers |
| `test_trainer_seq2seq.py` | Encoder-decoder fine-tuning |
| `trainer_test_utils.py` | Shared utilities (models, datasets, helpers) — not a test file |
| `distributed/` | DDP, FSDP, DeepSpeed (see [below](#distributed-tests)) |

## Running tests

Always use `RUN_SLOW=1` — most trainer tests are `@slow` and will be skipped without it.

### Debugging workflow

**Never run the full suite until the specific failing test passes.** Work from smallest scope outward:

1. **Single GPU** — fastest feedback:
   ```bash
   CUDA_VISIBLE_DEVICES=0 RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class::test_name -xvs
   ```
2. **Fix and re-run** that same test until it passes.
3. **2 GPUs** — catch DataParallel issues:
   ```bash
   CUDA_VISIBLE_DEVICES=0,1 RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class::test_name -xvs
   ```
4. **Full test class** — check for regressions:
   ```bash
   RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class -xvs
   ```
5. **All tests in that file — only at the very end**:
   ```bash
   RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py -v --tb=line
   ```

Same for distributed tests — single failing test first, fix, confirm, then widen scope.

**Tip**: `-k` filter applies globally across files. Use full node IDs instead: `pytest file::Class::test`.

## Writing tests

**`get_regression_trainer()`** is the fastest way to get a working Trainer. Pass any `TrainingArguments` kwarg directly. Uses `RegressionModel` + `RegressionDataset` (trains in milliseconds).

For LLM tests, use tiny Hub models: `AutoModelForCausalLM.from_pretrained("hf-internal-testing/tiny-random-LlamaForCausalLM")`.

Use `max_steps=10` instead of `num_train_epochs=3` when you just need training to run.

### Multi-GPU safety

The Trainer uses `nn.DataParallel` when `n_gpu > 1`:

- `train_batch_size = per_device_train_batch_size * n_gpu` — don't hardcode batch sizes in assertions.
- Compute steps dynamically: `math.ceil(num_samples / (batch_size * grad_accum))`.
- Use 100+ samples — small datasets can leave zero resume steps on multi-GPU.
- DataParallel gather introduces ~1e-8 FP differences — use `places=6` for loss assertions.
- If a test model has `**kwargs` but ignores `num_items_in_batch`, set `model.accepts_loss_kwargs = False`.

### Decorators

`@parameterized.expand` must be **outermost** (top), above `@require_*`.

---

## Distributed tests

### Directory layout

```
distributed/
  test_trainer_distributed.py           # Base: path constants, TrainerDistributedCommon ABC
  test_trainer_distributed_ddp.py       # DDP tests
  test_trainer_distributed_fsdp.py      # FSDP tests (config parsing + distributed)
  test_trainer_distributed_deepspeed.py # DeepSpeed tests (single-GPU + distributed)
  accelerate_configs/                   # YAML configs for `accelerate launch`
  scripts/                              # Scripts launched as subprocesses
    train.py                            # Main training script (synthetic data, tiny Qwen2)
    torchrun_env_check.py               # Dumps distributed env info to JSON per rank
    ds_config_zero2.json, ds_config_zero3.json
```

### Architecture

Each framework has three pieces:

1. **`{Framework}CommandsMixin`** — `get_torchrun_cmd()` and `get_accelerate_cmd()`.
2. **`TestTrainerDistributed{Framework}`** — framework-specific tests (env parity, etc.). NOT `@slow`.
3. **`TestTrainerDistributed{Framework}Common`** — inherits `TrainerDistributedCommon` for shared scenarios. `@slow`.

MRO: `class Foo(Mixin, TrainerDistributedCommon, TestCasePlus)` — Mixin before ABC.

`TrainerDistributedCommon` provides: `check_training`, `check_mixed_precision`, `check_gradient_accumulation`, `check_resume`, `check_eval`. Subclasses call these with `config_file=...`.

### Env parity tests

Both torchrun and accelerate sides must use the framework:

- **DDP**: no extra args (both `DistributedType.MULTI_GPU`)
- **FSDP**: `--fsdp full_shard --fsdp_config '{"fsdp_version": 1}'` (JSON string, no file)
- **DeepSpeed**: `--deepspeed path/to/ds_config_zero2.json`

`torchrun_env_check.py` uses `HfArgumentParser(TrainingArguments)` — accepts any TrainingArguments flag.

### Adding a distributed test

1. Shared scenario → add `check_*` to `TrainerDistributedCommon`, wire from each Common class.
2. Framework-specific → add to `TestTrainerDistributed{Framework}`.
3. New scripts → `distributed/scripts/`, reference via `SCRIPTS_DIR`.

### Pitfalls

- `str(args.parallel_mode)` → `"ParallelMode.DISTRIBUTED"`, not `"DISTRIBUTED"`.
- FSDP `cpu_offload` is not JSON-serializable — use `str()`.
- `train.py` defaults to `do_train=True`. Pass `--do_eval` explicitly for eval. Auto-enables when `--eval_output_file` is passed.
- DeepSpeed eval only works with ZeRO-3.
- `--fsdp_config` accepts a file path OR JSON string starting with `{`. Same for `--deepspeed`, `--accelerator_config`.
- `args.local_rank` may be -1 before framework consumes it — use `assertIn(val, (rank, -1))`.
- `@parameterized.expand` + ABC: can't use `@abstractmethod` on methods that subclasses decorate with expand.