first commit
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
This commit is contained in:
122
tests/trainer/TESTING_GUIDE.md
Normal file
122
tests/trainer/TESTING_GUIDE.md
Normal file
@@ -0,0 +1,122 @@
|
||||
# Trainer Testing Guide
|
||||
|
||||
## Test files
|
||||
|
||||
| File | What it covers |
|
||||
|---|---|
|
||||
| `test_trainer.py` | Core: mixed precision, grad accumulation, logging, metrics, early stopping |
|
||||
| `test_trainer_checkpointing.py` | Checkpoint save/resume, interrupted training, frozen params |
|
||||
| `test_trainer_data.py` | Collators, dynamic shapes, iterable datasets, label smoothing |
|
||||
| `test_trainer_optimizers.py` | Optimizers & LR schedulers |
|
||||
| `test_trainer_seq2seq.py` | Encoder-decoder fine-tuning |
|
||||
| `trainer_test_utils.py` | Shared utilities (models, datasets, helpers) — not a test file |
|
||||
| `distributed/` | DDP, FSDP, DeepSpeed (see [below](#distributed-tests)) |
|
||||
|
||||
## Running tests
|
||||
|
||||
Always use `RUN_SLOW=1` — most trainer tests are `@slow` and will be skipped without it.
|
||||
|
||||
### Debugging workflow
|
||||
|
||||
**Never run the full suite until the specific failing test passes.** Work from smallest scope outward:
|
||||
|
||||
1. **Single GPU** — fastest feedback:
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class::test_name -xvs
|
||||
```
|
||||
2. **Fix and re-run** that same test until it passes.
|
||||
3. **2 GPUs** — catch DataParallel issues:
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0,1 RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class::test_name -xvs
|
||||
```
|
||||
4. **Full test class** — check for regressions:
|
||||
```bash
|
||||
RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class -xvs
|
||||
```
|
||||
5. **All tests in that file — only at the very end**:
|
||||
```bash
|
||||
RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py -v --tb=line
|
||||
```
|
||||
|
||||
Same for distributed tests — single failing test first, fix, confirm, then widen scope.
|
||||
|
||||
**Tip**: `-k` filter applies globally across files. Use full node IDs instead: `pytest file::Class::test`.
|
||||
|
||||
## Writing tests
|
||||
|
||||
**`get_regression_trainer()`** is the fastest way to get a working Trainer. Pass any `TrainingArguments` kwarg directly. Uses `RegressionModel` + `RegressionDataset` (trains in milliseconds).
|
||||
|
||||
For LLM tests, use tiny Hub models: `AutoModelForCausalLM.from_pretrained("hf-internal-testing/tiny-random-LlamaForCausalLM")`.
|
||||
|
||||
Use `max_steps=10` instead of `num_train_epochs=3` when you just need training to run.
|
||||
|
||||
### Multi-GPU safety
|
||||
|
||||
The Trainer uses `nn.DataParallel` when `n_gpu > 1`:
|
||||
|
||||
- `train_batch_size = per_device_train_batch_size * n_gpu` — don't hardcode batch sizes in assertions.
|
||||
- Compute steps dynamically: `math.ceil(num_samples / (batch_size * grad_accum))`.
|
||||
- Use 100+ samples — small datasets can leave zero resume steps on multi-GPU.
|
||||
- DataParallel gather introduces ~1e-8 FP differences — use `places=6` for loss assertions.
|
||||
- If a test model has `**kwargs` but ignores `num_items_in_batch`, set `model.accepts_loss_kwargs = False`.
|
||||
|
||||
### Decorators
|
||||
|
||||
`@parameterized.expand` must be **outermost** (top), above `@require_*`.
|
||||
|
||||
---
|
||||
|
||||
## Distributed tests
|
||||
|
||||
### Directory layout
|
||||
|
||||
```
|
||||
distributed/
|
||||
test_trainer_distributed.py # Base: path constants, TrainerDistributedCommon ABC
|
||||
test_trainer_distributed_ddp.py # DDP tests
|
||||
test_trainer_distributed_fsdp.py # FSDP tests (config parsing + distributed)
|
||||
test_trainer_distributed_deepspeed.py # DeepSpeed tests (single-GPU + distributed)
|
||||
accelerate_configs/ # YAML configs for `accelerate launch`
|
||||
scripts/ # Scripts launched as subprocesses
|
||||
train.py # Main training script (synthetic data, tiny Qwen2)
|
||||
torchrun_env_check.py # Dumps distributed env info to JSON per rank
|
||||
ds_config_zero2.json, ds_config_zero3.json
|
||||
```
|
||||
|
||||
### Architecture
|
||||
|
||||
Each framework has three pieces:
|
||||
|
||||
1. **`{Framework}CommandsMixin`** — `get_torchrun_cmd()` and `get_accelerate_cmd()`.
|
||||
2. **`TestTrainerDistributed{Framework}`** — framework-specific tests (env parity, etc.). NOT `@slow`.
|
||||
3. **`TestTrainerDistributed{Framework}Common`** — inherits `TrainerDistributedCommon` for shared scenarios. `@slow`.
|
||||
|
||||
MRO: `class Foo(Mixin, TrainerDistributedCommon, TestCasePlus)` — Mixin before ABC.
|
||||
|
||||
`TrainerDistributedCommon` provides: `check_training`, `check_mixed_precision`, `check_gradient_accumulation`, `check_resume`, `check_eval`. Subclasses call these with `config_file=...`.
|
||||
|
||||
### Env parity tests
|
||||
|
||||
Both torchrun and accelerate sides must use the framework:
|
||||
|
||||
- **DDP**: no extra args (both `DistributedType.MULTI_GPU`)
|
||||
- **FSDP**: `--fsdp full_shard --fsdp_config '{"fsdp_version": 1}'` (JSON string, no file)
|
||||
- **DeepSpeed**: `--deepspeed path/to/ds_config_zero2.json`
|
||||
|
||||
`torchrun_env_check.py` uses `HfArgumentParser(TrainingArguments)` — accepts any TrainingArguments flag.
|
||||
|
||||
### Adding a distributed test
|
||||
|
||||
1. Shared scenario → add `check_*` to `TrainerDistributedCommon`, wire from each Common class.
|
||||
2. Framework-specific → add to `TestTrainerDistributed{Framework}`.
|
||||
3. New scripts → `distributed/scripts/`, reference via `SCRIPTS_DIR`.
|
||||
|
||||
### Pitfalls
|
||||
|
||||
- `str(args.parallel_mode)` → `"ParallelMode.DISTRIBUTED"`, not `"DISTRIBUTED"`.
|
||||
- FSDP `cpu_offload` is not JSON-serializable — use `str()`.
|
||||
- `train.py` defaults to `do_train=True`. Pass `--do_eval` explicitly for eval. Auto-enables when `--eval_output_file` is passed.
|
||||
- DeepSpeed eval only works with ZeRO-3.
|
||||
- `--fsdp_config` accepts a file path OR JSON string starting with `{`. Same for `--deepspeed`, `--accelerator_config`.
|
||||
- `args.local_rank` may be -1 before framework consumes it — use `assertIn(val, (rank, -1))`.
|
||||
- `@parameterized.expand` + ABC: can't use `@abstractmethod` on methods that subclasses decorate with expand.
|
||||
Reference in New Issue
Block a user