first commit

2026-06-05 16:53:03 +08:00
commit 06f1fd69a6
6047 changed files with 1895387 additions and 0 deletions
--- a/tests/trainer/TESTING_GUIDE.md
+++ b/tests/trainer/TESTING_GUIDE.md
@@ -0,0 +1,122 @@
+# Trainer Testing Guide
+
+## Test files
+
+| File | What it covers |
+|---|---|
+| `test_trainer.py` | Core: mixed precision, grad accumulation, logging, metrics, early stopping |
+| `test_trainer_checkpointing.py` | Checkpoint save/resume, interrupted training, frozen params |
+| `test_trainer_data.py` | Collators, dynamic shapes, iterable datasets, label smoothing |
+| `test_trainer_optimizers.py` | Optimizers & LR schedulers |
+| `test_trainer_seq2seq.py` | Encoder-decoder fine-tuning |
+| `trainer_test_utils.py` | Shared utilities (models, datasets, helpers) — not a test file |
+| `distributed/` | DDP, FSDP, DeepSpeed (see [below](#distributed-tests)) |
+
+## Running tests
+
+Always use `RUN_SLOW=1` — most trainer tests are `@slow` and will be skipped without it.
+
+### Debugging workflow
+
+**Never run the full suite until the specific failing test passes.** Work from smallest scope outward:
+
+1. **Single GPU** — fastest feedback:
+   ```bash
+   CUDA_VISIBLE_DEVICES=0 RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class::test_name -xvs
+   ```
+2. **Fix and re-run** that same test until it passes.
+3. **2 GPUs** — catch DataParallel issues:
+   ```bash
+   CUDA_VISIBLE_DEVICES=0,1 RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class::test_name -xvs
+   ```
+4. **Full test class** — check for regressions:
+   ```bash
+   RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class -xvs
+   ```
+5. **All tests in that file — only at the very end**:
+   ```bash
+   RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py -v --tb=line
+   ```
+
+Same for distributed tests — single failing test first, fix, confirm, then widen scope.
+
+**Tip**: `-k` filter applies globally across files. Use full node IDs instead: `pytest file::Class::test`.
+
+## Writing tests
+
+**`get_regression_trainer()`** is the fastest way to get a working Trainer. Pass any `TrainingArguments` kwarg directly. Uses `RegressionModel` + `RegressionDataset` (trains in milliseconds).
+
+For LLM tests, use tiny Hub models: `AutoModelForCausalLM.from_pretrained("hf-internal-testing/tiny-random-LlamaForCausalLM")`.
+
+Use `max_steps=10` instead of `num_train_epochs=3` when you just need training to run.
+
+### Multi-GPU safety
+
+The Trainer uses `nn.DataParallel` when `n_gpu > 1`:
+
+- `train_batch_size = per_device_train_batch_size * n_gpu` — don't hardcode batch sizes in assertions.
+- Compute steps dynamically: `math.ceil(num_samples / (batch_size * grad_accum))`.
+- Use 100+ samples — small datasets can leave zero resume steps on multi-GPU.
+- DataParallel gather introduces ~1e-8 FP differences — use `places=6` for loss assertions.
+- If a test model has `**kwargs` but ignores `num_items_in_batch`, set `model.accepts_loss_kwargs = False`.
+
+### Decorators
+
+`@parameterized.expand` must be **outermost** (top), above `@require_*`.
+
+---
+
+## Distributed tests
+
+### Directory layout
+
+```
+distributed/
+  test_trainer_distributed.py           # Base: path constants, TrainerDistributedCommon ABC
+  test_trainer_distributed_ddp.py       # DDP tests
+  test_trainer_distributed_fsdp.py      # FSDP tests (config parsing + distributed)
+  test_trainer_distributed_deepspeed.py # DeepSpeed tests (single-GPU + distributed)
+  accelerate_configs/                   # YAML configs for `accelerate launch`
+  scripts/                              # Scripts launched as subprocesses
+    train.py                            # Main training script (synthetic data, tiny Qwen2)
+    torchrun_env_check.py               # Dumps distributed env info to JSON per rank
+    ds_config_zero2.json, ds_config_zero3.json
+```
+
+### Architecture
+
+Each framework has three pieces:
+
+1. **`{Framework}CommandsMixin`** — `get_torchrun_cmd()` and `get_accelerate_cmd()`.
+2. **`TestTrainerDistributed{Framework}`** — framework-specific tests (env parity, etc.). NOT `@slow`.
+3. **`TestTrainerDistributed{Framework}Common`** — inherits `TrainerDistributedCommon` for shared scenarios. `@slow`.
+
+MRO: `class Foo(Mixin, TrainerDistributedCommon, TestCasePlus)` — Mixin before ABC.
+
+`TrainerDistributedCommon` provides: `check_training`, `check_mixed_precision`, `check_gradient_accumulation`, `check_resume`, `check_eval`. Subclasses call these with `config_file=...`.
+
+### Env parity tests
+
+Both torchrun and accelerate sides must use the framework:
+
+- **DDP**: no extra args (both `DistributedType.MULTI_GPU`)
+- **FSDP**: `--fsdp full_shard --fsdp_config '{"fsdp_version": 1}'` (JSON string, no file)
+- **DeepSpeed**: `--deepspeed path/to/ds_config_zero2.json`
+
+`torchrun_env_check.py` uses `HfArgumentParser(TrainingArguments)` — accepts any TrainingArguments flag.
+
+### Adding a distributed test
+
+1. Shared scenario → add `check_*` to `TrainerDistributedCommon`, wire from each Common class.
+2. Framework-specific → add to `TestTrainerDistributed{Framework}`.
+3. New scripts → `distributed/scripts/`, reference via `SCRIPTS_DIR`.
+
+### Pitfalls
+
+- `str(args.parallel_mode)` → `"ParallelMode.DISTRIBUTED"`, not `"DISTRIBUTED"`.
+- FSDP `cpu_offload` is not JSON-serializable — use `str()`.
+- `train.py` defaults to `do_train=True`. Pass `--do_eval` explicitly for eval. Auto-enables when `--eval_output_file` is passed.
+- DeepSpeed eval only works with ZeRO-3.
+- `--fsdp_config` accepts a file path OR JSON string starting with `{`. Same for `--deepspeed`, `--accelerator_config`.
+- `args.local_rank` may be -1 before framework consumes it — use `assertIn(val, (rank, -1))`.
+- `@parameterized.expand` + ABC: can't use `@abstractmethod` on methods that subclasses decorate with expand.