first commit
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
This commit is contained in:
122
tests/trainer/TESTING_GUIDE.md
Normal file
122
tests/trainer/TESTING_GUIDE.md
Normal file
@@ -0,0 +1,122 @@
|
||||
# Trainer Testing Guide
|
||||
|
||||
## Test files
|
||||
|
||||
| File | What it covers |
|
||||
|---|---|
|
||||
| `test_trainer.py` | Core: mixed precision, grad accumulation, logging, metrics, early stopping |
|
||||
| `test_trainer_checkpointing.py` | Checkpoint save/resume, interrupted training, frozen params |
|
||||
| `test_trainer_data.py` | Collators, dynamic shapes, iterable datasets, label smoothing |
|
||||
| `test_trainer_optimizers.py` | Optimizers & LR schedulers |
|
||||
| `test_trainer_seq2seq.py` | Encoder-decoder fine-tuning |
|
||||
| `trainer_test_utils.py` | Shared utilities (models, datasets, helpers) — not a test file |
|
||||
| `distributed/` | DDP, FSDP, DeepSpeed (see [below](#distributed-tests)) |
|
||||
|
||||
## Running tests
|
||||
|
||||
Always use `RUN_SLOW=1` — most trainer tests are `@slow` and will be skipped without it.
|
||||
|
||||
### Debugging workflow
|
||||
|
||||
**Never run the full suite until the specific failing test passes.** Work from smallest scope outward:
|
||||
|
||||
1. **Single GPU** — fastest feedback:
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class::test_name -xvs
|
||||
```
|
||||
2. **Fix and re-run** that same test until it passes.
|
||||
3. **2 GPUs** — catch DataParallel issues:
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0,1 RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class::test_name -xvs
|
||||
```
|
||||
4. **Full test class** — check for regressions:
|
||||
```bash
|
||||
RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class -xvs
|
||||
```
|
||||
5. **All tests in that file — only at the very end**:
|
||||
```bash
|
||||
RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py -v --tb=line
|
||||
```
|
||||
|
||||
Same for distributed tests — single failing test first, fix, confirm, then widen scope.
|
||||
|
||||
**Tip**: `-k` filter applies globally across files. Use full node IDs instead: `pytest file::Class::test`.
|
||||
|
||||
## Writing tests
|
||||
|
||||
**`get_regression_trainer()`** is the fastest way to get a working Trainer. Pass any `TrainingArguments` kwarg directly. Uses `RegressionModel` + `RegressionDataset` (trains in milliseconds).
|
||||
|
||||
For LLM tests, use tiny Hub models: `AutoModelForCausalLM.from_pretrained("hf-internal-testing/tiny-random-LlamaForCausalLM")`.
|
||||
|
||||
Use `max_steps=10` instead of `num_train_epochs=3` when you just need training to run.
|
||||
|
||||
### Multi-GPU safety
|
||||
|
||||
The Trainer uses `nn.DataParallel` when `n_gpu > 1`:
|
||||
|
||||
- `train_batch_size = per_device_train_batch_size * n_gpu` — don't hardcode batch sizes in assertions.
|
||||
- Compute steps dynamically: `math.ceil(num_samples / (batch_size * grad_accum))`.
|
||||
- Use 100+ samples — small datasets can leave zero resume steps on multi-GPU.
|
||||
- DataParallel gather introduces ~1e-8 FP differences — use `places=6` for loss assertions.
|
||||
- If a test model has `**kwargs` but ignores `num_items_in_batch`, set `model.accepts_loss_kwargs = False`.
|
||||
|
||||
### Decorators
|
||||
|
||||
`@parameterized.expand` must be **outermost** (top), above `@require_*`.
|
||||
|
||||
---
|
||||
|
||||
## Distributed tests
|
||||
|
||||
### Directory layout
|
||||
|
||||
```
|
||||
distributed/
|
||||
test_trainer_distributed.py # Base: path constants, TrainerDistributedCommon ABC
|
||||
test_trainer_distributed_ddp.py # DDP tests
|
||||
test_trainer_distributed_fsdp.py # FSDP tests (config parsing + distributed)
|
||||
test_trainer_distributed_deepspeed.py # DeepSpeed tests (single-GPU + distributed)
|
||||
accelerate_configs/ # YAML configs for `accelerate launch`
|
||||
scripts/ # Scripts launched as subprocesses
|
||||
train.py # Main training script (synthetic data, tiny Qwen2)
|
||||
torchrun_env_check.py # Dumps distributed env info to JSON per rank
|
||||
ds_config_zero2.json, ds_config_zero3.json
|
||||
```
|
||||
|
||||
### Architecture
|
||||
|
||||
Each framework has three pieces:
|
||||
|
||||
1. **`{Framework}CommandsMixin`** — `get_torchrun_cmd()` and `get_accelerate_cmd()`.
|
||||
2. **`TestTrainerDistributed{Framework}`** — framework-specific tests (env parity, etc.). NOT `@slow`.
|
||||
3. **`TestTrainerDistributed{Framework}Common`** — inherits `TrainerDistributedCommon` for shared scenarios. `@slow`.
|
||||
|
||||
MRO: `class Foo(Mixin, TrainerDistributedCommon, TestCasePlus)` — Mixin before ABC.
|
||||
|
||||
`TrainerDistributedCommon` provides: `check_training`, `check_mixed_precision`, `check_gradient_accumulation`, `check_resume`, `check_eval`. Subclasses call these with `config_file=...`.
|
||||
|
||||
### Env parity tests
|
||||
|
||||
Both torchrun and accelerate sides must use the framework:
|
||||
|
||||
- **DDP**: no extra args (both `DistributedType.MULTI_GPU`)
|
||||
- **FSDP**: `--fsdp full_shard --fsdp_config '{"fsdp_version": 1}'` (JSON string, no file)
|
||||
- **DeepSpeed**: `--deepspeed path/to/ds_config_zero2.json`
|
||||
|
||||
`torchrun_env_check.py` uses `HfArgumentParser(TrainingArguments)` — accepts any TrainingArguments flag.
|
||||
|
||||
### Adding a distributed test
|
||||
|
||||
1. Shared scenario → add `check_*` to `TrainerDistributedCommon`, wire from each Common class.
|
||||
2. Framework-specific → add to `TestTrainerDistributed{Framework}`.
|
||||
3. New scripts → `distributed/scripts/`, reference via `SCRIPTS_DIR`.
|
||||
|
||||
### Pitfalls
|
||||
|
||||
- `str(args.parallel_mode)` → `"ParallelMode.DISTRIBUTED"`, not `"DISTRIBUTED"`.
|
||||
- FSDP `cpu_offload` is not JSON-serializable — use `str()`.
|
||||
- `train.py` defaults to `do_train=True`. Pass `--do_eval` explicitly for eval. Auto-enables when `--eval_output_file` is passed.
|
||||
- DeepSpeed eval only works with ZeRO-3.
|
||||
- `--fsdp_config` accepts a file path OR JSON string starting with `{`. Same for `--deepspeed`, `--accelerator_config`.
|
||||
- `args.local_rank` may be -1 before framework consumes it — use `assertIn(val, (rank, -1))`.
|
||||
- `@parameterized.expand` + ABC: can't use `@abstractmethod` on methods that subclasses decorate with expand.
|
||||
0
tests/trainer/__init__.py
Normal file
0
tests/trainer/__init__.py
Normal file
0
tests/trainer/distributed/__init__.py
Normal file
0
tests/trainer/distributed/__init__.py
Normal file
3
tests/trainer/distributed/accelerate_configs/ddp.yaml
Normal file
3
tests/trainer/distributed/accelerate_configs/ddp.yaml
Normal file
@@ -0,0 +1,3 @@
|
||||
distributed_type: MULTI_GPU
|
||||
num_machines: 1
|
||||
num_processes: 2
|
||||
@@ -0,0 +1,4 @@
|
||||
distributed_type: DEEPSPEED
|
||||
deepspeed_config:
|
||||
deepspeed_config_file: tests/trainer/distributed/scripts/ds_config_zero2.json
|
||||
num_processes: 2
|
||||
@@ -0,0 +1,9 @@
|
||||
distributed_type: DEEPSPEED
|
||||
deepspeed_config:
|
||||
deepspeed_config_file: tests/trainer/distributed/scripts/ds_config_zero2.json
|
||||
num_processes: 2
|
||||
parallelism_config:
|
||||
parallelism_config_sp_size: 2
|
||||
parallelism_config_sp_backend: deepspeed
|
||||
parallelism_config_sp_seq_length_is_variable: true
|
||||
parallelism_config_sp_attn_implementation: sdpa
|
||||
@@ -0,0 +1,4 @@
|
||||
distributed_type: DEEPSPEED
|
||||
deepspeed_config:
|
||||
deepspeed_config_file: tests/trainer/distributed/scripts/ds_config_zero3.json
|
||||
num_processes: 2
|
||||
4
tests/trainer/distributed/accelerate_configs/fsdp.yaml
Normal file
4
tests/trainer/distributed/accelerate_configs/fsdp.yaml
Normal file
@@ -0,0 +1,4 @@
|
||||
distributed_type: FSDP
|
||||
fsdp_config:
|
||||
fsdp_version: 1
|
||||
num_processes: 2
|
||||
4
tests/trainer/distributed/accelerate_configs/fsdp2.yaml
Normal file
4
tests/trainer/distributed/accelerate_configs/fsdp2.yaml
Normal file
@@ -0,0 +1,4 @@
|
||||
distributed_type: FSDP
|
||||
fsdp_config:
|
||||
fsdp_version: 2
|
||||
num_processes: 2
|
||||
10
tests/trainer/distributed/accelerate_configs/fsdp2_cp.yaml
Normal file
10
tests/trainer/distributed/accelerate_configs/fsdp2_cp.yaml
Normal file
@@ -0,0 +1,10 @@
|
||||
distributed_type: FSDP
|
||||
fsdp_config:
|
||||
fsdp_version: 2
|
||||
num_processes: 2
|
||||
parallelism_config:
|
||||
parallelism_config_dp_replicate_size: 1
|
||||
parallelism_config_dp_shard_size: 1
|
||||
parallelism_config_tp_size: 1
|
||||
parallelism_config_cp_size: 2
|
||||
parallelism_config_cp_comm_strategy: alltoall
|
||||
88
tests/trainer/distributed/scripts/dispatch_batches.py
Normal file
88
tests/trainer/distributed/scripts/dispatch_batches.py
Normal file
@@ -0,0 +1,88 @@
|
||||
# Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""
|
||||
Worker script for dispatch_batches=False with a finite iterable dataset.
|
||||
|
||||
Verifies that training completes successfully when ``dispatch_batches``
|
||||
is disabled.
|
||||
|
||||
Run via torchrun or accelerate launch.
|
||||
"""
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
from torch.utils.data import IterableDataset
|
||||
|
||||
from transformers import HfArgumentParser, Trainer, TrainingArguments
|
||||
|
||||
|
||||
class RegressionModel(nn.Module):
|
||||
def __init__(self, a=0, b=0):
|
||||
super().__init__()
|
||||
self.a = nn.Parameter(torch.tensor(a).float())
|
||||
self.b = nn.Parameter(torch.tensor(b).float())
|
||||
self.config = None
|
||||
|
||||
def forward(self, input_x, labels=None, **kwargs):
|
||||
y = input_x * self.a + self.b
|
||||
if labels is None:
|
||||
return (y,)
|
||||
loss = nn.functional.mse_loss(y, labels)
|
||||
return (loss, y)
|
||||
|
||||
|
||||
class RegressionDataset:
|
||||
def __init__(self, a=2, b=3, length=64, seed=42, label_names=None):
|
||||
np.random.seed(seed)
|
||||
self.label_names = ["labels"] if label_names is None else label_names
|
||||
self.length = length
|
||||
self.x = np.random.normal(size=(length,)).astype(np.float32)
|
||||
self.ys = [a * self.x + b + np.random.normal(scale=0.1, size=(length,)) for _ in self.label_names]
|
||||
self.ys = [y.astype(np.float32) for y in self.ys]
|
||||
|
||||
def __len__(self):
|
||||
return self.length
|
||||
|
||||
def __getitem__(self, i):
|
||||
result = {name: y[i] for name, y in zip(self.label_names, self.ys)}
|
||||
result["input_x"] = self.x[i]
|
||||
return result
|
||||
|
||||
|
||||
class FiniteIterableDataset(IterableDataset):
|
||||
def __init__(self, a=2, b=3, length=64, seed=42, label_names=None):
|
||||
self.dataset = RegressionDataset(a=a, b=b, length=length, seed=seed, label_names=label_names)
|
||||
self.current_sample = 0
|
||||
|
||||
def __iter__(self):
|
||||
while self.current_sample < len(self.dataset):
|
||||
yield self.dataset[self.current_sample]
|
||||
self.current_sample += 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = HfArgumentParser((TrainingArguments,))
|
||||
training_args = parser.parse_args_into_dataclasses()[0]
|
||||
|
||||
training_args.per_device_train_batch_size = 1
|
||||
training_args.max_steps = 1
|
||||
training_args.accelerator_config.dispatch_batches = False
|
||||
|
||||
train_dataset = FiniteIterableDataset(label_names=["labels", "extra"], length=1)
|
||||
model = RegressionModel()
|
||||
|
||||
trainer = Trainer(model, training_args, train_dataset=train_dataset)
|
||||
trainer.train()
|
||||
32
tests/trainer/distributed/scripts/ds_config_zero2.json
Normal file
32
tests/trainer/distributed/scripts/ds_config_zero2.json
Normal file
@@ -0,0 +1,32 @@
|
||||
{
|
||||
"fp16": {
|
||||
"enabled": "auto"
|
||||
},
|
||||
"bf16": {
|
||||
"enabled": "auto"
|
||||
},
|
||||
"optimizer": {
|
||||
"type": "AdamW",
|
||||
"params": {
|
||||
"lr": "auto",
|
||||
"betas": "auto",
|
||||
"eps": "auto",
|
||||
"weight_decay": "auto"
|
||||
}
|
||||
},
|
||||
"scheduler": {
|
||||
"type": "WarmupLR",
|
||||
"params": {
|
||||
"warmup_min_lr": "auto",
|
||||
"warmup_max_lr": "auto",
|
||||
"warmup_num_steps": "auto"
|
||||
}
|
||||
},
|
||||
"zero_optimization": {
|
||||
"stage": 2
|
||||
},
|
||||
"gradient_accumulation_steps": "auto",
|
||||
"gradient_clipping": "auto",
|
||||
"train_batch_size": "auto",
|
||||
"train_micro_batch_size_per_gpu": "auto"
|
||||
}
|
||||
35
tests/trainer/distributed/scripts/ds_config_zero3.json
Normal file
35
tests/trainer/distributed/scripts/ds_config_zero3.json
Normal file
@@ -0,0 +1,35 @@
|
||||
{
|
||||
"fp16": {
|
||||
"enabled": "auto"
|
||||
},
|
||||
"bf16": {
|
||||
"enabled": "auto"
|
||||
},
|
||||
"optimizer": {
|
||||
"type": "AdamW",
|
||||
"params": {
|
||||
"lr": "auto",
|
||||
"betas": "auto",
|
||||
"eps": "auto",
|
||||
"weight_decay": "auto"
|
||||
}
|
||||
},
|
||||
"scheduler": {
|
||||
"type": "WarmupLR",
|
||||
"params": {
|
||||
"warmup_min_lr": "auto",
|
||||
"warmup_max_lr": "auto",
|
||||
"warmup_num_steps": "auto"
|
||||
}
|
||||
},
|
||||
"zero_optimization": {
|
||||
"stage": 3,
|
||||
"reduce_bucket_size": "auto",
|
||||
"stage3_prefetch_bucket_size": "auto",
|
||||
"stage3_param_persistence_threshold": "auto"
|
||||
},
|
||||
"gradient_accumulation_steps": "auto",
|
||||
"gradient_clipping": "auto",
|
||||
"train_batch_size": "auto",
|
||||
"train_micro_batch_size_per_gpu": "auto"
|
||||
}
|
||||
113
tests/trainer/distributed/scripts/eval_ddp.py
Normal file
113
tests/trainer/distributed/scripts/eval_ddp.py
Normal file
@@ -0,0 +1,113 @@
|
||||
# Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""
|
||||
Worker script for eval/predict ordering tests.
|
||||
|
||||
Verifies that distributed eval/predict returns all samples in the correct order.
|
||||
|
||||
Run via torchrun or accelerate launch.
|
||||
"""
|
||||
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
from torch.utils.data import Dataset
|
||||
|
||||
from transformers import EvalPrediction, HfArgumentParser, Trainer, TrainingArguments
|
||||
from transformers.utils import logging
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
|
||||
class DummyDataset(Dataset):
|
||||
def __init__(self, length: int = 101):
|
||||
self.length = length
|
||||
|
||||
def __len__(self):
|
||||
return self.length
|
||||
|
||||
def __getitem__(self, i) -> int:
|
||||
return i
|
||||
|
||||
|
||||
class DummyDataCollator:
|
||||
def __call__(self, features):
|
||||
return {"input_ids": torch.tensor(features), "labels": torch.tensor(features)}
|
||||
|
||||
|
||||
class DummyModel(nn.Module):
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
# Add some (unused) params otherwise DDP will complain.
|
||||
self.fc = nn.Linear(120, 80)
|
||||
|
||||
def forward(self, input_ids, labels=None):
|
||||
if labels is not None:
|
||||
return torch.tensor(0.0, device=input_ids.device), input_ids
|
||||
else:
|
||||
return input_ids
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = HfArgumentParser((TrainingArguments,))
|
||||
training_args = parser.parse_args_into_dataclasses()[0]
|
||||
|
||||
for dataset_length in [49, 7]:
|
||||
dataset = DummyDataset(dataset_length)
|
||||
|
||||
def compute_metrics(p: EvalPrediction) -> dict:
|
||||
sequential = list(range(len(dataset)))
|
||||
success = p.predictions.tolist() == sequential and p.label_ids.tolist() == sequential
|
||||
if not success and training_args.local_process_index == 0:
|
||||
logger.warning(
|
||||
"Predictions and/or labels do not match expected results:\n - predictions: "
|
||||
f"{p.predictions.tolist()}\n - labels: {p.label_ids.tolist()}\n - expected: {sequential}"
|
||||
)
|
||||
return {"success": success}
|
||||
|
||||
trainer = Trainer(
|
||||
model=DummyModel(),
|
||||
args=training_args,
|
||||
data_collator=DummyDataCollator(),
|
||||
eval_dataset=dataset,
|
||||
compute_metrics=compute_metrics,
|
||||
)
|
||||
metrics = trainer.evaluate()
|
||||
logger.info(metrics)
|
||||
if metrics["eval_success"] is not True:
|
||||
logger.error(metrics)
|
||||
exit(1)
|
||||
|
||||
p = trainer.predict(dataset)
|
||||
logger.info(p.metrics)
|
||||
if p.metrics["test_success"] is not True:
|
||||
logger.error(p.metrics)
|
||||
exit(1)
|
||||
|
||||
trainer.args.eval_accumulation_steps = 2
|
||||
|
||||
metrics = trainer.evaluate()
|
||||
logger.info(metrics)
|
||||
if metrics["eval_success"] is not True:
|
||||
logger.error(metrics)
|
||||
exit(1)
|
||||
|
||||
p = trainer.predict(dataset)
|
||||
logger.info(p.metrics)
|
||||
if p.metrics["test_success"] is not True:
|
||||
logger.error(p.metrics)
|
||||
exit(1)
|
||||
|
||||
trainer.args.eval_accumulation_steps = None
|
||||
125
tests/trainer/distributed/scripts/fsdp_generate.py
Normal file
125
tests/trainer/distributed/scripts/fsdp_generate.py
Normal file
@@ -0,0 +1,125 @@
|
||||
# Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""
|
||||
Worker script for FSDP generation tests.
|
||||
|
||||
Launched via ``torchrun`` from ``test_trainer_distributed_fsdp.py``.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import functools
|
||||
from collections.abc import Callable
|
||||
from typing import Any
|
||||
|
||||
import torch
|
||||
import torch.distributed
|
||||
from torch.distributed._composable.fsdp import fully_shard, register_fsdp_forward_method
|
||||
from torch.distributed.device_mesh import init_device_mesh
|
||||
from torch.distributed.fsdp import FullyShardedDataParallel
|
||||
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
|
||||
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
from transformers.models.gpt2.modeling_gpt2 import GPT2Block
|
||||
from transformers.testing_utils import backend_device_count, backend_torch_accelerator_module, torch_device
|
||||
|
||||
|
||||
data = 4 * [
|
||||
"Hello world!",
|
||||
"The quick brown fox jumps over the lazy dog.",
|
||||
]
|
||||
|
||||
|
||||
def manage_process_group(func: Callable[..., Any]) -> Callable[..., Any]:
|
||||
"""Manage the creation and destruction of the distributed process group for the wrapped function."""
|
||||
|
||||
def wrapped(*args: Any, **kwargs: Any) -> Any:
|
||||
device_count = backend_device_count(torch_device)
|
||||
torch.distributed.init_process_group(world_size=device_count)
|
||||
try:
|
||||
return func(*args, **kwargs)
|
||||
finally:
|
||||
torch.distributed.destroy_process_group()
|
||||
|
||||
return wrapped
|
||||
|
||||
|
||||
@manage_process_group
|
||||
def fsdp_generate():
|
||||
torch_accelerator_module = backend_torch_accelerator_module(torch_device)
|
||||
torch_accelerator_module.set_device(device := torch.device(rank := torch.distributed.get_rank()))
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained("hf-internal-testing/tiny-random-gpt2").to(device)
|
||||
|
||||
fsdp_model = FullyShardedDataParallel(
|
||||
model,
|
||||
auto_wrap_policy=functools.partial(transformer_auto_wrap_policy, transformer_layer_cls={GPT2Block}),
|
||||
limit_all_gathers=True,
|
||||
use_orig_params=True,
|
||||
)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/tiny-random-gpt2")
|
||||
batch = tokenizer(data[rank], return_tensors="pt", return_attention_mask=True).to(device)
|
||||
|
||||
with FullyShardedDataParallel.summon_full_params(fsdp_model):
|
||||
_ = fsdp_model.module.generate(
|
||||
input_ids=batch["input_ids"],
|
||||
attention_mask=batch["attention_mask"],
|
||||
max_length=30,
|
||||
)
|
||||
|
||||
|
||||
@manage_process_group
|
||||
def fsdp2_generate():
|
||||
torch_accelerator_module = backend_torch_accelerator_module(torch_device)
|
||||
torch_accelerator_module.set_device(device := torch.device(rank := torch.distributed.get_rank()))
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained("hf-internal-testing/tiny-random-gpt2").to(device)
|
||||
|
||||
mesh = init_device_mesh(device.type, (torch.distributed.get_world_size(),))
|
||||
for submodule in model.modules():
|
||||
if isinstance(submodule, GPT2Block):
|
||||
fully_shard(submodule, mesh=mesh)
|
||||
fully_shard(model, mesh=mesh)
|
||||
|
||||
register_fsdp_forward_method(model, "generate")
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/tiny-random-gpt2")
|
||||
batch = tokenizer(data[rank], return_tensors="pt", return_attention_mask=True).to(device)
|
||||
|
||||
_ = model.generate(
|
||||
input_ids=batch["input_ids"],
|
||||
attention_mask=batch["attention_mask"],
|
||||
max_length=30,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
class CLIArgs(argparse.Namespace):
|
||||
fsdp: bool
|
||||
fsdp2: bool
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
group = parser.add_mutually_exclusive_group()
|
||||
group.add_argument("--fsdp", action="store_true")
|
||||
group.add_argument("--fsdp2", action="store_true")
|
||||
args = parser.parse_args(namespace=CLIArgs())
|
||||
|
||||
if args.fsdp:
|
||||
fsdp_generate()
|
||||
elif args.fsdp2:
|
||||
fsdp2_generate()
|
||||
else:
|
||||
raise ValueError("Missing test selection")
|
||||
114
tests/trainer/distributed/scripts/loss_averaging.py
Normal file
114
tests/trainer/distributed/scripts/loss_averaging.py
Normal file
@@ -0,0 +1,114 @@
|
||||
# Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""
|
||||
Worker script for loss averaging tests.
|
||||
|
||||
Verifies that ``average_tokens_across_devices`` produces correct loss
|
||||
compared to a single-GPU baseline.
|
||||
|
||||
When ``--run_both_averaging_modes`` is passed, the script runs training
|
||||
twice (with and without averaging) in a single process launch, saving
|
||||
``<output_dir>_broken_losses.json`` and ``<output_dir>_fixed_losses.json``.
|
||||
|
||||
Run via torchrun or accelerate launch.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
|
||||
import datasets
|
||||
import torch
|
||||
|
||||
from transformers import (
|
||||
AutoModelForCausalLM,
|
||||
AutoTokenizer,
|
||||
DataCollatorForLanguageModeling,
|
||||
HfArgumentParser,
|
||||
Trainer,
|
||||
TrainerCallback,
|
||||
TrainingArguments,
|
||||
set_seed,
|
||||
)
|
||||
|
||||
|
||||
class StoreLossCallback(TrainerCallback):
|
||||
"""Simple callback to store the loss."""
|
||||
|
||||
def __init__(self):
|
||||
self.losses = []
|
||||
|
||||
def on_log(self, args, state, control, logs=None, **kwargs):
|
||||
if "loss" in logs:
|
||||
self.losses.append(logs["loss"])
|
||||
|
||||
|
||||
def run_distributed_training(training_args, loss_file):
|
||||
set_seed(42)
|
||||
model_name = "trl-internal-testing/tiny-Qwen2ForCausalLM-2.5"
|
||||
dataset_name = "wikitext"
|
||||
dataset_config = "wikitext-2-raw-v1"
|
||||
dataset = datasets.load_dataset(dataset_name, dataset_config, split="train[:50]")
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
tokenizer.pad_token = tokenizer.eos_token
|
||||
|
||||
def tokenize_function(examples):
|
||||
return tokenizer(examples["text"], max_length=128, padding="max_length", truncation=True)
|
||||
|
||||
tokenized_dataset = dataset.map(tokenize_function, batched=True)
|
||||
|
||||
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
|
||||
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32)
|
||||
|
||||
loss_callback = StoreLossCallback()
|
||||
|
||||
training_args.logging_steps = 1
|
||||
training_args.max_steps = 10
|
||||
training_args.learning_rate = 3e-4
|
||||
training_args.disable_tqdm = True
|
||||
training_args.dataloader_drop_last = True
|
||||
|
||||
trainer = Trainer(
|
||||
model,
|
||||
training_args,
|
||||
train_dataset=tokenized_dataset,
|
||||
callbacks=[loss_callback],
|
||||
data_collator=data_collator,
|
||||
)
|
||||
trainer.train()
|
||||
with open(loss_file, "w") as f:
|
||||
json.dump(loss_callback.losses, f)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Parse our custom flag first, pass the rest to HfArgumentParser.
|
||||
pre_parser = argparse.ArgumentParser(add_help=False)
|
||||
pre_parser.add_argument("--run_both_averaging_modes", action="store_true")
|
||||
custom_args, remaining = pre_parser.parse_known_args()
|
||||
|
||||
hf_parser = HfArgumentParser((TrainingArguments,))
|
||||
(training_args,) = hf_parser.parse_args_into_dataclasses(remaining)
|
||||
|
||||
if custom_args.run_both_averaging_modes:
|
||||
base_dir = training_args.output_dir
|
||||
# Run without averaging ("broken")
|
||||
training_args.average_tokens_across_devices = False
|
||||
training_args.output_dir = base_dir + "/broken"
|
||||
run_distributed_training(training_args, loss_file=base_dir + "/broken_losses.json")
|
||||
# Run with averaging ("fixed")
|
||||
training_args.average_tokens_across_devices = True
|
||||
training_args.output_dir = base_dir + "/fixed"
|
||||
run_distributed_training(training_args, loss_file=base_dir + "/fixed_losses.json")
|
||||
else:
|
||||
run_distributed_training(training_args, loss_file=training_args.output_dir + "_losses.json")
|
||||
93
tests/trainer/distributed/scripts/torchrun_env_check.py
Normal file
93
tests/trainer/distributed/scripts/torchrun_env_check.py
Normal file
@@ -0,0 +1,93 @@
|
||||
# Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""Dumps distributed environment info to a JSON file for verification.
|
||||
|
||||
This script creates a Trainer (which initializes the accelerator) and writes
|
||||
each worker's env vars, TrainingArguments fields, and accelerator state to
|
||||
``<output_dir>/env_rank<N>.json``.
|
||||
|
||||
Accepts all TrainingArguments flags (e.g. ``--deepspeed``, ``--fsdp``) so the
|
||||
Trainer sets up the correct framework regardless of launcher.
|
||||
|
||||
Works with any launcher (torchrun, accelerate launch with DDP/FSDP/DeepSpeed).
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
|
||||
from transformers import AutoModelForCausalLM, HfArgumentParser, Trainer, TrainingArguments
|
||||
|
||||
|
||||
def main():
|
||||
parser = HfArgumentParser((TrainingArguments,))
|
||||
(args,) = parser.parse_args_into_dataclasses()
|
||||
args.disable_tqdm = True
|
||||
|
||||
model_name = "trl-internal-testing/tiny-Qwen2ForCausalLM-2.5"
|
||||
model = AutoModelForCausalLM.from_pretrained(model_name)
|
||||
|
||||
trainer = Trainer(model=model, args=args)
|
||||
accelerator = trainer.accelerator
|
||||
|
||||
env_info = {
|
||||
# Raw env vars set by torchrun / accelerate
|
||||
"env_world_size": os.environ.get("WORLD_SIZE"),
|
||||
"env_rank": os.environ.get("RANK"),
|
||||
"env_local_rank": os.environ.get("LOCAL_RANK"),
|
||||
"env_master_addr": os.environ.get("MASTER_ADDR"),
|
||||
"env_master_port": os.environ.get("MASTER_PORT"),
|
||||
# TrainingArguments-derived values
|
||||
"args_local_rank": args.local_rank,
|
||||
"args_world_size": args.world_size,
|
||||
"args_process_index": args.process_index,
|
||||
"args_local_process_index": args.local_process_index,
|
||||
"args_parallel_mode": str(args.parallel_mode),
|
||||
"args_n_gpu": args.n_gpu,
|
||||
# Accelerator state
|
||||
"accelerator_num_processes": accelerator.num_processes,
|
||||
"accelerator_process_index": accelerator.process_index,
|
||||
"accelerator_local_process_index": accelerator.local_process_index,
|
||||
"accelerator_is_main_process": accelerator.is_main_process,
|
||||
"accelerator_is_local_main_process": accelerator.is_local_main_process,
|
||||
"accelerator_use_distributed": accelerator.use_distributed,
|
||||
"accelerator_distributed_type": str(accelerator.distributed_type),
|
||||
"accelerator_device": str(accelerator.device),
|
||||
# Trainer-level flags (these gate framework-specific code paths)
|
||||
"trainer_is_fsdp_enabled": trainer.is_fsdp_enabled,
|
||||
"trainer_is_deepspeed_enabled": trainer.is_deepspeed_enabled,
|
||||
}
|
||||
|
||||
# FSDP plugin info
|
||||
fsdp_plugin = getattr(accelerator.state, "fsdp_plugin", None)
|
||||
if fsdp_plugin is not None:
|
||||
env_info["fsdp_version"] = getattr(fsdp_plugin, "fsdp_version", None)
|
||||
env_info["fsdp_sharding_strategy"] = str(getattr(fsdp_plugin, "sharding_strategy", None))
|
||||
env_info["fsdp_cpu_offload"] = str(getattr(fsdp_plugin, "cpu_offload", None))
|
||||
env_info["fsdp_auto_wrap_policy"] = str(getattr(fsdp_plugin, "auto_wrap_policy", None))
|
||||
|
||||
# DeepSpeed plugin info
|
||||
deepspeed_plugin = getattr(accelerator.state, "deepspeed_plugin", None)
|
||||
if deepspeed_plugin is not None:
|
||||
env_info["deepspeed_zero_stage"] = deepspeed_plugin.zero_stage
|
||||
env_info["deepspeed_offload_optimizer_device"] = str(deepspeed_plugin.offload_optimizer_device)
|
||||
env_info["deepspeed_offload_param_device"] = str(deepspeed_plugin.offload_param_device)
|
||||
|
||||
output_file = os.path.join(args.output_dir, f"env_rank{args.process_index}.json")
|
||||
with open(output_file, "w") as f:
|
||||
json.dump(env_info, f)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
136
tests/trainer/distributed/scripts/train.py
Normal file
136
tests/trainer/distributed/scripts/train.py
Normal file
@@ -0,0 +1,136 @@
|
||||
# Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""Simple causal LM script for distributed tests (FSDP, DeepSpeed).
|
||||
|
||||
Uses a tiny Qwen2 model with synthetic data so tests run fast
|
||||
and don't require downloading real datasets.
|
||||
|
||||
Supports --do_train (default) and --do_eval via TrainingArguments.
|
||||
|
||||
32 training samples are created; with per_device_train_batch_size=4
|
||||
and 2 GPUs this gives 4 steps per epoch.
|
||||
"""
|
||||
|
||||
import json
|
||||
import sys
|
||||
|
||||
import torch
|
||||
|
||||
from transformers import (
|
||||
AutoModelForCausalLM,
|
||||
AutoTokenizer,
|
||||
DataCollatorForLanguageModeling,
|
||||
HfArgumentParser,
|
||||
Trainer,
|
||||
TrainingArguments,
|
||||
)
|
||||
|
||||
|
||||
DTYPE_MAP = {"fp32": torch.float32, "bf16": torch.bfloat16, "fp16": torch.float16}
|
||||
|
||||
|
||||
def _pop_custom_arg(name):
|
||||
"""Pop a custom --name value arg from sys.argv before HfArgumentParser sees it."""
|
||||
if name in sys.argv:
|
||||
idx = sys.argv.index(name)
|
||||
value = sys.argv[idx + 1]
|
||||
sys.argv.pop(idx)
|
||||
sys.argv.pop(idx)
|
||||
return value
|
||||
return None
|
||||
|
||||
|
||||
def main():
|
||||
# Parse custom args (not TrainingArguments fields)
|
||||
model_name = _pop_custom_arg("--model_name") or "trl-internal-testing/tiny-Qwen2ForCausalLM-2.5"
|
||||
loss_output_file = _pop_custom_arg("--loss_output_file")
|
||||
eval_output_file = _pop_custom_arg("--eval_output_file")
|
||||
model_dtype = _pop_custom_arg("--model_dtype")
|
||||
attn_impl = _pop_custom_arg("--attn_implementation")
|
||||
pad_to_multiple_of = _pop_custom_arg("--pad_to_multiple_of")
|
||||
|
||||
parser = HfArgumentParser((TrainingArguments,))
|
||||
(training_args,) = parser.parse_args_into_dataclasses()
|
||||
|
||||
# Default to training if neither --do_train nor --do_eval is set
|
||||
if not training_args.do_train and not training_args.do_eval:
|
||||
training_args.do_train = True
|
||||
|
||||
# Auto-enable eval when an eval output file is requested
|
||||
if eval_output_file:
|
||||
training_args.do_eval = True
|
||||
|
||||
torch_dtype = DTYPE_MAP[model_dtype] if model_dtype else None
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
if tokenizer.pad_token is None:
|
||||
tokenizer.pad_token = tokenizer.eos_token
|
||||
|
||||
model_kwargs = {}
|
||||
if torch_dtype:
|
||||
model_kwargs["torch_dtype"] = torch_dtype
|
||||
if attn_impl:
|
||||
model_kwargs["attn_implementation"] = attn_impl
|
||||
model = AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)
|
||||
model.generation_config.pad_token_id = tokenizer.pad_token_id
|
||||
|
||||
# Synthetic dataset — 32 samples of tokenized text
|
||||
# With per_device_train_batch_size=4 and 2 GPUs this gives 4 steps per epoch.
|
||||
texts = [
|
||||
"The quick brown fox jumps over the lazy dog. " * 5,
|
||||
"A journey of a thousand miles begins with a single step. " * 5,
|
||||
"To be or not to be, that is the question. " * 5,
|
||||
"All that glitters is not gold, all that wanders is not lost. " * 5,
|
||||
] * 8
|
||||
|
||||
train_dataset = None
|
||||
eval_dataset = None
|
||||
if training_args.do_train:
|
||||
train_dataset = [tokenizer(text, max_length=128, truncation=True, padding="max_length") for text in texts]
|
||||
if training_args.do_eval:
|
||||
eval_dataset = [tokenizer(text, max_length=128, truncation=True, padding="max_length") for text in texts[:8]]
|
||||
|
||||
collator_kwargs = {}
|
||||
if pad_to_multiple_of:
|
||||
collator_kwargs["pad_to_multiple_of"] = int(pad_to_multiple_of)
|
||||
|
||||
training_args.disable_tqdm = True
|
||||
|
||||
trainer = Trainer(
|
||||
model=model,
|
||||
args=training_args,
|
||||
train_dataset=train_dataset,
|
||||
eval_dataset=eval_dataset,
|
||||
data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, **collator_kwargs),
|
||||
)
|
||||
|
||||
if training_args.do_train:
|
||||
trainer.train()
|
||||
|
||||
if training_args.do_eval:
|
||||
eval_metrics = trainer.evaluate()
|
||||
if eval_output_file and training_args.process_index == 0:
|
||||
with open(eval_output_file, "w") as f:
|
||||
json.dump(eval_metrics, f)
|
||||
|
||||
# Save per-step losses for equivalence testing
|
||||
if training_args.do_train and loss_output_file and training_args.process_index == 0:
|
||||
losses = [log["loss"] for log in trainer.state.log_history if "loss" in log]
|
||||
with open(loss_output_file, "w") as f:
|
||||
json.dump(losses, f)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,4 @@
|
||||
{
|
||||
"image_processor_type": "ViTImageProcessor",
|
||||
"size": 30
|
||||
}
|
||||
87
tests/trainer/distributed/scripts/worker_seed.py
Normal file
87
tests/trainer/distributed/scripts/worker_seed.py
Normal file
@@ -0,0 +1,87 @@
|
||||
# Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""
|
||||
Worker script for dataloader worker seed divergence tests.
|
||||
|
||||
Verifies that dataloader workers get different random seeds across GPUs,
|
||||
so that each rank sees different random augmentations.
|
||||
|
||||
Run via torchrun or accelerate launch.
|
||||
"""
|
||||
|
||||
import random
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
import torch.distributed as dist
|
||||
import torch.nn as nn
|
||||
from torch.utils.data import Dataset
|
||||
|
||||
from transformers import HfArgumentParser, Trainer, TrainingArguments, set_seed
|
||||
from transformers.testing_utils import torch_device
|
||||
|
||||
|
||||
def gather_from_all_gpus(tensor, world_size):
|
||||
gather_list = [torch.zeros_like(tensor) for _ in range(world_size)]
|
||||
dist.all_gather(gather_list, tensor)
|
||||
return gather_list
|
||||
|
||||
|
||||
class DummyDataset(Dataset):
|
||||
def __init__(self):
|
||||
self.length = 64
|
||||
|
||||
def __len__(self):
|
||||
return self.length
|
||||
|
||||
def __getitem__(self, i) -> int:
|
||||
x = random.random()
|
||||
y = np.random.random()
|
||||
z = torch.rand([]).item()
|
||||
return {"x": torch.tensor([x, y, z])}
|
||||
|
||||
|
||||
class DummyModel(nn.Module):
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self.fc = nn.Linear(3, 1)
|
||||
|
||||
def forward(self, x):
|
||||
local_tensor = torch.tensor(x, device=torch_device)
|
||||
gathered = gather_from_all_gpus(local_tensor, dist.get_world_size())
|
||||
assert not all(torch.allclose(t, gathered[0]) for t in gathered[1:])
|
||||
y = self.fc(x)
|
||||
return (y.mean(), y)
|
||||
|
||||
|
||||
def run_distributed_training(training_args):
|
||||
set_seed(42)
|
||||
model = DummyModel()
|
||||
dataset = DummyDataset()
|
||||
training_args.max_steps = 3
|
||||
# dataloader_num_workers must be > 0 to enable worker_init_fn
|
||||
training_args.dataloader_num_workers = 2
|
||||
trainer = Trainer(
|
||||
model,
|
||||
training_args,
|
||||
train_dataset=dataset,
|
||||
)
|
||||
trainer.train()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = HfArgumentParser((TrainingArguments,))
|
||||
training_args = parser.parse_args_into_dataclasses()[0]
|
||||
run_distributed_training(training_args)
|
||||
180
tests/trainer/distributed/test_trainer_distributed.py
Normal file
180
tests/trainer/distributed/test_trainer_distributed.py
Normal file
@@ -0,0 +1,180 @@
|
||||
# Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""
|
||||
Shared constants, helpers, and reusable test logic for distributed trainer tests.
|
||||
|
||||
This module provides:
|
||||
- Path constants for test scripts and accelerate configs.
|
||||
- ``TrainerDistributedCommon``, an abstract base class that contains reusable
|
||||
test scenarios (training, mixed-precision, gradient accumulation, checkpoint
|
||||
resume, evaluation). Framework-specific test files (DDP, FSDP, DeepSpeed)
|
||||
subclass it and wire each scenario to parameterized test methods.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
from abc import ABC, abstractmethod
|
||||
|
||||
from transformers import is_torch_available
|
||||
from transformers.testing_utils import execute_subprocess_async
|
||||
from transformers.trainer_callback import TrainerState
|
||||
from transformers.trainer_utils import get_last_checkpoint
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Path constants
|
||||
# ---------------------------------------------------------------------------
|
||||
DISTRIBUTED_DIR = os.path.dirname(__file__)
|
||||
CONFIGS_DIR = os.path.join(DISTRIBUTED_DIR, "accelerate_configs")
|
||||
SCRIPTS_DIR = os.path.join(DISTRIBUTED_DIR, "scripts")
|
||||
TRAIN_SCRIPT = os.path.join(SCRIPTS_DIR, "train.py")
|
||||
|
||||
|
||||
class TrainerDistributedCommon(ABC):
|
||||
"""Reusable test scenarios shared across DDP, FSDP, and DeepSpeed.
|
||||
|
||||
Subclasses must:
|
||||
1. Implement ``get_accelerate_cmd`` to build the launch command.
|
||||
2. Define the following test methods (parameterized as needed)::
|
||||
|
||||
test_training → self.check_training(dtype, ...)
|
||||
test_training_mixed_precision → self.check_mixed_precision(dtype, ...)
|
||||
test_training_with_gradient_accumulation → self.check_gradient_accumulation(...)
|
||||
test_training_and_can_resume_normally → self.check_resume(...)
|
||||
test_eval → self.check_eval(...)
|
||||
|
||||
These test methods can't be defined here as ``@abstractmethod`` because
|
||||
``@parameterized.expand`` removes the original method name from the
|
||||
subclass, which would cause ABC to raise ``TypeError`` at instantiation.
|
||||
"""
|
||||
|
||||
@abstractmethod
|
||||
def get_accelerate_cmd(self, script, config_file, launch_args=None, script_args=None, **kwargs):
|
||||
"""Build the full ``accelerate launch`` command list.
|
||||
|
||||
Args:
|
||||
script: Path to the Python script to run.
|
||||
config_file: Path to the accelerate YAML config (always required).
|
||||
launch_args: Extra flags inserted *before* the script
|
||||
(e.g. ``--fsdp_sharding_strategy``, ``--offload_optimizer_device``).
|
||||
script_args: Extra flags appended *after* the script
|
||||
(e.g. ``--output_dir``, ``--bf16``).
|
||||
**kwargs: Framework-specific overrides (e.g. ``num_processes``).
|
||||
"""
|
||||
...
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Helpers
|
||||
# -------------------------------------------------------------------
|
||||
def _get_default_script_args(self, output_dir, num_epochs=1, logging_steps=5, save_steps=None):
|
||||
"""Build the baseline CLI arguments shared by all training runs."""
|
||||
args = [
|
||||
"--output_dir",
|
||||
output_dir,
|
||||
"--num_train_epochs",
|
||||
str(num_epochs),
|
||||
"--logging_steps",
|
||||
str(logging_steps),
|
||||
"--per_device_train_batch_size",
|
||||
"4",
|
||||
"--learning_rate",
|
||||
"5e-5",
|
||||
]
|
||||
if save_steps is not None:
|
||||
args += ["--save_steps", str(save_steps)]
|
||||
else:
|
||||
args += ["--save_strategy", "no"]
|
||||
return args
|
||||
|
||||
def _train_and_get_log_history(self, cmd, output_dir):
|
||||
"""Run a training command and return the log history from the last checkpoint."""
|
||||
execute_subprocess_async(cmd, env=self.get_env())
|
||||
checkpoint = get_last_checkpoint(output_dir)
|
||||
state_file = os.path.join(checkpoint, "trainer_state.json")
|
||||
return TrainerState.load_from_json(state_file).log_history
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Reusable test scenarios — called from subclass test methods
|
||||
# -------------------------------------------------------------------
|
||||
def check_training(self, dtype="bf16", **cmd_kwargs):
|
||||
"""Verify that training completes with the model loaded in *dtype* (no mixed precision)."""
|
||||
output_dir = self.get_auto_remove_tmp_dir()
|
||||
args = self._get_default_script_args(output_dir) + ["--model_dtype", dtype]
|
||||
execute_subprocess_async(
|
||||
self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=args, **cmd_kwargs),
|
||||
env=self.get_env(),
|
||||
)
|
||||
|
||||
def check_mixed_precision(self, dtype="bf16", **cmd_kwargs):
|
||||
"""Verify mixed-precision training: model in fp32, compute in *dtype*."""
|
||||
output_dir = self.get_auto_remove_tmp_dir()
|
||||
args = self._get_default_script_args(output_dir) + ["--model_dtype", "fp32", f"--{dtype}"]
|
||||
# fp16 requires a non-fused optimizer to avoid nan losses on small models
|
||||
if dtype == "fp16":
|
||||
args += ["--optim", "adamw_torch"]
|
||||
execute_subprocess_async(
|
||||
self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=args, **cmd_kwargs),
|
||||
env=self.get_env(),
|
||||
)
|
||||
|
||||
def check_gradient_accumulation(self, **cmd_kwargs):
|
||||
"""Verify that training with gradient accumulation completes without error."""
|
||||
output_dir = self.get_auto_remove_tmp_dir()
|
||||
args = self._get_default_script_args(output_dir) + ["--bf16", "--gradient_accumulation_steps", "2"]
|
||||
execute_subprocess_async(
|
||||
self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=args, **cmd_kwargs),
|
||||
env=self.get_env(),
|
||||
)
|
||||
|
||||
def check_resume(self, **cmd_kwargs):
|
||||
"""Verify that training can resume from a checkpoint with consistent learning rates."""
|
||||
output_dir = self.get_auto_remove_tmp_dir()
|
||||
args = self._get_default_script_args(output_dir, num_epochs=2, logging_steps=2, save_steps=2) + ["--bf16"]
|
||||
|
||||
original_logs = self._train_and_get_log_history(
|
||||
self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=args, **cmd_kwargs),
|
||||
output_dir,
|
||||
)
|
||||
|
||||
checkpoint = os.path.join(output_dir, "checkpoint-2")
|
||||
self.assertTrue(os.path.isdir(checkpoint), f"Checkpoint dir not found: {checkpoint}")
|
||||
|
||||
resume_args = args + ["--resume_from_checkpoint", checkpoint]
|
||||
resumed_logs = self._train_and_get_log_history(
|
||||
self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=resume_args, **cmd_kwargs),
|
||||
output_dir,
|
||||
)
|
||||
|
||||
for original, resumed in zip(original_logs, resumed_logs):
|
||||
if "learning_rate" in original:
|
||||
self.assertAlmostEqual(original["learning_rate"], resumed["learning_rate"], delta=1e-5)
|
||||
|
||||
def check_eval(self, **cmd_kwargs):
|
||||
"""Verify that evaluation produces a finite eval loss."""
|
||||
output_dir = self.get_auto_remove_tmp_dir()
|
||||
eval_output = os.path.join(output_dir, "eval_metrics.json")
|
||||
args = self._get_default_script_args(output_dir) + ["--do_eval", "--eval_output_file", eval_output]
|
||||
execute_subprocess_async(
|
||||
self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=args, **cmd_kwargs),
|
||||
env=self.get_env(),
|
||||
)
|
||||
|
||||
with open(eval_output) as f:
|
||||
eval_metrics = json.load(f)
|
||||
self.assertIn("eval_loss", eval_metrics)
|
||||
self.assertTrue(torch.isfinite(torch.tensor(eval_metrics["eval_loss"])))
|
||||
297
tests/trainer/distributed/test_trainer_distributed_ddp.py
Normal file
297
tests/trainer/distributed/test_trainer_distributed_ddp.py
Normal file
@@ -0,0 +1,297 @@
|
||||
# Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""
|
||||
DDP-specific distributed trainer tests.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
|
||||
from parameterized import parameterized
|
||||
|
||||
from transformers.testing_utils import (
|
||||
CaptureStderr,
|
||||
TestCasePlus,
|
||||
backend_device_count,
|
||||
execute_subprocess_async,
|
||||
get_torch_dist_unique_port,
|
||||
require_torch_multi_accelerator,
|
||||
slow,
|
||||
torch_device,
|
||||
)
|
||||
from transformers.utils import is_torch_bf16_available_on_device, is_torch_fp16_available_on_device
|
||||
|
||||
from .test_trainer_distributed import CONFIGS_DIR, SCRIPTS_DIR, TrainerDistributedCommon
|
||||
|
||||
|
||||
DDP_CONFIG_FILE = os.path.join(CONFIGS_DIR, "ddp.yaml")
|
||||
|
||||
dtypes = []
|
||||
if is_torch_bf16_available_on_device(torch_device):
|
||||
dtypes += ["bf16"]
|
||||
if is_torch_fp16_available_on_device(torch_device):
|
||||
dtypes += ["fp16"]
|
||||
|
||||
pure_dtype_params = ["fp32"] + dtypes
|
||||
mixed_precision_params = list(dtypes)
|
||||
|
||||
|
||||
def _parameterized_custom_name_func(func, param_num, param):
|
||||
param_based_name = parameterized.to_safe_name("_".join(str(x) for x in param.args))
|
||||
return f"{func.__name__}_{param_based_name}"
|
||||
|
||||
|
||||
class DDPCommandsMixin:
|
||||
"""Provides ``get_torchrun_cmd`` and ``get_accelerate_cmd`` for DDP."""
|
||||
|
||||
def get_torchrun_cmd(self, script, script_args=None, num_processes=None):
|
||||
if num_processes is None:
|
||||
num_processes = backend_device_count(torch_device)
|
||||
port = get_torch_dist_unique_port()
|
||||
cmd = [
|
||||
"torchrun",
|
||||
f"--nproc_per_node={num_processes}",
|
||||
"--nnodes=1",
|
||||
f"--master_port={port}",
|
||||
script,
|
||||
]
|
||||
if script_args:
|
||||
cmd.extend(script_args)
|
||||
return cmd
|
||||
|
||||
def get_accelerate_cmd(
|
||||
self, script, config_file, launch_args=None, script_args=None, num_processes=None, **kwargs
|
||||
):
|
||||
if num_processes is None:
|
||||
num_processes = backend_device_count(torch_device)
|
||||
port = get_torch_dist_unique_port()
|
||||
cmd = [
|
||||
"accelerate",
|
||||
"launch",
|
||||
"--config_file",
|
||||
config_file,
|
||||
"--num_processes",
|
||||
str(num_processes),
|
||||
"--main_process_port",
|
||||
str(port),
|
||||
]
|
||||
if launch_args:
|
||||
cmd.extend(launch_args)
|
||||
cmd.append(script)
|
||||
if script_args:
|
||||
cmd.extend(script_args)
|
||||
return cmd
|
||||
|
||||
|
||||
@slow
|
||||
@require_torch_multi_accelerator
|
||||
class TestTrainerDistributedDDP(DDPCommandsMixin, TestCasePlus):
|
||||
# -----------------------------------------------------------------------
|
||||
# accelerate launch tests
|
||||
# -----------------------------------------------------------------------
|
||||
def test_eval_order(self):
|
||||
output_dir = self.get_auto_remove_tmp_dir()
|
||||
script = os.path.join(SCRIPTS_DIR, "eval_ddp.py")
|
||||
cmd = self.get_accelerate_cmd(
|
||||
script,
|
||||
DDP_CONFIG_FILE,
|
||||
script_args=["--output_dir", output_dir],
|
||||
)
|
||||
execute_subprocess_async(cmd, env=self.get_env())
|
||||
|
||||
def test_loss_averaging(self):
|
||||
device_count = backend_device_count(torch_device)
|
||||
min_bs = 2
|
||||
output_dir = self.get_auto_remove_tmp_dir()
|
||||
script = os.path.join(SCRIPTS_DIR, "loss_averaging.py")
|
||||
|
||||
# Launch 1: single-GPU baseline
|
||||
cmd = self.get_accelerate_cmd(
|
||||
script,
|
||||
DDP_CONFIG_FILE,
|
||||
script_args=[
|
||||
"--output_dir",
|
||||
f"{output_dir}/base",
|
||||
"--per_device_train_batch_size",
|
||||
str(min_bs * device_count),
|
||||
"--average_tokens_across_devices",
|
||||
"True",
|
||||
],
|
||||
num_processes=1,
|
||||
)
|
||||
execute_subprocess_async(cmd, env=self.get_env())
|
||||
|
||||
# Launch 2: multi-GPU with both averaging modes in one process
|
||||
cmd = self.get_accelerate_cmd(
|
||||
script,
|
||||
DDP_CONFIG_FILE,
|
||||
script_args=[
|
||||
"--output_dir",
|
||||
f"{output_dir}/multi",
|
||||
"--per_device_train_batch_size",
|
||||
str(min_bs),
|
||||
"--run_both_averaging_modes",
|
||||
],
|
||||
num_processes=device_count,
|
||||
)
|
||||
execute_subprocess_async(cmd, env=self.get_env())
|
||||
|
||||
with open(f"{output_dir}/base_losses.json") as f:
|
||||
base_loss = json.load(f)
|
||||
with open(f"{output_dir}/multi/broken_losses.json") as f:
|
||||
broken_loss = json.load(f)
|
||||
with open(f"{output_dir}/multi/fixed_losses.json") as f:
|
||||
fixed_loss = json.load(f)
|
||||
|
||||
broken_diff = [abs(base_loss[i] - broken_loss[i]) for i in range(len(base_loss))]
|
||||
fixed_diff = [abs(base_loss[i] - fixed_loss[i]) for i in range(len(base_loss))]
|
||||
sum_base = sum(base_loss)
|
||||
sum_broken = sum(broken_loss)
|
||||
relative_broken = abs(sum_base - sum_broken) / max(sum_base, sum_broken)
|
||||
|
||||
self.assertGreater(max(broken_diff), 0.5)
|
||||
self.assertLess(max(fixed_diff), 0.005)
|
||||
self.assertLess(relative_broken, 0.1)
|
||||
|
||||
def test_worker_seed(self):
|
||||
output_dir = self.get_auto_remove_tmp_dir()
|
||||
script = os.path.join(SCRIPTS_DIR, "worker_seed.py")
|
||||
cmd = self.get_accelerate_cmd(
|
||||
script,
|
||||
DDP_CONFIG_FILE,
|
||||
script_args=["--output_dir", output_dir],
|
||||
)
|
||||
execute_subprocess_async(cmd, env=self.get_env())
|
||||
|
||||
# -----------------------------------------------------------------------
|
||||
# torchrun vs accelerate env parity
|
||||
# -----------------------------------------------------------------------
|
||||
def test_torchrun_accelerate_env_parity(self):
|
||||
"""Verify torchrun and accelerate launch produce the same distributed environment for DDP."""
|
||||
script = os.path.join(SCRIPTS_DIR, "torchrun_env_check.py")
|
||||
num_processes = backend_device_count(torch_device)
|
||||
|
||||
torchrun_dir = self.get_auto_remove_tmp_dir()
|
||||
cmd = self.get_torchrun_cmd(script, script_args=["--output_dir", torchrun_dir], num_processes=num_processes)
|
||||
execute_subprocess_async(cmd, env=self.get_env())
|
||||
|
||||
accelerate_dir = self.get_auto_remove_tmp_dir()
|
||||
cmd = self.get_accelerate_cmd(
|
||||
script, DDP_CONFIG_FILE, script_args=["--output_dir", accelerate_dir], num_processes=num_processes
|
||||
)
|
||||
execute_subprocess_async(cmd, env=self.get_env())
|
||||
|
||||
for rank in range(num_processes):
|
||||
with open(os.path.join(torchrun_dir, f"env_rank{rank}.json")) as f:
|
||||
tr = json.load(f)
|
||||
with open(os.path.join(accelerate_dir, f"env_rank{rank}.json")) as f:
|
||||
ac = json.load(f)
|
||||
|
||||
for info in (tr, ac):
|
||||
# Rank consistency: env vars, TrainingArguments, and accelerator all agree
|
||||
self.assertEqual(info["env_world_size"], str(num_processes))
|
||||
self.assertEqual(info["env_rank"], str(rank))
|
||||
self.assertEqual(info["env_local_rank"], str(rank))
|
||||
self.assertEqual(info["args_process_index"], rank)
|
||||
self.assertEqual(info["args_local_process_index"], rank)
|
||||
self.assertIn(info["args_local_rank"], (rank, -1)) # may be -1 before framework consumes it
|
||||
self.assertEqual(info["accelerator_process_index"], rank)
|
||||
self.assertEqual(info["accelerator_local_process_index"], rank)
|
||||
self.assertIsNotNone(info["env_master_addr"])
|
||||
self.assertIsNotNone(info["env_master_port"])
|
||||
|
||||
# World size and parallel mode
|
||||
self.assertEqual(info["args_world_size"], num_processes)
|
||||
self.assertEqual(info["args_n_gpu"], 1)
|
||||
self.assertEqual(info["args_parallel_mode"], "ParallelMode.DISTRIBUTED")
|
||||
self.assertEqual(info["accelerator_num_processes"], num_processes)
|
||||
self.assertTrue(info["accelerator_use_distributed"])
|
||||
self.assertEqual(info["accelerator_is_main_process"], rank == 0)
|
||||
self.assertEqual(info["accelerator_is_local_main_process"], rank == 0)
|
||||
|
||||
# DDP: distributed type is MULTI_GPU
|
||||
self.assertEqual(info["accelerator_distributed_type"], "DistributedType.MULTI_GPU")
|
||||
|
||||
# Each rank on its own device
|
||||
self.assertIn(f"{torch_device}:{rank}", info["accelerator_device"])
|
||||
|
||||
# DDP should not activate FSDP or DeepSpeed
|
||||
self.assertFalse(info["trainer_is_fsdp_enabled"])
|
||||
self.assertFalse(info["trainer_is_deepspeed_enabled"])
|
||||
self.assertNotIn("fsdp_version", info)
|
||||
self.assertNotIn("deepspeed_zero_stage", info)
|
||||
|
||||
@parameterized.expand(
|
||||
[
|
||||
("base", "--log_level info", 1),
|
||||
("low", "--log_level debug --log_level_replica debug", 2),
|
||||
("high", "--log_level error --log_level_replica debug", 1),
|
||||
("mixed", "--log_level error --log_level_replica error", 0),
|
||||
]
|
||||
)
|
||||
def test_log_level_replica(self, _name, extra_args_str, expected_matches):
|
||||
"""Test that log_level_replica controls logging on non-main processes."""
|
||||
output_dir = self.get_auto_remove_tmp_dir()
|
||||
script = os.path.join(SCRIPTS_DIR, "train.py")
|
||||
script_args = [
|
||||
"--output_dir",
|
||||
output_dir,
|
||||
"--num_train_epochs",
|
||||
"1",
|
||||
"--per_device_train_batch_size",
|
||||
"4",
|
||||
"--logging_strategy",
|
||||
"no",
|
||||
]
|
||||
if extra_args_str:
|
||||
script_args.extend(extra_args_str.split())
|
||||
cmd = self.get_accelerate_cmd(script, DDP_CONFIG_FILE, script_args=script_args, num_processes=2)
|
||||
log_info_string = "Running training"
|
||||
with CaptureStderr() as cl:
|
||||
execute_subprocess_async(cmd, env=self.get_env())
|
||||
n_matches = len(re.findall(log_info_string, cl.err))
|
||||
self.assertEqual(n_matches, expected_matches)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# DDP training integration tests (using train.py)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@slow
|
||||
@require_torch_multi_accelerator
|
||||
class TestTrainerDistributedDDPCommon(DDPCommandsMixin, TrainerDistributedCommon, TestCasePlus):
|
||||
"""
|
||||
Distributed DDP training tests using ``accelerate launch`` with the shared
|
||||
train.py script. Mirrors the test structure used in FSDP and DeepSpeed.
|
||||
"""
|
||||
|
||||
@parameterized.expand(pure_dtype_params, name_func=_parameterized_custom_name_func)
|
||||
def test_training(self, dtype):
|
||||
self.check_training(dtype, config_file=DDP_CONFIG_FILE)
|
||||
|
||||
@parameterized.expand(mixed_precision_params, name_func=_parameterized_custom_name_func)
|
||||
def test_training_mixed_precision(self, dtype):
|
||||
self.check_mixed_precision(dtype, config_file=DDP_CONFIG_FILE)
|
||||
|
||||
def test_training_with_gradient_accumulation(self):
|
||||
self.check_gradient_accumulation(config_file=DDP_CONFIG_FILE)
|
||||
|
||||
def test_training_and_can_resume_normally(self):
|
||||
self.check_resume(config_file=DDP_CONFIG_FILE)
|
||||
|
||||
def test_eval(self):
|
||||
self.check_eval(config_file=DDP_CONFIG_FILE)
|
||||
1706
tests/trainer/distributed/test_trainer_distributed_deepspeed.py
Normal file
1706
tests/trainer/distributed/test_trainer_distributed_deepspeed.py
Normal file
File diff suppressed because it is too large
Load Diff
668
tests/trainer/distributed/test_trainer_distributed_fsdp.py
Normal file
668
tests/trainer/distributed/test_trainer_distributed_fsdp.py
Normal file
@@ -0,0 +1,668 @@
|
||||
# Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""
|
||||
FSDP-specific distributed trainer tests.
|
||||
"""
|
||||
|
||||
import itertools
|
||||
import json
|
||||
import os
|
||||
import unittest
|
||||
from functools import partial
|
||||
from pathlib import Path
|
||||
from unittest.mock import patch
|
||||
|
||||
from parameterized import parameterized
|
||||
|
||||
from tests.trainer.trainer_test_utils import TrainerIntegrationCommon, get_regression_trainer # noqa
|
||||
from transformers import HfArgumentParser, PreTrainedConfig, TrainingArguments, is_torch_available
|
||||
from transformers.testing_utils import (
|
||||
TestCasePlus,
|
||||
backend_device_count,
|
||||
execute_subprocess_async,
|
||||
get_torch_dist_unique_port,
|
||||
mockenv_context,
|
||||
require_torch,
|
||||
require_torch_accelerator,
|
||||
require_torch_multi_accelerator,
|
||||
slow,
|
||||
torch_device,
|
||||
)
|
||||
from transformers.trainer_utils import set_seed
|
||||
from transformers.utils import (
|
||||
is_torch_bf16_available_on_device,
|
||||
is_torch_fp16_available_on_device,
|
||||
)
|
||||
|
||||
from .test_trainer_distributed import CONFIGS_DIR, SCRIPTS_DIR, TRAIN_SCRIPT, TrainerDistributedCommon
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
from torch import nn
|
||||
|
||||
from transformers import PreTrainedModel
|
||||
from transformers.trainer import FSDP_MODEL_NAME
|
||||
|
||||
# Base accelerate configs (version only — model-specific settings via launch args)
|
||||
FSDP_CONFIG_FILE = os.path.join(CONFIGS_DIR, "fsdp.yaml")
|
||||
FSDP2_CONFIG_FILE = os.path.join(CONFIGS_DIR, "fsdp2.yaml")
|
||||
FSDP2_CP_CONFIG_FILE = os.path.join(CONFIGS_DIR, "fsdp2_cp.yaml")
|
||||
FSDP_GENERATE_SCRIPT = os.path.join(SCRIPTS_DIR, "fsdp_generate.py")
|
||||
|
||||
FSDP_CONFIGS = {
|
||||
"fsdp1": FSDP_CONFIG_FILE,
|
||||
"fsdp2": FSDP2_CONFIG_FILE,
|
||||
}
|
||||
|
||||
# Launch args shared by all training tests
|
||||
TRAIN_LAUNCH_ARGS = [
|
||||
"--fsdp_auto_wrap_policy",
|
||||
"TRANSFORMER_BASED_WRAP",
|
||||
]
|
||||
|
||||
dtypes = []
|
||||
if is_torch_bf16_available_on_device(torch_device):
|
||||
dtypes += ["bf16"]
|
||||
if is_torch_fp16_available_on_device(torch_device):
|
||||
dtypes += ["fp16"]
|
||||
|
||||
sharding_strategies = ["full_shard", "shard_grad_op"] # zero3 and zero2
|
||||
fsdp_versions = ["fsdp1", "fsdp2"]
|
||||
|
||||
config_params = list(itertools.product(sharding_strategies, dtypes))
|
||||
# Mixed precision: model loaded in fp32, training with --bf16/--fp16
|
||||
mixed_precision_params = list(itertools.product(sharding_strategies, dtypes, fsdp_versions))
|
||||
# Pure dtype: model loaded in target dtype, no mixed precision flags
|
||||
pure_dtype_params = list(itertools.product(["fp32"] + dtypes, fsdp_versions))
|
||||
|
||||
resume_params = [
|
||||
("FULL_STATE_DICT", "fsdp1"), # FULL_STATE_DICT only supported for fsdp1
|
||||
("SHARDED_STATE_DICT", "fsdp1"),
|
||||
("SHARDED_STATE_DICT", "fsdp2"),
|
||||
]
|
||||
|
||||
set_seed(42)
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
# hack to restore original logging level pre #21700
|
||||
get_regression_trainer = partial(get_regression_trainer, log_level="info")
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
|
||||
class _BaseModel(PreTrainedModel):
|
||||
base_model_prefix = "base"
|
||||
config_class = PreTrainedConfig
|
||||
|
||||
def __init__(self, config):
|
||||
super().__init__(config)
|
||||
self.linear = nn.Linear(5, 5)
|
||||
self.linear_2 = nn.Linear(5, 5)
|
||||
self.post_init()
|
||||
|
||||
def forward(self, x):
|
||||
return self.linear_2(self.linear(x))
|
||||
|
||||
|
||||
@require_torch
|
||||
class InitializeMissingKeysTest(unittest.TestCase):
|
||||
"""Tests for FSDP non-rank-0 weight initialization: params should be moved from meta to CPU
|
||||
and marked as initialized without being re-initialized."""
|
||||
|
||||
def _clear_init_flags(self, model):
|
||||
for module in model.modules():
|
||||
if hasattr(module, "_is_hf_initialized"):
|
||||
delattr(module, "_is_hf_initialized")
|
||||
for param in model.parameters():
|
||||
if hasattr(param, "_is_hf_initialized"):
|
||||
delattr(param, "_is_hf_initialized")
|
||||
for buffer in model.buffers():
|
||||
if hasattr(buffer, "_is_hf_initialized"):
|
||||
delattr(buffer, "_is_hf_initialized")
|
||||
|
||||
def test_move_missing_keys_fsdp_non_rank0_moves_meta_to_cpu(self):
|
||||
"""FSDP non-rank-0 path should move all params from meta to CPU."""
|
||||
with torch.device("meta"):
|
||||
model = _BaseModel(PreTrainedConfig())
|
||||
|
||||
for param in model.parameters():
|
||||
self.assertEqual(param.device, torch.device("meta"))
|
||||
|
||||
with (
|
||||
patch("transformers.modeling_utils.is_fsdp_enabled", return_value=True),
|
||||
patch("transformers.modeling_utils.is_local_dist_rank_0", return_value=False),
|
||||
):
|
||||
model._move_missing_keys_from_meta_to_device(
|
||||
missing_keys=set(), device_map=None, device_mesh=None, hf_quantizer=None
|
||||
)
|
||||
|
||||
for name, param in model.named_parameters():
|
||||
self.assertEqual(param.device, torch.device("cpu"), f"param {name} should be on CPU after FSDP move")
|
||||
|
||||
def test_fsdp_non_rank0_end_to_end_no_reinit(self):
|
||||
"""End-to-end: move from meta + _initialize_missing_keys should mark all params initialized
|
||||
without changing their values."""
|
||||
with torch.device("meta"):
|
||||
model = _BaseModel(PreTrainedConfig())
|
||||
|
||||
with (
|
||||
patch("transformers.modeling_utils.is_fsdp_enabled", return_value=True),
|
||||
patch("transformers.modeling_utils.is_local_dist_rank_0", return_value=False),
|
||||
):
|
||||
model._move_missing_keys_from_meta_to_device(
|
||||
missing_keys=set(), device_map=None, device_mesh=None, hf_quantizer=None
|
||||
)
|
||||
pre_init_values = {name: param.clone() for name, param in model.named_parameters()}
|
||||
self._clear_init_flags(model)
|
||||
model._initialize_missing_keys(is_quantized=False)
|
||||
|
||||
for name, param in model.named_parameters():
|
||||
self.assertTrue(getattr(param, "_is_hf_initialized", False), f"param {name} not marked initialized")
|
||||
torch.testing.assert_close(param, pre_init_values[name], msg=f"param {name} was re-initialized")
|
||||
self.assertTrue(getattr(model, "_is_hf_initialized", False))
|
||||
|
||||
|
||||
def _parameterized_custom_name_func(func, param_num, param):
|
||||
# customize the test name generator function as we want both params to appear in the sub-test
|
||||
# name, as by default it shows only the first param
|
||||
param_based_name = parameterized.to_safe_name("_".join(str(x) for x in param.args))
|
||||
return f"{func.__name__}_{param_based_name}"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Command mixins
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class FSDPCommandsMixin:
|
||||
"""Provides ``get_torchrun_cmd`` and ``get_accelerate_cmd`` for FSDP."""
|
||||
|
||||
def get_torchrun_cmd(self, script, script_args=None, num_processes=None):
|
||||
if num_processes is None:
|
||||
num_processes = backend_device_count(torch_device)
|
||||
port = get_torch_dist_unique_port()
|
||||
cmd = [
|
||||
"torchrun",
|
||||
f"--nproc_per_node={num_processes}",
|
||||
"--nnodes=1",
|
||||
f"--master_port={port}",
|
||||
script,
|
||||
]
|
||||
if script_args:
|
||||
cmd.extend(script_args)
|
||||
return cmd
|
||||
|
||||
def get_accelerate_cmd(
|
||||
self, script, config_file, launch_args=None, script_args=None, num_processes=None, **kwargs
|
||||
):
|
||||
if num_processes is None:
|
||||
num_processes = backend_device_count(torch_device)
|
||||
port = get_torch_dist_unique_port()
|
||||
cmd = [
|
||||
"accelerate",
|
||||
"launch",
|
||||
"--config_file",
|
||||
config_file,
|
||||
"--num_processes",
|
||||
str(num_processes),
|
||||
"--main_process_port",
|
||||
str(port),
|
||||
]
|
||||
if launch_args:
|
||||
cmd.extend(launch_args)
|
||||
cmd.append(script)
|
||||
if script_args:
|
||||
cmd.extend(script_args)
|
||||
return cmd
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Config parsing tests (no distributed training runs)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@require_torch_accelerator
|
||||
class TestFSDPConfig(TestCasePlus):
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
master_port = get_torch_dist_unique_port()
|
||||
self.dist_env_1_gpu = {
|
||||
"MASTER_ADDR": "localhost",
|
||||
"MASTER_PORT": str(master_port),
|
||||
"RANK": "0",
|
||||
"LOCAL_RANK": "0",
|
||||
"WORLD_SIZE": "1",
|
||||
}
|
||||
self.accelerate_fsdp_config = {
|
||||
"fsdp_activation_checkpointing": False,
|
||||
"fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
|
||||
"fsdp_backward_prefetch": "BACKWARD_PRE",
|
||||
"fsdp_cpu_ram_efficient_loading": True,
|
||||
"fsdp_forward_prefetch": False,
|
||||
"fsdp_offload_params": False,
|
||||
"fsdp_reshard_after_forward": "FULL_SHARD",
|
||||
"fsdp_state_dict_type": "FULL_STATE_DICT",
|
||||
"fsdp_sync_module_states": True,
|
||||
"fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
|
||||
"fsdp_use_orig_params": True,
|
||||
"fsdp_version": 1,
|
||||
}
|
||||
|
||||
self.fsdp_config = {
|
||||
"backward_prefetch": "BACKWARD_PRE",
|
||||
"forward_prefetch": "false",
|
||||
"limit_all_gathers": "false",
|
||||
"use_orig_params": "true",
|
||||
"sync_module_states": "true",
|
||||
"cpu_ram_efficient_loading": "true",
|
||||
"activation_checkpointing": "false",
|
||||
"min_num_params": 1,
|
||||
}
|
||||
|
||||
@parameterized.expand(config_params, name_func=_parameterized_custom_name_func)
|
||||
def test_accelerate_fsdp_config(self, sharding_strategy, dtype):
|
||||
output_dir = self.get_auto_remove_tmp_dir()
|
||||
# Snapshot before trainer construction — `_process_fsdp_args` strips the
|
||||
# `fsdp_` prefix in place.
|
||||
expected = dict(self.accelerate_fsdp_config)
|
||||
kwargs = {
|
||||
"output_dir": output_dir,
|
||||
"train_len": 128,
|
||||
"save_steps": 5,
|
||||
"learning_rate": 0.1,
|
||||
"fsdp": f"{sharding_strategy} offload auto_wrap",
|
||||
"fsdp_config": self.accelerate_fsdp_config,
|
||||
}
|
||||
kwargs[dtype] = True
|
||||
with mockenv_context(**self.dist_env_1_gpu):
|
||||
trainer = get_regression_trainer(**kwargs)
|
||||
self.assertIs(trainer.args.fsdp, True)
|
||||
self.assertTrue(trainer.args.fsdp_config.get("cpu_offload"))
|
||||
for k, v in expected.items():
|
||||
assert k.startswith("fsdp_")
|
||||
# `transformer_layer_cls_to_wrap` is normalized from str → list during parsing.
|
||||
if k == "fsdp_transformer_layer_cls_to_wrap" and isinstance(v, str):
|
||||
v = [v]
|
||||
self.assertEqual(trainer.args.fsdp_config[k[5:]], v)
|
||||
|
||||
def test_torchrun_fsdp_config(self):
|
||||
"""Verify that --fsdp + --fsdp_config (torchrun-style) are parsed correctly."""
|
||||
output_dir = self.get_auto_remove_tmp_dir()
|
||||
fsdp_config = {"fsdp_transformer_layer_cls_to_wrap": "Qwen2DecoderLayer"}
|
||||
kwargs = {
|
||||
"output_dir": output_dir,
|
||||
"train_len": 128,
|
||||
"save_steps": 5,
|
||||
"learning_rate": 0.1,
|
||||
"fsdp": "full_shard auto_wrap",
|
||||
"fsdp_config": fsdp_config,
|
||||
"bf16": True,
|
||||
}
|
||||
with mockenv_context(**self.dist_env_1_gpu):
|
||||
trainer = get_regression_trainer(**kwargs)
|
||||
self.assertIs(trainer.args.fsdp, True)
|
||||
# fsdp_ prefix is stripped and value is normalized to a list during parsing
|
||||
self.assertIn("Qwen2DecoderLayer", trainer.args.fsdp_config["transformer_layer_cls_to_wrap"])
|
||||
|
||||
def test_fsdp_cli_parsing(self):
|
||||
"""`--fsdp` (bare) → True; legacy `--fsdp full_shard` still parses; absent → None."""
|
||||
parser = HfArgumentParser(TrainingArguments)
|
||||
base = ["--output_dir", "/tmp/x"]
|
||||
|
||||
args, _ = parser.parse_known_args([*base, "--fsdp"])
|
||||
self.assertIs(args.fsdp, True)
|
||||
|
||||
args, _ = parser.parse_known_args([*base, "--fsdp", "full_shard"])
|
||||
self.assertEqual(args.fsdp, "full_shard")
|
||||
|
||||
args, _ = parser.parse_known_args(base)
|
||||
self.assertIsNone(args.fsdp)
|
||||
|
||||
# Bare `--fsdp` should resolve to a fully enabled FSDP setup through `_process_fsdp_args`.
|
||||
with mockenv_context(**self.dist_env_1_gpu):
|
||||
trainer_args = TrainingArguments(output_dir="/tmp/x", fsdp=True)
|
||||
self.assertIs(trainer_args.fsdp, True)
|
||||
self.assertIsNotNone(trainer_args.fsdp_plugin_args)
|
||||
|
||||
@parameterized.expand(config_params, name_func=_parameterized_custom_name_func)
|
||||
def test_fsdp_config(self, sharding_strategy, dtype):
|
||||
output_dir = self.get_auto_remove_tmp_dir()
|
||||
kwargs = {
|
||||
"output_dir": output_dir,
|
||||
"train_len": 128,
|
||||
"save_steps": 5,
|
||||
"learning_rate": 0.1,
|
||||
"fsdp": f"{sharding_strategy} offload auto_wrap",
|
||||
"fsdp_config": self.fsdp_config,
|
||||
}
|
||||
kwargs[dtype] = True
|
||||
with mockenv_context(**self.dist_env_1_gpu):
|
||||
trainer = get_regression_trainer(**kwargs)
|
||||
self.assertIs(trainer.args.fsdp, True)
|
||||
self.assertTrue(trainer.args.fsdp_config.get("cpu_offload"))
|
||||
for k, v in self.fsdp_config.items():
|
||||
self.assertEqual(trainer.args.fsdp_config[k], v)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# FSDP distributed tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@require_torch_multi_accelerator
|
||||
class TestTrainerDistributedFSDP(FSDPCommandsMixin, TestCasePlus):
|
||||
def _run_env_check(self, cmd, num_processes):
|
||||
"""Run the env check script and return per-rank results."""
|
||||
execute_subprocess_async(cmd, env=self.get_env())
|
||||
# output_dir is always the last script_arg value
|
||||
output_dir = cmd[cmd.index("--output_dir") + 1]
|
||||
results = []
|
||||
for rank in range(num_processes):
|
||||
with open(os.path.join(output_dir, f"env_rank{rank}.json")) as f:
|
||||
results.append(json.load(f))
|
||||
return results
|
||||
|
||||
def test_torchrun_accelerate_fsdp1_env_parity(self):
|
||||
"""Verify torchrun+--fsdp and accelerate launch produce the same FSDP1 env."""
|
||||
script = os.path.join(SCRIPTS_DIR, "torchrun_env_check.py")
|
||||
num_processes = backend_device_count(torch_device)
|
||||
|
||||
torchrun_dir = self.get_auto_remove_tmp_dir()
|
||||
torchrun_results = self._run_env_check(
|
||||
self.get_torchrun_cmd(
|
||||
script,
|
||||
script_args=[
|
||||
"--output_dir",
|
||||
torchrun_dir,
|
||||
"--fsdp",
|
||||
"full_shard",
|
||||
"--fsdp_config",
|
||||
'{"fsdp_version": 1}',
|
||||
],
|
||||
num_processes=num_processes,
|
||||
),
|
||||
num_processes,
|
||||
)
|
||||
|
||||
accel_dir = self.get_auto_remove_tmp_dir()
|
||||
accel_results = self._run_env_check(
|
||||
self.get_accelerate_cmd(
|
||||
script, FSDP_CONFIG_FILE, script_args=["--output_dir", accel_dir], num_processes=num_processes
|
||||
),
|
||||
num_processes,
|
||||
)
|
||||
|
||||
self._check_parity(torchrun_results, accel_results, num_processes, expected_fsdp_version=1)
|
||||
|
||||
def test_torchrun_accelerate_fsdp2_env_parity(self):
|
||||
"""Verify torchrun+--fsdp and accelerate launch produce the same FSDP2 env."""
|
||||
script = os.path.join(SCRIPTS_DIR, "torchrun_env_check.py")
|
||||
num_processes = backend_device_count(torch_device)
|
||||
|
||||
torchrun_dir = self.get_auto_remove_tmp_dir()
|
||||
torchrun_results = self._run_env_check(
|
||||
self.get_torchrun_cmd(
|
||||
script,
|
||||
script_args=[
|
||||
"--output_dir",
|
||||
torchrun_dir,
|
||||
"--fsdp",
|
||||
"full_shard",
|
||||
"--fsdp_config",
|
||||
'{"fsdp_version": 2}',
|
||||
],
|
||||
num_processes=num_processes,
|
||||
),
|
||||
num_processes,
|
||||
)
|
||||
|
||||
accel_dir = self.get_auto_remove_tmp_dir()
|
||||
accel_results = self._run_env_check(
|
||||
self.get_accelerate_cmd(
|
||||
script, FSDP2_CONFIG_FILE, script_args=["--output_dir", accel_dir], num_processes=num_processes
|
||||
),
|
||||
num_processes,
|
||||
)
|
||||
|
||||
self._check_parity(torchrun_results, accel_results, num_processes, expected_fsdp_version=2)
|
||||
|
||||
def _check_parity(self, torchrun_results, accel_results, num_processes, expected_fsdp_version):
|
||||
for rank in range(num_processes):
|
||||
tr, ac = torchrun_results[rank], accel_results[rank]
|
||||
|
||||
# Both should agree on distributed env
|
||||
self.assertEqual(tr["args_world_size"], ac["args_world_size"])
|
||||
self.assertEqual(tr["args_process_index"], ac["args_process_index"])
|
||||
self.assertEqual(tr["args_parallel_mode"], ac["args_parallel_mode"])
|
||||
self.assertEqual(tr["accelerator_num_processes"], ac["accelerator_num_processes"])
|
||||
self.assertEqual(tr["accelerator_use_distributed"], ac["accelerator_use_distributed"])
|
||||
|
||||
for info in (tr, ac):
|
||||
# Rank consistency across all layers
|
||||
self.assertEqual(info["env_world_size"], str(num_processes))
|
||||
self.assertEqual(info["env_rank"], str(rank))
|
||||
self.assertEqual(info["args_process_index"], rank)
|
||||
self.assertEqual(info["args_local_process_index"], rank)
|
||||
self.assertEqual(info["accelerator_process_index"], rank)
|
||||
self.assertEqual(info["accelerator_local_process_index"], rank)
|
||||
self.assertEqual(info["args_n_gpu"], 1)
|
||||
self.assertEqual(info["accelerator_is_main_process"], rank == 0)
|
||||
self.assertEqual(info["accelerator_is_local_main_process"], rank == 0)
|
||||
self.assertIn(f"{torch_device}:{rank}", info["accelerator_device"])
|
||||
|
||||
# Both should have FSDP enabled with the correct version
|
||||
self.assertEqual(info["accelerator_distributed_type"], "DistributedType.FSDP")
|
||||
self.assertTrue(info["trainer_is_fsdp_enabled"])
|
||||
self.assertFalse(info["trainer_is_deepspeed_enabled"])
|
||||
self.assertEqual(info["fsdp_version"], expected_fsdp_version)
|
||||
self.assertNotIn("deepspeed_zero_stage", info)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# All distributed FSDP training tests
|
||||
# ---------------------------------------------------------------------------
|
||||
@slow
|
||||
@require_torch_multi_accelerator
|
||||
class TestTrainerDistributedFSDPCommon(
|
||||
FSDPCommandsMixin, TrainerDistributedCommon, TestCasePlus, TrainerIntegrationCommon
|
||||
):
|
||||
# -------------------------------------------------------------------
|
||||
# FSDP training — accelerate (parameterized over fsdp version)
|
||||
# -------------------------------------------------------------------
|
||||
|
||||
# Pure dtype training: model loaded in target dtype, no mixed precision
|
||||
@parameterized.expand(pure_dtype_params, name_func=_parameterized_custom_name_func)
|
||||
def test_training(self, dtype, fsdp_version):
|
||||
self.check_training(dtype, config_file=FSDP_CONFIGS[fsdp_version])
|
||||
|
||||
# Mixed precision: model loaded in fp32, training with --bf16/--fp16
|
||||
@parameterized.expand(mixed_precision_params, name_func=_parameterized_custom_name_func)
|
||||
def test_training_mixed_precision(self, sharding_strategy, dtype, fsdp_version):
|
||||
if fsdp_version == "fsdp2":
|
||||
reshard = "true" if sharding_strategy == "full_shard" else "false"
|
||||
else:
|
||||
reshard = sharding_strategy.upper()
|
||||
launch_args = list(TRAIN_LAUNCH_ARGS) + ["--fsdp_reshard_after_forward", reshard]
|
||||
self.check_mixed_precision(dtype, config_file=FSDP_CONFIGS[fsdp_version], launch_args=launch_args)
|
||||
|
||||
@parameterized.expand(["true", "false"], name_func=_parameterized_custom_name_func)
|
||||
def test_fsdp2_cpu_ram_efficient_loading(self, cpu_ram_efficient_loading):
|
||||
launch_args = list(TRAIN_LAUNCH_ARGS) + [
|
||||
"--fsdp_cpu_ram_efficient_loading",
|
||||
cpu_ram_efficient_loading,
|
||||
]
|
||||
self.check_training("bf16", config_file=FSDP2_CONFIG_FILE, launch_args=launch_args)
|
||||
|
||||
@parameterized.expand(fsdp_versions, name_func=_parameterized_custom_name_func)
|
||||
def test_training_with_gradient_accumulation(self, fsdp_version):
|
||||
self.check_gradient_accumulation(config_file=FSDP_CONFIGS[fsdp_version])
|
||||
|
||||
@parameterized.expand(fsdp_versions, name_func=_parameterized_custom_name_func)
|
||||
def test_basic_run_with_cpu_offload(self, fsdp_version):
|
||||
output_dir = self.get_auto_remove_tmp_dir()
|
||||
args = self._get_default_script_args(output_dir) + ["--bf16", "--max_steps", "10"]
|
||||
launch_args = list(TRAIN_LAUNCH_ARGS) + ["--fsdp_offload_params", "true"]
|
||||
execute_subprocess_async(
|
||||
self.get_accelerate_cmd(
|
||||
TRAIN_SCRIPT, script_args=args, config_file=FSDP_CONFIGS[fsdp_version], launch_args=launch_args
|
||||
),
|
||||
env=self.get_env(),
|
||||
)
|
||||
|
||||
@parameterized.expand(resume_params, name_func=_parameterized_custom_name_func)
|
||||
def test_training_and_can_resume_normally(self, state_dict_type, fsdp_version):
|
||||
output_dir = self.get_auto_remove_tmp_dir()
|
||||
args = self._get_default_script_args(output_dir, num_epochs=2, logging_steps=2, save_steps=2)
|
||||
|
||||
launch_args = list(TRAIN_LAUNCH_ARGS) + ["--fsdp_state_dict_type", state_dict_type]
|
||||
cmd_kwargs = {"config_file": FSDP_CONFIGS[fsdp_version], "launch_args": launch_args}
|
||||
|
||||
logs = self._train_and_get_log_history(
|
||||
self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=args, **cmd_kwargs),
|
||||
output_dir,
|
||||
)
|
||||
|
||||
# resume from ckpt
|
||||
checkpoint = os.path.join(output_dir, "checkpoint-2")
|
||||
resume_args = args + ["--resume_from_checkpoint", checkpoint]
|
||||
|
||||
is_fsdp_ckpt = os.path.isdir(checkpoint) and (
|
||||
# this checks the FSDP state dict when `SHARDED_STATE_DICT` is used
|
||||
any(
|
||||
FSDP_MODEL_NAME in folder_name
|
||||
for folder_name in os.listdir(checkpoint)
|
||||
if os.path.isdir(os.path.join(checkpoint, folder_name))
|
||||
)
|
||||
# this checks the FSDP state dict when `FULL_STATE_DICT` is used
|
||||
or os.path.isfile(os.path.join(checkpoint, f"{FSDP_MODEL_NAME}.bin"))
|
||||
)
|
||||
self.assertTrue(is_fsdp_ckpt)
|
||||
|
||||
logs_resume = self._train_and_get_log_history(
|
||||
self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=resume_args, **cmd_kwargs),
|
||||
output_dir,
|
||||
)
|
||||
|
||||
for log, log1 in zip(logs, logs_resume):
|
||||
if "learning_rate" in log:
|
||||
self.assertAlmostEqual(log["learning_rate"], log1["learning_rate"], delta=1e-5)
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Context parallel tests
|
||||
# -------------------------------------------------------------------
|
||||
def test_cp_equivalence(self):
|
||||
"""Test that CP produces the same losses as without CP."""
|
||||
|
||||
# CP doesn't work with Qwen2 (DTensor mixing error), so we use Llama here.
|
||||
launch_args = list(TRAIN_LAUNCH_ARGS) + ["--fsdp_state_dict_type", "SHARDED_STATE_DICT"]
|
||||
cp_script_args = [
|
||||
"--model_name",
|
||||
"hf-internal-testing/tiny-random-LlamaForCausalLM",
|
||||
"--max_steps",
|
||||
"10",
|
||||
"--per_device_train_batch_size",
|
||||
"1",
|
||||
"--seed",
|
||||
"42",
|
||||
"--logging_steps",
|
||||
"1",
|
||||
"--save_strategy",
|
||||
"no",
|
||||
"--model_dtype",
|
||||
"fp32",
|
||||
"--attn_implementation",
|
||||
"sdpa",
|
||||
"--pad_to_multiple_of",
|
||||
"4",
|
||||
]
|
||||
|
||||
# Step 1: Run with CP enabled (cp_size=2)
|
||||
cp_yes_output_dir = Path(self.get_auto_remove_tmp_dir()).resolve()
|
||||
cp_yes_losses_path = cp_yes_output_dir / "cp_yes_losses.json"
|
||||
cmd = self.get_accelerate_cmd(
|
||||
TRAIN_SCRIPT,
|
||||
config_file=FSDP2_CP_CONFIG_FILE,
|
||||
launch_args=launch_args,
|
||||
script_args=["--output_dir", str(cp_yes_output_dir), "--loss_output_file", str(cp_yes_losses_path)]
|
||||
+ cp_script_args,
|
||||
)
|
||||
execute_subprocess_async(cmd, env=self.get_env())
|
||||
|
||||
# Step 2: Run without CP (FSDP with num_processes=1, no parallelism_config)
|
||||
cp_no_output_dir = Path(self.get_auto_remove_tmp_dir()).resolve()
|
||||
cp_no_losses_path = cp_no_output_dir / "cp_no_losses.json"
|
||||
|
||||
cmd = self.get_accelerate_cmd(
|
||||
TRAIN_SCRIPT,
|
||||
config_file=FSDP2_CONFIG_FILE,
|
||||
launch_args=launch_args,
|
||||
script_args=[
|
||||
"--output_dir",
|
||||
str(cp_no_output_dir),
|
||||
"--loss_output_file",
|
||||
str(cp_no_losses_path),
|
||||
]
|
||||
+ cp_script_args,
|
||||
num_processes=1,
|
||||
)
|
||||
execute_subprocess_async(cmd, env=self.get_env())
|
||||
|
||||
# Compare losses
|
||||
with open(cp_yes_losses_path) as f:
|
||||
cp_yes_losses = json.load(f)
|
||||
with open(cp_no_losses_path) as f:
|
||||
cp_no_losses = json.load(f)
|
||||
|
||||
assert len(cp_yes_losses) == len(cp_no_losses), (
|
||||
f"Different number of losses: CP has {len(cp_yes_losses)}, no-CP has {len(cp_no_losses)}"
|
||||
)
|
||||
|
||||
cp_yes_losses_tensor = torch.tensor(cp_yes_losses)
|
||||
cp_no_losses_tensor = torch.tensor(cp_no_losses)
|
||||
|
||||
torch.testing.assert_close(
|
||||
cp_yes_losses_tensor,
|
||||
cp_no_losses_tensor,
|
||||
rtol=2e-2,
|
||||
atol=2e-2,
|
||||
msg=f"CP losses {cp_yes_losses} do not match non-CP losses {cp_no_losses}",
|
||||
)
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# FSDP eval tests
|
||||
# -------------------------------------------------------------------
|
||||
def test_eval(self):
|
||||
self.check_eval(config_file=FSDP_CONFIG_FILE)
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# FSDP generation tests (moved from tests/generation/test_fsdp.py)
|
||||
# -------------------------------------------------------------------
|
||||
def test_fsdp_generate(self):
|
||||
cmd = self.get_accelerate_cmd(
|
||||
FSDP_GENERATE_SCRIPT,
|
||||
config_file=FSDP_CONFIG_FILE,
|
||||
script_args=["--fsdp"],
|
||||
)
|
||||
execute_subprocess_async(cmd, env=self.get_env())
|
||||
|
||||
def test_fsdp2_generate(self):
|
||||
cmd = self.get_accelerate_cmd(
|
||||
FSDP_GENERATE_SCRIPT,
|
||||
config_file=FSDP2_CONFIG_FILE,
|
||||
script_args=["--fsdp2"],
|
||||
)
|
||||
execute_subprocess_async(cmd, env=self.get_env())
|
||||
1248
tests/trainer/test_data_collator.py
Normal file
1248
tests/trainer/test_data_collator.py
Normal file
File diff suppressed because it is too large
Load Diff
1313
tests/trainer/test_trainer.py
Normal file
1313
tests/trainer/test_trainer.py
Normal file
File diff suppressed because it is too large
Load Diff
250
tests/trainer/test_trainer_accelerator.py
Normal file
250
tests/trainer/test_trainer_accelerator.py
Normal file
@@ -0,0 +1,250 @@
|
||||
# Copyright 2018 the HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""
|
||||
Trainer AcceleratorConfig tests: creation from dict/YAML/dataclass, partial overrides,
|
||||
gradient accumulation settings, custom AcceleratorState, and validation.
|
||||
"""
|
||||
|
||||
import dataclasses
|
||||
import json
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
from accelerate import Accelerator
|
||||
from accelerate.state import AcceleratorState
|
||||
|
||||
from transformers import Trainer, TrainingArguments
|
||||
from transformers.testing_utils import TestCasePlus, require_torch
|
||||
from transformers.trainer_pt_utils import AcceleratorConfig
|
||||
|
||||
from .trainer_test_utils import (
|
||||
RegressionModelConfig,
|
||||
RegressionPreTrainedModel,
|
||||
RegressionTrainingArguments,
|
||||
SampleIterableDataset,
|
||||
)
|
||||
|
||||
|
||||
@require_torch
|
||||
class TrainerAcceleratorConfigTest(TestCasePlus):
|
||||
def test_accelerator_config_empty(self):
|
||||
# Checks that a config can be made with the defaults if not passed
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
config = RegressionModelConfig(a=1.5, b=2.5)
|
||||
model = RegressionPreTrainedModel(config)
|
||||
eval_dataset = SampleIterableDataset()
|
||||
|
||||
# Leaves one option as something *not* basic
|
||||
args = RegressionTrainingArguments(output_dir=tmp_dir)
|
||||
trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
|
||||
self.assertEqual(trainer.accelerator.split_batches, False)
|
||||
self.assertEqual(trainer.accelerator.dispatch_batches, None)
|
||||
self.assertEqual(trainer.accelerator.even_batches, True)
|
||||
self.assertEqual(trainer.accelerator.use_seedable_sampler, True)
|
||||
# gradient accumulation kwargs configures gradient_state
|
||||
self.assertNotIn("sync_each_batch", trainer.accelerator.gradient_state.plugin_kwargs)
|
||||
|
||||
def test_accelerator_config_from_dict(self):
|
||||
# Checks that accelerator kwargs can be passed through
|
||||
# and the accelerator is initialized respectively
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
config = RegressionModelConfig(a=1.5, b=2.5)
|
||||
model = RegressionPreTrainedModel(config)
|
||||
eval_dataset = SampleIterableDataset()
|
||||
|
||||
accelerator_config: dict[str, Any] = {
|
||||
"split_batches": True,
|
||||
"dispatch_batches": True,
|
||||
"even_batches": False,
|
||||
"use_seedable_sampler": True,
|
||||
}
|
||||
accelerator_config["gradient_accumulation_kwargs"] = {"sync_each_batch": True}
|
||||
|
||||
# Leaves all options as something *not* basic
|
||||
args = RegressionTrainingArguments(output_dir=tmp_dir, accelerator_config=accelerator_config)
|
||||
trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
|
||||
self.assertEqual(trainer.accelerator.split_batches, True)
|
||||
self.assertEqual(trainer.accelerator.dispatch_batches, True)
|
||||
self.assertEqual(trainer.accelerator.even_batches, False)
|
||||
self.assertEqual(trainer.accelerator.use_seedable_sampler, True)
|
||||
|
||||
def test_accelerator_config_from_yaml(self):
|
||||
# Checks that accelerator kwargs can be passed through
|
||||
# and the accelerator is initialized respectively
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
path_file = Path(tmp_dir) / "accelerator_config.json"
|
||||
with open(path_file, "w") as f:
|
||||
accelerator_config = {
|
||||
"split_batches": True,
|
||||
"dispatch_batches": True,
|
||||
"even_batches": False,
|
||||
"use_seedable_sampler": False,
|
||||
}
|
||||
json.dump(accelerator_config, f)
|
||||
config = RegressionModelConfig(a=1.5, b=2.5)
|
||||
model = RegressionPreTrainedModel(config)
|
||||
eval_dataset = SampleIterableDataset()
|
||||
|
||||
# Leaves all options as something *not* basic
|
||||
args = RegressionTrainingArguments(output_dir=tmp_dir, accelerator_config=path_file)
|
||||
trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
|
||||
self.assertEqual(trainer.accelerator.split_batches, True)
|
||||
self.assertEqual(trainer.accelerator.dispatch_batches, True)
|
||||
self.assertEqual(trainer.accelerator.even_batches, False)
|
||||
self.assertEqual(trainer.accelerator.use_seedable_sampler, False)
|
||||
|
||||
def test_accelerator_config_from_dataclass(self):
|
||||
# Checks that accelerator kwargs can be passed through
|
||||
# and the accelerator is initialized respectively
|
||||
|
||||
accelerator_config = AcceleratorConfig(
|
||||
split_batches=True,
|
||||
dispatch_batches=True,
|
||||
even_batches=False,
|
||||
use_seedable_sampler=False,
|
||||
)
|
||||
config = RegressionModelConfig(a=1.5, b=2.5)
|
||||
model = RegressionPreTrainedModel(config)
|
||||
eval_dataset = SampleIterableDataset()
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
args = RegressionTrainingArguments(output_dir=tmp_dir, accelerator_config=accelerator_config)
|
||||
trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
|
||||
self.assertEqual(trainer.accelerator.split_batches, True)
|
||||
self.assertEqual(trainer.accelerator.dispatch_batches, True)
|
||||
self.assertEqual(trainer.accelerator.even_batches, False)
|
||||
self.assertEqual(trainer.accelerator.use_seedable_sampler, False)
|
||||
|
||||
def test_accelerate_config_from_dataclass_grad_accum(self):
|
||||
# Checks that accelerator kwargs can be passed through
|
||||
# and the accelerator is initialized respectively
|
||||
|
||||
grad_acc_kwargs = {
|
||||
"num_steps": 10,
|
||||
"adjust_scheduler": False,
|
||||
"sync_with_dataloader": False,
|
||||
"sync_each_batch": True,
|
||||
}
|
||||
accelerator_config = AcceleratorConfig(
|
||||
split_batches=True,
|
||||
dispatch_batches=True,
|
||||
even_batches=False,
|
||||
use_seedable_sampler=False,
|
||||
gradient_accumulation_kwargs=grad_acc_kwargs,
|
||||
)
|
||||
config = RegressionModelConfig(a=1.5, b=2.5)
|
||||
model = RegressionPreTrainedModel(config)
|
||||
eval_dataset = SampleIterableDataset()
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
args = RegressionTrainingArguments(output_dir=tmp_dir, accelerator_config=accelerator_config)
|
||||
trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
|
||||
self.assertEqual(trainer.args.gradient_accumulation_steps, 10)
|
||||
|
||||
def test_accelerator_config_from_partial(self):
|
||||
# Checks that accelerator kwargs can be passed through
|
||||
# and the accelerator is initialized respectively
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
config = RegressionModelConfig(a=1.5, b=2.5)
|
||||
model = RegressionPreTrainedModel(config)
|
||||
eval_dataset = SampleIterableDataset()
|
||||
|
||||
# Leaves one option as something *not* basic
|
||||
args = RegressionTrainingArguments(
|
||||
output_dir=tmp_dir,
|
||||
accelerator_config={
|
||||
"split_batches": True,
|
||||
},
|
||||
)
|
||||
trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
|
||||
self.assertEqual(trainer.accelerator.split_batches, True)
|
||||
self.assertEqual(trainer.accelerator.dispatch_batches, None)
|
||||
self.assertEqual(trainer.accelerator.even_batches, True)
|
||||
self.assertEqual(trainer.accelerator.use_seedable_sampler, True)
|
||||
|
||||
def test_accelerator_custom_state(self):
|
||||
AcceleratorState._reset_state(reset_partial_state=True)
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
with self.assertRaises(ValueError) as cm:
|
||||
_ = RegressionTrainingArguments(output_dir=tmp_dir, accelerator_config={"use_configured_state": True})
|
||||
self.assertIn("Please define this beforehand", str(cm.warnings[0].message))
|
||||
_ = Accelerator()
|
||||
_ = RegressionTrainingArguments(output_dir=tmp_dir, accelerator_config={"use_configured_state": True})
|
||||
AcceleratorState._reset_state(reset_partial_state=True)
|
||||
|
||||
def test_accelerator_config_from_dict_grad_accum_num_steps(self):
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
config = RegressionModelConfig(a=1.5, b=2.5)
|
||||
model = RegressionPreTrainedModel(config)
|
||||
eval_dataset = SampleIterableDataset()
|
||||
|
||||
# case - TrainingArguments.gradient_accumulation_steps == 1
|
||||
# - gradient_accumulation_kwargs['num_steps] == 1
|
||||
# results in grad accum set to 1
|
||||
args = RegressionTrainingArguments(
|
||||
output_dir=tmp_dir,
|
||||
gradient_accumulation_steps=1,
|
||||
accelerator_config={
|
||||
"gradient_accumulation_kwargs": {
|
||||
"num_steps": 1,
|
||||
}
|
||||
},
|
||||
)
|
||||
trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
|
||||
self.assertEqual(trainer.accelerator.gradient_state.plugin_kwargs["num_steps"], 1)
|
||||
|
||||
# case - TrainingArguments.gradient_accumulation_steps > 1
|
||||
# - gradient_accumulation_kwargs['num_steps] specified
|
||||
# results in exception raised
|
||||
args = RegressionTrainingArguments(
|
||||
output_dir=tmp_dir,
|
||||
gradient_accumulation_steps=2,
|
||||
accelerator_config={
|
||||
"gradient_accumulation_kwargs": {
|
||||
"num_steps": 10,
|
||||
}
|
||||
},
|
||||
)
|
||||
with self.assertRaises(Exception) as context:
|
||||
trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
|
||||
self.assertTrue("The `AcceleratorConfig`'s `num_steps` is set but" in str(context.exception))
|
||||
|
||||
def test_accelerator_config_not_instantiated(self):
|
||||
# Checks that accelerator kwargs can be passed through
|
||||
# and the accelerator is initialized respectively
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
with self.assertRaises(NotImplementedError) as context:
|
||||
_ = RegressionTrainingArguments(
|
||||
output_dir=tmp_dir,
|
||||
accelerator_config=AcceleratorConfig,
|
||||
)
|
||||
self.assertTrue("Tried passing in a callable to `accelerator_config`" in str(context.exception))
|
||||
|
||||
# Now test with a custom subclass
|
||||
@dataclasses.dataclass
|
||||
class CustomAcceleratorConfig(AcceleratorConfig):
|
||||
pass
|
||||
|
||||
@dataclasses.dataclass
|
||||
class CustomTrainingArguments(TrainingArguments):
|
||||
accelerator_config: dict = dataclasses.field(
|
||||
default=CustomAcceleratorConfig,
|
||||
)
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
with self.assertRaises(NotImplementedError) as context:
|
||||
_ = CustomTrainingArguments(
|
||||
output_dir=tmp_dir,
|
||||
)
|
||||
self.assertTrue("Tried passing in a callable to `accelerator_config`" in str(context.exception))
|
||||
1343
tests/trainer/test_trainer_callback.py
Normal file
1343
tests/trainer/test_trainer_callback.py
Normal file
File diff suppressed because it is too large
Load Diff
2250
tests/trainer/test_trainer_checkpointing.py
Normal file
2250
tests/trainer/test_trainer_checkpointing.py
Normal file
File diff suppressed because it is too large
Load Diff
870
tests/trainer/test_trainer_data.py
Normal file
870
tests/trainer/test_trainer_data.py
Normal file
@@ -0,0 +1,870 @@
|
||||
# Copyright 2018 the HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""
|
||||
Trainer data-related tests: dataloaders, samplers, sharding, label smoothing,
|
||||
batch size finder, pad/concatenate, collators, and eval loop container.
|
||||
"""
|
||||
|
||||
import copy
|
||||
import tempfile
|
||||
import unittest
|
||||
import warnings
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
from torch import nn
|
||||
|
||||
from transformers import (
|
||||
GPT2Config,
|
||||
GPT2LMHeadModel,
|
||||
Trainer,
|
||||
TrainingArguments,
|
||||
)
|
||||
from transformers.data.data_collator import default_data_collator as _default_data_collator
|
||||
from transformers.modeling_outputs import SequenceClassifierOutput
|
||||
from transformers.testing_utils import (
|
||||
TestCasePlus,
|
||||
backend_device_count,
|
||||
require_accelerate,
|
||||
require_torch,
|
||||
torch_device,
|
||||
)
|
||||
from transformers.tokenization_utils_base import BatchEncoding
|
||||
from transformers.trainer_pt_utils import (
|
||||
DistributedLengthGroupedSampler,
|
||||
DistributedSamplerWithLoop,
|
||||
EvalLoopContainer,
|
||||
IterableDatasetShard,
|
||||
LabelSmoother,
|
||||
LengthGroupedSampler,
|
||||
ShardSampler,
|
||||
get_parameter_names,
|
||||
numpy_pad_and_concatenate,
|
||||
torch_pad_and_concatenate,
|
||||
)
|
||||
from transformers.trainer_utils import RemoveColumnsCollator, find_executable_batch_size
|
||||
|
||||
from .trainer_test_utils import (
|
||||
AlmostAccuracy,
|
||||
CustomDataloaderTrainer,
|
||||
DynamicShapesDataset,
|
||||
RegressionDataset,
|
||||
RegressionModel,
|
||||
RegressionModelConfig,
|
||||
RegressionPreTrainedModel,
|
||||
RegressionTrainingArguments,
|
||||
SampleIterableDataset,
|
||||
TrainerIntegrationCommon,
|
||||
TstLayer,
|
||||
get_regression_trainer,
|
||||
)
|
||||
|
||||
|
||||
class RandomIterableDataset(torch.utils.data.IterableDataset):
|
||||
# For testing, an iterable dataset of random length
|
||||
def __init__(self, p_stop=0.01, max_length=1000):
|
||||
self.p_stop = p_stop
|
||||
self.max_length = max_length
|
||||
self.generator = torch.Generator()
|
||||
|
||||
def __iter__(self):
|
||||
count = 0
|
||||
stop = False
|
||||
while not stop and count < self.max_length:
|
||||
yield count
|
||||
count += 1
|
||||
number = torch.rand(1, generator=self.generator).item()
|
||||
stop = number < self.p_stop
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Dataloader tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@require_torch
|
||||
class TrainerDataloaderTest(TestCasePlus):
|
||||
"""Tests for train/eval dataloaders, drop_last, persistent workers."""
|
||||
|
||||
def test_train_and_eval_dataloaders(self):
|
||||
if torch_device == "cuda":
|
||||
n_gpu = max(1, backend_device_count(torch_device))
|
||||
else:
|
||||
# DP is deprecated by PyTorch, accelerators like XPU doesn't support DP
|
||||
n_gpu = 1
|
||||
|
||||
tmp_dir = self.get_auto_remove_tmp_dir()
|
||||
trainer = get_regression_trainer(learning_rate=0.1, per_device_train_batch_size=16, output_dir=tmp_dir)
|
||||
self.assertEqual(trainer.get_train_dataloader().total_batch_size, 16 * n_gpu)
|
||||
trainer = get_regression_trainer(learning_rate=0.1, per_device_eval_batch_size=16, output_dir=tmp_dir)
|
||||
self.assertEqual(trainer.get_eval_dataloader().total_batch_size, 16 * n_gpu)
|
||||
|
||||
# Check drop_last works
|
||||
trainer = get_regression_trainer(
|
||||
train_len=66,
|
||||
eval_len=74,
|
||||
learning_rate=0.1,
|
||||
per_device_train_batch_size=16,
|
||||
per_device_eval_batch_size=32,
|
||||
output_dir=tmp_dir,
|
||||
)
|
||||
self.assertEqual(len(trainer.get_train_dataloader()), 66 // (16 * n_gpu) + 1)
|
||||
self.assertEqual(len(trainer.get_eval_dataloader()), 74 // (32 * n_gpu) + 1)
|
||||
|
||||
trainer = get_regression_trainer(
|
||||
train_len=66,
|
||||
eval_len=74,
|
||||
learning_rate=0.1,
|
||||
per_device_train_batch_size=16,
|
||||
per_device_eval_batch_size=32,
|
||||
dataloader_drop_last=True,
|
||||
output_dir=tmp_dir,
|
||||
)
|
||||
self.assertEqual(len(trainer.get_train_dataloader()), 66 // (16 * n_gpu))
|
||||
self.assertEqual(len(trainer.get_eval_dataloader()), 74 // (32 * n_gpu))
|
||||
|
||||
# Check passing a new dataset for evaluation works
|
||||
new_eval_dataset = RegressionDataset(length=128)
|
||||
self.assertEqual(len(trainer.get_eval_dataloader(new_eval_dataset)), 128 // (32 * n_gpu))
|
||||
|
||||
# tests that we do not require dataloader to have a .dataset attribute
|
||||
def test_dataloader_without_dataset(self):
|
||||
train_dataset = RegressionDataset(length=128)
|
||||
trainer = CustomDataloaderTrainer(
|
||||
model=RegressionModel(),
|
||||
train_dataset=train_dataset,
|
||||
eval_dataset=train_dataset,
|
||||
args=TrainingArguments(output_dir=self.get_auto_remove_tmp_dir()),
|
||||
)
|
||||
|
||||
trainer.train()
|
||||
trainer.evaluate()
|
||||
|
||||
def test_get_eval_dataloader_without_persistent_workers(self):
|
||||
train_dataset = RegressionDataset()
|
||||
config = GPT2Config(vocab_size=100, n_positions=128, n_embd=32, n_layer=3, n_head=4)
|
||||
tiny_gpt2 = GPT2LMHeadModel(config)
|
||||
args = TrainingArguments(self.get_auto_remove_tmp_dir(), dataloader_persistent_workers=False)
|
||||
|
||||
# Single evaluation dataset
|
||||
eval_dataset = RegressionDataset()
|
||||
trainer = Trainer(tiny_gpt2, args, train_dataset=train_dataset, eval_dataset=eval_dataset)
|
||||
# Mocking the prepare method to avoid the dataloader changing with each call to get_eval_dataloader
|
||||
trainer.accelerator.prepare = lambda x: x
|
||||
|
||||
default_dataloader = trainer.get_eval_dataloader()
|
||||
dataloader_with_dataset = trainer.get_eval_dataloader(eval_dataset)
|
||||
|
||||
self.assertEqual(default_dataloader.dataset, eval_dataset)
|
||||
self.assertEqual(dataloader_with_dataset.dataset, eval_dataset)
|
||||
self.assertNotEqual(default_dataloader, dataloader_with_dataset)
|
||||
|
||||
# Multiple evaluation datasets
|
||||
first_dataset = RegressionDataset()
|
||||
second_dataset = RegressionDataset()
|
||||
trainer = Trainer(
|
||||
tiny_gpt2,
|
||||
args,
|
||||
train_dataset=train_dataset,
|
||||
eval_dataset={"first": first_dataset, "second": second_dataset},
|
||||
)
|
||||
# Mocking the prepare method to avoid the dataloader changing with each call to get_eval_dataloader
|
||||
trainer.accelerator.prepare = lambda x: x
|
||||
|
||||
first_dataloader = trainer.get_eval_dataloader("first")
|
||||
first_dataloader_repeated = trainer.get_eval_dataloader("first")
|
||||
second_dataloader = trainer.get_eval_dataloader("second")
|
||||
second_dataloader_repeated = trainer.get_eval_dataloader("second")
|
||||
|
||||
self.assertEqual(first_dataset, first_dataloader.dataset)
|
||||
self.assertEqual(first_dataloader.dataset, first_dataloader_repeated.dataset)
|
||||
self.assertEqual(second_dataset, second_dataloader.dataset)
|
||||
self.assertEqual(second_dataloader.dataset, second_dataloader_repeated.dataset)
|
||||
self.assertNotEqual(first_dataloader, first_dataloader_repeated)
|
||||
self.assertNotEqual(second_dataloader, second_dataloader_repeated)
|
||||
|
||||
def test_get_eval_dataloader_with_persistent_workers(self):
|
||||
train_dataset = RegressionDataset()
|
||||
config = GPT2Config(vocab_size=100, n_positions=128, n_embd=32, n_layer=3, n_head=4)
|
||||
tiny_gpt2 = GPT2LMHeadModel(config)
|
||||
args = TrainingArguments(
|
||||
self.get_auto_remove_tmp_dir(),
|
||||
dataloader_persistent_workers=True,
|
||||
dataloader_num_workers=2,
|
||||
)
|
||||
|
||||
# Single evaluation dataset
|
||||
eval_dataset = RegressionDataset()
|
||||
trainer = Trainer(tiny_gpt2, args, train_dataset=train_dataset, eval_dataset=eval_dataset)
|
||||
# Mocking the prepare method to avoid the dataloader changing with each call to get_eval_dataloader
|
||||
trainer.accelerator.prepare = lambda x: x
|
||||
|
||||
default_dataloader = trainer.get_eval_dataloader()
|
||||
dataloader_with_dataset = trainer.get_eval_dataloader(eval_dataset)
|
||||
|
||||
self.assertEqual(default_dataloader.dataset, eval_dataset)
|
||||
self.assertEqual(dataloader_with_dataset.dataset, eval_dataset)
|
||||
self.assertEqual(default_dataloader, dataloader_with_dataset)
|
||||
|
||||
# Multiple evaluation datasets
|
||||
first_dataset = RegressionDataset()
|
||||
second_dataset = RegressionDataset()
|
||||
trainer = Trainer(
|
||||
tiny_gpt2,
|
||||
args,
|
||||
train_dataset=train_dataset,
|
||||
eval_dataset={"first": first_dataset, "second": second_dataset},
|
||||
)
|
||||
# Mocking the prepare method to avoid the dataloader changing with each call to get_eval_dataloader
|
||||
trainer.accelerator.prepare = lambda x: x
|
||||
|
||||
first_dataloader = trainer.get_eval_dataloader("first")
|
||||
first_dataloader_repeated = trainer.get_eval_dataloader("first")
|
||||
second_dataloader = trainer.get_eval_dataloader("second")
|
||||
second_dataloader_repeated = trainer.get_eval_dataloader("second")
|
||||
|
||||
self.assertEqual(first_dataset, first_dataloader.dataset)
|
||||
self.assertEqual(first_dataloader.dataset, first_dataloader_repeated.dataset)
|
||||
self.assertEqual(second_dataset, second_dataloader.dataset)
|
||||
self.assertEqual(second_dataloader.dataset, second_dataloader_repeated.dataset)
|
||||
self.assertEqual(first_dataloader, first_dataloader_repeated)
|
||||
self.assertEqual(second_dataloader, second_dataloader_repeated)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Label smoothing tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@require_torch
|
||||
class TrainerLabelSmoothingTest(unittest.TestCase):
|
||||
"""Tests for label smoothing and its interaction with multi-label classification."""
|
||||
|
||||
def test_label_smoothing(self):
|
||||
epsilon = 0.1
|
||||
num_labels = 12
|
||||
random_logits = torch.randn(4, 5, num_labels)
|
||||
random_labels = torch.randint(0, num_labels, (4, 5))
|
||||
loss = nn.functional.cross_entropy(random_logits.view(-1, num_labels), random_labels.view(-1))
|
||||
model_output = SequenceClassifierOutput(logits=random_logits)
|
||||
label_smoothed_loss = LabelSmoother(0.1)(model_output, random_labels)
|
||||
log_probs = -nn.functional.log_softmax(random_logits, dim=-1)
|
||||
expected_loss = (1 - epsilon) * loss + epsilon * log_probs.mean()
|
||||
torch.testing.assert_close(label_smoothed_loss, expected_loss)
|
||||
|
||||
# With a few -100 labels
|
||||
random_labels[0, 1] = -100
|
||||
random_labels[2, 1] = -100
|
||||
random_labels[2, 3] = -100
|
||||
|
||||
loss = nn.functional.cross_entropy(random_logits.view(-1, num_labels), random_labels.view(-1))
|
||||
model_output = SequenceClassifierOutput(logits=random_logits)
|
||||
label_smoothed_loss = LabelSmoother(0.1)(model_output, random_labels)
|
||||
log_probs = -nn.functional.log_softmax(random_logits, dim=-1)
|
||||
# Mask the log probs with the -100 labels
|
||||
log_probs[0, 1] = 0.0
|
||||
log_probs[2, 1] = 0.0
|
||||
log_probs[2, 3] = 0.0
|
||||
expected_loss = (1 - epsilon) * loss + epsilon * log_probs.sum() / (num_labels * 17)
|
||||
torch.testing.assert_close(label_smoothed_loss, expected_loss)
|
||||
|
||||
def test_label_smoothing_multi_label_incompatibility(self):
|
||||
"""Test that Trainer warns and disables label smoothing for multi-label classification"""
|
||||
|
||||
# Mock model config with multi-label classification
|
||||
class MockConfig:
|
||||
problem_type = "multi_label_classification"
|
||||
|
||||
class MockModel(nn.Module):
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self.config = MockConfig()
|
||||
self.linear = nn.Linear(10, 3)
|
||||
|
||||
def forward(self, **kwargs):
|
||||
return {"logits": torch.randn(2, 3)}
|
||||
|
||||
model = MockModel()
|
||||
|
||||
# Create training args with label smoothing
|
||||
training_args = TrainingArguments(
|
||||
output_dir="./test-trainer",
|
||||
label_smoothing_factor=0.1,
|
||||
per_device_train_batch_size=2,
|
||||
num_train_epochs=1,
|
||||
)
|
||||
|
||||
# Should warn and disable label smoothing
|
||||
with warnings.catch_warnings(record=True) as w:
|
||||
warnings.simplefilter("always")
|
||||
trainer = Trainer(model=model, args=training_args)
|
||||
|
||||
# Check warning was issued
|
||||
self.assertEqual(len(w), 1)
|
||||
self.assertIn("Label smoothing is not compatible with multi-label classification", str(w[0].message))
|
||||
|
||||
# Check label_smoother was disabled
|
||||
self.assertIsNone(trainer.label_smoother)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Sampler and sharding tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@require_torch
|
||||
class TrainerSamplerTest(unittest.TestCase):
|
||||
"""Tests for length-grouped samplers, distributed samplers, iterable dataset sharding, and shard samplers."""
|
||||
|
||||
def test_group_by_length(self):
|
||||
# Get some inputs of random lengths
|
||||
lengths = torch.randint(0, 25, (100,)).tolist()
|
||||
# Put one bigger than the others to check it ends up in first position
|
||||
lengths[32] = 50
|
||||
|
||||
indices = list(LengthGroupedSampler(4, lengths=lengths))
|
||||
# The biggest element should be first
|
||||
self.assertEqual(lengths[indices[0]], 50)
|
||||
# The indices should be a permutation of range(100)
|
||||
self.assertEqual(sorted(indices), list(range(100)))
|
||||
|
||||
def test_group_by_length_with_dict(self):
|
||||
# Get some inputs of random lengths
|
||||
data = []
|
||||
for _ in range(6):
|
||||
input_ids = torch.randint(0, 25, (100,)).tolist()
|
||||
data.append({"input_ids": input_ids})
|
||||
# Put one bigger than the others to check it ends up in first position
|
||||
data[3]["input_ids"] = torch.randint(0, 25, (105,)).tolist()
|
||||
|
||||
indices = list(LengthGroupedSampler(4, dataset=data))
|
||||
# The biggest element should be first
|
||||
self.assertEqual(len(data[indices[0]]["input_ids"]), 105)
|
||||
# The indices should be a permutation of range(6)
|
||||
self.assertEqual(sorted(indices), list(range(6)))
|
||||
|
||||
def test_group_by_length_with_batch_encoding(self):
|
||||
# Get some inputs of random lengths
|
||||
data = []
|
||||
for _ in range(6):
|
||||
input_ids = torch.randint(0, 25, (100,)).tolist()
|
||||
data.append(BatchEncoding({"input_ids": input_ids}))
|
||||
# Put one bigger than the others to check it ends up in first position
|
||||
data[3]["input_ids"] = torch.randint(0, 25, (105,)).tolist()
|
||||
|
||||
indices = list(LengthGroupedSampler(4, dataset=data))
|
||||
# The biggest element should be first
|
||||
self.assertEqual(len(data[indices[0]]["input_ids"]), 105)
|
||||
# The indices should be a permutation of range(6)
|
||||
self.assertEqual(sorted(indices), list(range(6)))
|
||||
|
||||
def test_distributed_length_grouped(self):
|
||||
# Get some inputs of random lengths
|
||||
lengths = torch.randint(0, 25, (100,)).tolist()
|
||||
# Put one bigger than the others to check it ends up in first position
|
||||
lengths[32] = 50
|
||||
|
||||
indices_process_0 = list(DistributedLengthGroupedSampler(4, num_replicas=2, rank=0, lengths=lengths))
|
||||
indices_process_1 = list(DistributedLengthGroupedSampler(4, num_replicas=2, rank=1, lengths=lengths))
|
||||
# The biggest element should be first
|
||||
self.assertEqual(lengths[indices_process_0[0]], 50)
|
||||
# The indices should be a permutation of range(100)
|
||||
self.assertEqual(sorted(indices_process_0 + indices_process_1), list(range(100)))
|
||||
|
||||
def test_distributed_sampler_with_loop(self):
|
||||
batch_size = 16
|
||||
for length in [23, 64, 123]:
|
||||
dataset = list(range(length))
|
||||
shard1 = DistributedSamplerWithLoop(dataset, batch_size, num_replicas=2, rank=0)
|
||||
shard2 = DistributedSamplerWithLoop(dataset, batch_size, num_replicas=2, rank=1)
|
||||
|
||||
# Set seeds
|
||||
shard1.set_epoch(0)
|
||||
shard2.set_epoch(0)
|
||||
|
||||
# Sample
|
||||
samples1 = list(shard1)
|
||||
samples2 = list(shard2)
|
||||
|
||||
self.assertTrue(len(samples1) % batch_size == 0)
|
||||
self.assertTrue(len(samples2) % batch_size == 0)
|
||||
|
||||
total = []
|
||||
for sample1, sample2 in zip(samples1, samples2):
|
||||
total += [sample1, sample2]
|
||||
|
||||
self.assertEqual(set(total[:length]), set(dataset))
|
||||
self.assertEqual(set(total[length:]), set(total[: (len(total) - length)]))
|
||||
|
||||
def check_iterable_dataset_shard(self, dataset, batch_size, drop_last, num_processes=2, epoch=0):
|
||||
# Set the seed for the base dataset to get the proper reference.
|
||||
dataset.generator.manual_seed(epoch)
|
||||
reference = list(dataset)
|
||||
|
||||
shards = [
|
||||
IterableDatasetShard(
|
||||
dataset, batch_size=batch_size, drop_last=drop_last, num_processes=num_processes, process_index=i
|
||||
)
|
||||
for i in range(num_processes)
|
||||
]
|
||||
for shard in shards:
|
||||
shard.set_epoch(epoch)
|
||||
shard_lists = [list(shard) for shard in shards]
|
||||
|
||||
for shard in shard_lists:
|
||||
# All shards have a number of samples that is a round multiple of batch size
|
||||
self.assertTrue(len(shard) % batch_size == 0)
|
||||
# All shards have the same number of samples
|
||||
self.assertEqual(len(shard), len(shard_lists[0]))
|
||||
|
||||
for shard in shards:
|
||||
# All shards know the total number of samples
|
||||
self.assertEqual(shard.num_examples, len(reference))
|
||||
|
||||
observed = []
|
||||
for idx in range(0, len(shard_lists[0]), batch_size):
|
||||
for shard in shard_lists:
|
||||
observed += shard[idx : idx + batch_size]
|
||||
|
||||
# If drop_last is False we loop through samples at the beginning to have a size that is a round multiple of
|
||||
# batch_size
|
||||
if not drop_last:
|
||||
while len(reference) < len(observed):
|
||||
reference += reference
|
||||
self.assertListEqual(observed, reference[: len(observed)])
|
||||
|
||||
# Check equivalence between IterableDataset and ShardSampler
|
||||
dataset.generator.manual_seed(epoch)
|
||||
reference = list(dataset)
|
||||
|
||||
sampler_shards = [
|
||||
ShardSampler(
|
||||
reference, batch_size=batch_size, drop_last=drop_last, num_processes=num_processes, process_index=i
|
||||
)
|
||||
for i in range(num_processes)
|
||||
]
|
||||
for shard, sampler_shard in zip(shard_lists, sampler_shards):
|
||||
self.assertListEqual(shard, list(sampler_shard))
|
||||
|
||||
def test_iterable_dataset_shard(self):
|
||||
dataset = RandomIterableDataset()
|
||||
|
||||
self.check_iterable_dataset_shard(dataset, 4, drop_last=True, num_processes=2, epoch=0)
|
||||
self.check_iterable_dataset_shard(dataset, 4, drop_last=False, num_processes=2, epoch=0)
|
||||
|
||||
self.check_iterable_dataset_shard(dataset, 4, drop_last=True, num_processes=3, epoch=42)
|
||||
self.check_iterable_dataset_shard(dataset, 4, drop_last=False, num_processes=3, epoch=42)
|
||||
|
||||
def test_iterable_dataset_shard_with_length(self):
|
||||
sampler_shards = [
|
||||
IterableDatasetShard(list(range(100)), batch_size=4, drop_last=True, num_processes=2, process_index=i)
|
||||
for i in range(2)
|
||||
]
|
||||
|
||||
# Build expected shards: each process will have batches of size 4 until there is not enough elements to
|
||||
# form two full batches (so we stop at 96 = (100 // (4 * 2)) * 4)
|
||||
expected_shards = [[], []]
|
||||
current_shard = 0
|
||||
for i in range(0, 96, 4):
|
||||
expected_shards[current_shard].extend(list(range(i, i + 4)))
|
||||
current_shard = 1 - current_shard
|
||||
|
||||
self.assertListEqual([list(shard) for shard in sampler_shards], expected_shards)
|
||||
self.assertListEqual([len(shard) for shard in sampler_shards], [len(shard) for shard in expected_shards])
|
||||
|
||||
sampler_shards = [
|
||||
IterableDatasetShard(list(range(100)), batch_size=4, drop_last=False, num_processes=2, process_index=i)
|
||||
for i in range(2)
|
||||
]
|
||||
# When drop_last=False, we get two last full batches by looping back to the beginning.
|
||||
expected_shards[0].extend(list(range(96, 100)))
|
||||
expected_shards[1].extend(list(range(0, 4)))
|
||||
|
||||
self.assertListEqual([list(shard) for shard in sampler_shards], expected_shards)
|
||||
self.assertListEqual([len(shard) for shard in sampler_shards], [len(shard) for shard in expected_shards])
|
||||
|
||||
def check_shard_sampler(self, dataset, batch_size, drop_last, num_processes=2):
|
||||
shards = [
|
||||
ShardSampler(
|
||||
dataset, batch_size=batch_size, drop_last=drop_last, num_processes=num_processes, process_index=i
|
||||
)
|
||||
for i in range(num_processes)
|
||||
]
|
||||
shard_lists = [list(shard) for shard in shards]
|
||||
|
||||
for shard in shard_lists:
|
||||
# All shards have a number of samples that is a round multiple of batch size
|
||||
self.assertTrue(len(shard) % batch_size == 0)
|
||||
# All shards have the same number of samples
|
||||
self.assertEqual(len(shard), len(shard_lists[0]))
|
||||
|
||||
observed = []
|
||||
for idx in range(0, len(shard_lists[0]), batch_size):
|
||||
for shard in shard_lists:
|
||||
observed += shard[idx : idx + batch_size]
|
||||
|
||||
# If drop_last is False we loop through samples at the beginning to have a size that is a round multiple of
|
||||
# batch_size
|
||||
reference = copy.copy(dataset)
|
||||
if not drop_last:
|
||||
while len(reference) < len(observed):
|
||||
reference += reference
|
||||
self.assertListEqual(observed, reference[: len(observed)])
|
||||
|
||||
def test_shard_sampler(self):
|
||||
for n_elements in [64, 123]:
|
||||
dataset = list(range(n_elements))
|
||||
|
||||
self.check_shard_sampler(dataset, 4, drop_last=True, num_processes=2)
|
||||
self.check_shard_sampler(dataset, 4, drop_last=False, num_processes=2)
|
||||
|
||||
self.check_shard_sampler(dataset, 4, drop_last=True, num_processes=3)
|
||||
self.check_shard_sampler(dataset, 4, drop_last=False, num_processes=3)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Batch size finder tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@require_torch
|
||||
class TrainerBatchSizeFinderTest(unittest.TestCase):
|
||||
"""Tests for the auto batch size finder (find_executable_batch_size)."""
|
||||
|
||||
@require_accelerate
|
||||
def test_executable_batch_size(self):
|
||||
batch_sizes = []
|
||||
|
||||
@find_executable_batch_size(starting_batch_size=64, auto_find_batch_size=True)
|
||||
def mock_training_loop_function(batch_size):
|
||||
nonlocal batch_sizes
|
||||
batch_sizes.append(batch_size)
|
||||
if batch_size > 16:
|
||||
raise RuntimeError("CUDA out of memory.")
|
||||
|
||||
mock_training_loop_function()
|
||||
self.assertEqual(batch_sizes, [64, 57, 51, 45, 40, 36, 32, 28, 25, 22, 19, 17, 15])
|
||||
|
||||
@require_accelerate
|
||||
def test_executable_batch_size_no_search(self):
|
||||
batch_sizes = []
|
||||
|
||||
@find_executable_batch_size(starting_batch_size=64, auto_find_batch_size=False)
|
||||
def mock_training_loop_function(batch_size):
|
||||
nonlocal batch_sizes
|
||||
batch_sizes.append(batch_size)
|
||||
|
||||
mock_training_loop_function()
|
||||
self.assertEqual(batch_sizes, [64])
|
||||
|
||||
@require_accelerate
|
||||
def test_executable_batch_size_with_error(self):
|
||||
@find_executable_batch_size(starting_batch_size=64, auto_find_batch_size=False)
|
||||
def mock_training_loop_function(batch_size):
|
||||
raise RuntimeError("CUDA out of memory.")
|
||||
|
||||
with self.assertRaises(RuntimeError) as cm:
|
||||
mock_training_loop_function()
|
||||
self.assertEqual("CUDA out of memory", cm.args[0])
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Data utility tests (parameter names, pad/concat, collators, eval loop container)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@require_torch
|
||||
class TrainerDataUtilsTest(unittest.TestCase):
|
||||
"""Tests for get_parameter_names, pad_and_concatenate, RemoveColumnsCollator, and EvalLoopContainer."""
|
||||
|
||||
def test_get_parameter_names(self):
|
||||
model = nn.Sequential(TstLayer(128), nn.ModuleList([TstLayer(128), TstLayer(128)]))
|
||||
# fmt: off
|
||||
self.assertEqual(
|
||||
get_parameter_names(model, [nn.LayerNorm]),
|
||||
['0.linear1.weight', '0.linear1.bias', '0.linear2.weight', '0.linear2.bias', '0.bias', '1.0.linear1.weight', '1.0.linear1.bias', '1.0.linear2.weight', '1.0.linear2.bias', '1.0.bias', '1.1.linear1.weight', '1.1.linear1.bias', '1.1.linear2.weight', '1.1.linear2.bias', '1.1.bias']
|
||||
)
|
||||
# fmt: on
|
||||
|
||||
def test_get_parameter_names_rmsnorm(self):
|
||||
class RMSNorm(nn.Module):
|
||||
def __init__(self, hidden_size):
|
||||
super().__init__()
|
||||
self.weight = nn.Parameter(torch.ones(hidden_size))
|
||||
self.bias = nn.Parameter(torch.zeros(hidden_size))
|
||||
|
||||
class ModelWithRMSNorm(nn.Module):
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self.linear = nn.Linear(128, 128)
|
||||
self.rmsnorm = RMSNorm(128)
|
||||
self.bias = nn.Parameter(torch.zeros(128))
|
||||
|
||||
model = ModelWithRMSNorm()
|
||||
# Test both type-based and name-based filtering
|
||||
decay_parameters = get_parameter_names(model, [], ["bias", "rmsnorm"])
|
||||
|
||||
# Parameters that should be in weight decay
|
||||
self.assertIn("linear.weight", decay_parameters)
|
||||
|
||||
# Parameters that should NOT be in weight decay
|
||||
self.assertNotIn("linear.bias", decay_parameters)
|
||||
self.assertNotIn("rmsnorm.weight", decay_parameters)
|
||||
self.assertNotIn("rmsnorm.bias", decay_parameters)
|
||||
self.assertNotIn("bias", decay_parameters)
|
||||
|
||||
def test_pad_and_concatenate_with_1d(self):
|
||||
"""Tests whether pad_and_concatenate works with scalars."""
|
||||
array1 = 1.0
|
||||
array2 = 2.0
|
||||
result = numpy_pad_and_concatenate(array1, array2)
|
||||
self.assertTrue(np.array_equal(np.array([1.0, 2.0]), result))
|
||||
|
||||
tensor1 = torch.tensor(1.0)
|
||||
tensor2 = torch.tensor(2.0)
|
||||
result = torch_pad_and_concatenate(tensor1, tensor2)
|
||||
self.assertTrue(torch.equal(result, torch.Tensor([1.0, 2.0])))
|
||||
|
||||
def test_remove_columns_collator(self):
|
||||
class MockLogger:
|
||||
def __init__(self) -> None:
|
||||
self.called = 0
|
||||
|
||||
def info(self, msg):
|
||||
self.called += 1
|
||||
self.last_msg = msg
|
||||
|
||||
data_batch = [
|
||||
{"col1": 1, "col2": 2, "col3": 3},
|
||||
{"col1": 1, "col2": 2, "col3": 3},
|
||||
]
|
||||
logger = MockLogger()
|
||||
remove_columns_collator = RemoveColumnsCollator(
|
||||
_default_data_collator, ["col1", "col2"], logger, "model", "training"
|
||||
)
|
||||
|
||||
self.assertNotIn("col3", remove_columns_collator(data_batch))
|
||||
# check that the logging message is printed out only once
|
||||
remove_columns_collator(data_batch)
|
||||
remove_columns_collator(data_batch)
|
||||
self.assertEqual(logger.called, 1)
|
||||
self.assertIn("col3", logger.last_msg)
|
||||
|
||||
def test_eval_loop_container(self):
|
||||
batch_1 = [
|
||||
torch.ones([8, 5]),
|
||||
{"loss": torch.tensor(1.0)},
|
||||
(torch.ones([8, 2, 3]), torch.ones([8, 2])),
|
||||
]
|
||||
batch_2 = [
|
||||
torch.ones([4, 5]),
|
||||
{"loss": torch.tensor(2.0)},
|
||||
(torch.ones([4, 2, 3]), torch.ones([4, 6])),
|
||||
]
|
||||
|
||||
concat_container = EvalLoopContainer(do_nested_concat=True, padding_index=-100)
|
||||
concat_container.add(batch_1)
|
||||
concat_container.add(batch_2)
|
||||
concat_container.to_cpu_and_numpy()
|
||||
arrays = concat_container.get_arrays()
|
||||
|
||||
# Test two nested batches concatenation
|
||||
self.assertIsInstance(arrays, list)
|
||||
self.assertEqual(len(arrays), 3)
|
||||
self.assertIsInstance(arrays[0], np.ndarray)
|
||||
self.assertEqual(arrays[0].shape, (12, 5))
|
||||
self.assertIsInstance(arrays[1], dict)
|
||||
self.assertIsInstance(arrays[1]["loss"], np.ndarray)
|
||||
self.assertEqual(arrays[1]["loss"].shape, (2,))
|
||||
self.assertTrue(np.allclose(arrays[1]["loss"], np.array([1.0, 2.0])))
|
||||
self.assertIsInstance(arrays[2], tuple)
|
||||
self.assertEqual(len(arrays[2]), 2)
|
||||
self.assertEqual(arrays[2][0].shape, (12, 2, 3))
|
||||
self.assertEqual(arrays[2][1].shape, (12, 6))
|
||||
# check that first batch padded with padding index -100 after concatenation
|
||||
self.assertEqual(arrays[2][1][0][2], -100)
|
||||
|
||||
# Test two batches with no concatenation
|
||||
list_container = EvalLoopContainer(do_nested_concat=False)
|
||||
list_container.add(batch_1)
|
||||
list_container.add(batch_2)
|
||||
list_container.to_cpu_and_numpy()
|
||||
arrays = list_container.get_arrays()
|
||||
|
||||
self.assertEqual(len(arrays), 2)
|
||||
self.assertIsInstance(arrays, list)
|
||||
np_batch_1, np_batch_2 = arrays
|
||||
|
||||
self.assertIsInstance(np_batch_1, list)
|
||||
self.assertEqual(len(np_batch_1), 3)
|
||||
self.assertIsInstance(np_batch_1[0], np.ndarray)
|
||||
self.assertIsInstance(np_batch_1[1], dict)
|
||||
self.assertIsInstance(np_batch_1[2], tuple)
|
||||
self.assertEqual(np_batch_1[0].shape, (8, 5))
|
||||
self.assertEqual(np_batch_1[1]["loss"].shape, ())
|
||||
self.assertEqual(np_batch_1[2][0].shape, (8, 2, 3))
|
||||
self.assertEqual(np_batch_1[2][1].shape, (8, 2))
|
||||
|
||||
self.assertIsInstance(np_batch_2, list)
|
||||
self.assertEqual(len(np_batch_2), 3)
|
||||
self.assertIsInstance(np_batch_2[0], np.ndarray)
|
||||
self.assertIsInstance(np_batch_2[1], dict)
|
||||
self.assertIsInstance(np_batch_2[2], tuple)
|
||||
self.assertEqual(np_batch_2[0].shape, (4, 5))
|
||||
self.assertEqual(np_batch_2[1]["loss"].shape, ())
|
||||
self.assertEqual(np_batch_2[2][0].shape, (4, 2, 3))
|
||||
self.assertEqual(np_batch_2[2][1].shape, (4, 6))
|
||||
|
||||
# Test no batches
|
||||
none_arr = EvalLoopContainer(do_nested_concat=True, padding_index=-100).get_arrays()
|
||||
self.assertIsNone(none_arr)
|
||||
|
||||
none_arr = EvalLoopContainer(do_nested_concat=False).get_arrays()
|
||||
self.assertIsNone(none_arr)
|
||||
|
||||
# Test one batch
|
||||
concat_container = EvalLoopContainer(do_nested_concat=True, padding_index=-100)
|
||||
concat_container.add(batch_1)
|
||||
arrays = concat_container.get_arrays()
|
||||
self.assertIsInstance(arrays, list)
|
||||
self.assertEqual(len(arrays), 3)
|
||||
self.assertIsInstance(arrays[0], np.ndarray)
|
||||
self.assertEqual(arrays[0].shape, (8, 5))
|
||||
self.assertIsInstance(arrays[1], dict)
|
||||
self.assertIsInstance(arrays[1]["loss"], np.ndarray)
|
||||
self.assertEqual(arrays[1]["loss"].shape, ())
|
||||
self.assertTrue(np.allclose(arrays[1]["loss"], np.array([1.0])))
|
||||
self.assertIsInstance(arrays[2], tuple)
|
||||
self.assertEqual(len(arrays[2]), 2)
|
||||
self.assertEqual(arrays[2][0].shape, (8, 2, 3))
|
||||
self.assertEqual(arrays[2][1].shape, (8, 2))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Dynamic shapes and iterable dataset tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@require_torch
|
||||
class TrainerDynamicShapesAndIterableTest(TestCasePlus, TrainerIntegrationCommon):
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
args = TrainingArguments("..")
|
||||
self.n_epochs = args.num_train_epochs
|
||||
self.batch_size = args.train_batch_size
|
||||
|
||||
def test_dynamic_shapes(self):
|
||||
eval_dataset = DynamicShapesDataset(batch_size=self.batch_size)
|
||||
model = RegressionModel(a=2, b=1)
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
args = TrainingArguments(tmp_dir)
|
||||
trainer = Trainer(model, args, eval_dataset=eval_dataset)
|
||||
|
||||
# Check evaluation can run to completion
|
||||
_ = trainer.evaluate()
|
||||
|
||||
# Check predictions
|
||||
preds = trainer.predict(eval_dataset)
|
||||
for expected, seen in zip(eval_dataset.ys, preds.label_ids):
|
||||
self.assertTrue(np.array_equal(expected, seen[: expected.shape[0]]))
|
||||
self.assertTrue(np.all(seen[expected.shape[0] :] == -100))
|
||||
|
||||
for expected, seen in zip(eval_dataset.xs, preds.predictions):
|
||||
self.assertTrue(np.array_equal(2 * expected + 1, seen[: expected.shape[0]]))
|
||||
self.assertTrue(np.all(seen[expected.shape[0] :] == -100))
|
||||
|
||||
# Same tests with eval accumulation
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
args = TrainingArguments(tmp_dir, eval_accumulation_steps=2)
|
||||
trainer = Trainer(model, args, eval_dataset=eval_dataset)
|
||||
|
||||
# Check evaluation can run to completion
|
||||
_ = trainer.evaluate()
|
||||
|
||||
# Check predictions
|
||||
preds = trainer.predict(eval_dataset)
|
||||
for expected, seen in zip(eval_dataset.ys, preds.label_ids):
|
||||
self.assertTrue(np.array_equal(expected, seen[: expected.shape[0]]))
|
||||
self.assertTrue(np.all(seen[expected.shape[0] :] == -100))
|
||||
|
||||
for expected, seen in zip(eval_dataset.xs, preds.predictions):
|
||||
self.assertTrue(np.array_equal(2 * expected + 1, seen[: expected.shape[0]]))
|
||||
self.assertTrue(np.all(seen[expected.shape[0] :] == -100))
|
||||
|
||||
def test_training_iterable_dataset(self):
|
||||
config = RegressionModelConfig()
|
||||
model = RegressionPreTrainedModel(config)
|
||||
# Adding one column not used by the model should have no impact
|
||||
train_dataset = SampleIterableDataset(label_names=["labels", "extra"])
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
args = RegressionTrainingArguments(output_dir=tmp_dir, max_steps=4)
|
||||
trainer = Trainer(model=model, args=args, train_dataset=train_dataset)
|
||||
trainer.train()
|
||||
self.assertEqual(trainer.state.global_step, 4)
|
||||
|
||||
loader = trainer.get_train_dataloader()
|
||||
self.assertIsInstance(loader, torch.utils.data.DataLoader)
|
||||
self.assertIsInstance(loader.sampler, torch.utils.data.dataloader._InfiniteConstantSampler)
|
||||
|
||||
def test_evaluation_iterable_dataset(self):
|
||||
config = RegressionModelConfig(a=1.5, b=2.5)
|
||||
model = RegressionPreTrainedModel(config)
|
||||
# RegressionPreTrainedModel accepts **kwargs but doesn't actually use num_items_in_batch,
|
||||
# so disable the loss scaling that assumes the model handles token-level averaging.
|
||||
model.accepts_loss_kwargs = False
|
||||
# Adding one column not used by the model should have no impact
|
||||
eval_dataset = SampleIterableDataset(label_names=["labels", "extra"])
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
args = RegressionTrainingArguments(output_dir=tmp_dir)
|
||||
trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset, compute_metrics=AlmostAccuracy())
|
||||
results = trainer.evaluate()
|
||||
x, y = trainer.eval_dataset.dataset.x, trainer.eval_dataset.dataset.ys[0]
|
||||
pred = 1.5 * x + 2.5
|
||||
expected_loss = ((pred - y) ** 2).mean()
|
||||
self.assertAlmostEqual(results["eval_loss"], expected_loss, places=6)
|
||||
expected_acc = AlmostAccuracy()((pred, y))["accuracy"]
|
||||
self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
|
||||
|
||||
# With a number of elements not a round multiple of the batch size
|
||||
eval_dataset = SampleIterableDataset(length=66)
|
||||
results = trainer.evaluate(eval_dataset)
|
||||
|
||||
x, y = eval_dataset.dataset.x, eval_dataset.dataset.ys[0]
|
||||
pred = 1.5 * x + 2.5
|
||||
expected_loss = ((pred - y) ** 2).mean()
|
||||
self.assertAlmostEqual(results["eval_loss"], expected_loss, places=6)
|
||||
expected_acc = AlmostAccuracy()((pred, y))["accuracy"]
|
||||
self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
|
||||
|
||||
def test_predict_iterable_dataset(self):
|
||||
config = RegressionModelConfig(a=1.5, b=2.5)
|
||||
model = RegressionPreTrainedModel(config)
|
||||
eval_dataset = SampleIterableDataset()
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
args = RegressionTrainingArguments(output_dir=tmp_dir)
|
||||
trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset, compute_metrics=AlmostAccuracy())
|
||||
preds = trainer.predict(trainer.eval_dataset).predictions
|
||||
x = eval_dataset.dataset.x
|
||||
self.assertTrue(np.allclose(preds, 1.5 * x + 2.5))
|
||||
|
||||
# With a number of elements not a round multiple of the batch size
|
||||
# Adding one column not used by the model should have no impact
|
||||
test_dataset = SampleIterableDataset(length=66, label_names=["labels", "extra"])
|
||||
preds = trainer.predict(test_dataset).predictions
|
||||
x = test_dataset.dataset.x
|
||||
self.assertTrue(np.allclose(preds, 1.5 * x + 2.5))
|
||||
519
tests/trainer/test_trainer_evaluation.py
Normal file
519
tests/trainer/test_trainer_evaluation.py
Normal file
@@ -0,0 +1,519 @@
|
||||
# Copyright 2018 the HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""
|
||||
Trainer evaluation and prediction tests: evaluate, predict, batched metrics, dynamic shapes,
|
||||
iterable datasets, early stopping, FP16/BF16 full eval memory, torch.compile, and MRPC/LM eval.
|
||||
"""
|
||||
|
||||
import gc
|
||||
import tempfile
|
||||
|
||||
import numpy as np
|
||||
|
||||
from transformers import (
|
||||
AutoTokenizer,
|
||||
TrainingArguments,
|
||||
is_torch_available,
|
||||
)
|
||||
from transformers.testing_utils import (
|
||||
TestCasePlus,
|
||||
backend_device_count,
|
||||
get_tests_dir,
|
||||
require_torch,
|
||||
require_torch_accelerator,
|
||||
require_torch_bf16,
|
||||
require_torch_fp16,
|
||||
slow,
|
||||
torch_device,
|
||||
)
|
||||
|
||||
from .trainer_test_utils import (
|
||||
PATH_SAMPLE_TEXT,
|
||||
AlmostAccuracy,
|
||||
AlmostAccuracyBatched,
|
||||
RegressionDataset,
|
||||
RegressionDictModel,
|
||||
TrainerIntegrationCommon,
|
||||
get_dataset,
|
||||
get_regression_trainer,
|
||||
)
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
from transformers import (
|
||||
AutoModelForCausalLM,
|
||||
AutoModelForSequenceClassification,
|
||||
GlueDataset,
|
||||
GlueDataTrainingArguments,
|
||||
Trainer,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Core evaluate / predict tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@require_torch
|
||||
class TrainerEvaluationTest(TestCasePlus, TrainerIntegrationCommon):
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
args = TrainingArguments("..")
|
||||
self.n_epochs = args.num_train_epochs
|
||||
self.batch_size = args.train_batch_size
|
||||
|
||||
def test_evaluate(self):
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
trainer = get_regression_trainer(a=1.5, b=2.5, compute_metrics=AlmostAccuracy(), output_dir=tmp_dir)
|
||||
results = trainer.evaluate()
|
||||
|
||||
x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
|
||||
pred = 1.5 * x + 2.5
|
||||
expected_loss = ((pred - y) ** 2).mean()
|
||||
self.assertAlmostEqual(results["eval_loss"], expected_loss)
|
||||
expected_acc = AlmostAccuracy()((pred, y))["accuracy"]
|
||||
self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
|
||||
|
||||
# With a number of elements not a round multiple of the batch size
|
||||
trainer = get_regression_trainer(
|
||||
a=1.5, b=2.5, eval_len=66, compute_metrics=AlmostAccuracy(), output_dir=tmp_dir
|
||||
)
|
||||
results = trainer.evaluate()
|
||||
|
||||
x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
|
||||
pred = 1.5 * x + 2.5
|
||||
expected_loss = ((pred - y) ** 2).mean()
|
||||
self.assertAlmostEqual(results["eval_loss"], expected_loss)
|
||||
expected_acc = AlmostAccuracy()((pred, y))["accuracy"]
|
||||
self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
|
||||
|
||||
# With logits preprocess
|
||||
trainer = get_regression_trainer(
|
||||
a=1.5,
|
||||
b=2.5,
|
||||
compute_metrics=AlmostAccuracy(),
|
||||
preprocess_logits_for_metrics=lambda logits, labels: logits + 1,
|
||||
output_dir=tmp_dir,
|
||||
)
|
||||
results = trainer.evaluate()
|
||||
|
||||
x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
|
||||
pred = 1.5 * x + 2.5
|
||||
expected_loss = ((pred - y) ** 2).mean()
|
||||
self.assertAlmostEqual(results["eval_loss"], expected_loss)
|
||||
expected_acc = AlmostAccuracy()((pred + 1, y))["accuracy"]
|
||||
self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
|
||||
|
||||
def test_predict(self):
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
trainer = get_regression_trainer(a=1.5, b=2.5, output_dir=tmp_dir)
|
||||
preds = trainer.predict(trainer.eval_dataset).predictions
|
||||
x = trainer.eval_dataset.x
|
||||
self.assertTrue(np.allclose(preds, 1.5 * x + 2.5))
|
||||
|
||||
# With a number of elements not a round multiple of the batch size
|
||||
trainer = get_regression_trainer(a=1.5, b=2.5, eval_len=66, output_dir=tmp_dir)
|
||||
preds = trainer.predict(trainer.eval_dataset).predictions
|
||||
x = trainer.eval_dataset.x
|
||||
self.assertTrue(np.allclose(preds, 1.5 * x + 2.5))
|
||||
|
||||
# With more than one output of the model
|
||||
trainer = get_regression_trainer(a=1.5, b=2.5, double_output=True, output_dir=tmp_dir)
|
||||
preds = trainer.predict(trainer.eval_dataset).predictions
|
||||
x = trainer.eval_dataset.x
|
||||
self.assertEqual(len(preds), 2)
|
||||
self.assertTrue(np.allclose(preds[0], 1.5 * x + 2.5))
|
||||
self.assertTrue(np.allclose(preds[1], 1.5 * x + 2.5))
|
||||
|
||||
# With more than one output/label of the model
|
||||
trainer = get_regression_trainer(
|
||||
a=1.5, b=2.5, double_output=True, label_names=["labels", "labels_2"], output_dir=tmp_dir
|
||||
)
|
||||
outputs = trainer.predict(trainer.eval_dataset)
|
||||
preds = outputs.predictions
|
||||
labels = outputs.label_ids
|
||||
x = trainer.eval_dataset.x
|
||||
self.assertEqual(len(preds), 2)
|
||||
self.assertTrue(np.allclose(preds[0], 1.5 * x + 2.5))
|
||||
self.assertTrue(np.allclose(preds[1], 1.5 * x + 2.5))
|
||||
self.assertTrue(np.array_equal(labels[0], trainer.eval_dataset.ys[0]))
|
||||
self.assertTrue(np.array_equal(labels[1], trainer.eval_dataset.ys[1]))
|
||||
|
||||
def test_train_and_predict_loss_parity(self):
|
||||
"""
|
||||
Tests that the loss computed during a training_step is the same as the one computed during prediction_step.
|
||||
for the same inputs
|
||||
"""
|
||||
model = AutoModelForCausalLM.from_pretrained("hf-internal-testing/tiny-random-LlamaForCausalLM")
|
||||
# Create a dummy batch of inputs
|
||||
inputs = {}
|
||||
inputs["input_ids"] = []
|
||||
for row_ind in range(4):
|
||||
seq_len = torch.randint(32, 64, (1,)).item()
|
||||
x = torch.randint(1, 100, (seq_len,))
|
||||
inputs["input_ids"].append(x)
|
||||
inputs["input_ids"] = torch.nn.utils.rnn.pad_sequence(inputs["input_ids"], batch_first=True, padding_value=0)
|
||||
inputs["labels"] = inputs["input_ids"].clone()
|
||||
inputs["labels"][inputs["input_ids"] == 0] = -100
|
||||
num_items_in_batch = inputs["labels"][..., 1:].ne(-100).sum().item()
|
||||
|
||||
def custom_loss_func(outputs, labels, num_items_in_batch=None):
|
||||
logits = outputs["logits"]
|
||||
loss_fct = torch.nn.CrossEntropyLoss()
|
||||
loss = loss_fct(logits.view(-1, logits.size(-1)), labels.view(-1))
|
||||
if num_items_in_batch is not None:
|
||||
return loss / num_items_in_batch # multiply by number of items to get the sum
|
||||
return loss
|
||||
|
||||
trainer = Trainer(model, train_dataset=None, compute_loss_func=custom_loss_func)
|
||||
|
||||
# creating log history of trainer, results don't matter
|
||||
train_loss = trainer.training_step(model, inputs, num_items_in_batch)
|
||||
predict_loss = trainer.prediction_step(model, inputs, prediction_loss_only=True)[0]
|
||||
|
||||
torch.testing.assert_close(train_loss, predict_loss, atol=1e-6, rtol=0)
|
||||
|
||||
def test_eval_use_gather_object(self):
|
||||
train_dataset = RegressionDataset()
|
||||
eval_dataset = RegressionDataset()
|
||||
model = RegressionDictModel()
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
args = TrainingArguments(tmp_dir, eval_use_gather_object=True)
|
||||
trainer = Trainer(model, args, train_dataset=train_dataset, eval_dataset=eval_dataset)
|
||||
trainer.train()
|
||||
_ = trainer.evaluate()
|
||||
_ = trainer.predict(eval_dataset)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Batch eval metrics tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@require_torch
|
||||
class TrainerBatchEvalMetricsTest(TestCasePlus, TrainerIntegrationCommon):
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
args = TrainingArguments("..")
|
||||
self.n_epochs = args.num_train_epochs
|
||||
self.batch_size = args.train_batch_size
|
||||
|
||||
def test_evaluate_with_batch_eval_metrics(self):
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
trainer = get_regression_trainer(
|
||||
a=1.5, b=2.5, compute_metrics=AlmostAccuracyBatched(), batch_eval_metrics=True, output_dir=tmp_dir
|
||||
)
|
||||
results = trainer.evaluate()
|
||||
|
||||
x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
|
||||
pred = 1.5 * x + 2.5
|
||||
expected_loss = ((pred - y) ** 2).mean()
|
||||
self.assertAlmostEqual(results["eval_loss"], expected_loss)
|
||||
expected_acc = AlmostAccuracy()((pred, y))["accuracy"]
|
||||
self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
|
||||
|
||||
# With a number of elements not a round multiple of the batch size
|
||||
trainer = get_regression_trainer(
|
||||
a=1.5,
|
||||
b=2.5,
|
||||
eval_len=66,
|
||||
compute_metrics=AlmostAccuracyBatched(),
|
||||
batch_eval_metrics=True,
|
||||
output_dir=tmp_dir,
|
||||
)
|
||||
results = trainer.evaluate()
|
||||
|
||||
x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
|
||||
pred = 1.5 * x + 2.5
|
||||
expected_loss = ((pred - y) ** 2).mean()
|
||||
self.assertAlmostEqual(results["eval_loss"], expected_loss)
|
||||
expected_acc = AlmostAccuracy()((pred, y))["accuracy"]
|
||||
self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
|
||||
|
||||
# With logits preprocess
|
||||
trainer = get_regression_trainer(
|
||||
a=1.5,
|
||||
b=2.5,
|
||||
compute_metrics=AlmostAccuracyBatched(),
|
||||
batch_eval_metrics=True,
|
||||
preprocess_logits_for_metrics=lambda logits, labels: logits + 1,
|
||||
output_dir=tmp_dir,
|
||||
)
|
||||
results = trainer.evaluate()
|
||||
|
||||
x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
|
||||
pred = 1.5 * x + 2.5
|
||||
expected_loss = ((pred - y) ** 2).mean()
|
||||
self.assertAlmostEqual(results["eval_loss"], expected_loss)
|
||||
expected_acc = AlmostAccuracy()((pred + 1, y))["accuracy"]
|
||||
self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
|
||||
|
||||
def test_predict_with_batch_eval_metrics(self):
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
trainer = get_regression_trainer(
|
||||
a=1.5, b=2.5, compute_metrics=AlmostAccuracyBatched(), batch_eval_metrics=True, output_dir=tmp_dir
|
||||
)
|
||||
results = trainer.predict(trainer.eval_dataset)
|
||||
preds = results.predictions
|
||||
x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
|
||||
gt = 1.5 * x + 2.5
|
||||
self.assertTrue(np.allclose(preds, gt))
|
||||
expected_acc = AlmostAccuracy()((preds, y))["accuracy"]
|
||||
self.assertAlmostEqual(results.metrics["test_accuracy"], expected_acc)
|
||||
|
||||
# With a number of elements not a round multiple of the batch size
|
||||
trainer = get_regression_trainer(
|
||||
a=1.5,
|
||||
b=2.5,
|
||||
eval_len=66,
|
||||
compute_metrics=AlmostAccuracyBatched(),
|
||||
batch_eval_metrics=True,
|
||||
output_dir=tmp_dir,
|
||||
)
|
||||
results = trainer.predict(trainer.eval_dataset)
|
||||
preds = results.predictions
|
||||
x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
|
||||
self.assertTrue(np.allclose(preds, 1.5 * x + 2.5))
|
||||
expected_acc = AlmostAccuracy()((preds, y))["accuracy"]
|
||||
self.assertAlmostEqual(results.metrics["test_accuracy"], expected_acc)
|
||||
|
||||
# With more than one output of the model
|
||||
trainer = get_regression_trainer(
|
||||
a=1.5,
|
||||
b=2.5,
|
||||
double_output=True,
|
||||
compute_metrics=AlmostAccuracyBatched(),
|
||||
batch_eval_metrics=True,
|
||||
output_dir=tmp_dir,
|
||||
)
|
||||
preds = trainer.predict(trainer.eval_dataset).predictions
|
||||
x = trainer.eval_dataset.x
|
||||
self.assertEqual(len(preds), 2)
|
||||
self.assertTrue(np.allclose(preds[0], 1.5 * x + 2.5))
|
||||
self.assertTrue(np.allclose(preds[1], 1.5 * x + 2.5))
|
||||
|
||||
# With more than one output/label of the model
|
||||
trainer = get_regression_trainer(
|
||||
a=1.5,
|
||||
b=2.5,
|
||||
double_output=True,
|
||||
label_names=["labels", "labels_2"],
|
||||
compute_metrics=AlmostAccuracyBatched(),
|
||||
batch_eval_metrics=True,
|
||||
output_dir=tmp_dir,
|
||||
)
|
||||
outputs = trainer.predict(trainer.eval_dataset)
|
||||
preds = outputs.predictions
|
||||
labels = outputs.label_ids
|
||||
x = trainer.eval_dataset.x
|
||||
self.assertEqual(len(preds), 2)
|
||||
self.assertTrue(np.allclose(preds[0], 1.5 * x + 2.5))
|
||||
self.assertTrue(np.allclose(preds[1], 1.5 * x + 2.5))
|
||||
self.assertTrue(np.array_equal(labels[0], trainer.eval_dataset.ys[0]))
|
||||
self.assertTrue(np.array_equal(labels[1], trainer.eval_dataset.ys[1]))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# FP16 / BF16 full eval memory tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@require_torch
|
||||
class TrainerFullEvalMemoryTest(TestCasePlus):
|
||||
@require_torch_fp16
|
||||
@require_torch_accelerator
|
||||
def test_fp16_full_eval(self):
|
||||
# this is a sensitive test so let's keep debugging printouts in place for quick diagnosis.
|
||||
# it's using pretty large safety margins, but small enough to detect broken functionality.
|
||||
debug = 0
|
||||
n_gpus = backend_device_count(torch_device)
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
bs = 8
|
||||
eval_len = 16 * n_gpus
|
||||
# make the params somewhat big so that there will be enough RAM consumed to be able to
|
||||
# measure things. We should get about 64KB for a+b in fp32
|
||||
a = torch.ones(1000, bs) + 0.001
|
||||
b = torch.ones(1000, bs) - 0.001
|
||||
|
||||
# 1. with fp16_full_eval disabled
|
||||
trainer = get_regression_trainer(
|
||||
a=a, b=b, eval_len=eval_len, skip_memory_metrics=False, output_dir=tmp_dir
|
||||
)
|
||||
metrics = trainer.evaluate()
|
||||
del trainer
|
||||
gc.collect()
|
||||
|
||||
fp32_init = metrics["init_mem_gpu_alloc_delta"]
|
||||
fp32_eval = metrics["eval_mem_gpu_alloc_delta"]
|
||||
|
||||
if debug:
|
||||
print(f"fp32_init {fp32_init}")
|
||||
print(f"fp32_eval {fp32_eval}")
|
||||
|
||||
# here we expect the model to be preloaded in trainer.__init__ and consume around 64K gpu ram.
|
||||
# perfect world: fp32_init == 64<<10
|
||||
self.assertGreater(fp32_init, 59_000)
|
||||
# after eval should be no extra memory allocated - with a small margin (other than the peak
|
||||
# memory consumption for the forward calculation that gets recovered)
|
||||
# perfect world: fp32_eval == close to zero
|
||||
self.assertLess(fp32_eval, 5_000)
|
||||
|
||||
# 2. with fp16_full_eval enabled
|
||||
trainer = get_regression_trainer(
|
||||
a=a, b=b, eval_len=eval_len, fp16_full_eval=True, skip_memory_metrics=False, output_dir=tmp_dir
|
||||
)
|
||||
metrics = trainer.evaluate()
|
||||
fp16_init = metrics["init_mem_gpu_alloc_delta"]
|
||||
fp16_eval = metrics["eval_mem_gpu_alloc_delta"]
|
||||
|
||||
if debug:
|
||||
print(f"fp16_init {fp16_init}")
|
||||
print(f"fp16_eval {fp16_eval}")
|
||||
|
||||
# here we expect the model to not be preloaded in trainer.__init__, so with a small margin it should be close to 0
|
||||
# perfect world: fp16_init == close to zero
|
||||
self.assertLess(fp16_init, 5_000)
|
||||
# here we put the model on device in eval and only `half()` of it, i.e. about 32K,(again we ignore the peak margin which gets returned back)
|
||||
# perfect world: fp32_init == 32<<10
|
||||
self.assertGreater(fp16_eval, 27_000)
|
||||
|
||||
# 3. relative comparison fp32 vs full fp16
|
||||
# should be about half of fp16_init
|
||||
# perfect world: fp32_init/2 == fp16_eval
|
||||
self.assertAlmostEqual(fp16_eval, fp32_init / 2, delta=5_000)
|
||||
|
||||
@require_torch_accelerator
|
||||
@require_torch_bf16
|
||||
def test_bf16_full_eval(self):
|
||||
# note: most of the logic is the same as test_fp16_full_eval
|
||||
|
||||
# this is a sensitive test so let's keep debugging printouts in place for quick diagnosis.
|
||||
# it's using pretty large safety margins, but small enough to detect broken functionality.
|
||||
debug = 0
|
||||
n_gpus = backend_device_count(torch_device)
|
||||
|
||||
bs = 8
|
||||
eval_len = 16 * n_gpus
|
||||
# make the params somewhat big so that there will be enough RAM consumed to be able to
|
||||
# measure things. We should get about 64KB for a+b in fp32
|
||||
a = torch.ones(1000, bs) + 0.001
|
||||
b = torch.ones(1000, bs) - 0.001
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
# 1. with bf16_full_eval disabled
|
||||
trainer = get_regression_trainer(
|
||||
a=a, b=b, eval_len=eval_len, skip_memory_metrics=False, output_dir=tmp_dir
|
||||
)
|
||||
metrics = trainer.evaluate()
|
||||
del trainer
|
||||
gc.collect()
|
||||
|
||||
fp32_init = metrics["init_mem_gpu_alloc_delta"]
|
||||
fp32_eval = metrics["eval_mem_gpu_alloc_delta"]
|
||||
|
||||
if debug:
|
||||
print(f"fp32_init {fp32_init}")
|
||||
print(f"fp32_eval {fp32_eval}")
|
||||
|
||||
# here we expect the model to be preloaded in trainer.__init__ and consume around 64K gpu ram.
|
||||
# perfect world: fp32_init == 64<<10
|
||||
self.assertGreater(fp32_init, 59_000)
|
||||
# after eval should be no extra memory allocated - with a small margin (other than the peak
|
||||
# memory consumption for the forward calculation that gets recovered)
|
||||
# perfect world: fp32_eval == close to zero
|
||||
self.assertLess(fp32_eval, 5_000)
|
||||
|
||||
# 2. with bf16_full_eval enabled
|
||||
trainer = get_regression_trainer(
|
||||
a=a, b=b, eval_len=eval_len, bf16_full_eval=True, skip_memory_metrics=False, output_dir=tmp_dir
|
||||
)
|
||||
metrics = trainer.evaluate()
|
||||
bf16_init = metrics["init_mem_gpu_alloc_delta"]
|
||||
bf16_eval = metrics["eval_mem_gpu_alloc_delta"]
|
||||
|
||||
if debug:
|
||||
print(f"bf16_init {bf16_init}")
|
||||
print(f"bf16_eval {bf16_eval}")
|
||||
|
||||
# here we expect the model to not be preloaded in trainer.__init__, so with a small margin it should be close to 0
|
||||
# perfect world: bf16_init == close to zero
|
||||
self.assertLess(bf16_init, 5_000)
|
||||
# here we put the model on device in eval and only `half()` of it, i.e. about 32K,(again we ignore the peak margin which gets returned back)
|
||||
# perfect world: fp32_init == 32<<10
|
||||
self.assertGreater(bf16_eval, 27_000)
|
||||
|
||||
# 3. relative comparison fp32 vs full bf16
|
||||
# should be about half of bf16_init
|
||||
# perfect world: fp32_init/2 == bf16_eval
|
||||
self.assertAlmostEqual(bf16_eval, fp32_init / 2, delta=5_000)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Slow external model eval tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@require_torch
|
||||
class TrainerSlowEvalTest(TestCasePlus):
|
||||
@slow
|
||||
def test_trainer_eval_mrpc(self):
|
||||
MODEL_ID = "google-bert/bert-base-cased-finetuned-mrpc"
|
||||
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
|
||||
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
|
||||
data_args = GlueDataTrainingArguments(
|
||||
task_name="mrpc", data_dir=f"{get_tests_dir()}/fixtures/tests_samples/MRPC", overwrite_cache=True
|
||||
)
|
||||
eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, mode="dev")
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
training_args = TrainingArguments(output_dir=tmp_dir, use_cpu=True)
|
||||
trainer = Trainer(model=model, args=training_args, eval_dataset=eval_dataset)
|
||||
result = trainer.evaluate()
|
||||
self.assertLess(result["eval_loss"], 0.2)
|
||||
|
||||
@slow
|
||||
def test_trainer_eval_multiple(self):
|
||||
MODEL_ID = "openai-community/gpt2"
|
||||
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
|
||||
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
|
||||
|
||||
dataset = get_dataset(PATH_SAMPLE_TEXT, tokenizer, 100)
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
training_args = TrainingArguments(
|
||||
output_dir=tmp_dir,
|
||||
use_cpu=True,
|
||||
per_device_eval_batch_size=1,
|
||||
)
|
||||
trainer = Trainer(
|
||||
model=model,
|
||||
args=training_args,
|
||||
eval_dataset={
|
||||
"data1": dataset,
|
||||
"data2": dataset,
|
||||
},
|
||||
)
|
||||
result = trainer.evaluate()
|
||||
self.assertIn("eval_data1_loss", result)
|
||||
self.assertIn("eval_data2_loss", result)
|
||||
|
||||
@slow
|
||||
def test_trainer_eval_lm(self):
|
||||
MODEL_ID = "distilbert/distilroberta-base"
|
||||
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
|
||||
dataset = get_dataset(PATH_SAMPLE_TEXT, tokenizer, 100)
|
||||
self.assertEqual(len(dataset), 31)
|
||||
308
tests/trainer/test_trainer_hyperparameter.py
Normal file
308
tests/trainer/test_trainer_hyperparameter.py
Normal file
@@ -0,0 +1,308 @@
|
||||
# Copyright 2018 the HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""
|
||||
Trainer hyperparameter search tests: Optuna (single/multi-objective, full eval),
|
||||
Ray Tune (with client), W&B sweeps, and backend availability detection.
|
||||
"""
|
||||
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
from transformers import TrainingArguments
|
||||
from transformers.hyperparameter_search import ALL_HYPERPARAMETER_SEARCH_BACKENDS, HPSearchBackend
|
||||
from transformers.testing_utils import require_optuna, require_ray, require_torch, require_wandb, torch_device
|
||||
from transformers.trainer_utils import IntervalStrategy
|
||||
from transformers.utils.hp_naming import TrialShortNamer
|
||||
|
||||
from .trainer_test_utils import (
|
||||
AlmostAccuracy,
|
||||
RegressionModelConfig,
|
||||
RegressionPreTrainedModel,
|
||||
get_regression_trainer,
|
||||
)
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_optuna
|
||||
class TrainerHyperParameterOptunaIntegrationTest(unittest.TestCase):
|
||||
def setUp(self):
|
||||
args = TrainingArguments("..")
|
||||
self.n_epochs = args.num_train_epochs
|
||||
self.batch_size = args.train_batch_size
|
||||
|
||||
def test_hyperparameter_search(self):
|
||||
class MyTrialShortNamer(TrialShortNamer):
|
||||
DEFAULTS = {"a": 0, "b": 0}
|
||||
|
||||
def hp_space(trial):
|
||||
return {}
|
||||
|
||||
def model_init(trial):
|
||||
if trial is not None:
|
||||
a = trial.suggest_int("a", -4, 4)
|
||||
b = trial.suggest_int("b", -4, 4)
|
||||
else:
|
||||
a = 0
|
||||
b = 0
|
||||
config = RegressionModelConfig(a=a, b=b, double_output=False)
|
||||
|
||||
return RegressionPreTrainedModel(config).to(torch_device)
|
||||
|
||||
def hp_name(trial):
|
||||
return MyTrialShortNamer.shortname(trial.params)
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
trainer = get_regression_trainer(
|
||||
output_dir=tmp_dir,
|
||||
learning_rate=0.1,
|
||||
logging_steps=1,
|
||||
eval_strategy=IntervalStrategy.EPOCH,
|
||||
save_strategy=IntervalStrategy.EPOCH,
|
||||
num_train_epochs=4,
|
||||
disable_tqdm=True,
|
||||
load_best_model_at_end=True,
|
||||
run_name="test",
|
||||
model_init=model_init,
|
||||
)
|
||||
trainer.hyperparameter_search(direction="minimize", hp_space=hp_space, hp_name=hp_name, n_trials=4)
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_optuna
|
||||
class TrainerHyperParameterMultiObjectOptunaIntegrationTest(unittest.TestCase):
|
||||
def setUp(self):
|
||||
args = TrainingArguments("..")
|
||||
self.n_epochs = args.num_train_epochs
|
||||
self.batch_size = args.train_batch_size
|
||||
|
||||
def test_hyperparameter_search(self):
|
||||
class MyTrialShortNamer(TrialShortNamer):
|
||||
DEFAULTS = {"a": 0, "b": 0}
|
||||
|
||||
def hp_space(trial):
|
||||
return {}
|
||||
|
||||
def model_init(trial):
|
||||
if trial is not None:
|
||||
a = trial.suggest_int("a", -4, 4)
|
||||
b = trial.suggest_int("b", -4, 4)
|
||||
else:
|
||||
a = 0
|
||||
b = 0
|
||||
config = RegressionModelConfig(a=a, b=b, double_output=False)
|
||||
|
||||
return RegressionPreTrainedModel(config).to(torch_device)
|
||||
|
||||
def hp_name(trial):
|
||||
return MyTrialShortNamer.shortname(trial.params)
|
||||
|
||||
def compute_objective(metrics: dict[str, float]) -> list[float]:
|
||||
return metrics["eval_loss"], metrics["eval_accuracy"]
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
trainer = get_regression_trainer(
|
||||
output_dir=tmp_dir,
|
||||
learning_rate=0.1,
|
||||
logging_steps=1,
|
||||
eval_strategy=IntervalStrategy.EPOCH,
|
||||
save_strategy=IntervalStrategy.EPOCH,
|
||||
num_train_epochs=10,
|
||||
disable_tqdm=True,
|
||||
load_best_model_at_end=True,
|
||||
run_name="test",
|
||||
model_init=model_init,
|
||||
compute_metrics=AlmostAccuracy(),
|
||||
)
|
||||
trainer.hyperparameter_search(
|
||||
direction=["minimize", "maximize"],
|
||||
hp_space=hp_space,
|
||||
hp_name=hp_name,
|
||||
n_trials=4,
|
||||
compute_objective=compute_objective,
|
||||
)
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_optuna
|
||||
class TrainerHyperParameterOptunaIntegrationTestWithFullEval(unittest.TestCase):
|
||||
def test_hyperparameter_search(self):
|
||||
def hp_space(trial):
|
||||
return {}
|
||||
|
||||
def model_init(trial):
|
||||
if trial is not None:
|
||||
a = trial.suggest_int("a", -4, 4)
|
||||
b = trial.suggest_int("b", -4, 4)
|
||||
else:
|
||||
a = 0
|
||||
b = 0
|
||||
config = RegressionModelConfig(a=a, b=b, double_output=False)
|
||||
|
||||
return RegressionPreTrainedModel(config).to(torch_device)
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
trainer = get_regression_trainer(
|
||||
output_dir=tmp_dir,
|
||||
disable_tqdm=True,
|
||||
model_init=model_init,
|
||||
fp16_full_eval=True,
|
||||
)
|
||||
trainer.hyperparameter_search(
|
||||
direction="minimize",
|
||||
hp_space=hp_space,
|
||||
n_trials=2,
|
||||
)
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_ray
|
||||
@unittest.skip("don't work because of a serialization issue")
|
||||
class TrainerHyperParameterRayIntegrationTest(unittest.TestCase):
|
||||
def setUp(self):
|
||||
args = TrainingArguments("..")
|
||||
self.n_epochs = args.num_train_epochs
|
||||
self.batch_size = args.train_batch_size
|
||||
|
||||
def ray_hyperparameter_search(self):
|
||||
class MyTrialShortNamer(TrialShortNamer):
|
||||
DEFAULTS = {"a": 0, "b": 0}
|
||||
|
||||
def hp_space(trial):
|
||||
from ray import tune
|
||||
|
||||
return {
|
||||
"a": tune.randint(-4, 4),
|
||||
"b": tune.randint(-4, 4),
|
||||
}
|
||||
|
||||
def model_init(config):
|
||||
if config is None:
|
||||
a = 0
|
||||
b = 0
|
||||
else:
|
||||
a = config["a"]
|
||||
b = config["b"]
|
||||
model_config = RegressionModelConfig(a=a, b=b, double_output=False)
|
||||
|
||||
return RegressionPreTrainedModel(model_config).to(torch_device)
|
||||
|
||||
def hp_name(params):
|
||||
return MyTrialShortNamer.shortname(params)
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
trainer = get_regression_trainer(
|
||||
output_dir=tmp_dir,
|
||||
learning_rate=0.1,
|
||||
logging_steps=1,
|
||||
eval_strategy=IntervalStrategy.EPOCH,
|
||||
save_strategy=IntervalStrategy.EPOCH,
|
||||
num_train_epochs=4,
|
||||
disable_tqdm=True,
|
||||
load_best_model_at_end=True,
|
||||
run_name="test",
|
||||
model_init=model_init,
|
||||
)
|
||||
trainer.hyperparameter_search(
|
||||
direction="minimize", hp_space=hp_space, hp_name=hp_name, backend="ray", n_trials=4
|
||||
)
|
||||
|
||||
def test_hyperparameter_search(self):
|
||||
self.ray_hyperparameter_search()
|
||||
|
||||
def test_hyperparameter_search_ray_client(self):
|
||||
import ray
|
||||
from ray.util.client.ray_client_helpers import ray_start_client_server
|
||||
|
||||
with ray_start_client_server():
|
||||
assert ray.util.client.ray.is_connected()
|
||||
self.ray_hyperparameter_search()
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_wandb
|
||||
class TrainerHyperParameterWandbIntegrationTest(unittest.TestCase):
|
||||
def setUp(self):
|
||||
args = TrainingArguments("..")
|
||||
self.n_epochs = args.num_train_epochs
|
||||
self.batch_size = args.train_batch_size
|
||||
|
||||
def test_hyperparameter_search(self):
|
||||
def hp_space(trial):
|
||||
return {
|
||||
"method": "random",
|
||||
"metric": {},
|
||||
"parameters": {
|
||||
"a": {"distribution": "uniform", "min": 1e-6, "max": 1e-4},
|
||||
"b": {"distribution": "int_uniform", "min": 1, "max": 6},
|
||||
},
|
||||
}
|
||||
|
||||
def model_init(config):
|
||||
if config is None:
|
||||
a = 0
|
||||
b = 0
|
||||
else:
|
||||
a = config["a"]
|
||||
b = config["b"]
|
||||
model_config = RegressionModelConfig(a=a, b=b, double_output=False)
|
||||
|
||||
return RegressionPreTrainedModel(model_config).to(torch_device)
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
trainer = get_regression_trainer(
|
||||
output_dir=tmp_dir,
|
||||
learning_rate=0.1,
|
||||
logging_steps=1,
|
||||
eval_strategy=IntervalStrategy.EPOCH,
|
||||
save_strategy=IntervalStrategy.EPOCH,
|
||||
num_train_epochs=4,
|
||||
disable_tqdm=True,
|
||||
load_best_model_at_end=True,
|
||||
run_name="test",
|
||||
model_init=model_init,
|
||||
)
|
||||
sweep_kwargs = {
|
||||
"direction": "minimize",
|
||||
"hp_space": hp_space,
|
||||
"backend": "wandb",
|
||||
"n_trials": 4,
|
||||
}
|
||||
best_run = trainer.hyperparameter_search(**sweep_kwargs)
|
||||
|
||||
self.assertIsNotNone(best_run.run_id)
|
||||
self.assertIsNotNone(best_run.run_summary)
|
||||
hp_keys = set(best_run.hyperparameters.keys())
|
||||
self.assertSetEqual(hp_keys, {"a", "b", "assignments", "metric"})
|
||||
|
||||
# pretend restarting the process purged the environ
|
||||
import os
|
||||
|
||||
del os.environ["WANDB_ENTITY"]
|
||||
del os.environ["WANDB_PROJECT"]
|
||||
sweep_kwargs["sweep_id"] = best_run.run_summary
|
||||
updated_best_run = trainer.hyperparameter_search(**sweep_kwargs)
|
||||
|
||||
self.assertIsNotNone(updated_best_run.run_id)
|
||||
self.assertEqual(updated_best_run.run_summary, best_run.run_summary)
|
||||
updated_hp_keys = set(updated_best_run.hyperparameters.keys())
|
||||
self.assertSetEqual(updated_hp_keys, {"a", "b", "assignments", "metric"})
|
||||
|
||||
|
||||
class HyperParameterSearchBackendsTest(unittest.TestCase):
|
||||
def test_hyperparameter_search_backends(self):
|
||||
self.assertEqual(
|
||||
list(ALL_HYPERPARAMETER_SEARCH_BACKENDS.keys()),
|
||||
list(HPSearchBackend),
|
||||
)
|
||||
853
tests/trainer/test_trainer_optimizers.py
Normal file
853
tests/trainer/test_trainer_optimizers.py
Normal file
@@ -0,0 +1,853 @@
|
||||
# Copyright 2018 the HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""
|
||||
Trainer optimizer and LR scheduler tests: custom optimizers, LR scheduler kwargs, cosine-with-min-lr,
|
||||
reduce-on-plateau, Adafactor, bitsandbytes (RMSProp, AdEMAMix), LOMO, GrokAdamW, schedule-free,
|
||||
GaLore, Apollo, Stable AdamW, Liger kernel, optimizer choice resolution, factory pattern detection,
|
||||
and model parameter inspection.
|
||||
"""
|
||||
|
||||
import tempfile
|
||||
|
||||
import numpy as np
|
||||
from parameterized import parameterized
|
||||
|
||||
from transformers import (
|
||||
GPT2Config,
|
||||
GPT2LMHeadModel,
|
||||
LlamaConfig,
|
||||
LlamaForCausalLM,
|
||||
Trainer,
|
||||
TrainingArguments,
|
||||
is_torch_available,
|
||||
)
|
||||
from transformers.testing_utils import (
|
||||
TestCasePlus,
|
||||
require_apollo_torch,
|
||||
require_bitsandbytes,
|
||||
require_galore_torch,
|
||||
require_grokadamw,
|
||||
require_lomo,
|
||||
require_schedulefree,
|
||||
require_torch,
|
||||
require_torch_accelerator,
|
||||
require_torch_optimi,
|
||||
)
|
||||
from transformers.trainer_utils import check_target_module_exists
|
||||
|
||||
from .trainer_test_utils import (
|
||||
BasicTextGenerationModel,
|
||||
RegressionDataset,
|
||||
RegressionModel,
|
||||
RepeatDataset,
|
||||
TorchTracemalloc,
|
||||
TrainerIntegrationCommon,
|
||||
TstLayer,
|
||||
bytes2megabytes,
|
||||
get_regression_trainer,
|
||||
)
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
from torch import nn
|
||||
|
||||
_ATTN_MLP_TARGET_MODULES = [r".*attn.*", r".*mlp.*"]
|
||||
|
||||
|
||||
@require_torch
|
||||
class TrainerOptimizerIntegrationTest(TestCasePlus, TrainerIntegrationCommon):
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
args = TrainingArguments("..")
|
||||
self.n_epochs = args.num_train_epochs
|
||||
self.batch_size = args.train_batch_size
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _get_llama_and_dataset(self):
|
||||
config = LlamaConfig(vocab_size=100, hidden_size=32, num_hidden_layers=3, num_attention_heads=4)
|
||||
model = LlamaForCausalLM(config)
|
||||
train_dataset = RepeatDataset(torch.randint(0, 100, (128,)))
|
||||
return model, train_dataset
|
||||
|
||||
def _get_gpt2_and_dataset(self):
|
||||
config = GPT2Config(vocab_size=100, n_positions=128, n_embd=32, n_layer=3, n_head=4)
|
||||
model = GPT2LMHeadModel(config)
|
||||
train_dataset = RepeatDataset(torch.randint(0, 100, (128,)))
|
||||
return model, train_dataset
|
||||
|
||||
def _train_with_llama(self, optim, optim_target_modules=None, **extra_kwargs):
|
||||
"""Smoke-test: tiny Llama + RepeatDataset with the given optimizer."""
|
||||
tiny_llama, train_dataset = self._get_llama_and_dataset()
|
||||
kwargs = {"learning_rate": 1e-9, "logging_steps": 5, "optim": optim}
|
||||
if optim_target_modules is not None:
|
||||
kwargs["optim_target_modules"] = optim_target_modules
|
||||
kwargs.update(extra_kwargs)
|
||||
args = TrainingArguments(self.get_auto_remove_tmp_dir(), **kwargs)
|
||||
trainer = Trainer(tiny_llama, args, train_dataset=train_dataset)
|
||||
trainer.train()
|
||||
return trainer
|
||||
|
||||
def _check_lr_display_without_scheduler(self, optim, optim_target_modules):
|
||||
"""Verify that LR is correctly reported without an LR scheduler."""
|
||||
tiny_llama, train_dataset = self._get_llama_and_dataset()
|
||||
learning_rate = 1e-9
|
||||
args = TrainingArguments(
|
||||
self.get_auto_remove_tmp_dir(),
|
||||
learning_rate=learning_rate,
|
||||
logging_steps=5,
|
||||
optim=optim,
|
||||
optim_target_modules=optim_target_modules,
|
||||
)
|
||||
trainer = Trainer(tiny_llama, args, train_dataset=train_dataset)
|
||||
trainer.create_optimizer_and_scheduler(num_training_steps=10)
|
||||
self.assertEqual(trainer.get_learning_rates(), [learning_rate, learning_rate])
|
||||
|
||||
def _check_lr_display_with_scheduler(self, optim, optim_target_modules, num_train_epochs=2):
|
||||
"""Verify warmup + cosine LR schedule: increases then decreases."""
|
||||
tiny_llama, train_dataset = self._get_llama_and_dataset()
|
||||
learning_rate = 2e-4
|
||||
num_warmup_steps = 5
|
||||
args = TrainingArguments(
|
||||
self.get_auto_remove_tmp_dir(),
|
||||
num_train_epochs=num_train_epochs,
|
||||
learning_rate=learning_rate,
|
||||
warmup_steps=num_warmup_steps,
|
||||
lr_scheduler_type="cosine",
|
||||
logging_steps=1,
|
||||
optim=optim,
|
||||
optim_target_modules=optim_target_modules,
|
||||
)
|
||||
trainer = Trainer(tiny_llama, args, train_dataset=train_dataset)
|
||||
trainer.train()
|
||||
logs = trainer.state.log_history[1:-1]
|
||||
|
||||
self.assertTrue(logs[num_warmup_steps - 1]["learning_rate"] == learning_rate)
|
||||
self.assertTrue(np.allclose(logs[-1]["learning_rate"], 0, atol=5e-6))
|
||||
|
||||
increasing_lrs = [
|
||||
logs[i]["learning_rate"] < logs[i + 1]["learning_rate"]
|
||||
for i in range(len(logs))
|
||||
if i < num_warmup_steps - 1
|
||||
]
|
||||
decreasing_lrs = [
|
||||
logs[i]["learning_rate"] > logs[i + 1]["learning_rate"]
|
||||
for i in range(len(logs) - 1)
|
||||
if i >= num_warmup_steps - 1
|
||||
]
|
||||
|
||||
self.assertTrue(all(increasing_lrs))
|
||||
self.assertTrue(all(decreasing_lrs))
|
||||
self.assertTrue(len(decreasing_lrs) > len(increasing_lrs))
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# adafactor optmizer test
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_adafactor_lr_none(self):
|
||||
# test the special case where lr=None, since Trainer can't not have lr_scheduler
|
||||
|
||||
from transformers.optimization import Adafactor, AdafactorSchedule
|
||||
|
||||
train_dataset = RegressionDataset()
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
args = TrainingArguments(tmp_dir)
|
||||
model = RegressionModel()
|
||||
optimizer = Adafactor(
|
||||
model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None
|
||||
)
|
||||
lr_scheduler = AdafactorSchedule(optimizer)
|
||||
trainer = Trainer(model, args, train_dataset=train_dataset, optimizers=(optimizer, lr_scheduler))
|
||||
trainer.train()
|
||||
|
||||
# Train a default model to compare against
|
||||
default_trainer = get_regression_trainer(learning_rate=0.1, output_dir=tmp_dir)
|
||||
default_trainer.train()
|
||||
|
||||
self.assertFalse(torch.allclose(trainer.model.a, default_trainer.model.a))
|
||||
self.assertFalse(torch.allclose(trainer.model.b, default_trainer.model.b))
|
||||
self.assertGreater(trainer.optimizer.state_dict()["param_groups"][0]["lr"], 0)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# BNB optimizer tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@parameterized.expand(["rmsprop_bnb", "ademamix", "ademamix_8bit", "rmsprop_bnb_8bit", "rmsprop_bnb_32bit"])
|
||||
@require_bitsandbytes
|
||||
def test_bnb_optim(self, optim):
|
||||
tiny_gpt2, train_dataset = self._get_gpt2_and_dataset()
|
||||
args = TrainingArguments(
|
||||
self.get_auto_remove_tmp_dir(),
|
||||
learning_rate=1e-9,
|
||||
logging_steps=5,
|
||||
logging_nan_inf_filter=False,
|
||||
optim=optim,
|
||||
)
|
||||
Trainer(tiny_gpt2, args, train_dataset=train_dataset).train()
|
||||
|
||||
@require_bitsandbytes
|
||||
def test_bnb_8bit_optimizer_skip_embedding(self):
|
||||
model = BasicTextGenerationModel(8, 4)
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
for name_optim in ["rmsprop_bnb_8bit", "adamw_8bit"]:
|
||||
args = TrainingArguments(
|
||||
output_dir=tmp_dir,
|
||||
optim=name_optim,
|
||||
)
|
||||
trainer = Trainer(model=model, args=args)
|
||||
optimizer = trainer.create_optimizer()
|
||||
modules = optimizer.mng.module_weight_config_triple
|
||||
self.assertNotEqual(len(modules), 0)
|
||||
module, name, config = modules[0]
|
||||
self.assertIsInstance(module, torch.nn.Embedding)
|
||||
self.assertEqual(name, "weight")
|
||||
self.assertDictEqual(config, {"optim_bits": 32})
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# LOMO tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@require_lomo
|
||||
@require_torch_accelerator
|
||||
def test_lomo(self):
|
||||
tiny_llama, train_dataset = self._get_llama_and_dataset()
|
||||
previous_params = {n: p.clone() for n, p in tiny_llama.named_parameters()}
|
||||
|
||||
args = TrainingArguments(
|
||||
self.get_auto_remove_tmp_dir(), learning_rate=1e-2, logging_steps=5, optim="lomo", max_steps=20
|
||||
)
|
||||
Trainer(tiny_llama, args, train_dataset=train_dataset).train()
|
||||
|
||||
for name, param in tiny_llama.named_parameters():
|
||||
self.assertFalse(torch.allclose(param, previous_params[name].to(param.device), rtol=1e-12, atol=1e-12))
|
||||
|
||||
@require_lomo
|
||||
@require_torch_accelerator
|
||||
def test_adalomo(self):
|
||||
self._train_with_llama("adalomo")
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# GrokAdamW test
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@require_grokadamw
|
||||
@require_torch_accelerator
|
||||
def test_grokadamw(self):
|
||||
self._train_with_llama("grokadamw", learning_rate=2e-5, max_steps=20)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Schedule-free tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@parameterized.expand([("schedule_free_adamw",), ("schedule_free_radam",)])
|
||||
@require_schedulefree
|
||||
@require_torch_accelerator
|
||||
def test_schedulefree(self, optim):
|
||||
self._train_with_llama(optim, lr_scheduler_type="constant")
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# GaLore tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_galore_matched_modules(self):
|
||||
regex_patterns = [r".*.attn.*", r".*.mlp.*"]
|
||||
|
||||
module_names = [
|
||||
"model.transformer.h.0.ln_1",
|
||||
"model.transformer.h.0.attn.q_proj",
|
||||
"model.lm_head",
|
||||
"model.transformer.h.0.mlp.up_proj",
|
||||
]
|
||||
expected_values = [False, True, False, True]
|
||||
|
||||
for expected_value, module_name in zip(expected_values, module_names):
|
||||
is_module_matched, is_regex = check_target_module_exists(regex_patterns, module_name, return_is_regex=True)
|
||||
self.assertTrue(is_module_matched == expected_value)
|
||||
if is_module_matched:
|
||||
self.assertTrue(is_regex)
|
||||
|
||||
exact_patterns = ["q_proj", "up_proj"]
|
||||
|
||||
module_names = [
|
||||
"model.transformer.h.0.ln_1",
|
||||
"model.transformer.h.0.attn.q_proj",
|
||||
"model.lm_head",
|
||||
"model.transformer.h.0.mlp.up_proj",
|
||||
]
|
||||
expected_values = [False, True, False, True]
|
||||
|
||||
for expected_value, module_name in zip(expected_values, module_names):
|
||||
is_module_matched, is_regex = check_target_module_exists(exact_patterns, module_name, return_is_regex=True)
|
||||
self.assertTrue(is_module_matched == expected_value)
|
||||
if is_module_matched:
|
||||
self.assertFalse(is_regex)
|
||||
|
||||
simple_regex = r".*.attn.*"
|
||||
|
||||
module_names = [
|
||||
"model.transformer.h.0.ln_1",
|
||||
"model.transformer.h.0.attn.q_proj",
|
||||
"model.lm_head",
|
||||
"model.transformer.h.0.mlp.up_proj",
|
||||
]
|
||||
expected_values = [False, True, False, False]
|
||||
|
||||
for expected_value, module_name in zip(expected_values, module_names):
|
||||
is_module_matched, is_regex = check_target_module_exists(simple_regex, module_name, return_is_regex=True)
|
||||
self.assertTrue(is_module_matched == expected_value)
|
||||
if is_module_matched:
|
||||
self.assertTrue(is_regex)
|
||||
|
||||
simple_regex = "model.transformer.h.0.attn.q_proj"
|
||||
|
||||
module_names = [
|
||||
"model.transformer.h.0.ln_1",
|
||||
"model.transformer.h.0.attn.q_proj",
|
||||
"model.lm_head",
|
||||
"model.transformer.h.0.mlp.up_proj",
|
||||
]
|
||||
expected_values = [False, True, False, False]
|
||||
|
||||
for expected_value, module_name in zip(expected_values, module_names):
|
||||
is_module_matched, is_regex = check_target_module_exists(simple_regex, module_name, return_is_regex=True)
|
||||
self.assertTrue(is_module_matched == expected_value)
|
||||
if is_module_matched:
|
||||
self.assertFalse(is_regex)
|
||||
|
||||
target_modules = ["attn", "mlp"]
|
||||
|
||||
module_names = [
|
||||
"model.transformer.h.0.ln_1",
|
||||
"model.transformer.h.0.attn.q_proj",
|
||||
"model.lm_head",
|
||||
"model.transformer.h.0.mlp.up_proj",
|
||||
]
|
||||
expected_values = [False, True, False, True]
|
||||
|
||||
for expected_value, module_name in zip(expected_values, module_names):
|
||||
is_module_matched, is_regex = check_target_module_exists(target_modules, module_name, return_is_regex=True)
|
||||
self.assertTrue(is_module_matched == expected_value)
|
||||
if is_module_matched:
|
||||
self.assertFalse(is_regex)
|
||||
|
||||
@parameterized.expand([("galore_adamw",), ("galore_adamw_layerwise",), ("galore_adamw_8bit",)])
|
||||
@require_galore_torch
|
||||
@require_torch_accelerator
|
||||
def test_galore(self, optim):
|
||||
self._train_with_llama(optim, optim_target_modules=_ATTN_MLP_TARGET_MODULES)
|
||||
|
||||
@require_galore_torch
|
||||
@require_torch_accelerator
|
||||
def test_galore_extra_args(self):
|
||||
self._train_with_llama(
|
||||
"galore_adamw",
|
||||
optim_target_modules=_ATTN_MLP_TARGET_MODULES,
|
||||
optim_args="rank=64, update_proj_gap=100, scale=0.10",
|
||||
)
|
||||
|
||||
@require_galore_torch
|
||||
@require_torch_accelerator
|
||||
def test_galore_layerwise_with_scheduler(self):
|
||||
self._train_with_llama(
|
||||
"galore_adamw_layerwise",
|
||||
optim_target_modules=_ATTN_MLP_TARGET_MODULES,
|
||||
lr_scheduler_type="cosine",
|
||||
)
|
||||
|
||||
@parameterized.expand(
|
||||
[
|
||||
(_ATTN_MLP_TARGET_MODULES,),
|
||||
(["q_proj", "k_proj", "v_proj"],),
|
||||
("all-linear",),
|
||||
]
|
||||
)
|
||||
@require_galore_torch
|
||||
@require_torch_accelerator
|
||||
def test_galore_adafactor(self, optim_target_modules):
|
||||
upper_bound_pm = 700
|
||||
lower_bound_pm = 650
|
||||
tiny_llama, train_dataset = self._get_llama_and_dataset()
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmpdir, TorchTracemalloc() as tracemalloc:
|
||||
args = TrainingArguments(
|
||||
tmpdir,
|
||||
learning_rate=1e-9,
|
||||
logging_steps=5,
|
||||
optim="galore_adafactor",
|
||||
optim_target_modules=optim_target_modules,
|
||||
)
|
||||
Trainer(tiny_llama, args, train_dataset=train_dataset).train()
|
||||
|
||||
galore_peak_memory = tracemalloc.peaked + bytes2megabytes(tracemalloc.begin)
|
||||
self.assertTrue(galore_peak_memory < upper_bound_pm)
|
||||
self.assertTrue(lower_bound_pm < galore_peak_memory)
|
||||
|
||||
@require_galore_torch
|
||||
@require_torch_accelerator
|
||||
def test_galore_lr_display_without_scheduler(self):
|
||||
self._check_lr_display_without_scheduler("galore_adamw", _ATTN_MLP_TARGET_MODULES)
|
||||
|
||||
@require_galore_torch
|
||||
@require_torch_accelerator
|
||||
def test_galore_lr_display_with_scheduler(self):
|
||||
self._check_lr_display_with_scheduler("galore_adamw", _ATTN_MLP_TARGET_MODULES)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Apollo tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@parameterized.expand([("apollo_adamw",), ("apollo_adamw_layerwise",)])
|
||||
@require_apollo_torch
|
||||
@require_torch_accelerator
|
||||
def test_apollo(self, optim):
|
||||
self._train_with_llama(optim, optim_target_modules=_ATTN_MLP_TARGET_MODULES)
|
||||
|
||||
@require_apollo_torch
|
||||
@require_torch_accelerator
|
||||
def test_apollo_extra_args(self):
|
||||
self._train_with_llama(
|
||||
"apollo_adamw",
|
||||
optim_target_modules=_ATTN_MLP_TARGET_MODULES,
|
||||
optim_args="proj=random,scale_type=tensor,rank=1,update_proj_gap=100,scale=128.0",
|
||||
)
|
||||
|
||||
@require_apollo_torch
|
||||
@require_torch_accelerator
|
||||
def test_apollo_layerwise_with_scheduler(self):
|
||||
self._train_with_llama(
|
||||
"apollo_adamw_layerwise",
|
||||
optim_target_modules=_ATTN_MLP_TARGET_MODULES,
|
||||
lr_scheduler_type="cosine",
|
||||
)
|
||||
|
||||
@require_apollo_torch
|
||||
@require_torch_accelerator
|
||||
def test_apollo_lr_display_without_scheduler(self):
|
||||
self._check_lr_display_without_scheduler("apollo_adamw", _ATTN_MLP_TARGET_MODULES)
|
||||
|
||||
@require_apollo_torch
|
||||
@require_torch_accelerator
|
||||
def test_apollo_lr_display_with_scheduler(self):
|
||||
self._check_lr_display_with_scheduler("apollo_adamw", _ATTN_MLP_TARGET_MODULES, num_train_epochs=10)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Stable AdamW tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@require_torch_optimi
|
||||
@require_torch_accelerator
|
||||
def test_stable_adamw(self):
|
||||
self._train_with_llama("stable_adamw", optim_target_modules=_ATTN_MLP_TARGET_MODULES)
|
||||
|
||||
@require_torch_optimi
|
||||
@require_torch_accelerator
|
||||
def test_stable_adamw_extra_args(self):
|
||||
self._train_with_llama(
|
||||
"stable_adamw",
|
||||
optim_target_modules=_ATTN_MLP_TARGET_MODULES,
|
||||
optim_args="decouple_lr=True,max_lr=1e-3,kahan_sum=True",
|
||||
)
|
||||
|
||||
@require_torch_optimi
|
||||
@require_torch_accelerator
|
||||
def test_stable_adamw_trainer_adamw_args(self):
|
||||
tiny_llama, train_dataset = self._get_llama_and_dataset()
|
||||
args = TrainingArguments(
|
||||
self.get_auto_remove_tmp_dir(),
|
||||
learning_rate=1e-9,
|
||||
logging_steps=5,
|
||||
weight_decay=0.001,
|
||||
adam_beta1=0.89,
|
||||
adam_beta2=0.98,
|
||||
adam_epsilon=1e-8,
|
||||
optim="stable_adamw",
|
||||
optim_target_modules=_ATTN_MLP_TARGET_MODULES,
|
||||
)
|
||||
trainer = Trainer(tiny_llama, args, train_dataset=train_dataset)
|
||||
trainer.create_optimizer_and_scheduler(num_training_steps=10)
|
||||
|
||||
# check StableAdamW optimizer is created with the correct parameters
|
||||
self.assertEqual(trainer.optimizer.defaults["beta1"], args.adam_beta1)
|
||||
self.assertEqual(trainer.optimizer.defaults["beta2"], args.adam_beta2)
|
||||
self.assertEqual(trainer.optimizer.defaults["eps"], args.adam_epsilon)
|
||||
self.assertEqual(trainer.optimizer.defaults["weight_decay"], args.weight_decay)
|
||||
|
||||
@require_torch_optimi
|
||||
@require_torch_accelerator
|
||||
def test_stable_adamw_lr_display_without_scheduler(self):
|
||||
self._check_lr_display_without_scheduler("stable_adamw", _ATTN_MLP_TARGET_MODULES)
|
||||
|
||||
@require_torch_optimi
|
||||
@require_torch_accelerator
|
||||
def test_stable_adamw_lr_display_with_scheduler(self):
|
||||
self._check_lr_display_with_scheduler("stable_adamw", _ATTN_MLP_TARGET_MODULES, num_train_epochs=10)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Misc optimizer tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_optimizer_factory_pattern(self):
|
||||
"""Test that is_optimizer_factory correctly identifies factory classes vs optimizer classes."""
|
||||
from transformers.trainer_optimizer import is_optimizer_factory
|
||||
|
||||
# Create a mock optimizer class
|
||||
class MockComplexOptimizer(torch.optim.Optimizer):
|
||||
def __init__(self, params, lr=1e-3):
|
||||
defaults = {"lr": lr}
|
||||
super().__init__(params, defaults)
|
||||
|
||||
def step(self, closure=None):
|
||||
pass
|
||||
|
||||
# Create a factory class (simulates Muon/Dion pattern)
|
||||
class MockOptimizerFactory:
|
||||
def __call__(self, opt_model, **optimizer_kwargs):
|
||||
all_params = list(opt_model.parameters())
|
||||
return MockComplexOptimizer(all_params, **optimizer_kwargs)
|
||||
|
||||
# Verify is_optimizer_factory correctly identifies factories vs optimizer classes
|
||||
self.assertFalse(is_optimizer_factory(MockComplexOptimizer)) # Optimizer class should return False
|
||||
self.assertTrue(is_optimizer_factory(MockOptimizerFactory)) # Factory class should return True
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Optimizer group and learning rate inspection tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_get_optimizer_group(self):
|
||||
model = nn.Sequential(nn.Linear(128, 64))
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
trainer = Trainer(model=model, args=TrainingArguments(output_dir=tmp_dir))
|
||||
# ValueError is raised if optimizer is None
|
||||
with self.assertRaises(ValueError):
|
||||
trainer.get_optimizer_group()
|
||||
trainer.create_optimizer()
|
||||
# Get groups
|
||||
num_groups = len(trainer.get_optimizer_group())
|
||||
self.assertEqual(num_groups, 2)
|
||||
# Get group of parameter
|
||||
param = next(model.parameters())
|
||||
group = trainer.get_optimizer_group(param)
|
||||
self.assertIn(param, group["params"])
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Custom optimizer and LR scheduler tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TrainerOptimizerTest(TestCasePlus):
|
||||
def test_get_optimizer_group(self):
|
||||
model = nn.Sequential(nn.Linear(128, 64))
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
trainer = Trainer(model=model, args=TrainingArguments(output_dir=tmp_dir))
|
||||
# ValueError is raised if optimizer is None
|
||||
with self.assertRaises(ValueError):
|
||||
trainer.get_optimizer_group()
|
||||
trainer.create_optimizer()
|
||||
# Get groups
|
||||
num_groups = len(trainer.get_optimizer_group())
|
||||
self.assertEqual(num_groups, 2)
|
||||
# Get group of parameter
|
||||
param = next(model.parameters())
|
||||
group = trainer.get_optimizer_group(param)
|
||||
self.assertIn(param, group["params"])
|
||||
|
||||
def test_optimizer_factory_pattern(self):
|
||||
"""Test that is_optimizer_factory correctly identifies factory classes vs optimizer classes."""
|
||||
from transformers.trainer_optimizer import is_optimizer_factory
|
||||
|
||||
# Create a mock optimizer class
|
||||
class MockComplexOptimizer(torch.optim.Optimizer):
|
||||
def __init__(self, params, lr=1e-3):
|
||||
defaults = {"lr": lr}
|
||||
super().__init__(params, defaults)
|
||||
|
||||
def step(self, closure=None):
|
||||
pass
|
||||
|
||||
# Create a factory class (simulates Muon/Dion pattern)
|
||||
class MockOptimizerFactory:
|
||||
def __call__(self, opt_model, **optimizer_kwargs):
|
||||
all_params = list(opt_model.parameters())
|
||||
return MockComplexOptimizer(all_params, **optimizer_kwargs)
|
||||
|
||||
# Verify is_optimizer_factory correctly identifies factories vs optimizer classes
|
||||
self.assertFalse(is_optimizer_factory(MockComplexOptimizer)) # Optimizer class should return False
|
||||
self.assertTrue(is_optimizer_factory(MockOptimizerFactory)) # Factory class should return True
|
||||
|
||||
def test_custom_optimizer(self):
|
||||
train_dataset = RegressionDataset()
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
args = TrainingArguments(tmp_dir)
|
||||
model = RegressionModel()
|
||||
optimizer = torch.optim.SGD(model.parameters(), lr=1.0)
|
||||
lr_scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda x: 1.0)
|
||||
trainer = Trainer(model, args, train_dataset=train_dataset, optimizers=(optimizer, lr_scheduler))
|
||||
trainer.train()
|
||||
|
||||
# Train a default model to compare against
|
||||
default_trainer = get_regression_trainer(learning_rate=0.1, output_dir=tmp_dir)
|
||||
default_trainer.train()
|
||||
|
||||
self.assertFalse(torch.allclose(trainer.model.a, default_trainer.model.a))
|
||||
self.assertFalse(torch.allclose(trainer.model.b, default_trainer.model.b))
|
||||
self.assertEqual(trainer.optimizer.state_dict()["param_groups"][0]["lr"], 1.0)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Weight decay parameter groups
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_no_wd_param_group(self):
|
||||
model = nn.Sequential(TstLayer(128), nn.ModuleList([TstLayer(128), TstLayer(128)]))
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
trainer = Trainer(model=model, args=TrainingArguments(output_dir=tmp_dir))
|
||||
trainer.create_optimizer_and_scheduler(10)
|
||||
wd_names = ['0.linear1.weight', '0.linear2.weight', '1.0.linear1.weight', '1.0.linear2.weight', '1.1.linear1.weight', '1.1.linear2.weight'] # fmt: skip
|
||||
wd_params = [p for n, p in model.named_parameters() if n in wd_names]
|
||||
no_wd_params = [p for n, p in model.named_parameters() if n not in wd_names]
|
||||
self.assertListEqual(trainer.optimizer.param_groups[0]["params"], wd_params)
|
||||
self.assertListEqual(trainer.optimizer.param_groups[1]["params"], no_wd_params)
|
||||
|
||||
|
||||
@require_torch
|
||||
class TrainerLRTest(TestCasePlus):
|
||||
def test_get_learning_rates(self):
|
||||
model = nn.Sequential(nn.Linear(128, 64))
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
trainer = Trainer(model=model, args=TrainingArguments(output_dir=tmp_dir))
|
||||
with self.assertRaises(ValueError):
|
||||
trainer.get_learning_rates()
|
||||
trainer.create_optimizer()
|
||||
self.assertEqual(trainer.get_learning_rates(), [5e-05, 5e-05])
|
||||
|
||||
def test_lr_scheduler_kwargs(self):
|
||||
from transformers import get_polynomial_decay_schedule_with_warmup
|
||||
|
||||
# test scheduler kwargs passed via TrainingArguments
|
||||
train_dataset = RegressionDataset()
|
||||
model = RegressionModel()
|
||||
num_steps, num_warmup_steps = 10, 2
|
||||
extra_kwargs = {"power": 5.0, "lr_end": 1e-5} # Non-default arguments
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
args = TrainingArguments(
|
||||
tmp_dir,
|
||||
lr_scheduler_type="polynomial",
|
||||
lr_scheduler_kwargs=extra_kwargs,
|
||||
learning_rate=0.2,
|
||||
warmup_steps=num_warmup_steps,
|
||||
)
|
||||
trainer = Trainer(model, args, train_dataset=train_dataset)
|
||||
trainer.create_optimizer_and_scheduler(num_training_steps=num_steps)
|
||||
|
||||
# Checking that the scheduler was created
|
||||
self.assertIsNotNone(trainer.lr_scheduler)
|
||||
|
||||
# Checking that the correct args were passed
|
||||
sched1 = trainer.lr_scheduler
|
||||
sched2 = get_polynomial_decay_schedule_with_warmup(
|
||||
trainer.optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_steps, **extra_kwargs
|
||||
)
|
||||
self.assertEqual(sched1.lr_lambdas[0].args, sched2.lr_lambdas[0].args)
|
||||
self.assertEqual(sched1.lr_lambdas[0].keywords, sched2.lr_lambdas[0].keywords)
|
||||
|
||||
def test_cosine_with_min_lr_scheduler(self):
|
||||
train_dataset = RegressionDataset()
|
||||
model = RegressionModel()
|
||||
num_steps, num_warmup_steps = 10, 2
|
||||
extra_kwargs = {"min_lr": 1e-5} # Non-default arguments
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
args = TrainingArguments(
|
||||
tmp_dir,
|
||||
lr_scheduler_type="cosine_with_min_lr",
|
||||
lr_scheduler_kwargs=extra_kwargs,
|
||||
learning_rate=0.2,
|
||||
warmup_steps=num_warmup_steps,
|
||||
)
|
||||
trainer = Trainer(model, args, train_dataset=train_dataset)
|
||||
trainer.create_optimizer_and_scheduler(num_training_steps=num_steps)
|
||||
|
||||
# Checking that the scheduler was created
|
||||
self.assertIsNotNone(trainer.lr_scheduler)
|
||||
|
||||
# Check the last learning rate
|
||||
for _ in range(num_steps):
|
||||
trainer.lr_scheduler.step()
|
||||
self.assertEqual(trainer.lr_scheduler.get_last_lr()[0], 1e-5)
|
||||
|
||||
def test_cosine_with_min_lr_schedule_with_warmup_lr_rate(self):
|
||||
train_dataset = RegressionDataset()
|
||||
model = RegressionModel()
|
||||
num_steps, num_warmup_steps = 10, 2
|
||||
extra_kwargs = {"min_lr": 1e-5} # Non-default arguments
|
||||
args = TrainingArguments(
|
||||
"./regression",
|
||||
lr_scheduler_type="cosine_warmup_with_min_lr",
|
||||
lr_scheduler_kwargs=extra_kwargs,
|
||||
learning_rate=0.2,
|
||||
warmup_steps=num_warmup_steps,
|
||||
)
|
||||
trainer = Trainer(model, args, train_dataset=train_dataset)
|
||||
trainer.create_optimizer_and_scheduler(num_training_steps=num_steps)
|
||||
|
||||
# Checking that the scheduler was created
|
||||
self.assertIsNotNone(trainer.lr_scheduler)
|
||||
|
||||
# Check the last learning rate
|
||||
step_lrs = []
|
||||
for _ in range(num_steps):
|
||||
step_lrs.append(trainer.optimizer.param_groups[0]["lr"])
|
||||
trainer.lr_scheduler.step()
|
||||
self.assertEqual(step_lrs[0], 0.1)
|
||||
self.assertEqual(step_lrs[1], 0.2)
|
||||
self.assertEqual(step_lrs[-1], 1e-05)
|
||||
|
||||
def test_reduce_lr_on_plateau_args(self):
|
||||
# test passed arguments for a custom ReduceLROnPlateau scheduler
|
||||
train_dataset = RegressionDataset(length=64)
|
||||
eval_dataset = RegressionDataset(length=64)
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
args = TrainingArguments(
|
||||
tmp_dir,
|
||||
eval_strategy="epoch",
|
||||
metric_for_best_model="eval_loss",
|
||||
)
|
||||
model = RegressionModel()
|
||||
optimizer = torch.optim.SGD(model.parameters(), lr=1.0)
|
||||
lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.2, patience=5, cooldown=2)
|
||||
trainer = Trainer(
|
||||
model,
|
||||
args,
|
||||
train_dataset=train_dataset,
|
||||
eval_dataset=eval_dataset,
|
||||
optimizers=(optimizer, lr_scheduler),
|
||||
)
|
||||
trainer.train()
|
||||
|
||||
self.assertIsInstance(trainer.lr_scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau)
|
||||
self.assertEqual(trainer.lr_scheduler.factor, 0.2)
|
||||
self.assertEqual(trainer.lr_scheduler.patience, 5)
|
||||
self.assertEqual(trainer.lr_scheduler.cooldown, 2)
|
||||
|
||||
def test_reduce_lr_on_plateau(self):
|
||||
# test the ReduceLROnPlateau scheduler
|
||||
|
||||
class TrainerWithLRLogs(Trainer):
|
||||
def log(self, logs):
|
||||
# the LR is computed after metrics and does not exist for the first epoch
|
||||
if hasattr(self.lr_scheduler, "_last_lr"):
|
||||
logs["learning_rate"] = self.lr_scheduler._last_lr[0]
|
||||
super().log(logs)
|
||||
|
||||
train_dataset = RegressionDataset(length=64)
|
||||
eval_dataset = RegressionDataset(length=64)
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
args = TrainingArguments(
|
||||
tmp_dir,
|
||||
lr_scheduler_type="reduce_lr_on_plateau",
|
||||
eval_strategy="epoch",
|
||||
metric_for_best_model="eval_loss",
|
||||
num_train_epochs=10,
|
||||
learning_rate=0.2,
|
||||
)
|
||||
model = RegressionModel()
|
||||
trainer = TrainerWithLRLogs(model, args, train_dataset=train_dataset, eval_dataset=eval_dataset)
|
||||
trainer.train()
|
||||
|
||||
self.assertIsInstance(trainer.lr_scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau)
|
||||
patience = trainer.lr_scheduler.patience
|
||||
|
||||
logs = trainer.state.log_history[1:]
|
||||
best_loss = logs[0]["eval_loss"]
|
||||
bad_epochs = 0
|
||||
for i, log in enumerate(logs[:-1]): # Compare learning rate to next epoch's
|
||||
loss = log["eval_loss"]
|
||||
just_decreased = False
|
||||
if loss > best_loss:
|
||||
bad_epochs += 1
|
||||
if bad_epochs > patience:
|
||||
self.assertLess(logs[i + 1]["learning_rate"], log["learning_rate"])
|
||||
just_decreased = True
|
||||
bad_epochs = 0
|
||||
else:
|
||||
best_loss = loss
|
||||
bad_epochs = 0
|
||||
if not just_decreased:
|
||||
self.assertEqual(logs[i + 1]["learning_rate"], log["learning_rate"])
|
||||
|
||||
def test_greedy_lr_args(self):
|
||||
# test passed arguments for a custom GreedyLR scheduler
|
||||
from transformers.optimization import GreedyLR
|
||||
|
||||
train_dataset = RegressionDataset(length=64)
|
||||
eval_dataset = RegressionDataset(length=64)
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
args = TrainingArguments(
|
||||
tmp_dir,
|
||||
eval_strategy="epoch",
|
||||
metric_for_best_model="eval_loss",
|
||||
)
|
||||
model = RegressionModel()
|
||||
optimizer = torch.optim.SGD(model.parameters(), lr=1.0)
|
||||
lr_scheduler = GreedyLR(optimizer, factor=0.8, patience=5, cooldown=2)
|
||||
trainer = Trainer(
|
||||
model,
|
||||
args,
|
||||
train_dataset=train_dataset,
|
||||
eval_dataset=eval_dataset,
|
||||
optimizers=(optimizer, lr_scheduler),
|
||||
)
|
||||
trainer.train()
|
||||
|
||||
self.assertIsInstance(trainer.lr_scheduler, GreedyLR)
|
||||
self.assertEqual(trainer.lr_scheduler.factor, 0.8)
|
||||
self.assertEqual(trainer.lr_scheduler.patience, 5)
|
||||
self.assertEqual(trainer.lr_scheduler.cooldown, 2)
|
||||
|
||||
def test_greedy_lr(self):
|
||||
# test the GreedyLR scheduler
|
||||
from transformers.optimization import GreedyLR
|
||||
|
||||
class TrainerWithLRLogs(Trainer):
|
||||
def log(self, logs):
|
||||
if hasattr(self.lr_scheduler, "_last_lr"):
|
||||
logs["learning_rate"] = self.lr_scheduler._last_lr[0]
|
||||
super().log(logs)
|
||||
|
||||
train_dataset = RegressionDataset(length=64)
|
||||
eval_dataset = RegressionDataset(length=64)
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
args = TrainingArguments(
|
||||
tmp_dir,
|
||||
lr_scheduler_type="greedy",
|
||||
lr_scheduler_kwargs={"patience": 1, "factor": 0.5},
|
||||
eval_strategy="epoch",
|
||||
metric_for_best_model="eval_loss",
|
||||
num_train_epochs=10,
|
||||
learning_rate=0.2,
|
||||
)
|
||||
model = RegressionModel()
|
||||
trainer = TrainerWithLRLogs(model, args, train_dataset=train_dataset, eval_dataset=eval_dataset)
|
||||
trainer.train()
|
||||
|
||||
self.assertIsInstance(trainer.lr_scheduler, GreedyLR)
|
||||
# Verify LR was adjusted at least once during training
|
||||
logs = trainer.state.log_history[1:]
|
||||
lr_values = [log["learning_rate"] for log in logs if "learning_rate" in log]
|
||||
self.assertTrue(len(set(lr_values)) > 1, "GreedyLR should have adjusted the LR at least once")
|
||||
413
tests/trainer/test_trainer_seq2seq.py
Normal file
413
tests/trainer/test_trainer_seq2seq.py
Normal file
@@ -0,0 +1,413 @@
|
||||
# Copyright 2020 the HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from unittest.mock import patch
|
||||
|
||||
from transformers import (
|
||||
AutoModelForSeq2SeqLM,
|
||||
BertConfig,
|
||||
BertTokenizer,
|
||||
DataCollatorForSeq2Seq,
|
||||
EncoderDecoderModel,
|
||||
GenerationConfig,
|
||||
Seq2SeqTrainer,
|
||||
Seq2SeqTrainingArguments,
|
||||
T5Tokenizer,
|
||||
)
|
||||
from transformers.testing_utils import (
|
||||
ExtendSysPath,
|
||||
TestCasePlus,
|
||||
backend_device_count,
|
||||
execute_subprocess_async,
|
||||
get_torch_dist_unique_port,
|
||||
require_bitsandbytes,
|
||||
require_sentencepiece,
|
||||
require_torch,
|
||||
require_torch_multi_accelerator,
|
||||
require_torch_non_multi_accelerator,
|
||||
slow,
|
||||
torch_device,
|
||||
)
|
||||
from transformers.trainer_callback import TrainerState
|
||||
from transformers.trainer_utils import set_seed
|
||||
from transformers.utils import is_datasets_available, is_torch_available
|
||||
|
||||
|
||||
if is_datasets_available():
|
||||
import datasets
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
|
||||
set_seed(42)
|
||||
MARIAN_MODEL = "sshleifer/student_marian_en_ro_6_1"
|
||||
MBART_TINY = "sshleifer/tiny-mbart"
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
class Seq2seqTrainerTester(TestCasePlus):
|
||||
@slow
|
||||
@require_torch
|
||||
def test_finetune_bert2bert(self):
|
||||
bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained(
|
||||
"prajjwal1/bert-tiny",
|
||||
"prajjwal1/bert-tiny",
|
||||
encoder_config=BertConfig.from_pretrained("prajjwal1/bert-tiny"),
|
||||
decoder_config=BertConfig.from_pretrained("prajjwal1/bert-tiny"),
|
||||
dtype=torch.float32,
|
||||
)
|
||||
tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
|
||||
|
||||
bert2bert.config.vocab_size = bert2bert.config.encoder.vocab_size
|
||||
tokenizer.eos_token_id = tokenizer.sep_token_id
|
||||
bert2bert.generation_config.decoder_start_token_id = tokenizer.cls_token_id
|
||||
bert2bert.generation_config.max_length = 128
|
||||
|
||||
train_dataset = datasets.load_dataset("abisee/cnn_dailymail", "3.0.0", split="train[:1%]")
|
||||
val_dataset = datasets.load_dataset("abisee/cnn_dailymail", "3.0.0", split="validation[:1%]")
|
||||
|
||||
train_dataset = train_dataset.select(range(32))
|
||||
val_dataset = val_dataset.select(range(16))
|
||||
|
||||
batch_size = 4
|
||||
|
||||
def _map_to_encoder_decoder_inputs(batch):
|
||||
# Tokenizer will automatically set [BOS] <text> [EOS]
|
||||
inputs = tokenizer(batch["article"], padding="max_length", truncation=True, max_length=512)
|
||||
outputs = tokenizer(batch["highlights"], padding="max_length", truncation=True, max_length=128)
|
||||
batch["input_ids"] = inputs.input_ids
|
||||
batch["attention_mask"] = inputs.attention_mask
|
||||
|
||||
batch["decoder_input_ids"] = outputs.input_ids
|
||||
batch["labels"] = outputs.input_ids.copy()
|
||||
batch["labels"] = [
|
||||
[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]
|
||||
]
|
||||
batch["decoder_attention_mask"] = outputs.attention_mask
|
||||
|
||||
assert all(len(x) == 512 for x in inputs.input_ids)
|
||||
assert all(len(x) == 128 for x in outputs.input_ids)
|
||||
|
||||
return batch
|
||||
|
||||
def _compute_metrics(pred):
|
||||
labels_ids = pred.label_ids
|
||||
pred_ids = pred.predictions
|
||||
|
||||
# Replace -100 (ignore index) with pad_token_id before decoding
|
||||
import numpy as np
|
||||
|
||||
labels_ids = np.where(labels_ids == -100, tokenizer.pad_token_id, labels_ids)
|
||||
|
||||
# all unnecessary tokens are removed
|
||||
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
|
||||
label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
|
||||
|
||||
accuracy = sum(int(pred_str[i] == label_str[i]) for i in range(len(pred_str))) / len(pred_str)
|
||||
|
||||
return {"accuracy": accuracy}
|
||||
|
||||
# map train dataset
|
||||
train_dataset = train_dataset.map(
|
||||
_map_to_encoder_decoder_inputs,
|
||||
batched=True,
|
||||
batch_size=batch_size,
|
||||
remove_columns=["article", "highlights"],
|
||||
)
|
||||
train_dataset.set_format(
|
||||
type="torch",
|
||||
columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
|
||||
)
|
||||
|
||||
# same for validation dataset
|
||||
val_dataset = val_dataset.map(
|
||||
_map_to_encoder_decoder_inputs,
|
||||
batched=True,
|
||||
batch_size=batch_size,
|
||||
remove_columns=["article", "highlights"],
|
||||
)
|
||||
val_dataset.set_format(
|
||||
type="torch",
|
||||
columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
|
||||
)
|
||||
|
||||
output_dir = self.get_auto_remove_tmp_dir()
|
||||
|
||||
training_args = Seq2SeqTrainingArguments(
|
||||
output_dir=output_dir,
|
||||
per_device_train_batch_size=batch_size,
|
||||
per_device_eval_batch_size=batch_size,
|
||||
predict_with_generate=True,
|
||||
eval_strategy="steps",
|
||||
do_train=True,
|
||||
do_eval=True,
|
||||
warmup_steps=0,
|
||||
eval_steps=2,
|
||||
logging_steps=2,
|
||||
)
|
||||
|
||||
# instantiate trainer
|
||||
trainer = Seq2SeqTrainer(
|
||||
model=bert2bert,
|
||||
args=training_args,
|
||||
compute_metrics=_compute_metrics,
|
||||
train_dataset=train_dataset,
|
||||
eval_dataset=val_dataset,
|
||||
processing_class=tokenizer,
|
||||
)
|
||||
|
||||
# start training
|
||||
trainer.train()
|
||||
|
||||
@slow
|
||||
@require_torch
|
||||
def test_return_sequences(self):
|
||||
# Tests that the number of generated sequences is correct when num_return_sequences > 1
|
||||
# and essentially ensuring that `accelerator.gather()` is used instead of `gather_for_metrics`
|
||||
INPUT_COLUMN = "question"
|
||||
TARGET_COLUMN = "answer"
|
||||
MAX_INPUT_LENGTH = 256
|
||||
MAX_TARGET_LENGTH = 256
|
||||
|
||||
dataset = datasets.load_dataset("openai/gsm8k", "main", split="train[:38]")
|
||||
model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-small")
|
||||
tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-small")
|
||||
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="pt", padding="longest")
|
||||
gen_config = GenerationConfig.from_pretrained(
|
||||
"google-t5/t5-small", max_length=None, min_length=None, max_new_tokens=256, min_new_tokens=1, num_beams=5
|
||||
)
|
||||
|
||||
training_args = Seq2SeqTrainingArguments(".", predict_with_generate=True)
|
||||
|
||||
trainer = Seq2SeqTrainer(
|
||||
model=model,
|
||||
args=training_args,
|
||||
processing_class=tokenizer,
|
||||
data_collator=data_collator,
|
||||
compute_metrics=lambda x: {"samples": x[0].shape[0]},
|
||||
)
|
||||
|
||||
def prepare_data(examples):
|
||||
# Remove pairs where at least one record is none
|
||||
inputs = examples[INPUT_COLUMN]
|
||||
targets = examples[TARGET_COLUMN]
|
||||
|
||||
model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)
|
||||
labels = tokenizer(text_target=targets, max_length=MAX_TARGET_LENGTH, truncation=True)
|
||||
model_inputs["labels"] = labels["input_ids"]
|
||||
|
||||
return model_inputs
|
||||
|
||||
prepared_dataset = dataset.map(prepare_data, batched=True, remove_columns=[INPUT_COLUMN, TARGET_COLUMN])
|
||||
dataset_len = len(prepared_dataset) # 38
|
||||
|
||||
for num_return_sequences in range(3, 0, -1):
|
||||
gen_config.num_return_sequences = num_return_sequences
|
||||
metrics = trainer.evaluate(eval_dataset=prepared_dataset, generation_config=gen_config)
|
||||
assert metrics["eval_samples"] == dataset_len * num_return_sequences, (
|
||||
f"Got {metrics['eval_samples']}, expected: {dataset_len * num_return_sequences}"
|
||||
)
|
||||
|
||||
@require_torch
|
||||
def test_bad_generation_config_fail_early(self):
|
||||
# Tests that a bad generation config causes the trainer to fail early
|
||||
model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-small")
|
||||
tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-small")
|
||||
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="pt", padding="longest")
|
||||
gen_config = GenerationConfig(do_sample=False, top_p=0.9) # bad: top_p is not compatible with do_sample=False
|
||||
|
||||
training_args = Seq2SeqTrainingArguments(".", predict_with_generate=True, generation_config=gen_config)
|
||||
with self.assertRaises(ValueError) as exc:
|
||||
_ = Seq2SeqTrainer(
|
||||
model=model,
|
||||
args=training_args,
|
||||
processing_class=tokenizer,
|
||||
data_collator=data_collator,
|
||||
compute_metrics=lambda x: {"samples": x[0].shape[0]},
|
||||
)
|
||||
self.assertIn("Fix these issues to train your model", str(exc.exception))
|
||||
|
||||
|
||||
@require_torch
|
||||
class TestTranslationExample(TestCasePlus):
|
||||
"""Tests for the run_translation.py example script (seq2seq training via CLI)."""
|
||||
|
||||
@classmethod
|
||||
def setUpClass(cls):
|
||||
super().setUpClass()
|
||||
examples_dir = Path(__file__).resolve().parents[2] / "examples" / "pytorch" / "translation"
|
||||
with ExtendSysPath(str(examples_dir)):
|
||||
from run_translation import main as _main
|
||||
|
||||
cls._run_translation_main = staticmethod(_main)
|
||||
|
||||
def _run_translation(
|
||||
self,
|
||||
distributed=False,
|
||||
extra_args_str=None,
|
||||
predict_with_generate=True,
|
||||
do_train=True,
|
||||
do_eval=True,
|
||||
do_predict=True,
|
||||
n_gpus_to_use=None,
|
||||
):
|
||||
data_dir = self.test_file_dir / "../fixtures/tests_samples/wmt_en_ro"
|
||||
output_dir = self.get_auto_remove_tmp_dir()
|
||||
args = f"""
|
||||
--model_name_or_path {MBART_TINY}
|
||||
--train_file {data_dir}/train.json
|
||||
--validation_file {data_dir}/val.json
|
||||
--test_file {data_dir}/test.json
|
||||
--output_dir {output_dir}
|
||||
--max_train_samples 8
|
||||
--max_source_length 12
|
||||
--max_target_length 12
|
||||
--do_train
|
||||
--num_train_epochs 1
|
||||
--per_device_train_batch_size 4
|
||||
--learning_rate 3e-3
|
||||
--warmup_steps 8
|
||||
--logging_steps 0
|
||||
--logging_strategy no
|
||||
--save_steps 1
|
||||
--train_sampling_strategy group_by_length
|
||||
--label_smoothing_factor 0.1
|
||||
--target_lang ro_RO
|
||||
--source_lang en_XX
|
||||
--report_to none
|
||||
""".split()
|
||||
|
||||
if do_eval:
|
||||
args += """
|
||||
--do_eval
|
||||
--per_device_eval_batch_size 4
|
||||
--max_eval_samples 8
|
||||
--val_max_target_length 12
|
||||
--eval_strategy steps
|
||||
--eval_steps 1
|
||||
""".split()
|
||||
|
||||
if do_predict:
|
||||
args += ["--do_predict"]
|
||||
|
||||
if predict_with_generate:
|
||||
args += ["--predict_with_generate"]
|
||||
|
||||
if do_train:
|
||||
args += ["--optim", "adafactor"]
|
||||
|
||||
if extra_args_str is not None:
|
||||
args += extra_args_str.split()
|
||||
|
||||
if distributed:
|
||||
if n_gpus_to_use is None:
|
||||
n_gpus_to_use = backend_device_count(torch_device)
|
||||
master_port = get_torch_dist_unique_port()
|
||||
distributed_args = f"""
|
||||
-m torch.distributed.run
|
||||
--nproc_per_node={n_gpus_to_use}
|
||||
--master_port={master_port}
|
||||
{self.examples_dir_str}/pytorch/translation/run_translation.py
|
||||
""".split()
|
||||
cmd = [sys.executable] + distributed_args + args
|
||||
execute_subprocess_async(cmd, env=self.get_env())
|
||||
else:
|
||||
testargs = ["run_translation.py"] + args
|
||||
with patch.object(sys, "argv", testargs):
|
||||
self._run_translation_main()
|
||||
|
||||
return output_dir
|
||||
|
||||
@require_torch_non_multi_accelerator
|
||||
def test_run_seq2seq_no_dist(self):
|
||||
output_dir = self._run_translation()
|
||||
logs = TrainerState.load_from_json(os.path.join(output_dir, "trainer_state.json")).log_history
|
||||
eval_metrics = [log for log in logs if "eval_loss" in log]
|
||||
first_step_stats = eval_metrics[0]
|
||||
assert "eval_bleu" in first_step_stats
|
||||
|
||||
@require_torch_multi_accelerator
|
||||
def test_run_seq2seq_dp(self):
|
||||
output_dir = self._run_translation(distributed=False)
|
||||
logs = TrainerState.load_from_json(os.path.join(output_dir, "trainer_state.json")).log_history
|
||||
eval_metrics = [log for log in logs if "eval_loss" in log]
|
||||
first_step_stats = eval_metrics[0]
|
||||
assert "eval_bleu" in first_step_stats
|
||||
|
||||
@require_torch_multi_accelerator
|
||||
def test_run_seq2seq_ddp(self):
|
||||
output_dir = self._run_translation(distributed=True)
|
||||
logs = TrainerState.load_from_json(os.path.join(output_dir, "trainer_state.json")).log_history
|
||||
eval_metrics = [log for log in logs if "eval_loss" in log]
|
||||
first_step_stats = eval_metrics[0]
|
||||
assert "eval_bleu" in first_step_stats
|
||||
|
||||
@slow
|
||||
def test_run_seq2seq_slow(self):
|
||||
output_dir = self._run_translation(
|
||||
extra_args_str=f"--model_name_or_path {MARIAN_MODEL} --learning_rate 3e-4 --num_train_epochs 10 --max_source_length 128 --max_target_length 128 --eval_steps 2 --save_steps 2",
|
||||
)
|
||||
logs = TrainerState.load_from_json(os.path.join(output_dir, "trainer_state.json")).log_history
|
||||
eval_metrics = [log for log in logs if "eval_loss" in log]
|
||||
first_step_stats = eval_metrics[0]
|
||||
last_step_stats = eval_metrics[-1]
|
||||
assert first_step_stats["eval_loss"] > last_step_stats["eval_loss"], "model learned nothing"
|
||||
assert isinstance(last_step_stats["eval_bleu"], float)
|
||||
contents = {os.path.basename(p) for p in os.listdir(output_dir)}
|
||||
assert "generated_predictions.txt" in contents
|
||||
assert "predict_results.json" in contents
|
||||
|
||||
@slow
|
||||
@require_bitsandbytes
|
||||
def test_run_seq2seq_bnb(self):
|
||||
from transformers.training_args import OptimizerNames
|
||||
|
||||
def train_and_return_metrics(optim: str) -> tuple[int, float]:
|
||||
output_dir = self._run_translation(
|
||||
distributed=True,
|
||||
extra_args_str=f"--skip_memory_metrics 0 --model_name_or_path {MARIAN_MODEL} --learning_rate 3e-4 --num_train_epochs 1 --optim {optim} --max_source_length 128 --max_target_length 128",
|
||||
do_eval=False,
|
||||
do_predict=False,
|
||||
n_gpus_to_use=1,
|
||||
)
|
||||
logs = TrainerState.load_from_json(Path(output_dir, "trainer_state.json")).log_history
|
||||
gpu_peak_mem_mb = int(logs[0]["train_mem_gpu_peaked_delta"] / 2**20)
|
||||
gpu_alloc_mem_mb = int(logs[0]["train_mem_gpu_alloc_delta"] / 2**20)
|
||||
loss = logs[0]["train_loss"]
|
||||
return gpu_peak_mem_mb, gpu_alloc_mem_mb, loss
|
||||
|
||||
gpu_peak_mem_orig, gpu_alloc_mem_orig, loss_orig = train_and_return_metrics(OptimizerNames.ADAMW_TORCH.value)
|
||||
gpu_peak_mem_bnb, gpu_alloc_mem_bnb, loss_bnb = train_and_return_metrics(OptimizerNames.ADAMW_BNB.value)
|
||||
|
||||
gpu_alloc_mem_diff = gpu_alloc_mem_orig - gpu_alloc_mem_bnb
|
||||
gpu_total_mem_orig = gpu_peak_mem_orig + gpu_alloc_mem_orig
|
||||
gpu_total_mem_bnb = gpu_peak_mem_bnb + gpu_alloc_mem_bnb
|
||||
gpu_total_mem_diff = gpu_total_mem_orig - gpu_total_mem_bnb
|
||||
|
||||
expected_savings = 120
|
||||
self.assertGreater(
|
||||
gpu_alloc_mem_diff,
|
||||
expected_savings,
|
||||
f"should use ~150MB less alloc gpu memory with BNB, but got diff={gpu_alloc_mem_diff}MB",
|
||||
)
|
||||
self.assertGreater(
|
||||
gpu_total_mem_diff,
|
||||
expected_savings,
|
||||
f"should use ~150MB less total gpu memory with BNB, but got diff={gpu_total_mem_diff}MB",
|
||||
)
|
||||
self.assertAlmostEqual(loss_orig, loss_bnb, 5, f"loss should be the same: {loss_orig} vs {loss_bnb}")
|
||||
406
tests/trainer/test_training_args.py
Normal file
406
tests/trainer/test_training_args.py
Normal file
@@ -0,0 +1,406 @@
|
||||
import dataclasses
|
||||
import os
|
||||
import tempfile
|
||||
import unittest
|
||||
from unittest.mock import patch
|
||||
|
||||
import torch
|
||||
|
||||
from transformers import TrainingArguments
|
||||
from transformers.debug_utils import DebugOption
|
||||
from transformers.trainer_utils import HubStrategy, IntervalStrategy, SaveStrategy, SchedulerType
|
||||
from transformers.training_args import OptimizerNames
|
||||
|
||||
|
||||
class TestTrainingArguments(unittest.TestCase):
|
||||
def test_default_output_dir(self):
|
||||
"""Test that output_dir defaults to 'trainer_output' when not specified."""
|
||||
args = TrainingArguments(output_dir=None)
|
||||
self.assertEqual(args.output_dir, "trainer_output")
|
||||
|
||||
def test_custom_output_dir(self):
|
||||
"""Test that output_dir is respected when specified."""
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
args = TrainingArguments(output_dir=tmp_dir)
|
||||
self.assertEqual(args.output_dir, tmp_dir)
|
||||
|
||||
def test_output_dir_creation(self):
|
||||
"""Test that output_dir is created only when needed."""
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
output_dir = os.path.join(tmp_dir, "test_output")
|
||||
|
||||
# Directory should not exist before creating args
|
||||
self.assertFalse(os.path.exists(output_dir))
|
||||
|
||||
# Create args with save_strategy="no" - should not create directory
|
||||
args = TrainingArguments(
|
||||
output_dir=output_dir,
|
||||
do_train=True,
|
||||
save_strategy="no",
|
||||
report_to=None,
|
||||
)
|
||||
self.assertFalse(os.path.exists(output_dir))
|
||||
|
||||
# Now set save_strategy="steps" - should create directory when needed
|
||||
args.save_strategy = "steps"
|
||||
args.save_steps = 1
|
||||
self.assertFalse(os.path.exists(output_dir)) # Still shouldn't exist
|
||||
|
||||
# Directory should be created when actually needed (e.g. in Trainer)
|
||||
|
||||
def test_torch_empty_cache_steps_requirements(self):
|
||||
"""Test that torch_empty_cache_steps is a positive integer or None."""
|
||||
|
||||
# None is acceptable (feature is disabled):
|
||||
args = TrainingArguments(torch_empty_cache_steps=None)
|
||||
self.assertIsNone(args.torch_empty_cache_steps)
|
||||
|
||||
# non-int is unacceptable:
|
||||
with self.assertRaises(ValueError):
|
||||
TrainingArguments(torch_empty_cache_steps=1.0)
|
||||
with self.assertRaises(ValueError):
|
||||
TrainingArguments(torch_empty_cache_steps="none")
|
||||
|
||||
# negative int is unacceptable:
|
||||
with self.assertRaises(ValueError):
|
||||
TrainingArguments(torch_empty_cache_steps=-1)
|
||||
|
||||
# zero is unacceptable:
|
||||
with self.assertRaises(ValueError):
|
||||
TrainingArguments(torch_empty_cache_steps=0)
|
||||
|
||||
# positive int is acceptable:
|
||||
args = TrainingArguments(torch_empty_cache_steps=1)
|
||||
self.assertEqual(args.torch_empty_cache_steps, 1)
|
||||
|
||||
def test_output_dir_expands_user(self):
|
||||
"""Test that ~ in output_dir is expanded to the user's home directory."""
|
||||
args = TrainingArguments(output_dir="~/foo", report_to=None)
|
||||
self.assertEqual(args.output_dir, os.path.expanduser("~/foo"))
|
||||
|
||||
def test_enum_coercions(self):
|
||||
"""Test that string values are correctly converted to their enum types."""
|
||||
args = TrainingArguments(
|
||||
output_dir="tmp",
|
||||
eval_strategy="steps",
|
||||
eval_steps=10,
|
||||
logging_strategy="steps",
|
||||
save_strategy="epoch",
|
||||
hub_strategy="end",
|
||||
lr_scheduler_type="linear",
|
||||
optim="adamw_torch",
|
||||
report_to=None,
|
||||
)
|
||||
self.assertEqual(args.eval_strategy, IntervalStrategy.STEPS)
|
||||
self.assertEqual(args.logging_strategy, IntervalStrategy.STEPS)
|
||||
self.assertEqual(args.save_strategy, SaveStrategy.EPOCH)
|
||||
self.assertEqual(args.hub_strategy, HubStrategy.END)
|
||||
self.assertEqual(args.lr_scheduler_type, SchedulerType.LINEAR)
|
||||
self.assertEqual(args.optim, OptimizerNames.ADAMW_TORCH)
|
||||
|
||||
# Invalid string should raise ValueError
|
||||
with self.assertRaises(ValueError):
|
||||
TrainingArguments(output_dir="tmp", eval_strategy="invalid_strategy", report_to=None)
|
||||
|
||||
def test_do_eval_auto_enabled(self):
|
||||
"""Test that do_eval is automatically set to True when eval_strategy is not 'no'."""
|
||||
args = TrainingArguments(
|
||||
output_dir="tmp",
|
||||
do_eval=False,
|
||||
eval_strategy="steps",
|
||||
eval_steps=10,
|
||||
report_to=None,
|
||||
)
|
||||
self.assertTrue(args.do_eval)
|
||||
|
||||
def test_eval_steps_fallback_to_logging_steps(self):
|
||||
"""Test that eval_steps falls back to logging_steps when not specified."""
|
||||
args = TrainingArguments(
|
||||
output_dir="tmp",
|
||||
eval_strategy="steps",
|
||||
logging_steps=10,
|
||||
report_to=None,
|
||||
)
|
||||
self.assertEqual(args.eval_steps, 10)
|
||||
|
||||
def test_eval_steps_required_when_strategy_steps(self):
|
||||
"""Test that eval_strategy='steps' with logging_steps=0 raises ValueError."""
|
||||
with self.assertRaises(ValueError):
|
||||
TrainingArguments(
|
||||
output_dir="tmp",
|
||||
eval_strategy="steps",
|
||||
logging_steps=0,
|
||||
report_to=None,
|
||||
)
|
||||
|
||||
def test_logging_steps_required_nonzero(self):
|
||||
"""Test that logging_strategy='steps' with logging_steps=0 raises ValueError."""
|
||||
with self.assertRaises(ValueError):
|
||||
TrainingArguments(
|
||||
output_dir="tmp",
|
||||
logging_strategy="steps",
|
||||
logging_steps=0,
|
||||
report_to=None,
|
||||
)
|
||||
|
||||
def test_steps_must_be_integer_when_greater_than_one(self):
|
||||
"""Test that fractional steps >1 raise ValueError, but <=1 are allowed."""
|
||||
with self.assertRaises(ValueError):
|
||||
TrainingArguments(
|
||||
output_dir="tmp",
|
||||
logging_strategy="steps",
|
||||
logging_steps=10.5,
|
||||
report_to=None,
|
||||
)
|
||||
with self.assertRaises(ValueError):
|
||||
TrainingArguments(
|
||||
output_dir="tmp",
|
||||
eval_strategy="steps",
|
||||
eval_steps=10.5,
|
||||
report_to=None,
|
||||
)
|
||||
with self.assertRaises(ValueError):
|
||||
TrainingArguments(
|
||||
output_dir="tmp",
|
||||
save_strategy="steps",
|
||||
save_steps=10.5,
|
||||
report_to=None,
|
||||
)
|
||||
# Fractional values <=1 (ratios) are allowed
|
||||
args = TrainingArguments(
|
||||
output_dir="tmp",
|
||||
logging_strategy="steps",
|
||||
logging_steps=0.5,
|
||||
report_to=None,
|
||||
)
|
||||
self.assertEqual(args.logging_steps, 0.5)
|
||||
|
||||
def test_load_best_model_requires_matching_strategies(self):
|
||||
"""Test load_best_model_at_end validation for strategy and step compatibility."""
|
||||
# Mismatched eval/save strategy should raise
|
||||
with self.assertRaises(ValueError):
|
||||
TrainingArguments(
|
||||
output_dir="tmp",
|
||||
load_best_model_at_end=True,
|
||||
eval_strategy="steps",
|
||||
eval_steps=10,
|
||||
save_strategy="epoch",
|
||||
report_to=None,
|
||||
)
|
||||
|
||||
# save_steps not a multiple of eval_steps should raise
|
||||
with self.assertRaises(ValueError):
|
||||
TrainingArguments(
|
||||
output_dir="tmp",
|
||||
load_best_model_at_end=True,
|
||||
eval_strategy="steps",
|
||||
eval_steps=10,
|
||||
save_strategy="steps",
|
||||
save_steps=15,
|
||||
report_to=None,
|
||||
)
|
||||
|
||||
# Valid: matching strategies with compatible steps should not raise
|
||||
args = TrainingArguments(
|
||||
output_dir="tmp",
|
||||
load_best_model_at_end=True,
|
||||
eval_strategy="steps",
|
||||
eval_steps=10,
|
||||
save_strategy="steps",
|
||||
save_steps=20,
|
||||
report_to=None,
|
||||
)
|
||||
self.assertTrue(args.load_best_model_at_end)
|
||||
|
||||
def test_metric_for_best_model_defaults(self):
|
||||
"""Test default metric_for_best_model and greater_is_better behavior."""
|
||||
# load_best_model_at_end with no metric → defaults to "loss"
|
||||
args = TrainingArguments(
|
||||
output_dir="tmp",
|
||||
load_best_model_at_end=True,
|
||||
eval_strategy="epoch",
|
||||
save_strategy="epoch",
|
||||
report_to=None,
|
||||
)
|
||||
self.assertEqual(args.metric_for_best_model, "loss")
|
||||
self.assertFalse(args.greater_is_better)
|
||||
|
||||
# metric ending in "loss" → greater_is_better is False
|
||||
args = TrainingArguments(
|
||||
output_dir="tmp",
|
||||
load_best_model_at_end=True,
|
||||
eval_strategy="epoch",
|
||||
save_strategy="epoch",
|
||||
metric_for_best_model="eval_loss",
|
||||
report_to=None,
|
||||
)
|
||||
self.assertFalse(args.greater_is_better)
|
||||
|
||||
# metric not ending in "loss" → greater_is_better is True
|
||||
args = TrainingArguments(
|
||||
output_dir="tmp",
|
||||
load_best_model_at_end=True,
|
||||
eval_strategy="epoch",
|
||||
save_strategy="epoch",
|
||||
metric_for_best_model="accuracy",
|
||||
report_to=None,
|
||||
)
|
||||
self.assertTrue(args.greater_is_better)
|
||||
|
||||
def test_fp16_bf16_mutual_exclusivity(self):
|
||||
"""Test that fp16 and bf16 cannot both be True."""
|
||||
with self.assertRaises(ValueError):
|
||||
TrainingArguments(output_dir="tmp", fp16=True, bf16=True, report_to=None)
|
||||
with self.assertRaises(ValueError):
|
||||
TrainingArguments(output_dir="tmp", fp16_full_eval=True, bf16_full_eval=True, report_to=None)
|
||||
|
||||
def test_reduce_on_plateau_requires_eval(self):
|
||||
"""Test that reduce_lr_on_plateau scheduler requires an eval strategy."""
|
||||
with self.assertRaises(ValueError):
|
||||
TrainingArguments(
|
||||
output_dir="tmp",
|
||||
lr_scheduler_type="reduce_lr_on_plateau",
|
||||
eval_strategy="no",
|
||||
report_to=None,
|
||||
)
|
||||
|
||||
def test_torch_compile_auto_enable(self):
|
||||
"""Test that torch_compile is auto-enabled when mode or backend is set."""
|
||||
args = TrainingArguments(
|
||||
output_dir="tmp",
|
||||
torch_compile_mode="reduce-overhead",
|
||||
report_to=None,
|
||||
)
|
||||
self.assertTrue(args.torch_compile)
|
||||
|
||||
args = TrainingArguments(
|
||||
output_dir="tmp",
|
||||
torch_compile_backend="inductor",
|
||||
report_to=None,
|
||||
)
|
||||
self.assertTrue(args.torch_compile)
|
||||
|
||||
# Default backend when torch_compile=True
|
||||
args = TrainingArguments(
|
||||
output_dir="tmp",
|
||||
torch_compile=True,
|
||||
report_to=None,
|
||||
)
|
||||
self.assertEqual(args.torch_compile_backend, "inductor")
|
||||
|
||||
def test_report_to_none_handling(self):
|
||||
"""Test report_to normalization for 'none' and string values."""
|
||||
args = TrainingArguments(output_dir="tmp", report_to="none")
|
||||
self.assertEqual(args.report_to, [])
|
||||
|
||||
args = TrainingArguments(output_dir="tmp", report_to=["none"])
|
||||
self.assertEqual(args.report_to, [])
|
||||
|
||||
args = TrainingArguments(output_dir="tmp", report_to="tensorboard")
|
||||
self.assertEqual(args.report_to, ["tensorboard"])
|
||||
|
||||
def test_kubeflow_auto_enable(self):
|
||||
"""Test that kubeflow is auto-enabled when KUBEFLOW_TRAINER_SERVER_URL is set."""
|
||||
with patch.dict(os.environ, {"KUBEFLOW_TRAINER_SERVER_URL": "https://test-url"}, clear=False):
|
||||
# Should auto-add kubeflow when report_to is "none" (default)
|
||||
args = TrainingArguments(output_dir="tmp", report_to="none")
|
||||
self.assertIn("kubeflow", args.report_to)
|
||||
|
||||
# Should auto-add kubeflow to existing list
|
||||
args = TrainingArguments(output_dir="tmp", report_to="tensorboard")
|
||||
self.assertIn("kubeflow", args.report_to)
|
||||
self.assertIn("tensorboard", args.report_to)
|
||||
|
||||
# Should not duplicate if already present
|
||||
args = TrainingArguments(output_dir="tmp", report_to=["kubeflow", "tensorboard"])
|
||||
self.assertEqual(args.report_to.count("kubeflow"), 1)
|
||||
|
||||
# Should not add kubeflow when env var is not set
|
||||
with patch.dict(os.environ, {}, clear=True):
|
||||
args = TrainingArguments(output_dir="tmp", report_to="none")
|
||||
self.assertNotIn("kubeflow", args.report_to)
|
||||
|
||||
def test_warmup_steps_validation(self):
|
||||
"""Test warmup_steps validation for negative values."""
|
||||
with self.assertRaises(ValueError):
|
||||
TrainingArguments(output_dir="tmp", warmup_steps=-1, report_to=None)
|
||||
|
||||
# Zero and fractional values are valid
|
||||
args = TrainingArguments(output_dir="tmp", warmup_steps=0, report_to=None)
|
||||
self.assertEqual(args.warmup_steps, 0)
|
||||
|
||||
args = TrainingArguments(output_dir="tmp", warmup_steps=0.5, report_to=None)
|
||||
self.assertEqual(args.warmup_steps, 0.5)
|
||||
|
||||
def test_debug_option_parsing(self):
|
||||
"""Test debug string parsing into DebugOption enum list."""
|
||||
args = TrainingArguments(output_dir="tmp", debug="underflow_overflow", report_to=None)
|
||||
self.assertEqual(args.debug, [DebugOption.UNDERFLOW_OVERFLOW])
|
||||
|
||||
args = TrainingArguments(output_dir="tmp", debug=None, report_to=None)
|
||||
self.assertEqual(args.debug, [])
|
||||
|
||||
def test_dataloader_prefetch_requires_workers(self):
|
||||
"""Test that dataloader_prefetch_factor requires num_workers > 0."""
|
||||
with self.assertRaises(ValueError):
|
||||
TrainingArguments(
|
||||
output_dir="tmp",
|
||||
dataloader_prefetch_factor=2,
|
||||
dataloader_num_workers=0,
|
||||
report_to=None,
|
||||
)
|
||||
# Valid: prefetch with workers > 0
|
||||
args = TrainingArguments(
|
||||
output_dir="tmp",
|
||||
dataloader_prefetch_factor=2,
|
||||
dataloader_num_workers=2,
|
||||
report_to=None,
|
||||
)
|
||||
self.assertEqual(args.dataloader_prefetch_factor, 2)
|
||||
|
||||
def test_use_cpu_disables_pin_memory(self):
|
||||
"""Test that use_cpu=True disables dataloader_pin_memory."""
|
||||
args = TrainingArguments(output_dir="tmp", use_cpu=True, report_to=None)
|
||||
self.assertFalse(args.dataloader_pin_memory)
|
||||
|
||||
def test_include_num_input_tokens_seen_coercion(self):
|
||||
"""Test bool-to-string coercion for include_num_input_tokens_seen."""
|
||||
args = TrainingArguments(output_dir="tmp", include_num_input_tokens_seen=True, report_to=None)
|
||||
self.assertEqual(args.include_num_input_tokens_seen, "all")
|
||||
|
||||
args = TrainingArguments(output_dir="tmp", include_num_input_tokens_seen=False, report_to=None)
|
||||
self.assertEqual(args.include_num_input_tokens_seen, "no")
|
||||
|
||||
def test_dict_field_parsing(self):
|
||||
"""Test that JSON string dict fields are parsed into dicts."""
|
||||
args = TrainingArguments(output_dir="tmp", lr_scheduler_kwargs='{"factor": 0.5}', report_to=None)
|
||||
self.assertEqual(args.lr_scheduler_kwargs, {"factor": 0.5})
|
||||
|
||||
def test_dtype_to_json(self):
|
||||
@dataclasses.dataclass
|
||||
class TorchDtypeTrainingArguments(TrainingArguments):
|
||||
dtype: torch.dtype = dataclasses.field(
|
||||
default=torch.float32,
|
||||
)
|
||||
|
||||
for dtype in [
|
||||
"float32",
|
||||
"float64",
|
||||
"complex64",
|
||||
"complex128",
|
||||
"float16",
|
||||
"bfloat16",
|
||||
"uint8",
|
||||
"int8",
|
||||
"int16",
|
||||
"int32",
|
||||
"int64",
|
||||
"bool",
|
||||
]:
|
||||
torch_dtype = getattr(torch, dtype)
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
args = TorchDtypeTrainingArguments(output_dir=tmp_dir, dtype=torch_dtype)
|
||||
|
||||
args_dict = args.to_dict()
|
||||
self.assertIn("dtype", args_dict)
|
||||
self.assertEqual(args_dict["dtype"], dtype)
|
||||
630
tests/trainer/trainer_test_utils.py
Normal file
630
tests/trainer/trainer_test_utils.py
Normal file
@@ -0,0 +1,630 @@
|
||||
# Copyright 2018 the HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""Shared test infrastructure for the Trainer test suite."""
|
||||
|
||||
import dataclasses
|
||||
import gc
|
||||
import json
|
||||
import os
|
||||
import random
|
||||
|
||||
import numpy as np
|
||||
|
||||
from transformers import (
|
||||
AutoTokenizer,
|
||||
PreTrainedConfig,
|
||||
TrainerCallback,
|
||||
TrainingArguments,
|
||||
is_datasets_available,
|
||||
is_torch_available,
|
||||
)
|
||||
from transformers.testing_utils import (
|
||||
backend_empty_cache,
|
||||
backend_max_memory_allocated,
|
||||
backend_memory_allocated,
|
||||
backend_reset_max_memory_allocated,
|
||||
get_tests_dir,
|
||||
torch_device,
|
||||
)
|
||||
from transformers.utils import (
|
||||
SAFE_WEIGHTS_INDEX_NAME,
|
||||
SAFE_WEIGHTS_NAME,
|
||||
is_accelerate_available,
|
||||
)
|
||||
|
||||
|
||||
if torch_device == "hpu":
|
||||
RTOL = 1e-3
|
||||
ATOL = 1e-3
|
||||
else:
|
||||
RTOL = 1e-5
|
||||
ATOL = 1e-5
|
||||
|
||||
if is_torch_available():
|
||||
import safetensors.torch
|
||||
import torch
|
||||
from torch import nn
|
||||
from torch.utils.data import IterableDataset
|
||||
|
||||
from transformers import (
|
||||
AutoModelForCausalLM,
|
||||
PreTrainedModel,
|
||||
Trainer,
|
||||
TrainerState,
|
||||
)
|
||||
|
||||
if is_datasets_available():
|
||||
import datasets
|
||||
|
||||
# for version specific tests in TrainerIntegrationTest
|
||||
if is_accelerate_available():
|
||||
pass
|
||||
|
||||
|
||||
PATH_SAMPLE_TEXT = f"{get_tests_dir()}/fixtures/sample_text.txt"
|
||||
|
||||
|
||||
def get_dataset(file_path, tokenizer, max_len):
|
||||
dataset = datasets.load_dataset("text", data_files=file_path)
|
||||
|
||||
# Filter out empty lines
|
||||
dataset = dataset.filter(lambda example: len(example["text"].strip()) > 0)
|
||||
|
||||
# Define tokenization function
|
||||
def tokenize_function(examples):
|
||||
tokenized = tokenizer(examples["text"], add_special_tokens=True, truncation=True, max_length=max_len)
|
||||
# Add labels as a copy of input_ids
|
||||
tokenized["labels"] = tokenized["input_ids"].copy()
|
||||
return tokenized
|
||||
|
||||
# Apply tokenization and remove original text column
|
||||
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])
|
||||
|
||||
return tokenized_dataset["train"]
|
||||
|
||||
|
||||
class StoreLossCallback(TrainerCallback):
|
||||
"""
|
||||
Simple callback to store the loss.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
self.losses = []
|
||||
|
||||
def on_log(self, args, state, control, logs=None, **kwargs):
|
||||
if "loss" in logs:
|
||||
self.losses.append(logs["loss"])
|
||||
|
||||
|
||||
class MockCudaOOMCallback(TrainerCallback):
|
||||
"""
|
||||
Simple callback to simulate CUDA OOM error if
|
||||
the batch size is >= to `batch_size_limit`.
|
||||
"""
|
||||
|
||||
def __init__(self, batch_size_limit=16):
|
||||
self.batch_size_limit = batch_size_limit
|
||||
|
||||
def on_step_end(self, args, state, control, **kwargs):
|
||||
# simulate OOM on the first step
|
||||
if state.train_batch_size >= self.batch_size_limit:
|
||||
raise RuntimeError("CUDA out of memory.")
|
||||
|
||||
|
||||
class RegressionDataset:
|
||||
def __init__(self, a=2, b=3, length=64, seed=42, label_names=None):
|
||||
np.random.seed(seed)
|
||||
self.label_names = ["labels"] if label_names is None else label_names
|
||||
self.length = length
|
||||
self.x = np.random.normal(size=(length,)).astype(np.float32)
|
||||
self.ys = [a * self.x + b + np.random.normal(scale=0.1, size=(length,)) for _ in self.label_names]
|
||||
self.ys = [y.astype(np.float32) for y in self.ys]
|
||||
|
||||
def __len__(self):
|
||||
return self.length
|
||||
|
||||
def __getitem__(self, i):
|
||||
result = {name: y[i] for name, y in zip(self.label_names, self.ys)}
|
||||
result["input_x"] = self.x[i]
|
||||
return result
|
||||
|
||||
|
||||
# Converting Bytes to Megabytes
|
||||
def bytes2megabytes(x):
|
||||
return int(x / 2**20)
|
||||
|
||||
|
||||
# Copied from accelerate: https://github.com/huggingface/accelerate/blob/ee163b66fb7848892519e804688cb4ae981aacbe/src/accelerate/test_utils/scripts/external_deps/test_peak_memory_usage.py#L40C1-L73C68
|
||||
class TorchTracemalloc:
|
||||
def __enter__(self):
|
||||
gc.collect()
|
||||
if torch_device in ["cuda", "xpu"]:
|
||||
backend_empty_cache(torch_device)
|
||||
backend_reset_max_memory_allocated(torch_device) # reset the peak gauge to zero
|
||||
self.begin = backend_memory_allocated(torch_device)
|
||||
else:
|
||||
self.begin = 0
|
||||
return self
|
||||
|
||||
def __exit__(self, *exc):
|
||||
gc.collect()
|
||||
if torch_device in ["cuda", "xpu"]:
|
||||
backend_empty_cache(torch_device)
|
||||
self.end = backend_memory_allocated(torch_device)
|
||||
self.peak = backend_max_memory_allocated(torch_device)
|
||||
else:
|
||||
self.end = 0
|
||||
self.peak = 0
|
||||
self.used = bytes2megabytes(self.end - self.begin)
|
||||
self.peaked = bytes2megabytes(self.peak - self.begin)
|
||||
|
||||
|
||||
@dataclasses.dataclass
|
||||
class RegressionTrainingArguments(TrainingArguments):
|
||||
a: float = 0.0
|
||||
b: float = 0.0
|
||||
|
||||
|
||||
class RepeatDataset:
|
||||
def __init__(self, x, length=64):
|
||||
self.x = x
|
||||
self.length = length
|
||||
|
||||
def __len__(self):
|
||||
return self.length
|
||||
|
||||
def __getitem__(self, i):
|
||||
return {"input_ids": self.x, "labels": self.x}
|
||||
|
||||
|
||||
class SequenceClassificationDataset:
|
||||
def __init__(self, length=64, vocab_size=100, num_labels=5):
|
||||
self.length = length
|
||||
self.sequences = [torch.randint(0, vocab_size, (64,)).tolist() for _ in range(length)]
|
||||
self.labels = torch.randint(0, num_labels, (length,)).tolist()
|
||||
|
||||
def __len__(self):
|
||||
return self.length
|
||||
|
||||
def __getitem__(self, i):
|
||||
return {"input_ids": self.sequences[i], "label": self.labels[i]}
|
||||
|
||||
|
||||
class DynamicShapesDataset:
|
||||
def __init__(self, length=64, seed=42, batch_size=8):
|
||||
self.length = length
|
||||
np.random.seed(seed)
|
||||
sizes = np.random.randint(1, 20, (length // batch_size,))
|
||||
# For easy batching, we make every batch_size consecutive samples the same size.
|
||||
self.xs = [np.random.normal(size=(s,)).astype(np.float32) for s in sizes.repeat(batch_size)]
|
||||
self.ys = [np.random.normal(size=(s,)).astype(np.float32) for s in sizes.repeat(batch_size)]
|
||||
|
||||
def __len__(self):
|
||||
return self.length
|
||||
|
||||
def __getitem__(self, i):
|
||||
return {"input_x": self.xs[i], "labels": self.ys[i]}
|
||||
|
||||
|
||||
class AlmostAccuracy:
|
||||
def __init__(self, thresh=0.25):
|
||||
self.thresh = thresh
|
||||
|
||||
def __call__(self, eval_pred):
|
||||
predictions, labels = eval_pred
|
||||
true = np.abs(predictions - labels) <= self.thresh
|
||||
return {"accuracy": true.astype(np.float32).mean().item()}
|
||||
|
||||
|
||||
class AlmostAccuracyBatched:
|
||||
def __init__(self, thresh=0.25):
|
||||
self.thresh = thresh
|
||||
self.batch_acc = []
|
||||
|
||||
def __call__(self, eval_pred, compute_result):
|
||||
predictions, labels = eval_pred
|
||||
if isinstance(predictions, tuple):
|
||||
predictions = predictions[0]
|
||||
if isinstance(labels, tuple):
|
||||
labels = labels[0]
|
||||
batch_size = len(predictions)
|
||||
true = torch.abs(predictions - labels) <= self.thresh
|
||||
acc = true.type(torch.FloatTensor).mean().item()
|
||||
self.batch_acc.extend([acc] * batch_size)
|
||||
if compute_result:
|
||||
result = {"accuracy": np.mean(self.batch_acc).item()}
|
||||
self.batch_acc = []
|
||||
return result
|
||||
|
||||
|
||||
class RegressionModelConfig(PreTrainedConfig):
|
||||
def __init__(self, a=0, b=0, double_output=False, random_torch=True, **kwargs):
|
||||
super().__init__(**kwargs)
|
||||
self.a = a
|
||||
self.b = b
|
||||
self.double_output = double_output
|
||||
self.random_torch = random_torch
|
||||
self.hidden_size = 1
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
|
||||
class SampleIterableDataset(IterableDataset):
|
||||
def __init__(self, a=2, b=3, length=64, seed=42, label_names=None):
|
||||
self.dataset = RegressionDataset(a=a, b=b, length=length, seed=seed, label_names=label_names)
|
||||
|
||||
def __iter__(self):
|
||||
for i in range(len(self.dataset)):
|
||||
yield self.dataset[i]
|
||||
|
||||
class FiniteIterableDataset(SampleIterableDataset):
|
||||
def __init__(self, a=2, b=3, length=64, seed=42, label_names=None):
|
||||
super().__init__(a, b, length, seed, label_names)
|
||||
self.current_sample = 0
|
||||
|
||||
def __iter__(self):
|
||||
while self.current_sample < len(self.dataset):
|
||||
yield self.dataset[self.current_sample]
|
||||
self.current_sample += 1
|
||||
|
||||
class MultiLoader:
|
||||
def __init__(self, loaders):
|
||||
self.loaders = loaders
|
||||
|
||||
def __len__(self):
|
||||
return sum(len(loader) for loader in self.loaders)
|
||||
|
||||
def __iter__(self):
|
||||
for loader in self.loaders:
|
||||
yield from loader
|
||||
|
||||
class CustomDataloaderTrainer(Trainer):
|
||||
def get_train_dataloader(self):
|
||||
dataloaders = [super().get_train_dataloader(), super().get_train_dataloader()]
|
||||
return MultiLoader(dataloaders)
|
||||
|
||||
def get_eval_dataloader(self, eval_dataset):
|
||||
dataloaders = [super().get_eval_dataloader(eval_dataset), super().get_eval_dataloader(eval_dataset)]
|
||||
return MultiLoader(dataloaders)
|
||||
|
||||
class RegressionModel(nn.Module):
|
||||
def __init__(self, a=0, b=0, double_output=False):
|
||||
super().__init__()
|
||||
self.a = nn.Parameter(torch.tensor(a).float())
|
||||
self.b = nn.Parameter(torch.tensor(b).float())
|
||||
self.double_output = double_output
|
||||
self.config = None
|
||||
|
||||
def forward(self, input_x, labels=None, **kwargs):
|
||||
y = input_x * self.a + self.b
|
||||
if labels is None:
|
||||
return (y, y) if self.double_output else (y,)
|
||||
loss = nn.functional.mse_loss(y, labels)
|
||||
return (loss, y, y) if self.double_output else (loss, y)
|
||||
|
||||
class RegressionDictModel(nn.Module):
|
||||
def __init__(self, a=0, b=0):
|
||||
super().__init__()
|
||||
self.a = nn.Parameter(torch.tensor(a).float())
|
||||
self.b = nn.Parameter(torch.tensor(b).float())
|
||||
self.config = None
|
||||
|
||||
def forward(self, input_x, labels=None, **kwargs):
|
||||
y = input_x * self.a + self.b
|
||||
result = {"output": y}
|
||||
if labels is not None:
|
||||
result["loss"] = nn.functional.mse_loss(y, labels)
|
||||
return result
|
||||
|
||||
class RegressionPreTrainedModel(PreTrainedModel):
|
||||
config_class = RegressionModelConfig
|
||||
base_model_prefix = "regression"
|
||||
|
||||
def __init__(self, config):
|
||||
super().__init__(config)
|
||||
self.a = nn.Parameter(torch.as_tensor(config.a).float())
|
||||
self.b = nn.Parameter(torch.as_tensor(config.b).float())
|
||||
self.double_output = config.double_output
|
||||
self.post_init()
|
||||
|
||||
def forward(self, input_x, labels=None, **kwargs):
|
||||
y = input_x * self.a + self.b
|
||||
if labels is None:
|
||||
return (y, y) if self.double_output else (y,)
|
||||
loss = nn.functional.mse_loss(y, labels)
|
||||
return (loss, y, y) if self.double_output else (loss, y)
|
||||
|
||||
class RegressionPreTrainedModelWithGradientCheckpointing(PreTrainedModel):
|
||||
config_class = RegressionModelConfig
|
||||
base_model_prefix = "regression"
|
||||
supports_gradient_checkpointing = True
|
||||
|
||||
def __init__(self, config):
|
||||
super().__init__(config)
|
||||
self.layers = nn.ModuleList([nn.Linear(config.hidden_size, config.hidden_size) for _ in range(4)])
|
||||
self.head = nn.Linear(config.hidden_size, 1)
|
||||
self.gradient_checkpointing = False
|
||||
self.double_output = config.double_output
|
||||
self.post_init()
|
||||
|
||||
def forward(self, input_x, labels=None, **kwargs):
|
||||
y = input_x.unsqueeze(0)
|
||||
|
||||
for layer in self.layers:
|
||||
if self.training and self.gradient_checkpointing:
|
||||
outputs = self._gradient_checkpointing_func(layer.__call__, y)
|
||||
else:
|
||||
outputs = layer(y)
|
||||
|
||||
y = outputs * 3
|
||||
|
||||
logits = self.head(y)
|
||||
|
||||
if labels is None:
|
||||
return (logits, logits) if self.double_output else (logits,)
|
||||
|
||||
loss = nn.functional.mse_loss(logits, labels)
|
||||
|
||||
return (loss, y, y) if self.double_output else (loss, y)
|
||||
|
||||
class RegressionRandomPreTrainedModel(PreTrainedModel):
|
||||
config_class = RegressionModelConfig
|
||||
base_model_prefix = "regression"
|
||||
|
||||
def __init__(self, config):
|
||||
super().__init__(config)
|
||||
self.a = nn.Parameter(torch.as_tensor(config.a).float())
|
||||
self.b = nn.Parameter(torch.as_tensor(config.b).float())
|
||||
self.random_torch = config.random_torch
|
||||
self.post_init()
|
||||
|
||||
def forward(self, input_x, labels=None, **kwargs):
|
||||
y = input_x * self.a + self.b
|
||||
if self.random_torch:
|
||||
torch_rand = torch.randn(1).squeeze()
|
||||
np_rand = np.random.rand()
|
||||
rand_rand = random.random()
|
||||
|
||||
if self.random_torch:
|
||||
y += 0.05 * torch_rand
|
||||
y += 0.05 * torch.tensor(np_rand + rand_rand)
|
||||
|
||||
if labels is None:
|
||||
return (y,)
|
||||
loss = nn.functional.mse_loss(y, labels)
|
||||
return (loss, y)
|
||||
|
||||
class BasicTextGenerationModel(nn.Module):
|
||||
def __init__(self, vocab_size, hidden_size):
|
||||
super().__init__()
|
||||
self.embedding = nn.Embedding(vocab_size, hidden_size)
|
||||
self.lstm = nn.LSTM(hidden_size, hidden_size, batch_first=True)
|
||||
self.fc = nn.Linear(hidden_size, vocab_size)
|
||||
|
||||
def forward(self, input_ids, labels=None, **kwargs):
|
||||
embedded = self.embedding(input_ids)
|
||||
lstm_out, _ = self.lstm(embedded)
|
||||
logits = self.fc(lstm_out)
|
||||
if labels is None:
|
||||
return logits
|
||||
|
||||
loss = nn.functional.cross_entropy(logits.view(-1, logits.size(-1)), labels.view(-1))
|
||||
return loss, logits
|
||||
|
||||
def create_dummy_dataset_for_text_generation(vocab_size, seq_length, num_samples):
|
||||
import numpy as np
|
||||
|
||||
# Create random input sequences
|
||||
input_ids = np.random.randint(0, vocab_size, (num_samples, seq_length))
|
||||
|
||||
# Create a datasets.Dataset
|
||||
dataset = datasets.Dataset.from_dict({"input_ids": input_ids, "labels": input_ids})
|
||||
|
||||
return dataset
|
||||
|
||||
class TstLayer(nn.Module):
|
||||
def __init__(self, hidden_size):
|
||||
super().__init__()
|
||||
self.linear1 = nn.Linear(hidden_size, hidden_size)
|
||||
self.ln1 = nn.LayerNorm(hidden_size)
|
||||
self.linear2 = nn.Linear(hidden_size, hidden_size)
|
||||
self.ln2 = nn.LayerNorm(hidden_size)
|
||||
self.bias = nn.Parameter(torch.zeros(hidden_size))
|
||||
|
||||
def forward(self, x):
|
||||
h = self.ln1(nn.functional.relu(self.linear1(x)))
|
||||
h = nn.functional.relu(self.linear2(x))
|
||||
return self.ln2(x + h + self.bias)
|
||||
|
||||
def get_regression_trainer(
|
||||
a=0,
|
||||
b=0,
|
||||
double_output=False,
|
||||
train_len=64,
|
||||
eval_len=64,
|
||||
pretrained=True,
|
||||
output_dir=None,
|
||||
**kwargs,
|
||||
):
|
||||
label_names = kwargs.get("label_names")
|
||||
gradient_checkpointing = kwargs.get("gradient_checkpointing", False)
|
||||
train_dataset = RegressionDataset(length=train_len, label_names=label_names)
|
||||
eval_dataset = RegressionDataset(length=eval_len, label_names=label_names)
|
||||
|
||||
model_init = kwargs.pop("model_init", None)
|
||||
if model_init is not None:
|
||||
model = None
|
||||
else:
|
||||
if pretrained:
|
||||
config = RegressionModelConfig(a=a, b=b, double_output=double_output)
|
||||
# We infer the correct model class if one uses gradient_checkpointing or not
|
||||
target_cls = (
|
||||
RegressionPreTrainedModel
|
||||
if not gradient_checkpointing
|
||||
else RegressionPreTrainedModelWithGradientCheckpointing
|
||||
)
|
||||
model = target_cls(config)
|
||||
else:
|
||||
model = RegressionModel(a=a, b=b, double_output=double_output)
|
||||
|
||||
compute_metrics = kwargs.pop("compute_metrics", None)
|
||||
data_collator = kwargs.pop("data_collator", None)
|
||||
optimizers = kwargs.pop("optimizers", (None, None))
|
||||
preprocess_logits_for_metrics = kwargs.pop("preprocess_logits_for_metrics", None)
|
||||
assert output_dir is not None, "output_dir should be specified for testing"
|
||||
args = RegressionTrainingArguments(output_dir, a=a, b=b, **kwargs)
|
||||
trainer = Trainer(
|
||||
model,
|
||||
args,
|
||||
data_collator=data_collator,
|
||||
train_dataset=train_dataset,
|
||||
eval_dataset=eval_dataset,
|
||||
compute_metrics=compute_metrics,
|
||||
optimizers=optimizers,
|
||||
model_init=model_init,
|
||||
preprocess_logits_for_metrics=preprocess_logits_for_metrics,
|
||||
)
|
||||
# TODO: loss function defined in RegressionModel doesn't accept num_item_per_batch, to fix later
|
||||
trainer.model_accepts_loss_kwargs = False
|
||||
return trainer
|
||||
|
||||
def get_language_model_trainer(**kwargs):
|
||||
dataset = datasets.load_dataset("fka/awesome-chatgpt-prompts")
|
||||
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
|
||||
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
|
||||
tokenizer.pad_token = tokenizer.eos_token
|
||||
|
||||
def _tokenize_function(examples):
|
||||
model_inputs = tokenizer(examples["prompt"], padding="max_length", truncation=True)
|
||||
model_inputs["labels"] = np.array(model_inputs["input_ids"]).astype(np.int64)
|
||||
return model_inputs
|
||||
|
||||
tokenized_datasets = dataset.map(_tokenize_function, batched=True)
|
||||
training_args = TrainingArguments(**kwargs)
|
||||
|
||||
trainer = Trainer(
|
||||
model=model,
|
||||
args=training_args,
|
||||
train_dataset=tokenized_datasets["train"],
|
||||
)
|
||||
|
||||
return trainer
|
||||
|
||||
|
||||
class TrainerIntegrationCommon:
|
||||
def check_saved_checkpoints(self, output_dir, freq, total, is_pretrained=True, use_scaler=False):
|
||||
weights_file = SAFE_WEIGHTS_NAME
|
||||
file_list = [weights_file, "training_args.bin", "optimizer.pt", "scheduler.pt", "trainer_state.json"]
|
||||
if is_pretrained:
|
||||
file_list.append("config.json")
|
||||
if use_scaler:
|
||||
file_list.append("scaler.pt")
|
||||
for step in range(freq, total, freq):
|
||||
checkpoint = os.path.join(output_dir, f"checkpoint-{step}")
|
||||
self.assertTrue(os.path.isdir(checkpoint))
|
||||
for filename in file_list:
|
||||
self.assertTrue(os.path.isfile(os.path.join(checkpoint, filename)))
|
||||
|
||||
def check_best_model_has_been_loaded(
|
||||
self,
|
||||
output_dir,
|
||||
freq,
|
||||
total,
|
||||
trainer,
|
||||
metric,
|
||||
greater_is_better=False,
|
||||
is_pretrained=True,
|
||||
):
|
||||
# Get log history from the final checkpoint (could be at total if not divisible by freq)
|
||||
final_checkpoint_step = total if total % freq != 0 else (total // freq) * freq
|
||||
checkpoint = os.path.join(output_dir, f"checkpoint-{final_checkpoint_step}")
|
||||
log_history = TrainerState.load_from_json(os.path.join(checkpoint, "trainer_state.json")).log_history
|
||||
|
||||
values = [d[metric] for d in log_history if metric in d]
|
||||
best_value = max(values) if greater_is_better else min(values)
|
||||
best_idx = values.index(best_value)
|
||||
|
||||
# Determine which checkpoint corresponds to the best metric
|
||||
# Evals happen at freq intervals, plus potentially at the final step
|
||||
eval_steps = list(range(freq, total + 1, freq))
|
||||
if total % freq != 0:
|
||||
eval_steps.append(total)
|
||||
best_checkpoint = eval_steps[best_idx]
|
||||
checkpoint = os.path.join(output_dir, f"checkpoint-{best_checkpoint}")
|
||||
if is_pretrained:
|
||||
best_model = RegressionPreTrainedModel.from_pretrained(checkpoint)
|
||||
best_model.to(trainer.args.device)
|
||||
else:
|
||||
best_model = RegressionModel()
|
||||
state_dict = safetensors.torch.load_file(os.path.join(checkpoint, SAFE_WEIGHTS_NAME))
|
||||
best_model.load_state_dict(state_dict)
|
||||
best_model.to(trainer.args.device)
|
||||
torch.testing.assert_close(best_model.a, trainer.model.a)
|
||||
torch.testing.assert_close(best_model.b, trainer.model.b)
|
||||
|
||||
metrics = trainer.evaluate()
|
||||
self.assertEqual(metrics[metric], best_value)
|
||||
|
||||
def remove_nan_logs(self, log):
|
||||
for key in list(log.keys()):
|
||||
if log[key] != log[key]: # Check if the value is NaN
|
||||
del log[key]
|
||||
|
||||
def check_trainer_state_are_the_same(self, trainer_state, trainer_state1):
|
||||
# We'll pop things so operate on copies.
|
||||
state = trainer_state.copy()
|
||||
state1 = trainer_state1.copy()
|
||||
# Log history main contain different logs for the time metrics (after resuming a training).
|
||||
log_history = state.pop("log_history", None)
|
||||
log_history1 = state1.pop("log_history", None)
|
||||
self.assertEqual(state, state1)
|
||||
skip_log_keys = ["train_runtime", "train_samples_per_second", "train_steps_per_second", "train_loss"]
|
||||
for log, log1 in zip(log_history, log_history1):
|
||||
for key in skip_log_keys:
|
||||
_ = log.pop(key, None)
|
||||
_ = log1.pop(key, None)
|
||||
|
||||
self.remove_nan_logs(log)
|
||||
self.remove_nan_logs(log1)
|
||||
|
||||
self.assertEqual(log, log1)
|
||||
|
||||
def convert_to_sharded_checkpoint(self, folder):
|
||||
# Converts a checkpoint of a regression model to a sharded checkpoint.
|
||||
loader = safetensors.torch.load_file
|
||||
weights_file = os.path.join(folder, SAFE_WEIGHTS_NAME)
|
||||
|
||||
extension = "safetensors"
|
||||
saver = safetensors.torch.save_file
|
||||
index_file = os.path.join(folder, SAFE_WEIGHTS_INDEX_NAME)
|
||||
shard_name = SAFE_WEIGHTS_NAME
|
||||
|
||||
state_dict = loader(weights_file)
|
||||
|
||||
os.remove(weights_file)
|
||||
keys = list(state_dict.keys())
|
||||
|
||||
shard_files = [
|
||||
shard_name.replace(f".{extension}", f"-{idx + 1:05d}-of-{len(keys):05d}.{extension}")
|
||||
for idx in range(len(keys))
|
||||
]
|
||||
index = {"metadata": {}, "weight_map": {key: shard_files[i] for i, key in enumerate(keys)}}
|
||||
|
||||
with open(index_file, "w", encoding="utf-8") as f:
|
||||
content = json.dumps(index, indent=2, sort_keys=True) + "\n"
|
||||
f.write(content)
|
||||
|
||||
for param_name, shard_file in zip(keys, shard_files):
|
||||
saver({param_name: state_dict[param_name]}, os.path.join(folder, shard_file))
|
||||
Reference in New Issue
Block a user