first commit
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled

This commit is contained in:
陈赣
2026-06-05 16:53:03 +08:00
commit 06f1fd69a6
6047 changed files with 1895387 additions and 0 deletions

View File

@@ -0,0 +1,122 @@
# Trainer Testing Guide
## Test files
| File | What it covers |
|---|---|
| `test_trainer.py` | Core: mixed precision, grad accumulation, logging, metrics, early stopping |
| `test_trainer_checkpointing.py` | Checkpoint save/resume, interrupted training, frozen params |
| `test_trainer_data.py` | Collators, dynamic shapes, iterable datasets, label smoothing |
| `test_trainer_optimizers.py` | Optimizers & LR schedulers |
| `test_trainer_seq2seq.py` | Encoder-decoder fine-tuning |
| `trainer_test_utils.py` | Shared utilities (models, datasets, helpers) — not a test file |
| `distributed/` | DDP, FSDP, DeepSpeed (see [below](#distributed-tests)) |
## Running tests
Always use `RUN_SLOW=1` — most trainer tests are `@slow` and will be skipped without it.
### Debugging workflow
**Never run the full suite until the specific failing test passes.** Work from smallest scope outward:
1. **Single GPU** — fastest feedback:
```bash
CUDA_VISIBLE_DEVICES=0 RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class::test_name -xvs
```
2. **Fix and re-run** that same test until it passes.
3. **2 GPUs** — catch DataParallel issues:
```bash
CUDA_VISIBLE_DEVICES=0,1 RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class::test_name -xvs
```
4. **Full test class** — check for regressions:
```bash
RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class -xvs
```
5. **All tests in that file — only at the very end**:
```bash
RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py -v --tb=line
```
Same for distributed tests — single failing test first, fix, confirm, then widen scope.
**Tip**: `-k` filter applies globally across files. Use full node IDs instead: `pytest file::Class::test`.
## Writing tests
**`get_regression_trainer()`** is the fastest way to get a working Trainer. Pass any `TrainingArguments` kwarg directly. Uses `RegressionModel` + `RegressionDataset` (trains in milliseconds).
For LLM tests, use tiny Hub models: `AutoModelForCausalLM.from_pretrained("hf-internal-testing/tiny-random-LlamaForCausalLM")`.
Use `max_steps=10` instead of `num_train_epochs=3` when you just need training to run.
### Multi-GPU safety
The Trainer uses `nn.DataParallel` when `n_gpu > 1`:
- `train_batch_size = per_device_train_batch_size * n_gpu` — don't hardcode batch sizes in assertions.
- Compute steps dynamically: `math.ceil(num_samples / (batch_size * grad_accum))`.
- Use 100+ samples — small datasets can leave zero resume steps on multi-GPU.
- DataParallel gather introduces ~1e-8 FP differences — use `places=6` for loss assertions.
- If a test model has `**kwargs` but ignores `num_items_in_batch`, set `model.accepts_loss_kwargs = False`.
### Decorators
`@parameterized.expand` must be **outermost** (top), above `@require_*`.
---
## Distributed tests
### Directory layout
```
distributed/
test_trainer_distributed.py # Base: path constants, TrainerDistributedCommon ABC
test_trainer_distributed_ddp.py # DDP tests
test_trainer_distributed_fsdp.py # FSDP tests (config parsing + distributed)
test_trainer_distributed_deepspeed.py # DeepSpeed tests (single-GPU + distributed)
accelerate_configs/ # YAML configs for `accelerate launch`
scripts/ # Scripts launched as subprocesses
train.py # Main training script (synthetic data, tiny Qwen2)
torchrun_env_check.py # Dumps distributed env info to JSON per rank
ds_config_zero2.json, ds_config_zero3.json
```
### Architecture
Each framework has three pieces:
1. **`{Framework}CommandsMixin`** — `get_torchrun_cmd()` and `get_accelerate_cmd()`.
2. **`TestTrainerDistributed{Framework}`** — framework-specific tests (env parity, etc.). NOT `@slow`.
3. **`TestTrainerDistributed{Framework}Common`** — inherits `TrainerDistributedCommon` for shared scenarios. `@slow`.
MRO: `class Foo(Mixin, TrainerDistributedCommon, TestCasePlus)` — Mixin before ABC.
`TrainerDistributedCommon` provides: `check_training`, `check_mixed_precision`, `check_gradient_accumulation`, `check_resume`, `check_eval`. Subclasses call these with `config_file=...`.
### Env parity tests
Both torchrun and accelerate sides must use the framework:
- **DDP**: no extra args (both `DistributedType.MULTI_GPU`)
- **FSDP**: `--fsdp full_shard --fsdp_config '{"fsdp_version": 1}'` (JSON string, no file)
- **DeepSpeed**: `--deepspeed path/to/ds_config_zero2.json`
`torchrun_env_check.py` uses `HfArgumentParser(TrainingArguments)` — accepts any TrainingArguments flag.
### Adding a distributed test
1. Shared scenario → add `check_*` to `TrainerDistributedCommon`, wire from each Common class.
2. Framework-specific → add to `TestTrainerDistributed{Framework}`.
3. New scripts → `distributed/scripts/`, reference via `SCRIPTS_DIR`.
### Pitfalls
- `str(args.parallel_mode)` → `"ParallelMode.DISTRIBUTED"`, not `"DISTRIBUTED"`.
- FSDP `cpu_offload` is not JSON-serializable — use `str()`.
- `train.py` defaults to `do_train=True`. Pass `--do_eval` explicitly for eval. Auto-enables when `--eval_output_file` is passed.
- DeepSpeed eval only works with ZeRO-3.
- `--fsdp_config` accepts a file path OR JSON string starting with `{`. Same for `--deepspeed`, `--accelerator_config`.
- `args.local_rank` may be -1 before framework consumes it — use `assertIn(val, (rank, -1))`.
- `@parameterized.expand` + ABC: can't use `@abstractmethod` on methods that subclasses decorate with expand.

View File

View File

View File

@@ -0,0 +1,3 @@
distributed_type: MULTI_GPU
num_machines: 1
num_processes: 2

View File

@@ -0,0 +1,4 @@
distributed_type: DEEPSPEED
deepspeed_config:
deepspeed_config_file: tests/trainer/distributed/scripts/ds_config_zero2.json
num_processes: 2

View File

@@ -0,0 +1,9 @@
distributed_type: DEEPSPEED
deepspeed_config:
deepspeed_config_file: tests/trainer/distributed/scripts/ds_config_zero2.json
num_processes: 2
parallelism_config:
parallelism_config_sp_size: 2
parallelism_config_sp_backend: deepspeed
parallelism_config_sp_seq_length_is_variable: true
parallelism_config_sp_attn_implementation: sdpa

View File

@@ -0,0 +1,4 @@
distributed_type: DEEPSPEED
deepspeed_config:
deepspeed_config_file: tests/trainer/distributed/scripts/ds_config_zero3.json
num_processes: 2

View File

@@ -0,0 +1,4 @@
distributed_type: FSDP
fsdp_config:
fsdp_version: 1
num_processes: 2

View File

@@ -0,0 +1,4 @@
distributed_type: FSDP
fsdp_config:
fsdp_version: 2
num_processes: 2

View File

@@ -0,0 +1,10 @@
distributed_type: FSDP
fsdp_config:
fsdp_version: 2
num_processes: 2
parallelism_config:
parallelism_config_dp_replicate_size: 1
parallelism_config_dp_shard_size: 1
parallelism_config_tp_size: 1
parallelism_config_cp_size: 2
parallelism_config_cp_comm_strategy: alltoall

View File

@@ -0,0 +1,88 @@
# Copyright 2020 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Worker script for dispatch_batches=False with a finite iterable dataset.
Verifies that training completes successfully when ``dispatch_batches``
is disabled.
Run via torchrun or accelerate launch.
"""
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import IterableDataset
from transformers import HfArgumentParser, Trainer, TrainingArguments
class RegressionModel(nn.Module):
def __init__(self, a=0, b=0):
super().__init__()
self.a = nn.Parameter(torch.tensor(a).float())
self.b = nn.Parameter(torch.tensor(b).float())
self.config = None
def forward(self, input_x, labels=None, **kwargs):
y = input_x * self.a + self.b
if labels is None:
return (y,)
loss = nn.functional.mse_loss(y, labels)
return (loss, y)
class RegressionDataset:
def __init__(self, a=2, b=3, length=64, seed=42, label_names=None):
np.random.seed(seed)
self.label_names = ["labels"] if label_names is None else label_names
self.length = length
self.x = np.random.normal(size=(length,)).astype(np.float32)
self.ys = [a * self.x + b + np.random.normal(scale=0.1, size=(length,)) for _ in self.label_names]
self.ys = [y.astype(np.float32) for y in self.ys]
def __len__(self):
return self.length
def __getitem__(self, i):
result = {name: y[i] for name, y in zip(self.label_names, self.ys)}
result["input_x"] = self.x[i]
return result
class FiniteIterableDataset(IterableDataset):
def __init__(self, a=2, b=3, length=64, seed=42, label_names=None):
self.dataset = RegressionDataset(a=a, b=b, length=length, seed=seed, label_names=label_names)
self.current_sample = 0
def __iter__(self):
while self.current_sample < len(self.dataset):
yield self.dataset[self.current_sample]
self.current_sample += 1
if __name__ == "__main__":
parser = HfArgumentParser((TrainingArguments,))
training_args = parser.parse_args_into_dataclasses()[0]
training_args.per_device_train_batch_size = 1
training_args.max_steps = 1
training_args.accelerator_config.dispatch_batches = False
train_dataset = FiniteIterableDataset(label_names=["labels", "extra"], length=1)
model = RegressionModel()
trainer = Trainer(model, training_args, train_dataset=train_dataset)
trainer.train()

View File

@@ -0,0 +1,32 @@
{
"fp16": {
"enabled": "auto"
},
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 2
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto"
}

View File

@@ -0,0 +1,35 @@
{
"fp16": {
"enabled": "auto"
},
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto"
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto"
}

View File

@@ -0,0 +1,113 @@
# Copyright 2020 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Worker script for eval/predict ordering tests.
Verifies that distributed eval/predict returns all samples in the correct order.
Run via torchrun or accelerate launch.
"""
import torch
import torch.nn as nn
from torch.utils.data import Dataset
from transformers import EvalPrediction, HfArgumentParser, Trainer, TrainingArguments
from transformers.utils import logging
logger = logging.get_logger(__name__)
class DummyDataset(Dataset):
def __init__(self, length: int = 101):
self.length = length
def __len__(self):
return self.length
def __getitem__(self, i) -> int:
return i
class DummyDataCollator:
def __call__(self, features):
return {"input_ids": torch.tensor(features), "labels": torch.tensor(features)}
class DummyModel(nn.Module):
def __init__(self):
super().__init__()
# Add some (unused) params otherwise DDP will complain.
self.fc = nn.Linear(120, 80)
def forward(self, input_ids, labels=None):
if labels is not None:
return torch.tensor(0.0, device=input_ids.device), input_ids
else:
return input_ids
if __name__ == "__main__":
parser = HfArgumentParser((TrainingArguments,))
training_args = parser.parse_args_into_dataclasses()[0]
for dataset_length in [49, 7]:
dataset = DummyDataset(dataset_length)
def compute_metrics(p: EvalPrediction) -> dict:
sequential = list(range(len(dataset)))
success = p.predictions.tolist() == sequential and p.label_ids.tolist() == sequential
if not success and training_args.local_process_index == 0:
logger.warning(
"Predictions and/or labels do not match expected results:\n - predictions: "
f"{p.predictions.tolist()}\n - labels: {p.label_ids.tolist()}\n - expected: {sequential}"
)
return {"success": success}
trainer = Trainer(
model=DummyModel(),
args=training_args,
data_collator=DummyDataCollator(),
eval_dataset=dataset,
compute_metrics=compute_metrics,
)
metrics = trainer.evaluate()
logger.info(metrics)
if metrics["eval_success"] is not True:
logger.error(metrics)
exit(1)
p = trainer.predict(dataset)
logger.info(p.metrics)
if p.metrics["test_success"] is not True:
logger.error(p.metrics)
exit(1)
trainer.args.eval_accumulation_steps = 2
metrics = trainer.evaluate()
logger.info(metrics)
if metrics["eval_success"] is not True:
logger.error(metrics)
exit(1)
p = trainer.predict(dataset)
logger.info(p.metrics)
if p.metrics["test_success"] is not True:
logger.error(p.metrics)
exit(1)
trainer.args.eval_accumulation_steps = None

View File

@@ -0,0 +1,125 @@
# Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Worker script for FSDP generation tests.
Launched via ``torchrun`` from ``test_trainer_distributed_fsdp.py``.
"""
import argparse
import functools
from collections.abc import Callable
from typing import Any
import torch
import torch.distributed
from torch.distributed._composable.fsdp import fully_shard, register_fsdp_forward_method
from torch.distributed.device_mesh import init_device_mesh
from torch.distributed.fsdp import FullyShardedDataParallel
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.models.gpt2.modeling_gpt2 import GPT2Block
from transformers.testing_utils import backend_device_count, backend_torch_accelerator_module, torch_device
data = 4 * [
"Hello world!",
"The quick brown fox jumps over the lazy dog.",
]
def manage_process_group(func: Callable[..., Any]) -> Callable[..., Any]:
"""Manage the creation and destruction of the distributed process group for the wrapped function."""
def wrapped(*args: Any, **kwargs: Any) -> Any:
device_count = backend_device_count(torch_device)
torch.distributed.init_process_group(world_size=device_count)
try:
return func(*args, **kwargs)
finally:
torch.distributed.destroy_process_group()
return wrapped
@manage_process_group
def fsdp_generate():
torch_accelerator_module = backend_torch_accelerator_module(torch_device)
torch_accelerator_module.set_device(device := torch.device(rank := torch.distributed.get_rank()))
model = AutoModelForCausalLM.from_pretrained("hf-internal-testing/tiny-random-gpt2").to(device)
fsdp_model = FullyShardedDataParallel(
model,
auto_wrap_policy=functools.partial(transformer_auto_wrap_policy, transformer_layer_cls={GPT2Block}),
limit_all_gathers=True,
use_orig_params=True,
)
tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/tiny-random-gpt2")
batch = tokenizer(data[rank], return_tensors="pt", return_attention_mask=True).to(device)
with FullyShardedDataParallel.summon_full_params(fsdp_model):
_ = fsdp_model.module.generate(
input_ids=batch["input_ids"],
attention_mask=batch["attention_mask"],
max_length=30,
)
@manage_process_group
def fsdp2_generate():
torch_accelerator_module = backend_torch_accelerator_module(torch_device)
torch_accelerator_module.set_device(device := torch.device(rank := torch.distributed.get_rank()))
model = AutoModelForCausalLM.from_pretrained("hf-internal-testing/tiny-random-gpt2").to(device)
mesh = init_device_mesh(device.type, (torch.distributed.get_world_size(),))
for submodule in model.modules():
if isinstance(submodule, GPT2Block):
fully_shard(submodule, mesh=mesh)
fully_shard(model, mesh=mesh)
register_fsdp_forward_method(model, "generate")
tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/tiny-random-gpt2")
batch = tokenizer(data[rank], return_tensors="pt", return_attention_mask=True).to(device)
_ = model.generate(
input_ids=batch["input_ids"],
attention_mask=batch["attention_mask"],
max_length=30,
)
if __name__ == "__main__":
class CLIArgs(argparse.Namespace):
fsdp: bool
fsdp2: bool
parser = argparse.ArgumentParser()
group = parser.add_mutually_exclusive_group()
group.add_argument("--fsdp", action="store_true")
group.add_argument("--fsdp2", action="store_true")
args = parser.parse_args(namespace=CLIArgs())
if args.fsdp:
fsdp_generate()
elif args.fsdp2:
fsdp2_generate()
else:
raise ValueError("Missing test selection")

View File

@@ -0,0 +1,114 @@
# Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Worker script for loss averaging tests.
Verifies that ``average_tokens_across_devices`` produces correct loss
compared to a single-GPU baseline.
When ``--run_both_averaging_modes`` is passed, the script runs training
twice (with and without averaging) in a single process launch, saving
``<output_dir>_broken_losses.json`` and ``<output_dir>_fixed_losses.json``.
Run via torchrun or accelerate launch.
"""
import argparse
import json
import datasets
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
DataCollatorForLanguageModeling,
HfArgumentParser,
Trainer,
TrainerCallback,
TrainingArguments,
set_seed,
)
class StoreLossCallback(TrainerCallback):
"""Simple callback to store the loss."""
def __init__(self):
self.losses = []
def on_log(self, args, state, control, logs=None, **kwargs):
if "loss" in logs:
self.losses.append(logs["loss"])
def run_distributed_training(training_args, loss_file):
set_seed(42)
model_name = "trl-internal-testing/tiny-Qwen2ForCausalLM-2.5"
dataset_name = "wikitext"
dataset_config = "wikitext-2-raw-v1"
dataset = datasets.load_dataset(dataset_name, dataset_config, split="train[:50]")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
return tokenizer(examples["text"], max_length=128, padding="max_length", truncation=True)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32)
loss_callback = StoreLossCallback()
training_args.logging_steps = 1
training_args.max_steps = 10
training_args.learning_rate = 3e-4
training_args.disable_tqdm = True
training_args.dataloader_drop_last = True
trainer = Trainer(
model,
training_args,
train_dataset=tokenized_dataset,
callbacks=[loss_callback],
data_collator=data_collator,
)
trainer.train()
with open(loss_file, "w") as f:
json.dump(loss_callback.losses, f)
if __name__ == "__main__":
# Parse our custom flag first, pass the rest to HfArgumentParser.
pre_parser = argparse.ArgumentParser(add_help=False)
pre_parser.add_argument("--run_both_averaging_modes", action="store_true")
custom_args, remaining = pre_parser.parse_known_args()
hf_parser = HfArgumentParser((TrainingArguments,))
(training_args,) = hf_parser.parse_args_into_dataclasses(remaining)
if custom_args.run_both_averaging_modes:
base_dir = training_args.output_dir
# Run without averaging ("broken")
training_args.average_tokens_across_devices = False
training_args.output_dir = base_dir + "/broken"
run_distributed_training(training_args, loss_file=base_dir + "/broken_losses.json")
# Run with averaging ("fixed")
training_args.average_tokens_across_devices = True
training_args.output_dir = base_dir + "/fixed"
run_distributed_training(training_args, loss_file=base_dir + "/fixed_losses.json")
else:
run_distributed_training(training_args, loss_file=training_args.output_dir + "_losses.json")

View File

@@ -0,0 +1,93 @@
# Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Dumps distributed environment info to a JSON file for verification.
This script creates a Trainer (which initializes the accelerator) and writes
each worker's env vars, TrainingArguments fields, and accelerator state to
``<output_dir>/env_rank<N>.json``.
Accepts all TrainingArguments flags (e.g. ``--deepspeed``, ``--fsdp``) so the
Trainer sets up the correct framework regardless of launcher.
Works with any launcher (torchrun, accelerate launch with DDP/FSDP/DeepSpeed).
"""
import json
import os
from transformers import AutoModelForCausalLM, HfArgumentParser, Trainer, TrainingArguments
def main():
parser = HfArgumentParser((TrainingArguments,))
(args,) = parser.parse_args_into_dataclasses()
args.disable_tqdm = True
model_name = "trl-internal-testing/tiny-Qwen2ForCausalLM-2.5"
model = AutoModelForCausalLM.from_pretrained(model_name)
trainer = Trainer(model=model, args=args)
accelerator = trainer.accelerator
env_info = {
# Raw env vars set by torchrun / accelerate
"env_world_size": os.environ.get("WORLD_SIZE"),
"env_rank": os.environ.get("RANK"),
"env_local_rank": os.environ.get("LOCAL_RANK"),
"env_master_addr": os.environ.get("MASTER_ADDR"),
"env_master_port": os.environ.get("MASTER_PORT"),
# TrainingArguments-derived values
"args_local_rank": args.local_rank,
"args_world_size": args.world_size,
"args_process_index": args.process_index,
"args_local_process_index": args.local_process_index,
"args_parallel_mode": str(args.parallel_mode),
"args_n_gpu": args.n_gpu,
# Accelerator state
"accelerator_num_processes": accelerator.num_processes,
"accelerator_process_index": accelerator.process_index,
"accelerator_local_process_index": accelerator.local_process_index,
"accelerator_is_main_process": accelerator.is_main_process,
"accelerator_is_local_main_process": accelerator.is_local_main_process,
"accelerator_use_distributed": accelerator.use_distributed,
"accelerator_distributed_type": str(accelerator.distributed_type),
"accelerator_device": str(accelerator.device),
# Trainer-level flags (these gate framework-specific code paths)
"trainer_is_fsdp_enabled": trainer.is_fsdp_enabled,
"trainer_is_deepspeed_enabled": trainer.is_deepspeed_enabled,
}
# FSDP plugin info
fsdp_plugin = getattr(accelerator.state, "fsdp_plugin", None)
if fsdp_plugin is not None:
env_info["fsdp_version"] = getattr(fsdp_plugin, "fsdp_version", None)
env_info["fsdp_sharding_strategy"] = str(getattr(fsdp_plugin, "sharding_strategy", None))
env_info["fsdp_cpu_offload"] = str(getattr(fsdp_plugin, "cpu_offload", None))
env_info["fsdp_auto_wrap_policy"] = str(getattr(fsdp_plugin, "auto_wrap_policy", None))
# DeepSpeed plugin info
deepspeed_plugin = getattr(accelerator.state, "deepspeed_plugin", None)
if deepspeed_plugin is not None:
env_info["deepspeed_zero_stage"] = deepspeed_plugin.zero_stage
env_info["deepspeed_offload_optimizer_device"] = str(deepspeed_plugin.offload_optimizer_device)
env_info["deepspeed_offload_param_device"] = str(deepspeed_plugin.offload_param_device)
output_file = os.path.join(args.output_dir, f"env_rank{args.process_index}.json")
with open(output_file, "w") as f:
json.dump(env_info, f)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,136 @@
# Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Simple causal LM script for distributed tests (FSDP, DeepSpeed).
Uses a tiny Qwen2 model with synthetic data so tests run fast
and don't require downloading real datasets.
Supports --do_train (default) and --do_eval via TrainingArguments.
32 training samples are created; with per_device_train_batch_size=4
and 2 GPUs this gives 4 steps per epoch.
"""
import json
import sys
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
DataCollatorForLanguageModeling,
HfArgumentParser,
Trainer,
TrainingArguments,
)
DTYPE_MAP = {"fp32": torch.float32, "bf16": torch.bfloat16, "fp16": torch.float16}
def _pop_custom_arg(name):
"""Pop a custom --name value arg from sys.argv before HfArgumentParser sees it."""
if name in sys.argv:
idx = sys.argv.index(name)
value = sys.argv[idx + 1]
sys.argv.pop(idx)
sys.argv.pop(idx)
return value
return None
def main():
# Parse custom args (not TrainingArguments fields)
model_name = _pop_custom_arg("--model_name") or "trl-internal-testing/tiny-Qwen2ForCausalLM-2.5"
loss_output_file = _pop_custom_arg("--loss_output_file")
eval_output_file = _pop_custom_arg("--eval_output_file")
model_dtype = _pop_custom_arg("--model_dtype")
attn_impl = _pop_custom_arg("--attn_implementation")
pad_to_multiple_of = _pop_custom_arg("--pad_to_multiple_of")
parser = HfArgumentParser((TrainingArguments,))
(training_args,) = parser.parse_args_into_dataclasses()
# Default to training if neither --do_train nor --do_eval is set
if not training_args.do_train and not training_args.do_eval:
training_args.do_train = True
# Auto-enable eval when an eval output file is requested
if eval_output_file:
training_args.do_eval = True
torch_dtype = DTYPE_MAP[model_dtype] if model_dtype else None
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model_kwargs = {}
if torch_dtype:
model_kwargs["torch_dtype"] = torch_dtype
if attn_impl:
model_kwargs["attn_implementation"] = attn_impl
model = AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)
model.generation_config.pad_token_id = tokenizer.pad_token_id
# Synthetic dataset — 32 samples of tokenized text
# With per_device_train_batch_size=4 and 2 GPUs this gives 4 steps per epoch.
texts = [
"The quick brown fox jumps over the lazy dog. " * 5,
"A journey of a thousand miles begins with a single step. " * 5,
"To be or not to be, that is the question. " * 5,
"All that glitters is not gold, all that wanders is not lost. " * 5,
] * 8
train_dataset = None
eval_dataset = None
if training_args.do_train:
train_dataset = [tokenizer(text, max_length=128, truncation=True, padding="max_length") for text in texts]
if training_args.do_eval:
eval_dataset = [tokenizer(text, max_length=128, truncation=True, padding="max_length") for text in texts[:8]]
collator_kwargs = {}
if pad_to_multiple_of:
collator_kwargs["pad_to_multiple_of"] = int(pad_to_multiple_of)
training_args.disable_tqdm = True
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, **collator_kwargs),
)
if training_args.do_train:
trainer.train()
if training_args.do_eval:
eval_metrics = trainer.evaluate()
if eval_output_file and training_args.process_index == 0:
with open(eval_output_file, "w") as f:
json.dump(eval_metrics, f)
# Save per-step losses for equivalence testing
if training_args.do_train and loss_output_file and training_args.process_index == 0:
losses = [log["loss"] for log in trainer.state.log_history if "loss" in log]
with open(loss_output_file, "w") as f:
json.dump(losses, f)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,4 @@
{
"image_processor_type": "ViTImageProcessor",
"size": 30
}

View File

@@ -0,0 +1,87 @@
# Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Worker script for dataloader worker seed divergence tests.
Verifies that dataloader workers get different random seeds across GPUs,
so that each rank sees different random augmentations.
Run via torchrun or accelerate launch.
"""
import random
import numpy as np
import torch
import torch.distributed as dist
import torch.nn as nn
from torch.utils.data import Dataset
from transformers import HfArgumentParser, Trainer, TrainingArguments, set_seed
from transformers.testing_utils import torch_device
def gather_from_all_gpus(tensor, world_size):
gather_list = [torch.zeros_like(tensor) for _ in range(world_size)]
dist.all_gather(gather_list, tensor)
return gather_list
class DummyDataset(Dataset):
def __init__(self):
self.length = 64
def __len__(self):
return self.length
def __getitem__(self, i) -> int:
x = random.random()
y = np.random.random()
z = torch.rand([]).item()
return {"x": torch.tensor([x, y, z])}
class DummyModel(nn.Module):
def __init__(self):
super().__init__()
self.fc = nn.Linear(3, 1)
def forward(self, x):
local_tensor = torch.tensor(x, device=torch_device)
gathered = gather_from_all_gpus(local_tensor, dist.get_world_size())
assert not all(torch.allclose(t, gathered[0]) for t in gathered[1:])
y = self.fc(x)
return (y.mean(), y)
def run_distributed_training(training_args):
set_seed(42)
model = DummyModel()
dataset = DummyDataset()
training_args.max_steps = 3
# dataloader_num_workers must be > 0 to enable worker_init_fn
training_args.dataloader_num_workers = 2
trainer = Trainer(
model,
training_args,
train_dataset=dataset,
)
trainer.train()
if __name__ == "__main__":
parser = HfArgumentParser((TrainingArguments,))
training_args = parser.parse_args_into_dataclasses()[0]
run_distributed_training(training_args)

View File

@@ -0,0 +1,180 @@
# Copyright 2020 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Shared constants, helpers, and reusable test logic for distributed trainer tests.
This module provides:
- Path constants for test scripts and accelerate configs.
- ``TrainerDistributedCommon``, an abstract base class that contains reusable
test scenarios (training, mixed-precision, gradient accumulation, checkpoint
resume, evaluation). Framework-specific test files (DDP, FSDP, DeepSpeed)
subclass it and wire each scenario to parameterized test methods.
"""
import json
import os
from abc import ABC, abstractmethod
from transformers import is_torch_available
from transformers.testing_utils import execute_subprocess_async
from transformers.trainer_callback import TrainerState
from transformers.trainer_utils import get_last_checkpoint
if is_torch_available():
import torch
# ---------------------------------------------------------------------------
# Path constants
# ---------------------------------------------------------------------------
DISTRIBUTED_DIR = os.path.dirname(__file__)
CONFIGS_DIR = os.path.join(DISTRIBUTED_DIR, "accelerate_configs")
SCRIPTS_DIR = os.path.join(DISTRIBUTED_DIR, "scripts")
TRAIN_SCRIPT = os.path.join(SCRIPTS_DIR, "train.py")
class TrainerDistributedCommon(ABC):
"""Reusable test scenarios shared across DDP, FSDP, and DeepSpeed.
Subclasses must:
1. Implement ``get_accelerate_cmd`` to build the launch command.
2. Define the following test methods (parameterized as needed)::
test_training → self.check_training(dtype, ...)
test_training_mixed_precision → self.check_mixed_precision(dtype, ...)
test_training_with_gradient_accumulation → self.check_gradient_accumulation(...)
test_training_and_can_resume_normally → self.check_resume(...)
test_eval → self.check_eval(...)
These test methods can't be defined here as ``@abstractmethod`` because
``@parameterized.expand`` removes the original method name from the
subclass, which would cause ABC to raise ``TypeError`` at instantiation.
"""
@abstractmethod
def get_accelerate_cmd(self, script, config_file, launch_args=None, script_args=None, **kwargs):
"""Build the full ``accelerate launch`` command list.
Args:
script: Path to the Python script to run.
config_file: Path to the accelerate YAML config (always required).
launch_args: Extra flags inserted *before* the script
(e.g. ``--fsdp_sharding_strategy``, ``--offload_optimizer_device``).
script_args: Extra flags appended *after* the script
(e.g. ``--output_dir``, ``--bf16``).
**kwargs: Framework-specific overrides (e.g. ``num_processes``).
"""
...
# -------------------------------------------------------------------
# Helpers
# -------------------------------------------------------------------
def _get_default_script_args(self, output_dir, num_epochs=1, logging_steps=5, save_steps=None):
"""Build the baseline CLI arguments shared by all training runs."""
args = [
"--output_dir",
output_dir,
"--num_train_epochs",
str(num_epochs),
"--logging_steps",
str(logging_steps),
"--per_device_train_batch_size",
"4",
"--learning_rate",
"5e-5",
]
if save_steps is not None:
args += ["--save_steps", str(save_steps)]
else:
args += ["--save_strategy", "no"]
return args
def _train_and_get_log_history(self, cmd, output_dir):
"""Run a training command and return the log history from the last checkpoint."""
execute_subprocess_async(cmd, env=self.get_env())
checkpoint = get_last_checkpoint(output_dir)
state_file = os.path.join(checkpoint, "trainer_state.json")
return TrainerState.load_from_json(state_file).log_history
# -------------------------------------------------------------------
# Reusable test scenarios — called from subclass test methods
# -------------------------------------------------------------------
def check_training(self, dtype="bf16", **cmd_kwargs):
"""Verify that training completes with the model loaded in *dtype* (no mixed precision)."""
output_dir = self.get_auto_remove_tmp_dir()
args = self._get_default_script_args(output_dir) + ["--model_dtype", dtype]
execute_subprocess_async(
self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=args, **cmd_kwargs),
env=self.get_env(),
)
def check_mixed_precision(self, dtype="bf16", **cmd_kwargs):
"""Verify mixed-precision training: model in fp32, compute in *dtype*."""
output_dir = self.get_auto_remove_tmp_dir()
args = self._get_default_script_args(output_dir) + ["--model_dtype", "fp32", f"--{dtype}"]
# fp16 requires a non-fused optimizer to avoid nan losses on small models
if dtype == "fp16":
args += ["--optim", "adamw_torch"]
execute_subprocess_async(
self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=args, **cmd_kwargs),
env=self.get_env(),
)
def check_gradient_accumulation(self, **cmd_kwargs):
"""Verify that training with gradient accumulation completes without error."""
output_dir = self.get_auto_remove_tmp_dir()
args = self._get_default_script_args(output_dir) + ["--bf16", "--gradient_accumulation_steps", "2"]
execute_subprocess_async(
self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=args, **cmd_kwargs),
env=self.get_env(),
)
def check_resume(self, **cmd_kwargs):
"""Verify that training can resume from a checkpoint with consistent learning rates."""
output_dir = self.get_auto_remove_tmp_dir()
args = self._get_default_script_args(output_dir, num_epochs=2, logging_steps=2, save_steps=2) + ["--bf16"]
original_logs = self._train_and_get_log_history(
self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=args, **cmd_kwargs),
output_dir,
)
checkpoint = os.path.join(output_dir, "checkpoint-2")
self.assertTrue(os.path.isdir(checkpoint), f"Checkpoint dir not found: {checkpoint}")
resume_args = args + ["--resume_from_checkpoint", checkpoint]
resumed_logs = self._train_and_get_log_history(
self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=resume_args, **cmd_kwargs),
output_dir,
)
for original, resumed in zip(original_logs, resumed_logs):
if "learning_rate" in original:
self.assertAlmostEqual(original["learning_rate"], resumed["learning_rate"], delta=1e-5)
def check_eval(self, **cmd_kwargs):
"""Verify that evaluation produces a finite eval loss."""
output_dir = self.get_auto_remove_tmp_dir()
eval_output = os.path.join(output_dir, "eval_metrics.json")
args = self._get_default_script_args(output_dir) + ["--do_eval", "--eval_output_file", eval_output]
execute_subprocess_async(
self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=args, **cmd_kwargs),
env=self.get_env(),
)
with open(eval_output) as f:
eval_metrics = json.load(f)
self.assertIn("eval_loss", eval_metrics)
self.assertTrue(torch.isfinite(torch.tensor(eval_metrics["eval_loss"])))

View File

@@ -0,0 +1,297 @@
# Copyright 2020 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
DDP-specific distributed trainer tests.
"""
import json
import os
import re
from parameterized import parameterized
from transformers.testing_utils import (
CaptureStderr,
TestCasePlus,
backend_device_count,
execute_subprocess_async,
get_torch_dist_unique_port,
require_torch_multi_accelerator,
slow,
torch_device,
)
from transformers.utils import is_torch_bf16_available_on_device, is_torch_fp16_available_on_device
from .test_trainer_distributed import CONFIGS_DIR, SCRIPTS_DIR, TrainerDistributedCommon
DDP_CONFIG_FILE = os.path.join(CONFIGS_DIR, "ddp.yaml")
dtypes = []
if is_torch_bf16_available_on_device(torch_device):
dtypes += ["bf16"]
if is_torch_fp16_available_on_device(torch_device):
dtypes += ["fp16"]
pure_dtype_params = ["fp32"] + dtypes
mixed_precision_params = list(dtypes)
def _parameterized_custom_name_func(func, param_num, param):
param_based_name = parameterized.to_safe_name("_".join(str(x) for x in param.args))
return f"{func.__name__}_{param_based_name}"
class DDPCommandsMixin:
"""Provides ``get_torchrun_cmd`` and ``get_accelerate_cmd`` for DDP."""
def get_torchrun_cmd(self, script, script_args=None, num_processes=None):
if num_processes is None:
num_processes = backend_device_count(torch_device)
port = get_torch_dist_unique_port()
cmd = [
"torchrun",
f"--nproc_per_node={num_processes}",
"--nnodes=1",
f"--master_port={port}",
script,
]
if script_args:
cmd.extend(script_args)
return cmd
def get_accelerate_cmd(
self, script, config_file, launch_args=None, script_args=None, num_processes=None, **kwargs
):
if num_processes is None:
num_processes = backend_device_count(torch_device)
port = get_torch_dist_unique_port()
cmd = [
"accelerate",
"launch",
"--config_file",
config_file,
"--num_processes",
str(num_processes),
"--main_process_port",
str(port),
]
if launch_args:
cmd.extend(launch_args)
cmd.append(script)
if script_args:
cmd.extend(script_args)
return cmd
@slow
@require_torch_multi_accelerator
class TestTrainerDistributedDDP(DDPCommandsMixin, TestCasePlus):
# -----------------------------------------------------------------------
# accelerate launch tests
# -----------------------------------------------------------------------
def test_eval_order(self):
output_dir = self.get_auto_remove_tmp_dir()
script = os.path.join(SCRIPTS_DIR, "eval_ddp.py")
cmd = self.get_accelerate_cmd(
script,
DDP_CONFIG_FILE,
script_args=["--output_dir", output_dir],
)
execute_subprocess_async(cmd, env=self.get_env())
def test_loss_averaging(self):
device_count = backend_device_count(torch_device)
min_bs = 2
output_dir = self.get_auto_remove_tmp_dir()
script = os.path.join(SCRIPTS_DIR, "loss_averaging.py")
# Launch 1: single-GPU baseline
cmd = self.get_accelerate_cmd(
script,
DDP_CONFIG_FILE,
script_args=[
"--output_dir",
f"{output_dir}/base",
"--per_device_train_batch_size",
str(min_bs * device_count),
"--average_tokens_across_devices",
"True",
],
num_processes=1,
)
execute_subprocess_async(cmd, env=self.get_env())
# Launch 2: multi-GPU with both averaging modes in one process
cmd = self.get_accelerate_cmd(
script,
DDP_CONFIG_FILE,
script_args=[
"--output_dir",
f"{output_dir}/multi",
"--per_device_train_batch_size",
str(min_bs),
"--run_both_averaging_modes",
],
num_processes=device_count,
)
execute_subprocess_async(cmd, env=self.get_env())
with open(f"{output_dir}/base_losses.json") as f:
base_loss = json.load(f)
with open(f"{output_dir}/multi/broken_losses.json") as f:
broken_loss = json.load(f)
with open(f"{output_dir}/multi/fixed_losses.json") as f:
fixed_loss = json.load(f)
broken_diff = [abs(base_loss[i] - broken_loss[i]) for i in range(len(base_loss))]
fixed_diff = [abs(base_loss[i] - fixed_loss[i]) for i in range(len(base_loss))]
sum_base = sum(base_loss)
sum_broken = sum(broken_loss)
relative_broken = abs(sum_base - sum_broken) / max(sum_base, sum_broken)
self.assertGreater(max(broken_diff), 0.5)
self.assertLess(max(fixed_diff), 0.005)
self.assertLess(relative_broken, 0.1)
def test_worker_seed(self):
output_dir = self.get_auto_remove_tmp_dir()
script = os.path.join(SCRIPTS_DIR, "worker_seed.py")
cmd = self.get_accelerate_cmd(
script,
DDP_CONFIG_FILE,
script_args=["--output_dir", output_dir],
)
execute_subprocess_async(cmd, env=self.get_env())
# -----------------------------------------------------------------------
# torchrun vs accelerate env parity
# -----------------------------------------------------------------------
def test_torchrun_accelerate_env_parity(self):
"""Verify torchrun and accelerate launch produce the same distributed environment for DDP."""
script = os.path.join(SCRIPTS_DIR, "torchrun_env_check.py")
num_processes = backend_device_count(torch_device)
torchrun_dir = self.get_auto_remove_tmp_dir()
cmd = self.get_torchrun_cmd(script, script_args=["--output_dir", torchrun_dir], num_processes=num_processes)
execute_subprocess_async(cmd, env=self.get_env())
accelerate_dir = self.get_auto_remove_tmp_dir()
cmd = self.get_accelerate_cmd(
script, DDP_CONFIG_FILE, script_args=["--output_dir", accelerate_dir], num_processes=num_processes
)
execute_subprocess_async(cmd, env=self.get_env())
for rank in range(num_processes):
with open(os.path.join(torchrun_dir, f"env_rank{rank}.json")) as f:
tr = json.load(f)
with open(os.path.join(accelerate_dir, f"env_rank{rank}.json")) as f:
ac = json.load(f)
for info in (tr, ac):
# Rank consistency: env vars, TrainingArguments, and accelerator all agree
self.assertEqual(info["env_world_size"], str(num_processes))
self.assertEqual(info["env_rank"], str(rank))
self.assertEqual(info["env_local_rank"], str(rank))
self.assertEqual(info["args_process_index"], rank)
self.assertEqual(info["args_local_process_index"], rank)
self.assertIn(info["args_local_rank"], (rank, -1)) # may be -1 before framework consumes it
self.assertEqual(info["accelerator_process_index"], rank)
self.assertEqual(info["accelerator_local_process_index"], rank)
self.assertIsNotNone(info["env_master_addr"])
self.assertIsNotNone(info["env_master_port"])
# World size and parallel mode
self.assertEqual(info["args_world_size"], num_processes)
self.assertEqual(info["args_n_gpu"], 1)
self.assertEqual(info["args_parallel_mode"], "ParallelMode.DISTRIBUTED")
self.assertEqual(info["accelerator_num_processes"], num_processes)
self.assertTrue(info["accelerator_use_distributed"])
self.assertEqual(info["accelerator_is_main_process"], rank == 0)
self.assertEqual(info["accelerator_is_local_main_process"], rank == 0)
# DDP: distributed type is MULTI_GPU
self.assertEqual(info["accelerator_distributed_type"], "DistributedType.MULTI_GPU")
# Each rank on its own device
self.assertIn(f"{torch_device}:{rank}", info["accelerator_device"])
# DDP should not activate FSDP or DeepSpeed
self.assertFalse(info["trainer_is_fsdp_enabled"])
self.assertFalse(info["trainer_is_deepspeed_enabled"])
self.assertNotIn("fsdp_version", info)
self.assertNotIn("deepspeed_zero_stage", info)
@parameterized.expand(
[
("base", "--log_level info", 1),
("low", "--log_level debug --log_level_replica debug", 2),
("high", "--log_level error --log_level_replica debug", 1),
("mixed", "--log_level error --log_level_replica error", 0),
]
)
def test_log_level_replica(self, _name, extra_args_str, expected_matches):
"""Test that log_level_replica controls logging on non-main processes."""
output_dir = self.get_auto_remove_tmp_dir()
script = os.path.join(SCRIPTS_DIR, "train.py")
script_args = [
"--output_dir",
output_dir,
"--num_train_epochs",
"1",
"--per_device_train_batch_size",
"4",
"--logging_strategy",
"no",
]
if extra_args_str:
script_args.extend(extra_args_str.split())
cmd = self.get_accelerate_cmd(script, DDP_CONFIG_FILE, script_args=script_args, num_processes=2)
log_info_string = "Running training"
with CaptureStderr() as cl:
execute_subprocess_async(cmd, env=self.get_env())
n_matches = len(re.findall(log_info_string, cl.err))
self.assertEqual(n_matches, expected_matches)
# ---------------------------------------------------------------------------
# DDP training integration tests (using train.py)
# ---------------------------------------------------------------------------
@slow
@require_torch_multi_accelerator
class TestTrainerDistributedDDPCommon(DDPCommandsMixin, TrainerDistributedCommon, TestCasePlus):
"""
Distributed DDP training tests using ``accelerate launch`` with the shared
train.py script. Mirrors the test structure used in FSDP and DeepSpeed.
"""
@parameterized.expand(pure_dtype_params, name_func=_parameterized_custom_name_func)
def test_training(self, dtype):
self.check_training(dtype, config_file=DDP_CONFIG_FILE)
@parameterized.expand(mixed_precision_params, name_func=_parameterized_custom_name_func)
def test_training_mixed_precision(self, dtype):
self.check_mixed_precision(dtype, config_file=DDP_CONFIG_FILE)
def test_training_with_gradient_accumulation(self):
self.check_gradient_accumulation(config_file=DDP_CONFIG_FILE)
def test_training_and_can_resume_normally(self):
self.check_resume(config_file=DDP_CONFIG_FILE)
def test_eval(self):
self.check_eval(config_file=DDP_CONFIG_FILE)

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,668 @@
# Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
FSDP-specific distributed trainer tests.
"""
import itertools
import json
import os
import unittest
from functools import partial
from pathlib import Path
from unittest.mock import patch
from parameterized import parameterized
from tests.trainer.trainer_test_utils import TrainerIntegrationCommon, get_regression_trainer # noqa
from transformers import HfArgumentParser, PreTrainedConfig, TrainingArguments, is_torch_available
from transformers.testing_utils import (
TestCasePlus,
backend_device_count,
execute_subprocess_async,
get_torch_dist_unique_port,
mockenv_context,
require_torch,
require_torch_accelerator,
require_torch_multi_accelerator,
slow,
torch_device,
)
from transformers.trainer_utils import set_seed
from transformers.utils import (
is_torch_bf16_available_on_device,
is_torch_fp16_available_on_device,
)
from .test_trainer_distributed import CONFIGS_DIR, SCRIPTS_DIR, TRAIN_SCRIPT, TrainerDistributedCommon
if is_torch_available():
import torch
from torch import nn
from transformers import PreTrainedModel
from transformers.trainer import FSDP_MODEL_NAME
# Base accelerate configs (version only — model-specific settings via launch args)
FSDP_CONFIG_FILE = os.path.join(CONFIGS_DIR, "fsdp.yaml")
FSDP2_CONFIG_FILE = os.path.join(CONFIGS_DIR, "fsdp2.yaml")
FSDP2_CP_CONFIG_FILE = os.path.join(CONFIGS_DIR, "fsdp2_cp.yaml")
FSDP_GENERATE_SCRIPT = os.path.join(SCRIPTS_DIR, "fsdp_generate.py")
FSDP_CONFIGS = {
"fsdp1": FSDP_CONFIG_FILE,
"fsdp2": FSDP2_CONFIG_FILE,
}
# Launch args shared by all training tests
TRAIN_LAUNCH_ARGS = [
"--fsdp_auto_wrap_policy",
"TRANSFORMER_BASED_WRAP",
]
dtypes = []
if is_torch_bf16_available_on_device(torch_device):
dtypes += ["bf16"]
if is_torch_fp16_available_on_device(torch_device):
dtypes += ["fp16"]
sharding_strategies = ["full_shard", "shard_grad_op"] # zero3 and zero2
fsdp_versions = ["fsdp1", "fsdp2"]
config_params = list(itertools.product(sharding_strategies, dtypes))
# Mixed precision: model loaded in fp32, training with --bf16/--fp16
mixed_precision_params = list(itertools.product(sharding_strategies, dtypes, fsdp_versions))
# Pure dtype: model loaded in target dtype, no mixed precision flags
pure_dtype_params = list(itertools.product(["fp32"] + dtypes, fsdp_versions))
resume_params = [
("FULL_STATE_DICT", "fsdp1"), # FULL_STATE_DICT only supported for fsdp1
("SHARDED_STATE_DICT", "fsdp1"),
("SHARDED_STATE_DICT", "fsdp2"),
]
set_seed(42)
if is_torch_available():
# hack to restore original logging level pre #21700
get_regression_trainer = partial(get_regression_trainer, log_level="info")
if is_torch_available():
class _BaseModel(PreTrainedModel):
base_model_prefix = "base"
config_class = PreTrainedConfig
def __init__(self, config):
super().__init__(config)
self.linear = nn.Linear(5, 5)
self.linear_2 = nn.Linear(5, 5)
self.post_init()
def forward(self, x):
return self.linear_2(self.linear(x))
@require_torch
class InitializeMissingKeysTest(unittest.TestCase):
"""Tests for FSDP non-rank-0 weight initialization: params should be moved from meta to CPU
and marked as initialized without being re-initialized."""
def _clear_init_flags(self, model):
for module in model.modules():
if hasattr(module, "_is_hf_initialized"):
delattr(module, "_is_hf_initialized")
for param in model.parameters():
if hasattr(param, "_is_hf_initialized"):
delattr(param, "_is_hf_initialized")
for buffer in model.buffers():
if hasattr(buffer, "_is_hf_initialized"):
delattr(buffer, "_is_hf_initialized")
def test_move_missing_keys_fsdp_non_rank0_moves_meta_to_cpu(self):
"""FSDP non-rank-0 path should move all params from meta to CPU."""
with torch.device("meta"):
model = _BaseModel(PreTrainedConfig())
for param in model.parameters():
self.assertEqual(param.device, torch.device("meta"))
with (
patch("transformers.modeling_utils.is_fsdp_enabled", return_value=True),
patch("transformers.modeling_utils.is_local_dist_rank_0", return_value=False),
):
model._move_missing_keys_from_meta_to_device(
missing_keys=set(), device_map=None, device_mesh=None, hf_quantizer=None
)
for name, param in model.named_parameters():
self.assertEqual(param.device, torch.device("cpu"), f"param {name} should be on CPU after FSDP move")
def test_fsdp_non_rank0_end_to_end_no_reinit(self):
"""End-to-end: move from meta + _initialize_missing_keys should mark all params initialized
without changing their values."""
with torch.device("meta"):
model = _BaseModel(PreTrainedConfig())
with (
patch("transformers.modeling_utils.is_fsdp_enabled", return_value=True),
patch("transformers.modeling_utils.is_local_dist_rank_0", return_value=False),
):
model._move_missing_keys_from_meta_to_device(
missing_keys=set(), device_map=None, device_mesh=None, hf_quantizer=None
)
pre_init_values = {name: param.clone() for name, param in model.named_parameters()}
self._clear_init_flags(model)
model._initialize_missing_keys(is_quantized=False)
for name, param in model.named_parameters():
self.assertTrue(getattr(param, "_is_hf_initialized", False), f"param {name} not marked initialized")
torch.testing.assert_close(param, pre_init_values[name], msg=f"param {name} was re-initialized")
self.assertTrue(getattr(model, "_is_hf_initialized", False))
def _parameterized_custom_name_func(func, param_num, param):
# customize the test name generator function as we want both params to appear in the sub-test
# name, as by default it shows only the first param
param_based_name = parameterized.to_safe_name("_".join(str(x) for x in param.args))
return f"{func.__name__}_{param_based_name}"
# ---------------------------------------------------------------------------
# Command mixins
# ---------------------------------------------------------------------------
class FSDPCommandsMixin:
"""Provides ``get_torchrun_cmd`` and ``get_accelerate_cmd`` for FSDP."""
def get_torchrun_cmd(self, script, script_args=None, num_processes=None):
if num_processes is None:
num_processes = backend_device_count(torch_device)
port = get_torch_dist_unique_port()
cmd = [
"torchrun",
f"--nproc_per_node={num_processes}",
"--nnodes=1",
f"--master_port={port}",
script,
]
if script_args:
cmd.extend(script_args)
return cmd
def get_accelerate_cmd(
self, script, config_file, launch_args=None, script_args=None, num_processes=None, **kwargs
):
if num_processes is None:
num_processes = backend_device_count(torch_device)
port = get_torch_dist_unique_port()
cmd = [
"accelerate",
"launch",
"--config_file",
config_file,
"--num_processes",
str(num_processes),
"--main_process_port",
str(port),
]
if launch_args:
cmd.extend(launch_args)
cmd.append(script)
if script_args:
cmd.extend(script_args)
return cmd
# ---------------------------------------------------------------------------
# Config parsing tests (no distributed training runs)
# ---------------------------------------------------------------------------
@require_torch_accelerator
class TestFSDPConfig(TestCasePlus):
def setUp(self):
super().setUp()
master_port = get_torch_dist_unique_port()
self.dist_env_1_gpu = {
"MASTER_ADDR": "localhost",
"MASTER_PORT": str(master_port),
"RANK": "0",
"LOCAL_RANK": "0",
"WORLD_SIZE": "1",
}
self.accelerate_fsdp_config = {
"fsdp_activation_checkpointing": False,
"fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
"fsdp_backward_prefetch": "BACKWARD_PRE",
"fsdp_cpu_ram_efficient_loading": True,
"fsdp_forward_prefetch": False,
"fsdp_offload_params": False,
"fsdp_reshard_after_forward": "FULL_SHARD",
"fsdp_state_dict_type": "FULL_STATE_DICT",
"fsdp_sync_module_states": True,
"fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
"fsdp_use_orig_params": True,
"fsdp_version": 1,
}
self.fsdp_config = {
"backward_prefetch": "BACKWARD_PRE",
"forward_prefetch": "false",
"limit_all_gathers": "false",
"use_orig_params": "true",
"sync_module_states": "true",
"cpu_ram_efficient_loading": "true",
"activation_checkpointing": "false",
"min_num_params": 1,
}
@parameterized.expand(config_params, name_func=_parameterized_custom_name_func)
def test_accelerate_fsdp_config(self, sharding_strategy, dtype):
output_dir = self.get_auto_remove_tmp_dir()
# Snapshot before trainer construction — `_process_fsdp_args` strips the
# `fsdp_` prefix in place.
expected = dict(self.accelerate_fsdp_config)
kwargs = {
"output_dir": output_dir,
"train_len": 128,
"save_steps": 5,
"learning_rate": 0.1,
"fsdp": f"{sharding_strategy} offload auto_wrap",
"fsdp_config": self.accelerate_fsdp_config,
}
kwargs[dtype] = True
with mockenv_context(**self.dist_env_1_gpu):
trainer = get_regression_trainer(**kwargs)
self.assertIs(trainer.args.fsdp, True)
self.assertTrue(trainer.args.fsdp_config.get("cpu_offload"))
for k, v in expected.items():
assert k.startswith("fsdp_")
# `transformer_layer_cls_to_wrap` is normalized from str → list during parsing.
if k == "fsdp_transformer_layer_cls_to_wrap" and isinstance(v, str):
v = [v]
self.assertEqual(trainer.args.fsdp_config[k[5:]], v)
def test_torchrun_fsdp_config(self):
"""Verify that --fsdp + --fsdp_config (torchrun-style) are parsed correctly."""
output_dir = self.get_auto_remove_tmp_dir()
fsdp_config = {"fsdp_transformer_layer_cls_to_wrap": "Qwen2DecoderLayer"}
kwargs = {
"output_dir": output_dir,
"train_len": 128,
"save_steps": 5,
"learning_rate": 0.1,
"fsdp": "full_shard auto_wrap",
"fsdp_config": fsdp_config,
"bf16": True,
}
with mockenv_context(**self.dist_env_1_gpu):
trainer = get_regression_trainer(**kwargs)
self.assertIs(trainer.args.fsdp, True)
# fsdp_ prefix is stripped and value is normalized to a list during parsing
self.assertIn("Qwen2DecoderLayer", trainer.args.fsdp_config["transformer_layer_cls_to_wrap"])
def test_fsdp_cli_parsing(self):
"""`--fsdp` (bare) → True; legacy `--fsdp full_shard` still parses; absent → None."""
parser = HfArgumentParser(TrainingArguments)
base = ["--output_dir", "/tmp/x"]
args, _ = parser.parse_known_args([*base, "--fsdp"])
self.assertIs(args.fsdp, True)
args, _ = parser.parse_known_args([*base, "--fsdp", "full_shard"])
self.assertEqual(args.fsdp, "full_shard")
args, _ = parser.parse_known_args(base)
self.assertIsNone(args.fsdp)
# Bare `--fsdp` should resolve to a fully enabled FSDP setup through `_process_fsdp_args`.
with mockenv_context(**self.dist_env_1_gpu):
trainer_args = TrainingArguments(output_dir="/tmp/x", fsdp=True)
self.assertIs(trainer_args.fsdp, True)
self.assertIsNotNone(trainer_args.fsdp_plugin_args)
@parameterized.expand(config_params, name_func=_parameterized_custom_name_func)
def test_fsdp_config(self, sharding_strategy, dtype):
output_dir = self.get_auto_remove_tmp_dir()
kwargs = {
"output_dir": output_dir,
"train_len": 128,
"save_steps": 5,
"learning_rate": 0.1,
"fsdp": f"{sharding_strategy} offload auto_wrap",
"fsdp_config": self.fsdp_config,
}
kwargs[dtype] = True
with mockenv_context(**self.dist_env_1_gpu):
trainer = get_regression_trainer(**kwargs)
self.assertIs(trainer.args.fsdp, True)
self.assertTrue(trainer.args.fsdp_config.get("cpu_offload"))
for k, v in self.fsdp_config.items():
self.assertEqual(trainer.args.fsdp_config[k], v)
# ---------------------------------------------------------------------------
# FSDP distributed tests
# ---------------------------------------------------------------------------
@require_torch_multi_accelerator
class TestTrainerDistributedFSDP(FSDPCommandsMixin, TestCasePlus):
def _run_env_check(self, cmd, num_processes):
"""Run the env check script and return per-rank results."""
execute_subprocess_async(cmd, env=self.get_env())
# output_dir is always the last script_arg value
output_dir = cmd[cmd.index("--output_dir") + 1]
results = []
for rank in range(num_processes):
with open(os.path.join(output_dir, f"env_rank{rank}.json")) as f:
results.append(json.load(f))
return results
def test_torchrun_accelerate_fsdp1_env_parity(self):
"""Verify torchrun+--fsdp and accelerate launch produce the same FSDP1 env."""
script = os.path.join(SCRIPTS_DIR, "torchrun_env_check.py")
num_processes = backend_device_count(torch_device)
torchrun_dir = self.get_auto_remove_tmp_dir()
torchrun_results = self._run_env_check(
self.get_torchrun_cmd(
script,
script_args=[
"--output_dir",
torchrun_dir,
"--fsdp",
"full_shard",
"--fsdp_config",
'{"fsdp_version": 1}',
],
num_processes=num_processes,
),
num_processes,
)
accel_dir = self.get_auto_remove_tmp_dir()
accel_results = self._run_env_check(
self.get_accelerate_cmd(
script, FSDP_CONFIG_FILE, script_args=["--output_dir", accel_dir], num_processes=num_processes
),
num_processes,
)
self._check_parity(torchrun_results, accel_results, num_processes, expected_fsdp_version=1)
def test_torchrun_accelerate_fsdp2_env_parity(self):
"""Verify torchrun+--fsdp and accelerate launch produce the same FSDP2 env."""
script = os.path.join(SCRIPTS_DIR, "torchrun_env_check.py")
num_processes = backend_device_count(torch_device)
torchrun_dir = self.get_auto_remove_tmp_dir()
torchrun_results = self._run_env_check(
self.get_torchrun_cmd(
script,
script_args=[
"--output_dir",
torchrun_dir,
"--fsdp",
"full_shard",
"--fsdp_config",
'{"fsdp_version": 2}',
],
num_processes=num_processes,
),
num_processes,
)
accel_dir = self.get_auto_remove_tmp_dir()
accel_results = self._run_env_check(
self.get_accelerate_cmd(
script, FSDP2_CONFIG_FILE, script_args=["--output_dir", accel_dir], num_processes=num_processes
),
num_processes,
)
self._check_parity(torchrun_results, accel_results, num_processes, expected_fsdp_version=2)
def _check_parity(self, torchrun_results, accel_results, num_processes, expected_fsdp_version):
for rank in range(num_processes):
tr, ac = torchrun_results[rank], accel_results[rank]
# Both should agree on distributed env
self.assertEqual(tr["args_world_size"], ac["args_world_size"])
self.assertEqual(tr["args_process_index"], ac["args_process_index"])
self.assertEqual(tr["args_parallel_mode"], ac["args_parallel_mode"])
self.assertEqual(tr["accelerator_num_processes"], ac["accelerator_num_processes"])
self.assertEqual(tr["accelerator_use_distributed"], ac["accelerator_use_distributed"])
for info in (tr, ac):
# Rank consistency across all layers
self.assertEqual(info["env_world_size"], str(num_processes))
self.assertEqual(info["env_rank"], str(rank))
self.assertEqual(info["args_process_index"], rank)
self.assertEqual(info["args_local_process_index"], rank)
self.assertEqual(info["accelerator_process_index"], rank)
self.assertEqual(info["accelerator_local_process_index"], rank)
self.assertEqual(info["args_n_gpu"], 1)
self.assertEqual(info["accelerator_is_main_process"], rank == 0)
self.assertEqual(info["accelerator_is_local_main_process"], rank == 0)
self.assertIn(f"{torch_device}:{rank}", info["accelerator_device"])
# Both should have FSDP enabled with the correct version
self.assertEqual(info["accelerator_distributed_type"], "DistributedType.FSDP")
self.assertTrue(info["trainer_is_fsdp_enabled"])
self.assertFalse(info["trainer_is_deepspeed_enabled"])
self.assertEqual(info["fsdp_version"], expected_fsdp_version)
self.assertNotIn("deepspeed_zero_stage", info)
# ---------------------------------------------------------------------------
# All distributed FSDP training tests
# ---------------------------------------------------------------------------
@slow
@require_torch_multi_accelerator
class TestTrainerDistributedFSDPCommon(
FSDPCommandsMixin, TrainerDistributedCommon, TestCasePlus, TrainerIntegrationCommon
):
# -------------------------------------------------------------------
# FSDP training — accelerate (parameterized over fsdp version)
# -------------------------------------------------------------------
# Pure dtype training: model loaded in target dtype, no mixed precision
@parameterized.expand(pure_dtype_params, name_func=_parameterized_custom_name_func)
def test_training(self, dtype, fsdp_version):
self.check_training(dtype, config_file=FSDP_CONFIGS[fsdp_version])
# Mixed precision: model loaded in fp32, training with --bf16/--fp16
@parameterized.expand(mixed_precision_params, name_func=_parameterized_custom_name_func)
def test_training_mixed_precision(self, sharding_strategy, dtype, fsdp_version):
if fsdp_version == "fsdp2":
reshard = "true" if sharding_strategy == "full_shard" else "false"
else:
reshard = sharding_strategy.upper()
launch_args = list(TRAIN_LAUNCH_ARGS) + ["--fsdp_reshard_after_forward", reshard]
self.check_mixed_precision(dtype, config_file=FSDP_CONFIGS[fsdp_version], launch_args=launch_args)
@parameterized.expand(["true", "false"], name_func=_parameterized_custom_name_func)
def test_fsdp2_cpu_ram_efficient_loading(self, cpu_ram_efficient_loading):
launch_args = list(TRAIN_LAUNCH_ARGS) + [
"--fsdp_cpu_ram_efficient_loading",
cpu_ram_efficient_loading,
]
self.check_training("bf16", config_file=FSDP2_CONFIG_FILE, launch_args=launch_args)
@parameterized.expand(fsdp_versions, name_func=_parameterized_custom_name_func)
def test_training_with_gradient_accumulation(self, fsdp_version):
self.check_gradient_accumulation(config_file=FSDP_CONFIGS[fsdp_version])
@parameterized.expand(fsdp_versions, name_func=_parameterized_custom_name_func)
def test_basic_run_with_cpu_offload(self, fsdp_version):
output_dir = self.get_auto_remove_tmp_dir()
args = self._get_default_script_args(output_dir) + ["--bf16", "--max_steps", "10"]
launch_args = list(TRAIN_LAUNCH_ARGS) + ["--fsdp_offload_params", "true"]
execute_subprocess_async(
self.get_accelerate_cmd(
TRAIN_SCRIPT, script_args=args, config_file=FSDP_CONFIGS[fsdp_version], launch_args=launch_args
),
env=self.get_env(),
)
@parameterized.expand(resume_params, name_func=_parameterized_custom_name_func)
def test_training_and_can_resume_normally(self, state_dict_type, fsdp_version):
output_dir = self.get_auto_remove_tmp_dir()
args = self._get_default_script_args(output_dir, num_epochs=2, logging_steps=2, save_steps=2)
launch_args = list(TRAIN_LAUNCH_ARGS) + ["--fsdp_state_dict_type", state_dict_type]
cmd_kwargs = {"config_file": FSDP_CONFIGS[fsdp_version], "launch_args": launch_args}
logs = self._train_and_get_log_history(
self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=args, **cmd_kwargs),
output_dir,
)
# resume from ckpt
checkpoint = os.path.join(output_dir, "checkpoint-2")
resume_args = args + ["--resume_from_checkpoint", checkpoint]
is_fsdp_ckpt = os.path.isdir(checkpoint) and (
# this checks the FSDP state dict when `SHARDED_STATE_DICT` is used
any(
FSDP_MODEL_NAME in folder_name
for folder_name in os.listdir(checkpoint)
if os.path.isdir(os.path.join(checkpoint, folder_name))
)
# this checks the FSDP state dict when `FULL_STATE_DICT` is used
or os.path.isfile(os.path.join(checkpoint, f"{FSDP_MODEL_NAME}.bin"))
)
self.assertTrue(is_fsdp_ckpt)
logs_resume = self._train_and_get_log_history(
self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=resume_args, **cmd_kwargs),
output_dir,
)
for log, log1 in zip(logs, logs_resume):
if "learning_rate" in log:
self.assertAlmostEqual(log["learning_rate"], log1["learning_rate"], delta=1e-5)
# -------------------------------------------------------------------
# Context parallel tests
# -------------------------------------------------------------------
def test_cp_equivalence(self):
"""Test that CP produces the same losses as without CP."""
# CP doesn't work with Qwen2 (DTensor mixing error), so we use Llama here.
launch_args = list(TRAIN_LAUNCH_ARGS) + ["--fsdp_state_dict_type", "SHARDED_STATE_DICT"]
cp_script_args = [
"--model_name",
"hf-internal-testing/tiny-random-LlamaForCausalLM",
"--max_steps",
"10",
"--per_device_train_batch_size",
"1",
"--seed",
"42",
"--logging_steps",
"1",
"--save_strategy",
"no",
"--model_dtype",
"fp32",
"--attn_implementation",
"sdpa",
"--pad_to_multiple_of",
"4",
]
# Step 1: Run with CP enabled (cp_size=2)
cp_yes_output_dir = Path(self.get_auto_remove_tmp_dir()).resolve()
cp_yes_losses_path = cp_yes_output_dir / "cp_yes_losses.json"
cmd = self.get_accelerate_cmd(
TRAIN_SCRIPT,
config_file=FSDP2_CP_CONFIG_FILE,
launch_args=launch_args,
script_args=["--output_dir", str(cp_yes_output_dir), "--loss_output_file", str(cp_yes_losses_path)]
+ cp_script_args,
)
execute_subprocess_async(cmd, env=self.get_env())
# Step 2: Run without CP (FSDP with num_processes=1, no parallelism_config)
cp_no_output_dir = Path(self.get_auto_remove_tmp_dir()).resolve()
cp_no_losses_path = cp_no_output_dir / "cp_no_losses.json"
cmd = self.get_accelerate_cmd(
TRAIN_SCRIPT,
config_file=FSDP2_CONFIG_FILE,
launch_args=launch_args,
script_args=[
"--output_dir",
str(cp_no_output_dir),
"--loss_output_file",
str(cp_no_losses_path),
]
+ cp_script_args,
num_processes=1,
)
execute_subprocess_async(cmd, env=self.get_env())
# Compare losses
with open(cp_yes_losses_path) as f:
cp_yes_losses = json.load(f)
with open(cp_no_losses_path) as f:
cp_no_losses = json.load(f)
assert len(cp_yes_losses) == len(cp_no_losses), (
f"Different number of losses: CP has {len(cp_yes_losses)}, no-CP has {len(cp_no_losses)}"
)
cp_yes_losses_tensor = torch.tensor(cp_yes_losses)
cp_no_losses_tensor = torch.tensor(cp_no_losses)
torch.testing.assert_close(
cp_yes_losses_tensor,
cp_no_losses_tensor,
rtol=2e-2,
atol=2e-2,
msg=f"CP losses {cp_yes_losses} do not match non-CP losses {cp_no_losses}",
)
# -------------------------------------------------------------------
# FSDP eval tests
# -------------------------------------------------------------------
def test_eval(self):
self.check_eval(config_file=FSDP_CONFIG_FILE)
# -------------------------------------------------------------------
# FSDP generation tests (moved from tests/generation/test_fsdp.py)
# -------------------------------------------------------------------
def test_fsdp_generate(self):
cmd = self.get_accelerate_cmd(
FSDP_GENERATE_SCRIPT,
config_file=FSDP_CONFIG_FILE,
script_args=["--fsdp"],
)
execute_subprocess_async(cmd, env=self.get_env())
def test_fsdp2_generate(self):
cmd = self.get_accelerate_cmd(
FSDP_GENERATE_SCRIPT,
config_file=FSDP2_CONFIG_FILE,
script_args=["--fsdp2"],
)
execute_subprocess_async(cmd, env=self.get_env())

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,250 @@
# Copyright 2018 the HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Trainer AcceleratorConfig tests: creation from dict/YAML/dataclass, partial overrides,
gradient accumulation settings, custom AcceleratorState, and validation.
"""
import dataclasses
import json
import tempfile
from pathlib import Path
from typing import Any
from accelerate import Accelerator
from accelerate.state import AcceleratorState
from transformers import Trainer, TrainingArguments
from transformers.testing_utils import TestCasePlus, require_torch
from transformers.trainer_pt_utils import AcceleratorConfig
from .trainer_test_utils import (
RegressionModelConfig,
RegressionPreTrainedModel,
RegressionTrainingArguments,
SampleIterableDataset,
)
@require_torch
class TrainerAcceleratorConfigTest(TestCasePlus):
def test_accelerator_config_empty(self):
# Checks that a config can be made with the defaults if not passed
with tempfile.TemporaryDirectory() as tmp_dir:
config = RegressionModelConfig(a=1.5, b=2.5)
model = RegressionPreTrainedModel(config)
eval_dataset = SampleIterableDataset()
# Leaves one option as something *not* basic
args = RegressionTrainingArguments(output_dir=tmp_dir)
trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
self.assertEqual(trainer.accelerator.split_batches, False)
self.assertEqual(trainer.accelerator.dispatch_batches, None)
self.assertEqual(trainer.accelerator.even_batches, True)
self.assertEqual(trainer.accelerator.use_seedable_sampler, True)
# gradient accumulation kwargs configures gradient_state
self.assertNotIn("sync_each_batch", trainer.accelerator.gradient_state.plugin_kwargs)
def test_accelerator_config_from_dict(self):
# Checks that accelerator kwargs can be passed through
# and the accelerator is initialized respectively
with tempfile.TemporaryDirectory() as tmp_dir:
config = RegressionModelConfig(a=1.5, b=2.5)
model = RegressionPreTrainedModel(config)
eval_dataset = SampleIterableDataset()
accelerator_config: dict[str, Any] = {
"split_batches": True,
"dispatch_batches": True,
"even_batches": False,
"use_seedable_sampler": True,
}
accelerator_config["gradient_accumulation_kwargs"] = {"sync_each_batch": True}
# Leaves all options as something *not* basic
args = RegressionTrainingArguments(output_dir=tmp_dir, accelerator_config=accelerator_config)
trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
self.assertEqual(trainer.accelerator.split_batches, True)
self.assertEqual(trainer.accelerator.dispatch_batches, True)
self.assertEqual(trainer.accelerator.even_batches, False)
self.assertEqual(trainer.accelerator.use_seedable_sampler, True)
def test_accelerator_config_from_yaml(self):
# Checks that accelerator kwargs can be passed through
# and the accelerator is initialized respectively
with tempfile.TemporaryDirectory() as tmp_dir:
path_file = Path(tmp_dir) / "accelerator_config.json"
with open(path_file, "w") as f:
accelerator_config = {
"split_batches": True,
"dispatch_batches": True,
"even_batches": False,
"use_seedable_sampler": False,
}
json.dump(accelerator_config, f)
config = RegressionModelConfig(a=1.5, b=2.5)
model = RegressionPreTrainedModel(config)
eval_dataset = SampleIterableDataset()
# Leaves all options as something *not* basic
args = RegressionTrainingArguments(output_dir=tmp_dir, accelerator_config=path_file)
trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
self.assertEqual(trainer.accelerator.split_batches, True)
self.assertEqual(trainer.accelerator.dispatch_batches, True)
self.assertEqual(trainer.accelerator.even_batches, False)
self.assertEqual(trainer.accelerator.use_seedable_sampler, False)
def test_accelerator_config_from_dataclass(self):
# Checks that accelerator kwargs can be passed through
# and the accelerator is initialized respectively
accelerator_config = AcceleratorConfig(
split_batches=True,
dispatch_batches=True,
even_batches=False,
use_seedable_sampler=False,
)
config = RegressionModelConfig(a=1.5, b=2.5)
model = RegressionPreTrainedModel(config)
eval_dataset = SampleIterableDataset()
with tempfile.TemporaryDirectory() as tmp_dir:
args = RegressionTrainingArguments(output_dir=tmp_dir, accelerator_config=accelerator_config)
trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
self.assertEqual(trainer.accelerator.split_batches, True)
self.assertEqual(trainer.accelerator.dispatch_batches, True)
self.assertEqual(trainer.accelerator.even_batches, False)
self.assertEqual(trainer.accelerator.use_seedable_sampler, False)
def test_accelerate_config_from_dataclass_grad_accum(self):
# Checks that accelerator kwargs can be passed through
# and the accelerator is initialized respectively
grad_acc_kwargs = {
"num_steps": 10,
"adjust_scheduler": False,
"sync_with_dataloader": False,
"sync_each_batch": True,
}
accelerator_config = AcceleratorConfig(
split_batches=True,
dispatch_batches=True,
even_batches=False,
use_seedable_sampler=False,
gradient_accumulation_kwargs=grad_acc_kwargs,
)
config = RegressionModelConfig(a=1.5, b=2.5)
model = RegressionPreTrainedModel(config)
eval_dataset = SampleIterableDataset()
with tempfile.TemporaryDirectory() as tmp_dir:
args = RegressionTrainingArguments(output_dir=tmp_dir, accelerator_config=accelerator_config)
trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
self.assertEqual(trainer.args.gradient_accumulation_steps, 10)
def test_accelerator_config_from_partial(self):
# Checks that accelerator kwargs can be passed through
# and the accelerator is initialized respectively
with tempfile.TemporaryDirectory() as tmp_dir:
config = RegressionModelConfig(a=1.5, b=2.5)
model = RegressionPreTrainedModel(config)
eval_dataset = SampleIterableDataset()
# Leaves one option as something *not* basic
args = RegressionTrainingArguments(
output_dir=tmp_dir,
accelerator_config={
"split_batches": True,
},
)
trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
self.assertEqual(trainer.accelerator.split_batches, True)
self.assertEqual(trainer.accelerator.dispatch_batches, None)
self.assertEqual(trainer.accelerator.even_batches, True)
self.assertEqual(trainer.accelerator.use_seedable_sampler, True)
def test_accelerator_custom_state(self):
AcceleratorState._reset_state(reset_partial_state=True)
with tempfile.TemporaryDirectory() as tmp_dir:
with self.assertRaises(ValueError) as cm:
_ = RegressionTrainingArguments(output_dir=tmp_dir, accelerator_config={"use_configured_state": True})
self.assertIn("Please define this beforehand", str(cm.warnings[0].message))
_ = Accelerator()
_ = RegressionTrainingArguments(output_dir=tmp_dir, accelerator_config={"use_configured_state": True})
AcceleratorState._reset_state(reset_partial_state=True)
def test_accelerator_config_from_dict_grad_accum_num_steps(self):
with tempfile.TemporaryDirectory() as tmp_dir:
config = RegressionModelConfig(a=1.5, b=2.5)
model = RegressionPreTrainedModel(config)
eval_dataset = SampleIterableDataset()
# case - TrainingArguments.gradient_accumulation_steps == 1
# - gradient_accumulation_kwargs['num_steps] == 1
# results in grad accum set to 1
args = RegressionTrainingArguments(
output_dir=tmp_dir,
gradient_accumulation_steps=1,
accelerator_config={
"gradient_accumulation_kwargs": {
"num_steps": 1,
}
},
)
trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
self.assertEqual(trainer.accelerator.gradient_state.plugin_kwargs["num_steps"], 1)
# case - TrainingArguments.gradient_accumulation_steps > 1
# - gradient_accumulation_kwargs['num_steps] specified
# results in exception raised
args = RegressionTrainingArguments(
output_dir=tmp_dir,
gradient_accumulation_steps=2,
accelerator_config={
"gradient_accumulation_kwargs": {
"num_steps": 10,
}
},
)
with self.assertRaises(Exception) as context:
trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
self.assertTrue("The `AcceleratorConfig`'s `num_steps` is set but" in str(context.exception))
def test_accelerator_config_not_instantiated(self):
# Checks that accelerator kwargs can be passed through
# and the accelerator is initialized respectively
with tempfile.TemporaryDirectory() as tmp_dir:
with self.assertRaises(NotImplementedError) as context:
_ = RegressionTrainingArguments(
output_dir=tmp_dir,
accelerator_config=AcceleratorConfig,
)
self.assertTrue("Tried passing in a callable to `accelerator_config`" in str(context.exception))
# Now test with a custom subclass
@dataclasses.dataclass
class CustomAcceleratorConfig(AcceleratorConfig):
pass
@dataclasses.dataclass
class CustomTrainingArguments(TrainingArguments):
accelerator_config: dict = dataclasses.field(
default=CustomAcceleratorConfig,
)
with tempfile.TemporaryDirectory() as tmp_dir:
with self.assertRaises(NotImplementedError) as context:
_ = CustomTrainingArguments(
output_dir=tmp_dir,
)
self.assertTrue("Tried passing in a callable to `accelerator_config`" in str(context.exception))

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,870 @@
# Copyright 2018 the HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Trainer data-related tests: dataloaders, samplers, sharding, label smoothing,
batch size finder, pad/concatenate, collators, and eval loop container.
"""
import copy
import tempfile
import unittest
import warnings
import numpy as np
import torch
from torch import nn
from transformers import (
GPT2Config,
GPT2LMHeadModel,
Trainer,
TrainingArguments,
)
from transformers.data.data_collator import default_data_collator as _default_data_collator
from transformers.modeling_outputs import SequenceClassifierOutput
from transformers.testing_utils import (
TestCasePlus,
backend_device_count,
require_accelerate,
require_torch,
torch_device,
)
from transformers.tokenization_utils_base import BatchEncoding
from transformers.trainer_pt_utils import (
DistributedLengthGroupedSampler,
DistributedSamplerWithLoop,
EvalLoopContainer,
IterableDatasetShard,
LabelSmoother,
LengthGroupedSampler,
ShardSampler,
get_parameter_names,
numpy_pad_and_concatenate,
torch_pad_and_concatenate,
)
from transformers.trainer_utils import RemoveColumnsCollator, find_executable_batch_size
from .trainer_test_utils import (
AlmostAccuracy,
CustomDataloaderTrainer,
DynamicShapesDataset,
RegressionDataset,
RegressionModel,
RegressionModelConfig,
RegressionPreTrainedModel,
RegressionTrainingArguments,
SampleIterableDataset,
TrainerIntegrationCommon,
TstLayer,
get_regression_trainer,
)
class RandomIterableDataset(torch.utils.data.IterableDataset):
# For testing, an iterable dataset of random length
def __init__(self, p_stop=0.01, max_length=1000):
self.p_stop = p_stop
self.max_length = max_length
self.generator = torch.Generator()
def __iter__(self):
count = 0
stop = False
while not stop and count < self.max_length:
yield count
count += 1
number = torch.rand(1, generator=self.generator).item()
stop = number < self.p_stop
# ---------------------------------------------------------------------------
# Dataloader tests
# ---------------------------------------------------------------------------
@require_torch
class TrainerDataloaderTest(TestCasePlus):
"""Tests for train/eval dataloaders, drop_last, persistent workers."""
def test_train_and_eval_dataloaders(self):
if torch_device == "cuda":
n_gpu = max(1, backend_device_count(torch_device))
else:
# DP is deprecated by PyTorch, accelerators like XPU doesn't support DP
n_gpu = 1
tmp_dir = self.get_auto_remove_tmp_dir()
trainer = get_regression_trainer(learning_rate=0.1, per_device_train_batch_size=16, output_dir=tmp_dir)
self.assertEqual(trainer.get_train_dataloader().total_batch_size, 16 * n_gpu)
trainer = get_regression_trainer(learning_rate=0.1, per_device_eval_batch_size=16, output_dir=tmp_dir)
self.assertEqual(trainer.get_eval_dataloader().total_batch_size, 16 * n_gpu)
# Check drop_last works
trainer = get_regression_trainer(
train_len=66,
eval_len=74,
learning_rate=0.1,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
output_dir=tmp_dir,
)
self.assertEqual(len(trainer.get_train_dataloader()), 66 // (16 * n_gpu) + 1)
self.assertEqual(len(trainer.get_eval_dataloader()), 74 // (32 * n_gpu) + 1)
trainer = get_regression_trainer(
train_len=66,
eval_len=74,
learning_rate=0.1,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
dataloader_drop_last=True,
output_dir=tmp_dir,
)
self.assertEqual(len(trainer.get_train_dataloader()), 66 // (16 * n_gpu))
self.assertEqual(len(trainer.get_eval_dataloader()), 74 // (32 * n_gpu))
# Check passing a new dataset for evaluation works
new_eval_dataset = RegressionDataset(length=128)
self.assertEqual(len(trainer.get_eval_dataloader(new_eval_dataset)), 128 // (32 * n_gpu))
# tests that we do not require dataloader to have a .dataset attribute
def test_dataloader_without_dataset(self):
train_dataset = RegressionDataset(length=128)
trainer = CustomDataloaderTrainer(
model=RegressionModel(),
train_dataset=train_dataset,
eval_dataset=train_dataset,
args=TrainingArguments(output_dir=self.get_auto_remove_tmp_dir()),
)
trainer.train()
trainer.evaluate()
def test_get_eval_dataloader_without_persistent_workers(self):
train_dataset = RegressionDataset()
config = GPT2Config(vocab_size=100, n_positions=128, n_embd=32, n_layer=3, n_head=4)
tiny_gpt2 = GPT2LMHeadModel(config)
args = TrainingArguments(self.get_auto_remove_tmp_dir(), dataloader_persistent_workers=False)
# Single evaluation dataset
eval_dataset = RegressionDataset()
trainer = Trainer(tiny_gpt2, args, train_dataset=train_dataset, eval_dataset=eval_dataset)
# Mocking the prepare method to avoid the dataloader changing with each call to get_eval_dataloader
trainer.accelerator.prepare = lambda x: x
default_dataloader = trainer.get_eval_dataloader()
dataloader_with_dataset = trainer.get_eval_dataloader(eval_dataset)
self.assertEqual(default_dataloader.dataset, eval_dataset)
self.assertEqual(dataloader_with_dataset.dataset, eval_dataset)
self.assertNotEqual(default_dataloader, dataloader_with_dataset)
# Multiple evaluation datasets
first_dataset = RegressionDataset()
second_dataset = RegressionDataset()
trainer = Trainer(
tiny_gpt2,
args,
train_dataset=train_dataset,
eval_dataset={"first": first_dataset, "second": second_dataset},
)
# Mocking the prepare method to avoid the dataloader changing with each call to get_eval_dataloader
trainer.accelerator.prepare = lambda x: x
first_dataloader = trainer.get_eval_dataloader("first")
first_dataloader_repeated = trainer.get_eval_dataloader("first")
second_dataloader = trainer.get_eval_dataloader("second")
second_dataloader_repeated = trainer.get_eval_dataloader("second")
self.assertEqual(first_dataset, first_dataloader.dataset)
self.assertEqual(first_dataloader.dataset, first_dataloader_repeated.dataset)
self.assertEqual(second_dataset, second_dataloader.dataset)
self.assertEqual(second_dataloader.dataset, second_dataloader_repeated.dataset)
self.assertNotEqual(first_dataloader, first_dataloader_repeated)
self.assertNotEqual(second_dataloader, second_dataloader_repeated)
def test_get_eval_dataloader_with_persistent_workers(self):
train_dataset = RegressionDataset()
config = GPT2Config(vocab_size=100, n_positions=128, n_embd=32, n_layer=3, n_head=4)
tiny_gpt2 = GPT2LMHeadModel(config)
args = TrainingArguments(
self.get_auto_remove_tmp_dir(),
dataloader_persistent_workers=True,
dataloader_num_workers=2,
)
# Single evaluation dataset
eval_dataset = RegressionDataset()
trainer = Trainer(tiny_gpt2, args, train_dataset=train_dataset, eval_dataset=eval_dataset)
# Mocking the prepare method to avoid the dataloader changing with each call to get_eval_dataloader
trainer.accelerator.prepare = lambda x: x
default_dataloader = trainer.get_eval_dataloader()
dataloader_with_dataset = trainer.get_eval_dataloader(eval_dataset)
self.assertEqual(default_dataloader.dataset, eval_dataset)
self.assertEqual(dataloader_with_dataset.dataset, eval_dataset)
self.assertEqual(default_dataloader, dataloader_with_dataset)
# Multiple evaluation datasets
first_dataset = RegressionDataset()
second_dataset = RegressionDataset()
trainer = Trainer(
tiny_gpt2,
args,
train_dataset=train_dataset,
eval_dataset={"first": first_dataset, "second": second_dataset},
)
# Mocking the prepare method to avoid the dataloader changing with each call to get_eval_dataloader
trainer.accelerator.prepare = lambda x: x
first_dataloader = trainer.get_eval_dataloader("first")
first_dataloader_repeated = trainer.get_eval_dataloader("first")
second_dataloader = trainer.get_eval_dataloader("second")
second_dataloader_repeated = trainer.get_eval_dataloader("second")
self.assertEqual(first_dataset, first_dataloader.dataset)
self.assertEqual(first_dataloader.dataset, first_dataloader_repeated.dataset)
self.assertEqual(second_dataset, second_dataloader.dataset)
self.assertEqual(second_dataloader.dataset, second_dataloader_repeated.dataset)
self.assertEqual(first_dataloader, first_dataloader_repeated)
self.assertEqual(second_dataloader, second_dataloader_repeated)
# ---------------------------------------------------------------------------
# Label smoothing tests
# ---------------------------------------------------------------------------
@require_torch
class TrainerLabelSmoothingTest(unittest.TestCase):
"""Tests for label smoothing and its interaction with multi-label classification."""
def test_label_smoothing(self):
epsilon = 0.1
num_labels = 12
random_logits = torch.randn(4, 5, num_labels)
random_labels = torch.randint(0, num_labels, (4, 5))
loss = nn.functional.cross_entropy(random_logits.view(-1, num_labels), random_labels.view(-1))
model_output = SequenceClassifierOutput(logits=random_logits)
label_smoothed_loss = LabelSmoother(0.1)(model_output, random_labels)
log_probs = -nn.functional.log_softmax(random_logits, dim=-1)
expected_loss = (1 - epsilon) * loss + epsilon * log_probs.mean()
torch.testing.assert_close(label_smoothed_loss, expected_loss)
# With a few -100 labels
random_labels[0, 1] = -100
random_labels[2, 1] = -100
random_labels[2, 3] = -100
loss = nn.functional.cross_entropy(random_logits.view(-1, num_labels), random_labels.view(-1))
model_output = SequenceClassifierOutput(logits=random_logits)
label_smoothed_loss = LabelSmoother(0.1)(model_output, random_labels)
log_probs = -nn.functional.log_softmax(random_logits, dim=-1)
# Mask the log probs with the -100 labels
log_probs[0, 1] = 0.0
log_probs[2, 1] = 0.0
log_probs[2, 3] = 0.0
expected_loss = (1 - epsilon) * loss + epsilon * log_probs.sum() / (num_labels * 17)
torch.testing.assert_close(label_smoothed_loss, expected_loss)
def test_label_smoothing_multi_label_incompatibility(self):
"""Test that Trainer warns and disables label smoothing for multi-label classification"""
# Mock model config with multi-label classification
class MockConfig:
problem_type = "multi_label_classification"
class MockModel(nn.Module):
def __init__(self):
super().__init__()
self.config = MockConfig()
self.linear = nn.Linear(10, 3)
def forward(self, **kwargs):
return {"logits": torch.randn(2, 3)}
model = MockModel()
# Create training args with label smoothing
training_args = TrainingArguments(
output_dir="./test-trainer",
label_smoothing_factor=0.1,
per_device_train_batch_size=2,
num_train_epochs=1,
)
# Should warn and disable label smoothing
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
trainer = Trainer(model=model, args=training_args)
# Check warning was issued
self.assertEqual(len(w), 1)
self.assertIn("Label smoothing is not compatible with multi-label classification", str(w[0].message))
# Check label_smoother was disabled
self.assertIsNone(trainer.label_smoother)
# ---------------------------------------------------------------------------
# Sampler and sharding tests
# ---------------------------------------------------------------------------
@require_torch
class TrainerSamplerTest(unittest.TestCase):
"""Tests for length-grouped samplers, distributed samplers, iterable dataset sharding, and shard samplers."""
def test_group_by_length(self):
# Get some inputs of random lengths
lengths = torch.randint(0, 25, (100,)).tolist()
# Put one bigger than the others to check it ends up in first position
lengths[32] = 50
indices = list(LengthGroupedSampler(4, lengths=lengths))
# The biggest element should be first
self.assertEqual(lengths[indices[0]], 50)
# The indices should be a permutation of range(100)
self.assertEqual(sorted(indices), list(range(100)))
def test_group_by_length_with_dict(self):
# Get some inputs of random lengths
data = []
for _ in range(6):
input_ids = torch.randint(0, 25, (100,)).tolist()
data.append({"input_ids": input_ids})
# Put one bigger than the others to check it ends up in first position
data[3]["input_ids"] = torch.randint(0, 25, (105,)).tolist()
indices = list(LengthGroupedSampler(4, dataset=data))
# The biggest element should be first
self.assertEqual(len(data[indices[0]]["input_ids"]), 105)
# The indices should be a permutation of range(6)
self.assertEqual(sorted(indices), list(range(6)))
def test_group_by_length_with_batch_encoding(self):
# Get some inputs of random lengths
data = []
for _ in range(6):
input_ids = torch.randint(0, 25, (100,)).tolist()
data.append(BatchEncoding({"input_ids": input_ids}))
# Put one bigger than the others to check it ends up in first position
data[3]["input_ids"] = torch.randint(0, 25, (105,)).tolist()
indices = list(LengthGroupedSampler(4, dataset=data))
# The biggest element should be first
self.assertEqual(len(data[indices[0]]["input_ids"]), 105)
# The indices should be a permutation of range(6)
self.assertEqual(sorted(indices), list(range(6)))
def test_distributed_length_grouped(self):
# Get some inputs of random lengths
lengths = torch.randint(0, 25, (100,)).tolist()
# Put one bigger than the others to check it ends up in first position
lengths[32] = 50
indices_process_0 = list(DistributedLengthGroupedSampler(4, num_replicas=2, rank=0, lengths=lengths))
indices_process_1 = list(DistributedLengthGroupedSampler(4, num_replicas=2, rank=1, lengths=lengths))
# The biggest element should be first
self.assertEqual(lengths[indices_process_0[0]], 50)
# The indices should be a permutation of range(100)
self.assertEqual(sorted(indices_process_0 + indices_process_1), list(range(100)))
def test_distributed_sampler_with_loop(self):
batch_size = 16
for length in [23, 64, 123]:
dataset = list(range(length))
shard1 = DistributedSamplerWithLoop(dataset, batch_size, num_replicas=2, rank=0)
shard2 = DistributedSamplerWithLoop(dataset, batch_size, num_replicas=2, rank=1)
# Set seeds
shard1.set_epoch(0)
shard2.set_epoch(0)
# Sample
samples1 = list(shard1)
samples2 = list(shard2)
self.assertTrue(len(samples1) % batch_size == 0)
self.assertTrue(len(samples2) % batch_size == 0)
total = []
for sample1, sample2 in zip(samples1, samples2):
total += [sample1, sample2]
self.assertEqual(set(total[:length]), set(dataset))
self.assertEqual(set(total[length:]), set(total[: (len(total) - length)]))
def check_iterable_dataset_shard(self, dataset, batch_size, drop_last, num_processes=2, epoch=0):
# Set the seed for the base dataset to get the proper reference.
dataset.generator.manual_seed(epoch)
reference = list(dataset)
shards = [
IterableDatasetShard(
dataset, batch_size=batch_size, drop_last=drop_last, num_processes=num_processes, process_index=i
)
for i in range(num_processes)
]
for shard in shards:
shard.set_epoch(epoch)
shard_lists = [list(shard) for shard in shards]
for shard in shard_lists:
# All shards have a number of samples that is a round multiple of batch size
self.assertTrue(len(shard) % batch_size == 0)
# All shards have the same number of samples
self.assertEqual(len(shard), len(shard_lists[0]))
for shard in shards:
# All shards know the total number of samples
self.assertEqual(shard.num_examples, len(reference))
observed = []
for idx in range(0, len(shard_lists[0]), batch_size):
for shard in shard_lists:
observed += shard[idx : idx + batch_size]
# If drop_last is False we loop through samples at the beginning to have a size that is a round multiple of
# batch_size
if not drop_last:
while len(reference) < len(observed):
reference += reference
self.assertListEqual(observed, reference[: len(observed)])
# Check equivalence between IterableDataset and ShardSampler
dataset.generator.manual_seed(epoch)
reference = list(dataset)
sampler_shards = [
ShardSampler(
reference, batch_size=batch_size, drop_last=drop_last, num_processes=num_processes, process_index=i
)
for i in range(num_processes)
]
for shard, sampler_shard in zip(shard_lists, sampler_shards):
self.assertListEqual(shard, list(sampler_shard))
def test_iterable_dataset_shard(self):
dataset = RandomIterableDataset()
self.check_iterable_dataset_shard(dataset, 4, drop_last=True, num_processes=2, epoch=0)
self.check_iterable_dataset_shard(dataset, 4, drop_last=False, num_processes=2, epoch=0)
self.check_iterable_dataset_shard(dataset, 4, drop_last=True, num_processes=3, epoch=42)
self.check_iterable_dataset_shard(dataset, 4, drop_last=False, num_processes=3, epoch=42)
def test_iterable_dataset_shard_with_length(self):
sampler_shards = [
IterableDatasetShard(list(range(100)), batch_size=4, drop_last=True, num_processes=2, process_index=i)
for i in range(2)
]
# Build expected shards: each process will have batches of size 4 until there is not enough elements to
# form two full batches (so we stop at 96 = (100 // (4 * 2)) * 4)
expected_shards = [[], []]
current_shard = 0
for i in range(0, 96, 4):
expected_shards[current_shard].extend(list(range(i, i + 4)))
current_shard = 1 - current_shard
self.assertListEqual([list(shard) for shard in sampler_shards], expected_shards)
self.assertListEqual([len(shard) for shard in sampler_shards], [len(shard) for shard in expected_shards])
sampler_shards = [
IterableDatasetShard(list(range(100)), batch_size=4, drop_last=False, num_processes=2, process_index=i)
for i in range(2)
]
# When drop_last=False, we get two last full batches by looping back to the beginning.
expected_shards[0].extend(list(range(96, 100)))
expected_shards[1].extend(list(range(0, 4)))
self.assertListEqual([list(shard) for shard in sampler_shards], expected_shards)
self.assertListEqual([len(shard) for shard in sampler_shards], [len(shard) for shard in expected_shards])
def check_shard_sampler(self, dataset, batch_size, drop_last, num_processes=2):
shards = [
ShardSampler(
dataset, batch_size=batch_size, drop_last=drop_last, num_processes=num_processes, process_index=i
)
for i in range(num_processes)
]
shard_lists = [list(shard) for shard in shards]
for shard in shard_lists:
# All shards have a number of samples that is a round multiple of batch size
self.assertTrue(len(shard) % batch_size == 0)
# All shards have the same number of samples
self.assertEqual(len(shard), len(shard_lists[0]))
observed = []
for idx in range(0, len(shard_lists[0]), batch_size):
for shard in shard_lists:
observed += shard[idx : idx + batch_size]
# If drop_last is False we loop through samples at the beginning to have a size that is a round multiple of
# batch_size
reference = copy.copy(dataset)
if not drop_last:
while len(reference) < len(observed):
reference += reference
self.assertListEqual(observed, reference[: len(observed)])
def test_shard_sampler(self):
for n_elements in [64, 123]:
dataset = list(range(n_elements))
self.check_shard_sampler(dataset, 4, drop_last=True, num_processes=2)
self.check_shard_sampler(dataset, 4, drop_last=False, num_processes=2)
self.check_shard_sampler(dataset, 4, drop_last=True, num_processes=3)
self.check_shard_sampler(dataset, 4, drop_last=False, num_processes=3)
# ---------------------------------------------------------------------------
# Batch size finder tests
# ---------------------------------------------------------------------------
@require_torch
class TrainerBatchSizeFinderTest(unittest.TestCase):
"""Tests for the auto batch size finder (find_executable_batch_size)."""
@require_accelerate
def test_executable_batch_size(self):
batch_sizes = []
@find_executable_batch_size(starting_batch_size=64, auto_find_batch_size=True)
def mock_training_loop_function(batch_size):
nonlocal batch_sizes
batch_sizes.append(batch_size)
if batch_size > 16:
raise RuntimeError("CUDA out of memory.")
mock_training_loop_function()
self.assertEqual(batch_sizes, [64, 57, 51, 45, 40, 36, 32, 28, 25, 22, 19, 17, 15])
@require_accelerate
def test_executable_batch_size_no_search(self):
batch_sizes = []
@find_executable_batch_size(starting_batch_size=64, auto_find_batch_size=False)
def mock_training_loop_function(batch_size):
nonlocal batch_sizes
batch_sizes.append(batch_size)
mock_training_loop_function()
self.assertEqual(batch_sizes, [64])
@require_accelerate
def test_executable_batch_size_with_error(self):
@find_executable_batch_size(starting_batch_size=64, auto_find_batch_size=False)
def mock_training_loop_function(batch_size):
raise RuntimeError("CUDA out of memory.")
with self.assertRaises(RuntimeError) as cm:
mock_training_loop_function()
self.assertEqual("CUDA out of memory", cm.args[0])
# ---------------------------------------------------------------------------
# Data utility tests (parameter names, pad/concat, collators, eval loop container)
# ---------------------------------------------------------------------------
@require_torch
class TrainerDataUtilsTest(unittest.TestCase):
"""Tests for get_parameter_names, pad_and_concatenate, RemoveColumnsCollator, and EvalLoopContainer."""
def test_get_parameter_names(self):
model = nn.Sequential(TstLayer(128), nn.ModuleList([TstLayer(128), TstLayer(128)]))
# fmt: off
self.assertEqual(
get_parameter_names(model, [nn.LayerNorm]),
['0.linear1.weight', '0.linear1.bias', '0.linear2.weight', '0.linear2.bias', '0.bias', '1.0.linear1.weight', '1.0.linear1.bias', '1.0.linear2.weight', '1.0.linear2.bias', '1.0.bias', '1.1.linear1.weight', '1.1.linear1.bias', '1.1.linear2.weight', '1.1.linear2.bias', '1.1.bias']
)
# fmt: on
def test_get_parameter_names_rmsnorm(self):
class RMSNorm(nn.Module):
def __init__(self, hidden_size):
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.bias = nn.Parameter(torch.zeros(hidden_size))
class ModelWithRMSNorm(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(128, 128)
self.rmsnorm = RMSNorm(128)
self.bias = nn.Parameter(torch.zeros(128))
model = ModelWithRMSNorm()
# Test both type-based and name-based filtering
decay_parameters = get_parameter_names(model, [], ["bias", "rmsnorm"])
# Parameters that should be in weight decay
self.assertIn("linear.weight", decay_parameters)
# Parameters that should NOT be in weight decay
self.assertNotIn("linear.bias", decay_parameters)
self.assertNotIn("rmsnorm.weight", decay_parameters)
self.assertNotIn("rmsnorm.bias", decay_parameters)
self.assertNotIn("bias", decay_parameters)
def test_pad_and_concatenate_with_1d(self):
"""Tests whether pad_and_concatenate works with scalars."""
array1 = 1.0
array2 = 2.0
result = numpy_pad_and_concatenate(array1, array2)
self.assertTrue(np.array_equal(np.array([1.0, 2.0]), result))
tensor1 = torch.tensor(1.0)
tensor2 = torch.tensor(2.0)
result = torch_pad_and_concatenate(tensor1, tensor2)
self.assertTrue(torch.equal(result, torch.Tensor([1.0, 2.0])))
def test_remove_columns_collator(self):
class MockLogger:
def __init__(self) -> None:
self.called = 0
def info(self, msg):
self.called += 1
self.last_msg = msg
data_batch = [
{"col1": 1, "col2": 2, "col3": 3},
{"col1": 1, "col2": 2, "col3": 3},
]
logger = MockLogger()
remove_columns_collator = RemoveColumnsCollator(
_default_data_collator, ["col1", "col2"], logger, "model", "training"
)
self.assertNotIn("col3", remove_columns_collator(data_batch))
# check that the logging message is printed out only once
remove_columns_collator(data_batch)
remove_columns_collator(data_batch)
self.assertEqual(logger.called, 1)
self.assertIn("col3", logger.last_msg)
def test_eval_loop_container(self):
batch_1 = [
torch.ones([8, 5]),
{"loss": torch.tensor(1.0)},
(torch.ones([8, 2, 3]), torch.ones([8, 2])),
]
batch_2 = [
torch.ones([4, 5]),
{"loss": torch.tensor(2.0)},
(torch.ones([4, 2, 3]), torch.ones([4, 6])),
]
concat_container = EvalLoopContainer(do_nested_concat=True, padding_index=-100)
concat_container.add(batch_1)
concat_container.add(batch_2)
concat_container.to_cpu_and_numpy()
arrays = concat_container.get_arrays()
# Test two nested batches concatenation
self.assertIsInstance(arrays, list)
self.assertEqual(len(arrays), 3)
self.assertIsInstance(arrays[0], np.ndarray)
self.assertEqual(arrays[0].shape, (12, 5))
self.assertIsInstance(arrays[1], dict)
self.assertIsInstance(arrays[1]["loss"], np.ndarray)
self.assertEqual(arrays[1]["loss"].shape, (2,))
self.assertTrue(np.allclose(arrays[1]["loss"], np.array([1.0, 2.0])))
self.assertIsInstance(arrays[2], tuple)
self.assertEqual(len(arrays[2]), 2)
self.assertEqual(arrays[2][0].shape, (12, 2, 3))
self.assertEqual(arrays[2][1].shape, (12, 6))
# check that first batch padded with padding index -100 after concatenation
self.assertEqual(arrays[2][1][0][2], -100)
# Test two batches with no concatenation
list_container = EvalLoopContainer(do_nested_concat=False)
list_container.add(batch_1)
list_container.add(batch_2)
list_container.to_cpu_and_numpy()
arrays = list_container.get_arrays()
self.assertEqual(len(arrays), 2)
self.assertIsInstance(arrays, list)
np_batch_1, np_batch_2 = arrays
self.assertIsInstance(np_batch_1, list)
self.assertEqual(len(np_batch_1), 3)
self.assertIsInstance(np_batch_1[0], np.ndarray)
self.assertIsInstance(np_batch_1[1], dict)
self.assertIsInstance(np_batch_1[2], tuple)
self.assertEqual(np_batch_1[0].shape, (8, 5))
self.assertEqual(np_batch_1[1]["loss"].shape, ())
self.assertEqual(np_batch_1[2][0].shape, (8, 2, 3))
self.assertEqual(np_batch_1[2][1].shape, (8, 2))
self.assertIsInstance(np_batch_2, list)
self.assertEqual(len(np_batch_2), 3)
self.assertIsInstance(np_batch_2[0], np.ndarray)
self.assertIsInstance(np_batch_2[1], dict)
self.assertIsInstance(np_batch_2[2], tuple)
self.assertEqual(np_batch_2[0].shape, (4, 5))
self.assertEqual(np_batch_2[1]["loss"].shape, ())
self.assertEqual(np_batch_2[2][0].shape, (4, 2, 3))
self.assertEqual(np_batch_2[2][1].shape, (4, 6))
# Test no batches
none_arr = EvalLoopContainer(do_nested_concat=True, padding_index=-100).get_arrays()
self.assertIsNone(none_arr)
none_arr = EvalLoopContainer(do_nested_concat=False).get_arrays()
self.assertIsNone(none_arr)
# Test one batch
concat_container = EvalLoopContainer(do_nested_concat=True, padding_index=-100)
concat_container.add(batch_1)
arrays = concat_container.get_arrays()
self.assertIsInstance(arrays, list)
self.assertEqual(len(arrays), 3)
self.assertIsInstance(arrays[0], np.ndarray)
self.assertEqual(arrays[0].shape, (8, 5))
self.assertIsInstance(arrays[1], dict)
self.assertIsInstance(arrays[1]["loss"], np.ndarray)
self.assertEqual(arrays[1]["loss"].shape, ())
self.assertTrue(np.allclose(arrays[1]["loss"], np.array([1.0])))
self.assertIsInstance(arrays[2], tuple)
self.assertEqual(len(arrays[2]), 2)
self.assertEqual(arrays[2][0].shape, (8, 2, 3))
self.assertEqual(arrays[2][1].shape, (8, 2))
# ---------------------------------------------------------------------------
# Dynamic shapes and iterable dataset tests
# ---------------------------------------------------------------------------
@require_torch
class TrainerDynamicShapesAndIterableTest(TestCasePlus, TrainerIntegrationCommon):
def setUp(self):
super().setUp()
args = TrainingArguments("..")
self.n_epochs = args.num_train_epochs
self.batch_size = args.train_batch_size
def test_dynamic_shapes(self):
eval_dataset = DynamicShapesDataset(batch_size=self.batch_size)
model = RegressionModel(a=2, b=1)
with tempfile.TemporaryDirectory() as tmp_dir:
args = TrainingArguments(tmp_dir)
trainer = Trainer(model, args, eval_dataset=eval_dataset)
# Check evaluation can run to completion
_ = trainer.evaluate()
# Check predictions
preds = trainer.predict(eval_dataset)
for expected, seen in zip(eval_dataset.ys, preds.label_ids):
self.assertTrue(np.array_equal(expected, seen[: expected.shape[0]]))
self.assertTrue(np.all(seen[expected.shape[0] :] == -100))
for expected, seen in zip(eval_dataset.xs, preds.predictions):
self.assertTrue(np.array_equal(2 * expected + 1, seen[: expected.shape[0]]))
self.assertTrue(np.all(seen[expected.shape[0] :] == -100))
# Same tests with eval accumulation
with tempfile.TemporaryDirectory() as tmp_dir:
args = TrainingArguments(tmp_dir, eval_accumulation_steps=2)
trainer = Trainer(model, args, eval_dataset=eval_dataset)
# Check evaluation can run to completion
_ = trainer.evaluate()
# Check predictions
preds = trainer.predict(eval_dataset)
for expected, seen in zip(eval_dataset.ys, preds.label_ids):
self.assertTrue(np.array_equal(expected, seen[: expected.shape[0]]))
self.assertTrue(np.all(seen[expected.shape[0] :] == -100))
for expected, seen in zip(eval_dataset.xs, preds.predictions):
self.assertTrue(np.array_equal(2 * expected + 1, seen[: expected.shape[0]]))
self.assertTrue(np.all(seen[expected.shape[0] :] == -100))
def test_training_iterable_dataset(self):
config = RegressionModelConfig()
model = RegressionPreTrainedModel(config)
# Adding one column not used by the model should have no impact
train_dataset = SampleIterableDataset(label_names=["labels", "extra"])
with tempfile.TemporaryDirectory() as tmp_dir:
args = RegressionTrainingArguments(output_dir=tmp_dir, max_steps=4)
trainer = Trainer(model=model, args=args, train_dataset=train_dataset)
trainer.train()
self.assertEqual(trainer.state.global_step, 4)
loader = trainer.get_train_dataloader()
self.assertIsInstance(loader, torch.utils.data.DataLoader)
self.assertIsInstance(loader.sampler, torch.utils.data.dataloader._InfiniteConstantSampler)
def test_evaluation_iterable_dataset(self):
config = RegressionModelConfig(a=1.5, b=2.5)
model = RegressionPreTrainedModel(config)
# RegressionPreTrainedModel accepts **kwargs but doesn't actually use num_items_in_batch,
# so disable the loss scaling that assumes the model handles token-level averaging.
model.accepts_loss_kwargs = False
# Adding one column not used by the model should have no impact
eval_dataset = SampleIterableDataset(label_names=["labels", "extra"])
with tempfile.TemporaryDirectory() as tmp_dir:
args = RegressionTrainingArguments(output_dir=tmp_dir)
trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset, compute_metrics=AlmostAccuracy())
results = trainer.evaluate()
x, y = trainer.eval_dataset.dataset.x, trainer.eval_dataset.dataset.ys[0]
pred = 1.5 * x + 2.5
expected_loss = ((pred - y) ** 2).mean()
self.assertAlmostEqual(results["eval_loss"], expected_loss, places=6)
expected_acc = AlmostAccuracy()((pred, y))["accuracy"]
self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
# With a number of elements not a round multiple of the batch size
eval_dataset = SampleIterableDataset(length=66)
results = trainer.evaluate(eval_dataset)
x, y = eval_dataset.dataset.x, eval_dataset.dataset.ys[0]
pred = 1.5 * x + 2.5
expected_loss = ((pred - y) ** 2).mean()
self.assertAlmostEqual(results["eval_loss"], expected_loss, places=6)
expected_acc = AlmostAccuracy()((pred, y))["accuracy"]
self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
def test_predict_iterable_dataset(self):
config = RegressionModelConfig(a=1.5, b=2.5)
model = RegressionPreTrainedModel(config)
eval_dataset = SampleIterableDataset()
with tempfile.TemporaryDirectory() as tmp_dir:
args = RegressionTrainingArguments(output_dir=tmp_dir)
trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset, compute_metrics=AlmostAccuracy())
preds = trainer.predict(trainer.eval_dataset).predictions
x = eval_dataset.dataset.x
self.assertTrue(np.allclose(preds, 1.5 * x + 2.5))
# With a number of elements not a round multiple of the batch size
# Adding one column not used by the model should have no impact
test_dataset = SampleIterableDataset(length=66, label_names=["labels", "extra"])
preds = trainer.predict(test_dataset).predictions
x = test_dataset.dataset.x
self.assertTrue(np.allclose(preds, 1.5 * x + 2.5))

View File

@@ -0,0 +1,519 @@
# Copyright 2018 the HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Trainer evaluation and prediction tests: evaluate, predict, batched metrics, dynamic shapes,
iterable datasets, early stopping, FP16/BF16 full eval memory, torch.compile, and MRPC/LM eval.
"""
import gc
import tempfile
import numpy as np
from transformers import (
AutoTokenizer,
TrainingArguments,
is_torch_available,
)
from transformers.testing_utils import (
TestCasePlus,
backend_device_count,
get_tests_dir,
require_torch,
require_torch_accelerator,
require_torch_bf16,
require_torch_fp16,
slow,
torch_device,
)
from .trainer_test_utils import (
PATH_SAMPLE_TEXT,
AlmostAccuracy,
AlmostAccuracyBatched,
RegressionDataset,
RegressionDictModel,
TrainerIntegrationCommon,
get_dataset,
get_regression_trainer,
)
if is_torch_available():
import torch
from transformers import (
AutoModelForCausalLM,
AutoModelForSequenceClassification,
GlueDataset,
GlueDataTrainingArguments,
Trainer,
)
# ---------------------------------------------------------------------------
# Core evaluate / predict tests
# ---------------------------------------------------------------------------
@require_torch
class TrainerEvaluationTest(TestCasePlus, TrainerIntegrationCommon):
def setUp(self):
super().setUp()
args = TrainingArguments("..")
self.n_epochs = args.num_train_epochs
self.batch_size = args.train_batch_size
def test_evaluate(self):
with tempfile.TemporaryDirectory() as tmp_dir:
trainer = get_regression_trainer(a=1.5, b=2.5, compute_metrics=AlmostAccuracy(), output_dir=tmp_dir)
results = trainer.evaluate()
x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
pred = 1.5 * x + 2.5
expected_loss = ((pred - y) ** 2).mean()
self.assertAlmostEqual(results["eval_loss"], expected_loss)
expected_acc = AlmostAccuracy()((pred, y))["accuracy"]
self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
# With a number of elements not a round multiple of the batch size
trainer = get_regression_trainer(
a=1.5, b=2.5, eval_len=66, compute_metrics=AlmostAccuracy(), output_dir=tmp_dir
)
results = trainer.evaluate()
x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
pred = 1.5 * x + 2.5
expected_loss = ((pred - y) ** 2).mean()
self.assertAlmostEqual(results["eval_loss"], expected_loss)
expected_acc = AlmostAccuracy()((pred, y))["accuracy"]
self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
# With logits preprocess
trainer = get_regression_trainer(
a=1.5,
b=2.5,
compute_metrics=AlmostAccuracy(),
preprocess_logits_for_metrics=lambda logits, labels: logits + 1,
output_dir=tmp_dir,
)
results = trainer.evaluate()
x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
pred = 1.5 * x + 2.5
expected_loss = ((pred - y) ** 2).mean()
self.assertAlmostEqual(results["eval_loss"], expected_loss)
expected_acc = AlmostAccuracy()((pred + 1, y))["accuracy"]
self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
def test_predict(self):
with tempfile.TemporaryDirectory() as tmp_dir:
trainer = get_regression_trainer(a=1.5, b=2.5, output_dir=tmp_dir)
preds = trainer.predict(trainer.eval_dataset).predictions
x = trainer.eval_dataset.x
self.assertTrue(np.allclose(preds, 1.5 * x + 2.5))
# With a number of elements not a round multiple of the batch size
trainer = get_regression_trainer(a=1.5, b=2.5, eval_len=66, output_dir=tmp_dir)
preds = trainer.predict(trainer.eval_dataset).predictions
x = trainer.eval_dataset.x
self.assertTrue(np.allclose(preds, 1.5 * x + 2.5))
# With more than one output of the model
trainer = get_regression_trainer(a=1.5, b=2.5, double_output=True, output_dir=tmp_dir)
preds = trainer.predict(trainer.eval_dataset).predictions
x = trainer.eval_dataset.x
self.assertEqual(len(preds), 2)
self.assertTrue(np.allclose(preds[0], 1.5 * x + 2.5))
self.assertTrue(np.allclose(preds[1], 1.5 * x + 2.5))
# With more than one output/label of the model
trainer = get_regression_trainer(
a=1.5, b=2.5, double_output=True, label_names=["labels", "labels_2"], output_dir=tmp_dir
)
outputs = trainer.predict(trainer.eval_dataset)
preds = outputs.predictions
labels = outputs.label_ids
x = trainer.eval_dataset.x
self.assertEqual(len(preds), 2)
self.assertTrue(np.allclose(preds[0], 1.5 * x + 2.5))
self.assertTrue(np.allclose(preds[1], 1.5 * x + 2.5))
self.assertTrue(np.array_equal(labels[0], trainer.eval_dataset.ys[0]))
self.assertTrue(np.array_equal(labels[1], trainer.eval_dataset.ys[1]))
def test_train_and_predict_loss_parity(self):
"""
Tests that the loss computed during a training_step is the same as the one computed during prediction_step.
for the same inputs
"""
model = AutoModelForCausalLM.from_pretrained("hf-internal-testing/tiny-random-LlamaForCausalLM")
# Create a dummy batch of inputs
inputs = {}
inputs["input_ids"] = []
for row_ind in range(4):
seq_len = torch.randint(32, 64, (1,)).item()
x = torch.randint(1, 100, (seq_len,))
inputs["input_ids"].append(x)
inputs["input_ids"] = torch.nn.utils.rnn.pad_sequence(inputs["input_ids"], batch_first=True, padding_value=0)
inputs["labels"] = inputs["input_ids"].clone()
inputs["labels"][inputs["input_ids"] == 0] = -100
num_items_in_batch = inputs["labels"][..., 1:].ne(-100).sum().item()
def custom_loss_func(outputs, labels, num_items_in_batch=None):
logits = outputs["logits"]
loss_fct = torch.nn.CrossEntropyLoss()
loss = loss_fct(logits.view(-1, logits.size(-1)), labels.view(-1))
if num_items_in_batch is not None:
return loss / num_items_in_batch # multiply by number of items to get the sum
return loss
trainer = Trainer(model, train_dataset=None, compute_loss_func=custom_loss_func)
# creating log history of trainer, results don't matter
train_loss = trainer.training_step(model, inputs, num_items_in_batch)
predict_loss = trainer.prediction_step(model, inputs, prediction_loss_only=True)[0]
torch.testing.assert_close(train_loss, predict_loss, atol=1e-6, rtol=0)
def test_eval_use_gather_object(self):
train_dataset = RegressionDataset()
eval_dataset = RegressionDataset()
model = RegressionDictModel()
with tempfile.TemporaryDirectory() as tmp_dir:
args = TrainingArguments(tmp_dir, eval_use_gather_object=True)
trainer = Trainer(model, args, train_dataset=train_dataset, eval_dataset=eval_dataset)
trainer.train()
_ = trainer.evaluate()
_ = trainer.predict(eval_dataset)
# ---------------------------------------------------------------------------
# Batch eval metrics tests
# ---------------------------------------------------------------------------
@require_torch
class TrainerBatchEvalMetricsTest(TestCasePlus, TrainerIntegrationCommon):
def setUp(self):
super().setUp()
args = TrainingArguments("..")
self.n_epochs = args.num_train_epochs
self.batch_size = args.train_batch_size
def test_evaluate_with_batch_eval_metrics(self):
with tempfile.TemporaryDirectory() as tmp_dir:
trainer = get_regression_trainer(
a=1.5, b=2.5, compute_metrics=AlmostAccuracyBatched(), batch_eval_metrics=True, output_dir=tmp_dir
)
results = trainer.evaluate()
x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
pred = 1.5 * x + 2.5
expected_loss = ((pred - y) ** 2).mean()
self.assertAlmostEqual(results["eval_loss"], expected_loss)
expected_acc = AlmostAccuracy()((pred, y))["accuracy"]
self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
# With a number of elements not a round multiple of the batch size
trainer = get_regression_trainer(
a=1.5,
b=2.5,
eval_len=66,
compute_metrics=AlmostAccuracyBatched(),
batch_eval_metrics=True,
output_dir=tmp_dir,
)
results = trainer.evaluate()
x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
pred = 1.5 * x + 2.5
expected_loss = ((pred - y) ** 2).mean()
self.assertAlmostEqual(results["eval_loss"], expected_loss)
expected_acc = AlmostAccuracy()((pred, y))["accuracy"]
self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
# With logits preprocess
trainer = get_regression_trainer(
a=1.5,
b=2.5,
compute_metrics=AlmostAccuracyBatched(),
batch_eval_metrics=True,
preprocess_logits_for_metrics=lambda logits, labels: logits + 1,
output_dir=tmp_dir,
)
results = trainer.evaluate()
x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
pred = 1.5 * x + 2.5
expected_loss = ((pred - y) ** 2).mean()
self.assertAlmostEqual(results["eval_loss"], expected_loss)
expected_acc = AlmostAccuracy()((pred + 1, y))["accuracy"]
self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
def test_predict_with_batch_eval_metrics(self):
with tempfile.TemporaryDirectory() as tmp_dir:
trainer = get_regression_trainer(
a=1.5, b=2.5, compute_metrics=AlmostAccuracyBatched(), batch_eval_metrics=True, output_dir=tmp_dir
)
results = trainer.predict(trainer.eval_dataset)
preds = results.predictions
x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
gt = 1.5 * x + 2.5
self.assertTrue(np.allclose(preds, gt))
expected_acc = AlmostAccuracy()((preds, y))["accuracy"]
self.assertAlmostEqual(results.metrics["test_accuracy"], expected_acc)
# With a number of elements not a round multiple of the batch size
trainer = get_regression_trainer(
a=1.5,
b=2.5,
eval_len=66,
compute_metrics=AlmostAccuracyBatched(),
batch_eval_metrics=True,
output_dir=tmp_dir,
)
results = trainer.predict(trainer.eval_dataset)
preds = results.predictions
x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
self.assertTrue(np.allclose(preds, 1.5 * x + 2.5))
expected_acc = AlmostAccuracy()((preds, y))["accuracy"]
self.assertAlmostEqual(results.metrics["test_accuracy"], expected_acc)
# With more than one output of the model
trainer = get_regression_trainer(
a=1.5,
b=2.5,
double_output=True,
compute_metrics=AlmostAccuracyBatched(),
batch_eval_metrics=True,
output_dir=tmp_dir,
)
preds = trainer.predict(trainer.eval_dataset).predictions
x = trainer.eval_dataset.x
self.assertEqual(len(preds), 2)
self.assertTrue(np.allclose(preds[0], 1.5 * x + 2.5))
self.assertTrue(np.allclose(preds[1], 1.5 * x + 2.5))
# With more than one output/label of the model
trainer = get_regression_trainer(
a=1.5,
b=2.5,
double_output=True,
label_names=["labels", "labels_2"],
compute_metrics=AlmostAccuracyBatched(),
batch_eval_metrics=True,
output_dir=tmp_dir,
)
outputs = trainer.predict(trainer.eval_dataset)
preds = outputs.predictions
labels = outputs.label_ids
x = trainer.eval_dataset.x
self.assertEqual(len(preds), 2)
self.assertTrue(np.allclose(preds[0], 1.5 * x + 2.5))
self.assertTrue(np.allclose(preds[1], 1.5 * x + 2.5))
self.assertTrue(np.array_equal(labels[0], trainer.eval_dataset.ys[0]))
self.assertTrue(np.array_equal(labels[1], trainer.eval_dataset.ys[1]))
# ---------------------------------------------------------------------------
# FP16 / BF16 full eval memory tests
# ---------------------------------------------------------------------------
@require_torch
class TrainerFullEvalMemoryTest(TestCasePlus):
@require_torch_fp16
@require_torch_accelerator
def test_fp16_full_eval(self):
# this is a sensitive test so let's keep debugging printouts in place for quick diagnosis.
# it's using pretty large safety margins, but small enough to detect broken functionality.
debug = 0
n_gpus = backend_device_count(torch_device)
with tempfile.TemporaryDirectory() as tmp_dir:
bs = 8
eval_len = 16 * n_gpus
# make the params somewhat big so that there will be enough RAM consumed to be able to
# measure things. We should get about 64KB for a+b in fp32
a = torch.ones(1000, bs) + 0.001
b = torch.ones(1000, bs) - 0.001
# 1. with fp16_full_eval disabled
trainer = get_regression_trainer(
a=a, b=b, eval_len=eval_len, skip_memory_metrics=False, output_dir=tmp_dir
)
metrics = trainer.evaluate()
del trainer
gc.collect()
fp32_init = metrics["init_mem_gpu_alloc_delta"]
fp32_eval = metrics["eval_mem_gpu_alloc_delta"]
if debug:
print(f"fp32_init {fp32_init}")
print(f"fp32_eval {fp32_eval}")
# here we expect the model to be preloaded in trainer.__init__ and consume around 64K gpu ram.
# perfect world: fp32_init == 64<<10
self.assertGreater(fp32_init, 59_000)
# after eval should be no extra memory allocated - with a small margin (other than the peak
# memory consumption for the forward calculation that gets recovered)
# perfect world: fp32_eval == close to zero
self.assertLess(fp32_eval, 5_000)
# 2. with fp16_full_eval enabled
trainer = get_regression_trainer(
a=a, b=b, eval_len=eval_len, fp16_full_eval=True, skip_memory_metrics=False, output_dir=tmp_dir
)
metrics = trainer.evaluate()
fp16_init = metrics["init_mem_gpu_alloc_delta"]
fp16_eval = metrics["eval_mem_gpu_alloc_delta"]
if debug:
print(f"fp16_init {fp16_init}")
print(f"fp16_eval {fp16_eval}")
# here we expect the model to not be preloaded in trainer.__init__, so with a small margin it should be close to 0
# perfect world: fp16_init == close to zero
self.assertLess(fp16_init, 5_000)
# here we put the model on device in eval and only `half()` of it, i.e. about 32K,(again we ignore the peak margin which gets returned back)
# perfect world: fp32_init == 32<<10
self.assertGreater(fp16_eval, 27_000)
# 3. relative comparison fp32 vs full fp16
# should be about half of fp16_init
# perfect world: fp32_init/2 == fp16_eval
self.assertAlmostEqual(fp16_eval, fp32_init / 2, delta=5_000)
@require_torch_accelerator
@require_torch_bf16
def test_bf16_full_eval(self):
# note: most of the logic is the same as test_fp16_full_eval
# this is a sensitive test so let's keep debugging printouts in place for quick diagnosis.
# it's using pretty large safety margins, but small enough to detect broken functionality.
debug = 0
n_gpus = backend_device_count(torch_device)
bs = 8
eval_len = 16 * n_gpus
# make the params somewhat big so that there will be enough RAM consumed to be able to
# measure things. We should get about 64KB for a+b in fp32
a = torch.ones(1000, bs) + 0.001
b = torch.ones(1000, bs) - 0.001
with tempfile.TemporaryDirectory() as tmp_dir:
# 1. with bf16_full_eval disabled
trainer = get_regression_trainer(
a=a, b=b, eval_len=eval_len, skip_memory_metrics=False, output_dir=tmp_dir
)
metrics = trainer.evaluate()
del trainer
gc.collect()
fp32_init = metrics["init_mem_gpu_alloc_delta"]
fp32_eval = metrics["eval_mem_gpu_alloc_delta"]
if debug:
print(f"fp32_init {fp32_init}")
print(f"fp32_eval {fp32_eval}")
# here we expect the model to be preloaded in trainer.__init__ and consume around 64K gpu ram.
# perfect world: fp32_init == 64<<10
self.assertGreater(fp32_init, 59_000)
# after eval should be no extra memory allocated - with a small margin (other than the peak
# memory consumption for the forward calculation that gets recovered)
# perfect world: fp32_eval == close to zero
self.assertLess(fp32_eval, 5_000)
# 2. with bf16_full_eval enabled
trainer = get_regression_trainer(
a=a, b=b, eval_len=eval_len, bf16_full_eval=True, skip_memory_metrics=False, output_dir=tmp_dir
)
metrics = trainer.evaluate()
bf16_init = metrics["init_mem_gpu_alloc_delta"]
bf16_eval = metrics["eval_mem_gpu_alloc_delta"]
if debug:
print(f"bf16_init {bf16_init}")
print(f"bf16_eval {bf16_eval}")
# here we expect the model to not be preloaded in trainer.__init__, so with a small margin it should be close to 0
# perfect world: bf16_init == close to zero
self.assertLess(bf16_init, 5_000)
# here we put the model on device in eval and only `half()` of it, i.e. about 32K,(again we ignore the peak margin which gets returned back)
# perfect world: fp32_init == 32<<10
self.assertGreater(bf16_eval, 27_000)
# 3. relative comparison fp32 vs full bf16
# should be about half of bf16_init
# perfect world: fp32_init/2 == bf16_eval
self.assertAlmostEqual(bf16_eval, fp32_init / 2, delta=5_000)
# ---------------------------------------------------------------------------
# Slow external model eval tests
# ---------------------------------------------------------------------------
@require_torch
class TrainerSlowEvalTest(TestCasePlus):
@slow
def test_trainer_eval_mrpc(self):
MODEL_ID = "google-bert/bert-base-cased-finetuned-mrpc"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
data_args = GlueDataTrainingArguments(
task_name="mrpc", data_dir=f"{get_tests_dir()}/fixtures/tests_samples/MRPC", overwrite_cache=True
)
eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, mode="dev")
with tempfile.TemporaryDirectory() as tmp_dir:
training_args = TrainingArguments(output_dir=tmp_dir, use_cpu=True)
trainer = Trainer(model=model, args=training_args, eval_dataset=eval_dataset)
result = trainer.evaluate()
self.assertLess(result["eval_loss"], 0.2)
@slow
def test_trainer_eval_multiple(self):
MODEL_ID = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
dataset = get_dataset(PATH_SAMPLE_TEXT, tokenizer, 100)
with tempfile.TemporaryDirectory() as tmp_dir:
training_args = TrainingArguments(
output_dir=tmp_dir,
use_cpu=True,
per_device_eval_batch_size=1,
)
trainer = Trainer(
model=model,
args=training_args,
eval_dataset={
"data1": dataset,
"data2": dataset,
},
)
result = trainer.evaluate()
self.assertIn("eval_data1_loss", result)
self.assertIn("eval_data2_loss", result)
@slow
def test_trainer_eval_lm(self):
MODEL_ID = "distilbert/distilroberta-base"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
dataset = get_dataset(PATH_SAMPLE_TEXT, tokenizer, 100)
self.assertEqual(len(dataset), 31)

View File

@@ -0,0 +1,308 @@
# Copyright 2018 the HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Trainer hyperparameter search tests: Optuna (single/multi-objective, full eval),
Ray Tune (with client), W&B sweeps, and backend availability detection.
"""
import tempfile
import unittest
from transformers import TrainingArguments
from transformers.hyperparameter_search import ALL_HYPERPARAMETER_SEARCH_BACKENDS, HPSearchBackend
from transformers.testing_utils import require_optuna, require_ray, require_torch, require_wandb, torch_device
from transformers.trainer_utils import IntervalStrategy
from transformers.utils.hp_naming import TrialShortNamer
from .trainer_test_utils import (
AlmostAccuracy,
RegressionModelConfig,
RegressionPreTrainedModel,
get_regression_trainer,
)
@require_torch
@require_optuna
class TrainerHyperParameterOptunaIntegrationTest(unittest.TestCase):
def setUp(self):
args = TrainingArguments("..")
self.n_epochs = args.num_train_epochs
self.batch_size = args.train_batch_size
def test_hyperparameter_search(self):
class MyTrialShortNamer(TrialShortNamer):
DEFAULTS = {"a": 0, "b": 0}
def hp_space(trial):
return {}
def model_init(trial):
if trial is not None:
a = trial.suggest_int("a", -4, 4)
b = trial.suggest_int("b", -4, 4)
else:
a = 0
b = 0
config = RegressionModelConfig(a=a, b=b, double_output=False)
return RegressionPreTrainedModel(config).to(torch_device)
def hp_name(trial):
return MyTrialShortNamer.shortname(trial.params)
with tempfile.TemporaryDirectory() as tmp_dir:
trainer = get_regression_trainer(
output_dir=tmp_dir,
learning_rate=0.1,
logging_steps=1,
eval_strategy=IntervalStrategy.EPOCH,
save_strategy=IntervalStrategy.EPOCH,
num_train_epochs=4,
disable_tqdm=True,
load_best_model_at_end=True,
run_name="test",
model_init=model_init,
)
trainer.hyperparameter_search(direction="minimize", hp_space=hp_space, hp_name=hp_name, n_trials=4)
@require_torch
@require_optuna
class TrainerHyperParameterMultiObjectOptunaIntegrationTest(unittest.TestCase):
def setUp(self):
args = TrainingArguments("..")
self.n_epochs = args.num_train_epochs
self.batch_size = args.train_batch_size
def test_hyperparameter_search(self):
class MyTrialShortNamer(TrialShortNamer):
DEFAULTS = {"a": 0, "b": 0}
def hp_space(trial):
return {}
def model_init(trial):
if trial is not None:
a = trial.suggest_int("a", -4, 4)
b = trial.suggest_int("b", -4, 4)
else:
a = 0
b = 0
config = RegressionModelConfig(a=a, b=b, double_output=False)
return RegressionPreTrainedModel(config).to(torch_device)
def hp_name(trial):
return MyTrialShortNamer.shortname(trial.params)
def compute_objective(metrics: dict[str, float]) -> list[float]:
return metrics["eval_loss"], metrics["eval_accuracy"]
with tempfile.TemporaryDirectory() as tmp_dir:
trainer = get_regression_trainer(
output_dir=tmp_dir,
learning_rate=0.1,
logging_steps=1,
eval_strategy=IntervalStrategy.EPOCH,
save_strategy=IntervalStrategy.EPOCH,
num_train_epochs=10,
disable_tqdm=True,
load_best_model_at_end=True,
run_name="test",
model_init=model_init,
compute_metrics=AlmostAccuracy(),
)
trainer.hyperparameter_search(
direction=["minimize", "maximize"],
hp_space=hp_space,
hp_name=hp_name,
n_trials=4,
compute_objective=compute_objective,
)
@require_torch
@require_optuna
class TrainerHyperParameterOptunaIntegrationTestWithFullEval(unittest.TestCase):
def test_hyperparameter_search(self):
def hp_space(trial):
return {}
def model_init(trial):
if trial is not None:
a = trial.suggest_int("a", -4, 4)
b = trial.suggest_int("b", -4, 4)
else:
a = 0
b = 0
config = RegressionModelConfig(a=a, b=b, double_output=False)
return RegressionPreTrainedModel(config).to(torch_device)
with tempfile.TemporaryDirectory() as tmp_dir:
trainer = get_regression_trainer(
output_dir=tmp_dir,
disable_tqdm=True,
model_init=model_init,
fp16_full_eval=True,
)
trainer.hyperparameter_search(
direction="minimize",
hp_space=hp_space,
n_trials=2,
)
@require_torch
@require_ray
@unittest.skip("don't work because of a serialization issue")
class TrainerHyperParameterRayIntegrationTest(unittest.TestCase):
def setUp(self):
args = TrainingArguments("..")
self.n_epochs = args.num_train_epochs
self.batch_size = args.train_batch_size
def ray_hyperparameter_search(self):
class MyTrialShortNamer(TrialShortNamer):
DEFAULTS = {"a": 0, "b": 0}
def hp_space(trial):
from ray import tune
return {
"a": tune.randint(-4, 4),
"b": tune.randint(-4, 4),
}
def model_init(config):
if config is None:
a = 0
b = 0
else:
a = config["a"]
b = config["b"]
model_config = RegressionModelConfig(a=a, b=b, double_output=False)
return RegressionPreTrainedModel(model_config).to(torch_device)
def hp_name(params):
return MyTrialShortNamer.shortname(params)
with tempfile.TemporaryDirectory() as tmp_dir:
trainer = get_regression_trainer(
output_dir=tmp_dir,
learning_rate=0.1,
logging_steps=1,
eval_strategy=IntervalStrategy.EPOCH,
save_strategy=IntervalStrategy.EPOCH,
num_train_epochs=4,
disable_tqdm=True,
load_best_model_at_end=True,
run_name="test",
model_init=model_init,
)
trainer.hyperparameter_search(
direction="minimize", hp_space=hp_space, hp_name=hp_name, backend="ray", n_trials=4
)
def test_hyperparameter_search(self):
self.ray_hyperparameter_search()
def test_hyperparameter_search_ray_client(self):
import ray
from ray.util.client.ray_client_helpers import ray_start_client_server
with ray_start_client_server():
assert ray.util.client.ray.is_connected()
self.ray_hyperparameter_search()
@require_torch
@require_wandb
class TrainerHyperParameterWandbIntegrationTest(unittest.TestCase):
def setUp(self):
args = TrainingArguments("..")
self.n_epochs = args.num_train_epochs
self.batch_size = args.train_batch_size
def test_hyperparameter_search(self):
def hp_space(trial):
return {
"method": "random",
"metric": {},
"parameters": {
"a": {"distribution": "uniform", "min": 1e-6, "max": 1e-4},
"b": {"distribution": "int_uniform", "min": 1, "max": 6},
},
}
def model_init(config):
if config is None:
a = 0
b = 0
else:
a = config["a"]
b = config["b"]
model_config = RegressionModelConfig(a=a, b=b, double_output=False)
return RegressionPreTrainedModel(model_config).to(torch_device)
with tempfile.TemporaryDirectory() as tmp_dir:
trainer = get_regression_trainer(
output_dir=tmp_dir,
learning_rate=0.1,
logging_steps=1,
eval_strategy=IntervalStrategy.EPOCH,
save_strategy=IntervalStrategy.EPOCH,
num_train_epochs=4,
disable_tqdm=True,
load_best_model_at_end=True,
run_name="test",
model_init=model_init,
)
sweep_kwargs = {
"direction": "minimize",
"hp_space": hp_space,
"backend": "wandb",
"n_trials": 4,
}
best_run = trainer.hyperparameter_search(**sweep_kwargs)
self.assertIsNotNone(best_run.run_id)
self.assertIsNotNone(best_run.run_summary)
hp_keys = set(best_run.hyperparameters.keys())
self.assertSetEqual(hp_keys, {"a", "b", "assignments", "metric"})
# pretend restarting the process purged the environ
import os
del os.environ["WANDB_ENTITY"]
del os.environ["WANDB_PROJECT"]
sweep_kwargs["sweep_id"] = best_run.run_summary
updated_best_run = trainer.hyperparameter_search(**sweep_kwargs)
self.assertIsNotNone(updated_best_run.run_id)
self.assertEqual(updated_best_run.run_summary, best_run.run_summary)
updated_hp_keys = set(updated_best_run.hyperparameters.keys())
self.assertSetEqual(updated_hp_keys, {"a", "b", "assignments", "metric"})
class HyperParameterSearchBackendsTest(unittest.TestCase):
def test_hyperparameter_search_backends(self):
self.assertEqual(
list(ALL_HYPERPARAMETER_SEARCH_BACKENDS.keys()),
list(HPSearchBackend),
)

View File

@@ -0,0 +1,853 @@
# Copyright 2018 the HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Trainer optimizer and LR scheduler tests: custom optimizers, LR scheduler kwargs, cosine-with-min-lr,
reduce-on-plateau, Adafactor, bitsandbytes (RMSProp, AdEMAMix), LOMO, GrokAdamW, schedule-free,
GaLore, Apollo, Stable AdamW, Liger kernel, optimizer choice resolution, factory pattern detection,
and model parameter inspection.
"""
import tempfile
import numpy as np
from parameterized import parameterized
from transformers import (
GPT2Config,
GPT2LMHeadModel,
LlamaConfig,
LlamaForCausalLM,
Trainer,
TrainingArguments,
is_torch_available,
)
from transformers.testing_utils import (
TestCasePlus,
require_apollo_torch,
require_bitsandbytes,
require_galore_torch,
require_grokadamw,
require_lomo,
require_schedulefree,
require_torch,
require_torch_accelerator,
require_torch_optimi,
)
from transformers.trainer_utils import check_target_module_exists
from .trainer_test_utils import (
BasicTextGenerationModel,
RegressionDataset,
RegressionModel,
RepeatDataset,
TorchTracemalloc,
TrainerIntegrationCommon,
TstLayer,
bytes2megabytes,
get_regression_trainer,
)
if is_torch_available():
import torch
from torch import nn
_ATTN_MLP_TARGET_MODULES = [r".*attn.*", r".*mlp.*"]
@require_torch
class TrainerOptimizerIntegrationTest(TestCasePlus, TrainerIntegrationCommon):
def setUp(self):
super().setUp()
args = TrainingArguments("..")
self.n_epochs = args.num_train_epochs
self.batch_size = args.train_batch_size
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _get_llama_and_dataset(self):
config = LlamaConfig(vocab_size=100, hidden_size=32, num_hidden_layers=3, num_attention_heads=4)
model = LlamaForCausalLM(config)
train_dataset = RepeatDataset(torch.randint(0, 100, (128,)))
return model, train_dataset
def _get_gpt2_and_dataset(self):
config = GPT2Config(vocab_size=100, n_positions=128, n_embd=32, n_layer=3, n_head=4)
model = GPT2LMHeadModel(config)
train_dataset = RepeatDataset(torch.randint(0, 100, (128,)))
return model, train_dataset
def _train_with_llama(self, optim, optim_target_modules=None, **extra_kwargs):
"""Smoke-test: tiny Llama + RepeatDataset with the given optimizer."""
tiny_llama, train_dataset = self._get_llama_and_dataset()
kwargs = {"learning_rate": 1e-9, "logging_steps": 5, "optim": optim}
if optim_target_modules is not None:
kwargs["optim_target_modules"] = optim_target_modules
kwargs.update(extra_kwargs)
args = TrainingArguments(self.get_auto_remove_tmp_dir(), **kwargs)
trainer = Trainer(tiny_llama, args, train_dataset=train_dataset)
trainer.train()
return trainer
def _check_lr_display_without_scheduler(self, optim, optim_target_modules):
"""Verify that LR is correctly reported without an LR scheduler."""
tiny_llama, train_dataset = self._get_llama_and_dataset()
learning_rate = 1e-9
args = TrainingArguments(
self.get_auto_remove_tmp_dir(),
learning_rate=learning_rate,
logging_steps=5,
optim=optim,
optim_target_modules=optim_target_modules,
)
trainer = Trainer(tiny_llama, args, train_dataset=train_dataset)
trainer.create_optimizer_and_scheduler(num_training_steps=10)
self.assertEqual(trainer.get_learning_rates(), [learning_rate, learning_rate])
def _check_lr_display_with_scheduler(self, optim, optim_target_modules, num_train_epochs=2):
"""Verify warmup + cosine LR schedule: increases then decreases."""
tiny_llama, train_dataset = self._get_llama_and_dataset()
learning_rate = 2e-4
num_warmup_steps = 5
args = TrainingArguments(
self.get_auto_remove_tmp_dir(),
num_train_epochs=num_train_epochs,
learning_rate=learning_rate,
warmup_steps=num_warmup_steps,
lr_scheduler_type="cosine",
logging_steps=1,
optim=optim,
optim_target_modules=optim_target_modules,
)
trainer = Trainer(tiny_llama, args, train_dataset=train_dataset)
trainer.train()
logs = trainer.state.log_history[1:-1]
self.assertTrue(logs[num_warmup_steps - 1]["learning_rate"] == learning_rate)
self.assertTrue(np.allclose(logs[-1]["learning_rate"], 0, atol=5e-6))
increasing_lrs = [
logs[i]["learning_rate"] < logs[i + 1]["learning_rate"]
for i in range(len(logs))
if i < num_warmup_steps - 1
]
decreasing_lrs = [
logs[i]["learning_rate"] > logs[i + 1]["learning_rate"]
for i in range(len(logs) - 1)
if i >= num_warmup_steps - 1
]
self.assertTrue(all(increasing_lrs))
self.assertTrue(all(decreasing_lrs))
self.assertTrue(len(decreasing_lrs) > len(increasing_lrs))
# ---------------------------------------------------------------------------
# adafactor optmizer test
# ---------------------------------------------------------------------------
def test_adafactor_lr_none(self):
# test the special case where lr=None, since Trainer can't not have lr_scheduler
from transformers.optimization import Adafactor, AdafactorSchedule
train_dataset = RegressionDataset()
with tempfile.TemporaryDirectory() as tmp_dir:
args = TrainingArguments(tmp_dir)
model = RegressionModel()
optimizer = Adafactor(
model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None
)
lr_scheduler = AdafactorSchedule(optimizer)
trainer = Trainer(model, args, train_dataset=train_dataset, optimizers=(optimizer, lr_scheduler))
trainer.train()
# Train a default model to compare against
default_trainer = get_regression_trainer(learning_rate=0.1, output_dir=tmp_dir)
default_trainer.train()
self.assertFalse(torch.allclose(trainer.model.a, default_trainer.model.a))
self.assertFalse(torch.allclose(trainer.model.b, default_trainer.model.b))
self.assertGreater(trainer.optimizer.state_dict()["param_groups"][0]["lr"], 0)
# ---------------------------------------------------------------------------
# BNB optimizer tests
# ---------------------------------------------------------------------------
@parameterized.expand(["rmsprop_bnb", "ademamix", "ademamix_8bit", "rmsprop_bnb_8bit", "rmsprop_bnb_32bit"])
@require_bitsandbytes
def test_bnb_optim(self, optim):
tiny_gpt2, train_dataset = self._get_gpt2_and_dataset()
args = TrainingArguments(
self.get_auto_remove_tmp_dir(),
learning_rate=1e-9,
logging_steps=5,
logging_nan_inf_filter=False,
optim=optim,
)
Trainer(tiny_gpt2, args, train_dataset=train_dataset).train()
@require_bitsandbytes
def test_bnb_8bit_optimizer_skip_embedding(self):
model = BasicTextGenerationModel(8, 4)
with tempfile.TemporaryDirectory() as tmp_dir:
for name_optim in ["rmsprop_bnb_8bit", "adamw_8bit"]:
args = TrainingArguments(
output_dir=tmp_dir,
optim=name_optim,
)
trainer = Trainer(model=model, args=args)
optimizer = trainer.create_optimizer()
modules = optimizer.mng.module_weight_config_triple
self.assertNotEqual(len(modules), 0)
module, name, config = modules[0]
self.assertIsInstance(module, torch.nn.Embedding)
self.assertEqual(name, "weight")
self.assertDictEqual(config, {"optim_bits": 32})
# ---------------------------------------------------------------------------
# LOMO tests
# ---------------------------------------------------------------------------
@require_lomo
@require_torch_accelerator
def test_lomo(self):
tiny_llama, train_dataset = self._get_llama_and_dataset()
previous_params = {n: p.clone() for n, p in tiny_llama.named_parameters()}
args = TrainingArguments(
self.get_auto_remove_tmp_dir(), learning_rate=1e-2, logging_steps=5, optim="lomo", max_steps=20
)
Trainer(tiny_llama, args, train_dataset=train_dataset).train()
for name, param in tiny_llama.named_parameters():
self.assertFalse(torch.allclose(param, previous_params[name].to(param.device), rtol=1e-12, atol=1e-12))
@require_lomo
@require_torch_accelerator
def test_adalomo(self):
self._train_with_llama("adalomo")
# ---------------------------------------------------------------------------
# GrokAdamW test
# ---------------------------------------------------------------------------
@require_grokadamw
@require_torch_accelerator
def test_grokadamw(self):
self._train_with_llama("grokadamw", learning_rate=2e-5, max_steps=20)
# ---------------------------------------------------------------------------
# Schedule-free tests
# ---------------------------------------------------------------------------
@parameterized.expand([("schedule_free_adamw",), ("schedule_free_radam",)])
@require_schedulefree
@require_torch_accelerator
def test_schedulefree(self, optim):
self._train_with_llama(optim, lr_scheduler_type="constant")
# ---------------------------------------------------------------------------
# GaLore tests
# ---------------------------------------------------------------------------
def test_galore_matched_modules(self):
regex_patterns = [r".*.attn.*", r".*.mlp.*"]
module_names = [
"model.transformer.h.0.ln_1",
"model.transformer.h.0.attn.q_proj",
"model.lm_head",
"model.transformer.h.0.mlp.up_proj",
]
expected_values = [False, True, False, True]
for expected_value, module_name in zip(expected_values, module_names):
is_module_matched, is_regex = check_target_module_exists(regex_patterns, module_name, return_is_regex=True)
self.assertTrue(is_module_matched == expected_value)
if is_module_matched:
self.assertTrue(is_regex)
exact_patterns = ["q_proj", "up_proj"]
module_names = [
"model.transformer.h.0.ln_1",
"model.transformer.h.0.attn.q_proj",
"model.lm_head",
"model.transformer.h.0.mlp.up_proj",
]
expected_values = [False, True, False, True]
for expected_value, module_name in zip(expected_values, module_names):
is_module_matched, is_regex = check_target_module_exists(exact_patterns, module_name, return_is_regex=True)
self.assertTrue(is_module_matched == expected_value)
if is_module_matched:
self.assertFalse(is_regex)
simple_regex = r".*.attn.*"
module_names = [
"model.transformer.h.0.ln_1",
"model.transformer.h.0.attn.q_proj",
"model.lm_head",
"model.transformer.h.0.mlp.up_proj",
]
expected_values = [False, True, False, False]
for expected_value, module_name in zip(expected_values, module_names):
is_module_matched, is_regex = check_target_module_exists(simple_regex, module_name, return_is_regex=True)
self.assertTrue(is_module_matched == expected_value)
if is_module_matched:
self.assertTrue(is_regex)
simple_regex = "model.transformer.h.0.attn.q_proj"
module_names = [
"model.transformer.h.0.ln_1",
"model.transformer.h.0.attn.q_proj",
"model.lm_head",
"model.transformer.h.0.mlp.up_proj",
]
expected_values = [False, True, False, False]
for expected_value, module_name in zip(expected_values, module_names):
is_module_matched, is_regex = check_target_module_exists(simple_regex, module_name, return_is_regex=True)
self.assertTrue(is_module_matched == expected_value)
if is_module_matched:
self.assertFalse(is_regex)
target_modules = ["attn", "mlp"]
module_names = [
"model.transformer.h.0.ln_1",
"model.transformer.h.0.attn.q_proj",
"model.lm_head",
"model.transformer.h.0.mlp.up_proj",
]
expected_values = [False, True, False, True]
for expected_value, module_name in zip(expected_values, module_names):
is_module_matched, is_regex = check_target_module_exists(target_modules, module_name, return_is_regex=True)
self.assertTrue(is_module_matched == expected_value)
if is_module_matched:
self.assertFalse(is_regex)
@parameterized.expand([("galore_adamw",), ("galore_adamw_layerwise",), ("galore_adamw_8bit",)])
@require_galore_torch
@require_torch_accelerator
def test_galore(self, optim):
self._train_with_llama(optim, optim_target_modules=_ATTN_MLP_TARGET_MODULES)
@require_galore_torch
@require_torch_accelerator
def test_galore_extra_args(self):
self._train_with_llama(
"galore_adamw",
optim_target_modules=_ATTN_MLP_TARGET_MODULES,
optim_args="rank=64, update_proj_gap=100, scale=0.10",
)
@require_galore_torch
@require_torch_accelerator
def test_galore_layerwise_with_scheduler(self):
self._train_with_llama(
"galore_adamw_layerwise",
optim_target_modules=_ATTN_MLP_TARGET_MODULES,
lr_scheduler_type="cosine",
)
@parameterized.expand(
[
(_ATTN_MLP_TARGET_MODULES,),
(["q_proj", "k_proj", "v_proj"],),
("all-linear",),
]
)
@require_galore_torch
@require_torch_accelerator
def test_galore_adafactor(self, optim_target_modules):
upper_bound_pm = 700
lower_bound_pm = 650
tiny_llama, train_dataset = self._get_llama_and_dataset()
with tempfile.TemporaryDirectory() as tmpdir, TorchTracemalloc() as tracemalloc:
args = TrainingArguments(
tmpdir,
learning_rate=1e-9,
logging_steps=5,
optim="galore_adafactor",
optim_target_modules=optim_target_modules,
)
Trainer(tiny_llama, args, train_dataset=train_dataset).train()
galore_peak_memory = tracemalloc.peaked + bytes2megabytes(tracemalloc.begin)
self.assertTrue(galore_peak_memory < upper_bound_pm)
self.assertTrue(lower_bound_pm < galore_peak_memory)
@require_galore_torch
@require_torch_accelerator
def test_galore_lr_display_without_scheduler(self):
self._check_lr_display_without_scheduler("galore_adamw", _ATTN_MLP_TARGET_MODULES)
@require_galore_torch
@require_torch_accelerator
def test_galore_lr_display_with_scheduler(self):
self._check_lr_display_with_scheduler("galore_adamw", _ATTN_MLP_TARGET_MODULES)
# ---------------------------------------------------------------------------
# Apollo tests
# ---------------------------------------------------------------------------
@parameterized.expand([("apollo_adamw",), ("apollo_adamw_layerwise",)])
@require_apollo_torch
@require_torch_accelerator
def test_apollo(self, optim):
self._train_with_llama(optim, optim_target_modules=_ATTN_MLP_TARGET_MODULES)
@require_apollo_torch
@require_torch_accelerator
def test_apollo_extra_args(self):
self._train_with_llama(
"apollo_adamw",
optim_target_modules=_ATTN_MLP_TARGET_MODULES,
optim_args="proj=random,scale_type=tensor,rank=1,update_proj_gap=100,scale=128.0",
)
@require_apollo_torch
@require_torch_accelerator
def test_apollo_layerwise_with_scheduler(self):
self._train_with_llama(
"apollo_adamw_layerwise",
optim_target_modules=_ATTN_MLP_TARGET_MODULES,
lr_scheduler_type="cosine",
)
@require_apollo_torch
@require_torch_accelerator
def test_apollo_lr_display_without_scheduler(self):
self._check_lr_display_without_scheduler("apollo_adamw", _ATTN_MLP_TARGET_MODULES)
@require_apollo_torch
@require_torch_accelerator
def test_apollo_lr_display_with_scheduler(self):
self._check_lr_display_with_scheduler("apollo_adamw", _ATTN_MLP_TARGET_MODULES, num_train_epochs=10)
# ---------------------------------------------------------------------------
# Stable AdamW tests
# ---------------------------------------------------------------------------
@require_torch_optimi
@require_torch_accelerator
def test_stable_adamw(self):
self._train_with_llama("stable_adamw", optim_target_modules=_ATTN_MLP_TARGET_MODULES)
@require_torch_optimi
@require_torch_accelerator
def test_stable_adamw_extra_args(self):
self._train_with_llama(
"stable_adamw",
optim_target_modules=_ATTN_MLP_TARGET_MODULES,
optim_args="decouple_lr=True,max_lr=1e-3,kahan_sum=True",
)
@require_torch_optimi
@require_torch_accelerator
def test_stable_adamw_trainer_adamw_args(self):
tiny_llama, train_dataset = self._get_llama_and_dataset()
args = TrainingArguments(
self.get_auto_remove_tmp_dir(),
learning_rate=1e-9,
logging_steps=5,
weight_decay=0.001,
adam_beta1=0.89,
adam_beta2=0.98,
adam_epsilon=1e-8,
optim="stable_adamw",
optim_target_modules=_ATTN_MLP_TARGET_MODULES,
)
trainer = Trainer(tiny_llama, args, train_dataset=train_dataset)
trainer.create_optimizer_and_scheduler(num_training_steps=10)
# check StableAdamW optimizer is created with the correct parameters
self.assertEqual(trainer.optimizer.defaults["beta1"], args.adam_beta1)
self.assertEqual(trainer.optimizer.defaults["beta2"], args.adam_beta2)
self.assertEqual(trainer.optimizer.defaults["eps"], args.adam_epsilon)
self.assertEqual(trainer.optimizer.defaults["weight_decay"], args.weight_decay)
@require_torch_optimi
@require_torch_accelerator
def test_stable_adamw_lr_display_without_scheduler(self):
self._check_lr_display_without_scheduler("stable_adamw", _ATTN_MLP_TARGET_MODULES)
@require_torch_optimi
@require_torch_accelerator
def test_stable_adamw_lr_display_with_scheduler(self):
self._check_lr_display_with_scheduler("stable_adamw", _ATTN_MLP_TARGET_MODULES, num_train_epochs=10)
# ---------------------------------------------------------------------------
# Misc optimizer tests
# ---------------------------------------------------------------------------
def test_optimizer_factory_pattern(self):
"""Test that is_optimizer_factory correctly identifies factory classes vs optimizer classes."""
from transformers.trainer_optimizer import is_optimizer_factory
# Create a mock optimizer class
class MockComplexOptimizer(torch.optim.Optimizer):
def __init__(self, params, lr=1e-3):
defaults = {"lr": lr}
super().__init__(params, defaults)
def step(self, closure=None):
pass
# Create a factory class (simulates Muon/Dion pattern)
class MockOptimizerFactory:
def __call__(self, opt_model, **optimizer_kwargs):
all_params = list(opt_model.parameters())
return MockComplexOptimizer(all_params, **optimizer_kwargs)
# Verify is_optimizer_factory correctly identifies factories vs optimizer classes
self.assertFalse(is_optimizer_factory(MockComplexOptimizer)) # Optimizer class should return False
self.assertTrue(is_optimizer_factory(MockOptimizerFactory)) # Factory class should return True
# ---------------------------------------------------------------------------
# Optimizer group and learning rate inspection tests
# ---------------------------------------------------------------------------
def test_get_optimizer_group(self):
model = nn.Sequential(nn.Linear(128, 64))
with tempfile.TemporaryDirectory() as tmp_dir:
trainer = Trainer(model=model, args=TrainingArguments(output_dir=tmp_dir))
# ValueError is raised if optimizer is None
with self.assertRaises(ValueError):
trainer.get_optimizer_group()
trainer.create_optimizer()
# Get groups
num_groups = len(trainer.get_optimizer_group())
self.assertEqual(num_groups, 2)
# Get group of parameter
param = next(model.parameters())
group = trainer.get_optimizer_group(param)
self.assertIn(param, group["params"])
# ---------------------------------------------------------------------------
# Custom optimizer and LR scheduler tests
# ---------------------------------------------------------------------------
class TrainerOptimizerTest(TestCasePlus):
def test_get_optimizer_group(self):
model = nn.Sequential(nn.Linear(128, 64))
with tempfile.TemporaryDirectory() as tmp_dir:
trainer = Trainer(model=model, args=TrainingArguments(output_dir=tmp_dir))
# ValueError is raised if optimizer is None
with self.assertRaises(ValueError):
trainer.get_optimizer_group()
trainer.create_optimizer()
# Get groups
num_groups = len(trainer.get_optimizer_group())
self.assertEqual(num_groups, 2)
# Get group of parameter
param = next(model.parameters())
group = trainer.get_optimizer_group(param)
self.assertIn(param, group["params"])
def test_optimizer_factory_pattern(self):
"""Test that is_optimizer_factory correctly identifies factory classes vs optimizer classes."""
from transformers.trainer_optimizer import is_optimizer_factory
# Create a mock optimizer class
class MockComplexOptimizer(torch.optim.Optimizer):
def __init__(self, params, lr=1e-3):
defaults = {"lr": lr}
super().__init__(params, defaults)
def step(self, closure=None):
pass
# Create a factory class (simulates Muon/Dion pattern)
class MockOptimizerFactory:
def __call__(self, opt_model, **optimizer_kwargs):
all_params = list(opt_model.parameters())
return MockComplexOptimizer(all_params, **optimizer_kwargs)
# Verify is_optimizer_factory correctly identifies factories vs optimizer classes
self.assertFalse(is_optimizer_factory(MockComplexOptimizer)) # Optimizer class should return False
self.assertTrue(is_optimizer_factory(MockOptimizerFactory)) # Factory class should return True
def test_custom_optimizer(self):
train_dataset = RegressionDataset()
with tempfile.TemporaryDirectory() as tmp_dir:
args = TrainingArguments(tmp_dir)
model = RegressionModel()
optimizer = torch.optim.SGD(model.parameters(), lr=1.0)
lr_scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda x: 1.0)
trainer = Trainer(model, args, train_dataset=train_dataset, optimizers=(optimizer, lr_scheduler))
trainer.train()
# Train a default model to compare against
default_trainer = get_regression_trainer(learning_rate=0.1, output_dir=tmp_dir)
default_trainer.train()
self.assertFalse(torch.allclose(trainer.model.a, default_trainer.model.a))
self.assertFalse(torch.allclose(trainer.model.b, default_trainer.model.b))
self.assertEqual(trainer.optimizer.state_dict()["param_groups"][0]["lr"], 1.0)
# ---------------------------------------------------------------------------
# Weight decay parameter groups
# ---------------------------------------------------------------------------
def test_no_wd_param_group(self):
model = nn.Sequential(TstLayer(128), nn.ModuleList([TstLayer(128), TstLayer(128)]))
with tempfile.TemporaryDirectory() as tmp_dir:
trainer = Trainer(model=model, args=TrainingArguments(output_dir=tmp_dir))
trainer.create_optimizer_and_scheduler(10)
wd_names = ['0.linear1.weight', '0.linear2.weight', '1.0.linear1.weight', '1.0.linear2.weight', '1.1.linear1.weight', '1.1.linear2.weight'] # fmt: skip
wd_params = [p for n, p in model.named_parameters() if n in wd_names]
no_wd_params = [p for n, p in model.named_parameters() if n not in wd_names]
self.assertListEqual(trainer.optimizer.param_groups[0]["params"], wd_params)
self.assertListEqual(trainer.optimizer.param_groups[1]["params"], no_wd_params)
@require_torch
class TrainerLRTest(TestCasePlus):
def test_get_learning_rates(self):
model = nn.Sequential(nn.Linear(128, 64))
with tempfile.TemporaryDirectory() as tmp_dir:
trainer = Trainer(model=model, args=TrainingArguments(output_dir=tmp_dir))
with self.assertRaises(ValueError):
trainer.get_learning_rates()
trainer.create_optimizer()
self.assertEqual(trainer.get_learning_rates(), [5e-05, 5e-05])
def test_lr_scheduler_kwargs(self):
from transformers import get_polynomial_decay_schedule_with_warmup
# test scheduler kwargs passed via TrainingArguments
train_dataset = RegressionDataset()
model = RegressionModel()
num_steps, num_warmup_steps = 10, 2
extra_kwargs = {"power": 5.0, "lr_end": 1e-5} # Non-default arguments
with tempfile.TemporaryDirectory() as tmp_dir:
args = TrainingArguments(
tmp_dir,
lr_scheduler_type="polynomial",
lr_scheduler_kwargs=extra_kwargs,
learning_rate=0.2,
warmup_steps=num_warmup_steps,
)
trainer = Trainer(model, args, train_dataset=train_dataset)
trainer.create_optimizer_and_scheduler(num_training_steps=num_steps)
# Checking that the scheduler was created
self.assertIsNotNone(trainer.lr_scheduler)
# Checking that the correct args were passed
sched1 = trainer.lr_scheduler
sched2 = get_polynomial_decay_schedule_with_warmup(
trainer.optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_steps, **extra_kwargs
)
self.assertEqual(sched1.lr_lambdas[0].args, sched2.lr_lambdas[0].args)
self.assertEqual(sched1.lr_lambdas[0].keywords, sched2.lr_lambdas[0].keywords)
def test_cosine_with_min_lr_scheduler(self):
train_dataset = RegressionDataset()
model = RegressionModel()
num_steps, num_warmup_steps = 10, 2
extra_kwargs = {"min_lr": 1e-5} # Non-default arguments
with tempfile.TemporaryDirectory() as tmp_dir:
args = TrainingArguments(
tmp_dir,
lr_scheduler_type="cosine_with_min_lr",
lr_scheduler_kwargs=extra_kwargs,
learning_rate=0.2,
warmup_steps=num_warmup_steps,
)
trainer = Trainer(model, args, train_dataset=train_dataset)
trainer.create_optimizer_and_scheduler(num_training_steps=num_steps)
# Checking that the scheduler was created
self.assertIsNotNone(trainer.lr_scheduler)
# Check the last learning rate
for _ in range(num_steps):
trainer.lr_scheduler.step()
self.assertEqual(trainer.lr_scheduler.get_last_lr()[0], 1e-5)
def test_cosine_with_min_lr_schedule_with_warmup_lr_rate(self):
train_dataset = RegressionDataset()
model = RegressionModel()
num_steps, num_warmup_steps = 10, 2
extra_kwargs = {"min_lr": 1e-5} # Non-default arguments
args = TrainingArguments(
"./regression",
lr_scheduler_type="cosine_warmup_with_min_lr",
lr_scheduler_kwargs=extra_kwargs,
learning_rate=0.2,
warmup_steps=num_warmup_steps,
)
trainer = Trainer(model, args, train_dataset=train_dataset)
trainer.create_optimizer_and_scheduler(num_training_steps=num_steps)
# Checking that the scheduler was created
self.assertIsNotNone(trainer.lr_scheduler)
# Check the last learning rate
step_lrs = []
for _ in range(num_steps):
step_lrs.append(trainer.optimizer.param_groups[0]["lr"])
trainer.lr_scheduler.step()
self.assertEqual(step_lrs[0], 0.1)
self.assertEqual(step_lrs[1], 0.2)
self.assertEqual(step_lrs[-1], 1e-05)
def test_reduce_lr_on_plateau_args(self):
# test passed arguments for a custom ReduceLROnPlateau scheduler
train_dataset = RegressionDataset(length=64)
eval_dataset = RegressionDataset(length=64)
with tempfile.TemporaryDirectory() as tmp_dir:
args = TrainingArguments(
tmp_dir,
eval_strategy="epoch",
metric_for_best_model="eval_loss",
)
model = RegressionModel()
optimizer = torch.optim.SGD(model.parameters(), lr=1.0)
lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.2, patience=5, cooldown=2)
trainer = Trainer(
model,
args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
optimizers=(optimizer, lr_scheduler),
)
trainer.train()
self.assertIsInstance(trainer.lr_scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau)
self.assertEqual(trainer.lr_scheduler.factor, 0.2)
self.assertEqual(trainer.lr_scheduler.patience, 5)
self.assertEqual(trainer.lr_scheduler.cooldown, 2)
def test_reduce_lr_on_plateau(self):
# test the ReduceLROnPlateau scheduler
class TrainerWithLRLogs(Trainer):
def log(self, logs):
# the LR is computed after metrics and does not exist for the first epoch
if hasattr(self.lr_scheduler, "_last_lr"):
logs["learning_rate"] = self.lr_scheduler._last_lr[0]
super().log(logs)
train_dataset = RegressionDataset(length=64)
eval_dataset = RegressionDataset(length=64)
with tempfile.TemporaryDirectory() as tmp_dir:
args = TrainingArguments(
tmp_dir,
lr_scheduler_type="reduce_lr_on_plateau",
eval_strategy="epoch",
metric_for_best_model="eval_loss",
num_train_epochs=10,
learning_rate=0.2,
)
model = RegressionModel()
trainer = TrainerWithLRLogs(model, args, train_dataset=train_dataset, eval_dataset=eval_dataset)
trainer.train()
self.assertIsInstance(trainer.lr_scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau)
patience = trainer.lr_scheduler.patience
logs = trainer.state.log_history[1:]
best_loss = logs[0]["eval_loss"]
bad_epochs = 0
for i, log in enumerate(logs[:-1]): # Compare learning rate to next epoch's
loss = log["eval_loss"]
just_decreased = False
if loss > best_loss:
bad_epochs += 1
if bad_epochs > patience:
self.assertLess(logs[i + 1]["learning_rate"], log["learning_rate"])
just_decreased = True
bad_epochs = 0
else:
best_loss = loss
bad_epochs = 0
if not just_decreased:
self.assertEqual(logs[i + 1]["learning_rate"], log["learning_rate"])
def test_greedy_lr_args(self):
# test passed arguments for a custom GreedyLR scheduler
from transformers.optimization import GreedyLR
train_dataset = RegressionDataset(length=64)
eval_dataset = RegressionDataset(length=64)
with tempfile.TemporaryDirectory() as tmp_dir:
args = TrainingArguments(
tmp_dir,
eval_strategy="epoch",
metric_for_best_model="eval_loss",
)
model = RegressionModel()
optimizer = torch.optim.SGD(model.parameters(), lr=1.0)
lr_scheduler = GreedyLR(optimizer, factor=0.8, patience=5, cooldown=2)
trainer = Trainer(
model,
args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
optimizers=(optimizer, lr_scheduler),
)
trainer.train()
self.assertIsInstance(trainer.lr_scheduler, GreedyLR)
self.assertEqual(trainer.lr_scheduler.factor, 0.8)
self.assertEqual(trainer.lr_scheduler.patience, 5)
self.assertEqual(trainer.lr_scheduler.cooldown, 2)
def test_greedy_lr(self):
# test the GreedyLR scheduler
from transformers.optimization import GreedyLR
class TrainerWithLRLogs(Trainer):
def log(self, logs):
if hasattr(self.lr_scheduler, "_last_lr"):
logs["learning_rate"] = self.lr_scheduler._last_lr[0]
super().log(logs)
train_dataset = RegressionDataset(length=64)
eval_dataset = RegressionDataset(length=64)
with tempfile.TemporaryDirectory() as tmp_dir:
args = TrainingArguments(
tmp_dir,
lr_scheduler_type="greedy",
lr_scheduler_kwargs={"patience": 1, "factor": 0.5},
eval_strategy="epoch",
metric_for_best_model="eval_loss",
num_train_epochs=10,
learning_rate=0.2,
)
model = RegressionModel()
trainer = TrainerWithLRLogs(model, args, train_dataset=train_dataset, eval_dataset=eval_dataset)
trainer.train()
self.assertIsInstance(trainer.lr_scheduler, GreedyLR)
# Verify LR was adjusted at least once during training
logs = trainer.state.log_history[1:]
lr_values = [log["learning_rate"] for log in logs if "learning_rate" in log]
self.assertTrue(len(set(lr_values)) > 1, "GreedyLR should have adjusted the LR at least once")

View File

@@ -0,0 +1,413 @@
# Copyright 2020 the HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import sys
from pathlib import Path
from unittest.mock import patch
from transformers import (
AutoModelForSeq2SeqLM,
BertConfig,
BertTokenizer,
DataCollatorForSeq2Seq,
EncoderDecoderModel,
GenerationConfig,
Seq2SeqTrainer,
Seq2SeqTrainingArguments,
T5Tokenizer,
)
from transformers.testing_utils import (
ExtendSysPath,
TestCasePlus,
backend_device_count,
execute_subprocess_async,
get_torch_dist_unique_port,
require_bitsandbytes,
require_sentencepiece,
require_torch,
require_torch_multi_accelerator,
require_torch_non_multi_accelerator,
slow,
torch_device,
)
from transformers.trainer_callback import TrainerState
from transformers.trainer_utils import set_seed
from transformers.utils import is_datasets_available, is_torch_available
if is_datasets_available():
import datasets
if is_torch_available():
import torch
set_seed(42)
MARIAN_MODEL = "sshleifer/student_marian_en_ro_6_1"
MBART_TINY = "sshleifer/tiny-mbart"
@require_sentencepiece
class Seq2seqTrainerTester(TestCasePlus):
@slow
@require_torch
def test_finetune_bert2bert(self):
bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained(
"prajjwal1/bert-tiny",
"prajjwal1/bert-tiny",
encoder_config=BertConfig.from_pretrained("prajjwal1/bert-tiny"),
decoder_config=BertConfig.from_pretrained("prajjwal1/bert-tiny"),
dtype=torch.float32,
)
tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
bert2bert.config.vocab_size = bert2bert.config.encoder.vocab_size
tokenizer.eos_token_id = tokenizer.sep_token_id
bert2bert.generation_config.decoder_start_token_id = tokenizer.cls_token_id
bert2bert.generation_config.max_length = 128
train_dataset = datasets.load_dataset("abisee/cnn_dailymail", "3.0.0", split="train[:1%]")
val_dataset = datasets.load_dataset("abisee/cnn_dailymail", "3.0.0", split="validation[:1%]")
train_dataset = train_dataset.select(range(32))
val_dataset = val_dataset.select(range(16))
batch_size = 4
def _map_to_encoder_decoder_inputs(batch):
# Tokenizer will automatically set [BOS] <text> [EOS]
inputs = tokenizer(batch["article"], padding="max_length", truncation=True, max_length=512)
outputs = tokenizer(batch["highlights"], padding="max_length", truncation=True, max_length=128)
batch["input_ids"] = inputs.input_ids
batch["attention_mask"] = inputs.attention_mask
batch["decoder_input_ids"] = outputs.input_ids
batch["labels"] = outputs.input_ids.copy()
batch["labels"] = [
[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]
]
batch["decoder_attention_mask"] = outputs.attention_mask
assert all(len(x) == 512 for x in inputs.input_ids)
assert all(len(x) == 128 for x in outputs.input_ids)
return batch
def _compute_metrics(pred):
labels_ids = pred.label_ids
pred_ids = pred.predictions
# Replace -100 (ignore index) with pad_token_id before decoding
import numpy as np
labels_ids = np.where(labels_ids == -100, tokenizer.pad_token_id, labels_ids)
# all unnecessary tokens are removed
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
accuracy = sum(int(pred_str[i] == label_str[i]) for i in range(len(pred_str))) / len(pred_str)
return {"accuracy": accuracy}
# map train dataset
train_dataset = train_dataset.map(
_map_to_encoder_decoder_inputs,
batched=True,
batch_size=batch_size,
remove_columns=["article", "highlights"],
)
train_dataset.set_format(
type="torch",
columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)
# same for validation dataset
val_dataset = val_dataset.map(
_map_to_encoder_decoder_inputs,
batched=True,
batch_size=batch_size,
remove_columns=["article", "highlights"],
)
val_dataset.set_format(
type="torch",
columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)
output_dir = self.get_auto_remove_tmp_dir()
training_args = Seq2SeqTrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
predict_with_generate=True,
eval_strategy="steps",
do_train=True,
do_eval=True,
warmup_steps=0,
eval_steps=2,
logging_steps=2,
)
# instantiate trainer
trainer = Seq2SeqTrainer(
model=bert2bert,
args=training_args,
compute_metrics=_compute_metrics,
train_dataset=train_dataset,
eval_dataset=val_dataset,
processing_class=tokenizer,
)
# start training
trainer.train()
@slow
@require_torch
def test_return_sequences(self):
# Tests that the number of generated sequences is correct when num_return_sequences > 1
# and essentially ensuring that `accelerator.gather()` is used instead of `gather_for_metrics`
INPUT_COLUMN = "question"
TARGET_COLUMN = "answer"
MAX_INPUT_LENGTH = 256
MAX_TARGET_LENGTH = 256
dataset = datasets.load_dataset("openai/gsm8k", "main", split="train[:38]")
model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-small")
tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-small")
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="pt", padding="longest")
gen_config = GenerationConfig.from_pretrained(
"google-t5/t5-small", max_length=None, min_length=None, max_new_tokens=256, min_new_tokens=1, num_beams=5
)
training_args = Seq2SeqTrainingArguments(".", predict_with_generate=True)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
processing_class=tokenizer,
data_collator=data_collator,
compute_metrics=lambda x: {"samples": x[0].shape[0]},
)
def prepare_data(examples):
# Remove pairs where at least one record is none
inputs = examples[INPUT_COLUMN]
targets = examples[TARGET_COLUMN]
model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)
labels = tokenizer(text_target=targets, max_length=MAX_TARGET_LENGTH, truncation=True)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
prepared_dataset = dataset.map(prepare_data, batched=True, remove_columns=[INPUT_COLUMN, TARGET_COLUMN])
dataset_len = len(prepared_dataset) # 38
for num_return_sequences in range(3, 0, -1):
gen_config.num_return_sequences = num_return_sequences
metrics = trainer.evaluate(eval_dataset=prepared_dataset, generation_config=gen_config)
assert metrics["eval_samples"] == dataset_len * num_return_sequences, (
f"Got {metrics['eval_samples']}, expected: {dataset_len * num_return_sequences}"
)
@require_torch
def test_bad_generation_config_fail_early(self):
# Tests that a bad generation config causes the trainer to fail early
model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-small")
tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-small")
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="pt", padding="longest")
gen_config = GenerationConfig(do_sample=False, top_p=0.9) # bad: top_p is not compatible with do_sample=False
training_args = Seq2SeqTrainingArguments(".", predict_with_generate=True, generation_config=gen_config)
with self.assertRaises(ValueError) as exc:
_ = Seq2SeqTrainer(
model=model,
args=training_args,
processing_class=tokenizer,
data_collator=data_collator,
compute_metrics=lambda x: {"samples": x[0].shape[0]},
)
self.assertIn("Fix these issues to train your model", str(exc.exception))
@require_torch
class TestTranslationExample(TestCasePlus):
"""Tests for the run_translation.py example script (seq2seq training via CLI)."""
@classmethod
def setUpClass(cls):
super().setUpClass()
examples_dir = Path(__file__).resolve().parents[2] / "examples" / "pytorch" / "translation"
with ExtendSysPath(str(examples_dir)):
from run_translation import main as _main
cls._run_translation_main = staticmethod(_main)
def _run_translation(
self,
distributed=False,
extra_args_str=None,
predict_with_generate=True,
do_train=True,
do_eval=True,
do_predict=True,
n_gpus_to_use=None,
):
data_dir = self.test_file_dir / "../fixtures/tests_samples/wmt_en_ro"
output_dir = self.get_auto_remove_tmp_dir()
args = f"""
--model_name_or_path {MBART_TINY}
--train_file {data_dir}/train.json
--validation_file {data_dir}/val.json
--test_file {data_dir}/test.json
--output_dir {output_dir}
--max_train_samples 8
--max_source_length 12
--max_target_length 12
--do_train
--num_train_epochs 1
--per_device_train_batch_size 4
--learning_rate 3e-3
--warmup_steps 8
--logging_steps 0
--logging_strategy no
--save_steps 1
--train_sampling_strategy group_by_length
--label_smoothing_factor 0.1
--target_lang ro_RO
--source_lang en_XX
--report_to none
""".split()
if do_eval:
args += """
--do_eval
--per_device_eval_batch_size 4
--max_eval_samples 8
--val_max_target_length 12
--eval_strategy steps
--eval_steps 1
""".split()
if do_predict:
args += ["--do_predict"]
if predict_with_generate:
args += ["--predict_with_generate"]
if do_train:
args += ["--optim", "adafactor"]
if extra_args_str is not None:
args += extra_args_str.split()
if distributed:
if n_gpus_to_use is None:
n_gpus_to_use = backend_device_count(torch_device)
master_port = get_torch_dist_unique_port()
distributed_args = f"""
-m torch.distributed.run
--nproc_per_node={n_gpus_to_use}
--master_port={master_port}
{self.examples_dir_str}/pytorch/translation/run_translation.py
""".split()
cmd = [sys.executable] + distributed_args + args
execute_subprocess_async(cmd, env=self.get_env())
else:
testargs = ["run_translation.py"] + args
with patch.object(sys, "argv", testargs):
self._run_translation_main()
return output_dir
@require_torch_non_multi_accelerator
def test_run_seq2seq_no_dist(self):
output_dir = self._run_translation()
logs = TrainerState.load_from_json(os.path.join(output_dir, "trainer_state.json")).log_history
eval_metrics = [log for log in logs if "eval_loss" in log]
first_step_stats = eval_metrics[0]
assert "eval_bleu" in first_step_stats
@require_torch_multi_accelerator
def test_run_seq2seq_dp(self):
output_dir = self._run_translation(distributed=False)
logs = TrainerState.load_from_json(os.path.join(output_dir, "trainer_state.json")).log_history
eval_metrics = [log for log in logs if "eval_loss" in log]
first_step_stats = eval_metrics[0]
assert "eval_bleu" in first_step_stats
@require_torch_multi_accelerator
def test_run_seq2seq_ddp(self):
output_dir = self._run_translation(distributed=True)
logs = TrainerState.load_from_json(os.path.join(output_dir, "trainer_state.json")).log_history
eval_metrics = [log for log in logs if "eval_loss" in log]
first_step_stats = eval_metrics[0]
assert "eval_bleu" in first_step_stats
@slow
def test_run_seq2seq_slow(self):
output_dir = self._run_translation(
extra_args_str=f"--model_name_or_path {MARIAN_MODEL} --learning_rate 3e-4 --num_train_epochs 10 --max_source_length 128 --max_target_length 128 --eval_steps 2 --save_steps 2",
)
logs = TrainerState.load_from_json(os.path.join(output_dir, "trainer_state.json")).log_history
eval_metrics = [log for log in logs if "eval_loss" in log]
first_step_stats = eval_metrics[0]
last_step_stats = eval_metrics[-1]
assert first_step_stats["eval_loss"] > last_step_stats["eval_loss"], "model learned nothing"
assert isinstance(last_step_stats["eval_bleu"], float)
contents = {os.path.basename(p) for p in os.listdir(output_dir)}
assert "generated_predictions.txt" in contents
assert "predict_results.json" in contents
@slow
@require_bitsandbytes
def test_run_seq2seq_bnb(self):
from transformers.training_args import OptimizerNames
def train_and_return_metrics(optim: str) -> tuple[int, float]:
output_dir = self._run_translation(
distributed=True,
extra_args_str=f"--skip_memory_metrics 0 --model_name_or_path {MARIAN_MODEL} --learning_rate 3e-4 --num_train_epochs 1 --optim {optim} --max_source_length 128 --max_target_length 128",
do_eval=False,
do_predict=False,
n_gpus_to_use=1,
)
logs = TrainerState.load_from_json(Path(output_dir, "trainer_state.json")).log_history
gpu_peak_mem_mb = int(logs[0]["train_mem_gpu_peaked_delta"] / 2**20)
gpu_alloc_mem_mb = int(logs[0]["train_mem_gpu_alloc_delta"] / 2**20)
loss = logs[0]["train_loss"]
return gpu_peak_mem_mb, gpu_alloc_mem_mb, loss
gpu_peak_mem_orig, gpu_alloc_mem_orig, loss_orig = train_and_return_metrics(OptimizerNames.ADAMW_TORCH.value)
gpu_peak_mem_bnb, gpu_alloc_mem_bnb, loss_bnb = train_and_return_metrics(OptimizerNames.ADAMW_BNB.value)
gpu_alloc_mem_diff = gpu_alloc_mem_orig - gpu_alloc_mem_bnb
gpu_total_mem_orig = gpu_peak_mem_orig + gpu_alloc_mem_orig
gpu_total_mem_bnb = gpu_peak_mem_bnb + gpu_alloc_mem_bnb
gpu_total_mem_diff = gpu_total_mem_orig - gpu_total_mem_bnb
expected_savings = 120
self.assertGreater(
gpu_alloc_mem_diff,
expected_savings,
f"should use ~150MB less alloc gpu memory with BNB, but got diff={gpu_alloc_mem_diff}MB",
)
self.assertGreater(
gpu_total_mem_diff,
expected_savings,
f"should use ~150MB less total gpu memory with BNB, but got diff={gpu_total_mem_diff}MB",
)
self.assertAlmostEqual(loss_orig, loss_bnb, 5, f"loss should be the same: {loss_orig} vs {loss_bnb}")

View File

@@ -0,0 +1,406 @@
import dataclasses
import os
import tempfile
import unittest
from unittest.mock import patch
import torch
from transformers import TrainingArguments
from transformers.debug_utils import DebugOption
from transformers.trainer_utils import HubStrategy, IntervalStrategy, SaveStrategy, SchedulerType
from transformers.training_args import OptimizerNames
class TestTrainingArguments(unittest.TestCase):
def test_default_output_dir(self):
"""Test that output_dir defaults to 'trainer_output' when not specified."""
args = TrainingArguments(output_dir=None)
self.assertEqual(args.output_dir, "trainer_output")
def test_custom_output_dir(self):
"""Test that output_dir is respected when specified."""
with tempfile.TemporaryDirectory() as tmp_dir:
args = TrainingArguments(output_dir=tmp_dir)
self.assertEqual(args.output_dir, tmp_dir)
def test_output_dir_creation(self):
"""Test that output_dir is created only when needed."""
with tempfile.TemporaryDirectory() as tmp_dir:
output_dir = os.path.join(tmp_dir, "test_output")
# Directory should not exist before creating args
self.assertFalse(os.path.exists(output_dir))
# Create args with save_strategy="no" - should not create directory
args = TrainingArguments(
output_dir=output_dir,
do_train=True,
save_strategy="no",
report_to=None,
)
self.assertFalse(os.path.exists(output_dir))
# Now set save_strategy="steps" - should create directory when needed
args.save_strategy = "steps"
args.save_steps = 1
self.assertFalse(os.path.exists(output_dir)) # Still shouldn't exist
# Directory should be created when actually needed (e.g. in Trainer)
def test_torch_empty_cache_steps_requirements(self):
"""Test that torch_empty_cache_steps is a positive integer or None."""
# None is acceptable (feature is disabled):
args = TrainingArguments(torch_empty_cache_steps=None)
self.assertIsNone(args.torch_empty_cache_steps)
# non-int is unacceptable:
with self.assertRaises(ValueError):
TrainingArguments(torch_empty_cache_steps=1.0)
with self.assertRaises(ValueError):
TrainingArguments(torch_empty_cache_steps="none")
# negative int is unacceptable:
with self.assertRaises(ValueError):
TrainingArguments(torch_empty_cache_steps=-1)
# zero is unacceptable:
with self.assertRaises(ValueError):
TrainingArguments(torch_empty_cache_steps=0)
# positive int is acceptable:
args = TrainingArguments(torch_empty_cache_steps=1)
self.assertEqual(args.torch_empty_cache_steps, 1)
def test_output_dir_expands_user(self):
"""Test that ~ in output_dir is expanded to the user's home directory."""
args = TrainingArguments(output_dir="~/foo", report_to=None)
self.assertEqual(args.output_dir, os.path.expanduser("~/foo"))
def test_enum_coercions(self):
"""Test that string values are correctly converted to their enum types."""
args = TrainingArguments(
output_dir="tmp",
eval_strategy="steps",
eval_steps=10,
logging_strategy="steps",
save_strategy="epoch",
hub_strategy="end",
lr_scheduler_type="linear",
optim="adamw_torch",
report_to=None,
)
self.assertEqual(args.eval_strategy, IntervalStrategy.STEPS)
self.assertEqual(args.logging_strategy, IntervalStrategy.STEPS)
self.assertEqual(args.save_strategy, SaveStrategy.EPOCH)
self.assertEqual(args.hub_strategy, HubStrategy.END)
self.assertEqual(args.lr_scheduler_type, SchedulerType.LINEAR)
self.assertEqual(args.optim, OptimizerNames.ADAMW_TORCH)
# Invalid string should raise ValueError
with self.assertRaises(ValueError):
TrainingArguments(output_dir="tmp", eval_strategy="invalid_strategy", report_to=None)
def test_do_eval_auto_enabled(self):
"""Test that do_eval is automatically set to True when eval_strategy is not 'no'."""
args = TrainingArguments(
output_dir="tmp",
do_eval=False,
eval_strategy="steps",
eval_steps=10,
report_to=None,
)
self.assertTrue(args.do_eval)
def test_eval_steps_fallback_to_logging_steps(self):
"""Test that eval_steps falls back to logging_steps when not specified."""
args = TrainingArguments(
output_dir="tmp",
eval_strategy="steps",
logging_steps=10,
report_to=None,
)
self.assertEqual(args.eval_steps, 10)
def test_eval_steps_required_when_strategy_steps(self):
"""Test that eval_strategy='steps' with logging_steps=0 raises ValueError."""
with self.assertRaises(ValueError):
TrainingArguments(
output_dir="tmp",
eval_strategy="steps",
logging_steps=0,
report_to=None,
)
def test_logging_steps_required_nonzero(self):
"""Test that logging_strategy='steps' with logging_steps=0 raises ValueError."""
with self.assertRaises(ValueError):
TrainingArguments(
output_dir="tmp",
logging_strategy="steps",
logging_steps=0,
report_to=None,
)
def test_steps_must_be_integer_when_greater_than_one(self):
"""Test that fractional steps >1 raise ValueError, but <=1 are allowed."""
with self.assertRaises(ValueError):
TrainingArguments(
output_dir="tmp",
logging_strategy="steps",
logging_steps=10.5,
report_to=None,
)
with self.assertRaises(ValueError):
TrainingArguments(
output_dir="tmp",
eval_strategy="steps",
eval_steps=10.5,
report_to=None,
)
with self.assertRaises(ValueError):
TrainingArguments(
output_dir="tmp",
save_strategy="steps",
save_steps=10.5,
report_to=None,
)
# Fractional values <=1 (ratios) are allowed
args = TrainingArguments(
output_dir="tmp",
logging_strategy="steps",
logging_steps=0.5,
report_to=None,
)
self.assertEqual(args.logging_steps, 0.5)
def test_load_best_model_requires_matching_strategies(self):
"""Test load_best_model_at_end validation for strategy and step compatibility."""
# Mismatched eval/save strategy should raise
with self.assertRaises(ValueError):
TrainingArguments(
output_dir="tmp",
load_best_model_at_end=True,
eval_strategy="steps",
eval_steps=10,
save_strategy="epoch",
report_to=None,
)
# save_steps not a multiple of eval_steps should raise
with self.assertRaises(ValueError):
TrainingArguments(
output_dir="tmp",
load_best_model_at_end=True,
eval_strategy="steps",
eval_steps=10,
save_strategy="steps",
save_steps=15,
report_to=None,
)
# Valid: matching strategies with compatible steps should not raise
args = TrainingArguments(
output_dir="tmp",
load_best_model_at_end=True,
eval_strategy="steps",
eval_steps=10,
save_strategy="steps",
save_steps=20,
report_to=None,
)
self.assertTrue(args.load_best_model_at_end)
def test_metric_for_best_model_defaults(self):
"""Test default metric_for_best_model and greater_is_better behavior."""
# load_best_model_at_end with no metric → defaults to "loss"
args = TrainingArguments(
output_dir="tmp",
load_best_model_at_end=True,
eval_strategy="epoch",
save_strategy="epoch",
report_to=None,
)
self.assertEqual(args.metric_for_best_model, "loss")
self.assertFalse(args.greater_is_better)
# metric ending in "loss" → greater_is_better is False
args = TrainingArguments(
output_dir="tmp",
load_best_model_at_end=True,
eval_strategy="epoch",
save_strategy="epoch",
metric_for_best_model="eval_loss",
report_to=None,
)
self.assertFalse(args.greater_is_better)
# metric not ending in "loss" → greater_is_better is True
args = TrainingArguments(
output_dir="tmp",
load_best_model_at_end=True,
eval_strategy="epoch",
save_strategy="epoch",
metric_for_best_model="accuracy",
report_to=None,
)
self.assertTrue(args.greater_is_better)
def test_fp16_bf16_mutual_exclusivity(self):
"""Test that fp16 and bf16 cannot both be True."""
with self.assertRaises(ValueError):
TrainingArguments(output_dir="tmp", fp16=True, bf16=True, report_to=None)
with self.assertRaises(ValueError):
TrainingArguments(output_dir="tmp", fp16_full_eval=True, bf16_full_eval=True, report_to=None)
def test_reduce_on_plateau_requires_eval(self):
"""Test that reduce_lr_on_plateau scheduler requires an eval strategy."""
with self.assertRaises(ValueError):
TrainingArguments(
output_dir="tmp",
lr_scheduler_type="reduce_lr_on_plateau",
eval_strategy="no",
report_to=None,
)
def test_torch_compile_auto_enable(self):
"""Test that torch_compile is auto-enabled when mode or backend is set."""
args = TrainingArguments(
output_dir="tmp",
torch_compile_mode="reduce-overhead",
report_to=None,
)
self.assertTrue(args.torch_compile)
args = TrainingArguments(
output_dir="tmp",
torch_compile_backend="inductor",
report_to=None,
)
self.assertTrue(args.torch_compile)
# Default backend when torch_compile=True
args = TrainingArguments(
output_dir="tmp",
torch_compile=True,
report_to=None,
)
self.assertEqual(args.torch_compile_backend, "inductor")
def test_report_to_none_handling(self):
"""Test report_to normalization for 'none' and string values."""
args = TrainingArguments(output_dir="tmp", report_to="none")
self.assertEqual(args.report_to, [])
args = TrainingArguments(output_dir="tmp", report_to=["none"])
self.assertEqual(args.report_to, [])
args = TrainingArguments(output_dir="tmp", report_to="tensorboard")
self.assertEqual(args.report_to, ["tensorboard"])
def test_kubeflow_auto_enable(self):
"""Test that kubeflow is auto-enabled when KUBEFLOW_TRAINER_SERVER_URL is set."""
with patch.dict(os.environ, {"KUBEFLOW_TRAINER_SERVER_URL": "https://test-url"}, clear=False):
# Should auto-add kubeflow when report_to is "none" (default)
args = TrainingArguments(output_dir="tmp", report_to="none")
self.assertIn("kubeflow", args.report_to)
# Should auto-add kubeflow to existing list
args = TrainingArguments(output_dir="tmp", report_to="tensorboard")
self.assertIn("kubeflow", args.report_to)
self.assertIn("tensorboard", args.report_to)
# Should not duplicate if already present
args = TrainingArguments(output_dir="tmp", report_to=["kubeflow", "tensorboard"])
self.assertEqual(args.report_to.count("kubeflow"), 1)
# Should not add kubeflow when env var is not set
with patch.dict(os.environ, {}, clear=True):
args = TrainingArguments(output_dir="tmp", report_to="none")
self.assertNotIn("kubeflow", args.report_to)
def test_warmup_steps_validation(self):
"""Test warmup_steps validation for negative values."""
with self.assertRaises(ValueError):
TrainingArguments(output_dir="tmp", warmup_steps=-1, report_to=None)
# Zero and fractional values are valid
args = TrainingArguments(output_dir="tmp", warmup_steps=0, report_to=None)
self.assertEqual(args.warmup_steps, 0)
args = TrainingArguments(output_dir="tmp", warmup_steps=0.5, report_to=None)
self.assertEqual(args.warmup_steps, 0.5)
def test_debug_option_parsing(self):
"""Test debug string parsing into DebugOption enum list."""
args = TrainingArguments(output_dir="tmp", debug="underflow_overflow", report_to=None)
self.assertEqual(args.debug, [DebugOption.UNDERFLOW_OVERFLOW])
args = TrainingArguments(output_dir="tmp", debug=None, report_to=None)
self.assertEqual(args.debug, [])
def test_dataloader_prefetch_requires_workers(self):
"""Test that dataloader_prefetch_factor requires num_workers > 0."""
with self.assertRaises(ValueError):
TrainingArguments(
output_dir="tmp",
dataloader_prefetch_factor=2,
dataloader_num_workers=0,
report_to=None,
)
# Valid: prefetch with workers > 0
args = TrainingArguments(
output_dir="tmp",
dataloader_prefetch_factor=2,
dataloader_num_workers=2,
report_to=None,
)
self.assertEqual(args.dataloader_prefetch_factor, 2)
def test_use_cpu_disables_pin_memory(self):
"""Test that use_cpu=True disables dataloader_pin_memory."""
args = TrainingArguments(output_dir="tmp", use_cpu=True, report_to=None)
self.assertFalse(args.dataloader_pin_memory)
def test_include_num_input_tokens_seen_coercion(self):
"""Test bool-to-string coercion for include_num_input_tokens_seen."""
args = TrainingArguments(output_dir="tmp", include_num_input_tokens_seen=True, report_to=None)
self.assertEqual(args.include_num_input_tokens_seen, "all")
args = TrainingArguments(output_dir="tmp", include_num_input_tokens_seen=False, report_to=None)
self.assertEqual(args.include_num_input_tokens_seen, "no")
def test_dict_field_parsing(self):
"""Test that JSON string dict fields are parsed into dicts."""
args = TrainingArguments(output_dir="tmp", lr_scheduler_kwargs='{"factor": 0.5}', report_to=None)
self.assertEqual(args.lr_scheduler_kwargs, {"factor": 0.5})
def test_dtype_to_json(self):
@dataclasses.dataclass
class TorchDtypeTrainingArguments(TrainingArguments):
dtype: torch.dtype = dataclasses.field(
default=torch.float32,
)
for dtype in [
"float32",
"float64",
"complex64",
"complex128",
"float16",
"bfloat16",
"uint8",
"int8",
"int16",
"int32",
"int64",
"bool",
]:
torch_dtype = getattr(torch, dtype)
with tempfile.TemporaryDirectory() as tmp_dir:
args = TorchDtypeTrainingArguments(output_dir=tmp_dir, dtype=torch_dtype)
args_dict = args.to_dict()
self.assertIn("dtype", args_dict)
self.assertEqual(args_dict["dtype"], dtype)

View File

@@ -0,0 +1,630 @@
# Copyright 2018 the HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Shared test infrastructure for the Trainer test suite."""
import dataclasses
import gc
import json
import os
import random
import numpy as np
from transformers import (
AutoTokenizer,
PreTrainedConfig,
TrainerCallback,
TrainingArguments,
is_datasets_available,
is_torch_available,
)
from transformers.testing_utils import (
backend_empty_cache,
backend_max_memory_allocated,
backend_memory_allocated,
backend_reset_max_memory_allocated,
get_tests_dir,
torch_device,
)
from transformers.utils import (
SAFE_WEIGHTS_INDEX_NAME,
SAFE_WEIGHTS_NAME,
is_accelerate_available,
)
if torch_device == "hpu":
RTOL = 1e-3
ATOL = 1e-3
else:
RTOL = 1e-5
ATOL = 1e-5
if is_torch_available():
import safetensors.torch
import torch
from torch import nn
from torch.utils.data import IterableDataset
from transformers import (
AutoModelForCausalLM,
PreTrainedModel,
Trainer,
TrainerState,
)
if is_datasets_available():
import datasets
# for version specific tests in TrainerIntegrationTest
if is_accelerate_available():
pass
PATH_SAMPLE_TEXT = f"{get_tests_dir()}/fixtures/sample_text.txt"
def get_dataset(file_path, tokenizer, max_len):
dataset = datasets.load_dataset("text", data_files=file_path)
# Filter out empty lines
dataset = dataset.filter(lambda example: len(example["text"].strip()) > 0)
# Define tokenization function
def tokenize_function(examples):
tokenized = tokenizer(examples["text"], add_special_tokens=True, truncation=True, max_length=max_len)
# Add labels as a copy of input_ids
tokenized["labels"] = tokenized["input_ids"].copy()
return tokenized
# Apply tokenization and remove original text column
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])
return tokenized_dataset["train"]
class StoreLossCallback(TrainerCallback):
"""
Simple callback to store the loss.
"""
def __init__(self):
self.losses = []
def on_log(self, args, state, control, logs=None, **kwargs):
if "loss" in logs:
self.losses.append(logs["loss"])
class MockCudaOOMCallback(TrainerCallback):
"""
Simple callback to simulate CUDA OOM error if
the batch size is >= to `batch_size_limit`.
"""
def __init__(self, batch_size_limit=16):
self.batch_size_limit = batch_size_limit
def on_step_end(self, args, state, control, **kwargs):
# simulate OOM on the first step
if state.train_batch_size >= self.batch_size_limit:
raise RuntimeError("CUDA out of memory.")
class RegressionDataset:
def __init__(self, a=2, b=3, length=64, seed=42, label_names=None):
np.random.seed(seed)
self.label_names = ["labels"] if label_names is None else label_names
self.length = length
self.x = np.random.normal(size=(length,)).astype(np.float32)
self.ys = [a * self.x + b + np.random.normal(scale=0.1, size=(length,)) for _ in self.label_names]
self.ys = [y.astype(np.float32) for y in self.ys]
def __len__(self):
return self.length
def __getitem__(self, i):
result = {name: y[i] for name, y in zip(self.label_names, self.ys)}
result["input_x"] = self.x[i]
return result
# Converting Bytes to Megabytes
def bytes2megabytes(x):
return int(x / 2**20)
# Copied from accelerate: https://github.com/huggingface/accelerate/blob/ee163b66fb7848892519e804688cb4ae981aacbe/src/accelerate/test_utils/scripts/external_deps/test_peak_memory_usage.py#L40C1-L73C68
class TorchTracemalloc:
def __enter__(self):
gc.collect()
if torch_device in ["cuda", "xpu"]:
backend_empty_cache(torch_device)
backend_reset_max_memory_allocated(torch_device) # reset the peak gauge to zero
self.begin = backend_memory_allocated(torch_device)
else:
self.begin = 0
return self
def __exit__(self, *exc):
gc.collect()
if torch_device in ["cuda", "xpu"]:
backend_empty_cache(torch_device)
self.end = backend_memory_allocated(torch_device)
self.peak = backend_max_memory_allocated(torch_device)
else:
self.end = 0
self.peak = 0
self.used = bytes2megabytes(self.end - self.begin)
self.peaked = bytes2megabytes(self.peak - self.begin)
@dataclasses.dataclass
class RegressionTrainingArguments(TrainingArguments):
a: float = 0.0
b: float = 0.0
class RepeatDataset:
def __init__(self, x, length=64):
self.x = x
self.length = length
def __len__(self):
return self.length
def __getitem__(self, i):
return {"input_ids": self.x, "labels": self.x}
class SequenceClassificationDataset:
def __init__(self, length=64, vocab_size=100, num_labels=5):
self.length = length
self.sequences = [torch.randint(0, vocab_size, (64,)).tolist() for _ in range(length)]
self.labels = torch.randint(0, num_labels, (length,)).tolist()
def __len__(self):
return self.length
def __getitem__(self, i):
return {"input_ids": self.sequences[i], "label": self.labels[i]}
class DynamicShapesDataset:
def __init__(self, length=64, seed=42, batch_size=8):
self.length = length
np.random.seed(seed)
sizes = np.random.randint(1, 20, (length // batch_size,))
# For easy batching, we make every batch_size consecutive samples the same size.
self.xs = [np.random.normal(size=(s,)).astype(np.float32) for s in sizes.repeat(batch_size)]
self.ys = [np.random.normal(size=(s,)).astype(np.float32) for s in sizes.repeat(batch_size)]
def __len__(self):
return self.length
def __getitem__(self, i):
return {"input_x": self.xs[i], "labels": self.ys[i]}
class AlmostAccuracy:
def __init__(self, thresh=0.25):
self.thresh = thresh
def __call__(self, eval_pred):
predictions, labels = eval_pred
true = np.abs(predictions - labels) <= self.thresh
return {"accuracy": true.astype(np.float32).mean().item()}
class AlmostAccuracyBatched:
def __init__(self, thresh=0.25):
self.thresh = thresh
self.batch_acc = []
def __call__(self, eval_pred, compute_result):
predictions, labels = eval_pred
if isinstance(predictions, tuple):
predictions = predictions[0]
if isinstance(labels, tuple):
labels = labels[0]
batch_size = len(predictions)
true = torch.abs(predictions - labels) <= self.thresh
acc = true.type(torch.FloatTensor).mean().item()
self.batch_acc.extend([acc] * batch_size)
if compute_result:
result = {"accuracy": np.mean(self.batch_acc).item()}
self.batch_acc = []
return result
class RegressionModelConfig(PreTrainedConfig):
def __init__(self, a=0, b=0, double_output=False, random_torch=True, **kwargs):
super().__init__(**kwargs)
self.a = a
self.b = b
self.double_output = double_output
self.random_torch = random_torch
self.hidden_size = 1
if is_torch_available():
class SampleIterableDataset(IterableDataset):
def __init__(self, a=2, b=3, length=64, seed=42, label_names=None):
self.dataset = RegressionDataset(a=a, b=b, length=length, seed=seed, label_names=label_names)
def __iter__(self):
for i in range(len(self.dataset)):
yield self.dataset[i]
class FiniteIterableDataset(SampleIterableDataset):
def __init__(self, a=2, b=3, length=64, seed=42, label_names=None):
super().__init__(a, b, length, seed, label_names)
self.current_sample = 0
def __iter__(self):
while self.current_sample < len(self.dataset):
yield self.dataset[self.current_sample]
self.current_sample += 1
class MultiLoader:
def __init__(self, loaders):
self.loaders = loaders
def __len__(self):
return sum(len(loader) for loader in self.loaders)
def __iter__(self):
for loader in self.loaders:
yield from loader
class CustomDataloaderTrainer(Trainer):
def get_train_dataloader(self):
dataloaders = [super().get_train_dataloader(), super().get_train_dataloader()]
return MultiLoader(dataloaders)
def get_eval_dataloader(self, eval_dataset):
dataloaders = [super().get_eval_dataloader(eval_dataset), super().get_eval_dataloader(eval_dataset)]
return MultiLoader(dataloaders)
class RegressionModel(nn.Module):
def __init__(self, a=0, b=0, double_output=False):
super().__init__()
self.a = nn.Parameter(torch.tensor(a).float())
self.b = nn.Parameter(torch.tensor(b).float())
self.double_output = double_output
self.config = None
def forward(self, input_x, labels=None, **kwargs):
y = input_x * self.a + self.b
if labels is None:
return (y, y) if self.double_output else (y,)
loss = nn.functional.mse_loss(y, labels)
return (loss, y, y) if self.double_output else (loss, y)
class RegressionDictModel(nn.Module):
def __init__(self, a=0, b=0):
super().__init__()
self.a = nn.Parameter(torch.tensor(a).float())
self.b = nn.Parameter(torch.tensor(b).float())
self.config = None
def forward(self, input_x, labels=None, **kwargs):
y = input_x * self.a + self.b
result = {"output": y}
if labels is not None:
result["loss"] = nn.functional.mse_loss(y, labels)
return result
class RegressionPreTrainedModel(PreTrainedModel):
config_class = RegressionModelConfig
base_model_prefix = "regression"
def __init__(self, config):
super().__init__(config)
self.a = nn.Parameter(torch.as_tensor(config.a).float())
self.b = nn.Parameter(torch.as_tensor(config.b).float())
self.double_output = config.double_output
self.post_init()
def forward(self, input_x, labels=None, **kwargs):
y = input_x * self.a + self.b
if labels is None:
return (y, y) if self.double_output else (y,)
loss = nn.functional.mse_loss(y, labels)
return (loss, y, y) if self.double_output else (loss, y)
class RegressionPreTrainedModelWithGradientCheckpointing(PreTrainedModel):
config_class = RegressionModelConfig
base_model_prefix = "regression"
supports_gradient_checkpointing = True
def __init__(self, config):
super().__init__(config)
self.layers = nn.ModuleList([nn.Linear(config.hidden_size, config.hidden_size) for _ in range(4)])
self.head = nn.Linear(config.hidden_size, 1)
self.gradient_checkpointing = False
self.double_output = config.double_output
self.post_init()
def forward(self, input_x, labels=None, **kwargs):
y = input_x.unsqueeze(0)
for layer in self.layers:
if self.training and self.gradient_checkpointing:
outputs = self._gradient_checkpointing_func(layer.__call__, y)
else:
outputs = layer(y)
y = outputs * 3
logits = self.head(y)
if labels is None:
return (logits, logits) if self.double_output else (logits,)
loss = nn.functional.mse_loss(logits, labels)
return (loss, y, y) if self.double_output else (loss, y)
class RegressionRandomPreTrainedModel(PreTrainedModel):
config_class = RegressionModelConfig
base_model_prefix = "regression"
def __init__(self, config):
super().__init__(config)
self.a = nn.Parameter(torch.as_tensor(config.a).float())
self.b = nn.Parameter(torch.as_tensor(config.b).float())
self.random_torch = config.random_torch
self.post_init()
def forward(self, input_x, labels=None, **kwargs):
y = input_x * self.a + self.b
if self.random_torch:
torch_rand = torch.randn(1).squeeze()
np_rand = np.random.rand()
rand_rand = random.random()
if self.random_torch:
y += 0.05 * torch_rand
y += 0.05 * torch.tensor(np_rand + rand_rand)
if labels is None:
return (y,)
loss = nn.functional.mse_loss(y, labels)
return (loss, y)
class BasicTextGenerationModel(nn.Module):
def __init__(self, vocab_size, hidden_size):
super().__init__()
self.embedding = nn.Embedding(vocab_size, hidden_size)
self.lstm = nn.LSTM(hidden_size, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, vocab_size)
def forward(self, input_ids, labels=None, **kwargs):
embedded = self.embedding(input_ids)
lstm_out, _ = self.lstm(embedded)
logits = self.fc(lstm_out)
if labels is None:
return logits
loss = nn.functional.cross_entropy(logits.view(-1, logits.size(-1)), labels.view(-1))
return loss, logits
def create_dummy_dataset_for_text_generation(vocab_size, seq_length, num_samples):
import numpy as np
# Create random input sequences
input_ids = np.random.randint(0, vocab_size, (num_samples, seq_length))
# Create a datasets.Dataset
dataset = datasets.Dataset.from_dict({"input_ids": input_ids, "labels": input_ids})
return dataset
class TstLayer(nn.Module):
def __init__(self, hidden_size):
super().__init__()
self.linear1 = nn.Linear(hidden_size, hidden_size)
self.ln1 = nn.LayerNorm(hidden_size)
self.linear2 = nn.Linear(hidden_size, hidden_size)
self.ln2 = nn.LayerNorm(hidden_size)
self.bias = nn.Parameter(torch.zeros(hidden_size))
def forward(self, x):
h = self.ln1(nn.functional.relu(self.linear1(x)))
h = nn.functional.relu(self.linear2(x))
return self.ln2(x + h + self.bias)
def get_regression_trainer(
a=0,
b=0,
double_output=False,
train_len=64,
eval_len=64,
pretrained=True,
output_dir=None,
**kwargs,
):
label_names = kwargs.get("label_names")
gradient_checkpointing = kwargs.get("gradient_checkpointing", False)
train_dataset = RegressionDataset(length=train_len, label_names=label_names)
eval_dataset = RegressionDataset(length=eval_len, label_names=label_names)
model_init = kwargs.pop("model_init", None)
if model_init is not None:
model = None
else:
if pretrained:
config = RegressionModelConfig(a=a, b=b, double_output=double_output)
# We infer the correct model class if one uses gradient_checkpointing or not
target_cls = (
RegressionPreTrainedModel
if not gradient_checkpointing
else RegressionPreTrainedModelWithGradientCheckpointing
)
model = target_cls(config)
else:
model = RegressionModel(a=a, b=b, double_output=double_output)
compute_metrics = kwargs.pop("compute_metrics", None)
data_collator = kwargs.pop("data_collator", None)
optimizers = kwargs.pop("optimizers", (None, None))
preprocess_logits_for_metrics = kwargs.pop("preprocess_logits_for_metrics", None)
assert output_dir is not None, "output_dir should be specified for testing"
args = RegressionTrainingArguments(output_dir, a=a, b=b, **kwargs)
trainer = Trainer(
model,
args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
optimizers=optimizers,
model_init=model_init,
preprocess_logits_for_metrics=preprocess_logits_for_metrics,
)
# TODO: loss function defined in RegressionModel doesn't accept num_item_per_batch, to fix later
trainer.model_accepts_loss_kwargs = False
return trainer
def get_language_model_trainer(**kwargs):
dataset = datasets.load_dataset("fka/awesome-chatgpt-prompts")
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
tokenizer.pad_token = tokenizer.eos_token
def _tokenize_function(examples):
model_inputs = tokenizer(examples["prompt"], padding="max_length", truncation=True)
model_inputs["labels"] = np.array(model_inputs["input_ids"]).astype(np.int64)
return model_inputs
tokenized_datasets = dataset.map(_tokenize_function, batched=True)
training_args = TrainingArguments(**kwargs)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
)
return trainer
class TrainerIntegrationCommon:
def check_saved_checkpoints(self, output_dir, freq, total, is_pretrained=True, use_scaler=False):
weights_file = SAFE_WEIGHTS_NAME
file_list = [weights_file, "training_args.bin", "optimizer.pt", "scheduler.pt", "trainer_state.json"]
if is_pretrained:
file_list.append("config.json")
if use_scaler:
file_list.append("scaler.pt")
for step in range(freq, total, freq):
checkpoint = os.path.join(output_dir, f"checkpoint-{step}")
self.assertTrue(os.path.isdir(checkpoint))
for filename in file_list:
self.assertTrue(os.path.isfile(os.path.join(checkpoint, filename)))
def check_best_model_has_been_loaded(
self,
output_dir,
freq,
total,
trainer,
metric,
greater_is_better=False,
is_pretrained=True,
):
# Get log history from the final checkpoint (could be at total if not divisible by freq)
final_checkpoint_step = total if total % freq != 0 else (total // freq) * freq
checkpoint = os.path.join(output_dir, f"checkpoint-{final_checkpoint_step}")
log_history = TrainerState.load_from_json(os.path.join(checkpoint, "trainer_state.json")).log_history
values = [d[metric] for d in log_history if metric in d]
best_value = max(values) if greater_is_better else min(values)
best_idx = values.index(best_value)
# Determine which checkpoint corresponds to the best metric
# Evals happen at freq intervals, plus potentially at the final step
eval_steps = list(range(freq, total + 1, freq))
if total % freq != 0:
eval_steps.append(total)
best_checkpoint = eval_steps[best_idx]
checkpoint = os.path.join(output_dir, f"checkpoint-{best_checkpoint}")
if is_pretrained:
best_model = RegressionPreTrainedModel.from_pretrained(checkpoint)
best_model.to(trainer.args.device)
else:
best_model = RegressionModel()
state_dict = safetensors.torch.load_file(os.path.join(checkpoint, SAFE_WEIGHTS_NAME))
best_model.load_state_dict(state_dict)
best_model.to(trainer.args.device)
torch.testing.assert_close(best_model.a, trainer.model.a)
torch.testing.assert_close(best_model.b, trainer.model.b)
metrics = trainer.evaluate()
self.assertEqual(metrics[metric], best_value)
def remove_nan_logs(self, log):
for key in list(log.keys()):
if log[key] != log[key]: # Check if the value is NaN
del log[key]
def check_trainer_state_are_the_same(self, trainer_state, trainer_state1):
# We'll pop things so operate on copies.
state = trainer_state.copy()
state1 = trainer_state1.copy()
# Log history main contain different logs for the time metrics (after resuming a training).
log_history = state.pop("log_history", None)
log_history1 = state1.pop("log_history", None)
self.assertEqual(state, state1)
skip_log_keys = ["train_runtime", "train_samples_per_second", "train_steps_per_second", "train_loss"]
for log, log1 in zip(log_history, log_history1):
for key in skip_log_keys:
_ = log.pop(key, None)
_ = log1.pop(key, None)
self.remove_nan_logs(log)
self.remove_nan_logs(log1)
self.assertEqual(log, log1)
def convert_to_sharded_checkpoint(self, folder):
# Converts a checkpoint of a regression model to a sharded checkpoint.
loader = safetensors.torch.load_file
weights_file = os.path.join(folder, SAFE_WEIGHTS_NAME)
extension = "safetensors"
saver = safetensors.torch.save_file
index_file = os.path.join(folder, SAFE_WEIGHTS_INDEX_NAME)
shard_name = SAFE_WEIGHTS_NAME
state_dict = loader(weights_file)
os.remove(weights_file)
keys = list(state_dict.keys())
shard_files = [
shard_name.replace(f".{extension}", f"-{idx + 1:05d}-of-{len(keys):05d}.{extension}")
for idx in range(len(keys))
]
index = {"metadata": {}, "weight_map": {key: shard_files[i] for i, key in enumerate(keys)}}
with open(index_file, "w", encoding="utf-8") as f:
content = json.dumps(index, indent=2, sort_keys=True) + "\n"
f.write(content)
for param_name, shard_file in zip(keys, shard_files):
saver({param_name: state_dict[param_name]}, os.path.join(folder, shard_file))