first commit

2026-06-05 16:53:03 +08:00
commit 06f1fd69a6
6047 changed files with 1895387 additions and 0 deletions
--- a/tests/trainer/TESTING_GUIDE.md
+++ b/tests/trainer/TESTING_GUIDE.md
@@ -0,0 +1,122 @@
+# Trainer Testing Guide
+
+## Test files
+
+| File | What it covers |
+|---|---|
+| `test_trainer.py` | Core: mixed precision, grad accumulation, logging, metrics, early stopping |
+| `test_trainer_checkpointing.py` | Checkpoint save/resume, interrupted training, frozen params |
+| `test_trainer_data.py` | Collators, dynamic shapes, iterable datasets, label smoothing |
+| `test_trainer_optimizers.py` | Optimizers & LR schedulers |
+| `test_trainer_seq2seq.py` | Encoder-decoder fine-tuning |
+| `trainer_test_utils.py` | Shared utilities (models, datasets, helpers) — not a test file |
+| `distributed/` | DDP, FSDP, DeepSpeed (see [below](#distributed-tests)) |
+
+## Running tests
+
+Always use `RUN_SLOW=1` — most trainer tests are `@slow` and will be skipped without it.
+
+### Debugging workflow
+
+**Never run the full suite until the specific failing test passes.** Work from smallest scope outward:
+
+1. **Single GPU** — fastest feedback:
+   ```bash
+   CUDA_VISIBLE_DEVICES=0 RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class::test_name -xvs
+   ```
+2. **Fix and re-run** that same test until it passes.
+3. **2 GPUs** — catch DataParallel issues:
+   ```bash
+   CUDA_VISIBLE_DEVICES=0,1 RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class::test_name -xvs
+   ```
+4. **Full test class** — check for regressions:
+   ```bash
+   RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py::Class -xvs
+   ```
+5. **All tests in that file — only at the very end**:
+   ```bash
+   RUN_SLOW=1 python -m pytest tests/trainer/test_trainer.py -v --tb=line
+   ```
+
+Same for distributed tests — single failing test first, fix, confirm, then widen scope.
+
+**Tip**: `-k` filter applies globally across files. Use full node IDs instead: `pytest file::Class::test`.
+
+## Writing tests
+
+**`get_regression_trainer()`** is the fastest way to get a working Trainer. Pass any `TrainingArguments` kwarg directly. Uses `RegressionModel` + `RegressionDataset` (trains in milliseconds).
+
+For LLM tests, use tiny Hub models: `AutoModelForCausalLM.from_pretrained("hf-internal-testing/tiny-random-LlamaForCausalLM")`.
+
+Use `max_steps=10` instead of `num_train_epochs=3` when you just need training to run.
+
+### Multi-GPU safety
+
+The Trainer uses `nn.DataParallel` when `n_gpu > 1`:
+
+- `train_batch_size = per_device_train_batch_size * n_gpu` — don't hardcode batch sizes in assertions.
+- Compute steps dynamically: `math.ceil(num_samples / (batch_size * grad_accum))`.
+- Use 100+ samples — small datasets can leave zero resume steps on multi-GPU.
+- DataParallel gather introduces ~1e-8 FP differences — use `places=6` for loss assertions.
+- If a test model has `**kwargs` but ignores `num_items_in_batch`, set `model.accepts_loss_kwargs = False`.
+
+### Decorators
+
+`@parameterized.expand` must be **outermost** (top), above `@require_*`.
+
+---
+
+## Distributed tests
+
+### Directory layout
+
+```
+distributed/
+  test_trainer_distributed.py           # Base: path constants, TrainerDistributedCommon ABC
+  test_trainer_distributed_ddp.py       # DDP tests
+  test_trainer_distributed_fsdp.py      # FSDP tests (config parsing + distributed)
+  test_trainer_distributed_deepspeed.py # DeepSpeed tests (single-GPU + distributed)
+  accelerate_configs/                   # YAML configs for `accelerate launch`
+  scripts/                              # Scripts launched as subprocesses
+    train.py                            # Main training script (synthetic data, tiny Qwen2)
+    torchrun_env_check.py               # Dumps distributed env info to JSON per rank
+    ds_config_zero2.json, ds_config_zero3.json
+```
+
+### Architecture
+
+Each framework has three pieces:
+
+1. **`{Framework}CommandsMixin`** — `get_torchrun_cmd()` and `get_accelerate_cmd()`.
+2. **`TestTrainerDistributed{Framework}`** — framework-specific tests (env parity, etc.). NOT `@slow`.
+3. **`TestTrainerDistributed{Framework}Common`** — inherits `TrainerDistributedCommon` for shared scenarios. `@slow`.
+
+MRO: `class Foo(Mixin, TrainerDistributedCommon, TestCasePlus)` — Mixin before ABC.
+
+`TrainerDistributedCommon` provides: `check_training`, `check_mixed_precision`, `check_gradient_accumulation`, `check_resume`, `check_eval`. Subclasses call these with `config_file=...`.
+
+### Env parity tests
+
+Both torchrun and accelerate sides must use the framework:
+
+- **DDP**: no extra args (both `DistributedType.MULTI_GPU`)
+- **FSDP**: `--fsdp full_shard --fsdp_config '{"fsdp_version": 1}'` (JSON string, no file)
+- **DeepSpeed**: `--deepspeed path/to/ds_config_zero2.json`
+
+`torchrun_env_check.py` uses `HfArgumentParser(TrainingArguments)` — accepts any TrainingArguments flag.
+
+### Adding a distributed test
+
+1. Shared scenario → add `check_*` to `TrainerDistributedCommon`, wire from each Common class.
+2. Framework-specific → add to `TestTrainerDistributed{Framework}`.
+3. New scripts → `distributed/scripts/`, reference via `SCRIPTS_DIR`.
+
+### Pitfalls
+
+- `str(args.parallel_mode)` → `"ParallelMode.DISTRIBUTED"`, not `"DISTRIBUTED"`.
+- FSDP `cpu_offload` is not JSON-serializable — use `str()`.
+- `train.py` defaults to `do_train=True`. Pass `--do_eval` explicitly for eval. Auto-enables when `--eval_output_file` is passed.
+- DeepSpeed eval only works with ZeRO-3.
+- `--fsdp_config` accepts a file path OR JSON string starting with `{`. Same for `--deepspeed`, `--accelerator_config`.
+- `args.local_rank` may be -1 before framework consumes it — use `assertIn(val, (rank, -1))`.
+- `@parameterized.expand` + ABC: can't use `@abstractmethod` on methods that subclasses decorate with expand.
--- a/tests/trainer/init.py
+++ b/tests/trainer/init.py
--- a/tests/trainer/distributed/init.py
+++ b/tests/trainer/distributed/init.py
--- a/tests/trainer/distributed/accelerate_configs/ddp.yaml
+++ b/tests/trainer/distributed/accelerate_configs/ddp.yaml
@@ -0,0 +1,3 @@
+distributed_type: MULTI_GPU
+num_machines: 1
+num_processes: 2
--- a/tests/trainer/distributed/accelerate_configs/deepspeed_zero2.yaml
+++ b/tests/trainer/distributed/accelerate_configs/deepspeed_zero2.yaml
@@ -0,0 +1,4 @@
+distributed_type: DEEPSPEED
+deepspeed_config:
+  deepspeed_config_file: tests/trainer/distributed/scripts/ds_config_zero2.json
+num_processes: 2
--- a/tests/trainer/distributed/accelerate_configs/deepspeed_zero2_sp.yaml
+++ b/tests/trainer/distributed/accelerate_configs/deepspeed_zero2_sp.yaml
@@ -0,0 +1,9 @@
+distributed_type: DEEPSPEED
+deepspeed_config:
+  deepspeed_config_file: tests/trainer/distributed/scripts/ds_config_zero2.json
+num_processes: 2
+parallelism_config:
+  parallelism_config_sp_size: 2
+  parallelism_config_sp_backend: deepspeed
+  parallelism_config_sp_seq_length_is_variable: true
+  parallelism_config_sp_attn_implementation: sdpa
--- a/tests/trainer/distributed/accelerate_configs/deepspeed_zero3.yaml
+++ b/tests/trainer/distributed/accelerate_configs/deepspeed_zero3.yaml
@@ -0,0 +1,4 @@
+distributed_type: DEEPSPEED
+deepspeed_config:
+  deepspeed_config_file: tests/trainer/distributed/scripts/ds_config_zero3.json
+num_processes: 2
--- a/tests/trainer/distributed/accelerate_configs/fsdp.yaml
+++ b/tests/trainer/distributed/accelerate_configs/fsdp.yaml
@@ -0,0 +1,4 @@
+distributed_type: FSDP
+fsdp_config:
+  fsdp_version: 1
+num_processes: 2
--- a/tests/trainer/distributed/accelerate_configs/fsdp2.yaml
+++ b/tests/trainer/distributed/accelerate_configs/fsdp2.yaml
@@ -0,0 +1,4 @@
+distributed_type: FSDP
+fsdp_config:
+  fsdp_version: 2
+num_processes: 2
--- a/tests/trainer/distributed/accelerate_configs/fsdp2_cp.yaml
+++ b/tests/trainer/distributed/accelerate_configs/fsdp2_cp.yaml
@@ -0,0 +1,10 @@
+distributed_type: FSDP
+fsdp_config:
+  fsdp_version: 2
+num_processes: 2
+parallelism_config:
+  parallelism_config_dp_replicate_size: 1
+  parallelism_config_dp_shard_size: 1
+  parallelism_config_tp_size: 1
+  parallelism_config_cp_size: 2
+  parallelism_config_cp_comm_strategy: alltoall
--- a/tests/trainer/distributed/scripts/dispatch_batches.py
+++ b/tests/trainer/distributed/scripts/dispatch_batches.py
@@ -0,0 +1,88 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Worker script for dispatch_batches=False with a finite iterable dataset.
+
+Verifies that training completes successfully when ``dispatch_batches``
+is disabled.
+
+Run via torchrun or accelerate launch.
+"""
+
+import numpy as np
+import torch
+import torch.nn as nn
+from torch.utils.data import IterableDataset
+
+from transformers import HfArgumentParser, Trainer, TrainingArguments
+
+
+class RegressionModel(nn.Module):
+    def __init__(self, a=0, b=0):
+        super().__init__()
+        self.a = nn.Parameter(torch.tensor(a).float())
+        self.b = nn.Parameter(torch.tensor(b).float())
+        self.config = None
+
+    def forward(self, input_x, labels=None, **kwargs):
+        y = input_x * self.a + self.b
+        if labels is None:
+            return (y,)
+        loss = nn.functional.mse_loss(y, labels)
+        return (loss, y)
+
+
+class RegressionDataset:
+    def __init__(self, a=2, b=3, length=64, seed=42, label_names=None):
+        np.random.seed(seed)
+        self.label_names = ["labels"] if label_names is None else label_names
+        self.length = length
+        self.x = np.random.normal(size=(length,)).astype(np.float32)
+        self.ys = [a * self.x + b + np.random.normal(scale=0.1, size=(length,)) for _ in self.label_names]
+        self.ys = [y.astype(np.float32) for y in self.ys]
+
+    def __len__(self):
+        return self.length
+
+    def __getitem__(self, i):
+        result = {name: y[i] for name, y in zip(self.label_names, self.ys)}
+        result["input_x"] = self.x[i]
+        return result
+
+
+class FiniteIterableDataset(IterableDataset):
+    def __init__(self, a=2, b=3, length=64, seed=42, label_names=None):
+        self.dataset = RegressionDataset(a=a, b=b, length=length, seed=seed, label_names=label_names)
+        self.current_sample = 0
+
+    def __iter__(self):
+        while self.current_sample < len(self.dataset):
+            yield self.dataset[self.current_sample]
+            self.current_sample += 1
+
+
+if __name__ == "__main__":
+    parser = HfArgumentParser((TrainingArguments,))
+    training_args = parser.parse_args_into_dataclasses()[0]
+
+    training_args.per_device_train_batch_size = 1
+    training_args.max_steps = 1
+    training_args.accelerator_config.dispatch_batches = False
+
+    train_dataset = FiniteIterableDataset(label_names=["labels", "extra"], length=1)
+    model = RegressionModel()
+
+    trainer = Trainer(model, training_args, train_dataset=train_dataset)
+    trainer.train()
--- a/tests/trainer/distributed/scripts/ds_config_zero2.json
+++ b/tests/trainer/distributed/scripts/ds_config_zero2.json
@@ -0,0 +1,32 @@
+{
+    "fp16": {
+        "enabled": "auto"
+    },
+    "bf16": {
+        "enabled": "auto"
+    },
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": "auto",
+            "betas": "auto",
+            "eps": "auto",
+            "weight_decay": "auto"
+        }
+    },
+    "scheduler": {
+        "type": "WarmupLR",
+        "params": {
+            "warmup_min_lr": "auto",
+            "warmup_max_lr": "auto",
+            "warmup_num_steps": "auto"
+        }
+    },
+    "zero_optimization": {
+        "stage": 2
+    },
+    "gradient_accumulation_steps": "auto",
+    "gradient_clipping": "auto",
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto"
+}
--- a/tests/trainer/distributed/scripts/ds_config_zero3.json
+++ b/tests/trainer/distributed/scripts/ds_config_zero3.json
@@ -0,0 +1,35 @@
+{
+    "fp16": {
+        "enabled": "auto"
+    },
+    "bf16": {
+        "enabled": "auto"
+    },
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": "auto",
+            "betas": "auto",
+            "eps": "auto",
+            "weight_decay": "auto"
+        }
+    },
+    "scheduler": {
+        "type": "WarmupLR",
+        "params": {
+            "warmup_min_lr": "auto",
+            "warmup_max_lr": "auto",
+            "warmup_num_steps": "auto"
+        }
+    },
+    "zero_optimization": {
+        "stage": 3,
+        "reduce_bucket_size": "auto",
+        "stage3_prefetch_bucket_size": "auto",
+        "stage3_param_persistence_threshold": "auto"
+    },
+    "gradient_accumulation_steps": "auto",
+    "gradient_clipping": "auto",
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto"
+}
--- a/tests/trainer/distributed/scripts/eval_ddp.py
+++ b/tests/trainer/distributed/scripts/eval_ddp.py
@@ -0,0 +1,113 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Worker script for eval/predict ordering tests.
+
+Verifies that distributed eval/predict returns all samples in the correct order.
+
+Run via torchrun or accelerate launch.
+"""
+
+import torch
+import torch.nn as nn
+from torch.utils.data import Dataset
+
+from transformers import EvalPrediction, HfArgumentParser, Trainer, TrainingArguments
+from transformers.utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+
+class DummyDataset(Dataset):
+    def __init__(self, length: int = 101):
+        self.length = length
+
+    def __len__(self):
+        return self.length
+
+    def __getitem__(self, i) -> int:
+        return i
+
+
+class DummyDataCollator:
+    def __call__(self, features):
+        return {"input_ids": torch.tensor(features), "labels": torch.tensor(features)}
+
+
+class DummyModel(nn.Module):
+    def __init__(self):
+        super().__init__()
+        # Add some (unused) params otherwise DDP will complain.
+        self.fc = nn.Linear(120, 80)
+
+    def forward(self, input_ids, labels=None):
+        if labels is not None:
+            return torch.tensor(0.0, device=input_ids.device), input_ids
+        else:
+            return input_ids
+
+
+if __name__ == "__main__":
+    parser = HfArgumentParser((TrainingArguments,))
+    training_args = parser.parse_args_into_dataclasses()[0]
+
+    for dataset_length in [49, 7]:
+        dataset = DummyDataset(dataset_length)
+
+        def compute_metrics(p: EvalPrediction) -> dict:
+            sequential = list(range(len(dataset)))
+            success = p.predictions.tolist() == sequential and p.label_ids.tolist() == sequential
+            if not success and training_args.local_process_index == 0:
+                logger.warning(
+                    "Predictions and/or labels do not match expected results:\n  - predictions: "
+                    f"{p.predictions.tolist()}\n  - labels: {p.label_ids.tolist()}\n  - expected: {sequential}"
+                )
+            return {"success": success}
+
+        trainer = Trainer(
+            model=DummyModel(),
+            args=training_args,
+            data_collator=DummyDataCollator(),
+            eval_dataset=dataset,
+            compute_metrics=compute_metrics,
+        )
+        metrics = trainer.evaluate()
+        logger.info(metrics)
+        if metrics["eval_success"] is not True:
+            logger.error(metrics)
+            exit(1)
+
+        p = trainer.predict(dataset)
+        logger.info(p.metrics)
+        if p.metrics["test_success"] is not True:
+            logger.error(p.metrics)
+            exit(1)
+
+        trainer.args.eval_accumulation_steps = 2
+
+        metrics = trainer.evaluate()
+        logger.info(metrics)
+        if metrics["eval_success"] is not True:
+            logger.error(metrics)
+            exit(1)
+
+        p = trainer.predict(dataset)
+        logger.info(p.metrics)
+        if p.metrics["test_success"] is not True:
+            logger.error(p.metrics)
+            exit(1)
+
+        trainer.args.eval_accumulation_steps = None
--- a/tests/trainer/distributed/scripts/fsdp_generate.py
+++ b/tests/trainer/distributed/scripts/fsdp_generate.py
@@ -0,0 +1,125 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Worker script for FSDP generation tests.
+
+Launched via ``torchrun`` from ``test_trainer_distributed_fsdp.py``.
+"""
+
+import argparse
+import functools
+from collections.abc import Callable
+from typing import Any
+
+import torch
+import torch.distributed
+from torch.distributed._composable.fsdp import fully_shard, register_fsdp_forward_method
+from torch.distributed.device_mesh import init_device_mesh
+from torch.distributed.fsdp import FullyShardedDataParallel
+from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
+
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers.models.gpt2.modeling_gpt2 import GPT2Block
+from transformers.testing_utils import backend_device_count, backend_torch_accelerator_module, torch_device
+
+
+data = 4 * [
+    "Hello world!",
+    "The quick brown fox jumps over the lazy dog.",
+]
+
+
+def manage_process_group(func: Callable[..., Any]) -> Callable[..., Any]:
+    """Manage the creation and destruction of the distributed process group for the wrapped function."""
+
+    def wrapped(*args: Any, **kwargs: Any) -> Any:
+        device_count = backend_device_count(torch_device)
+        torch.distributed.init_process_group(world_size=device_count)
+        try:
+            return func(*args, **kwargs)
+        finally:
+            torch.distributed.destroy_process_group()
+
+    return wrapped
+
+
+@manage_process_group
+def fsdp_generate():
+    torch_accelerator_module = backend_torch_accelerator_module(torch_device)
+    torch_accelerator_module.set_device(device := torch.device(rank := torch.distributed.get_rank()))
+
+    model = AutoModelForCausalLM.from_pretrained("hf-internal-testing/tiny-random-gpt2").to(device)
+
+    fsdp_model = FullyShardedDataParallel(
+        model,
+        auto_wrap_policy=functools.partial(transformer_auto_wrap_policy, transformer_layer_cls={GPT2Block}),
+        limit_all_gathers=True,
+        use_orig_params=True,
+    )
+
+    tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/tiny-random-gpt2")
+    batch = tokenizer(data[rank], return_tensors="pt", return_attention_mask=True).to(device)
+
+    with FullyShardedDataParallel.summon_full_params(fsdp_model):
+        _ = fsdp_model.module.generate(
+            input_ids=batch["input_ids"],
+            attention_mask=batch["attention_mask"],
+            max_length=30,
+        )
+
+
+@manage_process_group
+def fsdp2_generate():
+    torch_accelerator_module = backend_torch_accelerator_module(torch_device)
+    torch_accelerator_module.set_device(device := torch.device(rank := torch.distributed.get_rank()))
+
+    model = AutoModelForCausalLM.from_pretrained("hf-internal-testing/tiny-random-gpt2").to(device)
+
+    mesh = init_device_mesh(device.type, (torch.distributed.get_world_size(),))
+    for submodule in model.modules():
+        if isinstance(submodule, GPT2Block):
+            fully_shard(submodule, mesh=mesh)
+    fully_shard(model, mesh=mesh)
+
+    register_fsdp_forward_method(model, "generate")
+
+    tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/tiny-random-gpt2")
+    batch = tokenizer(data[rank], return_tensors="pt", return_attention_mask=True).to(device)
+
+    _ = model.generate(
+        input_ids=batch["input_ids"],
+        attention_mask=batch["attention_mask"],
+        max_length=30,
+    )
+
+
+if __name__ == "__main__":
+
+    class CLIArgs(argparse.Namespace):
+        fsdp: bool
+        fsdp2: bool
+
+    parser = argparse.ArgumentParser()
+    group = parser.add_mutually_exclusive_group()
+    group.add_argument("--fsdp", action="store_true")
+    group.add_argument("--fsdp2", action="store_true")
+    args = parser.parse_args(namespace=CLIArgs())
+
+    if args.fsdp:
+        fsdp_generate()
+    elif args.fsdp2:
+        fsdp2_generate()
+    else:
+        raise ValueError("Missing test selection")
--- a/tests/trainer/distributed/scripts/loss_averaging.py
+++ b/tests/trainer/distributed/scripts/loss_averaging.py
@@ -0,0 +1,114 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Worker script for loss averaging tests.
+
+Verifies that ``average_tokens_across_devices`` produces correct loss
+compared to a single-GPU baseline.
+
+When ``--run_both_averaging_modes`` is passed, the script runs training
+twice (with and without averaging) in a single process launch, saving
+``<output_dir>_broken_losses.json`` and ``<output_dir>_fixed_losses.json``.
+
+Run via torchrun or accelerate launch.
+"""
+
+import argparse
+import json
+
+import datasets
+import torch
+
+from transformers import (
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    DataCollatorForLanguageModeling,
+    HfArgumentParser,
+    Trainer,
+    TrainerCallback,
+    TrainingArguments,
+    set_seed,
+)
+
+
+class StoreLossCallback(TrainerCallback):
+    """Simple callback to store the loss."""
+
+    def __init__(self):
+        self.losses = []
+
+    def on_log(self, args, state, control, logs=None, **kwargs):
+        if "loss" in logs:
+            self.losses.append(logs["loss"])
+
+
+def run_distributed_training(training_args, loss_file):
+    set_seed(42)
+    model_name = "trl-internal-testing/tiny-Qwen2ForCausalLM-2.5"
+    dataset_name = "wikitext"
+    dataset_config = "wikitext-2-raw-v1"
+    dataset = datasets.load_dataset(dataset_name, dataset_config, split="train[:50]")
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    tokenizer.pad_token = tokenizer.eos_token
+
+    def tokenize_function(examples):
+        return tokenizer(examples["text"], max_length=128, padding="max_length", truncation=True)
+
+    tokenized_dataset = dataset.map(tokenize_function, batched=True)
+
+    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
+    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32)
+
+    loss_callback = StoreLossCallback()
+
+    training_args.logging_steps = 1
+    training_args.max_steps = 10
+    training_args.learning_rate = 3e-4
+    training_args.disable_tqdm = True
+    training_args.dataloader_drop_last = True
+
+    trainer = Trainer(
+        model,
+        training_args,
+        train_dataset=tokenized_dataset,
+        callbacks=[loss_callback],
+        data_collator=data_collator,
+    )
+    trainer.train()
+    with open(loss_file, "w") as f:
+        json.dump(loss_callback.losses, f)
+
+
+if __name__ == "__main__":
+    # Parse our custom flag first, pass the rest to HfArgumentParser.
+    pre_parser = argparse.ArgumentParser(add_help=False)
+    pre_parser.add_argument("--run_both_averaging_modes", action="store_true")
+    custom_args, remaining = pre_parser.parse_known_args()
+
+    hf_parser = HfArgumentParser((TrainingArguments,))
+    (training_args,) = hf_parser.parse_args_into_dataclasses(remaining)
+
+    if custom_args.run_both_averaging_modes:
+        base_dir = training_args.output_dir
+        # Run without averaging ("broken")
+        training_args.average_tokens_across_devices = False
+        training_args.output_dir = base_dir + "/broken"
+        run_distributed_training(training_args, loss_file=base_dir + "/broken_losses.json")
+        # Run with averaging ("fixed")
+        training_args.average_tokens_across_devices = True
+        training_args.output_dir = base_dir + "/fixed"
+        run_distributed_training(training_args, loss_file=base_dir + "/fixed_losses.json")
+    else:
+        run_distributed_training(training_args, loss_file=training_args.output_dir + "_losses.json")
--- a/tests/trainer/distributed/scripts/torchrun_env_check.py
+++ b/tests/trainer/distributed/scripts/torchrun_env_check.py
@@ -0,0 +1,93 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Dumps distributed environment info to a JSON file for verification.
+
+This script creates a Trainer (which initializes the accelerator) and writes
+each worker's env vars, TrainingArguments fields, and accelerator state to
+``<output_dir>/env_rank<N>.json``.
+
+Accepts all TrainingArguments flags (e.g. ``--deepspeed``, ``--fsdp``) so the
+Trainer sets up the correct framework regardless of launcher.
+
+Works with any launcher (torchrun, accelerate launch with DDP/FSDP/DeepSpeed).
+"""
+
+import json
+import os
+
+from transformers import AutoModelForCausalLM, HfArgumentParser, Trainer, TrainingArguments
+
+
+def main():
+    parser = HfArgumentParser((TrainingArguments,))
+    (args,) = parser.parse_args_into_dataclasses()
+    args.disable_tqdm = True
+
+    model_name = "trl-internal-testing/tiny-Qwen2ForCausalLM-2.5"
+    model = AutoModelForCausalLM.from_pretrained(model_name)
+
+    trainer = Trainer(model=model, args=args)
+    accelerator = trainer.accelerator
+
+    env_info = {
+        # Raw env vars set by torchrun / accelerate
+        "env_world_size": os.environ.get("WORLD_SIZE"),
+        "env_rank": os.environ.get("RANK"),
+        "env_local_rank": os.environ.get("LOCAL_RANK"),
+        "env_master_addr": os.environ.get("MASTER_ADDR"),
+        "env_master_port": os.environ.get("MASTER_PORT"),
+        # TrainingArguments-derived values
+        "args_local_rank": args.local_rank,
+        "args_world_size": args.world_size,
+        "args_process_index": args.process_index,
+        "args_local_process_index": args.local_process_index,
+        "args_parallel_mode": str(args.parallel_mode),
+        "args_n_gpu": args.n_gpu,
+        # Accelerator state
+        "accelerator_num_processes": accelerator.num_processes,
+        "accelerator_process_index": accelerator.process_index,
+        "accelerator_local_process_index": accelerator.local_process_index,
+        "accelerator_is_main_process": accelerator.is_main_process,
+        "accelerator_is_local_main_process": accelerator.is_local_main_process,
+        "accelerator_use_distributed": accelerator.use_distributed,
+        "accelerator_distributed_type": str(accelerator.distributed_type),
+        "accelerator_device": str(accelerator.device),
+        # Trainer-level flags (these gate framework-specific code paths)
+        "trainer_is_fsdp_enabled": trainer.is_fsdp_enabled,
+        "trainer_is_deepspeed_enabled": trainer.is_deepspeed_enabled,
+    }
+
+    # FSDP plugin info
+    fsdp_plugin = getattr(accelerator.state, "fsdp_plugin", None)
+    if fsdp_plugin is not None:
+        env_info["fsdp_version"] = getattr(fsdp_plugin, "fsdp_version", None)
+        env_info["fsdp_sharding_strategy"] = str(getattr(fsdp_plugin, "sharding_strategy", None))
+        env_info["fsdp_cpu_offload"] = str(getattr(fsdp_plugin, "cpu_offload", None))
+        env_info["fsdp_auto_wrap_policy"] = str(getattr(fsdp_plugin, "auto_wrap_policy", None))
+
+    # DeepSpeed plugin info
+    deepspeed_plugin = getattr(accelerator.state, "deepspeed_plugin", None)
+    if deepspeed_plugin is not None:
+        env_info["deepspeed_zero_stage"] = deepspeed_plugin.zero_stage
+        env_info["deepspeed_offload_optimizer_device"] = str(deepspeed_plugin.offload_optimizer_device)
+        env_info["deepspeed_offload_param_device"] = str(deepspeed_plugin.offload_param_device)
+
+    output_file = os.path.join(args.output_dir, f"env_rank{args.process_index}.json")
+    with open(output_file, "w") as f:
+        json.dump(env_info, f)
+
+
+if __name__ == "__main__":
+    main()
--- a/tests/trainer/distributed/scripts/train.py
+++ b/tests/trainer/distributed/scripts/train.py
@@ -0,0 +1,136 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Simple causal LM script for distributed tests (FSDP, DeepSpeed).
+
+Uses a tiny Qwen2 model with synthetic data so tests run fast
+and don't require downloading real datasets.
+
+Supports --do_train (default) and --do_eval via TrainingArguments.
+
+32 training samples are created; with per_device_train_batch_size=4
+and 2 GPUs this gives 4 steps per epoch.
+"""
+
+import json
+import sys
+
+import torch
+
+from transformers import (
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    DataCollatorForLanguageModeling,
+    HfArgumentParser,
+    Trainer,
+    TrainingArguments,
+)
+
+
+DTYPE_MAP = {"fp32": torch.float32, "bf16": torch.bfloat16, "fp16": torch.float16}
+
+
+def _pop_custom_arg(name):
+    """Pop a custom --name value arg from sys.argv before HfArgumentParser sees it."""
+    if name in sys.argv:
+        idx = sys.argv.index(name)
+        value = sys.argv[idx + 1]
+        sys.argv.pop(idx)
+        sys.argv.pop(idx)
+        return value
+    return None
+
+
+def main():
+    # Parse custom args (not TrainingArguments fields)
+    model_name = _pop_custom_arg("--model_name") or "trl-internal-testing/tiny-Qwen2ForCausalLM-2.5"
+    loss_output_file = _pop_custom_arg("--loss_output_file")
+    eval_output_file = _pop_custom_arg("--eval_output_file")
+    model_dtype = _pop_custom_arg("--model_dtype")
+    attn_impl = _pop_custom_arg("--attn_implementation")
+    pad_to_multiple_of = _pop_custom_arg("--pad_to_multiple_of")
+
+    parser = HfArgumentParser((TrainingArguments,))
+    (training_args,) = parser.parse_args_into_dataclasses()
+
+    # Default to training if neither --do_train nor --do_eval is set
+    if not training_args.do_train and not training_args.do_eval:
+        training_args.do_train = True
+
+    # Auto-enable eval when an eval output file is requested
+    if eval_output_file:
+        training_args.do_eval = True
+
+    torch_dtype = DTYPE_MAP[model_dtype] if model_dtype else None
+
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+
+    model_kwargs = {}
+    if torch_dtype:
+        model_kwargs["torch_dtype"] = torch_dtype
+    if attn_impl:
+        model_kwargs["attn_implementation"] = attn_impl
+    model = AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)
+    model.generation_config.pad_token_id = tokenizer.pad_token_id
+
+    # Synthetic dataset — 32 samples of tokenized text
+    # With per_device_train_batch_size=4 and 2 GPUs this gives 4 steps per epoch.
+    texts = [
+        "The quick brown fox jumps over the lazy dog. " * 5,
+        "A journey of a thousand miles begins with a single step. " * 5,
+        "To be or not to be, that is the question. " * 5,
+        "All that glitters is not gold, all that wanders is not lost. " * 5,
+    ] * 8
+
+    train_dataset = None
+    eval_dataset = None
+    if training_args.do_train:
+        train_dataset = [tokenizer(text, max_length=128, truncation=True, padding="max_length") for text in texts]
+    if training_args.do_eval:
+        eval_dataset = [tokenizer(text, max_length=128, truncation=True, padding="max_length") for text in texts[:8]]
+
+    collator_kwargs = {}
+    if pad_to_multiple_of:
+        collator_kwargs["pad_to_multiple_of"] = int(pad_to_multiple_of)
+
+    training_args.disable_tqdm = True
+
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, **collator_kwargs),
+    )
+
+    if training_args.do_train:
+        trainer.train()
+
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+        if eval_output_file and training_args.process_index == 0:
+            with open(eval_output_file, "w") as f:
+                json.dump(eval_metrics, f)
+
+    # Save per-step losses for equivalence testing
+    if training_args.do_train and loss_output_file and training_args.process_index == 0:
+        losses = [log["loss"] for log in trainer.state.log_history if "loss" in log]
+        with open(loss_output_file, "w") as f:
+            json.dump(losses, f)
+
+
+if __name__ == "__main__":
+    main()
--- a/tests/trainer/distributed/scripts/vit_feature_extractor.json
+++ b/tests/trainer/distributed/scripts/vit_feature_extractor.json
@@ -0,0 +1,4 @@
+{
+    "image_processor_type": "ViTImageProcessor",
+    "size": 30
+}
--- a/tests/trainer/distributed/scripts/worker_seed.py
+++ b/tests/trainer/distributed/scripts/worker_seed.py
@@ -0,0 +1,87 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Worker script for dataloader worker seed divergence tests.
+
+Verifies that dataloader workers get different random seeds across GPUs,
+so that each rank sees different random augmentations.
+
+Run via torchrun or accelerate launch.
+"""
+
+import random
+
+import numpy as np
+import torch
+import torch.distributed as dist
+import torch.nn as nn
+from torch.utils.data import Dataset
+
+from transformers import HfArgumentParser, Trainer, TrainingArguments, set_seed
+from transformers.testing_utils import torch_device
+
+
+def gather_from_all_gpus(tensor, world_size):
+    gather_list = [torch.zeros_like(tensor) for _ in range(world_size)]
+    dist.all_gather(gather_list, tensor)
+    return gather_list
+
+
+class DummyDataset(Dataset):
+    def __init__(self):
+        self.length = 64
+
+    def __len__(self):
+        return self.length
+
+    def __getitem__(self, i) -> int:
+        x = random.random()
+        y = np.random.random()
+        z = torch.rand([]).item()
+        return {"x": torch.tensor([x, y, z])}
+
+
+class DummyModel(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.fc = nn.Linear(3, 1)
+
+    def forward(self, x):
+        local_tensor = torch.tensor(x, device=torch_device)
+        gathered = gather_from_all_gpus(local_tensor, dist.get_world_size())
+        assert not all(torch.allclose(t, gathered[0]) for t in gathered[1:])
+        y = self.fc(x)
+        return (y.mean(), y)
+
+
+def run_distributed_training(training_args):
+    set_seed(42)
+    model = DummyModel()
+    dataset = DummyDataset()
+    training_args.max_steps = 3
+    # dataloader_num_workers must be > 0 to enable worker_init_fn
+    training_args.dataloader_num_workers = 2
+    trainer = Trainer(
+        model,
+        training_args,
+        train_dataset=dataset,
+    )
+    trainer.train()
+
+
+if __name__ == "__main__":
+    parser = HfArgumentParser((TrainingArguments,))
+    training_args = parser.parse_args_into_dataclasses()[0]
+    run_distributed_training(training_args)
--- a/tests/trainer/distributed/test_trainer_distributed.py
+++ b/tests/trainer/distributed/test_trainer_distributed.py
@@ -0,0 +1,180 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Shared constants, helpers, and reusable test logic for distributed trainer tests.
+
+This module provides:
+- Path constants for test scripts and accelerate configs.
+- ``TrainerDistributedCommon``, an abstract base class that contains reusable
+  test scenarios (training, mixed-precision, gradient accumulation, checkpoint
+  resume, evaluation). Framework-specific test files (DDP, FSDP, DeepSpeed)
+  subclass it and wire each scenario to parameterized test methods.
+"""
+
+import json
+import os
+from abc import ABC, abstractmethod
+
+from transformers import is_torch_available
+from transformers.testing_utils import execute_subprocess_async
+from transformers.trainer_callback import TrainerState
+from transformers.trainer_utils import get_last_checkpoint
+
+
+if is_torch_available():
+    import torch
+
+# ---------------------------------------------------------------------------
+# Path constants
+# ---------------------------------------------------------------------------
+DISTRIBUTED_DIR = os.path.dirname(__file__)
+CONFIGS_DIR = os.path.join(DISTRIBUTED_DIR, "accelerate_configs")
+SCRIPTS_DIR = os.path.join(DISTRIBUTED_DIR, "scripts")
+TRAIN_SCRIPT = os.path.join(SCRIPTS_DIR, "train.py")
+
+
+class TrainerDistributedCommon(ABC):
+    """Reusable test scenarios shared across DDP, FSDP, and DeepSpeed.
+
+    Subclasses must:
+    1. Implement ``get_accelerate_cmd`` to build the launch command.
+    2. Define the following test methods (parameterized as needed)::
+
+        test_training               → self.check_training(dtype, ...)
+        test_training_mixed_precision → self.check_mixed_precision(dtype, ...)
+        test_training_with_gradient_accumulation → self.check_gradient_accumulation(...)
+        test_training_and_can_resume_normally    → self.check_resume(...)
+        test_eval                   → self.check_eval(...)
+
+    These test methods can't be defined here as ``@abstractmethod`` because
+    ``@parameterized.expand`` removes the original method name from the
+    subclass, which would cause ABC to raise ``TypeError`` at instantiation.
+    """
+
+    @abstractmethod
+    def get_accelerate_cmd(self, script, config_file, launch_args=None, script_args=None, **kwargs):
+        """Build the full ``accelerate launch`` command list.
+
+        Args:
+            script: Path to the Python script to run.
+            config_file: Path to the accelerate YAML config (always required).
+            launch_args: Extra flags inserted *before* the script
+                (e.g. ``--fsdp_sharding_strategy``, ``--offload_optimizer_device``).
+            script_args: Extra flags appended *after* the script
+                (e.g. ``--output_dir``, ``--bf16``).
+            **kwargs: Framework-specific overrides (e.g. ``num_processes``).
+        """
+        ...
+
+    # -------------------------------------------------------------------
+    # Helpers
+    # -------------------------------------------------------------------
+    def _get_default_script_args(self, output_dir, num_epochs=1, logging_steps=5, save_steps=None):
+        """Build the baseline CLI arguments shared by all training runs."""
+        args = [
+            "--output_dir",
+            output_dir,
+            "--num_train_epochs",
+            str(num_epochs),
+            "--logging_steps",
+            str(logging_steps),
+            "--per_device_train_batch_size",
+            "4",
+            "--learning_rate",
+            "5e-5",
+        ]
+        if save_steps is not None:
+            args += ["--save_steps", str(save_steps)]
+        else:
+            args += ["--save_strategy", "no"]
+        return args
+
+    def _train_and_get_log_history(self, cmd, output_dir):
+        """Run a training command and return the log history from the last checkpoint."""
+        execute_subprocess_async(cmd, env=self.get_env())
+        checkpoint = get_last_checkpoint(output_dir)
+        state_file = os.path.join(checkpoint, "trainer_state.json")
+        return TrainerState.load_from_json(state_file).log_history
+
+    # -------------------------------------------------------------------
+    # Reusable test scenarios — called from subclass test methods
+    # -------------------------------------------------------------------
+    def check_training(self, dtype="bf16", **cmd_kwargs):
+        """Verify that training completes with the model loaded in *dtype* (no mixed precision)."""
+        output_dir = self.get_auto_remove_tmp_dir()
+        args = self._get_default_script_args(output_dir) + ["--model_dtype", dtype]
+        execute_subprocess_async(
+            self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=args, **cmd_kwargs),
+            env=self.get_env(),
+        )
+
+    def check_mixed_precision(self, dtype="bf16", **cmd_kwargs):
+        """Verify mixed-precision training: model in fp32, compute in *dtype*."""
+        output_dir = self.get_auto_remove_tmp_dir()
+        args = self._get_default_script_args(output_dir) + ["--model_dtype", "fp32", f"--{dtype}"]
+        # fp16 requires a non-fused optimizer to avoid nan losses on small models
+        if dtype == "fp16":
+            args += ["--optim", "adamw_torch"]
+        execute_subprocess_async(
+            self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=args, **cmd_kwargs),
+            env=self.get_env(),
+        )
+
+    def check_gradient_accumulation(self, **cmd_kwargs):
+        """Verify that training with gradient accumulation completes without error."""
+        output_dir = self.get_auto_remove_tmp_dir()
+        args = self._get_default_script_args(output_dir) + ["--bf16", "--gradient_accumulation_steps", "2"]
+        execute_subprocess_async(
+            self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=args, **cmd_kwargs),
+            env=self.get_env(),
+        )
+
+    def check_resume(self, **cmd_kwargs):
+        """Verify that training can resume from a checkpoint with consistent learning rates."""
+        output_dir = self.get_auto_remove_tmp_dir()
+        args = self._get_default_script_args(output_dir, num_epochs=2, logging_steps=2, save_steps=2) + ["--bf16"]
+
+        original_logs = self._train_and_get_log_history(
+            self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=args, **cmd_kwargs),
+            output_dir,
+        )
+
+        checkpoint = os.path.join(output_dir, "checkpoint-2")
+        self.assertTrue(os.path.isdir(checkpoint), f"Checkpoint dir not found: {checkpoint}")
+
+        resume_args = args + ["--resume_from_checkpoint", checkpoint]
+        resumed_logs = self._train_and_get_log_history(
+            self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=resume_args, **cmd_kwargs),
+            output_dir,
+        )
+
+        for original, resumed in zip(original_logs, resumed_logs):
+            if "learning_rate" in original:
+                self.assertAlmostEqual(original["learning_rate"], resumed["learning_rate"], delta=1e-5)
+
+    def check_eval(self, **cmd_kwargs):
+        """Verify that evaluation produces a finite eval loss."""
+        output_dir = self.get_auto_remove_tmp_dir()
+        eval_output = os.path.join(output_dir, "eval_metrics.json")
+        args = self._get_default_script_args(output_dir) + ["--do_eval", "--eval_output_file", eval_output]
+        execute_subprocess_async(
+            self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=args, **cmd_kwargs),
+            env=self.get_env(),
+        )
+
+        with open(eval_output) as f:
+            eval_metrics = json.load(f)
+        self.assertIn("eval_loss", eval_metrics)
+        self.assertTrue(torch.isfinite(torch.tensor(eval_metrics["eval_loss"])))
--- a/tests/trainer/distributed/test_trainer_distributed_ddp.py
+++ b/tests/trainer/distributed/test_trainer_distributed_ddp.py
@@ -0,0 +1,297 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+DDP-specific distributed trainer tests.
+"""
+
+import json
+import os
+import re
+
+from parameterized import parameterized
+
+from transformers.testing_utils import (
+    CaptureStderr,
+    TestCasePlus,
+    backend_device_count,
+    execute_subprocess_async,
+    get_torch_dist_unique_port,
+    require_torch_multi_accelerator,
+    slow,
+    torch_device,
+)
+from transformers.utils import is_torch_bf16_available_on_device, is_torch_fp16_available_on_device
+
+from .test_trainer_distributed import CONFIGS_DIR, SCRIPTS_DIR, TrainerDistributedCommon
+
+
+DDP_CONFIG_FILE = os.path.join(CONFIGS_DIR, "ddp.yaml")
+
+dtypes = []
+if is_torch_bf16_available_on_device(torch_device):
+    dtypes += ["bf16"]
+if is_torch_fp16_available_on_device(torch_device):
+    dtypes += ["fp16"]
+
+pure_dtype_params = ["fp32"] + dtypes
+mixed_precision_params = list(dtypes)
+
+
+def _parameterized_custom_name_func(func, param_num, param):
+    param_based_name = parameterized.to_safe_name("_".join(str(x) for x in param.args))
+    return f"{func.__name__}_{param_based_name}"
+
+
+class DDPCommandsMixin:
+    """Provides ``get_torchrun_cmd`` and ``get_accelerate_cmd`` for DDP."""
+
+    def get_torchrun_cmd(self, script, script_args=None, num_processes=None):
+        if num_processes is None:
+            num_processes = backend_device_count(torch_device)
+        port = get_torch_dist_unique_port()
+        cmd = [
+            "torchrun",
+            f"--nproc_per_node={num_processes}",
+            "--nnodes=1",
+            f"--master_port={port}",
+            script,
+        ]
+        if script_args:
+            cmd.extend(script_args)
+        return cmd
+
+    def get_accelerate_cmd(
+        self, script, config_file, launch_args=None, script_args=None, num_processes=None, **kwargs
+    ):
+        if num_processes is None:
+            num_processes = backend_device_count(torch_device)
+        port = get_torch_dist_unique_port()
+        cmd = [
+            "accelerate",
+            "launch",
+            "--config_file",
+            config_file,
+            "--num_processes",
+            str(num_processes),
+            "--main_process_port",
+            str(port),
+        ]
+        if launch_args:
+            cmd.extend(launch_args)
+        cmd.append(script)
+        if script_args:
+            cmd.extend(script_args)
+        return cmd
+
+
+@slow
+@require_torch_multi_accelerator
+class TestTrainerDistributedDDP(DDPCommandsMixin, TestCasePlus):
+    # -----------------------------------------------------------------------
+    # accelerate launch tests
+    # -----------------------------------------------------------------------
+    def test_eval_order(self):
+        output_dir = self.get_auto_remove_tmp_dir()
+        script = os.path.join(SCRIPTS_DIR, "eval_ddp.py")
+        cmd = self.get_accelerate_cmd(
+            script,
+            DDP_CONFIG_FILE,
+            script_args=["--output_dir", output_dir],
+        )
+        execute_subprocess_async(cmd, env=self.get_env())
+
+    def test_loss_averaging(self):
+        device_count = backend_device_count(torch_device)
+        min_bs = 2
+        output_dir = self.get_auto_remove_tmp_dir()
+        script = os.path.join(SCRIPTS_DIR, "loss_averaging.py")
+
+        # Launch 1: single-GPU baseline
+        cmd = self.get_accelerate_cmd(
+            script,
+            DDP_CONFIG_FILE,
+            script_args=[
+                "--output_dir",
+                f"{output_dir}/base",
+                "--per_device_train_batch_size",
+                str(min_bs * device_count),
+                "--average_tokens_across_devices",
+                "True",
+            ],
+            num_processes=1,
+        )
+        execute_subprocess_async(cmd, env=self.get_env())
+
+        # Launch 2: multi-GPU with both averaging modes in one process
+        cmd = self.get_accelerate_cmd(
+            script,
+            DDP_CONFIG_FILE,
+            script_args=[
+                "--output_dir",
+                f"{output_dir}/multi",
+                "--per_device_train_batch_size",
+                str(min_bs),
+                "--run_both_averaging_modes",
+            ],
+            num_processes=device_count,
+        )
+        execute_subprocess_async(cmd, env=self.get_env())
+
+        with open(f"{output_dir}/base_losses.json") as f:
+            base_loss = json.load(f)
+        with open(f"{output_dir}/multi/broken_losses.json") as f:
+            broken_loss = json.load(f)
+        with open(f"{output_dir}/multi/fixed_losses.json") as f:
+            fixed_loss = json.load(f)
+
+        broken_diff = [abs(base_loss[i] - broken_loss[i]) for i in range(len(base_loss))]
+        fixed_diff = [abs(base_loss[i] - fixed_loss[i]) for i in range(len(base_loss))]
+        sum_base = sum(base_loss)
+        sum_broken = sum(broken_loss)
+        relative_broken = abs(sum_base - sum_broken) / max(sum_base, sum_broken)
+
+        self.assertGreater(max(broken_diff), 0.5)
+        self.assertLess(max(fixed_diff), 0.005)
+        self.assertLess(relative_broken, 0.1)
+
+    def test_worker_seed(self):
+        output_dir = self.get_auto_remove_tmp_dir()
+        script = os.path.join(SCRIPTS_DIR, "worker_seed.py")
+        cmd = self.get_accelerate_cmd(
+            script,
+            DDP_CONFIG_FILE,
+            script_args=["--output_dir", output_dir],
+        )
+        execute_subprocess_async(cmd, env=self.get_env())
+
+    # -----------------------------------------------------------------------
+    # torchrun vs accelerate env parity
+    # -----------------------------------------------------------------------
+    def test_torchrun_accelerate_env_parity(self):
+        """Verify torchrun and accelerate launch produce the same distributed environment for DDP."""
+        script = os.path.join(SCRIPTS_DIR, "torchrun_env_check.py")
+        num_processes = backend_device_count(torch_device)
+
+        torchrun_dir = self.get_auto_remove_tmp_dir()
+        cmd = self.get_torchrun_cmd(script, script_args=["--output_dir", torchrun_dir], num_processes=num_processes)
+        execute_subprocess_async(cmd, env=self.get_env())
+
+        accelerate_dir = self.get_auto_remove_tmp_dir()
+        cmd = self.get_accelerate_cmd(
+            script, DDP_CONFIG_FILE, script_args=["--output_dir", accelerate_dir], num_processes=num_processes
+        )
+        execute_subprocess_async(cmd, env=self.get_env())
+
+        for rank in range(num_processes):
+            with open(os.path.join(torchrun_dir, f"env_rank{rank}.json")) as f:
+                tr = json.load(f)
+            with open(os.path.join(accelerate_dir, f"env_rank{rank}.json")) as f:
+                ac = json.load(f)
+
+            for info in (tr, ac):
+                # Rank consistency: env vars, TrainingArguments, and accelerator all agree
+                self.assertEqual(info["env_world_size"], str(num_processes))
+                self.assertEqual(info["env_rank"], str(rank))
+                self.assertEqual(info["env_local_rank"], str(rank))
+                self.assertEqual(info["args_process_index"], rank)
+                self.assertEqual(info["args_local_process_index"], rank)
+                self.assertIn(info["args_local_rank"], (rank, -1))  # may be -1 before framework consumes it
+                self.assertEqual(info["accelerator_process_index"], rank)
+                self.assertEqual(info["accelerator_local_process_index"], rank)
+                self.assertIsNotNone(info["env_master_addr"])
+                self.assertIsNotNone(info["env_master_port"])
+
+                # World size and parallel mode
+                self.assertEqual(info["args_world_size"], num_processes)
+                self.assertEqual(info["args_n_gpu"], 1)
+                self.assertEqual(info["args_parallel_mode"], "ParallelMode.DISTRIBUTED")
+                self.assertEqual(info["accelerator_num_processes"], num_processes)
+                self.assertTrue(info["accelerator_use_distributed"])
+                self.assertEqual(info["accelerator_is_main_process"], rank == 0)
+                self.assertEqual(info["accelerator_is_local_main_process"], rank == 0)
+
+                # DDP: distributed type is MULTI_GPU
+                self.assertEqual(info["accelerator_distributed_type"], "DistributedType.MULTI_GPU")
+
+                # Each rank on its own device
+                self.assertIn(f"{torch_device}:{rank}", info["accelerator_device"])
+
+                # DDP should not activate FSDP or DeepSpeed
+                self.assertFalse(info["trainer_is_fsdp_enabled"])
+                self.assertFalse(info["trainer_is_deepspeed_enabled"])
+                self.assertNotIn("fsdp_version", info)
+                self.assertNotIn("deepspeed_zero_stage", info)
+
+    @parameterized.expand(
+        [
+            ("base", "--log_level info", 1),
+            ("low", "--log_level debug --log_level_replica debug", 2),
+            ("high", "--log_level error --log_level_replica debug", 1),
+            ("mixed", "--log_level error --log_level_replica error", 0),
+        ]
+    )
+    def test_log_level_replica(self, _name, extra_args_str, expected_matches):
+        """Test that log_level_replica controls logging on non-main processes."""
+        output_dir = self.get_auto_remove_tmp_dir()
+        script = os.path.join(SCRIPTS_DIR, "train.py")
+        script_args = [
+            "--output_dir",
+            output_dir,
+            "--num_train_epochs",
+            "1",
+            "--per_device_train_batch_size",
+            "4",
+            "--logging_strategy",
+            "no",
+        ]
+        if extra_args_str:
+            script_args.extend(extra_args_str.split())
+        cmd = self.get_accelerate_cmd(script, DDP_CONFIG_FILE, script_args=script_args, num_processes=2)
+        log_info_string = "Running training"
+        with CaptureStderr() as cl:
+            execute_subprocess_async(cmd, env=self.get_env())
+        n_matches = len(re.findall(log_info_string, cl.err))
+        self.assertEqual(n_matches, expected_matches)
+
+
+# ---------------------------------------------------------------------------
+# DDP training integration tests (using train.py)
+# ---------------------------------------------------------------------------
+
+
+@slow
+@require_torch_multi_accelerator
+class TestTrainerDistributedDDPCommon(DDPCommandsMixin, TrainerDistributedCommon, TestCasePlus):
+    """
+    Distributed DDP training tests using ``accelerate launch`` with the shared
+    train.py script. Mirrors the test structure used in FSDP and DeepSpeed.
+    """
+
+    @parameterized.expand(pure_dtype_params, name_func=_parameterized_custom_name_func)
+    def test_training(self, dtype):
+        self.check_training(dtype, config_file=DDP_CONFIG_FILE)
+
+    @parameterized.expand(mixed_precision_params, name_func=_parameterized_custom_name_func)
+    def test_training_mixed_precision(self, dtype):
+        self.check_mixed_precision(dtype, config_file=DDP_CONFIG_FILE)
+
+    def test_training_with_gradient_accumulation(self):
+        self.check_gradient_accumulation(config_file=DDP_CONFIG_FILE)
+
+    def test_training_and_can_resume_normally(self):
+        self.check_resume(config_file=DDP_CONFIG_FILE)
+
+    def test_eval(self):
+        self.check_eval(config_file=DDP_CONFIG_FILE)
--- a/tests/trainer/distributed/test_trainer_distributed_deepspeed.py
+++ b/tests/trainer/distributed/test_trainer_distributed_deepspeed.py
--- a/tests/trainer/distributed/test_trainer_distributed_fsdp.py
+++ b/tests/trainer/distributed/test_trainer_distributed_fsdp.py
@@ -0,0 +1,668 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+FSDP-specific distributed trainer tests.
+"""
+
+import itertools
+import json
+import os
+import unittest
+from functools import partial
+from pathlib import Path
+from unittest.mock import patch
+
+from parameterized import parameterized
+
+from tests.trainer.trainer_test_utils import TrainerIntegrationCommon, get_regression_trainer  # noqa
+from transformers import HfArgumentParser, PreTrainedConfig, TrainingArguments, is_torch_available
+from transformers.testing_utils import (
+    TestCasePlus,
+    backend_device_count,
+    execute_subprocess_async,
+    get_torch_dist_unique_port,
+    mockenv_context,
+    require_torch,
+    require_torch_accelerator,
+    require_torch_multi_accelerator,
+    slow,
+    torch_device,
+)
+from transformers.trainer_utils import set_seed
+from transformers.utils import (
+    is_torch_bf16_available_on_device,
+    is_torch_fp16_available_on_device,
+)
+
+from .test_trainer_distributed import CONFIGS_DIR, SCRIPTS_DIR, TRAIN_SCRIPT, TrainerDistributedCommon
+
+
+if is_torch_available():
+    import torch
+    from torch import nn
+
+    from transformers import PreTrainedModel
+    from transformers.trainer import FSDP_MODEL_NAME
+
+# Base accelerate configs (version only — model-specific settings via launch args)
+FSDP_CONFIG_FILE = os.path.join(CONFIGS_DIR, "fsdp.yaml")
+FSDP2_CONFIG_FILE = os.path.join(CONFIGS_DIR, "fsdp2.yaml")
+FSDP2_CP_CONFIG_FILE = os.path.join(CONFIGS_DIR, "fsdp2_cp.yaml")
+FSDP_GENERATE_SCRIPT = os.path.join(SCRIPTS_DIR, "fsdp_generate.py")
+
+FSDP_CONFIGS = {
+    "fsdp1": FSDP_CONFIG_FILE,
+    "fsdp2": FSDP2_CONFIG_FILE,
+}
+
+# Launch args shared by all training tests
+TRAIN_LAUNCH_ARGS = [
+    "--fsdp_auto_wrap_policy",
+    "TRANSFORMER_BASED_WRAP",
+]
+
+dtypes = []
+if is_torch_bf16_available_on_device(torch_device):
+    dtypes += ["bf16"]
+if is_torch_fp16_available_on_device(torch_device):
+    dtypes += ["fp16"]
+
+sharding_strategies = ["full_shard", "shard_grad_op"]  # zero3 and zero2
+fsdp_versions = ["fsdp1", "fsdp2"]
+
+config_params = list(itertools.product(sharding_strategies, dtypes))
+# Mixed precision: model loaded in fp32, training with --bf16/--fp16
+mixed_precision_params = list(itertools.product(sharding_strategies, dtypes, fsdp_versions))
+# Pure dtype: model loaded in target dtype, no mixed precision flags
+pure_dtype_params = list(itertools.product(["fp32"] + dtypes, fsdp_versions))
+
+resume_params = [
+    ("FULL_STATE_DICT", "fsdp1"),  # FULL_STATE_DICT only supported for fsdp1
+    ("SHARDED_STATE_DICT", "fsdp1"),
+    ("SHARDED_STATE_DICT", "fsdp2"),
+]
+
+set_seed(42)
+
+
+if is_torch_available():
+    # hack to restore original logging level pre #21700
+    get_regression_trainer = partial(get_regression_trainer, log_level="info")
+
+
+if is_torch_available():
+
+    class _BaseModel(PreTrainedModel):
+        base_model_prefix = "base"
+        config_class = PreTrainedConfig
+
+        def __init__(self, config):
+            super().__init__(config)
+            self.linear = nn.Linear(5, 5)
+            self.linear_2 = nn.Linear(5, 5)
+            self.post_init()
+
+        def forward(self, x):
+            return self.linear_2(self.linear(x))
+
+
+@require_torch
+class InitializeMissingKeysTest(unittest.TestCase):
+    """Tests for FSDP non-rank-0 weight initialization: params should be moved from meta to CPU
+    and marked as initialized without being re-initialized."""
+
+    def _clear_init_flags(self, model):
+        for module in model.modules():
+            if hasattr(module, "_is_hf_initialized"):
+                delattr(module, "_is_hf_initialized")
+        for param in model.parameters():
+            if hasattr(param, "_is_hf_initialized"):
+                delattr(param, "_is_hf_initialized")
+        for buffer in model.buffers():
+            if hasattr(buffer, "_is_hf_initialized"):
+                delattr(buffer, "_is_hf_initialized")
+
+    def test_move_missing_keys_fsdp_non_rank0_moves_meta_to_cpu(self):
+        """FSDP non-rank-0 path should move all params from meta to CPU."""
+        with torch.device("meta"):
+            model = _BaseModel(PreTrainedConfig())
+
+        for param in model.parameters():
+            self.assertEqual(param.device, torch.device("meta"))
+
+        with (
+            patch("transformers.modeling_utils.is_fsdp_enabled", return_value=True),
+            patch("transformers.modeling_utils.is_local_dist_rank_0", return_value=False),
+        ):
+            model._move_missing_keys_from_meta_to_device(
+                missing_keys=set(), device_map=None, device_mesh=None, hf_quantizer=None
+            )
+
+        for name, param in model.named_parameters():
+            self.assertEqual(param.device, torch.device("cpu"), f"param {name} should be on CPU after FSDP move")
+
+    def test_fsdp_non_rank0_end_to_end_no_reinit(self):
+        """End-to-end: move from meta + _initialize_missing_keys should mark all params initialized
+        without changing their values."""
+        with torch.device("meta"):
+            model = _BaseModel(PreTrainedConfig())
+
+        with (
+            patch("transformers.modeling_utils.is_fsdp_enabled", return_value=True),
+            patch("transformers.modeling_utils.is_local_dist_rank_0", return_value=False),
+        ):
+            model._move_missing_keys_from_meta_to_device(
+                missing_keys=set(), device_map=None, device_mesh=None, hf_quantizer=None
+            )
+            pre_init_values = {name: param.clone() for name, param in model.named_parameters()}
+            self._clear_init_flags(model)
+            model._initialize_missing_keys(is_quantized=False)
+
+        for name, param in model.named_parameters():
+            self.assertTrue(getattr(param, "_is_hf_initialized", False), f"param {name} not marked initialized")
+            torch.testing.assert_close(param, pre_init_values[name], msg=f"param {name} was re-initialized")
+        self.assertTrue(getattr(model, "_is_hf_initialized", False))
+
+
+def _parameterized_custom_name_func(func, param_num, param):
+    # customize the test name generator function as we want both params to appear in the sub-test
+    # name, as by default it shows only the first param
+    param_based_name = parameterized.to_safe_name("_".join(str(x) for x in param.args))
+    return f"{func.__name__}_{param_based_name}"
+
+
+# ---------------------------------------------------------------------------
+# Command mixins
+# ---------------------------------------------------------------------------
+
+
+class FSDPCommandsMixin:
+    """Provides ``get_torchrun_cmd`` and ``get_accelerate_cmd`` for FSDP."""
+
+    def get_torchrun_cmd(self, script, script_args=None, num_processes=None):
+        if num_processes is None:
+            num_processes = backend_device_count(torch_device)
+        port = get_torch_dist_unique_port()
+        cmd = [
+            "torchrun",
+            f"--nproc_per_node={num_processes}",
+            "--nnodes=1",
+            f"--master_port={port}",
+            script,
+        ]
+        if script_args:
+            cmd.extend(script_args)
+        return cmd
+
+    def get_accelerate_cmd(
+        self, script, config_file, launch_args=None, script_args=None, num_processes=None, **kwargs
+    ):
+        if num_processes is None:
+            num_processes = backend_device_count(torch_device)
+        port = get_torch_dist_unique_port()
+        cmd = [
+            "accelerate",
+            "launch",
+            "--config_file",
+            config_file,
+            "--num_processes",
+            str(num_processes),
+            "--main_process_port",
+            str(port),
+        ]
+        if launch_args:
+            cmd.extend(launch_args)
+        cmd.append(script)
+        if script_args:
+            cmd.extend(script_args)
+        return cmd
+
+
+# ---------------------------------------------------------------------------
+# Config parsing tests (no distributed training runs)
+# ---------------------------------------------------------------------------
+
+
+@require_torch_accelerator
+class TestFSDPConfig(TestCasePlus):
+    def setUp(self):
+        super().setUp()
+        master_port = get_torch_dist_unique_port()
+        self.dist_env_1_gpu = {
+            "MASTER_ADDR": "localhost",
+            "MASTER_PORT": str(master_port),
+            "RANK": "0",
+            "LOCAL_RANK": "0",
+            "WORLD_SIZE": "1",
+        }
+        self.accelerate_fsdp_config = {
+            "fsdp_activation_checkpointing": False,
+            "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
+            "fsdp_backward_prefetch": "BACKWARD_PRE",
+            "fsdp_cpu_ram_efficient_loading": True,
+            "fsdp_forward_prefetch": False,
+            "fsdp_offload_params": False,
+            "fsdp_reshard_after_forward": "FULL_SHARD",
+            "fsdp_state_dict_type": "FULL_STATE_DICT",
+            "fsdp_sync_module_states": True,
+            "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
+            "fsdp_use_orig_params": True,
+            "fsdp_version": 1,
+        }
+
+        self.fsdp_config = {
+            "backward_prefetch": "BACKWARD_PRE",
+            "forward_prefetch": "false",
+            "limit_all_gathers": "false",
+            "use_orig_params": "true",
+            "sync_module_states": "true",
+            "cpu_ram_efficient_loading": "true",
+            "activation_checkpointing": "false",
+            "min_num_params": 1,
+        }
+
+    @parameterized.expand(config_params, name_func=_parameterized_custom_name_func)
+    def test_accelerate_fsdp_config(self, sharding_strategy, dtype):
+        output_dir = self.get_auto_remove_tmp_dir()
+        # Snapshot before trainer construction — `_process_fsdp_args` strips the
+        # `fsdp_` prefix in place.
+        expected = dict(self.accelerate_fsdp_config)
+        kwargs = {
+            "output_dir": output_dir,
+            "train_len": 128,
+            "save_steps": 5,
+            "learning_rate": 0.1,
+            "fsdp": f"{sharding_strategy} offload auto_wrap",
+            "fsdp_config": self.accelerate_fsdp_config,
+        }
+        kwargs[dtype] = True
+        with mockenv_context(**self.dist_env_1_gpu):
+            trainer = get_regression_trainer(**kwargs)
+            self.assertIs(trainer.args.fsdp, True)
+            self.assertTrue(trainer.args.fsdp_config.get("cpu_offload"))
+            for k, v in expected.items():
+                assert k.startswith("fsdp_")
+                # `transformer_layer_cls_to_wrap` is normalized from str → list during parsing.
+                if k == "fsdp_transformer_layer_cls_to_wrap" and isinstance(v, str):
+                    v = [v]
+                self.assertEqual(trainer.args.fsdp_config[k[5:]], v)
+
+    def test_torchrun_fsdp_config(self):
+        """Verify that --fsdp + --fsdp_config (torchrun-style) are parsed correctly."""
+        output_dir = self.get_auto_remove_tmp_dir()
+        fsdp_config = {"fsdp_transformer_layer_cls_to_wrap": "Qwen2DecoderLayer"}
+        kwargs = {
+            "output_dir": output_dir,
+            "train_len": 128,
+            "save_steps": 5,
+            "learning_rate": 0.1,
+            "fsdp": "full_shard auto_wrap",
+            "fsdp_config": fsdp_config,
+            "bf16": True,
+        }
+        with mockenv_context(**self.dist_env_1_gpu):
+            trainer = get_regression_trainer(**kwargs)
+            self.assertIs(trainer.args.fsdp, True)
+            # fsdp_ prefix is stripped and value is normalized to a list during parsing
+            self.assertIn("Qwen2DecoderLayer", trainer.args.fsdp_config["transformer_layer_cls_to_wrap"])
+
+    def test_fsdp_cli_parsing(self):
+        """`--fsdp` (bare) → True; legacy `--fsdp full_shard` still parses; absent → None."""
+        parser = HfArgumentParser(TrainingArguments)
+        base = ["--output_dir", "/tmp/x"]
+
+        args, _ = parser.parse_known_args([*base, "--fsdp"])
+        self.assertIs(args.fsdp, True)
+
+        args, _ = parser.parse_known_args([*base, "--fsdp", "full_shard"])
+        self.assertEqual(args.fsdp, "full_shard")
+
+        args, _ = parser.parse_known_args(base)
+        self.assertIsNone(args.fsdp)
+
+        # Bare `--fsdp` should resolve to a fully enabled FSDP setup through `_process_fsdp_args`.
+        with mockenv_context(**self.dist_env_1_gpu):
+            trainer_args = TrainingArguments(output_dir="/tmp/x", fsdp=True)
+            self.assertIs(trainer_args.fsdp, True)
+            self.assertIsNotNone(trainer_args.fsdp_plugin_args)
+
+    @parameterized.expand(config_params, name_func=_parameterized_custom_name_func)
+    def test_fsdp_config(self, sharding_strategy, dtype):
+        output_dir = self.get_auto_remove_tmp_dir()
+        kwargs = {
+            "output_dir": output_dir,
+            "train_len": 128,
+            "save_steps": 5,
+            "learning_rate": 0.1,
+            "fsdp": f"{sharding_strategy} offload auto_wrap",
+            "fsdp_config": self.fsdp_config,
+        }
+        kwargs[dtype] = True
+        with mockenv_context(**self.dist_env_1_gpu):
+            trainer = get_regression_trainer(**kwargs)
+            self.assertIs(trainer.args.fsdp, True)
+            self.assertTrue(trainer.args.fsdp_config.get("cpu_offload"))
+            for k, v in self.fsdp_config.items():
+                self.assertEqual(trainer.args.fsdp_config[k], v)
+
+
+# ---------------------------------------------------------------------------
+# FSDP distributed tests
+# ---------------------------------------------------------------------------
+
+
+@require_torch_multi_accelerator
+class TestTrainerDistributedFSDP(FSDPCommandsMixin, TestCasePlus):
+    def _run_env_check(self, cmd, num_processes):
+        """Run the env check script and return per-rank results."""
+        execute_subprocess_async(cmd, env=self.get_env())
+        # output_dir is always the last script_arg value
+        output_dir = cmd[cmd.index("--output_dir") + 1]
+        results = []
+        for rank in range(num_processes):
+            with open(os.path.join(output_dir, f"env_rank{rank}.json")) as f:
+                results.append(json.load(f))
+        return results
+
+    def test_torchrun_accelerate_fsdp1_env_parity(self):
+        """Verify torchrun+--fsdp and accelerate launch produce the same FSDP1 env."""
+        script = os.path.join(SCRIPTS_DIR, "torchrun_env_check.py")
+        num_processes = backend_device_count(torch_device)
+
+        torchrun_dir = self.get_auto_remove_tmp_dir()
+        torchrun_results = self._run_env_check(
+            self.get_torchrun_cmd(
+                script,
+                script_args=[
+                    "--output_dir",
+                    torchrun_dir,
+                    "--fsdp",
+                    "full_shard",
+                    "--fsdp_config",
+                    '{"fsdp_version": 1}',
+                ],
+                num_processes=num_processes,
+            ),
+            num_processes,
+        )
+
+        accel_dir = self.get_auto_remove_tmp_dir()
+        accel_results = self._run_env_check(
+            self.get_accelerate_cmd(
+                script, FSDP_CONFIG_FILE, script_args=["--output_dir", accel_dir], num_processes=num_processes
+            ),
+            num_processes,
+        )
+
+        self._check_parity(torchrun_results, accel_results, num_processes, expected_fsdp_version=1)
+
+    def test_torchrun_accelerate_fsdp2_env_parity(self):
+        """Verify torchrun+--fsdp and accelerate launch produce the same FSDP2 env."""
+        script = os.path.join(SCRIPTS_DIR, "torchrun_env_check.py")
+        num_processes = backend_device_count(torch_device)
+
+        torchrun_dir = self.get_auto_remove_tmp_dir()
+        torchrun_results = self._run_env_check(
+            self.get_torchrun_cmd(
+                script,
+                script_args=[
+                    "--output_dir",
+                    torchrun_dir,
+                    "--fsdp",
+                    "full_shard",
+                    "--fsdp_config",
+                    '{"fsdp_version": 2}',
+                ],
+                num_processes=num_processes,
+            ),
+            num_processes,
+        )
+
+        accel_dir = self.get_auto_remove_tmp_dir()
+        accel_results = self._run_env_check(
+            self.get_accelerate_cmd(
+                script, FSDP2_CONFIG_FILE, script_args=["--output_dir", accel_dir], num_processes=num_processes
+            ),
+            num_processes,
+        )
+
+        self._check_parity(torchrun_results, accel_results, num_processes, expected_fsdp_version=2)
+
+    def _check_parity(self, torchrun_results, accel_results, num_processes, expected_fsdp_version):
+        for rank in range(num_processes):
+            tr, ac = torchrun_results[rank], accel_results[rank]
+
+            # Both should agree on distributed env
+            self.assertEqual(tr["args_world_size"], ac["args_world_size"])
+            self.assertEqual(tr["args_process_index"], ac["args_process_index"])
+            self.assertEqual(tr["args_parallel_mode"], ac["args_parallel_mode"])
+            self.assertEqual(tr["accelerator_num_processes"], ac["accelerator_num_processes"])
+            self.assertEqual(tr["accelerator_use_distributed"], ac["accelerator_use_distributed"])
+
+            for info in (tr, ac):
+                # Rank consistency across all layers
+                self.assertEqual(info["env_world_size"], str(num_processes))
+                self.assertEqual(info["env_rank"], str(rank))
+                self.assertEqual(info["args_process_index"], rank)
+                self.assertEqual(info["args_local_process_index"], rank)
+                self.assertEqual(info["accelerator_process_index"], rank)
+                self.assertEqual(info["accelerator_local_process_index"], rank)
+                self.assertEqual(info["args_n_gpu"], 1)
+                self.assertEqual(info["accelerator_is_main_process"], rank == 0)
+                self.assertEqual(info["accelerator_is_local_main_process"], rank == 0)
+                self.assertIn(f"{torch_device}:{rank}", info["accelerator_device"])
+
+                # Both should have FSDP enabled with the correct version
+                self.assertEqual(info["accelerator_distributed_type"], "DistributedType.FSDP")
+                self.assertTrue(info["trainer_is_fsdp_enabled"])
+                self.assertFalse(info["trainer_is_deepspeed_enabled"])
+                self.assertEqual(info["fsdp_version"], expected_fsdp_version)
+                self.assertNotIn("deepspeed_zero_stage", info)
+
+
+# ---------------------------------------------------------------------------
+# All distributed FSDP training tests
+# ---------------------------------------------------------------------------
+@slow
+@require_torch_multi_accelerator
+class TestTrainerDistributedFSDPCommon(
+    FSDPCommandsMixin, TrainerDistributedCommon, TestCasePlus, TrainerIntegrationCommon
+):
+    # -------------------------------------------------------------------
+    # FSDP training — accelerate (parameterized over fsdp version)
+    # -------------------------------------------------------------------
+
+    # Pure dtype training: model loaded in target dtype, no mixed precision
+    @parameterized.expand(pure_dtype_params, name_func=_parameterized_custom_name_func)
+    def test_training(self, dtype, fsdp_version):
+        self.check_training(dtype, config_file=FSDP_CONFIGS[fsdp_version])
+
+    # Mixed precision: model loaded in fp32, training with --bf16/--fp16
+    @parameterized.expand(mixed_precision_params, name_func=_parameterized_custom_name_func)
+    def test_training_mixed_precision(self, sharding_strategy, dtype, fsdp_version):
+        if fsdp_version == "fsdp2":
+            reshard = "true" if sharding_strategy == "full_shard" else "false"
+        else:
+            reshard = sharding_strategy.upper()
+        launch_args = list(TRAIN_LAUNCH_ARGS) + ["--fsdp_reshard_after_forward", reshard]
+        self.check_mixed_precision(dtype, config_file=FSDP_CONFIGS[fsdp_version], launch_args=launch_args)
+
+    @parameterized.expand(["true", "false"], name_func=_parameterized_custom_name_func)
+    def test_fsdp2_cpu_ram_efficient_loading(self, cpu_ram_efficient_loading):
+        launch_args = list(TRAIN_LAUNCH_ARGS) + [
+            "--fsdp_cpu_ram_efficient_loading",
+            cpu_ram_efficient_loading,
+        ]
+        self.check_training("bf16", config_file=FSDP2_CONFIG_FILE, launch_args=launch_args)
+
+    @parameterized.expand(fsdp_versions, name_func=_parameterized_custom_name_func)
+    def test_training_with_gradient_accumulation(self, fsdp_version):
+        self.check_gradient_accumulation(config_file=FSDP_CONFIGS[fsdp_version])
+
+    @parameterized.expand(fsdp_versions, name_func=_parameterized_custom_name_func)
+    def test_basic_run_with_cpu_offload(self, fsdp_version):
+        output_dir = self.get_auto_remove_tmp_dir()
+        args = self._get_default_script_args(output_dir) + ["--bf16", "--max_steps", "10"]
+        launch_args = list(TRAIN_LAUNCH_ARGS) + ["--fsdp_offload_params", "true"]
+        execute_subprocess_async(
+            self.get_accelerate_cmd(
+                TRAIN_SCRIPT, script_args=args, config_file=FSDP_CONFIGS[fsdp_version], launch_args=launch_args
+            ),
+            env=self.get_env(),
+        )
+
+    @parameterized.expand(resume_params, name_func=_parameterized_custom_name_func)
+    def test_training_and_can_resume_normally(self, state_dict_type, fsdp_version):
+        output_dir = self.get_auto_remove_tmp_dir()
+        args = self._get_default_script_args(output_dir, num_epochs=2, logging_steps=2, save_steps=2)
+
+        launch_args = list(TRAIN_LAUNCH_ARGS) + ["--fsdp_state_dict_type", state_dict_type]
+        cmd_kwargs = {"config_file": FSDP_CONFIGS[fsdp_version], "launch_args": launch_args}
+
+        logs = self._train_and_get_log_history(
+            self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=args, **cmd_kwargs),
+            output_dir,
+        )
+
+        # resume from ckpt
+        checkpoint = os.path.join(output_dir, "checkpoint-2")
+        resume_args = args + ["--resume_from_checkpoint", checkpoint]
+
+        is_fsdp_ckpt = os.path.isdir(checkpoint) and (
+            # this checks the FSDP state dict when `SHARDED_STATE_DICT` is used
+            any(
+                FSDP_MODEL_NAME in folder_name
+                for folder_name in os.listdir(checkpoint)
+                if os.path.isdir(os.path.join(checkpoint, folder_name))
+            )
+            # this checks the FSDP state dict when `FULL_STATE_DICT` is used
+            or os.path.isfile(os.path.join(checkpoint, f"{FSDP_MODEL_NAME}.bin"))
+        )
+        self.assertTrue(is_fsdp_ckpt)
+
+        logs_resume = self._train_and_get_log_history(
+            self.get_accelerate_cmd(TRAIN_SCRIPT, script_args=resume_args, **cmd_kwargs),
+            output_dir,
+        )
+
+        for log, log1 in zip(logs, logs_resume):
+            if "learning_rate" in log:
+                self.assertAlmostEqual(log["learning_rate"], log1["learning_rate"], delta=1e-5)
+
+    # -------------------------------------------------------------------
+    # Context parallel tests
+    # -------------------------------------------------------------------
+    def test_cp_equivalence(self):
+        """Test that CP produces the same losses as without CP."""
+
+        # CP doesn't work with Qwen2 (DTensor mixing error), so we use Llama here.
+        launch_args = list(TRAIN_LAUNCH_ARGS) + ["--fsdp_state_dict_type", "SHARDED_STATE_DICT"]
+        cp_script_args = [
+            "--model_name",
+            "hf-internal-testing/tiny-random-LlamaForCausalLM",
+            "--max_steps",
+            "10",
+            "--per_device_train_batch_size",
+            "1",
+            "--seed",
+            "42",
+            "--logging_steps",
+            "1",
+            "--save_strategy",
+            "no",
+            "--model_dtype",
+            "fp32",
+            "--attn_implementation",
+            "sdpa",
+            "--pad_to_multiple_of",
+            "4",
+        ]
+
+        # Step 1: Run with CP enabled (cp_size=2)
+        cp_yes_output_dir = Path(self.get_auto_remove_tmp_dir()).resolve()
+        cp_yes_losses_path = cp_yes_output_dir / "cp_yes_losses.json"
+        cmd = self.get_accelerate_cmd(
+            TRAIN_SCRIPT,
+            config_file=FSDP2_CP_CONFIG_FILE,
+            launch_args=launch_args,
+            script_args=["--output_dir", str(cp_yes_output_dir), "--loss_output_file", str(cp_yes_losses_path)]
+            + cp_script_args,
+        )
+        execute_subprocess_async(cmd, env=self.get_env())
+
+        # Step 2: Run without CP (FSDP with num_processes=1, no parallelism_config)
+        cp_no_output_dir = Path(self.get_auto_remove_tmp_dir()).resolve()
+        cp_no_losses_path = cp_no_output_dir / "cp_no_losses.json"
+
+        cmd = self.get_accelerate_cmd(
+            TRAIN_SCRIPT,
+            config_file=FSDP2_CONFIG_FILE,
+            launch_args=launch_args,
+            script_args=[
+                "--output_dir",
+                str(cp_no_output_dir),
+                "--loss_output_file",
+                str(cp_no_losses_path),
+            ]
+            + cp_script_args,
+            num_processes=1,
+        )
+        execute_subprocess_async(cmd, env=self.get_env())
+
+        # Compare losses
+        with open(cp_yes_losses_path) as f:
+            cp_yes_losses = json.load(f)
+        with open(cp_no_losses_path) as f:
+            cp_no_losses = json.load(f)
+
+        assert len(cp_yes_losses) == len(cp_no_losses), (
+            f"Different number of losses: CP has {len(cp_yes_losses)}, no-CP has {len(cp_no_losses)}"
+        )
+
+        cp_yes_losses_tensor = torch.tensor(cp_yes_losses)
+        cp_no_losses_tensor = torch.tensor(cp_no_losses)
+
+        torch.testing.assert_close(
+            cp_yes_losses_tensor,
+            cp_no_losses_tensor,
+            rtol=2e-2,
+            atol=2e-2,
+            msg=f"CP losses {cp_yes_losses} do not match non-CP losses {cp_no_losses}",
+        )
+
+    # -------------------------------------------------------------------
+    # FSDP eval tests
+    # -------------------------------------------------------------------
+    def test_eval(self):
+        self.check_eval(config_file=FSDP_CONFIG_FILE)
+
+    # -------------------------------------------------------------------
+    # FSDP generation tests (moved from tests/generation/test_fsdp.py)
+    # -------------------------------------------------------------------
+    def test_fsdp_generate(self):
+        cmd = self.get_accelerate_cmd(
+            FSDP_GENERATE_SCRIPT,
+            config_file=FSDP_CONFIG_FILE,
+            script_args=["--fsdp"],
+        )
+        execute_subprocess_async(cmd, env=self.get_env())
+
+    def test_fsdp2_generate(self):
+        cmd = self.get_accelerate_cmd(
+            FSDP_GENERATE_SCRIPT,
+            config_file=FSDP2_CONFIG_FILE,
+            script_args=["--fsdp2"],
+        )
+        execute_subprocess_async(cmd, env=self.get_env())
--- a/tests/trainer/test_data_collator.py
+++ b/tests/trainer/test_data_collator.py
--- a/tests/trainer/test_trainer.py
+++ b/tests/trainer/test_trainer.py
--- a/tests/trainer/test_trainer_accelerator.py
+++ b/tests/trainer/test_trainer_accelerator.py
@@ -0,0 +1,250 @@
+# Copyright 2018 the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Trainer AcceleratorConfig tests: creation from dict/YAML/dataclass, partial overrides,
+gradient accumulation settings, custom AcceleratorState, and validation.
+"""
+
+import dataclasses
+import json
+import tempfile
+from pathlib import Path
+from typing import Any
+
+from accelerate import Accelerator
+from accelerate.state import AcceleratorState
+
+from transformers import Trainer, TrainingArguments
+from transformers.testing_utils import TestCasePlus, require_torch
+from transformers.trainer_pt_utils import AcceleratorConfig
+
+from .trainer_test_utils import (
+    RegressionModelConfig,
+    RegressionPreTrainedModel,
+    RegressionTrainingArguments,
+    SampleIterableDataset,
+)
+
+
+@require_torch
+class TrainerAcceleratorConfigTest(TestCasePlus):
+    def test_accelerator_config_empty(self):
+        # Checks that a config can be made with the defaults if not passed
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            config = RegressionModelConfig(a=1.5, b=2.5)
+            model = RegressionPreTrainedModel(config)
+            eval_dataset = SampleIterableDataset()
+
+            # Leaves one option as something *not* basic
+            args = RegressionTrainingArguments(output_dir=tmp_dir)
+            trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
+            self.assertEqual(trainer.accelerator.split_batches, False)
+            self.assertEqual(trainer.accelerator.dispatch_batches, None)
+            self.assertEqual(trainer.accelerator.even_batches, True)
+            self.assertEqual(trainer.accelerator.use_seedable_sampler, True)
+            # gradient accumulation kwargs configures gradient_state
+            self.assertNotIn("sync_each_batch", trainer.accelerator.gradient_state.plugin_kwargs)
+
+    def test_accelerator_config_from_dict(self):
+        # Checks that accelerator kwargs can be passed through
+        # and the accelerator is initialized respectively
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            config = RegressionModelConfig(a=1.5, b=2.5)
+            model = RegressionPreTrainedModel(config)
+            eval_dataset = SampleIterableDataset()
+
+            accelerator_config: dict[str, Any] = {
+                "split_batches": True,
+                "dispatch_batches": True,
+                "even_batches": False,
+                "use_seedable_sampler": True,
+            }
+            accelerator_config["gradient_accumulation_kwargs"] = {"sync_each_batch": True}
+
+            # Leaves all options as something *not* basic
+            args = RegressionTrainingArguments(output_dir=tmp_dir, accelerator_config=accelerator_config)
+            trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
+            self.assertEqual(trainer.accelerator.split_batches, True)
+            self.assertEqual(trainer.accelerator.dispatch_batches, True)
+            self.assertEqual(trainer.accelerator.even_batches, False)
+            self.assertEqual(trainer.accelerator.use_seedable_sampler, True)
+
+    def test_accelerator_config_from_yaml(self):
+        # Checks that accelerator kwargs can be passed through
+        # and the accelerator is initialized respectively
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            path_file = Path(tmp_dir) / "accelerator_config.json"
+            with open(path_file, "w") as f:
+                accelerator_config = {
+                    "split_batches": True,
+                    "dispatch_batches": True,
+                    "even_batches": False,
+                    "use_seedable_sampler": False,
+                }
+                json.dump(accelerator_config, f)
+            config = RegressionModelConfig(a=1.5, b=2.5)
+            model = RegressionPreTrainedModel(config)
+            eval_dataset = SampleIterableDataset()
+
+            # Leaves all options as something *not* basic
+            args = RegressionTrainingArguments(output_dir=tmp_dir, accelerator_config=path_file)
+            trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
+            self.assertEqual(trainer.accelerator.split_batches, True)
+            self.assertEqual(trainer.accelerator.dispatch_batches, True)
+            self.assertEqual(trainer.accelerator.even_batches, False)
+            self.assertEqual(trainer.accelerator.use_seedable_sampler, False)
+
+    def test_accelerator_config_from_dataclass(self):
+        # Checks that accelerator kwargs can be passed through
+        # and the accelerator is initialized respectively
+
+        accelerator_config = AcceleratorConfig(
+            split_batches=True,
+            dispatch_batches=True,
+            even_batches=False,
+            use_seedable_sampler=False,
+        )
+        config = RegressionModelConfig(a=1.5, b=2.5)
+        model = RegressionPreTrainedModel(config)
+        eval_dataset = SampleIterableDataset()
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            args = RegressionTrainingArguments(output_dir=tmp_dir, accelerator_config=accelerator_config)
+            trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
+            self.assertEqual(trainer.accelerator.split_batches, True)
+            self.assertEqual(trainer.accelerator.dispatch_batches, True)
+            self.assertEqual(trainer.accelerator.even_batches, False)
+            self.assertEqual(trainer.accelerator.use_seedable_sampler, False)
+
+    def test_accelerate_config_from_dataclass_grad_accum(self):
+        # Checks that accelerator kwargs can be passed through
+        # and the accelerator is initialized respectively
+
+        grad_acc_kwargs = {
+            "num_steps": 10,
+            "adjust_scheduler": False,
+            "sync_with_dataloader": False,
+            "sync_each_batch": True,
+        }
+        accelerator_config = AcceleratorConfig(
+            split_batches=True,
+            dispatch_batches=True,
+            even_batches=False,
+            use_seedable_sampler=False,
+            gradient_accumulation_kwargs=grad_acc_kwargs,
+        )
+        config = RegressionModelConfig(a=1.5, b=2.5)
+        model = RegressionPreTrainedModel(config)
+        eval_dataset = SampleIterableDataset()
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            args = RegressionTrainingArguments(output_dir=tmp_dir, accelerator_config=accelerator_config)
+            trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
+            self.assertEqual(trainer.args.gradient_accumulation_steps, 10)
+
+    def test_accelerator_config_from_partial(self):
+        # Checks that accelerator kwargs can be passed through
+        # and the accelerator is initialized respectively
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            config = RegressionModelConfig(a=1.5, b=2.5)
+            model = RegressionPreTrainedModel(config)
+            eval_dataset = SampleIterableDataset()
+
+            # Leaves one option as something *not* basic
+            args = RegressionTrainingArguments(
+                output_dir=tmp_dir,
+                accelerator_config={
+                    "split_batches": True,
+                },
+            )
+            trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
+            self.assertEqual(trainer.accelerator.split_batches, True)
+            self.assertEqual(trainer.accelerator.dispatch_batches, None)
+            self.assertEqual(trainer.accelerator.even_batches, True)
+            self.assertEqual(trainer.accelerator.use_seedable_sampler, True)
+
+    def test_accelerator_custom_state(self):
+        AcceleratorState._reset_state(reset_partial_state=True)
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            with self.assertRaises(ValueError) as cm:
+                _ = RegressionTrainingArguments(output_dir=tmp_dir, accelerator_config={"use_configured_state": True})
+                self.assertIn("Please define this beforehand", str(cm.warnings[0].message))
+            _ = Accelerator()
+            _ = RegressionTrainingArguments(output_dir=tmp_dir, accelerator_config={"use_configured_state": True})
+        AcceleratorState._reset_state(reset_partial_state=True)
+
+    def test_accelerator_config_from_dict_grad_accum_num_steps(self):
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            config = RegressionModelConfig(a=1.5, b=2.5)
+            model = RegressionPreTrainedModel(config)
+            eval_dataset = SampleIterableDataset()
+
+            # case - TrainingArguments.gradient_accumulation_steps == 1
+            #      - gradient_accumulation_kwargs['num_steps] == 1
+            # results in grad accum set to 1
+            args = RegressionTrainingArguments(
+                output_dir=tmp_dir,
+                gradient_accumulation_steps=1,
+                accelerator_config={
+                    "gradient_accumulation_kwargs": {
+                        "num_steps": 1,
+                    }
+                },
+            )
+            trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
+            self.assertEqual(trainer.accelerator.gradient_state.plugin_kwargs["num_steps"], 1)
+
+            # case - TrainingArguments.gradient_accumulation_steps > 1
+            #      - gradient_accumulation_kwargs['num_steps] specified
+            # results in exception raised
+            args = RegressionTrainingArguments(
+                output_dir=tmp_dir,
+                gradient_accumulation_steps=2,
+                accelerator_config={
+                    "gradient_accumulation_kwargs": {
+                        "num_steps": 10,
+                    }
+                },
+            )
+            with self.assertRaises(Exception) as context:
+                trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
+            self.assertTrue("The `AcceleratorConfig`'s `num_steps` is set but" in str(context.exception))
+
+    def test_accelerator_config_not_instantiated(self):
+        # Checks that accelerator kwargs can be passed through
+        # and the accelerator is initialized respectively
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            with self.assertRaises(NotImplementedError) as context:
+                _ = RegressionTrainingArguments(
+                    output_dir=tmp_dir,
+                    accelerator_config=AcceleratorConfig,
+                )
+            self.assertTrue("Tried passing in a callable to `accelerator_config`" in str(context.exception))
+
+        # Now test with a custom subclass
+        @dataclasses.dataclass
+        class CustomAcceleratorConfig(AcceleratorConfig):
+            pass
+
+        @dataclasses.dataclass
+        class CustomTrainingArguments(TrainingArguments):
+            accelerator_config: dict = dataclasses.field(
+                default=CustomAcceleratorConfig,
+            )
+
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            with self.assertRaises(NotImplementedError) as context:
+                _ = CustomTrainingArguments(
+                    output_dir=tmp_dir,
+                )
+            self.assertTrue("Tried passing in a callable to `accelerator_config`" in str(context.exception))
--- a/tests/trainer/test_trainer_callback.py
+++ b/tests/trainer/test_trainer_callback.py
--- a/tests/trainer/test_trainer_checkpointing.py
+++ b/tests/trainer/test_trainer_checkpointing.py
--- a/tests/trainer/test_trainer_data.py
+++ b/tests/trainer/test_trainer_data.py
@@ -0,0 +1,870 @@
+# Copyright 2018 the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Trainer data-related tests: dataloaders, samplers, sharding, label smoothing,
+batch size finder, pad/concatenate, collators, and eval loop container.
+"""
+
+import copy
+import tempfile
+import unittest
+import warnings
+
+import numpy as np
+import torch
+from torch import nn
+
+from transformers import (
+    GPT2Config,
+    GPT2LMHeadModel,
+    Trainer,
+    TrainingArguments,
+)
+from transformers.data.data_collator import default_data_collator as _default_data_collator
+from transformers.modeling_outputs import SequenceClassifierOutput
+from transformers.testing_utils import (
+    TestCasePlus,
+    backend_device_count,
+    require_accelerate,
+    require_torch,
+    torch_device,
+)
+from transformers.tokenization_utils_base import BatchEncoding
+from transformers.trainer_pt_utils import (
+    DistributedLengthGroupedSampler,
+    DistributedSamplerWithLoop,
+    EvalLoopContainer,
+    IterableDatasetShard,
+    LabelSmoother,
+    LengthGroupedSampler,
+    ShardSampler,
+    get_parameter_names,
+    numpy_pad_and_concatenate,
+    torch_pad_and_concatenate,
+)
+from transformers.trainer_utils import RemoveColumnsCollator, find_executable_batch_size
+
+from .trainer_test_utils import (
+    AlmostAccuracy,
+    CustomDataloaderTrainer,
+    DynamicShapesDataset,
+    RegressionDataset,
+    RegressionModel,
+    RegressionModelConfig,
+    RegressionPreTrainedModel,
+    RegressionTrainingArguments,
+    SampleIterableDataset,
+    TrainerIntegrationCommon,
+    TstLayer,
+    get_regression_trainer,
+)
+
+
+class RandomIterableDataset(torch.utils.data.IterableDataset):
+    # For testing, an iterable dataset of random length
+    def __init__(self, p_stop=0.01, max_length=1000):
+        self.p_stop = p_stop
+        self.max_length = max_length
+        self.generator = torch.Generator()
+
+    def __iter__(self):
+        count = 0
+        stop = False
+        while not stop and count < self.max_length:
+            yield count
+            count += 1
+            number = torch.rand(1, generator=self.generator).item()
+            stop = number < self.p_stop
+
+
+# ---------------------------------------------------------------------------
+# Dataloader tests
+# ---------------------------------------------------------------------------
+
+
+@require_torch
+class TrainerDataloaderTest(TestCasePlus):
+    """Tests for train/eval dataloaders, drop_last, persistent workers."""
+
+    def test_train_and_eval_dataloaders(self):
+        if torch_device == "cuda":
+            n_gpu = max(1, backend_device_count(torch_device))
+        else:
+            # DP is deprecated by PyTorch, accelerators like XPU doesn't support DP
+            n_gpu = 1
+
+        tmp_dir = self.get_auto_remove_tmp_dir()
+        trainer = get_regression_trainer(learning_rate=0.1, per_device_train_batch_size=16, output_dir=tmp_dir)
+        self.assertEqual(trainer.get_train_dataloader().total_batch_size, 16 * n_gpu)
+        trainer = get_regression_trainer(learning_rate=0.1, per_device_eval_batch_size=16, output_dir=tmp_dir)
+        self.assertEqual(trainer.get_eval_dataloader().total_batch_size, 16 * n_gpu)
+
+        # Check drop_last works
+        trainer = get_regression_trainer(
+            train_len=66,
+            eval_len=74,
+            learning_rate=0.1,
+            per_device_train_batch_size=16,
+            per_device_eval_batch_size=32,
+            output_dir=tmp_dir,
+        )
+        self.assertEqual(len(trainer.get_train_dataloader()), 66 // (16 * n_gpu) + 1)
+        self.assertEqual(len(trainer.get_eval_dataloader()), 74 // (32 * n_gpu) + 1)
+
+        trainer = get_regression_trainer(
+            train_len=66,
+            eval_len=74,
+            learning_rate=0.1,
+            per_device_train_batch_size=16,
+            per_device_eval_batch_size=32,
+            dataloader_drop_last=True,
+            output_dir=tmp_dir,
+        )
+        self.assertEqual(len(trainer.get_train_dataloader()), 66 // (16 * n_gpu))
+        self.assertEqual(len(trainer.get_eval_dataloader()), 74 // (32 * n_gpu))
+
+        # Check passing a new dataset for evaluation works
+        new_eval_dataset = RegressionDataset(length=128)
+        self.assertEqual(len(trainer.get_eval_dataloader(new_eval_dataset)), 128 // (32 * n_gpu))
+
+    # tests that we do not require dataloader to have a .dataset attribute
+    def test_dataloader_without_dataset(self):
+        train_dataset = RegressionDataset(length=128)
+        trainer = CustomDataloaderTrainer(
+            model=RegressionModel(),
+            train_dataset=train_dataset,
+            eval_dataset=train_dataset,
+            args=TrainingArguments(output_dir=self.get_auto_remove_tmp_dir()),
+        )
+
+        trainer.train()
+        trainer.evaluate()
+
+    def test_get_eval_dataloader_without_persistent_workers(self):
+        train_dataset = RegressionDataset()
+        config = GPT2Config(vocab_size=100, n_positions=128, n_embd=32, n_layer=3, n_head=4)
+        tiny_gpt2 = GPT2LMHeadModel(config)
+        args = TrainingArguments(self.get_auto_remove_tmp_dir(), dataloader_persistent_workers=False)
+
+        # Single evaluation dataset
+        eval_dataset = RegressionDataset()
+        trainer = Trainer(tiny_gpt2, args, train_dataset=train_dataset, eval_dataset=eval_dataset)
+        # Mocking the prepare method to avoid the dataloader changing with each call to get_eval_dataloader
+        trainer.accelerator.prepare = lambda x: x
+
+        default_dataloader = trainer.get_eval_dataloader()
+        dataloader_with_dataset = trainer.get_eval_dataloader(eval_dataset)
+
+        self.assertEqual(default_dataloader.dataset, eval_dataset)
+        self.assertEqual(dataloader_with_dataset.dataset, eval_dataset)
+        self.assertNotEqual(default_dataloader, dataloader_with_dataset)
+
+        # Multiple evaluation datasets
+        first_dataset = RegressionDataset()
+        second_dataset = RegressionDataset()
+        trainer = Trainer(
+            tiny_gpt2,
+            args,
+            train_dataset=train_dataset,
+            eval_dataset={"first": first_dataset, "second": second_dataset},
+        )
+        # Mocking the prepare method to avoid the dataloader changing with each call to get_eval_dataloader
+        trainer.accelerator.prepare = lambda x: x
+
+        first_dataloader = trainer.get_eval_dataloader("first")
+        first_dataloader_repeated = trainer.get_eval_dataloader("first")
+        second_dataloader = trainer.get_eval_dataloader("second")
+        second_dataloader_repeated = trainer.get_eval_dataloader("second")
+
+        self.assertEqual(first_dataset, first_dataloader.dataset)
+        self.assertEqual(first_dataloader.dataset, first_dataloader_repeated.dataset)
+        self.assertEqual(second_dataset, second_dataloader.dataset)
+        self.assertEqual(second_dataloader.dataset, second_dataloader_repeated.dataset)
+        self.assertNotEqual(first_dataloader, first_dataloader_repeated)
+        self.assertNotEqual(second_dataloader, second_dataloader_repeated)
+
+    def test_get_eval_dataloader_with_persistent_workers(self):
+        train_dataset = RegressionDataset()
+        config = GPT2Config(vocab_size=100, n_positions=128, n_embd=32, n_layer=3, n_head=4)
+        tiny_gpt2 = GPT2LMHeadModel(config)
+        args = TrainingArguments(
+            self.get_auto_remove_tmp_dir(),
+            dataloader_persistent_workers=True,
+            dataloader_num_workers=2,
+        )
+
+        # Single evaluation dataset
+        eval_dataset = RegressionDataset()
+        trainer = Trainer(tiny_gpt2, args, train_dataset=train_dataset, eval_dataset=eval_dataset)
+        # Mocking the prepare method to avoid the dataloader changing with each call to get_eval_dataloader
+        trainer.accelerator.prepare = lambda x: x
+
+        default_dataloader = trainer.get_eval_dataloader()
+        dataloader_with_dataset = trainer.get_eval_dataloader(eval_dataset)
+
+        self.assertEqual(default_dataloader.dataset, eval_dataset)
+        self.assertEqual(dataloader_with_dataset.dataset, eval_dataset)
+        self.assertEqual(default_dataloader, dataloader_with_dataset)
+
+        # Multiple evaluation datasets
+        first_dataset = RegressionDataset()
+        second_dataset = RegressionDataset()
+        trainer = Trainer(
+            tiny_gpt2,
+            args,
+            train_dataset=train_dataset,
+            eval_dataset={"first": first_dataset, "second": second_dataset},
+        )
+        # Mocking the prepare method to avoid the dataloader changing with each call to get_eval_dataloader
+        trainer.accelerator.prepare = lambda x: x
+
+        first_dataloader = trainer.get_eval_dataloader("first")
+        first_dataloader_repeated = trainer.get_eval_dataloader("first")
+        second_dataloader = trainer.get_eval_dataloader("second")
+        second_dataloader_repeated = trainer.get_eval_dataloader("second")
+
+        self.assertEqual(first_dataset, first_dataloader.dataset)
+        self.assertEqual(first_dataloader.dataset, first_dataloader_repeated.dataset)
+        self.assertEqual(second_dataset, second_dataloader.dataset)
+        self.assertEqual(second_dataloader.dataset, second_dataloader_repeated.dataset)
+        self.assertEqual(first_dataloader, first_dataloader_repeated)
+        self.assertEqual(second_dataloader, second_dataloader_repeated)
+
+
+# ---------------------------------------------------------------------------
+# Label smoothing tests
+# ---------------------------------------------------------------------------
+
+
+@require_torch
+class TrainerLabelSmoothingTest(unittest.TestCase):
+    """Tests for label smoothing and its interaction with multi-label classification."""
+
+    def test_label_smoothing(self):
+        epsilon = 0.1
+        num_labels = 12
+        random_logits = torch.randn(4, 5, num_labels)
+        random_labels = torch.randint(0, num_labels, (4, 5))
+        loss = nn.functional.cross_entropy(random_logits.view(-1, num_labels), random_labels.view(-1))
+        model_output = SequenceClassifierOutput(logits=random_logits)
+        label_smoothed_loss = LabelSmoother(0.1)(model_output, random_labels)
+        log_probs = -nn.functional.log_softmax(random_logits, dim=-1)
+        expected_loss = (1 - epsilon) * loss + epsilon * log_probs.mean()
+        torch.testing.assert_close(label_smoothed_loss, expected_loss)
+
+        # With a few -100 labels
+        random_labels[0, 1] = -100
+        random_labels[2, 1] = -100
+        random_labels[2, 3] = -100
+
+        loss = nn.functional.cross_entropy(random_logits.view(-1, num_labels), random_labels.view(-1))
+        model_output = SequenceClassifierOutput(logits=random_logits)
+        label_smoothed_loss = LabelSmoother(0.1)(model_output, random_labels)
+        log_probs = -nn.functional.log_softmax(random_logits, dim=-1)
+        # Mask the log probs with the -100 labels
+        log_probs[0, 1] = 0.0
+        log_probs[2, 1] = 0.0
+        log_probs[2, 3] = 0.0
+        expected_loss = (1 - epsilon) * loss + epsilon * log_probs.sum() / (num_labels * 17)
+        torch.testing.assert_close(label_smoothed_loss, expected_loss)
+
+    def test_label_smoothing_multi_label_incompatibility(self):
+        """Test that Trainer warns and disables label smoothing for multi-label classification"""
+
+        # Mock model config with multi-label classification
+        class MockConfig:
+            problem_type = "multi_label_classification"
+
+        class MockModel(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.config = MockConfig()
+                self.linear = nn.Linear(10, 3)
+
+            def forward(self, **kwargs):
+                return {"logits": torch.randn(2, 3)}
+
+        model = MockModel()
+
+        # Create training args with label smoothing
+        training_args = TrainingArguments(
+            output_dir="./test-trainer",
+            label_smoothing_factor=0.1,
+            per_device_train_batch_size=2,
+            num_train_epochs=1,
+        )
+
+        # Should warn and disable label smoothing
+        with warnings.catch_warnings(record=True) as w:
+            warnings.simplefilter("always")
+            trainer = Trainer(model=model, args=training_args)
+
+            # Check warning was issued
+            self.assertEqual(len(w), 1)
+            self.assertIn("Label smoothing is not compatible with multi-label classification", str(w[0].message))
+
+            # Check label_smoother was disabled
+            self.assertIsNone(trainer.label_smoother)
+
+
+# ---------------------------------------------------------------------------
+# Sampler and sharding tests
+# ---------------------------------------------------------------------------
+
+
+@require_torch
+class TrainerSamplerTest(unittest.TestCase):
+    """Tests for length-grouped samplers, distributed samplers, iterable dataset sharding, and shard samplers."""
+
+    def test_group_by_length(self):
+        # Get some inputs of random lengths
+        lengths = torch.randint(0, 25, (100,)).tolist()
+        # Put one bigger than the others to check it ends up in first position
+        lengths[32] = 50
+
+        indices = list(LengthGroupedSampler(4, lengths=lengths))
+        # The biggest element should be first
+        self.assertEqual(lengths[indices[0]], 50)
+        # The indices should be a permutation of range(100)
+        self.assertEqual(sorted(indices), list(range(100)))
+
+    def test_group_by_length_with_dict(self):
+        # Get some inputs of random lengths
+        data = []
+        for _ in range(6):
+            input_ids = torch.randint(0, 25, (100,)).tolist()
+            data.append({"input_ids": input_ids})
+        # Put one bigger than the others to check it ends up in first position
+        data[3]["input_ids"] = torch.randint(0, 25, (105,)).tolist()
+
+        indices = list(LengthGroupedSampler(4, dataset=data))
+        # The biggest element should be first
+        self.assertEqual(len(data[indices[0]]["input_ids"]), 105)
+        # The indices should be a permutation of range(6)
+        self.assertEqual(sorted(indices), list(range(6)))
+
+    def test_group_by_length_with_batch_encoding(self):
+        # Get some inputs of random lengths
+        data = []
+        for _ in range(6):
+            input_ids = torch.randint(0, 25, (100,)).tolist()
+            data.append(BatchEncoding({"input_ids": input_ids}))
+        # Put one bigger than the others to check it ends up in first position
+        data[3]["input_ids"] = torch.randint(0, 25, (105,)).tolist()
+
+        indices = list(LengthGroupedSampler(4, dataset=data))
+        # The biggest element should be first
+        self.assertEqual(len(data[indices[0]]["input_ids"]), 105)
+        # The indices should be a permutation of range(6)
+        self.assertEqual(sorted(indices), list(range(6)))
+
+    def test_distributed_length_grouped(self):
+        # Get some inputs of random lengths
+        lengths = torch.randint(0, 25, (100,)).tolist()
+        # Put one bigger than the others to check it ends up in first position
+        lengths[32] = 50
+
+        indices_process_0 = list(DistributedLengthGroupedSampler(4, num_replicas=2, rank=0, lengths=lengths))
+        indices_process_1 = list(DistributedLengthGroupedSampler(4, num_replicas=2, rank=1, lengths=lengths))
+        # The biggest element should be first
+        self.assertEqual(lengths[indices_process_0[0]], 50)
+        # The indices should be a permutation of range(100)
+        self.assertEqual(sorted(indices_process_0 + indices_process_1), list(range(100)))
+
+    def test_distributed_sampler_with_loop(self):
+        batch_size = 16
+        for length in [23, 64, 123]:
+            dataset = list(range(length))
+            shard1 = DistributedSamplerWithLoop(dataset, batch_size, num_replicas=2, rank=0)
+            shard2 = DistributedSamplerWithLoop(dataset, batch_size, num_replicas=2, rank=1)
+
+            # Set seeds
+            shard1.set_epoch(0)
+            shard2.set_epoch(0)
+
+            # Sample
+            samples1 = list(shard1)
+            samples2 = list(shard2)
+
+            self.assertTrue(len(samples1) % batch_size == 0)
+            self.assertTrue(len(samples2) % batch_size == 0)
+
+            total = []
+            for sample1, sample2 in zip(samples1, samples2):
+                total += [sample1, sample2]
+
+            self.assertEqual(set(total[:length]), set(dataset))
+            self.assertEqual(set(total[length:]), set(total[: (len(total) - length)]))
+
+    def check_iterable_dataset_shard(self, dataset, batch_size, drop_last, num_processes=2, epoch=0):
+        # Set the seed for the base dataset to get the proper reference.
+        dataset.generator.manual_seed(epoch)
+        reference = list(dataset)
+
+        shards = [
+            IterableDatasetShard(
+                dataset, batch_size=batch_size, drop_last=drop_last, num_processes=num_processes, process_index=i
+            )
+            for i in range(num_processes)
+        ]
+        for shard in shards:
+            shard.set_epoch(epoch)
+        shard_lists = [list(shard) for shard in shards]
+
+        for shard in shard_lists:
+            # All shards have a number of samples that is a round multiple of batch size
+            self.assertTrue(len(shard) % batch_size == 0)
+            # All shards have the same number of samples
+            self.assertEqual(len(shard), len(shard_lists[0]))
+
+        for shard in shards:
+            # All shards know the total number of samples
+            self.assertEqual(shard.num_examples, len(reference))
+
+        observed = []
+        for idx in range(0, len(shard_lists[0]), batch_size):
+            for shard in shard_lists:
+                observed += shard[idx : idx + batch_size]
+
+        # If drop_last is False we loop through samples at the beginning to have a size that is a round multiple of
+        # batch_size
+        if not drop_last:
+            while len(reference) < len(observed):
+                reference += reference
+        self.assertListEqual(observed, reference[: len(observed)])
+
+        # Check equivalence between IterableDataset and ShardSampler
+        dataset.generator.manual_seed(epoch)
+        reference = list(dataset)
+
+        sampler_shards = [
+            ShardSampler(
+                reference, batch_size=batch_size, drop_last=drop_last, num_processes=num_processes, process_index=i
+            )
+            for i in range(num_processes)
+        ]
+        for shard, sampler_shard in zip(shard_lists, sampler_shards):
+            self.assertListEqual(shard, list(sampler_shard))
+
+    def test_iterable_dataset_shard(self):
+        dataset = RandomIterableDataset()
+
+        self.check_iterable_dataset_shard(dataset, 4, drop_last=True, num_processes=2, epoch=0)
+        self.check_iterable_dataset_shard(dataset, 4, drop_last=False, num_processes=2, epoch=0)
+
+        self.check_iterable_dataset_shard(dataset, 4, drop_last=True, num_processes=3, epoch=42)
+        self.check_iterable_dataset_shard(dataset, 4, drop_last=False, num_processes=3, epoch=42)
+
+    def test_iterable_dataset_shard_with_length(self):
+        sampler_shards = [
+            IterableDatasetShard(list(range(100)), batch_size=4, drop_last=True, num_processes=2, process_index=i)
+            for i in range(2)
+        ]
+
+        # Build expected shards: each process will have batches of size 4 until there is not enough elements to
+        # form two full batches (so we stop at 96 = (100 // (4 * 2)) * 4)
+        expected_shards = [[], []]
+        current_shard = 0
+        for i in range(0, 96, 4):
+            expected_shards[current_shard].extend(list(range(i, i + 4)))
+            current_shard = 1 - current_shard
+
+        self.assertListEqual([list(shard) for shard in sampler_shards], expected_shards)
+        self.assertListEqual([len(shard) for shard in sampler_shards], [len(shard) for shard in expected_shards])
+
+        sampler_shards = [
+            IterableDatasetShard(list(range(100)), batch_size=4, drop_last=False, num_processes=2, process_index=i)
+            for i in range(2)
+        ]
+        # When drop_last=False, we get two last full batches by looping back to the beginning.
+        expected_shards[0].extend(list(range(96, 100)))
+        expected_shards[1].extend(list(range(0, 4)))
+
+        self.assertListEqual([list(shard) for shard in sampler_shards], expected_shards)
+        self.assertListEqual([len(shard) for shard in sampler_shards], [len(shard) for shard in expected_shards])
+
+    def check_shard_sampler(self, dataset, batch_size, drop_last, num_processes=2):
+        shards = [
+            ShardSampler(
+                dataset, batch_size=batch_size, drop_last=drop_last, num_processes=num_processes, process_index=i
+            )
+            for i in range(num_processes)
+        ]
+        shard_lists = [list(shard) for shard in shards]
+
+        for shard in shard_lists:
+            # All shards have a number of samples that is a round multiple of batch size
+            self.assertTrue(len(shard) % batch_size == 0)
+            # All shards have the same number of samples
+            self.assertEqual(len(shard), len(shard_lists[0]))
+
+        observed = []
+        for idx in range(0, len(shard_lists[0]), batch_size):
+            for shard in shard_lists:
+                observed += shard[idx : idx + batch_size]
+
+        # If drop_last is False we loop through samples at the beginning to have a size that is a round multiple of
+        # batch_size
+        reference = copy.copy(dataset)
+        if not drop_last:
+            while len(reference) < len(observed):
+                reference += reference
+        self.assertListEqual(observed, reference[: len(observed)])
+
+    def test_shard_sampler(self):
+        for n_elements in [64, 123]:
+            dataset = list(range(n_elements))
+
+            self.check_shard_sampler(dataset, 4, drop_last=True, num_processes=2)
+            self.check_shard_sampler(dataset, 4, drop_last=False, num_processes=2)
+
+            self.check_shard_sampler(dataset, 4, drop_last=True, num_processes=3)
+            self.check_shard_sampler(dataset, 4, drop_last=False, num_processes=3)
+
+
+# ---------------------------------------------------------------------------
+# Batch size finder tests
+# ---------------------------------------------------------------------------
+
+
+@require_torch
+class TrainerBatchSizeFinderTest(unittest.TestCase):
+    """Tests for the auto batch size finder (find_executable_batch_size)."""
+
+    @require_accelerate
+    def test_executable_batch_size(self):
+        batch_sizes = []
+
+        @find_executable_batch_size(starting_batch_size=64, auto_find_batch_size=True)
+        def mock_training_loop_function(batch_size):
+            nonlocal batch_sizes
+            batch_sizes.append(batch_size)
+            if batch_size > 16:
+                raise RuntimeError("CUDA out of memory.")
+
+        mock_training_loop_function()
+        self.assertEqual(batch_sizes, [64, 57, 51, 45, 40, 36, 32, 28, 25, 22, 19, 17, 15])
+
+    @require_accelerate
+    def test_executable_batch_size_no_search(self):
+        batch_sizes = []
+
+        @find_executable_batch_size(starting_batch_size=64, auto_find_batch_size=False)
+        def mock_training_loop_function(batch_size):
+            nonlocal batch_sizes
+            batch_sizes.append(batch_size)
+
+        mock_training_loop_function()
+        self.assertEqual(batch_sizes, [64])
+
+    @require_accelerate
+    def test_executable_batch_size_with_error(self):
+        @find_executable_batch_size(starting_batch_size=64, auto_find_batch_size=False)
+        def mock_training_loop_function(batch_size):
+            raise RuntimeError("CUDA out of memory.")
+
+        with self.assertRaises(RuntimeError) as cm:
+            mock_training_loop_function()
+            self.assertEqual("CUDA out of memory", cm.args[0])
+
+
+# ---------------------------------------------------------------------------
+# Data utility tests (parameter names, pad/concat, collators, eval loop container)
+# ---------------------------------------------------------------------------
+
+
+@require_torch
+class TrainerDataUtilsTest(unittest.TestCase):
+    """Tests for get_parameter_names, pad_and_concatenate, RemoveColumnsCollator, and EvalLoopContainer."""
+
+    def test_get_parameter_names(self):
+        model = nn.Sequential(TstLayer(128), nn.ModuleList([TstLayer(128), TstLayer(128)]))
+        # fmt: off
+        self.assertEqual(
+            get_parameter_names(model, [nn.LayerNorm]),
+            ['0.linear1.weight', '0.linear1.bias', '0.linear2.weight', '0.linear2.bias', '0.bias', '1.0.linear1.weight', '1.0.linear1.bias', '1.0.linear2.weight', '1.0.linear2.bias', '1.0.bias', '1.1.linear1.weight', '1.1.linear1.bias', '1.1.linear2.weight', '1.1.linear2.bias', '1.1.bias']
+        )
+        # fmt: on
+
+    def test_get_parameter_names_rmsnorm(self):
+        class RMSNorm(nn.Module):
+            def __init__(self, hidden_size):
+                super().__init__()
+                self.weight = nn.Parameter(torch.ones(hidden_size))
+                self.bias = nn.Parameter(torch.zeros(hidden_size))
+
+        class ModelWithRMSNorm(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.linear = nn.Linear(128, 128)
+                self.rmsnorm = RMSNorm(128)
+                self.bias = nn.Parameter(torch.zeros(128))
+
+        model = ModelWithRMSNorm()
+        # Test both type-based and name-based filtering
+        decay_parameters = get_parameter_names(model, [], ["bias", "rmsnorm"])
+
+        # Parameters that should be in weight decay
+        self.assertIn("linear.weight", decay_parameters)
+
+        # Parameters that should NOT be in weight decay
+        self.assertNotIn("linear.bias", decay_parameters)
+        self.assertNotIn("rmsnorm.weight", decay_parameters)
+        self.assertNotIn("rmsnorm.bias", decay_parameters)
+        self.assertNotIn("bias", decay_parameters)
+
+    def test_pad_and_concatenate_with_1d(self):
+        """Tests whether pad_and_concatenate works with scalars."""
+        array1 = 1.0
+        array2 = 2.0
+        result = numpy_pad_and_concatenate(array1, array2)
+        self.assertTrue(np.array_equal(np.array([1.0, 2.0]), result))
+
+        tensor1 = torch.tensor(1.0)
+        tensor2 = torch.tensor(2.0)
+        result = torch_pad_and_concatenate(tensor1, tensor2)
+        self.assertTrue(torch.equal(result, torch.Tensor([1.0, 2.0])))
+
+    def test_remove_columns_collator(self):
+        class MockLogger:
+            def __init__(self) -> None:
+                self.called = 0
+
+            def info(self, msg):
+                self.called += 1
+                self.last_msg = msg
+
+        data_batch = [
+            {"col1": 1, "col2": 2, "col3": 3},
+            {"col1": 1, "col2": 2, "col3": 3},
+        ]
+        logger = MockLogger()
+        remove_columns_collator = RemoveColumnsCollator(
+            _default_data_collator, ["col1", "col2"], logger, "model", "training"
+        )
+
+        self.assertNotIn("col3", remove_columns_collator(data_batch))
+        # check that the logging message is printed out only once
+        remove_columns_collator(data_batch)
+        remove_columns_collator(data_batch)
+        self.assertEqual(logger.called, 1)
+        self.assertIn("col3", logger.last_msg)
+
+    def test_eval_loop_container(self):
+        batch_1 = [
+            torch.ones([8, 5]),
+            {"loss": torch.tensor(1.0)},
+            (torch.ones([8, 2, 3]), torch.ones([8, 2])),
+        ]
+        batch_2 = [
+            torch.ones([4, 5]),
+            {"loss": torch.tensor(2.0)},
+            (torch.ones([4, 2, 3]), torch.ones([4, 6])),
+        ]
+
+        concat_container = EvalLoopContainer(do_nested_concat=True, padding_index=-100)
+        concat_container.add(batch_1)
+        concat_container.add(batch_2)
+        concat_container.to_cpu_and_numpy()
+        arrays = concat_container.get_arrays()
+
+        # Test two nested batches concatenation
+        self.assertIsInstance(arrays, list)
+        self.assertEqual(len(arrays), 3)
+        self.assertIsInstance(arrays[0], np.ndarray)
+        self.assertEqual(arrays[0].shape, (12, 5))
+        self.assertIsInstance(arrays[1], dict)
+        self.assertIsInstance(arrays[1]["loss"], np.ndarray)
+        self.assertEqual(arrays[1]["loss"].shape, (2,))
+        self.assertTrue(np.allclose(arrays[1]["loss"], np.array([1.0, 2.0])))
+        self.assertIsInstance(arrays[2], tuple)
+        self.assertEqual(len(arrays[2]), 2)
+        self.assertEqual(arrays[2][0].shape, (12, 2, 3))
+        self.assertEqual(arrays[2][1].shape, (12, 6))
+        # check that first batch padded with padding index -100 after concatenation
+        self.assertEqual(arrays[2][1][0][2], -100)
+
+        # Test two batches with no concatenation
+        list_container = EvalLoopContainer(do_nested_concat=False)
+        list_container.add(batch_1)
+        list_container.add(batch_2)
+        list_container.to_cpu_and_numpy()
+        arrays = list_container.get_arrays()
+
+        self.assertEqual(len(arrays), 2)
+        self.assertIsInstance(arrays, list)
+        np_batch_1, np_batch_2 = arrays
+
+        self.assertIsInstance(np_batch_1, list)
+        self.assertEqual(len(np_batch_1), 3)
+        self.assertIsInstance(np_batch_1[0], np.ndarray)
+        self.assertIsInstance(np_batch_1[1], dict)
+        self.assertIsInstance(np_batch_1[2], tuple)
+        self.assertEqual(np_batch_1[0].shape, (8, 5))
+        self.assertEqual(np_batch_1[1]["loss"].shape, ())
+        self.assertEqual(np_batch_1[2][0].shape, (8, 2, 3))
+        self.assertEqual(np_batch_1[2][1].shape, (8, 2))
+
+        self.assertIsInstance(np_batch_2, list)
+        self.assertEqual(len(np_batch_2), 3)
+        self.assertIsInstance(np_batch_2[0], np.ndarray)
+        self.assertIsInstance(np_batch_2[1], dict)
+        self.assertIsInstance(np_batch_2[2], tuple)
+        self.assertEqual(np_batch_2[0].shape, (4, 5))
+        self.assertEqual(np_batch_2[1]["loss"].shape, ())
+        self.assertEqual(np_batch_2[2][0].shape, (4, 2, 3))
+        self.assertEqual(np_batch_2[2][1].shape, (4, 6))
+
+        # Test no batches
+        none_arr = EvalLoopContainer(do_nested_concat=True, padding_index=-100).get_arrays()
+        self.assertIsNone(none_arr)
+
+        none_arr = EvalLoopContainer(do_nested_concat=False).get_arrays()
+        self.assertIsNone(none_arr)
+
+        # Test one batch
+        concat_container = EvalLoopContainer(do_nested_concat=True, padding_index=-100)
+        concat_container.add(batch_1)
+        arrays = concat_container.get_arrays()
+        self.assertIsInstance(arrays, list)
+        self.assertEqual(len(arrays), 3)
+        self.assertIsInstance(arrays[0], np.ndarray)
+        self.assertEqual(arrays[0].shape, (8, 5))
+        self.assertIsInstance(arrays[1], dict)
+        self.assertIsInstance(arrays[1]["loss"], np.ndarray)
+        self.assertEqual(arrays[1]["loss"].shape, ())
+        self.assertTrue(np.allclose(arrays[1]["loss"], np.array([1.0])))
+        self.assertIsInstance(arrays[2], tuple)
+        self.assertEqual(len(arrays[2]), 2)
+        self.assertEqual(arrays[2][0].shape, (8, 2, 3))
+        self.assertEqual(arrays[2][1].shape, (8, 2))
+
+
+# ---------------------------------------------------------------------------
+# Dynamic shapes and iterable dataset tests
+# ---------------------------------------------------------------------------
+
+
+@require_torch
+class TrainerDynamicShapesAndIterableTest(TestCasePlus, TrainerIntegrationCommon):
+    def setUp(self):
+        super().setUp()
+        args = TrainingArguments("..")
+        self.n_epochs = args.num_train_epochs
+        self.batch_size = args.train_batch_size
+
+    def test_dynamic_shapes(self):
+        eval_dataset = DynamicShapesDataset(batch_size=self.batch_size)
+        model = RegressionModel(a=2, b=1)
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            args = TrainingArguments(tmp_dir)
+            trainer = Trainer(model, args, eval_dataset=eval_dataset)
+
+            # Check evaluation can run to completion
+            _ = trainer.evaluate()
+
+            # Check predictions
+            preds = trainer.predict(eval_dataset)
+            for expected, seen in zip(eval_dataset.ys, preds.label_ids):
+                self.assertTrue(np.array_equal(expected, seen[: expected.shape[0]]))
+                self.assertTrue(np.all(seen[expected.shape[0] :] == -100))
+
+            for expected, seen in zip(eval_dataset.xs, preds.predictions):
+                self.assertTrue(np.array_equal(2 * expected + 1, seen[: expected.shape[0]]))
+                self.assertTrue(np.all(seen[expected.shape[0] :] == -100))
+
+        # Same tests with eval accumulation
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            args = TrainingArguments(tmp_dir, eval_accumulation_steps=2)
+            trainer = Trainer(model, args, eval_dataset=eval_dataset)
+
+            # Check evaluation can run to completion
+            _ = trainer.evaluate()
+
+            # Check predictions
+            preds = trainer.predict(eval_dataset)
+            for expected, seen in zip(eval_dataset.ys, preds.label_ids):
+                self.assertTrue(np.array_equal(expected, seen[: expected.shape[0]]))
+                self.assertTrue(np.all(seen[expected.shape[0] :] == -100))
+
+            for expected, seen in zip(eval_dataset.xs, preds.predictions):
+                self.assertTrue(np.array_equal(2 * expected + 1, seen[: expected.shape[0]]))
+                self.assertTrue(np.all(seen[expected.shape[0] :] == -100))
+
+    def test_training_iterable_dataset(self):
+        config = RegressionModelConfig()
+        model = RegressionPreTrainedModel(config)
+        # Adding one column not used by the model should have no impact
+        train_dataset = SampleIterableDataset(label_names=["labels", "extra"])
+
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            args = RegressionTrainingArguments(output_dir=tmp_dir, max_steps=4)
+            trainer = Trainer(model=model, args=args, train_dataset=train_dataset)
+            trainer.train()
+            self.assertEqual(trainer.state.global_step, 4)
+
+            loader = trainer.get_train_dataloader()
+            self.assertIsInstance(loader, torch.utils.data.DataLoader)
+            self.assertIsInstance(loader.sampler, torch.utils.data.dataloader._InfiniteConstantSampler)
+
+    def test_evaluation_iterable_dataset(self):
+        config = RegressionModelConfig(a=1.5, b=2.5)
+        model = RegressionPreTrainedModel(config)
+        # RegressionPreTrainedModel accepts **kwargs but doesn't actually use num_items_in_batch,
+        # so disable the loss scaling that assumes the model handles token-level averaging.
+        model.accepts_loss_kwargs = False
+        # Adding one column not used by the model should have no impact
+        eval_dataset = SampleIterableDataset(label_names=["labels", "extra"])
+
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            args = RegressionTrainingArguments(output_dir=tmp_dir)
+            trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset, compute_metrics=AlmostAccuracy())
+            results = trainer.evaluate()
+            x, y = trainer.eval_dataset.dataset.x, trainer.eval_dataset.dataset.ys[0]
+            pred = 1.5 * x + 2.5
+            expected_loss = ((pred - y) ** 2).mean()
+            self.assertAlmostEqual(results["eval_loss"], expected_loss, places=6)
+            expected_acc = AlmostAccuracy()((pred, y))["accuracy"]
+            self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
+
+            # With a number of elements not a round multiple of the batch size
+            eval_dataset = SampleIterableDataset(length=66)
+            results = trainer.evaluate(eval_dataset)
+
+            x, y = eval_dataset.dataset.x, eval_dataset.dataset.ys[0]
+            pred = 1.5 * x + 2.5
+            expected_loss = ((pred - y) ** 2).mean()
+            self.assertAlmostEqual(results["eval_loss"], expected_loss, places=6)
+            expected_acc = AlmostAccuracy()((pred, y))["accuracy"]
+            self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
+
+    def test_predict_iterable_dataset(self):
+        config = RegressionModelConfig(a=1.5, b=2.5)
+        model = RegressionPreTrainedModel(config)
+        eval_dataset = SampleIterableDataset()
+
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            args = RegressionTrainingArguments(output_dir=tmp_dir)
+            trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset, compute_metrics=AlmostAccuracy())
+            preds = trainer.predict(trainer.eval_dataset).predictions
+            x = eval_dataset.dataset.x
+            self.assertTrue(np.allclose(preds, 1.5 * x + 2.5))
+
+            # With a number of elements not a round multiple of the batch size
+            # Adding one column not used by the model should have no impact
+            test_dataset = SampleIterableDataset(length=66, label_names=["labels", "extra"])
+            preds = trainer.predict(test_dataset).predictions
+            x = test_dataset.dataset.x
+            self.assertTrue(np.allclose(preds, 1.5 * x + 2.5))
--- a/tests/trainer/test_trainer_evaluation.py
+++ b/tests/trainer/test_trainer_evaluation.py
@@ -0,0 +1,519 @@
+# Copyright 2018 the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Trainer evaluation and prediction tests: evaluate, predict, batched metrics, dynamic shapes,
+iterable datasets, early stopping, FP16/BF16 full eval memory, torch.compile, and MRPC/LM eval.
+"""
+
+import gc
+import tempfile
+
+import numpy as np
+
+from transformers import (
+    AutoTokenizer,
+    TrainingArguments,
+    is_torch_available,
+)
+from transformers.testing_utils import (
+    TestCasePlus,
+    backend_device_count,
+    get_tests_dir,
+    require_torch,
+    require_torch_accelerator,
+    require_torch_bf16,
+    require_torch_fp16,
+    slow,
+    torch_device,
+)
+
+from .trainer_test_utils import (
+    PATH_SAMPLE_TEXT,
+    AlmostAccuracy,
+    AlmostAccuracyBatched,
+    RegressionDataset,
+    RegressionDictModel,
+    TrainerIntegrationCommon,
+    get_dataset,
+    get_regression_trainer,
+)
+
+
+if is_torch_available():
+    import torch
+
+    from transformers import (
+        AutoModelForCausalLM,
+        AutoModelForSequenceClassification,
+        GlueDataset,
+        GlueDataTrainingArguments,
+        Trainer,
+    )
+
+
+# ---------------------------------------------------------------------------
+# Core evaluate / predict tests
+# ---------------------------------------------------------------------------
+
+
+@require_torch
+class TrainerEvaluationTest(TestCasePlus, TrainerIntegrationCommon):
+    def setUp(self):
+        super().setUp()
+        args = TrainingArguments("..")
+        self.n_epochs = args.num_train_epochs
+        self.batch_size = args.train_batch_size
+
+    def test_evaluate(self):
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            trainer = get_regression_trainer(a=1.5, b=2.5, compute_metrics=AlmostAccuracy(), output_dir=tmp_dir)
+            results = trainer.evaluate()
+
+            x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
+            pred = 1.5 * x + 2.5
+            expected_loss = ((pred - y) ** 2).mean()
+            self.assertAlmostEqual(results["eval_loss"], expected_loss)
+            expected_acc = AlmostAccuracy()((pred, y))["accuracy"]
+            self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
+
+            # With a number of elements not a round multiple of the batch size
+            trainer = get_regression_trainer(
+                a=1.5, b=2.5, eval_len=66, compute_metrics=AlmostAccuracy(), output_dir=tmp_dir
+            )
+            results = trainer.evaluate()
+
+            x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
+            pred = 1.5 * x + 2.5
+            expected_loss = ((pred - y) ** 2).mean()
+            self.assertAlmostEqual(results["eval_loss"], expected_loss)
+            expected_acc = AlmostAccuracy()((pred, y))["accuracy"]
+            self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
+
+            # With logits preprocess
+            trainer = get_regression_trainer(
+                a=1.5,
+                b=2.5,
+                compute_metrics=AlmostAccuracy(),
+                preprocess_logits_for_metrics=lambda logits, labels: logits + 1,
+                output_dir=tmp_dir,
+            )
+            results = trainer.evaluate()
+
+            x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
+            pred = 1.5 * x + 2.5
+            expected_loss = ((pred - y) ** 2).mean()
+            self.assertAlmostEqual(results["eval_loss"], expected_loss)
+            expected_acc = AlmostAccuracy()((pred + 1, y))["accuracy"]
+            self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
+
+    def test_predict(self):
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            trainer = get_regression_trainer(a=1.5, b=2.5, output_dir=tmp_dir)
+            preds = trainer.predict(trainer.eval_dataset).predictions
+            x = trainer.eval_dataset.x
+            self.assertTrue(np.allclose(preds, 1.5 * x + 2.5))
+
+            # With a number of elements not a round multiple of the batch size
+            trainer = get_regression_trainer(a=1.5, b=2.5, eval_len=66, output_dir=tmp_dir)
+            preds = trainer.predict(trainer.eval_dataset).predictions
+            x = trainer.eval_dataset.x
+            self.assertTrue(np.allclose(preds, 1.5 * x + 2.5))
+
+            # With more than one output of the model
+            trainer = get_regression_trainer(a=1.5, b=2.5, double_output=True, output_dir=tmp_dir)
+            preds = trainer.predict(trainer.eval_dataset).predictions
+            x = trainer.eval_dataset.x
+            self.assertEqual(len(preds), 2)
+            self.assertTrue(np.allclose(preds[0], 1.5 * x + 2.5))
+            self.assertTrue(np.allclose(preds[1], 1.5 * x + 2.5))
+
+            # With more than one output/label of the model
+            trainer = get_regression_trainer(
+                a=1.5, b=2.5, double_output=True, label_names=["labels", "labels_2"], output_dir=tmp_dir
+            )
+            outputs = trainer.predict(trainer.eval_dataset)
+            preds = outputs.predictions
+            labels = outputs.label_ids
+            x = trainer.eval_dataset.x
+            self.assertEqual(len(preds), 2)
+            self.assertTrue(np.allclose(preds[0], 1.5 * x + 2.5))
+            self.assertTrue(np.allclose(preds[1], 1.5 * x + 2.5))
+            self.assertTrue(np.array_equal(labels[0], trainer.eval_dataset.ys[0]))
+            self.assertTrue(np.array_equal(labels[1], trainer.eval_dataset.ys[1]))
+
+    def test_train_and_predict_loss_parity(self):
+        """
+        Tests that the loss computed during a training_step is the same as the one computed during prediction_step.
+        for the same inputs
+        """
+        model = AutoModelForCausalLM.from_pretrained("hf-internal-testing/tiny-random-LlamaForCausalLM")
+        # Create a dummy batch of inputs
+        inputs = {}
+        inputs["input_ids"] = []
+        for row_ind in range(4):
+            seq_len = torch.randint(32, 64, (1,)).item()
+            x = torch.randint(1, 100, (seq_len,))
+            inputs["input_ids"].append(x)
+        inputs["input_ids"] = torch.nn.utils.rnn.pad_sequence(inputs["input_ids"], batch_first=True, padding_value=0)
+        inputs["labels"] = inputs["input_ids"].clone()
+        inputs["labels"][inputs["input_ids"] == 0] = -100
+        num_items_in_batch = inputs["labels"][..., 1:].ne(-100).sum().item()
+
+        def custom_loss_func(outputs, labels, num_items_in_batch=None):
+            logits = outputs["logits"]
+            loss_fct = torch.nn.CrossEntropyLoss()
+            loss = loss_fct(logits.view(-1, logits.size(-1)), labels.view(-1))
+            if num_items_in_batch is not None:
+                return loss / num_items_in_batch  # multiply by number of items to get the sum
+            return loss
+
+        trainer = Trainer(model, train_dataset=None, compute_loss_func=custom_loss_func)
+
+        # creating log history of trainer, results don't matter
+        train_loss = trainer.training_step(model, inputs, num_items_in_batch)
+        predict_loss = trainer.prediction_step(model, inputs, prediction_loss_only=True)[0]
+
+        torch.testing.assert_close(train_loss, predict_loss, atol=1e-6, rtol=0)
+
+    def test_eval_use_gather_object(self):
+        train_dataset = RegressionDataset()
+        eval_dataset = RegressionDataset()
+        model = RegressionDictModel()
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            args = TrainingArguments(tmp_dir, eval_use_gather_object=True)
+            trainer = Trainer(model, args, train_dataset=train_dataset, eval_dataset=eval_dataset)
+            trainer.train()
+            _ = trainer.evaluate()
+            _ = trainer.predict(eval_dataset)
+
+
+# ---------------------------------------------------------------------------
+# Batch eval metrics tests
+# ---------------------------------------------------------------------------
+
+
+@require_torch
+class TrainerBatchEvalMetricsTest(TestCasePlus, TrainerIntegrationCommon):
+    def setUp(self):
+        super().setUp()
+        args = TrainingArguments("..")
+        self.n_epochs = args.num_train_epochs
+        self.batch_size = args.train_batch_size
+
+    def test_evaluate_with_batch_eval_metrics(self):
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            trainer = get_regression_trainer(
+                a=1.5, b=2.5, compute_metrics=AlmostAccuracyBatched(), batch_eval_metrics=True, output_dir=tmp_dir
+            )
+            results = trainer.evaluate()
+
+            x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
+            pred = 1.5 * x + 2.5
+            expected_loss = ((pred - y) ** 2).mean()
+            self.assertAlmostEqual(results["eval_loss"], expected_loss)
+            expected_acc = AlmostAccuracy()((pred, y))["accuracy"]
+            self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
+
+            # With a number of elements not a round multiple of the batch size
+            trainer = get_regression_trainer(
+                a=1.5,
+                b=2.5,
+                eval_len=66,
+                compute_metrics=AlmostAccuracyBatched(),
+                batch_eval_metrics=True,
+                output_dir=tmp_dir,
+            )
+            results = trainer.evaluate()
+
+            x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
+            pred = 1.5 * x + 2.5
+            expected_loss = ((pred - y) ** 2).mean()
+            self.assertAlmostEqual(results["eval_loss"], expected_loss)
+            expected_acc = AlmostAccuracy()((pred, y))["accuracy"]
+            self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
+
+            # With logits preprocess
+            trainer = get_regression_trainer(
+                a=1.5,
+                b=2.5,
+                compute_metrics=AlmostAccuracyBatched(),
+                batch_eval_metrics=True,
+                preprocess_logits_for_metrics=lambda logits, labels: logits + 1,
+                output_dir=tmp_dir,
+            )
+            results = trainer.evaluate()
+
+            x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
+            pred = 1.5 * x + 2.5
+            expected_loss = ((pred - y) ** 2).mean()
+            self.assertAlmostEqual(results["eval_loss"], expected_loss)
+            expected_acc = AlmostAccuracy()((pred + 1, y))["accuracy"]
+            self.assertAlmostEqual(results["eval_accuracy"], expected_acc)
+
+    def test_predict_with_batch_eval_metrics(self):
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            trainer = get_regression_trainer(
+                a=1.5, b=2.5, compute_metrics=AlmostAccuracyBatched(), batch_eval_metrics=True, output_dir=tmp_dir
+            )
+            results = trainer.predict(trainer.eval_dataset)
+            preds = results.predictions
+            x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
+            gt = 1.5 * x + 2.5
+            self.assertTrue(np.allclose(preds, gt))
+            expected_acc = AlmostAccuracy()((preds, y))["accuracy"]
+            self.assertAlmostEqual(results.metrics["test_accuracy"], expected_acc)
+
+            # With a number of elements not a round multiple of the batch size
+            trainer = get_regression_trainer(
+                a=1.5,
+                b=2.5,
+                eval_len=66,
+                compute_metrics=AlmostAccuracyBatched(),
+                batch_eval_metrics=True,
+                output_dir=tmp_dir,
+            )
+            results = trainer.predict(trainer.eval_dataset)
+            preds = results.predictions
+            x, y = trainer.eval_dataset.x, trainer.eval_dataset.ys[0]
+            self.assertTrue(np.allclose(preds, 1.5 * x + 2.5))
+            expected_acc = AlmostAccuracy()((preds, y))["accuracy"]
+            self.assertAlmostEqual(results.metrics["test_accuracy"], expected_acc)
+
+            # With more than one output of the model
+            trainer = get_regression_trainer(
+                a=1.5,
+                b=2.5,
+                double_output=True,
+                compute_metrics=AlmostAccuracyBatched(),
+                batch_eval_metrics=True,
+                output_dir=tmp_dir,
+            )
+            preds = trainer.predict(trainer.eval_dataset).predictions
+            x = trainer.eval_dataset.x
+            self.assertEqual(len(preds), 2)
+            self.assertTrue(np.allclose(preds[0], 1.5 * x + 2.5))
+            self.assertTrue(np.allclose(preds[1], 1.5 * x + 2.5))
+
+            # With more than one output/label of the model
+            trainer = get_regression_trainer(
+                a=1.5,
+                b=2.5,
+                double_output=True,
+                label_names=["labels", "labels_2"],
+                compute_metrics=AlmostAccuracyBatched(),
+                batch_eval_metrics=True,
+                output_dir=tmp_dir,
+            )
+            outputs = trainer.predict(trainer.eval_dataset)
+            preds = outputs.predictions
+            labels = outputs.label_ids
+            x = trainer.eval_dataset.x
+            self.assertEqual(len(preds), 2)
+            self.assertTrue(np.allclose(preds[0], 1.5 * x + 2.5))
+            self.assertTrue(np.allclose(preds[1], 1.5 * x + 2.5))
+            self.assertTrue(np.array_equal(labels[0], trainer.eval_dataset.ys[0]))
+            self.assertTrue(np.array_equal(labels[1], trainer.eval_dataset.ys[1]))
+
+
+# ---------------------------------------------------------------------------
+# FP16 / BF16 full eval memory tests
+# ---------------------------------------------------------------------------
+
+
+@require_torch
+class TrainerFullEvalMemoryTest(TestCasePlus):
+    @require_torch_fp16
+    @require_torch_accelerator
+    def test_fp16_full_eval(self):
+        # this is a sensitive test so let's keep debugging printouts in place for quick diagnosis.
+        # it's using pretty large safety margins, but small enough to detect broken functionality.
+        debug = 0
+        n_gpus = backend_device_count(torch_device)
+
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            bs = 8
+            eval_len = 16 * n_gpus
+            # make the params somewhat big so that there will be enough RAM consumed to be able to
+            # measure things. We should get about 64KB for a+b in fp32
+            a = torch.ones(1000, bs) + 0.001
+            b = torch.ones(1000, bs) - 0.001
+
+            # 1. with fp16_full_eval disabled
+            trainer = get_regression_trainer(
+                a=a, b=b, eval_len=eval_len, skip_memory_metrics=False, output_dir=tmp_dir
+            )
+            metrics = trainer.evaluate()
+            del trainer
+            gc.collect()
+
+            fp32_init = metrics["init_mem_gpu_alloc_delta"]
+            fp32_eval = metrics["eval_mem_gpu_alloc_delta"]
+
+            if debug:
+                print(f"fp32_init {fp32_init}")
+                print(f"fp32_eval {fp32_eval}")
+
+            # here we expect the model to be preloaded in trainer.__init__ and consume around 64K gpu ram.
+            # perfect world: fp32_init == 64<<10
+            self.assertGreater(fp32_init, 59_000)
+            # after eval should be no extra memory allocated - with a small margin (other than the peak
+            # memory consumption for the forward calculation that gets recovered)
+            # perfect world: fp32_eval == close to zero
+            self.assertLess(fp32_eval, 5_000)
+
+            # 2. with fp16_full_eval enabled
+            trainer = get_regression_trainer(
+                a=a, b=b, eval_len=eval_len, fp16_full_eval=True, skip_memory_metrics=False, output_dir=tmp_dir
+            )
+            metrics = trainer.evaluate()
+            fp16_init = metrics["init_mem_gpu_alloc_delta"]
+            fp16_eval = metrics["eval_mem_gpu_alloc_delta"]
+
+            if debug:
+                print(f"fp16_init {fp16_init}")
+                print(f"fp16_eval {fp16_eval}")
+
+            # here we expect the model to not be preloaded in trainer.__init__, so with a small margin it should be close to 0
+            # perfect world: fp16_init == close to zero
+            self.assertLess(fp16_init, 5_000)
+            # here we put the model on device in eval and only `half()` of it, i.e. about 32K,(again we ignore the peak margin which gets returned back)
+            # perfect world: fp32_init == 32<<10
+            self.assertGreater(fp16_eval, 27_000)
+
+            # 3. relative comparison fp32 vs full fp16
+            # should be about half of fp16_init
+            # perfect world: fp32_init/2 == fp16_eval
+            self.assertAlmostEqual(fp16_eval, fp32_init / 2, delta=5_000)
+
+    @require_torch_accelerator
+    @require_torch_bf16
+    def test_bf16_full_eval(self):
+        # note: most of the logic is the same as test_fp16_full_eval
+
+        # this is a sensitive test so let's keep debugging printouts in place for quick diagnosis.
+        # it's using pretty large safety margins, but small enough to detect broken functionality.
+        debug = 0
+        n_gpus = backend_device_count(torch_device)
+
+        bs = 8
+        eval_len = 16 * n_gpus
+        # make the params somewhat big so that there will be enough RAM consumed to be able to
+        # measure things. We should get about 64KB for a+b in fp32
+        a = torch.ones(1000, bs) + 0.001
+        b = torch.ones(1000, bs) - 0.001
+
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            # 1. with bf16_full_eval disabled
+            trainer = get_regression_trainer(
+                a=a, b=b, eval_len=eval_len, skip_memory_metrics=False, output_dir=tmp_dir
+            )
+            metrics = trainer.evaluate()
+            del trainer
+            gc.collect()
+
+            fp32_init = metrics["init_mem_gpu_alloc_delta"]
+            fp32_eval = metrics["eval_mem_gpu_alloc_delta"]
+
+            if debug:
+                print(f"fp32_init {fp32_init}")
+                print(f"fp32_eval {fp32_eval}")
+
+            # here we expect the model to be preloaded in trainer.__init__ and consume around 64K gpu ram.
+            # perfect world: fp32_init == 64<<10
+            self.assertGreater(fp32_init, 59_000)
+            # after eval should be no extra memory allocated - with a small margin (other than the peak
+            # memory consumption for the forward calculation that gets recovered)
+            # perfect world: fp32_eval == close to zero
+            self.assertLess(fp32_eval, 5_000)
+
+            # 2. with bf16_full_eval enabled
+            trainer = get_regression_trainer(
+                a=a, b=b, eval_len=eval_len, bf16_full_eval=True, skip_memory_metrics=False, output_dir=tmp_dir
+            )
+            metrics = trainer.evaluate()
+            bf16_init = metrics["init_mem_gpu_alloc_delta"]
+            bf16_eval = metrics["eval_mem_gpu_alloc_delta"]
+
+            if debug:
+                print(f"bf16_init {bf16_init}")
+                print(f"bf16_eval {bf16_eval}")
+
+            # here we expect the model to not be preloaded in trainer.__init__, so with a small margin it should be close to 0
+            # perfect world: bf16_init == close to zero
+            self.assertLess(bf16_init, 5_000)
+            # here we put the model on device in eval and only `half()` of it, i.e. about 32K,(again we ignore the peak margin which gets returned back)
+            # perfect world: fp32_init == 32<<10
+            self.assertGreater(bf16_eval, 27_000)
+
+            # 3. relative comparison fp32 vs full bf16
+            # should be about half of bf16_init
+            # perfect world: fp32_init/2 == bf16_eval
+            self.assertAlmostEqual(bf16_eval, fp32_init / 2, delta=5_000)
+
+
+# ---------------------------------------------------------------------------
+# Slow external model eval tests
+# ---------------------------------------------------------------------------
+
+
+@require_torch
+class TrainerSlowEvalTest(TestCasePlus):
+    @slow
+    def test_trainer_eval_mrpc(self):
+        MODEL_ID = "google-bert/bert-base-cased-finetuned-mrpc"
+        tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+        model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
+        data_args = GlueDataTrainingArguments(
+            task_name="mrpc", data_dir=f"{get_tests_dir()}/fixtures/tests_samples/MRPC", overwrite_cache=True
+        )
+        eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, mode="dev")
+
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            training_args = TrainingArguments(output_dir=tmp_dir, use_cpu=True)
+            trainer = Trainer(model=model, args=training_args, eval_dataset=eval_dataset)
+            result = trainer.evaluate()
+            self.assertLess(result["eval_loss"], 0.2)
+
+    @slow
+    def test_trainer_eval_multiple(self):
+        MODEL_ID = "openai-community/gpt2"
+        tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+        model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
+
+        dataset = get_dataset(PATH_SAMPLE_TEXT, tokenizer, 100)
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            training_args = TrainingArguments(
+                output_dir=tmp_dir,
+                use_cpu=True,
+                per_device_eval_batch_size=1,
+            )
+            trainer = Trainer(
+                model=model,
+                args=training_args,
+                eval_dataset={
+                    "data1": dataset,
+                    "data2": dataset,
+                },
+            )
+            result = trainer.evaluate()
+            self.assertIn("eval_data1_loss", result)
+            self.assertIn("eval_data2_loss", result)
+
+    @slow
+    def test_trainer_eval_lm(self):
+        MODEL_ID = "distilbert/distilroberta-base"
+        tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+        dataset = get_dataset(PATH_SAMPLE_TEXT, tokenizer, 100)
+        self.assertEqual(len(dataset), 31)
--- a/tests/trainer/test_trainer_hyperparameter.py
+++ b/tests/trainer/test_trainer_hyperparameter.py
@@ -0,0 +1,308 @@
+# Copyright 2018 the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Trainer hyperparameter search tests: Optuna (single/multi-objective, full eval),
+Ray Tune (with client), W&B sweeps, and backend availability detection.
+"""
+
+import tempfile
+import unittest
+
+from transformers import TrainingArguments
+from transformers.hyperparameter_search import ALL_HYPERPARAMETER_SEARCH_BACKENDS, HPSearchBackend
+from transformers.testing_utils import require_optuna, require_ray, require_torch, require_wandb, torch_device
+from transformers.trainer_utils import IntervalStrategy
+from transformers.utils.hp_naming import TrialShortNamer
+
+from .trainer_test_utils import (
+    AlmostAccuracy,
+    RegressionModelConfig,
+    RegressionPreTrainedModel,
+    get_regression_trainer,
+)
+
+
+@require_torch
+@require_optuna
+class TrainerHyperParameterOptunaIntegrationTest(unittest.TestCase):
+    def setUp(self):
+        args = TrainingArguments("..")
+        self.n_epochs = args.num_train_epochs
+        self.batch_size = args.train_batch_size
+
+    def test_hyperparameter_search(self):
+        class MyTrialShortNamer(TrialShortNamer):
+            DEFAULTS = {"a": 0, "b": 0}
+
+        def hp_space(trial):
+            return {}
+
+        def model_init(trial):
+            if trial is not None:
+                a = trial.suggest_int("a", -4, 4)
+                b = trial.suggest_int("b", -4, 4)
+            else:
+                a = 0
+                b = 0
+            config = RegressionModelConfig(a=a, b=b, double_output=False)
+
+            return RegressionPreTrainedModel(config).to(torch_device)
+
+        def hp_name(trial):
+            return MyTrialShortNamer.shortname(trial.params)
+
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            trainer = get_regression_trainer(
+                output_dir=tmp_dir,
+                learning_rate=0.1,
+                logging_steps=1,
+                eval_strategy=IntervalStrategy.EPOCH,
+                save_strategy=IntervalStrategy.EPOCH,
+                num_train_epochs=4,
+                disable_tqdm=True,
+                load_best_model_at_end=True,
+                run_name="test",
+                model_init=model_init,
+            )
+            trainer.hyperparameter_search(direction="minimize", hp_space=hp_space, hp_name=hp_name, n_trials=4)
+
+
+@require_torch
+@require_optuna
+class TrainerHyperParameterMultiObjectOptunaIntegrationTest(unittest.TestCase):
+    def setUp(self):
+        args = TrainingArguments("..")
+        self.n_epochs = args.num_train_epochs
+        self.batch_size = args.train_batch_size
+
+    def test_hyperparameter_search(self):
+        class MyTrialShortNamer(TrialShortNamer):
+            DEFAULTS = {"a": 0, "b": 0}
+
+        def hp_space(trial):
+            return {}
+
+        def model_init(trial):
+            if trial is not None:
+                a = trial.suggest_int("a", -4, 4)
+                b = trial.suggest_int("b", -4, 4)
+            else:
+                a = 0
+                b = 0
+            config = RegressionModelConfig(a=a, b=b, double_output=False)
+
+            return RegressionPreTrainedModel(config).to(torch_device)
+
+        def hp_name(trial):
+            return MyTrialShortNamer.shortname(trial.params)
+
+        def compute_objective(metrics: dict[str, float]) -> list[float]:
+            return metrics["eval_loss"], metrics["eval_accuracy"]
+
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            trainer = get_regression_trainer(
+                output_dir=tmp_dir,
+                learning_rate=0.1,
+                logging_steps=1,
+                eval_strategy=IntervalStrategy.EPOCH,
+                save_strategy=IntervalStrategy.EPOCH,
+                num_train_epochs=10,
+                disable_tqdm=True,
+                load_best_model_at_end=True,
+                run_name="test",
+                model_init=model_init,
+                compute_metrics=AlmostAccuracy(),
+            )
+            trainer.hyperparameter_search(
+                direction=["minimize", "maximize"],
+                hp_space=hp_space,
+                hp_name=hp_name,
+                n_trials=4,
+                compute_objective=compute_objective,
+            )
+
+
+@require_torch
+@require_optuna
+class TrainerHyperParameterOptunaIntegrationTestWithFullEval(unittest.TestCase):
+    def test_hyperparameter_search(self):
+        def hp_space(trial):
+            return {}
+
+        def model_init(trial):
+            if trial is not None:
+                a = trial.suggest_int("a", -4, 4)
+                b = trial.suggest_int("b", -4, 4)
+            else:
+                a = 0
+                b = 0
+            config = RegressionModelConfig(a=a, b=b, double_output=False)
+
+            return RegressionPreTrainedModel(config).to(torch_device)
+
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            trainer = get_regression_trainer(
+                output_dir=tmp_dir,
+                disable_tqdm=True,
+                model_init=model_init,
+                fp16_full_eval=True,
+            )
+            trainer.hyperparameter_search(
+                direction="minimize",
+                hp_space=hp_space,
+                n_trials=2,
+            )
+
+
+@require_torch
+@require_ray
+@unittest.skip("don't work because of a serialization issue")
+class TrainerHyperParameterRayIntegrationTest(unittest.TestCase):
+    def setUp(self):
+        args = TrainingArguments("..")
+        self.n_epochs = args.num_train_epochs
+        self.batch_size = args.train_batch_size
+
+    def ray_hyperparameter_search(self):
+        class MyTrialShortNamer(TrialShortNamer):
+            DEFAULTS = {"a": 0, "b": 0}
+
+        def hp_space(trial):
+            from ray import tune
+
+            return {
+                "a": tune.randint(-4, 4),
+                "b": tune.randint(-4, 4),
+            }
+
+        def model_init(config):
+            if config is None:
+                a = 0
+                b = 0
+            else:
+                a = config["a"]
+                b = config["b"]
+            model_config = RegressionModelConfig(a=a, b=b, double_output=False)
+
+            return RegressionPreTrainedModel(model_config).to(torch_device)
+
+        def hp_name(params):
+            return MyTrialShortNamer.shortname(params)
+
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            trainer = get_regression_trainer(
+                output_dir=tmp_dir,
+                learning_rate=0.1,
+                logging_steps=1,
+                eval_strategy=IntervalStrategy.EPOCH,
+                save_strategy=IntervalStrategy.EPOCH,
+                num_train_epochs=4,
+                disable_tqdm=True,
+                load_best_model_at_end=True,
+                run_name="test",
+                model_init=model_init,
+            )
+            trainer.hyperparameter_search(
+                direction="minimize", hp_space=hp_space, hp_name=hp_name, backend="ray", n_trials=4
+            )
+
+    def test_hyperparameter_search(self):
+        self.ray_hyperparameter_search()
+
+    def test_hyperparameter_search_ray_client(self):
+        import ray
+        from ray.util.client.ray_client_helpers import ray_start_client_server
+
+        with ray_start_client_server():
+            assert ray.util.client.ray.is_connected()
+            self.ray_hyperparameter_search()
+
+
+@require_torch
+@require_wandb
+class TrainerHyperParameterWandbIntegrationTest(unittest.TestCase):
+    def setUp(self):
+        args = TrainingArguments("..")
+        self.n_epochs = args.num_train_epochs
+        self.batch_size = args.train_batch_size
+
+    def test_hyperparameter_search(self):
+        def hp_space(trial):
+            return {
+                "method": "random",
+                "metric": {},
+                "parameters": {
+                    "a": {"distribution": "uniform", "min": 1e-6, "max": 1e-4},
+                    "b": {"distribution": "int_uniform", "min": 1, "max": 6},
+                },
+            }
+
+        def model_init(config):
+            if config is None:
+                a = 0
+                b = 0
+            else:
+                a = config["a"]
+                b = config["b"]
+            model_config = RegressionModelConfig(a=a, b=b, double_output=False)
+
+            return RegressionPreTrainedModel(model_config).to(torch_device)
+
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            trainer = get_regression_trainer(
+                output_dir=tmp_dir,
+                learning_rate=0.1,
+                logging_steps=1,
+                eval_strategy=IntervalStrategy.EPOCH,
+                save_strategy=IntervalStrategy.EPOCH,
+                num_train_epochs=4,
+                disable_tqdm=True,
+                load_best_model_at_end=True,
+                run_name="test",
+                model_init=model_init,
+            )
+            sweep_kwargs = {
+                "direction": "minimize",
+                "hp_space": hp_space,
+                "backend": "wandb",
+                "n_trials": 4,
+            }
+            best_run = trainer.hyperparameter_search(**sweep_kwargs)
+
+            self.assertIsNotNone(best_run.run_id)
+            self.assertIsNotNone(best_run.run_summary)
+            hp_keys = set(best_run.hyperparameters.keys())
+            self.assertSetEqual(hp_keys, {"a", "b", "assignments", "metric"})
+
+            # pretend restarting the process purged the environ
+            import os
+
+            del os.environ["WANDB_ENTITY"]
+            del os.environ["WANDB_PROJECT"]
+            sweep_kwargs["sweep_id"] = best_run.run_summary
+            updated_best_run = trainer.hyperparameter_search(**sweep_kwargs)
+
+            self.assertIsNotNone(updated_best_run.run_id)
+            self.assertEqual(updated_best_run.run_summary, best_run.run_summary)
+            updated_hp_keys = set(updated_best_run.hyperparameters.keys())
+            self.assertSetEqual(updated_hp_keys, {"a", "b", "assignments", "metric"})
+
+
+class HyperParameterSearchBackendsTest(unittest.TestCase):
+    def test_hyperparameter_search_backends(self):
+        self.assertEqual(
+            list(ALL_HYPERPARAMETER_SEARCH_BACKENDS.keys()),
+            list(HPSearchBackend),
+        )
--- a/tests/trainer/test_trainer_optimizers.py
+++ b/tests/trainer/test_trainer_optimizers.py
@@ -0,0 +1,853 @@
+# Copyright 2018 the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Trainer optimizer and LR scheduler tests: custom optimizers, LR scheduler kwargs, cosine-with-min-lr,
+reduce-on-plateau, Adafactor, bitsandbytes (RMSProp, AdEMAMix), LOMO, GrokAdamW, schedule-free,
+GaLore, Apollo, Stable AdamW, Liger kernel, optimizer choice resolution, factory pattern detection,
+and model parameter inspection.
+"""
+
+import tempfile
+
+import numpy as np
+from parameterized import parameterized
+
+from transformers import (
+    GPT2Config,
+    GPT2LMHeadModel,
+    LlamaConfig,
+    LlamaForCausalLM,
+    Trainer,
+    TrainingArguments,
+    is_torch_available,
+)
+from transformers.testing_utils import (
+    TestCasePlus,
+    require_apollo_torch,
+    require_bitsandbytes,
+    require_galore_torch,
+    require_grokadamw,
+    require_lomo,
+    require_schedulefree,
+    require_torch,
+    require_torch_accelerator,
+    require_torch_optimi,
+)
+from transformers.trainer_utils import check_target_module_exists
+
+from .trainer_test_utils import (
+    BasicTextGenerationModel,
+    RegressionDataset,
+    RegressionModel,
+    RepeatDataset,
+    TorchTracemalloc,
+    TrainerIntegrationCommon,
+    TstLayer,
+    bytes2megabytes,
+    get_regression_trainer,
+)
+
+
+if is_torch_available():
+    import torch
+    from torch import nn
+
+_ATTN_MLP_TARGET_MODULES = [r".*attn.*", r".*mlp.*"]
+
+
+@require_torch
+class TrainerOptimizerIntegrationTest(TestCasePlus, TrainerIntegrationCommon):
+    def setUp(self):
+        super().setUp()
+        args = TrainingArguments("..")
+        self.n_epochs = args.num_train_epochs
+        self.batch_size = args.train_batch_size
+
+    # ---------------------------------------------------------------------------
+    # Helpers
+    # ---------------------------------------------------------------------------
+
+    def _get_llama_and_dataset(self):
+        config = LlamaConfig(vocab_size=100, hidden_size=32, num_hidden_layers=3, num_attention_heads=4)
+        model = LlamaForCausalLM(config)
+        train_dataset = RepeatDataset(torch.randint(0, 100, (128,)))
+        return model, train_dataset
+
+    def _get_gpt2_and_dataset(self):
+        config = GPT2Config(vocab_size=100, n_positions=128, n_embd=32, n_layer=3, n_head=4)
+        model = GPT2LMHeadModel(config)
+        train_dataset = RepeatDataset(torch.randint(0, 100, (128,)))
+        return model, train_dataset
+
+    def _train_with_llama(self, optim, optim_target_modules=None, **extra_kwargs):
+        """Smoke-test: tiny Llama + RepeatDataset with the given optimizer."""
+        tiny_llama, train_dataset = self._get_llama_and_dataset()
+        kwargs = {"learning_rate": 1e-9, "logging_steps": 5, "optim": optim}
+        if optim_target_modules is not None:
+            kwargs["optim_target_modules"] = optim_target_modules
+        kwargs.update(extra_kwargs)
+        args = TrainingArguments(self.get_auto_remove_tmp_dir(), **kwargs)
+        trainer = Trainer(tiny_llama, args, train_dataset=train_dataset)
+        trainer.train()
+        return trainer
+
+    def _check_lr_display_without_scheduler(self, optim, optim_target_modules):
+        """Verify that LR is correctly reported without an LR scheduler."""
+        tiny_llama, train_dataset = self._get_llama_and_dataset()
+        learning_rate = 1e-9
+        args = TrainingArguments(
+            self.get_auto_remove_tmp_dir(),
+            learning_rate=learning_rate,
+            logging_steps=5,
+            optim=optim,
+            optim_target_modules=optim_target_modules,
+        )
+        trainer = Trainer(tiny_llama, args, train_dataset=train_dataset)
+        trainer.create_optimizer_and_scheduler(num_training_steps=10)
+        self.assertEqual(trainer.get_learning_rates(), [learning_rate, learning_rate])
+
+    def _check_lr_display_with_scheduler(self, optim, optim_target_modules, num_train_epochs=2):
+        """Verify warmup + cosine LR schedule: increases then decreases."""
+        tiny_llama, train_dataset = self._get_llama_and_dataset()
+        learning_rate = 2e-4
+        num_warmup_steps = 5
+        args = TrainingArguments(
+            self.get_auto_remove_tmp_dir(),
+            num_train_epochs=num_train_epochs,
+            learning_rate=learning_rate,
+            warmup_steps=num_warmup_steps,
+            lr_scheduler_type="cosine",
+            logging_steps=1,
+            optim=optim,
+            optim_target_modules=optim_target_modules,
+        )
+        trainer = Trainer(tiny_llama, args, train_dataset=train_dataset)
+        trainer.train()
+        logs = trainer.state.log_history[1:-1]
+
+        self.assertTrue(logs[num_warmup_steps - 1]["learning_rate"] == learning_rate)
+        self.assertTrue(np.allclose(logs[-1]["learning_rate"], 0, atol=5e-6))
+
+        increasing_lrs = [
+            logs[i]["learning_rate"] < logs[i + 1]["learning_rate"]
+            for i in range(len(logs))
+            if i < num_warmup_steps - 1
+        ]
+        decreasing_lrs = [
+            logs[i]["learning_rate"] > logs[i + 1]["learning_rate"]
+            for i in range(len(logs) - 1)
+            if i >= num_warmup_steps - 1
+        ]
+
+        self.assertTrue(all(increasing_lrs))
+        self.assertTrue(all(decreasing_lrs))
+        self.assertTrue(len(decreasing_lrs) > len(increasing_lrs))
+
+    # ---------------------------------------------------------------------------
+    # adafactor optmizer test
+    # ---------------------------------------------------------------------------
+
+    def test_adafactor_lr_none(self):
+        # test the special case where lr=None, since Trainer can't not have lr_scheduler
+
+        from transformers.optimization import Adafactor, AdafactorSchedule
+
+        train_dataset = RegressionDataset()
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            args = TrainingArguments(tmp_dir)
+            model = RegressionModel()
+            optimizer = Adafactor(
+                model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None
+            )
+            lr_scheduler = AdafactorSchedule(optimizer)
+            trainer = Trainer(model, args, train_dataset=train_dataset, optimizers=(optimizer, lr_scheduler))
+            trainer.train()
+
+            # Train a default model to compare against
+            default_trainer = get_regression_trainer(learning_rate=0.1, output_dir=tmp_dir)
+            default_trainer.train()
+
+            self.assertFalse(torch.allclose(trainer.model.a, default_trainer.model.a))
+            self.assertFalse(torch.allclose(trainer.model.b, default_trainer.model.b))
+            self.assertGreater(trainer.optimizer.state_dict()["param_groups"][0]["lr"], 0)
+
+    # ---------------------------------------------------------------------------
+    # BNB optimizer tests
+    # ---------------------------------------------------------------------------
+
+    @parameterized.expand(["rmsprop_bnb", "ademamix", "ademamix_8bit", "rmsprop_bnb_8bit", "rmsprop_bnb_32bit"])
+    @require_bitsandbytes
+    def test_bnb_optim(self, optim):
+        tiny_gpt2, train_dataset = self._get_gpt2_and_dataset()
+        args = TrainingArguments(
+            self.get_auto_remove_tmp_dir(),
+            learning_rate=1e-9,
+            logging_steps=5,
+            logging_nan_inf_filter=False,
+            optim=optim,
+        )
+        Trainer(tiny_gpt2, args, train_dataset=train_dataset).train()
+
+    @require_bitsandbytes
+    def test_bnb_8bit_optimizer_skip_embedding(self):
+        model = BasicTextGenerationModel(8, 4)
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            for name_optim in ["rmsprop_bnb_8bit", "adamw_8bit"]:
+                args = TrainingArguments(
+                    output_dir=tmp_dir,
+                    optim=name_optim,
+                )
+                trainer = Trainer(model=model, args=args)
+                optimizer = trainer.create_optimizer()
+                modules = optimizer.mng.module_weight_config_triple
+                self.assertNotEqual(len(modules), 0)
+                module, name, config = modules[0]
+                self.assertIsInstance(module, torch.nn.Embedding)
+                self.assertEqual(name, "weight")
+                self.assertDictEqual(config, {"optim_bits": 32})
+
+    # ---------------------------------------------------------------------------
+    # LOMO tests
+    # ---------------------------------------------------------------------------
+
+    @require_lomo
+    @require_torch_accelerator
+    def test_lomo(self):
+        tiny_llama, train_dataset = self._get_llama_and_dataset()
+        previous_params = {n: p.clone() for n, p in tiny_llama.named_parameters()}
+
+        args = TrainingArguments(
+            self.get_auto_remove_tmp_dir(), learning_rate=1e-2, logging_steps=5, optim="lomo", max_steps=20
+        )
+        Trainer(tiny_llama, args, train_dataset=train_dataset).train()
+
+        for name, param in tiny_llama.named_parameters():
+            self.assertFalse(torch.allclose(param, previous_params[name].to(param.device), rtol=1e-12, atol=1e-12))
+
+    @require_lomo
+    @require_torch_accelerator
+    def test_adalomo(self):
+        self._train_with_llama("adalomo")
+
+    # ---------------------------------------------------------------------------
+    # GrokAdamW test
+    # ---------------------------------------------------------------------------
+
+    @require_grokadamw
+    @require_torch_accelerator
+    def test_grokadamw(self):
+        self._train_with_llama("grokadamw", learning_rate=2e-5, max_steps=20)
+
+    # ---------------------------------------------------------------------------
+    # Schedule-free tests
+    # ---------------------------------------------------------------------------
+
+    @parameterized.expand([("schedule_free_adamw",), ("schedule_free_radam",)])
+    @require_schedulefree
+    @require_torch_accelerator
+    def test_schedulefree(self, optim):
+        self._train_with_llama(optim, lr_scheduler_type="constant")
+
+    # ---------------------------------------------------------------------------
+    # GaLore tests
+    # ---------------------------------------------------------------------------
+
+    def test_galore_matched_modules(self):
+        regex_patterns = [r".*.attn.*", r".*.mlp.*"]
+
+        module_names = [
+            "model.transformer.h.0.ln_1",
+            "model.transformer.h.0.attn.q_proj",
+            "model.lm_head",
+            "model.transformer.h.0.mlp.up_proj",
+        ]
+        expected_values = [False, True, False, True]
+
+        for expected_value, module_name in zip(expected_values, module_names):
+            is_module_matched, is_regex = check_target_module_exists(regex_patterns, module_name, return_is_regex=True)
+            self.assertTrue(is_module_matched == expected_value)
+            if is_module_matched:
+                self.assertTrue(is_regex)
+
+        exact_patterns = ["q_proj", "up_proj"]
+
+        module_names = [
+            "model.transformer.h.0.ln_1",
+            "model.transformer.h.0.attn.q_proj",
+            "model.lm_head",
+            "model.transformer.h.0.mlp.up_proj",
+        ]
+        expected_values = [False, True, False, True]
+
+        for expected_value, module_name in zip(expected_values, module_names):
+            is_module_matched, is_regex = check_target_module_exists(exact_patterns, module_name, return_is_regex=True)
+            self.assertTrue(is_module_matched == expected_value)
+            if is_module_matched:
+                self.assertFalse(is_regex)
+
+        simple_regex = r".*.attn.*"
+
+        module_names = [
+            "model.transformer.h.0.ln_1",
+            "model.transformer.h.0.attn.q_proj",
+            "model.lm_head",
+            "model.transformer.h.0.mlp.up_proj",
+        ]
+        expected_values = [False, True, False, False]
+
+        for expected_value, module_name in zip(expected_values, module_names):
+            is_module_matched, is_regex = check_target_module_exists(simple_regex, module_name, return_is_regex=True)
+            self.assertTrue(is_module_matched == expected_value)
+            if is_module_matched:
+                self.assertTrue(is_regex)
+
+        simple_regex = "model.transformer.h.0.attn.q_proj"
+
+        module_names = [
+            "model.transformer.h.0.ln_1",
+            "model.transformer.h.0.attn.q_proj",
+            "model.lm_head",
+            "model.transformer.h.0.mlp.up_proj",
+        ]
+        expected_values = [False, True, False, False]
+
+        for expected_value, module_name in zip(expected_values, module_names):
+            is_module_matched, is_regex = check_target_module_exists(simple_regex, module_name, return_is_regex=True)
+            self.assertTrue(is_module_matched == expected_value)
+            if is_module_matched:
+                self.assertFalse(is_regex)
+
+        target_modules = ["attn", "mlp"]
+
+        module_names = [
+            "model.transformer.h.0.ln_1",
+            "model.transformer.h.0.attn.q_proj",
+            "model.lm_head",
+            "model.transformer.h.0.mlp.up_proj",
+        ]
+        expected_values = [False, True, False, True]
+
+        for expected_value, module_name in zip(expected_values, module_names):
+            is_module_matched, is_regex = check_target_module_exists(target_modules, module_name, return_is_regex=True)
+            self.assertTrue(is_module_matched == expected_value)
+            if is_module_matched:
+                self.assertFalse(is_regex)
+
+    @parameterized.expand([("galore_adamw",), ("galore_adamw_layerwise",), ("galore_adamw_8bit",)])
+    @require_galore_torch
+    @require_torch_accelerator
+    def test_galore(self, optim):
+        self._train_with_llama(optim, optim_target_modules=_ATTN_MLP_TARGET_MODULES)
+
+    @require_galore_torch
+    @require_torch_accelerator
+    def test_galore_extra_args(self):
+        self._train_with_llama(
+            "galore_adamw",
+            optim_target_modules=_ATTN_MLP_TARGET_MODULES,
+            optim_args="rank=64, update_proj_gap=100, scale=0.10",
+        )
+
+    @require_galore_torch
+    @require_torch_accelerator
+    def test_galore_layerwise_with_scheduler(self):
+        self._train_with_llama(
+            "galore_adamw_layerwise",
+            optim_target_modules=_ATTN_MLP_TARGET_MODULES,
+            lr_scheduler_type="cosine",
+        )
+
+    @parameterized.expand(
+        [
+            (_ATTN_MLP_TARGET_MODULES,),
+            (["q_proj", "k_proj", "v_proj"],),
+            ("all-linear",),
+        ]
+    )
+    @require_galore_torch
+    @require_torch_accelerator
+    def test_galore_adafactor(self, optim_target_modules):
+        upper_bound_pm = 700
+        lower_bound_pm = 650
+        tiny_llama, train_dataset = self._get_llama_and_dataset()
+
+        with tempfile.TemporaryDirectory() as tmpdir, TorchTracemalloc() as tracemalloc:
+            args = TrainingArguments(
+                tmpdir,
+                learning_rate=1e-9,
+                logging_steps=5,
+                optim="galore_adafactor",
+                optim_target_modules=optim_target_modules,
+            )
+            Trainer(tiny_llama, args, train_dataset=train_dataset).train()
+
+        galore_peak_memory = tracemalloc.peaked + bytes2megabytes(tracemalloc.begin)
+        self.assertTrue(galore_peak_memory < upper_bound_pm)
+        self.assertTrue(lower_bound_pm < galore_peak_memory)
+
+    @require_galore_torch
+    @require_torch_accelerator
+    def test_galore_lr_display_without_scheduler(self):
+        self._check_lr_display_without_scheduler("galore_adamw", _ATTN_MLP_TARGET_MODULES)
+
+    @require_galore_torch
+    @require_torch_accelerator
+    def test_galore_lr_display_with_scheduler(self):
+        self._check_lr_display_with_scheduler("galore_adamw", _ATTN_MLP_TARGET_MODULES)
+
+    # ---------------------------------------------------------------------------
+    # Apollo tests
+    # ---------------------------------------------------------------------------
+
+    @parameterized.expand([("apollo_adamw",), ("apollo_adamw_layerwise",)])
+    @require_apollo_torch
+    @require_torch_accelerator
+    def test_apollo(self, optim):
+        self._train_with_llama(optim, optim_target_modules=_ATTN_MLP_TARGET_MODULES)
+
+    @require_apollo_torch
+    @require_torch_accelerator
+    def test_apollo_extra_args(self):
+        self._train_with_llama(
+            "apollo_adamw",
+            optim_target_modules=_ATTN_MLP_TARGET_MODULES,
+            optim_args="proj=random,scale_type=tensor,rank=1,update_proj_gap=100,scale=128.0",
+        )
+
+    @require_apollo_torch
+    @require_torch_accelerator
+    def test_apollo_layerwise_with_scheduler(self):
+        self._train_with_llama(
+            "apollo_adamw_layerwise",
+            optim_target_modules=_ATTN_MLP_TARGET_MODULES,
+            lr_scheduler_type="cosine",
+        )
+
+    @require_apollo_torch
+    @require_torch_accelerator
+    def test_apollo_lr_display_without_scheduler(self):
+        self._check_lr_display_without_scheduler("apollo_adamw", _ATTN_MLP_TARGET_MODULES)
+
+    @require_apollo_torch
+    @require_torch_accelerator
+    def test_apollo_lr_display_with_scheduler(self):
+        self._check_lr_display_with_scheduler("apollo_adamw", _ATTN_MLP_TARGET_MODULES, num_train_epochs=10)
+
+    # ---------------------------------------------------------------------------
+    # Stable AdamW tests
+    # ---------------------------------------------------------------------------
+
+    @require_torch_optimi
+    @require_torch_accelerator
+    def test_stable_adamw(self):
+        self._train_with_llama("stable_adamw", optim_target_modules=_ATTN_MLP_TARGET_MODULES)
+
+    @require_torch_optimi
+    @require_torch_accelerator
+    def test_stable_adamw_extra_args(self):
+        self._train_with_llama(
+            "stable_adamw",
+            optim_target_modules=_ATTN_MLP_TARGET_MODULES,
+            optim_args="decouple_lr=True,max_lr=1e-3,kahan_sum=True",
+        )
+
+    @require_torch_optimi
+    @require_torch_accelerator
+    def test_stable_adamw_trainer_adamw_args(self):
+        tiny_llama, train_dataset = self._get_llama_and_dataset()
+        args = TrainingArguments(
+            self.get_auto_remove_tmp_dir(),
+            learning_rate=1e-9,
+            logging_steps=5,
+            weight_decay=0.001,
+            adam_beta1=0.89,
+            adam_beta2=0.98,
+            adam_epsilon=1e-8,
+            optim="stable_adamw",
+            optim_target_modules=_ATTN_MLP_TARGET_MODULES,
+        )
+        trainer = Trainer(tiny_llama, args, train_dataset=train_dataset)
+        trainer.create_optimizer_and_scheduler(num_training_steps=10)
+
+        # check StableAdamW optimizer is created with the correct parameters
+        self.assertEqual(trainer.optimizer.defaults["beta1"], args.adam_beta1)
+        self.assertEqual(trainer.optimizer.defaults["beta2"], args.adam_beta2)
+        self.assertEqual(trainer.optimizer.defaults["eps"], args.adam_epsilon)
+        self.assertEqual(trainer.optimizer.defaults["weight_decay"], args.weight_decay)
+
+    @require_torch_optimi
+    @require_torch_accelerator
+    def test_stable_adamw_lr_display_without_scheduler(self):
+        self._check_lr_display_without_scheduler("stable_adamw", _ATTN_MLP_TARGET_MODULES)
+
+    @require_torch_optimi
+    @require_torch_accelerator
+    def test_stable_adamw_lr_display_with_scheduler(self):
+        self._check_lr_display_with_scheduler("stable_adamw", _ATTN_MLP_TARGET_MODULES, num_train_epochs=10)
+
+    # ---------------------------------------------------------------------------
+    # Misc optimizer tests
+    # ---------------------------------------------------------------------------
+
+    def test_optimizer_factory_pattern(self):
+        """Test that is_optimizer_factory correctly identifies factory classes vs optimizer classes."""
+        from transformers.trainer_optimizer import is_optimizer_factory
+
+        # Create a mock optimizer class
+        class MockComplexOptimizer(torch.optim.Optimizer):
+            def __init__(self, params, lr=1e-3):
+                defaults = {"lr": lr}
+                super().__init__(params, defaults)
+
+            def step(self, closure=None):
+                pass
+
+        # Create a factory class (simulates Muon/Dion pattern)
+        class MockOptimizerFactory:
+            def __call__(self, opt_model, **optimizer_kwargs):
+                all_params = list(opt_model.parameters())
+                return MockComplexOptimizer(all_params, **optimizer_kwargs)
+
+        # Verify is_optimizer_factory correctly identifies factories vs optimizer classes
+        self.assertFalse(is_optimizer_factory(MockComplexOptimizer))  # Optimizer class should return False
+        self.assertTrue(is_optimizer_factory(MockOptimizerFactory))  # Factory class should return True
+
+    # ---------------------------------------------------------------------------
+    # Optimizer group and learning rate inspection tests
+    # ---------------------------------------------------------------------------
+
+    def test_get_optimizer_group(self):
+        model = nn.Sequential(nn.Linear(128, 64))
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            trainer = Trainer(model=model, args=TrainingArguments(output_dir=tmp_dir))
+            # ValueError is raised if optimizer is None
+            with self.assertRaises(ValueError):
+                trainer.get_optimizer_group()
+            trainer.create_optimizer()
+            # Get groups
+            num_groups = len(trainer.get_optimizer_group())
+            self.assertEqual(num_groups, 2)
+            # Get group of parameter
+            param = next(model.parameters())
+            group = trainer.get_optimizer_group(param)
+            self.assertIn(param, group["params"])
+
+
+# ---------------------------------------------------------------------------
+# Custom optimizer and LR scheduler tests
+# ---------------------------------------------------------------------------
+
+
+class TrainerOptimizerTest(TestCasePlus):
+    def test_get_optimizer_group(self):
+        model = nn.Sequential(nn.Linear(128, 64))
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            trainer = Trainer(model=model, args=TrainingArguments(output_dir=tmp_dir))
+            # ValueError is raised if optimizer is None
+            with self.assertRaises(ValueError):
+                trainer.get_optimizer_group()
+            trainer.create_optimizer()
+            # Get groups
+            num_groups = len(trainer.get_optimizer_group())
+            self.assertEqual(num_groups, 2)
+            # Get group of parameter
+            param = next(model.parameters())
+            group = trainer.get_optimizer_group(param)
+            self.assertIn(param, group["params"])
+
+    def test_optimizer_factory_pattern(self):
+        """Test that is_optimizer_factory correctly identifies factory classes vs optimizer classes."""
+        from transformers.trainer_optimizer import is_optimizer_factory
+
+        # Create a mock optimizer class
+        class MockComplexOptimizer(torch.optim.Optimizer):
+            def __init__(self, params, lr=1e-3):
+                defaults = {"lr": lr}
+                super().__init__(params, defaults)
+
+            def step(self, closure=None):
+                pass
+
+        # Create a factory class (simulates Muon/Dion pattern)
+        class MockOptimizerFactory:
+            def __call__(self, opt_model, **optimizer_kwargs):
+                all_params = list(opt_model.parameters())
+                return MockComplexOptimizer(all_params, **optimizer_kwargs)
+
+        # Verify is_optimizer_factory correctly identifies factories vs optimizer classes
+        self.assertFalse(is_optimizer_factory(MockComplexOptimizer))  # Optimizer class should return False
+        self.assertTrue(is_optimizer_factory(MockOptimizerFactory))  # Factory class should return True
+
+    def test_custom_optimizer(self):
+        train_dataset = RegressionDataset()
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            args = TrainingArguments(tmp_dir)
+            model = RegressionModel()
+            optimizer = torch.optim.SGD(model.parameters(), lr=1.0)
+            lr_scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda x: 1.0)
+            trainer = Trainer(model, args, train_dataset=train_dataset, optimizers=(optimizer, lr_scheduler))
+            trainer.train()
+
+            # Train a default model to compare against
+            default_trainer = get_regression_trainer(learning_rate=0.1, output_dir=tmp_dir)
+            default_trainer.train()
+
+            self.assertFalse(torch.allclose(trainer.model.a, default_trainer.model.a))
+            self.assertFalse(torch.allclose(trainer.model.b, default_trainer.model.b))
+            self.assertEqual(trainer.optimizer.state_dict()["param_groups"][0]["lr"], 1.0)
+
+    # ---------------------------------------------------------------------------
+    # Weight decay parameter groups
+    # ---------------------------------------------------------------------------
+
+    def test_no_wd_param_group(self):
+        model = nn.Sequential(TstLayer(128), nn.ModuleList([TstLayer(128), TstLayer(128)]))
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            trainer = Trainer(model=model, args=TrainingArguments(output_dir=tmp_dir))
+            trainer.create_optimizer_and_scheduler(10)
+            wd_names = ['0.linear1.weight', '0.linear2.weight', '1.0.linear1.weight', '1.0.linear2.weight', '1.1.linear1.weight', '1.1.linear2.weight']  # fmt: skip
+            wd_params = [p for n, p in model.named_parameters() if n in wd_names]
+            no_wd_params = [p for n, p in model.named_parameters() if n not in wd_names]
+            self.assertListEqual(trainer.optimizer.param_groups[0]["params"], wd_params)
+            self.assertListEqual(trainer.optimizer.param_groups[1]["params"], no_wd_params)
+
+
+@require_torch
+class TrainerLRTest(TestCasePlus):
+    def test_get_learning_rates(self):
+        model = nn.Sequential(nn.Linear(128, 64))
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            trainer = Trainer(model=model, args=TrainingArguments(output_dir=tmp_dir))
+            with self.assertRaises(ValueError):
+                trainer.get_learning_rates()
+            trainer.create_optimizer()
+            self.assertEqual(trainer.get_learning_rates(), [5e-05, 5e-05])
+
+    def test_lr_scheduler_kwargs(self):
+        from transformers import get_polynomial_decay_schedule_with_warmup
+
+        # test scheduler kwargs passed via TrainingArguments
+        train_dataset = RegressionDataset()
+        model = RegressionModel()
+        num_steps, num_warmup_steps = 10, 2
+        extra_kwargs = {"power": 5.0, "lr_end": 1e-5}  # Non-default arguments
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            args = TrainingArguments(
+                tmp_dir,
+                lr_scheduler_type="polynomial",
+                lr_scheduler_kwargs=extra_kwargs,
+                learning_rate=0.2,
+                warmup_steps=num_warmup_steps,
+            )
+            trainer = Trainer(model, args, train_dataset=train_dataset)
+            trainer.create_optimizer_and_scheduler(num_training_steps=num_steps)
+
+            # Checking that the scheduler was created
+            self.assertIsNotNone(trainer.lr_scheduler)
+
+            # Checking that the correct args were passed
+            sched1 = trainer.lr_scheduler
+            sched2 = get_polynomial_decay_schedule_with_warmup(
+                trainer.optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_steps, **extra_kwargs
+            )
+            self.assertEqual(sched1.lr_lambdas[0].args, sched2.lr_lambdas[0].args)
+            self.assertEqual(sched1.lr_lambdas[0].keywords, sched2.lr_lambdas[0].keywords)
+
+    def test_cosine_with_min_lr_scheduler(self):
+        train_dataset = RegressionDataset()
+        model = RegressionModel()
+        num_steps, num_warmup_steps = 10, 2
+        extra_kwargs = {"min_lr": 1e-5}  # Non-default arguments
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            args = TrainingArguments(
+                tmp_dir,
+                lr_scheduler_type="cosine_with_min_lr",
+                lr_scheduler_kwargs=extra_kwargs,
+                learning_rate=0.2,
+                warmup_steps=num_warmup_steps,
+            )
+            trainer = Trainer(model, args, train_dataset=train_dataset)
+            trainer.create_optimizer_and_scheduler(num_training_steps=num_steps)
+
+            # Checking that the scheduler was created
+            self.assertIsNotNone(trainer.lr_scheduler)
+
+            # Check the last learning rate
+            for _ in range(num_steps):
+                trainer.lr_scheduler.step()
+            self.assertEqual(trainer.lr_scheduler.get_last_lr()[0], 1e-5)
+
+    def test_cosine_with_min_lr_schedule_with_warmup_lr_rate(self):
+        train_dataset = RegressionDataset()
+        model = RegressionModel()
+        num_steps, num_warmup_steps = 10, 2
+        extra_kwargs = {"min_lr": 1e-5}  # Non-default arguments
+        args = TrainingArguments(
+            "./regression",
+            lr_scheduler_type="cosine_warmup_with_min_lr",
+            lr_scheduler_kwargs=extra_kwargs,
+            learning_rate=0.2,
+            warmup_steps=num_warmup_steps,
+        )
+        trainer = Trainer(model, args, train_dataset=train_dataset)
+        trainer.create_optimizer_and_scheduler(num_training_steps=num_steps)
+
+        # Checking that the scheduler was created
+        self.assertIsNotNone(trainer.lr_scheduler)
+
+        # Check the last learning rate
+        step_lrs = []
+        for _ in range(num_steps):
+            step_lrs.append(trainer.optimizer.param_groups[0]["lr"])
+            trainer.lr_scheduler.step()
+        self.assertEqual(step_lrs[0], 0.1)
+        self.assertEqual(step_lrs[1], 0.2)
+        self.assertEqual(step_lrs[-1], 1e-05)
+
+    def test_reduce_lr_on_plateau_args(self):
+        # test passed arguments for a custom ReduceLROnPlateau scheduler
+        train_dataset = RegressionDataset(length=64)
+        eval_dataset = RegressionDataset(length=64)
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            args = TrainingArguments(
+                tmp_dir,
+                eval_strategy="epoch",
+                metric_for_best_model="eval_loss",
+            )
+            model = RegressionModel()
+            optimizer = torch.optim.SGD(model.parameters(), lr=1.0)
+            lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.2, patience=5, cooldown=2)
+            trainer = Trainer(
+                model,
+                args,
+                train_dataset=train_dataset,
+                eval_dataset=eval_dataset,
+                optimizers=(optimizer, lr_scheduler),
+            )
+            trainer.train()
+
+            self.assertIsInstance(trainer.lr_scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau)
+            self.assertEqual(trainer.lr_scheduler.factor, 0.2)
+            self.assertEqual(trainer.lr_scheduler.patience, 5)
+            self.assertEqual(trainer.lr_scheduler.cooldown, 2)
+
+    def test_reduce_lr_on_plateau(self):
+        # test the ReduceLROnPlateau scheduler
+
+        class TrainerWithLRLogs(Trainer):
+            def log(self, logs):
+                # the LR is computed after metrics and does not exist for the first epoch
+                if hasattr(self.lr_scheduler, "_last_lr"):
+                    logs["learning_rate"] = self.lr_scheduler._last_lr[0]
+                super().log(logs)
+
+        train_dataset = RegressionDataset(length=64)
+        eval_dataset = RegressionDataset(length=64)
+
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            args = TrainingArguments(
+                tmp_dir,
+                lr_scheduler_type="reduce_lr_on_plateau",
+                eval_strategy="epoch",
+                metric_for_best_model="eval_loss",
+                num_train_epochs=10,
+                learning_rate=0.2,
+            )
+            model = RegressionModel()
+            trainer = TrainerWithLRLogs(model, args, train_dataset=train_dataset, eval_dataset=eval_dataset)
+            trainer.train()
+
+            self.assertIsInstance(trainer.lr_scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau)
+            patience = trainer.lr_scheduler.patience
+
+            logs = trainer.state.log_history[1:]
+            best_loss = logs[0]["eval_loss"]
+            bad_epochs = 0
+            for i, log in enumerate(logs[:-1]):  # Compare learning rate to next epoch's
+                loss = log["eval_loss"]
+                just_decreased = False
+                if loss > best_loss:
+                    bad_epochs += 1
+                    if bad_epochs > patience:
+                        self.assertLess(logs[i + 1]["learning_rate"], log["learning_rate"])
+                        just_decreased = True
+                        bad_epochs = 0
+                else:
+                    best_loss = loss
+                    bad_epochs = 0
+                if not just_decreased:
+                    self.assertEqual(logs[i + 1]["learning_rate"], log["learning_rate"])
+
+    def test_greedy_lr_args(self):
+        # test passed arguments for a custom GreedyLR scheduler
+        from transformers.optimization import GreedyLR
+
+        train_dataset = RegressionDataset(length=64)
+        eval_dataset = RegressionDataset(length=64)
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            args = TrainingArguments(
+                tmp_dir,
+                eval_strategy="epoch",
+                metric_for_best_model="eval_loss",
+            )
+            model = RegressionModel()
+            optimizer = torch.optim.SGD(model.parameters(), lr=1.0)
+            lr_scheduler = GreedyLR(optimizer, factor=0.8, patience=5, cooldown=2)
+            trainer = Trainer(
+                model,
+                args,
+                train_dataset=train_dataset,
+                eval_dataset=eval_dataset,
+                optimizers=(optimizer, lr_scheduler),
+            )
+            trainer.train()
+
+            self.assertIsInstance(trainer.lr_scheduler, GreedyLR)
+            self.assertEqual(trainer.lr_scheduler.factor, 0.8)
+            self.assertEqual(trainer.lr_scheduler.patience, 5)
+            self.assertEqual(trainer.lr_scheduler.cooldown, 2)
+
+    def test_greedy_lr(self):
+        # test the GreedyLR scheduler
+        from transformers.optimization import GreedyLR
+
+        class TrainerWithLRLogs(Trainer):
+            def log(self, logs):
+                if hasattr(self.lr_scheduler, "_last_lr"):
+                    logs["learning_rate"] = self.lr_scheduler._last_lr[0]
+                super().log(logs)
+
+        train_dataset = RegressionDataset(length=64)
+        eval_dataset = RegressionDataset(length=64)
+
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            args = TrainingArguments(
+                tmp_dir,
+                lr_scheduler_type="greedy",
+                lr_scheduler_kwargs={"patience": 1, "factor": 0.5},
+                eval_strategy="epoch",
+                metric_for_best_model="eval_loss",
+                num_train_epochs=10,
+                learning_rate=0.2,
+            )
+            model = RegressionModel()
+            trainer = TrainerWithLRLogs(model, args, train_dataset=train_dataset, eval_dataset=eval_dataset)
+            trainer.train()
+
+            self.assertIsInstance(trainer.lr_scheduler, GreedyLR)
+            # Verify LR was adjusted at least once during training
+            logs = trainer.state.log_history[1:]
+            lr_values = [log["learning_rate"] for log in logs if "learning_rate" in log]
+            self.assertTrue(len(set(lr_values)) > 1, "GreedyLR should have adjusted the LR at least once")
--- a/tests/trainer/test_trainer_seq2seq.py
+++ b/tests/trainer/test_trainer_seq2seq.py
@@ -0,0 +1,413 @@
+# Copyright 2020 the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+from pathlib import Path
+from unittest.mock import patch
+
+from transformers import (
+    AutoModelForSeq2SeqLM,
+    BertConfig,
+    BertTokenizer,
+    DataCollatorForSeq2Seq,
+    EncoderDecoderModel,
+    GenerationConfig,
+    Seq2SeqTrainer,
+    Seq2SeqTrainingArguments,
+    T5Tokenizer,
+)
+from transformers.testing_utils import (
+    ExtendSysPath,
+    TestCasePlus,
+    backend_device_count,
+    execute_subprocess_async,
+    get_torch_dist_unique_port,
+    require_bitsandbytes,
+    require_sentencepiece,
+    require_torch,
+    require_torch_multi_accelerator,
+    require_torch_non_multi_accelerator,
+    slow,
+    torch_device,
+)
+from transformers.trainer_callback import TrainerState
+from transformers.trainer_utils import set_seed
+from transformers.utils import is_datasets_available, is_torch_available
+
+
+if is_datasets_available():
+    import datasets
+
+if is_torch_available():
+    import torch
+
+
+set_seed(42)
+MARIAN_MODEL = "sshleifer/student_marian_en_ro_6_1"
+MBART_TINY = "sshleifer/tiny-mbart"
+
+
+@require_sentencepiece
+class Seq2seqTrainerTester(TestCasePlus):
+    @slow
+    @require_torch
+    def test_finetune_bert2bert(self):
+        bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained(
+            "prajjwal1/bert-tiny",
+            "prajjwal1/bert-tiny",
+            encoder_config=BertConfig.from_pretrained("prajjwal1/bert-tiny"),
+            decoder_config=BertConfig.from_pretrained("prajjwal1/bert-tiny"),
+            dtype=torch.float32,
+        )
+        tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
+
+        bert2bert.config.vocab_size = bert2bert.config.encoder.vocab_size
+        tokenizer.eos_token_id = tokenizer.sep_token_id
+        bert2bert.generation_config.decoder_start_token_id = tokenizer.cls_token_id
+        bert2bert.generation_config.max_length = 128
+
+        train_dataset = datasets.load_dataset("abisee/cnn_dailymail", "3.0.0", split="train[:1%]")
+        val_dataset = datasets.load_dataset("abisee/cnn_dailymail", "3.0.0", split="validation[:1%]")
+
+        train_dataset = train_dataset.select(range(32))
+        val_dataset = val_dataset.select(range(16))
+
+        batch_size = 4
+
+        def _map_to_encoder_decoder_inputs(batch):
+            # Tokenizer will automatically set [BOS] <text> [EOS]
+            inputs = tokenizer(batch["article"], padding="max_length", truncation=True, max_length=512)
+            outputs = tokenizer(batch["highlights"], padding="max_length", truncation=True, max_length=128)
+            batch["input_ids"] = inputs.input_ids
+            batch["attention_mask"] = inputs.attention_mask
+
+            batch["decoder_input_ids"] = outputs.input_ids
+            batch["labels"] = outputs.input_ids.copy()
+            batch["labels"] = [
+                [-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]
+            ]
+            batch["decoder_attention_mask"] = outputs.attention_mask
+
+            assert all(len(x) == 512 for x in inputs.input_ids)
+            assert all(len(x) == 128 for x in outputs.input_ids)
+
+            return batch
+
+        def _compute_metrics(pred):
+            labels_ids = pred.label_ids
+            pred_ids = pred.predictions
+
+            # Replace -100 (ignore index) with pad_token_id before decoding
+            import numpy as np
+
+            labels_ids = np.where(labels_ids == -100, tokenizer.pad_token_id, labels_ids)
+
+            # all unnecessary tokens are removed
+            pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
+            label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
+
+            accuracy = sum(int(pred_str[i] == label_str[i]) for i in range(len(pred_str))) / len(pred_str)
+
+            return {"accuracy": accuracy}
+
+        # map train dataset
+        train_dataset = train_dataset.map(
+            _map_to_encoder_decoder_inputs,
+            batched=True,
+            batch_size=batch_size,
+            remove_columns=["article", "highlights"],
+        )
+        train_dataset.set_format(
+            type="torch",
+            columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
+        )
+
+        # same for validation dataset
+        val_dataset = val_dataset.map(
+            _map_to_encoder_decoder_inputs,
+            batched=True,
+            batch_size=batch_size,
+            remove_columns=["article", "highlights"],
+        )
+        val_dataset.set_format(
+            type="torch",
+            columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
+        )
+
+        output_dir = self.get_auto_remove_tmp_dir()
+
+        training_args = Seq2SeqTrainingArguments(
+            output_dir=output_dir,
+            per_device_train_batch_size=batch_size,
+            per_device_eval_batch_size=batch_size,
+            predict_with_generate=True,
+            eval_strategy="steps",
+            do_train=True,
+            do_eval=True,
+            warmup_steps=0,
+            eval_steps=2,
+            logging_steps=2,
+        )
+
+        # instantiate trainer
+        trainer = Seq2SeqTrainer(
+            model=bert2bert,
+            args=training_args,
+            compute_metrics=_compute_metrics,
+            train_dataset=train_dataset,
+            eval_dataset=val_dataset,
+            processing_class=tokenizer,
+        )
+
+        # start training
+        trainer.train()
+
+    @slow
+    @require_torch
+    def test_return_sequences(self):
+        # Tests that the number of generated sequences is correct when num_return_sequences > 1
+        # and essentially ensuring that `accelerator.gather()` is used instead of `gather_for_metrics`
+        INPUT_COLUMN = "question"
+        TARGET_COLUMN = "answer"
+        MAX_INPUT_LENGTH = 256
+        MAX_TARGET_LENGTH = 256
+
+        dataset = datasets.load_dataset("openai/gsm8k", "main", split="train[:38]")
+        model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-small")
+        tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-small")
+        data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="pt", padding="longest")
+        gen_config = GenerationConfig.from_pretrained(
+            "google-t5/t5-small", max_length=None, min_length=None, max_new_tokens=256, min_new_tokens=1, num_beams=5
+        )
+
+        training_args = Seq2SeqTrainingArguments(".", predict_with_generate=True)
+
+        trainer = Seq2SeqTrainer(
+            model=model,
+            args=training_args,
+            processing_class=tokenizer,
+            data_collator=data_collator,
+            compute_metrics=lambda x: {"samples": x[0].shape[0]},
+        )
+
+        def prepare_data(examples):
+            # Remove pairs where at least one record is none
+            inputs = examples[INPUT_COLUMN]
+            targets = examples[TARGET_COLUMN]
+
+            model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)
+            labels = tokenizer(text_target=targets, max_length=MAX_TARGET_LENGTH, truncation=True)
+            model_inputs["labels"] = labels["input_ids"]
+
+            return model_inputs
+
+        prepared_dataset = dataset.map(prepare_data, batched=True, remove_columns=[INPUT_COLUMN, TARGET_COLUMN])
+        dataset_len = len(prepared_dataset)  # 38
+
+        for num_return_sequences in range(3, 0, -1):
+            gen_config.num_return_sequences = num_return_sequences
+            metrics = trainer.evaluate(eval_dataset=prepared_dataset, generation_config=gen_config)
+            assert metrics["eval_samples"] == dataset_len * num_return_sequences, (
+                f"Got {metrics['eval_samples']}, expected: {dataset_len * num_return_sequences}"
+            )
+
+    @require_torch
+    def test_bad_generation_config_fail_early(self):
+        # Tests that a bad generation config causes the trainer to fail early
+        model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-small")
+        tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-small")
+        data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="pt", padding="longest")
+        gen_config = GenerationConfig(do_sample=False, top_p=0.9)  # bad: top_p is not compatible with do_sample=False
+
+        training_args = Seq2SeqTrainingArguments(".", predict_with_generate=True, generation_config=gen_config)
+        with self.assertRaises(ValueError) as exc:
+            _ = Seq2SeqTrainer(
+                model=model,
+                args=training_args,
+                processing_class=tokenizer,
+                data_collator=data_collator,
+                compute_metrics=lambda x: {"samples": x[0].shape[0]},
+            )
+        self.assertIn("Fix these issues to train your model", str(exc.exception))
+
+
+@require_torch
+class TestTranslationExample(TestCasePlus):
+    """Tests for the run_translation.py example script (seq2seq training via CLI)."""
+
+    @classmethod
+    def setUpClass(cls):
+        super().setUpClass()
+        examples_dir = Path(__file__).resolve().parents[2] / "examples" / "pytorch" / "translation"
+        with ExtendSysPath(str(examples_dir)):
+            from run_translation import main as _main
+
+            cls._run_translation_main = staticmethod(_main)
+
+    def _run_translation(
+        self,
+        distributed=False,
+        extra_args_str=None,
+        predict_with_generate=True,
+        do_train=True,
+        do_eval=True,
+        do_predict=True,
+        n_gpus_to_use=None,
+    ):
+        data_dir = self.test_file_dir / "../fixtures/tests_samples/wmt_en_ro"
+        output_dir = self.get_auto_remove_tmp_dir()
+        args = f"""
+            --model_name_or_path {MBART_TINY}
+            --train_file {data_dir}/train.json
+            --validation_file {data_dir}/val.json
+            --test_file {data_dir}/test.json
+            --output_dir {output_dir}
+            --max_train_samples 8
+            --max_source_length 12
+            --max_target_length 12
+            --do_train
+            --num_train_epochs 1
+            --per_device_train_batch_size 4
+            --learning_rate 3e-3
+            --warmup_steps 8
+            --logging_steps 0
+            --logging_strategy no
+            --save_steps 1
+            --train_sampling_strategy group_by_length
+            --label_smoothing_factor 0.1
+            --target_lang ro_RO
+            --source_lang en_XX
+            --report_to none
+        """.split()
+
+        if do_eval:
+            args += """
+                --do_eval
+                --per_device_eval_batch_size 4
+                --max_eval_samples 8
+                --val_max_target_length 12
+                --eval_strategy steps
+                --eval_steps 1
+            """.split()
+
+        if do_predict:
+            args += ["--do_predict"]
+
+        if predict_with_generate:
+            args += ["--predict_with_generate"]
+
+        if do_train:
+            args += ["--optim", "adafactor"]
+
+        if extra_args_str is not None:
+            args += extra_args_str.split()
+
+        if distributed:
+            if n_gpus_to_use is None:
+                n_gpus_to_use = backend_device_count(torch_device)
+            master_port = get_torch_dist_unique_port()
+            distributed_args = f"""
+                -m torch.distributed.run
+                --nproc_per_node={n_gpus_to_use}
+                --master_port={master_port}
+                {self.examples_dir_str}/pytorch/translation/run_translation.py
+            """.split()
+            cmd = [sys.executable] + distributed_args + args
+            execute_subprocess_async(cmd, env=self.get_env())
+        else:
+            testargs = ["run_translation.py"] + args
+            with patch.object(sys, "argv", testargs):
+                self._run_translation_main()
+
+        return output_dir
+
+    @require_torch_non_multi_accelerator
+    def test_run_seq2seq_no_dist(self):
+        output_dir = self._run_translation()
+        logs = TrainerState.load_from_json(os.path.join(output_dir, "trainer_state.json")).log_history
+        eval_metrics = [log for log in logs if "eval_loss" in log]
+        first_step_stats = eval_metrics[0]
+        assert "eval_bleu" in first_step_stats
+
+    @require_torch_multi_accelerator
+    def test_run_seq2seq_dp(self):
+        output_dir = self._run_translation(distributed=False)
+        logs = TrainerState.load_from_json(os.path.join(output_dir, "trainer_state.json")).log_history
+        eval_metrics = [log for log in logs if "eval_loss" in log]
+        first_step_stats = eval_metrics[0]
+        assert "eval_bleu" in first_step_stats
+
+    @require_torch_multi_accelerator
+    def test_run_seq2seq_ddp(self):
+        output_dir = self._run_translation(distributed=True)
+        logs = TrainerState.load_from_json(os.path.join(output_dir, "trainer_state.json")).log_history
+        eval_metrics = [log for log in logs if "eval_loss" in log]
+        first_step_stats = eval_metrics[0]
+        assert "eval_bleu" in first_step_stats
+
+    @slow
+    def test_run_seq2seq_slow(self):
+        output_dir = self._run_translation(
+            extra_args_str=f"--model_name_or_path {MARIAN_MODEL} --learning_rate 3e-4 --num_train_epochs 10 --max_source_length 128 --max_target_length 128 --eval_steps 2 --save_steps 2",
+        )
+        logs = TrainerState.load_from_json(os.path.join(output_dir, "trainer_state.json")).log_history
+        eval_metrics = [log for log in logs if "eval_loss" in log]
+        first_step_stats = eval_metrics[0]
+        last_step_stats = eval_metrics[-1]
+        assert first_step_stats["eval_loss"] > last_step_stats["eval_loss"], "model learned nothing"
+        assert isinstance(last_step_stats["eval_bleu"], float)
+        contents = {os.path.basename(p) for p in os.listdir(output_dir)}
+        assert "generated_predictions.txt" in contents
+        assert "predict_results.json" in contents
+
+    @slow
+    @require_bitsandbytes
+    def test_run_seq2seq_bnb(self):
+        from transformers.training_args import OptimizerNames
+
+        def train_and_return_metrics(optim: str) -> tuple[int, float]:
+            output_dir = self._run_translation(
+                distributed=True,
+                extra_args_str=f"--skip_memory_metrics 0 --model_name_or_path {MARIAN_MODEL} --learning_rate 3e-4 --num_train_epochs 1 --optim {optim} --max_source_length 128 --max_target_length 128",
+                do_eval=False,
+                do_predict=False,
+                n_gpus_to_use=1,
+            )
+            logs = TrainerState.load_from_json(Path(output_dir, "trainer_state.json")).log_history
+            gpu_peak_mem_mb = int(logs[0]["train_mem_gpu_peaked_delta"] / 2**20)
+            gpu_alloc_mem_mb = int(logs[0]["train_mem_gpu_alloc_delta"] / 2**20)
+            loss = logs[0]["train_loss"]
+            return gpu_peak_mem_mb, gpu_alloc_mem_mb, loss
+
+        gpu_peak_mem_orig, gpu_alloc_mem_orig, loss_orig = train_and_return_metrics(OptimizerNames.ADAMW_TORCH.value)
+        gpu_peak_mem_bnb, gpu_alloc_mem_bnb, loss_bnb = train_and_return_metrics(OptimizerNames.ADAMW_BNB.value)
+
+        gpu_alloc_mem_diff = gpu_alloc_mem_orig - gpu_alloc_mem_bnb
+        gpu_total_mem_orig = gpu_peak_mem_orig + gpu_alloc_mem_orig
+        gpu_total_mem_bnb = gpu_peak_mem_bnb + gpu_alloc_mem_bnb
+        gpu_total_mem_diff = gpu_total_mem_orig - gpu_total_mem_bnb
+
+        expected_savings = 120
+        self.assertGreater(
+            gpu_alloc_mem_diff,
+            expected_savings,
+            f"should use ~150MB less alloc gpu memory with BNB, but got diff={gpu_alloc_mem_diff}MB",
+        )
+        self.assertGreater(
+            gpu_total_mem_diff,
+            expected_savings,
+            f"should use ~150MB less total gpu memory with BNB, but got diff={gpu_total_mem_diff}MB",
+        )
+        self.assertAlmostEqual(loss_orig, loss_bnb, 5, f"loss should be the same: {loss_orig} vs {loss_bnb}")
--- a/tests/trainer/test_training_args.py
+++ b/tests/trainer/test_training_args.py
@@ -0,0 +1,406 @@
+import dataclasses
+import os
+import tempfile
+import unittest
+from unittest.mock import patch
+
+import torch
+
+from transformers import TrainingArguments
+from transformers.debug_utils import DebugOption
+from transformers.trainer_utils import HubStrategy, IntervalStrategy, SaveStrategy, SchedulerType
+from transformers.training_args import OptimizerNames
+
+
+class TestTrainingArguments(unittest.TestCase):
+    def test_default_output_dir(self):
+        """Test that output_dir defaults to 'trainer_output' when not specified."""
+        args = TrainingArguments(output_dir=None)
+        self.assertEqual(args.output_dir, "trainer_output")
+
+    def test_custom_output_dir(self):
+        """Test that output_dir is respected when specified."""
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            args = TrainingArguments(output_dir=tmp_dir)
+            self.assertEqual(args.output_dir, tmp_dir)
+
+    def test_output_dir_creation(self):
+        """Test that output_dir is created only when needed."""
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            output_dir = os.path.join(tmp_dir, "test_output")
+
+            # Directory should not exist before creating args
+            self.assertFalse(os.path.exists(output_dir))
+
+            # Create args with save_strategy="no" - should not create directory
+            args = TrainingArguments(
+                output_dir=output_dir,
+                do_train=True,
+                save_strategy="no",
+                report_to=None,
+            )
+            self.assertFalse(os.path.exists(output_dir))
+
+            # Now set save_strategy="steps" - should create directory when needed
+            args.save_strategy = "steps"
+            args.save_steps = 1
+            self.assertFalse(os.path.exists(output_dir))  # Still shouldn't exist
+
+            # Directory should be created when actually needed (e.g. in Trainer)
+
+    def test_torch_empty_cache_steps_requirements(self):
+        """Test that torch_empty_cache_steps is a positive integer or None."""
+
+        # None is acceptable (feature is disabled):
+        args = TrainingArguments(torch_empty_cache_steps=None)
+        self.assertIsNone(args.torch_empty_cache_steps)
+
+        # non-int is unacceptable:
+        with self.assertRaises(ValueError):
+            TrainingArguments(torch_empty_cache_steps=1.0)
+        with self.assertRaises(ValueError):
+            TrainingArguments(torch_empty_cache_steps="none")
+
+        # negative int is unacceptable:
+        with self.assertRaises(ValueError):
+            TrainingArguments(torch_empty_cache_steps=-1)
+
+        # zero is unacceptable:
+        with self.assertRaises(ValueError):
+            TrainingArguments(torch_empty_cache_steps=0)
+
+        # positive int is acceptable:
+        args = TrainingArguments(torch_empty_cache_steps=1)
+        self.assertEqual(args.torch_empty_cache_steps, 1)
+
+    def test_output_dir_expands_user(self):
+        """Test that ~ in output_dir is expanded to the user's home directory."""
+        args = TrainingArguments(output_dir="~/foo", report_to=None)
+        self.assertEqual(args.output_dir, os.path.expanduser("~/foo"))
+
+    def test_enum_coercions(self):
+        """Test that string values are correctly converted to their enum types."""
+        args = TrainingArguments(
+            output_dir="tmp",
+            eval_strategy="steps",
+            eval_steps=10,
+            logging_strategy="steps",
+            save_strategy="epoch",
+            hub_strategy="end",
+            lr_scheduler_type="linear",
+            optim="adamw_torch",
+            report_to=None,
+        )
+        self.assertEqual(args.eval_strategy, IntervalStrategy.STEPS)
+        self.assertEqual(args.logging_strategy, IntervalStrategy.STEPS)
+        self.assertEqual(args.save_strategy, SaveStrategy.EPOCH)
+        self.assertEqual(args.hub_strategy, HubStrategy.END)
+        self.assertEqual(args.lr_scheduler_type, SchedulerType.LINEAR)
+        self.assertEqual(args.optim, OptimizerNames.ADAMW_TORCH)
+
+        # Invalid string should raise ValueError
+        with self.assertRaises(ValueError):
+            TrainingArguments(output_dir="tmp", eval_strategy="invalid_strategy", report_to=None)
+
+    def test_do_eval_auto_enabled(self):
+        """Test that do_eval is automatically set to True when eval_strategy is not 'no'."""
+        args = TrainingArguments(
+            output_dir="tmp",
+            do_eval=False,
+            eval_strategy="steps",
+            eval_steps=10,
+            report_to=None,
+        )
+        self.assertTrue(args.do_eval)
+
+    def test_eval_steps_fallback_to_logging_steps(self):
+        """Test that eval_steps falls back to logging_steps when not specified."""
+        args = TrainingArguments(
+            output_dir="tmp",
+            eval_strategy="steps",
+            logging_steps=10,
+            report_to=None,
+        )
+        self.assertEqual(args.eval_steps, 10)
+
+    def test_eval_steps_required_when_strategy_steps(self):
+        """Test that eval_strategy='steps' with logging_steps=0 raises ValueError."""
+        with self.assertRaises(ValueError):
+            TrainingArguments(
+                output_dir="tmp",
+                eval_strategy="steps",
+                logging_steps=0,
+                report_to=None,
+            )
+
+    def test_logging_steps_required_nonzero(self):
+        """Test that logging_strategy='steps' with logging_steps=0 raises ValueError."""
+        with self.assertRaises(ValueError):
+            TrainingArguments(
+                output_dir="tmp",
+                logging_strategy="steps",
+                logging_steps=0,
+                report_to=None,
+            )
+
+    def test_steps_must_be_integer_when_greater_than_one(self):
+        """Test that fractional steps >1 raise ValueError, but <=1 are allowed."""
+        with self.assertRaises(ValueError):
+            TrainingArguments(
+                output_dir="tmp",
+                logging_strategy="steps",
+                logging_steps=10.5,
+                report_to=None,
+            )
+        with self.assertRaises(ValueError):
+            TrainingArguments(
+                output_dir="tmp",
+                eval_strategy="steps",
+                eval_steps=10.5,
+                report_to=None,
+            )
+        with self.assertRaises(ValueError):
+            TrainingArguments(
+                output_dir="tmp",
+                save_strategy="steps",
+                save_steps=10.5,
+                report_to=None,
+            )
+        # Fractional values <=1 (ratios) are allowed
+        args = TrainingArguments(
+            output_dir="tmp",
+            logging_strategy="steps",
+            logging_steps=0.5,
+            report_to=None,
+        )
+        self.assertEqual(args.logging_steps, 0.5)
+
+    def test_load_best_model_requires_matching_strategies(self):
+        """Test load_best_model_at_end validation for strategy and step compatibility."""
+        # Mismatched eval/save strategy should raise
+        with self.assertRaises(ValueError):
+            TrainingArguments(
+                output_dir="tmp",
+                load_best_model_at_end=True,
+                eval_strategy="steps",
+                eval_steps=10,
+                save_strategy="epoch",
+                report_to=None,
+            )
+
+        # save_steps not a multiple of eval_steps should raise
+        with self.assertRaises(ValueError):
+            TrainingArguments(
+                output_dir="tmp",
+                load_best_model_at_end=True,
+                eval_strategy="steps",
+                eval_steps=10,
+                save_strategy="steps",
+                save_steps=15,
+                report_to=None,
+            )
+
+        # Valid: matching strategies with compatible steps should not raise
+        args = TrainingArguments(
+            output_dir="tmp",
+            load_best_model_at_end=True,
+            eval_strategy="steps",
+            eval_steps=10,
+            save_strategy="steps",
+            save_steps=20,
+            report_to=None,
+        )
+        self.assertTrue(args.load_best_model_at_end)
+
+    def test_metric_for_best_model_defaults(self):
+        """Test default metric_for_best_model and greater_is_better behavior."""
+        # load_best_model_at_end with no metric → defaults to "loss"
+        args = TrainingArguments(
+            output_dir="tmp",
+            load_best_model_at_end=True,
+            eval_strategy="epoch",
+            save_strategy="epoch",
+            report_to=None,
+        )
+        self.assertEqual(args.metric_for_best_model, "loss")
+        self.assertFalse(args.greater_is_better)
+
+        # metric ending in "loss" → greater_is_better is False
+        args = TrainingArguments(
+            output_dir="tmp",
+            load_best_model_at_end=True,
+            eval_strategy="epoch",
+            save_strategy="epoch",
+            metric_for_best_model="eval_loss",
+            report_to=None,
+        )
+        self.assertFalse(args.greater_is_better)
+
+        # metric not ending in "loss" → greater_is_better is True
+        args = TrainingArguments(
+            output_dir="tmp",
+            load_best_model_at_end=True,
+            eval_strategy="epoch",
+            save_strategy="epoch",
+            metric_for_best_model="accuracy",
+            report_to=None,
+        )
+        self.assertTrue(args.greater_is_better)
+
+    def test_fp16_bf16_mutual_exclusivity(self):
+        """Test that fp16 and bf16 cannot both be True."""
+        with self.assertRaises(ValueError):
+            TrainingArguments(output_dir="tmp", fp16=True, bf16=True, report_to=None)
+        with self.assertRaises(ValueError):
+            TrainingArguments(output_dir="tmp", fp16_full_eval=True, bf16_full_eval=True, report_to=None)
+
+    def test_reduce_on_plateau_requires_eval(self):
+        """Test that reduce_lr_on_plateau scheduler requires an eval strategy."""
+        with self.assertRaises(ValueError):
+            TrainingArguments(
+                output_dir="tmp",
+                lr_scheduler_type="reduce_lr_on_plateau",
+                eval_strategy="no",
+                report_to=None,
+            )
+
+    def test_torch_compile_auto_enable(self):
+        """Test that torch_compile is auto-enabled when mode or backend is set."""
+        args = TrainingArguments(
+            output_dir="tmp",
+            torch_compile_mode="reduce-overhead",
+            report_to=None,
+        )
+        self.assertTrue(args.torch_compile)
+
+        args = TrainingArguments(
+            output_dir="tmp",
+            torch_compile_backend="inductor",
+            report_to=None,
+        )
+        self.assertTrue(args.torch_compile)
+
+        # Default backend when torch_compile=True
+        args = TrainingArguments(
+            output_dir="tmp",
+            torch_compile=True,
+            report_to=None,
+        )
+        self.assertEqual(args.torch_compile_backend, "inductor")
+
+    def test_report_to_none_handling(self):
+        """Test report_to normalization for 'none' and string values."""
+        args = TrainingArguments(output_dir="tmp", report_to="none")
+        self.assertEqual(args.report_to, [])
+
+        args = TrainingArguments(output_dir="tmp", report_to=["none"])
+        self.assertEqual(args.report_to, [])
+
+        args = TrainingArguments(output_dir="tmp", report_to="tensorboard")
+        self.assertEqual(args.report_to, ["tensorboard"])
+
+    def test_kubeflow_auto_enable(self):
+        """Test that kubeflow is auto-enabled when KUBEFLOW_TRAINER_SERVER_URL is set."""
+        with patch.dict(os.environ, {"KUBEFLOW_TRAINER_SERVER_URL": "https://test-url"}, clear=False):
+            # Should auto-add kubeflow when report_to is "none" (default)
+            args = TrainingArguments(output_dir="tmp", report_to="none")
+            self.assertIn("kubeflow", args.report_to)
+
+            # Should auto-add kubeflow to existing list
+            args = TrainingArguments(output_dir="tmp", report_to="tensorboard")
+            self.assertIn("kubeflow", args.report_to)
+            self.assertIn("tensorboard", args.report_to)
+
+            # Should not duplicate if already present
+            args = TrainingArguments(output_dir="tmp", report_to=["kubeflow", "tensorboard"])
+            self.assertEqual(args.report_to.count("kubeflow"), 1)
+
+        # Should not add kubeflow when env var is not set
+        with patch.dict(os.environ, {}, clear=True):
+            args = TrainingArguments(output_dir="tmp", report_to="none")
+            self.assertNotIn("kubeflow", args.report_to)
+
+    def test_warmup_steps_validation(self):
+        """Test warmup_steps validation for negative values."""
+        with self.assertRaises(ValueError):
+            TrainingArguments(output_dir="tmp", warmup_steps=-1, report_to=None)
+
+        # Zero and fractional values are valid
+        args = TrainingArguments(output_dir="tmp", warmup_steps=0, report_to=None)
+        self.assertEqual(args.warmup_steps, 0)
+
+        args = TrainingArguments(output_dir="tmp", warmup_steps=0.5, report_to=None)
+        self.assertEqual(args.warmup_steps, 0.5)
+
+    def test_debug_option_parsing(self):
+        """Test debug string parsing into DebugOption enum list."""
+        args = TrainingArguments(output_dir="tmp", debug="underflow_overflow", report_to=None)
+        self.assertEqual(args.debug, [DebugOption.UNDERFLOW_OVERFLOW])
+
+        args = TrainingArguments(output_dir="tmp", debug=None, report_to=None)
+        self.assertEqual(args.debug, [])
+
+    def test_dataloader_prefetch_requires_workers(self):
+        """Test that dataloader_prefetch_factor requires num_workers > 0."""
+        with self.assertRaises(ValueError):
+            TrainingArguments(
+                output_dir="tmp",
+                dataloader_prefetch_factor=2,
+                dataloader_num_workers=0,
+                report_to=None,
+            )
+        # Valid: prefetch with workers > 0
+        args = TrainingArguments(
+            output_dir="tmp",
+            dataloader_prefetch_factor=2,
+            dataloader_num_workers=2,
+            report_to=None,
+        )
+        self.assertEqual(args.dataloader_prefetch_factor, 2)
+
+    def test_use_cpu_disables_pin_memory(self):
+        """Test that use_cpu=True disables dataloader_pin_memory."""
+        args = TrainingArguments(output_dir="tmp", use_cpu=True, report_to=None)
+        self.assertFalse(args.dataloader_pin_memory)
+
+    def test_include_num_input_tokens_seen_coercion(self):
+        """Test bool-to-string coercion for include_num_input_tokens_seen."""
+        args = TrainingArguments(output_dir="tmp", include_num_input_tokens_seen=True, report_to=None)
+        self.assertEqual(args.include_num_input_tokens_seen, "all")
+
+        args = TrainingArguments(output_dir="tmp", include_num_input_tokens_seen=False, report_to=None)
+        self.assertEqual(args.include_num_input_tokens_seen, "no")
+
+    def test_dict_field_parsing(self):
+        """Test that JSON string dict fields are parsed into dicts."""
+        args = TrainingArguments(output_dir="tmp", lr_scheduler_kwargs='{"factor": 0.5}', report_to=None)
+        self.assertEqual(args.lr_scheduler_kwargs, {"factor": 0.5})
+
+    def test_dtype_to_json(self):
+        @dataclasses.dataclass
+        class TorchDtypeTrainingArguments(TrainingArguments):
+            dtype: torch.dtype = dataclasses.field(
+                default=torch.float32,
+            )
+
+        for dtype in [
+            "float32",
+            "float64",
+            "complex64",
+            "complex128",
+            "float16",
+            "bfloat16",
+            "uint8",
+            "int8",
+            "int16",
+            "int32",
+            "int64",
+            "bool",
+        ]:
+            torch_dtype = getattr(torch, dtype)
+            with tempfile.TemporaryDirectory() as tmp_dir:
+                args = TorchDtypeTrainingArguments(output_dir=tmp_dir, dtype=torch_dtype)
+
+                args_dict = args.to_dict()
+                self.assertIn("dtype", args_dict)
+                self.assertEqual(args_dict["dtype"], dtype)
--- a/tests/trainer/trainer_test_utils.py
+++ b/tests/trainer/trainer_test_utils.py
@@ -0,0 +1,630 @@
+# Copyright 2018 the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Shared test infrastructure for the Trainer test suite."""
+
+import dataclasses
+import gc
+import json
+import os
+import random
+
+import numpy as np
+
+from transformers import (
+    AutoTokenizer,
+    PreTrainedConfig,
+    TrainerCallback,
+    TrainingArguments,
+    is_datasets_available,
+    is_torch_available,
+)
+from transformers.testing_utils import (
+    backend_empty_cache,
+    backend_max_memory_allocated,
+    backend_memory_allocated,
+    backend_reset_max_memory_allocated,
+    get_tests_dir,
+    torch_device,
+)
+from transformers.utils import (
+    SAFE_WEIGHTS_INDEX_NAME,
+    SAFE_WEIGHTS_NAME,
+    is_accelerate_available,
+)
+
+
+if torch_device == "hpu":
+    RTOL = 1e-3
+    ATOL = 1e-3
+else:
+    RTOL = 1e-5
+    ATOL = 1e-5
+
+if is_torch_available():
+    import safetensors.torch
+    import torch
+    from torch import nn
+    from torch.utils.data import IterableDataset
+
+    from transformers import (
+        AutoModelForCausalLM,
+        PreTrainedModel,
+        Trainer,
+        TrainerState,
+    )
+
+if is_datasets_available():
+    import datasets
+
+# for version specific tests in TrainerIntegrationTest
+if is_accelerate_available():
+    pass
+
+
+PATH_SAMPLE_TEXT = f"{get_tests_dir()}/fixtures/sample_text.txt"
+
+
+def get_dataset(file_path, tokenizer, max_len):
+    dataset = datasets.load_dataset("text", data_files=file_path)
+
+    # Filter out empty lines
+    dataset = dataset.filter(lambda example: len(example["text"].strip()) > 0)
+
+    # Define tokenization function
+    def tokenize_function(examples):
+        tokenized = tokenizer(examples["text"], add_special_tokens=True, truncation=True, max_length=max_len)
+        # Add labels as a copy of input_ids
+        tokenized["labels"] = tokenized["input_ids"].copy()
+        return tokenized
+
+    # Apply tokenization and remove original text column
+    tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])
+
+    return tokenized_dataset["train"]
+
+
+class StoreLossCallback(TrainerCallback):
+    """
+    Simple callback to store the loss.
+    """
+
+    def __init__(self):
+        self.losses = []
+
+    def on_log(self, args, state, control, logs=None, **kwargs):
+        if "loss" in logs:
+            self.losses.append(logs["loss"])
+
+
+class MockCudaOOMCallback(TrainerCallback):
+    """
+    Simple callback to simulate CUDA OOM error if
+    the batch size is >= to `batch_size_limit`.
+    """
+
+    def __init__(self, batch_size_limit=16):
+        self.batch_size_limit = batch_size_limit
+
+    def on_step_end(self, args, state, control, **kwargs):
+        # simulate OOM on the first step
+        if state.train_batch_size >= self.batch_size_limit:
+            raise RuntimeError("CUDA out of memory.")
+
+
+class RegressionDataset:
+    def __init__(self, a=2, b=3, length=64, seed=42, label_names=None):
+        np.random.seed(seed)
+        self.label_names = ["labels"] if label_names is None else label_names
+        self.length = length
+        self.x = np.random.normal(size=(length,)).astype(np.float32)
+        self.ys = [a * self.x + b + np.random.normal(scale=0.1, size=(length,)) for _ in self.label_names]
+        self.ys = [y.astype(np.float32) for y in self.ys]
+
+    def __len__(self):
+        return self.length
+
+    def __getitem__(self, i):
+        result = {name: y[i] for name, y in zip(self.label_names, self.ys)}
+        result["input_x"] = self.x[i]
+        return result
+
+
+# Converting Bytes to Megabytes
+def bytes2megabytes(x):
+    return int(x / 2**20)
+
+
+# Copied from accelerate: https://github.com/huggingface/accelerate/blob/ee163b66fb7848892519e804688cb4ae981aacbe/src/accelerate/test_utils/scripts/external_deps/test_peak_memory_usage.py#L40C1-L73C68
+class TorchTracemalloc:
+    def __enter__(self):
+        gc.collect()
+        if torch_device in ["cuda", "xpu"]:
+            backend_empty_cache(torch_device)
+            backend_reset_max_memory_allocated(torch_device)  # reset the peak gauge to zero
+            self.begin = backend_memory_allocated(torch_device)
+        else:
+            self.begin = 0
+        return self
+
+    def __exit__(self, *exc):
+        gc.collect()
+        if torch_device in ["cuda", "xpu"]:
+            backend_empty_cache(torch_device)
+            self.end = backend_memory_allocated(torch_device)
+            self.peak = backend_max_memory_allocated(torch_device)
+        else:
+            self.end = 0
+            self.peak = 0
+        self.used = bytes2megabytes(self.end - self.begin)
+        self.peaked = bytes2megabytes(self.peak - self.begin)
+
+
+@dataclasses.dataclass
+class RegressionTrainingArguments(TrainingArguments):
+    a: float = 0.0
+    b: float = 0.0
+
+
+class RepeatDataset:
+    def __init__(self, x, length=64):
+        self.x = x
+        self.length = length
+
+    def __len__(self):
+        return self.length
+
+    def __getitem__(self, i):
+        return {"input_ids": self.x, "labels": self.x}
+
+
+class SequenceClassificationDataset:
+    def __init__(self, length=64, vocab_size=100, num_labels=5):
+        self.length = length
+        self.sequences = [torch.randint(0, vocab_size, (64,)).tolist() for _ in range(length)]
+        self.labels = torch.randint(0, num_labels, (length,)).tolist()
+
+    def __len__(self):
+        return self.length
+
+    def __getitem__(self, i):
+        return {"input_ids": self.sequences[i], "label": self.labels[i]}
+
+
+class DynamicShapesDataset:
+    def __init__(self, length=64, seed=42, batch_size=8):
+        self.length = length
+        np.random.seed(seed)
+        sizes = np.random.randint(1, 20, (length // batch_size,))
+        # For easy batching, we make every batch_size consecutive samples the same size.
+        self.xs = [np.random.normal(size=(s,)).astype(np.float32) for s in sizes.repeat(batch_size)]
+        self.ys = [np.random.normal(size=(s,)).astype(np.float32) for s in sizes.repeat(batch_size)]
+
+    def __len__(self):
+        return self.length
+
+    def __getitem__(self, i):
+        return {"input_x": self.xs[i], "labels": self.ys[i]}
+
+
+class AlmostAccuracy:
+    def __init__(self, thresh=0.25):
+        self.thresh = thresh
+
+    def __call__(self, eval_pred):
+        predictions, labels = eval_pred
+        true = np.abs(predictions - labels) <= self.thresh
+        return {"accuracy": true.astype(np.float32).mean().item()}
+
+
+class AlmostAccuracyBatched:
+    def __init__(self, thresh=0.25):
+        self.thresh = thresh
+        self.batch_acc = []
+
+    def __call__(self, eval_pred, compute_result):
+        predictions, labels = eval_pred
+        if isinstance(predictions, tuple):
+            predictions = predictions[0]
+        if isinstance(labels, tuple):
+            labels = labels[0]
+        batch_size = len(predictions)
+        true = torch.abs(predictions - labels) <= self.thresh
+        acc = true.type(torch.FloatTensor).mean().item()
+        self.batch_acc.extend([acc] * batch_size)
+        if compute_result:
+            result = {"accuracy": np.mean(self.batch_acc).item()}
+            self.batch_acc = []
+            return result
+
+
+class RegressionModelConfig(PreTrainedConfig):
+    def __init__(self, a=0, b=0, double_output=False, random_torch=True, **kwargs):
+        super().__init__(**kwargs)
+        self.a = a
+        self.b = b
+        self.double_output = double_output
+        self.random_torch = random_torch
+        self.hidden_size = 1
+
+
+if is_torch_available():
+
+    class SampleIterableDataset(IterableDataset):
+        def __init__(self, a=2, b=3, length=64, seed=42, label_names=None):
+            self.dataset = RegressionDataset(a=a, b=b, length=length, seed=seed, label_names=label_names)
+
+        def __iter__(self):
+            for i in range(len(self.dataset)):
+                yield self.dataset[i]
+
+    class FiniteIterableDataset(SampleIterableDataset):
+        def __init__(self, a=2, b=3, length=64, seed=42, label_names=None):
+            super().__init__(a, b, length, seed, label_names)
+            self.current_sample = 0
+
+        def __iter__(self):
+            while self.current_sample < len(self.dataset):
+                yield self.dataset[self.current_sample]
+                self.current_sample += 1
+
+    class MultiLoader:
+        def __init__(self, loaders):
+            self.loaders = loaders
+
+        def __len__(self):
+            return sum(len(loader) for loader in self.loaders)
+
+        def __iter__(self):
+            for loader in self.loaders:
+                yield from loader
+
+    class CustomDataloaderTrainer(Trainer):
+        def get_train_dataloader(self):
+            dataloaders = [super().get_train_dataloader(), super().get_train_dataloader()]
+            return MultiLoader(dataloaders)
+
+        def get_eval_dataloader(self, eval_dataset):
+            dataloaders = [super().get_eval_dataloader(eval_dataset), super().get_eval_dataloader(eval_dataset)]
+            return MultiLoader(dataloaders)
+
+    class RegressionModel(nn.Module):
+        def __init__(self, a=0, b=0, double_output=False):
+            super().__init__()
+            self.a = nn.Parameter(torch.tensor(a).float())
+            self.b = nn.Parameter(torch.tensor(b).float())
+            self.double_output = double_output
+            self.config = None
+
+        def forward(self, input_x, labels=None, **kwargs):
+            y = input_x * self.a + self.b
+            if labels is None:
+                return (y, y) if self.double_output else (y,)
+            loss = nn.functional.mse_loss(y, labels)
+            return (loss, y, y) if self.double_output else (loss, y)
+
+    class RegressionDictModel(nn.Module):
+        def __init__(self, a=0, b=0):
+            super().__init__()
+            self.a = nn.Parameter(torch.tensor(a).float())
+            self.b = nn.Parameter(torch.tensor(b).float())
+            self.config = None
+
+        def forward(self, input_x, labels=None, **kwargs):
+            y = input_x * self.a + self.b
+            result = {"output": y}
+            if labels is not None:
+                result["loss"] = nn.functional.mse_loss(y, labels)
+            return result
+
+    class RegressionPreTrainedModel(PreTrainedModel):
+        config_class = RegressionModelConfig
+        base_model_prefix = "regression"
+
+        def __init__(self, config):
+            super().__init__(config)
+            self.a = nn.Parameter(torch.as_tensor(config.a).float())
+            self.b = nn.Parameter(torch.as_tensor(config.b).float())
+            self.double_output = config.double_output
+            self.post_init()
+
+        def forward(self, input_x, labels=None, **kwargs):
+            y = input_x * self.a + self.b
+            if labels is None:
+                return (y, y) if self.double_output else (y,)
+            loss = nn.functional.mse_loss(y, labels)
+            return (loss, y, y) if self.double_output else (loss, y)
+
+    class RegressionPreTrainedModelWithGradientCheckpointing(PreTrainedModel):
+        config_class = RegressionModelConfig
+        base_model_prefix = "regression"
+        supports_gradient_checkpointing = True
+
+        def __init__(self, config):
+            super().__init__(config)
+            self.layers = nn.ModuleList([nn.Linear(config.hidden_size, config.hidden_size) for _ in range(4)])
+            self.head = nn.Linear(config.hidden_size, 1)
+            self.gradient_checkpointing = False
+            self.double_output = config.double_output
+            self.post_init()
+
+        def forward(self, input_x, labels=None, **kwargs):
+            y = input_x.unsqueeze(0)
+
+            for layer in self.layers:
+                if self.training and self.gradient_checkpointing:
+                    outputs = self._gradient_checkpointing_func(layer.__call__, y)
+                else:
+                    outputs = layer(y)
+
+                y = outputs * 3
+
+            logits = self.head(y)
+
+            if labels is None:
+                return (logits, logits) if self.double_output else (logits,)
+
+            loss = nn.functional.mse_loss(logits, labels)
+
+            return (loss, y, y) if self.double_output else (loss, y)
+
+    class RegressionRandomPreTrainedModel(PreTrainedModel):
+        config_class = RegressionModelConfig
+        base_model_prefix = "regression"
+
+        def __init__(self, config):
+            super().__init__(config)
+            self.a = nn.Parameter(torch.as_tensor(config.a).float())
+            self.b = nn.Parameter(torch.as_tensor(config.b).float())
+            self.random_torch = config.random_torch
+            self.post_init()
+
+        def forward(self, input_x, labels=None, **kwargs):
+            y = input_x * self.a + self.b
+            if self.random_torch:
+                torch_rand = torch.randn(1).squeeze()
+            np_rand = np.random.rand()
+            rand_rand = random.random()
+
+            if self.random_torch:
+                y += 0.05 * torch_rand
+            y += 0.05 * torch.tensor(np_rand + rand_rand)
+
+            if labels is None:
+                return (y,)
+            loss = nn.functional.mse_loss(y, labels)
+            return (loss, y)
+
+    class BasicTextGenerationModel(nn.Module):
+        def __init__(self, vocab_size, hidden_size):
+            super().__init__()
+            self.embedding = nn.Embedding(vocab_size, hidden_size)
+            self.lstm = nn.LSTM(hidden_size, hidden_size, batch_first=True)
+            self.fc = nn.Linear(hidden_size, vocab_size)
+
+        def forward(self, input_ids, labels=None, **kwargs):
+            embedded = self.embedding(input_ids)
+            lstm_out, _ = self.lstm(embedded)
+            logits = self.fc(lstm_out)
+            if labels is None:
+                return logits
+
+            loss = nn.functional.cross_entropy(logits.view(-1, logits.size(-1)), labels.view(-1))
+            return loss, logits
+
+    def create_dummy_dataset_for_text_generation(vocab_size, seq_length, num_samples):
+        import numpy as np
+
+        # Create random input sequences
+        input_ids = np.random.randint(0, vocab_size, (num_samples, seq_length))
+
+        # Create a datasets.Dataset
+        dataset = datasets.Dataset.from_dict({"input_ids": input_ids, "labels": input_ids})
+
+        return dataset
+
+    class TstLayer(nn.Module):
+        def __init__(self, hidden_size):
+            super().__init__()
+            self.linear1 = nn.Linear(hidden_size, hidden_size)
+            self.ln1 = nn.LayerNorm(hidden_size)
+            self.linear2 = nn.Linear(hidden_size, hidden_size)
+            self.ln2 = nn.LayerNorm(hidden_size)
+            self.bias = nn.Parameter(torch.zeros(hidden_size))
+
+        def forward(self, x):
+            h = self.ln1(nn.functional.relu(self.linear1(x)))
+            h = nn.functional.relu(self.linear2(x))
+            return self.ln2(x + h + self.bias)
+
+    def get_regression_trainer(
+        a=0,
+        b=0,
+        double_output=False,
+        train_len=64,
+        eval_len=64,
+        pretrained=True,
+        output_dir=None,
+        **kwargs,
+    ):
+        label_names = kwargs.get("label_names")
+        gradient_checkpointing = kwargs.get("gradient_checkpointing", False)
+        train_dataset = RegressionDataset(length=train_len, label_names=label_names)
+        eval_dataset = RegressionDataset(length=eval_len, label_names=label_names)
+
+        model_init = kwargs.pop("model_init", None)
+        if model_init is not None:
+            model = None
+        else:
+            if pretrained:
+                config = RegressionModelConfig(a=a, b=b, double_output=double_output)
+                # We infer the correct model class if one uses gradient_checkpointing or not
+                target_cls = (
+                    RegressionPreTrainedModel
+                    if not gradient_checkpointing
+                    else RegressionPreTrainedModelWithGradientCheckpointing
+                )
+                model = target_cls(config)
+            else:
+                model = RegressionModel(a=a, b=b, double_output=double_output)
+
+        compute_metrics = kwargs.pop("compute_metrics", None)
+        data_collator = kwargs.pop("data_collator", None)
+        optimizers = kwargs.pop("optimizers", (None, None))
+        preprocess_logits_for_metrics = kwargs.pop("preprocess_logits_for_metrics", None)
+        assert output_dir is not None, "output_dir should be specified for testing"
+        args = RegressionTrainingArguments(output_dir, a=a, b=b, **kwargs)
+        trainer = Trainer(
+            model,
+            args,
+            data_collator=data_collator,
+            train_dataset=train_dataset,
+            eval_dataset=eval_dataset,
+            compute_metrics=compute_metrics,
+            optimizers=optimizers,
+            model_init=model_init,
+            preprocess_logits_for_metrics=preprocess_logits_for_metrics,
+        )
+        # TODO: loss function defined in RegressionModel doesn't accept num_item_per_batch, to fix later
+        trainer.model_accepts_loss_kwargs = False
+        return trainer
+
+    def get_language_model_trainer(**kwargs):
+        dataset = datasets.load_dataset("fka/awesome-chatgpt-prompts")
+        model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
+        tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
+        tokenizer.pad_token = tokenizer.eos_token
+
+        def _tokenize_function(examples):
+            model_inputs = tokenizer(examples["prompt"], padding="max_length", truncation=True)
+            model_inputs["labels"] = np.array(model_inputs["input_ids"]).astype(np.int64)
+            return model_inputs
+
+        tokenized_datasets = dataset.map(_tokenize_function, batched=True)
+        training_args = TrainingArguments(**kwargs)
+
+        trainer = Trainer(
+            model=model,
+            args=training_args,
+            train_dataset=tokenized_datasets["train"],
+        )
+
+        return trainer
+
+
+class TrainerIntegrationCommon:
+    def check_saved_checkpoints(self, output_dir, freq, total, is_pretrained=True, use_scaler=False):
+        weights_file = SAFE_WEIGHTS_NAME
+        file_list = [weights_file, "training_args.bin", "optimizer.pt", "scheduler.pt", "trainer_state.json"]
+        if is_pretrained:
+            file_list.append("config.json")
+        if use_scaler:
+            file_list.append("scaler.pt")
+        for step in range(freq, total, freq):
+            checkpoint = os.path.join(output_dir, f"checkpoint-{step}")
+            self.assertTrue(os.path.isdir(checkpoint))
+            for filename in file_list:
+                self.assertTrue(os.path.isfile(os.path.join(checkpoint, filename)))
+
+    def check_best_model_has_been_loaded(
+        self,
+        output_dir,
+        freq,
+        total,
+        trainer,
+        metric,
+        greater_is_better=False,
+        is_pretrained=True,
+    ):
+        # Get log history from the final checkpoint (could be at total if not divisible by freq)
+        final_checkpoint_step = total if total % freq != 0 else (total // freq) * freq
+        checkpoint = os.path.join(output_dir, f"checkpoint-{final_checkpoint_step}")
+        log_history = TrainerState.load_from_json(os.path.join(checkpoint, "trainer_state.json")).log_history
+
+        values = [d[metric] for d in log_history if metric in d]
+        best_value = max(values) if greater_is_better else min(values)
+        best_idx = values.index(best_value)
+
+        # Determine which checkpoint corresponds to the best metric
+        # Evals happen at freq intervals, plus potentially at the final step
+        eval_steps = list(range(freq, total + 1, freq))
+        if total % freq != 0:
+            eval_steps.append(total)
+        best_checkpoint = eval_steps[best_idx]
+        checkpoint = os.path.join(output_dir, f"checkpoint-{best_checkpoint}")
+        if is_pretrained:
+            best_model = RegressionPreTrainedModel.from_pretrained(checkpoint)
+            best_model.to(trainer.args.device)
+        else:
+            best_model = RegressionModel()
+            state_dict = safetensors.torch.load_file(os.path.join(checkpoint, SAFE_WEIGHTS_NAME))
+            best_model.load_state_dict(state_dict)
+            best_model.to(trainer.args.device)
+        torch.testing.assert_close(best_model.a, trainer.model.a)
+        torch.testing.assert_close(best_model.b, trainer.model.b)
+
+        metrics = trainer.evaluate()
+        self.assertEqual(metrics[metric], best_value)
+
+    def remove_nan_logs(self, log):
+        for key in list(log.keys()):
+            if log[key] != log[key]:  # Check if the value is NaN
+                del log[key]
+
+    def check_trainer_state_are_the_same(self, trainer_state, trainer_state1):
+        # We'll pop things so operate on copies.
+        state = trainer_state.copy()
+        state1 = trainer_state1.copy()
+        # Log history main contain different logs for the time metrics (after resuming a training).
+        log_history = state.pop("log_history", None)
+        log_history1 = state1.pop("log_history", None)
+        self.assertEqual(state, state1)
+        skip_log_keys = ["train_runtime", "train_samples_per_second", "train_steps_per_second", "train_loss"]
+        for log, log1 in zip(log_history, log_history1):
+            for key in skip_log_keys:
+                _ = log.pop(key, None)
+                _ = log1.pop(key, None)
+
+            self.remove_nan_logs(log)
+            self.remove_nan_logs(log1)
+
+            self.assertEqual(log, log1)
+
+    def convert_to_sharded_checkpoint(self, folder):
+        # Converts a checkpoint of a regression model to a sharded checkpoint.
+        loader = safetensors.torch.load_file
+        weights_file = os.path.join(folder, SAFE_WEIGHTS_NAME)
+
+        extension = "safetensors"
+        saver = safetensors.torch.save_file
+        index_file = os.path.join(folder, SAFE_WEIGHTS_INDEX_NAME)
+        shard_name = SAFE_WEIGHTS_NAME
+
+        state_dict = loader(weights_file)
+
+        os.remove(weights_file)
+        keys = list(state_dict.keys())
+
+        shard_files = [
+            shard_name.replace(f".{extension}", f"-{idx + 1:05d}-of-{len(keys):05d}.{extension}")
+            for idx in range(len(keys))
+        ]
+        index = {"metadata": {}, "weight_map": {key: shard_files[i] for i, key in enumerate(keys)}}
+
+        with open(index_file, "w", encoding="utf-8") as f:
+            content = json.dumps(index, indent=2, sort_keys=True) + "\n"
+            f.write(content)
+
+        for param_name, shard_file in zip(keys, shard_files):
+            saver({param_name: state_dict[param_name]}, os.path.join(folder, shard_file))