gavin/transformers

Fork 0

Files

陈赣 06f1fd69a6

Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled

Details

Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled

Details

Build documentation / build (push) Has been cancelled

Details

Build documentation / build_other_lang (push) Has been cancelled

Details

CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled

Details

New model PR merged notification / Notify new model (push) Has been cancelled

Details

PR CI / pr-ci (push) Has been cancelled

Details

Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled

Details

Secret Leaks / trufflehog (push) Has been cancelled

Details

Update Transformers metadata / build_and_package (push) Has been cancelled

Details

Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled

Details

Check Tiny Models / Check tiny models (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled

Details

Nvidia CI - Flash Attn / Setup (push) Has been cancelled

Details

Nvidia CI - Flash Attn / Model CI (push) Has been cancelled

Details

Nvidia CI / Setup (push) Has been cancelled

Details

Nvidia CI / Model CI (push) Has been cancelled

Details

Nvidia CI / Torch pipeline CI (push) Has been cancelled

Details

Nvidia CI / Example CI (push) Has been cancelled

Details

Nvidia CI / Trainer/FSDP CI (push) Has been cancelled

Details

Nvidia CI / DeepSpeed CI (push) Has been cancelled

Details

Nvidia CI / Quantization CI (push) Has been cancelled

Details

Nvidia CI / Kernels CI (push) Has been cancelled

Details

Doctests / Setup (push) Has been cancelled

Details

Doctests / Call doctest jobs (push) Has been cancelled

Details

Doctests / Send results to webhook (push) Has been cancelled

Details

Extras Smoke Test / Get supported Python versions (push) Has been cancelled

Details

Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled

Details

Extras Smoke Test / Check Slack token availability (push) Has been cancelled

Details

Extras Smoke Test / Notify failures to Slack (push) Has been cancelled

Details

Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled

Details

Stale Bot / Close Stale Issues (push) Has been cancelled

Details

first commit

2026-06-05 16:53:03 +08:00

4.9 KiB

Raw Blame History

Data collators

A data collator assembles individual dataset samples into a batch for the model. It can also dynamically pad samples to the longest sequence in each batch, which is more efficient than padding to a global maximum length.

Dataset[0] → {"input_ids": [101, 2003], "labels": 1}
Dataset[1] → {"input_ids": [101, 2003, 1996], "labels": 0}
Dataset[2] → {"input_ids": [101, 7592], "labels": 1}
         ↓  collator
{
  "input_ids": tensor([[101, 2003,    0],   # padded to longest
                        [101, 2003, 1996],
                        [101, 7592,    0]]),
  "labels":    tensor([1, 0, 1])
}

Transformers provides data collators for various tasks (see all available data collators). Create a custom data collator with:

DataCollatorWithPadding when you need standard tokenizer-based padding plus extra fields.
DataCollatorMixin when you need custom padding logic, multiple paired inputs per sample, or a batch structure the tokenizer can't produce on its own.

DataCollatorWithPadding

For simple use cases like adding an extra field, subclass [DataCollatorWithPadding] and extend its __call__ method. The example below adds a "score" field.

Remove the custom field first because [~PreTrainedTokenizerBase.pad] doesn't recognize it.
Call the parent class to handle input_ids and attention_mask.
Add the "score" field back to the batch.

import torch
from dataclasses import dataclass
from transformers import DataCollatorWithPadding, PreTrainedTokenizerBase

@dataclass
class DataCollatorWithScore(DataCollatorWithPadding):
    tokenizer: PreTrainedTokenizerBase

    def __call__(self, features):
        scores = [f.pop("score") for f in features]

        batch = super().__call__(features)
        batch["score"] = torch.tensor(scores, dtype=torch.float)

        return batch

Pass the custom data collator to [Trainer] like any other data collator.

trainer = Trainer(
    ...,
    data_collator=DataCollatorWithScore(tokenizer=tokenizer),
)

DataCollatorMixin

Subclass [DataCollatorMixin] for full control over batch assembly and implement your own __call__ method. Build custom padding logic, handle multiple input types, or create entirely new batch structures. The DataCollatorForPreference example below uses [DataCollatorMixin] because each training sample has a chosen and rejected response, and the model needs to see both.

Separate chosen_ids and rejected_ids because [~trl.trainer.utils.pad] expects flat lists.
Concatenate the input pair into a single list.
Generate attention_mask with torch.ones_like instead of the tokenizer because the collator works with raw token ID lists.
Pad input_ids and attention_mask.

import torch
from transformers import DataCollatorMixin
from trl.trainer.utils import pad

class DataCollatorForPreference(DataCollatorMixin):
    pad_token_id: int
    pad_to_multiple_of: int | None = None

    def __call__(self, examples: list[dict]) -> dict:
        chosen_input_ids   = [torch.tensor(ex["chosen_ids"])   for ex in examples]
        rejected_input_ids = [torch.tensor(ex["rejected_ids"]) for ex in examples]

        input_ids      = chosen_input_ids + rejected_input_ids
        attention_mask = [torch.ones_like(ids) for ids in input_ids]

        output = {
            "input_ids": pad(
                input_ids,
                padding_value=self.pad_token_id,
                padding_side="right",
                pad_to_multiple_of=self.pad_to_multiple_of,
            ),
            "attention_mask": pad(
                attention_mask,
                padding_value=0,
                padding_side="right",
                pad_to_multiple_of=self.pad_to_multiple_of,
            ),
        }

        ...

        return output

Next steps

See all available data collators for common tasks like token classification.

4.9 KiB Raw Blame History

Data collators

DataCollatorWithPadding

DataCollatorMixin

Next steps

4.9 KiB

Raw Blame History