Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
118 lines
6.2 KiB
Markdown
118 lines
6.2 KiB
Markdown
# Tracing model intermediate outputs
|
||
|
||
Every model's `forward()` method used to manually resolve `None` flags like `output_attentions` from config defaults, accumulate per-layer attention weights and hidden states into tuples, and convert [`ModelOutput`] dataclasses to plain tuples when `return_dict=False`. Two decorators replace all of that boilerplate.
|
||
|
||
- `@capture_outputs` resolves output flags, collects intermediate values, and handles `return_dict` conversion.
|
||
- `@merge_with_config_defaults` resolves `use_cache` from config. Omit it for models that don't cache, like [`CLIPModel`].
|
||
|
||
You'll mostly encounter these decorators when integrating a new model. See [adding a model to 🤗 Transformers](./modular_transformers) for a step-by-step guide.
|
||
|
||
## Declare which submodules to capture
|
||
|
||
Apply `@capture_outputs` to the base model's `forward()` method. It attaches forward hooks that:
|
||
|
||
- Intercept outputs from specified submodule classes during the forward pass, without those submodules needing to know they're being observed.
|
||
- Collect per-layer attention weights and hidden states into tuples.
|
||
- Inject collected values into the returned [`ModelOutput`] dataclass.
|
||
- Convert the dataclass to a plain tuple when `return_dict=False`.
|
||
- Resolve `output_attentions` and `output_hidden_states` from kwargs or `self.config` when `None`.
|
||
|
||
## Map output fields to submodules
|
||
|
||
`@capture_outputs` needs to know which submodule produces which output. Declare a `_can_record_outputs` class-level dictionary on your `PreTrainedModel` subclass. Each key is an output field name (`"hidden_states"`, `"attentions"`, `"cross_attentions"`), and each value is a module class or an `OutputRecorder` instance.
|
||
|
||
## Fine-grained control with OutputRecorder
|
||
|
||
`OutputRecorder` accepts a `target_class` (an `nn.Module` subclass whose outputs to collect) and an optional `index` to select which element of the module's output tuple to get. Pass `layer_name` to attach hooks only to modules with a specific attribute name. Use `layer_name` when two layers share the same class, for example self-attention vs. cross-attention.
|
||
|
||
The example below shows both decorators in practice with different levels of output control. See [LlamaModel](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py) for real world reference.
|
||
|
||
```python
|
||
from ...processing_utils import Unpack
|
||
from ...utils import TransformersKwargs
|
||
from ...utils.generic import merge_with_config_defaults
|
||
from ...utils.output_capturing import capture_outputs, OutputRecorder
|
||
|
||
class MyPreTrainedModel(PreTrainedModel):
|
||
_can_record_outputs = {
|
||
# Capture hidden_states: hook fires after each MyDecoderBlock forward,
|
||
# grabbing its first output (index 0 by default).
|
||
"hidden_states": MyDecoderBlock,
|
||
|
||
# Capture self-attention weights: hook fires after each MyAttention
|
||
# forward, grabbing its second output (index=1 by default).
|
||
"attentions": MyAttention,
|
||
|
||
# Capture cross-attention weights: same class, different submodule.
|
||
# layer_name targets the attribute `self.crossattention` inside the block.
|
||
# Captures second output as requested (index=1)
|
||
"cross_attentions": OutputRecorder(
|
||
MyAttention, layer_name="crossattention", index=1
|
||
),
|
||
}
|
||
|
||
# Now in base model's forward we need the decorators and `Unpack` `kwargs`
|
||
class MyModel(MyPreTrainedModel):
|
||
|
||
@merge_with_config_defaults # ← resolves use_cache
|
||
@capture_outputs # ← handles output collection + return_dict
|
||
def forward(
|
||
self,
|
||
input_ids: torch.LongTensor | None = None,
|
||
past_key_values: Cache = None,
|
||
**kwargs: Unpack[TransformersKwargs],
|
||
) -> BaseModelOutputWithPast:
|
||
|
||
# No manual collection needed. Just run your layers normally.
|
||
hidden_states = self.embed_tokens(input_ids)
|
||
for layer in self.layers:
|
||
hidden_states = layer(hidden_states, **kwargs)
|
||
|
||
# Return the primary outputs. The decorator will fill in
|
||
# hidden_states/attentions/cross_attentions automatically.
|
||
return BaseModelOutputWithPast(
|
||
last_hidden_state=hidden_states,
|
||
past_key_values=past_key_values
|
||
)
|
||
```
|
||
|
||
## Patch layer classes
|
||
|
||
Output tracing depends on `_can_record_outputs` pointing at the *exact* classes a model's layers instantiate. If you swap a layer implementation for a custom attention kernel, a quantized expert layer, or an architecture variant, those pointers must stay in sync. The [patching API](./monkey_patching) provides a clean global registry for keeping `_can_record_outputs` consistent.
|
||
|
||
`register_patch_mapping` maps original class names to replacement `nn.Module` subclasses. Keys can be exact class names or regex patterns. Exact matches take priority. Patterns are tested with `re.search()`, so unanchored patterns match anywhere in the class name. Registering the same key twice raises `ValueError` unless you pass `overwrite=True`.
|
||
|
||
Remove entries with `unregister_patch_mapping`.
|
||
|
||
|
||
```python
|
||
from transformers.monkey_patching import register_patch_mapping, unregister_patch_mapping
|
||
|
||
# Exact name – replaces only Qwen2MoeExperts
|
||
register_patch_mapping({"Qwen2MoeExperts": SequentialExperts})
|
||
|
||
# Regex – replaces every class whose name ends in "Attention"
|
||
register_patch_mapping({".*Attention$": FusedAttention})
|
||
|
||
# Anchored version – only matches Llama2Attention, Llama3Attention, …
|
||
register_patch_mapping({"^Llama\\d+Attention$": CustomLlamaAttention})
|
||
|
||
# Same way, custom keys can be removed from the registry by passing the name that was registered
|
||
unregister_patch_mapping(["Qwen2MoeExperts", ".*Attention$"])
|
||
```
|
||
|
||
Once mappings are registered, `patch_output_recorders` walks every submodule and updates each `OutputRecorder.target_class` to the registered replacement.
|
||
|
||
> [!TIP]
|
||
> The [`~PreTrainedModel.from_pretrained`] method calls `patch_output_recorders` automatically. You only need to call it yourself when constructing a model directly.
|
||
|
||
```python
|
||
from transformers.monkey_patching import patch_output_recorders
|
||
# Built manually, outside from_pretrained
|
||
model = Qwen2MoeModel(config)
|
||
|
||
# Without this, _can_record_outputs still points at the original Qwen2MoeExperts class
|
||
# and hooks will never fire on CustomExperts instances.
|
||
patch_output_recorders(model)
|
||
```
|