Files
transformers/docs/source/en/model_doc/perceiver.md
陈赣 06f1fd69a6
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
first commit
2026-06-05 16:53:03 +08:00

10 KiB

This model was published in HF papers on 2021-07-30 and contributed to Hugging Face Transformers on 2021-12-08.

Perceiver

Overview

The Perceiver IO model was proposed in Perceiver IO: A General Architecture for Structured Inputs & Outputs by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.

Perceiver IO is a generalization of Perceiver to handle arbitrary outputs in addition to arbitrary inputs. The original Perceiver only produced a single classification label. In addition to classification labels, Perceiver IO can produce (for example) language, optical flow, and multimodal videos with audio. This is done using the same building blocks as the original Perceiver. The computational complexity of Perceiver IO is linear in the input and output size and the bulk of the processing occurs in the latent space, allowing us to process inputs and outputs that are much larger than can be handled by standard Transformers. This means, for example, Perceiver IO can do BERT-style masked language modeling directly using bytes instead of tokenized inputs.

The abstract from the paper is the following:

The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point clouds) while scaling linearly in compute and memory with the input size. While the Perceiver supports many kinds of inputs, it can only produce very simple outputs such as class scores. Perceiver IO overcomes this limitation without sacrificing the original's appealing properties by learning to flexibly query the model's latent space to produce outputs of arbitrary size and semantics. Perceiver IO still decouples model depth from data size and still scales linearly with data size, but now with respect to both input and output sizes. The full Perceiver IO model achieves strong results on tasks with highly structured output spaces, such as natural language and visual understanding, StarCraft II, and multi-task and multi-modal domains. As highlights, Perceiver IO matches a Transformer-based BERT baseline on the GLUE language benchmark without the need for input tokenization and achieves state-of-the-art performance on Sintel optical flow estimation.

Here's a TLDR explaining how Perceiver works:

The main problem with the self-attention mechanism of the Transformer is that the time and memory requirements scale quadratically with the sequence length. Hence, models like BERT and RoBERTa are limited to a max sequence length of 512 tokens. Perceiver aims to solve this issue by, instead of performing self-attention on the inputs, perform it on a set of latent variables, and only use the inputs for cross-attention. In this way, the time and memory requirements don't depend on the length of the inputs anymore, as one uses a fixed amount of latent variables, like 256 or 512. These are randomly initialized, after which they are trained end-to-end using backpropagation.

Internally, [PerceiverModel] will create the latents, which is a tensor of shape (batch_size, num_latents, d_latents). One must provide inputs (which could be text, images, audio, you name it!) to the model, which it will use to perform cross-attention with the latents. The output of the Perceiver encoder is a tensor of the same shape. One can then, similar to BERT, convert the last hidden states of the latents to classification logits by averaging along the sequence dimension, and placing a linear layer on top of that to project the d_latents to num_labels.

This was the idea of the original Perceiver paper. However, it could only output classification logits. In a follow-up work, PerceiverIO, they generalized it to let the model also produce outputs of arbitrary size. How, you might ask? The idea is actually relatively simple: one defines outputs of an arbitrary size, and then applies cross-attention with the last hidden states of the latents, using the outputs as queries, and the latents as keys and values.

So let's say one wants to perform masked language modeling (BERT-style) with the Perceiver. As the Perceiver's input length will not have an impact on the computation time of the self-attention layers, one can provide raw bytes, providing inputs of length 2048 to the model. If one now masks out certain of these 2048 tokens, one can define the outputs as being of shape: (batch_size, 2048, 768). Next, one performs cross-attention with the final hidden states of the latents to update the outputs tensor. After cross-attention, one still has a tensor of shape (batch_size, 2048, 768). One can then place a regular language modeling head on top, to project the last dimension to the vocabulary size of the model, i.e. creating logits of shape (batch_size, 2048, 262) (as Perceiver uses a vocabulary size of 262 byte IDs).

drawing

Perceiver IO architecture. Taken from the original paper

This model was contributed by nielsr. The original code can be found here.

Perceiver does not work with torch.nn.DataParallel due to a bug in PyTorch, see issue #36035

Resources

Perceiver specific outputs

autodoc models.perceiver.modeling_perceiver.PerceiverModelOutput

autodoc models.perceiver.modeling_perceiver.PerceiverDecoderOutput

autodoc models.perceiver.modeling_perceiver.PerceiverMaskedLMOutput

autodoc models.perceiver.modeling_perceiver.PerceiverClassifierOutput

PerceiverConfig

autodoc PerceiverConfig

PerceiverTokenizer

autodoc PerceiverTokenizer - call

PerceiverImageProcessor

autodoc PerceiverImageProcessor - preprocess

PerceiverImageProcessorPil

autodoc PerceiverImageProcessorPil - preprocess

PerceiverTextPreprocessor

autodoc models.perceiver.modeling_perceiver.PerceiverTextPreprocessor

PerceiverImagePreprocessor

autodoc models.perceiver.modeling_perceiver.PerceiverImagePreprocessor

PerceiverOneHotPreprocessor

autodoc models.perceiver.modeling_perceiver.PerceiverOneHotPreprocessor

PerceiverAudioPreprocessor

autodoc models.perceiver.modeling_perceiver.PerceiverAudioPreprocessor

PerceiverMultimodalPreprocessor

autodoc models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor

PerceiverProjectionDecoder

autodoc models.perceiver.modeling_perceiver.PerceiverProjectionDecoder

PerceiverBasicDecoder

autodoc models.perceiver.modeling_perceiver.PerceiverBasicDecoder

PerceiverClassificationDecoder

autodoc models.perceiver.modeling_perceiver.PerceiverClassificationDecoder

PerceiverOpticalFlowDecoder

autodoc models.perceiver.modeling_perceiver.PerceiverOpticalFlowDecoder

PerceiverBasicVideoAutoencodingDecoder

autodoc models.perceiver.modeling_perceiver.PerceiverBasicVideoAutoencodingDecoder

PerceiverMultimodalDecoder

autodoc models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder

PerceiverProjectionPostprocessor

autodoc models.perceiver.modeling_perceiver.PerceiverProjectionPostprocessor

PerceiverAudioPostprocessor

autodoc models.perceiver.modeling_perceiver.PerceiverAudioPostprocessor

PerceiverClassificationPostprocessor

autodoc models.perceiver.modeling_perceiver.PerceiverClassificationPostprocessor

PerceiverMultimodalPostprocessor

autodoc models.perceiver.modeling_perceiver.PerceiverMultimodalPostprocessor

PerceiverModel

autodoc PerceiverModel - forward

PerceiverForMaskedLM

autodoc PerceiverForMaskedLM - forward

PerceiverForSequenceClassification

autodoc PerceiverForSequenceClassification - forward

PerceiverForImageClassificationLearned

autodoc PerceiverForImageClassificationLearned - forward

PerceiverForImageClassificationFourier

autodoc PerceiverForImageClassificationFourier - forward

PerceiverForImageClassificationConvProcessing

autodoc PerceiverForImageClassificationConvProcessing - forward

PerceiverForOpticalFlow

autodoc PerceiverForOpticalFlow - forward

PerceiverForMultimodalAutoencoding

autodoc PerceiverForMultimodalAutoencoding - forward