# Multimodal processors A processor combines a tokenizer with one or more modality processors, such as an image processor, video processor, or feature extractor. It exposes a single `__call__` method that routes each input to the right component and merges the outputs into one dictionary. Some multimodal models interleave text with images, videos, or audio. For these models, [`ProcessorMixin`] can replace placeholder tokens like ``, `