*This model was contributed to Hugging Face Transformers on 2025-12-17.*
*This model is to be announced*
# Pixio
[Pixio]() is a vision foundation model that uses [ViT](./vit) as a feature extractor for multiple downstream tasks like depth estimation, semantic segmentation, feed-forward 3D reconstruction, robotics, and image classification. It is built on the Masked Autoencoder (MAE) pre-training framework, with four minimal yet critical updates: 1) deeper decoder, 2) larger masking granularity, 3) more class tokens, and 4) web-scale curated training data.
You can find all the original Pixio checkpoints under the [Pixio]() collection.
The example below demonstrates how to obtain an image embedding with the [`AutoModel`] class.
```python
import requests
from PIL import Image
from transformers import AutoImageProcessor, AutoModel
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
processor = AutoImageProcessor.from_pretrained("facebook/pixio-vith16")
model = AutoModel.from_pretrained("facebook/pixio-vith16", device_map="auto")
inputs = processor(images=image, return_tensors="pt").to(model.device)
outputs = model(**inputs)
features_norm = outputs.last_hidden_state # class tokens + patch tokens after last LayerNorm
features = outputs.hidden_states[-1] # class tokens + patch tokens before last LayerNorm
```
## Notes
- The example below shows how to split the output tensor into:
- a set of global embeddings for the whole image, commonly referred to as `CLS` token,
useful for classification and retrieval.
You can either average them (recommended) or concatenate them along the channel dimension.
- a set of local embeddings, one for each `16x16` patch of the input image,
useful for dense tasks, such as depth estimation and semantic segmentation.
```py
from transformers import AutoImageProcessor, AutoModel
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
print(image.height, image.width) # [480, 640]
processor = AutoImageProcessor.from_pretrained('facebook/pixio-vith16')
model = AutoModel.from_pretrained('facebook/pixio-vith16', device_map="auto")
patch_size = model.config.patch_size
inputs = processor(images=image, return_tensors="pt").to(model.device)
print(inputs.pixel_values.shape) # [1, 3, 256, 256]
batch_size, rgb, img_height, img_width = inputs.pixel_values.shape
num_patches_height, num_patches_width = img_height // patch_size, img_width // patch_size
num_patches_flat = num_patches_height * num_patches_width
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
print(last_hidden_states.shape) # [1, 8 + 256, 1280]
assert last_hidden_states.shape == (batch_size, model.config.n_cls_tokens + num_patches_flat, model.config.hidden_size)
cls_tokens = last_hidden_states[:, :model.config.n_cls_tokens, :]
patch_features = last_hidden_states[:, model.config.n_cls_tokens:, :].unflatten(1, (num_patches_height, num_patches_width))
```
- Use [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) to speedup inference.
```py
import torch
from transformers import AutoImageProcessor, AutoModel
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
processor = AutoImageProcessor.from_pretrained('facebook/pixio-vith16')
model = AutoModel.from_pretrained('facebook/pixio-vith16', device_map="auto")
compiled_model = torch.compile(model)
inputs = processor(images=image, return_tensors="pt").to(model.device)
outputs = compiled_model(**inputs)
last_hidden_states = outputs.last_hidden_state
```
## PixioConfig
[[autodoc]] PixioConfig
## PixioModel
[[autodoc]] PixioModel
- forward
## PixioBackbone
[[autodoc]] PixioBackbone
- forward