*This model was published in HF papers on 2025-01-22 and contributed to Hugging Face Transformers on 2025-10-13.* # VideoLLaMA3
VideoLLaMA3 architecture. Taken from the technical report.
This model was contributed by [lkhl](https://huggingface.co/lkhl).
## Usage example
### Single Media inference
The model can accept both images and videos as input. Here's an example code for inference.
```python
from transformers import AutoProcessor, VideoLlama3ForConditionalGeneration
# Load the model in half-precision on the available device(s)
model = VideoLlama3ForConditionalGeneration.from_pretrained("lkhl/VideoLLaMA3-2B-Image-HF", device_map="auto")
processor = AutoProcessor.from_pretrained("lkhl/VideoLLaMA3-2B-Image-HF")
conversation = [
{
"role":"user",
"content":[
{"type": "image", "image": "https://github.com/DAMO-NLP-SG/VideoLLaMA3/raw/refs/heads/main/assets/sora.png"},
{"type": "text", "text": "Describe this image."}
]
}
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(output_text)
# Video
conversation = [
{
"role": "user",
"content": [
{"type": "video", "video": "https://github.com/DAMO-NLP-SG/VideoLLaMA3/raw/refs/heads/main/assets/cat_and_chicken.mp4"},
{"type": "text", "text": "What happened in the video?"},
],
}
]
inputs = processor.apply_chat_template(
conversation,
fps=1,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(output_text)
```
### Batch Mixed Media Inference
The model can batch inputs composed of mixed samples of various types such as images, videos, and text. Here is an example.
```python
# Image
conversation1 = [
{
"role": "user",
"content": [
{"type": "image", "image": "https://github.com/DAMO-NLP-SG/VideoLLaMA3/raw/refs/heads/main/assets/sora.png"},
{"type": "text", "text": "Describe this image."}
]
}
]
# Video
conversation2 = [
{
"role": "user",
"content": [
{"type": "video", "video": "https://github.com/DAMO-NLP-SG/VideoLLaMA3/raw/refs/heads/main/assets/cat_and_chicken.mp4"},
{"type": "text", "text": "What happened in the video?"},
],
}
]
# Text
conversation3 = [
{
"role": "user",
"content": "What color is a banana?"
}
]
conversations = [conversation1, conversation2, conversation3]
# Preparation for batch inference
inputs = processor.apply_chat_template(
conversations,
fps=1,
add_generation_prompt=True,
tokenize=True,
padding=True,
padding_side="left",
return_dict=True,
return_tensors="pt"
).to(model.device)
# Batch Inference
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(output_text)
```
#### Flash-Attention 2 to speed up generation
First, make sure to install the latest version of Flash Attention 2:
```bash
pip install -U flash-attn --no-build-isolation
```
Also, you should have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of the [flash attention repository](https://github.com/Dao-AILab/flash-attention). FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`.
To load and run a model using Flash Attention-2, simply add `attn_implementation="flash_attention_2"` when loading the model as follows:
```python
from transformers import VideoLlama3ForConditionalGeneration
model = VideoLlama3ForConditionalGeneration.from_pretrained(
"lkhl/VideoLLaMA3-2B-Image-HF", ,
attn_implementation="flash_attention_2",
device_map="auto")
```
## VideoLlama3Config
[[autodoc]] VideoLlama3Config
## VideoLlama3VisionConfig
[[autodoc]] VideoLlama3VisionConfig
## VideoLlama3ImageProcessor
[[autodoc]] VideoLlama3ImageProcessor
- preprocess
## VideoLlama3VideoProcessor
[[autodoc]] VideoLlama3VideoProcessor
- preprocess
## VideoLlama3ImageProcessorPil
[[autodoc]] VideoLlama3ImageProcessorPil
- preprocess
## VideoLlama3Processor
[[autodoc]] VideoLlama3Processor
- __call__
## VideoLlama3Model
[[autodoc]] VideoLlama3Model
- forward
- get_video_features
- get_image_features
## VideoLlama3VisionModel
[[autodoc]] VideoLlama3VisionModel
- forward
## VideoLlama3ForConditionalGeneration
[[autodoc]] VideoLlama3ForConditionalGeneration
- forward
- get_video_features
- get_image_features