first commit
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
This commit is contained in:
0
tests/models/gemma4/__init__.py
Normal file
0
tests/models/gemma4/__init__.py
Normal file
247
tests/models/gemma4/test_image_processing_gemma4.py
Normal file
247
tests/models/gemma4/test_image_processing_gemma4.py
Normal file
@@ -0,0 +1,247 @@
|
||||
# Copyright 2026 the HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import unittest
|
||||
|
||||
import numpy as np
|
||||
from parameterized import parameterized
|
||||
|
||||
from transformers.models.gemma4.image_processing_pil_gemma4 import get_aspect_ratio_preserving_size
|
||||
from transformers.testing_utils import require_torch, require_vision
|
||||
from transformers.utils import is_torch_available, is_torchvision_available, is_vision_available
|
||||
|
||||
from ...test_image_processing_common import ImageProcessingTestMixin, prepare_image_inputs
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
if is_vision_available():
|
||||
from PIL import Image
|
||||
|
||||
if is_torchvision_available():
|
||||
pass
|
||||
|
||||
|
||||
class Gemma4ImageProcessingTester:
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=7,
|
||||
num_channels=3,
|
||||
min_resolution=30,
|
||||
max_resolution=400,
|
||||
do_resize=True,
|
||||
do_normalize=False,
|
||||
image_mean=None,
|
||||
image_std=None,
|
||||
do_convert_rgb=True,
|
||||
patch_size=6,
|
||||
max_soft_tokens=70,
|
||||
pooling_kernel_size=1,
|
||||
):
|
||||
super().__init__()
|
||||
image_mean = image_mean if image_mean is not None else [0.0, 0.0, 0.0]
|
||||
image_std = image_std if image_std is not None else [1.0, 1.0, 1.0]
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.num_channels = num_channels
|
||||
self.min_resolution = min_resolution
|
||||
self.max_resolution = max_resolution
|
||||
self.do_resize = do_resize
|
||||
self.do_normalize = do_normalize
|
||||
self.image_mean = image_mean
|
||||
self.image_std = image_std
|
||||
self.do_convert_rgb = do_convert_rgb
|
||||
self.patch_size = patch_size
|
||||
self.max_soft_tokens = max_soft_tokens
|
||||
self.pooling_kernel_size = pooling_kernel_size
|
||||
|
||||
def prepare_image_processor_dict(self):
|
||||
return {
|
||||
"do_resize": self.do_resize,
|
||||
"do_normalize": self.do_normalize,
|
||||
"image_mean": self.image_mean,
|
||||
"image_std": self.image_std,
|
||||
"do_convert_rgb": self.do_convert_rgb,
|
||||
"patch_size": self.patch_size,
|
||||
"max_soft_tokens": self.max_soft_tokens,
|
||||
"pooling_kernel_size": self.pooling_kernel_size,
|
||||
}
|
||||
|
||||
# Copied from tests.models.clip.test_image_processing_clip.CLIPImageProcessingTester.prepare_image_inputs
|
||||
def prepare_image_inputs(self, equal_resolution=False, numpify=False, torchify=False):
|
||||
return prepare_image_inputs(
|
||||
batch_size=self.batch_size,
|
||||
num_channels=self.num_channels,
|
||||
min_resolution=self.min_resolution,
|
||||
max_resolution=self.max_resolution,
|
||||
equal_resolution=equal_resolution,
|
||||
numpify=numpify,
|
||||
torchify=torchify,
|
||||
)
|
||||
|
||||
def expected_output_image_shape(self, images=None):
|
||||
"""Return the expected per-image output shape: (max_patches, patch_pixels)."""
|
||||
max_patches = self.max_soft_tokens * self.pooling_kernel_size**2
|
||||
# Images are always converted to RGB (3 channels) before patchification
|
||||
patch_pixels = self.patch_size**2 * 3
|
||||
return max_patches, patch_pixels
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_vision
|
||||
class Gemma4ImageProcessingTest(ImageProcessingTestMixin, unittest.TestCase):
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
self.image_processor_tester = Gemma4ImageProcessingTester(self)
|
||||
|
||||
@unittest.skip("Gemma4 patchification requires RGB (3-channel) images; 4-channel inputs are unsupported.")
|
||||
def test_call_numpy_4_channels(self):
|
||||
pass
|
||||
|
||||
@property
|
||||
def image_processor_dict(self):
|
||||
return self.image_processor_tester.prepare_image_processor_dict()
|
||||
|
||||
def test_image_processor_properties(self):
|
||||
"""Test that all expected attributes are present."""
|
||||
for image_processing_class in self.image_processing_classes.values():
|
||||
image_processing = image_processing_class(**self.image_processor_dict)
|
||||
self.assertTrue(hasattr(image_processing, "do_resize"))
|
||||
self.assertTrue(hasattr(image_processing, "do_normalize"))
|
||||
self.assertTrue(hasattr(image_processing, "image_mean"))
|
||||
self.assertTrue(hasattr(image_processing, "image_std"))
|
||||
self.assertTrue(hasattr(image_processing, "do_convert_rgb"))
|
||||
self.assertTrue(hasattr(image_processing, "patch_size"))
|
||||
self.assertTrue(hasattr(image_processing, "max_soft_tokens"))
|
||||
self.assertTrue(hasattr(image_processing, "pooling_kernel_size"))
|
||||
|
||||
def test_image_processor_defaults(self):
|
||||
"""Test default parameter values for Gemma4 matching VARASP_SL280_K3."""
|
||||
for image_processing_class in self.image_processing_classes.values():
|
||||
proc = image_processing_class()
|
||||
self.assertEqual(proc.patch_size, 16)
|
||||
self.assertEqual(proc.max_soft_tokens, 280)
|
||||
self.assertEqual(proc.pooling_kernel_size, 3)
|
||||
self.assertFalse(proc.do_normalize)
|
||||
self.assertEqual(list(proc.image_mean), [0.0, 0.0, 0.0])
|
||||
self.assertEqual(list(proc.image_std), [1.0, 1.0, 1.0])
|
||||
self.assertEqual(proc.resample, 3)
|
||||
|
||||
def test_image_processor_from_dict_with_kwargs(self):
|
||||
for image_processing_class in self.image_processing_classes.values():
|
||||
image_processor = image_processing_class.from_dict(self.image_processor_dict)
|
||||
self.assertEqual(image_processor.patch_size, 6)
|
||||
self.assertEqual(image_processor.max_soft_tokens, 70)
|
||||
|
||||
image_processor = image_processing_class.from_dict(self.image_processor_dict, patch_size=18)
|
||||
self.assertEqual(image_processor.patch_size, 18)
|
||||
|
||||
def test_output_keys(self):
|
||||
"""Test that the output contains pixel_values, image_position_ids, and num_soft_tokens_per_image."""
|
||||
for image_processing_class in self.image_processing_classes.values():
|
||||
image_processing = image_processing_class(**self.image_processor_dict)
|
||||
image = Image.fromarray(np.random.randint(0, 255, (100, 100, 3), dtype=np.uint8))
|
||||
result = image_processing(image, return_tensors="pt")
|
||||
self.assertIn("pixel_values", result)
|
||||
self.assertIn("image_position_ids", result)
|
||||
self.assertIn("num_soft_tokens_per_image", result)
|
||||
|
||||
def test_aspect_ratio_preserving_resize_dimensions(self):
|
||||
"""Test resize dimension calculations match C++ source of truth VisionAspectRatioTests."""
|
||||
for patch_size, max_patches, pooling_kernel_size, height, width, expectation in [
|
||||
(16, 256, 1, 256, 256, (256, 256)),
|
||||
(16, 256, 1, 512, 512, (256, 256)),
|
||||
(10, 200, 1, 50, 10000, (10, 2000)),
|
||||
(10, 200, 1, 25, 10000, (10, 2000)),
|
||||
(16, 2304, 6, 2785, 34, (6144, 96)),
|
||||
(10, 200, 1, 25, 20000, (10, 2000)),
|
||||
(4, 64, 2, 50, 1000, (8, 128)),
|
||||
(5, 100, 3, 100, 100, (45, 45)),
|
||||
(5, 20, 3, 5, 100, (15, 30)),
|
||||
]:
|
||||
target_h, target_w = get_aspect_ratio_preserving_size(
|
||||
height=height,
|
||||
width=width,
|
||||
patch_size=patch_size,
|
||||
max_patches=max_patches,
|
||||
pooling_kernel_size=pooling_kernel_size,
|
||||
)
|
||||
side_mult = patch_size * pooling_kernel_size
|
||||
|
||||
self.assertEqual((target_h, target_w), expectation)
|
||||
self.assertEqual(target_h % side_mult, 0, f"Resized height {target_h} not divisible by {side_mult}")
|
||||
self.assertEqual(target_w % side_mult, 0, f"Resized width {target_w} not divisible by {side_mult}")
|
||||
|
||||
@parameterized.expand([(70), (140), (280), (560), (1120)])
|
||||
def test_max_soft_tokens_values(self, max_soft_tokens):
|
||||
"""Test that the processor produces valid patchified output for each supported max_soft_tokens value."""
|
||||
for image_processing_class in self.image_processing_classes.values():
|
||||
processor = image_processing_class(patch_size=16, max_soft_tokens=max_soft_tokens, pooling_kernel_size=3)
|
||||
image = Image.fromarray(np.random.randint(0, 255, (200, 300, 3), dtype=np.uint8))
|
||||
result = processor(image, return_tensors="pt")
|
||||
|
||||
max_patches = max_soft_tokens * 3**2
|
||||
patch_pixels = 16 * 16 * 3
|
||||
self.assertEqual(result.pixel_values.shape, (1, max_patches, patch_pixels))
|
||||
self.assertEqual(result.image_position_ids.shape, (1, max_patches, 2))
|
||||
|
||||
# Verify real patches don't exceed the budget
|
||||
real_mask = result.image_position_ids[0, :, 0] >= 0
|
||||
num_real = real_mask.sum().item()
|
||||
self.assertLessEqual(num_real, max_patches)
|
||||
|
||||
def test_position_ids_structure(self):
|
||||
"""Test that image_position_ids has correct real and padding structure."""
|
||||
for image_processing_class in self.image_processing_classes.values():
|
||||
image_processing = image_processing_class(**self.image_processor_dict)
|
||||
image = Image.fromarray(np.random.randint(0, 255, (100, 100, 3), dtype=np.uint8))
|
||||
result = image_processing(image, return_tensors="pt")
|
||||
|
||||
position_ids = result.image_position_ids[0] # (max_patches, 2)
|
||||
max_patches = (
|
||||
self.image_processor_tester.max_soft_tokens * self.image_processor_tester.pooling_kernel_size**2
|
||||
)
|
||||
|
||||
# Real positions should be non-negative
|
||||
real_mask = position_ids[:, 0] >= 0
|
||||
num_real = real_mask.sum().item()
|
||||
self.assertGreater(num_real, 0)
|
||||
self.assertLessEqual(num_real, max_patches)
|
||||
|
||||
# Padding positions should be (-1, -1)
|
||||
pad_mask = ~real_mask
|
||||
if pad_mask.any():
|
||||
pad_positions = position_ids[pad_mask]
|
||||
self.assertTrue((pad_positions == -1).all())
|
||||
|
||||
# Real positions should come before padding positions
|
||||
if pad_mask.any():
|
||||
last_real_idx = torch.where(real_mask)[0][-1].item()
|
||||
first_pad_idx = torch.where(pad_mask)[0][0].item()
|
||||
self.assertEqual(last_real_idx + 1, first_pad_idx)
|
||||
|
||||
def test_padding_patches_are_zero(self):
|
||||
"""Test that padding patches in pixel_values are filled with zeros."""
|
||||
for image_processing_class in self.image_processing_classes.values():
|
||||
image_processing = image_processing_class(**self.image_processor_dict)
|
||||
image = Image.fromarray(np.random.randint(1, 255, (100, 100, 3), dtype=np.uint8))
|
||||
result = image_processing(image, return_tensors="pt")
|
||||
|
||||
position_ids = result.image_position_ids[0]
|
||||
pad_mask = position_ids[:, 0] < 0
|
||||
if pad_mask.any():
|
||||
pad_patches = result.pixel_values[0, pad_mask]
|
||||
self.assertTrue((pad_patches == 0).all())
|
||||
906
tests/models/gemma4/test_modeling_gemma4.py
Normal file
906
tests/models/gemma4/test_modeling_gemma4.py
Normal file
@@ -0,0 +1,906 @@
|
||||
# Copyright 2026 the HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Testing suite for the PyTorch Gemma4 model."""
|
||||
|
||||
import unittest
|
||||
from contextlib import contextmanager
|
||||
|
||||
import pytest
|
||||
from parameterized import parameterized
|
||||
|
||||
from transformers import (
|
||||
AutoTokenizer,
|
||||
Gemma4Config,
|
||||
Gemma4TextConfig,
|
||||
is_torch_available,
|
||||
)
|
||||
from transformers.testing_utils import (
|
||||
Expectations,
|
||||
cleanup,
|
||||
require_deterministic_for_xpu,
|
||||
require_torch,
|
||||
require_torch_accelerator,
|
||||
require_torch_multi_gpu,
|
||||
slow,
|
||||
torch_device,
|
||||
)
|
||||
|
||||
from ...causal_lm_tester import CausalLMModelTest, CausalLMModelTester
|
||||
from ...generation.test_utils import GenerationTesterMixin
|
||||
from ...test_configuration_common import ConfigTester
|
||||
from ...test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor
|
||||
from ...test_processing_common import url_to_local_path
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
from transformers import (
|
||||
AutoModelForCausalLM,
|
||||
Gemma4ForCausalLM,
|
||||
Gemma4ForConditionalGeneration,
|
||||
Gemma4Model,
|
||||
Gemma4Processor,
|
||||
Gemma4TextModel,
|
||||
)
|
||||
|
||||
|
||||
GEMMA4_RANDOM_MOE_FA2_SKIP_REASON = (
|
||||
"Randomly initialized Gemma4 MoE routers are too sensitive to tiny eager/FA2 input differences"
|
||||
)
|
||||
|
||||
|
||||
class Gemma4TextModelTester(CausalLMModelTester):
|
||||
if is_torch_available():
|
||||
config_class = Gemma4TextConfig
|
||||
base_model_class = Gemma4TextModel
|
||||
causal_lm_class = Gemma4ForCausalLM
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
super().__init__(*args, **kwargs)
|
||||
self.num_hidden_layers = 4 # override to correctly test sharing cache pattern
|
||||
self.num_kv_shared_layers = 2 # important to override
|
||||
self.layer_types = [
|
||||
"sliding_attention",
|
||||
"full_attention",
|
||||
"sliding_attention",
|
||||
"full_attention",
|
||||
] # similarly we want to test sharing on both types
|
||||
self.global_head_dim = self.head_dim # gemma4 use a different head_dim for full and sliding layers
|
||||
|
||||
# To make model small
|
||||
self.vocab_size_per_layer_input = 99
|
||||
self.hidden_size_per_layer_input = 16
|
||||
|
||||
# To activate moe blocks
|
||||
self.enable_moe_block = True
|
||||
self.moe_intermediate_size = 16
|
||||
self.top_k_experts = 2
|
||||
|
||||
# Test if bidirectional image mask path works
|
||||
self.use_bidirectional_attention = "vision"
|
||||
|
||||
|
||||
@require_torch
|
||||
class Gemma4TextModelTest(CausalLMModelTest, unittest.TestCase):
|
||||
model_tester_class = Gemma4TextModelTester
|
||||
# used in `test_torch_compile_for_training`
|
||||
_torch_compile_train_cls = Gemma4ForCausalLM if is_torch_available() else None
|
||||
|
||||
@unittest.skip("We need 4 layers to correctly test cache sharing.")
|
||||
def test_num_layers_is_small(self):
|
||||
pass
|
||||
|
||||
@unittest.skip("Gemma4 uses different rope per layer type, which is not compatible with this test")
|
||||
def test_model_rope_scaling_frequencies(self):
|
||||
pass
|
||||
|
||||
@parameterized.expand([("linear",), ("dynamic",), ("yarn",)])
|
||||
@unittest.skip("Gemma4 uses different rope per layer type, which is not compatible with this test")
|
||||
def test_model_rope_scaling_from_config(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(
|
||||
"Gemma4 cannot use random inputs_embeds, as it needs to reverse them when input_ids is not provided"
|
||||
)
|
||||
def test_generate_from_random_inputs_embeds(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(
|
||||
"Flaky on CI, but not locally on Mac. If model is set to fp32 instead of bf16, not flaky anymore."
|
||||
"TODO Cyril: investigate where the loss of precision between bf16 and fp32 comes from."
|
||||
)
|
||||
def test_sdpa_padding_matches_padding_free_with_position_ids(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(
|
||||
"Fails after fully removing the unused weights, even if `forward` is exactly the same. Investigate why."
|
||||
)
|
||||
def test_tp_generation_quantized(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(GEMMA4_RANDOM_MOE_FA2_SKIP_REASON)
|
||||
def test_flash_attn_2_equivalence(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(GEMMA4_RANDOM_MOE_FA2_SKIP_REASON)
|
||||
def test_flash_attn_2_inference_equivalence(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(GEMMA4_RANDOM_MOE_FA2_SKIP_REASON)
|
||||
def test_flash_attn_2_inference_equivalence_right_padding(self):
|
||||
pass
|
||||
|
||||
def test_all_bidirectional_attention_uses_bidirectional_mask(self):
|
||||
self.model_tester.use_bidirectional_attention = "all"
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
config._attn_implementation = "eager"
|
||||
|
||||
model = Gemma4TextModel(config).to(torch_device)
|
||||
model.eval()
|
||||
|
||||
input_ids = inputs_dict["input_ids"][:1]
|
||||
with torch.no_grad():
|
||||
out = model(input_ids=input_ids, output_attentions=True)
|
||||
|
||||
for attention in out.attentions:
|
||||
self.assertTrue((attention[..., :4, :4] != 0).all().item())
|
||||
|
||||
def test_model_training(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(
|
||||
"Under non-bf16 dtypes, MoE grouped_mm falls back to "
|
||||
"_grouped_mm_fallback_backward which is incompatible with torch.compile under 'reduce-overhead' mode"
|
||||
)
|
||||
def test_flash_attn_2_can_compile_with_attention_mask_None_without_graph_break(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(
|
||||
"Under non-bf16 dtypes, MoE grouped_mm falls back to "
|
||||
"_grouped_mm_fallback_backward which is incompatible with torch.compile under 'reduce-overhead' mode"
|
||||
)
|
||||
def test_torch_compile_for_training(self):
|
||||
pass
|
||||
|
||||
|
||||
class Gemma4Audio2TextModelTester:
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
image_token_id=4,
|
||||
boi_token_id=5,
|
||||
eoi_token_id=6,
|
||||
audio_token_id=7,
|
||||
boa_token_id=8,
|
||||
eoa_token_index=9,
|
||||
video_token_id=10,
|
||||
seq_length=50,
|
||||
audio_seq_length=96,
|
||||
audio_num_channels=16,
|
||||
is_training=True,
|
||||
audio_config={
|
||||
"hidden_size": 32,
|
||||
"num_hidden_layers": 2,
|
||||
"num_attention_heads": 4,
|
||||
"hidden_act": "silu",
|
||||
"subsampling_conv_channels": [16, 8],
|
||||
"conv_kernel_size": 3,
|
||||
"attention_chunk_size": 4,
|
||||
"attention_context_left": 5,
|
||||
"attention_context_right": 0,
|
||||
"output_proj_dims": 32,
|
||||
# Clipped linears register inf/-inf buffers which cause NaN in test_torch_save_load's
|
||||
# comparison logic (inf - inf = NaN). Disable for testing.
|
||||
"use_clipped_linears": False,
|
||||
},
|
||||
):
|
||||
self.parent = parent
|
||||
self.image_token_id = image_token_id
|
||||
self.boi_token_id = boi_token_id
|
||||
self.eoi_token_id = eoi_token_id
|
||||
self.audio_token_id = audio_token_id
|
||||
self.boa_token_id = boa_token_id
|
||||
self.eoa_token_index = eoa_token_index
|
||||
self.video_token_id = video_token_id
|
||||
self.llm_tester = Gemma4TextModelTester(self.parent)
|
||||
self.llm_tester.use_bidirectional_attention = None
|
||||
self.text_config = self.llm_tester.get_config()
|
||||
self.audio_config = audio_config
|
||||
self.seq_length = seq_length
|
||||
self.audio_seq_length = audio_seq_length
|
||||
self.audio_num_channels = audio_num_channels
|
||||
self.pad_token_id = self.text_config.pad_token_id
|
||||
|
||||
self.num_hidden_layers = self.text_config.num_hidden_layers
|
||||
self.vocab_size = self.text_config.vocab_size
|
||||
self.hidden_size = self.text_config.hidden_size
|
||||
self.num_attention_heads = self.text_config.num_attention_heads
|
||||
self.is_training = is_training
|
||||
|
||||
self.batch_size = 3
|
||||
self.encoder_seq_length = seq_length
|
||||
|
||||
def get_config(self):
|
||||
return Gemma4Config(
|
||||
text_config=self.text_config,
|
||||
vision_config=None,
|
||||
audio_config=self.audio_config,
|
||||
image_token_id=self.image_token_id,
|
||||
boi_token_id=self.boi_token_id,
|
||||
eoi_token_id=self.eoi_token_id,
|
||||
audio_token_id=self.audio_token_id,
|
||||
boa_token_id=self.boa_token_id,
|
||||
eoa_token_index=self.eoa_token_index,
|
||||
video_token_id=self.video_token_id,
|
||||
)
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
input_features = floats_tensor([self.batch_size, self.audio_seq_length, self.audio_num_channels])
|
||||
input_features_mask = torch.ones(self.batch_size, self.audio_seq_length, dtype=torch.bool)
|
||||
config = self.get_config()
|
||||
return config, input_features, input_features_mask
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config, input_features, input_features_mask = self.prepare_config_and_inputs()
|
||||
input_ids = ids_tensor([self.batch_size, self.seq_length], config.text_config.vocab_size - 1) + 1
|
||||
attention_mask = input_ids.ne(self.pad_token_id).to(torch_device)
|
||||
|
||||
# Ensure no tokens accidentally match special token IDs
|
||||
for token_id in [config.image_token_id, config.video_token_id, config.audio_token_id]:
|
||||
input_ids[input_ids == token_id] = self.pad_token_id
|
||||
|
||||
# The audio encoder produces audio_seq_length / 4 tokens per audio sample after subsampling.
|
||||
# We need that many audio placeholder tokens per sequence in input_ids.
|
||||
num_audio_tokens = self.audio_seq_length // 4
|
||||
input_ids[:, :num_audio_tokens] = config.audio_token_id
|
||||
|
||||
inputs_dict = {
|
||||
"input_features": input_features,
|
||||
"input_features_mask": input_features_mask,
|
||||
"input_ids": input_ids,
|
||||
"attention_mask": attention_mask,
|
||||
}
|
||||
return config, inputs_dict
|
||||
|
||||
|
||||
@require_torch
|
||||
class Gemma4Audio2TextModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
|
||||
all_model_classes = (Gemma4Model, Gemma4ForConditionalGeneration) if is_torch_available() else ()
|
||||
all_generative_model_classes = (Gemma4ForConditionalGeneration,) if is_torch_available() else ()
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = Gemma4Audio2TextModelTester(self)
|
||||
self.config_tester = ConfigTester(self, config_class=Gemma4Config, hidden_size=37)
|
||||
|
||||
@unittest.skip("The tester has no image in input dict")
|
||||
def test_get_image_features_hidden_states(self):
|
||||
pass
|
||||
|
||||
@unittest.skip("The tester has no image in input dict")
|
||||
def test_get_image_features_attentions(self):
|
||||
pass
|
||||
|
||||
@unittest.skip("The tester has no image in input dict")
|
||||
@parameterized.expand([True, False, None])
|
||||
def test_get_image_features_output(self, return_dict: bool | None):
|
||||
pass
|
||||
|
||||
@unittest.skip("The tester has no videos in input dict")
|
||||
def test_get_video_features_hidden_states(self):
|
||||
pass
|
||||
|
||||
@unittest.skip("The tester has no videos in input dict")
|
||||
def test_get_video_features_attentions(self):
|
||||
pass
|
||||
|
||||
@unittest.skip("The tester has no videos in input dict")
|
||||
@parameterized.expand([True, False, None])
|
||||
def test_get_video_features_output(self, return_dict: bool | None):
|
||||
pass
|
||||
|
||||
@unittest.skip("We need 4 layers to correctly test cache sharing.")
|
||||
def test_num_layers_is_small(self):
|
||||
pass
|
||||
|
||||
@unittest.skip("Gemma4 needs correct embeddings for per-layer-input computation, random won't work!")
|
||||
def test_generate_from_random_inputs_embeds(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(GEMMA4_RANDOM_MOE_FA2_SKIP_REASON)
|
||||
def test_flash_attn_2_inference_equivalence(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(GEMMA4_RANDOM_MOE_FA2_SKIP_REASON)
|
||||
def test_flash_attn_2_inference_equivalence_right_padding(self):
|
||||
pass
|
||||
|
||||
def test_audio_rel_pos_encoding_uses_context_size_from_config(self):
|
||||
"""Regression test for #45468; attention context size is properly read from config"""
|
||||
from transformers.models.gemma4.configuration_gemma4 import Gemma4AudioConfig
|
||||
from transformers.models.gemma4.modeling_gemma4 import Gemma4AudioRelPositionalEncoding
|
||||
|
||||
config = Gemma4AudioConfig(
|
||||
hidden_size=32,
|
||||
attention_chunk_size=6,
|
||||
attention_context_left=5,
|
||||
attention_context_right=1,
|
||||
use_clipped_linears=False,
|
||||
)
|
||||
|
||||
module = Gemma4AudioRelPositionalEncoding(config)
|
||||
hidden_states = torch.zeros(1, 3, config.hidden_size)
|
||||
|
||||
pos = module(hidden_states)
|
||||
|
||||
context_size = config.attention_chunk_size + config.attention_context_left - 1 + config.attention_context_right
|
||||
expected_len = context_size // 2 + 1
|
||||
|
||||
self.assertEqual(pos.shape, (1, expected_len, config.hidden_size))
|
||||
|
||||
position_ids = torch.arange(context_size // 2, -1, -1, device=hidden_states.device)[..., None]
|
||||
scaled_time = position_ids * module.inv_timescales.to(device=hidden_states.device)
|
||||
expected = torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], dim=-1).to(hidden_states.dtype)
|
||||
|
||||
torch.testing.assert_close(pos, expected)
|
||||
|
||||
|
||||
class Gemma4Vision2TextModelTester:
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
mm_tokens_per_image=2,
|
||||
image_token_id=4,
|
||||
video_token_id=7,
|
||||
audio_token_id=8,
|
||||
boi_token_id=5,
|
||||
eoi_token_id=6,
|
||||
seq_length=25,
|
||||
is_training=True,
|
||||
vision_config={
|
||||
"use_labels": True,
|
||||
"image_size": 20,
|
||||
"patch_size": 5,
|
||||
"num_channels": 3,
|
||||
"is_training": True,
|
||||
"hidden_size": 32,
|
||||
"num_key_value_heads": 1,
|
||||
"num_hidden_layers": 2,
|
||||
"num_attention_heads": 4,
|
||||
"intermediate_size": 37,
|
||||
"dropout": 0.1,
|
||||
"attention_dropout": 0.1,
|
||||
"initializer_range": 0.02,
|
||||
},
|
||||
):
|
||||
self.parent = parent
|
||||
# `image_token_id` is set to 0 to pass "resize_embeddings" test, do not modify
|
||||
self.mm_tokens_per_image = mm_tokens_per_image
|
||||
self.image_token_id = image_token_id
|
||||
self.video_token_id = video_token_id
|
||||
self.audio_token_id = audio_token_id
|
||||
self.boi_token_id = boi_token_id
|
||||
self.eoi_token_id = eoi_token_id
|
||||
self.llm_tester = Gemma4TextModelTester(self.parent)
|
||||
self.text_config = self.llm_tester.get_config()
|
||||
self.vision_config = vision_config
|
||||
self.seq_length = seq_length
|
||||
self.pad_token_id = self.text_config.pad_token_id
|
||||
|
||||
self.num_hidden_layers = self.text_config.num_hidden_layers
|
||||
self.vocab_size = self.text_config.vocab_size
|
||||
self.hidden_size = self.text_config.hidden_size
|
||||
self.num_attention_heads = self.text_config.num_attention_heads
|
||||
self.is_training = is_training
|
||||
|
||||
self.batch_size = 3
|
||||
self.num_channels = vision_config["num_channels"]
|
||||
self.image_size = vision_config["image_size"]
|
||||
self.encoder_seq_length = seq_length
|
||||
|
||||
def get_config(self):
|
||||
return Gemma4Config(
|
||||
text_config=self.text_config,
|
||||
vision_config=self.vision_config,
|
||||
image_token_id=self.image_token_id,
|
||||
video_token_id=self.video_token_id,
|
||||
audio_token_id=self.audio_token_id,
|
||||
boi_token_id=self.boi_token_id,
|
||||
eoi_token_id=self.eoi_token_id,
|
||||
mm_tokens_per_image=self.mm_tokens_per_image,
|
||||
)
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
config = self.get_config()
|
||||
config.vision_config.pooling_kernel_size = 2
|
||||
|
||||
# (num_images, max_num_patches, patch_size * patch_size * num_channels)
|
||||
patch_size = config.vision_config.patch_size
|
||||
pixel_values = floats_tensor(
|
||||
[
|
||||
self.batch_size,
|
||||
self.vision_config["image_size"],
|
||||
patch_size * patch_size * self.vision_config["num_channels"],
|
||||
]
|
||||
)
|
||||
# (num_images, max_num_patches, 2) for height/width positions. Let it be all ones for testign
|
||||
pixel_position_ids = torch.ones(self.vision_config["image_size"], device=torch_device, dtype=torch.long)
|
||||
pixel_position_ids = pixel_position_ids[None, :, None].repeat(self.batch_size, 1, 2)
|
||||
|
||||
return config, pixel_values, pixel_position_ids
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
config, pixel_values, pixel_position_ids = config_and_inputs
|
||||
input_ids = ids_tensor([self.batch_size, self.seq_length], config.text_config.vocab_size - 1) + 1
|
||||
attention_mask = input_ids.ne(self.pad_token_id).to(torch_device)
|
||||
|
||||
# Ensure no tokens accidentally match special token IDs
|
||||
for token_id in [config.image_token_id, config.video_token_id, config.audio_token_id]:
|
||||
input_ids[input_ids == token_id] = self.pad_token_id
|
||||
input_ids[:, :1] = config.image_token_id
|
||||
|
||||
mm_token_type_ids = torch.zeros_like(input_ids)
|
||||
mm_token_type_ids[input_ids == config.image_token_id] = 1
|
||||
|
||||
inputs_dict = {
|
||||
"pixel_values": pixel_values,
|
||||
"image_position_ids": pixel_position_ids,
|
||||
"input_ids": input_ids,
|
||||
"attention_mask": attention_mask,
|
||||
"mm_token_type_ids": mm_token_type_ids,
|
||||
}
|
||||
return config, inputs_dict
|
||||
|
||||
|
||||
@require_torch
|
||||
class Gemma4Vision2TextModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
|
||||
all_model_classes = (Gemma4Model, Gemma4ForConditionalGeneration) if is_torch_available() else ()
|
||||
all_generative_model_classes = (Gemma4ForConditionalGeneration,) if is_torch_available() else ()
|
||||
additional_model_inputs = ["mm_token_type_ids"]
|
||||
model_split_percents = [0.85, 0.9]
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = Gemma4Vision2TextModelTester(self)
|
||||
self.config_tester = ConfigTester(self, config_class=Gemma4Config, hidden_size=37)
|
||||
self.skip_flash_attn_inference_equivalence_tests()
|
||||
|
||||
def skip_flash_attn_inference_equivalence_tests(self):
|
||||
skippable_tests = [
|
||||
"test_flash_attn_2_inference_equivalence",
|
||||
"test_flash_attn_3_inference_equivalence",
|
||||
"test_flash_attn_4_inference_equivalence",
|
||||
]
|
||||
for test in skippable_tests:
|
||||
if self._testMethodName.startswith(test):
|
||||
self.skipTest(
|
||||
reason="The base test does not pass image_position_ids and mm_token_type_ids required by Gemma4"
|
||||
)
|
||||
|
||||
def test_training(self):
|
||||
# Overwrite to test training with text-only samples, should not raise errors
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
config.return_dict = True
|
||||
|
||||
model = Gemma4ForConditionalGeneration(config)
|
||||
model.to(torch_device)
|
||||
model.train()
|
||||
inputs = self._prepare_for_class(inputs_dict, Gemma4ForConditionalGeneration, return_labels=True)
|
||||
loss = model(**inputs).loss
|
||||
loss.backward()
|
||||
|
||||
# pop out image-related inputs and try to run forward
|
||||
inputs.pop("mm_token_type_ids", None)
|
||||
inputs.pop("pixel_values", None)
|
||||
loss = model(**inputs).loss
|
||||
loss.backward()
|
||||
|
||||
@unittest.skip("The tester has no audios in input dict")
|
||||
def test_get_audio_features_hidden_states(self):
|
||||
pass
|
||||
|
||||
@unittest.skip("The tester has no audios in input dict")
|
||||
def test_get_audio_features_attentions(self):
|
||||
pass
|
||||
|
||||
@unittest.skip("The tester has no audios in input dict")
|
||||
@parameterized.expand([True, False, None])
|
||||
def test_get_audio_features_output(self, return_dict: bool | None):
|
||||
pass
|
||||
|
||||
@unittest.skip("The tester has no videos in input dict")
|
||||
def test_get_video_features_hidden_states(self):
|
||||
pass
|
||||
|
||||
@unittest.skip("The tester has no videos in input dict")
|
||||
def test_get_video_features_attentions(self):
|
||||
pass
|
||||
|
||||
@unittest.skip("The tester has no videos in input dict")
|
||||
@parameterized.expand([True, False, None])
|
||||
def test_get_video_features_output(self, return_dict: bool | None):
|
||||
pass
|
||||
|
||||
@unittest.skip("We need 4 layers to correctly test cache sharing.")
|
||||
def test_num_layers_is_small(self):
|
||||
pass
|
||||
|
||||
@unittest.skip("Gemma4 needs correct embeddings for per-layer-input computation, random won't work!")
|
||||
def test_generate_from_random_inputs_embeds(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(
|
||||
"Randomly starts failing after module order changed in the __init__ because accelertate is not robust enough"
|
||||
)
|
||||
def test_cpu_offload(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(
|
||||
"Randomly starts failing after module order changed in the __init__ because accelertate is not robust enough"
|
||||
)
|
||||
def test_disk_offload_bin(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(
|
||||
"Randomly starts failing after module order changed in the __init__ because accelertate is not robust enough"
|
||||
)
|
||||
def test_disk_offload_safetensors(self):
|
||||
pass
|
||||
|
||||
def test_per_layer_inputs_are_correctly_forwarded(self):
|
||||
from transformers.models.gemma4.modeling_gemma4 import Gemma4TextModel
|
||||
|
||||
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
model = Gemma4ForConditionalGeneration(config).to(torch_device)
|
||||
model.eval()
|
||||
|
||||
input_ids = torch.randint(20, 50, (1, 10), device=torch_device)
|
||||
inputs_embeds = model.get_input_embeddings()(input_ids)
|
||||
per_layer_inputs = model.model.language_model.get_per_layer_inputs(input_ids, None)
|
||||
|
||||
@contextmanager
|
||||
def count_get_per_layer_inputs_calls():
|
||||
original = Gemma4TextModel.get_per_layer_inputs
|
||||
counter = {"call_count": 0}
|
||||
|
||||
def count_calls(*args, **kwargs):
|
||||
nonlocal counter
|
||||
counter["call_count"] += 1
|
||||
return original(*args, **kwargs)
|
||||
|
||||
Gemma4TextModel.get_per_layer_inputs = count_calls
|
||||
try:
|
||||
yield counter
|
||||
finally:
|
||||
Gemma4TextModel.get_per_layer_inputs = original
|
||||
|
||||
# We should never call `get_per_layer_input_embeddings` if we provide both inputs_embeds and per_layer_inputs
|
||||
with count_get_per_layer_inputs_calls() as counter:
|
||||
_ = model(inputs_embeds=inputs_embeds, per_layer_inputs=per_layer_inputs)
|
||||
self.assertEqual(counter["call_count"], 0)
|
||||
|
||||
# We should call it once if we provide only input_ids
|
||||
with count_get_per_layer_inputs_calls() as counter:
|
||||
_ = model(input_ids)
|
||||
self.assertEqual(counter["call_count"], 1)
|
||||
|
||||
# We should call it once as well if we provide only inputs_embeds
|
||||
with count_get_per_layer_inputs_calls() as counter:
|
||||
_ = model(inputs_embeds=inputs_embeds)
|
||||
self.assertEqual(counter["call_count"], 1)
|
||||
|
||||
|
||||
@slow
|
||||
@require_torch_accelerator
|
||||
class Gemma4IntegrationTest(unittest.TestCase):
|
||||
def setUp(self):
|
||||
self.model_name = "google/gemma-4-E2B-it"
|
||||
self.processor = Gemma4Processor.from_pretrained(self.model_name)
|
||||
|
||||
self.url1 = url_to_local_path(
|
||||
"https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/cow_beach_1.png"
|
||||
)
|
||||
self.url2 = url_to_local_path(
|
||||
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/australia.jpg"
|
||||
)
|
||||
self.messages = [
|
||||
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "image", "url": self.url1},
|
||||
{"type": "text", "text": "What is shown in this image?"},
|
||||
],
|
||||
},
|
||||
]
|
||||
|
||||
def tearDown(self):
|
||||
cleanup(torch_device, gc_collect=True)
|
||||
|
||||
@require_deterministic_for_xpu
|
||||
def test_model_with_image(self):
|
||||
model = Gemma4ForConditionalGeneration.from_pretrained(self.model_name, device_map=torch_device)
|
||||
|
||||
inputs = self.processor.apply_chat_template(
|
||||
self.messages,
|
||||
tokenize=True,
|
||||
return_dict=True,
|
||||
return_tensors="pt",
|
||||
add_generation_prompt=True,
|
||||
).to(torch_device)
|
||||
|
||||
output = model.generate(**inputs, max_new_tokens=30, do_sample=False)
|
||||
input_size = inputs.input_ids.shape[-1]
|
||||
output_text = self.processor.batch_decode(output[:, input_size:], skip_special_tokens=True)
|
||||
|
||||
EXPECTED_TEXTS = Expectations(
|
||||
{
|
||||
("cuda", 8): ['This image shows a **brown and white cow** standing on a **sandy beach** with the **ocean and a blue sky** in the background'],
|
||||
("xpu", 3): ['This image shows a **brown and white cow** standing on a **sandy beach** with the **ocean and a blue sky** in the background'],
|
||||
}
|
||||
) # fmt: skip
|
||||
EXPECTED_TEXT = EXPECTED_TEXTS.get_expectation()
|
||||
self.assertEqual(output_text, EXPECTED_TEXT)
|
||||
|
||||
@require_deterministic_for_xpu
|
||||
def test_model_with_image_batch(self):
|
||||
model = Gemma4ForConditionalGeneration.from_pretrained(self.model_name, device_map=torch_device)
|
||||
|
||||
messages_2 = [
|
||||
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "image",
|
||||
"url": self.url1,
|
||||
},
|
||||
{"type": "image", "url": self.url2},
|
||||
{"type": "text", "text": "Are these images identical?"},
|
||||
],
|
||||
},
|
||||
]
|
||||
|
||||
inputs = self.processor.apply_chat_template(
|
||||
[self.messages, messages_2],
|
||||
tokenize=True,
|
||||
return_dict=True,
|
||||
return_tensors="pt",
|
||||
padding=True,
|
||||
add_generation_prompt=True,
|
||||
).to(torch_device)
|
||||
|
||||
output = model.generate(**inputs, max_new_tokens=30, do_sample=False)
|
||||
input_size = inputs.input_ids.shape[-1]
|
||||
output_text = self.processor.batch_decode(output[:, input_size:], skip_special_tokens=True)
|
||||
|
||||
EXPECTED_TEXTS = Expectations(
|
||||
{
|
||||
("cuda", (8, 0)): [
|
||||
"This image shows a **brown and white cow** standing on a **sandy beach** with the **ocean and a blue sky** in the background",
|
||||
"No, these images are not identical.\n\nThe first image is a photograph of a **cow** standing on a beach under a blue sky.\n\n",
|
||||
],
|
||||
("cuda", (8, 6)): [
|
||||
"This image shows a **brown and white cow** standing on a **sandy beach** with the **ocean and a blue sky** in the background",
|
||||
"No, these images are not identical.\n\nThe first image is a photograph of a **brown and white cow standing on a beach** under a blue",
|
||||
],
|
||||
("xpu", 3): [
|
||||
"This image shows a **brown and white cow** standing on a **sandy beach** with the **ocean and a blue sky** in the background",
|
||||
"No, these images are **not identical**.\n\nHere's a breakdown of the differences:\n\n1. **Image 1 (Cow on",
|
||||
],
|
||||
}
|
||||
)
|
||||
EXPECTED_TEXT = EXPECTED_TEXTS.get_expectation()
|
||||
self.assertEqual(output_text, EXPECTED_TEXT)
|
||||
|
||||
@require_deterministic_for_xpu
|
||||
def test_model_multiimage(self):
|
||||
model = Gemma4ForConditionalGeneration.from_pretrained(self.model_name, device_map=torch_device)
|
||||
|
||||
messages = [
|
||||
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "image", "url": self.url2},
|
||||
{"type": "text", "text": "What do you see here?"},
|
||||
],
|
||||
},
|
||||
]
|
||||
|
||||
inputs = self.processor.apply_chat_template(
|
||||
messages,
|
||||
tokenize=True,
|
||||
return_dict=True,
|
||||
return_tensors="pt",
|
||||
padding=True,
|
||||
add_generation_prompt=True,
|
||||
).to(torch_device)
|
||||
|
||||
output = model.generate(**inputs, max_new_tokens=30, do_sample=False)
|
||||
input_size = inputs.input_ids.shape[-1]
|
||||
output_text = self.processor.batch_decode(output[:, input_size:], skip_special_tokens=True)
|
||||
EXPECTED_TEXTS = Expectations(
|
||||
{
|
||||
("cuda", 8): ['Based on the image, here is a description of what I see:\n\n**Foreground & Street Scene:**\n* **Traffic Sign:** The most prominent'],
|
||||
("xpu", 3): ['Based on the image, here is a description of what I see:\n\n**Foreground & Street Scene:**\n* **Roadway:** There is an'],
|
||||
}
|
||||
) # fmt: skip
|
||||
EXPECTED_TEXT = EXPECTED_TEXTS.get_expectation()
|
||||
self.assertEqual(output_text, EXPECTED_TEXT)
|
||||
|
||||
@require_torch_multi_gpu
|
||||
def test_model_text_only_multigpu(self):
|
||||
"""Accelerate destroys the input dict `shared_kv_states` if it's not passed as kwarg and part of
|
||||
`_skip_keys_device_placement`, so test this to avoid regresions.
|
||||
"""
|
||||
model = AutoModelForCausalLM.from_pretrained(self.model_name, device_map="auto")
|
||||
tokenizer = AutoTokenizer.from_pretrained(self.model_name, padding_side="left")
|
||||
inputs = tokenizer.apply_chat_template(
|
||||
[{"role": "user", "content": "Write a poem about Machine Learning."}],
|
||||
tokenize=True,
|
||||
return_dict=True,
|
||||
return_tensors="pt",
|
||||
add_generation_prompt=True,
|
||||
).to(model.device)
|
||||
|
||||
output = model.generate(**inputs, max_new_tokens=30, do_sample=False)
|
||||
input_size = inputs.input_ids.shape[-1]
|
||||
output_text = self.processor.batch_decode(output[:, input_size:], skip_special_tokens=True)
|
||||
|
||||
EXPECTED_TEXTS = Expectations(
|
||||
{
|
||||
("cuda", (8, 0)): ['## The Algorithmic Mind\n\nA whisper starts, a seed unseen,\nOf data vast, a vibrant sheen.\nA sea of numbers,'],
|
||||
("cuda", (8, 6)): ['## The Algorithmic Mind\n\nA tapestry of data, vast and deep,\nWhere silent numbers in their slumber sleep.\nA sea of text'],
|
||||
}
|
||||
) # fmt: skip
|
||||
EXPECTED_TEXT = EXPECTED_TEXTS.get_expectation()
|
||||
self.assertEqual(output_text, EXPECTED_TEXT)
|
||||
|
||||
@require_deterministic_for_xpu
|
||||
def test_model_text_only(self):
|
||||
model = AutoModelForCausalLM.from_pretrained(self.model_name, device_map=torch_device)
|
||||
tokenizer = AutoTokenizer.from_pretrained(self.model_name, padding_side="left")
|
||||
inputs = tokenizer.apply_chat_template(
|
||||
[{"role": "user", "content": "Write a poem about Machine Learning."}],
|
||||
tokenize=True,
|
||||
return_dict=True,
|
||||
return_tensors="pt",
|
||||
add_generation_prompt=True,
|
||||
).to(torch_device)
|
||||
|
||||
output = model.generate(**inputs, max_new_tokens=30, do_sample=False)
|
||||
input_size = inputs.input_ids.shape[-1]
|
||||
output_text = self.processor.batch_decode(output[:, input_size:], skip_special_tokens=True)
|
||||
|
||||
EXPECTED_TEXTS = Expectations(
|
||||
{
|
||||
("cuda", (8, 0)): ['## The Algorithmic Mind\n\nA whisper starts, a seed unseen,\nOf data vast, a vibrant sheen.\nA sea of numbers,'],
|
||||
("cuda", (8, 6)): ['## The Algorithmic Mind\n\nA tapestry of data, vast and deep,\nWhere silent numbers in their slumber sleep.\nA sea of text'],
|
||||
("xpu", 3): ['## The Algorithmic Mind\n\nA whisper starts in silicon deep,\nWhere data streams in endless sweep.\nNo flesh and blood, no beating'],
|
||||
}
|
||||
) # fmt: skip
|
||||
EXPECTED_TEXT = EXPECTED_TEXTS.get_expectation()
|
||||
self.assertEqual(output_text, EXPECTED_TEXT)
|
||||
|
||||
def test_states_sharing_with_and_without_cache(self):
|
||||
model = AutoModelForCausalLM.from_pretrained(self.model_name, device_map=torch_device)
|
||||
tokenizer = AutoTokenizer.from_pretrained(self.model_name, padding_side="left")
|
||||
inputs = tokenizer.apply_chat_template(
|
||||
[{"role": "user", "content": "Who are you? What can you do?"}],
|
||||
tokenize=True,
|
||||
return_dict=True,
|
||||
return_tensors="pt",
|
||||
add_generation_prompt=True,
|
||||
).to(torch_device)
|
||||
input_size = inputs.input_ids.shape[-1]
|
||||
|
||||
# With and without cache generatiom should share kv states the same way
|
||||
output_with_cache = model.generate(**inputs, max_new_tokens=30, do_sample=False, use_cache=True)
|
||||
output_without_cache = model.generate(**inputs, max_new_tokens=30, do_sample=False, use_cache=False)
|
||||
|
||||
output_text_with_cache = tokenizer.batch_decode(output_with_cache[:, input_size:], skip_special_tokens=True)
|
||||
output_text_without_cache = tokenizer.batch_decode(
|
||||
output_without_cache[:, input_size:], skip_special_tokens=True
|
||||
)
|
||||
|
||||
self.assertEqual(output_text_with_cache, output_text_without_cache)
|
||||
|
||||
# Note: we do not test FA2 as the head dim is 512 on some layers, which is not compatible with the kernels
|
||||
@parameterized.expand([("sdpa",), ("eager",)])
|
||||
@require_deterministic_for_xpu
|
||||
def test_generation_beyond_sliding_window(self, attn_implementation: str):
|
||||
"""Test that we can correctly generate beyond the sliding window. Outputs for every attention functions
|
||||
should be coherent and identical.
|
||||
"""
|
||||
|
||||
input_text = [
|
||||
"This is a nice place. " * 800 + "I really enjoy the scenery,", # This is larger than 4096 tokens
|
||||
"A list of colors: red, blue", # This will almost all be padding tokens
|
||||
]
|
||||
tokenizer = AutoTokenizer.from_pretrained(self.model_name, padding="left")
|
||||
input_text = [
|
||||
tokenizer.apply_chat_template(
|
||||
[{"role": "user", "content": item}],
|
||||
tokenize=False,
|
||||
add_generation_prompt=True,
|
||||
)
|
||||
for item in input_text
|
||||
]
|
||||
inputs = tokenizer(input_text, padding=True, return_tensors="pt").to(torch_device)
|
||||
|
||||
model = Gemma4ForConditionalGeneration.from_pretrained(
|
||||
self.model_name,
|
||||
device_map=torch_device,
|
||||
attn_implementation=attn_implementation,
|
||||
)
|
||||
|
||||
# Make sure prefill is larger than sliding window
|
||||
input_size = inputs.input_ids.shape[-1]
|
||||
self.assertTrue(input_size > model.config.get_text_config().sliding_window)
|
||||
|
||||
out = model.generate(**inputs, max_new_tokens=16, do_sample=False, cache_implementation="static")
|
||||
output_text = tokenizer.batch_decode(out[:, input_size:])
|
||||
|
||||
EXPECTED_COMPLETIONS = Expectations(
|
||||
{
|
||||
("cuda", 8): [
|
||||
"That sounds lovely! It seems like you're really enjoying the place you'",
|
||||
"Here are a few ways you could use or expand upon that list, depending on",
|
||||
],
|
||||
("xpu", 3): [
|
||||
"That sounds lovely! It seems like you're really enjoying the place you'",
|
||||
"Here are a few ways you could use or expand upon that list, depending on",
|
||||
],
|
||||
}
|
||||
)
|
||||
self.assertEqual(output_text, EXPECTED_COMPLETIONS.get_expectation())
|
||||
|
||||
@pytest.mark.torch_export_test
|
||||
def test_export_text_only(self):
|
||||
from transformers.integrations.executorch import TorchExportableModuleForDecoderOnlyLM
|
||||
|
||||
model = Gemma4ForConditionalGeneration.from_pretrained(self.model_name, device_map=torch_device)
|
||||
tokenizer = AutoTokenizer.from_pretrained(self.model_name)
|
||||
|
||||
exportable_module = TorchExportableModuleForDecoderOnlyLM(
|
||||
model, batch_size=1, max_cache_len=1024, device=torch_device
|
||||
)
|
||||
exported_program = exportable_module.export(
|
||||
input_ids=torch.tensor([[1]], device=torch_device, dtype=torch.long),
|
||||
)
|
||||
|
||||
# Test generation with the exported model
|
||||
prompt = tokenizer.apply_chat_template(
|
||||
[{"role": "user", "content": "What is the capital of France?"}],
|
||||
tokenize=False,
|
||||
add_generation_prompt=True,
|
||||
)
|
||||
|
||||
max_new_tokens_to_generate = 20
|
||||
# Generate text with the exported model
|
||||
export_generated_text = TorchExportableModuleForDecoderOnlyLM.generate(
|
||||
exported_program, tokenizer, prompt, max_new_tokens=max_new_tokens_to_generate, device=torch_device
|
||||
)
|
||||
|
||||
input_text = tokenizer(prompt, return_tensors="pt").to(torch_device)
|
||||
eager_outputs = model.generate(
|
||||
**input_text,
|
||||
max_new_tokens=max_new_tokens_to_generate,
|
||||
do_sample=False, # Use greedy decoding to match the exported model
|
||||
)
|
||||
|
||||
eager_generated_text = tokenizer.decode(eager_outputs[0], skip_special_tokens=True)
|
||||
self.assertEqual(export_generated_text, eager_generated_text)
|
||||
207
tests/models/gemma4/test_processing_gemma4.py
Normal file
207
tests/models/gemma4/test_processing_gemma4.py
Normal file
@@ -0,0 +1,207 @@
|
||||
# Copyright 2026 the HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import shutil
|
||||
import unittest
|
||||
|
||||
import numpy as np
|
||||
|
||||
from transformers import Gemma4Processor
|
||||
from transformers.testing_utils import get_tests_dir, require_vision
|
||||
from transformers.utils import is_vision_available
|
||||
|
||||
from ...test_processing_common import ProcessorTesterMixin
|
||||
|
||||
|
||||
if is_vision_available():
|
||||
pass
|
||||
|
||||
SAMPLE_VOCAB = get_tests_dir("fixtures/test_sentencepiece.model")
|
||||
|
||||
|
||||
@require_vision
|
||||
class Gemma4ProcessorTest(ProcessorTesterMixin, unittest.TestCase):
|
||||
processor_class = Gemma4Processor
|
||||
video_unstructured_max_length = 570
|
||||
video_text_kwargs_max_length = 570
|
||||
video_text_kwargs_override_max_length = 570
|
||||
|
||||
@classmethod
|
||||
def _setup_test_attributes(cls, processor):
|
||||
cls.image_token = processor.image_token
|
||||
cls.video_token = processor.video_token
|
||||
|
||||
@classmethod
|
||||
def _setup_video_processor(cls):
|
||||
video_processor_class = cls._get_component_class_from_processor("video_processor")
|
||||
gemma4_video_processor_kwargs = {
|
||||
"patch_size": 28,
|
||||
"max_soft_tokens": 70,
|
||||
"pooling_kernel_size": 3,
|
||||
"num_frames": 2,
|
||||
}
|
||||
return video_processor_class(**gemma4_video_processor_kwargs)
|
||||
|
||||
@classmethod
|
||||
def _setup_feature_extractor(cls):
|
||||
feature_extractor_class = cls._get_component_class_from_processor("feature_extractor")
|
||||
gemma4_feature_extractor_kwargs = {}
|
||||
return feature_extractor_class(**gemma4_feature_extractor_kwargs)
|
||||
|
||||
@classmethod
|
||||
def _setup_image_processor(cls):
|
||||
image_processor_class = cls._get_component_class_from_processor("image_processor")
|
||||
gemma4_image_processor_kwargs = {
|
||||
"patch_size": 28,
|
||||
"max_soft_tokens": 70,
|
||||
"pooling_kernel_size": 3,
|
||||
}
|
||||
return image_processor_class(**gemma4_image_processor_kwargs)
|
||||
|
||||
@classmethod
|
||||
def _setup_tokenizer(cls):
|
||||
tokenizer_class = cls._get_component_class_from_processor("tokenizer")
|
||||
extra_special_tokens = {
|
||||
"image_token": "<|image|>",
|
||||
"video_token": "<|video|>",
|
||||
"boi_token": "<start_of_image>",
|
||||
"eoi_token": "<end_of_image>",
|
||||
"audio_token": "<audio_soft_token>",
|
||||
"boa_token": "<start_of_audio>",
|
||||
"eoa_token": "<end_of_audio>",
|
||||
}
|
||||
tokenizer = tokenizer_class.from_pretrained(
|
||||
SAMPLE_VOCAB, keep_accents=True, extra_special_tokens=extra_special_tokens
|
||||
)
|
||||
tokenizer.pad_token_id = tokenizer.eos_token_id
|
||||
return tokenizer
|
||||
|
||||
# Copied from tests.models.llava.test_processing_llava.LlavaProcessorTest.test_get_num_vision_tokens
|
||||
def test_get_num_vision_tokens(self):
|
||||
"Tests general functionality of the helper used internally in vLLM"
|
||||
|
||||
processor = self.get_processor()
|
||||
|
||||
output = processor._get_num_multimodal_tokens(image_sizes=[(100, 100), (300, 100), (500, 30)])
|
||||
self.assertTrue("num_image_tokens" in output)
|
||||
self.assertEqual(len(output["num_image_tokens"]), 3)
|
||||
|
||||
self.assertTrue("num_image_patches" in output)
|
||||
self.assertEqual(len(output["num_image_patches"]), 3)
|
||||
|
||||
@classmethod
|
||||
def tearDownClass(cls):
|
||||
shutil.rmtree(cls.tmpdirname, ignore_errors=True)
|
||||
|
||||
@staticmethod
|
||||
def prepare_processor_dict():
|
||||
return {
|
||||
"chat_template": "{{ bos_token }}\n{%- if messages[0]['role'] == 'system' -%}\n {%- set first_user_prefix = messages[0]['content'][0]['text'] + '\n\n' -%}\n {%- set loop_messages = messages[1:] -%}\n{%- else -%}\n {%- set first_user_prefix = \"\" -%}\n {%- set loop_messages = messages -%}\n{%- endif -%}\n{%- for message in loop_messages -%}\n {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}\n {{ raise_exception(\"Conversation roles must alternate user/assistant/user/assistant/...\") }}\n {%- endif -%}\n {%- if (message['role'] == 'assistant') -%}\n {%- set role = \"model\" -%}\n {%- else -%}\n {%- set role = message['role'] -%}\n {%- endif -%}\n {{ '<start_of_turn>' + role + '\n' + (first_user_prefix if loop.first else \"\") }}\n {%- if message['content'] is string -%}\n {{ message['content'] | trim }}\n {%- elif message['content'] is iterable -%}\n {%- for item in message['content'] -%}\n {%- if item['type'] == 'image' -%}\n {{ '<|image|>' }}\n {%- elif item['type'] == 'video' -%}\n{{ '<video_soft_token>' }}\n {%- elif item['type'] == 'text' -%}\n {{ item['text'] | trim }}\n {%- endif -%}\n {%- endfor -%}\n {%- else -%}\n {{ raise_exception(\"Invalid content type\") }}\n {%- endif -%}\n {{ '<end_of_turn>\n' }}\n{%- endfor -%}\n{%- if add_generation_prompt -%}\n {{'<start_of_turn>model\n'}}\n{%- endif -%}\n", "image_seq_length": 3,
|
||||
} # fmt: skip
|
||||
|
||||
# Override as Gemma4 needs images to be an explicitly nested batch
|
||||
def prepare_image_inputs(self, batch_size: int | None = None):
|
||||
"""This function prepares a list of PIL images for testing"""
|
||||
images = super().prepare_image_inputs(batch_size)
|
||||
if isinstance(images, (list, tuple)):
|
||||
images = [[image] for image in images]
|
||||
return images
|
||||
|
||||
def test_text_with_image_tokens(self):
|
||||
feature_extractor = self.get_component("feature_extractor")
|
||||
image_processor = self.get_component("image_processor")
|
||||
video_processor = self.get_component("video_processor")
|
||||
tokenizer = self.get_component("tokenizer")
|
||||
|
||||
processor = self.processor_class(
|
||||
feature_extractor=feature_extractor,
|
||||
tokenizer=tokenizer,
|
||||
image_processor=image_processor,
|
||||
video_processor=video_processor,
|
||||
)
|
||||
text_multi_images = f"{processor.image_token}{processor.image_token}Dummy text!"
|
||||
text_single_image = f"{processor.image_token}Dummy text!"
|
||||
|
||||
image = self.prepare_image_inputs()
|
||||
|
||||
# We can't be sure what is users intention: if user wants one image per text OR two images for first text and no image for second text
|
||||
with self.assertRaises(ValueError):
|
||||
_ = processor(text=[text_single_image, text_single_image], images=[image, image], return_tensors="np")
|
||||
|
||||
# The users is expected to be explicit about which image belong to which text by nesting the images list
|
||||
out_multiimages = processor(text=text_multi_images, images=[image, image], return_tensors="np")
|
||||
out_batch_oneimage = processor(
|
||||
text=[text_single_image, text_single_image], images=[[image], [image]], return_tensors="np"
|
||||
)
|
||||
self.assertListEqual(
|
||||
out_batch_oneimage[self.images_input_name].tolist(), out_multiimages[self.images_input_name].tolist()
|
||||
)
|
||||
|
||||
def test_special_mm_token_truncation(self):
|
||||
"""Tests that special vision tokens do not get truncated when `truncation=True` is set."""
|
||||
|
||||
processor = self.get_processor()
|
||||
|
||||
input_str = self.prepare_text_inputs(batch_size=2, modalities="image")
|
||||
image_input = self.prepare_image_inputs(batch_size=2)
|
||||
_ = processor(
|
||||
text=input_str,
|
||||
images=image_input,
|
||||
return_tensors="pt",
|
||||
truncation=None,
|
||||
padding=True,
|
||||
)
|
||||
|
||||
with self.assertRaises(ValueError):
|
||||
_ = processor(
|
||||
text=input_str,
|
||||
images=image_input,
|
||||
return_tensors="pt",
|
||||
truncation=True,
|
||||
padding=True,
|
||||
max_length=5,
|
||||
)
|
||||
|
||||
def test_get_num_multimodal_tokens_matches_processor_call(self):
|
||||
"Tests that the helper used internally in vLLM works correctly"
|
||||
|
||||
processor = self.get_processor()
|
||||
if processor.tokenizer.pad_token_id is None:
|
||||
processor.tokenizer.pad_token_id = processor.tokenizer.eos_token_id
|
||||
|
||||
if not hasattr(processor, "_get_num_multimodal_tokens"):
|
||||
self.skipTest("Processor doesn't support `_get_num_multimodal_tokens` yet")
|
||||
|
||||
image_sizes = [(100, 100), (300, 100), (500, 30), (213, 167)]
|
||||
|
||||
# Overwritten because Gemma3 needs nested image inputs
|
||||
image_inputs = []
|
||||
for h, w in image_sizes:
|
||||
image_inputs.append([np.random.randint(255, size=(h, w, 3), dtype=np.uint8)])
|
||||
|
||||
text = [f"This is an image {getattr(self, 'image_token', '')}"] * len(image_inputs)
|
||||
inputs = processor(
|
||||
text=text, images=image_inputs, padding=True, return_mm_token_type_ids=True, return_tensors="pt"
|
||||
)
|
||||
|
||||
if "mm_token_type_ids" not in inputs:
|
||||
self.skipTest("Processor doesn't support `mm_token_type_ids`")
|
||||
|
||||
num_image_tokens_from_call = inputs.mm_token_type_ids.sum(-1).tolist()
|
||||
num_image_tokens_from_helper = processor._get_num_multimodal_tokens(image_sizes=image_sizes)
|
||||
self.assertListEqual(num_image_tokens_from_call, num_image_tokens_from_helper["num_image_tokens"])
|
||||
|
||||
@unittest.skip("This test seems to be loading a different video, check for all models and fix")
|
||||
def test_apply_chat_template_video_frame_sampling(self):
|
||||
pass
|
||||
Reference in New Issue
Block a user