# Writing model tests The Transformers test suite uses a mixin-based architecture to auto-generate 100+ tests from minimal code. You write a small amount of model-specific code, and the mixins handle save/load, generation, pipelines, training, and tensor parallelism. Run your model's tests with the following commands. ```bash # run your model's tests pytest tests/models/mymodel/test_modeling_mymodel.py -v # run a specific test pytest tests/models/mymodel/test_modeling_mymodel.py::MyModelTest::test_model # run tests matching a keyword pattern (useful to run all integration tests) pytest tests/models/mymodel/ -k integration -v # include slow integration tests RUN_SLOW=1 pytest tests/models/mymodel/ -v ``` The Hugging Face CI runs model tests without `@slow` on every pull request, and slow tests run on a nightly schedule (see [Pull request checks](./pr_checks) for what the CI validates). ## Pick a base test class Three base classes cover the most common model families. Pick the one that matches your model's modality. | Base class | Use for | Mixins | |---|---|---| | `CausalLMModelTest` | Causal language models | `ModelTesterMixin`, `GenerationTesterMixin`, `PipelineTesterMixin`, `TrainingTesterMixin`, `TensorParallelTesterMixin` | | `VLMModelTest` | Vision-language models | `ModelTesterMixin`, `GenerationTesterMixin`, `PipelineTesterMixin` | | `ALMModelTest` | Audio-language models | `ModelTesterMixin`, `GenerationTesterMixin`, `PipelineTesterMixin` | `VLMModelTest` and `ALMModelTest` share a common `MultiModalModelTest` parent that nests sub-configs into a composite top-level config and places modality placeholder tokens in `input_ids` alongside the raw modality features (audio or vision). `CausalLMModelTest` doesn't use the multimodal parent. It builds on the three shared mixins and adds `TrainingTesterMixin` and `TensorParallelTesterMixin` for training and tensor-parallel coverage. For architectures that don't fit any of the three (encoder-only, encoder-decoder, etc.), build the test infrastructure directly from the [two-class pattern](#modeltester-and-modeltest) and [test mixins](#test-mixins) described below. ## CausalLMModelTest `CausalLMModelTest` is the recommended base class for testing causal language models. It inherits from five [test mixins](#test-mixins) and auto-generates tests for save/load, generation, pipelines, training, and tensor parallelism. ```py import unittest from transformers.testing_utils import require_torch from transformers import is_torch_available from ...causal_lm_tester import CausalLMModelTest, CausalLMModelTester if is_torch_available(): from transformers import MyModel class MyModelTester(CausalLMModelTester): if is_torch_available(): base_model_class = MyModel @require_torch class MyModelTest(CausalLMModelTest, unittest.TestCase): model_tester_class = MyModelTester ``` These two classes give full test coverage for `MyModel` and all its head classes (`MyModelForCausalLM`, `MyModelForSequenceClassification`, etc.). See [tests/models/llama/test_modeling_llama.py](https://github.com/huggingface/transformers/blob/main/tests/models/llama/test_modeling_llama.py) for a real example. `CausalLMModelTester` only requires `base_model_class`. The tester strips the `Model` suffix to get a base name (`LlamaModel` becomes `Llama`), then appends suffixes like `Config` or `ForCausalLM` to discover related classes. If a class doesn't exist in the module, the attribute stays `None` and the tester skips the corresponding tests. ### Overriding defaults in the CausalLMTester If your model doesn't follow standard naming, or you need to customize behavior, override attributes on the tester or test class. ```py class MyModelTester(CausalLMModelTester): if is_torch_available(): base_model_class = MyModel # override if the class name doesn't follow the convention causal_lm_class = MyCustomCausalLM @require_torch class MyModelTest(CausalLMModelTest, unittest.TestCase): model_tester_class = MyModelTester # disable embedding resize tests test_resize_embeddings = False ``` For models that need custom constructor parameters on the tester, override `__init__` and call `super().__init__(parent=parent)` before setting extra attributes. See [tests/models/youtu/test_modeling_youtu.py](https://github.com/huggingface/transformers/blob/main/tests/models/youtu/test_modeling_youtu.py) for a real example. ```py class YoutuModelTester(CausalLMModelTester): if is_torch_available(): base_model_class = YoutuModel def __init__(self, parent, kv_lora_rank=16, q_lora_rank=32): super().__init__(parent=parent) self.kv_lora_rank = kv_lora_rank self.q_lora_rank = q_lora_rank ``` ## VLMModelTest `VLMModelTest` is the base class for vision-language models. It inherits from three mixins (`ModelTesterMixin`, `GenerationTesterMixin`, `PipelineTesterMixin`) and sets `_is_composite = True` to handle multiple sub-models. ```py import unittest from transformers.testing_utils import require_torch from transformers import is_torch_available from ...vlm_tester import VLMModelTest, VLMModelTester if is_torch_available(): from transformers import ( MyVLMConfig, MyVLMModel, MyVLMTextConfig, MyVLMVisionConfig, MyVLMForConditionalGeneration, ) class MyVLMTester(VLMModelTester): if is_torch_available(): base_model_class = MyVLMModel config_class = MyVLMConfig text_config_class = MyVLMTextConfig vision_config_class = MyVLMVisionConfig conditional_generation_class = MyVLMForConditionalGeneration @require_torch class MyVLMTest(VLMModelTest, unittest.TestCase): model_tester_class = MyVLMTester ``` ### Overriding defaults in the VLMModelTester When the VLM needs custom vision parameters or non-default config values, override `__init__`. Set defaults with `setdefault` before calling `super().__init__(parent, **kwargs)`. The example below shows the first few defaults from [tests/models/qianfan_ocr/test_modeling_qianfan_ocr.py](https://github.com/huggingface/transformers/blob/main/tests/models/qianfan_ocr/test_modeling_qianfan_ocr.py). ```py class QianfanOCRVisionText2TextModelTester(VLMModelTester): base_model_class = QianfanOCRModel config_class = QianfanOCRConfig text_config_class = Qwen3Config vision_config_class = QianfanOCRVisionConfig conditional_generation_class = QianfanOCRForConditionalGeneration def __init__(self, parent, **kwargs): kwargs.setdefault("image_token_id", 1) kwargs.setdefault("image_size", 32) kwargs.setdefault("patch_size", 4) kwargs.setdefault("num_channels", 3) # ... more defaults super().__init__(parent, **kwargs) ``` VLM tests differ from `CausalLMModelTest` in a few ways. - You must set `config_class`, `text_config_class`, `vision_config_class`, and `conditional_generation_class` on the tester. - `VLMModelTest` doesn't include `TrainingTesterMixin` or `TensorParallelTesterMixin`. - The tester's `__init__` accepts vision parameters (`image_size`, `patch_size`, `num_channels`, `num_image_tokens`) from `**kwargs` and `setdefault()`. - `ConfigTester` uses `has_text_modality=False` because the top-level config is a composite config rather than a text model config. ## ALMModelTest `ALMModelTest` is the base class for audio-language models (ALMs) like Qwen2Audio, AudioFlamingo3, and GraniteSpeech. It mirrors the VLM pattern with the same `MultiModalModelTest` parent and auto-discovery of head classes. The vision-side machinery is swapped for audio features, an audio sub-config, and an audio-token placement strategy. ```py class MyALMTester(ALMModelTester): config_class = MyALMConfig text_config_class = MyALMTextConfig audio_config_class = MyALMAudioConfig conditional_generation_class = MyALMForConditionalGeneration audio_mask_key = "feature_attention_mask" class MyALMTest(ALMModelTest, unittest.TestCase): model_tester_class = MyALMTester ``` ### Overriding defaults in the ALMModelTester The tester's `__init__` sets ALM-specific defaults (`feat_seq_length=128`, `num_mel_bins=80`, `audio_token_id=0`). Override them with `setdefault` before calling `super().__init__(parent, **kwargs)`. Two class attributes tell the tester how your model names things. - `audio_mask_key`: the kwarg name your model expects for the audio mask (`"feature_attention_mask"`, `"input_features_mask"`, etc.). Leave it `None` if your model doesn't consume a separate audio mask. - `audio_config_key`: the attribute name your top-level config uses to nest the audio sub-config. Defaults to `"audio_config"` but models like GraniteSpeech use `"encoder_config"`. ```py class Qwen2AudioModelTester(ALMModelTester): def __init__(self, parent, **kwargs): kwargs.setdefault("feat_seq_length", 60) kwargs.setdefault("max_source_positions", kwargs["feat_seq_length"] // 2) super().__init__(parent, **kwargs) ``` `ALMModelTester` requires you to override one hook, `get_audio_embeds_mask(audio_mask)`, and exposes a few more optional ones for customization. - `get_audio_embeds_mask(audio_mask)`: returns the per-batch mask of audio embedding positions after the encoder's downsampling. The tester uses its row-wise sum to decide how many `audio_token_id` placeholders to insert into `input_ids`, so the count must match what your encoder emits. - `create_audio_features()`: returns the audio feature tensor. Default shape is `[batch_size, num_mel_bins, feat_seq_length]`. Override when your model, like GraniteSpeech, expects time-first features (`[batch_size, feat_seq_length, num_mel_bins]`). - `create_audio_mask()`: returns the audio-level attention mask. The default builds a random contiguous valid region per row in the batch. Override with a deterministic full-length mask if your tests compare two `prepare_config_and_inputs_for_common()` invocations against each other, or if your audio encoder dispatches to a backend that rejects non-null masks. - `place_audio_tokens(input_ids, config, num_audio_tokens)`: places audio placeholder tokens contiguously after `BOS`. Override only if your model needs a different layout. - `get_audio_feature_key()`: returns the inputs-dict key for audio features (`"input_features"` by default). In addition to the inherited multimodal tests, `ALMModelTest` adds `test_mismatching_num_audio_tokens`. The test asserts the model raises a clear `ValueError` when the number of audio features doesn't match the number of audio placeholder tokens in `input_ids`, and verifies that a prompt with multiple audio segments still forwards successfully. ## Write tests for other architectures For encoder-only, encoder-decoder, audio, or other non-standard architectures, build the test infrastructure directly from the two-class pattern and test mixins described below. ### ModelTester and ModelTest Every model test file follows the same structure. 1. `ModelTester` (plain class) creates tiny configs and dummy inputs for testing, and can also hold small regression tests specific to the model. 2. `ModelTest` (`unittest.TestCase` + mixins) inherits auto-generated tests and runs them against every model variant. `ModelTest` calls `prepare_config_and_inputs_for_common()` on the tester to get a `(config, inputs_dict)` tuple. All mixins rely on `prepare_config_and_inputs_for_common()` for test data. ### Test mixins Pick the mixins your model needs. | Mixin | Source file | What it tests | |---|---|---| | `ModelTesterMixin` | `tests/test_modeling_common.py` | Save/load, gradient checkpointing, forward signature, common attributes | | `GenerationTesterMixin` | `tests/generation/test_utils.py` | Greedy, sampling, beam search, assisted decoding | | `PipelineTesterMixin` | `tests/test_pipeline_mixin.py` | One test per pipeline task | | `TrainingTesterMixin` | `tests/test_training_mixin.py` | Overfitting on a small batch | | `TensorParallelTesterMixin` | `tests/test_tensor_parallel_mixin.py` | Distributed tensor parallelism | ### Writing a model test See [tests/models/modernbert/test_modeling_modernbert.py](https://github.com/huggingface/transformers/blob/main/tests/models/modernbert/test_modeling_modernbert.py) for a complete working example. The key steps are outlined below. 1. The `ModelTester` class builds tiny configs and dummy inputs. Keep dimensions small so tests finish in seconds on CPU. Use the three tensor helpers below to build inputs. - `ids_tensor(shape, vocab_size)`: Random integer tensor in `[0, vocab_size)`. Use for `input_ids`, `token_type_ids`, and label tensors. - `random_attention_mask(shape)`: Binary tensor (0s and 1s) where the first token is always 1. Use for `attention_mask`. - `floats_tensor(shape, scale=1.0)`: Random float tensor. Use for continuous inputs like `pixel_values` or `inputs_embeds`. The tester must implement `get_config()`, `prepare_config_and_inputs()`, and `prepare_config_and_inputs_for_common()`. Add `create_and_check_*` methods for each task head (base model, sequence classification, token classification, etc.). 2. Inherit from the mixins your model needs, set `all_model_classes` and `pipeline_model_mapping`, and define `setUp()`. Write `test_*` methods that delegate to the tester's `create_and_check_*` methods. 3. For each task head, add a `create_and_check_*` method on the tester that instantiates the model, runs a forward pass, and asserts output shapes. Then add a corresponding `test_*` method on the test class. ### File organization Test files live in `tests/models/mymodel/` following the structure shown below. ```text tests/models/mymodel/ ├── __init__.py ├── test_modeling_mymodel.py # model tests (required) ├── test_tokenization_mymodel.py # tokenizer tests (if custom tokenizer) ├── test_image_processing_mymodel.py # image processor tests (if vision model) ├── test_feature_extraction_mymodel.py # feature extractor tests (if audio/speech model) └── test_processing_mymodel.py # processor tests (if multimodal) ``` Tokenizer tests follow the same pattern. Inherit `TokenizerTesterMixin` from `tests/test_tokenization_common.py`, set a few attributes, and get auto-generated tests. See [tests/models/llama/test_tokenization_llama.py](https://github.com/huggingface/transformers/blob/main/tests/models/llama/test_tokenization_llama.py) for an example. ## Config tests `ConfigTester` verifies that a config class handles serialization, save/load, and standard properties correctly. `CausalLMModelTest` and `VLMModelTest` include config tests automatically. For the general path with `ModelTester` and `ModelTest`, define the config tester manually in `setUp()`. ```py from tests.test_configuration_common import ConfigTester def setUp(self): self.config_tester = ConfigTester(self, config_class=MyModelConfig, hidden_size=32) def test_config(self): self.config_tester.run_common_tests() ``` `run_common_tests()` runs several checks. - Checks that common properties like `hidden_size`, `num_attention_heads`, and `num_hidden_layers` exist (and `vocab_size` if `has_text_modality=True`). - Tests JSON serialization with `to_json_string()` and `to_json_file()`. - Round-trips `save_pretrained()` and `from_pretrained()`. - Confirms `id2label` and `label2id` consistency. - Creates a config with no arguments to validate default initialization. - Sets common kwargs like `output_hidden_states` and confirms they're stored correctly. Pass `has_text_modality=False` for vision-only models that lack `vocab_size`, and pass extra `**kwargs` to override config defaults. ```py self.config_tester = ConfigTester( self, config_class=MyVisionConfig, has_text_modality=False, hidden_size=64 ) ``` ## Integration tests and tiny models Mixin tests use tiny configs with random weights to verify model behavior quickly. Integration tests run inference with real pretrained weights to validate output correctness. Tiny models on the Hub are small enough for fast CI, but structured like real checkpoints. ### Writing integration tests Place integration tests in a separate test class and mark them with `@slow`. Each test downloads real weights, runs inference, and checks outputs against expected values. Call `cleanup(torch_device, gc_collect=False)` in `setUp` and `tearDown` to avoid memory residuals. ```py import torch from transformers import AutoTokenizer from transformers.testing_utils import cleanup, require_torch, slow, torch_device class MyModelIntegrationTest(unittest.TestCase): def setUp(self): cleanup(torch_device, gc_collect=False) def tearDown(self): cleanup(torch_device, gc_collect=False) @slow @require_torch def test_inference(self): model = MyModelForCausalLM.from_pretrained("myorg/mymodel-base").to(torch_device) tokenizer = AutoTokenizer.from_pretrained("myorg/mymodel-base") inputs = tokenizer("Hello, world", return_tensors="pt").to(torch_device) with torch.no_grad(): outputs = model(**inputs) # check against expected values expected_slice = torch.tensor([[-0.1234, 0.5678, -0.9012]]) torch.testing.assert_close(outputs.logits[0, :1, :3], expected_slice, atol=1e-4, rtol=1e-4) ``` Mark any test with `@slow` if it downloads weights, loads a large dataset, or takes more than a few seconds. The [pull request CI](./pr_checks) skips slow tests, but the nightly schedule runs them. #### Generation integration tests Use `do_sample=False` in generation tests so the output is deterministic across runs and hardware. For Mixture-of-Experts models, also call `model.set_experts_implementation("eager")` before generating to force a stable expert dispatch path. Without it, small numerical differences in the router can flip which expert handles a token and change the output. ```py @slow @require_torch def test_generate(self): model = MyModelForCausalLM.from_pretrained("myorg/mymodel-base").to(torch_device) tokenizer = AutoTokenizer.from_pretrained("myorg/mymodel-base") inputs = tokenizer("Hello, world", return_tensors="pt").to(torch_device) # model.set_experts_implementation("eager") # uncomment for MoE models generated_ids = model.generate(**inputs, max_new_tokens=20, do_sample=False) output = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) self.assertEqual(output, ["Hello, world! This is the expected continuation..."]) ``` #### Hardware-specific expectations Transformers CI runs slow tests on an NVIDIA A10. Numerical results can vary slightly across GPU generations, so integration tests use the [Expectations](https://github.com/huggingface/transformers/blob/main/src/transformers/testing_utils.py#L3247) class to register per-device expected values. `Expectations` picks the best match for the current hardware based on `(device_type, (major, minor))` SM keys, and falls back to a default when nothing matches. Run `torch.cuda.get_device_capability()` to print your local SM version (e.g. `(8, 6)` for an A10, `(9, 0)` for H100). ```py from transformers.testing_utils import Expectations expected_texts = Expectations( { ("cuda", (8, 6)): ["Hello, world! This is the A10 continuation..."], ("cuda", (9, 0)): ["Hello, world! This is the H100 continuation..."], } ).get_expectation() self.assertEqual(output, expected_texts) ``` ### Creating tiny models Tiny models with random weights live on the Hub under the [hf-internal-testing](https://huggingface.co/hf-internal-testing) organization. Pipeline tests rely on tiny models when they need a Hub-hosted checkpoint but don't care about output quality. Fast smoke tests also load tiny models to verify forward pass shapes without downloading large checkpoints. Tiny models are a last resort for integration tests. Only use them when the smallest available checkpoint exceeds ~24 GB of VRAM. Use original pretrained weights, when possible, to catch real numerical regressions. The `utils/create_dummy_models.py` script generates tiny models from `ModelTester.get_config()`. The script extracts tiny hyperparameters from your tester, builds a model with random weights, and uploads the result to the Hub. Generate tiny models locally. ```bash python utils/create_dummy_models.py output_dir -m your_model_type ``` Upload them to the Hub. ```bash python utils/create_dummy_models.py output_dir -m your_model_type --upload --organization hf-internal-testing ``` Each model uses the name `hf-internal-testing/tiny-random-{ModelClassName}` and gets recorded in `tests/utils/tiny_model_summary.json`. A CI workflow (`.github/workflows/check_tiny_models.yml`) regenerates tiny models daily. ## Control what gets tested Boolean flags on `ModelTesterMixin` toggle auto-generated tests. Override any flag on your test class to enable or disable specific checks. ```py class MyModelTest(CausalLMModelTest, unittest.TestCase): model_tester_class = MyModelTester test_resize_embeddings = False test_all_params_have_gradient = False # when not all parameters are activated in every forward pass ``` | Flag | Default | What it controls | |---|---|---| | `test_resize_embeddings` | `True` | Embedding layer resizing | | `test_resize_position_embeddings` | `False` | Position embedding resizing | | `test_mismatched_shapes` | `True` | Mismatched input/output shape handling | | `test_missing_keys` | `True` | Missing key warnings on load | | `test_torch_exportable` | `True` | `torch.export` compatibility | | `test_all_params_have_gradient` | `True` | All parameters receive gradients (set `False` when not all parameters are activated in every forward pass, such as MoE experts) | | `is_encoder_decoder` | `False` | Encoder-decoder specific tests | | `has_attentions` | `True` | Attention output tests | | `_is_composite` | `False` | Composite/multimodal model handling | | `model_split_percents` | `[0.5, 0.7, 0.9]` | Split percentages for model parallelism tests | ## Next steps - Browse the [pytest](https://docs.pytest.org/en/latest/getting-started.html) docs for more about test selection, fixtures, logging, and more.