gavin/transformers

Fork 0

Files

陈赣 06f1fd69a6

Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled

Details

Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled

Details

Build documentation / build (push) Has been cancelled

Details

Build documentation / build_other_lang (push) Has been cancelled

Details

CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled

Details

New model PR merged notification / Notify new model (push) Has been cancelled

Details

PR CI / pr-ci (push) Has been cancelled

Details

Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled

Details

Secret Leaks / trufflehog (push) Has been cancelled

Details

Update Transformers metadata / build_and_package (push) Has been cancelled

Details

Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled

Details

Check Tiny Models / Check tiny models (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled

Details

Nvidia CI - Flash Attn / Setup (push) Has been cancelled

Details

Nvidia CI - Flash Attn / Model CI (push) Has been cancelled

Details

Nvidia CI / Setup (push) Has been cancelled

Details

Nvidia CI / Model CI (push) Has been cancelled

Details

Nvidia CI / Torch pipeline CI (push) Has been cancelled

Details

Nvidia CI / Example CI (push) Has been cancelled

Details

Nvidia CI / Trainer/FSDP CI (push) Has been cancelled

Details

Nvidia CI / DeepSpeed CI (push) Has been cancelled

Details

Nvidia CI / Quantization CI (push) Has been cancelled

Details

Nvidia CI / Kernels CI (push) Has been cancelled

Details

Doctests / Setup (push) Has been cancelled

Details

Doctests / Call doctest jobs (push) Has been cancelled

Details

Doctests / Send results to webhook (push) Has been cancelled

Details

Extras Smoke Test / Get supported Python versions (push) Has been cancelled

Details

Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled

Details

Extras Smoke Test / Check Slack token availability (push) Has been cancelled

Details

Extras Smoke Test / Notify failures to Slack (push) Has been cancelled

Details

Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled

Details

Stale Bot / Close Stale Issues (push) Has been cancelled

Details

first commit

2026-06-05 16:53:03 +08:00

16 KiB

Raw Permalink Blame History

Tensor parallelism

Tensor parallelism slices a model layer into pieces so multiple hardware accelerators work on it simultaneously. This lets you run models that exceed a single GPU's memory capacity and achieve higher throughput. You'll need fast intra-node communication because GPUs exchange partial results at each layer.

The list below shows models with native tensor parallelism support. Open a GitHub issue or pull request to add support for a model.

Show supported models

This guide covers enabling tensor parallelism in Transformers and the available partitioning strategies.

Partitioning a model

Transformers enables tensor parallelism when a model has a tp_plan. Choose from two partitioning methods.

Set tp_plan="auto" for an automatic plan based on the model's predefined configuration.
Define and pass a manual tp_plan.

import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct" # better to visualize all the possible strategies
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct" , dtype=torch.bfloat16, tp_plan="auto")
print(model._tp_plan)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
prompt = "Can I help"
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

# distributed run
outputs = model(inputs)

Launch the inference script with torchrun. Use 4 processes per GPU.

torchrun --nproc-per-node 4 demo.py

Define a tensor parallel plan for each layer in tp_plan. Pass it to [~PreTrainedModel.from_pretrained]. The example below uses column and row partitioning. See the Partitioning strategies section for other supported strategies.

Manual partitioning requires a deep understanding of model architecture and strategy interactions. Poor partitioning choices create slow models that fail or produce incorrect results. The Ultra-Scale Playbook explains partitioning strategies in detail.

from transformers import AutoModelForCausalLM

tp_plan = {
    "model.layers.*.self_attn.q_proj": "colwise",
    "model.layers.*.self_attn.k_proj": "colwise",
    "model.layers.*.self_attn.v_proj": "colwise",
    "model.layers.*.self_attn.o_proj": "rowwise",
    ...
}

model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", dtype="auto", tp_plan=tp_plan)
print(model.tp_plan)

Partitioning strategies

The [ParallelInterface] class defines all partitioning strategies. It maps a string to the strategy implementation. You don't need to interact with this class directly since you set strategies with tp_plan in [~PreTrainedModel.from_pretrained]. It's useful for checking available strategies.

class ParallelInterface(MutableMapping):
    """
    Dict-like object keeping track of allowed attention functions. You can easily add a new attention function
    with a call to `register()`. If a model needs to locally overwrite an existing attention function, say `sdpa`,
    it needs to declare a new instance of this class inside the `modeling_<model>.py`, and declare it on that instance.
    """
    _global_mapping = {
        "colwise": ColwiseParallel(),
        "rowwise": RowwiseParallel(),
        "colwise_rep": ColwiseParallel(output_layouts=Replicate()),
        "rowwise_rep": RowwiseParallel(input_layouts=Replicate()),
        "local_colwise": ColwiseParallel(use_dtensor=False),
        "local_rowwise": RowwiseParallel(use_dtensor=False),
        "local": IsolatedParallel(),
        "moe_tp_experts": MoeTensorParalellExperts(),
        "local_packed_rowwise": PackedRowwiseParallel(use_dtensor=False),
        "sequence_parallel": SequenceParallel(),
        "replicate": ReplicateParallel(),
    }

The table below describes each strategy.

Strategy	Description
`ColwiseParallel`	Partitions weights and biases column-wise.
`RowwiseParallel`	Partitions weights and biases row-wise. Supports `nn.Embedding` modules partitioning.
`SequenceParallel`	Sequence parallel implementation to support `LayerNorm` and `Dropout` layers. Supports Python implementation of RMSNorm.
`PackedColwiseParallel`	A variant of `ColwiseParallel` that supports packed weights (for example, packing `up_proj` and `gate_proj` together). Refer to the code for more details.
`PackedRowwiseParallel`	A variant of `RowwiseParallel` that supports packed weights (refer to the code for more details).
`GatherParallel`	Gathers module outputs across devices.
`IsolatedParallel`	Isolates a module from other devices. Used for Experts in Mixture-of-Experts (MoE) layers.
`ReplicateParallel`	Replicates modules across all devices. Prevents `torch.distributed` APIs from breaking due to a partially sharded model.

Packed strategies

Weight packing combines multiple linear layers into a single, larger layer. The PackedColwiseParallel and PackedRowwiseParallel strategies shard packed weights correctly. Basic ColwiseParallel or RowwiseParallel strategies shard packed weights incorrectly.

The example below packs up_proj and gate_proj into a single gate_up_proj module and requires the PackedRowwiseParallel strategy to shard gate_up_proj.

class Llama4TextExperts(nn.Module):
    ...
    self.gate_up_proj = nn.Parameter(torch.zeros(self.num_experts, self.hidden_size, 2 * self.expert_dim))

Use batch matrix multiplication in the forward pass to compute the output of the gate_up_proj module.

def forward(self, hidden_states):
    ...
    gate_up = torch.bmm(hidden_states, self.gate_up_proj) # Compute the output of the gate_up_proj module
    gate, up = gate_up.chunk(2, dim=-1) # Split the output into gate and up

Tip

See this comment for a visual representation of why Packed* needs to be used.

Local strategies

Local strategies (local_colwise, local_rowwise, local_packed_rowwise) don't use DTensor because it lacks support for some operations like torch.chunk. Instead, local strategies use the basic torch.Tensor and perform distributed logic manually.

Custom partitioning strategies

Inherit from TensorParallelLayer to create a custom partitioning strategy. Implement partition_tensor, _prepare_input_fn and _prepare_output_fn.

The example below shows how to implement ColwiseParallel with this workflow.

Inherit from TensorParallelLayer. In the __init__ method, define input_layouts and output_layouts to describe how the input and output tensors should be placed on devices. The desired_input_layouts attribute is used to specify how the input should be placed on devices.

class ColwiseParallel(TensorParallelLayer):
    def __init__(
        self,
        *,
        input_layouts: Optional[Placement] = None, # The input layout coming from the previous layer
        output_layouts: Optional[Placement] = None, # The output layout we want to achieve
        use_local_output: bool = True, # Whether to use local output or not
        use_dtensor=True, # Whether to use DTensor or not
    ):
        self.input_layouts = (input_layouts or Replicate(),) # The input sharding coming from the previous layer
        self.output_layouts = (output_layouts or Shard(-1),) # Desired output sharding
        self.desired_input_layouts = (Replicate(),) # Desired input sharding, inputs should be replicated across GPUs
        self.use_local_output = use_local_output
        self.use_dtensor = use_dtensor

Implement the partition_tensor, _prepare_input_fn, and _prepare_output_fn methods.

The partition_tensor method partitions the tensor and fills empty_param with the partitioned tensor. Use the utility function get_tensor_shard to help you get the correct shard of the original parameter for a given rank and get_packed_weights to help with packed weights.

def partition_tensor(
    self,
    param, # Full tensor of the parameter
    empty_param, # Empty tensor of the parameter, will be filled with the partitioned tensor
    param_type, # Type of the parameter, `bias` or `weight`
    param_casting_dtype, # The type to cast the parameter to
    to_contiguous, # Whether to convert the tensor to a contiguous memory layout
    rank, # The rank of the current device
    device_mesh, # The device mesh
) -> nn.Parameter: # Return the partitioned parameter
    ...

The _prepare_input_fn and _prepare_output_fn methods are used in the pre-forward and forward hooks. They redistribute the inputs and outputs to the desired layout as specified in the __init__.

def _prepare_input_fn(input_layouts, desired_input_layouts, mod, inputs, device_mesh):
    ...
    # Do some custom logic, cast to DTensor etc.
    ...
    return inputs.redistribute(placements=desired_input_layouts, device_mesh=device_mesh)
def _prepare_output_fn(output_layouts, use_local_output, mod, outputs, device_mesh):
    ...
    # Do some custom logic, cast to DTensor etc.
    ...
    return outputs.redistribute(placements=output_layouts, device_mesh=device_mesh)

from transformers.integrations.tensor_parallel import ParallelInterface

ParallelInterface.register_strategy("colwise_custom", ColwiseParallel)
tp_plan = {
    "model.layers.*.self_attn.q_proj": "colwise_custom",
    ...
}
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, tp_plan=tp_plan)

Benchmarks

Tensor parallelism significantly speeds up inference, especially for large batch sizes or long sequences.

This chart shows the expected speedup for a single forward pass on Llama with a sequence length of 512.

Design implementation

Transformers implements tensor parallelism in a framework-agnostic way. It relies on DeviceMesh and DTensor from torch.distributed to provide a simple, extensible interface.

DeviceMesh

DeviceMesh creates a multi-dimensional grid of devices that communicate together. Different parallelization strategies require different communication patterns. Create a DeviceMesh with multiple sub-meshes to handle these patterns.

from torch.distributed.device_mesh import init_device_mesh

# Create a 1D mesh of 4 GPUs
device_mesh = init_device_mesh("cuda", (4,), mesh_dim_names=["tp"])

Most torch.distributed parallelization strategies apply to the mesh itself or its sub-mesh. The mesh automatically handles communication patterns.

DTensor

DTensor (Distributed Tensor) handles distributed logic on top of usual tensor operations. Most model weights in tensor parallelism are stored as DTensors.

The placement attribute tells PyTorch how to place a tensor on devices in DeviceMesh. It accepts the following values:

Shard(dimension) shards a DTensor across a given dimension over the DeviceMesh it was constructed under. The example below shows how to shard weights over different dimensions for column-wise partitioning.

weight = ...
weight = DTensor.from_local(weight, device_mesh["tp"], placements=[Shard(0)]) # Shard across the 1st (column-wise) dimension
bias = ...
bias = DTensor.from_local(bias, device_mesh["tp"], placements=[Shard(-1)]) # Shard across the ONLY dimension

This example shows how to shard weights over different dimensions for row-wise partitioning.

weight = ...
weight = DTensor.from_local(weight, device_mesh["tp"], placements=[Shard(1)]) # Shard across the 2nd (row-wise) dimension
bias = ...
bias = DTensor.from_local(bias, device_mesh["tp"], placements=[Replicate()]) # Replicate bias across all GPUs

Replicate() replicates a DTensor across the DeviceMesh. It creates a full copy of the tensor on each device.

bias = ...
bias = DTensor.from_local(bias, device_mesh["tp"], placements=[Replicate()]) # Replicate bias across all GPUs

Partial() indicates a tensor is pending a reduction operation (not typically relevant for Transformers usage).

Resources

The Ultra-Scale Playbook section on tensor parallelism provides more details.
Check the expert parallelism guide if you're using a mixture-of-experts (MoE) model. These models support tensor parallelism and expert parallelism.
Read the Tensor Parallelism (TP) in Transformers: 5 Minutes to Understand blog post for a quick overview of tensor parallelism and learn how column and row parallel setups differ.
See the Tensor parallelism training guide to learn how to use it in a training setting.

16 KiB Raw Permalink Blame History

Tensor parallelism

Partitioning a model

Partitioning strategies

Packed strategies

Local strategies

Custom partitioning strategies

Benchmarks

Design implementation

DeviceMesh

DTensor

Resources

16 KiB

Raw Permalink Blame History