first commit

2026-06-05 16:53:03 +08:00
commit 06f1fd69a6
6047 changed files with 1895387 additions and 0 deletions
--- a/docs/source/en/grad_checkpointing.md
+++ b/docs/source/en/grad_checkpointing.md
@@ -0,0 +1,50 @@
+<!---Copyright 2026 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Gradient checkpointing
+
+The forward pass typically caches all intermediate activations for the backward pass to reuse. However, activations scale with batch size and sequence length. Gradient checkpointing only saves certain activations and discards the rest. This forces the backward pass to recompute some of the activations on-the-fly as they're needed.
+
+```text
+Normal training:
+  Forward:   [L1]→[L2]→[L3]→[L4]   (save ALL activations)
+  Backward:  ←uses cached activations everywhere
+
+Gradient checkpointing:
+  Forward:   [L1]→[L2]→[L3]→[L4]   (save only at checkpoints, discard the rest)
+  Backward:  ←reaches L2, recomputes L2→L3 from scratch, uses it, discards it
+```
+
+Training will be ~20% slower because some activations need to be recomputed, but it reduces activation memory.
+
+Set `gradient_checkpointing=True` to enable.
+
+> [!TIP]
+> Use with [gradient accumulation](./grad_accumulation) to further reduce memory usage.
+
+```py
+from transformers import TrainingArguments
+
+args = TrainingArguments(
+    ...,
+    gradient_checkpointing=True,
+)
+```
+
+## Next steps
+
+- Read the [GPU memory usage](./model_memory_anatomy) doc to understand what is driving memory usage on the GPU during training.
+- See the [Mixed precision training](./mixed_precision_training) guide to learn how to use lower precision data types to further reduce memory and speed up training.
+- See the [Kernels](./kernels) guide to learn how to speed up training with custom fused kernels.