Files
transformers/docs/source/zh/main_classes/tokenizer.md
陈赣 06f1fd69a6
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
first commit
2026-06-05 16:53:03 +08:00

3.7 KiB
Raw Blame History

Tokenizer

tokenizer负责准备输入以供模型使用。该库包含所有模型的tokenizer。大多数tokenizer都有两种版本一个是完全的 Python 实现,另一个是基于 Rust 库 🤗 Tokenizers 的“Fast”实现。"Fast" 实现允许:

  1. 在批量分词时显著提速
  2. 在原始字符串字符和单词和token空间之间进行映射的其他方法例如获取包含给定字符的token的索引或与给定token对应的字符范围

基类 [PreTrainedTokenizer] 和 [PreTrained TokenizerFast] 实现了在模型输入中编码字符串输入的常用方法(见下文),并从本地文件或目录或从库提供的预训练的 tokenizer从 HuggingFace 的 AWS S3 存储库下载)实例化/保存 python 和“Fast” tokenizer。它们都依赖于包含常用方法的 [~tokenization_utils_base.PreTrainedTokenizerBase]。

因此,[PreTrainedTokenizer] 和 [PreTrainedTokenizerFast] 实现了使用所有tokenizers的主要方法

  • 分词将字符串拆分为子词标记字符串将tokens字符串转换为id并转换回来以及编码/解码(即标记化并转换为整数)。
  • 以独立于底层结构BPE、SentencePiece……的方式向词汇表中添加新tokens。
  • 管理特殊tokens如mask、句首等添加它们将它们分配给tokenizer中的属性以便于访问并确保它们在标记过程中不会被分割。

[BatchEncoding] 包含 [~tokenization_utils_base.PreTrainedTokenizerBase] 的编码方法(__call__encode_plusbatch_encode_plus)的输出,并且是从 Python 字典派生的。当tokenizer是纯 Python tokenizer时此类的行为就像标准的 Python 字典一样,并保存这些方法计算的各种模型输入(input_idsattention_mask。当分词器是“Fast”分词器时即由 HuggingFace 的 tokenizers 库 支持此类还提供了几种高级对齐方法可用于在原始字符串字符和单词与token空间之间进行映射例如获取包含给定字符的token的索引或与给定token对应的字符范围

PreTrainedTokenizer

autodoc PreTrainedTokenizer - call - add_tokens - add_special_tokens - apply_chat_template - batch_decode - decode - encode - push_to_hub - all

PreTrainedTokenizerFast

[PreTrainedTokenizerFast] 依赖于 tokenizers 库。可以非常简单地将从 🤗 tokenizers 库获取的tokenizers加载到 🤗 transformers 中。查看 使用 🤗 tokenizers 的分词器 页面以了解如何执行此操作。

autodoc PreTrainedTokenizerFast - call - add_tokens - add_special_tokens - apply_chat_template - batch_decode - decode - encode - push_to_hub - all

BatchEncoding

autodoc BatchEncoding