Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
173 lines
6.2 KiB
Markdown
173 lines
6.2 KiB
Markdown
<!--Copyright 2026 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
|
|
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
|
rendered properly in your Markdown viewer.
|
|
|
|
-->
|
|
|
|
# Personalizarea tokenizerelor
|
|
|
|
Tokenizerele sunt decuplate de vocabularele lor învățate. Asta îți permite să inițializezi un tokenizer gol pentru antrenare sau să creezi unul direct cu propriul vocabular. Pipeline-ul de bază pentru tokenization rămâne același (normalizer, pre-tokenizer, algoritmul de tokenization), deci nu trebuie să îl recreezi de la zero.
|
|
|
|
Acest ghid îți arată cum să antrenezi și să creezi un tokenizer personalizat.
|
|
|
|
## Antrenarea unui tokenizer
|
|
|
|
Un tokenizer gol antrenabil înlocuiește vocabularul cu un nou vocabular țintă. Este util pentru adaptarea la un nou domeniu, cum ar fi finanțe, o limbă cu resurse reduse sau cod.
|
|
|
|
Creează un tokenizer gol și încarcă un dataset.
|
|
|
|
```py
|
|
from datasets import load_dataset
|
|
from transformers import GemmaTokenizer
|
|
|
|
tokenizer = GemmaTokenizer()
|
|
dataset = load_dataset("Josephgflowers/Finance-Instruct-500k", split="train")
|
|
```
|
|
|
|
Folosește metoda [`TokenizersBackend.train_new_from_iterator`] ca să antrenezi tokenizerul. Metoda acceptă o funcție generator ca să returneze bucăți de text din dataset în loc să încarce totul în memorie dintr-o dată. Argumentul `vocab_size` setează dimensiunea vocabularului tokenizer-ului.
|
|
|
|
```py
|
|
def batch_iterator(batch_size=1000):
|
|
for i in range(0, len(dataset), batch_size):
|
|
yield dataset[i : i + batch_size]["assistant"]
|
|
|
|
trained_tokenizer = tokenizer.train_new_from_iterator(
|
|
batch_iterator(),
|
|
vocab_size=32000,
|
|
)
|
|
encoded = trained_tokenizer("The stock market rallied today.")
|
|
print(encoded["input_ids"])
|
|
[5866, 11503, 98, 5885, 8617, 13381, 30]
|
|
```
|
|
|
|
Adaugă token-uri speciale noi cu argumentul `new_special_tokens` sau folosește `special_tokens_map` ca să redenumești token-urile speciale vechi cu cele noi.
|
|
|
|
Salvează noul tokenizer de finanțe cu [`~PreTrainedTokenizerBase.save_pretrained`] sau salvează-l și încarcă-l pe Hub cu [`~PreTrainedTokenizerBase.push_to_hub`]. Asta creează un fișier `tokenizer.json` care captează vocabularul nou antrenat, regulile de îmbinare și configurația completă a pipeline-ului.
|
|
|
|
```py
|
|
trained_tokenizer.save_pretrained("./finance-gemma-tokenizer")
|
|
trained_tokenizer.push_to_hub("finance-gemma-tokenizer")
|
|
```
|
|
|
|
## Vocabular personalizat
|
|
|
|
Un tokenizer gol suportă vocabular personalizat cu argumentele `vocab` și `merges`.
|
|
|
|
- `vocab` este setul complet de token-uri pe care un tokenizer le cunoaște, iar fiecare intrare mapează un token la input id-ul său.
|
|
- `merges` definește cum ar trebui algoritmul BPE să combine token-urile adiacente.
|
|
|
|
```py
|
|
from transformers import GemmaTokenizer
|
|
|
|
vocab={
|
|
"<pad>": 0,
|
|
"</s>": 1,
|
|
"<s>": 2,
|
|
"<unk>": 3,
|
|
"<mask>": 4,
|
|
"▁the": 5,
|
|
"▁stock": 6,
|
|
"▁market": 7,
|
|
"▁": 8,
|
|
"r": 9,
|
|
"a": 10,
|
|
"l": 11,
|
|
"i": 12,
|
|
"e": 13,
|
|
"d": 14,
|
|
"ra": 15,
|
|
"li": 16,
|
|
"lie": 17,
|
|
"lied": 18,
|
|
"ral": 19,
|
|
"ralli": 20,
|
|
"rallie": 21,
|
|
"rallied": 22,
|
|
}
|
|
merges=[
|
|
("r", "a"), # r + a → ra
|
|
("l", "i"), # l + i → li
|
|
("li", "e"), # li + e → lie
|
|
("lie", "d"), # lie + d → lied
|
|
("ra", "l"), # ra + l → ral
|
|
("ral", "li"), # ral + li → ralli
|
|
("ralli", "e"), # ralli + e → rallie
|
|
("rallie", "d"), # rallie + d → rallied
|
|
]
|
|
|
|
tokenizer = GemmaTokenizer(vocab=vocab, merges=merges)
|
|
encoded = tokenizer("the stock market rallied")
|
|
print(encoded["input_ids"])
|
|
```
|
|
|
|
## Subclasarea TokenizersBackend
|
|
|
|
Tokenizers suportă patru [backend-uri](./fast_tokenizers#backend-uri) diferite. În general, ar trebui să folosești [`TokenizersBackend`] ca să definești un tokenizer nou deoarece este mai rapid.
|
|
|
|
> [!TIP]
|
|
> [`PythonBackend`] este un tokenizer pur Python care nu depinde de backend-uri ca Rust, SentencePiece sau mistral-common. Folosește [`PythonBackend`] doar dacă construiești un tokenizer foarte specializat care nu poate fi exprimat de backend-ul Rust.
|
|
|
|
1. Subclasează [`TokenizersBackend`] cu atribute de clasă precum latura de padding și algoritmul de tokenizare de folosit.
|
|
2. Definește pipeline-ul de tokenizare în `__init__`. Asta include algoritmul de tokenizare de folosit, cum să împartă textul brut înaintea algoritmului și cum să decodifice token-urile înapoi în text.
|
|
|
|
```py
|
|
from tokenizers import Tokenizer, decoders, pre_tokenizers
|
|
from tokenizers.models import BPE
|
|
from transformers import TokenizersBackend
|
|
|
|
class NewTokenizer(TokenizersBackend):
|
|
padding_side = "left"
|
|
model = BPE
|
|
|
|
def __init__(
|
|
self,
|
|
vocab=None,
|
|
merges=None,
|
|
unk_token="<unk>",
|
|
bos_token="<s>",
|
|
eos_token="</s>",
|
|
pad_token="<pad>",
|
|
):
|
|
self._vocab = vocab or {
|
|
str(unk_token): 0,
|
|
str(bos_token): 1,
|
|
str(eos_token): 2,
|
|
str(pad_token): 3,
|
|
}
|
|
self._merges = merges or []
|
|
|
|
self._tokenizer = Tokenizer(
|
|
BPE(vocab=self._vocab, merges=self._merges, fuse_unk=True)
|
|
)
|
|
self._tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
|
|
self._tokenizer.decoder = decoders.ByteLevel()
|
|
|
|
super().__init__(
|
|
unk_token=unk_token,
|
|
bos_token=bos_token,
|
|
eos_token=eos_token,
|
|
pad_token=pad_token,
|
|
)
|
|
```
|
|
|
|
Antrenează sau salvează noul tokenizer gol.
|
|
|
|
```py
|
|
tokenizer = NewTokenizer()
|
|
|
|
# antrenează pe corpus nou
|
|
tokenizer.train_new_from_iterator()
|
|
# salvează tokenizer-ul
|
|
tokenizer.save_pretrained("./new-tokenizer")
|
|
```
|