first commit
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled

This commit is contained in:
陈赣
2026-06-05 16:53:03 +08:00
commit 06f1fd69a6
6047 changed files with 1895387 additions and 0 deletions

384
docs/source/zh/tasks/asr.md Normal file
View File

@@ -0,0 +1,384 @@
<!--
Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# 自动语音识别
[[open-in-colab]]
<Youtube id="TksaY_FDgnk"/>
自动语音识别ASR将语音信号转换为文本将一系列音频输入映射到文本输出。
Siri 和 Alexa 这类虚拟助手使用 ASR 模型来帮助用户日常生活,还有许多其他面向用户的有用应用,如会议实时字幕和会议纪要。
本指南将向您展示如何:
1. 在 [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) 数据集上对
[Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) 进行微调,以将音频转录为文本。
2. 使用微调后的模型进行推断。
<Tip>
如果您想查看所有与本任务兼容的架构和检查点,最好查看[任务页](https://huggingface.co/tasks/automatic-speech-recognition)。
</Tip>
在开始之前,请确保您已安装所有必要的库:
```bash
pip install transformers datasets evaluate jiwer
```
我们鼓励您登录自己的 Hugging Face 账户,这样您就可以上传并与社区分享您的模型。
出现提示时,输入您的令牌登录:
```py
>>> from huggingface_hub import notebook_login
>>> notebook_login()
```
## 加载 MInDS-14 数据集
首先从🤗 Datasets 库中加载 [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14)
数据集的一个较小子集。这将让您有机会先进行实验,确保一切正常,然后再花更多时间在完整数据集上进行训练。
```py
>>> from datasets import load_dataset, Audio
>>> minds = load_dataset("PolyAI/minds14", name="en-US", split="train[:100]")
```
使用 [`~Dataset.train_test_split`] 方法将数据集的 `train` 拆分为训练集和测试集:
```py
>>> minds = minds.train_test_split(test_size=0.2)
```
然后看看数据集:
```py
>>> minds
DatasetDict({
train: Dataset({
features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
num_rows: 16
})
test: Dataset({
features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
num_rows: 4
})
})
```
虽然数据集包含 `lang_id``english_transcription` 等许多有用的信息,但在本指南中,
您将专注于 `audio``transcription`。使用 [`~datasets.Dataset.remove_columns`] 方法删除其他列:
```py
>>> minds = minds.remove_columns(["english_transcription", "intent_class", "lang_id"])
```
再看看示例:
```py
>>> minds["train"][0]
{'audio': {'array': array([-0.00024414, 0. , 0. , ..., 0.00024414,
0.00024414, 0.00024414], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
'sampling_rate': 8000},
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
'transcription': "hi I'm trying to use the banking app on my phone and currently my checking and savings account balance is not refreshing"}
```
有 2 个字段:
- `audio`:由语音信号形成的一维 `array`,用于加载和重新采样音频文件。
- `transcription`:目标文本。
## 预处理
下一步是加载一个 Wav2Vec2 处理器来处理音频信号:
```py
>>> from transformers import AutoProcessor
>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base")
```
MInDS-14 数据集的采样率为 8000kHz您可以在其[数据集卡片](https://huggingface.co/datasets/PolyAI/minds14)中找到此信息),
这意味着您需要将数据集重新采样为 16000kHz 以使用预训练的 Wav2Vec2 模型:
```py
>>> minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
>>> minds["train"][0]
{'audio': {'array': array([-2.38064706e-04, -1.58618059e-04, -5.43987835e-06, ...,
2.78103951e-04, 2.38446111e-04, 1.18740834e-04], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
'sampling_rate': 16000},
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
'transcription': "hi I'm trying to use the banking app on my phone and currently my checking and savings account balance is not refreshing"}
```
如您在上面的 `transcription` 中所看到的,文本包含大小写字符的混合。
Wav2Vec2 分词器仅训练了大写字符,因此您需要确保文本与分词器的词汇表匹配:
```py
>>> def uppercase(example):
... return {"transcription": example["transcription"].upper()}
>>> minds = minds.map(uppercase)
```
现在创建一个预处理函数,该函数应该:
1. 调用 `audio` 列以加载和重新采样音频文件。
2. 从音频文件中提取 `input_values` 并使用处理器对 `transcription` 列执行 tokenizer 操作。
```py
>>> def prepare_dataset(batch):
... audio = batch["audio"]
... batch = processor(audio["array"], sampling_rate=audio["sampling_rate"], text=batch["transcription"])
... batch["input_length"] = len(batch["input_values"][0])
... return batch
```
要在整个数据集上应用预处理函数,可以使用🤗 Datasets 的 [`~datasets.Dataset.map`] 函数。
您可以通过增加 `num_proc` 参数来加速 `map` 的处理进程数量。
使用 [`~datasets.Dataset.remove_columns`] 方法删除不需要的列:
```py
>>> encoded_minds = minds.map(prepare_dataset, remove_columns=minds.column_names["train"], num_proc=4)
```
🤗 Transformers 没有用于 ASR 的数据整理器,因此您需要调整 [`DataCollatorWithPadding`] 来创建一个示例批次。
它还会动态地将您的文本和标签填充到其批次中最长元素的长度(而不是整个数据集),以使它们具有统一的长度。
虽然可以通过在 `tokenizer` 函数中设置 `padding=True` 来填充文本,但动态填充更有效。
与其他数据整理器不同,这个特定的数据整理器需要对 `input_values``labels` 应用不同的填充方法:
```py
>>> import torch
>>> from dataclasses import dataclass, field
>>> from typing import Any, Dict, List, Optional, Union
>>> @dataclass
... class DataCollatorCTCWithPadding:
... processor: AutoProcessor
... padding: Union[bool, str] = "longest"
... def __call__(self, features: list[dict[str, Union[list[int], torch.Tensor]]]) -> dict[str, torch.Tensor]:
... # split inputs and labels since they have to be of different lengths and need
... # different padding methods
... input_features = [{"input_values": feature["input_values"][0]} for feature in features]
... label_features = [{"input_ids": feature["labels"]} for feature in features]
... batch = self.processor.pad(input_features, padding=self.padding, return_tensors="pt")
... labels_batch = self.processor.pad(labels=label_features, padding=self.padding, return_tensors="pt")
... # replace padding with -100 to ignore loss correctly
... labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
... batch["labels"] = labels
... return batch
```
现在实例化您的 `DataCollatorForCTCWithPadding`
```py
>>> data_collator = DataCollatorCTCWithPadding(processor=processor, padding="longest")
```
## 评估
在训练过程中包含一个指标通常有助于评估模型的性能。
您可以通过🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) 库快速加载一个评估方法。
对于这个任务,加载 [word error rate](https://huggingface.co/spaces/evaluate-metric/wer)WER指标
(请参阅🤗 Evaluate [快速上手](https://huggingface.co/docs/evaluate/a_quick_tour)以了解如何加载和计算指标):
```py
>>> import evaluate
>>> wer = evaluate.load("wer")
```
然后创建一个函数,将您的预测和标签传递给 [`~evaluate.EvaluationModule.compute`] 来计算 WER
```py
>>> import numpy as np
>>> def compute_metrics(pred):
... pred_logits = pred.predictions
... pred_ids = np.argmax(pred_logits, axis=-1)
... pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id
... pred_str = processor.batch_decode(pred_ids)
... label_str = processor.batch_decode(pred.label_ids, group_tokens=False)
... wer = wer.compute(predictions=pred_str, references=label_str)
... return {"wer": wer}
```
您的 `compute_metrics` 函数现在已经准备就绪,当您设置好训练时将返回给此函数。
## 训练
<Tip>
如果您不熟悉使用[`Trainer`]微调模型,请查看这里的基本教程[here](../training#train-with-pytorch-trainer)
</Tip>
现在您已经准备好开始训练您的模型了!使用 [`AutoModelForCTC`] 加载 Wav2Vec2。
使用 `ctc_loss_reduction` 参数指定要应用的减少方式。通常最好使用平均值而不是默认的求和:
```py
>>> from transformers import AutoModelForCTC, TrainingArguments, Trainer
>>> model = AutoModelForCTC.from_pretrained(
... "facebook/wav2vec2-base",
... ctc_loss_reduction="mean",
... pad_token_id=processor.tokenizer.pad_token_id,
)
```
此时,只剩下 3 个步骤:
1. 在 [`TrainingArguments`] 中定义您的训练参数。唯一必需的参数是 `output_dir`,用于指定保存模型的位置。
您可以通过设置 `push_to_hub=True` 将此模型推送到 Hub您需要登录到 Hugging Face 才能上传您的模型)。
在每个 epoch 结束时,[`Trainer`] 将评估 WER 并保存训练检查点。
2. 将训练参数与模型、数据集、分词器、数据整理器和 `compute_metrics` 函数一起传递给 [`Trainer`]。
3. 调用 [`~Trainer.train`] 来微调您的模型。
```py
>>> training_args = TrainingArguments(
... output_dir="my_awesome_asr_mind_model",
... per_device_train_batch_size=8,
... gradient_accumulation_steps=2,
... learning_rate=1e-5,
... warmup_steps=500,
... max_steps=2000,
... gradient_checkpointing=True,
... fp16=True,
... train_sampling_strategy="group_by_length",
... eval_strategy="steps",
... per_device_eval_batch_size=8,
... save_steps=1000,
... eval_steps=1000,
... logging_steps=25,
... load_best_model_at_end=True,
... metric_for_best_model="wer",
... greater_is_better=False,
... push_to_hub=True,
... )
>>> trainer = Trainer(
... model=model,
... args=training_args,
... train_dataset=encoded_minds["train"],
... eval_dataset=encoded_minds["test"],
... processing_class=processor,
... data_collator=data_collator,
... compute_metrics=compute_metrics,
... )
>>> trainer.train()
```
训练完成后,使用 [`~transformers.Trainer.push_to_hub`] 方法将您的模型分享到 Hub方便大家使用您的模型
```py
>>> trainer.push_to_hub()
```
<Tip>
要深入了解如何微调模型进行自动语音识别,
请查看这篇博客[文章](https://huggingface.co/blog/fine-tune-wav2vec2-english)以了解英语 ASR
还可以参阅[这篇文章](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2)以了解多语言 ASR。
</Tip>
## 推断
很好,现在您已经微调了一个模型,您可以用它进行推断了!
加载您想要运行推断的音频文件。请记住,如果需要,将音频文件的采样率重新采样为与模型匹配的采样率!
```py
>>> from datasets import load_dataset, Audio
>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
>>> sampling_rate = dataset.features["audio"].sampling_rate
>>> audio_file = dataset[0]["audio"]["path"]
```
尝试使用微调后的模型进行推断的最简单方法是使用 [`pipeline`]。
使用您的模型实例化一个用于自动语音识别的 `pipeline`,并将您的音频文件传递给它:
```py
>>> from transformers import pipeline
>>> transcriber = pipeline("automatic-speech-recognition", model="stevhliu/my_awesome_asr_minds_model")
>>> transcriber(audio_file)
{'text': 'I WOUD LIKE O SET UP JOINT ACOUNT WTH Y PARTNER'}
```
<Tip>
转录结果还不错,但可以更好!尝试用更多示例微调您的模型,以获得更好的结果!
</Tip>
如果您愿意,您也可以手动复制 `pipeline` 的结果:
加载一个处理器来预处理音频文件和转录,并将 `input` 返回为 PyTorch 张量:
```py
>>> from transformers import AutoProcessor
>>> processor = AutoProcessor.from_pretrained("stevhliu/my_awesome_asr_mind_model")
>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
```
将您的输入传递给模型并返回 logits
```py
>>> from transformers import AutoModelForCTC
>>> model = AutoModelForCTC.from_pretrained("stevhliu/my_awesome_asr_mind_model")
>>> with torch.no_grad():
... logits = model(**inputs).logits
```
获取具有最高概率的预测 `input_ids`,并使用处理器将预测的 `input_ids` 解码回文本:
```py
>>> import torch
>>> predicted_ids = torch.argmax(logits, dim=-1)
>>> transcription = processor.batch_decode(predicted_ids)
>>> transcription
['I WOUL LIKE O SET UP JOINT ACOUNT WTH Y PARTNER']
```

View File

@@ -0,0 +1,283 @@
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# 问答
[[open-in-colab]]
<Youtube id="ajPx5LwJD-I"/>
问答任务根据给定的问题返回答案。相信您肯定在日常生活中接触过问答模型, 比如您可能使用过 豆包、Siri 等虚拟助手询问天气情况。问答任务通常分为两种类型:
- 抽取式:从给定的上下文中提取答案。
- 生成式:根据上下文生成能够正确回答问题的答案。
本指南将向您展示如何:
1. 在 [SQuAD](https://huggingface.co/datasets/squad) 数据集上微调 [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased),用于抽取式问答。
2. 使用微调后的模型进行推断。
<Tip>
如果您想查看所有与本任务兼容的架构和检查点,最好查看[任务页](https://huggingface.co/tasks/question-answering)。
</Tip>
在开始之前,请确保您已安装所有必要的库:
```bash
pip install transformers datasets evaluate
```
建议您登录 Hugging Face 账户,以便将模型上传并分享给社区。在提示时,输入您的令牌进行登录:
```py
>>> from huggingface_hub import notebook_login
>>> notebook_login()
```
## 加载 SQuAD 数据集
首先从 🤗 Datasets 库中加载 SQuAD 数据集的一个较小子集。这样您可以先进行实验,确保一切正常,再花更多时间在完整数据集上进行训练。
```py
>>> from datasets import load_dataset
>>> squad = load_dataset("squad", split="train[:5000]")
```
使用 [`~datasets.Dataset.train_test_split`] 方法将数据集的 `train` 划分为训练集和测试集:
```py
>>> squad = squad.train_test_split(test_size=0.2)
```
然后查看一个示例:
```py
>>> squad["train"][0]
{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
'id': '5733be284776f41900661182',
'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
'title': 'University_of_Notre_Dame'
}
```
这里有几个重要字段:
- `answers`:答案词元的起始位置及答案文本。
- `context`:模型需要从中提取答案的背景信息。
- `question`:模型应该回答的问题。
## 预处理
<Youtube id="qgaM0weJHpA"/>
下一步是加载 DistilBERT 分词器,对 `question``context` 字段进行处理:
```py
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
```
问答任务有一些特别的预处理步骤需要注意:
1. 数据集中的某些示例可能具有非常长的 `context`,超过了模型的最大输入长度。为处理较长的序列,仅截断 `context` 部分,设置 `truncation="only_second"`
2. 接下来,通过设置 `return_offset_mapping=True`,将答案的起始和结束位置映射回原始的 `context`
3. 有了映射后,即可找到答案的起始和结束词元。使用 [`~tokenizers.Encoding.sequence_ids`] 方法找出偏移量的哪部分对应 `question`,哪部分对应 `context`
下面是创建函数以截断并将 `answer` 的起止词元映射到 `context` 的方法:
```py
>>> def preprocess_function(examples):
... questions = [q.strip() for q in examples["question"]]
... inputs = tokenizer(
... questions,
... examples["context"],
... max_length=384,
... truncation="only_second",
... return_offsets_mapping=True,
... padding="max_length",
... )
... offset_mapping = inputs.pop("offset_mapping")
... answers = examples["answers"]
... start_positions = []
... end_positions = []
... for i, offset in enumerate(offset_mapping):
... answer = answers[i]
... start_char = answer["answer_start"][0]
... end_char = answer["answer_start"][0] + len(answer["text"][0])
... sequence_ids = inputs.sequence_ids(i)
... # 找到上下文的起始和结束位置
... idx = 0
... while sequence_ids[idx] != 1:
... idx += 1
... context_start = idx
... while sequence_ids[idx] == 1:
... idx += 1
... context_end = idx - 1
... # 如果答案不完全在上下文内,标记为 (0, 0)
... if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
... start_positions.append(0)
... end_positions.append(0)
... else:
... # 否则为答案的起止词元位置
... idx = context_start
... while idx <= context_end and offset[idx][0] <= start_char:
... idx += 1
... start_positions.append(idx - 1)
... idx = context_end
... while idx >= context_start and offset[idx][1] >= end_char:
... idx -= 1
... end_positions.append(idx + 1)
... inputs["start_positions"] = start_positions
... inputs["end_positions"] = end_positions
... return inputs
```
使用 🤗 Datasets 的 [`~datasets.Dataset.map`] 函数将预处理函数应用于整个数据集。通过设置 `batched=True` 一次处理数据集的多个元素,可以加速 `map` 函数。删除不需要的列:
```py
>>> tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)
```
现在使用 [`DefaultDataCollator`] 创建一批样本。与 🤗 Transformers 中的其他数据整理器不同,[`DefaultDataCollator`] 不会应用任何额外的预处理(如填充)。
```py
>>> from transformers import DefaultDataCollator
>>> data_collator = DefaultDataCollator()
```
## 训练
<Tip>
如果您不熟悉使用 [`Trainer`] 微调模型,请查看[这里](../training#train-with-pytorch-trainer)的基础教程!
</Tip>
现在可以开始训练模型了!使用 [`AutoModelForQuestionAnswering`] 加载 DistilBERT
```py
>>> from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
>>> model = AutoModelForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased")
```
此时,只剩三个步骤:
1. 在 [`TrainingArguments`] 中定义训练超参数。唯一必需的参数是 `output_dir`,它指定保存模型的位置。通过设置 `push_to_hub=True`,将模型推送到 Hub您需要登录 Hugging Face 才能上传模型)。
2. 将训练参数传递给 [`Trainer`],同时传入模型、数据集、分词器和数据整理器。
3. 调用 [`~Trainer.train`] 微调您的模型。
```py
>>> training_args = TrainingArguments(
... output_dir="my_awesome_qa_model",
... eval_strategy="epoch",
... learning_rate=2e-5,
... per_device_train_batch_size=16,
... per_device_eval_batch_size=16,
... num_train_epochs=3,
... weight_decay=0.01,
... push_to_hub=True,
... )
>>> trainer = Trainer(
... model=model,
... args=training_args,
... train_dataset=tokenized_squad["train"],
... eval_dataset=tokenized_squad["test"],
... processing_class=tokenizer,
... data_collator=data_collator,
... )
>>> trainer.train()
```
训练完成后,使用 [`~transformers.Trainer.push_to_hub`] 方法将模型分享到 Hub让所有人都能使用您的模型
```py
>>> trainer.push_to_hub()
```
<Tip>
如需了解如何微调问答模型的更深入示例,请参阅相应的
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb)。
</Tip>
## 评估
问答任务的评估需要大量后处理工作。为了不占用您太多时间,本指南跳过了评估步骤。[`Trainer`] 在训练过程中仍然会计算评估损失,因此您对模型性能并非完全一无所知。
如果您有更多时间,并且对如何评估问答模型感兴趣,可以查看 🤗 Hugging Face 课程中的[问答](https://huggingface.co/course/chapter7/7?fw=pt#post-processing)章节!
## 推断
很好,现在您已经微调了模型,可以用它进行推断了!
准备一个问题和一些您希望模型作出预测的上下文:
```py
>>> question = "How many programming languages does BLOOM support?"
>>> context = "BLOOM has 176 billion parameters and can generate text in 46 languages natural languages and 13 programming languages."
```
使用微调后的模型进行推断最简单的方式是直接使用 tokenizer 和 model。对文本进行分词并返回 PyTorch 张量:
```py
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_qa_model")
>>> inputs = tokenizer(question, context, return_tensors="pt")
```
将输入传递给模型并返回输出:
```py
>>> import torch
>>> from transformers import AutoModelForQuestionAnswering
>>> model = AutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model")
>>> with torch.no_grad():
... outputs = model(**inputs)
```
从模型输出中获取起始和结束位置的最高概率:
```py
>>> answer_start_index = outputs.start_logits.argmax()
>>> answer_end_index = outputs.end_logits.argmax()
```
解码预测的词元以获取答案:
```py
>>> predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
>>> tokenizer.decode(predict_answer_tokens)
'176 billion parameters and can generate text in 46 languages natural languages and 13'
```

View File

@@ -0,0 +1,254 @@
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# 文本分类
[[open-in-colab]]
<Youtube id="leNG9fN9FQU"/>
文本分类是一种常见的 NLP 任务,它为文本分配标签或类别。许多大型公司在生产环境中运行文本分类,用于各种实际应用。其中最流行的形式之一是情感分析,它为文本序列分配诸如 🙂 正面、🙁 负面或 😐 中性的标签。
本指南将向您展示如何:
1. 在 [IMDb](https://huggingface.co/datasets/imdb) 数据集上微调 [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased),以判断电影评论是正面还是负面。
2. 使用微调后的模型进行推断。
<Tip>
如果您想查看所有与本任务兼容的架构和检查点,最好查看[任务页](https://huggingface.co/tasks/text-classification)。
</Tip>
在开始之前,请确保您已安装所有必要的库:
```bash
pip install transformers datasets evaluate accelerate
```
建议您登录 Hugging Face 账户,以便将模型上传并分享给社区。在提示时,输入您的令牌进行登录:
```py
>>> from huggingface_hub import notebook_login
>>> notebook_login()
```
## 加载 IMDb 数据集
首先从 🤗 Datasets 库中加载 IMDb 数据集:
```py
>>> from datasets import load_dataset
>>> imdb = load_dataset("imdb")
```
然后查看一个示例:
```py
>>> imdb["test"][0]
{
"label": 0,
"text": "I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It's really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it's rubbish as they have to always say \"Gene Roddenberry's Earth...\" otherwise people would not continue watching. Roddenberry's ashes must be turning in their orbit as this dull, cheap, poorly edited (watching it without advert breaks really brings this home) trudging Trabant of a show lumbers into space. Spoiler. So, kill off a main character. And then bring him back as another actor. Jeeez! Dallas all over again.",
}
```
该数据集有两个字段:
- `text`:电影评论文本。
- `label`:值为 `0` 表示负面评论,值为 `1` 表示正面评论。
## 预处理
下一步是加载 DistilBERT 分词器,对 `text` 字段进行预处理:
```py
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
```
创建一个预处理函数来对 `text` 进行分词,并将序列截断至不超过 DistilBERT 最大输入长度:
```py
>>> def preprocess_function(examples):
... return tokenizer(examples["text"], truncation=True)
```
使用 🤗 Datasets 的 [`~datasets.Dataset.map`] 函数将预处理函数应用于整个数据集。通过设置 `batched=True` 一次处理数据集的多个元素,可以加速 `map`
```py
tokenized_imdb = imdb.map(preprocess_function, batched=True)
```
现在使用 [`DataCollatorWithPadding`] 创建一批样本。在整理时将句子*动态填充*至批次中的最长长度,比将整个数据集填充至最大长度更高效。
```py
>>> from transformers import DataCollatorWithPadding
>>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
```
## 评估
在训练过程中加入评估指标有助于评估模型的性能。您可以使用 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) 库快速加载评估方法。对于此任务,加载[准确率](https://huggingface.co/spaces/evaluate-metric/accuracy)指标(参阅 🤗 Evaluate [快速教程](https://huggingface.co/docs/evaluate/a_quick_tour),了解更多关于加载和计算指标的信息):
```py
>>> import evaluate
>>> accuracy = evaluate.load("accuracy")
```
然后创建一个函数,将您的预测结果和标签传递给 [`~evaluate.EvaluationModule.compute`] 来计算准确率:
```py
>>> import numpy as np
>>> def compute_metrics(eval_pred):
... predictions, labels = eval_pred
... predictions = np.argmax(predictions, axis=1)
... return accuracy.compute(predictions=predictions, references=labels)
```
您的 `compute_metrics` 函数已准备就绪,在设置训练时会用到它。
## 训练
在开始训练模型之前,使用 `id2label``label2id` 创建预期 id 到其标签的映射:
```py
>>> id2label = {0: "NEGATIVE", 1: "POSITIVE"}
>>> label2id = {"NEGATIVE": 0, "POSITIVE": 1}
```
<Tip>
如果您不熟悉使用 [`Trainer`] 微调模型,请查看[这里](../training#train-with-pytorch-trainer)的基础教程!
</Tip>
现在可以开始训练模型了!使用 [`AutoModelForSequenceClassification`] 加载 DistilBERT并指定预期标签数量和标签映射
```py
>>> from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
>>> model = AutoModelForSequenceClassification.from_pretrained(
... "distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
... )
```
此时,只剩三个步骤:
1. 在 [`TrainingArguments`] 中定义训练超参数。唯一必需的参数是 `output_dir`,它指定保存模型的位置。通过设置 `push_to_hub=True`,将模型推送到 Hub您需要登录 Hugging Face 才能上传模型)。每个 epoch 结束时,[`Trainer`] 将评估准确率并保存训练检查点。
2. 将训练参数传递给 [`Trainer`],同时传入模型、数据集、分词器、数据整理器和 `compute_metrics` 函数。
3. 调用 [`~Trainer.train`] 微调您的模型。
```py
>>> training_args = TrainingArguments(
... output_dir="my_awesome_model",
... learning_rate=2e-5,
... per_device_train_batch_size=16,
... per_device_eval_batch_size=16,
... num_train_epochs=2,
... weight_decay=0.01,
... eval_strategy="epoch",
... save_strategy="epoch",
... load_best_model_at_end=True,
... push_to_hub=True,
... )
>>> trainer = Trainer(
... model=model,
... args=training_args,
... train_dataset=tokenized_imdb["train"],
... eval_dataset=tokenized_imdb["test"],
... processing_class=tokenizer,
... data_collator=data_collator,
... compute_metrics=compute_metrics,
... )
>>> trainer.train()
```
<Tip>
当您将 `tokenizer` 传递给 [`Trainer`] 时,它会默认应用动态填充。在这种情况下,您无需显式指定数据整理器。
</Tip>
训练完成后,使用 [`~transformers.Trainer.push_to_hub`] 方法将模型分享到 Hub让所有人都能使用您的模型
```py
>>> trainer.push_to_hub()
```
<Tip>
如需了解如何微调文本分类模型的更深入示例,请参阅相应的
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb)。
</Tip>
## 推断
很好,现在您已经微调了模型,可以用它进行推断了!
准备一些您想要进行推断的文本:
```py
>>> text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."
```
使用微调后的模型进行推断最简单的方式是在 [`pipeline`] 中使用它。用您的模型实例化一个情感分析 `pipeline`,并将文本传递给它:
```py
>>> from transformers import pipeline
>>> classifier = pipeline("sentiment-analysis", model="stevhliu/my_awesome_model")
>>> classifier(text)
[{'label': 'POSITIVE', 'score': 0.9994940757751465}]
```
如果您愿意,也可以手动复现 `pipeline` 的结果:
对文本进行分词并返回 PyTorch 张量:
```py
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model")
>>> inputs = tokenizer(text, return_tensors="pt")
```
将输入传递给模型并返回 `logits`
```py
>>> from transformers import AutoModelForSequenceClassification
>>> model = AutoModelForSequenceClassification.from_pretrained("stevhliu/my_awesome_model")
>>> with torch.no_grad():
... logits = model(**inputs).logits
```
获取概率最高的类别,并使用模型的 `id2label` 映射将其转换为文本标签:
```py
>>> predicted_class_id = logits.argmax().item()
>>> model.config.id2label[predicted_class_id]
'POSITIVE'
```

View File

@@ -0,0 +1,256 @@
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# 摘要
[[open-in-colab]]
<Youtube id="yHnr5Dk2zCI"/>
摘要任务生成文档或文章的简短版本,同时保留所有重要信息。与翻译类似,它是另一个可以表述为序列到序列任务的例子。摘要可以分为:
- 抽取式:从文档中提取最相关的信息。
- 生成式:生成能够捕获最重要信息的新文本。
本指南将向您展示如何:
1. 在 [BillSum](https://huggingface.co/datasets/billsum) 数据集的加利福尼亚州法案子集上微调 [T5](https://huggingface.co/google-t5/t5-small),用于生成式摘要。
2. 使用微调后的模型进行推断。
<Tip>
如果您想查看所有与本任务兼容的架构和检查点,最好查看[任务页](https://huggingface.co/tasks/summarization)。
</Tip>
在开始之前,请确保您已安装所有必要的库:
```bash
pip install transformers datasets evaluate rouge_score
```
建议您登录 Hugging Face 账户,以便将模型上传并分享给社区。在提示时,输入您的令牌进行登录:
```py
>>> from huggingface_hub import notebook_login
>>> notebook_login()
```
## 加载 BillSum 数据集
首先从 🤗 Datasets 库中加载 BillSum 数据集中较小的加利福尼亚州法案子集:
```py
>>> from datasets import load_dataset
>>> billsum = load_dataset("billsum", split="ca_test")
```
使用 [`~datasets.Dataset.train_test_split`] 方法将数据集划分为训练集和测试集:
```py
>>> billsum = billsum.train_test_split(test_size=0.2)
```
然后查看一个示例:
```py
>>> billsum["train"][0]
{'summary': 'Existing law authorizes state agencies to enter into contracts for the acquisition of goods or services upon approval by the Department of General Services. Existing law sets forth various requirements and prohibitions for those contracts, including, but not limited to, a prohibition on entering into contracts for the acquisition of goods or services of $100,000 or more with a contractor that discriminates between spouses and domestic partners or same-sex and different-sex couples in the provision of benefits. Existing law provides that a contract entered into in violation of those requirements and prohibitions is void and authorizes the state or any person acting on behalf of the state to bring a civil action seeking a determination that a contract is in violation and therefore void. Under existing law, a willful violation of those requirements and prohibitions is a misdemeanor.\nThis bill would also prohibit a state agency from entering into contracts for the acquisition of goods or services of $100,000 or more with a contractor that discriminates between employees on the basis of gender identity in the provision of benefits, as specified. By expanding the scope of a crime, this bill would impose a state-mandated local program.\nThe California Constitution requires the state to reimburse local agencies and school districts for certain costs mandated by the state. Statutory provisions establish procedures for making that reimbursement.\nThis bill would provide that no reimbursement is required by this act for a specified reason.',
'text': 'The people of the State of California do enact as follows: ...',
'title': 'An act to add Section 10295.35 to the Public Contract Code, relating to public contracts.'}
```
您会用到的两个字段是:
- `text`:法案文本,将作为模型的输入。
- `summary``text` 的精简版本,将作为模型的目标输出。
## 预处理
下一步是加载 T5 分词器,处理 `text``summary`
```py
>>> from transformers import AutoTokenizer
>>> checkpoint = "google-t5/t5-small"
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
```
您要创建的预处理函数需要:
1. 在输入前添加提示词,让 T5 知道这是一个摘要任务。某些能够处理多种 NLP 任务的模型需要针对特定任务提示。
2. 在对标签进行分词时使用关键字参数 `text_target`
3. 将序列截断至不超过 `max_length` 参数设置的最大长度。
```py
>>> prefix = "summarize: "
>>> def preprocess_function(examples):
... inputs = [prefix + doc for doc in examples["text"]]
... model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
... labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)
... model_inputs["labels"] = labels["input_ids"]
... return model_inputs
```
使用 🤗 Datasets 的 [`~datasets.Dataset.map`] 方法将预处理函数应用于整个数据集。通过设置 `batched=True` 一次处理数据集的多个元素,可以加速 `map` 函数:
```py
>>> tokenized_billsum = billsum.map(preprocess_function, batched=True)
```
现在使用 [`DataCollatorForSeq2Seq`] 创建一批样本。在整理时将句子*动态填充*至批次中的最长长度,比将整个数据集填充至最大长度更高效。
```py
>>> from transformers import DataCollatorForSeq2Seq
>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)
```
## 评估
在训练过程中加入评估指标有助于评估模型的性能。您可以使用 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) 库快速加载评估方法。对于此任务,加载 [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge) 指标(参阅 🤗 Evaluate [快速教程](https://huggingface.co/docs/evaluate/a_quick_tour),了解更多关于加载和计算指标的信息):
```py
>>> import evaluate
>>> rouge = evaluate.load("rouge")
```
然后创建一个函数,将您的预测结果和标签传递给 [`~evaluate.EvaluationModule.compute`] 来计算 ROUGE 指标:
```py
>>> import numpy as np
>>> def compute_metrics(eval_pred):
... predictions, labels = eval_pred
... decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
... labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
... decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
... result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
... prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
... result["gen_len"] = np.mean(prediction_lens)
... return {k: round(v, 4) for k, v in result.items()}
```
您的 `compute_metrics` 函数已准备就绪,在设置训练时会用到它。
## 训练
<Tip>
如果您不熟悉使用 [`Trainer`] 微调模型,请查看[这里](../training#train-with-pytorch-trainer)的基础教程!
</Tip>
现在可以开始训练模型了!使用 [`AutoModelForSeq2SeqLM`] 加载 T5
```py
>>> from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
>>> model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
```
此时,只剩三个步骤:
1. 在 [`Seq2SeqTrainingArguments`] 中定义训练超参数。唯一必需的参数是 `output_dir`,它指定保存模型的位置。通过设置 `push_to_hub=True`,将模型推送到 Hub您需要登录 Hugging Face 才能上传模型)。每个 epoch 结束时,[`Trainer`] 将评估 ROUGE 指标并保存训练检查点。
2. 将训练参数传递给 [`Seq2SeqTrainer`],同时传入模型、数据集、分词器、数据整理器和 `compute_metrics` 函数。
3. 调用 [`~Trainer.train`] 微调您的模型。
```py
>>> training_args = Seq2SeqTrainingArguments(
... output_dir="my_awesome_billsum_model",
... eval_strategy="epoch",
... learning_rate=2e-5,
... per_device_train_batch_size=16,
... per_device_eval_batch_size=16,
... weight_decay=0.01,
... save_total_limit=3,
... num_train_epochs=4,
... predict_with_generate=True,
... fp16=True, #change to bf16=True for XPU
... push_to_hub=True,
... )
>>> trainer = Seq2SeqTrainer(
... model=model,
... args=training_args,
... train_dataset=tokenized_billsum["train"],
... eval_dataset=tokenized_billsum["test"],
... processing_class=tokenizer,
... data_collator=data_collator,
... compute_metrics=compute_metrics,
... )
>>> trainer.train()
```
训练完成后,使用 [`~transformers.Trainer.push_to_hub`] 方法将模型分享到 Hub让所有人都能使用您的模型
```py
>>> trainer.push_to_hub()
```
<Tip>
如需了解如何微调摘要模型的更深入示例,请参阅相应的
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb)。
</Tip>
## 推断
很好,现在您已经微调了模型,可以用它进行推断了!
准备一些您想要生成摘要的文本。对于 T5您需要根据所处理的任务为输入添加前缀。对于摘要任务前缀如下所示
```py
>>> text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."
```
对文本进行分词并将 `input_ids` 作为 PyTorch 张量返回:
```py
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_billsum_model")
>>> inputs = tokenizer(text, return_tensors="pt").input_ids
```
使用 [`~generation.GenerationMixin.generate`] 方法创建摘要。有关不同文本生成策略和控制生成参数的更多详情,请查阅[文本生成](../main_classes/text_generation) API。
```py
>>> from transformers import AutoModelForSeq2SeqLM
>>> model = AutoModelForSeq2SeqLM.from_pretrained("username/my_awesome_billsum_model")
>>> outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)
```
将生成的词元 id 解码回文本:
```py
>>> tokenizer.decode(outputs[0], skip_special_tokens=True)
'the inflation reduction act lowers prescription drug costs, health care costs, and energy costs. it\'s the most aggressive action on tackling the climate crisis in american history. it will ask the ultra-wealthy and corporations to pay their fair share.'
```

View File

@@ -0,0 +1,399 @@
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# 词元分类
[[open-in-colab]]
<Youtube id="wVHdVlPScxA"/>
词元分类为句子中的每个词元分配标签。最常见的词元分类任务之一是命名实体识别NER。NER 尝试为句子中的每个实体找到对应标签,例如人名、地名或组织名。
本指南将向您展示如何:
1. 在 [WNUT 17](https://huggingface.co/datasets/wnut_17) 数据集上微调 [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased),以检测新兴实体。
2. 使用微调后的模型进行推断。
<Tip>
如果您想查看所有与本任务兼容的架构和检查点,最好查看[任务页](https://huggingface.co/tasks/token-classification)。
</Tip>
在开始之前,请确保您已安装所有必要的库:
```bash
pip install transformers datasets evaluate seqeval
```
建议您登录 Hugging Face 账户,以便将模型上传并分享给社区。在提示时,输入您的令牌进行登录:
```py
>>> from huggingface_hub import notebook_login
>>> notebook_login()
```
## 加载 WNUT 17 数据集
首先从 🤗 Datasets 库中加载 WNUT 17 数据集:
```py
>>> from datasets import load_dataset
>>> wnut = load_dataset("wnut_17")
```
然后查看一个示例:
```py
>>> wnut["train"][0]
{'id': '0',
'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0],
'tokens': ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
}
```
`ner_tags` 中的每个数字代表一个实体。将数字转换为标签名称,以了解实体类型:
```py
>>> label_list = wnut["train"].features[f"ner_tags"].feature.names
>>> label_list
[
"O",
"B-corporation",
"I-corporation",
"B-creative-work",
"I-creative-work",
"B-group",
"I-group",
"B-location",
"I-location",
"B-person",
"I-person",
"B-product",
"I-product",
]
```
每个 `ner_tag` 的前缀字母表示实体中词元的位置:
- `B-` 表示实体的开始。
- `I-` 表示词元包含在同一实体中(例如,`State` 词元是 `Empire State Building` 等实体的一部分)。
- `0` 表示该词元不对应任何实体。
## 预处理
<Youtube id="iY2AZYdZAr0"/>
下一步是加载 DistilBERT 分词器,对 `tokens` 字段进行预处理:
```py
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
```
如上面示例的 `tokens` 字段所示,看起来输入已经完成了分词。但实际上输入尚未分词,您需要设置 `is_split_into_words=True` 将词语分词为子词。例如:
```py
>>> example = wnut["train"][0]
>>> tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
>>> tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
>>> tokens
['[CLS]', '@', 'paul', '##walk', 'it', "'", 's', 'the', 'view', 'from', 'where', 'i', "'", 'm', 'living', 'for', 'two', 'weeks', '.', 'empire', 'state', 'building', '=', 'es', '##b', '.', 'pretty', 'bad', 'storm', 'here', 'last', 'evening', '.', '[SEP]']
```
然而,这会添加一些特殊词元 `[CLS]``[SEP]`,子词分词会造成输入与标签之间的不匹配——原本对应单个标签的单个词,现在可能被分割为两个子词。您需要通过以下方式重新对齐词元和标签:
1. 使用 [`word_ids`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.BatchEncoding.word_ids) 方法将所有词元映射到对应的词语。
2. 对特殊词元 `[CLS]``[SEP]` 分配标签 `-100`,使其被 PyTorch 的损失函数忽略(参见 [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html))。
3. 仅为给定词语的第一个词元打标签,对同一词语的其他子词元分配 `-100`
下面是创建一个函数来重新对齐词元和标签、并将序列截断至不超过 DistilBERT 最大输入长度的方法:
```py
>>> def tokenize_and_align_labels(examples):
... tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
... labels = []
... for i, label in enumerate(examples[f"ner_tags"]):
... word_ids = tokenized_inputs.word_ids(batch_index=i) # 将词元映射到对应词语
... previous_word_idx = None
... label_ids = []
... for word_idx in word_ids: # 将特殊词元设置为 -100
... if word_idx is None:
... label_ids.append(-100)
... elif word_idx != previous_word_idx: # 仅为给定词语的第一个词元打标签
... label_ids.append(label[word_idx])
... else:
... label_ids.append(-100)
... previous_word_idx = word_idx
... labels.append(label_ids)
... tokenized_inputs["labels"] = labels
... return tokenized_inputs
```
使用 🤗 Datasets 的 [`~datasets.Dataset.map`] 函数将预处理函数应用于整个数据集。通过设置 `batched=True` 一次处理数据集的多个元素,可以加速 `map` 函数:
```py
>>> tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)
```
现在使用 [`DataCollatorWithPadding`] 创建一批样本。在整理时将句子*动态填充*至批次中的最长长度,比将整个数据集填充至最大长度更高效。
```py
>>> from transformers import DataCollatorForTokenClassification
>>> data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
```
## 评估
在训练过程中加入评估指标有助于评估模型的性能。您可以使用 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) 库快速加载评估方法。对于此任务,加载 [seqeval](https://huggingface.co/spaces/evaluate-metric/seqeval) 框架(参阅 🤗 Evaluate [快速教程](https://huggingface.co/docs/evaluate/a_quick_tour)了解更多关于加载和计算指标的信息。seqeval 实际上会产生多个分数精确率、召回率、F1 和准确率。
```py
>>> import evaluate
>>> seqeval = evaluate.load("seqeval")
```
首先获取 NER 标签,然后创建一个函数,将真实预测结果和真实标签传递给 [`~evaluate.EvaluationModule.compute`] 来计算分数:
```py
>>> import numpy as np
>>> labels = [label_list[i] for i in example[f"ner_tags"]]
>>> def compute_metrics(p):
... predictions, labels = p
... predictions = np.argmax(predictions, axis=2)
... true_predictions = [
... [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
... for prediction, label in zip(predictions, labels)
... ]
... true_labels = [
... [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
... for prediction, label in zip(predictions, labels)
... ]
... results = seqeval.compute(predictions=true_predictions, references=true_labels)
... return {
... "precision": results["overall_precision"],
... "recall": results["overall_recall"],
... "f1": results["overall_f1"],
... "accuracy": results["overall_accuracy"],
... }
```
您的 `compute_metrics` 函数已准备就绪,在设置训练时会用到它。
## 训练
在开始训练模型之前,使用 `id2label``label2id` 创建预期 id 到其标签的映射:
```py
>>> id2label = {
... 0: "O",
... 1: "B-corporation",
... 2: "I-corporation",
... 3: "B-creative-work",
... 4: "I-creative-work",
... 5: "B-group",
... 6: "I-group",
... 7: "B-location",
... 8: "I-location",
... 9: "B-person",
... 10: "I-person",
... 11: "B-product",
... 12: "I-product",
... }
>>> label2id = {
... "O": 0,
... "B-corporation": 1,
... "I-corporation": 2,
... "B-creative-work": 3,
... "I-creative-work": 4,
... "B-group": 5,
... "I-group": 6,
... "B-location": 7,
... "I-location": 8,
... "B-person": 9,
... "I-person": 10,
... "B-product": 11,
... "I-product": 12,
... }
```
<Tip>
如果您不熟悉使用 [`Trainer`] 微调模型,请查看[这里](../training#train-with-pytorch-trainer)的基础教程!
</Tip>
现在可以开始训练模型了!使用 [`AutoModelForTokenClassification`] 加载 DistilBERT并指定预期标签数量和标签映射
```py
>>> from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
>>> model = AutoModelForTokenClassification.from_pretrained(
... "distilbert/distilbert-base-uncased", num_labels=13, id2label=id2label, label2id=label2id
... )
```
此时,只剩三个步骤:
1. 在 [`TrainingArguments`] 中定义训练超参数。唯一必需的参数是 `output_dir`,它指定保存模型的位置。通过设置 `push_to_hub=True`,将模型推送到 Hub您需要登录 Hugging Face 才能上传模型)。每个 epoch 结束时,[`Trainer`] 将评估 seqeval 分数并保存训练检查点。
2. 将训练参数传递给 [`Trainer`],同时传入模型、数据集、分词器、数据整理器和 `compute_metrics` 函数。
3. 调用 [`~Trainer.train`] 微调您的模型。
```py
>>> training_args = TrainingArguments(
... output_dir="my_awesome_wnut_model",
... learning_rate=2e-5,
... per_device_train_batch_size=16,
... per_device_eval_batch_size=16,
... num_train_epochs=2,
... weight_decay=0.01,
... eval_strategy="epoch",
... save_strategy="epoch",
... load_best_model_at_end=True,
... push_to_hub=True,
... )
>>> trainer = Trainer(
... model=model,
... args=training_args,
... train_dataset=tokenized_wnut["train"],
... eval_dataset=tokenized_wnut["test"],
... processing_class=tokenizer,
... data_collator=data_collator,
... compute_metrics=compute_metrics,
... )
>>> trainer.train()
```
训练完成后,使用 [`~transformers.Trainer.push_to_hub`] 方法将模型分享到 Hub让所有人都能使用您的模型
```py
>>> trainer.push_to_hub()
```
<Tip>
如需了解如何微调词元分类模型的更深入示例,请参阅相应的
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb)。
</Tip>
## 推断
很好,现在您已经微调了模型,可以用它进行推断了!
准备一些您想要进行推断的文本:
```py
>>> text = "The Golden State Warriors are an American professional basketball team based in San Francisco."
```
使用微调后的模型进行推断最简单的方式是在 [`pipeline`] 中使用它。用您的模型实例化一个 NER `pipeline`,并将文本传递给它:
```py
>>> from transformers import pipeline
>>> classifier = pipeline("ner", model="stevhliu/my_awesome_wnut_model")
>>> classifier(text)
[{'entity': 'B-location',
'score': 0.42658573,
'index': 2,
'word': 'golden',
'start': 4,
'end': 10},
{'entity': 'I-location',
'score': 0.35856336,
'index': 3,
'word': 'state',
'start': 11,
'end': 16},
{'entity': 'B-group',
'score': 0.3064001,
'index': 4,
'word': 'warriors',
'start': 17,
'end': 25},
{'entity': 'B-location',
'score': 0.65523505,
'index': 13,
'word': 'san',
'start': 80,
'end': 83},
{'entity': 'B-location',
'score': 0.4668663,
'index': 14,
'word': 'francisco',
'start': 84,
'end': 93}]
```
如果您愿意,也可以手动复现 `pipeline` 的结果:
对文本进行分词并返回 PyTorch 张量:
```py
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_wnut_model")
>>> inputs = tokenizer(text, return_tensors="pt")
```
将输入传递给模型并返回 `logits`
```py
>>> from transformers import AutoModelForTokenClassification
>>> model = AutoModelForTokenClassification.from_pretrained("stevhliu/my_awesome_wnut_model")
>>> with torch.no_grad():
... logits = model(**inputs).logits
```
获取概率最高的类别,并使用模型的 `id2label` 映射将其转换为文本标签:
```py
>>> predictions = torch.argmax(logits, dim=2)
>>> predicted_token_class = [model.config.id2label[t.item()] for t in predictions[0]]
>>> predicted_token_class
['O',
'O',
'B-location',
'I-location',
'B-group',
'O',
'O',
'O',
'O',
'O',
'O',
'O',
'O',
'B-location',
'B-location',
'O',
'O']
```

View File

@@ -0,0 +1,264 @@
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# 翻译
[[open-in-colab]]
<Youtube id="1JvfrvZgi6c"/>
翻译将一种语言的文本序列转换为另一种语言。它是可以表述为序列到序列问题的几个任务之一——这是一种从输入返回某些输出的强大框架,适用于翻译或摘要等任务。翻译系统通常用于不同语言文本之间的转换,但也可以用于语音,或者文本转语音、语音转文本等组合场景。
本指南将向您展示如何:
1. 在 [OPUS Books](https://huggingface.co/datasets/opus_books) 数据集的英法子集上微调 [T5](https://huggingface.co/google-t5/t5-small),将英文文本翻译成法文。
2. 使用微调后的模型进行推断。
<Tip>
如果您想查看所有与本任务兼容的架构和检查点,最好查看[任务页](https://huggingface.co/tasks/translation)。
</Tip>
在开始之前,请确保您已安装所有必要的库:
```bash
pip install transformers datasets evaluate sacrebleu
```
建议您登录 Hugging Face 账户,以便将模型上传并分享给社区。在提示时,输入您的令牌进行登录:
```py
>>> from huggingface_hub import notebook_login
>>> notebook_login()
```
## 加载 OPUS Books 数据集
首先从 🤗 Datasets 库中加载 [OPUS Books](https://huggingface.co/datasets/opus_books) 数据集的英法子集:
```py
>>> from datasets import load_dataset
>>> books = load_dataset("opus_books", "en-fr")
```
使用 [`~datasets.Dataset.train_test_split`] 方法将数据集划分为训练集和测试集:
```py
>>> books = books["train"].train_test_split(test_size=0.2)
```
然后查看一个示例:
```py
>>> books["train"][0]
{'id': '90560',
'translation': {'en': 'But this lofty plateau measured only a few fathoms, and soon we reentered Our Element.',
'fr': 'Mais ce plateau élevé ne mesurait que quelques toises, et bientôt nous fûmes rentrés dans notre élément.'}}
```
`translation`:文本的英文和法文翻译。
## 预处理
<Youtube id="XAR8jnZZuUs"/>
下一步是加载 T5 分词器,处理英法语言对:
```py
>>> from transformers import AutoTokenizer
>>> checkpoint = "google-t5/t5-small"
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
```
您要创建的预处理函数需要:
1. 在输入前添加提示词,让 T5 知道这是一个翻译任务。某些能够处理多种 NLP 任务的模型需要针对特定任务提示。
2.`text_target` 参数中设置目标语言(法语),以确保分词器能正确处理目标文本。如果不设置 `text_target`,分词器会将目标文本作为英语处理。
3. 将序列截断至不超过 `max_length` 参数设置的最大长度。
```py
>>> source_lang = "en"
>>> target_lang = "fr"
>>> prefix = "translate English to French: "
>>> def preprocess_function(examples):
... inputs = [prefix + example[source_lang] for example in examples["translation"]]
... targets = [example[target_lang] for example in examples["translation"]]
... model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
... return model_inputs
```
使用 🤗 Datasets 的 [`~datasets.Dataset.map`] 方法将预处理函数应用于整个数据集。通过设置 `batched=True` 一次处理数据集的多个元素,可以加速 `map` 函数:
```py
>>> tokenized_books = books.map(preprocess_function, batched=True)
```
现在使用 [`DataCollatorForSeq2Seq`] 创建一批样本。在整理时将句子*动态填充*至批次中的最长长度,比将整个数据集填充至最大长度更高效。
```py
>>> from transformers import DataCollatorForSeq2Seq
>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)
```
## 评估
在训练过程中加入评估指标有助于评估模型的性能。您可以使用 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) 库快速加载评估方法。对于此任务,加载 [SacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu) 指标(参阅 🤗 Evaluate [快速教程](https://huggingface.co/docs/evaluate/a_quick_tour),了解更多关于加载和计算指标的信息):
```py
>>> import evaluate
>>> metric = evaluate.load("sacrebleu")
```
然后创建一个函数,将您的预测结果和标签传递给 [`~evaluate.EvaluationModule.compute`] 来计算 SacreBLEU 分数:
```py
>>> import numpy as np
>>> def postprocess_text(preds, labels):
... preds = [pred.strip() for pred in preds]
... labels = [[label.strip()] for label in labels]
... return preds, labels
>>> def compute_metrics(eval_preds):
... preds, labels = eval_preds
... if isinstance(preds, tuple):
... preds = preds[0]
... decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
... labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
... decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
... decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
... result = metric.compute(predictions=decoded_preds, references=decoded_labels)
... result = {"bleu": result["score"]}
... prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
... result["gen_len"] = np.mean(prediction_lens)
... result = {k: round(v, 4) for k, v in result.items()}
... return result
```
您的 `compute_metrics` 函数已准备就绪,在设置训练时会用到它。
## 训练
<Tip>
如果您不熟悉使用 [`Trainer`] 微调模型,请查看[这里](../training#train-with-pytorch-trainer)的基础教程!
</Tip>
现在可以开始训练模型了!使用 [`AutoModelForSeq2SeqLM`] 加载 T5
```py
>>> from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
>>> model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
```
此时,只剩三个步骤:
1. 在 [`Seq2SeqTrainingArguments`] 中定义训练超参数。唯一必需的参数是 `output_dir`,它指定保存模型的位置。通过设置 `push_to_hub=True`,将模型推送到 Hub您需要登录 Hugging Face 才能上传模型)。每个 epoch 结束时,[`Trainer`] 将评估 SacreBLEU 指标并保存训练检查点。
2. 将训练参数传递给 [`Seq2SeqTrainer`],同时传入模型、数据集、分词器、数据整理器和 `compute_metrics` 函数。
3. 调用 [`~Trainer.train`] 微调您的模型。
```py
>>> training_args = Seq2SeqTrainingArguments(
... output_dir="my_awesome_opus_books_model",
... eval_strategy="epoch",
... learning_rate=2e-5,
... per_device_train_batch_size=16,
... per_device_eval_batch_size=16,
... weight_decay=0.01,
... save_total_limit=3,
... num_train_epochs=2,
... predict_with_generate=True,
... fp16=True, #change to bf16=True for XPU
... push_to_hub=True,
... )
>>> trainer = Seq2SeqTrainer(
... model=model,
... args=training_args,
... train_dataset=tokenized_books["train"],
... eval_dataset=tokenized_books["test"],
... processing_class=tokenizer,
... data_collator=data_collator,
... compute_metrics=compute_metrics,
... )
>>> trainer.train()
```
训练完成后,使用 [`~transformers.Trainer.push_to_hub`] 方法将模型分享到 Hub让所有人都能使用您的模型
```py
>>> trainer.push_to_hub()
```
<Tip>
如需了解如何微调翻译模型的更深入示例,请参阅相应的
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb)。
</Tip>
## 推断
很好,现在您已经微调了模型,可以用它进行推断了!
准备一些您想要翻译成另一种语言的文本。对于 T5您需要根据所处理的任务为输入添加前缀。对于从英语到法语的翻译前缀如下所示
```py
>>> text = "translate English to French: Legumes share resources with nitrogen-fixing bacteria."
```
对文本进行分词并将 `input_ids` 作为 PyTorch 张量返回:
```py
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_opus_books_model")
>>> inputs = tokenizer(text, return_tensors="pt").input_ids
```
使用 [`~generation.GenerationMixin.generate`] 方法创建翻译结果。有关不同文本生成策略和控制生成参数的更多详情,请查阅[文本生成](../main_classes/text_generation) API。
```py
>>> from transformers import AutoModelForSeq2SeqLM
>>> model = AutoModelForSeq2SeqLM.from_pretrained("username/my_awesome_opus_books_model")
>>> outputs = model.generate(inputs, max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95)
```
将生成的词元 id 解码回文本:
```py
>>> tokenizer.decode(outputs[0], skip_special_tokens=True)
'Les lignées partagent des ressources avec des bactéries enfixant l'azote.'
```