跳到主内容

HFTrainer

pipeline pipeline

使用 Trainer 框架训练一个新的 Hugging Face Transformer 模型。

示例

以下显示了一个使用此管道的简单示例。

import pandas as pd

from datasets import load_dataset

from txtai.pipeline import HFTrainer

trainer = HFTrainer()

# Pandas DataFrame
df = pd.read_csv("training.csv")
model, tokenizer = trainer("bert-base-uncased", df)

# Hugging Face dataset
ds = load_dataset("glue", "sst2")
model, tokenizer = trainer("bert-base-uncased", ds["train"], columns=("sentence", "label"))

# List of dicts
dt = [{"text": "sentence 1", "label": 0}, {"text": "sentence 2", "label": 1}]]
model, tokenizer = trainer("bert-base-uncased", dt)

# Support additional TrainingArguments
model, tokenizer = trainer("bert-base-uncased", dt, 
                            learning_rate=3e-5, num_train_epochs=5)

所有 TrainingArguments 都作为函数参数支持传递给 trainer 调用。

请参阅以下链接以获取更详细的示例。

笔记本 描述
训练文本标注器 构建文本序列分类模型 Open In Colab
无标签训练 使用零样本分类器训练新模型 Open In Colab
训练问答模型 构建和微调问答模型 Open In Colab
从头开始训练语言模型 构建新的语言模型 Open In Colab

训练任务

HFTrainer 管道为以下训练任务构建和/或微调模型。

任务 描述
language-generation 用于文本生成的因果语言模型 (例如 GPT)
language-modeling 用于通用任务的掩码语言模型 (例如 BERT)
question-answering 抽取式问答模型,通常使用 SQuAD 数据集
sequence-sequence 序列到序列模型 (例如 T5)
text-classification 使用一组标签对文本进行分类
token-detection 带有被替换的 token 检测的 ELECTRA 风格预训练

PEFT

通过 Hugging Face 的 PEFT 库支持参数高效微调 (PEFT)。通过 bitsandbytes 提供量化。请参见以下示例。

from txtai.pipeline import HFTrainer

trainer = HFTrainer()
trainer(..., quantize=True, lora=True)

当这些参数设置为 True 时,它们使用默认配置。这也可以自定义。

quantize = {
    "load_in_4bit": True,
    "bnb_4bit_use_double_quant": True,
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_compute_dtype": "bfloat16"
}

lora = {
    "r": 16,
    "lora_alpha": 8,
    "target_modules": "all-linear",
    "lora_dropout": 0.05,
    "bias": "none"
}

trainer(..., quantize=quantize, lora=lora)

这些参数也接受 transformers.BitsAndBytesConfigpeft.LoraConfig 实例。

有关更多信息,请参阅以下 PEFT 文档链接。

方法

此管道的 Python 文档。

__call__(base, train, validation=None, columns=None, maxlength=None, stride=128, task='text-classification', prefix=None, metrics=None, tokenizers=None, checkpoint=None, quantize=None, lora=None, **args)

使用参数构建新模型。

参数

名称 类型 描述 默认值
base

基础模型的路径,接受 Hugging Face 模型中心 id、本地路径或 (模型, tokenizer) 元组

必需
train

训练数据

必需
validation

验证数据

None
columns

用于文本/标签的列元组,默认为 (text, None, label)

None
maxlength

最大序列长度,默认为 tokenizer.model_max_length

None
stride

用于问答任务分割数据的块大小

128
task

可选的模型任务或类别,决定模型类型,默认为 "text-classification"

'text-classification'
prefix

可选的源前缀

None
metrics

可选的计算并返回评估指标字典的函数

None
tokenizers

可选的并发 tokenizer 数量,默认为 None

None
checkpoint

可选的从检查点恢复标志或检查点目录路径,默认为 None

None
quantize

传递给基础模型的量化配置

None
lora

传递给 PEFT 模型的 lora 配置

None
args

训练参数

{}

返回

类型 描述

(model, tokenizer)

源代码位于 txtai/pipeline/train/hftrainer.py
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
def __call__(
    self,
    base,
    train,
    validation=None,
    columns=None,
    maxlength=None,
    stride=128,
    task="text-classification",
    prefix=None,
    metrics=None,
    tokenizers=None,
    checkpoint=None,
    quantize=None,
    lora=None,
    **args
):
    """
    Builds a new model using arguments.

    Args:
        base: path to base model, accepts Hugging Face model hub id, local path or (model, tokenizer) tuple
        train: training data
        validation: validation data
        columns: tuple of columns to use for text/label, defaults to (text, None, label)
        maxlength: maximum sequence length, defaults to tokenizer.model_max_length
        stride: chunk size for splitting data for QA tasks
        task: optional model task or category, determines the model type, defaults to "text-classification"
        prefix: optional source prefix
        metrics: optional function that computes and returns a dict of evaluation metrics
        tokenizers: optional number of concurrent tokenizers, defaults to None
        checkpoint: optional resume from checkpoint flag or path to checkpoint directory, defaults to None
        quantize: quantization configuration to pass to base model
        lora: lora configuration to pass to PEFT model
        args: training arguments

    Returns:
        (model, tokenizer)
    """

    # Quantization / LoRA support
    if (quantize or lora) and not PEFT:
        raise ImportError('PEFT is not available - install "pipeline" extra to enable')

    # Parse TrainingArguments
    args = self.parse(args)

    # Set seed for model reproducibility
    set_seed(args.seed)

    # Load model configuration, tokenizer and max sequence length
    config, tokenizer, maxlength = self.load(base, maxlength)

    # Default tokenizer pad token if it's not set
    tokenizer.pad_token = tokenizer.pad_token if tokenizer.pad_token is not None else tokenizer.eos_token

    # Prepare parameters
    process, collator, labels = self.prepare(task, train, tokenizer, columns, maxlength, stride, prefix, args)

    # Tokenize training and validation data
    train, validation = process(train, validation, os.cpu_count() if tokenizers and isinstance(tokenizers, bool) else tokenizers)

    # Create model to train
    model = self.model(task, base, config, labels, tokenizer, quantize)

    # Default config pad token if it's not set
    model.config.pad_token_id = model.config.pad_token_id if model.config.pad_token_id is not None else model.config.eos_token_id

    # Load as PEFT model, if necessary
    model = self.peft(task, lora, model)

    # Add model to collator
    if collator:
        collator.model = model

    # Build trainer
    trainer = Trainer(
        model=model,
        tokenizer=tokenizer,
        data_collator=collator,
        args=args,
        train_dataset=train,
        eval_dataset=validation if validation else None,
        compute_metrics=metrics,
    )

    # Run training
    trainer.train(resume_from_checkpoint=checkpoint)

    # Run evaluation
    if validation:
        trainer.evaluate()

    # Save model outputs
    if args.should_save:
        trainer.save_model()
        trainer.save_state()

    # Put model in eval mode to disable weight updates and return (model, tokenizer)
    return (model.eval(), tokenizer)