跳到内容

翻译

pipeline pipeline

翻译管道用于在不同语言之间翻译文本。它支持超过 100 种语言。内置自动源语言检测功能。此管道检测每个输入文本行的语言,为源-目标组合加载模型,并将文本翻译成目标语言。

示例

下面显示了使用此管道的一个简单示例。

from txtai.pipeline import Translation

# Create and run pipeline
translate = Translation()
translate("This is a test translation into Spanish", "es")

有关更详细的示例,请参阅下方链接。

笔记本 描述
在不同语言之间翻译文本 简化机器翻译和语言检测 Open In Colab

配置驱动示例

管道可以使用 Python 或配置运行。可以在 配置 中使用管道的小写名称实例化管道。配置驱动的管道通过 工作流API 运行。

config.yml

# Create pipeline using lower case class name
translation:

# Run pipeline with workflow
workflow:
  translate:
    tasks:
      - action: translation
        args: ["es"]

使用工作流运行

from txtai import Application

# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("translate", ["This is a test translation into Spanish"]))

使用 API 运行

CONFIG=config.yml uvicorn "txtai.api:app" &

curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name":"translate", "elements":["This is a test translation into Spanish"]}'

方法

此管道的 Python 文档。

__init__(path=None, quantize=False, gpu=True, batch=64, langdetect=None, findmodels=True)

构建一个新的语言翻译管道。

参数

名称 类型 描述 默认值
path

可选的模型路径,接受 Hugging Face 模型中心 ID 或本地路径,如果未提供,则使用任务的默认模型

None
quantize

模型是否应量化,默认为 False

False
gpu

是否应启用 GPU,True/False,也支持 GPU 设备 ID

True
batch

用于增量处理内容的批量大小

64
langdetect

设置一个自定义语言检测函数,该方法必须接受一个字符串列表并为每个字符串返回语言代码,如果未提供,则使用默认语言检测器

None
findmodels

是否在 Hugging Face Hub 中搜索源-目标翻译模型,True/False

True
源代码位于 txtai/pipeline/text/translation.py
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
def __init__(self, path=None, quantize=False, gpu=True, batch=64, langdetect=None, findmodels=True):
    """
    Constructs a new language translation pipeline.

    Args:
        path: optional path to model, accepts Hugging Face model hub id or local path,
              uses default model for task if not provided
        quantize: if model should be quantized, defaults to False
        gpu: True/False if GPU should be enabled, also supports a GPU device id
        batch: batch size used to incrementally process content
        langdetect: set a custom language detection function, method must take a list of strings and return
                    language codes for each, uses default language detector if not provided
        findmodels: True/False if the Hugging Face Hub will be searched for source-target translation models
    """

    # Call parent constructor
    super().__init__(path if path else "facebook/m2m100_418M", quantize, gpu, batch)

    # Language detection
    self.detector = None
    self.langdetect = langdetect
    self.findmodels = findmodels

    # Language models
    self.models = {}
    self.ids = None

__call__(texts, target='en', source=None, showmodels=False)

将文本从源语言翻译成目标语言。

此方法支持字符串或列表作为输入文本。如果输入是字符串,则返回类型为字符串。如果输入文本是列表,则返回类型为列表。

参数

名称 类型 描述 默认值
texts

text|list

必需
target

目标语言代码,默认为“en”

'en'
source

源语言代码,如果未提供则检测语言

None

返回值

类型 描述

翻译后的文本列表

源代码位于 txtai/pipeline/text/translation.py
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
def __call__(self, texts, target="en", source=None, showmodels=False):
    """
    Translates text from source language into target language.

    This method supports texts as a string or a list. If the input is a string,
    the return type is string. If text is a list, the return type is a list.

    Args:
        texts: text|list
        target: target language code, defaults to "en"
        source: source language code, detects language if not provided

    Returns:
        list of translated text
    """

    values = [texts] if not isinstance(texts, list) else texts

    # Detect source languages
    languages = self.detect(values) if not source else [source] * len(values)
    unique = set(languages)

    # Build a dict from language to list of (index, text)
    langdict = {}
    for x, lang in enumerate(languages):
        if lang not in langdict:
            langdict[lang] = []
        langdict[lang].append((x, values[x]))

    results = {}
    for language in unique:
        # Get all indices and text values for a language
        inputs = langdict[language]

        # Translate text in batches
        outputs = []
        for chunk in self.batch([text for _, text in inputs], self.batchsize):
            outputs.extend(self.translate(chunk, language, target, showmodels))

        # Store output value
        for y, (x, _) in enumerate(inputs):
            if showmodels:
                model, op = outputs[y]
                results[x] = (op.strip(), language, model)
            else:
                results[x] = outputs[y].strip()

    # Return results in same order as input
    results = [results[x] for x in sorted(results)]
    return results[0] if isinstance(texts, str) else results