跳到内容

文本转语音

pipeline pipeline

文本转语音流水线从文本生成语音。

示例

下面展示了使用此流水线的一个简单示例。

from txtai.pipeline import TextToSpeech

# Create and run pipeline with default model
tts = TextToSpeech()
tts("Say something here")

# Stream audio - incrementally generates snippets of audio
yield from tts(
  "Say something here. And say something else.".split(),
  stream=True
)

# Generate audio using a speaker id
tts = TextToSpeech("neuml/vctk-vits-onnx")
tts("Say something here", speaker=15)

# Generate audio using speaker embeddings
tts = TextToSpeech("neuml/txtai-speecht5-onnx")
tts("Say something here", speaker=np.array(...))

有关更详细的示例,请参阅下面的链接。

Notebook 描述
文本转语音生成 从文本生成语音 Open In Colab
语音转语音 RAG ▶️ 带有 RAG 的全周期语音转语音工作流 Open In Colab
生成式音频 使用生成式音频工作流讲故事 Open In Colab

此流水线由 Hugging Face Hub 中的 ONNX 模型支持。目前提供以下模型。

配置驱动示例

流水线可以使用 Python 或配置运行。流水线可以在配置中实例化,使用流水线的名称小写形式。配置驱动的流水线可以使用工作流API运行。

config.yml

# Create pipeline using lower case class name
texttospeech:

# Run pipeline with workflow
workflow:
  tts:
    tasks:
      - action: texttospeech

使用工作流运行

from txtai import Application

# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("tts", ["Say something here"]))

使用 API 运行

CONFIG=config.yml uvicorn "txtai.api:app" &

curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name":"tts", "elements":["Say something here"]}'

方法

此流水线的 Python 文档。

__init__(path=None, maxtokens=512, rate=22050)

创建一个新的 TextToSpeech 流水线。

参数

名称 类型 描述 默认值
path

可选的模型路径

None
maxtokens

模型可以处理的最大 token 数量,默认为 512

512
rate

目标采样率,默认为 22050

22050
源代码位于 txtai/pipeline/audio/texttospeech.py
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
def __init__(self, path=None, maxtokens=512, rate=22050):
    """
    Creates a new TextToSpeech pipeline.

    Args:
        path: optional model path
        maxtokens: maximum number of tokens model can process, defaults to 512
        rate: target sample rate, defaults to 22050
    """

    if not TTS:
        raise ImportError('TextToSpeech pipeline is not available - install "pipeline" extra to enable')

    # Default path
    path = path if path else "neuml/ljspeech-jets-onnx"

    # Target sample rate
    self.rate = rate

    # Load target tts pipeline
    self.pipeline = None
    if self.hasfile(path, "model.onnx") and self.hasfile(path, "config.yaml"):
        self.pipeline = ESPnet(path, maxtokens, self.providers())
    elif self.hasfile(path, "model.onnx") and self.hasfile(path, "voices.json"):
        self.pipeline = Kokoro(path, maxtokens, self.providers())
    else:
        self.pipeline = SpeechT5(path, maxtokens, self.providers())

__call__(text, stream=False, speaker=1, encoding=None, **kwargs)

从文本生成语音。文本长度超过 maxtokens 将被分批处理,并作为每个文本输入的单个波形返回。

此方法支持文本作为字符串或列表。如果输入是字符串,返回类型是音频。如果文本是列表,返回类型是列表。

参数

名称 类型 描述 默认值
text

text|list

必需
stream

如果为 True 则流式传输响应,默认为 False

False
speaker

说话人 ID,默认为 1

1
encoding

可选的音频编码格式

None
kwargs

附加的关键字参数

{}

返回值

类型 描述

取决于 encoding 参数,返回 (音频, 采样率) 列表或音频列表

源代码位于 txtai/pipeline/audio/texttospeech.py
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
def __call__(self, text, stream=False, speaker=1, encoding=None, **kwargs):
    """
    Generates speech from text. Text longer than maxtokens will be batched and returned
    as a single waveform per text input.

    This method supports text as a string or a list. If the input is a string,
    the return type is audio. If text is a list, the return type is a list.

    Args:
        text: text|list
        stream: stream response if True, defaults to False
        speaker: speaker id, defaults to 1
        encoding: optional audio encoding format
        kwargs: additional keyword args

    Returns:
        list of (audio, sample rate) or list of audio depending on encoding parameter
    """

    # Convert results to a list if necessary
    texts = [text] if isinstance(text, str) else text

    # Streaming response
    if stream:
        return self.stream(texts, speaker, encoding)

    # Transform text to speech
    results = [self.execute(x, speaker, encoding, **kwargs) for x in texts]

    # Return results
    return results[0] if isinstance(text, str) else results