跳过内容

分割

pipeline pipeline

分割管道将文本分割成语义单元。

示例

下面展示了使用此管道的一个简单示例。

from txtai.pipeline import Segmentation

# Create and run pipeline
segment = Segmentation(sentences=True)
segment("This is a test. And another test.")

# Load third-party chunkers
segment = Segmentation(chunker="semantic")
segment("This is a test. And another test.")

分割管道支持使用基于规则的方法对 sentences(句子)、lines(行)、paragraphs(段落)和 sections(章节)进行分割。创建管道时可以设置这些模式。还通过 chunker 参数支持第三方 chunker(分块器)。

配置驱动示例

管道通过 Python 或配置运行。可以使用管道的小写名称在 配置 中实例化管道。配置驱动的管道通过 工作流API 运行。

config.yml

# Create pipeline using lower case class name
segmentation:
  sentences: true

# Run pipeline with workflow
workflow:
  segment:
    tasks:
      - action: segmentation

通过工作流运行

from txtai import Application

# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("segment", ["This is a test. And another test."]))

通过 API 运行

CONFIG=config.yml uvicorn "txtai.api:app" &

curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name":"segment", "elements":["This is a test. And another test."]}'

方法

管道的 Python 文档。

__init__(sentences=False, lines=False, paragraphs=False, minlength=None, join=False, sections=False, cleantext=True, chunker=None, **kwargs)

创建一个新的分割管道。

参数

名称 类型 描述 默认值
sentences

如果为 True,将文本分词为句子,默认为 False

False
lines

如果为 True,将文本分词为行,默认为 False

False
paragraphs

如果为 True,将文本分词为段落,默认为 False

False
minlength

要求每个文本元素至少包含 minlength 个字符,默认为 None

None
join

如果为 True,将分词后的章节重新连接在一起,默认为 False

False
sections

如果为 True,将文本分词为章节,默认为 False。根据可用内容使用章节或分页符进行分割

False
cleantext

应用文本清洗规则,默认为 True

True
chunker

如果设置,创建第三方 chunker 对文本进行分词,默认为 None

None
kwargs

额外的关键字参数

{}
源代码位于 txtai/pipeline/data/segmentation.py
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
def __init__(
    self, sentences=False, lines=False, paragraphs=False, minlength=None, join=False, sections=False, cleantext=True, chunker=None, **kwargs
):
    """
    Creates a new Segmentation pipeline.

    Args:
        sentences: tokenize text into sentences if True, defaults to False
        lines: tokenizes text into lines if True, defaults to False
        paragraphs: tokenizes text into paragraphs if True, defaults to False
        minlength: require at least minlength characters per text element, defaults to None
        join: joins tokenized sections back together if True, defaults to False
        sections: tokenizes text into sections if True, defaults to False. Splits using section or page breaks, depending on what's available
        cleantext: apply text cleaning rules, defaults to True
        chunker: creates a third-party chunker to tokenize text if set, defaults to None
        kwargs: additional keyword arguments
    """

    if not NLTK and sentences:
        raise ImportError('NLTK is not available - install "pipeline" extra to enable')

    if not CHONKIE and chunker:
        raise ImportError('Chonkie is not available - install "pipeline" extra to enable')

    self.sentences = sentences
    self.lines = lines
    self.paragraphs = paragraphs
    self.sections = sections
    self.minlength = minlength
    self.join = join
    self.cleantext = cleantext

    # Create a third-party chunker, if applicable
    self.chunker = self.createchunker(chunker, **kwargs) if chunker else None

__call__(text)

将文本分割成语义单元。

此方法支持将文本作为字符串或列表。如果输入是字符串,返回类型为文本|列表。如果输入是列表,则返回一个列表,具体是文本列表还是列表的列表取决于分词策略。

参数

名称 类型 描述 默认值
text

文本|列表

必需

返回

类型 描述

分割后的文本

源代码位于 txtai/pipeline/data/segmentation.py
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
def __call__(self, text):
    """
    Segments text into semantic units.

    This method supports text as a string or a list. If the input is a string, the return
    type is text|list. If text is a list, a list of returned, this could be a
    list of text or a list of lists depending on the tokenization strategy.

    Args:
        text: text|list

    Returns:
        segmented text
    """

    # Get inputs
    texts = [text] if not isinstance(text, list) else text

    # Extract text for each input file
    results = []
    for value in texts:
        # Get text
        value = self.text(value)

        # Parse and add extracted results
        results.append(self.parse(value))

    return results[0] if isinstance(text, str) else results