文件转HTML

pipeline

文件转HTML管道将文件转换为HTML。它支持以下文本提取后端。

Apache Tika

Apache Tika检测并提取一千多种不同文件类型的元数据和文本。支持的文档格式列表请参见此链接。

Apache Tika需要安装Java。另一种方法是通过此Docker镜像启动单独的Apache Tika服务，并设置这些环境变量。

Docling

Docling可以轻松快速地解析文档并将其导出为所需的格式。这是一个自2024年末开始迅速普及的库。Docling擅长解析PDF中的格式元素（表格、章节等）。

支持的文档格式列表请参见此链接。

示例

下面显示了使用此管道的一个简单示例。

from txtai.pipeline import FileToHTML

# Create and run pipeline
html = FileToHTML()
html("/path/to/file")

配置驱动的示例

管道可以通过Python或配置文件运行。可以使用管道的小写名称在配置文件中实例化管道。配置驱动的管道可以使用工作流或API运行。

config.yml

# Create pipeline using lower case class name
filetohtml:

# Run pipeline with workflow
workflow:
  html:
    tasks:
      - action: filetohtml

使用工作流运行

from txtai import Application

# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("html", ["/path/to/file"]))

使用API运行

CONFIG=config.yml uvicorn "txtai.api:app" &

curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name":"html", "elements":["/path/to/file"]}'

方法

管道的Python文档。

`init(backend='available')`

创建一个新的文件转HTML管道。

参数

名称	类型	描述	默认值
`backend`		用于提取内容的后端，支持"tika"、"docling"或"available"（默认），"available"会找到第一个可用的后端。	`'available'`

源代码位于txtai/pipeline/data/filetohtml.py中

def __init__(self, backend="available"):
    """
    Creates a new File to HTML pipeline.

    Args:
        backend: backend to use to extract content, supports "tika", "docling" or "available" (default) which finds the first available
    """

    # Lowercase backend parameter
    backend = backend.lower() if backend else None

    # Check for available backend
    if backend == "available":
        backend = "tika" if Tika.available() else "docling" if Docling.available() else None

    # Create backend instance
    self.backend = Tika() if backend == "tika" else Docling() if backend == "docling" else None

`call(path)`

将指定路径的文件转换为HTML。如果没有可用的后端，则返回None。

参数

名称	类型	描述	默认值
`path`		输入文件路径	必需

返回值

类型	描述
	如果有可用的后端，则返回html；否则返回None。