huridocs/pdf-document-layout-analysis Docker Image Overview

huridocs/pdf-document-layout-analysis

HURIDOCS的pdf-document-layout-analysis是一个基于Docker的智能PDF文档布局分析微服务，支持OCR识别、内容提取、元素分类（文本、标题、表格等）、阅读顺序确定及多格式转换（Markdown/HTML）。

6 收藏0 次下载activehuridocs镜像

🚀专业版镜像服务，面向生产环境设计

中文简介版本下载

🚀专业版镜像服务，面向生产环境设计

PDF文档布局分析

一个基于Docker的微服务，用于智能PDF文档布局分析、OCR识别和内容提取

!Python Version !FastAPI !Docker !GPU Support

由❤️ HURIDOCS 构建

⭐ GitHub 上给我们点赞 • 🐳 Docker Hub 拉取镜像 • 🤗 Hugging Face 查看

🚀 概述

本项目提供了一个强大且灵活的PDF分析微服务，基于整洁架构（Clean Architecture）原则构建。该服务支持PDF页面不同部分的OCR识别、分割和分类，可识别文本、标题、图片、表格、公式等元素，并确定这些元素的正确阅读顺序，还能将PDF转换为Markdown、HTML等多种格式。

✨ 核心功能

🔍 高级PDF布局分析 - 高精度分割和分类PDF内容
🖼️ 视觉与快速模型 - 可选择VGT（视觉网格Transformer）模型追求精度，或LightGBM模型追求速度
📝 多格式输出 - 支持导出为JSON、Markdown、HTML，并可可视化PDF分割结果
🌐 OCR支持 - 通过Tesseract OCR支持150+种语言
📊 表格与公式提取 - 将表格提取为HTML，公式提取为LaTeX
🏗️ 整洁架构 - 模块化、可测试且易维护的代码库
🐳 Docker就绪 - 易于部署，支持GPU加速
⚡ RESTful API - 全面的API，包含10+个端点

快速开始

运行服务：

带GPU支持：

bash
docker run --rm --name pdf-document-layout-analysis --gpus '"device=0"' -p 5060:5060 --entrypoint ./start.sh huridocs/pdf-document-layout-analysis:v0.0.31

无GPU支持：

bash
docker run --rm --name pdf-document-layout-analysis -p 5060:5060 --entrypoint ./start.sh huridocs/pdf-document-layout-analysis:v0.0.31

📝 服务还支持翻译功能，但需从源代码安装。请查看GitHub页面获取说明。

从PDF获取分割结果：

bash
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060

停止服务：

bash
docker stop pdf-document-layout-analysis

💡 提示：将/path/to/your/document.pdf替换为实际PDF文件路径。服务将返回包含分割内容和元数据的JSON响应。

📋 目录

🚀 快速开始
⚙️ 依赖项
📋 要求
📚 API参考
💡 使用示例
🏗️ 架构
🤖 模型
📊 数据
📈 基准测试
- 性能
- 速度
🌐 OCR更多语言安装
🔗 相关服务
🤝 贡献指南

⚙️ 依赖项

必需

Docker Desktop 4.25.0+ - 安装指南
Python 3.10+（用于本地开发）

可选

NVIDIA Container Toolkit - 安装指南（用于GPU支持）

📋 要求

系统要求

内存：至少2 GB
GPU内存：5 GB（可选，若无GPU将自动回退到CPU）
磁盘空间：10 GB（用于模型和依赖项）
CPU：推荐多核以获得更好性能

Docker要求

Docker Engine 20.10+
Docker Compose 2.0+

📚 API参考

服务提供全面的RESTful API，包含以下端点：

核心分析端点

端点	方法	描述	参数
`/`	POST	分析PDF布局并提取分割内容	`file`、`fast`、`ocr_tables`
`/save_xml/{filename}`	POST	分析PDF并保存XML输出	`file`、`xml_file_name`、`fast`
`/get_xml/{filename}`	GET	检索已保存的XML分析结果	`xml_file_name`

内容提取端点

端点	方法	描述	参数
`/text`	POST	按内容类型提取文本	`file`、`fast`、`types`
`/toc`	POST	提取目录	`file`、`fast`
`/toc_legacy_uwazi_compatible`	POST	提取目录（兼容Uwazi）	`file`

格式转换端点

端点	方法	描述	参数
`/markdown`	POST	将PDF转换为Markdown（zip中包含分割数据）	`file`、`fast`、`extract_toc`、`dpi`、`output_file`
`/html`	POST	将PDF转换为HTML（zip中包含分割数据）	`file`、`fast`、`extract_toc`、`dpi`、`output_file`
`/visualize`	POST	可视化PDF上的分割结果	`file`、`fast`

OCR和工具端点

端点	方法	描述	参数
`/ocr`	POST	对PDF应用OCR	`file`、`language`
`/info`	GET	获取服务信息	-
`/`	GET	健康检查和系统信息	-
`/error`	GET	测试错误处理	-

通用参数

file：要处理的PDF文件（multipart/form-data格式）
fast：使用LightGBM模型而非VGT（布尔值，默认：false）
ocr_tables：对表格区域应用OCR（布尔值，默认：false）
language：OCR语言代码（字符串，默认："en"）
types：要提取的内容类型，用逗号分隔（字符串，默认："all"）
extract_toc：在输出开头包含目录（布尔值，默认：false）
dpi：转换的图像分辨率（整数，默认：120）

💡 使用示例

基本PDF分析

使用VGT模型的标准分析：

bash
curl -X POST \
  -F 'file=@document.pdf' \
  http://localhost:5060

使用LightGBM模型的快速分析：

bash
curl -X POST \
  -F 'file=@document.pdf' \
  -F 'fast=true' \
  http://localhost:5060

带表格OCR的分析：

bash
curl -X POST \
  -F 'file=@document.pdf' \
  -F 'ocr_tables=true' \
  http://localhost:5060

文本提取

提取所有文本：

bash
curl -X POST \
  -F 'file=@document.pdf' \
  -F 'types=all' \
  http://localhost:5060/text

提取特定内容类型：

bash
curl -X POST \
  -F 'file=@document.pdf' \
  -F 'types=title,text,table' \
  http://localhost:5060/text

格式转换

转换为Markdown：

bash
curl -X POST http://localhost:5060/markdown \
  -F 'file=@document.pdf' \
  -F 'extract_toc=true' \
  -F 'output_file=document.md' \
  --output 'document.zip'

转换为HTML：

bash
curl -X POST http://localhost:5060/html \
  -F 'file=@document.pdf' \
  -F 'extract_toc=true' \
  -F 'output_file=document.html' \
  --output 'document.zip'

📋 分割数据：格式转换端点会在zip输出中自动包含详细的分割数据。生成的zip文件包含{filename}_segmentation.json文件，其中包含每个检测到的文档段的信息，包括：

坐标：left、top、width、height

页面信息：page_number、page_width、page_height

内容：text内容和段type（例如："Title"、"Text"、"Table"、"Picture"）

OCR处理

英文OCR：

bash
curl -X POST \
  -F 'file=@scanned_document.pdf' \
  -F 'language=en' \
  http://localhost:5060/ocr \
  --output ocr_processed.pdf

其他语言OCR：

bash
# 法语
curl -X POST \
  -F 'file=@document_french.pdf' \
  -F 'language=fr' \
  http://localhost:5060/ocr \
  --output ocr_french.pdf

# 西班牙语
curl -X POST \
  -F 'file=@document_spanish.pdf' \
  -F 'language=es' \
  http://localhost:5060/ocr \
  --output ocr_spanish.pdf

可视化

生成可视化PDF：

bash
curl -X POST \
  -F 'file=@document.pdf' \
  http://localhost:5060/visualize \
  --output visualization.pdf

目录提取

提取结构化目录：

bash
curl -X POST \
  -F 'file=@document.pdf' \
  http://localhost:5060/toc

XML存储和检索

分析并保存XML：

bash
curl -X POST \
  -F 'file=@document.pdf' \
  http://localhost:5060/save_xml/my_analysis

检索已保存的XML：

bash
curl http://localhost:5060/get_xml/my_analysis.xml

服务信息

获取服务信息和支持的语言：

bash
curl http://localhost:5060/info

健康检查：

bash
curl http://localhost:5060/

响应格式

大多数端点返回包含分割信息的JSON：

json
[
  {
    "left": 72.0,
    "top": 84.0,
    "width": 451.2,
    "height": 23.04,
    "page_number": 1,
    "page_width": 595.32,
    "page_height": 841.92,
    "text": "文档标题",
    "type": "Title"
  },
  {
    "left": 72.0,
    "top": 120.0,
    "width": 451.2,
    "height": 200.0,
    "page_number": 1,
    "page_width": 595.32,
    "page_height": 841.92,
    "text": "这是主要文本内容...",
    "type": "Text"
  }
]

支持的内容类型

Caption - 图片和表格标题
Footnote - 脚注文本
Formula - 数学公式
List item - 列表项和项目符号
Page footer - 页脚内容
Page header - 页眉内容
Picture - 图片和图表
Section header - 章节标题
Table - 表格内容
Text - 常规文本段落
Title - 文档和章节标题

🏗️ 架构

本项目遵循整洁架构原则，确保关注点分离、可测试性和可维护性。代码库分为不同的层：

目录结构

src/
├── domain/                 # 企业业务规则
│   ├── PdfImages.py       # PDF图像处理领域逻辑
│   ├── PdfSegment.py      # PDF分割实体
│   ├── Prediction.py      # 机器学习预测实体
│   └── SegmentBox.py      # 核心分割框实体
├── use_cases/             # 应用业务规则
│   ├── pdf_analysis/      # PDF分析用例
│   ├── text_extraction/   # 文本提取用例
│   ├── toc_extraction/    # 目录提取用例
│   ├── visualization/     # PDF可视化工用例
│   ├── ocr/              # OCR处理用例
│   ├── markdown_conversion/ # Markdown转换用例
│   └── html_conversion/   # HTML转换用例
├── adapters/              # 接口适配器
│   ├── infrastructure/    # 外部服务适配器
│   ├── ml/               # 机器学习模型适配器
│   ├── storage/          # 文件存储适配器
│   └── web/              # Web框架适配器
├── ports/                 # 接口定义
│   ├── services/         # 服务接口
│   └── repositories/     # 仓库接口
└── drivers/              # 框架和驱动
    └── web/              # FastAPI应用设置