s3-ocr

对存储在 S3 中的文件运行 OCR 的工具

该项目背景介绍：s3-ocr：从存储在 S3 存储桶中的 PDF 文件中提取文本

安装

使用 pip 安装此工具

pip install s3-ocr

演示

您可以在使用 Datasette 托管的此示例表格中看到对此工具针对 Internet Archive 的三个 PDF (一个、两个、三个) 运行的结果。

开始对存储桶中的 PDF 进行 OCR 处理

start 命令接受一个键列表，并将其提交给 Textract 进行 OCR 处理。

您需要使用环境变量、主目录中的凭据文件或使用 s3-credentials 生成的 JSON 或 INI 文件来配置 AWS。

您可以像这样启动进程

s3-ocr start name-of-your-bucket my-pdf-file.pdf

您指定的路径应为存储桶内的路径。如果您将 PDF 文件存储在存储桶内的文件夹中，它应该看起来像这样

s3-ocr start name-of-your-bucket path/to/one.pdf path/to/two.pdf

OCR 可能需要一些时间。OCR 的结果将存储在您存储桶的 textract-output 文件夹中。

要处理存储桶中所有扩展名为 .pdf 的文件，请使用 --all

s3-ocr start name-of-bucket --all

要在特定文件夹中处理所有扩展名为 .pdf 的文件，请使用 --prefix

s3-ocr start name-of-bucket --prefix path/to/folder

s3-ocr start --help

Usage: s3-ocr start [OPTIONS] BUCKET [KEYS]...

  Start OCR tasks for PDF files in an S3 bucket

      s3-ocr start name-of-bucket path/to/one.pdf path/to/two.pdf

  To process every file with a .pdf extension:

      s3-ocr start name-of-bucket --all

  To process every .pdf in the PUBLIC/ folder:

      s3-ocr start name-of-bucket --prefix PUBLIC/

Options:
  --all                 Process all PDF files in the bucket
  --prefix TEXT         Process all PDF files within this prefix
  --dry-run             Show what this would do, but don't actually do it
  --no-retry            Don't retry failed requests
  --access-key TEXT     AWS access key ID
  --secret-key TEXT     AWS secret access key
  --session-token TEXT  AWS session token
  --endpoint-url TEXT   Custom endpoint URL
  -a, --auth FILENAME   Path to JSON/INI file containing credentials
  --help                Show this message and exit.

检查状态

s3-ocr status 命令大致指示任务的进度

% s3-ocr status sfms-history
153 complete out of 532 jobs

它将基于 .s3-ocr.json 文件提交的任务与已将其结果写入 textract-output/ 文件夹的任务进行比较。

s3-ocr status --help

Usage: s3-ocr status [OPTIONS] BUCKET

  Show status of OCR jobs for a bucket

Options:
  --access-key ...

检查作业

s3-ocr inspect-job 命令可用于检查特定作业 ID 的状态

% s3-ocr inspect-job b267282745685226339b7e0d4366c4ff6887b7e293ed4b304dc8bb8b991c7864
{
  "DocumentMetadata": {
    "Pages": 583
  },
  "JobStatus": "SUCCEEDED",
  "DetectDocumentTextModelVersion": "1.0"
}

s3-ocr inspect-job --help

Usage: s3-ocr inspect-job [OPTIONS] JOB_ID

  Show the current status of an OCR job

      s3-ocr inspect-job 

Options:
  --access-key ...

获取结果

OCR 作业完成后，您可以使用 fetch 命令下载结果 JSON

s3-ocr fetch name-of-bucket path/to/file.pdf

这将在当前目录中保存名称类似这样的文件

4d9b5cf580e761fdb16fd24edce14737ebc562972526ef6617554adfa11d6038-1.json
4d9b5cf580e761fdb16fd24edce14737ebc562972526ef6617554adfa11d6038-2.json

文件数量将根据文档长度而异。

如果您不想要单独的文件，可以使用 -c/--combine 选项将它们合并在一起

s3-ocr fetch name-of-bucket path/to/file.pdf --combine output.json

output.json 文件将包含类似这样的数据

{
  "Blocks": [
    {
      "BlockType": "PAGE",
      "Geometry": {...}
      "Page": 1,
      ...
    },
    {
      "BlockType": "LINE",
      "Page": 1,
      ...
      "Text": "Barry",
    },

s3-ocr fetch --help

Usage: s3-ocr fetch [OPTIONS] BUCKET KEY

  Fetch the OCR results for a specified file

      s3-ocr fetch name-of-bucket path/to/key.pdf

  This will save files in the current directory called things like

      a806e67e504fc15f...48314e-1.json     a806e67e504fc15f...48314e-2.json

  To combine these together into a single JSON file with a specified name, use:

      s3-ocr fetch name-of-bucket path/to/key.pdf --combine output.json

  Use "--output -" to print the combined JSON to standard output instead.

Options:
  -c, --combine FILENAME  Write combined JSON to file
  --access-key ...

仅获取页面的文本

如果您不想直接处理 JSON，可以使用 text 命令仅检索从 PDF 中提取的文本

s3-ocr text name-of-bucket path/to/file.pdf

这将向标准输出打印纯文本。

要将其保存到文件，请使用此命令

s3-ocr text name-of-bucket path/to/file.pdf > text.txt

单独的页面将由三个换行符分隔。要使用 ---- 水平分隔线代替，请添加 --divider

s3-ocr text name-of-bucket path/to/file.pdf --divider

s3-ocr text --help

Usage: s3-ocr text [OPTIONS] BUCKET KEY

  Retrieve the text from an OCRd PDF file

      s3-ocr text name-of-bucket path/to/key.pdf

Options:
  --divider             Add ---- between pages
  --access-key ...

避免处理重复项

如果您在 S3 存储桶内移动文件，s3-ocr 可能会丢失已处理文件的记录。如果您对这些新文件运行 s3-ocr start，这可能导致额外的 Textract 处理费用。

s3-ocr dedupe 命令通过扫描您的存储桶来查找名称更改但之前已处理的文件来解决此问题。它通过查看每个文件的 ETag 来实现，该 ETag 表示文件内容的 MD5 哈希值。

该命令将为每个检测到的重复项写入新的 .s3ocr.json 文件。这可以避免在您运行 s3-ocr start 时，这些重复项再次通过 OCR 进行处理。

s3-ocr dedupe name-of-bucket

添加 --dry-run 以预览将对您的存储桶进行的更改。

s3-ocr dedupe --help

Usage: s3-ocr dedupe [OPTIONS] BUCKET

  Scan every file in the bucket checking for duplicates - files that have not
  yet been OCRd but that have the same contents (based on ETag) as a file that
  HAS been OCRd.

      s3-ocr dedupe name-of-bucket

Options:
  --dry-run             Show output without writing anything to S3
  --access-key ...

对您的存储桶进行的更改

为了跟踪已提交处理的文件，s3-ocr 将为添加到 OCR 队列的每个文件创建一个 JSON 文件。

该文件将命名为

path-to-file/name-of-file.pdf.s3-ocr.json

每个 JSON 文件都包含类似这样的数据

{
  "job_id": "a34eb4e8dc7e70aa9668f7272aa403e85997364199a654422340bc5ada43affe",
  "etag": "\"b0c77472e15500347ebf46032a454e8e\""
}

记录的 job_id 可用于以后将文件与 textract-output/ 中的 OCR 任务结果关联起来。

etag 是提交时 S3 对象的 ETag。这可用于以后判断文件自上次运行 OCR 以来是否已更改。

该工具的设计，即使用 .s3-ocr.json 文件跟踪已提交的作业，意味着可以安全地对同一存储桶多次运行 s3-ocr start，而不会有启动重复 OCR 作业的风险。

创建 OCR 结果的 SQLite 索引

s3-ocr index 命令创建一个包含 OCR 结果的 SQLite 数据库，并针对文本配置 SQLite 全文搜索

% s3-ocr index sfms-history index.db
Fetching job details  [####################################]  100%
Populating pages table  [####################----------------]   55%  00:03:18

生成数据库的结构如下所示（不包括 FTS 表）

CREATE TABLE [pages] (
   [path] TEXT,
   [page] INTEGER,
   [folder] TEXT,
   [text] TEXT,
   PRIMARY KEY ([path], [page])
);
CREATE TABLE [ocr_jobs] (
   [key] TEXT PRIMARY KEY,
   [job_id] TEXT,
   [etag] TEXT,
   [s3_ocr_etag] TEXT
);
CREATE TABLE [fetched_jobs] (
   [job_id] TEXT PRIMARY KEY
);

该数据库设计用于与 Datasette 一起使用。

s3-ocr index --help

Usage: s3-ocr index [OPTIONS] BUCKET DATABASE

  Create a SQLite database with OCR results for files in a bucket

Options:
  --access-key ...

开发

要为此工具贡献力量，首先请检出代码。然后创建一个新的虚拟环境

cd s3-ocr
python -m venv venv
source venv/bin/activate

现在安装依赖项和测试依赖项

pip install -e '.[test]'

运行测试

pytest

使用最新的 --help 重新生成 README 文件

cog -r README.md

s3-ocr 作者 simonw

README 源代码