Multi-engine document OCR with cascading fallback and quality audit.
socr orchestrates multiple OCR engines — calling each as a CLI subprocess, auditing output quality, and falling back to a different engine when results are poor. Each engine is a standalone CLI tool (gemini-ocr, deepseek-ocr, marker-ocr, etc.) that can also be used independently.
pip install socr
# With specific engine backends
pip install socr[gemini] # Google Gemini (cloud)
pip install socr[local] # DeepSeek + Nougat (local/free)
pip install socr[all] # All enginesEngines are installed separately because they have different dependencies (torch, cloud SDKs, etc.). Install only what you need.
# Process a PDF (deterministic mode)
socr paper.pdf
# Cost-aware agentic mode: processes + saves one page at a time (pages/NNN.md),
# resumable on re-run, byte-identical final output.
socr paper.pdf --agentic
socr paper.pdf --agentic --strict-local # local-only (free), page-by-page
socr paper.pdf --agentic --cost-budget 0.05 # cap spend per document
# Interrupted a run? Just run it again — finished pages are skipped:
socr paper.pdf --agentic # resumes from the last saved page
# Choose engine (deterministic mode)
socr paper.pdf --primary gemini
socr paper.pdf --save-figures
# Batch process a directory
socr batch ~/Papers/ -o ./results/
socr batch ~/Papers/ --dry-run # preview what would be processed
# Reproducibly rebuild a document from a manifest (no model calls)
socr replay output/paper/manifest.json -o paper.md
# Check which engines are available
socr enginessocr routes each page to an OCR engine, checks the result, and re-tries on a different engine when the result is poor. It runs in two modes that differ in how the engine for a page is chosen.
PDF → classify each page → easy: local engine · hard: primary engine
→ heuristic audit → fallback on failed pages → Markdown
The engine is chosen up front from predicted page difficulty (tables, equations, layout). Born-digital prose uses native text for free. Quality is checked by heuristics; failed pages fall back to another engine.
PDF → for each page, in order:
route → extract text → reconcile tables → place figures/charts → (equations)
→ verify → FLUSH pages/NNN.md to disk → next page
→ stitch fragments → Markdown (byte-identical to whole-doc assembly) + manifest
The engine is chosen dynamically by cost: try the cheapest available provider
first, let a judge (a vision model that looks at the page, or a heuristic
fallback) decide accept-or-escalate, and climb the cost ladder
(local → cheap cloud → premium cloud) only when the cheaper output is rejected.
Stops at the first accepted output, bounded by --cost-budget / --max-cost-per-page.
Each run records the winning provider + cost per page and writes a manifest
that socr replay can reconstruct with zero model calls.
Progressive, page-by-page processing. In agentic mode socr is page-major: it
finishes one page completely and writes it to disk immediately (pages/NNN.md
- a
pages/NNN.jsonstatus sidecar) before starting the next, rather than holding the whole document in memory and saving once at the end. Concretely:
- Crash-safe. A hang or crash on page 30 leaves pages 1–29 already on disk. The
final
<stem>.mdis the ordered stitch of the fragments and is byte-identical to the non-progressive whole-doc assembly. - Resume. Re-running a partially-processed document skips pages already finished (matched by a per-page run-fingerprint + input checksum) and reprocesses only the rest. A model/prompt/flag change invalidates the fingerprint and forces re-OCR; the skip is conservative — on any doubt the page is reprocessed, never silently reused.
- Clean halt on a wedged model. If the local VLM stops responding, socr flushes
the finished pages, marks the document
PARTIAL_SAVE_VLM_TIMEOUT, and stops cleanly instead of feeding more work into a stuck GPU. - Tables / figures / charts, per page. Born-digital table geometry is verified before paying for a VLM judge; figures are embedded inline within their page; chart/front-matter pages (vector charts that would otherwise become text word-salad) are saved as image assets with a note rather than transcribed.
- Equations (opt-in).
--detect-equationssaves model-free crop PNGs of display equations;--recover-clean-equationsadditionally reads each crop to LaTeX into a non-destructive sidecar (validated; bad LaTeX never replaces the native text/crop).
Each engine is a separate CLI binary. socr calls it as a subprocess, reads the
output markdown, and applies the quality pipeline. See docs/ARCHITECTURE.md for
the full design.
Routing is local-first → Ollama Cloud → paid cloud edge case. See
docs/MODELS.md for the full per-sub-task policy and the measured data behind it.
| Engine | Package | Type | Routing role |
|---|---|---|---|
| Qwen | qwen-ocr-cli |
Ollama Cloud / local | Workhorse VLM (qwen3.5:cloud, no extra key) |
| Gemini | gemini-ocr-cli |
Cloud | Edge-case escalation, ~$0.0002/page |
| Marker | marker-ocr-cli |
Local | Layout-aware fallback (Surya + Texify) |
| GLM | glm-ocr-cli |
Local | Fast local emergency fallback |
| Nougat | nougat-ocr-cli |
Local | Academic papers, Python <3.13 |
| Mistral | mistral-ocr-cli |
Cloud | Manual only (--primary mistral); dominated by Gemini |
| DeepSeek | deepseek-ocr-cli |
Local | Manual only (--primary deepseek); low quality |
Check availability:
$ socr engines
[+] gemini cloud, ~$0.0002/page
[+] marker local, layout-aware (Surya + Texify)
[+] mistral cloud, ~$0.001/page
[+] deepseek local via Ollama
[x] nougat local, academic papers
socr process <PDF> [OPTIONS]
-o, --output-dir PATH Output directory
--primary ENGINE Primary OCR engine (gemini, marker, deepseek, etc.)
--fallback ENGINE Fallback engine
--no-audit Skip quality audit
--no-native-first OCR every page (don't use native text for prose)
--save-figures Extract figure PNGs + inline image refs (no captions)
--describe-figures Also add VLM captions (opt-in, non-authoritative)
--timeout SECONDS Subprocess timeout
--profile NAME Load ~/.config/socr/{name}.yaml
--config PATH Custom YAML config file
-q, --quiet / -v, --verbose Output verbosity
--dry-run / --reprocess List-only / force reprocess
# Agentic cost-aware routing (page-major; progressive save + resume)
--agentic Per page: cheapest provider first, judge escalates,
then flush pages/NNN.md to disk before the next page.
Re-running resumes from the last finished page.
--strict-local Only local/free rungs (no paid cloud)
--judge-backend MODE auto | vlm | heuristic (default: auto)
--judge-model NAME VLM model for the judge (e.g. qwen2-vl:7b)
--max-cost-per-page USD Skip providers above this price (0 = no cap)
--cost-budget USD Stop escalating once doc spend hits this (0 = ∞)
--write-manifest Write a replayable manifest + blob cache
--detect-equations Detect display-equation regions, save crop PNGs (model-free)
--recover-clean-equations Also read equation crops to LaTeX into a sidecar (opt-in)
--legacy-routing Use the old deterministic backbone instead of agentic
socr batch <DIR> [OPTIONS]
Same options as process, plus:
--limit N Process first N files
socr replay <MANIFEST> [-o OUT] Rebuild a document from cache (no model calls)
socr judge-benchmark <DATASET> Score the judge against labeled good/mangled pages
socr engines Show available engines
output/<doc_stem>/
├── <doc_stem>.md # final OCR text (stitched from pages/)
├── metadata.json # processing stats + status
├── pages/ # agentic mode: per-page progressive save + resume ledger
│ ├── 00001.md # one fragment per page (stitches to <doc_stem>.md, byte-identical)
│ ├── 00001.json # sidecar: status, terminal flag, engine, run-fingerprint
│ └── ...
├── figures/ # with --save-figures
│ └── figure_1_page3.png
└── audit_log.json # per-page audit events (timeouts, chart-asset pages, failures)
The pages/ directory is what makes a run crash-safe and resumable: each
NNN.md is written the instant its page finishes, and re-running reuses the ones
whose NNN.json is terminal with a matching fingerprint.
Create ~/.config/socr/config.yaml:
primary_engine: gemini
fallback_engine: marker
timeout: 300
save_figures: false
audit_enabled: true
audit_min_words: 50Or use profiles: ~/.config/socr/fast.yaml → socr paper.pdf --profile fast
Each backend is an independent CLI tool:
- gemini-ocr-cli — Google Gemini
- deepseek-ocr-cli — DeepSeek via Ollama
- mistral-ocr-cli — Mistral AI
- marker-ocr-cli — Marker (Surya + Texify)
- nougat-ocr-cli — Meta Nougat
MIT