Document Field Extractor

An asynchronous OCR pipeline that extracts structured fields — vendor, invoice number, date, total, and email — from invoices and receipts. Every extracted field carries a confidence score, and low-confidence fields are automatically flagged for human review.

Built with FastAPI · Celery · Redis · Tesseract · OpenCV · Docker.

Architecture

OCR on a real document can take several seconds. Blocking an HTTP request that long is not acceptable, so all heavy work runs off the request path:

        upload                      queue                     process
  client ──────▶ FastAPI ───────▶ Redis ───────▶ Celery worker
                    ▲                                   │
                    └─────── poll /result/{id} ◀────────┘
                                (result stored in Redis)

Service	Role
FastAPI	Accepts uploads, validates files, returns `job_id` immediately (HTTP 202)
Redis	Message broker + result store
Celery worker	Preprocesses image, runs Tesseract OCR, extracts fields

The client polls GET /result/{job_id} until the job is done. This decoupling means the API stays responsive under load regardless of how long OCR takes.

Image Preprocessing Pipeline

Raw images from phones and scanners are skewed, noisy, and low-contrast. Before Tesseract sees a single pixel, every image passes through a 5-step cleaning pipeline:

Step	Implementation	Why
Grayscale	`cv2.cvtColor`	Tesseract operates on intensity only
Scale up	`cv2.resize` (if height < 1000 px)	Accuracy drops sharply below ~150 dpi
Denoise	`cv2.fastNlMeansDenoising`	Removes salt-and-pepper noise from cheap scanners
Deskew	`cv2.minAreaRect` + rotate	Corrects tilt from uneven placement on a scanner bed
Adaptive threshold	`cv2.adaptiveThreshold`	Handles shadows and uneven lighting from phone photos

Use POST /preview to see the preprocessed image before running the full pipeline — useful for diagnosing poor extraction results.

Field Extraction

Turning raw OCR text into reliable structured data across inconsistent document layouts is the core challenge. This project solves it with a rule + regex approach where:

Every extractor returns a (value, confidence) pair — no silent guesses.
Fields below the confidence threshold are added to a needs_review list.

Example response:

{
  "job_id": "abc-123",
  "status": "done",
  "result": {
    "filename": "invoice.png",
    "fields": {
      "fields": {
        "vendor":         { "value": "ACME SUPPLIES LTD", "confidence": 0.50, "needs_review": true  },
        "invoice_number": { "value": "INV-2024-0042",     "confidence": 0.85, "needs_review": false },
        "date":           { "value": "2024-01-31",        "confidence": 0.90, "needs_review": false },
        "total":          { "value": "495.00",            "confidence": 0.95, "needs_review": false },
        "email":          { "value": "billing@acme.com",  "confidence": 0.95, "needs_review": false }
      },
      "needs_review": ["vendor"],
      "summary": "4 of 5 fields extracted confidently."
    },
    "raw_text_preview": "ACME SUPPLIES LTD\n..."
  }
}

API Reference

Method	Endpoint	Description
`GET`	`/`	Web UI
`GET`	`/health`	Health check
`POST`	`/extract`	Upload a document — returns `job_id` immediately
`GET`	`/result/{job_id}`	Poll for the extraction result
`POST`	`/preview`	Return the preprocessed image as PNG

Supported formats: PDF, PNG, JPEG

Rate limits: /extract — 10 requests/min per IP · /preview — 20 requests/min per IP

Exceeding the limit returns 429 Too Many Requests.

Interactive docs available at http://localhost:8000/docs when running locally.

Getting Started

Prerequisites

Docker Desktop

Run

git clone https://github.com/YOUR_USERNAME/ocr-extractor.git
cd ocr-extractor
docker compose up --build

All three services (API, worker, Redis) start automatically.

Usage

Open http://localhost:8000 in your browser to use the web UI — drag and drop a document, click Extract Fields, and results appear as fields with confidence bars and review flags.

Or use the API directly:

# Submit a document
curl -X POST http://localhost:8000/extract -F "file=@invoice.png"
# -> {"job_id": "abc-123", "status": "queued"}

# Poll for the result
curl http://localhost:8000/result/abc-123

# Preview the preprocessed image
curl -X POST http://localhost:8000/preview -F "file=@invoice.png" --output preview.png

Stop

docker compose down

Running Tests

Tests cover extraction logic and the image preprocessing pipeline independently — no OCR engine, Redis, or Docker required.

pip install -r requirements.txt
pytest tests/ -v

22 passed in 1.14s

Project Structure

.
├── app/
│   ├── main.py            # FastAPI routes + rate limiting
│   ├── worker.py          # Celery task: OCR + extraction
│   ├── extraction.py      # Field extractors with confidence scores
│   ├── image_processing.py# 5-step preprocessing pipeline
│   ├── celery_app.py      # Celery + Redis configuration
│   └── static/
│       └── index.html     # Single-page web UI (drag & drop, live polling)
├── tests/
│   ├── test_extraction.py
│   └── test_image_processing.py
├── Dockerfile
├── docker-compose.yml
└── requirements.txt

Limitations & What's Next

Rule-based extraction is fast and transparent but will miss unusual document layouts. A layout-aware model (e.g. LayoutLM) is the natural next step for fields where regex plateaus.
No persistence beyond Redis result expiry (1 hour). A production deployment would store results in a database.
Vendor detection is heuristic (first non-empty line) and intentionally carries low confidence so it always gets flagged for review.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
app		app
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
render.yaml		render.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Field Extractor

Architecture

Image Preprocessing Pipeline

Field Extraction

API Reference

Getting Started

Prerequisites

Run

Usage

Stop

Running Tests

Project Structure

Limitations & What's Next

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Document Field Extractor

Architecture

Image Preprocessing Pipeline

Field Extraction

API Reference

Getting Started

Prerequisites

Run

Usage

Stop

Running Tests

Project Structure

Limitations & What's Next

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages