Overview

A production ETL platform built for a financial institution that needed to digitize and extract structured data from a large, heterogeneous corpus of banking documents. The documents arrived in multiple formats — PDFs, Word documents, and scanned images — with wildly inconsistent internal structure.

The system processes tens of thousands of documents per day, extracts structured fields, validates against business rules, and delivers clean data into PostgreSQL for downstream systems.

Problem

The institution was manually reviewing thousands of documents per month to extract key financial data: loan amounts, interest rates, borrower information, guarantor details, collateral descriptions. This took 40+ person-hours per week and introduced transcription errors at a 3–4% rate.

The additional challenge: documents arrived from multiple sources with no standardized format. A loan agreement from one branch looked nothing like one from another. The same logical field might be labeled differently, appear in different positions, or be split across pages.

Constraints

Accuracy: Financial data errors are not acceptable; 99%+ accuracy required
Mixed formats: PDF (digital + scanned), DOCX, images as attachments
No vendor lock-in: Existing infrastructure; no commercial OCR SaaS budget
Audit trail: Every extraction decision must be logged with confidence scores
Idempotency: Re-processing the same document must not create duplicate records

Architecture

Document Intake
    │  - S3 upload trigger (Lambda notification → SQS)
    │  - Celery consumer picks from queue
    │
    ▼
Format Detection Layer
    ├── PDF digital → PyMuPDF text extraction
    ├── PDF scanned → Tesseract OCR pipeline
    └── DOCX → python-docx extraction
    │
    ▼
Text Normalization
    │  - Unicode normalization
    │  - Table detection and flattening
    │  - Section header detection
    │
    ▼
Structured Extraction (Spark job)
    │  - Field-level extraction with regex + heuristics
    │  - Confidence scoring per field
    │  - Multi-pass: tries multiple patterns
    │
    ▼
Validation Layer
    │  - Business rule validation (e.g., loan amount must be positive)
    │  - Cross-field consistency checks (e.g., dates must be sequential)
    │  - Flag low-confidence extractions for human review
    │
    ▼
PostgreSQL Storage
    │  - Immutable raw record (original extracted text)
    │  - Structured record (validated fields)
    │  - Audit log (confidence scores, extraction source)
    │
    ▼
Human Review Queue (if confidence < threshold)

Technology Decisions

Decision	Choice	Why
Task queue	Celery + Redis	Existing stack; reliable retry semantics; good monitoring
Distributed processing	Apache Spark	Bulk re-processing of historical documents; Python UDFs
PDF text extraction	PyMuPDF	Best text extraction fidelity for digital PDFs
OCR	Tesseract	Open source; runs on our infra; sufficient accuracy after preprocessing
Image preprocessing	OpenCV	Deskew, denoise, binarize before OCR — critical for accuracy
Storage	PostgreSQL	Relational integrity for financial data; transactional guarantees
Raw storage	AWS S3	Original documents preserved permanently for audit

Implementation

Celery Task Chain

Each document goes through a chain of tasks, where failure at any stage triggers retries with exponential backoff:

from celery import chain

def process_document(document_id: str) -> None:
    """
    Orchestrate document processing as a Celery task chain.
    Each step retries independently on failure.
    """
    workflow = chain(
        detect_format.s(document_id),
        extract_text.s(),
        normalize_text.s(),
        extract_fields.s(),
        validate_fields.s(),
        persist_results.s(),
    )
    workflow.apply_async()

OCR Preprocessing Pipeline

Raw scanned documents need significant preprocessing before Tesseract can achieve acceptable accuracy:

import cv2
import numpy as np

def preprocess_for_ocr(image_path: str) -> np.ndarray:
    """
    Preprocess a scanned document image for OCR.
    Deskew, denoise, and binarize to maximize Tesseract accuracy.
    """
    img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
    
    # Deskew using Hough line detection
    img = deskew(img)
    
    # Denoise
    img = cv2.fastNlMeansDenoising(img, h=10)
    
    # Adaptive thresholding (better than global for uneven lighting)
    img = cv2.adaptiveThreshold(
        img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY, 11, 2
    )
    
    return img

Field Extraction with Confidence Scoring

Every extracted field is scored by confidence, driving the human review decision:

def extract_loan_amount(text: str) -> ExtractionResult:
    """
    Extract loan amount from document text using multiple patterns.
    Returns value and confidence score.
    """
    patterns = [
        (r"Loan Amount[:\s]+₹?\s*([\d,]+(?:\.\d{2})?)", 0.95),
        (r"Principal Amount[:\s]+₹?\s*([\d,]+(?:\.\d{2})?)", 0.85),
        (r"Amount of Loan[:\s]+₹?\s*([\d,]+(?:\.\d{2})?)", 0.80),
        (r"₹\s*([\d,]{5,}(?:\.\d{2})?)", 0.60),  # fallback: any large rupee amount
    ]
    
    for pattern, confidence in patterns:
        if match := re.search(pattern, text, re.IGNORECASE):
            return ExtractionResult(
                value=parse_amount(match.group(1)),
                confidence=confidence,
                pattern_used=pattern,
            )
    
    return ExtractionResult(value=None, confidence=0.0, pattern_used=None)

Failures & Lessons

Failure 1: No preprocessing on OCR inputs Early versions sent raw scans directly to Tesseract. Accuracy was 74% — unacceptable. Adding the OpenCV preprocessing pipeline pushed it to 96.8%. Deskewing alone accounted for a 12% accuracy improvement.

Failure 2: Synchronous Spark processing Initial architecture tried to use Spark for every document. Spark startup overhead (~30s) made this impractical for single documents. Resolution: use Spark only for bulk historical processing; Celery + Python for real-time single document flow.

Failure 3: Confidence threshold calibration Initial threshold for human review was too low — 40% of documents were routed to human review. After calibrating per field type (some fields are reliably extracted, others aren't), human review rate dropped to 8% without accuracy loss.

Failure 4: Duplicate processing System initially allowed the same document to be processed multiple times (e.g., on upload + manual trigger). Added an idempotency layer: SHA-256 hash of document content stored; duplicate check before processing begins.

Metrics

Metric	Value
Documents processed daily	10,000+
Extraction accuracy (validated fields)	97.3%
Average processing time per document	< 8 seconds
Human review rate	8.2%
Error recovery rate	99.1%
Manual review hours saved per week	38 hours
Error reduction vs. manual process	92%

Future Improvements

LLM extraction layer — Use GPT-4o with structured outputs for ambiguous documents; fall back to pattern matching for high-confidence cases (cost control)
Document classification model — Fine-tuned BERT to classify document type before extraction, routing to specialized extractors
Table extraction — Current table detection is heuristic; a dedicated table-understanding model would significantly improve accuracy on tabular financial data
Streaming ingestion — Move from S3 batch upload to streaming document ingestion for sub-second processing triggers

Key Takeaways

Image preprocessing quality is the single biggest lever for OCR accuracy — more impactful than OCR model choice
Per-field confidence scoring + selective human review is essential; don't binary classify "review / don't review" the whole document
Separate the real-time path (Celery) from the bulk path (Spark) — they have fundamentally different latency requirements
Idempotency must be designed from the start in ETL systems; retrofitting it is painful