BackendProduction

Document Intelligence Pipeline

Production ETL platform processing thousands of banking documents — PDFs, Word files, image attachments — using asynchronous processing, Celery, and Apache Spark.

6 min readETLDocument ProcessingCeleryApache SparkBankingPDFOCR
10K+
Documents/Day
97.3%
Processing Accuracy
< 8 sec
Avg Processing Time
99.1%
Error Recovery Rate

Overview

A production ETL platform built for a financial institution that needed to digitize and extract structured data from a large, heterogeneous corpus of banking documents. The documents arrived in multiple formats — PDFs, Word documents, and scanned images — with wildly inconsistent internal structure.

The system processes tens of thousands of documents per day, extracts structured fields, validates against business rules, and delivers clean data into PostgreSQL for downstream systems.

Problem

The institution was manually reviewing thousands of documents per month to extract key financial data: loan amounts, interest rates, borrower information, guarantor details, collateral descriptions. This took 40+ person-hours per week and introduced transcription errors at a 3–4% rate.

The additional challenge: documents arrived from multiple sources with no standardized format. A loan agreement from one branch looked nothing like one from another. The same logical field might be labeled differently, appear in different positions, or be split across pages.

Constraints

  • Accuracy: Financial data errors are not acceptable; 99%+ accuracy required
  • Mixed formats: PDF (digital + scanned), DOCX, images as attachments
  • No vendor lock-in: Existing infrastructure; no commercial OCR SaaS budget
  • Audit trail: Every extraction decision must be logged with confidence scores
  • Idempotency: Re-processing the same document must not create duplicate records

Architecture

Document Intake
    │  - S3 upload trigger (Lambda notification → SQS)
    │  - Celery consumer picks from queue
    │
    ▼
Format Detection Layer
    ├── PDF digital → PyMuPDF text extraction
    ├── PDF scanned → Tesseract OCR pipeline
    └── DOCX → python-docx extraction
    │
    ▼
Text Normalization
    │  - Unicode normalization
    │  - Table detection and flattening
    │  - Section header detection
    │
    ▼
Structured Extraction (Spark job)
    │  - Field-level extraction with regex + heuristics
    │  - Confidence scoring per field
    │  - Multi-pass: tries multiple patterns
    │
    ▼
Validation Layer
    │  - Business rule validation (e.g., loan amount must be positive)
    │  - Cross-field consistency checks (e.g., dates must be sequential)
    │  - Flag low-confidence extractions for human review
    │
    ▼
PostgreSQL Storage
    │  - Immutable raw record (original extracted text)
    │  - Structured record (validated fields)
    │  - Audit log (confidence scores, extraction source)
    │
    ▼
Human Review Queue (if confidence < threshold)

Technology Decisions

DecisionChoiceWhy
Task queueCelery + RedisExisting stack; reliable retry semantics; good monitoring
Distributed processingApache SparkBulk re-processing of historical documents; Python UDFs
PDF text extractionPyMuPDFBest text extraction fidelity for digital PDFs
OCRTesseractOpen source; runs on our infra; sufficient accuracy after preprocessing
Image preprocessingOpenCVDeskew, denoise, binarize before OCR — critical for accuracy
StoragePostgreSQLRelational integrity for financial data; transactional guarantees
Raw storageAWS S3Original documents preserved permanently for audit

Implementation

Celery Task Chain

Each document goes through a chain of tasks, where failure at any stage triggers retries with exponential backoff:

from celery import chain

def process_document(document_id: str) -> None:
    """
    Orchestrate document processing as a Celery task chain.
    Each step retries independently on failure.
    """
    workflow = chain(
        detect_format.s(document_id),
        extract_text.s(),
        normalize_text.s(),
        extract_fields.s(),
        validate_fields.s(),
        persist_results.s(),
    )
    workflow.apply_async()

OCR Preprocessing Pipeline

Raw scanned documents need significant preprocessing before Tesseract can achieve acceptable accuracy:

import cv2
import numpy as np

def preprocess_for_ocr(image_path: str) -> np.ndarray:
    """
    Preprocess a scanned document image for OCR.
    Deskew, denoise, and binarize to maximize Tesseract accuracy.
    """
    img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
    
    # Deskew using Hough line detection
    img = deskew(img)
    
    # Denoise
    img = cv2.fastNlMeansDenoising(img, h=10)
    
    # Adaptive thresholding (better than global for uneven lighting)
    img = cv2.adaptiveThreshold(
        img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY, 11, 2
    )
    
    return img

Field Extraction with Confidence Scoring

Every extracted field is scored by confidence, driving the human review decision:

def extract_loan_amount(text: str) -> ExtractionResult:
    """
    Extract loan amount from document text using multiple patterns.
    Returns value and confidence score.
    """
    patterns = [
        (r"Loan Amount[:\s]+₹?\s*([\d,]+(?:\.\d{2})?)", 0.95),
        (r"Principal Amount[:\s]+₹?\s*([\d,]+(?:\.\d{2})?)", 0.85),
        (r"Amount of Loan[:\s]+₹?\s*([\d,]+(?:\.\d{2})?)", 0.80),
        (r"₹\s*([\d,]{5,}(?:\.\d{2})?)", 0.60),  # fallback: any large rupee amount
    ]
    
    for pattern, confidence in patterns:
        if match := re.search(pattern, text, re.IGNORECASE):
            return ExtractionResult(
                value=parse_amount(match.group(1)),
                confidence=confidence,
                pattern_used=pattern,
            )
    
    return ExtractionResult(value=None, confidence=0.0, pattern_used=None)

Failures & Lessons

Failure 1: No preprocessing on OCR inputs Early versions sent raw scans directly to Tesseract. Accuracy was 74% — unacceptable. Adding the OpenCV preprocessing pipeline pushed it to 96.8%. Deskewing alone accounted for a 12% accuracy improvement.

Failure 2: Synchronous Spark processing Initial architecture tried to use Spark for every document. Spark startup overhead (~30s) made this impractical for single documents. Resolution: use Spark only for bulk historical processing; Celery + Python for real-time single document flow.

Failure 3: Confidence threshold calibration Initial threshold for human review was too low — 40% of documents were routed to human review. After calibrating per field type (some fields are reliably extracted, others aren't), human review rate dropped to 8% without accuracy loss.

Failure 4: Duplicate processing System initially allowed the same document to be processed multiple times (e.g., on upload + manual trigger). Added an idempotency layer: SHA-256 hash of document content stored; duplicate check before processing begins.

Metrics

MetricValue
Documents processed daily10,000+
Extraction accuracy (validated fields)97.3%
Average processing time per document< 8 seconds
Human review rate8.2%
Error recovery rate99.1%
Manual review hours saved per week38 hours
Error reduction vs. manual process92%

Future Improvements

  1. LLM extraction layer — Use GPT-4o with structured outputs for ambiguous documents; fall back to pattern matching for high-confidence cases (cost control)
  2. Document classification model — Fine-tuned BERT to classify document type before extraction, routing to specialized extractors
  3. Table extraction — Current table detection is heuristic; a dedicated table-understanding model would significantly improve accuracy on tabular financial data
  4. Streaming ingestion — Move from S3 batch upload to streaming document ingestion for sub-second processing triggers

Key Takeaways

  • Image preprocessing quality is the single biggest lever for OCR accuracy — more impactful than OCR model choice
  • Per-field confidence scoring + selective human review is essential; don't binary classify "review / don't review" the whole document
  • Separate the real-time path (Celery) from the bulk path (Spark) — they have fundamentally different latency requirements
  • Idempotency must be designed from the start in ETL systems; retrofitting it is painful