Overview
A production ETL platform built for a financial institution that needed to digitize and extract structured data from a large, heterogeneous corpus of banking documents. The documents arrived in multiple formats — PDFs, Word documents, and scanned images — with wildly inconsistent internal structure.
The system processes tens of thousands of documents per day, extracts structured fields, validates against business rules, and delivers clean data into PostgreSQL for downstream systems.
Problem
The institution was manually reviewing thousands of documents per month to extract key financial data: loan amounts, interest rates, borrower information, guarantor details, collateral descriptions. This took 40+ person-hours per week and introduced transcription errors at a 3–4% rate.
The additional challenge: documents arrived from multiple sources with no standardized format. A loan agreement from one branch looked nothing like one from another. The same logical field might be labeled differently, appear in different positions, or be split across pages.
Constraints
- Accuracy: Financial data errors are not acceptable; 99%+ accuracy required
- Mixed formats: PDF (digital + scanned), DOCX, images as attachments
- No vendor lock-in: Existing infrastructure; no commercial OCR SaaS budget
- Audit trail: Every extraction decision must be logged with confidence scores
- Idempotency: Re-processing the same document must not create duplicate records
Architecture
Document Intake
│ - S3 upload trigger (Lambda notification → SQS)
│ - Celery consumer picks from queue
│
▼
Format Detection Layer
├── PDF digital → PyMuPDF text extraction
├── PDF scanned → Tesseract OCR pipeline
└── DOCX → python-docx extraction
│
▼
Text Normalization
│ - Unicode normalization
│ - Table detection and flattening
│ - Section header detection
│
▼
Structured Extraction (Spark job)
│ - Field-level extraction with regex + heuristics
│ - Confidence scoring per field
│ - Multi-pass: tries multiple patterns
│
▼
Validation Layer
│ - Business rule validation (e.g., loan amount must be positive)
│ - Cross-field consistency checks (e.g., dates must be sequential)
│ - Flag low-confidence extractions for human review
│
▼
PostgreSQL Storage
│ - Immutable raw record (original extracted text)
│ - Structured record (validated fields)
│ - Audit log (confidence scores, extraction source)
│
▼
Human Review Queue (if confidence < threshold)
Technology Decisions
| Decision | Choice | Why |
|---|---|---|
| Task queue | Celery + Redis | Existing stack; reliable retry semantics; good monitoring |
| Distributed processing | Apache Spark | Bulk re-processing of historical documents; Python UDFs |
| PDF text extraction | PyMuPDF | Best text extraction fidelity for digital PDFs |
| OCR | Tesseract | Open source; runs on our infra; sufficient accuracy after preprocessing |
| Image preprocessing | OpenCV | Deskew, denoise, binarize before OCR — critical for accuracy |
| Storage | PostgreSQL | Relational integrity for financial data; transactional guarantees |
| Raw storage | AWS S3 | Original documents preserved permanently for audit |
Implementation
Celery Task Chain
Each document goes through a chain of tasks, where failure at any stage triggers retries with exponential backoff:
from celery import chain
def process_document(document_id: str) -> None:
"""
Orchestrate document processing as a Celery task chain.
Each step retries independently on failure.
"""
workflow = chain(
detect_format.s(document_id),
extract_text.s(),
normalize_text.s(),
extract_fields.s(),
validate_fields.s(),
persist_results.s(),
)
workflow.apply_async()
OCR Preprocessing Pipeline
Raw scanned documents need significant preprocessing before Tesseract can achieve acceptable accuracy:
import cv2
import numpy as np
def preprocess_for_ocr(image_path: str) -> np.ndarray:
"""
Preprocess a scanned document image for OCR.
Deskew, denoise, and binarize to maximize Tesseract accuracy.
"""
img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
# Deskew using Hough line detection
img = deskew(img)
# Denoise
img = cv2.fastNlMeansDenoising(img, h=10)
# Adaptive thresholding (better than global for uneven lighting)
img = cv2.adaptiveThreshold(
img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY, 11, 2
)
return img
Field Extraction with Confidence Scoring
Every extracted field is scored by confidence, driving the human review decision:
def extract_loan_amount(text: str) -> ExtractionResult:
"""
Extract loan amount from document text using multiple patterns.
Returns value and confidence score.
"""
patterns = [
(r"Loan Amount[:\s]+₹?\s*([\d,]+(?:\.\d{2})?)", 0.95),
(r"Principal Amount[:\s]+₹?\s*([\d,]+(?:\.\d{2})?)", 0.85),
(r"Amount of Loan[:\s]+₹?\s*([\d,]+(?:\.\d{2})?)", 0.80),
(r"₹\s*([\d,]{5,}(?:\.\d{2})?)", 0.60), # fallback: any large rupee amount
]
for pattern, confidence in patterns:
if match := re.search(pattern, text, re.IGNORECASE):
return ExtractionResult(
value=parse_amount(match.group(1)),
confidence=confidence,
pattern_used=pattern,
)
return ExtractionResult(value=None, confidence=0.0, pattern_used=None)
Failures & Lessons
Failure 1: No preprocessing on OCR inputs Early versions sent raw scans directly to Tesseract. Accuracy was 74% — unacceptable. Adding the OpenCV preprocessing pipeline pushed it to 96.8%. Deskewing alone accounted for a 12% accuracy improvement.
Failure 2: Synchronous Spark processing Initial architecture tried to use Spark for every document. Spark startup overhead (~30s) made this impractical for single documents. Resolution: use Spark only for bulk historical processing; Celery + Python for real-time single document flow.
Failure 3: Confidence threshold calibration Initial threshold for human review was too low — 40% of documents were routed to human review. After calibrating per field type (some fields are reliably extracted, others aren't), human review rate dropped to 8% without accuracy loss.
Failure 4: Duplicate processing System initially allowed the same document to be processed multiple times (e.g., on upload + manual trigger). Added an idempotency layer: SHA-256 hash of document content stored; duplicate check before processing begins.
Metrics
| Metric | Value |
|---|---|
| Documents processed daily | 10,000+ |
| Extraction accuracy (validated fields) | 97.3% |
| Average processing time per document | < 8 seconds |
| Human review rate | 8.2% |
| Error recovery rate | 99.1% |
| Manual review hours saved per week | 38 hours |
| Error reduction vs. manual process | 92% |
Future Improvements
- LLM extraction layer — Use GPT-4o with structured outputs for ambiguous documents; fall back to pattern matching for high-confidence cases (cost control)
- Document classification model — Fine-tuned BERT to classify document type before extraction, routing to specialized extractors
- Table extraction — Current table detection is heuristic; a dedicated table-understanding model would significantly improve accuracy on tabular financial data
- Streaming ingestion — Move from S3 batch upload to streaming document ingestion for sub-second processing triggers
Key Takeaways
- Image preprocessing quality is the single biggest lever for OCR accuracy — more impactful than OCR model choice
- Per-field confidence scoring + selective human review is essential; don't binary classify "review / don't review" the whole document
- Separate the real-time path (Celery) from the bulk path (Spark) — they have fundamentally different latency requirements
- Idempotency must be designed from the start in ETL systems; retrofitting it is painful