Image PII Redaction with OCR: Scanned Documents and ID Cards
Research Source
Organizations digitize paper records by scanning, creating image files (PNG, JPEG, TIFF) and scanned PDFs. These contain PII visible to humans but invisible to text-based PII detection. Names, addresses, government IDs, and medical information in scanned documents pass through every text-based anonymization tool undetected. OCR (Optical Character Recognition) bridges this gap by extracting text from images for PII detection.
Executive Summary
Text-based PII tools cannot see scanned documents. Names, government IDs, and medical data in scanned PDFs and photographs pass through every text-only anonymization tool undetected.
cloak.business integrates Tesseract OCR for image-based PII detection across 37 languages. Bounding-box redaction applies black rectangles over PII regions, preserving document layout. Supports PNG, JPEG, TIFF, BMP, WebP, and GIF formats up to 10MB/150MP.
The Problem: The Analog-Digital PII Gap
Healthcare organizations scan patient intake forms. Legal teams scan signed contracts. Government agencies digitize archived records. Insurance companies photograph damage reports with personally identifiable license plates and addresses. All these create images containing PII that text-based tools cannot process. Even modern AI-powered PII detection works only on text — feeding it a JPEG returns nothing, regardless of how much PII the image contains.
Irreducible truth: PII detection that only works on text ignores an entire category of documents. As long as organizations use scanners, cameras, and fax machines, image-based PII detection is not optional — it is required for comprehensive coverage.
The Solution: How cloak.business Addresses This
Tesseract OCR Engine
cloak.business uses Tesseract OCR to extract text from images with 95%+ accuracy on clean documents. Supports 37 languages including Latin, Cyrillic, CJK, Arabic, and Devanagari scripts. EXIF auto-orientation ensures correct text extraction regardless of image rotation.
Bounding-Box Redaction
Detected PII regions are redacted with black rectangles precisely positioned over the text. Adjacent boxes are automatically merged to prevent partial character visibility. The document layout, non-PII content, and formatting remain intact.
Supported Formats and Limits
PNG, JPEG/JPG, TIFF, BMP, WebP, and GIF. Maximum 10MB per image, 150MP maximum resolution. Batch processing available via API and MCP Server (analyze_image and redact_image tools).
Integration Points
Image redaction is available through the web app (drag-and-drop), REST API (/api/presidio/image), MCP Server (2 image tools), desktop app, and Nextcloud app. The same 320+ entity types are detected in images as in text.
Compliance Mapping
This feature addresses GDPR Article 4(1) (personal data in any form — including images), HIPAA §164.514 (de-identification of scanned medical records), and archival/FOIA requirements where scanned government documents must be redacted before public release.
cloak.business's GDPR, HIPAA, PCI-DSS, ISO 27001, SOC 2 compliance coverage, combined with Customer-selected hosting, provides documented technical measures organizations can reference in their compliance documentation.
Product Specifications
| Specification | Value |
|---|---|
| Entity Types | 320+ |
| Detection | 3-layer hybrid: Presidio + NLP + Stance classification |
| Test Coverage | 100% (419/419 tests) |
| Languages | 48 |
| Anonymization Methods | Replace, Redact, Mask, Hash, Encrypt (AES-256-GCM), RSA-4096 Asymmetric, Keep |
| Platforms | Web App, REST API, SDKs (JavaScript, Python), Cloud Storage Add-ins, Nextcloud |
| Pricing | Enterprise (custom) |
| Hosting | Customer-selected |
| Compliance | GDPR, HIPAA, PCI-DSS, ISO 27001, SOC 2 |