Dashboard cloak.business Case Study
cloak.business New Pain Point
Pain Point Case Study NP-25

Image PII Redaction with OCR: Scanned Documents and ID Cards

anonym.community · 2026-03-14

Research Source

Scanned Documents Bypass Text-Based PII Detection Entirely
anonym.community March 2026 crawl

Organizations digitize paper records by scanning, creating image files (PNG, JPEG, TIFF) and scanned PDFs. These contain PII visible to humans but invisible to text-based PII detection. Names, addresses, government IDs, and medical information in scanned documents pass through every text-based anonymization tool undetected. OCR (Optical Character Recognition) bridges this gap by extracting text from images for PII detection.

Executive Summary

Text-based PII tools cannot see scanned documents. Names, government IDs, and medical data in scanned PDFs and photographs pass through every text-only anonymization tool undetected.

cloak.business integrates Tesseract OCR for image-based PII detection across 37 languages. Bounding-box redaction applies black rectangles over PII regions, preserving document layout. Supports PNG, JPEG, TIFF, BMP, WebP, and GIF formats up to 10MB/150MP.

The Problem: The Analog-Digital PII Gap

Healthcare organizations scan patient intake forms. Legal teams scan signed contracts. Government agencies digitize archived records. Insurance companies photograph damage reports with personally identifiable license plates and addresses. All these create images containing PII that text-based tools cannot process. Even modern AI-powered PII detection works only on text — feeding it a JPEG returns nothing, regardless of how much PII the image contains.

Irreducible truth: PII detection that only works on text ignores an entire category of documents. As long as organizations use scanners, cameras, and fax machines, image-based PII detection is not optional — it is required for comprehensive coverage.

The Solution: How cloak.business Addresses This

Tesseract OCR Engine

cloak.business uses Tesseract OCR to extract text from images with 95%+ accuracy on clean documents. Supports 37 languages including Latin, Cyrillic, CJK, Arabic, and Devanagari scripts. EXIF auto-orientation ensures correct text extraction regardless of image rotation.

Bounding-Box Redaction

Detected PII regions are redacted with black rectangles precisely positioned over the text. Adjacent boxes are automatically merged to prevent partial character visibility. The document layout, non-PII content, and formatting remain intact.

Supported Formats and Limits

PNG, JPEG/JPG, TIFF, BMP, WebP, and GIF. Maximum 10MB per image, 150MP maximum resolution. Batch processing available via API and MCP Server (analyze_image and redact_image tools).

Integration Points

Image redaction is available through the web app (drag-and-drop), REST API (/api/presidio/image), MCP Server (2 image tools), desktop app, and Nextcloud app. The same 320+ entity types are detected in images as in text.

Compliance Mapping

This feature addresses GDPR Article 4(1) (personal data in any form — including images), HIPAA §164.514 (de-identification of scanned medical records), and archival/FOIA requirements where scanned government documents must be redacted before public release.

cloak.business's GDPR, HIPAA, PCI-DSS, ISO 27001, SOC 2 compliance coverage, combined with Customer-selected hosting, provides documented technical measures organizations can reference in their compliance documentation.

Product Specifications

SpecificationValue
Entity Types320+
Detection3-layer hybrid: Presidio + NLP + Stance classification
Test Coverage100% (419/419 tests)
Languages48
Anonymization MethodsReplace, Redact, Mask, Hash, Encrypt (AES-256-GCM), RSA-4096 Asymmetric, Keep
PlatformsWeb App, REST API, SDKs (JavaScript, Python), Cloud Storage Add-ins, Nextcloud
PricingEnterprise (custom)
HostingCustomer-selected
ComplianceGDPR, HIPAA, PCI-DSS, ISO 27001, SOC 2