Why LLMs Miss 50% of Clinical PHI and What the Research Says About Better De-Identification

healthcare compliance guide with research citations.

The Challenge

A 2025 research study found that general-purpose LLM tools miss more than 50% of clinical PHI in free-text clinical notes. HIPAA Safe Harbor requires removing 18 specific identifiers, but clinical notes contain them in unstructured, abbreviated, and context-dependent forms ("Pt. John D., DOB 4/12/67, presented to ED..."). Tools that rely solely on pattern matching fail on abbreviated forms; tools that rely solely on ML fail on regional variations and rare identifier types.

By the Numbers

LLMs miss >50% of clinical PHI in multilingual documents (arXiv:2509.14464, 2025)
34.8% of all ChatGPT inputs contain sensitive data including multilingual PII (Cyberhaven Q4 2025)

Real-World Scenario

A hospital system is building a de-identified research dataset from 500,000 clinical notes. Their current tool (Presidio default) misses ~30% of PHI based on internal testing. This creates research IRB compliance issues and potential HIPAA violations. anonym.legal's hybrid approach with healthcare-specific entity types reduces the miss rate to under 5%.

Technical Approach

Hybrid three-tier detection provides both high recall (ML-based NER for names and contextual PHI) and high precision (regex for structured identifiers). The 260+ entity types include medical-specific identifiers: MRN formats, NPI, DEA numbers, health plan IDs. Confidence thresholds can be set for maximum recall in high-risk PHI scenarios.

Source ---)

The Challenge

By the Numbers

Real-World Scenario

Technical Approach

Comments (0)