Presidio's 22.7% Precision Problem: Why False Positives Are Destroying Your Anonymization Results

technical comparison targeting developers and data engineers who have tried Presidio.

The Challenge

Microsoft Presidio's default NER (Named Entity Recognition) model generates high false positive rates in unstructured text. A 2024 benchmark study found Presidio's person name recognizer achieved 22.7% precision in business document contexts — meaning 77.3% of "person name" detections are false positives. For a document with 100 capitalized proper nouns (product names, company names, place names), only 23 are actual person names, but Presidio flags all 100. The downstream effect: organizations anonymize meaningful content (product names, company names) while users lose confidence in the tool and may start disabling detection to reduce noise.

By the Numbers

A 2024 benchmark study found Presidio's person name recognizer achieved 22.7% precision in business document contexts — meaning 77.3% of "person name" detections are false positives.
For a document with 100 capitalized proper nouns (product names, company names, place names), only 23 are actual person names, but Presidio flags all 100.

Real-World Scenario

A data analytics firm processing customer feedback surveys abandoned Presidio after 40% of survey responses had product names, city names, and brand mentions incorrectly redacted alongside actual PII. Downstream analysis was corrupted by over-anonymization. Switching to anonym.legal's hybrid recognizer, precision improved to ~85%+ — product names preserved, person names correctly identified. Analysis quality restored.

Technical Approach

The hybrid recognizer stack (Regex + NLP + XLM-RoBERTa transformers) dramatically improves precision by using context from surrounding text. Transformer-based models understand that "Apple announced its earnings" refers to a company, while "Apple Smith joined the team" refers to a person. The result is materially higher precision than bare Presidio, preserving document utility while maintaining privacy protection. Users who experienced Presidio's false positive problem find anonym.legal's accuracy meaningfully better.

Source

The Challenge

By the Numbers

Real-World Scenario

Technical Approach

Comments (0)