Three NLP Engines: spaCy, Stanza, and XLM-RoBERTa Combined
Research Source
No single NLP engine covers all 48 languages effectively. spaCy has excellent models for European languages but limited coverage for South/Southeast Asian languages. Stanza excels at specific languages (Bulgarian, Hungarian, Hebrew) but lacks breadth. Transformer models (XLM-RoBERTa) handle many languages but are computationally expensive. A hybrid approach — routing each language to its strongest engine — maximizes accuracy while minimizing resource usage.
Executive Summary
No single NLP engine covers all languages effectively. spaCy excels at European languages, Stanza at specific NER tasks, XLM-RoBERTa at broad multilingual coverage. A hybrid approach routes each language to its strongest engine.
anonym.legal combines 3 NLP engines: spaCy (24 languages), Stanza NER (6 languages), and XLM-RoBERTa transformer (18 languages). Each language is routed to the engine that provides the best accuracy for that language.
The Problem: The Single-Engine Limitation
spaCy provides fast, accurate NER for 24 languages — but has no models for Bulgarian, Hungarian, Hebrew, Vietnamese, Afrikaans, or Armenian. Stanza provides excellent NER for these 6 languages — but is slower and more memory-intensive. XLM-RoBERTa handles 18 additional languages (Arabic, Hindi, Thai, and others) — but requires GPU-like resources for production performance. An organization processing documents in 48 languages needs all three engines, with intelligent routing to ensure each document is processed by the best available engine.
Irreducible truth: Language coverage is not a number — it is a per-language accuracy metric. Claiming '48 languages' with a single engine that performs well on 20 and poorly on 28 is misleading. True coverage means every language is processed by an engine optimized for it.
The Solution: How anonym.legal Addresses This
spaCy: 24 Languages
Fast and accurate NER for: Catalan, Danish, German, Greek, English, Spanish, Finnish, French, Croatian, Italian, Japanese, Korean, Lithuanian, Macedonian, Norwegian, Dutch, Polish, Portuguese, Romanian, Russian, Slovenian, Swedish, Ukrainian, Chinese. LRU-cached models with lazy loading.
Stanza NER: 6 Languages
Specialized NER models for languages where spaCy has limited coverage: Bulgarian, Hungarian, Hebrew, Vietnamese, Afrikaans, Armenian. These languages require Stanza's neural NER pipeline for accurate name and entity recognition.
XLM-RoBERTa Transformer: 18 Languages
Cross-lingual transformer for: Arabic, Hindi, Turkish, Czech, Slovak, Indonesian, Thai, Persian, Serbian, Latvian, Estonian, Malay, Bengali, Urdu, Swahili, Tagalog, Icelandic, Basque. Uses NLP alias mapping to the English pipeline with custom recognizers for language-specific patterns.
Intelligent Routing
The analyzer engine automatically routes each request to the appropriate NLP engine based on the detected or specified language. No user configuration required. The routing is transparent — users specify the language (or let auto-detection choose), and the system selects the optimal engine.
Compliance Mapping
This architecture supports GDPR Article 5(1)(d) (accuracy — each language processed by its most accurate engine), and enables global deployments where documents arrive in any of 48 languages and must be processed with consistent accuracy.
anonym.legal's GDPR, HIPAA, PCI-DSS, ISO 27001 compliance coverage, combined with Hetzner Germany, ISO 27001 hosting, provides documented technical measures organizations can reference in their compliance documentation.
Product Specifications
| Specification | Value |
|---|---|
| Entity Types | 320+ |
| Detection | 3-layer hybrid: Presidio + NLP + Stance classification |
| Test Coverage | 100% (419/419 tests) |
| Languages | 48 |
| Anonymization Methods | Replace, Redact, Mask, Hash (SHA-256/512), Encrypt (AES-256-GCM) |
| Platforms | Web App, Desktop, Office Add-in, Chrome Extension, MCP Server, REST API |
| Pricing | Free €0, Basic €3, Pro €15, Business €29 |
| Hosting | Hetzner Germany, ISO 27001 |
| Compliance | GDPR, HIPAA, PCI-DSS, ISO 27001 |