targeting HR, finance, and data management professionals.
The Challenge
Excel spreadsheets used in business operations are among the most PII-dense document types: customer lists, employee records, patient registries, vendor databases, financial records. Unlike PDFs (text layer) or Word documents (flowing text), Excel has two-dimensional structure — PII entities can appear in any cell, across hundreds of columns and thousands of rows. Naive text scanning misses the structural context (a column header "SSN" tells you the entire column contains social security numbers, even if they don't look like SSNs to a general NER model). Excel-specific challenges include: date cells formatted as numbers, partial SSNs split across columns, and reference formulas that compute PII values from other cells.
By the Numbers
- Excel spreadsheets used in business operations are among the most PII-dense document types: customer lists, employee records, patient registries, vendor databases, financial records.
- Unlike PDFs (text layer) or Word documents (flowing text), Excel has two-dimensional structure — PII entities can appear in any cell, across hundreds of columns and thousands of rows.
Real-World Scenario
An HR department receives employee records from an acquired company: a 15,000-row XLSX with 40 columns including employee IDs, names, SSNs, salaries, performance scores, and manager names. Anonymizing for sharing with an external HR consultant requires removing personal identifiers while preserving the statistical structure. anonym.legal processes the full XLSX with the "HR GDPR" preset: names, SSNs, email addresses, and phone numbers anonymized cell-by-cell while salary data, performance scores, and department codes are preserved. Processing time: 8 minutes vs. estimated 40 hours manual review.
Technical Approach
Native XLSX support with cell-level PII detection that uses column headers as context signals. A column labeled "SSN" with values matching partial patterns is detected as SSN context even for edge-case values. Multi-sheet processing applies the same configuration across all sheets. Output preserves Excel formatting while anonymizing PII cell values. Column structures, formulas, and non-PII data are preserved.
Comments (0)