Why 'Delete the Email Column' Isn't Enough: Detecting PII in CSV Free-Text Fields for Research Data Sharing

targeting academic researchers and research data management professionals.

The Challenge

Research data shared between institutions (universities, NGOs, think tanks) frequently travels in CSV format — a lingua franca for data exchange. Survey data CSVs are particularly challenging: structured columns (name, email, phone) are easy to identify and clean, but free-text response columns contain unstructured PII mixed with the actual research data. A column like "additional_comments" might contain "My doctor at Boston Medical Center said..." revealing name, institution, and health information. Standard CSV anonymization approaches clean structured columns but leave free-text PII untouched. This "partial anonymization" fails GDPR's definition of anonymized data.

By the Numbers

This "partial anonymization" fails GDPR's definition of anonymized data.

Real-World Scenario

A research consortium at three European universities shares a 5,000-row survey CSV about patient experiences. Free-text columns contain incidental names, hospital references, and location details that would identify individual respondents. anonym.legal processes the CSV: 47 free-text PII entities detected and anonymized across the free-text columns, structured PII columns (name, email, birth date) cleaned. The anonymized CSV is shared between institutions in compliance with GDPR Article 89 (research exemption requiring appropriate safeguards). Research ethics board approves the anonymization methodology.

Technical Approach

CSV processing applies entity detection to every cell, including free-text columns, using the same NLP + transformer stack as document processing. PII entities discovered in free-text survey responses ("My name is John and I work at IBM") are detected and replaced while the surrounding context ("I feel that the new policy...") is preserved. Structured columns with PII headers are also cleaned. The result is a genuinely anonymized CSV that maintains research utility.

Source

The Challenge

By the Numbers

Real-World Scenario

Technical Approach

Comments (0)