Re-identification Attacks on "Anonymized" Data: What Works in 2026
The history of data anonymization is largely a history of failed anonymizations. Netflix prize dataset. AOL search logs. Massachusetts health records. NYC taxi data. In each case, data that had been stripped of obvious identifiers was later re-identified by researchers — sometimes completely, sometimes partially, always enough to undermine the claim of anonymization.
This is not ancient history. The same techniques continue to work against data released today.
The Fundamental Problem: Quasi-Identifiers
The core insight, articulated by Latanya Sweeney in the late 1990s and confirmed repeatedly since, is that re-identification does not require recovering the original identifiers. It requires only finding a combination of attributes that is unique enough to identify an individual.
Sweeney's original finding: 87% of Americans are uniquely identified by the combination of ZIP code, date of birth, and sex. These three fields appear in countless "anonymized" datasets. Each alone is innocuous. Together, they are identifying.
Modern datasets often contain dozens of such quasi-identifiers. An "anonymized" employee dataset might contain: department, job title, salary band, years at company, state of residence, age range. An "anonymized" health dataset might contain: diagnosis codes, procedure codes, admission date, discharge date, hospital, ZIP code. The more columns, the more combinations available for linking.
Current Attack Techniques
1. Linkage Attacks
The most basic form: link the target dataset against an external reference dataset that contains known identifiers. Where quasi-identifiers match, identities transfer.
In 2019, Rocher, Hendrickx, and de Montjoye published a landmark study in *Nature Communications* showing that 99.98% of Americans could be correctly re-identified in any dataset using 15 demographic attributes. The model required no special access — it used census microdata as the reference.
The implication: any "anonymized" dataset containing demographic attributes is vulnerable to linkage if a sufficiently rich external reference exists. In 2026, such references are abundant — voter registration files, commercial data brokers, LinkedIn, social media.
2. Inference Attacks on LLMs
Large language models memorize training data. The degree of memorization depends on how frequently text appears in the training corpus and the size of the model, but even rare examples can be extracted.
Carlini et al. (2023) demonstrated that GPT-2 and later models could be prompted to reproduce verbatim training text, including personal information. The attack: provide a prefix that appeared in training data and observe whether the model completes it with memorized content.
For organizations that used personal data in LLM training without adequate anonymization, this is not a theoretical risk — it is a deployed risk. Models in production today may reproduce personal information that appeared in their training data.
3. Differencing Attacks
When an organization releases multiple versions of a dataset (before and after removing a record, for example), an attacker with access to both versions can compute the difference and identify the removed individual's attributes.
Even without multiple releases, count queries on small populations can reveal individual data. "How many employees in the legal department in Berlin have a salary above €120,000?" — if the answer is 1, the attribute is effectively disclosed.
Differential privacy was developed to provide mathematically rigorous protection against differencing attacks. Its adoption remains limited outside of large technology companies and statistical agencies.
4. Reconstruction Attacks
Recent work has shown that even aggregate statistics — published in the form of marginal tables — can be inverted to reconstruct individual-level records with high accuracy when enough marginals are published.
The US Census Bureau acknowledged this vulnerability and adopted differential privacy for the 2020 Census specifically because sophisticated reconstruction attacks against 2010 Census data were feasible.
5. Auxiliary Information Attacks
Social media profiles, public records, news articles, and other auxiliary sources can provide enough information to link an anonymous record to a named individual when the anonymous record contains unusual combinations of attributes.
The classic example: the "anonymized" governor of Massachusetts. His hospital records (age, ZIP code, diagnosis) were linked to his voter registration record because he was the only person in his ZIP code with his age and diagnosis.
In 2026, auxiliary information has only grown richer. LinkedIn profiles provide employer, title, location, and career history. Twitter/X provides stated opinions, travel locations, relationship information. This information is public and increasingly indexed.
What "Anonymized" Claims Actually Cover
Understanding what specific anonymization technique was applied matters enormously, because different techniques protect against different attacks.
k-anonymity: Guarantees that each record is indistinguishable from at least k-1 others on quasi-identifier combinations. Does not protect against attribute disclosure when the sensitive attribute is uniform within a group. Does not protect against auxiliary information attacks. k=2 or k=5 provides minimal protection against modern linkage attacks.
l-diversity: Extension of k-anonymity requiring diversity in sensitive attribute values within each equivalence class. Better than k-anonymity but still vulnerable to skewness attacks.
Differential privacy (ε-DP): Provides probabilistic bounds on what an attacker can learn from the output. The privacy budget ε must be carefully chosen — ε=10 provides minimal protection; ε=0.1 provides strong protection at the cost of significant utility loss. Does not protect against linkage with external data that the mechanism did not model.
Pseudonymization: Replaces direct identifiers with tokens. Still personal data under GDPR. Protects against naive linkage but not against attacks using quasi-identifiers.
Masking/redaction: Removes or replaces identified fields. Protection depends entirely on the completeness of identification — any missed quasi-identifier combination remains exploitable.
What This Means for Practice
Release Less
The most reliable protection against re-identification is not releasing data in the first place. For many use cases, aggregate statistics, synthetic data, or secure access environments (research data centers, trusted researcher frameworks) provide sufficient analytical value without the re-identification risk of individual-level releases.
Assess Risk Before Every Release
Re-identification risk depends on the specific dataset, the available auxiliary information, and the attacker model. A k=5 anonymized dataset of hospital patients might be low-risk if the patient population is large and the auxiliary information is limited. The same dataset might be high-risk for a rare disease cohort.
Before any release: enumerate the quasi-identifiers present, estimate the population size of the smallest equivalence classes, and assess what external data could be used for linkage.
Document What You Did and Why
Regulators under GDPR and the EU AI Act are increasingly asking for evidence of re-identification risk assessment, not just the fact of anonymization. Document: what technique was applied, what risk assessment was performed, what residual risk was accepted and why.
Treat LLM Outputs as Potential PII Disclosures
Any model trained on personal data should be tested for memorization before deployment. Standard approaches include: membership inference tests, verbatim text extraction probes, and structured output evaluation for formats that appear in training data (email addresses, phone numbers, health record structures).
The Regulatory Trajectory
GDPR's Article 89 allows research exemptions for anonymized data — but only for data that genuinely meets the anonymization standard. Supervisory authorities are increasingly questioning anonymization claims. The EDPB's 2014 Opinion on Anonymization Techniques remains the reference, but enforcement practice has moved toward requiring demonstrable re-identification risk assessment rather than accepting self-assessment.
Under the EU AI Act, training data documentation requirements create a new disclosure surface: an organization that publishes a training data summary admitting that personal data was present may face questions about the adequacy of its anonymization measures.
The technical standard for what counts as "anonymized" is tightening. Re-identification research is a live field. Techniques that provided adequate protection five years ago may not provide adequate protection today.
Comments (0)