Synthetic Data vs. Anonymized Data: Which Provides Better GDPR Protection?
Synthetic data has had a remarkable rise in enterprise data strategy conversations. The pitch is compelling: generate a dataset that has the same statistical properties as your real data, but contains no real individuals. Use it freely for development, testing, analytics, and AI training — without GDPR constraints.
The reality is more complicated. Whether synthetic data is outside GDPR scope depends on how it was generated, what the original data looked like, and what an attacker with access to the synthetic data could infer. The legal position is not fully settled. And the privacy guarantees provided by synthetic data generation methods vary enormously.
This post compares synthetic data and anonymized data across four dimensions: technical privacy, legal status, utility, and practical deployment.
What Is Synthetic Data?
Synthetic data is artificially generated data that mimics the statistical properties of a real dataset without containing records of real individuals. Generation methods range from simple statistical models to sophisticated deep learning approaches.
Parametric synthesis: Fit statistical models (distributions, correlations) to the real data, then sample from the fitted model. Fast, transparent, works well for structured tabular data. Privacy depends on whether the fitted model has memorized individual records.
GANs (Generative Adversarial Networks): Train a generator to produce samples indistinguishable from real data. High fidelity for complex distributions. Known to memorize training examples — particularly rare records.
VAEs (Variational Autoencoders): Learn a compressed latent representation of the data and sample from it. Similar fidelity to GANs with different memorization characteristics.
Differential privacy synthesis (e.g., DP-GAN, MST, AIM): Adds calibrated noise to the synthesis process to provide mathematical privacy guarantees. Lower fidelity than unconstrained synthesis but provable privacy protection.
Technical Privacy: What Can an Attacker Learn?
Membership Inference
Given a synthetic record, can an attacker determine whether a specific individual's record was in the training data? This is the fundamental privacy question for generative models.
Research consistently shows that standard (non-DP) GANs and VAEs are vulnerable to membership inference attacks, particularly for records that are unusual or that appear multiple times in training data. The attack success rate varies by model architecture, training data size, and the unusualness of the target record, but "vulnerable" is the right description for most off-the-shelf synthesis approaches.
Differentially private synthesis methods bound membership inference risk — at cost of data fidelity.
Attribute Inference
Even without knowing whether a specific individual was in the training data, can an attacker infer a sensitive attribute of a known individual? For example: "I know Alice works at Company X and was born in 1985. Does the synthetic dataset reveal that Alice's income is €95,000?"
If the synthetic data accurately reproduces the joint distribution of employer, birth year, and income from the real data, attribute inference may be possible for individuals whose records are unique in those dimensions. This is a linkage attack, not a memorization attack — and it affects any synthetic data that preserves accurate joint distributions.
Reconstruction
If an attacker has access to marginal statistics from the synthetic data — summary tables, aggregates — they may be able to reconstruct approximate individual-level records. This is the same attack that affects published statistics and was the reason the US Census adopted differential privacy.
Legal Status Under GDPR
Is Synthetic Data "Personal Data"?
GDPR applies to personal data — information relating to an identified or identifiable natural person. Whether synthetic data qualifies as personal data depends on whether any individual can be identified from it.
Case 1: Clean synthesis from adequately anonymized data. If the original dataset was anonymized before synthesis (removing all direct and quasi-identifiers to the GDPR standard), the synthetic data generated from it is not derived from personal data. The synthesis process cannot recreate personal data that was not present in the input. This is the safest position legally — but it means the anonymization work still needs to be done.
Case 2: Synthesis directly from personal data, with DP guarantees. Differential privacy provides mathematical bounds on information leakage. The EDPB has not formally ruled that DP-synthetic data is outside GDPR scope, but DP guarantees are increasingly recognized as evidence of adequate anonymization in academic and regulatory literature. This is an unsettled area.
Case 3: Synthesis directly from personal data, without DP guarantees. Standard GAN/VAE synthesis from personal data produces a dataset that may contain memorized records and is vulnerable to membership inference. The EDPB's position (Opinion 05/2014 on Anonymisation Techniques) would likely classify this as pseudonymization at best, not anonymization. Still personal data under GDPR.
The EDPB Position
The EDPB has not issued a specific opinion on synthetic data (as of early 2026). The 2014 anonymization opinion predates modern generative models. Several national DPAs have published guidance suggesting that synthetic data does not automatically fall outside GDPR scope — the test is whether re-identification is reasonably likely given all available means, not whether the data is labeled "synthetic."
The ICO (UK, post-Brexit, but influential) published a call for views on synthetic data in 2023, acknowledging the legal uncertainty. Their tentative position: synthetic data is not automatically anonymous and must be assessed case by case.
Anonymized Data: The Traditional Approach
Traditional anonymization — PII detection and redaction, generalization, suppression — has a clearer legal trajectory. GDPR Recital 26 explicitly states that anonymous data is outside scope. The EDPB's 2014 opinion provides the reference framework for what techniques provide adequate anonymization.
The limitations are well-documented (see our post on re-identification attacks): k-anonymity at small k values provides limited protection, quasi-identifier combinations remain exploitable, and the external data landscape for linkage attacks has grown richer.
But the legal status of well-anonymized data is clearer than synthetic data: if you can demonstrate that re-identification is not "reasonably likely given all available means," the data is outside GDPR scope.
Utility Comparison
For most analytical use cases, the utility comparison is nuanced:
- Synthetic data advantages:
- Can generate unlimited records beyond the original dataset size
- Can generate scenarios not present in the original data
- Better for ML training when the original dataset is small and privacy constraints prevent release
- Easier to share externally — no residual individual records to explain
- Anonymized data advantages:
- Preserves exact statistical properties of the original population (within the anonymization technique's constraints)
- No fidelity loss from generalization or noise addition in well-executed redaction
- Auditable — you can verify what was removed and why
- Legal position more established
Differential privacy synthesis is the best privacy option within synthetic data — but DP noise reduces fidelity, and the privacy-utility tradeoff must be calibrated to the use case. ε=1 provides strong privacy protection but may produce statistics that diverge significantly from the original.
Practical Recommendation
The "synthetic data is always privacy-safe" framing is incorrect. Both approaches require careful implementation and assessment. A practical decision framework:
- Use anonymized data when:
- The original dataset is the ground truth you need to preserve
- Legal certainty is paramount — the GDPR anonymization standard is well-understood
- Use case requires exact statistical representation of the original population
- Dataset is large enough that utility loss from redaction is manageable
- Use synthetic data when:
- You need unlimited records or scenarios beyond the original
- The original dataset is small and anonymized release would produce small equivalence classes
- You are generating training data for ML from scratch (using DP synthesis)
- You have validated that the synthesis approach provides adequate privacy for your threat model
- In all cases:
- Document your privacy analysis before release or use
- For synthetic data: validate against membership inference and attribute inference attacks
- For anonymized data: validate against re-identification with available external data
- Do not rely on the "synthetic" or "anonymized" label alone — assess the specific dataset and generation method
The GDPR does not care about labels. It cares about whether individuals can be identified. That question must be answered with evidence, not assertions.
Comments (0)