Hook: You've tagged your PII columns in dbt. Your raw data still hit the warehouse unmasked. Here's the gap between tagging and actual compliance.
The Challenge
Modern data engineering teams use ELT pipelines (dbt, Airflow, Spark) to transform raw data before loading it into analytics warehouses (Snowflake, BigQuery, Redshift). These pipelines routinely process raw customer data containing PII — names, emails, phone numbers, addresses — before analytics engineers have a chance to apply masking. A Medium article from Voi Engineering on PII data privacy in Snowflake documents the complexity: tag-based masking policies must be defined per column, propagated through lineage, and enforced at query time across all downstream models. Without automated PII detection in the pipeline, analytics teams rely on manual column tagging — which is error-prone and doesn't scale as schema evolves.
By the Numbers
- Modern data engineering teams use ELT pipelines (dbt, Airflow, Spark) to transform raw data before loading it into analytics warehouses (Snowflake, BigQuery, Redshift).
- These pipelines routinely process raw customer data containing PII — names, emails, phone numbers, addresses — before analytics engineers have a chance to apply masking.
Technical Approach
Batch processing supports CSV, JSON, and XML formats with consistent PII detection across all files in a batch. Processing metadata export (CSV/JSON) provides the data lineage report that compliance teams need. The same Presidio-based engine across all platforms ensures consistency between manual review (web/desktop) and automated batch processing.
Comments (0)