Started June 1, 2025
Completed
Data Quality Auditing Framework
Implemented automated data validation using Great Expectations and Apache Airflow.
Data Validation Great Expectations Airflow SQL Server Data Quality
Overview
Built a data auditing pipeline that detects missing, duplicated, or schema-breaking data in critical financial datasets. Used Great Expectations for test definitions and Apache Airflow for orchestrated runs across historical and incremental loads.
It demonstrates:
- Data validation on financial systems
- Airflow DAG design
- Automated QA workflows
- Column-level profiling and anomaly detection
Architecture
- Airflow schedules and monitors validation DAGs
- Great Expectations performs column-level assertions
- SQL Server acts as the primary data source
- Slack API used for alerting on validation failures
Status
Audits are now integrated into every ETL run. Alerting reduced QA turnaround time by 80%.
Challenges
- False positives on NULL fields due to evolving schema
- Synchronizing audit time windows with ETL cutoffs
- Managing GE expectations across staging vs prod
Lessons Learned
- Keep expectations versioned and environment-scoped
- Null tolerance must match data freshness expectations
- Use Data Docs for stakeholder visibility
Related Services
- Data quality monitoring
- ETL validation systems
- Great Expectations onboarding
- Airflow integration for audits
Recent Activity
Wired Slack alerts to Great Expectations failures in Airflow; automated QA coverage now live.
Launched column-level data validation suite using Great Expectations against core financial tables.