Master Data ELT Pipeline — Constellation Analytics

Background

Schneider Electric operates one of the most complex SAP environments in the global energy sector. With master data maintained across dozens of plant codes and business units, data quality degradation was a slow, quiet problem — until it became an urgent one. By the time issues surfaced in downstream reporting, the cost of remediation had compounded significantly.

The goal of this project was straightforward: build a monitoring pipeline that catches master data issues early, surfaces them clearly, and gives the governance team the visibility they need to act before problems propagate.

🎯

Problem: Master data quality issues in SAP were invisible until they caused downstream reporting failures. The team needed a proactive monitoring layer, not a reactive one.

Approach & Architecture

Rather than building a heavyweight data platform, the solution was designed to be lean and maintainable. Raw data extracted from SAP feeds into PostgreSQL via a scheduled ELT process, where a layer of SQL transformations applies the quality rules defined with the data governance team.

Pipeline Overview

The pipeline runs on a daily schedule triggered by GitHub Actions. Each run extracts a delta from the SAP source, loads it into a staging schema in PostgreSQL (hosted on Neon Cloud), then applies a suite of SQL transformations that evaluate completeness, conformity, and consistency rules.

SQL

-- Example: completeness check on material master
SELECT
    plant_code,
    COUNT(*)                                          AS total_records,
    COUNT(*) FILTER (WHERE uom IS NULL)               AS missing_uom,
    COUNT(*) FILTER (WHERE material_group IS NULL)    AS missing_material_group,
    ROUND(
        COUNT(*) FILTER (WHERE uom IS NULL)::numeric
        / NULLIF(COUNT(*), 0) * 100, 2
    )                                                 AS pct_missing_uom
FROM staging.material_master
GROUP BY plant_code
ORDER BY pct_missing_uom DESC;

Quality Rules

Rules were co-designed with the data governance team to reflect actual business logic — not generic data quality frameworks. Each rule produces a pass/fail result per record, which is then aggregated to a plant-level score. The scoring model uses weighted rules based on downstream impact.

ℹ️

All rule logic is version-controlled in GitHub. Any changes to thresholds or rule definitions are tracked and auditable — an important requirement for a client operating in a regulated environment.

Key Challenges

SAP Data Variability

SAP exports are notoriously inconsistent in structure across plant codes and business units. Null handling, encoding differences, and field repurposing across regions required a normalisation layer before any quality rules could be applied reliably.

Defining "Quality"

The most time-consuming part of the project wasn't technical — it was aligning on what "good" data actually looks like. Different teams had different definitions. Facilitated workshops with the governance team produced a shared rule set that reflected real operational requirements rather than theoretical standards.

💡

Document every rule decision and the reasoning behind it. Six months later, no one remembers why a threshold was set to 95% instead of 98%. A decision log saves future arguments.

Results

The pipeline has been in production since mid-2024. Key outcomes after the first quarter of operation:

94%

Reduction in data quality issues reaching downstream reporting

3 hrs

Weekly time saved on manual data audit processes

Active quality rules covering 6 master data domains

Beyond the numbers, the governance team shifted from reactive firefighting to proactive monitoring. The Tableau dashboard now anchors their weekly data quality review meeting.

The Dashboard

The Tableau dashboard connects directly to the PostgreSQL results tables via a live connection. It surfaces three views: a plant-level scorecard, a rule-level drill-down, and a trend view showing quality score movement over time.

⚠️

The full dashboard cannot be shared publicly due to client confidentiality. The screenshot above is a representative mock-up using anonymised data.

Reflections

This project reinforced something I've seen repeatedly: the technical build is rarely the hardest part. A well-designed SQL pipeline takes weeks. Getting fifteen stakeholders to agree on what "complete" means for a material record takes months. The analyst's job is as much facilitation as it is engineering.

If I were to start again, I would invest more time upfront in data profiling before the rule design workshops. Walking into those sessions with concrete examples of where the data breaks down is far more productive than starting from abstract principles.

Questions about this project? Get in touch.

Master Data ELT Pipeline —Schneider Electric(AI Generated Sample Project)

Background

Approach & Architecture

Pipeline Overview

Quality Rules

Key Challenges

SAP Data Variability

Defining "Quality"

Results

The Dashboard

Reflections

Master Data ELT Pipeline —
Schneider Electric
(AI Generated Sample Project)