Python & Automation Personal Project

Automated Sales Performance
Tracker
(AI Generated Sample Project)

A personal project built to practise end-to-end pipeline development. Python pulls and cleans synthetic sales data, loads it to a PostgreSQL database on Neon Cloud via a scheduled GitHub Actions workflow, and a Power BI report sits on top for slice-and-dice analysis.

Python PostgreSQL Power BI Pandas GitHub Actions Neon Cloud DAX SQL
Type Personal Project
Timeline 3 weeks · 2024
Goal Pipeline Practice
Status Complete

Background

Most of the pipeline work I do professionally runs inside client environments — which means I can't always tinker with the architecture, swap out components, or break things deliberately to see what happens. I wanted a project I owned end-to-end, where every decision was mine, and where the failure mode was just a broken personal project rather than a production incident.

Sales data is a useful domain for this kind of practice. The shape of the problem is familiar — transactions, products, regions, time — but it has enough complexity to make the pipeline interesting. Aggregations behave differently depending on how the data is modelled, and there are natural opportunities to practice both ETL logic and reporting design in the same project.

🎯

Goal: Build a fully automated, end-to-end pipeline from raw data to live Power BI report — using only tools already in my professional stack, with no manual steps after initial setup.

The Data

Rather than scraping real data or relying on a static CSV, I generated a synthetic sales dataset using Python's faker library combined with some hand-crafted logic to make the data behave realistically. The dataset covers twelve months of fictional transactions for a mid-sized retail business with four product categories and three sales regions.

Synthetic data design

The key design decision was making the data tell a story. Purely random data produces flat, uninteresting reports. Instead, I seeded realistic patterns into the generation logic: seasonal peaks in Q4, a struggling product category in the second half of the year, and a consistently outperforming region. This gave the eventual Power BI report something to actually surface.

Python
import pandas as pd
import numpy as np
from faker import Faker

fake = Faker('en_AU')
rng  = np.random.default_rng(seed=42)

REGIONS   = ['NSW', 'VIC', 'QLD']
CATEGORIES = ['Electronics', 'Apparel', 'Home & Garden', 'Sports']

# Q4 seasonal uplift — transactions more likely Oct–Dec
def seasonal_weight(month: int) -> float:
    return 1.6 if month >= 10 else 1.0

rows = []
for _ in range(10_000):
    month = rng.integers(1, 13)
    rows.append({
        'transaction_id': fake.uuid4(),
        'date':           fake.date_between('-1y', 'today'),
        'region':         rng.choice(REGIONS, p=[0.45, 0.35, 0.20]),
        'category':       rng.choice(CATEGORIES),
        'revenue':        round(rng.normal(loc=250, scale=80), 2),
        'units_sold':     int(rng.integers(1, 12)),
    })

df = pd.DataFrame(rows)
df['revenue'] = df['revenue'].clip(lower=10)   # no negative revenue

Pipeline Architecture

The pipeline has three distinct stages: generation and cleaning in Python, loading to PostgreSQL on Neon Cloud, and transformation via SQL views that Power BI queries directly. No orchestration framework — just a GitHub Actions workflow on a daily cron schedule.

Stage 1: Generate and clean

The Python script runs first, generating a fresh batch of data and running it through a cleaning step. In a real scenario this would be an extract from a source system, but the cleaning logic is identical either way: type coercion, null handling, deduplication on transaction_id, and basic sanity checks on numeric ranges.

Stage 2: Load to Neon

The cleaned DataFrame is written to a staging.sales_raw table in PostgreSQL using psycopg2. The connection string is stored as a GitHub Actions secret, so credentials never touch the repository. The load uses a truncate-and-reload pattern — simple, predictable, and easy to reason about for a dataset of this size.

Python
import psycopg2
import os
from psycopg2.extras import execute_values

conn = psycopg2.connect(os.environ['NEON_CONNECTION_STRING'])
cur  = conn.cursor()

cur.execute("TRUNCATE TABLE staging.sales_raw;")

execute_values(
    cur,
    """INSERT INTO staging.sales_raw
       (transaction_id, date, region, category, revenue, units_sold)
       VALUES %s""",
    [tuple(row) for row in df.itertuples(index=False)],
)

conn.commit()
cur.close()
conn.close()

Stage 3: SQL transformation layer

Rather than transforming data in Python before loading, the mart layer lives in PostgreSQL as a set of views. Power BI connects to these views rather than the raw table. This keeps the reporting layer stable even if the upstream data changes shape — the view definitions absorb the translation.

SQL
-- mart.v_sales_monthly: monthly aggregates by region and category
CREATE OR REPLACE VIEW mart.v_sales_monthly AS
SELECT
    DATE_TRUNC('month', date)::date   AS month,
    region,
    category,
    SUM(revenue)                      AS total_revenue,
    SUM(units_sold)                   AS total_units,
    COUNT(*)                          AS transaction_count,
    ROUND(AVG(revenue)::numeric, 2)   AS avg_order_value
FROM staging.sales_raw
GROUP BY 1, 2, 3
ORDER BY 1, 2, 3;
ℹ️

Neon Cloud's serverless PostgreSQL spins down when idle and back up on connection. For a daily pipeline this is fine — the first query of the day takes a second or two longer while the instance warms. For lower-latency use cases it would need a different approach.

Automation with GitHub Actions

The entire pipeline runs automatically on a daily schedule using a GitHub Actions workflow. The workflow installs dependencies, runs the Python script, and exits. If anything fails, GitHub sends an email notification — a lightweight alerting mechanism that requires zero additional infrastructure.

YAML
name: Sales Pipeline

on:
  schedule:
    - cron: '0 6 * * *'   # 06:00 UTC daily
  workflow_dispatch:        # manual trigger for testing

jobs:
  run-pipeline:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run pipeline
        env:
          NEON_CONNECTION_STRING: ${{ secrets.NEON_CONNECTION_STRING }}
        run: python pipeline/run.py

Power BI Report

Power BI Desktop connects to Neon via the PostgreSQL connector, pointing at the mart views. The report has three pages: an executive summary with KPI cards, a regional breakdown with a small multiples bar chart, and a product category trend view showing month-on-month revenue movement.

DAX measures

Most of the logic lives in the SQL views, keeping DAX relatively clean. The main measures I wrote are a rolling 3-month average for the trend line and a simple period-over-period comparison for the KPI cards.

DAX
-- Rolling 3-month average revenue
Revenue 3M Avg =
AVERAGEX(
    DATESINPERIOD(
        'Calendar'[Date],
        LASTDATE('Calendar'[Date]),
        -3,
        MONTH
    ),
    [Total Revenue]
)

-- Month-over-month change %
MoM Revenue % =
VAR CurrentMonth = [Total Revenue]
VAR PriorMonth   = CALCULATE([Total Revenue], DATEADD('Calendar'[Date], -1, MONTH))
RETURN
    DIVIDE(CurrentMonth - PriorMonth, PriorMonth)

What I Got Out of It

0
Manual steps after initial setup
3
Pipeline stages: generate, load, transform
~8s
Full pipeline run time end-to-end

The pipeline itself is not complex — that was deliberate. The value of the project was in connecting all the pieces in a way I controlled completely. I now have a working pattern for secret management via GitHub Actions, a clear mental model of the truncate-and-reload vs incremental load trade-off, and a better feel for where SQL transformation belongs vs Python transformation.

The Power BI side also surfaced something useful: connecting to a serverless database is straightforward, but scheduled dataset refresh in the Power BI Service requires an on-premises data gateway for non-cloud connectors. Worth knowing before assuming everything is plug-and-play.

💡

Personal projects are most useful when they deliberately use tools from your real stack rather than learning new ones. The goal here wasn't to learn Python — it was to practise the seams between familiar tools.

Reflections

If I were building this for a real client, the truncate-and-reload approach would need to be replaced with an incremental load pattern as the dataset grew. The current setup regenerates all 10,000 rows daily, which is fine for a toy dataset but would not scale to millions of records without either partitioning the table or switching to an upsert-based load.

I would also add proper logging. At the moment, GitHub Actions provides run logs, but there is no structured record of how many rows were loaded, whether any failed validation checks, or how long each stage took. A simple log table in Neon would cost nothing and make debugging significantly faster.

The project is on GitHub if you want to look at the code directly.


Questions about this project? Get in touch.