AI / ML Product Energy Tech B2B SaaS Watt Footprint · 2025

Rebuilding a broken energy bill pipeline: twice

How we took a bill reader from 597 inaccurate extractions to a validated AI pipeline processing 930 bills, cutting manual review time by 80%.

My Role
Product Owner
Timeline
Jul – Dec 2025
Team
5 people
Geography
Ireland · UK · UAE
80%
Reduction in processing time
95%
PDF extraction accuracy
80%
Image bill accuracy
333
Net new bills in 6 months

The data was wrong. On almost every bill.

When I joined Watt Footprint in July 2025, the platform was processing energy bills through a direct OpenAI call. It would extract fields — MPRN, account number, invoice number, energy charges, standing charges, PSO levy, VAT, totals — and push them into the database. In theory. In practice, I was finding errors on nearly every bill I spot-checked.

Account numbers off by a digit. Unit rates misread. Standing charges swapped with usage charges. Financial totals that didn't add up. I stopped trusting the system within the first week.

The bill volume at that point was 597. By December it was 930. And with the client base actively growing into the UK and UAE, the problem was only going to compound. Irish, UK, and UAE energy bills look different. Electricity, gas, and water each have their own billing formats. Some came in as clean PDFs. Others were scanned images — blurry, tilted, inconsistently laid out. Accommodating all of that while the extraction pipeline was already struggling was the core challenge.

Manual validation was the workaround. Ten bills took 60 to 90 minutes: read, cross-check, format the numbers, verify the totals. High-concentration work that shouldn't have been necessary at that scale.

The business impact wasn't just efficiency. When a client's energy dashboard shows wrong data, they notice. Onboarding slows down. Trust erodes before the product has a chance to prove itself.

A small team, mixed backgrounds, international scope.

The team working on this: a Head of Operations, a Software Lead, a Data Analyst, a full-stack SWE, and me as PO. One of the harder parts of the project was that the HOO and Software Lead came from business backgrounds. The trade-offs between different OCR approaches, or why adding a validation layer was worth the complexity, weren't immediately obvious to them.

That meant a lot of the work on my end was translation — explaining what Textract could and couldn't do, what accuracy meant in a HITL context, why a 75% threshold made sense as a routing rule. Making sure everyone understood the constraints before decisions got locked in.

We also had currency to deal with. When the first UK and UAE clients came on board, the bill reader had been built for euros. Adding pound sterling and UAE dirham support wasn't just a data field change — it touched parsing logic, reporting, and how totals were validated downstream.

Three versions, each fixing what the last one broke.

Version 1
Direct OpenAI Extraction
Called OpenAI directly to extract structured fields from bill text. Fast to build, but the model hallucinated values and had no error-catching mechanism.
Scrapped — too many errors in prod
Version 2
AWS Textract OCR
Replaced OpenAI extraction with Textract for OCR. Better character accuracy on printed PDFs, but parsing logic was fragile and calculation fields kept going wrong on edge cases.
Improved — but parsing still brittle
Version 3
Textract + LLM Validation
Textract extracts. OpenAI reasoning model validates. Each field is checked for logical consistency before it hits the database. Human review for anything under 75% confidence.
Shipped — 95% PDF accuracy

The v3 architecture was the one that held:

UploadS3
ExtractTextract
StructureJSON
ValidateOpenAI
PersistPostgreSQL

The key decision in v3 was adding the reasoning model as a validation step, not an extraction step. Textract is better at OCR than a general-purpose LLM. But Textract can't tell you whether a unit rate that looks correct actually multiplies to the stated total, or whether a VAT figure is internally consistent with the pre-tax charge. That's where OpenAI reasoning models came in — they were the strongest available at the time for this kind of structured logical checking, and we had startup credits that made it feasible without a material cost increase.

Anything that came through with a validation score below 75% was flagged for human review rather than pushed to the database. That routing rule kept bad data out while making sure the reviewer's time was spent only where the system wasn't confident.

Before committing to Textract, we stress-tested it ourselves.

Before building v2 properly, the data analyst put together a quick Python prototype that called the Textract API directly and saved results to CSV. We ran it against a deliberate mix: clean PDFs, scanned bills, blurry images, awkward layouts. The kind of bills that were already causing problems in the live system.

For validation, we built a ground-truth spreadsheet — manually entered correct values alongside OCR output for each bill. Accuracy was calculated field by field using a three-tier scoring approach:

Match
Score: 1
Predicted value exactly matches the manually verified ground truth
Near Miss
Score: 0.5
Mathematically correct but minor formatting difference (e.g. trailing zero, currency symbol)
Fail
Score: 0
Values significantly wrong or field not extracted at all
HITL Threshold
75%
Bills scoring below this were routed to human review rather than pushed to the database

For financial fields specifically, we also tracked Financial Delta — comparing the sum of OCR-extracted totals against the sum of manually verified totals across the batch. A small per-bill error can compound quickly when you're reporting on 13 enterprise accounts' energy spend.

Textract on clean printed PDFs reliably hits a Character Error Rate below 2-3%, which translates to more than 97% character accuracy. The overall system accuracy, accounting for scanned and blurry bills, came in around 95% for PDFs and 80% for image-based bills.

80% faster. Lower cognitive load. Clients noticed.

Metric Manual Automated + HITL Improvement
Time per 10 bills 60 – 90 mins 12 – 18 mins ~80% reduction
Action per bill Type, calculate, verify Confirm or correct Effort type shift
Cognitive load High focus, sustained Low focus, spot-check Significant reduction
Bills in system 597 (Jul 2025) 930 (Dec 2025) +333 in 6 months

More than the time saving, the HITL model changed the nature of the work. Manual validation is high-concentration — reading energy tables and doing mental arithmetic for 90 minutes at a stretch. The automated pipeline turned that into low-focus verification: look at the flagged fields, confirm or correct, save. That shift in cognitive effort matters when the same person is also managing client onboarding.

On the client side, a few enterprise accounts specifically mentioned how quickly their bill data was showing up in the dashboard after upload. Month-over-month comparisons became usable almost immediately after onboarding, rather than days later once someone had found the time to manually process the batch. That kind of early-stage experience tends to reduce churn before it starts.

What I'd do differently.

Stress-test earlier The v1-to-v2 migration was reactive. By the time we properly identified the failure modes of direct OpenAI extraction, clients were already seeing incorrect data. Running a structured accuracy test against a diverse bill set before going to production would have surfaced those issues earlier — and probably saved us a version.
Ground the 75% threshold in data The HITL routing threshold was set on judgment rather than analysis. A more rigorous approach would have mapped error types to business impact — a wrong invoice total damages client trust much more than a wrong account name. The threshold should probably have been field-specific, not a single overall score.
Earlier alignment on "accurate enough" Different stakeholders had different intuitions about what accuracy meant in practice. The HOO wanted zero errors. Engineering was comfortable with 95%+. Clients cared most about financial fields being right. Getting explicit alignment on what "good enough" looked like for each field type earlier would have reduced back-and-forth on prioritisation.