The data was wrong. On almost every bill.
When I joined Watt Footprint in July 2025, the platform was processing energy bills through a direct OpenAI call. It would extract fields — MPRN, account number, invoice number, energy charges, standing charges, PSO levy, VAT, totals — and push them into the database. In theory. In practice, I was finding errors on nearly every bill I spot-checked.
Account numbers off by a digit. Unit rates misread. Standing charges swapped with usage charges. Financial totals that didn't add up. I stopped trusting the system within the first week.
The bill volume at that point was 597. By December it was 930. And with the client base actively growing into the UK and UAE, the problem was only going to compound. Irish, UK, and UAE energy bills look different. Electricity, gas, and water each have their own billing formats. Some came in as clean PDFs. Others were scanned images — blurry, tilted, inconsistently laid out. Accommodating all of that while the extraction pipeline was already struggling was the core challenge.
Manual validation was the workaround. Ten bills took 60 to 90 minutes: read, cross-check, format the numbers, verify the totals. High-concentration work that shouldn't have been necessary at that scale.
A small team, mixed backgrounds, international scope.
The team working on this: a Head of Operations, a Software Lead, a Data Analyst, a full-stack SWE, and me as PO. One of the harder parts of the project was that the HOO and Software Lead came from business backgrounds. The trade-offs between different OCR approaches, or why adding a validation layer was worth the complexity, weren't immediately obvious to them.
That meant a lot of the work on my end was translation — explaining what Textract could and couldn't do, what accuracy meant in a HITL context, why a 75% threshold made sense as a routing rule. Making sure everyone understood the constraints before decisions got locked in.
We also had currency to deal with. When the first UK and UAE clients came on board, the bill reader had been built for euros. Adding pound sterling and UAE dirham support wasn't just a data field change — it touched parsing logic, reporting, and how totals were validated downstream.
Three versions, each fixing what the last one broke.
The v3 architecture was the one that held:
The key decision in v3 was adding the reasoning model as a validation step, not an extraction step. Textract is better at OCR than a general-purpose LLM. But Textract can't tell you whether a unit rate that looks correct actually multiplies to the stated total, or whether a VAT figure is internally consistent with the pre-tax charge. That's where OpenAI reasoning models came in — they were the strongest available at the time for this kind of structured logical checking, and we had startup credits that made it feasible without a material cost increase.
Anything that came through with a validation score below 75% was flagged for human review rather than pushed to the database. That routing rule kept bad data out while making sure the reviewer's time was spent only where the system wasn't confident.
Before committing to Textract, we stress-tested it ourselves.
Before building v2 properly, the data analyst put together a quick Python prototype that called the Textract API directly and saved results to CSV. We ran it against a deliberate mix: clean PDFs, scanned bills, blurry images, awkward layouts. The kind of bills that were already causing problems in the live system.
For validation, we built a ground-truth spreadsheet — manually entered correct values alongside OCR output for each bill. Accuracy was calculated field by field using a three-tier scoring approach:
For financial fields specifically, we also tracked Financial Delta — comparing the sum of OCR-extracted totals against the sum of manually verified totals across the batch. A small per-bill error can compound quickly when you're reporting on 13 enterprise accounts' energy spend.
Textract on clean printed PDFs reliably hits a Character Error Rate below 2-3%, which translates to more than 97% character accuracy. The overall system accuracy, accounting for scanned and blurry bills, came in around 95% for PDFs and 80% for image-based bills.
80% faster. Lower cognitive load. Clients noticed.
More than the time saving, the HITL model changed the nature of the work. Manual validation is high-concentration — reading energy tables and doing mental arithmetic for 90 minutes at a stretch. The automated pipeline turned that into low-focus verification: look at the flagged fields, confirm or correct, save. That shift in cognitive effort matters when the same person is also managing client onboarding.
On the client side, a few enterprise accounts specifically mentioned how quickly their bill data was showing up in the dashboard after upload. Month-over-month comparisons became usable almost immediately after onboarding, rather than days later once someone had found the time to manually process the batch. That kind of early-stage experience tends to reduce churn before it starts.