This week we shipped a quiet upgrade to the part of Tidy Receipts that finds the receipt inside your photo. Where the old code correctly cropped about 5 of every 100 hand-photographed receipts, the new code gets about 16. That is a 3.5× improvement, and we can prove it.
This post is the story of how we got there, including a surprising finding that changed where we go next.
The problem we couldn’t see
Tidy Receipts is a privacy-first tool: everything happens in your browser, and your receipts never reach our servers. That is deliberate, and it is the point. The trade-off is that we have no telemetry to tell us when the edge detector found your receipt and when it didn’t. We can’t watch over your shoulder. We can only see what we test ourselves.
Until recently, “what we test ourselves” was a handful of receipts on a developer’s desk. The algorithm worked on those, so we shipped it. We had no honest numbers for how it did on the broader range of receipts real people actually photograph: white receipts on light tablecloths, thermal receipts curled at the edges, fingers gripping a corner, wood-grain tables that look just as rectangular as the receipt.
So before we changed anything, we built measurement.
The measurement harness
The harness is a small browser-side tool that imports the same code the production site runs, points it at a known dataset, and reports how well it did against ground truth. Crucially, it tests the actual production code, not a copy. That means we can’t accidentally optimise a stand-in algorithm and ship something different.
The dataset is the Zenodo SIBGRAPI 2024 receipt collection: 198 phone-photographed restaurant receipts, each one annotated with the four ground-truth corners of the receipt and a full transcription of the printed text. It’s a realistic spread: portrait and landscape, low and high contrast, clear backgrounds and busy ones, some with hands in frame.
When we ran the old algorithm against this dataset, the numbers were not flattering:
- Correct crop (within ~50px of ground truth on a ~1000px image): 4.5%
- Confidently wrong crop (silent bad output): 2.5%
- No detection (algorithm gave up): 93%
The 93% “gave up” rate is more sympathetic than it sounds. When detection fails, the tool falls back to contrast-enhancement on the full frame, which is still readable. But the user wasn’t told that detection had failed; the result row just said “cleaned.” That is a kind of quiet dishonesty, and we’d been shipping it.
Why the old algorithm was so naïve
The original code was about as simple as image processing gets. It used Canny edge detection, a 1986 technique by John Canny for finding strong intensity transitions, then picked the largest closed four-sided contour as the receipt. That works perfectly on a clean photograph against a high-contrast background. It struggles when:
- The receipt has weaker edges than its surroundings (low contrast against the table)
- Internal text creates dense competing rectangles (thermal receipts with many lines)
- Fingers, hands, or shadows break the receipt boundary
- The background itself has rectangular structure (wood grain, table edges, tile patterns)
These aren’t edge cases. They are how most real photos look.
The new approach
Rather than make one strategy work everywhere, we widened the search and let the algorithm pick.
The new detector runs five candidate generators in parallel:
- Canny edges, three times with different sensitivity settings (so weak edges and strong edges both get a chance).
- Otsu’s method, a 1979 algorithm by Nobuyuki Otsu that automatically finds the brightness threshold which best separates the image into two regions. Great when the receipt is markedly brighter or darker than the background.
- Adaptive thresholding with text-density detection, based on the adaptive thresholding technique, but inverted so that dark text becomes the foreground, then dilated to merge text rows into one receipt-shaped blob. This catches white-on-light cases the brightness-based methods miss.
Each generator emits a list of candidate quadrilaterals. Each candidate is scored on four criteria: how much of the image it covers, its aspect ratio, how convex it is, and how tightly it fills its minimum rotated rectangle. The highest-scoring candidate above a confidence threshold wins. If nothing meets the threshold, we explicitly say so, and a little grey dot appears in the result row next to the words “Contrast only: couldn’t detect receipt edges”. You see, accurately, that we didn’t find your receipt’s outline.
When the algorithm does work, the badge is green and says “Cropped and straightened”. You can glance at the thumbnail to verify it looks right.
The results
Running the new detector against the same 198 receipts:
| metric | before | after |
|---|---|---|
| Correct crop (≤50px from ground truth) | 4.5% | 16.2% |
| Confidently wrong crop | 2.5% | 9.1% |
| Falls through to contrast-only | 93% | 75% |
| Mean processing time per receipt | 17ms | 67ms |
The 3.5× improvement on correct crops is real, and you’ll see it in the cleaner PDFs that come out of the tool now. The price is a higher rate of “cropped, but in the wrong place”, which is why the new per-receipt status badge matters. Bad crops are visible at a glance, and you know to retake the photo.
50ms of extra compute is genuinely imperceptible on a phone. On a batch of 20 receipts it’s an extra second of total time.
The surprise
Once the geometric numbers stabilised, we wired up a second measurement layer: end-to-end OCR validation. The harness now runs Tesseract on each algorithm’s cleaned output and checks whether the total matches the ground-truth total from the dataset’s text labels. This is the metric that actually matters to users: geometric accuracy is a means; recovering your total is the end.
Then we ran a bake-off: all five candidate strategies, plus the combined version, against the full dataset. We expected the geometric ranking to track the OCR ranking. It did not.
| algorithm | correct crop rate | total recovered rate |
|---|---|---|
| Old algorithm | 4.5% | not measured |
| Adaptive only | 2.0% | 37.4% |
| Multi-Canny only | 5.1% | 37.9% |
| Otsu only | 16.2% | 36.9% |
| Combined (shipped) | 16.2% | 38.4% |
Across an 8× spread in geometric accuracy, end-to-end OCR success barely moved. Why? Because when an algorithm confidently crops the wrong region, OCR loses the content that was outside the bad crop. When it doesn’t crop at all, OCR works on the contrast-enhanced full frame, and Tesseract is forgiving enough that this often recovers the total just fine.
This was uncomfortable to discover. It meant the metric we’d been optimising, geometric crop accuracy, wasn’t quite the metric that determined whether users got useful output. We shipped the combined detector anyway, because it’s strictly better on every measurement, and the cost is imperceptible. But it changed where we’ll spend the next round.
What we’ll do next
The contrast-only fallback already recovers ~42% of totals without any detection at all. The detection step has limited remaining leverage on that number. The bigger levers we’ll explore next:
- Run OCR twice, on the warp and on the full frame, and keep whichever extracts more. Decouples OCR success from detection success entirely.
- Tune the cleanup that happens before OCR (contrast curves, deskew, denoise) with the harness watching the OCR numbers.
- A different OCR engine or model, if we ever add a server-side processing tier.
We’ll also keep an eye on what the existing detector does wrong. The 9% “confidently wrong” rate is the cost of the more aggressive approach. If it climbs, or if specific failure modes become common, we’ll layer in more filters.
A note on honesty
Tidy Receipts can’t see your receipts after they leave your browser. The flip side is that our quality work has to happen against synthetic-ish stand-ins like the Zenodo dataset, which are good but never quite as varied as the real distribution. If you’ve used the tool and the detection looked wrong, or surprisingly right, the Feedback link in the navigation is how that information reaches us. It is the only window we have onto what we don’t already know.
We measure what we ship. Now you know what that looks like.
References
- John Canny. A Computational Approach to Edge Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1986. Wikipedia overview.
- Nobuyuki Otsu. A Threshold Selection Method from Gray-Level Histograms. IEEE Transactions on Systems, Man, and Cybernetics, 1979. Wikipedia overview.
- Bradley & Roth, Adaptive Thresholding using the Integral Image, 2007. Wikipedia overview.
- Filtering and Preparation of Document Images for OCR (MaVILab-UFV, SIBGRAPI 2024). The Zenodo dataset we used was published alongside this paper; the project repository has further context.