What to Expect
This page summarises the engine's measured performance and what the Focus task responsible person
can expect when enabling the engine in production.
False Reject Rate
0%
No crowdworker is ever incorrectly penalised. Guaranteed by design.
False Pass Rate
~2–3%
Of auto-approved submissions, ~2–3% should have been rejected. These are caught by the customer feedback loop.
Automation
~55–60%
Of all submissions, the engine auto-approves ~55–60%. The rest is escalated for human review.
Bad Caught
~77%
Of submissions that should be rejected, ~77% are correctly escalated by the engine.
What Does This Mean in Practice?
Based on the current dataset of 6,098 submissions (95% approved, 5% rejected):
| Current (manual) | With FocusReview |
| 5,290 submissions reviewed by humans |
~2,500 submissions reviewed by humans (53% reduction) |
| 290 submissions rejected |
~223 correctly escalated, ~67 slip through auto-approve |
| Zero automation |
~3,600 auto-approved without human intervention |
The 67 false passes (~2–3% of auto-approved) are submissions that the engine
approved but a human reviewer would have rejected. These are primarily "metre not captured in full" cases
where the photos look fine individually but don't cover the entire shelf. These are still caught by
the existing customer feedback loop (Typeform responses).
What the Engine Catches
| Issue Type | Detection Rate | How |
| Wrong store visited | 100% | GPS verification + storefront photo matching |
| Missing entire sections | 86% | Benchmark comparison + empty rayon detection |
| Incorrect answers | 83% | Answer pattern analysis |
| Other / misc | 75% | Combined signal detection |
| Fraud attempts | 50% | Vision AI duplicate detection |
| Too wide angle | 43% | Not actively checked (low customer impact) |
| Shelf not fully captured | 37% | Bay counting + benchmarks (hardest category) |
What the Engine Does NOT Catch Well
"Metre not captured in full" (40% of all rejections) remains the hardest category
with only ~37% detection. This is because:
- Individual photos look fine — the issue is that not enough of them were taken
- The engine can only detect this when shelf bay numbers are readable on photos, or when the photo count deviates significantly from the benchmark
- Human reviewers have store-specific knowledge about expected shelf sizes that the engine lacks
Recommended Rollout
Phase 1: Shadow Mode (recommended start)
Engine ON, Auto-Submit OFF. The engine processes all incoming submissions and logs its decisions,
but does not actually submit reviews. You can compare engine decisions against human reviewer decisions
in the dashboard to build confidence.
Duration: 2–4 weeks. Goal: Confirm metrics match expectations on live data.
Phase 2: Auto-Approve
Engine ON, Auto-Submit ON. The engine auto-approves high-confidence submissions. All escalated
submissions still go to human reviewers, with AI annotations showing exactly which checks failed.
Expected impact: ~53% reduction in human review workload.
Phase 3: Monitor & Tune
Monitor the customer feedback loop (Typeform responses) for false passes. If the rate is acceptable,
no action needed. If specific customers or rejection reasons show higher false pass rates,
customer-specific thresholds can be configured.
Cost
| Metric | Value |
| Processing time per submission | ~60–90 seconds |
| Gemini API cost per submission | ~$0.02–0.04 |
| Estimated annual cost (6,098 submissions) | ~$120–$240 |
| Model | Gemini 3 Flash (tested: outperforms Pro on this task) |
Evaluation Methodology
Performance metrics are based on:
- Development tests: Multiple runs of n=100 (stratified 70/30 approved/rejected) across 10+ iterations
- Independent validation: n=222 fresh submissions never seen during development
- Large-scale validation: n=392 submissions with the same benchmark configuration
- Model comparison: n=50 Flash vs Pro side-by-side (Flash wins on all metrics)
- Voting test: 30 submissions × 3 runs each — 87% unanimous agreement, majority voting provides no significant improvement
Gemini variance: The vision model is non-deterministic. Key metrics can vary ±5–10pp
between runs. The numbers above are averages across multiple runs. In production, variance averages out
over many submissions.