FocusReview

What to Expect

This page summarises the engine's measured performance and what the Focus task responsible person can expect when enabling the engine in production.

False Reject Rate

No crowdworker is ever incorrectly penalised. Guaranteed by design.

False Pass Rate

~2–3%

Of auto-approved submissions, ~2–3% should have been rejected. These are caught by the customer feedback loop.

Automation

~55–60%

Of all submissions, the engine auto-approves ~55–60%. The rest is escalated for human review.

Bad Caught

~77%

Of submissions that should be rejected, ~77% are correctly escalated by the engine.

What Does This Mean in Practice?

Based on the current dataset of 6,098 submissions (95% approved, 5% rejected):

Current (manual)	With FocusReview
5,290 submissions reviewed by humans	~2,500 submissions reviewed by humans (53% reduction)
290 submissions rejected	~223 correctly escalated, ~67 slip through auto-approve
Zero automation	~3,600 auto-approved without human intervention

The 67 false passes (~2–3% of auto-approved) are submissions that the engine approved but a human reviewer would have rejected. These are primarily "metre not captured in full" cases where the photos look fine individually but don't cover the entire shelf. These are still caught by the existing customer feedback loop (Typeform responses).

What the Engine Catches

Issue Type	Detection Rate	How
Wrong store visited	100%	GPS verification + storefront photo matching
Missing entire sections	86%	Benchmark comparison + empty rayon detection
Incorrect answers	83%	Answer pattern analysis
Other / misc	75%	Combined signal detection
Fraud attempts	50%	Vision AI duplicate detection
Too wide angle	43%	Not actively checked (low customer impact)
Shelf not fully captured	37%	Bay counting + benchmarks (hardest category)

What the Engine Does NOT Catch Well

"Metre not captured in full" (40% of all rejections) remains the hardest category with only ~37% detection. This is because:

Individual photos look fine — the issue is that not enough of them were taken
The engine can only detect this when shelf bay numbers are readable on photos, or when the photo count deviates significantly from the benchmark
Human reviewers have store-specific knowledge about expected shelf sizes that the engine lacks

Recommended Rollout

Phase 1: Shadow Mode (recommended start)

Engine ON, Auto-Submit OFF. The engine processes all incoming submissions and logs its decisions, but does not actually submit reviews. You can compare engine decisions against human reviewer decisions in the dashboard to build confidence.

Duration: 2–4 weeks. Goal: Confirm metrics match expectations on live data.

Phase 2: Auto-Approve

Engine ON, Auto-Submit ON. The engine auto-approves high-confidence submissions. All escalated submissions still go to human reviewers, with AI annotations showing exactly which checks failed.

Expected impact: ~53% reduction in human review workload.

Phase 3: Monitor & Tune

Monitor the customer feedback loop (Typeform responses) for false passes. If the rate is acceptable, no action needed. If specific customers or rejection reasons show higher false pass rates, customer-specific thresholds can be configured.

Cost

Metric	Value
Processing time per submission	~60–90 seconds
Gemini API cost per submission	~$0.02–0.04
Estimated annual cost (6,098 submissions)	~$120–$240
Model	Gemini 3 Flash (tested: outperforms Pro on this task)

Evaluation Methodology

Performance metrics are based on:

Development tests: Multiple runs of n=100 (stratified 70/30 approved/rejected) across 10+ iterations
Independent validation: n=222 fresh submissions never seen during development
Large-scale validation: n=392 submissions with the same benchmark configuration
Model comparison: n=50 Flash vs Pro side-by-side (Flash wins on all metrics)
Voting test: 30 submissions × 3 runs each — 87% unanimous agreement, majority voting provides no significant improvement

Gemini variance: The vision model is non-deterministic. Key metrics can vary ±5–10pp between runs. The numbers above are averages across multiple runs. In production, variance averages out over many submissions.

Access Denied

FocusReview

What to Expect

What Does This Mean in Practice?

What the Engine Catches

What the Engine Does NOT Catch Well

Recommended Rollout

Phase 1: Shadow Mode (recommended start)

Phase 2: Auto-Approve

Phase 3: Monitor & Tune

Cost

Evaluation Methodology