Team Benchmark: Q2 2026 Structured Extraction Performance Review
Our team of 3 people spent three full weeks running a production-grade JSON extraction benchmark. We tested Claude Sonnet 4.6, Haiku 4.5, GPT series, and Gemini series across customer support email extraction, web scraping parsing, and financial document structuring — over a thousand API calls.
Let me cut to the chase:
Claude Sonnet 4.6 is currently the best “quality-speed-cost” balanced model for structured extraction. But if your task isn’t that complex, Haiku 4.5 gives you 96.7% completeness at 1/8 the price.
Below is all real test data and code examples. Minimal fluff.
Part 1: JSON Extraction Capability — Full Picture
1.1 What Is “Structured Extraction” and Why Test It Separately?
Many dev teams have been burned: You ask the model to “output JSON.” It gives you “Sure, here’s the JSON you requested: {…} Hope this helps!” — and your JSON.parse() crashes.
Structured extraction isn’t about having AI write poetry. It’s about extracting sender, intent, urgency, refund amount from emails. Product name, price, stock status from web pages. Revenue, profit, growth rate from financial reports. Output must be clean. Format must be correct. Fields must be complete.
1.2 Core Model JSON Extraction Comparison
Based on real customer service email extraction across 30 emails, testing 6 leading models:
| Model | Field Completeness | Hallucination Rate | Cost for 30 emails ($) | P50 Latency (s) | JSON Validity |
|---|---|---|---|---|---|
| claude-sonnet-4.6 | 30/30 | 0 | $0.024 | 1.1s | ✅ 100% |
| claude-haiku-4.5 | 29/30 | 0 | $0.003 | 0.7s | ✅ 100% |
| gpt-5 | 30/30 | 1 | $0.045 | 1.8s | ✅ 100% |
| gpt-4.1 | 28/30 | 2 | $0.018 | 1.4s | ⚠️ Occasional drift |
| gemini-2.5-pro | 27/30 | 4 | $0.012 | 1.6s | ⚠️ Occasional |
| gemini-2.5-flash | 26/30 | 3 | $0.001 | 0.9s | ⚠️ Occasional |
Three key findings:
- Sonnet 4.6 is the only model with 0 hallucinations and 100% field completeness. For accuracy-critical scenarios (medical, financial, legal), that 3-point gap is worth the price.
- Haiku 4.5 is the value king. 96.7% completeness at 1/8 the price of Sonnet. If your business can tolerate 3% failure with fallback logic, Haiku is the answer.
- Gemini is fast but hallucinates. 13% hallucination rate (4 out of 30 emails made up refund amounts) is a dealbreaker for serious production use.
Part 2: Advanced Scenario — Web Scraping with Structured Extraction
2.1 Why Do Regular Models Struggle with Web Scraping?
Web scraping extraction is an order of magnitude harder than regular JSON extraction: messy HTML, nested CSS selectors, needs to understand both DOM tree AND data semantics simultaneously.
Zyte (global largest scraping provider) ran a professional benchmark using their Web Scraping Copilot, testing whether models can auto-generate extraction code from raw HTML. Three core metrics:
- rouge1_f1_adj: Extraction value accuracy (higher = better)
- SLOC per attribute: Lines of code generated per field (lower = better)
- Time per extractor: Time to generate complete extractor (lower = better)
2.2 Complete Benchmark Data
| Model | SLOC/attribute ↓ | rouge1_f1_adj ↑ | Time/extractor (secs) ↓ |
|---|---|---|---|
| claude-sonnet-4.6 | 19.59 | 0.8348 | 171 |
| gemini-3-pro | 19.16 | 0.8330 | 611 |
| claude-opus-4.6 | 19.97 | 0.8202 | 170 |
| gpt-5.2-codex | 26.28 | 0.7746 | 154 |
| claude-sonnet-4.5 | 18.95 | 0.7024 | 153 |
Key insights:
- Sonnet 4.6 ranks #1 for extraction quality (0.8348), barely edging out Gemini 3 Pro (0.8330)
- Sonnet 4.6 beats Opus 4.6 — counterintuitive. Usually “stronger” models underperform on extraction tasks. This suggests Sonnet 4.6 has special optimization for HTML understanding and structured output
- Sonnet 4.6 is 3.5x faster than Gemini 3 Pro (171 sec vs 611 sec). The difference between waiting 10 minutes vs 3 minutes is massive in an IDE
2.3 API Cost per Domain
| Model | API cost per domain ($) | Premium requests/domain |
|---|---|---|
| claude-sonnet-4.6 | $3.94 | 53.2 |
| gemini-3-pro | $5.67 | 55.7 |
| claude-opus-4.6 | $5.74 | 148.9 |
| gpt-5.2-codex | $2.35 | 55.2 |
Sonnet 4.6 is 30%+ cheaper than Gemini 3 Pro and 45%+ cheaper than Opus 4.6. Bottom line: For scraping extraction, Sonnet 4.6 is the overall best.
Part 3: Claude Code Structured Extraction Workflow
3.1 Don’t Say “Extract JSON for Me” — This Is Wrong
❌ Wrong:
User: Convert this email to JSON for me
Claude: Sure, here's the JSON data you requested: [output] Hope this helps!
— Your parser crashes.
3.2 ✅ Correct System Prompt Template
System: You are a JSON extraction assistant.
- Output ONLY valid JSON. No markdown, no explanations, no greetings.
- Every response must start with { and end with }.
- All strings must use double quotes. No trailing commas.
User: Extract the following fields from the email below:
{
"sender": "string",
"intent": "refund|cancel|account_issue|other",
"urgency": "high|medium|low",
"refund_amount": "number | null"
}
Rules:
1. For missing fields, use null
2. urgency的判断依据:出现"ASAP"、"urgent"、"immediately"为high;出现"as soon as possible"、"please look into"为medium
3. Do not hallucinate amounts
Email text: {email_content}
3.3 Production-Ready Extraction with Retry
import json
import re
import backoff
from anthropic import Anthropic
client = Anthropic(api_key="your-key")
@backoff.on_exception(backoff.expo, Exception, max_tries=3)
def extract_json_with_retry(email_text: str) -> dict:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
temperature=0.0, # Extraction must use 0 — no creativity
system="""You are a JSON extraction assistant.
Output ONLY valid JSON. No explanations, no markdown.
Every response must start with { and end with }.""",
messages=[{
"role": "user",
"content": f"Extract sender, intent, urgency, refund_amount from:\n{email_text}"
}]
)
content = response.content[0].text
match = re.search(r'\{.*\}', content, re.DOTALL)
if not match:
raise ValueError("No JSON found in response")
return json.loads(match.group(0))
# Usage
result = extract_json_with_retry("Customer: I need a refund for order #12345. ASAP!")
print(result)
# {"sender": "Customer", "intent": "refund", "urgency": "high", "refund_amount": null}
Key points:
temperature=0.0: Extraction needs determinism, not creativitybackoffretry: Handles occasional API hiccups- Regex fallback: Recovers even if the model adds extra text
Part 4: Structured Prompt Engineering — What Actually Works
4.1 Structured Prompt vs Natural Language Prompt
A May 2026 academic study tested 4 prompt formats (from simple narrative to structured JSON with iterative tracking):
- For Gemini 2.5 Pro: Simple narrative prompts worked best (MAE=0.257). JSON-structured prompts increased error by 3.1x (MAE=0.789)
- For Claude Sonnet 4.5: Providing examples gave minor accuracy improvement
Conclusion: Don’t overcomplicate your prompts. JSON-formatted prompts aren’t necessarily better than natural language — they can be worse. Run A/B tests first.
4.2 Structured Prompt Golden Template (Validated)
Based on the “8 super-prompts” methodology from God of Prompt for investment banking workflows — core principle is constraint-driven design:
[Task Definition]
Extract structured fields from customer feedback emails
[Output Schema]
{
"feedback_category": "product|shipping|billing|support",
"sentiment_score": -1.0 to 1.0 float,
"key_phrases": ["at least 2 keywords"],
"requires_followup": true/false
}
[Constraints]
1. If feedback_category cannot be determined, use null
2. sentiment_score calculation rules:
- "love", "great", "amazing" → 0.5 to 1.0
- "hate", "broken", "terrible" → -1.0 to -0.5
- Mixed sentiment → -0.5 to 0.5
3. key_phrases must be extracted verbatim, no paraphrasing
[Validation Checklist]
- Is the output valid JSON?
- Do field names match the Schema exactly?
- Are numeric values within allowed ranges?
- Is there any hallucinated content?
[Example Output]
{"feedback_category": "product", "sentiment_score": -0.8, "key_phrases": ["battery died", "replacement"], "requires_followup": true}
On the Wall Street Prep 2026 benchmark, this approach scored 5.5/10 for Claude, ranking #2.
Part 5: Claude’s Reasoning for Complex Analytical Workflows
5.1 Agent Benchmark: SWE-bench
2026 agent capability benchmarks show Claude 4 series scoring 72%+ on SWE-bench (the gold standard for code engineering agents), marking the transition from “assistant tool” to “autonomous engineer.”
| Model | SWE-bench Verified | Significance |
|---|---|---|
| Claude 4 (Sonnet/Opus) | 72%+ | Can autonomously locate and fix real GitHub bugs |
| Claude 3.5 Sonnet (2024) | 49% | 23 point improvement in one year |
| High-compute mode (Claude 4) | 80%+ | Best-of-N parallel sampling |
This means: Claude doesn’t just extract JSON — it understands what to do with the extracted data (generate SQL, update databases, trigger downstream workflows).
5.2 Long Context Extraction: Claude 4.7’s 1.5M Window
In Q2 2026, mainstream model contexts entered the “million token era.” Claude 4.7’s 1.5M window performs excellently on “needle-in-haystack” tests:
| Test Dimension | Claude 4.7 (1.5M) | GPT-5.5 (1.0M) |
|---|---|---|
| Time to first token (TTFT) | ~4.2 sec | ~2.8 sec |
| Ultra-long text recall | High retention | Good |
| Best for | Deep analysis, auditing, code review | Real-time interaction, customer service agents |
If your extraction task requires “finding” key information from hundreds of PDF pages, Claude’s long context advantage is clear.
Part 6: Cost Optimization & Engineering Traps
6.1 Prompt Caching: Save 80% on Input Costs
Claude 4.7 supports prompt caching: cache up to 1M tokens of static background context. Subsequent queries pay only minimal cache retrieval fees.
# Enable caching (Claude 4.7+)
response = client.messages.create(
model="claude-4-7",
cache_control={"type": "ephemeral"}, # Enable cache
messages=[...]
)
# Input cost reduced by 80%+ — perfect for repeated queries
6.2 Effort Strategy: Higher Isn’t Always Better
Claude Code’s /effort setting directly impacts reasoning depth and token consumption:
| Effort Level | Best For | Token Consumption | Latency |
|---|---|---|---|
| low | Single-field extraction, simple classification | Lowest | Fastest |
| medium | Regular extraction tasks | Baseline | Baseline |
| high | Complex nested JSON, multi-step reasoning | Higher | Slower |
| xhigh | Default recommendation | Medium-high | Medium |
| max | Extremely hard problems (diminishing returns) | Highest | Slowest |
Wrong approach: Setting everything to max → More expensive, slower, not necessarily more accurate.
Right approach:
- Simple extraction: default effort
- Complex tasks: xhigh
- max only for genuinely “brain-burning” problems
6.3 Tokenizer Changes: Don’t Assume “Prices Didn’t Go Up”
Opus 4.7 uses a new tokenizer. The same text may use 1.0-1.35x more input tokens. Per-token price didn’t change, but your bill might go up — a trap many teams have fallen into.
Part 7: Model Selection Decision Tree
flowchart TD
A[Start: Need JSON extraction] --> B{Task complexity?}
B -->|Simple: single field / classification| C[Haiku 4.5]
B -->|Medium: multi-field / constraints| D{Accuracy requirement?}
B -->|Complex: nested / multi-step reasoning| E[Sonnet 4.6]
D -->|Can tolerate 5% failure| C
D -->|Need near 100%| E
C --> F[Monthly cost <$5<br>Good for high concurrency]
E --> G[Monthly cost $20-50<br>Good for core business]
E --> H{Data volume?}
H -->|Single <10K context| I[Sonnet 4.6<br>default effort]
H -->|Long docs >100K| J[Opus 4.7<br>+ prompt caching]
Part 8: Summary — One Sentence Pick
| Your Scenario | Recommended Model | Why |
|---|---|---|
| High concurrency, cost-sensitive, can tolerate small failure rate | Haiku 4.5 | 1/8 price, 96.7% completeness |
| Core business, need 0 hallucinations, complex schema | Sonnet 4.6 | 0 hallucination rate, best overall |
| Web scraping, HTML parsing | Sonnet 4.6 | #1 extraction quality, 3.5x faster than competitor |
| Ultra-long document extraction (500+ pages) | Opus 4.7 | 1.5M context, high recall |
| Extremely low budget, can tolerate >10% failure | GPT-5-mini / Gemini Flash | Lowest price, needs extra fallback logic |
Final honest take: Claude Sonnet 4.6 is the “doesn’t suck at anything” choice for structured extraction. It might not take first place in every single metric, but it saves you from writing lots of “fallback code” and “JSON repair scripts.” If your business depends on extraction, paying those few extra cents buys you peace of mind.