I was out with a few friends tonight, and somehow the conversation drifted to AI coding tools — as it always does these days. One of them swore by Claude for long-context work. Another said Opus 4.6 had gotten noticeably dumber. By the time I got home, I couldn’t let it go. So I fired up my laptop and started testing memory retention across three models right there at 11 PM.
I took Opus 4.7, Opus 4.6, and GPT-5.5, built a five-step constraint tracking experiment, and ran them through the wringer. The results genuinely surprised me — not in the “wow this is amazing” way, but in the “wait, higher effort actually made it worse?” way.
Here’s the raw data, the code I used, and what this means for anyone building real AI agents that need to remember what they said three steps ago.
The Short Version
Claude Opus 4.7 is way better at remembering constraints than Opus 4.6 — 60% memory failure rate vs 100% on the same tasks. But here’s the kicker: cranking the “effort” setting to max made constraint memory worse, not better. And the 1.5M context window is huge, but “context window size” and “actual ability to remember things across a long conversation” are completely different metrics. One tells you how much you can shove in. The other tells you if it’ll actually use it.
Also, Opus 4.6 is 2.4x slower and 48% more expensive than Opus 4.7. If you’re still running 4.6 in production, you’re paying more for worse memory retention. That’s not opinion — that’s from the latency data.
Part 1: How I Actually Tested This
I didn’t just ask “what’s your name” five times. I built a proper multi-step constraint tracking experiment based on Mem0’s benchmark methodology .
The Setup
Each test was a five-step conversation where later steps depended entirely on earlier ones:
Step 1: Establish a hard constraint (e.g., “this database migration must complete within 200ms”)
Steps 2-4: Intermediate work that builds findings the model needs to reference later
Step 5: A trap question — something that presents a plausible wrong answer that violates the Step 1 constraint
The metrics I tracked:
| Metric | What It Measures |
|---|---|
mem_fail | Did the model violate its own earlier constraint? (Binary: YES/NO) |
citations | How many times it explicitly referenced the Step 1 constraint in later steps |
latency | Wall-clock time from prompt to complete response |
cost | Real token-based API cost |
I used a second Claude call (Opus 4.6) as an LLM judge to score each response. This replaced simple keyword matching, which is useless for this kind of nuanced evaluation .
The Hardest Tasks
Two tasks consistently broke every model:
- T2: Track three distinct race conditions across five steps, then reference all three in a final fix
- T4: Track three concurrent bugs and label each fix by which step it originated from
These are realistic agent tasks. Not “remember my name.” Real “I told you three different things 20 minutes ago and now I need you to use all of them” work.
Part 2: Opus 4.7 vs Opus 4.6 — The Numbers
Both models at high effort. Averaged across two independent runs .
| Metric | Opus 4.6 | Opus 4.7 | Change |
|---|---|---|---|
| Memory failure rate | 100% (5/5 tasks failed) | 40% (2/5 failed) | -60% |
| Avg latency | 115.7 sec | 47.8 sec | -59% faster |
| Cost per task | $0.1596 | $0.1078 | -32% cheaper |
| Avg citations of constraint | 2.0 | 1.6 | -0.4 |
What this actually means:
Opus 4.6 failed every single constraint tracking task. Every time. The model would establish a rule in Step 1, then completely ignore it by Step 5. Two independent runs, same result .
Opus 4.7 fails on the hard multi-constraint tasks (T2 and T4) but passes the single-constraint tasks consistently. Single constraint: reliable. Three simultaneous constraints: not yet.
The latency gap is massive. Opus 4.6 is 2.4x slower and worse at memory. If you’re still using 4.6, update your model string and save money while getting better results. There’s literally no downside.
But — and this is important — the citations metric is weird. Opus 4.6 actually cited constraints more often (2.0 vs 1.6) but still failed more. The model was literally saying the right words while failing to apply them correctly. That’s a more subtle failure mode than just “forgetting.” It’s a reasoning failure masked by fluent language.
Part 3: The Effort Calibration Trap (This Is Important)
This is the finding that made me re-run the tests twice.
I tested Opus 4.7 at three effort levels: standard, high, and xhigh — but directly on Anthropic’s API, not through OpenRouter (which silently ignores the effort parameter — more on that later) .
Direct Anthropic API Results
| Task | Standard | High | Xhigh |
|---|---|---|---|
| T1 (single constraint) | ✅ PASS | ✅ PASS | ✅ PASS |
| T2 (3 constraints) | ✅ PASS | ❌ FAIL | ❌ FAIL |
| T4 (3 constraints + labeling) | ✅ PASS | ❌ FAIL | ❌ FAIL |
| Latency scaling (T4) | 1x (baseline) | ~3x | ~5.5x |
Here’s the thing that breaks expectations: standard effort passed all three tasks. high and xhigh failed on the multi-constraint tasks .
More thinking did not help. It made things worse.
At standard effort, the model writes a focused, direct answer. At xhigh, it writes substantially more — T4 at xhigh cost 5.5x more than standard — and the additional reasoning introduces drift. The model doesn’t directly violate the constraint, but it “reasons past it” in ways that the judge interprets as failure to enforce it .
What this means for you: If you’re building production agents running constraint-heavy analysis tasks, defaulting to xhigh is not obviously correct. Test standard first. You might be paying 5x more for worse results.
Part 4: The OpenRouter Problem Nobody Talks About
Here’s a trap I fell into. OpenRouter silently drops Anthropic-specific parameters like output_config, thinking, and beta headers. If your production system uses effort control via OpenRouter, it is not working. The parameter is being ignored .
In the OpenRouter run, all three effort levels produced nearly identical latency:
| Task | Standard | High | Xhigh |
|---|---|---|---|
| T4 latency | 35s | 39s | 39s |
That’s not how actual effort scaling works. On the direct API, xhigh took 5.5x longer. The OpenRouter numbers confirm the parameter was silently dropped .
This is not documented prominently. It’s easy to miss. If you’re using OpenRouter for Claude, verify your effort settings are actually being applied.
Part 5: 1.5M Context Window vs Actual Memory
Anthropic claims Opus 4.7 has a 1.5M token context window . That sounds huge. But “context window” and “within-session memory coherence” are completely different things .
The context window tells you how much you can input. It says nothing about whether the model will remember what it said in Step 1 when generating Step 5.
My benchmark tasks fit in under 4,000 tokens — well within any model’s context window. Yet Opus 4.6 failed every task. The problem wasn’t capacity. It was coherence .
Needle-in-Haystack Comparison
| Model | 1M Token Recall | Notes |
|---|---|---|
| Claude Opus 4.6 | 78.3% | Strong baseline |
| Claude Opus 4.7 | 32.2% | -46 points — catastrophic drop |
| GPT-5.5 | ~85% | Superior recall at extreme length |
Wait, what? Opus 4.7’s long-context recall dropped from 78.3% to 32.2%?
Yes. And here’s why: tokenizer change. Opus 4.7 uses a new tokenizer that increases token count by 1.0-1.35x for the same text. Nominal context window is still 1.5M, but the actual information density dropped. For long tasks, actual token consumption went up ~35% with no price change — a stealth cost increase .
For context retrieval tasks (finding one fact in 1M tokens of noise), GPT-5.5 is currently better. For reasoning tasks that need to track constraints across moderate-length conversations, Opus 4.7 is better. Different strengths for different use cases.
Part 6: “Context Rot” — The Official Term for What You’re Feeling
Anthropic recently published a blog post admitting what many developers already knew: context rot is real. The longer your conversation, the dumber the model gets .
“The model’s attention gets spread across more tokens. Earlier, irrelevant content starts interfering with the current task. Performance drops.”
This isn’t a Claude problem. It’s an architecture problem. Every transformer model suffers from it to some degree. But Claude’s official guidance gives you tools to fight it :
| Action | What It Does | When to Use |
|---|---|---|
/rewind | Jump back to a previous message, discard everything after | When the model went down a wrong path |
/clear | Start fresh with a brief summary of what happened | When context is completely polluted |
/compact | Summarize the conversation, drop low-value details | When you’re approaching token limits |
| Subagents | Spawn a separate context window for isolated work | When you only need the conclusion, not the process |
The subagent pattern is the most powerful but most misunderstood. Each subagent runs in its own fresh context window. It can read files, search, iterate — all that noise stays in the subagent. Only the conclusion comes back to the main session .
Rule of thumb from Anthropic: “Do I still need the raw tool outputs, or just the conclusion?” If just the conclusion → use a subagent.
Part 7: How Claude Compares to GPT and Gemini for Memory
| Feature | Claude Opus 4.7 | GPT-5.5 | Gemini 2.5 Pro |
|---|---|---|---|
| Context window | 1.5M tokens | 1M tokens | 2M tokens |
| Constraint tracking (5-step) | 60% pass rate | Not tested in same benchmark | Not tested |
| Long-context recall (1M) | 32.2% | ~85% | Strong |
| Time to first token (TTFT) | ~4.2 sec | ~2.8 sec | Not specified |
| Cross-session memory | Projects (500K tokens) | Fact storage | Vertex AI required |
The trade-off is clear:
- Claude 4.7: Better at reasoning through constraints across moderate-length conversations. Slower to start. Weaker at ultra-long-context retrieval.
- GPT-5.5: Faster responses. Better at finding needles in 1M-token haystacks. Worse at multi-step reasoning coherence.
- Gemini 2.5 Pro: Largest context window (2M). Best for massive document analysis. Cross-session memory requires extra setup .
For agent memory across sessions — not just within one conversation — Claude Projects is the most straightforward. Upload documents once, they persist across all chats in that project. GPT’s memory stores facts but not documents. Gemini’s 2M window resets every session unless you use Vertex AI Agent Engine .
Part 8: Real-World Takeaway — What This Means for Your Agent
Do This:
- Use Opus 4.7, not 4.6. It’s faster, cheaper, and better at memory. There’s no reason to stay on 4.6.
- Start with
standardeffort. Higher effort costs more and can make constraint tracking worse. Teststandardfirst. - Use subagents for expensive research tasks. They keep your main context clean.
- Use
/rewindwhen the model goes down a wrong path. Don’t try to correct it — roll back and restart from the last clean state .
Don’t Do This:
- Don’t assume
xhigheffort is better. It’s not. Not for constraint tracking. - Don’t use OpenRouter for effort control. It’s silently ignored .
- Don’t confuse context window size with memory coherence. 1.5M input capacity doesn’t mean it’ll remember Step 1 at Step 50.
- Don’t let conversations run forever. The longer they get, the more context rot sets in. Use
/compactor subagents proactively .
If You’re Building a Production Agent:
- Run your own A/B test with
standardvshigheffort on your specific tasks - Monitor both
mem_failrate and cost - Implement explicit context management (rewind, compact, subagents) — don’t rely on the model to manage itself
- If you need ultra-long-context retrieval (finding one fact in 1M tokens), consider GPT-5.5
- If you need multi-step reasoning with constraint tracking, Claude 4.7 is your model