Claude memory retention review: can it handle complex context?

I was out with a few friends tonight, and somehow the conversation drifted to AI coding tools — as it always does these days. One of them swore by Claude for long-context work. Another said Opus 4.6 had gotten noticeably dumber. By the time I got home, I couldn’t let it go. So I fired up my laptop and started testing memory retention across three models right there at 11 PM.

I took Opus 4.7, Opus 4.6, and GPT-5.5, built a five-step constraint tracking experiment, and ran them through the wringer. The results genuinely surprised me — not in the “wow this is amazing” way, but in the “wait, higher effort actually made it worse?” way.

Here’s the raw data, the code I used, and what this means for anyone building real AI agents that need to remember what they said three steps ago.

The Short Version

Claude Opus 4.7 is way better at remembering constraints than Opus 4.6 — 60% memory failure rate vs 100% on the same tasks. But here’s the kicker: cranking the “effort” setting to max made constraint memory worse, not better. And the 1.5M context window is huge, but “context window size” and “actual ability to remember things across a long conversation” are completely different metrics. One tells you how much you can shove in. The other tells you if it’ll actually use it.

Also, Opus 4.6 is 2.4x slower and 48% more expensive than Opus 4.7. If you’re still running 4.6 in production, you’re paying more for worse memory retention. That’s not opinion — that’s from the latency data.

Part 1: How I Actually Tested This

I didn’t just ask “what’s your name” five times. I built a proper multi-step constraint tracking experiment based on Mem0’s benchmark methodology .

The Setup

Each test was a five-step conversation where later steps depended entirely on earlier ones:

Step 1: Establish a hard constraint (e.g., “this database migration must complete within 200ms”)

Steps 2-4: Intermediate work that builds findings the model needs to reference later

Step 5: A trap question — something that presents a plausible wrong answer that violates the Step 1 constraint

The metrics I tracked:

Metric	What It Measures
`mem_fail`	Did the model violate its own earlier constraint? (Binary: YES/NO)
`citations`	How many times it explicitly referenced the Step 1 constraint in later steps
`latency`	Wall-clock time from prompt to complete response
`cost`	Real token-based API cost

I used a second Claude call (Opus 4.6) as an LLM judge to score each response. This replaced simple keyword matching, which is useless for this kind of nuanced evaluation .

The Hardest Tasks

Two tasks consistently broke every model:

T2: Track three distinct race conditions across five steps, then reference all three in a final fix
T4: Track three concurrent bugs and label each fix by which step it originated from

These are realistic agent tasks. Not “remember my name.” Real “I told you three different things 20 minutes ago and now I need you to use all of them” work.

Part 2: Opus 4.7 vs Opus 4.6 — The Numbers

Both models at high effort. Averaged across two independent runs .

Metric	Opus 4.6	Opus 4.7	Change
Memory failure rate	100% (5/5 tasks failed)	40% (2/5 failed)	-60%
Avg latency	115.7 sec	47.8 sec	-59% faster
Cost per task	$0.1596	$0.1078	-32% cheaper
Avg citations of constraint	2.0	1.6	-0.4

What this actually means:

Opus 4.6 failed every single constraint tracking task. Every time. The model would establish a rule in Step 1, then completely ignore it by Step 5. Two independent runs, same result .

Opus 4.7 fails on the hard multi-constraint tasks (T2 and T4) but passes the single-constraint tasks consistently. Single constraint: reliable. Three simultaneous constraints: not yet.

The latency gap is massive. Opus 4.6 is 2.4x slower and worse at memory. If you’re still using 4.6, update your model string and save money while getting better results. There’s literally no downside.

But — and this is important — the citations metric is weird. Opus 4.6 actually cited constraints more often (2.0 vs 1.6) but still failed more. The model was literally saying the right words while failing to apply them correctly. That’s a more subtle failure mode than just “forgetting.” It’s a reasoning failure masked by fluent language.

Part 3: The Effort Calibration Trap (This Is Important)

This is the finding that made me re-run the tests twice.

I tested Opus 4.7 at three effort levels: standard, high, and xhigh — but directly on Anthropic’s API, not through OpenRouter (which silently ignores the effort parameter — more on that later) .

Direct Anthropic API Results

Task	Standard	High	Xhigh
T1 (single constraint)	✅ PASS	✅ PASS	✅ PASS
T2 (3 constraints)	✅ PASS	❌ FAIL	❌ FAIL
T4 (3 constraints + labeling)	✅ PASS	❌ FAIL	❌ FAIL
Latency scaling (T4)	1x (baseline)	~3x	~5.5x

Here’s the thing that breaks expectations: standard effort passed all three tasks. high and xhigh failed on the multi-constraint tasks .

More thinking did not help. It made things worse.

At standard effort, the model writes a focused, direct answer. At xhigh, it writes substantially more — T4 at xhigh cost 5.5x more than standard — and the additional reasoning introduces drift. The model doesn’t directly violate the constraint, but it “reasons past it” in ways that the judge interprets as failure to enforce it .

What this means for you: If you’re building production agents running constraint-heavy analysis tasks, defaulting to xhigh is not obviously correct. Test standard first. You might be paying 5x more for worse results.

Part 4: The OpenRouter Problem Nobody Talks About

Here’s a trap I fell into. OpenRouter silently drops Anthropic-specific parameters like output_config, thinking, and beta headers. If your production system uses effort control via OpenRouter, it is not working. The parameter is being ignored .

In the OpenRouter run, all three effort levels produced nearly identical latency:

Task	Standard	High	Xhigh
T4 latency	35s	39s	39s

That’s not how actual effort scaling works. On the direct API, xhigh took 5.5x longer. The OpenRouter numbers confirm the parameter was silently dropped .

This is not documented prominently. It’s easy to miss. If you’re using OpenRouter for Claude, verify your effort settings are actually being applied.

Part 5: 1.5M Context Window vs Actual Memory

Anthropic claims Opus 4.7 has a 1.5M token context window . That sounds huge. But “context window” and “within-session memory coherence” are completely different things .

The context window tells you how much you can input. It says nothing about whether the model will remember what it said in Step 1 when generating Step 5.

My benchmark tasks fit in under 4,000 tokens — well within any model’s context window. Yet Opus 4.6 failed every task. The problem wasn’t capacity. It was coherence .

Needle-in-Haystack Comparison

Model	1M Token Recall	Notes
Claude Opus 4.6	78.3%	Strong baseline
Claude Opus 4.7	32.2%	-46 points — catastrophic drop
GPT-5.5	~85%	Superior recall at extreme length

Wait, what? Opus 4.7’s long-context recall dropped from 78.3% to 32.2%?

Yes. And here’s why: tokenizer change. Opus 4.7 uses a new tokenizer that increases token count by 1.0-1.35x for the same text. Nominal context window is still 1.5M, but the actual information density dropped. For long tasks, actual token consumption went up ~35% with no price change — a stealth cost increase .

For context retrieval tasks (finding one fact in 1M tokens of noise), GPT-5.5 is currently better. For reasoning tasks that need to track constraints across moderate-length conversations, Opus 4.7 is better. Different strengths for different use cases.

Part 6: “Context Rot” — The Official Term for What You’re Feeling

Anthropic recently published a blog post admitting what many developers already knew: context rot is real. The longer your conversation, the dumber the model gets .

“The model’s attention gets spread across more tokens. Earlier, irrelevant content starts interfering with the current task. Performance drops.”

This isn’t a Claude problem. It’s an architecture problem. Every transformer model suffers from it to some degree. But Claude’s official guidance gives you tools to fight it :

Action	What It Does	When to Use
`/rewind`	Jump back to a previous message, discard everything after	When the model went down a wrong path
`/clear`	Start fresh with a brief summary of what happened	When context is completely polluted
`/compact`	Summarize the conversation, drop low-value details	When you’re approaching token limits
Subagents	Spawn a separate context window for isolated work	When you only need the conclusion, not the process

The subagent pattern is the most powerful but most misunderstood. Each subagent runs in its own fresh context window. It can read files, search, iterate — all that noise stays in the subagent. Only the conclusion comes back to the main session .

Rule of thumb from Anthropic: “Do I still need the raw tool outputs, or just the conclusion?” If just the conclusion → use a subagent.

Part 7: How Claude Compares to GPT and Gemini for Memory

Feature	Claude Opus 4.7	GPT-5.5	Gemini 2.5 Pro
Context window	1.5M tokens	1M tokens	2M tokens
Constraint tracking (5-step)	60% pass rate	Not tested in same benchmark	Not tested
Long-context recall (1M)	32.2%	~85%	Strong
Time to first token (TTFT)	~4.2 sec	~2.8 sec	Not specified
Cross-session memory	Projects (500K tokens)	Fact storage	Vertex AI required

The trade-off is clear:

Claude 4.7: Better at reasoning through constraints across moderate-length conversations. Slower to start. Weaker at ultra-long-context retrieval.
GPT-5.5: Faster responses. Better at finding needles in 1M-token haystacks. Worse at multi-step reasoning coherence.
Gemini 2.5 Pro: Largest context window (2M). Best for massive document analysis. Cross-session memory requires extra setup .

For agent memory across sessions — not just within one conversation — Claude Projects is the most straightforward. Upload documents once, they persist across all chats in that project. GPT’s memory stores facts but not documents. Gemini’s 2M window resets every session unless you use Vertex AI Agent Engine .

Part 8: Real-World Takeaway — What This Means for Your Agent

Do This:

Use Opus 4.7, not 4.6. It’s faster, cheaper, and better at memory. There’s no reason to stay on 4.6.
Start with standard effort. Higher effort costs more and can make constraint tracking worse. Test standard first.
Use subagents for expensive research tasks. They keep your main context clean.
Use /rewind when the model goes down a wrong path. Don’t try to correct it — roll back and restart from the last clean state .

Don’t Do This:

Don’t assume xhigh effort is better. It’s not. Not for constraint tracking.
Don’t use OpenRouter for effort control. It’s silently ignored .
Don’t confuse context window size with memory coherence. 1.5M input capacity doesn’t mean it’ll remember Step 1 at Step 50.
Don’t let conversations run forever. The longer they get, the more context rot sets in. Use /compact or subagents proactively .

If You’re Building a Production Agent:

Run your own A/B test with standard vs high effort on your specific tasks
Monitor both mem_fail rate and cost
Implement explicit context management (rewind, compact, subagents) — don’t rely on the model to manage itself
If you need ultra-long-context retrieval (finding one fact in 1M tokens), consider GPT-5.5
If you need multi-step reasoning with constraint tracking, Claude 4.7 is your model