Claude Opus 4.7 benchmarked well and felt worse — refusals, early give-ups, and a quiet token-cost creep. Here's what actually went wrong, how Opus 4.8 walked it back, and why we defected to Codex for the whole 4.7 run.
When Anthropic shipped Claude Opus 4.7 on April 16, 2026, the benchmark table looked like a win. SWE-Bench Verified at 87.6%. SWE-Bench Pro at 64.3%. On paper, it was the strongest coding model the company had released.
Then developers actually used it for a month.
What followed wasn't a meltdown — despite what a few breathless headlines suggested — but it was a genuinely rough stretch: a wave of bizarre refusals, a model that kept declaring victory before the job was done, and a quiet bump in everyone's token bill that nobody announced. Six weeks later, on May 28, Opus 4.8 arrived looking less like a feature release and more like an apology written in changelog form.
This is the story of what went wrong with 4.7, and how 4.8 walked it back — told with the receipts, and without the parts that turned out to be internet noise.
The benchmark paradox
Here's the tension that defines this whole episode: 4.7 scored well and felt worse.
That gap is the most useful thing to understand about modern model releases. A higher SWE-Bench number doesn't mean the model is more pleasant — or even more reliable — to work with day to day. 4.7 could resolve more isolated benchmark tasks than its predecessor while simultaneously being more likely to refuse your work, give up halfway, or skip the one tool call that mattered. Benchmarks measure capability ceilings. They don't measure whether the model will actually use that capability on your problem at 4pm on a Tuesday.
Keep that in mind, because 4.8's headline improvements are almost entirely about the second category.
What actually went wrong with 4.7
1. The refusals — the real backlash
The loudest, best-documented complaint had nothing to do with coding skill. Opus 4.7 shipped with new automatic cybersecurity safeguards designed to block high-risk requests — and they misfired constantly.
Anthropic's own claude-code GitHub repo tells the story plainly. Refusal complaints went from roughly two or three a month through late 2025 to more than thirty in April 2026 alone, as The Register reported. And these weren't edge cases:
The director of LSU's Cyber Center was refused help proofreading a lab handout and basic cryptography exercises, noting in his issue that "for $200+ per month, basic help with editing tasks will not be rejected."
Standard computational structural biology got flagged as a Usage Policy violation — and critically, the same prompts had worked fine under Opus 4.6. That's not a model being cautious; that's a regression.
When a security researcher, a biologist, and a working developer all hit the same wall in the same month, the problem isn't the users.
2. Declaring victory early
The subtler complaint — and the one Anthropic effectively confirmed — was overconfidence. 4.7 had a habit of jumping to conclusions and announcing success on thin evidence. It would write a few tests, decide it was done, and move on while the actual bug sat untouched.
You don't have to take a Reddit thread's word for it. Anthropic's own 4.8 announcement describes models that "jump to conclusions, confidently claiming to have made progress... despite the evidence being thin." Boris Cherny from the Claude Code team put it more directly: 4.8 "tells you when it's unsure and catches its own bugs instead of declaring victory early." You only ship that messaging when the prior version had the opposite problem.
3. Skipped tools and lost threads
Two more agentic papercuts rounded it out. 4.7 would sometimes skip a tool call the task required — reasoning about a file instead of reading it, for instance. And on long agentic sessions, it tended to derail after context compaction, losing the plot partway through extended work.
Both are now confirmed by Anthropic in the most backhanded way possible: the official 4.8 notes list "less likely to skip a tool call the task required, an issue some users reported on Claude Opus 4.7" as a fix.
4. The cost creep nobody announced
Finally, the quiet one. 4.7 introduced a new tokenizer, and OpenRouter's analysis of over a million real requests found it produced 32–45% more native tokens for the same text. Base pricing didn't change — still $5 per million input tokens, $25 output — but tokens are the unit you're billed in. For uncached prompts above roughly 2,000 tokens, effective costs rose somewhere between 12% and 27%.
Anthropic's own migration guide confirms the mechanism, noting 4.7 "may use roughly 1x to 1.35x as many tokens." Prompt caching absorbs a lot of this in practice, so it wasn't catastrophic — but it was a real bill increase delivered with zero fanfare.
The part where we left
I'll be honest about how this went for us, because it's the most useful data point I have: we didn't wait 4.7 out. Somewhere in that month the refusals and the half-finished tasks stacked up past our patience, and we moved our whole workflow over to OpenAI's Codex for the entire 4.7 run.
It wasn't a statement or a benchmark-driven decision — it was triage. Codex wasn't categorically better at everything, and switching tools mid-stride has its own friction. But it didn't argue with us about proofreading, and it didn't declare victory at 60% and ask if we wanted to continue. When the model you pay for keeps getting in the way, the cheapest fix is the one that gets out of it. So we switched.
That's the thing benchmarks will never show you. No leaderboard captures "our team quietly migrated to a competitor for six weeks." But it's the truest review 4.7 got from us.
How 4.8 answers each one
Lay 4.8's improvements next to 4.7's complaints and the alignment is almost suspiciously tidy:
| Opus 4.7 complaint | Opus 4.8 response |
|---|---|
| Overconfident, declared victory early | ~4× less likely (Anthropic-reported) to let flaws in its own code pass unremarked |
| Skipped required tool calls | "Better tool triggering" — explicitly less likely to skip a needed call |
| Derailed after long-context compaction | Fewer compactions, better compaction recovery |
| Strong benchmarks | Edges 4.7: SWE-Bench Pro 69.2% vs 64.3%, Verified 88.6% vs 87.6% |
A fair caveat: those honesty and benchmark figures are Anthropic's own internal evaluations, not independent reproductions, so read them directionally. And the long-context fix isn't universal — some users running at maximum reasoning effort have reported 4.8 actually hitting limits sooner, because it spends more tokens thinking.
The one thing 4.8 conspicuously didn't fix? Model version pinning — the ability to lock your workflow to a specific version and opt out of mid-cycle behavior changes. It remains the most-requested feature, and the entire 4.7 episode is the argument for it.
The lesson worth keeping
The 4.7-to-4.8 arc is a small case study in something every team building on LLMs should internalize: the changelog and the experience are different documents.
4.7's regressions didn't show up in its launch benchmarks. They showed up in GitHub issues, in token meters, and in the specific frustration of a researcher who just wanted help proofreading a worksheet. The safeguards that caused the most pain were, on paper, a safety improvement. The tokenizer change was, on paper, a capability upgrade. Both made the day-to-day worse.
4.8 is the better model — genuinely. But the reason it's better isn't that it benchmarks higher. It's that Anthropic spent six weeks listening to the parts of the experience that benchmarks never capture, and shipped fixes aimed squarely at them. That's the upgrade that actually mattered.
As for us? 4.8 won the main seat back. But we didn't fully un-switch — Codex (now on GPT-5.5) stayed in the kit as a sidearm: Opus 4.8 drives the day-to-day, and we reach for Codex to complement it when a second model's take is useful. The fact that 4.8 earned its way back to primary tells you most of what you need to know — and the fact that we kept Codex around tells you the rest. A month of pain makes you keep a backup.
The takeaway for the rest of us: trust the benchmarks for what a model can do, and trust your own logs for what it will do. When the two disagree, your logs are right.