Anthropic shipped Claude 4 this month, and after spending a few weeks integrating it into Orion I have some practical notes — not the marketing take, but what actually changes when you swap models in a real production app.
The short version
Claude 4 Sonnet is the one you want for most product work. It’s faster and significantly cheaper than Opus while closing the gap on reasoning quality. Opus 4 is reserved for tasks where you genuinely need the best possible result and latency is secondary — think long document analysis or complex multi-step agents.
What actually got better
Extended thinking is useful now. Previous models had a “thinking” mode that felt like a party trick. Claude 4’s extended thinking produces reasoning traces I actually trust for multi-step decisions. I’m using it for campaign strategy generation in Orion and the output quality improvement is real.
Tool use is more reliable. In Claude 3, you’d occasionally get hallucinated tool calls — the model would invent parameters that didn’t exist. Claude 4 is noticeably better at staying inside the schema you define. Still not perfect, but rare enough that you can build on it without elaborate validation layers.
200K context actually works. Long-context models have always had a “lost in the middle” problem — information buried in the middle of a giant prompt gets ignored. Anthropic claims they’ve improved this, and from my testing it holds up. I fed it a 60-page marketing strategy doc and the model referenced specific sections correctly.
What hasn’t changed
Prompt engineering still matters. The model is better, but a sloppy system prompt still produces sloppy output. The fundamentals — clear role, explicit constraints, structured output format — still do 80% of the work.
Cost is still the main scaling constraint. For high-volume use cases you need to be deliberate about when you reach for the big model vs. caching, summarization, or a lighter model for simpler tasks.
My current stack after the upgrade
For Orion I’m running:
- Claude 4 Sonnet for all real-time user interactions
- Claude 4 Opus for offline batch jobs (brand analysis reports, weekly summaries)
- Claude 3 Haiku for simple classification tasks that don’t need reasoning
That split keeps quality high and cost predictable. If you’re building something similar, start with Sonnet and only reach for Opus when you have a specific quality gap you can measure.