Everyone’s posting benchmark numbers. I’m going to post something more useful: what happened when I ran both models through the actual tasks I use in production.
I tested on three things: structured data extraction, long-form content generation, and multi-step tool use. These map directly to what Orion does every day.
The setup
I ran the same prompts through GPT-5 (via the OpenAI API) and Claude 4 Sonnet (via the Anthropic API) over about two weeks. Same temperature (0 for extraction tasks, 0.7 for creative tasks), same prompts, same evaluation rubric.
This is not a rigorous scientific study. It’s a working developer’s notes.
Structured data extraction
Winner: Claude 4 Sonnet
I fed both models the same messy marketing briefs — the kind a real client sends, with inconsistent formatting, missing fields, and contradictory information — and asked them to extract structured JSON.
Claude 4 was more likely to flag ambiguity rather than silently pick an interpretation. GPT-5 was more likely to fill in gaps confidently, which sounds good until you realize it sometimes filled them incorrectly. For extraction where I need reliability over confidence, Claude wins.
Long-form content generation
Winner: Tie (different strengths)
GPT-5 writes in a more natural, varied style by default. If you hand it a brief and say “write a blog post,” the output reads like a human wrote it. Claude 4’s default output is cleaner and more structured — good for documentation, a bit bland for editorial content.
With the right system prompt, both models can produce excellent long-form content. But GPT-5 needs less prompting to get there. Claude 4 is easier to constrain when you need consistency across many outputs.
Multi-step tool use
Winner: GPT-5 (barely)
I set up a simple agent loop: search for information, extract key points, draft a summary, check it against source material, revise if needed. Five tools, sequential execution.
GPT-5 was more consistent at completing the full loop without getting confused about which step it was on. Claude 4 occasionally got stuck revisiting earlier steps or asked for clarification when the task was unambiguous. The gap is small and probably closeable with better prompting, but GPT-5 felt more “agentic” out of the box.
Cost and speed
At current API pricing, Claude 4 Sonnet is meaningfully cheaper than GPT-5 for equivalent context lengths. For high-volume production use, that matters. Speed is roughly comparable on streaming — both feel snappy enough for real-time interfaces.
My actual decision
I use both. Claude 4 Sonnet for extraction and structured tasks. GPT-5 for content generation where style matters. Neither model is so superior that you should bet your entire stack on it.
The more important variable is your prompt, your context, and your evaluation loop. A mediocre model with excellent prompts beats an excellent model with mediocre prompts, every time.
If you’re just starting out and need to pick one: start with Claude 4 Sonnet. The API is clean, the reliability is high, and Anthropic’s rate limits are less punishing on lower tiers. Switch or add GPT-5 when you have a specific task where you can measure the difference.