This website uses cookies

Read our Privacy policy and Terms of use for more information.

The Maze: Anthropic now sits at the awkward end of the AI market: expensive enough to make every procurement person twitch, useful enough that many developers still keep it in the stack. In the Artificial Analysis cost view, Claude Opus 4.7 costs $5,117 to run the full Intelligence Index. Claude Sonnet 4.6 + max costs $4,206. GPT-5.5 xhigh sits lower at $3,357. Gemini 3.5 Flash is far below at $1,552. Premium AI is not dead. It is just being forced to prove it with every task.

  • Claude's premium is not marginal. The top Claude run is 1.5x GPT-5.5 xhigh and 3.3x Gemini 3.5 Flash in the visible cost view. That gap matters because the benchmark cost includes input, reasoning, and output components, not only a neat per-token sticker price. Opus 4.7 shows $2,319 in reasoning and $2,319 in output on top of $479 in input. Sonnet 4.6 + max shows even more reasoning cost at $2,805. The expensive part is not asking the model a question. It is making the model think, act, revise, and produce enough output to finish valuable work.

  • The ranking turns AI model choice into a workload decision. Cheap models win when the task is high-volume, easy to verify, and tolerant of retries. Premium models win only when failure is more expensive than the model call. That is why the visible comments matter. One developer argued that GPT-5.5 xhigh plus subscription economics can support more parallel agents for the same spend. Another pointed out that tool calls and long context can skew the benchmark fast. Both are right. The buyer is not choosing a model. The buyer is choosing a cost curve for coding agents, research workflows, support automation, and internal operating systems.

  • Artificial Analysis makes the caveat explicit. The methodology describes the Intelligence Index as a text-only English benchmark spanning reasoning, knowledge, math, and programming. Version 4.0 includes GDPval-AA, Terminal-Bench Hard, SciCode, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, and other evaluations. That makes it useful for comparing frontier models. It does not make it a perfect proxy for every company workload. A retailer using AI for product-content QA, marketplace seller support, or merchandising analysis should test against its own failure cost, not only the leaderboard.

  • The commercial lesson is uncomfortable for both camps. The "AI will be free" crowd ignores that some customers pay for reliability, workflow fit, and fewer retries. The "quality always wins" crowd ignores that agentic usage multiplies cost brutally. A model that is 10% better and 3x more expensive can still win for high-stakes code review. It can also lose instantly in catalog enrichment, customer-service triage, or ad-copy variants. The premium has to map to an operational bottleneck, not a vague belief that better intelligence deserves a blank check.

Why it matters: AI pricing is moving from curiosity to operating leverage. The winners will not be the teams that buy the smartest model by default or the cheapest model by reflex. They will route work by economics: premium models for bottlenecks where quality changes the outcome, cheaper models for volume work where throughput matters more. That is boring. It is also how every expensive technology becomes a real business system.

Reply

Avatar

or to participate

Keep Reading