Issue 112 Apr 2026
OpenAI publishes the o3 system card, with cost-per-answer figures.
Matthew PaverEditorLondon5 min read
The lead this week is OpenAI's o3 system card, which puts the first credible numbers on the cost-per-correct-answer of a frontier reasoning model. The rest is context for teams budgeting against that.
If the answer to 'should we route this query through a reasoning model?' has been a guess on your team, you now have a better basis than vibes.
The lead this week is OpenAI's o3 system card. For the first time at this tier, the numbers that matter for production planning sit alongside the benchmark wins.
Spotlight
Reasoning-model cost figures are the planning input that mattered most this week, regardless of which provider you ship with.
Underhyped
OpenRouter's per-model latency dashboards deserve more attention than they are getting; they are the only public source of like-for-like routing data for production teams.
Risk to watch
Keep an eye on Stripe's policy on AI-generated chargebacks. Payment-rail policy quietly shifts what fraud-side AI agents can do.
Filed underIndustryModelsReposResearchTools
Lead story
OpenAI has published the o3 and o3-mini system cards alongside per-token pricing, with worked-example costs for benchmark-quality reasoning across maths, coding, and general problem solving.
Our take
Reasoning-model pricing has been opaque for months. A first-party cost-per-correct-answer figure is the input that lets a planning meeting be a planning meeting instead of a discussion of vibes.
Try this week
Run o3-mini against one of your existing eval sets this week and compare cost-per-correct-answer with whatever you ship today. The number is the deliverable, not the model.
Learn this week
Recommended reading tied to this week's lead.
Prompt of the week
OpenAI publishes the o3 system card, with cost-per-answer figures.
You are a cost-aware routing assistant. For the user query below, decide whether to answer directly with a small model (cheap, fast) or escalate to a reasoning model (expensive, slow but more accurate). Output a single line: ROUTE: small | ROUTE: reasoning, followed by one sentence of justification.Works with: Claude, ChatGPT, Gemini
What changed in industry
2 minOpenAI has released Sora 2, with synchronised audio output, longer maximum clip durations, and noticeable improvements in physical plausibility over the original v1 preview. Audio-synced video at this quality tier expands the addressable use cases beyond marketing B-roll. Worth a scoped evaluation if your team works on creative tooling or short-form content.
Mistral has cut Mistral Large 2 inference and fine-tuning pricing and added a managed fine-tuning path through La Plateforme aimed at teams that want a closed-source alternative to OpenAI and Anthropic. Pricing pressure at the closed-source frontier is real now. If you have been on a single-vendor contract, the renewal conversation just got more interesting.
Microsoft has continued the Phi-3 line of small language models, with sizes from 3.8B aimed at on-device inference and a quality bar Microsoft is benchmarking against larger frontier models. If small-model claims hold on your eval set, the cost-per-good-output for a class of tasks moves a step. Useful if you have been priced out of frontier inference for high-volume workloads.
Research worth your time
1 minResearchers from Google DeepMind and Berkeley have argued that, for a fixed quality target, scaling inference-time compute can sometimes substitute for additional pre-training compute, with worked-example trade-offs across maths and reasoning. If the trade-off generalises, smaller models with more thinking time are a viable architecture for cost-sensitive deployments. Worth a careful read before the next budget cycle.
Tools to try
1 minOpenRouter has expanded its public dashboards to include p50 and p99 latency per model and current pricing across providers, alongside the existing fallback-routing primitives. Like-for-like latency data has been the missing input for production routing decisions. Worth bookmarking even if you do not route through OpenRouter directly.
NVIDIA's NIM offering bundles open-weight models into containerised microservices with TensorRT-LLM optimisations, an OpenAI-compatible API, and Helm charts for Kubernetes deployment. If you self-host inference on NVIDIA GPUs, NIM closes the gap between 'open-weight model' and 'production-ready endpoint' without a long DevOps tail.
Open-source picks
1 minHugging Face has released smolagents, a deliberately small Python library for agent loops that writes actions as Python code rather than JSON tool calls, with built-in sandboxing options. Code-as-action is a noticeable ergonomic upgrade over JSON tool calling for many tasks. Worth a half-hour even if you stay on your current framework.
Hugging Face has continued the SmolLM family of small open-weight models (135M / 360M / 1.7B) trained on a curated dataset, with the training recipes published for reproducibility. Small-and-reproducible is the missing tier between 'too big to host' and 'hosted-API'. If you want to fine-tune your own model without renting an H100 cluster, this is the level the field is converging on.
InferenceIssue 112 Apr 2026
Set in Space Grotesk and Source Serif 4. Compiled in London.
More from Inference
More issues you might enjoy
How was this issue?
Helps us pick better stories next week.
