Issue 112 Apr 2026

OpenAI publishes the o3 system card, with cost-per-answer figures.

Matthew PaverEditorLondon5 min read

The lead this week is OpenAI's o3 system card, which puts the first credible numbers on the cost-per-correct-answer of a frontier reasoning model. The rest is context for teams budgeting against that.

If the answer to 'should we route this query through a reasoning model?' has been a guess on your team, you now have a better basis than vibes.

The lead this week is OpenAI's o3 system card. For the first time at this tier, the numbers that matter for production planning sit alongside the benchmark wins.

Spotlight

Reasoning-model cost figures are the planning input that mattered most this week, regardless of which provider you ship with.

Underhyped

OpenRouter's per-model latency dashboards deserve more attention than they are getting; they are the only public source of like-for-like routing data for production teams.

Risk to watch

Keep an eye on Stripe's policy on AI-generated chargebacks. Payment-rail policy quietly shifts what fraud-side AI agents can do.

Filed underIndustryModelsReposResearchTools

Lead story

OpenAI Blog

OpenAI o3 and o3-mini: system card and pricing

OpenAI has published the o3 and o3-mini system cards alongside per-token pricing, with worked-example costs for benchmark-quality reasoning across maths, coding, and general problem solving.

Our take

Reasoning-model pricing has been opaque for months. A first-party cost-per-correct-answer figure is the input that lets a planning meeting be a planning meeting instead of a discussion of vibes.

Try this week

Run o3-mini against one of your existing eval sets this week and compare cost-per-correct-answer with whatever you ship today. The number is the deliverable, not the model.

Read article

Learn this week

Recommended reading tied to this week's lead.

Inference-time scaling: a practical guide

Pairs with this week's research lead. Walks through a small worked example.

IntermediateHugging Face Blog30 minFree

Prompt of the week

OpenAI publishes the o3 system card, with cost-per-answer figures.

You are a cost-aware routing assistant. For the user query below, decide whether to answer directly with a small model (cheap, fast) or escalate to a reasoning model (expensive, slow but more accurate). Output a single line: ROUTE: small | ROUTE: reasoning, followed by one sentence of justification.

Works with: Claude, ChatGPT, Gemini

What changed in industry

2 min

Accessible · OpenAI BlogBuilder

Sora 2: video generation with synchronised audio and physics

OpenAI has released Sora 2, with synchronised audio output, longer maximum clip durations, and noticeable improvements in physical plausibility over the original v1 preview. Audio-synced video at this quality tier expands the addressable use cases beyond marketing B-roll. Worth a scoped evaluation if your team works on creative tooling or short-form content.

Accessible · MistralLeader

Mistral Large 2: pricing cuts and a new fine-tuning path

Mistral has cut Mistral Large 2 inference and fine-tuning pricing and added a managed fine-tuning path through La Plateforme aimed at teams that want a closed-source alternative to OpenAI and Anthropic. Pricing pressure at the closed-source frontier is real now. If you have been on a single-vendor contract, the renewal conversation just got more interesting.

Technical · Microsoft Azure BlogBuilder

Phi-3: Microsoft's small-language-model line

Microsoft has continued the Phi-3 line of small language models, with sizes from 3.8B aimed at on-device inference and a quality bar Microsoft is benchmarking against larger frontier models. If small-model claims hold on your eval set, the cost-per-good-output for a class of tasks moves a step. Useful if you have been priced out of frontier inference for high-volume workloads.

Research worth your time

1 min

Deep dive · arXiv CS.AIResearcher

Inference-time scaling: trading more thinking for fewer parameters

Researchers from Google DeepMind and Berkeley have argued that, for a fixed quality target, scaling inference-time compute can sometimes substitute for additional pre-training compute, with worked-example trade-offs across maths and reasoning. If the trade-off generalises, smaller models with more thinking time are a viable architecture for cost-sensitive deployments. Worth a careful read before the next budget cycle.

Tools to try

1 min

Technical · OpenRouterBuilder

OpenRouter: per-model latency and pricing dashboards

OpenRouter has expanded its public dashboards to include p50 and p99 latency per model and current pricing across providers, alongside the existing fallback-routing primitives. Like-for-like latency data has been the missing input for production routing decisions. Worth bookmarking even if you do not route through OpenRouter directly.

Deep dive · NVIDIA DeveloperBuilder

NVIDIA NIM: containerised inference microservices

NVIDIA's NIM offering bundles open-weight models into containerised microservices with TensorRT-LLM optimisations, an OpenAI-compatible API, and Helm charts for Kubernetes deployment. If you self-host inference on NVIDIA GPUs, NIM closes the gap between 'open-weight model' and 'production-ready endpoint' without a long DevOps tail.

Open-source picks

1 min

Technical · GitHubBuilder

Hugging Face: smolagents, a small library for agent loops

Hugging Face has released smolagents, a deliberately small Python library for agent loops that writes actions as Python code rather than JSON tool calls, with built-in sandboxing options. Code-as-action is a noticeable ergonomic upgrade over JSON tool calling for many tasks. Worth a half-hour even if you stay on your current framework.

Technical · Hugging Face BlogBuilder

SmolLM: Hugging Face's small-model lineup

Hugging Face has continued the SmolLM family of small open-weight models (135M / 360M / 1.7B) trained on a curated dataset, with the training recipes published for reproducibility. Small-and-reproducible is the missing tier between 'too big to host' and 'hosted-API'. If you want to fine-tune your own model without renting an H100 cluster, this is the level the field is converging on.