I’m a bit late to this model review, but the extra time has allowed me to reflect more deeply on the key dimensions that truly matter for agents. Traditional benchmarks distill a model’s performance down to one simple correctness score—they’ve always done so because it was straightforward, quick to apply, and convenient for assessing overall capability. I also recommend this to people creating strong benchmarks: it should ultimately boil down to a single, easily interpretable number. This will probably remain the case for the next year or two, and agent benchmarks will improve, but right now they don’t align well with our actual experience—agentic work is really about balancing correctness, usability, speed, and cost. In the end, benchmarks will tackle these issues one by one. While GPT-25.3 may seem like just another incremental upgrade on standard paper benchmarks, in real-world use it genuinely feels like a significant leap forward across all four of those qualities. GPT-4.5 in Codex, running in always-on fast mode with high or extra-high effort, is the first OpenAI agent that genuinely feels capable of handling almost any random task you throw at it. I haven’t been deeply involved in software engineering lately, so most of my agent work has focused on smaller projects (not completely one-off, but still modest in scope—usually ones I design and build entirely myself over a few weeks), along with data analysis and research tasks. Embracing an agent-native approach involves extensive use of regular APIs, background packages (such as installing and managing LaTeX binaries, ffmpeg, multimedia conversion tools, and similar utilities), git operations, file management, search functionality, and more. Before GPT-5.4, I always churned away from OpenAI’s agents because of a thousand tiny frustrations.
Interconnects AI