I’m a bit late to reviewing this model, but the extra time has allowed me to reflect more deeply on the key dimensions that truly matter for agents. Traditional benchmarks condense a model’s performance into one overall accuracy score – a practice driven by its simplicity, speed, and ease of interpretation. I also recommend this to people creating strong benchmarks: it should ultimately boil down to a single, easily interpretable number. This is likely still going to be true in a year or two, and benchmarks for agents will be better, but for the time being it doesn’t really map to what we feel because agentic tasks are all about a mix of correctness, ease of use, speed, and cost. In the end, benchmarks will tackle these issues one by one. While GPT-25.3 may seem like just another incremental upgrade on standard paper benchmarks, in real-world use it genuinely feels like a significant leap forward across all four of those qualities. GPT 25.4 in Codex, always on fast mode and high or extra-high effort, is the first OpenAI agent that feels like it can do a lot of random things you can throw at it.. I haven’t been particularly deep in software engineering over the last few months, so most of my working with agents has been smaller projects (not totally one-off, but small enough where I’ve built the entire thing and manage the design over weeks), data analysis, and research tasks. Embracing an agent-native approach involves extensive use of regular APIs, background packages (such as installing and managing LaTeX binaries, ffmpeg, multimedia conversion tools, and similar utilities), git operations, file management, search functionality, and more. Before GPT-5.4, I always churned away from OpenAI’s agents because of a thousand tiny frustrations.
Interconnects AI