This month was packed, as every major open frontier lab—including DeepSeek—released new models. This led the Center for AI Standards and Innovation (CAISI) to conduct an evaluation, an organization that has previously assessed open models and the risks they pose. Their findings indicate that open models continue to trail the American frontier, with the disparity growing larger over time. For the report, they compute an Elo rating using Item Response Theory—a method frequently employed to compare models even when evaluated on entirely different benchmark suites. Für die Version 4 verwendete CAISI neun verschiedene Benchmarks: The large Elo gap stems from DeepSeek V3’s poor performance on CTF-Archive-Diamond (which was only partially evaluated and then extrapolated via IRT for V4), PortBench (a private CAISI benchmark), and ARC-AGI-2 (which used a different scoring method from the public leaderboards). The variations across these benchmarks significantly affect the overall Elo rating, which can further widen the perceived capability gap. Using Epoch AI’s ECI—which also applies IRT across a diverse set of benchmarks—we observe that the gap remains roughly 3–7 months since R1. The open-closed gap in ECI (from https://mcnair.center/china/).
Interconnects AI