At a glance: Code-based interfaces instead of rigid tool-calls enable AI agents to analyze spatial scenes more flexibly and solve complex 3D/4D tasks iteratively.
Researchers present SpatialClaw, a framework that helps Vision-Language Models analyze 3D and 4D scenes through a code-based action interface and understand spatial relationships between objects. The system uses a stateful Python kernel and achieves an average accuracy of 59.9% across 20 benchmarks in testing – a gain of 11.2 percentage points over previous spatial agents.
Understanding spatial relationships – where objects are located, how they relate to each other, how they move in 3D – presents fundamental challenges for Vision-Language Models (VLMs). Tool-augmented agents attempt to solve this through specialized perception modules, but their effectiveness is limited by the interface through which these tools are invoked.
SpatialClaw addresses this design problem through an unconventional solution: rather than relying either on single-pass code execution (which must commit to a strategy in advance) or on rigid tool-call interfaces, the framework uses code as an action interface. The approach is training-free and works with a stateful Python kernel that has preloaded the input images as well as a suite of perception and geometry primitives. The VLM agent can then iteratively write Python cells that respond to all previous outputs – thereby flexibly combining perception results, adapting analyses to intermediate results and visual observations, and tailoring solutions to specific requirements.
In evaluation across 20 benchmarks covering both static and dynamic 3D/4D tasks, SpatialClaw achieved consistent gains across six VLM backbones from two model families – without benchmark- or model-specific tuning. The framework demonstrates that the choice of agent interface itself is a critical success factor for open-ended spatial reasoning.
Source: arxiv.org · Published 10 June 2026
Lumi AI News — AI-assisted curation in accordance with Art. 50 EU AI Act. Paraphrase and classification through Lumi News Pipeline v1.6.5.