Skip to content

VisualClaw: Agent Framework with Intelligent Video Filtering and Self-Learning Skills

Share on:

In brief: VisualClaw reduces deployment costs for video agents by up to 98 percent through frame filtering and self-learning skill updates, while improving accuracy in most settings.

Researchers introduce VisualClaw, a multimodal agent system for Vision Language Models that reduces API costs by up to 98 percent through selective video frame filtering and continuous skill learning. The framework addresses three core deployment challenges of video-understanding agents: high latency from dense video analysis, static agent structures after deployment, and lack of real-time tool utilization in benchmarks.

VisualClaw implements two optimization principles. The first is a hybrid encoding method that filters less informative frames from a video sequence using cascaded gates and compresses the text skill bank through top-k injection (hot/cold approach). This directly reduces queries to Vision Language Models. The second principle is skill evolution: the agent learns from failed queries by conditioning a retrieval-augmented evolver with direct or guided evidence, updating the skill bank for future tasks.

In experiments across four video QA benchmarks with two different VLMs, VisualClaw reduced per-question API costs by an average of 98 percent compared to uploading all frames and by 25.9 percent compared to the baseline of eight uniform frames per video. On the EgoSchema benchmark, the system achieved an average accuracy improvement of 3.85 percent with Gemini 3 Flash, reaching as high as 15.80 percent in some cases. The researchers also curated VisualClawArena, a new 200-scenario benchmark that requires multimodal agents to leverage video evidence, documents, dynamic updates, and executable checks within a workspace.

On VisualClawArena, the framework with computer-use agent backends improved macro accuracy by 2.9 percent for Codex (GPT-5.5) and 3.2 percent for Claude Code (Sonnet 4.6) compared to baselines without evolution, while achieving cost reduction of 9.5 percent versus uniformly sampled baselines. In edge scenarios, the number of API calls was reduced from approximately 3,600 per one-hour streaming session to just 5 to 20 requests.


Source: arxiv.org · Published June 14, 2026
Lumi AI News — AI-assisted curation pursuant to Art. 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.7.1.

Share on: