Bottom Line: InternVideo3 enables foundation models to analyze longer video sequences with iterative reasoning and tool use while avoiding efficiency problems in KV cache management.

Researchers present InternVideo3, a framework for extending foundation models with agent-like capabilities for video processing. The system combines multi-step reasoning over longer video sequences with a more efficient architecture for context processing.

The framework addresses an existing gap in open-source research: While foundation models increasingly support multi-step reasoning and tool use, development remains focused primarily on text-heavy applications. Long-horizon tasks in the video domain that require continuous temporal understanding and iterative interaction have remained underrepresented so far.

At the core of InternVideo3 is Multimodal Contextual Reasoning (MCR): a closed feedback loop over a shared, evolving context. This context integrates observations (video input), instructions, reasoning steps, tool actions, and memory. Long-video understanding is modeled as iterative evidence collection and verification. Introduced in parallel is Multimodal Multi-head Latent Attention (M²LA): a reparameterization technique that compresses KV cache states while preserving the full token stream. This prevents the typical memory and latency issues with longer video sequences.

Training occurs in four phases: continued pretraining, supervised fine-tuning for short-to-long scenarios, rule-based reinforcement learning, and on-policy distillation. The model was evaluated on established benchmarks (Video-MME, MLVU, EgoSchema) as well as a practical video agent with retrieval tools. The results demonstrate that efficient context handling and closed reasoning are necessary to adapt open multimodal models for long, visually grounded agent tasks.

Source: arxiv.org · Published June 9, 2026
Lumi AI News — AI-assisted curation pursuant to Article 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.6.5.

Share on:

InternVideo3: Foundation Models with Multimodal Reasoning for Video Agents

Lumi AI News

Legal

Topics