Bottom Line: InternVideo3 enables foundation models to analyze longer video sequences with iterative reasoning and tool use while avoiding efficiency problems in KV cache management.
Researchers present InternVideo3, a framework for extending foundation models with agent-like capabilities for video processing. The system combines multi-step reasoning over longer video sequences with a more efficient architecture for context processing.
The framework addresses an existing gap in open-source research: While foundation models increasingly support multi-step reasoning and tool use, development remains focused primarily on text-heavy applications. Long-horizon tasks in the video domain that require continuous temporal understanding and iterative interaction have remained underrepresented so far.
At the core of InternVideo3 is Multimodal Contextual Reasoning (MCR): a closed feedback loop over a shared, evolving context. This context integrates observations (video input), instructions, reasoning steps, tool actions, and memory. Long-video understanding is modeled as iterative evidence collection and verification. Introduced in parallel is Multimodal Multi-head Latent Attention (M²LA): a reparameterization technique that compresses KV cache states while preserving the full token stream. This prevents the typical memory and latency issues with longer video sequences.
Training occurs in four phases: continued pretraining, supervised fine-tuning for short-to-long scenarios, rule-based reinforcement learning, and on-policy distillation. The model was evaluated on established benchmarks (Video-MME, MLVU, EgoSchema) as well as a practical video agent with retrieval tools. The results demonstrate that efficient context handling and closed reasoning are necessary to adapt open multimodal models for long, visually grounded agent tasks.
Source: arxiv.org · Published June 9, 2026
Lumi AI News — AI-assisted curation pursuant to Article 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.6.5.