In brief: Orca combines video, speech and visual questions in a unified latent space, demonstrating that this unified world model can outperform specialized models in text, image and action forecasting tasks.

Researchers present Orca, a foundation model that processes multimodal world signals in a unified latent space and supports three different output interfaces: text generation, image prediction and action generation for embodied systems.

Orca is based on a next-state-prediction paradigm that goes beyond isolated token, frame or action forecasting. The model learns through two complementary procedures: unsupervised learning captures dense natural state transitions from continuous video streams, supervised learning models sparse meaningful transitions via language-described events and visual-question-answering supervision.

For pre-training, the team leverages a dataset of 125,000 hours of video material and 160 million event annotations. The resulting model develops a unified latent space that captures world dynamics in an abstract manner. The architecture keeps the trained backbone network frozen during fine-tuning and trains only lightweight modality-specific decoders — an efficient transfer learning approach.

Evaluations show that Orca outperforms specialized, similarly-sized baseline models on the three downstream tasks of text generation, image prediction and action generation. The scalability of the presented paradigm is confirmed: a stronger learned latent space enables more robust downstream readouts. The research group simultaneously documents current limitations and sketches open questions for the research community.

Source: arxiv.org · Published 28 June 2026
Lumi AI News — AI-assisted curation in accordance with Article 50 EU AI Act. Paraphrase and classification via Lumi News Pipeline v1.7.2.

Share on:

Orca: Foundation Model for Unified World Understanding Presented

Lumi AI News

Legal

Topics