In a nutshell: Orca learns a shared world representation from videos and language annotations, enabling text generation, image prediction, and agent control with a frozen backbone and modular decoders.
Researchers introduce Orca, a general foundation model that learns a unified latent space of the world from multimodal signals and makes it accessible through various output interfaces. The system combines video processing with language supervision and embodied AI in a shared next-state-prediction framework.
Orca builds on two complementary learning paradigms: “unconscious learning” extracts dense state transitions from continuous videos, while “conscious learning” models sparse, meaningful transitions from language-described events and visual-question-answering supervision. In contrast to isolated prediction objectives (next-token, next-frame, next-action), the approach focuses on unified state transitions, creating a consistent modeling pathway for understanding, prediction, and action.
The pretraining dataset comprises 125,000 hours of video material and 160 million event annotations. The learned latent space is subsequently used as a frozen backbone, with only lightweight, modality-specific decoders remaining trainable. This architecture enables flexible downstream applications.
The evaluation covers three representative downstream tasks: text generation, image prediction, and embodied action generation. Orca outperforms comparably sized specialized baseline models. The results indicate that a stronger world latent directly leads to stronger downstream outputs. This positions Orca as a promising approach to foundation-based world understanding, demonstrating the scalability of a unified paradigm.
Source: arxiv.org · Published 28 June 2026
Lumi AI News — AI-assisted curation pursuant to Art. 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.7.2.