Orca combines video, speech and visual questions in a unified latent space, demonstrating that this unified world model can outperform specialized models in text, image and action forecasting tasks.
Orca learns a shared world representation from videos and language annotations, enabling text generation, image prediction, and agent control with a frozen backbone and modular decoders.