A modified Transformer with two independent computation streams for state management and token prediction reduces required resources and improves performance by 2–3 percentage points on downstream tasks.
Orca combines video, speech and visual questions in a unified latent space, demonstrating that this unified world model can outperform specialized models in text, image and action forecasting tasks.