Skip to content

Transformer Variant with Separate State and Prediction Streams Shows Efficiency Gains

In a nutshell: A modified Transformer with two independent computation streams for state management and token prediction reduces required resources and improves performance by 2–3 percentage points on downstream tasks.

Researchers have developed a Transformer architecture that splits next-token prediction and state information storage into two separate computation streams. The State-Prediction Separation Hypothesis consistently promotes better data and computational efficiency.

Conventional Transformers use the same forward pass to simultaneously predict the next token and store useful state information for future predictions. This creates optimization trade-offs for both tasks.

The proposed Transformer variant splits these functions into two separate computation streams: one stream focuses on immediate token prediction, while the other manages state updates and maintenance. This enables the model to optimize each stream for its specific role.

Pretraining experiments across various model sizes consistently demonstrate better validation losses. On downstream tasks, the modified Transformer outperforms standard Transformers by an average of 2–3 percentage points. In parallel, advantages in data and computational efficiency are evident – the model achieves certain performance targets with less training effort.

Through empirical analyses, the authors investigate potential confounders and demonstrate fundamental differences in gradient structures between architectures. The results suggest that decoupling these functions represents a fundamental improvement in Transformer design principles, not merely a hyperparameter variation.


Source: arxiv.org · Published 30 June 2026
Lumi AI News — AI-assisted curation pursuant to Art. 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.7.2.

Share on: