In a nutshell: FlashMorph converts Transformers into hybrid attention models by optimally determining which layers require full attention and which can be replaced with linear attention.

Researchers present FlashMorph, a method for selecting which layers of a Transformer retain full attention and which are replaced by more efficient linear attention. The method optimizes the hybrid configuration under budget constraints rather than relying on heuristics.

The problem with hybrid models: Hybrid attention architectures improve efficiency over long contexts by retaining full attention in only a few layers and replacing the remaining layers with linear attention. This saves computation time and memory. However, the effectiveness of this conversion depends heavily on which layers retain full attention. Previous layer selection methods rely on simple heuristics such as fixed patterns or scoring individual layers, treating layers in isolation rather than accounting for their mutual dependencies in the overall configuration.

FlashMorph’s approach: The new method formulates layer selection as a budget-constrained subset optimization problem. FlashMorph first constructs a morphable model where each full-attention layer is equipped with a converted linear-attention variant. All weights are then frozen, and layer-wise gates are jointly optimized on synthetic long-context retrieval tasks. A linearization regularization encourages the model to opt for linear attention for efficiency. The learned gates are then discretized—converted into binary decisions per layer—while respecting a specified budget for full-attention layers. This is followed by standard logits distillation and long-context fine-tuning.

Practical implications: For engineers developing LLMs with long contexts—for example, for document processing or retrieval tasks—FlashMorph offers a systematic procedure instead of ad-hoc heuristics. The method significantly reduces selection overhead and discovers more effective hybrid configurations that maintain strong long-context recall and overall benchmark performance. This reduces the time and computational power needed to strategically convert an existing Transformer for long contexts.

Source: arxiv.org · Published June 28, 2026
Lumi AI News — AI-assisted curation in accordance with Article 50 EU AI Act. Paraphrase and classification via Lumi News Pipeline v1.7.2.

Share on:

FlashMorph: Automatic Selection of Attention Layers in Hybrid Models

Lumi AI News

Legal

Topics