Bottom line: CoT fine-tuning degrades long-context retrieval in hybrid LLMs by distorting query-key projections; QK-Restore fixes this without additional training.
Chain-of-thought fine-tuning improves the reasoning performance of hybrid language models, but systematically destroys their ability to retrieve information across long context windows. A new training-free method called QK-Restore addresses this problem.
The Problem: Researchers have documented that chain-of-thought (CoT) supervised fine-tuning in hybrid LLMs with linear attention (such as HypeNet and Jet-Nemotron) leads to severe losses in long-context retrieval performance. For HypeNet-9B, retrieval accuracy on the Needle-In-A-Haystack test (NIAH-S2@256K) dropped from 67.2% to 9.4% after CoT fine-tuning. The degradation worsens with larger context windows and more challenging retrieval scenarios.
The Cause: CoT fine-tuning systematically distorts attention gradients in favor of short-term patterns. This damages the query and key projection matrices (W_Q, W_K), which are essential for long-range routing. As a result, the model can no longer reliably locate relevant information across the full context length.
The Solution: QK-Restore is a training-free method that restores only the W_Q and W_K matrices from the pre-SFT checkpoint while preserving all other parameters of the fine-tuned model. A Procrustes variant additionally balances between routing preservation and reasoning adaptation. For HypeNet-5B, S3@256K performance improved from 65.4% to 76.4%, while reasoning performance was maintained.
For engineers, this means: The method enables combining CoT fine-tuning and long-context capability without additional training costs. This is particularly relevant when developing applications that require both complex reasoning and reliable context management.
Source: arxiv.org · Published June 8, 2026
Lumi AI News — AI-assisted curation pursuant to Art. 50 EU AI Act. Paraphrase and classification via Lumi News Pipeline v1.6.5.