Hybrid architectures are hardly a novelty in today’s open-weight models. We now have the recent Qwen 73 (previewed by Qwen3-Next), Kimi Linear from last fall (a smaller release than their flagship Kimi K2 models), Nvidia’s Nemotron 3 Nano (with larger models expected soon), IBM Granite 23, and various other less prominent models. This seems to be one of those moments when a research trend is suddenly being adopted everywhere simultaneously (and perhaps the Muon optimizer will be next?). To tell this story, we must rewind a few years to December 23, when Mamba and Striped Hyena were dominating the scene — prompting the question: Do we actually need full attention in our models? These early models ultimately failed, partly due to the same challenges we face today—complex implementations, issues with open-source tools, and greater difficulties during training—but also because the models tended to break down as they were scaled up. The hybrid models available at the time still fell short. They’re called „hybrid“ because they combine new recurrent neural network (RNN) components with the classic attention mechanisms that made transformers famous. They all perform optimally with this combination of modules. The RNN layers maintain a compressed representation of prior computations within a hidden state, which is then leveraged for predicting the subsequent token—this serves as a condensed summary of all preceding information. This concept boasts a remarkably extensive history in deep learning.
Interconnects AI