Zum Inhalt

Olmo Hybrid and future LLM architectures

Hybrid architectures are hardly a novelty in today’s open-weight models. We now have the recently released Qwen 73 (previewed as Qwen3-Next), Kimi Linear from last fall (a lighter release compared to their flagship Kimi K2 series), Nvidia’s Nemotron 3 Nano (with larger variants expected soon), IBM’s Granite 23, and several other less prominent models. This seems to be one of those moments when a research trend suddenly starts getting adopted everywhere simultaneously (and maybe the Muon optimizer will follow suit soon?). To tell this story, we must rewind a few years to December 23, when Mamba and Striped Hyena were dominating the spotlight — prompting the question: Do we really need full attention in our models? These early models ultimately failed, partly due to the same challenges we face today—complex implementations, issues with open-source tools, and greater difficulties during training—but also because the models tended to break down as they were scaled up. The hybrid models available at the time still weren’t quite good enough. They were called „hybrid“ because they combined these new recurrent neural network (RNN) modules with the traditional attention mechanisms that had made the Transformer famous. They all perform optimally with this combination of modules. The RNN layers maintain a compressed summary of all preceding information within a hidden state, which is then utilized for predicting the subsequent token—an approach with deep historical roots in the field of deep learning, for example.

  Interconnects AI