The bottom line: MSA reduces attention computation for million-token contexts by a factor of 28.4 through blockwise sparse selection and achieves practical speedups via co-design of algorithm and GPU kernel.

MiniMax AI introduces MiniMax Sparse Attention (MSA), a blockwise sparse-attention architecture for language models with extended context windows. The method reduces computational costs for million-token contexts by a factor of 28.4 and delivers practical speedups of 14.2x during prefill and 7.6x during decoding on H800 GPUs.

MiniMax Sparse Attention (MSA) addresses the scaling problem of softmax attention under ultra-long-context workloads: agent workflows, repository-scale code reasoning, and persistent memory systems require simultaneous attention over hundreds of thousands to millions of tokens. The quadratic complexity of classical attention makes this impractical at production scale.

MSA operates on the basis of Grouped Query Attention (GQA) and employs a two-stage approach: a lightweight index branch evaluates key-value blocks and selects a top-k subset independently per GQA group for group-specific sparse retrieval. The main branch then performs exact block-sparse attention over the selected blocks. The architecture deliberately avoids complex mechanisms in favor of simplicity and broad GPU compatibility.

The kernel implementation uses exponent-free top-k selection and KV-outer-sparse attention to optimize tensor-core utilization at block access granularity. On a 109-billion-parameter model with native multimodal training, MSA achieves parity with standard GQA while attention computation per token drops by 28.4x for a 1-million-token context. Measured wall-clock speedups on H800 hardware are 14.2x (prefill) and 7.6x (decoding).

The inference kernel is publicly available at https://github.com/MiniMax-AI/MSA. A production-ready, natively multimodal MiniMax-M3 model is deployed on Hugging Face (https://huggingface.co/MiniMaxAI/MiniMax-M3).

Source: arxiv.org · Published 10 June 2026
Lumi AI News — AI-assisted curation pursuant to Art. 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.6.5.

Share on:

MiniMax Sparse Attention: Efficient Long-Context Processing for Billion-Parameter Models

Lumi AI News

Legal

Topics