MSA reduces attention computation for million-token contexts by a factor of 28.4 through blockwise sparse selection and achieves practical speedups via co-design of algorithm and GPU kernel.
Aligning router rows with the principal singular directions of their associated expert matrices improves the efficiency and stability of Mixture-of-Experts models.