Zum Inhalt

Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding

Weiren Yu Product Manager. Yarong Mu is a Senior Staff Software Engineer at Google Cloud. Lihao Ran is a Software Engineer at Google Cloud. Zhaoxiang Feng, Research Assistant at UCSD. Yiming Zhao, Research Assistant at UCSD. Hao Zhang Assistant Professor UCSD. The present field of Large Language Model (LLM) acceleration is primarily driven by autoregressive speculative decoding, in which a compact drafter model sequentially predicts tokens that are subsequently verified by the target model. However, this serial drafting method creates a core execution bottleneck, as it needs K sequential forward passes to produce K candidate tokens.

  Google Developers Blog