Weiren Yu ist der Produktmanager. Yarong Mu is a Senior Staff Software Engineer at Google Cloud. Lihao Ran is a Software Engineer at Google Cloud. Zhaoxiang Feng, Research Assistant at UCSD. Yiming Zhao, Research Assistant at UCSD. Hao Zhang is an Assistant Professor at UCSD. The present field of Large Language Model (LLM) acceleration is primarily driven by autoregressive speculative decoding, in which a small, lightweight drafter sequentially predicts tokens that are subsequently verified by the target model. However, this serial drafting approach introduces a fundamental execution bottleneck: it requires K sequential forward passes to generate K candidate tokens.
Google Developers Blog