Skip to content

P-EAGLE: Parallel Speculation for Faster LLM Inference on AWS SageMaker

Share on:

In a nutshell: AWS has developed P-EAGLE, a parallelized variant of speculative decoding that generates draft tokens in a single forward pass instead of sequentially, achieving inference throughput improvements of up to 1.69x on SageMaker AI.

AWS has developed Parallel-EAGLE (P-EAGLE) and released it as open source – a method that transforms speculative decoding from a sequential into a fully parallelized process. This enables significantly faster inference endpoints to be deployed on Amazon SageMaker AI without having to manage custom CUDA kernels.

Speculative decoding is an established technique for accelerating LLM inference: a lightweight draft model generates multiple candidate next tokens, which a target LLM then verifies in a single forward pass. However, the current standard EAGLE-3 generates these draft tokens autoregressively – each token depends on the result of the previous one. If one wants to predict K candidates, EAGLE-3 thus requires K sequential forward passes through the draft head. This latency grows linearly with the speculation depth.

P-EAGLE solves this bottleneck through complete parallelization: instead of generating draft tokens sequentially, all speculative tokens are predicted simultaneously in a single forward pass. The method decouples the number of draft tokens from the number of sequential forward passes. A practical example: if the target model generates the token “Paris,” EAGLE-3 needs four sequential draft passes to predict the next four tokens (“, known for its”). P-EAGLE fills positions 2–4 with trainable placeholders and predicts all four tokens simultaneously. On real benchmarks with advanced hardware, P-EAGLE achieves a throughput gain of up to 1.69x over classic EAGLE.

Amazon SageMaker JumpStart now natively supports P-EAGLE for a range of popular foundation models. Developers can thereby deploy P-EAGLE-accelerated inference endpoints with a controlled environment – without having to manage CUDA kernels or distributed serving setups themselves. The integration is accomplished via a few lines of code: select a model from the SageMaker JumpStart catalog, configure parallel drafting parameters, start the endpoint.

Benchmark results show P-EAGLE on Qwen3-Coder-30B-A3B-Instruct with NVIDIA B200 GPUs and FP8 quantization consistently outperforming EAGLE-3 and baseline inference (without speculation), measured in output tokens per second. The advantage grows with higher concurrency and greater speculation depth (K values).


Source: aws.amazon.com · Published June 16, 2026
Lumi AI News — AI-assisted curation pursuant to Art. 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.7.1.

Share on: