arxiv preprint - Speculative Streaming: Fast LLM Inference without Auxiliary Models

In this episode, we discuss Speculative Streaming: Fast LLM Inference without Auxiliary Models by Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi. The paper introduces Speculative Streaming, a method designed to quickly infer outputs from large language models without needing auxiliary models, unlike the current speculative decoding technique. This new approach fine-tunes the main model for future n-gram predictions, leading to significant speedups, ranging from 1.8 to 3.1 times, in tasks such as Summarization and Meaning Representation without losing quality. Speculative Streaming is also highly efficient, yielding speed gains comparable to complex architectures while using vastly fewer additional parameters, making it ideal for deployment on devices with limited resources.

arxiv preprint – Speculative Streaming: Fast LLM Inference without Auxiliary Models