arxiv Preprint - Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding

In this episode we discuss Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding
by Xuefei Ning, Zinan Lin, Zixuan Zhou, Huazhong Yang, Yu Wang. The paper proposes a method called “Skeleton-of-Thought” (SoT) to decrease the generation latency of large language models (LLMs). The sequential decoding approach used in current LLMs contributes to high latency. SoT guides LLMs to first generate the skeleton of the answer and then completes the contents of each skeleton point in parallel through API calls or batched decoding.

arxiv Preprint – Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding