arxiv preprint - From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data

In this episode, we discuss From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data by Zheyang Xiong, Vasilis Papageorgiou, Kangwook Lee, Dimitris Papailiopoulos. This paper addresses the challenge Large Language Models (LLMs) face with long-context information retrieval and reasoning. The authors propose finetuning LLMs using a synthetic dataset designed for numerical key-value retrieval tasks, resulting in significant improvements. Experiments demonstrate enhanced performance on longer-context tasks without compromising general benchmark performance, unlike other long-context augmentation methods that can provoke hallucination.

arxiv preprint – From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data