arxiv preprint – Training Chain-of-Thought via Latent-Variable Inference


In this episode we discuss Training Chain-of-Thought via Latent-Variable Inference
by Du Phan, Matthew D. Hoffman, David Dohan, Sholto Douglas, Tuan Anh Le, Aaron Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, Rif A. Saurous. The paper introduces a fine-tuning strategy for large language models that improves their problem-solving accuracy by focusing on maximizing the probability of correct answers using chain-of-thought (CoT) prompts without requiring detailed rationale supervision. It tackles the challenge of sampling from the posterior distribution of possible rationales with a novel Markov-chain Monte Carlo (MCMC) expectation-maximization (EM) algorithm, which also incorporates a control-variate technique to reduce variance in gradient estimates. The method outperforms existing fine-tuning methods, including the self-taught reasoner (STaR) and prompt-tuning with CoT, in generating more accurate answers on various complex reasoning tasks.


Posted

in

by

Tags: