arxiv preprint – Evaluating Human Alignment and Model Faithfulness of LLM Rationale

In this episode, we discuss Evaluating Human Alignment and Model Faithfulness of LLM Rationale by Mohsen Fayyaz, Fan Yin, Jiao Sun, Nanyun Peng. The paper investigates how effectively large language models (LLMs) can explain their decisions through rationales extracted from input texts. It compares two types of rationale extraction methods—attribution-based and prompting-based—finding that prompting-based rationales better align with human-annotated rationales. The study also explores the faithfulness limitations of prompting-based methods and shows that fine-tuning models on specific datasets can improve the faithfulness of both rationale extraction approaches.



