arxiv preprint - Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns

In this episode we discuss Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns
by Brian DuSell, David Chiang. The paper introduces stack attention, a novel attention mechanism that incorporates the concept of stacks to help recognize hierarchical and nested syntactic structures, which traditional scaled dot-product attention fails to handle effectively. Two versions of stack attention are presented, one deterministic and one nondeterministic, both aiming to enhance transformers’ ability to parse context-free languages (CFLs) without requiring explicit syntactic training data. Experimental results reveal that transformers equipped with stack attention outperform standard transformers on CFLs with complex parsing requirements and also show improvements in natural language modeling and machine translation within a limited parameter setting.

arxiv preprint – Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns