arxiv preprint - Block-State Transformers

In this episode we discuss Block-State Transformers
by Mahan Fathi, Jonathan Pilault, Orhan Firat, Christopher Pal, Pierre-Luc Bacon, Ross Goroshin. The paper introduces the Block-State Transformer (BST) architecture that merges state space models and block-wise attention to effectively capture long-range dependencies and improve performance on language modeling tasks. The BST incorporates an SSM sublayer for long-range context and a Block Transformer sublayer for local sequence processing, enhancing parallellization and combining the strengths of both model types. Experiments demonstrate the BST’s superior performance over traditional Transformers in terms of perplexity, generalization to longer sequences, and a significant acceleration in processing speed due to model parallelization.

arxiv preprint – Block-State Transformers