Exact Linear Attention

Abstract

arXiv:2605.18848v1 Announce Type: new Abstract: This paper introduces Exact Linear Attention (ELA), a mechanism that achieves linear computational complexity for Transformer attention by leveraging the exact decomposition property of kernel functions, without any approximation error. It identifies and addresses gradient explosion and token attention dilution in prior linear attention methods by imposing kernel constraints that ensure non-negativity, discriminability, and geometric interpretability. Several kernel functions are proposed, including the Hadamard Exp Kernel, Summation Squared Euclidean Distance Kernel, and Subtraction Squared Euclidean Distance Kernel. Beyond the core attention formulation, the paper presents three engineering innovations: a Hyper Link structure that replaces traditional residual connections to mitigate gradient degradation, a Memory Lobe module based on bidirectional linear attention that captures transformation flow across layers to implement qualitative memory and an implicit reinforcement learning paradigm, and a routing score based bias mechanism for Mixture of Experts to improve interpretability and semantic alignment.

Abstract

Related papers