A Theoretical View On Sparsely Activated Networks
2022 Β· Cenk Baykal, Nishanth Dikkala, Rina Panigrahy, et al.
Abstract
Deep and wide neural networks successfully fit very complex functions today, but dense models are starting to be prohibitively expensive for inference. To mitigate this, one promising direction is networks that activate a sparse subgraph of the network. The subgraph is chosen by a data-dependent routing function, enforcing a fixed mapping of inputs to subnetworks (e.g., the Mixture of Experts (MoE) paradigm in Switch Transformers). However, prior work is largely empirical, and while existing routing functions work well in practice, they do not lead to theoretical guarantees on approximation ability. We aim to provide a theoretical explanation for the power of sparse networks. As our first contribution, we present a formal model of data-dependent sparse networks that captures salient aspects of popular architectures. We then introduce a routing function based on locality sensitive hashing (LSH) that enables us to reason about how well sparse networks approximate target functions. After
Authors
(none)
Tags
Stats
Related papers
- Functional Similarity Metric For Neural Networks: Overcoming Parametric Ambiguity Via Activation Region Analysis (2026)0.00
- Practice With Graph-based ANN Algorithms On Sparse Data: Chi-square Two-tower Model, HNSW, Sign Cauchy Projections (2023)0.00
- Scatterbrain: Unifying Sparse And Low-rank Attention Approximation (2021)0.00
- On Design Choices In Similarity-preserving Sparse Randomized Embeddings (2024)2.26
- Representing Deep Neural Networks Latent Space Geometries With Graphs (2020)7.50
- Hyperformer: Learning Expressive Sparse Feature Representations Via Hypergraph Transformer (2023)7.50
- Beyond Matryoshka: Revisiting Sparse Coding For Adaptive Representation (2025)4.30
- Sparsey: Event Recognition Via Deep Hierarchical Spare Distributed Codes (2016)8.35