LGNENov 20, 2024

Deriving Activation Functions Using Integration

arXiv:2411.13010v33 citationsh-index: 16Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of optimizing activation functions for deep learning models, particularly in large-scale language tasks, though it appears incremental as it builds on prior functions like ELU and ReLU².

The paper tackles the problem of designing activation functions by proposing a novel approach that derives them from gradients via integration, resulting in xIELU, which achieves lower perplexity in large language models (e.g., 1.1B and 3B parameter Llama models) compared to existing functions like ReLU² and SwiGLU under matched compute and parameters.

Our work proposes a novel approach to designing activation functions by focusing on their gradients and deriving the corresponding activation functions using integration. We introduce the Expanded Integral of the Exponential Linear Unit (xIELU), a trainable piecewise activation function derived by integrating trainable affine transformations applied to the Exponential Linear Unit (ELU). xIELU combines two key properties for the gradient: (1) a trainable and linearly increasing gradient for positive inputs, similar to Squared ReLU (ReLU$^2$), and (2) a trainable gradient that can take negative values for negative inputs, inspired by Expanded SiLU (xSiLU). Conceptually, xIELU can be viewed as an extension of ReLU$^2$ to handle negative inputs. The trainable parameters in xIELU allow it to adaptively reduce its nonlinearity for higher-level representations deeper in the network. In experiments with 1.1B and 3B parameter Llama models trained on 125B tokens of FineWeb Edu, xIELU achieves lower perplexity compared to popular activation functions like ReLU$^2$ and SwiGLU when matched for the same compute cost and parameter count. A reference implementation is available at https://github.com/Anonymous5823/xielu.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes