Moderate Adaptive Linear Units (MoLU)
This addresses the need for efficient and robust activation functions across various deep learning applications, such as LLMs and CNNs, though it appears incremental as it builds on existing activation function designs.
The paper tackles the problem of activation functions in deep neural networks by proposing MoLU, a novel activation function defined as f(x)=x × (1+tanh(x))/2, which achieves faster convergence and improved accuracy compared to functions like GeLU, SiLU, and Mish.
We propose the Moderate Adaptive Linear Unit (MoLU), a novel activation function for deep neural networks, defined analytically as: f(x)=x \times (1+tanh(x))/2. MoLU combines mathematical elegance with empirical effectiveness, exhibiting superior performance in terms of prediction accuracy, convergence speed, and computational efficiency. Due to its C-infinity smoothness, i.e. infinite differentiability and analyticity, MoLU is expected to mitigate issues such as vanishing or exploding gradients, making it suitable for a broad range of architectures and applications, including large language models (LLMs), Neural Ordinary Differential Equations (Neural ODEs), Physics-Informed Neural Networks (PINNs), and Convolutional Neural Networks (CNNs). Empirical evaluations show that MoLU consistently achieves faster convergence and improved final accuracy relative to widely used activation functions such as GeLU, SiLU, and Mish. These properties position MoLU as a promising and robust candidate for general-purpose activation across diverse deep learning paradigms.