LGAICVNEAug 21, 2021

SERF: Towards better training of deep neural networks using log-Softplus ERror activation Function

arXiv:2108.09598v331 citations
Originality Incremental advance
AI Analysis

This work addresses training dynamics and performance issues for deep learning practitioners, offering an incremental improvement over existing activation functions like Swish and Mish.

The authors tackled the Dying ReLU problem and other training issues in deep neural networks by proposing a novel activation function called Serf, which outperformed ReLU, Swish, and Mish across various computer vision and NLP tasks, with significant gains on deeper architectures.

Activation functions play a pivotal role in determining the training dynamics and neural network performance. The widely adopted activation function ReLU despite being simple and effective has few disadvantages including the Dying ReLU problem. In order to tackle such problems, we propose a novel activation function called Serf which is self-regularized and nonmonotonic in nature. Like Mish, Serf also belongs to the Swish family of functions. Based on several experiments on computer vision (image classification and object detection) and natural language processing (machine translation, sentiment classification and multimodal entailment) tasks with different state-of-the-art architectures, it is observed that Serf vastly outperforms ReLU (baseline) and other activation functions including both Swish and Mish, with a markedly bigger margin on deeper architectures. Ablation studies further demonstrate that Serf based architectures perform better than those of Swish and Mish in varying scenarios, validating the effectiveness and compatibility of Serf with varying depth, complexity, optimizers, learning rates, batch sizes, initializers and dropout rates. Finally, we investigate the mathematical relation between Swish and Serf, thereby showing the impact of preconditioner function ingrained in the first derivative of Serf which provides a regularization effect making gradients smoother and optimization faster.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes