LGAISep 26, 2025

Stochastic activations

arXiv:2509.22358v1h-index: 48
Originality Incremental advance
AI Analysis

This addresses optimization and efficiency issues in large language models for researchers and practitioners, though it appears incremental as it builds on existing activation functions.

The paper tackles the optimization problem of RELU's constant shape for negative inputs by introducing stochastic activations that randomly select between SILU and RELU during training, then fine-tunes with RELU for inference. This approach reduces inference FLOPs for significant CPU speedup and performs reasonably well in text generation, slightly inferior to the best deterministic method.

We introduce stochastic activations. This novel strategy randomly selects between several non-linear functions in the feed-forward layer of a large language model. In particular, we choose between SILU or RELU depending on a Bernoulli draw. This strategy circumvents the optimization problem associated with RELU, namely, the constant shape for negative inputs that prevents the gradient flow. We leverage this strategy in two ways: (1) We use stochastic activations during pre-training and fine-tune the model with RELU, which is used at inference time to provide sparse latent vectors. This reduces the inference FLOPs and translates into a significant speedup in the CPU. Interestingly, this leads to much better results than training from scratch with the RELU activation function. (2) We evaluate stochastic activations for generation. This strategy performs reasonably well: it is only slightly inferior to the best deterministic non-linearity, namely SILU combined with temperature scaling. This offers an alternative to existing strategies by providing a controlled way to increase the diversity of the generated text.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes