Naoki Sato

LG
h-index3
7papers
38citations
Novelty42%
AI Score40

7 Papers

LGNov 15, 2023
Using Stochastic Gradient Descent to Smooth Nonconvex Functions: Analysis of Implicit Graduated Optimization

Naoki Sato, Hideaki Iiduka

The graduated optimization approach is a method for finding global optimal solutions for nonconvex functions by using a function smoothing operation with stochastic noise. This paper makes three contributions regarding graduated optimization. First, we extend the definition of function smoothing that is traditionally achieved through convolution with Gaussian noise and characterize for the first time function smoothing with heavy-tailed noise. Second, we show that light- or heavy-tailed stochastic noise in stochastic gradient descent (SGD) has the effect of smoothing the objective function, the degree of which is determined by the learning rate, batch size, and the moment of the stochastic noise. Using this finding, we propose and analyze a new graduated optimization algorithm that varies the degree of smoothing by varying the learning rate and batch size. Third, we relax the $σ$-nice property, a standard but restrictive condition in the analysis of graduated optimization. Our refinement enables convergence guarantees for a broader class of non-convex functions, thereby bridging the gap between theoretical assumptions and practical optimization landscapes.

LGFeb 3
Lipschitz Multiscale Deep Equilibrium Models: A Theoretically Guaranteed and Accelerated Approach

Naoki Sato, Hideaki Iiduka

Deep equilibrium models (DEQs) achieve infinitely deep network representations without stacking layers by exploring fixed points of layer transformations in neural networks. Such models constitute an innovative approach that achieves performance comparable to state-of-the-art methods in many large-scale numerical experiments, despite requiring significantly less memory. However, DEQs face the challenge of requiring vastly more computational time for training and inference than conventional methods, as they repeatedly perform fixed-point iterations with no convergence guarantee upon each input. Therefore, this study explored an approach to improve fixed-point convergence and consequently reduce computational time by restructuring the model architecture to guarantee fixed-point convergence. Our proposed approach for image classification, Lipschitz multiscale DEQ, has theoretically guaranteed fixed-point convergence for both forward and backward passes by hyperparameter adjustment, achieving up to a 4.75$\times$ speed-up in numerical experiments on CIFAR-10 at the cost of a minor drop in accuracy.

LGJul 2, 2025
Convergence Bound and Critical Batch Size of Muon Optimizer

Naoki Sato, Hiroki Naganuma, Hideaki Iiduka

Muon, a recently proposed optimizer that leverages the inherent matrix structure of neural network parameters, has demonstrated strong empirical performance, indicating its potential as a successor to standard optimizers such as AdamW. This paper presents theoretical analysis to support its practical success. We provide convergence proofs for Muon across four practical settings, systematically examining its behavior with and without the inclusion of Nesterov momentum and weight decay. Our analysis covers the standard configuration using both, thereby elucidating its real-world performance. We then demonstrate that the addition of weight decay yields strictly tighter theoretical bounds and clarify the interplay between the weight decay coefficient and the learning rate. Finally, we derive the critical batch size for Muon that minimizes the computational cost of training. Our analysis identifies the hyperparameters governing this value, and our experiments validate the corresponding theoretical findings across workloads including image classification and language modeling task.

LGDec 16, 2024
Explicit and Implicit Graduated Optimization in Deep Neural Networks

Naoki Sato, Hideaki Iiduka

Graduated optimization is a global optimization technique that is used to minimize a multimodal nonconvex function by smoothing the objective function with noise and gradually refining the solution. This paper experimentally evaluates the performance of the explicit graduated optimization algorithm with an optimal noise scheduling derived from a previous study and discusses its limitations. It uses traditional benchmark functions and empirical loss functions for modern neural network architectures for evaluating. In addition, this paper extends the implicit graduated optimization algorithm, which is based on the fact that stochastic noise in the optimization process of SGD implicitly smooths the objective function, to SGD with momentum, analyzes its convergence, and demonstrates its effectiveness through experiments on image classification tasks with ResNet architectures.

LGDec 16, 2024
Scaled Conjugate Gradient Method for Nonconvex Optimization in Deep Neural Networks

Naoki Sato, Koshiro Izumi, Hideaki Iiduka

A scaled conjugate gradient method that accelerates existing adaptive methods utilizing stochastic gradients is proposed for solving nonconvex optimization problems with deep neural networks. It is shown theoretically that, whether with constant or diminishing learning rates, the proposed method can obtain a stationary point of the problem. Additionally, its rate of convergence with diminishing learning rates is verified to be superior to that of the conjugate gradient method. The proposed method is shown to minimize training loss functions faster than the existing adaptive methods in practical applications of image and text classification. Furthermore, in the training of generative adversarial networks, one version of the proposed method achieved the lowest Frechet inception distance score among those of the adaptive methods.

LGFeb 4, 2024
Momentum Does Not Reduce Stochastic Noise in Stochastic Gradient Descent

Naoki Sato, Hideaki Iiduka

For nonconvex objective functions, including those found in training deep neural networks, stochastic gradient descent (SGD) with momentum is said to converge faster and have better generalizability than SGD without momentum. In particular, adding momentum is thought to reduce stochastic noise. To verify this, we estimated the magnitude of gradient noise by using convergence analysis and an optimal batch size estimation formula and found that momentum does not reduce gradient noise. We also analyzed the effect of search direction noise, which is stochastic noise defined as the error between the search direction of the optimizer and the steepest descent direction, and found that it inherently smooths the objective function and that momentum does not reduce search direction noise either. Finally, an analysis of the degree of smoothing introduced by search direction noise revealed that adding momentum offers limited advantage to SGD.

LGJan 28, 2022
Existence and Estimation of Critical Batch Size for Training Generative Adversarial Networks with Two Time-Scale Update Rule

Naoki Sato, Hideaki Iiduka

Previous results have shown that a two time-scale update rule (TTUR) using different learning rates, such as different constant rates or different decaying rates, is useful for training generative adversarial networks (GANs) in theory and in practice. Moreover, not only the learning rate but also the batch size is important for training GANs with TTURs and they both affect the number of steps needed for training. This paper studies the relationship between batch size and the number of steps needed for training GANs with TTURs based on constant learning rates. We theoretically show that, for a TTUR with constant learning rates, the number of steps needed to find stationary points of the loss functions of both the discriminator and generator decreases as the batch size increases and that there exists a critical batch size minimizing the stochastic first-order oracle (SFO) complexity. Then, we use the Fr'echet inception distance (FID) as the performance measure for training and provide numerical results indicating that the number of steps needed to achieve a low FID score decreases as the batch size increases and that the SFO complexity increases once the batch size exceeds the measured critical batch size. Moreover, we show that measured critical batch sizes are close to the sizes estimated from our theoretical results.