Hans De Sterck

NA
h-index27
17papers
262citations
Novelty51%
AI Score43

17 Papers

SYJun 4, 2022
Neural Lyapunov Control of Unknown Nonlinear Systems with Stability Guarantees

Ruikun Zhou, Thanin Quartz, Hans De Sterck et al.

Learning for control of dynamical systems with formal guarantees remains a challenging task. This paper proposes a learning framework to simultaneously stabilize an unknown nonlinear system with a neural controller and learn a neural Lyapunov function to certify a region of attraction (ROA) for the closed-loop system. The algorithmic structure consists of two neural networks and a satisfiability modulo theories (SMT) solver. The first neural network is responsible for learning the unknown dynamics. The second neural network aims to identify a valid Lyapunov function and a provably stabilizing nonlinear controller. The SMT solver then verifies that the candidate Lyapunov function indeed satisfies the Lyapunov conditions. We provide theoretical guarantees of the proposed learning framework in terms of the closed-loop stability for the unknown nonlinear system. We illustrate the effectiveness of the approach with a set of numerical experiments.

CLOct 18, 2023Code
Fast Multipole Attention: A Scalable Multilevel Attention Mechanism for Text and Images

Yanming Kang, Giang Tran, Hans De Sterck

While Transformer networks benefit from a global receptive field, their quadratic cost relative to sequence length restricts their application to long sequences and high-resolution inputs. We introduce Fast Multipole Attention (FMA), a divide-and-conquer mechanism for self-attention inspired by the Fast Multipole Method from n-body physics. FMA reduces the time and memory complexity of self-attention from $\mathcal{O}\left(n^2\right)$ to $\mathcal{O}(n \log n)$ and $\mathcal{O}(n)$ while preserving full-context interactions. FMA contains a learned hierarchy with $\mathcal{O}(\log n)$ levels of resolution. In this hierarchy, nearby tokens interact at full resolution, while distant tokens engage through progressively coarser, learned basis functions. We have developed both 1D and 2D implementations of FMA for language and vision tasks, respectively. On autoregressive and bidirectional language modeling benchmarks, the 1D variant either matches or outperforms leading efficient attention baselines with substantially lower memory use. With linear complexity, the 2D variant demonstrates superior performance over strong vision transformer baselines in classification and semantic segmentation tasks. Our results confirm that the multilevel attention implemented by FMA allows Transformer-based models to scale to much longer sequences and higher-resolution inputs without loss in accuracy. This provides a principled, physics-inspired approach for developing scalable neural networks suitable for language, vision, and multimodal tasks. Our code will be available at https://github.com/epoch98/FMA.

NANov 25, 2011
An adaptive algebraic multigrid algorithm for low-rank canonical tensor decomposition

Hans De Sterck, Killian Miller

This paper presents a multigrid algorithm for the computation of the rank-R canonical decomposition of a tensor for low rank R. Standard alternating least squares (ALS) is used as the relaxation method. Transfer operators and coarse-level tensors are constructed in an adaptive setup phase based on multiplicative correction and on Bootstrap algebraic multigrid. An accurate solution is then computed by an additive solve phase based on the Full Approximation Scheme. Numerical tests show that for certain test problems the multilevel method significantly outperforms standalone ALS when a high level of accuracy is required.

NAJun 27, 2018
Nonlinearly Preconditioned L-BFGS as an Acceleration Mechanism for Alternating Least Squares, with Application to Tensor Decomposition

Hans De Sterck, Alexander J. M. Howse

We derive nonlinear acceleration methods based on the limited memory BFGS (L-BFGS) update formula for accelerating iterative optimization methods of alternating least squares (ALS) type applied to canonical polyadic (CP) and Tucker tensor decompositions. Our approach starts from linear preconditioning ideas that use linear transformations encoded by matrix multiplications, and extends these ideas to the case of genuinely nonlinear preconditioning, where the preconditioning operation involves fully nonlinear transformations. As such, the ALS-type iterations are used as fully nonlinear preconditioners for L-BFGS, or, equivalently, L-BFGS is used as a nonlinear accelerator for ALS. Numerical results show that the resulting methods perform much better than either stand-alone L-BFGS or stand-alone ALS, offering substantial improvements in terms of time-to-solution and robustness over state-of-the-art methods for large and noisy tensor problems, including previously described acceleration methods based on nonlinear conjugate gradients and nonlinear GMRES. Our approach provides a general L-BFGS-based acceleration mechanism for nonlinear optimization.

7.8NAApr 21
Stable Mesh-Free Variational Radial Basis Function Approximation for Elliptic PDEs and Obstacle Problems

Tan Phuong Dong Le, Giang Tran, Hans De Sterck

We present a comprehensive study of radial basis function (RBF) approximations for elliptic and obstacle-type boundary value problems under a variational formulation. Our focus is on practical accuracy, robustness and efficiency. To address ill-conditioning in dense systems, we apply truncated singular value decomposition (TSVD) and investigate its effect on stability and accuracy trade-offs. Numerical experiments report benchmarks on accuracy and show fast error decay. We investigate the trade-off between approximation and truncation errors for practical settings for the number of basis functions, the oversampling ratio and the truncation threshold. In comparison with other methods, RBF variational solvers deliver high accuracy at similar or lower cost for boundary value problems.

SYSep 12, 2024
Stochastic Reinforcement Learning with Stability Guarantees for Control of Unknown Nonlinear Systems

Thanin Quartz, Ruikun Zhou, Hans De Sterck et al.

Designing a stabilizing controller for nonlinear systems is a challenging task, especially for high-dimensional problems with unknown dynamics. Traditional reinforcement learning algorithms applied to stabilization tasks tend to drive the system close to the equilibrium point. However, these approaches often fall short of achieving true stabilization and result in persistent oscillations around the equilibrium point. In this work, we propose a reinforcement learning algorithm that stabilizes the system by learning a local linear representation ofthe dynamics. The main component of the algorithm is integrating the learned gain matrix directly into the neural policy. We demonstrate the effectiveness of our algorithm on several challenging high-dimensional dynamical systems. In these simulations, our algorithm outperforms popular reinforcement learning algorithms, such as soft actor-critic (SAC) and proximal policy optimization (PPO), and successfully stabilizes the system. To support the numerical results, we provide a theoretical analysis of the feasibility of the learned algorithm for both deterministic and stochastic reinforcement learning settings, along with a convergence analysis of the proposed learning algorithm. Furthermore, we verify that the learned control policies indeed provide asymptotic stability for the nonlinear systems.

LGSep 30, 2022
Downlink Compression Improves TopK Sparsification

William Zou, Hans De Sterck, Jun Liu

Training large neural networks is time consuming. To speed up the process, distributed training is often used. One of the largest bottlenecks in distributed training is communicating gradients across different nodes. Different gradient compression techniques have been proposed to alleviate the communication bottleneck, including topK gradient sparsification, which truncates the gradient to the largest K components before sending it to other nodes. While some authors have investigated topK gradient sparsification in the parameter-server framework by applying topK compression in both the worker-to-server (uplink) and server-to-worker (downlink) direction, the currently accepted belief says that adding extra compression degrades the convergence of the model. We demonstrate, on the contrary, that adding downlink compression can potentially improve the performance of topK sparsification: not only does it reduce the amount of communication per step, but also, counter-intuitively, can improve the upper bound in the convergence analysis. To show this, we revisit non-convex convergence analysis of topK stochastic gradient descent (SGD) and extend it from the unidirectional to the bidirectional setting. We also remove a restriction of the previous analysis that requires unrealistically large values of K. We experimentally evaluate bidirectional topK SGD against unidirectional topK SGD and show that models trained with bidirectional topK SGD will perform as well as models trained with unidirectional topK SGD while yielding significant communication benefits for large numbers of workers.

OCOct 13, 2018Code
Nesterov Acceleration of Alternating Least Squares for Canonical Tensor Decomposition: Momentum Step Size Selection and Restart Mechanisms

Drew Mitchell, Nan Ye, Hans De Sterck

We present Nesterov-type acceleration techniques for Alternating Least Squares (ALS) methods applied to canonical tensor decomposition. While Nesterov acceleration turns gradient descent into an optimal first-order method for convex problems by adding a momentum term with a specific weight sequence, a direct application of this method and weight sequence to ALS results in erratic convergence behaviour. This is so because the tensor decomposition problem is non-convex and ALS is accelerated instead of gradient descent. Instead, we consider various restart mechanisms and suitable choices of momentum weights that enable effective acceleration. Our extensive empirical results show that the Nesterov-accelerated ALS methods with restart can be dramatically more efficient than the stand-alone ALS or Nesterov accelerated gradient methods, when problems are ill-conditioned or accurate solutions are desired. The resulting methods perform competitively with or superior to existing acceleration methods for ALS, including ALS acceleration by NCG, NGMRES, or LBFGS, and additionally enjoy the benefit of being much easier to implement. We also compare with Nesterov-type updates where the momentum weight is determined by a line search, which are equivalent or closely related to existing line search methods for ALS. On a large and ill-conditioned 71$\times$1000$\times$900 tensor consisting of readings from chemical sensors to track hazardous gases, the restarted Nesterov-ALS method shows desirable robustness properties and outperforms any of the existing methods by a large factor. There is clear potential for extending our Nesterov-type acceleration approach to accelerating other optimization algorithms than ALS applied to other non-convex problems, such as Tucker tensor decomposition. Our Matlab code is available at https://github.com/hansdesterck/nonlinear-preconditioning-for-optimization.

LGJun 30, 2024
Sum-of-norms regularized Nonnegative Matrix Factorization

Andersen Ang, Waqas Bin Hamed, Hans De Sterck

When applying nonnegative matrix factorization (NMF), the rank parameter is generally unknown. This rank, called the nonnegative rank, is usually estimated heuristically since computing its exact value is NP-hard. In this work, we propose an approximation method to estimate the rank on-the-fly while solving NMF. We use the sum-of-norm (SON), a group-lasso structure that encourages pairwise similarity, to reduce the rank of a factor matrix when the initial rank is overestimated. On various datasets, SON-NMF can reveal the correct nonnegative rank of the data without prior knowledge or parameter tuning. SON-NMF is a nonconvex, nonsmooth, non-separable, and non-proximable problem, making it nontrivial to solve. First, since rank estimation in NMF is NP-hard, the proposed approach does not benefit from lower computational complexity. Using a graph-theoretic argument, we prove that the complexity of SON-NMF is essentially irreducible. Second, the per-iteration cost of algorithms for SON-NMF can be high. This motivates us to propose a first-order BCD algorithm that approximately solves SON-NMF with low per-iteration cost via the proximal average operator. SON-NMF exhibits favorable features for applications. Besides the ability to automatically estimate the rank from data, SON-NMF can handle rank-deficient data matrices and detect weak components with small energy. Furthermore, in hyperspectral imaging, SON-NMF naturally addresses the issue of spectral variability.

LGApr 3, 2024
First-order PDES for Graph Neural Networks: Advection And Burgers Equation Models

Yifan Qu, Oliver Krzysik, Hans De Sterck et al.

Graph Neural Networks (GNNs) have established themselves as the preferred methodology in a multitude of domains, ranging from computer vision to computational biology, especially in contexts where data inherently conform to graph structures. While many existing methods have endeavored to model GNNs using various techniques, a prevalent challenge they grapple with is the issue of over-smoothing. This paper presents new Graph Neural Network models that incorporate two first-order Partial Differential Equations (PDEs). These models do not increase complexity but effectively mitigate the over-smoothing problem. Our experimental findings highlight the capacity of our new PDE model to achieve comparable results with higher-order PDE models and fix the over-smoothing problem up to 64 layers. These results underscore the adaptability and versatility of GNNs, indicating that unconventional approaches can yield outcomes on par with established techniques.

NASep 29, 2021
Anderson Acceleration as a Krylov Method with Application to Asymptotic Convergence Analysis

Hans De Sterck, Yunhui He, Oliver A. Krzysik

Anderson acceleration (AA) is widely used for accelerating the convergence of nonlinear fixed-point methods $x_{k+1}=q(x_{k})$, $x_k \in \mathbb{R}^n$, but little is known about how to quantify the convergence acceleration provided by AA. As a roadway towards gaining more understanding of convergence acceleration by AA, we study AA($m$), i.e., Anderson acceleration with finite window size $m$, applied to the case of linear fixed-point iterations $x_{k+1}=M x_{k}+b$. We write AA($m$) as a Krylov method with polynomial residual update formulas, and derive recurrence relations for the AA($m$) polynomials. Writing AA($m$) as a Krylov method immediately implies that $k$ iterations of AA($m$) cannot produce a smaller residual than $k$ iterations of GMRES without restart (but without implying anything about the relative convergence speed of (windowed) AA($m$) versus restarted GMRES($m$)). We find that the AA($m$) residual polynomials observe a periodic memory effect where increasing powers of the error iteration matrix $M$ act on the initial residual as the iteration number increases. We derive several further results based on these polynomial residual update formulas, including orthogonality relations, a lower bound on the AA(1) acceleration coefficient $β_k$, and explicit nonlinear recursions for the AA(1) residuals and residual polynomials that do not include the acceleration coefficient $β_k$. Using these recurrence relations we also derive new residual convergence bounds for AA(1) in the linear case, demonstrating how the per-iteration residual reduction $||r_{k+1}||/||r_{k}||$ depends strongly on the residual reduction in the previous iteration and on the angle between the prior residual vectors $r_k$ and $r_{k-1}$. We apply these results to study the influence of the initial guess on the asymptotic convergence factor of AA(1), and to study AA(1) residual convergence patterns.

OCSep 29, 2021
Linear Asymptotic Convergence of Anderson Acceleration: Fixed-Point Analysis

Hans De Sterck, Yunhui He

We study the asymptotic convergence of AA($m$), i.e., Anderson acceleration with window size $m$ for accelerating fixed-point methods $x_{k+1}=q(x_{k})$, $x_k \in R^n$. Convergence acceleration by AA($m$) has been widely observed but is not well understood. We consider the case where the fixed-point iteration function $q(x)$ is differentiable and the convergence of the fixed-point method itself is root-linear. We identify numerically several conspicuous properties of AA($m$) convergence: First, AA($m$) sequences $\{x_k\}$ converge root-linearly but the root-linear convergence factor depends strongly on the initial condition. Second, the AA($m$) acceleration coefficients $β^{(k)}$ do not converge but oscillate as $\{x_k\}$ converges to $x^*$. To shed light on these observations, we write the AA($m$) iteration as an augmented fixed-point iteration $z_{k+1} =Ψ(z_k)$, $z_k \in R^{n(m+1)}$ and analyze the continuity and differentiability properties of $Ψ(z)$ and $β(z)$. We find that the vector of acceleration coefficients $β(z)$ is not continuous at the fixed point $z^*$. However, we show that, despite the discontinuity of $β(z)$, the iteration function $Ψ(z)$ is Lipschitz continuous and directionally differentiable at $z^*$ for AA(1), and we generalize this to AA($m$) with $m>1$ for most cases. Furthermore, we find that $Ψ(z)$ is not differentiable at $z^*$. We then discuss how these theoretical findings relate to the observed convergence behaviour of AA($m$). The discontinuity of $β(z)$ at $z^*$ allows $β^{(k)}$ to oscillate as $\{x_k\}$ converges to $x^*$, and the non-differentiability of $Ψ(z)$ allows AA($m$) sequences to converge with root-linear convergence factors that strongly depend on the initial condition. Additional numerical results illustrate our findings.

LGOct 22, 2020
N-ODE Transformer: A Depth-Adaptive Variant of the Transformer Using Neural Ordinary Differential Equations

Aaron Baier-Reinio, Hans De Sterck

We use neural ordinary differential equations to formulate a variant of the Transformer that is depth-adaptive in the sense that an input-dependent number of time steps is taken by the ordinary differential equation solver. Our goal in proposing the N-ODE Transformer is to investigate whether its depth-adaptivity may aid in overcoming some specific known theoretical limitations of the Transformer in handling nonlocal effects. Specifically, we consider the simple problem of determining the parity of a binary sequence, for which the standard Transformer has known limitations that can only be overcome by using a sufficiently large number of layers or attention heads. We find, however, that the depth-adaptivity of the N-ODE Transformer does not provide a remedy for the inherently nonlocal nature of the parity problem, and provide explanations for why this is so. Next, we pursue regularization of the N-ODE Transformer by penalizing the arclength of the ODE trajectories, but find that this fails to improve the accuracy or efficiency of the N-ODE Transformer on the challenging parity problem. We suggest future avenues of research for modifications and extensions of the N-ODE Transformer that may lead to improved accuracy and efficiency for sequence modelling tasks such as neural machine translation.

OCJul 6, 2020
On the Asymptotic Linear Convergence Speed of Anderson Acceleration Applied to ADMM

Dawei Wang, Yunhui He, Hans De Sterck

Empirical results show that Anderson acceleration (AA) can be a powerful mechanism to improve the asymptotic linear convergence speed of the Alternating Direction Method of Multipliers (ADMM) when ADMM by itself converges linearly. However, theoretical results to quantify this improvement do not exist yet. In this paper we explain and quantify this improvement in linear asymptotic convergence speed for the special case of a stationary version of AA applied to ADMM. We do so by considering the spectral properties of the Jacobians of ADMM and the stationary version of AA evaluated at the fixed point, where the coefficients of the stationary AA method are computed such that its asymptotic linear convergence factor is optimal. The optimal linear convergence factors of this stationary AA-ADMM method are computed analytically or by optimization, based on previous work on optimal stationary AA acceleration. Using this spectral picture and those analytical results, our approach provides new insight into how and by how much the stationary AA method can improve the asymptotic linear convergence factor of ADMM. Numerical results also indicate that the optimal linear convergence factor of the stationary AA methods gives a useful estimate for the asymptotic linear convergence speed of the non-stationary AA method that is used in practice.

OCJul 4, 2020
On the Asymptotic Linear Convergence Speed of Anderson Acceleration, Nesterov Acceleration, and Nonlinear GMRES

Hans De Sterck, Yunhui He

We consider nonlinear convergence acceleration methods for fixed-point iteration $x_{k+1}=q(x_k)$, including Anderson acceleration (AA), nonlinear GMRES (NGMRES), and Nesterov-type acceleration (corresponding to AA with window size one). We focus on fixed-point methods that converge asymptotically linearly with convergence factor $ρ<1$ and that solve an underlying fully smooth and non-convex optimization problem. It is often observed that AA and NGMRES substantially improve the asymptotic convergence behavior of the fixed-point iteration, but this improvement has not been quantified theoretically. We investigate this problem under simplified conditions. First, we consider stationary versions of AA and NGMRES, and determine coefficients that result in optimal asymptotic convergence factors, given knowledge of the spectrum of $q'(x)$ at the fixed point $x^*$. This allows us to understand and quantify the asymptotic convergence improvement that can be provided by nonlinear convergence acceleration, viewing $x_{k+1}=q(x_k)$ as a nonlinear preconditioner for AA and NGMRES. Second, for the case of infinite window size, we consider linear asymptotic convergence bounds for GMRES applied to the fixed-point iteration linearized about $x^*$. Since AA and NGMRES are equivalent to GMRES in the linear case, one may expect the GMRES convergence factors to be relevant for AA and NGMRES as $x_k \rightarrow x^*$. Our results are illustrated numerically for a class of test problems from canonical tensor decomposition, comparing steepest descent and alternating least squares (ALS) as the fixed-point iterations that are accelerated by AA and NGMRES. Our numerical tests show that both approaches allow us to estimate asymptotic convergence speed for nonstationary AA and NGMRES with finite window size.

AIJan 22, 2016
GeoTextTagger: High-Precision Location Tagging of Textual Documents using a Natural Language Processing Approach

Shawn Brunsting, Hans De Sterck, Remco Dolman et al.

Location tagging, also known as geotagging or geolocation, is the process of assigning geographical coordinates to input data. In this paper we present an algorithm for location tagging of textual documents. Our approach makes use of previous work in natural language processing by using a state-of-the-art part-of-speech tagger and named entity recognizer to find blocks of text which may refer to locations. A knowledge base (OpenStreatMap) is then used to find a list of possible locations for each block. Finally, one location is chosen for each block by assigning distance-based scores to each location and repeatedly selecting the location and block with the best score. We tested our geolocation algorithm with Wikipedia articles about topics with a well-defined geographical location that are geotagged by the articles' authors, where classification approaches have achieved median errors as low as 11 km, with attainable accuracy limited by the class size. Our approach achieved a 10th percentile error of 490 metres and median error of 54 kilometres on the Wikipedia dataset we used. When considering the five location tags with the greatest scores, 50% of articles were assigned at least one tag within 8.5 kilometres of the article's author-assigned true location. We also tested our approach on Twitter messages that are tagged with the location from which the message was sent. Twitter texts are challenging because they are short and unstructured and often do not contain words referring to the location they were sent from, but we obtain potentially useful results. We explain how we use the Spark framework for data analytics to collect and process our test data. In general, classification-based approaches for location tagging may be reaching their upper accuracy limit, but our precision-focused approach has high accuracy for some texts and shows significant potential for improvement overall.

NAAug 13, 2015
Algorithmic Acceleration of Parallel ALS for Collaborative Filtering: Speeding up Distributed Big Data Recommendation in Spark

Manda Winlaw, Michael B. Hynes, Anthony Caterini et al.

Collaborative filtering algorithms are important building blocks in many practical recommendation systems. For example, many large-scale data processing environments include collaborative filtering models for which the Alternating Least Squares (ALS) algorithm is used to compute latent factor matrix decompositions. In this paper, we propose an approach to accelerate the convergence of parallel ALS-based optimization methods for collaborative filtering using a nonlinear conjugate gradient (NCG) wrapper around the ALS iterations. We also provide a parallel implementation of the accelerated ALS-NCG algorithm in the Apache Spark distributed data processing environment, and an efficient line search technique as part of the ALS-NCG implementation that requires only one pass over the data on distributed datasets. In serial numerical experiments on a linux workstation and parallel numerical experiments on a 16 node cluster with 256 computing cores, we demonstrate that the combined ALS-NCG method requires many fewer iterations and less time than standalone ALS to reach movie rankings with high accuracy on the MovieLens 20M dataset. In parallel, ALS-NCG can achieve an acceleration factor of 4 or greater in clock time when an accurate solution is desired; furthermore, the acceleration factor increases as greater numerical precision is required in the solution. In addition, the NCG acceleration mechanism is efficient in parallel and scales linearly with problem size on synthetic datasets with up to nearly 1 billion ratings. The acceleration mechanism is general and may also be applicable to other optimization methods for collaborative filtering.