Thomas Chen

LG
h-index72
17papers
170citations
Novelty41%
AI Score44

17 Papers

75.6LGJun 1
A Theoretical Framework for Self-Play Theorem Proving Algorithms

Thomas Chen, Zhiyuan Li

Self-play, a type of training algorithm that enables a model to self-improve, has recently shown promising empirical results in the context of formal theorem proving using Large Language Models (LLMs). (Dong & Ma, 2025) instantiate self-play with two cooperating agents: a prover, which proves theorems, and a conjecturer, which generates new theorems as a curriculum to the prover. In this paper, we provide a theoretical framework for understanding the self-improvement capabilities of self-play algorithms for theorem proving. First, we formalize the set of theorems as a graph, with nodes as theorems and edges between pairs of theorems with similar semantics. We introduce a set of primitive assumptions that characterize the guarantees of a trained prover and how a conjecturer can access the structure of the graph. Second, we show that if the underlying graph of theorems is well-connected, then a prover-conjecturer system, where the conjecturing algorithm is based on a reversible random walk, is sufficient to grow the set of proved theorems exponentially. Third, motivated by an issue encountered empirically by self-play algorithms, where the conjecturer tends to generate artificially complex and non-fundamental theorems, we propose a diversity measure for a training distribution of theorems generated by a conjecturer and an improved conjecturing algorithm that locally maximizes this diversity measure, by computing the diffusion similarity between neighboring theorems in the theorem graph. Finally, we describe a method to compute the diffusion similarity by using contrastive learning to embed nodes into Euclidean space and then computing the inner-product between embeddings.

46.8NAApr 15
Bivariate range functions with superior convergence order

Bingwei Zhang, Thomas Chen, Kai Hormann et al.

Range functions are a fundamental tool for certified computations in geometric modeling, computer graphics, and robotics, but traditional range functions have only quadratic convergence order ($m=2$). For ``superior'' convergence order (i.e., $m>2$), we exploit the Cornelius--Lohner framework in order to introduce new bivariate range functions based on Taylor, Lagrange, and Hermite interpolation. In particular, we focus on practical range functions with cubic and quartic convergence order. We implemented them in Julia and provide experimental validation of their performance in terms of efficiency and efficacy.

LGSep 19, 2023
Geometric structure of Deep Learning networks and construction of global ${\mathcal L}^2$ minimizers

Thomas Chen, Patricia Muñoz Ewald

In this paper, we explicitly determine local and global minimizers of the $\mathcal{L}^2$ cost function in underparametrized Deep Learning (DL) networks; our main goal is to shed light on their geometric structure and properties. We accomplish this by a direct construction, without invoking the gradient descent flow at any point of this work. We specifically consider $L$ hidden layers, a ReLU ramp activation function, an $\mathcal{L}^2$ Schatten class (or Hilbert-Schmidt) cost function, input and output spaces $\mathbb{R}^Q$ with equal dimension $Q\geq1$, and hidden layers also defined on $\mathbb{R}^{Q}$; the training inputs are assumed to be sufficiently clustered. The training input size $N$ can be arbitrarily large - thus, we are considering the underparametrized regime. More general settings are left to future work. We construct an explicit family of minimizers for the global minimum of the cost function in the case $L\geq Q$, which we show to be degenerate. Moreover, we determine a set of $2^Q-1$ distinct degenerate local minima of the cost function. In the context presented here, the concatenation of hidden layers of the DL network is reinterpreted as a recursive application of a {\em truncation map} which "curates" the training inputs by minimizing their noise to signal ratio.

LGSep 19, 2023
Geometric structure of shallow neural networks and constructive ${\mathcal L}^2$ cost minimization

Thomas Chen, Patrícia Muñoz Ewald

In this paper, we approach the problem of cost (loss) minimization in underparametrized shallow ReLU networks through the explicit construction of upper bounds which appeal to the structure of classification data, without use of gradient descent. A key focus is on elucidating the geometric structure of approximate and precise minimizers. We consider an $L^2$ cost function, input space $\mathbb{R}^M$, output space ${\mathbb R}^Q$ with $Q\leq M$, and training input sample size that can be arbitrarily large. We prove an upper bound on the minimum of the cost function of order $O(δ_P)$ where $δ_P$ measures the signal-to-noise ratio of training data. In the special case $M=Q$, we explicitly determine an exact degenerate local minimum of the cost function, and show that the sharp value differs from the upper bound obtained for $Q\leq M$ by a relative error $O(δ_P^2)$. The proof of the upper bound yields a constructively trained network; we show that it metrizes a particular $Q$-dimensional subspace in the input space ${\mathbb R}^M$. We comment on the characterization of the global minimum of the cost function in the given context.

LGAug 2, 2024
Gradient flow in parameter space is equivalent to linear interpolation in output space

Thomas Chen, Patrícia Muñoz Ewald

We prove that the standard gradient flow in parameter space that underlies many training algorithms in deep learning can be continuously deformed into an adapted gradient flow which yields (constrained) Euclidean gradient flow in output space. Moreover, for the $L^{2}$ loss, if the Jacobian of the outputs with respect to the parameters is full rank (for fixed training data), then the time variable can be reparametrized so that the resulting flow is simply linear interpolation, and a global minimum can be achieved. For the cross-entropy loss, under the same rank condition and assuming the labels have positive components, we derive an explicit formula for the unique global minimum.

LGNov 27, 2023
Global $\mathcal{L}^2$ minimization at uniform exponential rate via geometrically adapted gradient descent in Deep Learning

Thomas Chen

We consider the scenario of supervised learning in Deep Learning (DL) networks, and exploit the arbitrariness of choice in the Riemannian metric relative to which the gradient descent flow can be defined (a general fact of differential geometry). In the standard approach to DL, the gradient flow on the space of parameters (weights and biases) is defined with respect to the Euclidean metric. Here instead, we choose the gradient flow with respect to the Euclidean metric in the output layer of the DL network. This naturally induces two modified versions of the gradient descent flow in the parameter space, one adapted for the overparametrized setting, and the other for the underparametrized setting. In the overparametrized case, we prove that, provided that a rank condition holds, all orbits of the modified gradient descent drive the ${\mathcal L}^2$ cost to its global minimum at a uniform exponential convergence rate; one thereby obtains an a priori stopping time for any prescribed proximity to the global minimum. We point out relations of the latter to sub-Riemannian geometry. Moreover, we generalize the above framework to the situation in which the rank condition does not hold; in particular, we show that local equilibria can only exist if a rank loss occurs, and that generically, they are not isolated points, but elements of a critical submanifold of parameter space.

LGNov 13, 2023
On non-approximability of zero loss global ${\mathcal L}^2$ minimizers by gradient descent in Deep Learning

Thomas Chen, Patricia Muñoz Ewald

We analyze geometric aspects of the gradient descent algorithm in Deep Learning (DL), and give a detailed discussion of the circumstance that in underparametrized DL networks, zero loss minimization can generically not be attained. As a consequence, we conclude that the distribution of training inputs must necessarily be non-generic in order to produce zero loss minimizers, both for the method constructed in [Chen-Munoz Ewald 2023, 2024], or for gradient descent [Chen 2025] (which assume clustering of training data).

MAFeb 19, 2025
Learning Symbolic Task Decompositions for Multi-Agent Teams

Ameesh Shah, Niklas Lauffer, Thomas Chen et al.

One approach for improving sample efficiency in cooperative multi-agent learning is to decompose overall tasks into sub-tasks that can be assigned to individual agents. We study this problem in the context of reward machines: symbolic tasks that can be formally decomposed into sub-tasks. In order to handle settings without a priori knowledge of the environment, we introduce a framework that can learn the optimal decomposition from model-free interactions with the environment. Our method uses a task-conditioned architecture to simultaneously learn an optimal decomposition and the corresponding agents' policies for each sub-task. In doing so, we remove the need for a human to manually design the optimal decomposition while maintaining the sample-efficiency benefits of improved credit assignment. We provide experimental results in several deep reinforcement learning settings, demonstrating the efficacy of our approach. Our results indicate that our approach succeeds even in environments with codependent agent dynamics, enabling synchronous multi-agent learning not achievable in previous works.

LGMay 11, 2024
Interpretable global minima of deep ReLU neural networks on sequentially separable data

Thomas Chen, Patrícia Muñoz Ewald

We explicitly construct zero loss neural network classifiers. We write the weight matrices and bias vectors in terms of cumulative parameters, which determine truncation maps acting recursively on input space. The configurations for the training data considered are (i) sufficiently small, well separated clusters corresponding to each class, and (ii) equivalence classes which are sequentially linearly separable. In the best case, for $Q$ classes of data in $\mathbb{R}^M$, global minimizers can be described with $Q(M+2)$ parameters.

LGJun 3, 2025
Non-Asymptotic Length Generalization

Thomas Chen, Tengyu Ma, Zhiyuan Li

Length generalization is the ability of a learning algorithm to learn a hypothesis which generalizes to longer inputs than the inputs in the training set. In this paper, we provide provable guarantees of length generalization for various classes of functions in an idealized setting. First, we formalize the framework of non-asymptotic length generalization, which requires a computable upper bound for the minimum input length that guarantees length generalization, as a function of the complexity of ground-truth function under some given complexity measure. We refer to this minimum input length to length generalize as length complexity. We show the Minimum-Complexity Interpolator learning algorithm achieves optimal length complexity. We further show that whether a function class admits non-asymptotic length generalization is equivalent to the decidability of its language equivalence problem, which implies that there is no computable upper bound for the length complexity of Context-Free Grammars. On the positive side, we show that the length complexity of Deterministic Finite Automata is $2n - 2$ where $n$ is the number of states of the ground-truth automaton. Our main results are upper bounds of length complexity for a subset of a transformer-related function class called C-RASP (Yang & Chiang, 2024). We show that the length complexity of 1-layer C-RASP functions is $O(T^2)$ when the ground-truth function has precision $T$, and that the length complexity of 2-layer C-RASP functions is $O(T^{O(K)})$ when the ground-truth function has precision $T$ and $K$ heads.

LGApr 8, 2025
Architecture independent generalization bounds for overparametrized deep ReLU networks

Thomas Chen, Chun-Kai Kevin Chien, Patricia Muñoz Ewald et al.

We prove that overparametrized neural networks are able to generalize with a test error that is independent of the level of overparametrization, and independent of the Vapnik-Chervonenkis (VC) dimension. We prove explicit bounds that only depend on the metric geometry of the test and training sets, on the regularity properties of the activation function, and on the operator norms of the weights and norms of biases. For overparametrized deep ReLU networks with a training sample size bounded by the input space dimension, we explicitly construct zero loss minimizers without use of gradient descent, and prove that the generalization error is independent of the network architecture.

LGFeb 19, 2025
Zero loss guarantees and explicit minimizers for generic overparametrized Deep Learning networks

Thomas Chen, Andrew G. Moore

We determine sufficient conditions for overparametrized deep learning (DL) networks to guarantee the attainability of zero loss in the context of supervised learning, for the $\mathcal{L}^2$ cost and {\em generic} training data. We present an explicit construction of the zero loss minimizers without invoking gradient descent. On the other hand, we point out that increase of depth can deteriorate the efficiency of cost minimization using a gradient descent algorithm by analyzing the conditions for rank loss of the training Jacobian. Our results clarify key aspects on the dichotomy between zero loss reachability in underparametrized versus overparametrized DL.

LGJan 13, 2025
Derivation of effective gradient flow equations and dynamical truncation of training data in Deep Learning

Thomas Chen

We derive explicit equations governing the cumulative biases and weights in Deep Learning with ReLU activation function, based on gradient descent for the Euclidean cost in the input layer, and under the assumption that the weights are, in a precise sense, adapted to the coordinate system distinguished by the activations. We show that gradient descent corresponds to a dynamical process in the input layer, whereby clusters of data are progressively reduced in complexity ("truncated") at an exponential rate that increases with the number of data points that have already been truncated. We provide a detailed discussion of several types of solutions to the gradient flow equations. A main motivation for this work is to shed light on the interpretability question in supervised learning.

CRFeb 14, 2022
A Review of zk-SNARKs

Thomas Chen, Hui Lu, Teeramet Kunpittaya et al.

A zk-SNARK is a protocol that lets one party, the prover, prove to another party, the verifier, that a statement about some privately-held information is true without revealing the information itself. This paper describes technical foundations, current applications, and some novel applications of zk-SNARKs. Regarding technical foundations, we go over the Quadratic Arithmetic Program reduction and the Pinocchio protocol. We then go over financial security applications like Zcash and Tornado Cash, and zk-Rollup applications like zkEVM and Darkforest. We propose novel zk-SNARK protocols for private auctions and decentralized card games on the blockchain, providing code for the proposed applications. We conclude by touching on promising zk-SNARK innovations, such as zk-STARKs.

CLAug 21, 2021
Hierarchical Summarization for Longform Spoken Dialog

Daniel Li, Thomas Chen, Albert Tung et al.

Every day we are surrounded by spoken dialog. This medium delivers rich diverse streams of information auditorily; however, systematically understanding dialog can often be non-trivial. Despite the pervasiveness of spoken dialog, automated speech understanding and quality information extraction remains markedly poor, especially when compared to written prose. Furthermore, compared to understanding text, auditory communication poses many additional challenges such as speaker disfluencies, informal prose styles, and lack of structure. These concerns all demonstrate the need for a distinctly speech tailored interactive system to help users understand and navigate the spoken language domain. While individual automatic speech recognition (ASR) and text summarization methods already exist, they are imperfect technologies; neither consider user purpose and intent nor address spoken language induced complications. Consequently, we design a two stage ASR and text summarization pipeline and propose a set of semantic segmentation and merging algorithms to resolve these speech modeling challenges. Our system enables users to easily browse and navigate content as well as recover from errors in these underlying technologies. Finally, we present an evaluation of the system which highlights user preference for hierarchical summarization as a tool to quickly skim audio and identify content of interest to the user.

CRJan 24, 2019
OAuthGuard: Protecting User Security and Privacy with OAuth 2.0 and OpenID Connect

Wanpeng Li, Chris J Mitchell, Thomas Chen

Millions of users routinely use Google to log in to websites supporting OAuth 2.0 or OpenID Connect; the security of OAuth 2.0 and OpenID Connect is therefore of critical importance. As revealed in previous studies, in practice RPs often implement OAuth 2.0 incorrectly, and so many real-world OAuth 2.0 and OpenID Connect systems are vulnerable to attack. However, users of such flawed systems are typically unaware of these issues, and so are at risk of attacks which could result in unauthorised access to the victim user's account at an RP. In order to address this threat, we have developed OAuthGuard, an OAuth 2.0 and OpenID Connect vulnerability scanner and protector, that works with RPs using Google OAuth 2.0 and OpenID Connect services. It protects user security and privacy even when RPs do not implement OAuth 2.0 or OpenID Connect correctly. We used OAuthGuard to survey the 1000 top-ranked websites supporting Google sign-in for the possible presence of five OAuth 2.0 or OpenID Connect security and privacy vulnerabilities, of which one has not previously been described in the literature. Of the 137 sites in our study that employ Google Sign-in, 69 were found to suffer from at least one serious vulnerability. OAuthGuard was able to protect user security and privacy for 56 of these 69 RPs, and for the other 13 was able to warn users that they were using an insecure implementation.

CRJan 24, 2018
Mitigating CSRF attacks on OAuth 2.0 and OpenID Connect

Wanpeng Li, Chris J Mitchell, Thomas Chen

Many millions of users routinely use their Google, Facebook and Microsoft accounts to log in to websites supporting OAuth 2.0 and/or OpenID Connect-based single sign on. The security of OAuth 2.0 and OpenID Connect is therefore of critical importance, and it has been widely examined both in theory and in practice. Unfortunately, as these studies have shown, real-world implementations of both schemes are often vulnerable to attack, and in particular to cross-site request forgery (CSRF) attacks. In this paper we propose a new technique which can be used to mitigate CSRF attacks against both OAuth 2.0 and OpenID Connect.