LGFeb 27, 2023
A Brief Survey on the Approximation Theory for Sequence ModellingHaotian Jiang, Qianxiao Li, Zhong Li et al.
We survey current developments in the approximation theory of sequence modelling in machine learning. Particular emphasis is placed on classifying existing results for various model architectures through the lens of classical approximation paradigms, and the insights one can gain from these results. We also outline some future research directions towards building a theory of sequence modelling.
OCAug 7, 2022
Decomposable Non-Smooth Convex Optimization with Nearly-Linear Gradient Oracle ComplexitySally Dong, Haotian Jiang, Yin Tat Lee et al.
Many fundamental problems in machine learning can be formulated by the convex program \[ \min_{θ\in R^d}\ \sum_{i=1}^{n}f_{i}(θ), \] where each $f_i$ is a convex, Lipschitz function supported on a subset of $d_i$ coordinates of $θ$. One common approach to this problem, exemplified by stochastic gradient descent, involves sampling one $f_i$ term at every iteration to make progress. This approach crucially relies on a notion of uniformity across the $f_i$'s, formally captured by their condition number. In this work, we give an algorithm that minimizes the above convex formulation to $ε$-accuracy in $\widetilde{O}(\sum_{i=1}^n d_i \log (1 /ε))$ gradient computations, with no assumptions on the condition number. The previous best algorithm independent of the condition number is the standard cutting plane method, which requires $O(nd \log (1/ε))$ gradient computations. As a corollary, we improve upon the evaluation oracle complexity for decomposable submodular minimization by Axiotis et al. (ICML 2021). Our main technical contribution is an adaptive procedure to select an $f_i$ term at every iteration via a novel combination of cutting-plane and interior-point methods.
CLMar 4, 2024Code
Differentially Private Synthetic Data via Foundation Model APIs 2: TextChulin Xie, Zinan Lin, Arturs Backurs et al. · microsoft-research
Text data has become extremely valuable due to the emergence of machine learning algorithms that learn from it. A lot of high-quality text data generated in the real world is private and therefore cannot be shared or used freely due to privacy concerns. Generating synthetic replicas of private text data with a formal privacy guarantee, i.e., differential privacy (DP), offers a promising and scalable solution. However, existing methods necessitate DP finetuning of large language models (LLMs) on private data to generate DP synthetic data. This approach is not viable for proprietary LLMs (e.g., GPT-3.5) and also demands considerable computational resources for open-source LLMs. Lin et al. (2024) recently introduced the Private Evolution (PE) algorithm to generate DP synthetic images with only API access to diffusion models. In this work, we propose an augmented PE algorithm, named Aug-PE, that applies to the complex setting of text. We use API access to an LLM and generate DP synthetic text without any model training. We conduct comprehensive experiments on three benchmark datasets. Our results demonstrate that Aug-PE produces DP synthetic text that yields competitive utility with the SOTA DP finetuning baselines. This underscores the feasibility of relying solely on API access of LLMs to produce high-quality DP synthetic texts, thereby facilitating more accessible routes to privacy-preserving LLM applications. Our code and data are available at https://github.com/AI-secure/aug-pe.
23.5LGMay 18
InfoFlow: A Framework for Multi-Layer Transformer AnalysisPenghao Yu, Haotian Jiang, Zeyu Bao et al.
While the approximation properties of single-layer Transformer architectures have been studied in recent works, a rigorous theoretical understanding of the multi-layer setting remains limited. In this work, we establish that multi-layer Transformers possess fundamentally different approximation capabilities from single-layer ones: for certain retrieval tasks, any single-layer Transformer requires least $Ω(\varepsilon^{-k})$ parameters to achieve precision $\varepsilon$, where $k$ grows linearly with sequence length $T$, whereas a two-layer Transformer with a single head per layer achieves the same approximation precision with at most $O (\varepsilon^{-1})$ parameters. To understand this separation, we identify two structural mechanisms underlying multi-layer approximation. Specifically, softmax attention can only efficiently retrieve the token attaining the maximum attention score, incurring exponential-in-length parameter cost for $k$-th largest retrieval with $k \geq 2$. Moreover, the parameter cost of decoding coupled information scales with the size of the retrieved token set. Motivated by these findings, we propose InfoFlow, a framework for multi-layer Transformers. The framework tracks an information set of accessible input positions at each token and layer, assigning an explicit approximation rate to each mode of information propagation. This abstraction recovers known approximation bounds, remains consistent with experimental observations on trained networks, and yields concrete predictions in settings where direct theoretical analysis is currently intractable. Our results provide a principled framework for reasoning about the approximation efficiency of multi-layer Transformers.
60.3CVMar 23
PPGL-Swarm: Integrated Multimodal Risk Stratification and Hereditary Syndrome Detection in Pheochromocytoma and ParagangliomaZelin Liu, Xiangfu Yu, Jie Huang et al.
Pheochromocytomas and paragangliomas (PPGLs) are rare neuroendocrine tumors, of which 15-25% develop metastatic disease with 5-year survival rates reported as low as 34%. PPGL may indicate hereditary syndromes requiring stricter, syndrome-specific treatment and surveillance, but clinicians often fail to recognize these associations in routine care. Clinical practice uses GAPP score for PPGL grading, but several limitations remain for PPGL diagnosis: (1) GAPP scoring demands a high workload for clinician because it requires the manual evaluation of six independent components; (2) key components such as cellularity and Ki-67 are often evaluated with subjective criteria; (3) several clinically relevant metastatic risk factors are not captured by GAPP, such as SDHB mutations, which have been associated with reported metastatic rates of 35-75%. Agent-driven diagnostic systems appear promising, but most lack traceable reasoning for decision-making and do not incorporate domain-specific knowledge such as PPGL genotype information. To address these limitations, we present PPGL-Swarm, an agentic PPGL diagnostic system that generates a comprehensive report, including automated GAPP scoring (with quantified cellularity and Ki-67), genotype risk alerts, and multimodal report with integrated evidence. The system provides an auditable reasoning trail by decomposing diagnosis into micro-tasks, each assigned to a specialized agent. The gene and table agents use knowledge enhancement to better interpret genotype and laboratory findings, and during training we use reinforcement learning to refine tool selection and task assignment.
ITMay 20, 2022
Low-cost Relevance Generation and Evaluation Metrics for Entity Resolution in AIVenkat Varada, Mina Ghashami, Jitesh Mehta et al.
Entity Resolution (ER) in voice assistants is a prime component during run time that resolves entities in users request to real world entities. ER involves two major functionalities 1. Relevance generation and 2. Ranking. In this paper we propose a low cost relevance generation framework by generating features using customer implicit and explicit feedback signals. The generated relevance datasets can serve as test sets to measure ER performance. We also introduce a set of metrics that accurately measures the performance of ER systems in various dimensions. They provide great interpretability to deep dive and identifying root cause of ER issues, whether the problem is in relevance generation or ranking.
CVMar 16, 2025Code
Pathology Image Restoration via Mixture of PromptsJiangdong Cai, Yan Chen, Zhenrong Shen et al.
In digital pathology, acquiring all-in-focus images is essential to high-quality imaging and high-efficient clinical workflow. Traditional scanners achieve this by scanning at multiple focal planes of varying depths and then merging them, which is relatively slow and often struggles with complex tissue defocus. Recent prevailing image restoration technique provides a means to restore high-quality pathology images from scans of single focal planes. However, existing image restoration methods are inadequate, due to intricate defocus patterns in pathology images and their domain-specific semantic complexities. In this work, we devise a two-stage restoration solution cascading a transformer and a diffusion model, to benefit from their powers in preserving image fidelity and perceptual quality, respectively. We particularly propose a novel mixture of prompts for the two-stage solution. Given initial prompt that models defocus in microscopic imaging, we design two prompts that describe the high-level image semantics from pathology foundation model and the fine-grained tissue structures via edge extraction. We demonstrate that, by feeding the prompt mixture to our method, we can restore high-quality pathology images from single-focal-plane scans, implying high potentials of the mixture of prompts to clinical usage. Code will be publicly available at https://github.com/caijd2000/MoP.
IVMar 9, 2025Code
LSA: Latent Style Augmentation Towards Stain-Agnostic Cervical Cancer ScreeningJiangdong Cai, Haotian Jiang, Zhenrong Shen et al.
The deployment of computer-aided diagnosis systems for cervical cancer screening using whole slide images (WSIs) faces critical challenges due to domain shifts caused by staining variations across different scanners and imaging environments. While existing stain augmentation methods improve patch-level robustness, they fail to scale to WSIs due to two key limitations: (1) inconsistent stain patterns when extending patch operations to gigapixel slides, and (2) prohibitive computational/storage costs from offline processing of augmented WSIs.To address this, we propose Latent Style Augmentation (LSA), a framework that performs efficient, online stain augmentation directly on WSI-level latent features. We first introduce WSAug, a WSI-level stain augmentation method ensuring consistent stain across patches within a WSI. Using offline-augmented WSIs by WSAug, we design and train Stain Transformer, which can simulate targeted style in the latent space, efficiently enhancing the robustness of the WSI-level classifier. We validate our method on a multi-scanner WSI dataset for cervical cancer diagnosis. Despite being trained on data from a single scanner, our approach achieves significant performance improvements on out-of-distribution data from other scanners. Code will be available at https://github.com/caijd2000/LSA.
23.0DSApr 14
Near-Optimal Constructive Bounds for $\ell_2$ Prefix Discrepancy and Steinitz Problems via Affine Spectral IndependenceKunal Dutta, Agastya Vibhuti Jha, Haotian Jiang
A classical result of Steinitz from 1913 \cite{Ste13}, answering an earlier question of Riemann and Lévy (e.g., \cite{Lev05}), states that for any norm $\|\cdot\|$ in $\mathbb{R}^d$ and any set of vectors $v_1, \cdots, v_n \in \R^d$ satisfying $\sum_{i=1}^n v_i = 0$, there exists an ordering $π: [n] \rightarrow [n]$ such that every partial sum along this order is bounded by $O(d)$, i.e., $\big\| \sum_{i=1}^t v_{π(i)} \big\| \leq O(d)$ for all $t \in [n]$. Steinitz's bound is tight up to constants in general, but for the $\ell_2$ norm $\|\cdot\|_2$, it has been conjectured that the best bound is $O(\sqrt{d})$. Almost a century later, a breakthrough work of Banaszczyk \cite{Ban12} gave a bound of $O(\sqrt{d} + \sqrt{\log n})$ for the $\ell_2$ Steinitz problem, matching the conjecture under the mild assumption that $d \geq Ω(\log n)$. Banaszczyk's result is non-constructive, and the previous best algorithmic bound was $O(\sqrt{d \log n})$, due to Bansal and Garg \cite{BG17}. In this work, we give an efficient algorithm that matches the conjectured $O(\sqrt{d})$ bound for the $\ell_2$ Steinitz problem under the slightly worse, yet still polylogarithmic, condition of $d \geq Ω(\log^7 n)$. As in prior work, our result extends to the harder problem of $\ell_2$ prefix discrepancy. We employ the framework of obtaining the desired ordering via a discrete Brownian motion, guided by a semidefinite program (SDP). To obtain our results, we use the new technique of ``Decoupling via Affine Spectral Independence'', proposed by Bansal and Jiang \cite{BJ26} to achieve substantial progress on the Beck-Fiala and Komlós conjectures, together with a ``Global Interval Tree'' data structure that simultaneously controls the deviations for all prefixes.
IRFeb 17
Automatic Funny Scene Extraction from Long-form Cinematic VideosSibendu Paul, Haotian Jiang, Caren Chen
Automatically extracting engaging and high-quality humorous scenes from cinematic titles is pivotal for creating captivating video previews and snackable content, boosting user engagement on streaming platforms. Long-form cinematic titles, with their extended duration and complex narratives, challenge scene localization, while humor's reliance on diverse modalities and its nuanced style add further complexity. This paper introduces an end-to-end system for automatically identifying and ranking humorous scenes from long-form cinematic titles, featuring shot detection, multimodal scene localization, and humor tagging optimized for cinematic content. Key innovations include a novel scene segmentation approach combining visual and textual cues, improved shot representations via guided triplet mining, and a multimodal humor tagging framework leveraging both audio and text. Our system achieves an 18.3% AP improvement over state-of-the-art scene detection on the OVSD dataset and an F1 score of 0.834 for detecting humor in long text. Extensive evaluations across five cinematic titles demonstrate 87% of clips extracted by our pipeline are intended to be funny, while 98% of scenes are accurately localized. With successful generalization to trailers, these results showcase the pipeline's potential to enhance content creation workflows, improve user engagement, and streamline snackable content generation for diverse cinematic media formats.
CVFeb 27, 2025
MITracker: Multi-View Integration for Visual Object TrackingMengjie Xu, Yitao Zhu, Haotian Jiang et al.
Multi-view object tracking (MVOT) offers promising solutions to challenges such as occlusion and target loss, which are common in traditional single-view tracking. However, progress has been limited by the lack of comprehensive multi-view datasets and effective cross-view integration methods. To overcome these limitations, we compiled a Multi-View object Tracking (MVTrack) dataset of 234K high-quality annotated frames featuring 27 distinct objects across various scenes. In conjunction with this dataset, we introduce a novel MVOT method, Multi-View Integration Tracker (MITracker), to efficiently integrate multi-view object features and provide stable tracking outcomes. MITracker can track any object in video frames of arbitrary length from arbitrary viewpoints. The key advancements of our method over traditional single-view approaches come from two aspects: (1) MITracker transforms 2D image features into a 3D feature volume and compresses it into a bird's eye view (BEV) plane, facilitating inter-view information fusion; (2) we propose an attention mechanism that leverages geometric information from fused 3D feature volume to refine the tracking results at each view. MITracker outperforms existing methods on the MVTrack and GMTD datasets, achieving state-of-the-art performance. The code and the new dataset will be available at https://mii-laboratory.github.io/MITracker/.
LGOct 8, 2025
The Effect of Attention Head Count on Transformer ApproximationPenghao Yu, Haotian Jiang, Zeyu Bao et al.
Transformer has become the dominant architecture for sequence modeling, yet a detailed understanding of how its structural parameters influence expressive power remains limited. In this work, we study the approximation properties of transformers, with particular emphasis on the role of the number of attention heads. Our analysis begins with the introduction of a generalized $D$-retrieval task, which we prove to be dense in the space of continuous functions, thereby providing the basis for our theoretical framework. We then establish both upper and lower bounds on the parameter complexity required for $ε$-approximation. Specifically, we show that transformers with sufficiently many heads admit efficient approximation, whereas with too few heads, the number of parameters must scale at least as $O(1/ε^{cT})$, for some constant $c$ and sequence length $T$. To the best of our knowledge, this constitutes the first rigorous lower bound of this type in a nonlinear and practically relevant setting. We further examine the single-head case and demonstrate that an embedding dimension of order $O(T)$ allows complete memorization of the input, where approximation is entirely achieved by the feed-forward block. Finally, we validate our theoretical findings with experiments on both synthetic data and real-world tasks, illustrating the practical relevance of our results.
LGOct 4, 2025
Allocation of Parameters in TransformersRuoxi Yu, Haotian Jiang, Jingpu Cheng et al.
Transformers have achieved remarkable successes across a wide range of applications, yet the theoretical foundation of their model efficiency remains underexplored. In this work, we investigate how the model parameters -- mainly attention heads and head dimensions -- should be allocated across layers to balance expressivity and efficiency. We first provide mathematical analysis on the role of early layers in information extraction from an approximation perspective, with a theoretical characterization on the trade-off between the number of heads and head dimension under a fixed parameter budget. In addition, we uncover and prove the \emph{saturation} behavior of softmax activations: Continuously increasing head dimensions can lead to diminishing returns in learning errors, particularly for long sequences. Supported by both theory and experiments, this saturation pattern suggests that later layers can operate more efficiently with reduced parameters. Combining these insights, we propose principled strategies for allocating attention heads and dimensions across Transformers' layers, shedding light on theoretically-grounded model efficiency of Transformer-based architectures.
LGJun 24, 2025
The Effect of Depth on the Expressivity of Deep Linear State-Space ModelsZeyu Bao, Penghao Yu, Haotian Jiang et al.
Deep state-space models (SSMs) have gained increasing popularity in sequence modelling. While there are numerous theoretical investigations of shallow SSMs, how the depth of the SSM affects its expressiveness remains a crucial problem. In this paper, we systematically investigate the role of depth and width in deep linear SSMs, aiming to characterize how they influence the expressive capacity of the architecture. First, we rigorously prove that in the absence of parameter constraints, increasing depth and increasing width are generally equivalent, provided that the parameter count remains within the same order of magnitude. However, under the assumption that the parameter norms are constrained, the effects of depth and width differ significantly. We show that a shallow linear SSM with large parameter norms can be represented by a deep linear SSM with smaller norms using a constructive method. In particular, this demonstrates that deep SSMs are more capable of representing targets with large norms than shallow SSMs under norm constraints. Finally, we derive upper bounds on the minimal depth required for a deep linear SSM to represent a given shallow linear SSM under constrained parameter norms. We also validate our theoretical results with numerical experiments
LGMay 29, 2023
Forward and Inverse Approximation Theory for Linear Temporal Convolutional NetworksHaotian Jiang, Qianxiao Li
We present a theoretical analysis of the approximation properties of convolutional architectures when applied to the modeling of temporal sequences. Specifically, we prove an approximation rate estimate (Jackson-type result) and an inverse approximation theorem (Bernstein-type result), which together provide a comprehensive characterization of the types of sequential relationships that can be efficiently captured by a temporal convolutional architecture. The rate estimate improves upon a previous result via the introduction of a refined complexity measure, whereas the inverse approximation theorem is new.
LGMay 29, 2023
Approximation Rate of the Transformer Architecture for Sequence ModelingHaotian Jiang, Qianxiao Li
The Transformer architecture is widely applied in sequence modeling applications, yet the theoretical understanding of its working principles remains limited. In this work, we investigate the approximation rate for single-layer Transformers with one head. We consider a class of non-linear relationships and identify a novel notion of complexity measures to establish an explicit Jackson-type approximation rate estimate for the Transformer. This rate reveals the structural properties of the Transformer and suggests the types of sequential relationships it is best suited for approximating. In particular, the results on approximation rates enable us to concretely analyze the differences between the Transformer and classical sequence modeling methods, such as recurrent neural networks.
LGMay 25, 2023
Learning across Data Owners with Joint Differential PrivacyYangsibo Huang, Haotian Jiang, Daogao Liu et al.
In this paper, we study the setting in which data owners train machine learning models collaboratively under a privacy notion called joint differential privacy [Kearns et al., 2018]. In this setting, the model trained for each data owner $j$ uses $j$'s data without privacy consideration and other owners' data with differential privacy guarantees. This setting was initiated in [Jain et al., 2021] with a focus on linear regressions. In this paper, we study this setting for stochastic convex optimization (SCO). We present an algorithm that is a variant of DP-SGD [Song et al., 2013; Abadi et al., 2016] and provides theoretical bounds on its population loss. We compare our algorithm to several baselines and discuss for what parameter setups our algorithm is more preferred. We also empirically study joint differential privacy in the multi-class classification problem over two public datasets. Our empirical findings are well-connected to the insights from our theoretical results.
LGJul 20, 2021
Approximation Theory of Convolutional Architectures for Time Series ModellingHaotian Jiang, Zhong Li, Qianxiao Li
We study the approximation properties of convolutional architectures applied to time series modelling, which can be formulated mathematically as a functional approximation problem. In the recurrent setting, recent results reveal an intricate connection between approximation efficiency and memory structures in the data generation process. In this paper, we derive parallel results for convolutional architectures, with WaveNet being a prime example. Our results reveal that in this new setting, approximation efficiency is not only characterised by memory, but also additional fine structures in the target relationship. This leads to a novel definition of spectrum-based regularity that measures the complexity of temporal relationships under the convolutional approximation scheme. These analyses provide a foundation to understand the differences between architectural choices for time series modelling and can give theoretically grounded guidance for practical applications.
DSApr 8, 2020
An Improved Cutting Plane Method for Convex Optimization, Convex-Concave Games and its ApplicationsHaotian Jiang, Yin Tat Lee, Zhao Song et al.
Given a separation oracle for a convex set $K \subset \mathbb{R}^n$ that is contained in a box of radius $R$, the goal is to either compute a point in $K$ or prove that $K$ does not contain a ball of radius $ε$. We propose a new cutting plane algorithm that uses an optimal $O(n \log (κ))$ evaluations of the oracle and an additional $O(n^2)$ time per evaluation, where $κ= nR/ε$. $\bullet$ This improves upon Vaidya's $O( \text{SO} \cdot n \log (κ) + n^{ω+1} \log (κ))$ time algorithm [Vaidya, FOCS 1989a] in terms of polynomial dependence on $n$, where $ω< 2.373$ is the exponent of matrix multiplication and $\text{SO}$ is the time for oracle evaluation. $\bullet$ This improves upon Lee-Sidford-Wong's $O( \text{SO} \cdot n \log (κ) + n^3 \log^{O(1)} (κ))$ time algorithm [Lee, Sidford and Wong, FOCS 2015] in terms of dependence on $κ$. For many important applications in economics, $κ= Ω(\exp(n))$ and this leads to a significant difference between $\log(κ)$ and $\mathrm{poly}(\log (κ))$. We also provide evidence that the $n^2$ time per evaluation cannot be improved and thus our running time is optimal. A bottleneck of previous cutting plane methods is to compute leverage scores, a measure of the relative importance of past constraints. Our result is achieved by a novel multi-layered data structure for leverage score maintenance, which is a sophisticated combination of diverse techniques such as random projection, batched low-rank update, inverse maintenance, polynomial interpolation, and fast rectangular matrix multiplication. Interestingly, our method requires a combination of different fast rectangular matrix multiplication algorithms.
LGMay 19, 2017
Practical Algorithms for Best-K Identification in Multi-Armed BanditsHaotian Jiang, Jian Li, Mingda Qiao
In the Best-$K$ identification problem (Best-$K$-Arm), we are given $N$ stochastic bandit arms with unknown reward distributions. Our goal is to identify the $K$ arms with the largest means with high confidence, by drawing samples from the arms adaptively. This problem is motivated by various practical applications and has attracted considerable attention in the past decade. In this paper, we propose new practical algorithms for the Best-$K$-Arm problem, which have nearly optimal sample complexity bounds (matching the lower bound up to logarithmic factors) and outperform the state-of-the-art algorithms for the Best-$K$-Arm problem (even for $K=1$) in practice.