CVSep 19, 2023Code
DCPT: Darkness Clue-Prompted Tracking in Nighttime UAVsJiawen Zhu, Huayi Tang, Zhi-Qi Cheng et al. · cmu, uw
Existing nighttime unmanned aerial vehicle (UAV) trackers follow an "Enhance-then-Track" architecture - first using a light enhancer to brighten the nighttime video, then employing a daytime tracker to locate the object. This separate enhancement and tracking fails to build an end-to-end trainable vision system. To address this, we propose a novel architecture called Darkness Clue-Prompted Tracking (DCPT) that achieves robust UAV tracking at night by efficiently learning to generate darkness clue prompts. Without a separate enhancer, DCPT directly encodes anti-dark capabilities into prompts using a darkness clue prompter (DCP). Specifically, DCP iteratively learns emphasizing and undermining projections for darkness clues. It then injects these learned visual prompts into a daytime tracker with fixed parameters across transformer layers. Moreover, a gated feature aggregation mechanism enables adaptive fusion between prompts and between prompts and the base model. Extensive experiments show state-of-the-art performance for DCPT on multiple dark scenario benchmarks. The unified end-to-end learning of enhancement and tracking in DCPT enables a more trainable system. The darkness clue prompting efficiently injects anti-dark knowledge without extra modules. Code is available at https://github.com/bearyi26/DCPT.
LGJul 14, 2023Code
Can Large Language Models Empower Molecular Property Prediction?Chen Qian, Huayi Tang, Zhirui Yang et al.
Molecular property prediction has gained significant attention due to its transformative potential in multiple scientific disciplines. Conventionally, a molecule graph can be represented either as a graph-structured data or a SMILES text. Recently, the rapid development of Large Language Models (LLMs) has revolutionized the field of NLP. Although it is natural to utilize LLMs to assist in understanding molecules represented by SMILES, the exploration of how LLMs will impact molecular property prediction is still in its early stage. In this work, we advance towards this objective through two perspectives: zero/few-shot molecular classification, and using the new explanations generated by LLMs as representations of molecules. To be specific, we first prompt LLMs to do in-context molecular classification and evaluate their performance. After that, we employ LLMs to generate semantically enriched explanations for the original SMILES and then leverage that to fine-tune a small-scale LM model for multiple downstream tasks. The experimental results highlight the superiority of text explanations as molecular representations across multiple benchmark datasets, and confirm the immense potential of LLMs in molecular property prediction tasks. Codes are available at \url{https://github.com/ChnQ/LLM4Mol}.
AIJun 4
Statistical Priors for Implicit Preferences: Decoupling Skill Selection as a Local Harness in Personal AgentsZeyu Gan, Huayi Tang, Yong Liu
As Large Language Model (LLM) capabilities advance, locally deployed personal agents relying on API-based remote models and external skills have emerged as a novel paradigm. With the rapid expansion of available skills, enabling personal agents to learn and adapt to implicit user preferences becomes a critical challenge. However, local deployment constraints preclude complex centralized selection algorithms, creating an urgent need for a lightweight local preference harness. This paper explores the implementation of such a harness through a novel architecture that strictly decouples statistical preference learning from semantic intent parsing. Specifically, we leverage localized statistical results to influence and modulate the selection decisions of the remote LLM. Extensive evaluations demonstrate that our decoupled approach achieves the lowest cumulative regret and highest test accuracy, significantly outperforming traditional memory-augmented agents.
LGOct 6, 2023Code
Perfect Alignment May be Poisonous to Graph Contrastive LearningJingyu Liu, Huayi Tang, Yong Liu
Graph Contrastive Learning (GCL) aims to learn node representations by aligning positive pairs and separating negative ones. However, few of researchers have focused on the inner law behind specific augmentations used in graph-based learning. What kind of augmentation will help downstream performance, how does contrastive learning actually influence downstream tasks, and why the magnitude of augmentation matters so much? This paper seeks to address these questions by establishing a connection between augmentation and downstream performance. Our findings reveal that GCL contributes to downstream tasks mainly by separating different classes rather than gathering nodes of the same class. So perfect alignment and augmentation overlap which draw all intra-class samples the same can not fully explain the success of contrastive learning. Therefore, in order to understand how augmentation aids the contrastive learning process, we conduct further investigations into the generalization, finding that perfect alignment that draw positive pair the same could help contrastive loss but is poisonous to generalization, as a result, perfect alignment may not lead to best downstream performance, so specifically designed augmentation is needed to achieve appropriate alignment performance and improve downstream accuracy. We further analyse the result by information theory and graph spectrum theory and propose two simple but effective methods to verify the theories. The two methods could be easily applied to various GCL algorithms and extensive experiments are conducted to prove its effectiveness. The code is available at https://github.com/somebodyhh1/GRACEIS
CVMar 1, 2025Code
Two-stream Beats One-stream: Asymmetric Siamese Network for Efficient Visual TrackingJiawen Zhu, Huayi Tang, Xin Chen et al.
Efficient tracking has garnered attention for its ability to operate on resource-constrained platforms for real-world deployment beyond desktop GPUs. Current efficient trackers mainly follow precision-oriented trackers, adopting a one-stream framework with lightweight modules. However, blindly adhering to the one-stream paradigm may not be optimal, as incorporating template computation in every frame leads to redundancy, and pervasive semantic interaction between template and search region places stress on edge devices. In this work, we propose a novel asymmetric Siamese tracker named \textbf{AsymTrack} for efficient tracking. AsymTrack disentangles template and search streams into separate branches, with template computing only once during initialization to generate modulation signals. Building on this architecture, we devise an efficient template modulation mechanism to unidirectional inject crucial cues into the search features, and design an object perception enhancement module that integrates abstract semantics and local details to overcome the limited representation in lightweight tracker. Extensive experiments demonstrate that AsymTrack offers superior speed-precision trade-offs across different platforms compared to the current state-of-the-arts. For instance, AsymTrack-T achieves 60.8\% AUC on LaSOT and 224/81/84 FPS on GPU/CPU/AGX, surpassing HiT-Tiny by 6.0\% AUC with higher speeds. The code is available at https://github.com/jiawen-zhu/AsymTrack.
CLMar 12
Beyond the Black Box: A Survey on the Theory and Mechanism of Large Language ModelsZeyu Gan, Ruifeng Ren, Wei Yao et al.
The rapid emergence of Large Language Models (LLMs) has precipitated a profound paradigm shift in Artificial Intelligence, delivering monumental engineering successes that increasingly impact modern society. However, a critical paradox persists within the current field: despite the empirical efficacy, our theoretical understanding of LLMs remains disproportionately nascent, forcing these systems to be treated largely as ``black boxes''. To address this theoretical fragmentation, this survey proposes a unified lifecycle-based taxonomy that organizes the research landscape into six distinct stages: Data Preparation, Model Preparation, Training, Alignment, Inference, and Evaluation. Within this framework, we provide a systematic review of the foundational theories and internal mechanisms driving LLM performance. Specifically, we analyze core theoretical issues such as the mathematical justification for data mixtures, the representational limits of various architectures, and the optimization dynamics of alignment algorithms. Moving beyond current best practices, we identify critical frontier challenges, including the theoretical limits of synthetic data self-improvement, the mathematical bounds of safety guarantees, and the mechanistic origins of emergent intelligence. By connecting empirical observations with rigorous scientific inquiry, this work provides a structured roadmap for transitioning LLM development from engineering heuristics toward a principled scientific discipline.
LGNov 8, 2023
Information-Theoretic Generalization Bounds for Transductive Learning and its ApplicationsHuayi Tang, Yong Liu
In this paper, we establish generalization bounds for transductive learning algorithms in the context of information theory and PAC-Bayes, covering both the random sampling and the random splitting setting. First, we show that the transductive generalization gap can be controlled by the mutual information between training label selection and the hypothesis. Next, we propose the concept of transductive supersample and use it to derive transductive information-theoretic bounds involving conditional mutual information and different information measures. We further establish transductive PAC-Bayesian bounds with weaker assumptions on the type of loss function and the number of training and test data points. Lastly, we use the theoretical results to derive upper bounds for adaptive optimization algorithms under the transductive learning setting. We also apply them to semi-supervised learning and transductive graph learning scenarios, meanwhile validating the derived bounds by experiments on synthetic and real-world datasets.
LGFeb 3
Sparsity is Combinatorial Depth: Quantifying MoE Expressivity via Tropical GeometryYe Su, Huayi Tang, Zixuan Gong et al.
While Mixture-of-Experts (MoE) architectures define the state-of-the-art, their theoretical success is often attributed to heuristic efficiency rather than geometric expressivity. In this work, we present the first analysis of MoE through the lens of tropical geometry, establishing that the Top-$k$ routing mechanism is algebraically isomorphic to the $k$-th elementary symmetric tropical polynomial. This isomorphism partitions the input space into the Normal Fan of a Hypersimplex, revealing that \textbf{sparsity is combinatorial depth} which scales geometric capacity by the binomial coefficient $\binom{N}{k}$. Moving beyond ambient bounds, we introduce the concept of \textit{Effective Capacity} under the Manifold Hypothesis. We prove that while dense networks suffer from capacity collapse on low-dimensional data, MoE architectures exhibit \textit{Combinatorial Resilience}, maintaining high expressivity via the transversality of routing cones. In this study, our framework unifies the discrete geometry of the Hypersimplex with the continuous geometry of neural functions, offering a rigorous theoretical justification for the topological supremacy of conditional computation.
LGNov 2, 2025
Transformers as Intrinsic Optimizers: Forward Inference through the Energy PrincipleRuifeng Ren, Sheng Ouyang, Huayi Tang et al.
Transformers have demonstrated strong adaptability across a wide range of tasks and have become the backbone of modern Large Language Models (LLMs). However, their underlying mechanisms remain open for further exploration. The energy-based perspective has long provided a valuable principle for understanding neural computation. In this paper, we revisit the principle of energy as a lens to understand attention-based Transformer models. We present a unified energy-based framework which is composed of three key components: the global energy $F^*$, the energy function $E_i$ and the employed gradient descent (GD) form. Within this framework, standard softmax attention can be viewed as a special case of minimizing the Helmholtz free energy as $F^*$ using standard GD when $E_i$ takes the form of elastic potential energy, with residual connections ensuring that this optimization proceeds in an incremental manner. In addition, linear attentions can also be naturally incorporated into this framework by adjusting the corresponding energy forms. We also extend the above analysis to the multi-head setting, where the energy is defined across multiple low-dimensional subspaces. Building on this framework, we propose energy-based modifications of attention structures. Inspired by classical GD algorithms, we extend the original attention formulation based on standard GD to the momentum-based GD, Nesterov Accelerated Gradient (NAG), and Newton's method variants, each inducing a corresponding new attention structure. Our experiments provide preliminary support for the potential of the energy-based framework for designing attention mechanisms.
LGMar 28, 2021Code
Self-Supervised Discriminative Feature Learning for Deep Multi-View ClusteringJie Xu, Yazhou Ren, Huayi Tang et al.
Multi-view clustering is an important research topic due to its capability to utilize complementary information from multiple views. However, there are few methods to consider the negative impact caused by certain views with unclear clustering structures, resulting in poor multi-view clustering performance. To address this drawback, we propose self-supervised discriminative feature learning for deep multi-view clustering (SDMVC). Concretely, deep autoencoders are applied to learn embedded features for each view independently. To leverage the multi-view complementary information, we concatenate all views' embedded features to form the global features, which can overcome the negative impact of some views' unclear clustering structures. In a self-supervised manner, pseudo-labels are obtained to build a unified target distribution to perform multi-view discriminative feature learning. During this process, global discriminative information can be mined to supervise all views to learn more discriminative features, which in turn are used to update the target distribution. Besides, this unified target distribution can make SDMVC learn consistent cluster assignments, which accomplishes the clustering consistency of multiple views while preserving their features' diversity. Experiments on various types of multi-view datasets show that SDMVC outperforms 14 competitors including classic and state-of-the-art methods. The code is available at https://github.com/SubmissionsIn/SDMVC.
CLFeb 24, 2025
Towards Auto-Regressive Next-Token Prediction: In-Context Learning Emerges from GeneralizationZixuan Gong, Xiaolin Hu, Huayi Tang et al.
Large language models (LLMs) have demonstrated remarkable in-context learning (ICL) abilities. However, existing theoretical analysis of ICL primarily exhibits two limitations: (a) Limited i.i.d. Setting. Most studies focus on supervised function learning tasks where prompts are constructed with i.i.d. input-label pairs. This i.i.d. assumption diverges significantly from real language learning scenarios where prompt tokens are interdependent. (b) Lack of Emergence Explanation. Most literature answers what ICL does from an implicit optimization perspective but falls short in elucidating how ICL emerges and the impact of pre-training phase on ICL. In our paper, to extend (a), we adopt a more practical paradigm, auto-regressive next-token prediction (AR-NTP), which closely aligns with the actual training of language models. Specifically, within AR-NTP, we emphasize prompt token-dependency, which involves predicting each subsequent token based on the preceding sequence. To address (b), we formalize a systematic pre-training and ICL framework, highlighting the layer-wise structure of sequences and topics, alongside a two-level expectation. In conclusion, we present data-dependent, topic-dependent and optimization-dependent PAC-Bayesian generalization bounds for pre-trained LLMs, investigating that ICL emerges from the generalization of sequences and topics. Our theory is supported by experiments on numerical linear dynamic systems, synthetic GINC and real-world language datasets.
LGSep 8, 2025
PAC-Bayesian Generalization Bounds for Graph Convolutional Networks on Inductive Node ClassificationHuayi Tang, Yong Liu
Graph neural networks (GNNs) have achieved remarkable success in processing graph-structured data across various applications. A critical aspect of real-world graphs is their dynamic nature, where new nodes are continually added and existing connections may change over time. Previous theoretical studies, largely based on the transductive learning framework, fail to adequately model such temporal evolution and structural dynamics. In this paper, we presents a PAC-Bayesian theoretical analysis of graph convolutional networks (GCNs) for inductive node classification, treating nodes as dependent and non-identically distributed data points. We derive novel generalization bounds for one-layer GCNs that explicitly incorporate the effects of data dependency and non-stationarity, and establish sufficient conditions under which the generalization gap converges to zero as the number of nodes increases. Furthermore, we extend our analysis to two-layer GCNs, and reveal that it requires stronger assumptions on graph topology to guarantee convergence. This work establishes a theoretical foundation for understanding and improving GNN generalization in dynamic graph environments.
LGJun 3, 2025
On Weak-to-Strong Generalization and f-DivergenceWei Yao, Gengze Xu, Huayi Tang et al.
Weak-to-strong generalization (W2SG) has emerged as a promising paradigm for stimulating the capabilities of strong pre-trained models by leveraging supervision from weaker supervisors. To improve the performance of the strong model, existing methods often require additional weak models or complex procedures, leading to substantial computational and memory overhead. Motivated by the effectiveness of $f$-divergence loss in various machine learning domains, we introduce $f$-divergence as an information-theoretic loss function framework in W2SG. Our theoretical analysis reveals fundamental limitations and equivalence of different $f$-divergence losses in W2SG, supported by sample complexity bounds and information-theoretic insights. We empirically demonstrate that $f$-divergence loss, which generalizes widely-used metrics like KL divergence, effectively improves generalization and noise tolerance of the strong model in practice.
LGMay 14, 2023
Towards Understanding the Generalization of Graph Neural NetworksHuayi Tang, Yong Liu
Graph neural networks (GNNs) are the most widely adopted model in graph-structured data oriented learning and representation. Despite their extraordinary success in real-world applications, understanding their working mechanism by theory is still on primary stage. In this paper, we move towards this goal from the perspective of generalization. To be specific, we first establish high probability bounds of generalization gap and gradients in transductive learning with consideration of stochastic optimization. After that, we provide high probability bounds of generalization gap for popular GNNs. The theoretical results reveal the architecture specific factors affecting the generalization gap. Experimental results on benchmark datasets show the consistency between theoretical results and empirical evidence. Our results provide new insights in understanding the generalization of GNNs.
CVJun 21, 2021
Multi-VAE: Learning Disentangled View-common and View-peculiar Visual Representations for Multi-view ClusteringJie Xu, Yazhou Ren, Huayi Tang et al.
Multi-view clustering, a long-standing and important research problem, focuses on mining complementary information from diverse views. However, existing works often fuse multiple views' representations or handle clustering in a common feature space, which may result in their entanglement especially for visual representations. To address this issue, we present a novel VAE-based multi-view clustering framework (Multi-VAE) by learning disentangled visual representations. Concretely, we define a view-common variable and multiple view-peculiar variables in the generative model. The prior of view-common variable obeys approximately discrete Gumbel Softmax distribution, which is introduced to extract the common cluster factor of multiple views. Meanwhile, the prior of view-peculiar variable follows continuous Gaussian distribution, which is used to represent each view's peculiar visual factors. By controlling the mutual information capacity to disentangle the view-common and view-peculiar representations, continuous visual information of multiple views can be separated so that their common discrete cluster information can be effectively mined. Experimental results demonstrate that Multi-VAE enjoys the disentangled and explainable visual representations, while obtaining superior clustering performance compared with state-of-the-art methods.
LGJun 21, 2021
Multi-level Feature Learning for Contrastive Multi-view ClusteringJie Xu, Huayi Tang, Yazhou Ren et al.
Multi-view clustering can explore common semantics from multiple views and has attracted increasing attention. However, existing works punish multiple objectives in the same feature space, where they ignore the conflict between learning consistent common semantics and reconstructing inconsistent view-private information. In this paper, we propose a new framework of multi-level feature learning for contrastive multi-view clustering to address the aforementioned issue. Our method learns different levels of features from the raw features, including low-level features, high-level features, and semantic labels/features in a fusion-free manner, so that it can effectively achieve the reconstruction objective and the consistency objectives in different feature spaces. Specifically, the reconstruction objective is conducted on the low-level features. Two consistency objectives based on contrastive learning are conducted on the high-level features and the semantic labels, respectively. They make the high-level features effectively explore the common semantics and the semantic labels achieve the multi-view clustering. As a result, the proposed framework can reduce the adverse influence of view-private information. Extensive experiments on public datasets demonstrate that our method achieves state-of-the-art clustering effectiveness.