CLMay 14, 2022
RASAT: Integrating Relational Structures into Pretrained Seq2Seq Model for Text-to-SQLJiexing Qi, Jingyao Tang, Ziwei He et al. · meta-ai, mila
Relational structures such as schema linking and schema encoding have been validated as a key component to qualitatively translating natural language into SQL queries. However, introducing these structural relations comes with prices: they often result in a specialized model structure, which largely prohibits using large pretrained models in text-to-SQL. To address this problem, we propose RASAT: a Transformer seq2seq architecture augmented with relation-aware self-attention that could leverage a variety of relational structures while inheriting the pretrained parameters from the T5 model effectively. Our model can incorporate almost all types of existing relations in the literature, and in addition, we propose introducing co-reference relations for the multi-turn scenario. Experimental results on three widely used text-to-SQL datasets, covering both single-turn and multi-turn scenarios, have shown that RASAT could achieve state-of-the-art results across all three benchmarks (75.5% EX on Spider, 52.6% IEX on SParC, and 37.4% IEX on CoSQL).
LGMar 26, 2022
A Roadmap for Big ModelSha Yuan, Hanyu Zhao, Shuai Zhao et al. · bytedance, pku
With the rapid development of deep learning, training Big Models (BMs) for multiple downstream tasks becomes a popular paradigm. Researchers have achieved various outcomes in the construction of BMs and the BM application in many fields. At present, there is a lack of research work that sorts out the overall progress of BMs and guides the follow-up research. In this paper, we cover not only the BM technologies themselves but also the prerequisites for BM training and applications with BMs, dividing the BM review into four parts: Resource, Models, Key Technologies and Application. We introduce 16 specific BM-related topics in those four parts, they are Data, Knowledge, Computing System, Parallel Training System, Language Model, Vision Model, Multi-modal Model, Theory&Interpretability, Commonsense Reasoning, Reliability&Security, Governance, Evaluation, Machine Translation, Text Generation, Dialogue and Protein Research. In each topic, we summarize clearly the current studies and propose some future research directions. At the end of this paper, we conclude the further development of BMs in a more general view.
LGFeb 25, 2023Code
Bayesian Neural Networks Avoid Encoding Complex and Perturbation-Sensitive ConceptsQihan Ren, Huiqi Deng, Yunuo Chen et al.
In this paper, we focus on mean-field variational Bayesian Neural Networks (BNNs) and explore the representation capacity of such BNNs by investigating which types of concepts are less likely to be encoded by the BNN. It has been observed and studied that a relatively small set of interactive concepts usually emerge in the knowledge representation of a sufficiently-trained neural network, and such concepts can faithfully explain the network output. Based on this, our study proves that compared to standard deep neural networks (DNNs), it is less likely for BNNs to encode complex concepts. Experiments verify our theoretical proofs. Note that the tendency to encode less complex concepts does not necessarily imply weak representation power, considering that complex concepts exhibit low generalization power and high adversarial vulnerability. The code is available at https://github.com/sjtu-xai-lab/BNN-concepts.
LGMay 4, 2022
Towards Theoretical Analysis of Transformation Complexity of ReLU DNNsJie Ren, Mingjie Li, Meng Zhou et al. · cmu
This paper aims to theoretically analyze the complexity of feature transformations encoded in piecewise linear DNNs with ReLU layers. We propose metrics to measure three types of complexities of transformations based on the information theory. We further discover and prove the strong correlation between the complexity and the disentanglement of transformations. Based on the proposed metrics, we analyze two typical phenomena of the change of the transformation complexity during the training process, and explore the ceiling of a DNN's complexity. The proposed metrics can also be used as a loss to learn a DNN with the minimum complexity, which also controls the over-fitting level of the DNN and influences adversarial robustness, adversarial transferability, and knowledge consistency. Comprehensive comparative studies have provided new perspectives to understand the DNN.
LGAug 18, 2022
Quantifying the Knowledge in a DNN to Explain Knowledge Distillation for ClassificationQuanshi Zhang, Xu Cheng, Yilan Chen et al.
Compared to traditional learning from scratch, knowledge distillation sometimes makes the DNN achieve superior performance. This paper provides a new perspective to explain the success of knowledge distillation, i.e., quantifying knowledge points encoded in intermediate layers of a DNN for classification, based on the information theory. To this end, we consider the signal processing in a DNN as the layer-wise information discarding. A knowledge point is referred to as an input unit, whose information is much less discarded than other input units. Thus, we propose three hypotheses for knowledge distillation based on the quantification of knowledge points. 1. The DNN learning from knowledge distillation encodes more knowledge points than the DNN learning from scratch. 2. Knowledge distillation makes the DNN more likely to learn different knowledge points simultaneously. In comparison, the DNN learning from scratch tends to encode various knowledge points sequentially. 3. The DNN learning from knowledge distillation is often optimized more stably than the DNN learning from scratch. In order to verify the above hypotheses, we design three types of metrics with annotations of foreground objects to analyze feature representations of the DNN, \textit{i.e.} the quantity and the quality of knowledge points, the learning speed of different knowledge points, and the stability of optimization directions. In experiments, we diagnosed various DNNs for different classification tasks, i.e., image classification, 3D point cloud classification, binary sentiment classification, and question answering, which verified above hypotheses.
LGFeb 25, 2023
Does a Neural Network Really Encode Symbolic Concepts?Mingjie Li, Quanshi Zhang
Recently, a series of studies have tried to extract interactions between input variables modeled by a DNN and define such interactions as concepts encoded by the DNN. However, strictly speaking, there still lacks a solid guarantee whether such interactions indeed represent meaningful concepts. Therefore, in this paper, we examine the trustworthiness of interaction concepts from four perspectives. Extensive empirical studies have verified that a well-trained DNN usually encodes sparse, transferable, and discriminative concepts, which is partially aligned with human intuition.
LGFeb 25, 2023
Explaining Generalization Power of a DNN Using Interactive ConceptsHuilin Zhou, Hao Zhang, Huiqi Deng et al.
This paper explains the generalization power of a deep neural network (DNN) from the perspective of interactions. Although there is no universally accepted definition of the concepts encoded by a DNN, the sparsity of interactions in a DNN has been proved, i.e., the output score of a DNN can be well explained by a small number of interactions between input variables. In this way, to some extent, we can consider such interactions as interactive concepts encoded by the DNN. Therefore, in this paper, we derive an analytic explanation of inconsistency of concepts of different complexities. This may shed new lights on using the generalization power of concepts to explain the generalization power of the entire DNN. Besides, we discover that the DNN with stronger generalization power usually learns simple concepts more quickly and encodes fewer complex concepts. We also discover the detouring dynamics of learning complex concepts, which explains both the high learning difficulty and the low generalization power of complex concepts. The code will be released when the paper is accepted.
LGMar 2, 2023
Understanding and Unifying Fourteen Attribution Methods with Taylor InteractionsHuiqi Deng, Na Zou, Mengnan Du et al.
Various attribution methods have been developed to explain deep neural networks (DNNs) by inferring the attribution/importance/contribution score of each input variable to the final output. However, existing attribution methods are often built upon different heuristics. There remains a lack of a unified theoretical understanding of why these methods are effective and how they are related. To this end, for the first time, we formulate core mechanisms of fourteen attribution methods, which were designed on different heuristics, into the same mathematical system, i.e., the system of Taylor interactions. Specifically, we prove that attribution scores estimated by fourteen attribution methods can all be reformulated as the weighted sum of two types of effects, i.e., independent effects of each individual input variable and interaction effects between input variables. The essential difference among the fourteen attribution methods mainly lies in the weights of allocating different effects. Based on the above findings, we propose three principles for a fair allocation of effects to evaluate the faithfulness of the fourteen attribution methods.
93.2AIMay 28
AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and SecurityDongrui Liu, Yu Li, Zhonghao Yang et al.
Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.
LGOct 17, 2022
Defects of Convolutional Decoder Networks in Frequency RepresentationLing Tang, Wen Shen, Zhanpeng Zhou et al.
In this paper, we prove the representation defects of a cascaded convolutional decoder network, considering the capacity of representing different frequency components of an input sample. We conduct the discrete Fourier transform on each channel of the feature map in an intermediate layer of the decoder network. Then, we extend the 2D circular convolution theorem to represent the forward and backward propagations through convolutional layers in the frequency domain. Based on this, we prove three defects in representing feature spectrums. First, we prove that the convolution operation, the zero-padding operation, and a set of other settings all make a convolutional decoder network more likely to weaken high-frequency components. Second, we prove that the upsampling operation generates a feature spectrum, in which strong signals repetitively appear at certain frequencies. Third, we prove that if the frequency components in the input sample and frequency components in the target output for regression have a small shift, then the decoder usually cannot be effectively learned.
LGApr 4, 2023
HarsanyiNet: Computing Accurate Shapley Values in a Single Forward PropagationLu Chen, Siyu Lou, Keyan Zhang et al.
The Shapley value is widely regarded as a trustworthy attribution metric. However, when people use Shapley values to explain the attribution of input variables of a deep neural network (DNN), it usually requires a very high computational cost to approximate relatively accurate Shapley values in real-world applications. Therefore, we propose a novel network architecture, the HarsanyiNet, which makes inferences on the input sample and simultaneously computes the exact Shapley values of the input variables in a single forward propagation. The HarsanyiNet is designed on the theoretical foundation that the Shapley value can be reformulated as the redistribution of Harsanyi interactions encoded by the network.
LGJul 24, 2022
Proving Common Mechanisms Shared by Twelve Methods of Boosting Adversarial TransferabilityQuanshi Zhang, Xin Wang, Jie Ren et al.
Although many methods have been proposed to enhance the transferability of adversarial perturbations, these methods are designed in a heuristic manner, and the essential mechanism for improving adversarial transferability is still unclear. This paper summarizes the common mechanism shared by twelve previous transferability-boosting methods in a unified view, i.e., these methods all reduce game-theoretic interactions between regional adversarial perturbations. To this end, we focus on the attacking utility of all interactions between regional adversarial perturbations, and we first discover and prove the negative correlation between the adversarial transferability and the attacking utility of interactions. Based on this discovery, we theoretically prove and empirically verify that twelve previous transferability-boosting methods all reduce interactions between regional adversarial perturbations. More crucially, we consider the reduction of interactions as the essential reason for the enhancement of adversarial transferability. Furthermore, we design the interaction loss to directly penalize interactions between regional adversarial perturbations during attacking. Experimental results show that the interaction loss significantly improves the transferability of adversarial perturbations.
LGApr 26, 2023
Technical Note: Defining and Quantifying AND-OR Interactions for Faithful and Concise Explanation of DNNsMingjie Li, Quanshi Zhang
In this technical note, we aim to explain a deep neural network (DNN) by quantifying the encoded interactions between input variables, which reflects the DNN's inference logic. Specifically, we first rethink the definition of interactions, and then formally define faithfulness and conciseness for interaction-based explanation. To this end, we propose two kinds of interactions, i.e., the AND interaction and the OR interaction. For faithfulness, we prove the uniqueness of the AND (OR) interaction in quantifying the effect of the AND (OR) relationship between input variables. Besides, based on AND-OR interactions, we design techniques to boost the conciseness of the explanation, while not hurting the faithfulness. In this way, the inference logic of a DNN can be faithfully and concisely explained by a set of symbolic concepts.
CLApr 3, 2023
Can the Inference Logic of Large Language Models be Disentangled into Symbolic Concepts?Wen Shen, Lei Cheng, Yuxiao Yang et al.
In this paper, we explain the inference logic of large language models (LLMs) as a set of symbolic concepts. Many recent studies have discovered that traditional DNNs usually encode sparse symbolic concepts. However, because an LLM has much more parameters than traditional DNNs, whether the LLM also encodes sparse symbolic concepts is still an open problem. Therefore, in this paper, we propose to disentangle the inference score of LLMs for dialogue tasks into a small number of symbolic concepts. We verify that we can use those sparse concepts to well estimate all inference scores of the LLM on all arbitrarily masking states of the input sentence. We also evaluate the transferability of concepts encoded by an LLM and verify that symbolic concepts usually exhibit high transferability across similar input sentences. More crucially, those symbolic concepts can be used to explain the exact reasons accountable for the LLM's prediction errors.
AIOct 15, 2023
Explaining How a Neural Network Play the Go Game and Let People LearnHuilin Zhou, Huijie Tang, Mingjie Li et al.
The AI model has surpassed human players in the game of Go, and it is widely believed that the AI model has encoded new knowledge about the Go game beyond human players. In this way, explaining the knowledge encoded by the AI model and using it to teach human players represent a promising-yet-challenging issue in explainable AI. To this end, mathematical supports are required to ensure that human players can learn accurate and verifiable knowledge, rather than specious intuitive analysis. Thus, in this paper, we extract interaction primitives between stones encoded by the value network for the Go game, so as to enable people to learn from the value network. Experiments show the effectiveness of our method.
LGMay 30, 2022
Why Adversarial Training of ReLU Networks Is Difficult?Xu Cheng, Hao Zhang, Yue Xin et al.
This paper mathematically derives an analytic solution of the adversarial perturbation on a ReLU network, and theoretically explains the difficulty of adversarial training. Specifically, we formulate the dynamics of the adversarial perturbation generated by the multi-step attack, which shows that the adversarial perturbation tends to strengthen eigenvectors corresponding to a few top-ranked eigenvalues of the Hessian matrix of the loss w.r.t. the input. We also prove that adversarial training tends to strengthen the influence of unconfident input samples with large gradient norms in an exponential manner. Besides, we find that adversarial training strengthens the influence of the Hessian matrix of the loss w.r.t. network parameters, which makes the adversarial training more likely to oscillate along directions of a few samples, and boosts the difficulty of adversarial training. Crucially, our proofs provide a unified explanation for previous findings in understanding adversarial training.
LGMay 30, 2022
Batch Normalization Is Blind to the First and Second Derivatives of the LossZhanpeng Zhou, Wen Shen, Huixin Chen et al.
In this paper, we prove the effects of the BN operation on the back-propagation of the first and second derivatives of the loss. When we do the Taylor series expansion of the loss function, we prove that the BN operation will block the influence of the first-order term and most influence of the second-order term of the loss. We also find that such a problem is caused by the standardization phase of the BN operation. Experimental results have verified our theoretical conclusions, and we have found that the BN operation significantly affects feature representations in specific tasks, where losses of different samples share similar analytic formulas.
LGSep 23, 2023
Towards Attributions of Input Variables in a CoalitionXinhao Zheng, Huiqi Deng, Quanshi Zhang
This paper focuses on the fundamental challenge of partitioning input variables in attribution methods for Explainable AI, particularly in Shapley value-based approaches. Previous methods always compute attributions given a predefined partition but lack theoretical guidance on how to form meaningful variable partitions. We identify that attribution conflicts arise when the attribution of a coalition differs from the sum of its individual variables' attributions. To address this, we analyze the numerical effects of AND-OR interactions in AI models and extend the Shapley value to a new attribution metric for variable coalitions. Our theoretical findings reveal that specific interactions cause attribution conflicts, and we propose three metrics to evaluate coalition faithfulness. Experiments on synthetic data, NLP, image classification, and the game of Go validate our approach, demonstrating consistency with human intuition and practical applicability.
AIJan 26
AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and SecurityDongrui Liu, Qihan Ren, Chen Qian et al.
The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transparency in risk diagnosis. To introduce an agentic guardrail that covers complex and numerous risky behaviors, we first propose a unified three-dimensional taxonomy that orthogonally categorizes agentic risks by their source (where), failure mode (how), and consequence (what). Guided by this structured and hierarchical taxonomy, we introduce a new fine-grained agentic safety benchmark (ATBench) and a Diagnostic Guardrail framework for agent safety and security (AgentDoG). AgentDoG provides fine-grained and contextual monitoring across agent trajectories. More Crucially, AgentDoG can diagnose the root causes of unsafe actions and seemingly safe but unreasonable actions, offering provenance and transparency beyond binary labels to facilitate effective agent alignment. AgentDoG variants are available in three sizes (4B, 7B, and 8B parameters) across Qwen and Llama model families. Extensive experimental results demonstrate that AgentDoG achieves state-of-the-art performance in agentic safety moderation in diverse and complex interactive scenarios. All models and datasets are openly released.
LGJul 27, 2024
Towards the Dynamics of a DNN Learning Symbolic InteractionsQihan Ren, Junpeng Zhang, Yang Xu et al.
This study proves the two-phase dynamics of a deep neural network (DNN) learning interactions. Despite the long disappointing view of the faithfulness of post-hoc explanation of a DNN, a series of theorems have been proven in recent years to show that for a given input sample, a small set of interactions between input variables can be considered as primitive inference patterns that faithfully represent a DNN's detailed inference logic on that sample. Particularly, Zhang et al. have observed that various DNNs all learn interactions of different complexities in two distinct phases, and this two-phase dynamics well explains how a DNN changes from under-fitting to over-fitting. Therefore, in this study, we mathematically prove the two-phase dynamics of interactions, providing a theoretical mechanism for how the generalization power of a DNN changes during the training process. Experiments show that our theory well predicts the real dynamics of interactions on different DNNs trained for various tasks.
73.3AIMay 18
Reconciling Contradictory Views on the Effectiveness of SFT in LLMs: An Interaction PerspectiveJunpeng Zhang, Lei Cheng, Guoxi Zhang et al.
This paper explores a scientific question in supervised fine-tuning (SFT): why SFT is broadly effective for small-scale deep neural networks, yet can produce inconsistent or even detrimental effects when applied to large language models (LLMs). Recent advances in interaction-based explanations suggest that interactions between words/tokens provide a faithful metric for quantifying the inference patterns encoded by LLMs. We find that the evolution of interactions during SFT can effectively explain the inconsistent effectiveness of SFT for LLMs. Specifically, we find that (1) SFT primarily removes noise-like interactions, while rarely acquiring reliable new interactions. (2) This denoising stage is extremely brief, after which continued fine-tuning tends to introduce overfitted interactions. We validate these findings across multiple LLMs and datasets. Our findings provide new insights into early stopping and offer practical guidance for LLM training.
81.4AIMay 18
Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning ModelsJunyao Yang, Chen Qian, Kun Wang et al.
The advancement of Large Reasoning Models (LRMs) has catalyzed a paradigm shift from reactive ``fast thinking'' text generation to systematic, step-by-step ``slow thinking'' reasoning, unlocking state-of-the-art performance in complex mathematical and logical tasks. However, the field faces \textit{the fundamental gap between token-level behavioral analysis and internal reasoning mechanisms, and the instability of reinforcement learning (RL) for reasoning optimization relying on costly external verifiers}. We identify and formally define \textbf{Entropy-Gradient Inversion}, a robust negative correlation between token entropy and logit gradients that acts as a definitive geometric fingerprint for LRM reasoning capability. Building on this, we propose \textbf{Correlation-Regularized Group Policy Optimization (CorR-PO)}, which embeds this inversion signature into RL reward regularization. Extensive experiments on various reasoning benchmarks across multiple model scales show CorR-PO consistently outperforms state-of-the-art baselines, confirming that stronger inversion directly correlates with superior reasoning performance.
LGSep 13, 2024
Layerwise Change of Knowledge in Neural NetworksXu Cheng, Lei Cheng, Zhaoran Peng et al.
This paper aims to explain how a deep neural network (DNN) gradually extracts new knowledge and forgets noisy features through layers in forward propagation. Up to now, although the definition of knowledge encoded by the DNN has not reached a consensus, Previous studies have derived a series of mathematical evidence to take interactions as symbolic primitive inference patterns encoded by a DNN. We extend the definition of interactions and, for the first time, extract interactions encoded by intermediate layers. We quantify and track the newly emerged interactions and the forgotten interactions in each layer during the forward propagation, which shed new light on the learning behavior of DNNs. The layer-wise change of interactions also reveals the change of the generalization capacity and instability of feature representations of a DNN.
88.1AIApr 8
Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model CapabilityQihan Ren, Peng Wang, Ruikun Cai et al.
A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failures are under-optimization artifacts: cross-domain performance first degrades before recovering and improving with extended training (a dip-and-recovery pattern), so shorttraining checkpoints can underestimate generalization. Data quality and structure both matter: low-quality solutions broadly hurt generalization,while verified long-CoT traces yield consistent cross-domain gains. Model capability is essential: stronger models internalize transferable procedural patterns (e.g., backtracking) even from a toy arithmetic game, while weaker ones imitate surface verbosity. This generalization is asymmetric, however: reasoning improves while safety degrades, reframing the question from whether reasoning SFT generalizes to under what conditions and at what cost.
70.9LGMay 13
Understanding Generalization through Decision Pattern ShiftHuiqi Deng, Yibo Li, Quanshi Zhang et al.
Understanding why deep neural networks (DNNs) fail to generalize to unseen samples remains a long-standing challenge. Existing studies mainly examine changes in externally observable factors such as data, representations, or outputs, yet offer limited insight into how a model's internal decision mechanism evolves from training to test. To address this gap, we introduce Decision Pattern Shift (DPS), a new perspective that defines generalization through the stability of internal decision patterns and quantifies failure as their deviation from those learned during training. Specifically, we represent each sample's decision pattern as a GradCAM-based channel-contribution vector, which captures how feature channels collectively support a prediction, and we propose the DPS metric to measure its discrepancy from the class-average pattern. Empirical analyses across multiple datasets and architectures show that, (i) decision patterns form a highly structured, class-consistent space with strong intra-class cohesion and low inter-class confusion, enabling direct analysis of a model's decision logic; (ii) the DPS magnitude correlates linearly with the generalization gap (nearly all Pearson r > 0.8), revealing generalization as a systematic drift in the model's internal decision mechanism; (iii) the DPS spectrum organizes diverse generalization degradation scenarios (covering ideal generalization, in-distribution degradation, domain shift, out-of-distribution, and shortcut learning) into a continuous trajectory, providing a unified explanation of their failure modes. These findings open up new possibilities for early generalization-risk detection, failure-mode diagnosis, and channel-level defect localization.
LGDec 21, 2025
The Interaction Bottleneck of Deep Neural Networks: Discovery, Proof, and ModulationHuiqi Deng, Qihan Ren, Zhuofan Chen et al.
Understanding what kinds of cooperative structures deep neural networks (DNNs) can represent remains a fundamental yet insufficiently understood problem. In this work, we treat interactions as the fundamental units of such structure and investigate a largely unexplored question: how DNNs encode interactions under different levels of contextual complexity, and how these microscopic interaction patterns shape macroscopic representation capacity. To quantify this complexity, we use multi-order interactions [57], where each order reflects the amount of contextual information required to evaluate the joint interaction utility of a variable pair. This formulation enables a stratified analysis of cooperative patterns learned by DNNs. Building on this formulation, we develop a comprehensive study of interaction structure in DNNs. (i) We empirically discover a universal interaction bottleneck: across architectures and tasks, DNNs easily learn low-order and high-order interactions but consistently under-represent mid-order ones. (ii) We theoretically explain this bottleneck by proving that mid-order interactions incur the highest contextual variability, yielding large gradient variance and making them intrinsically difficult to learn. (iii) We further modulate the bottleneck by introducing losses that steer models toward emphasizing interactions of selected orders. Finally, we connect microscopic interaction structures with macroscopic representational behavior: low-order-emphasized models exhibit stronger generalization and robustness, whereas high-order-emphasized models demonstrate greater structural modeling and fitting capability. Together, these results uncover an inherent representational bias in modern DNNs and establish interaction order as a powerful lens for interpreting and guiding deep representations.
82.8AIMay 12
Attributing Emergence in Million-Agent SystemsLing Tang, Jilin Mei, Qian Chen et al.
Large language models (LLMs) can simulate human-like reasoning and decision-making in individual agents. LLM-powered multi-agent systems (MAS) combine such agents to simulate population-scale social phenomena such as polarization, information cascades, and market panics. Such studies require attributing macro emergence to individual agents, but existing axiomatic methods scale combinatorially in $N$ and have been confined to $N \lesssim 10^3$, while the phenomena they explain occur at $N \geq 10^6$. We address this gap by adapting Aumann--Shapley path-integral attribution to LLM-powered MAS at million-agent scale; the resulting method satisfies all four axioms, runs four to five orders of magnitude faster than sampled Shapley on the same hardware. We use this method to test the scale gap empirically: across 14 days of public Bluesky data ($1{,}671{,}587$ active users), we compute the attribution at both full scale and the visibility-biased $N = 10^2$ convenience sample used by small-scale studies, and the two disagree structurally. At full scale the long tail and middle tier jointly carry the majority; the biased small panel attributes almost everything to a few high-follower accounts. We then prove that under any nonlinear macro indicator the disagreement cannot be reduced by post-hoc rescaling: an Attribution Scaling Bias theorem shows that no global rescaling factor can reconcile small-scale and full-scale attribution. Full-scale attribution is therefore not a methodological choice but a theoretical requirement for any nonlinear macro indicator.
55.8AIMay 12
What Do EEG Foundation Models Capture from Human Brain Signals?Ling Tang, Qian Chen, Jilin Mei et al.
Clinical electroencephalogram (EEG) analysis rests on a hand-crafted feature catalog refined over decades, \emph{e.g.,} band power, connectivity, complexity, and more. Modern EEG foundation models bypass this catalog, learn directly from raw signals via self-supervised pretraining, and match or outperform feature-engineered baselines on most clinical benchmarks. Whether the two representations align is an open question, which we decompose into three sub-questions: \emph{what does the model learn}, \emph{what does the model use}, and \emph{how much can be explained}. We answer them with layer-wise ridge probing, LEACE-style cross-covariance subspace erasure, and a transparent classifier benchmarked against a random-feature baseline. The audit covers three foundation models (CSBrain, CBraMod, LaBraM), five clinical tasks (MDD, Stress, ISRUC-Sleep, TUSL, Siena), and a 6-family 63-feature lexicon. Of the $945$ (model, task, feature) units, $648$ ($68.6\%$) are representation-causal and $199$ ($21.1\%$) are encoded-only. Across tasks, $50$ features qualify as universal candidates with strong support (all three architectures RC) in two or more tasks. Frequency-domain features dominate, but the other five families each contribute substantial causal mass. Confirmed features recover, on average, $79.3\%$ of the foundation model's advantage over the random baseline, with a clean task gradient (MDD $\approx 0.99$ down to Stress $\approx 0.56$): tasks near ceiling are almost fully recovered by the lexicon, while harder tasks leave a non-trivial residual that pinpoints a concrete target for future concept discovery.
CLFeb 20, 2024
Identifying Semantic Induction Heads to Understand In-Context LearningJie Ren, Qipeng Guo, Hang Yan et al.
Although large language models (LLMs) have demonstrated remarkable performance, the lack of transparency in their inference logic raises concerns about their trustworthiness. To gain a better understanding of LLMs, we conduct a detailed analysis of the operations of attention heads and aim to better understand the in-context learning of LLMs. Specifically, we investigate whether attention heads encode two types of relationships between tokens present in natural languages: the syntactic dependency parsed from sentences and the relation within knowledge graphs. We find that certain attention heads exhibit a pattern where, when attending to head tokens, they recall tail tokens and increase the output logits of those tail tokens. More crucially, the formulation of such semantic induction heads has a close correlation with the emergence of the in-context learning ability of language models. The study of semantic attention heads advances our understanding of the intricate operations of attention heads in transformers, and further provides new insights into the in-context learning of LLMs.
LGJan 29, 2024
Defining and Extracting generalizable interaction primitives from DNNsLu Chen, Siyu Lou, Benhao Huang et al. · cmu
Faithfully summarizing the knowledge encoded by a deep neural network (DNN) into a few symbolic primitive patterns without losing much information represents a core challenge in explainable AI. To this end, Ren et al. (2024) have derived a series of theorems to prove that the inference score of a DNN can be explained as a small set of interactions between input variables. However, the lack of generalization power makes it still hard to consider such interactions as faithful primitive patterns encoded by the DNN. Therefore, given different DNNs trained for the same task, we develop a new method to extract interactions that are shared by these DNNs. Experiments show that the extracted interactions can better reflect common knowledge shared by different DNNs.
LGMay 16, 2024
Two-Phase Dynamics of Interactions Explains the Starting Point of a DNN Learning Over-Fitted FeaturesJunpeng Zhang, Qing Li, Liang Lin et al.
This paper investigates the dynamics of a deep neural network (DNN) learning interactions. Previous studies have discovered and mathematically proven that given each input sample, a well-trained DNN usually only encodes a small number of interactions (non-linear relationships) between input variables in the sample. A series of theorems have been derived to prove that we can consider the DNN's inference equivalent to using these interactions as primitive patterns for inference. In this paper, we discover the DNN learns interactions in two phases. The first phase mainly penalizes interactions of medium and high orders, and the second phase mainly learns interactions of gradually increasing orders. We can consider the two-phase phenomenon as the starting point of a DNN learning over-fitted features. Such a phenomenon has been widely shared by DNNs with various architectures trained for different tasks. Therefore, the discovery of the two-phase dynamics provides a detailed mechanism for how a DNN gradually learns different inference patterns (interactions). In particular, we have also verified the claim that high-order interactions have weaker generalization power than low-order interactions. Thus, the discovered two-phase dynamics also explains how the generalization power of a DNN changes during the training process.
LGMay 20, 2024
Quantifying In-Context Reasoning Effects and Memorization Effects in LLMsSiyu Lou, Yuntian Chen, Xiaodan Liang et al.
In this study, we propose an axiomatic system to define and quantify the precise memorization and in-context reasoning effects used by the large language model (LLM) for language generation. These effects are formulated as non-linear interactions between tokens/words encoded by the LLM. Specifically, the axiomatic system enables us to categorize the memorization effects into foundational memorization effects and chaotic memorization effects, and further classify in-context reasoning effects into enhanced inference patterns, eliminated inference patterns, and reversed inference patterns. Besides, the decomposed effects satisfy the sparsity property and the universal matching property, which mathematically guarantee that the LLM's confidence score can be faithfully decomposed into the memorization effects and in-context reasoning effects. Experiments show that the clear disentanglement of memorization effects and in-context reasoning effects enables a straightforward examination of detailed inference patterns encoded by LLMs.
LGAug 11, 2025
Attribution Explanations for Deep Neural Networks: A Theoretical PerspectiveHuiqi Deng, Hongbin Pei, Quanshi Zhang et al.
Attribution explanation is a typical approach for explaining deep neural networks (DNNs), inferring an importance or contribution score for each input variable to the final output. In recent years, numerous attribution methods have been developed to explain DNNs. However, a persistent concern remains unresolved, i.e., whether and which attribution methods faithfully reflect the actual contribution of input variables to the decision-making process. The faithfulness issue undermines the reliability and practical utility of attribution explanations. We argue that these concerns stem from three core challenges. First, difficulties arise in comparing attribution methods due to their unstructured heterogeneity, differences in heuristics, formulations, and implementations that lack a unified organization. Second, most methods lack solid theoretical underpinnings, with their rationales remaining absent, ambiguous, or unverified. Third, empirically evaluating faithfulness is challenging without ground truth. Recent theoretical advances provide a promising way to tackle these challenges, attracting increasing attention. We summarize these developments, with emphasis on three key directions: (i) Theoretical unification, which uncovers commonalities and differences among methods, enabling systematic comparisons; (ii) Theoretical rationale, clarifying the foundations of existing methods; (iii) Theoretical evaluation, rigorously proving whether methods satisfy faithfulness principles. Beyond a comprehensive review, we provide insights into how these studies help deepen theoretical understanding, inform method selection, and inspire new attribution methods. We conclude with a discussion of promising open problems for further work.
LGMay 2, 2025
Towards the Resistance of Neural Network Watermarking to Fine-tuningLing Tang, Yuefeng Chen, Hui Xue et al.
This paper proves a new watermarking method to embed the ownership information into a deep neural network (DNN), which is robust to fine-tuning. Specifically, we prove that when the input feature of a convolutional layer only contains low-frequency components, specific frequency components of the convolutional filter will not be changed by gradient descent during the fine-tuning process, where we propose a revised Fourier transform to extract frequency components from the convolutional filter. Additionally, we also prove that these frequency components are equivariant to weight scaling and weight permutations. In this way, we design a watermark module to encode the watermark information to specific frequency components in a convolutional filter. Preliminary experiments demonstrate the effectiveness of our method.
CVOct 15, 2025
Counting Hallucinations in Diffusion ModelsShuai Fu, Jian Zhou, Qi Chen et al.
Diffusion probabilistic models (DPMs) have demonstrated remarkable progress in generative tasks, such as image and video synthesis. However, they still often produce hallucinated samples (hallucinations) that conflict with real-world knowledge, such as generating an implausible duplicate cup floating beside another cup. Despite their prevalence, the lack of feasible methodologies for systematically quantifying such hallucinations hinders progress in addressing this challenge and obscures potential pathways for designing next-generation generative models under factual constraints. In this work, we bridge this gap by focusing on a specific form of hallucination, which we term counting hallucination, referring to the generation of an incorrect number of instances or structured objects, such as a hand image with six fingers, despite such patterns being absent from the training data. To this end, we construct a dataset suite CountHalluSet, with well-defined counting criteria, comprising ToyShape, SimObject, and RealHand. Using these datasets, we develop a standardized evaluation protocol for quantifying counting hallucinations, and systematically examine how different sampling conditions in DPMs, including solver type, ODE solver order, sampling steps, and initial noise, affect counting hallucination levels. Furthermore, we analyze their correlation with common evaluation metrics such as FID, revealing that this widely used image quality metric fails to capture counting hallucinations consistently. This work aims to take the first step toward systematically quantifying hallucinations in diffusion models and offer new insights into the investigation of hallucination phenomena in image generation.
LGMay 11, 2025
Technical Report: Quantifying and Analyzing the Generalization Power of a DNNYuxuan He, Junpeng Zhang, Lei Cheng et al.
This paper proposes a new perspective for analyzing the generalization power of deep neural networks (DNNs), i.e., directly disentangling and analyzing the dynamics of generalizable and non-generalizable interaction encoded by a DNN through the training process. Specifically, this work builds upon the recent theoretical achievement in explainble AI, which proves that the detailed inference logic of DNNs can be can be strictly rewritten as a small number of AND-OR interaction patterns. Based on this, we propose an efficient method to quantify the generalization power of each interaction, and we discover a distinct three-phase dynamics of the generalization power of interactions during training. In particular, the early phase of training typically removes noisy and non-generalizable interactions and learns simple and generalizable ones. The second and the third phases tend to capture increasingly complex interactions that are harder to generalize. Experimental results verify that the learning of non-generalizable interactions is the the direct cause for the gap between the training and testing losses.
LGFeb 14, 2025
Revisiting Generalization Power of a DNN in Terms of Symbolic InteractionsLei Cheng, Junpeng Zhang, Qihan Ren et al.
This paper aims to analyze the generalization power of deep neural networks (DNNs) from the perspective of interactions. Unlike previous analysis of a DNN's generalization power in a highdimensional feature space, we find that the generalization power of a DNN can be explained as the generalization power of the interactions. We found that the generalizable interactions follow a decay-shaped distribution, while non-generalizable interactions follow a spindle-shaped distribution. Furthermore, our theory can effectively disentangle these two types of interactions from a DNN. We have verified that our theory can well match real interactions in a DNN in experiments.
LGFeb 12, 2025
Randomness of Low-Layer Parameters Determines Confusing Samples in Terms of Interaction Representations of a DNNJunpeng Zhang, Lei Cheng, Qing Li et al.
In this paper, we find that the complexity of interactions encoded by a deep neural network (DNN) can explain its generalization power. We also discover that the confusing samples of a DNN, which are represented by non-generalizable interactions, are determined by its low-layer parameters. In comparison, other factors, such as high-layer parameters and network architecture, have much less impact on the composition of confusing samples. Two DNNs with different low-layer parameters usually have fully different sets of confusing samples, even though they have similar performance. This finding extends the understanding of the lottery ticket hypothesis, and well explains distinctive representation power of different DNNs.
LGMay 3, 2023
Where We Have Arrived in Proving the Emergence of Sparse Symbolic Concepts in AI ModelsQihan Ren, Jiayang Gao, Wen Shen et al.
This study aims to prove the emergence of symbolic concepts (or more precisely, sparse primitive inference patterns) in well-trained deep neural networks (DNNs). Specifically, we prove the following three conditions for the emergence. (i) The high-order derivatives of the network output with respect to the input variables are all zero. (ii) The DNN can be used on occluded samples and when the input sample is less occluded, the DNN will yield higher confidence. (iii) The confidence of the DNN does not significantly degrade on occluded samples. These conditions are quite common, and we prove that under these conditions, the DNN will only encode a relatively small number of sparse interactions between input variables. Moreover, we can consider such interactions as symbolic primitive inference patterns encoded by a DNN, because we show that inference scores of the DNN on an exponentially large number of randomly masked samples can always be well mimicked by numerical effects of just a few interactions.
LGDec 2, 2021
Trap of Feature Diversity in the Learning of MLPsDongrui Liu, Shaobo Wang, Jie Ren et al.
In this paper, we focus on a typical two-phase phenomenon in the learning of multi-layer perceptrons (MLPs), and we aim to explain the reason for the decrease of feature diversity in the first phase. Specifically, people find that, in the training of MLPs, the training loss does not decrease significantly until the second phase. To this end, we further explore the reason why the diversity of features over different samples keeps decreasing in the first phase, which hurts the optimization of MLPs. We explain such a phenomenon in terms of the learning dynamics of MLPs. Furthermore, we theoretically explain why four typical operations can alleviate the decrease of the feature diversity.
LGNov 11, 2021
Discovering and Explaining the Representation Bottleneck of DNNsHuiqi Deng, Qihan Ren, Hao Zhang et al.
This paper explores the bottleneck of feature representations of deep neural networks (DNNs), from the perspective of the complexity of interactions between input variables encoded in DNNs. To this end, we focus on the multi-order interaction between input variables, where the order represents the complexity of interactions. We discover that a DNN is more likely to encode both too simple interactions and too complex interactions, but usually fails to learn interactions of intermediate complexity. Such a phenomenon is widely shared by different DNNs for different tasks. This phenomenon indicates a cognition gap between DNNs and human beings, and we call it a representation bottleneck. We theoretically prove the underlying reason for the representation bottleneck. Furthermore, we propose a loss to encourage/penalize the learning of interactions of specific complexities, and analyze the representation capacities of interactions of different complexities.
LGNov 11, 2021
Defining and Quantifying the Emergence of Sparse Concepts in DNNsJie Ren, Mingjie Li, Qirui Chen et al.
This paper aims to illustrate the concept-emerging phenomenon in a trained DNN. Specifically, we find that the inference score of a DNN can be disentangled into the effects of a few interactive concepts. These concepts can be understood as causal patterns in a sparse, symbolic causal graph, which explains the DNN. The faithfulness of using such a causal graph to explain the DNN is theoretically guaranteed, because we prove that the causal graph can well mimic the DNN's outputs on an exponential number of different masked samples. Besides, such a causal graph can be further simplified and re-written as an And-Or graph (AOG), without losing much explanation accuracy.
CVNov 5, 2021
Interpreting Representation Quality of DNNs for 3D Point Cloud ProcessingWen Shen, Qihan Ren, Dongrui Liu et al.
In this paper, we evaluate the quality of knowledge representations encoded in deep neural networks (DNNs) for 3D point cloud processing. We propose a method to disentangle the overall model vulnerability into the sensitivity to the rotation, the translation, the scale, and local 3D structures. Besides, we also propose metrics to evaluate the spatial smoothness of encoding 3D structures, and the representation complexity of the DNN. Based on such analysis, experiments expose representation problems with classic DNNs, and explain the utility of the adversarial training.
LGNov 5, 2021
A Unified Game-Theoretic Interpretation of Adversarial RobustnessJie Ren, Die Zhang, Yisen Wang et al.
This paper provides a unified view to explain different adversarial attacks and defense methods, \emph{i.e.} the view of multi-order interactions between input variables of DNNs. Based on the multi-order interaction, we discover that adversarial attacks mainly affect high-order interactions to fool the DNN. Furthermore, we find that the robustness of adversarially trained DNNs comes from category-specific low-order interactions. Our findings provide a potential method to unify adversarial perturbations and robustness, which can explain the existing defense methods in a principle way. Besides, our findings also make a revision of previous inaccurate understanding of the shape bias of adversarially learned features.
CVNov 5, 2021
Visualizing the Emergence of Intermediate Visual Patterns in DNNsMingjie Li, Shaobo Wang, Quanshi Zhang
This paper proposes a method to visualize the discrimination power of intermediate-layer visual patterns encoded by a DNN. Specifically, we visualize (1) how the DNN gradually learns regional visual patterns in each intermediate layer during the training process, and (2) the effects of the DNN using non-discriminative patterns in low layers to construct disciminative patterns in middle/high layers through the forward propagation. Based on our visualization method, we can quantify knowledge points (i.e., the number of discriminative visual patterns) learned by the DNN to evaluate the representation capacity of the DNN. Furthermore, this method also provides new insights into signal-processing behaviors of existing deep-learning techniques, such as adversarial attacks and knowledge distillation.
QMSep 22, 2021
Rapid detection and recognition of whole brain activity in a freely behaving Caenorhabditis elegansYuxiang Wu, Shang Wu, Xin Wang et al.
Advanced volumetric imaging methods and genetically encoded activity indicators have permitted a comprehensive characterization of whole brain activity at single neuron resolution in \textit{Caenorhabditis elegans}. The constant motion and deformation of the nematode nervous system, however, impose a great challenge for consistent identification of densely packed neurons in a behaving animal. Here, we propose a cascade solution for long-term and rapid recognition of head ganglion neurons in a freely moving \textit{C. elegans}. First, potential neuronal regions from a stack of fluorescence images are detected by a deep learning algorithm. Second, 2-dimensional neuronal regions are fused into 3-dimensional neuron entities. Third, by exploiting the neuronal density distribution surrounding a neuron and relative positional information between neurons, a multi-class artificial neural network transforms engineered neuronal feature vectors into digital neuronal identities. With a small number of training samples, our bottom-up approach is able to process each volume - $1024 \times 1024 \times 18$ in voxels - in less than 1 second and achieves an accuracy of $91\%$ in neuronal detection and above $80\%$ in neuronal tracking over a long video recording. Our work represents a step towards rapid and fully automated algorithms for decoding whole brain activity underlying naturalistic behaviors.
LGAug 16, 2021
Interpreting Attributions and Interactions of Adversarial AttacksXin Wang, Shuyun Lin, Hao Zhang et al.
This paper aims to explain adversarial attacks in terms of how adversarial perturbations contribute to the attacking task. We estimate attributions of different image regions to the decrease of the attacking cost based on the Shapley value. We define and quantify interactions among adversarial perturbation pixels, and decompose the entire perturbation map into relatively independent perturbation components. The decomposition of the perturbation map shows that adversarially-trained DNNs have more perturbation components in the foreground than normally-trained DNNs. Moreover, compared to the normally-trained DNN, the adversarially-trained DNN have more components which mainly decrease the score of the true category. Above analyses provide new insights into the understanding of adversarial attacks.
LGJul 31, 2021
A Hypothesis for the Aesthetic Appreciation in Neural NetworksXu Cheng, Xin Wang, Haotian Xue et al.
This paper proposes a hypothesis for the aesthetic appreciation that aesthetic images make a neural network strengthen salient concepts and discard inessential concepts. In order to verify this hypothesis, we use multi-variate interactions to represent salient concepts and inessential concepts contained in images. Furthermore, we design a set of operations to revise images towards more beautiful ones. In experiments, we find that the revised images are more aesthetic than the original ones to some extent.
LGJul 16, 2021
Proceedings of ICML 2021 Workshop on Theoretic Foundation, Criticism, and Application Trend of Explainable AIQuanshi Zhang, Tian Han, Lixin Fan et al.
This is the Proceedings of ICML 2021 Workshop on Theoretic Foundation, Criticism, and Application Trend of Explainable AI. Deep neural networks (DNNs) have undoubtedly brought great success to a wide range of applications in computer vision, computational linguistics, and AI. However, foundational principles underlying the DNNs' success and their resilience to adversarial attacks are still largely missing. Interpreting and theorizing the internal mechanisms of DNNs becomes a compelling yet controversial topic. This workshop pays a special interest in theoretic foundations, limitations, and new application trends in the scope of XAI. These issues reflect new bottlenecks in the future development of XAI.
CVJul 9, 2021
Interpretable Compositional Convolutional Neural NetworksWen Shen, Zhihua Wei, Shikun Huang et al.
The reasonable definition of semantic interpretability presents the core challenge in explainable AI. This paper proposes a method to modify a traditional convolutional neural network (CNN) into an interpretable compositional CNN, in order to learn filters that encode meaningful visual patterns in intermediate convolutional layers. In a compositional CNN, each filter is supposed to consistently represent a specific compositional object part or image region with a clear meaning. The compositional CNN learns from image labels for classification without any annotations of parts or regions for supervision. Our method can be broadly applied to different types of CNNs. Experiments have demonstrated the effectiveness of our method.