LGMay 30, 2022
Why Adversarial Training of ReLU Networks Is Difficult?Xu Cheng, Hao Zhang, Yue Xin et al.
This paper mathematically derives an analytic solution of the adversarial perturbation on a ReLU network, and theoretically explains the difficulty of adversarial training. Specifically, we formulate the dynamics of the adversarial perturbation generated by the multi-step attack, which shows that the adversarial perturbation tends to strengthen eigenvectors corresponding to a few top-ranked eigenvalues of the Hessian matrix of the loss w.r.t. the input. We also prove that adversarial training tends to strengthen the influence of unconfident input samples with large gradient norms in an exponential manner. Besides, we find that adversarial training strengthens the influence of the Hessian matrix of the loss w.r.t. network parameters, which makes the adversarial training more likely to oscillate along directions of a few samples, and boosts the difficulty of adversarial training. Crucially, our proofs provide a unified explanation for previous findings in understanding adversarial training.
LGJul 27, 2024
Towards the Dynamics of a DNN Learning Symbolic InteractionsQihan Ren, Junpeng Zhang, Yang Xu et al.
This study proves the two-phase dynamics of a deep neural network (DNN) learning interactions. Despite the long disappointing view of the faithfulness of post-hoc explanation of a DNN, a series of theorems have been proven in recent years to show that for a given input sample, a small set of interactions between input variables can be considered as primitive inference patterns that faithfully represent a DNN's detailed inference logic on that sample. Particularly, Zhang et al. have observed that various DNNs all learn interactions of different complexities in two distinct phases, and this two-phase dynamics well explains how a DNN changes from under-fitting to over-fitting. Therefore, in this study, we mathematically prove the two-phase dynamics of interactions, providing a theoretical mechanism for how the generalization power of a DNN changes during the training process. Experiments show that our theory well predicts the real dynamics of interactions on different DNNs trained for various tasks.
ROMar 14
URDF-Anything+: Autoregressive Articulated 3D Models Generation for Physical SimulationZhuangzhe Wu, Yue Xin, Chengkai Hou et al.
Articulated objects are fundamental for robotics, simulation of physics, and interactive virtual environments. However, reconstructing them from visual input remains challenging, as it requires jointly inferring both part geometry and kinematic structure. We present, an end-to-end autoregressive framework that directly generates executable articulated object models from visual observations. Given image and object-level 3D cues, our method sequentially produces part geometries and their associated joint parameters, resulting in complete URDF models without reliance on multi-stage pipelines. The generation proceeds until the model determines that all parts have been produced, automatically inferring complete geometry and kinematics. Building on this capability, we enable a new Real-Follow-Sim paradigm, where high-fidelity digital twins constructed from visual observations allow policies trained and tested purely in simulation to transfer to real robots without online adaptation. Experiments on large-scale articulated object benchmarks and real-world robotic tasks demonstrate that outperforms prior methods in geometric reconstruction quality, joint parameter accuracy, and physical executability.
IRSep 29, 2024
DIIT: A Domain-Invariant Information Transfer Method for Industrial Cross-Domain RecommendationHeyuan Huang, Xingyu Lou, Chaochao Chen et al.
Cross-Domain Recommendation (CDR) have received widespread attention due to their ability to utilize rich information across domains. However, most existing CDR methods assume an ideal static condition that is not practical in industrial recommendation systems (RS). Therefore, simply applying existing CDR methods in the industrial RS environment may lead to low effectiveness and efficiency. To fill this gap, we propose DIIT, an end-to-end Domain-Invariant Information Transfer method for industrial cross-domain recommendation. Specifically, We first simulate the industrial RS environment that maintains respective models in multiple domains, each of them is trained in the incremental mode. Then, for improving the effectiveness, we design two extractors to fully extract domain-invariant information from the latest source domain models at the domain level and the representation level respectively. Finally, for improving the efficiency, we design a migrator to transfer the extracted information to the latest target domain model, which only need the target domain model for inference. Experiments conducted on one production dataset and two public datasets verify the effectiveness and efficiency of DIIT.
AISep 18, 2025
Generalizable Geometric Image Caption SynthesisYue Xin, Wenyuan Wang, Rui Pan et al.
Multimodal large language models have various practical applications that demand strong reasoning abilities. Despite recent advancements, these models still struggle to solve complex geometric problems. A key challenge stems from the lack of high-quality image-text pair datasets for understanding geometric images. Furthermore, most template-based data synthesis pipelines typically fail to generalize to questions beyond their predefined templates. In this paper, we bridge this gap by introducing a complementary process of Reinforcement Learning with Verifiable Rewards (RLVR) into the data generation pipeline. By adopting RLVR to refine captions for geometric images synthesized from 50 basic geometric relations and using reward signals derived from mathematical problem-solving tasks, our pipeline successfully captures the key features of geometry problem-solving. This enables better task generalization and yields non-trivial improvements. Furthermore, even in out-of-distribution scenarios, the generated dataset enhances the general reasoning capabilities of multimodal large language models, yielding accuracy improvements of $2.8\%\text{-}4.8\%$ in statistics, arithmetic, algebraic, and numerical tasks with non-geometric input images of MathVista and MathVerse, along with $2.4\%\text{-}3.9\%$ improvements in Art, Design, Tech, and Engineering tasks in MMMU.
AISep 20, 2025
SalaMAnder: Shapley-based Mathematical Expression Attribution and Metric for Chain-of-Thought ReasoningYue Xin, Chen Shen, Shaotian Yan et al.
Chain-of-Thought (CoT) prompting enhances the math reasoning capability of large language models (LLMs) to a large margin. However, the mechanism underlying such improvements remains unexplored. In this paper, we present \textbf{SalaMAnder} (\textbf{S}h\textbf{a}p\textbf{l}ey-b\textbf{a}sed \textbf{M}athematical Expression \textbf{A}ttribution a\textbf{nd} M\textbf{e}t\textbf{r}ic), a theoretically grounded methodology as well as a mathematically rigorous evaluation metric for quantifying component-level contributions in few-shot CoT reasoning. Concretely, we leverage the Shapley value for mathematical expression attribution and develop an efficient stratified sampling algorithm that significantly reduces the computational complexity. Besides, we develop the \textbf{CoSP} (\textbf{C}ardinality \textbf{o}f \textbf{S}hapley \textbf{P}ositives) metric through covariance analysis. Comprehensive validation across popular LLM models and diverse mathematical benchmarks demonstrates that the CoSP metric within our SalaMAnder framework exhibits a robust monotonic correlation with model performance, not only providing theoretical explanations for the empirical success of existing few-shot CoT but also establishing mathematically rigorous principles for prompt construction optimization. Furthermore, we verify the reliability of the explanation, based on which we unify the insights of previous work.
CLJul 14, 2025
Enhancing Chain-of-Thought Reasoning with Critical Representation Fine-tuningChenxi Huang, Shaotian Yan, Liang Xie et al.
Representation Fine-tuning (ReFT), a recently proposed Parameter-Efficient Fine-Tuning (PEFT) method, has attracted widespread attention for significantly improving parameter efficiency by editing representation space alone. In this work, we investigate applying ReFT to complex reasoning tasks. However, directly using the native ReFT method, which modifies fixed representations at the beginning and end of each layer, yields suboptimal performance, as these fixed-position representations have uncertain impact on the outputs. We observe that, in complex reasoning tasks, there often exist certain critical representations. These representations either integrate significant information from preceding layers or regulate subsequent layer representations. Through layer-by-layer propagation, they exert a substantial influence on the final output. Naturally, fine-tuning these critical representations has the potential to greatly enhance reasoning performance. Building upon these insights, we propose Critical Representation Fine-Tuning (CRFT), a novel method that identifies and optimizes these critical representations through information flow analysis. CRFT operates within a supervised learning framework, dynamically optimizing critical representations in a low-rank linear subspace while freezing the base model. The effectiveness and efficiency of our method are validated across eight benchmarks for arithmetic and commonsense reasoning, using LLaMA and Mistral model families. Furthermore, our method also adapts effectively to few-shot settings, boosting one-shot accuracy by 16.4%. Our work highlights the untapped potential of representation-level optimization for CoT reasoning, offering a lightweight yet powerful alternative to traditional PEFT methods.
CVJan 9, 2024
Mix-GENEO: A Flexible Filtration for Multiparameter Persistent Homology Detects Digital ImagesJiaxing He, Bingzhe Hou, Tieru Wu et al.
Two important tasks in the field of Topological Data Analysis are building practical multifiltrations on objects and using TDA to detect the geometry. Motivated by the tasks, we build multiparameter filtrations by operators on images named multi-GENEO, multi-DGENEO and mix-GENEO, and we prove the stability of both the interleaving distance and multiparameter persistence landscape of multi-GENEO with respect to the pseudometric on bounded functions. We also give the estimations of upper bound for multi-DGENEO and mix-GENEO. In practical applications, we regard image as a discrete function space, and then we build multifiltrations on the discrete function space. Finally, we construct comparable experiment on MNIST dataset to demonstrate our bifiltrations are superior to 1-parameter filtrations including lower-star filtration and upper-star filtration. For instance, 6 and 9 can be distinguished by our bifiltrations, while they cannot be distinguished by 1-parameter filtrations. The experiment results demonstrate our bifiltrations have ability to detect geometric and topological differences of digital images.