LGFeb 13, 2025Code
MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense ConnectionsDa Xiao, Qingye Meng, Shengping Li et al.
We propose MUltiway Dynamic Dense (MUDD) connections, a simple yet effective method to address the limitations of residual connections and enhance cross-layer information flow in Transformers. Unlike existing dense connection approaches with static and shared connection weights, MUDD generates connection weights dynamically depending on hidden states at each sequence position and for each decoupled input stream (the query, key, value or residual) of a Transformer block. MUDD connections can be seamlessly integrated into any Transformer architecture to create MUDDFormer. Extensive experiments show that MUDDFormer significantly outperforms Transformers across various model architectures and scales in language modeling, achieving the performance of Transformers trained with 1.8X-2.4X compute. Notably, MUDDPythia-2.8B matches Pythia-6.9B in pretraining ppl and downstream tasks and even rivals Pythia-12B in five-shot settings, while adding only 0.23% parameters and 0.4% computation. Code in JAX and PyTorch and pre-trained models are available at https://github.com/Caiyun-AI/MUDDFormer .
LGMay 14, 2024Code
Improving Transformers with Dynamically Composable Multi-Head AttentionDa Xiao, Qingye Meng, Shengping Li et al.
Multi-Head Attention (MHA) is a key component of Transformer. In MHA, attention heads work independently, causing problems such as low-rank bottleneck of attention score matrices and head redundancy. We propose Dynamically Composable Multi-Head Attention (DCMHA), a parameter and computation efficient attention architecture that tackles the shortcomings of MHA and increases the expressive power of the model by dynamically composing attention heads. At the core of DCMHA is a $\it{Compose}$ function that transforms the attention score and weight matrices in an input-dependent way. DCMHA can be used as a drop-in replacement of MHA in any transformer architecture to obtain the corresponding DCFormer. DCFormer significantly outperforms Transformer on different architectures and model scales in language modeling, matching the performance of models with ~1.7x-2.0x compute. For example, DCPythia-6.9B outperforms open source Pythia-12B on both pretraining perplexity and downstream task evaluation. The code and models are available at https://github.com/Caiyun-AI/DCFormer.
40.1SPApr 4
Interpretable Fuzzy Modeling Reveals Population-Level Representation Differences in P300 Brain Computer Interfaces Across Neurodivergent and Neurotypical CohortsXiaowei Jiang, Sudong Shang, Adrian Wilkinson et al.
P300-based brain-computer interfaces (BCIs) are widely used for communication, but population heterogeneity may alter the neural patterns available for decoding. Prior work has mainly examined such differences at the signal or performance level, while the representation structure learned by the decoder remains underexplored. In this study, we propose an interpretable fuzzy spatiotemporal framework for P300 classification and use it to analyze population-level differences across amyotrophic lateral sclerosis (ALS), autism (AUT), and neurotypical (NT) cohorts. The model employs spatial and temporal fuzzy filters with learnable prototypes, enabling both classification and reconstruction of cohort-specific fuzzy centers. Experiments were conducted on ALS and NT subsets from bigP3BCI and on the BCIAUT-P300 benchmark in a within-subject setting. The proposed model achieved competitive performance against multiple deep learning baselines. More importantly, the reconstructed fuzzy centers revealed systematic cohort-dependent differences in waveform morphology and representation geometry. Point-wise statistical analysis identified significant temporal differences between cohorts, including intervals overlapping with the canonical P300 window, and low-dimensional embeddings showed partially separated cohort-specific prototype organizations. These results suggest that population heterogeneity in P300-BCI is reflected not only in decoding performance but also in the discriminative structure learned by the model. The proposed framework provides an interpretable route toward population-aware P300-BCI analysis and design.
CLDec 17, 2024Code
Benchmarking and Understanding Compositional Relational Reasoning of LLMsRuikang Ni, Da Xiao, Qingye Meng et al.
Compositional relational reasoning (CRR) is a hallmark of human intelligence, but we lack a clear understanding of whether and how existing transformer large language models (LLMs) can solve CRR tasks. To enable systematic exploration of the CRR capability of LLMs, we first propose a new synthetic benchmark called Generalized Associative Recall (GAR) by integrating and generalizing the essence of several tasks in mechanistic interpretability (MI) study in a unified framework. Evaluation shows that GAR is challenging enough for existing LLMs, revealing their fundamental deficiency in CRR. Meanwhile, it is easy enough for systematic MI study. Then, to understand how LLMs solve GAR tasks, we use attribution patching to discover the core circuits reused by Vicuna-33B across different tasks and a set of vital attention heads. Intervention experiments show that the correct functioning of these heads significantly impacts task performance. Especially, we identify two classes of heads whose activations represent the abstract notion of true and false in GAR tasks respectively. They play a fundamental role in CRR across various models and tasks. The dataset and code are available at https://github.com/Caiyun-AI/GAR.
LGJan 30
FedCARE: Federated Unlearning with Conflict-Aware Projection and Relearning-Resistant RecoveryYue Li, Mingmin Chu, Xilei Yang et al.
Federated learning (FL) enables collaborative model training without centralizing raw data, but privacy regulations such as the right to be forgotten require FL systems to remove the influence of previously used training data upon request. Retraining a federated model from scratch is prohibitively expensive, motivating federated unlearning (FU). However, existing FU methods suffer from high unlearning overhead, utility degradation caused by entangled knowledge, and unintended relearning during post-unlearning recovery. In this paper, we propose FedCARE, a unified and low overhead FU framework that enables conflict-aware unlearning and relearning-resistant recovery. FedCARE leverages gradient ascent for efficient forgetting when target data are locally available and employs data free model inversion to construct class level proxies of shared knowledge. Based on these insights, FedCARE integrates a pseudo-sample generator, conflict-aware projected gradient ascent for utility preserving unlearning, and a recovery strategy that suppresses rollback toward the pre-unlearning model. FedCARE supports client, instance, and class level unlearning with modest overhead. Extensive experiments on multiple datasets and model architectures under both IID and non-IID settings show that FedCARE achieves effective forgetting, improved utility retention, and reduced relearning risk compared to state of the art FU baselines.
ROAug 1, 2025
Hestia: Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D ReconstructionCheng-You Lu, Zhuoli Zhuang, Nguyen Thanh Trung Le et al. · stanford
Advances in 3D reconstruction and novel view synthesis have enabled efficient and photorealistic rendering. However, images for reconstruction are still either largely manual or constrained by simple preplanned trajectories. To address this issue, recent works propose generalizable next-best-view planners that do not require online learning. Nevertheless, robustness and performance remain limited across various shapes. Hence, this study introduces Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D Reconstruction (Hestia), which addresses the shortcomings of the reinforcement learning-based generalizable approaches for five-degree-of-freedom viewpoint prediction. Hestia systematically improves the planners through four components: a more diverse dataset to promote robustness, a hierarchical structure to manage the high-dimensional continuous action search space, a close-greedy strategy to mitigate spurious correlations, and a face-aware design to avoid overlooking geometry. Experimental results show that Hestia achieves non-marginal improvements, with at least a 4% gain in coverage ratio, while reducing Chamfer Distance by 50% and maintaining real-time inference. In addition, Hestia outperforms prior methods by at least 12% in coverage ratio with a 5-image budget and remains robust to object placement variations. Finally, we demonstrate that Hestia, as a next-best-view planner, is feasible for the real-world application. Our project page is https://johnnylu305.github.io/hestia web.
LGFeb 8, 2018
Improving the Universality and Learnability of Neural Programmer-Interpreters with Combinator AbstractionDa Xiao, Jo-Yu Liao, Xingyuan Yuan
To overcome the limitations of Neural Programmer-Interpreters (NPI) in its universality and learnability, we propose the incorporation of combinator abstraction into neural programing and a new NPI architecture to support this abstraction, which we call Combinatory Neural Programmer-Interpreter (CNPI). Combinator abstraction dramatically reduces the number and complexity of programs that need to be interpreted by the core controller of CNPI, while still allowing the CNPI to represent and interpret arbitrary complex programs by the collaboration of the core with the other components. We propose a small set of four combinators to capture the most pervasive programming patterns. Due to the finiteness and simplicity of this combinator set and the offloading of some burden of interpretation from the core, we are able construct a CNPI that is universal with respect to the set of all combinatorizable programs, which is adequate for solving most algorithmic tasks. Moreover, besides supervised training on execution traces, CNPI can be trained by policy gradient reinforcement learning with appropriately designed curricula.
MLDec 21, 2013
An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural NetworksIan J. Goodfellow, Mehdi Mirza, Da Xiao et al.
Catastrophic forgetting is a problem faced by many machine learning models and algorithms. When trained on one task, then trained on a second task, many machine learning models "forget" how to perform the first task. This is widely believed to be a serious problem for neural networks. Here, we investigate the extent to which the catastrophic forgetting problem occurs for modern neural networks, comparing both established and recent gradient-based training algorithms and activation functions. We also examine the effect of the relationship between the first task and the second task on catastrophic forgetting. We find that it is always best to train using the dropout algorithm--the dropout algorithm is consistently best at adapting to the new task, remembering the old task, and has the best tradeoff curve between these two extremes. We find that different tasks and relationships between tasks result in very different rankings of activation function performance. This suggests the choice of activation function should always be cross-validated.