91.5LGApr 24Code
SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform InferenceYuqi Pan, Jinghao Zhuang, Yupeng Feng et al.
Scaling context length is reshaping large-model development, yet full-attention Transformers suffer from prohibitive computation and inference bottlenecks at long sequences. A key challenge is to design foundation models that maintain performance and long-context efficiency with minimal training overhead. We introduce SpikingBrain2.0 (SpB2.0), a 5B model that advances both architecture and training efficiency of its predecessor. Our contributions are two-fold. (1) Architectural Innovation: We propose Dual-Space Sparse Attention (DSSA), an inter-layer hybrid of Sparse Softmax Attention (MoBA) and Sparse Linear Attention (SSE), achieving an improved performance-efficiency trade-off for long-context modeling. SpB2.0 further supports dual quantization paths: INT8-Spiking coding enables sparse event-driven computation, while FP8 coding accelerates inference on modern GPUs. (2) Enhanced Training Strategy: We develop an optimized Transformer-to-Hybrid (T2H) pipeline with dual conversion paths for LLMs and VLMs using curated open-source data. Empirically, SpB2.0-5B and SpB2.0-VL-5B recover most of the base Transformer (Qwen3-4B) capability with under 7k A100 GPU hours. SpB2.0 achieves a 10.13x TTFT speedup at 4M context and supports over 10M tokens on 8 A100 GPUs under vLLM, where full-attention models exceed memory limits. It also demonstrates strong cross-platform compatibility, enabling FP8 GPU inference (2.52x speedup at 250k) and efficient neuromorphic execution (64.31% sparsity, with 70.6% and 46.5% area and power reduction at 500MHz). Overall, SpikingBrain2.0 provides a practical pathway for lightweight, multimodal, spiking foundation models, highlighting the potential of combining brain-inspired mechanisms with efficient architectures for resource-constrained and edge scenarios.
CVJul 13, 2024Code
Low-Rank Interconnected Adaptation across LayersYibo Zhong, Jinman Zhao, Yao Zhou
Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning (PEFT) method that learns weight updates $ΔW = AB$ for pretrained weights $W$ through low-rank adapters $A$ and $B$. While LoRA ensures hardware efficiency, its low-rank weight updates limit adaptation performance. In this paper, we propose low-rank interconnected adaptation across layers (Lily), a novel PEFT method that introduces an interconnected framework with locally shared $A$ and globally shared $B$ experts. This structure eliminates redundant per-layer $AB$ pairs, enabling higher-rank $ΔW$ with equal or fewer parameters. To enhance expressiveness, we use data-dependent routers to determine $A$-$B$ interconnections, preventing $B$ experts from converging to the same behavior and improving representational power across domains. Experiments across modalities, architectures, and model sizes demonstrate Lily's superior performance and efficiency. GitHub: https://github.com/yibozhong/lily
CVSep 29, 2024Code
Pear: Pruning and Sharing Adapters in Visual Parameter-Efficient Fine-TuningYibo Zhong, Yao Zhou
Adapters have been widely explored to alleviate computational and storage costs when fine-tuning pretrained foundation models. However, the adapter itself can exhibit redundancy, leading to unnecessary storage overhead and inferior performance. In this paper, we propose Prune and Share (Pear), a novel adapter-pruning framework for efficient fine-tuning of pretrained visual foundation models. Specifically, we prune certain adapters and share the more important unpruned ones with positions where adapters are pruned, allowing continual adaptation at these positions after pruning. Additionally, a knowledge checkpoint strategy is introduced, which preserves the information of the pruned adapters and further boosts performance. Experimental results on visual adaptation benchmark validate the effectiveness and efficiency of the proposed Pear comparing to other competitive methods. Code is in https://github.com/yibozhong/pear.
LGMar 3, 2025
Building Machine Learning Challenges for Anomaly Detection in ScienceElizabeth G. Campolongo, Yuan-Tang Chou, Ekaterina Govorkova et al.
Scientific discoveries are often made by finding a pattern or object that was not predicted by the known rules of science. Oftentimes, these anomalous events or objects that do not conform to the norms are an indication that the rules of science governing the data are incomplete, and something new needs to be present to explain these unexpected outliers. The challenge of finding anomalies can be confounding since it requires codifying a complete knowledge of the known scientific behaviors and then projecting these known behaviors on the data to look for deviations. When utilizing machine learning, this presents a particular challenge since we require that the model not only understands scientific data perfectly but also recognizes when the data is inconsistent and out of the scope of its trained behavior. In this paper, we present three datasets aimed at developing machine learning-based anomaly detection for disparate scientific domains covering astrophysics, genomics, and polar science. We present the different datasets along with a scheme to make machine learning challenges around the three datasets findable, accessible, interoperable, and reusable (FAIR). Furthermore, we present an approach that generalizes to future machine learning challenges, enabling the possibility of large, more compute-intensive challenges that can ultimately lead to scientific discovery.
CVApr 13, 2024
Rethinking Low-Rank Adaptation in Vision: Exploring Head-Level Responsiveness across Diverse TasksYibo Zhong, Yao Zhou
Low-rank adaptation (LoRA) has shifted the paradigm of adapting pre-trained Vision Transformers (ViT), achieving great efficiency by updating only a subset of tailored parameters to approximate weight updates. However, the multi-head design of the self-attention mechanism, with the heads working in parallel in the computation flow, exhibiting similar visual patterns and requiring update over all of them, incurs unnecessary storage and computational overhead. In this paper, we propose Head-level responsiveness tuning for low-rank adaptation (Heart-LoRA). The proposed method explores redundancy among the heads and selectively activates task-responsive heads, thus enabling fine-grained head-level tuning. Additionally, given the different responsiveness of heads to diverse visual tasks, our proposed method dynamically activates a subset of the approximated heads that are tailored to the current task. Experimental results show that Heart-LoRA yields superior performance over state-of-the-art PETL approaches on visual adaptation benchmark datasets.
LGAug 16, 2025
ENA: Efficient N-dimensional AttentionYibo Zhong
Efficient modeling of long sequences of high-order data requires a more efficient architecture than Transformer. In this paper, we investigate two key aspects of extending linear recurrent models, especially those originally designed for language modeling, to high-order data (1D to ND): scanning strategies and attention-hybrid architectures. Empirical results suggest that scanning provides limited benefits, while attention-hybrid models yield promising results. Focusing on the latter, we further evaluate types of attention and find that tiled high-order sliding window attention (SWA) is efficient in both theory and practice. We term the resulting hybrid architecture of linear recurrence and high-order SWA as Efficient N-dimensional Attention (ENA). We then conduct several experiments to demonstrate its effectiveness. The intuition behind ENA is that linear recurrence compresses global information into a state, while SWA complements it by enforcing strict local modeling. Together, they form a simple framework that offers a promising and practical solution for ultra-long high-order data modeling.