Yufei Gu

LG
h-index3
6papers
20citations
Novelty62%
AI Score52

6 Papers

NEOct 10, 2023Code
SpikeCLIP: A Contrastive Language-Image Pretrained Spiking Neural Network

Changze Lv, Tianlong Li, Wenhao Liu et al.

Spiking Neural Networks (SNNs) have emerged as a promising alternative to conventional Artificial Neural Networks (ANNs), demonstrating comparable performance in both visual and linguistic tasks while offering the advantage of improved energy efficiency. Despite these advancements, the integration of linguistic and visual features into a unified representation through spike trains poses a significant challenge, and the application of SNNs to multimodal scenarios remains largely unexplored. This paper presents SpikeCLIP, a novel framework designed to bridge the modality gap in spike-based computation. Our approach employs a two-step recipe: an ``alignment pre-training'' to align features across modalities, followed by a ``dual-loss fine-tuning'' to refine the model's performance. Extensive experiments reveal that SNNs achieve results on par with ANNs while substantially reducing energy consumption across various datasets commonly used for multimodal model evaluation. Furthermore, SpikeCLIP maintains robust image classification capabilities, even when dealing with classes that fall outside predefined categories. This study marks a significant advancement in the development of energy-efficient and biologically plausible multimodal learning systems. Our code is available at https://github.com/Lvchangze/SpikeCLIP.

LGOct 20, 2023
Unraveling the Enigma of Double Descent: An In-depth Analysis through the Lens of Learned Feature Space

Yufei Gu, Xiaoqing Zheng, Tomaso Aste

Double descent presents a counter-intuitive aspect within the machine learning domain, and researchers have observed its manifestation in various models and tasks. While some theoretical explanations have been proposed for this phenomenon in specific contexts, an accepted theory to account for its occurrence in deep learning remains yet to be established. In this study, we revisit the phenomenon of double descent and demonstrate that its occurrence is strongly influenced by the presence of noisy data. Through conducting a comprehensive analysis of the feature space of learned representations, we unveil that double descent arises in imperfect models trained with noisy data. We argue that double descent is a consequence of the model first learning the noisy data until interpolation and then adding implicit regularization via over-parameterization acquiring therefore capability to separate the information from the noise.

58.9LGMay 10
Dimension-Free Saddle-Point Escape in Muon

Yanlin Long, Yufei Gu, Zeke Xie

Modern Large Language Model (LLM) training is fundamentally bottlenecked by pathologically flat saddle points in extreme high-dimensional landscapes. Motivated by this challenge, we analyze the saddle-point escape dynamics of the emerging Muon optimizer, demonstrating its resilience against the $\mathcal{O}(D)$ dimensional curse that severely traps element-wise adaptive optimizers like AdamW. By extending generalized matrix perturbation theory, we develop a theoretical framework to capture Muon's non-equilibrium optimization trajectories. This theoretical machinery mathematically proves that Muon elegantly bypasses the dimensional curse via a non-linear spectral shaping mechanism. By leveraging resolvent functional calculus and macroscopic Cauchy contour integration, we avoid isotropic noise assumptions and Tracy-Widom edge singularities. We establish that structural incoherence securely shields the trajectory from orthogonal drift, enabling a dimension-free saddle-point escape, and triggering a deterministic $\mathcal{O}(1)$ discrete ballistic ejection under sufficient spectral gap. Consequently, we provide an algebraically dimension-free escape bound for Muon, formalizing the underlying mechanics of its non-convex optimization dynamics.

LGMay 13, 2024Code
Class-wise Activation Unravelling the Engima of Deep Double Descent

Yufei Gu

Double descent presents a counter-intuitive aspect within the machine learning domain, and researchers have observed its manifestation in various models and tasks. While some theoretical explanations have been proposed for this phenomenon in specific contexts, an accepted theory for its occurring mechanism in deep learning remains yet to be established. In this study, we revisited the phenomenon of double descent and discussed the conditions of its occurrence. This paper introduces the concept of class-activation matrices and a methodology for estimating the effective complexity of functions, on which we unveil that over-parameterized models exhibit more distinct and simpler class patterns in hidden activations compared to under-parameterized ones. We further looked into the interpolation of noisy labelled data among clean representations and demonstrated overfitting w.r.t. expressive capacity. By comprehensively analysing hypotheses and presenting corresponding empirical evidence that either validates or contradicts these hypotheses, we aim to provide fresh insights into the phenomenon of double descent and benign over-parameterization and facilitate future explorations. By comprehensively studying different hypotheses and the corresponding empirical evidence either supports or challenges these hypotheses, our goal is to offer new insights into the phenomena of double descent and benign over-parameterization, thereby enabling further explorations in the field. The source code is available at https://github.com/Yufei-Gu-451/sparse-generalization.git.

CLFeb 5
Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better

Ji Zhao, Yufei Gu, Shitong Shao et al.

As Large Language Models (LLMs) achieve remarkable empirical success through scaling model and data size, pretraining has become increasingly critical yet computationally prohibitive, hindering rapid development. Despite the availability of numerous pretrained LLMs developed at significant computational expense, a fundamental real-world question remains underexplored: \textit{Can we leverage existing small pretrained models to accelerate the training of larger models?} In this paper, we propose a Late-to-Early Training (LET) paradigm that enables LLMs to explicitly learn later knowledge in earlier steps and earlier layers. The core idea is to guide the early layers of an LLM during early training using representations from the late layers of a pretrained (i.e. late training phase) model. We identify two key mechanisms that drive LET's effectiveness: late-to-early-step learning and late-to-early-layer learning. These mechanisms significantly accelerate training convergence while robustly enhancing both language modeling capabilities and downstream task performance, enabling faster training with superior performance. Extensive experiments on 1.4B and 7B parameter models demonstrate LET's efficiency and effectiveness. Notably, when training a 1.4B LLM on the Pile dataset, our method achieves up to 1.6$\times$ speedup with nearly 5\% improvement in downstream task accuracy compared to standard training, even when using a pretrained model with 10$\times$ fewer parameters than the target model.

LGJan 30
Mano: Restriking Manifold Optimization for LLM Training

Yufei Gu, Zeke Xie

While large language models (LLMs) have emerged as a significant advancement in artificial intelligence, the hardware and computational costs for training LLMs are also significantly burdensome. Among the state-of-the-art optimizers, AdamW relies on diagonal curvature estimates and ignores structural properties, while Muon applies global spectral normalization at the expense of losing curvature information. In this study, we restriked manifold optimization methods for training LLMs, which may address both optimizers' limitations, while conventional manifold optimization methods have been largely overlooked due to the poor performance in large-scale model optimization. By innovatively projecting the momentum onto the tangent space of model parameters and constraining it on a rotational Oblique manifold, we propose a novel, powerful, and efficient optimizer **Mano** that is the first to bridge the performance gap between manifold optimization and modern optimizers. Extensive experiments on the LLaMA and Qwen3 models demonstrate that Mano consistently and significantly outperforms AdamW and Muon even with less memory consumption and computational complexity, respectively, suggesting an expanded Pareto frontier in terms of space and time efficiency.